Disclaimer

These materials have been revised and updated by successive teaching assistants over recent years. The dataset and coding practice exercises in Part 4 are adapted from materials originally developed for the ICPSR Summer Program in Quantitative Methods.

Revision history:

Part 1: R Refresher and OLS Review

1.1 The conditional expectation function

In most empirical social science, our goal is to estimate the conditional expectation function (CEF):

\[ \mu(x) = E[y_i \mid x_i] \]

This is the average value of the outcome \(y\) we would observe if we could hold the covariates \(x\) at a particular value. It is the population object we are trying to learn about. For example, if \(y\) is public debt and \(x\) is GDP growth, then \(E[\text{debt} \mid \text{growth} = 2\%]\) tells us the average level of public debt among country-years where growth is 2%.

The linear regression model imposes a strong assumption: that the CEF is linear in parameters:

\[ E[y_i \mid x_i] = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \dots + \beta_k x_{ki} \]

This is the linearity assumption. It says that the effect of each covariate on the expected outcome is constant — a one-unit increase in \(x_1\) always changes \(E[y]\) by \(\beta_1\), regardless of the level of \(x_1\) or any other covariate. In many applications this is a convenient approximation, but it is not always correct. If the true CEF is nonlinear (e.g., quadratic, logarithmic, or involves interactions), imposing linearity produces a systematic misfit — the regression line no longer traces out the true conditional expectation, and the estimated coefficients inherit bias. We will examine this problem concretely in Part 2.

The observed outcome is:

\[ y_i = E[y_i \mid x_i] + \varepsilon_i = X_i'\beta + \varepsilon_i \]

where \(X_i'\) is the row vector of covariates for individual \(i\) (including a 1 for the intercept), and \(\varepsilon_i\) is the deviation of individual \(i\) from the conditional mean. By construction, \(E[\varepsilon_i \mid x_i] = 0\) — the errors average to zero at every value of \(x\). In plain terms, the regressors are not systematically related to any unobserved factors that also affect \(y\). This is the exogeneity assumption, and it is the critical condition for OLS to deliver unbiased estimates of \(\beta\).

When exogeneity fails — i.e., \(E[\varepsilon_i \mid x_i] \neq 0\) — our estimates no longer recover the CEF. We call this problem endogeneity. The classical sources each represent a different mechanism by which a regressor becomes correlated with the error term.

1.2 Quick OLS review

Let’s remind ourselves how OLS works by simulating data from a known data generating process (DGP). The advantage of simulation is that we know the truth, so we can verify whether our estimator recovers it.

\[ y_i = 2 + 1.5 x_{1i} - 3 x_{2i} + \varepsilon_i, \quad \varepsilon_i \sim N(0, 4) \]

set.seed(512)  # for replicability

n <- 500
x1 <- rnorm(n, mean = 0, sd = 2)
x2 <- rnorm(n, mean = 0, sd = 1)
e  <- rnorm(n, mean = 0, sd = 2)

y <- 2 + 1.5 * x1 - 3 * x2 + e

simdat <- data.frame(y, x1, x2)

The function lm() estimates the model by ordinary least squares:

mod <- lm(y ~ x1 + x2, data = simdat)
summary(mod)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = simdat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9235 -1.3058 -0.1052  1.2736  6.5285 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.09317    0.08579   24.40   <2e-16 ***
## x1           1.58939    0.04351   36.53   <2e-16 ***
## x2          -3.07835    0.08514  -36.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.918 on 497 degrees of freedom
## Multiple R-squared:  0.8407, Adjusted R-squared:   0.84 
## F-statistic:  1311 on 2 and 497 DF,  p-value: < 2.2e-16

The summary() output gives us the coefficient estimates \(\hat{\beta}\), their standard errors, \(t\)-statistics, and \(p\)-values. The estimates should be close to the true values \((2, 1.5, -3)\).

Recall that the OLS estimator in matrix form is:

\[ \hat{\beta} = (X'X)^{-1}X'y \]

where \(X\) is the \(n \times (k+1)\) design matrix (including a column of ones for the intercept) and \(y\) is the \(n \times 1\) outcome vector. This formula can look intimidating, but it encodes a simple idea — the next section breaks it into two interpretable pieces.

What does the OLS formula actually do?

The formula is compact but its logic is clear once you decompose it. Both \(X'X\) and \(X'y\) are built from covariances, so before unpacking the formula we need to be clear about what covariance means and how it relates to correlation.

Covariance and correlation. These two concepts are closely related. The covariance between two variables measures how much they move together in their original units: \(\text{Cov}(x_1, x_2) = E[(x_1 - \bar{x}_1)(x_2 - \bar{x}_2)]\). For a set of variables, the covariance matrix collects all pairwise covariances:

\[ \Sigma = \begin{pmatrix} \text{Var}(y) & \text{Cov}(y, x_1) & \text{Cov}(y, x_2) \\ \text{Cov}(x_1, y) & \text{Var}(x_1) & \text{Cov}(x_1, x_2) \\ \text{Cov}(x_2, y) & \text{Cov}(x_2, x_1) & \text{Var}(x_2) \end{pmatrix} \]

The diagonal contains variances (how much each variable spreads); the off-diagonal contains covariances (how much each pair moves together, in the original units). The correlation matrix is the same information, but each entry is rescaled by the standard deviations of the two variables involved:

\[ R = \begin{pmatrix} 1 & \text{Cor}(y, x_1) & \text{Cor}(y, x_2) \\ \text{Cor}(x_1, y) & 1 & \text{Cor}(x_1, x_2) \\ \text{Cor}(x_2, y) & \text{Cor}(x_2, x_1) & 1 \end{pmatrix}, \quad \text{where} \quad \text{Cor}(a, b) = \frac{\text{Cov}(a, b)}{\text{SD}(a) \cdot \text{SD}(b)} \]

The diagonal is always 1 (every variable is perfectly correlated with itself), and all entries are bounded between \(-1\) and \(1\).

So the covariance is an unscaled correlation — it carries the same information about the direction and strength of the linear relationship, but in the units of the original variables rather than on the standardized \([-1, 1]\) scale. We can see this directly by comparing both matrices from our simulated data:

# Covariance matrix — entries in the original units of the variables
round(cov(simdat), 2)
##        y   x1    x2
## y  22.99 6.14 -3.11
## x1  6.14 3.89  0.01
## x2 -3.11 0.01  1.02
# Correlation matrix — same information, rescaled to [-1, 1]
round(cor(simdat), 2)
##        y   x1    x2
## y   1.00 0.65 -0.64
## x1  0.65 1.00  0.01
## x2 -0.64 0.01  1.00

The two matrices carry the same information — which pairs move together and in which direction — but on different scales. For instance, the covariance between \(y\) and \(x_1\) is large in absolute terms because both variables have wide spread; the corresponding correlation puts this on a standardized \([-1, 1]\) scale, making it easier to judge the strength of the association. Conversely, \(x_1\) and \(x_2\) are uncorrelated by construction, which shows up as near-zero entries in both matrices.

This matters because OLS works with covariances, not correlations. The matrices \(X'X\) and \(X'y\) are sums of products in the original units of the data — the scale of the variables directly affects both the magnitude of the coefficients and the precision with which they are estimated.

Piece 1: \(X'X\) — the (unnormalized) covariance of the regressors. The matrix \(X'X\) is a \((k+1) \times (k+1)\) matrix whose entries are the sums of squares and cross-products of the covariates. Dividing by \(n\) would give you the sample covariance matrix of \(X\). It captures how much variation each regressor has (diagonal) and how much they overlap with each other (off-diagonal). Its inverse, \((X'X)^{-1}\), tells OLS how to “undo” the correlation among regressors to isolate each one’s separate contribution — this is why multicollinearity (near-singular \(X'X\)) is a problem: there isn’t enough independent variation to separate the effects.

Piece 2: \(X'y\) — the cross-covariance of regressors with the outcome. The vector \(X'y\) is \((k+1) \times 1\) and contains the sums of products between each regressor and \(y\). It measures how much each covariate co-moves with the outcome — the (unnormalized) covariance between each \(x\) and \(y\).

Putting them together: OLS asks “how much does each regressor co-move with \(y\) (\(X'y\)), after accounting for the overlap among the regressors themselves (\(X'X\))?”

Without the inversion, the raw covariance \(X'y\) would double-count any variation that two regressors share. If \(x_1\) and \(x_2\) are correlated, part of the co-movement between \(x_1\) and \(y\) is really due to \(x_2\), and vice versa. Multiplying by \((X'X)^{-1}\) strips out this shared variation, so that each coefficient reflects only the independent contribution of its regressor.

This is also why multicollinearity is a problem: when regressors overlap heavily, there is little independent variation left after the adjustment, and the estimates become imprecise (large standard errors).

The geometric view: orthogonal projection

This section uses linear algebra. If the vector-space framing is unfamiliar, the covariance intuition above is sufficient — you can come back to this later.

Think of \(y\) as a point in \(n\)-dimensional space, and the columns of \(X\) as spanning a lower-dimensional “model space.” OLS finds the point \(\hat{y} = X\hat{\beta}\) in that space that is closest to \(y\). The residual \(\hat{e} = y - \hat{y}\) is perpendicular to the model space:

\[ X'\hat{e} = X'(y - X\hat{\beta}) = 0 \]

This orthogonality condition is the system of normal equations that yields the OLS formula. The key implication: OLS always finds the best linear approximation to the CEF. When the CEF is linear, OLS recovers the truth. When it is not, OLS gives the best straight line — but that line may be a poor fit (the misspecification problem in Part 2).

An important distinction: The condition \(X'\hat{e} = 0\) — that the residuals are uncorrelated with the regressors — is not an assumption. It is a mechanical result of the OLS procedure. OLS forces the residuals to be orthogonal to \(X\) by construction; that is how it finds \(\hat{\beta}\). The assumption is \(E[\varepsilon_i \mid x_i] = 0\) — that the true population errors are uncorrelated with the regressors. This is a statement about the data-generating process, not about the fitted residuals. A model can satisfy \(X'\hat{e} = 0\) perfectly (every OLS model does) while badly violating exogeneity. The residuals always “look clean” in-sample; the question is whether the errors are clean in the population.

The following diagram illustrates this geometry. The grey band represents the “model space” — the set of all fitted value vectors \(X\beta\) for any choice of \(\beta\). OLS finds \(\hat{\mathbf{y}}\), the point in this space closest to the observed \(\mathbf{y}\). The residual \(\hat{\mathbf{e}}\) connects \(\hat{\mathbf{y}}\) to \(\mathbf{y}\) and is perpendicular to the model space — that perpendicularity is the normal equations \(X'\hat{e} = 0\).

OLS as orthogonal projection. The vector y lives in n-dimensional space. Its projection onto the column space of X is the fitted value vector ŷ. The residual ê is perpendicular to that space.

OLS as orthogonal projection. The vector y lives in n-dimensional space. Its projection onto the column space of X is the fitted value vector ŷ. The residual ê is perpendicular to that space.

We can verify the OLS formula by computing it manually:

X <- cbind(1, x1, x2)  # design matrix: n x 3

# Step by step:
XtX <- t(X) %*% X       # 3 x 3: covariance structure of regressors
Xty <- t(X) %*% y       # 3 x 1: covariance of regressors with outcome

beta_hat <- solve(XtX) %*% Xty   # (X'X)^{-1} X'y
beta_hat
##         [,1]
##     2.093174
## x1  1.589385
## x2 -3.078347

These values are identical to the lm() output. In this course we will sometimes compute estimators “by hand” to make the mechanics transparent — it helps demystify what functions like lm() are doing internally.

1.3 dplyr essentials

We use dplyr for data manipulation throughout this course. Its verbs compose via the pipe operator |>. Quick refresher:

# Create a small example dataset
toy <- tibble(
  country = rep(c("USA", "UK", "France"), each = 3),
  year    = rep(2010:2012, 3),
  gdp     = c(14.5, 15.0, 15.5, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7),
  debt    = c(60, 65, 70, 75, 80, 85, 55, 58, 60)
)

toy
## # A tibble: 9 × 4
##   country  year   gdp  debt
##   <chr>   <int> <dbl> <dbl>
## 1 USA      2010  14.5    60
## 2 USA      2011  15      65
## 3 USA      2012  15.5    70
## 4 UK       2010   2.2    75
## 5 UK       2011   2.3    80
## 6 UK       2012   2.4    85
## 7 France   2010   2.5    55
## 8 France   2011   2.6    58
## 9 France   2012   2.7    60

The core verbs:

# select: choose columns by name
toy |>
  select(country, year, debt)
## # A tibble: 9 × 3
##   country  year  debt
##   <chr>   <int> <dbl>
## 1 USA      2010    60
## 2 USA      2011    65
## 3 USA      2012    70
## 4 UK       2010    75
## 5 UK       2011    80
## 6 UK       2012    85
## 7 France   2010    55
## 8 France   2011    58
## 9 France   2012    60
# filter: keep rows that satisfy a condition
toy |>
  filter(year >= 2011, debt > 60)
## # A tibble: 4 × 4
##   country  year   gdp  debt
##   <chr>   <int> <dbl> <dbl>
## 1 USA      2011  15      65
## 2 USA      2012  15.5    70
## 3 UK       2011   2.3    80
## 4 UK       2012   2.4    85
# mutate: create or transform variables
toy |>
  mutate(debt_to_gdp = debt / gdp,
         log_gdp     = log(gdp))
## # A tibble: 9 × 6
##   country  year   gdp  debt debt_to_gdp log_gdp
##   <chr>   <int> <dbl> <dbl>       <dbl>   <dbl>
## 1 USA      2010  14.5    60        4.14   2.67 
## 2 USA      2011  15      65        4.33   2.71 
## 3 USA      2012  15.5    70        4.52   2.74 
## 4 UK       2010   2.2    75       34.1    0.788
## 5 UK       2011   2.3    80       34.8    0.833
## 6 UK       2012   2.4    85       35.4    0.875
## 7 France   2010   2.5    55       22      0.916
## 8 France   2011   2.6    58       22.3    0.956
## 9 France   2012   2.7    60       22.2    0.993
# summarize + group_by: compute grouped summaries
toy |>
  group_by(country) |>
  summarize(mean_debt = mean(debt),
            sd_debt   = sd(debt),
            n_years   = n())
## # A tibble: 3 × 4
##   country mean_debt sd_debt n_years
##   <chr>       <dbl>   <dbl>   <int>
## 1 France       57.7    2.52       3
## 2 UK           80      5          3
## 3 USA          65      5          3
# arrange: sort rows
toy |>
  arrange(desc(debt))
## # A tibble: 9 × 4
##   country  year   gdp  debt
##   <chr>   <int> <dbl> <dbl>
## 1 UK       2012   2.4    85
## 2 UK       2011   2.3    80
## 3 UK       2010   2.2    75
## 4 USA      2012  15.5    70
## 5 USA      2011  15      65
## 6 USA      2010  14.5    60
## 7 France   2012   2.7    60
## 8 France   2011   2.6    58
## 9 France   2010   2.5    55

These verbs chain together into pipelines — especially useful for panel data, where we often operate within groups (countries, time periods).

1.4 Project organization and output management

A well-organized project structure pays off for reproducibility and collaboration. A typical research project:

my_project/
├── data/               # raw input datasets
├── output/
│   ├── plots/          # saved figures (.pdf, .png)
│   ├── tables/         # exported regression tables (.tex, .html)
│   ├── data/           # cleaned or processed datasets (.rds, .csv)
│   └── models/         # saved estimated model objects (.rds)
├── my_script.R         # or .Rmd analysis file(s)
└── README.md           # brief description of the project

In this lab, we use this structure: raw data lives in data/, and everything we produce goes into output/, organized by type. This separation keeps your working directory clean, makes it easy to distinguish inputs from generated artifacts, and helps collaborators find what they need.

Displaying regression tables with texreg

The summary() function is fine for quick inspection, but for comparing models side by side — or for exporting publication-ready tables — we use the texreg package. It provides three functions for different output targets:

Function Output Use case
screenreg() Console (plain text) Quick inspection during analysis
htmlreg() HTML RMarkdown documents knitted to HTML
texreg() LaTeX RMarkdown documents knitted to PDF, or papers via Overleaf

Tip: Since htmlreg() and texreg() produce format-specific output, this lab defines a helper show_reg() (in the setup chunk) that automatically picks the right one based on how you knit the document. Use show_reg() in place of htmlreg()/texreg() when you want tables to render correctly in both HTML and PDF.

Let’s see them in action with two simple models:

# Fit two models for demonstration
demo_dat <- data.frame(
  y  = rnorm(100),
  x1 = rnorm(100),
  x2 = rnorm(100)
)

m1 <- lm(y ~ x1, data = demo_dat)
m2 <- lm(y ~ x1 + x2, data = demo_dat)

# Console output — good for quick comparison
screenreg(list(m1, m2),
          custom.model.names = c("Model 1", "Model 2"))
## 
## =============================
##              Model 1  Model 2
## -----------------------------
## (Intercept)   -0.10    -0.10 
##               (0.10)   (0.10)
## x1            -0.16    -0.16 
##               (0.11)   (0.11)
## x2                     -0.02 
##                        (0.10)
## -----------------------------
## R^2            0.02     0.02 
## Adj. R^2       0.01     0.00 
## Num. obs.    100      100    
## =============================
## *** p < 0.001; ** p < 0.01; * p < 0.05

For RMarkdown documents, use htmlreg() for HTML output or texreg() for PDF output, with results='asis' in the chunk options so the table renders properly. The show_reg() helper defined in the setup chunk does this automatically:

# show_reg() picks htmlreg() for HTML output or texreg() for PDF automatically
show_reg(list(m1, m2),
         custom.model.names = c("Model 1", "Model 2"),
         caption = "Example regression table",
         caption.above = TRUE)
Example regression table
  Model 1 Model 2
(Intercept) -0.10 -0.10
  (0.10) (0.10)
x1 -0.16 -0.16
  (0.11) (0.11)
x2   -0.02
    (0.10)
R2 0.02 0.02
Adj. R2 0.01 0.00
Num. obs. 100 100
***p < 0.001; **p < 0.01; *p < 0.05

Saving tables to files

To export a table as a standalone file — for inclusion in a paper or for sharing — use the file argument:

# Export as LaTeX (.tex) — for Overleaf or LaTeX documents
texreg(list(m1, m2),
       custom.model.names = c("Model 1", "Model 2"),
       file = "output/tables/example_table.tex",
       caption = "Example regression table")

# Export as HTML — for web documents or quick sharing
htmlreg(list(m1, m2),
        custom.model.names = c("Model 1", "Model 2"),
        file = "output/tables/example_table.html",
        caption = "Example regression table")

The texreg family accepts many customization options: custom.coef.names to rename coefficients, include.ci = TRUE to show confidence intervals instead of standard errors, stars to control significance stars, digits to set decimal places, etc. See ?texreg for the full list.

Saving figures with ggsave

To export ggplot2 figures to files, use ggsave(). By default it saves the last plot displayed, but you can also pass a plot object explicitly:

# Save the last displayed plot
ggsave(filename = "output/plots/my_figure.pdf",
       width = 6, height = 6 / 1.618)  # golden ratio

# Common formats: .pdf (vector, best for papers), .png (raster, good for slides)

We use the golden ratio (\(\phi \approx 1.618\)) for the width-to-height proportion — it produces aesthetically balanced figures and is a sensible default for academic plots.

Saving processed data

After cleaning or transforming a dataset, save the result so downstream scripts can load it directly without re-running the cleaning pipeline:

# CSV — portable, human-readable, but loses R-specific metadata (factor levels, etc.)
write_csv(my_clean_data, file = "output/data/clean_data.csv")

# RDS — compact, preserves all R object attributes, but R-only
saveRDS(my_clean_data, file = "output/data/clean_data.rds")
# To reload: my_clean_data <- readRDS("output/data/clean_data.rds")

Use .csv when the data needs to be shared with non-R users or loaded into other software. Use .rds when the data stays within R and you want to preserve factor levels, date classes, or other metadata.

Saving estimated models

You can also save fitted model objects so you (or a co-author) can reload them later without re-running the estimation — especially useful for models that take a long time to fit:

# Save a fitted model object
saveRDS(my_model, file = "output/models/my_model.rds")
# To reload: my_model <- readRDS("output/models/my_model.rds")

The .rds format preserves the entire model object, including coefficients, residuals, and the variance-covariance matrix — everything you need for summary(), predict(), or coeftest().


Part 2: Endogeneity Problems

2.0 The CEF and endogeneity bias

Recall that the OLS estimator targets the conditional expectation function:

\[ E[y_i \mid x_i] = X_i'\beta \]

This works when \(E[\varepsilon_i \mid x_i] = 0\). But when a regressor is endogenous — correlated with the error term — we have \(E[\varepsilon_i \mid x_i] \neq 0\), and OLS no longer recovers the true \(\beta\). The estimated regression line no longer traces out the CEF; it traces out a distorted version of it.

A note on terminology. “Endogeneity” is an umbrella term — it simply means that the mean independence assumption \(E[\varepsilon_i \mid x_i] = 0\) is violated. It tells you that something is wrong, but not what is wrong. In practice, there are several distinct mechanisms through which exogeneity can fail, and they differ in their causes, the direction and magnitude of the bias they produce, and the remedies available. Saying “the model suffers from endogeneity” is like saying “the patient is sick” — it is not a diagnosis. When communicating results, be specific: say “omitted variable bias,” “measurement error,” “reverse causality,” etc. This precision matters because different problems call for different solutions (controls, better data, instrumental variables, etc.), and a vague appeal to “endogeneity” obscures which solution is appropriate.

In this lab we review four classical forms of endogeneity, each producing bias and inconsistency through a different channel:

Source Mechanism Effect on \(\hat{\beta}\)
Omitted variables A relevant variable is left out; its effect is absorbed into \(\varepsilon\), which becomes correlated with \(x\) Direction depends on the sign of the omitted variable’s effect and its correlation with \(x\)
Measurement error The true regressor \(x\) is observed with noise \(x^* = x + \upsilon\); the noise makes the observed regressor correlated with the error term Under classical error: attenuation toward zero. Under non-classical error: direction can be unpredictable
Functional misspecification The true CEF is nonlinear but we impose a linear model; the omitted nonlinear terms (e.g., \(x^2\)) are absorbed into \(\varepsilon\) and are mechanically correlated with \(x\) A special case of OVB where the “omitted variable” is a transformation of an included regressor
Simultaneity (Appendix B) \(x\) and \(y\) are jointly determined; \(x\) depends on \(\varepsilon\) through the feedback loop Direction depends on the structural parameters of the system

We now illustrate the first three with simulations where we know the true DGP, so we can measure the bias exactly. Simultaneity is covered in Appendix B (self-study).

2.1 Omitted Variable Bias (OVB)

The problem

Suppose the true model is:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \varepsilon \]

but we estimate a model that omits one or more regressors. If the omitted variable is correlated with an included regressor and affects the outcome, the coefficient on the included regressor is biased. In the simple case where we omit \(x_2\):

\[ \hat{\beta}_1^{\text{short}} \xrightarrow{p} \beta_1 + \beta_2 \cdot \delta_{x_2 \mid x_1} \]

The arrow \(\xrightarrow{p}\) reads “converges in probability to” — it tells us what the estimator approaches as the sample grows large. In other words, this is not sampling noise; it is a systematic shift that does not vanish with more data. Here \(\delta_{x_2 \mid x_1}\) is the coefficient from regressing the omitted variable \(x_2\) on the included variable \(x_1\). The bias is the product of two things: (1) how much the omitted variable affects \(y\) (i.e., \(\beta_2\)), and (2) how correlated the omitted variable is with the included regressor (i.e., \(\delta\)).

For example, if we estimate the effect of education on income but omit cognitive ability, the education coefficient absorbs part of the ability effect — biasing it upward if ability and education are positively correlated.

Simulation

In Part 1 we simulated covariates one at a time using rnorm() — each drawn independently. But OVB is driven by correlations among covariates, so we need a way to generate variables that are correlated by design. The multivariate normal distribution does exactly this. For a random vector \(\mathbf{x} = (x_1, x_2, \dots, x_p)'\):

\[ \mathbf{x} \sim \mathcal{N}_p(\boldsymbol{\mu}, \, \Sigma) \]

where \(\boldsymbol{\mu}\) is the \(p \times 1\) mean vector (the center of the distribution) and \(\Sigma\) is the \(p \times p\) covariance matrix that controls both the spread of each variable (diagonal entries = variances) and the linear dependencies between them (off-diagonal entries = covariances). When the diagonal entries are all ones (unit variance), the off-diagonal entries are correlations directly.

In R, mvrnorm(n, mu, Sigma) from the MASS package draws \(n\) observations from this distribution. We set up our DGP with four covariates, mean zero, and the following correlation structure:

  • \(x_1\) and \(x_2\) are negatively correlated (\(\rho = -0.4\))
  • \(x_2\) and \(x_3\) are positively correlated (\(\rho = 0.7\))
  • \(x_4\) is uncorrelated with all others

\[ \boldsymbol{\mu} = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \end{pmatrix}, \qquad \Sigma = \begin{pmatrix} 1 & -0.4 & 0 & 0 \\ -0.4 & 1 & 0.7 & 0 \\ 0 & 0.7 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} \]

Since the diagonal is all ones (unit variance for each covariate), this \(\Sigma\) is simultaneously a covariance and a correlation matrix — the off-diagonal entries are the pairwise correlations directly. Compare this to the code below:

set.seed(123)

# True parameters: intercept = 2, b1 = 1, b2 = -5, b3 = 3, b4 = 5
beta <- as.matrix(c(2, 1, -5, 3, 5))

# Correlation/covariance matrix (unit variances, so correlations = covariances)
sigma <- rbind(
  c(1.0, -0.4, 0.0, 0.0),
  c(-0.4, 1.0, 0.7, 0.0),
  c(0.0,  0.7, 1.0, 0.0),
  c(0.0,  0.0, 0.0, 1.0)
)
colnames(sigma) <- rownames(sigma) <- c("x1", "x2", "x3", "x4")

# Draw 1000 observations from the multivariate normal
X <- mvrnorm(n = 1000, mu = rep(0, 4), Sigma = sigma)
Xs <- cbind(1, X)
simdat_ovb <- as.data.frame(Xs)
colnames(simdat_ovb) <- c("int", "x1", "x2", "x3", "x4")

# Generate y = Xs %*% beta + epsilon
simdat_ovb$y <- rnorm(1000, mean = Xs %*% beta)

Before modeling, it is good practice to verify that the data look as expected. Check the correlation matrix of the covariates — does it match the structure we specified?

# Always inspect your data before modeling
round(cor(simdat_ovb[, c("x1", "x2", "x3", "x4")]), 2)
##       x1    x2   x3    x4
## x1  1.00 -0.33 0.06  0.03
## x2 -0.33  1.00 0.71 -0.03
## x3  0.06  0.71 1.00  0.01
## x4  0.03 -0.03 0.01  1.00

We see the correlations we built in: \(x_1\) and \(x_2\) are negatively correlated (\(\approx -0.4\)), \(x_2\) and \(x_3\) are positively correlated (\(\approx 0.7\)), and \(x_4\) is uncorrelated with the rest. This structure is what drives the OVB patterns we are about to observe.

Estimation and misspecification

We compare the true model against progressively more misspecified models:

true_model <- lm(y ~ x1 + x2 + x3 + x4, data = simdat_ovb)
omit_x4    <- lm(y ~ x1 + x2 + x3,       data = simdat_ovb)
omit_x3    <- lm(y ~ x1 + x2 + x4,       data = simdat_ovb)
just_x1    <- lm(y ~ x1,                  data = simdat_ovb)

# Compare estimates to truth
round(cbind(Estimates = coef(true_model), Truth = beta), 3)
##             Estimates   
## (Intercept)     1.963  2
## x1              1.071  1
## x2             -4.989 -5
## x3              3.057  3
## x4              4.983  5

The true model recovers the parameters well. We can compare all four models side by side using screenreg():

screenreg(list(true_model, omit_x4, omit_x3, just_x1),
          custom.model.names = c("True", "Omit x4", "Omit x3", "Just x1"))
## 
## ===============================================================
##              True         Omit x4      Omit x3      Just x1    
## ---------------------------------------------------------------
## (Intercept)     1.96 ***     1.86 ***     1.98 ***     1.81 ***
##                (0.03)       (0.16)       (0.07)       (0.19)   
## x1              1.07 ***     1.02 ***     2.14 ***     3.08 ***
##                (0.04)       (0.19)       (0.08)       (0.19)   
## x2             -4.99 ***    -5.45 ***    -2.37 ***             
##                (0.05)       (0.27)       (0.07)                
## x3              3.06 ***     3.42 ***                          
##                (0.05)       (0.24)                             
## x4              4.98 ***                  5.08 ***             
##                (0.03)                    (0.07)                
## ---------------------------------------------------------------
## R^2             0.98         0.44         0.89         0.20    
## Adj. R^2        0.98         0.44         0.89         0.20    
## Num. obs.    1000         1000         1000         1000       
## ===============================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Visualizing the bias

bias_df <- bind_rows(
  tibble(model = "Omit x4",
         coefficient = names(coef(omit_x4)),
         bias = coef(omit_x4) - beta[1:4]),
  tibble(model = "Omit x3",
         coefficient = names(coef(omit_x3)),
         bias = coef(omit_x3) - beta[c(1, 2, 3, 5)]),
  tibble(model = "Just x1",
         coefficient = names(coef(just_x1)),
         bias = coef(just_x1) - beta[1:2])
)

ggplot(bias_df, aes(x = coefficient, y = bias, fill = model)) +
  geom_col(position = position_dodge(width = 0.7), width = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  scale_fill_brewer(palette = "Set1", name = "Model") +
  labs(title = "Omitted Variable Bias across Misspecified Models",
       y = "Bias (Estimate - Truth)", x = "Coefficient") +
  theme_minimal(base_size = 14)
Coefficient bias from omitting correlated regressors. Omitting x4 (uncorrelated) produces no bias; omitting x3 (correlated with x2) biases the x2 estimate.

Coefficient bias from omitting correlated regressors. Omitting x4 (uncorrelated) produces no bias; omitting x3 (correlated with x2) biases the x2 estimate.

# Save figure to output folder
ggsave("output/plots/ovb_bias.pdf", width = 6, height = 6 / 1.618)

Key takeaways:

  • Omitting \(x_4\) (uncorrelated with the rest) produces no bias on the remaining coefficients. This is the best-case scenario for an omitted variable.
  • Omitting \(x_3\) (correlated with \(x_2\) at \(\rho = 0.7\)) biases \(\hat{\beta}_2\) substantially. The coefficient on \(x_2\) absorbs part of the effect of \(x_3\).
  • When we include only \(x_1\), it absorbs effects from all omitted and correlated regressors, and the bias can be severe.

In terms of the CEF, the misspecified model no longer estimates \(E[y \mid x_1, x_2, x_3, x_4]\). Instead, it estimates a distorted conditional expectation where the omitted effects are conflated with the included regressors.

2.2 Measurement Error

The problem

Suppose the true model includes \(x_2\), but we do not observe \(x_2\) directly. Instead, we observe a noisy proxy:

\[ x_2^* = x_2 + \upsilon, \quad \upsilon \sim N(0, \sigma_\upsilon^2) \]

Under classical measurement error — where the noise \(\upsilon\) is independent of both the true \(x_2\) and the structural error \(\varepsilon\) — the OLS coefficient on \(x_2^*\) suffers from attenuation bias: it is biased toward zero by a factor known as the reliability ratio:

\[ \hat{\beta}_2^* \xrightarrow{p} \beta_2 \cdot \underbrace{\frac{\sigma_{x_2}^2}{\sigma_{x_2}^2 + \sigma_\upsilon^2}}_{\text{reliability ratio} < 1} \]

The reliability ratio is always between 0 and 1: the fraction of variance that is “real signal.” More noise → ratio closer to zero → coefficient shrinks toward zero.

Important caveat: The clean attenuation result above holds only under classical measurement error. In practice, measurement error is often non-classical — the noise may be correlated with the true value of \(x\) (e.g., high earners underreport income), with the outcome \(y\), or with other covariates. When this happens, the bias is no longer predictably toward zero: it can go in either direction and its magnitude depends on the specific correlation structure of the error. For example, if survey respondents who are more politically engaged both report their ideology more accurately and participate more, the measurement error in ideology is correlated with participation — and the resulting bias could inflate or deflate the coefficient unpredictably. The bottom line: do not assume attenuation by default; think carefully about the likely structure of the measurement error in your specific application.

Additionally, measurement error can mask nonlinearity — noise in \(x\) smooths out curvature, potentially making a nonlinear relationship appear linear.

Simulation: attenuation bias

Using the same simulated data from the OVB section, we add measurement error to \(x_2\):

set.seed(123)

noisydat <- simdat_ovb
noise <- rnorm(1000, mean = 0, sd = 1.5)
noisydat$x2 <- noisydat$x2 + noise  # observed with error

noisy_model <- lm(y ~ x1 + x2 + x3 + x4, data = noisydat)

round(cbind(
  Noisy    = coef(noisy_model),
  True     = coef(true_model),
  DGP      = beta
), 3)
##              Noisy   True   
## (Intercept)  1.963  1.963  2
## x1          -0.021  1.071  1
## x2          -2.788 -4.989 -5
## x3           4.968  3.057  3
## x4           4.983  4.983  5

Notice: the coefficient on \(x_2\) is attenuated — pulled toward zero relative to the truth (\(\beta_2 = -5\)). The other coefficients that are correlated with \(x_2\) may also be affected, as the model compensates for the weakened signal in \(x_2^*\) by adjusting the estimates on correlated covariates. This is known as induced bias.

Visualizing attenuation bias

bind_rows(
  simdat_ovb |> mutate(type = "True x2"),
  noisydat   |> mutate(type = "Noisy x2")
) |>
  ggplot(aes(x = x2, y = y, color = type)) +
  geom_point(alpha = 0.3, size = 0.8) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  facet_wrap(~type) +
  labs(x = "x2", y = "y",
       title = "Attenuation Bias from Measurement Error") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")
Measurement error in x2 flattens the estimated regression line, pulling the slope toward zero.

Measurement error in x2 flattens the estimated regression line, pulling the slope toward zero.

ggsave("output/plots/attenuation_bias.pdf", width = 6, height = 6 / 1.618)

Visualizing how measurement error masks data features

Beyond attenuation in linear models, measurement error can obscure the functional form of the relationship. If the true DGP is nonlinear, noise in \(x\) spreads out the data cloud horizontally, smoothing out the curvature and making the relationship appear flatter and more linear than it actually is.

set.seed(123)

n_mask <- 200
x_true <- runif(n_mask, -3, 3)
y_mask <- sin(x_true) + rnorm(n_mask, 0, 0.3)

# Add substantial measurement error
u_mask <- rnorm(n_mask, mean = 0, sd = 3)
x_obs  <- x_true + u_mask

df_mask <- data.frame(x_true, x_obs, y = y_mask)

p1 <- ggplot(df_mask, aes(x = x_true, y = y)) +
  geom_point(shape = 1, size = 2, alpha = 0.8) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  labs(title = "True Covariate (x)", x = "x (true)", y = "y") +
  ylim(-2, 2) +
  theme_minimal(base_size = 13)

p2 <- ggplot(df_mask, aes(x = x_obs, y = y)) +
  geom_point(shape = 8, size = 2, alpha = 0.8) +
  geom_smooth(method = "loess", se = FALSE, color = "red") +
  labs(title = "Observed Covariate with Measurement Error (x*)",
       x = "x* (observed)", y = "y") +
  ylim(-2, 2) +
  theme_minimal(base_size = 13)

# Stack vertically
grid.arrange(p1, p2, nrow = 2)
When the true relationship is nonlinear (top), measurement error in x can mask that structure entirely (bottom), leading to incorrect modeling decisions.

When the true relationship is nonlinear (top), measurement error in x can mask that structure entirely (bottom), leading to incorrect modeling decisions.

In the top panel, the sinusoidal relationship is clearly visible. In the bottom panel, the horizontal spread introduced by measurement error washes out the curvature almost entirely. A researcher working only with \(x^*\) might conclude the relationship is essentially flat — a fundamentally wrong conclusion driven entirely by data quality, not by the underlying phenomenon.

2.3 Functional Misspecification

The problem

Functional misspecification is a special case of omitted variable bias that arises from the linearity assumption itself. If the true CEF includes nonlinear terms — say, \(x_2^2\) — but we fit a linear model with only \(x_2\), then the omitted term \(x_2^2\) is absorbed into the error. Since \(x_2^2\) is mechanically correlated with \(x_2\) (and with any variable correlated with \(x_2\)), this is formally equivalent to OVB: the “omitted variable” happens to be a transformation of an included regressor.

The OVB formula applies directly. If we omit \(x_2^2\) from a model that includes \(x_2\):

\[ \hat{\beta}_2^{\text{linear}} \xrightarrow{p} \beta_2 + \beta_3 \cdot \delta_{x_2^2 \mid x_2, \dots} \]

where \(\beta_3\) is the true coefficient on \(x_2^2\) and \(\delta\) captures the relationship between \(x_2^2\) and \(x_2\) net of other covariates.

The same logic applies to omitted interactions: if the true model includes \(x_1 \times x_2\) but we impose additivity, the omitted interaction term is absorbed into the coefficients of \(x_1\) and \(x_2\).

This matters in practice because researchers often assume linearity and additivity by default. But if the true relationship involves quadratic terms or interactions, imposing a linear additive model does not merely produce a “less precise” estimate — it produces biased estimates of every coefficient that is correlated with the omitted nonlinear terms.

Simulation

We simulate a DGP with three covariates where \(x_2\) has a quadratic effect:

\[ y = 1 + 2 x_1 + 4 x_2 - 0.6 x_2^2 + 0.8 (x_1 \times x_2) + 1.5 x_3 + \varepsilon, \quad \varepsilon \sim N(0, 4) \]

The true model includes both a quadratic in \(x_2\) and an interaction between \(x_1\) and \(x_2\). Think of this as a variable with diminishing returns (\(x_2\)) whose effect also varies with context (\(x_1\)) — a common pattern in social science (e.g., the effect of government spending on growth depends on institutional quality, and has diminishing returns regardless). The covariates are correlated: \(\text{Cor}(x_1, x_2) = 0.5\), \(\text{Cor}(x_2, x_3) = 0.5\), \(\text{Cor}(x_1, x_3) = 0.2\).

The marginal effect of \(x_1\) in the true DGP is \(2 + 0.8 x_2\), which depends on \(x_2\). When we omit the nonlinear terms and fit a linear model, the estimated coefficient on \(x_1\) converges to the average marginal effect across the sample rather than the structural parameter — producing clear bias on \(x_1\) even though \(x_1\) itself is correctly specified. The distinction matters because the average marginal effect hides the fact that \(x_1\)’s effect varies with \(x_2\): reporting a single number masks real heterogeneity in the relationship.

set.seed(512)

n <- 800

# Correlation structure among covariates
sigma_ms <- rbind(
  c(1.0, 0.5, 0.2),
  c(0.5, 1.0, 0.5),
  c(0.2, 0.5, 1.0)
)

# Draw correlated covariates — x2 centered near 3 with wider spread
sds_ms <- c(1, 1.5, 1)
cov_ms <- diag(sds_ms) %*% sigma_ms %*% diag(sds_ms)
X_ms <- mvrnorm(n, mu = c(0, 3, 0), Sigma = cov_ms)
x1_ms <- X_ms[, 1]
x2_ms <- X_ms[, 2]
x3_ms <- X_ms[, 3]

# True DGP: quadratic in x2 + interaction x1*x2
y_ms <- 1 + 2 * x1_ms + 4 * x2_ms - 0.6 * x2_ms^2 + 0.8 * x1_ms * x2_ms + 1.5 * x3_ms + rnorm(n, 0, 2)

msdat <- data.frame(y = y_ms, x1 = x1_ms, x2 = x2_ms, x3 = x3_ms)

Before fitting models, let’s visualize the relationship we built in. A scatter plot of \(y\) against \(x_2\) should reveal the curvature — the inverted-U that the linear model will miss:

# Quick EDA: is there visible nonlinearity?
ggplot(msdat, aes(x = x2, y = y)) +
  geom_point(alpha = 0.2, size = 0.8) +
  geom_smooth(method = "loess", se = FALSE, color = "steelblue", linewidth = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "firebrick", linetype = "dashed", linewidth = 0.8) +
  labs(title = "y vs. x2: Is the relationship linear?",
       subtitle = "Blue = loess (flexible); Red dashed = linear fit",
       x = expression(x[2]), y = "y") +
  theme_minimal(base_size = 13)

The loess smoother (blue) traces the curvature; the linear fit (red dashed) misses it. In practice, this kind of plot is your first defense against misspecification — always plot your key relationships before assuming linearity.

Estimation: correct vs. misspecified

# Correct specification: includes the quadratic and interaction
correct_mod <- lm(y ~ x1 * x2 + I(x2^2) + x3, data = msdat)

# Misspecified: assumes linearity, omits x2^2 and x1:x2
misspec_mod <- lm(y ~ x1 + x2 + x3, data = msdat)

# Compare estimates
screenreg(list(correct_mod, misspec_mod),
          custom.model.names = c("Correct", "Misspecified (linear)"))
## 
## ==============================================
##              Correct     Misspecified (linear)
## ----------------------------------------------
## (Intercept)    0.74 *      5.74 ***           
##               (0.31)      (0.26)              
## x1             1.82 ***    4.26 ***           
##               (0.18)      (0.10)              
## x2             4.23 ***    0.40 ***           
##               (0.19)      (0.08)              
## x2^2          -0.63 ***                       
##               (0.03)                          
## x3             1.29 ***    1.31 ***           
##               (0.09)      (0.11)              
## x1:x2          0.84 ***                       
##               (0.06)                          
## ----------------------------------------------
## R^2            0.87        0.80               
## Adj. R^2       0.87        0.80               
## Num. obs.    800         800                  
## ==============================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Two things stand out. First, the coefficient on \(x_2\) in the misspecified model is dramatically biased: it should be 4, but the omitted quadratic pulls it toward zero. Second — and this is the key insight — the coefficient on \(x_1\) is also badly biased. The true structural effect of \(x_1\) is 2, but the misspecified model estimates it at approximately \(2 + 0.8 \times \bar{x}_2 \approx 4.4\). The omitted interaction \(x_1 \times x_2\) gets absorbed into the \(x_1\) coefficient, inflating it to reflect the average marginal effect rather than the structural parameter. This is OVB where the “omitted variables” are nonlinear transformations of included regressors.

Predicting the CEF

The key consequence of misspecification is that it distorts the estimated conditional expectation function. We can visualize this by predicting \(E[y \mid x_2]\) under both models, holding \(x_1\) and \(x_3\) at their means.

The prediction workflow has three steps: (1) set up a grid of values for the variable of interest, (2) compute \(\hat{y} = X_{\text{new}} \hat{\beta}\) at each grid point, and (3) get confidence intervals. We show the manual approach first (for understanding), then the predict() shortcut.

# Step 1: Set up a prediction grid
# Vary x2 across its range, hold x1 and x3 at their sample means
x2_grid <- seq(min(msdat$x2), max(msdat$x2), length.out = 200)
newdata <- data.frame(
  x1 = mean(msdat$x1),
  x2 = x2_grid,
  x3 = mean(msdat$x3)
)

# Step 2: Manual prediction — y_hat = X_new %*% beta_hat
# Build the design matrix for each model (columns must match coefficient order)
beta_correct <- coef(correct_mod)
beta_misspec <- coef(misspec_mod)

X_new_correct <- cbind(1, newdata$x1, newdata$x2, newdata$x2^2, newdata$x3,
                       newdata$x1 * newdata$x2)  # intercept, x1, x2, x2^2, x3, x1:x2
X_new_misspec <- cbind(1, newdata$x1, newdata$x2, newdata$x3)

yhat_correct <- X_new_correct %*% beta_correct
yhat_misspec <- X_new_misspec %*% beta_misspec

# Step 3: Use predict() — does the same thing in one call, plus SEs
pred_correct <- predict(correct_mod, newdata = newdata, se.fit = TRUE)
pred_misspec <- predict(misspec_mod, newdata = newdata, se.fit = TRUE)

# Verify: manual matches predict()
head(data.frame(
  manual = as.numeric(yhat_correct),
  predict = pred_correct$fit
), 4)
##      manual   predict
## 1 -8.863865 -8.863865
## 2 -8.547384 -8.547384
## 3 -8.233924 -8.233924
## 4 -7.923483 -7.923483

The predict() function with se.fit = TRUE also returns standard errors, which we use to build 95% confidence intervals. The SE at each point reflects the uncertainty in \(\hat{\beta}\) propagated through the design matrix: \(\text{SE}(\hat{y}) = \sqrt{x_{\text{new}}' \hat{V} \, x_{\text{new}}}\), where \(\hat{V}\) is the variance-covariance matrix of the estimated coefficients. In words: the prediction uncertainty comes from the coefficient uncertainty, weighted by the covariate values at which we are predicting.

For self-study (Appendix): You can also compute these SEs manually using the variance-covariance matrix vcov(model), or simulate from the sampling distribution of \(\hat{\beta}\) using mvrnorm(). The Appendix on interaction effects demonstrates both approaches.

# 95% confidence intervals from predict()
ci_correct_lb <- pred_correct$fit - 1.96 * pred_correct$se.fit
ci_correct_ub <- pred_correct$fit + 1.96 * pred_correct$se.fit
ci_misspec_lb <- pred_misspec$fit - 1.96 * pred_misspec$se.fit
ci_misspec_ub <- pred_misspec$fit + 1.96 * pred_misspec$se.fit

Visualizing the two CEFs

cef_df <- data.frame(
  x2 = rep(x2_grid, 2),
  yhat = c(as.numeric(yhat_correct), as.numeric(yhat_misspec)),
  lb = c(ci_correct_lb, ci_misspec_lb),
  ub = c(ci_correct_ub, ci_misspec_ub),
  model = rep(c("Correct", "Misspecified (linear)"), each = 200)
)

ggplot() +
  # Raw data in the background
  geom_point(data = msdat, aes(x = x2, y = y),
             alpha = 0.15, size = 0.8, color = "grey50") +
  # Predicted CEFs with confidence bands
  geom_ribbon(data = cef_df, aes(x = x2, ymin = lb, ymax = ub, fill = model),
              alpha = 0.25) +
  geom_line(data = cef_df, aes(x = x2, y = yhat, color = model),
            linewidth = 1) +
  scale_color_manual(values = c("Correct" = "steelblue",
                                "Misspecified (linear)" = "firebrick")) +
  scale_fill_manual(values = c("Correct" = "steelblue",
                               "Misspecified (linear)" = "firebrick")) +
  labs(title = "CEF Comparison: Correct vs. Misspecified Model",
       subtitle = expression("Predicted E[y | x"[2]*"] with x"[1]*" and x"[3]*" held at their means"),
       x = expression(x[2]), y = expression(hat(y)),
       color = "Model", fill = "Model") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom")
Estimated CEF under the correct and misspecified models. The linear model imposes a straight line where the truth is curved, and its coefficient on x1 is inflated because the omitted interaction is absorbed into the x1 estimate.

Estimated CEF under the correct and misspecified models. The linear model imposes a straight line where the truth is curved, and its coefficient on x1 is inflated because the omitted interaction is absorbed into the x1 estimate.

ggsave("output/plots/cef_misspecification.pdf", width = 7, height = 5)

Key takeaways:

  • The correct model traces the true inverted-U shape of the CEF. The linear model imposes a straight line that misses this structure entirely.
  • The misspecified CEF systematically diverges from the correct one — especially at the tails of \(x_2\), where the quadratic curvature is strongest.
  • The confidence bands of the misspecified model are narrower in places, which is misleading: they reflect precision around the wrong function, not precision around the truth.
  • The bias on \(x_1\) is large and substantively meaningful: the misspecified model estimates the average marginal effect of \(x_1\) (around \(2 + 0.8 \bar{x}_2 \approx 4.4\)) rather than the structural parameter (2). This happens because the omitted interaction \(x_1 \times x_2\) gets absorbed into the \(x_1\) coefficient. The misspecified estimate is “right on average” but wrong about the structure — a crucial distinction.

Part 3: Clustered Standard Errors

3.0 Estimation vs. inference

So far in this lab we have focused on estimation — obtaining the coefficient vector \(\hat{\beta}\) — and on the conditions under which OLS recovers the true CEF (exogeneity, correct functional form). Notice that none of this required distributional assumptions: we never assumed the errors are normal, homoskedastic, or independent. OLS produces \(\hat{\beta} = (X'X)^{-1}X'y\) regardless of the error distribution; whether \(\hat{\beta}\) is unbiased depends on exogeneity, not on distributional properties.

Inference is a separate step. Once we have \(\hat{\beta}\), we want to know how precise it is: standard errors, confidence intervals, hypothesis tests. This is where the structure of the errors starts to matter. The standard OLS variance formula:

\[ \hat{V}_{\text{OLS}} = \hat{\sigma}^2 (X'X)^{-1}, \quad \text{where} \quad \hat{\sigma}^2 = \frac{\hat{e}'\hat{e}}{n - k} \]

assumes that the errors are independent and have constant variance (homoskedasticity) — i.e., \(\text{Var}(\varepsilon_i \mid X) = \sigma^2\) for all \(i\), and \(\text{Cov}(\varepsilon_i, \varepsilon_j \mid X) = 0\) for \(i \neq j\). Under these conditions, the formula above is correct and the usual \(t\)-tests and confidence intervals are valid.

But in many real datasets these conditions fail. Three common violations:

  • Heteroskedasticity: error variance differs across observations (\(\text{Var}(\varepsilon_i) \neq \text{Var}(\varepsilon_j)\))
  • Clustering: errors are correlated within groups (students in the same school, country-years in the same country)
  • Autocorrelation: errors are correlated over time — a central topic in this course

When any of these hold, the OLS point estimates \(\hat{\beta}\) are still fine, but the standard errors are wrong — typically too small. This leads us to overstate the precision of our estimates and reject null hypotheses too often. The rest of Part 3 addresses clustering. Later parts of the course will tackle autocorrelation in detail.

3.1 Why clustering matters

In many cross-sectional datasets, observations are not independent — they are grouped. Students are nested within schools, workers within firms, survey respondents within countries. If there are unobserved group-level factors that affect the outcome (e.g., school quality, firm culture, national institutions), the errors will be correlated within groups:

\[ \text{Cov}(\varepsilon_i, \varepsilon_j) \neq 0 \quad \text{for } i, j \text{ in the same group} \]

This within-group correlation does not bias the OLS coefficient estimates — \(\hat{\beta}\) is still unbiased and consistent. The problem is entirely about inference: OLS standard errors assume each observation contributes an independent piece of information. When errors are positively correlated within groups, the data contain less independent information than OLS assumes. The result:

  • Standard errors are too small
  • \(t\)-statistics are too large
  • Confidence intervals are too narrow
  • We reject null hypotheses too often

The solution is to cluster standard errors by group. This allows for arbitrary within-group error correlation while maintaining the assumption of independence across groups. As we will see, the reliability of this correction depends critically on the number of clusters.

3.2 The cluster-robust variance estimator

The standard OLS variance assumes errors are independent. The cluster-robust estimator relaxes this: it allows errors within the same group to be correlated in any way, while still assuming independence across groups. It does so using a “sandwich” structure:

\[ \hat{V}_{CL} = \underbrace{(X'X)^{-1}}_{\text{bread}} \left( \sum_{g=1}^{G} X_g' \hat{e}_g \hat{e}_g' X_g \right) \underbrace{(X'X)^{-1}}_{\text{bread}} \]

Let’s unpack each piece in plain terms:

  • \(g = 1, \dots, G\) indexes the clusters (e.g., classrooms, countries). The data are split into \(G\) groups.

  • \(X_g\) is the portion of the design matrix belonging to cluster \(g\) — just the rows for observations in that group.

  • \(\hat{e}_g\) is the vector of OLS residuals for the observations in cluster \(g\).

  • The bread, \((X'X)^{-1}\): This is the same matrix that appears in the OLS formula. It adjusts for the scale and correlation of the regressors — the same role it plays in estimation.

  • The meat, \(\sum_g X_g' \hat{e}_g \hat{e}_g' X_g\): This is where the correction happens. It helps to read this from the inside out:

    1. \(\hat{e}_g \hat{e}_g'\) is the outer product of the residuals within cluster \(g\) — an \(n_g \times n_g\) matrix whose \((i,j)\) entry is \(\hat{e}_i \hat{e}_j\). It captures all pairwise products of residuals within the group, without assuming any particular correlation pattern.

    2. \(X_g' \hat{e}_g\) is a \(k \times 1\) vector — the cluster’s score: it sums up, for each coefficient, how much the residuals in that cluster co-move with the regressors. Think of it as cluster \(g\)’s “contribution to estimation uncertainty.” If the residuals within a cluster are all large and same-signed (correlated errors), this vector will be large; if they cancel out (independent errors), it will be small.

    3. \(X_g' \hat{e}_g \hat{e}_g' X_g = (X_g' \hat{e}_g)(X_g' \hat{e}_g)'\) is the outer product of this score with itself — a \(k \times k\) matrix capturing how much cluster \(g\) contributes to the variance of \(\hat{\beta}\). Summing over all \(G\) clusters gives the total meat.

    The key insight: if errors are truly independent within clusters, the individual residuals in \(\hat{e}_g\) would tend to cancel when summed in \(X_g' \hat{e}_g\), and the meat would shrink toward the standard OLS variance. When errors are positively correlated within clusters, they reinforce each other, making \(X_g' \hat{e}_g\) large and inflating the variance estimate — which is exactly the correction we need.

The standard OLS variance is a special case: if every cluster contained exactly one observation, \(\hat{e}_g \hat{e}_g'\) would be a \(1 \times 1\) scalar (\(\hat{e}_i^2\)), the meat would reduce to \(X' \text{diag}(\hat{e}^2) X\), and we would get the heteroskedasticity-robust (HC) estimator. Clustering generalizes this by allowing the off-diagonal terms (cross-products of residuals within a group) to be nonzero.

3.3 Example 1: Many clusters (classrooms)

When the number of clusters is large, the sandwich estimator performs well. We simulate a cross-sectional dataset of students nested within 50 classrooms.

Data generating process

Each student’s test score depends on a student-level covariate (e.g., hours studied) and an unobserved classroom effect (teacher quality):

\[ y_{ig} = \beta_0 + \beta_1 x_{ig} + \alpha_g + \varepsilon_{ig} \]

where \(\alpha_g \sim N(0, \sigma_\alpha^2)\) is the classroom effect and \(\varepsilon_{ig} \sim N(0, 1)\) is the student-level error. The classroom effect \(\alpha_g\) is unobserved — it lives in the composite error \(u_{ig} = \alpha_g + \varepsilon_{ig}\), creating within-classroom correlation: any two students in the same classroom share the same \(\alpha_g\), so \(\text{Cov}(u_{ig}, u_{jg}) = \sigma_\alpha^2 \neq 0\).

Is this omitted variable bias? Technically, \(\alpha_g\) is an omitted variable that affects \(y\). But in our DGP, \(\alpha_g\) is drawn independently of the regressor hours — better teachers do not systematically get students who study more or less. So \(E[\alpha_g \mid x_{ig}] = 0\), and the OVB formula gives zero bias: \(\hat{\beta}_1\) is unbiased. The only problem is that \(\alpha_g\) induces correlated errors within classrooms, which makes the OLS standard errors too small. This example deliberately isolates the inference problem from the estimation problem.

If \(\alpha_g\) were correlated with hours — e.g., better teachers also assign more homework — we would have both OVB and clustering, and the coefficient itself would be biased. In practice the two problems often co-occur, but it is important to understand them separately.

set.seed(512)

# 50 classrooms, 20 students each
G_many <- 50
n_per  <- 20
N_many <- G_many * n_per

# True parameters
beta_0_cl <- 70   # average test score
beta_1_cl <- 3    # effect of hours studied

# Classroom IDs
classroom <- rep(paste0("Class_", sprintf("%02d", 1:G_many)), each = n_per)

# Student-level covariate: hours studied
# Includes a classroom-level component (some classrooms assign more homework),
# so hours is partially shared within classrooms — realistic and important
# for making clustering affect the slope SE, not just the intercept
hours_group <- rep(rnorm(G_many, mean = 0, sd = 1.5), each = n_per)
hours <- rnorm(N_many, mean = 5, sd = 2) + hours_group

# Classroom-level effect: unobserved teacher quality
# Each classroom gets one draw; all students in that classroom share it
# (drawn independently of hours_group — so OLS is unbiased, but SEs are wrong)
alpha <- rep(rnorm(G_many, mean = 0, sd = 4), each = n_per)

# Student-level noise
epsilon <- rnorm(N_many, mean = 0, sd = 3)

# Outcome: test score
score <- beta_0_cl + beta_1_cl * hours + alpha + epsilon

# Assemble the dataset
students <- data.frame(score, hours, classroom)

Before modeling, let’s visualize the clustering structure. If there is within-classroom correlation, we should see that classroom means vary substantially:

# EDA: do scores cluster by classroom?
students |>
  group_by(classroom) |>
  summarize(mean_score = mean(score), mean_hours = mean(hours), n = n()) |>
  ggplot(aes(x = mean_hours, y = mean_score)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "grey50") +
  labs(title = "Classroom-Level Averages",
       subtitle = "Each point = one classroom; spread reflects unobserved classroom effects",
       x = "Mean hours studied", y = "Mean test score") +
  theme_minimal(base_size = 13)

The classroom means are dispersed — some classrooms score much higher or lower than others, even at similar average study hours. This is the unobserved teacher quality (\(\alpha_g\)) at work, and it is exactly the within-group correlation that OLS standard errors will miss.

Step 1: Fit OLS ignoring clustering

mod_many <- lm(score ~ hours, data = students)
summary(mod_many)
## 
## Call:
## lm(formula = score ~ hours, data = students)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2117  -3.0904  -0.1897   2.7914  15.3904 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 70.34653    0.30236  232.66   <2e-16 ***
## hours        2.97560    0.05598   53.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.461 on 998 degrees of freedom
## Multiple R-squared:  0.739,  Adjusted R-squared:  0.7387 
## F-statistic:  2825 on 1 and 998 DF,  p-value: < 2.2e-16

The coefficient on hours should be close to \(3\) — OLS is unbiased. But the standard errors treat all 1000 students as independent, ignoring the within-classroom correlation introduced by \(\alpha_g\).

Step 2: Build the sandwich estimator by hand

# Extract the design matrix and residuals
X_mat <- model.matrix(mod_many)      # N x 2 (intercept + hours)
e_hat <- residuals(mod_many)         # N x 1
k     <- ncol(X_mat)                 # number of coefficients

# The "bread": (X'X)^{-1}
bread <- solve(t(X_mat) %*% X_mat)

# The "meat": sum over clusters of X_g' e_g e_g' X_g
meat <- matrix(0, k, k)

for (g in unique(students$classroom)) {
  # Identify rows belonging to this cluster
  idx <- which(students$classroom == g)

  # Subset the design matrix and residual vector for this cluster
  Xg <- X_mat[idx, , drop = FALSE]   # n_g x k
  eg <- e_hat[idx]                    # n_g x 1

  # The outer product eg %*% t(eg) is n_g x n_g — it captures
  # all pairwise residual products within the cluster
  meat <- meat + t(Xg) %*% (eg %o% eg) %*% Xg
}

# Assemble the sandwich: bread x meat x bread
V_cl_hand <- bread %*% meat %*% bread

# Small-sample correction (HC1-style for clusters)
# Adjusts for finite G and degrees of freedom
V_cl_hand <- V_cl_hand * ((N_many - 1) / (N_many - k)) * (G_many / (G_many - 1))

# Clustered SEs
se_cl_hand <- sqrt(diag(V_cl_hand))

Step 3: Verify with the sandwich package

In practice, use vcovCL() — it does exactly what we just built, in one line:

V_cl_pkg <- vcovCL(mod_many, cluster = students$classroom, type = "HC1")

# Verify: our by-hand calculation matches
data.frame(
  By_Hand  = se_cl_hand,
  Package  = sqrt(diag(V_cl_pkg))
)
##               By_Hand   Package
## (Intercept) 0.5666881 0.5666881
## hours       0.1257683 0.1257683

Step 4: Compare OLS vs. clustered standard errors

V_ols <- vcov(mod_many)

data.frame(
  OLS_SE     = sqrt(diag(V_ols)),
  Cluster_SE = sqrt(diag(V_cl_pkg)),
  Ratio      = sqrt(diag(V_cl_pkg)) / sqrt(diag(V_ols))
)
##                 OLS_SE Cluster_SE   Ratio
## (Intercept) 0.30236270  0.5666881 1.87420
## hours       0.05598211  0.1257683 2.24658

The clustered SEs are substantially larger — reflecting the within-classroom correlation that OLS ignores. The intercept SE increases the most because \(\alpha_g\) shifts the mean score for entire classrooms.

Step 5: Display corrected inference

# OLS inference (standard)
coeftest(mod_many)
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 70.346534   0.302363 232.656 < 2.2e-16 ***
## hours        2.975602   0.055982  53.153 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Corrected inference (clustered by classroom)
coeftest(mod_many, vcov. = V_cl_pkg)
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 70.34653    0.56669 124.136 < 2.2e-16 ***
## hours        2.97560    0.12577  23.659 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficient estimates are identical — clustering only changes the standard errors, \(t\)-statistics, and \(p\)-values. With 50 clusters, the sandwich estimator is reliable and the correction is well-behaved.

ct_many <- coeftest(mod_many, vcov. = V_cl_pkg)

show_reg(list(mod_many, mod_many),
         custom.model.names = c("OLS SEs", "Clustered SEs"),
         override.se = list(sqrt(diag(V_ols)),
                            sqrt(diag(V_cl_pkg))),
         override.pvalues = list(coef(summary(mod_many))[, 4],
                                 ct_many[, 4]),
         caption = "OLS vs. clustered standard errors (50 classrooms)",
         caption.above = TRUE)
OLS vs. clustered standard errors (50 classrooms)
  OLS SEs Clustered SEs
(Intercept) 70.35*** 70.35***
  (0.30) (0.57)
hours 2.98*** 2.98***
  (0.06) (0.13)
R2 0.74 0.74
Adj. R2 0.74 0.74
Num. obs. 1000 1000
***p < 0.001; **p < 0.01; *p < 0.05

3.4 Example 2: Few clusters (countries)

The sandwich estimator relies on the law of large numbers applied to the cluster-level contributions to the meat. When the number of clusters \(G\) is small, this approximation is poor — the estimator can be biased downward, producing clustered SEs that are still too small. This matters in practice because many political science and public health datasets have a small number of country- or state-level clusters (often 10–30), putting them squarely in the zone where the standard sandwich may be unreliable.

We now simulate a similar setup but with only 8 countries:

set.seed(512)

# 8 countries, 150 individuals each
G_few  <- 8
n_per_c <- 150
N_few  <- G_few * n_per_c

# Country IDs
country <- rep(paste0("Country_", LETTERS[1:G_few]), each = n_per_c)

# Individual covariate — includes a country-level component
x_group_c <- rep(rnorm(G_few, mean = 0, sd = 1), each = n_per_c)
x_few <- rnorm(N_few, mean = 0, sd = 1) + x_group_c

# Country-level effect (large, creating strong within-country correlation)
# Drawn independently of x_group_c
alpha_c <- rep(rnorm(G_few, mean = 0, sd = 3), each = n_per_c)

# Individual error
eps_c <- rnorm(N_few, mean = 0, sd = 2)

# Outcome
y_few <- 5 + 2 * x_few + alpha_c + eps_c

country_dat <- data.frame(y = y_few, x = x_few, country)

OLS and clustered SEs

mod_few <- lm(y ~ x, data = country_dat)

V_ols_few <- vcov(mod_few)
V_cl_few  <- vcovCL(mod_few, cluster = country_dat$country, type = "HC1")

data.frame(
  OLS_SE     = sqrt(diag(V_ols_few)),
  Cluster_SE = sqrt(diag(V_cl_few)),
  Ratio      = sqrt(diag(V_cl_few)) / sqrt(diag(V_ols_few))
)
##                 OLS_SE Cluster_SE     Ratio
## (Intercept) 0.09921432  1.0560607 10.644236
## x           0.07611123  0.5878487  7.723548
coeftest(mod_few)
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 4.419880   0.099214  44.549 < 2.2e-16 ***
## x           1.450636   0.076111  19.059 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(mod_few, vcov. = V_cl_few)
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  4.41988    1.05606  4.1853 3.057e-05 ***
## x            1.45064    0.58785  2.4677   0.01374 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With only 8 clusters, the sandwich estimator is formed from just 8 terms in the meat — a small sample from which to estimate a covariance matrix. The clustered SEs may still be too small, leading to over-rejection. Researchers typically recommend a minimum of 30–50 clusters for the standard sandwich to be reliable.

What to do with few clusters: Several alternatives exist, including bias-reduced linearization (BRL), the wild cluster bootstrap (fwildclusterboot package in R), and degrees-of-freedom adjustments. These are beyond the scope of this lab, but be aware that simply plugging in vcovCL() with \(G < 20\) may not fully solve the problem.

3.5 How the number of clusters matters

To build intuition, let’s simulate the same DGP many times but vary the number of clusters, and track how well the clustered SE approximates the true standard error:

set.seed(2026)

n_sims <- 500
G_values <- c(5, 8, 15, 25, 50, 100)
n_per_g <- 30   # observations per cluster (held fixed)
true_beta <- 2

results <- list()

for (G in G_values) {
  N <- G * n_per_g
  beta_hats <- numeric(n_sims)
  se_ols    <- numeric(n_sims)
  se_cl     <- numeric(n_sims)

  for (s in 1:n_sims) {
    # Simulate data with group structure
    group   <- rep(1:G, each = n_per_g)
    x_grp   <- rep(rnorm(G, 0, 1), each = n_per_g)  # group-level x component
    x       <- rnorm(N) + x_grp                      # x is partially shared within groups
    alpha   <- rep(rnorm(G, 0, 3), each = n_per_g)   # group effect (indep. of x_grp)
    y       <- 1 + true_beta * x + alpha + rnorm(N)

    mod <- lm(y ~ x)
    beta_hats[s] <- coef(mod)["x"]
    se_ols[s] <- sqrt(vcov(mod)["x", "x"])
    se_cl[s]  <- sqrt(vcovCL(mod, cluster = group, type = "HC1")["x", "x"])
  }

  # True SE = SD of beta_hat across simulations
  true_se <- sd(beta_hats)

  results[[as.character(G)]] <- data.frame(
    G = G,
    mean_se_ols = mean(se_ols),
    mean_se_cl  = mean(se_cl),
    true_se     = true_se,
    ratio_ols   = mean(se_ols) / true_se,
    ratio_cl    = mean(se_cl) / true_se
  )
}

sim_df <- bind_rows(results)

# Plot: SE / True SE as a function of G
sim_long <- sim_df |>
  select(G, ratio_ols, ratio_cl) |>
  pivot_longer(cols = c(ratio_ols, ratio_cl),
               names_to = "type", values_to = "ratio") |>
  mutate(type = ifelse(type == "ratio_ols", "OLS SE", "Clustered SE"))

ggplot(sim_long, aes(x = G, y = ratio, color = type, shape = type)) +
  geom_point(size = 3) +
  geom_line(linewidth = 0.8) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "grey40") +
  scale_x_continuous(breaks = G_values) +
  labs(title = "Standard Error Accuracy by Number of Clusters",
       subtitle = paste0("Based on ", n_sims, " simulations per G value (",
                         n_per_g, " obs per cluster)"),
       x = "Number of clusters (G)",
       y = "Mean estimated SE / True SE",
       color = NULL, shape = NULL) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")
As the number of clusters increases, the ratio of the clustered SE to the true SE converges to 1. With fewer than ~20 clusters, the sandwich estimator tends to underestimate the true standard error.

As the number of clusters increases, the ratio of the clustered SE to the true SE converges to 1. With fewer than ~20 clusters, the sandwich estimator tends to underestimate the true standard error.

ggsave("output/plots/cluster_tradeoff.pdf", width = 7, height = 4.5)

Key takeaways:

  • The OLS SE (ignoring clustering) consistently underestimates the true standard error, regardless of \(G\). This is the core problem that clustering addresses.
  • The clustered SE converges toward the true SE as \(G\) increases, reaching a good approximation around \(G = 30\)\(50\).
  • With very few clusters (\(G < 10\)), even the clustered SE tends to underestimate uncertainty — the sandwich estimator itself is unreliable.
  • The practical rule of thumb: if you have fewer than ~20 clusters, report clustered SEs but flag the limitation. Consider the wild cluster bootstrap as an alternative.

Part 4: Coding Practice — Franzese Public Debt Data

In this section, you will apply the concepts from today’s lab to a real political economy dataset. The data come from Franzese’s study of electoral and partisan cycles in public debt (see Franzese, 2002). The observations are country-years — a grouped (clustered) data structure where each country contributes multiple observations. This is exactly the kind of data where omitted variable bias, functional misspecification, and within-group error correlation are all relevant concerns.

Variable Description
debtx Public debt as a fraction of GDP
growth Real GDP growth rate
ue Unemployment rate
cog Partisan center of gravity (left-right position of government)
enop Effective number of parties in government (party-system fractionalization)

Exercise 1: Load, explore, and visualize the data

Load the dataset and get to know it before modeling. Good EDA is your first defense against misspecification, missing data, and surprises in the results.

# Load the data
debt <- read_csv("data/debt_699")

Tasks:

  1. Use names() or glimpse() to inspect what variables are available. Then use select() to keep only the five variables listed above (plus the country identifier — you’ll need it later for clustering).
  2. Use summary() to inspect distributions. Are there missing values? On which variables?
  3. Use drop_na() to create a complete-case dataset. How many rows did you lose?
  4. Use summarize() to compute the mean and standard deviation of debtx and growth.
  5. How many countries are in the data, and how many years per country? Use group_by() and summarize(n = n()) to check. Is the panel balanced?
  6. Create a scatter plot of debtx vs. growth. Add a linear fit (geom_smooth(method = "lm")) and a loess smoother (geom_smooth(method = "loess")). Does the relationship look linear?
  7. Compute the correlation matrix of the five analysis variables using cor() (on the complete-case dataset). Which pairs are most strongly correlated? How might this affect OVB?
# Your code here:
# Tasks 1-4:
# debt |>
#   select(...) |>
#   drop_na() |>
#   summary()
#
# Task 5:
# your_data |> group_by(country_var) |> summarize(n = n())
#
# Task 6:
# ggplot(your_data, aes(x = growth, y = debtx)) +
#   geom_point(alpha = 0.3) +
#   geom_smooth(method = "lm", color = "red", se = FALSE) +
#   geom_smooth(method = "loess", color = "blue", se = FALSE)
#
# Task 7:
# cor(your_data |> select(debtx, growth, ue, cog, enop))

Task 6 is especially important: if the loess and linear fits diverge, that is a visual signal that a quadratic or other nonlinear term might be warranted — exactly the misspecification problem from Part 2.3.

Exercise 2: OVB in practice

The goal is to examine whether omitting controls biases the estimated effect of economic growth on public debt.

Tasks:

  1. Estimate Model A: debtx ~ growth + ue + cog + enop
  2. Estimate Model B: debtx ~ growth
  3. Compare the coefficient on growth across the two models using screenreg(). In which direction is the bias? Why might that be, substantively?
  4. (Bonus) Use the OVB formula to analytically calculate the bias. You will need auxiliary regressions of the omitted variables on growth.
# Your code here:
# Hint: use screenreg(list(model_a, model_b)) for side-by-side comparison
# For the bonus, recall: bias = sum of (beta_omitted * delta_omitted|included)
# where delta comes from regressing each omitted variable on growth

Exercise 3: Clustered standard errors

The Franzese data are country-years — observations are grouped by country. Unobserved country-level factors (institutions, political culture) affect all years within a country, creating within-group error correlation. Cluster your standard errors to account for this.

Tasks:

  1. Identify the country variable in the original dataset. Hint: use names(debt) to find it.
  2. Compute cluster-robust standard errors for Model A using vcovCL() from the sandwich package.
  3. Display the results with coeftest(). How do the clustered standard errors compare to the OLS standard errors? Does any coefficient lose significance?
  4. How many clusters (countries) are in this dataset? Is this a “many clusters” or “few clusters” situation? What does that imply for the reliability of the sandwich estimator?
# Your code here:
# Hint: vcovCL(model_a, cluster = your_data$country_var, type = "HC1")
# Then: coeftest(model_a, vcov. = your_clustered_vcov)
# For task 4: length(unique(your_data$country_var))

Exercise 4: Misspecification check

Check whether the linear specification of Model A is adequate, or whether a nonlinear term is needed.

Tasks:

  1. Add a quadratic term for growth (using I(growth^2)) to Model A. Does the quadratic term appear significant?
  2. Use predict() to compute predicted values of debtx as a function of growth (holding other covariates at their means) for both the linear and quadratic specifications.
  3. Plot the two predicted CEFs against growth. Does the linear specification seem adequate?
# Your code here:
# Hint for 1: lm(debtx ~ growth + I(growth^2) + ue + cog + enop, data = ...)
# Hint for 2: create a newdata data.frame with growth varying and other vars at mean
#   newdata <- data.frame(growth = seq(...), ue = mean(...), cog = mean(...), enop = mean(...))
#   predict(model, newdata = newdata)

Exercise 5: Export your work

Practice saving outputs to the output/ subfolders.

Tasks:

  1. Use screenreg() to display Models A and B side by side in the console. Then export the same table as a .tex file using texreg() with the file argument: output/tables/debt_models.tex.
  2. Save the predicted CEF plot from Exercise 4 with ggsave(): output/plots/cef_debt.pdf.
  3. Save the complete-case dataset from Exercise 1 as output/data/debt_clean.rds using saveRDS().
  4. Save Model A as output/models/model_a.rds using saveRDS().
# Your code here:
# Hint for 1:
#   screenreg(list(model_a, model_b), custom.model.names = c("Full", "Bivariate"))
#   texreg(list(model_a, model_b), file = "output/tables/debt_models.tex")
#
# Hint for 2:
#   ggsave("output/plots/cef_debt.pdf", width = 6, height = 6/1.618)
#
# Hint for 3:
#   saveRDS(your_clean_data, file = "output/data/debt_clean.rds")
#
# Hint for 4:
#   saveRDS(model_a, file = "output/models/model_a.rds")

Appendix A: Interaction Effects (Self-Study)

This appendix covers interaction effects and how to compute and visualize conditional marginal effects step by step. The material is self-contained — work through it on your own after the lab session.

A.1 Why interactions?

In many social science applications, the effect of one variable on the outcome depends on the level of another variable. For example:

  • The effect of electoral rules on party-system fragmentation may depend on ethnic diversity (Clark & Golder, 2006).
  • The effect of economic shocks on protest may depend on the level of state repression.
  • The effect of foreign aid on growth may depend on the quality of governance.

The simplest interaction model captures this conditional relationship:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 \times x_2) + \varepsilon \]

In this model, the marginal effect of \(x_1\) on \(y\) is no longer a single number. It is a function of \(x_2\):

\[ \frac{\partial E[y \mid x_1, x_2]}{\partial x_1} = \beta_1 + \beta_3 x_2 \]

This means \(\beta_1\) alone is not “the effect of \(x_1\).” It is the effect of \(x_1\) only when \(x_2 = 0\), which may not even be a meaningful value in the data. To properly interpret an interaction model, we must compute and visualize the marginal effect across the observed range of the moderating variable.

However, the simple linear interaction produces a marginal effect that is always a straight line — it increases or decreases monotonically with \(x_2\). In many real applications, the conditional effect is nonlinear: the effect of \(x_1\) may be strongest at moderate values of \(x_2\) and weaker (or even reversed) at the extremes. To capture this, we can include a quadratic interaction:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_2^2 + \beta_4 (x_1 \times x_2) + \beta_5 (x_1 \times x_2^2) + \varepsilon \]

Now the marginal effect of \(x_1\) is:

\[ \frac{\partial E[y \mid x_1, x_2]}{\partial x_1} = \beta_1 + \beta_4 x_2 + \beta_5 x_2^2 \]

This is a parabola in \(x_2\) — it can peak at intermediate values and cross zero at both ends, producing a much richer and more substantively interesting picture. Think of a context like: the effect of economic openness on growth is positive at moderate levels of institutional quality, but diminishes in both very weak states (which cannot absorb the benefits) and very strong states (which have already captured them).

A.2 Simulate interaction data

We simulate a DGP where the marginal effect of \(x_1\) follows an inverted-U shape in \(x_2\):

\[ y = 1 + 2 x_1 + 0.5 x_2 - 0.3 x_2^2 + 1.5 (x_1 \times x_2) - 0.8 (x_1 \times x_2^2) + \varepsilon \]

The marginal effect of \(x_1\) is therefore: \(\frac{\partial y}{\partial x_1} = 2 + 1.5 x_2 - 0.8 x_2^2\)

This is a downward-opening parabola. It is positive for a range of \(x_2\) values and turns negative at the extremes, crossing zero at approximately \(x_2 \approx -0.9\) and \(x_2 \approx 2.8\).

set.seed(512)

b0 <- 1; b1 <- 2; b2 <- 0.5; b3 <- -0.3; b4 <- 1.5; b5 <- -0.8
nobs <- 300

vcv <- rbind(c(1.0, 0.3), c(0.3, 1.0))
X_int <- mvrnorm(n = nobs, mu = c(0, 0), Sigma = vcv)
x1 <- X_int[, 1]; x2 <- X_int[, 2]

e <- rnorm(nobs, mean = 0, sd = 2)
y <- b0 + b1*x1 + b2*x2 + b3*x2^2 + b4*x1*x2 + b5*x1*x2^2 + e

intdat <- data.frame(y, x1, x2)
rm(X_int, x1, x2, e, y, vcv, b0, b1, b2, b3, b4, b5, nobs)

A.3 Estimate the interaction model

We use I() to include the squared term and : for interactions:

fit_int <- lm(y ~ x1 + x2 + I(x2^2) + x1:x2 + x1:I(x2^2), data = intdat)
summary(fit_int)
## 
## Call:
## lm(formula = y ~ x1 + x2 + I(x2^2) + x1:x2 + x1:I(x2^2), data = intdat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7123 -1.3581  0.0417  1.4456  4.9332 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.06932    0.15028   7.116 8.54e-12 ***
## x1           2.05733    0.15646  13.150  < 2e-16 ***
## x2           0.61518    0.14460   4.254 2.82e-05 ***
## I(x2^2)     -0.31462    0.11578  -2.717  0.00697 ** 
## x1:x2        1.43932    0.12630  11.396  < 2e-16 ***
## x1:I(x2^2)  -0.69758    0.09934  -7.022 1.52e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.048 on 294 degrees of freedom
## Multiple R-squared:  0.5232, Adjusted R-squared:  0.5151 
## F-statistic: 64.53 on 5 and 294 DF,  p-value: < 2.2e-16

Important: Use R’s formula syntax rather than creating interaction terms manually. The I() function tells R to interpret the expression arithmetically (squaring \(x_2\)) rather than as a formula operator.

A.4 Computing marginal effects step by step

Step 1: Point estimates

\[ \frac{\partial \hat{y}}{\partial x_1} = \hat{\beta}_{x_1} + \hat{\beta}_{x_1:x_2} \cdot x_2 + \hat{\beta}_{x_1:x_2^2} \cdot x_2^2 \]

B <- coef(fit_int)
names(B)
## [1] "(Intercept)" "x1"          "x2"          "I(x2^2)"     "x1:x2"      
## [6] "x1:I(x2^2)"
x2_vals <- seq(min(intdat$x2), max(intdat$x2), length.out = 200)
me_x1 <- B["x1"] + B["x1:x2"] * x2_vals + B["x1:I(x2^2)"] * x2_vals^2

Quick plot to check the shape:

plot(x2_vals, me_x1, type = "l", lwd = 2,
     xlab = "x2", ylab = "Marginal effect of x1",
     main = "ME of x1 conditional on x2 (point estimates)")
abline(h = 0, lty = 2, col = "red")

Step 2: Standard errors

The variance of the conditional marginal effect involves all pairwise variances and covariances among the three relevant coefficients:

\[ \text{Var}\left(\frac{\partial \hat{y}}{\partial x_1}\right) = \text{Var}(\hat{\beta}_1) + x_2^2 \text{Var}(\hat{\beta}_4) + x_2^4 \text{Var}(\hat{\beta}_5) + 2 x_2 \text{Cov}(\hat{\beta}_1, \hat{\beta}_4) + 2 x_2^2 \text{Cov}(\hat{\beta}_1, \hat{\beta}_5) + 2 x_2^3 \text{Cov}(\hat{\beta}_4, \hat{\beta}_5) \]

V  <- vcov(fit_int)
c1 <- "x1"; c4 <- "x1:x2"; c5 <- "x1:I(x2^2)"

var_me <- V[c1,c1] +
          x2_vals^2 * V[c4,c4] +
          x2_vals^4 * V[c5,c5] +
          2 * x2_vals   * V[c1,c4] +
          2 * x2_vals^2 * V[c1,c5] +
          2 * x2_vals^3 * V[c4,c5]

se_x1 <- sqrt(var_me)

Step 3: Confidence intervals

lb <- me_x1 - qnorm(0.975) * se_x1
ub <- me_x1 + qnorm(0.975) * se_x1

Step 4: Visualize

me_df <- data.frame(x2 = x2_vals, me = me_x1, se = se_x1, lb, ub)

ggplot(me_df, aes(x = x2)) +
  geom_ribbon(aes(ymin = lb, ymax = ub), fill = "grey80", alpha = 0.5) +
  geom_line(aes(y = me), linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(title    = "Marginal Effect of x1 on y",
       subtitle = "Conditional on x2, with 95% confidence interval",
       x = "x2", y = "Estimated marginal effect of x1") +
  theme_minimal(base_size = 14)
Marginal effect of x1 on y, conditional on x2, with 95% CI. The inverted-U shape shows the effect is strongest at moderate x2 and reverses at the extremes.

Marginal effect of x1 on y, conditional on x2, with 95% CI. The inverted-U shape shows the effect is strongest at moderate x2 and reverses at the extremes.

ggsave("output/plots/marginal_effect_x1.pdf", width = 6, height = 6 / 1.618)

How to read this plot:

  • The solid curve traces the estimated marginal effect of \(x_1\) at each value of \(x_2\).
  • The shaded ribbon is the 95% confidence interval.
  • Where the ribbon excludes zero, the effect is statistically significant at the 5% level.
  • The peak indicates where \(x_1\) has its strongest positive effect on \(y\).
  • The zero-crossings indicate values of \(x_2\) at which the effect of \(x_1\) switches sign.
  • The width of the ribbon is narrowest near the center of the data and wider at the extremes.

A.5 Using corrected standard errors

To use robust or clustered SEs, simply swap the variance-covariance matrix in Step 2:

V_robust <- vcovHC(fit_int, type = "HC1")

var_me_robust <- V_robust[c1,c1] +
                 x2_vals^2 * V_robust[c4,c4] +
                 x2_vals^4 * V_robust[c5,c5] +
                 2 * x2_vals   * V_robust[c1,c4] +
                 2 * x2_vals^2 * V_robust[c1,c5] +
                 2 * x2_vals^3 * V_robust[c4,c5]

se_x1_robust <- sqrt(var_me_robust)
me_df$lb_robust <- me_x1 - qnorm(0.975) * se_x1_robust
me_df$ub_robust <- me_x1 + qnorm(0.975) * se_x1_robust

ggplot(me_df, aes(x = x2)) +
  geom_ribbon(aes(ymin = lb_robust, ymax = ub_robust), fill = "steelblue", alpha = 0.3) +
  geom_ribbon(aes(ymin = lb, ymax = ub), fill = "grey70", alpha = 0.3) +
  geom_line(aes(y = me), linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title    = "Marginal Effect of x1 on y",
       subtitle = "Grey = OLS SEs  |  Blue = HC-Robust SEs",
       x = "x2", y = "Estimated marginal effect of x1") +
  theme_minimal(base_size = 14)
Comparing OLS-based (grey) and HC-robust (blue) confidence intervals.

Comparing OLS-based (grey) and HC-robust (blue) confidence intervals.

The same approach works with vcovCL() for clustered SEs — just plug in the cluster-robust vcov matrix. This flexibility is one reason why computing marginal effects by hand is valuable: you can use any variance-covariance estimator.


Appendix B: Simultaneity Bias (Self-Study)

This appendix covers simultaneity — the fourth classical source of endogeneity. Work through it on your own after the lab session.

B.1 The problem

Simultaneity arises when two variables are jointly determined — each is both a cause and an effect of the other. In such cases, regressing \(y\) on \(x\) with OLS is invalid because \(x\) is endogenous: it depends on \(y\), which means it is correlated with \(\varepsilon\).

Classic examples in social science:

  • Supply and demand: price affects quantity demanded, and quantity demanded affects price.
  • Democracy and growth: democratic institutions may promote growth, but economic development may also foster democratization.
  • Conflict and state capacity: weak states may be more prone to conflict, and conflict further weakens state capacity.

In each case, regressing one variable on the other with OLS conflates the two causal directions, and the coefficient estimate is a mix of both — not a clean estimate of either.

B.2 Formal setup

Consider the following structural system:

\[ \begin{cases} y = 0.5x + z + \varepsilon_1 \\ x = 0.5y + w + \varepsilon_2 \end{cases} \]

where \(z\) and \(w\) are exogenous variables, and \(\varepsilon_1, \varepsilon_2\) are independent errors. If we naively regress \(y\) on \(x\) and \(z\), the estimate of \(0.5\) will be biased because \(x\) is a function of \(y\) (and therefore of \(\varepsilon_1\)).

To obtain unbiased estimates, we need the reduced form: we solve the system algebraically to express each endogenous variable solely as a function of exogenous variables and errors:

\[ y = \frac{4}{3}z + \frac{2}{3}w + \frac{4}{3}\varepsilon_1 + \frac{2}{3}\varepsilon_2 \]

\[ x = \frac{2}{3}z + \frac{4}{3}w + \frac{2}{3}\varepsilon_1 + \frac{4}{3}\varepsilon_2 \]

B.3 Simulation

set.seed(123)
n <- 1000
z  <- rnorm(n, 0, 3)
w  <- rnorm(n, 0, 3)
e1 <- rnorm(n, 0, 2)
e2 <- rnorm(n, 0, 2)

# Generate data from the reduced form
y_sim <- (4/3)*z + (2/3)*w + (4/3)*e1 + (2/3)*e2
x_sim <- (2/3)*z + (4/3)*w + (2/3)*e1 + (4/3)*e2

Compare the naive structural regression against the reduced form:

# Naive OLS on the structural equation (biased)
naive_model <- lm(y_sim ~ x_sim + z)

# Reduced form regression on exogenous variables only (unbiased)
rf_model <- lm(y_sim ~ w + z)

# Compare side by side
screenreg(list(naive_model, rf_model),
          custom.model.names = c("Naive (Structural)", "Reduced Form"))
## 
## =============================================
##              Naive (Structural)  Reduced Form
## ---------------------------------------------
## (Intercept)    -0.05               -0.07     
##                (0.06)              (0.09)    
## x_sim           0.62 ***                     
##                (0.01)                        
## z               0.89 ***            1.31 *** 
##                (0.02)              (0.03)    
## w                                   0.69 *** 
##                                    (0.03)    
## ---------------------------------------------
## R^2             0.88                0.70     
## Adj. R^2        0.88                0.70     
## Num. obs.    1000                1000        
## =============================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

The true effect of \(x\) on \(y\) is \(0.5\), but the naive estimate is biased upward. The reduced form recovers the true parameters:

data.frame(
  Variable = c("w", "z"),
  Truth    = round(c(2/3, 4/3), 3),
  Estimate = round(coef(rf_model)[2:3], 3)
)
##   Variable Truth Estimate
## w        w 0.667    0.688
## z        z 1.333    1.313

Key takeaway: When simultaneity is present, OLS on the structural equation is biased and inconsistent. The reduced form — regressing \(y\) on exogenous variables only — recovers unbiased parameters. In practice, this motivates the use of instrumental variables (IV), where we use exogenous variation in \(w\) to isolate the causal effect of \(x\) on \(y\).