These materials have been revised and updated by successive teaching assistants over recent years. The dataset and coding practice exercises in Part 4 are adapted from materials originally developed for the ICPSR Summer Program in Quantitative Methods.
Revision history:
In most empirical social science, our goal is to estimate the conditional expectation function (CEF):
\[ \mu(x) = E[y_i \mid x_i] \]
This is the average value of the outcome \(y\) we would observe if we could hold the covariates \(x\) at a particular value. It is the population object we are trying to learn about. For example, if \(y\) is public debt and \(x\) is GDP growth, then \(E[\text{debt} \mid \text{growth} = 2\%]\) tells us the average level of public debt among country-years where growth is 2%.
The linear regression model imposes a strong assumption: that the CEF is linear in parameters:
\[ E[y_i \mid x_i] = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \dots + \beta_k x_{ki} \]
This is the linearity assumption. It says that the effect of each covariate on the expected outcome is constant — a one-unit increase in \(x_1\) always changes \(E[y]\) by \(\beta_1\), regardless of the level of \(x_1\) or any other covariate. In many applications this is a convenient approximation, but it is not always correct. If the true CEF is nonlinear (e.g., quadratic, logarithmic, or involves interactions), imposing linearity produces a systematic misfit — the regression line no longer traces out the true conditional expectation, and the estimated coefficients inherit bias. We will examine this problem concretely in Part 2.
The observed outcome is:
\[ y_i = E[y_i \mid x_i] + \varepsilon_i = X_i'\beta + \varepsilon_i \]
where \(X_i'\) is the row vector of covariates for individual \(i\) (including a 1 for the intercept), and \(\varepsilon_i\) is the deviation of individual \(i\) from the conditional mean. By construction, \(E[\varepsilon_i \mid x_i] = 0\) — the errors average to zero at every value of \(x\). In plain terms, the regressors are not systematically related to any unobserved factors that also affect \(y\). This is the exogeneity assumption, and it is the critical condition for OLS to deliver unbiased estimates of \(\beta\).
When exogeneity fails — i.e., \(E[\varepsilon_i \mid x_i] \neq 0\) — our estimates no longer recover the CEF. We call this problem endogeneity. The classical sources each represent a different mechanism by which a regressor becomes correlated with the error term.
Let’s remind ourselves how OLS works by simulating data from a known data generating process (DGP). The advantage of simulation is that we know the truth, so we can verify whether our estimator recovers it.
\[ y_i = 2 + 1.5 x_{1i} - 3 x_{2i} + \varepsilon_i, \quad \varepsilon_i \sim N(0, 4) \]
set.seed(512) # for replicability
n <- 500
x1 <- rnorm(n, mean = 0, sd = 2)
x2 <- rnorm(n, mean = 0, sd = 1)
e <- rnorm(n, mean = 0, sd = 2)
y <- 2 + 1.5 * x1 - 3 * x2 + e
simdat <- data.frame(y, x1, x2)
The function lm() estimates the model by ordinary least
squares:
mod <- lm(y ~ x1 + x2, data = simdat)
summary(mod)
##
## Call:
## lm(formula = y ~ x1 + x2, data = simdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9235 -1.3058 -0.1052 1.2736 6.5285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.09317 0.08579 24.40 <2e-16 ***
## x1 1.58939 0.04351 36.53 <2e-16 ***
## x2 -3.07835 0.08514 -36.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.918 on 497 degrees of freedom
## Multiple R-squared: 0.8407, Adjusted R-squared: 0.84
## F-statistic: 1311 on 2 and 497 DF, p-value: < 2.2e-16
The summary() output gives us the coefficient estimates
\(\hat{\beta}\), their standard errors,
\(t\)-statistics, and \(p\)-values. The estimates should be close
to the true values \((2, 1.5,
-3)\).
Recall that the OLS estimator in matrix form is:
\[ \hat{\beta} = (X'X)^{-1}X'y \]
where \(X\) is the \(n \times (k+1)\) design matrix (including a column of ones for the intercept) and \(y\) is the \(n \times 1\) outcome vector. This formula can look intimidating, but it encodes a simple idea — the next section breaks it into two interpretable pieces.
The formula is compact but its logic is clear once you decompose it. Both \(X'X\) and \(X'y\) are built from covariances, so before unpacking the formula we need to be clear about what covariance means and how it relates to correlation.
Covariance and correlation. These two concepts are closely related. The covariance between two variables measures how much they move together in their original units: \(\text{Cov}(x_1, x_2) = E[(x_1 - \bar{x}_1)(x_2 - \bar{x}_2)]\). For a set of variables, the covariance matrix collects all pairwise covariances:
\[ \Sigma = \begin{pmatrix} \text{Var}(y) & \text{Cov}(y, x_1) & \text{Cov}(y, x_2) \\ \text{Cov}(x_1, y) & \text{Var}(x_1) & \text{Cov}(x_1, x_2) \\ \text{Cov}(x_2, y) & \text{Cov}(x_2, x_1) & \text{Var}(x_2) \end{pmatrix} \]
The diagonal contains variances (how much each variable spreads); the off-diagonal contains covariances (how much each pair moves together, in the original units). The correlation matrix is the same information, but each entry is rescaled by the standard deviations of the two variables involved:
\[ R = \begin{pmatrix} 1 & \text{Cor}(y, x_1) & \text{Cor}(y, x_2) \\ \text{Cor}(x_1, y) & 1 & \text{Cor}(x_1, x_2) \\ \text{Cor}(x_2, y) & \text{Cor}(x_2, x_1) & 1 \end{pmatrix}, \quad \text{where} \quad \text{Cor}(a, b) = \frac{\text{Cov}(a, b)}{\text{SD}(a) \cdot \text{SD}(b)} \]
The diagonal is always 1 (every variable is perfectly correlated with itself), and all entries are bounded between \(-1\) and \(1\).
So the covariance is an unscaled correlation — it carries the same information about the direction and strength of the linear relationship, but in the units of the original variables rather than on the standardized \([-1, 1]\) scale. We can see this directly by comparing both matrices from our simulated data:
# Covariance matrix — entries in the original units of the variables
round(cov(simdat), 2)
## y x1 x2
## y 22.99 6.14 -3.11
## x1 6.14 3.89 0.01
## x2 -3.11 0.01 1.02
# Correlation matrix — same information, rescaled to [-1, 1]
round(cor(simdat), 2)
## y x1 x2
## y 1.00 0.65 -0.64
## x1 0.65 1.00 0.01
## x2 -0.64 0.01 1.00
The two matrices carry the same information — which pairs move together and in which direction — but on different scales. For instance, the covariance between \(y\) and \(x_1\) is large in absolute terms because both variables have wide spread; the corresponding correlation puts this on a standardized \([-1, 1]\) scale, making it easier to judge the strength of the association. Conversely, \(x_1\) and \(x_2\) are uncorrelated by construction, which shows up as near-zero entries in both matrices.
This matters because OLS works with covariances, not correlations. The matrices \(X'X\) and \(X'y\) are sums of products in the original units of the data — the scale of the variables directly affects both the magnitude of the coefficients and the precision with which they are estimated.
Piece 1: \(X'X\) — the (unnormalized) covariance of the regressors. The matrix \(X'X\) is a \((k+1) \times (k+1)\) matrix whose entries are the sums of squares and cross-products of the covariates. Dividing by \(n\) would give you the sample covariance matrix of \(X\). It captures how much variation each regressor has (diagonal) and how much they overlap with each other (off-diagonal). Its inverse, \((X'X)^{-1}\), tells OLS how to “undo” the correlation among regressors to isolate each one’s separate contribution — this is why multicollinearity (near-singular \(X'X\)) is a problem: there isn’t enough independent variation to separate the effects.
Piece 2: \(X'y\) — the cross-covariance of regressors with the outcome. The vector \(X'y\) is \((k+1) \times 1\) and contains the sums of products between each regressor and \(y\). It measures how much each covariate co-moves with the outcome — the (unnormalized) covariance between each \(x\) and \(y\).
Putting them together: OLS asks “how much does each regressor co-move with \(y\) (\(X'y\)), after accounting for the overlap among the regressors themselves (\(X'X\))?”
Without the inversion, the raw covariance \(X'y\) would double-count any variation that two regressors share. If \(x_1\) and \(x_2\) are correlated, part of the co-movement between \(x_1\) and \(y\) is really due to \(x_2\), and vice versa. Multiplying by \((X'X)^{-1}\) strips out this shared variation, so that each coefficient reflects only the independent contribution of its regressor.
This is also why multicollinearity is a problem: when regressors overlap heavily, there is little independent variation left after the adjustment, and the estimates become imprecise (large standard errors).
This section uses linear algebra. If the vector-space framing is unfamiliar, the covariance intuition above is sufficient — you can come back to this later.
Think of \(y\) as a point in \(n\)-dimensional space, and the columns of \(X\) as spanning a lower-dimensional “model space.” OLS finds the point \(\hat{y} = X\hat{\beta}\) in that space that is closest to \(y\). The residual \(\hat{e} = y - \hat{y}\) is perpendicular to the model space:
\[ X'\hat{e} = X'(y - X\hat{\beta}) = 0 \]
This orthogonality condition is the system of normal equations that yields the OLS formula. The key implication: OLS always finds the best linear approximation to the CEF. When the CEF is linear, OLS recovers the truth. When it is not, OLS gives the best straight line — but that line may be a poor fit (the misspecification problem in Part 2).
An important distinction: The condition \(X'\hat{e} = 0\) — that the residuals are uncorrelated with the regressors — is not an assumption. It is a mechanical result of the OLS procedure. OLS forces the residuals to be orthogonal to \(X\) by construction; that is how it finds \(\hat{\beta}\). The assumption is \(E[\varepsilon_i \mid x_i] = 0\) — that the true population errors are uncorrelated with the regressors. This is a statement about the data-generating process, not about the fitted residuals. A model can satisfy \(X'\hat{e} = 0\) perfectly (every OLS model does) while badly violating exogeneity. The residuals always “look clean” in-sample; the question is whether the errors are clean in the population.
The following diagram illustrates this geometry. The grey band represents the “model space” — the set of all fitted value vectors \(X\beta\) for any choice of \(\beta\). OLS finds \(\hat{\mathbf{y}}\), the point in this space closest to the observed \(\mathbf{y}\). The residual \(\hat{\mathbf{e}}\) connects \(\hat{\mathbf{y}}\) to \(\mathbf{y}\) and is perpendicular to the model space — that perpendicularity is the normal equations \(X'\hat{e} = 0\).
OLS as orthogonal projection. The vector y lives in n-dimensional space. Its projection onto the column space of X is the fitted value vector ŷ. The residual ê is perpendicular to that space.
We can verify the OLS formula by computing it manually:
X <- cbind(1, x1, x2) # design matrix: n x 3
# Step by step:
XtX <- t(X) %*% X # 3 x 3: covariance structure of regressors
Xty <- t(X) %*% y # 3 x 1: covariance of regressors with outcome
beta_hat <- solve(XtX) %*% Xty # (X'X)^{-1} X'y
beta_hat
## [,1]
## 2.093174
## x1 1.589385
## x2 -3.078347
These values are identical to the lm() output. In this
course we will sometimes compute estimators “by hand” to make the
mechanics transparent — it helps demystify what functions like
lm() are doing internally.
dplyr essentialsWe use dplyr for data manipulation throughout this
course. Its verbs compose via the pipe operator
|>. Quick refresher:
# Create a small example dataset
toy <- tibble(
country = rep(c("USA", "UK", "France"), each = 3),
year = rep(2010:2012, 3),
gdp = c(14.5, 15.0, 15.5, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7),
debt = c(60, 65, 70, 75, 80, 85, 55, 58, 60)
)
toy
## # A tibble: 9 × 4
## country year gdp debt
## <chr> <int> <dbl> <dbl>
## 1 USA 2010 14.5 60
## 2 USA 2011 15 65
## 3 USA 2012 15.5 70
## 4 UK 2010 2.2 75
## 5 UK 2011 2.3 80
## 6 UK 2012 2.4 85
## 7 France 2010 2.5 55
## 8 France 2011 2.6 58
## 9 France 2012 2.7 60
The core verbs:
# select: choose columns by name
toy |>
select(country, year, debt)
## # A tibble: 9 × 3
## country year debt
## <chr> <int> <dbl>
## 1 USA 2010 60
## 2 USA 2011 65
## 3 USA 2012 70
## 4 UK 2010 75
## 5 UK 2011 80
## 6 UK 2012 85
## 7 France 2010 55
## 8 France 2011 58
## 9 France 2012 60
# filter: keep rows that satisfy a condition
toy |>
filter(year >= 2011, debt > 60)
## # A tibble: 4 × 4
## country year gdp debt
## <chr> <int> <dbl> <dbl>
## 1 USA 2011 15 65
## 2 USA 2012 15.5 70
## 3 UK 2011 2.3 80
## 4 UK 2012 2.4 85
# mutate: create or transform variables
toy |>
mutate(debt_to_gdp = debt / gdp,
log_gdp = log(gdp))
## # A tibble: 9 × 6
## country year gdp debt debt_to_gdp log_gdp
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 USA 2010 14.5 60 4.14 2.67
## 2 USA 2011 15 65 4.33 2.71
## 3 USA 2012 15.5 70 4.52 2.74
## 4 UK 2010 2.2 75 34.1 0.788
## 5 UK 2011 2.3 80 34.8 0.833
## 6 UK 2012 2.4 85 35.4 0.875
## 7 France 2010 2.5 55 22 0.916
## 8 France 2011 2.6 58 22.3 0.956
## 9 France 2012 2.7 60 22.2 0.993
# summarize + group_by: compute grouped summaries
toy |>
group_by(country) |>
summarize(mean_debt = mean(debt),
sd_debt = sd(debt),
n_years = n())
## # A tibble: 3 × 4
## country mean_debt sd_debt n_years
## <chr> <dbl> <dbl> <int>
## 1 France 57.7 2.52 3
## 2 UK 80 5 3
## 3 USA 65 5 3
# arrange: sort rows
toy |>
arrange(desc(debt))
## # A tibble: 9 × 4
## country year gdp debt
## <chr> <int> <dbl> <dbl>
## 1 UK 2012 2.4 85
## 2 UK 2011 2.3 80
## 3 UK 2010 2.2 75
## 4 USA 2012 15.5 70
## 5 USA 2011 15 65
## 6 USA 2010 14.5 60
## 7 France 2012 2.7 60
## 8 France 2011 2.6 58
## 9 France 2010 2.5 55
These verbs chain together into pipelines — especially useful for panel data, where we often operate within groups (countries, time periods).
A well-organized project structure pays off for reproducibility and collaboration. A typical research project:
my_project/
├── data/ # raw input datasets
├── output/
│ ├── plots/ # saved figures (.pdf, .png)
│ ├── tables/ # exported regression tables (.tex, .html)
│ ├── data/ # cleaned or processed datasets (.rds, .csv)
│ └── models/ # saved estimated model objects (.rds)
├── my_script.R # or .Rmd analysis file(s)
└── README.md # brief description of the project
In this lab, we use this structure: raw data lives in
data/, and everything we produce goes into
output/, organized by type. This separation keeps your
working directory clean, makes it easy to distinguish inputs from
generated artifacts, and helps collaborators find what they need.
texregThe summary() function is fine for quick inspection, but
for comparing models side by side — or for exporting publication-ready
tables — we use the texreg package. It provides three
functions for different output targets:
| Function | Output | Use case |
|---|---|---|
screenreg() |
Console (plain text) | Quick inspection during analysis |
htmlreg() |
HTML | RMarkdown documents knitted to HTML |
texreg() |
LaTeX | RMarkdown documents knitted to PDF, or papers via Overleaf |
Tip: Since htmlreg() and
texreg() produce format-specific output, this lab defines a
helper show_reg() (in the setup chunk) that automatically
picks the right one based on how you knit the document. Use
show_reg() in place of
htmlreg()/texreg() when you want tables to
render correctly in both HTML and PDF.
Let’s see them in action with two simple models:
# Fit two models for demonstration
demo_dat <- data.frame(
y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100)
)
m1 <- lm(y ~ x1, data = demo_dat)
m2 <- lm(y ~ x1 + x2, data = demo_dat)
# Console output — good for quick comparison
screenreg(list(m1, m2),
custom.model.names = c("Model 1", "Model 2"))
##
## =============================
## Model 1 Model 2
## -----------------------------
## (Intercept) -0.10 -0.10
## (0.10) (0.10)
## x1 -0.16 -0.16
## (0.11) (0.11)
## x2 -0.02
## (0.10)
## -----------------------------
## R^2 0.02 0.02
## Adj. R^2 0.01 0.00
## Num. obs. 100 100
## =============================
## *** p < 0.001; ** p < 0.01; * p < 0.05
For RMarkdown documents, use htmlreg() for HTML output
or texreg() for PDF output, with
results='asis' in the chunk options so the table renders
properly. The show_reg() helper defined in the setup chunk
does this automatically:
# show_reg() picks htmlreg() for HTML output or texreg() for PDF automatically
show_reg(list(m1, m2),
custom.model.names = c("Model 1", "Model 2"),
caption = "Example regression table",
caption.above = TRUE)
| Model 1 | Model 2 | |
|---|---|---|
| (Intercept) | -0.10 | -0.10 |
| (0.10) | (0.10) | |
| x1 | -0.16 | -0.16 |
| (0.11) | (0.11) | |
| x2 | -0.02 | |
| (0.10) | ||
| R2 | 0.02 | 0.02 |
| Adj. R2 | 0.01 | 0.00 |
| Num. obs. | 100 | 100 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | ||
To export a table as a standalone file — for inclusion in a paper or
for sharing — use the file argument:
# Export as LaTeX (.tex) — for Overleaf or LaTeX documents
texreg(list(m1, m2),
custom.model.names = c("Model 1", "Model 2"),
file = "output/tables/example_table.tex",
caption = "Example regression table")
# Export as HTML — for web documents or quick sharing
htmlreg(list(m1, m2),
custom.model.names = c("Model 1", "Model 2"),
file = "output/tables/example_table.html",
caption = "Example regression table")
The texreg family accepts many customization options:
custom.coef.names to rename coefficients,
include.ci = TRUE to show confidence intervals instead of
standard errors, stars to control significance stars,
digits to set decimal places, etc. See ?texreg
for the full list.
ggsaveTo export ggplot2 figures to files, use
ggsave(). By default it saves the last plot displayed, but
you can also pass a plot object explicitly:
# Save the last displayed plot
ggsave(filename = "output/plots/my_figure.pdf",
width = 6, height = 6 / 1.618) # golden ratio
# Common formats: .pdf (vector, best for papers), .png (raster, good for slides)
We use the golden ratio (\(\phi \approx 1.618\)) for the width-to-height proportion — it produces aesthetically balanced figures and is a sensible default for academic plots.
After cleaning or transforming a dataset, save the result so downstream scripts can load it directly without re-running the cleaning pipeline:
# CSV — portable, human-readable, but loses R-specific metadata (factor levels, etc.)
write_csv(my_clean_data, file = "output/data/clean_data.csv")
# RDS — compact, preserves all R object attributes, but R-only
saveRDS(my_clean_data, file = "output/data/clean_data.rds")
# To reload: my_clean_data <- readRDS("output/data/clean_data.rds")
Use .csv when the data needs to be shared with non-R
users or loaded into other software. Use .rds when the data
stays within R and you want to preserve factor levels, date classes, or
other metadata.
You can also save fitted model objects so you (or a co-author) can reload them later without re-running the estimation — especially useful for models that take a long time to fit:
# Save a fitted model object
saveRDS(my_model, file = "output/models/my_model.rds")
# To reload: my_model <- readRDS("output/models/my_model.rds")
The .rds format preserves the entire model object,
including coefficients, residuals, and the variance-covariance matrix —
everything you need for summary(), predict(),
or coeftest().
Recall that the OLS estimator targets the conditional expectation function:
\[ E[y_i \mid x_i] = X_i'\beta \]
This works when \(E[\varepsilon_i \mid x_i] = 0\). But when a regressor is endogenous — correlated with the error term — we have \(E[\varepsilon_i \mid x_i] \neq 0\), and OLS no longer recovers the true \(\beta\). The estimated regression line no longer traces out the CEF; it traces out a distorted version of it.
A note on terminology. “Endogeneity” is an umbrella term — it simply means that the mean independence assumption \(E[\varepsilon_i \mid x_i] = 0\) is violated. It tells you that something is wrong, but not what is wrong. In practice, there are several distinct mechanisms through which exogeneity can fail, and they differ in their causes, the direction and magnitude of the bias they produce, and the remedies available. Saying “the model suffers from endogeneity” is like saying “the patient is sick” — it is not a diagnosis. When communicating results, be specific: say “omitted variable bias,” “measurement error,” “reverse causality,” etc. This precision matters because different problems call for different solutions (controls, better data, instrumental variables, etc.), and a vague appeal to “endogeneity” obscures which solution is appropriate.
In this lab we review four classical forms of endogeneity, each producing bias and inconsistency through a different channel:
| Source | Mechanism | Effect on \(\hat{\beta}\) |
|---|---|---|
| Omitted variables | A relevant variable is left out; its effect is absorbed into \(\varepsilon\), which becomes correlated with \(x\) | Direction depends on the sign of the omitted variable’s effect and its correlation with \(x\) |
| Measurement error | The true regressor \(x\) is observed with noise \(x^* = x + \upsilon\); the noise makes the observed regressor correlated with the error term | Under classical error: attenuation toward zero. Under non-classical error: direction can be unpredictable |
| Functional misspecification | The true CEF is nonlinear but we impose a linear model; the omitted nonlinear terms (e.g., \(x^2\)) are absorbed into \(\varepsilon\) and are mechanically correlated with \(x\) | A special case of OVB where the “omitted variable” is a transformation of an included regressor |
| Simultaneity (Appendix B) | \(x\) and \(y\) are jointly determined; \(x\) depends on \(\varepsilon\) through the feedback loop | Direction depends on the structural parameters of the system |
We now illustrate the first three with simulations where we know the true DGP, so we can measure the bias exactly. Simultaneity is covered in Appendix B (self-study).
Suppose the true model is:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \varepsilon \]
but we estimate a model that omits one or more regressors. If the omitted variable is correlated with an included regressor and affects the outcome, the coefficient on the included regressor is biased. In the simple case where we omit \(x_2\):
\[ \hat{\beta}_1^{\text{short}} \xrightarrow{p} \beta_1 + \beta_2 \cdot \delta_{x_2 \mid x_1} \]
The arrow \(\xrightarrow{p}\) reads “converges in probability to” — it tells us what the estimator approaches as the sample grows large. In other words, this is not sampling noise; it is a systematic shift that does not vanish with more data. Here \(\delta_{x_2 \mid x_1}\) is the coefficient from regressing the omitted variable \(x_2\) on the included variable \(x_1\). The bias is the product of two things: (1) how much the omitted variable affects \(y\) (i.e., \(\beta_2\)), and (2) how correlated the omitted variable is with the included regressor (i.e., \(\delta\)).
For example, if we estimate the effect of education on income but omit cognitive ability, the education coefficient absorbs part of the ability effect — biasing it upward if ability and education are positively correlated.
In Part 1 we simulated covariates one at a time using
rnorm() — each drawn independently. But OVB is driven by
correlations among covariates, so we need a way to
generate variables that are correlated by design. The
multivariate normal distribution does exactly this. For
a random vector \(\mathbf{x} = (x_1, x_2,
\dots, x_p)'\):
\[ \mathbf{x} \sim \mathcal{N}_p(\boldsymbol{\mu}, \, \Sigma) \]
where \(\boldsymbol{\mu}\) is the \(p \times 1\) mean vector (the center of the distribution) and \(\Sigma\) is the \(p \times p\) covariance matrix that controls both the spread of each variable (diagonal entries = variances) and the linear dependencies between them (off-diagonal entries = covariances). When the diagonal entries are all ones (unit variance), the off-diagonal entries are correlations directly.
In R, mvrnorm(n, mu, Sigma) from the MASS
package draws \(n\) observations from
this distribution. We set up our DGP with four covariates, mean zero,
and the following correlation structure:
\[ \boldsymbol{\mu} = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \end{pmatrix}, \qquad \Sigma = \begin{pmatrix} 1 & -0.4 & 0 & 0 \\ -0.4 & 1 & 0.7 & 0 \\ 0 & 0.7 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} \]
Since the diagonal is all ones (unit variance for each covariate), this \(\Sigma\) is simultaneously a covariance and a correlation matrix — the off-diagonal entries are the pairwise correlations directly. Compare this to the code below:
set.seed(123)
# True parameters: intercept = 2, b1 = 1, b2 = -5, b3 = 3, b4 = 5
beta <- as.matrix(c(2, 1, -5, 3, 5))
# Correlation/covariance matrix (unit variances, so correlations = covariances)
sigma <- rbind(
c(1.0, -0.4, 0.0, 0.0),
c(-0.4, 1.0, 0.7, 0.0),
c(0.0, 0.7, 1.0, 0.0),
c(0.0, 0.0, 0.0, 1.0)
)
colnames(sigma) <- rownames(sigma) <- c("x1", "x2", "x3", "x4")
# Draw 1000 observations from the multivariate normal
X <- mvrnorm(n = 1000, mu = rep(0, 4), Sigma = sigma)
Xs <- cbind(1, X)
simdat_ovb <- as.data.frame(Xs)
colnames(simdat_ovb) <- c("int", "x1", "x2", "x3", "x4")
# Generate y = Xs %*% beta + epsilon
simdat_ovb$y <- rnorm(1000, mean = Xs %*% beta)
Before modeling, it is good practice to verify that the data look as expected. Check the correlation matrix of the covariates — does it match the structure we specified?
# Always inspect your data before modeling
round(cor(simdat_ovb[, c("x1", "x2", "x3", "x4")]), 2)
## x1 x2 x3 x4
## x1 1.00 -0.33 0.06 0.03
## x2 -0.33 1.00 0.71 -0.03
## x3 0.06 0.71 1.00 0.01
## x4 0.03 -0.03 0.01 1.00
We see the correlations we built in: \(x_1\) and \(x_2\) are negatively correlated (\(\approx -0.4\)), \(x_2\) and \(x_3\) are positively correlated (\(\approx 0.7\)), and \(x_4\) is uncorrelated with the rest. This structure is what drives the OVB patterns we are about to observe.
We compare the true model against progressively more misspecified models:
true_model <- lm(y ~ x1 + x2 + x3 + x4, data = simdat_ovb)
omit_x4 <- lm(y ~ x1 + x2 + x3, data = simdat_ovb)
omit_x3 <- lm(y ~ x1 + x2 + x4, data = simdat_ovb)
just_x1 <- lm(y ~ x1, data = simdat_ovb)
# Compare estimates to truth
round(cbind(Estimates = coef(true_model), Truth = beta), 3)
## Estimates
## (Intercept) 1.963 2
## x1 1.071 1
## x2 -4.989 -5
## x3 3.057 3
## x4 4.983 5
The true model recovers the parameters well. We can compare all four
models side by side using screenreg():
screenreg(list(true_model, omit_x4, omit_x3, just_x1),
custom.model.names = c("True", "Omit x4", "Omit x3", "Just x1"))
##
## ===============================================================
## True Omit x4 Omit x3 Just x1
## ---------------------------------------------------------------
## (Intercept) 1.96 *** 1.86 *** 1.98 *** 1.81 ***
## (0.03) (0.16) (0.07) (0.19)
## x1 1.07 *** 1.02 *** 2.14 *** 3.08 ***
## (0.04) (0.19) (0.08) (0.19)
## x2 -4.99 *** -5.45 *** -2.37 ***
## (0.05) (0.27) (0.07)
## x3 3.06 *** 3.42 ***
## (0.05) (0.24)
## x4 4.98 *** 5.08 ***
## (0.03) (0.07)
## ---------------------------------------------------------------
## R^2 0.98 0.44 0.89 0.20
## Adj. R^2 0.98 0.44 0.89 0.20
## Num. obs. 1000 1000 1000 1000
## ===============================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
bias_df <- bind_rows(
tibble(model = "Omit x4",
coefficient = names(coef(omit_x4)),
bias = coef(omit_x4) - beta[1:4]),
tibble(model = "Omit x3",
coefficient = names(coef(omit_x3)),
bias = coef(omit_x3) - beta[c(1, 2, 3, 5)]),
tibble(model = "Just x1",
coefficient = names(coef(just_x1)),
bias = coef(just_x1) - beta[1:2])
)
ggplot(bias_df, aes(x = coefficient, y = bias, fill = model)) +
geom_col(position = position_dodge(width = 0.7), width = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_fill_brewer(palette = "Set1", name = "Model") +
labs(title = "Omitted Variable Bias across Misspecified Models",
y = "Bias (Estimate - Truth)", x = "Coefficient") +
theme_minimal(base_size = 14)
Coefficient bias from omitting correlated regressors. Omitting x4 (uncorrelated) produces no bias; omitting x3 (correlated with x2) biases the x2 estimate.
# Save figure to output folder
ggsave("output/plots/ovb_bias.pdf", width = 6, height = 6 / 1.618)
Key takeaways:
In terms of the CEF, the misspecified model no longer estimates \(E[y \mid x_1, x_2, x_3, x_4]\). Instead, it estimates a distorted conditional expectation where the omitted effects are conflated with the included regressors.
Suppose the true model includes \(x_2\), but we do not observe \(x_2\) directly. Instead, we observe a noisy proxy:
\[ x_2^* = x_2 + \upsilon, \quad \upsilon \sim N(0, \sigma_\upsilon^2) \]
Under classical measurement error — where the noise \(\upsilon\) is independent of both the true \(x_2\) and the structural error \(\varepsilon\) — the OLS coefficient on \(x_2^*\) suffers from attenuation bias: it is biased toward zero by a factor known as the reliability ratio:
\[ \hat{\beta}_2^* \xrightarrow{p} \beta_2 \cdot \underbrace{\frac{\sigma_{x_2}^2}{\sigma_{x_2}^2 + \sigma_\upsilon^2}}_{\text{reliability ratio} < 1} \]
The reliability ratio is always between 0 and 1: the fraction of variance that is “real signal.” More noise → ratio closer to zero → coefficient shrinks toward zero.
Important caveat: The clean attenuation result above holds only under classical measurement error. In practice, measurement error is often non-classical — the noise may be correlated with the true value of \(x\) (e.g., high earners underreport income), with the outcome \(y\), or with other covariates. When this happens, the bias is no longer predictably toward zero: it can go in either direction and its magnitude depends on the specific correlation structure of the error. For example, if survey respondents who are more politically engaged both report their ideology more accurately and participate more, the measurement error in ideology is correlated with participation — and the resulting bias could inflate or deflate the coefficient unpredictably. The bottom line: do not assume attenuation by default; think carefully about the likely structure of the measurement error in your specific application.
Additionally, measurement error can mask nonlinearity — noise in \(x\) smooths out curvature, potentially making a nonlinear relationship appear linear.
Using the same simulated data from the OVB section, we add measurement error to \(x_2\):
set.seed(123)
noisydat <- simdat_ovb
noise <- rnorm(1000, mean = 0, sd = 1.5)
noisydat$x2 <- noisydat$x2 + noise # observed with error
noisy_model <- lm(y ~ x1 + x2 + x3 + x4, data = noisydat)
round(cbind(
Noisy = coef(noisy_model),
True = coef(true_model),
DGP = beta
), 3)
## Noisy True
## (Intercept) 1.963 1.963 2
## x1 -0.021 1.071 1
## x2 -2.788 -4.989 -5
## x3 4.968 3.057 3
## x4 4.983 4.983 5
Notice: the coefficient on \(x_2\) is attenuated — pulled toward zero relative to the truth (\(\beta_2 = -5\)). The other coefficients that are correlated with \(x_2\) may also be affected, as the model compensates for the weakened signal in \(x_2^*\) by adjusting the estimates on correlated covariates. This is known as induced bias.
bind_rows(
simdat_ovb |> mutate(type = "True x2"),
noisydat |> mutate(type = "Noisy x2")
) |>
ggplot(aes(x = x2, y = y, color = type)) +
geom_point(alpha = 0.3, size = 0.8) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
facet_wrap(~type) +
labs(x = "x2", y = "y",
title = "Attenuation Bias from Measurement Error") +
theme_minimal(base_size = 14) +
theme(legend.position = "none")
Measurement error in x2 flattens the estimated regression line, pulling the slope toward zero.
ggsave("output/plots/attenuation_bias.pdf", width = 6, height = 6 / 1.618)
Beyond attenuation in linear models, measurement error can obscure the functional form of the relationship. If the true DGP is nonlinear, noise in \(x\) spreads out the data cloud horizontally, smoothing out the curvature and making the relationship appear flatter and more linear than it actually is.
set.seed(123)
n_mask <- 200
x_true <- runif(n_mask, -3, 3)
y_mask <- sin(x_true) + rnorm(n_mask, 0, 0.3)
# Add substantial measurement error
u_mask <- rnorm(n_mask, mean = 0, sd = 3)
x_obs <- x_true + u_mask
df_mask <- data.frame(x_true, x_obs, y = y_mask)
p1 <- ggplot(df_mask, aes(x = x_true, y = y)) +
geom_point(shape = 1, size = 2, alpha = 0.8) +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
labs(title = "True Covariate (x)", x = "x (true)", y = "y") +
ylim(-2, 2) +
theme_minimal(base_size = 13)
p2 <- ggplot(df_mask, aes(x = x_obs, y = y)) +
geom_point(shape = 8, size = 2, alpha = 0.8) +
geom_smooth(method = "loess", se = FALSE, color = "red") +
labs(title = "Observed Covariate with Measurement Error (x*)",
x = "x* (observed)", y = "y") +
ylim(-2, 2) +
theme_minimal(base_size = 13)
# Stack vertically
grid.arrange(p1, p2, nrow = 2)
When the true relationship is nonlinear (top), measurement error in x can mask that structure entirely (bottom), leading to incorrect modeling decisions.
In the top panel, the sinusoidal relationship is clearly visible. In the bottom panel, the horizontal spread introduced by measurement error washes out the curvature almost entirely. A researcher working only with \(x^*\) might conclude the relationship is essentially flat — a fundamentally wrong conclusion driven entirely by data quality, not by the underlying phenomenon.
Functional misspecification is a special case of omitted variable bias that arises from the linearity assumption itself. If the true CEF includes nonlinear terms — say, \(x_2^2\) — but we fit a linear model with only \(x_2\), then the omitted term \(x_2^2\) is absorbed into the error. Since \(x_2^2\) is mechanically correlated with \(x_2\) (and with any variable correlated with \(x_2\)), this is formally equivalent to OVB: the “omitted variable” happens to be a transformation of an included regressor.
The OVB formula applies directly. If we omit \(x_2^2\) from a model that includes \(x_2\):
\[ \hat{\beta}_2^{\text{linear}} \xrightarrow{p} \beta_2 + \beta_3 \cdot \delta_{x_2^2 \mid x_2, \dots} \]
where \(\beta_3\) is the true coefficient on \(x_2^2\) and \(\delta\) captures the relationship between \(x_2^2\) and \(x_2\) net of other covariates.
The same logic applies to omitted interactions: if the true model includes \(x_1 \times x_2\) but we impose additivity, the omitted interaction term is absorbed into the coefficients of \(x_1\) and \(x_2\).
This matters in practice because researchers often assume linearity and additivity by default. But if the true relationship involves quadratic terms or interactions, imposing a linear additive model does not merely produce a “less precise” estimate — it produces biased estimates of every coefficient that is correlated with the omitted nonlinear terms.
We simulate a DGP with three covariates where \(x_2\) has a quadratic effect:
\[ y = 1 + 2 x_1 + 4 x_2 - 0.6 x_2^2 + 0.8 (x_1 \times x_2) + 1.5 x_3 + \varepsilon, \quad \varepsilon \sim N(0, 4) \]
The true model includes both a quadratic in \(x_2\) and an interaction between \(x_1\) and \(x_2\). Think of this as a variable with diminishing returns (\(x_2\)) whose effect also varies with context (\(x_1\)) — a common pattern in social science (e.g., the effect of government spending on growth depends on institutional quality, and has diminishing returns regardless). The covariates are correlated: \(\text{Cor}(x_1, x_2) = 0.5\), \(\text{Cor}(x_2, x_3) = 0.5\), \(\text{Cor}(x_1, x_3) = 0.2\).
The marginal effect of \(x_1\) in the true DGP is \(2 + 0.8 x_2\), which depends on \(x_2\). When we omit the nonlinear terms and fit a linear model, the estimated coefficient on \(x_1\) converges to the average marginal effect across the sample rather than the structural parameter — producing clear bias on \(x_1\) even though \(x_1\) itself is correctly specified. The distinction matters because the average marginal effect hides the fact that \(x_1\)’s effect varies with \(x_2\): reporting a single number masks real heterogeneity in the relationship.
set.seed(512)
n <- 800
# Correlation structure among covariates
sigma_ms <- rbind(
c(1.0, 0.5, 0.2),
c(0.5, 1.0, 0.5),
c(0.2, 0.5, 1.0)
)
# Draw correlated covariates — x2 centered near 3 with wider spread
sds_ms <- c(1, 1.5, 1)
cov_ms <- diag(sds_ms) %*% sigma_ms %*% diag(sds_ms)
X_ms <- mvrnorm(n, mu = c(0, 3, 0), Sigma = cov_ms)
x1_ms <- X_ms[, 1]
x2_ms <- X_ms[, 2]
x3_ms <- X_ms[, 3]
# True DGP: quadratic in x2 + interaction x1*x2
y_ms <- 1 + 2 * x1_ms + 4 * x2_ms - 0.6 * x2_ms^2 + 0.8 * x1_ms * x2_ms + 1.5 * x3_ms + rnorm(n, 0, 2)
msdat <- data.frame(y = y_ms, x1 = x1_ms, x2 = x2_ms, x3 = x3_ms)
Before fitting models, let’s visualize the relationship we built in. A scatter plot of \(y\) against \(x_2\) should reveal the curvature — the inverted-U that the linear model will miss:
# Quick EDA: is there visible nonlinearity?
ggplot(msdat, aes(x = x2, y = y)) +
geom_point(alpha = 0.2, size = 0.8) +
geom_smooth(method = "loess", se = FALSE, color = "steelblue", linewidth = 1) +
geom_smooth(method = "lm", se = FALSE, color = "firebrick", linetype = "dashed", linewidth = 0.8) +
labs(title = "y vs. x2: Is the relationship linear?",
subtitle = "Blue = loess (flexible); Red dashed = linear fit",
x = expression(x[2]), y = "y") +
theme_minimal(base_size = 13)
The loess smoother (blue) traces the curvature; the linear fit (red dashed) misses it. In practice, this kind of plot is your first defense against misspecification — always plot your key relationships before assuming linearity.
# Correct specification: includes the quadratic and interaction
correct_mod <- lm(y ~ x1 * x2 + I(x2^2) + x3, data = msdat)
# Misspecified: assumes linearity, omits x2^2 and x1:x2
misspec_mod <- lm(y ~ x1 + x2 + x3, data = msdat)
# Compare estimates
screenreg(list(correct_mod, misspec_mod),
custom.model.names = c("Correct", "Misspecified (linear)"))
##
## ==============================================
## Correct Misspecified (linear)
## ----------------------------------------------
## (Intercept) 0.74 * 5.74 ***
## (0.31) (0.26)
## x1 1.82 *** 4.26 ***
## (0.18) (0.10)
## x2 4.23 *** 0.40 ***
## (0.19) (0.08)
## x2^2 -0.63 ***
## (0.03)
## x3 1.29 *** 1.31 ***
## (0.09) (0.11)
## x1:x2 0.84 ***
## (0.06)
## ----------------------------------------------
## R^2 0.87 0.80
## Adj. R^2 0.87 0.80
## Num. obs. 800 800
## ==============================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
Two things stand out. First, the coefficient on \(x_2\) in the misspecified model is dramatically biased: it should be 4, but the omitted quadratic pulls it toward zero. Second — and this is the key insight — the coefficient on \(x_1\) is also badly biased. The true structural effect of \(x_1\) is 2, but the misspecified model estimates it at approximately \(2 + 0.8 \times \bar{x}_2 \approx 4.4\). The omitted interaction \(x_1 \times x_2\) gets absorbed into the \(x_1\) coefficient, inflating it to reflect the average marginal effect rather than the structural parameter. This is OVB where the “omitted variables” are nonlinear transformations of included regressors.
The key consequence of misspecification is that it distorts the estimated conditional expectation function. We can visualize this by predicting \(E[y \mid x_2]\) under both models, holding \(x_1\) and \(x_3\) at their means.
The prediction workflow has three steps: (1) set up
a grid of values for the variable of interest, (2) compute \(\hat{y} = X_{\text{new}} \hat{\beta}\) at
each grid point, and (3) get confidence intervals. We show the manual
approach first (for understanding), then the predict()
shortcut.
# Step 1: Set up a prediction grid
# Vary x2 across its range, hold x1 and x3 at their sample means
x2_grid <- seq(min(msdat$x2), max(msdat$x2), length.out = 200)
newdata <- data.frame(
x1 = mean(msdat$x1),
x2 = x2_grid,
x3 = mean(msdat$x3)
)
# Step 2: Manual prediction — y_hat = X_new %*% beta_hat
# Build the design matrix for each model (columns must match coefficient order)
beta_correct <- coef(correct_mod)
beta_misspec <- coef(misspec_mod)
X_new_correct <- cbind(1, newdata$x1, newdata$x2, newdata$x2^2, newdata$x3,
newdata$x1 * newdata$x2) # intercept, x1, x2, x2^2, x3, x1:x2
X_new_misspec <- cbind(1, newdata$x1, newdata$x2, newdata$x3)
yhat_correct <- X_new_correct %*% beta_correct
yhat_misspec <- X_new_misspec %*% beta_misspec
# Step 3: Use predict() — does the same thing in one call, plus SEs
pred_correct <- predict(correct_mod, newdata = newdata, se.fit = TRUE)
pred_misspec <- predict(misspec_mod, newdata = newdata, se.fit = TRUE)
# Verify: manual matches predict()
head(data.frame(
manual = as.numeric(yhat_correct),
predict = pred_correct$fit
), 4)
## manual predict
## 1 -8.863865 -8.863865
## 2 -8.547384 -8.547384
## 3 -8.233924 -8.233924
## 4 -7.923483 -7.923483
The predict() function with se.fit = TRUE
also returns standard errors, which we use to build 95% confidence
intervals. The SE at each point reflects the uncertainty in \(\hat{\beta}\) propagated through the design
matrix: \(\text{SE}(\hat{y}) =
\sqrt{x_{\text{new}}' \hat{V} \, x_{\text{new}}}\), where
\(\hat{V}\) is the variance-covariance
matrix of the estimated coefficients. In words: the prediction
uncertainty comes from the coefficient uncertainty, weighted by the
covariate values at which we are predicting.
For self-study (Appendix): You can also compute these SEs manually using the variance-covariance matrix
vcov(model), or simulate from the sampling distribution of \(\hat{\beta}\) usingmvrnorm(). The Appendix on interaction effects demonstrates both approaches.
# 95% confidence intervals from predict()
ci_correct_lb <- pred_correct$fit - 1.96 * pred_correct$se.fit
ci_correct_ub <- pred_correct$fit + 1.96 * pred_correct$se.fit
ci_misspec_lb <- pred_misspec$fit - 1.96 * pred_misspec$se.fit
ci_misspec_ub <- pred_misspec$fit + 1.96 * pred_misspec$se.fit
cef_df <- data.frame(
x2 = rep(x2_grid, 2),
yhat = c(as.numeric(yhat_correct), as.numeric(yhat_misspec)),
lb = c(ci_correct_lb, ci_misspec_lb),
ub = c(ci_correct_ub, ci_misspec_ub),
model = rep(c("Correct", "Misspecified (linear)"), each = 200)
)
ggplot() +
# Raw data in the background
geom_point(data = msdat, aes(x = x2, y = y),
alpha = 0.15, size = 0.8, color = "grey50") +
# Predicted CEFs with confidence bands
geom_ribbon(data = cef_df, aes(x = x2, ymin = lb, ymax = ub, fill = model),
alpha = 0.25) +
geom_line(data = cef_df, aes(x = x2, y = yhat, color = model),
linewidth = 1) +
scale_color_manual(values = c("Correct" = "steelblue",
"Misspecified (linear)" = "firebrick")) +
scale_fill_manual(values = c("Correct" = "steelblue",
"Misspecified (linear)" = "firebrick")) +
labs(title = "CEF Comparison: Correct vs. Misspecified Model",
subtitle = expression("Predicted E[y | x"[2]*"] with x"[1]*" and x"[3]*" held at their means"),
x = expression(x[2]), y = expression(hat(y)),
color = "Model", fill = "Model") +
theme_minimal(base_size = 14) +
theme(legend.position = "bottom")
Estimated CEF under the correct and misspecified models. The linear model imposes a straight line where the truth is curved, and its coefficient on x1 is inflated because the omitted interaction is absorbed into the x1 estimate.
ggsave("output/plots/cef_misspecification.pdf", width = 7, height = 5)
Key takeaways:
So far in this lab we have focused on estimation — obtaining the coefficient vector \(\hat{\beta}\) — and on the conditions under which OLS recovers the true CEF (exogeneity, correct functional form). Notice that none of this required distributional assumptions: we never assumed the errors are normal, homoskedastic, or independent. OLS produces \(\hat{\beta} = (X'X)^{-1}X'y\) regardless of the error distribution; whether \(\hat{\beta}\) is unbiased depends on exogeneity, not on distributional properties.
Inference is a separate step. Once we have \(\hat{\beta}\), we want to know how precise it is: standard errors, confidence intervals, hypothesis tests. This is where the structure of the errors starts to matter. The standard OLS variance formula:
\[ \hat{V}_{\text{OLS}} = \hat{\sigma}^2 (X'X)^{-1}, \quad \text{where} \quad \hat{\sigma}^2 = \frac{\hat{e}'\hat{e}}{n - k} \]
assumes that the errors are independent and have constant variance (homoskedasticity) — i.e., \(\text{Var}(\varepsilon_i \mid X) = \sigma^2\) for all \(i\), and \(\text{Cov}(\varepsilon_i, \varepsilon_j \mid X) = 0\) for \(i \neq j\). Under these conditions, the formula above is correct and the usual \(t\)-tests and confidence intervals are valid.
But in many real datasets these conditions fail. Three common violations:
When any of these hold, the OLS point estimates \(\hat{\beta}\) are still fine, but the standard errors are wrong — typically too small. This leads us to overstate the precision of our estimates and reject null hypotheses too often. The rest of Part 3 addresses clustering. Later parts of the course will tackle autocorrelation in detail.
In many cross-sectional datasets, observations are not independent — they are grouped. Students are nested within schools, workers within firms, survey respondents within countries. If there are unobserved group-level factors that affect the outcome (e.g., school quality, firm culture, national institutions), the errors will be correlated within groups:
\[ \text{Cov}(\varepsilon_i, \varepsilon_j) \neq 0 \quad \text{for } i, j \text{ in the same group} \]
This within-group correlation does not bias the OLS coefficient estimates — \(\hat{\beta}\) is still unbiased and consistent. The problem is entirely about inference: OLS standard errors assume each observation contributes an independent piece of information. When errors are positively correlated within groups, the data contain less independent information than OLS assumes. The result:
The solution is to cluster standard errors by group. This allows for arbitrary within-group error correlation while maintaining the assumption of independence across groups. As we will see, the reliability of this correction depends critically on the number of clusters.
The standard OLS variance assumes errors are independent. The cluster-robust estimator relaxes this: it allows errors within the same group to be correlated in any way, while still assuming independence across groups. It does so using a “sandwich” structure:
\[ \hat{V}_{CL} = \underbrace{(X'X)^{-1}}_{\text{bread}} \left( \sum_{g=1}^{G} X_g' \hat{e}_g \hat{e}_g' X_g \right) \underbrace{(X'X)^{-1}}_{\text{bread}} \]
Let’s unpack each piece in plain terms:
\(g = 1, \dots, G\) indexes the clusters (e.g., classrooms, countries). The data are split into \(G\) groups.
\(X_g\) is the portion of the design matrix belonging to cluster \(g\) — just the rows for observations in that group.
\(\hat{e}_g\) is the vector of OLS residuals for the observations in cluster \(g\).
The bread, \((X'X)^{-1}\): This is the same matrix that appears in the OLS formula. It adjusts for the scale and correlation of the regressors — the same role it plays in estimation.
The meat, \(\sum_g X_g' \hat{e}_g \hat{e}_g' X_g\): This is where the correction happens. It helps to read this from the inside out:
\(\hat{e}_g \hat{e}_g'\) is the outer product of the residuals within cluster \(g\) — an \(n_g \times n_g\) matrix whose \((i,j)\) entry is \(\hat{e}_i \hat{e}_j\). It captures all pairwise products of residuals within the group, without assuming any particular correlation pattern.
\(X_g' \hat{e}_g\) is a \(k \times 1\) vector — the cluster’s score: it sums up, for each coefficient, how much the residuals in that cluster co-move with the regressors. Think of it as cluster \(g\)’s “contribution to estimation uncertainty.” If the residuals within a cluster are all large and same-signed (correlated errors), this vector will be large; if they cancel out (independent errors), it will be small.
\(X_g' \hat{e}_g \hat{e}_g' X_g = (X_g' \hat{e}_g)(X_g' \hat{e}_g)'\) is the outer product of this score with itself — a \(k \times k\) matrix capturing how much cluster \(g\) contributes to the variance of \(\hat{\beta}\). Summing over all \(G\) clusters gives the total meat.
The key insight: if errors are truly independent within clusters, the individual residuals in \(\hat{e}_g\) would tend to cancel when summed in \(X_g' \hat{e}_g\), and the meat would shrink toward the standard OLS variance. When errors are positively correlated within clusters, they reinforce each other, making \(X_g' \hat{e}_g\) large and inflating the variance estimate — which is exactly the correction we need.
The standard OLS variance is a special case: if every cluster contained exactly one observation, \(\hat{e}_g \hat{e}_g'\) would be a \(1 \times 1\) scalar (\(\hat{e}_i^2\)), the meat would reduce to \(X' \text{diag}(\hat{e}^2) X\), and we would get the heteroskedasticity-robust (HC) estimator. Clustering generalizes this by allowing the off-diagonal terms (cross-products of residuals within a group) to be nonzero.
When the number of clusters is large, the sandwich estimator performs well. We simulate a cross-sectional dataset of students nested within 50 classrooms.
Each student’s test score depends on a student-level covariate (e.g., hours studied) and an unobserved classroom effect (teacher quality):
\[ y_{ig} = \beta_0 + \beta_1 x_{ig} + \alpha_g + \varepsilon_{ig} \]
where \(\alpha_g \sim N(0, \sigma_\alpha^2)\) is the classroom effect and \(\varepsilon_{ig} \sim N(0, 1)\) is the student-level error. The classroom effect \(\alpha_g\) is unobserved — it lives in the composite error \(u_{ig} = \alpha_g + \varepsilon_{ig}\), creating within-classroom correlation: any two students in the same classroom share the same \(\alpha_g\), so \(\text{Cov}(u_{ig}, u_{jg}) = \sigma_\alpha^2 \neq 0\).
Is this omitted variable bias? Technically, \(\alpha_g\) is an omitted variable that
affects \(y\). But in our DGP, \(\alpha_g\) is drawn
independently of the regressor hours —
better teachers do not systematically get students who study more or
less. So \(E[\alpha_g \mid x_{ig}] =
0\), and the OVB formula gives zero bias: \(\hat{\beta}_1\) is unbiased. The
only problem is that \(\alpha_g\) induces correlated errors within
classrooms, which makes the OLS standard errors too small. This example
deliberately isolates the inference problem from the
estimation problem.
If \(\alpha_g\) were
correlated with hours — e.g., better teachers also assign
more homework — we would have both OVB and clustering, and the
coefficient itself would be biased. In practice the two problems often
co-occur, but it is important to understand them separately.
set.seed(512)
# 50 classrooms, 20 students each
G_many <- 50
n_per <- 20
N_many <- G_many * n_per
# True parameters
beta_0_cl <- 70 # average test score
beta_1_cl <- 3 # effect of hours studied
# Classroom IDs
classroom <- rep(paste0("Class_", sprintf("%02d", 1:G_many)), each = n_per)
# Student-level covariate: hours studied
# Includes a classroom-level component (some classrooms assign more homework),
# so hours is partially shared within classrooms — realistic and important
# for making clustering affect the slope SE, not just the intercept
hours_group <- rep(rnorm(G_many, mean = 0, sd = 1.5), each = n_per)
hours <- rnorm(N_many, mean = 5, sd = 2) + hours_group
# Classroom-level effect: unobserved teacher quality
# Each classroom gets one draw; all students in that classroom share it
# (drawn independently of hours_group — so OLS is unbiased, but SEs are wrong)
alpha <- rep(rnorm(G_many, mean = 0, sd = 4), each = n_per)
# Student-level noise
epsilon <- rnorm(N_many, mean = 0, sd = 3)
# Outcome: test score
score <- beta_0_cl + beta_1_cl * hours + alpha + epsilon
# Assemble the dataset
students <- data.frame(score, hours, classroom)
Before modeling, let’s visualize the clustering structure. If there is within-classroom correlation, we should see that classroom means vary substantially:
# EDA: do scores cluster by classroom?
students |>
group_by(classroom) |>
summarize(mean_score = mean(score), mean_hours = mean(hours), n = n()) |>
ggplot(aes(x = mean_hours, y = mean_score)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "grey50") +
labs(title = "Classroom-Level Averages",
subtitle = "Each point = one classroom; spread reflects unobserved classroom effects",
x = "Mean hours studied", y = "Mean test score") +
theme_minimal(base_size = 13)
The classroom means are dispersed — some classrooms score much higher or lower than others, even at similar average study hours. This is the unobserved teacher quality (\(\alpha_g\)) at work, and it is exactly the within-group correlation that OLS standard errors will miss.
mod_many <- lm(score ~ hours, data = students)
summary(mod_many)
##
## Call:
## lm(formula = score ~ hours, data = students)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2117 -3.0904 -0.1897 2.7914 15.3904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.34653 0.30236 232.66 <2e-16 ***
## hours 2.97560 0.05598 53.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.461 on 998 degrees of freedom
## Multiple R-squared: 0.739, Adjusted R-squared: 0.7387
## F-statistic: 2825 on 1 and 998 DF, p-value: < 2.2e-16
The coefficient on hours should be close to \(3\) — OLS is unbiased. But the standard
errors treat all 1000 students as independent, ignoring the
within-classroom correlation introduced by \(\alpha_g\).
# Extract the design matrix and residuals
X_mat <- model.matrix(mod_many) # N x 2 (intercept + hours)
e_hat <- residuals(mod_many) # N x 1
k <- ncol(X_mat) # number of coefficients
# The "bread": (X'X)^{-1}
bread <- solve(t(X_mat) %*% X_mat)
# The "meat": sum over clusters of X_g' e_g e_g' X_g
meat <- matrix(0, k, k)
for (g in unique(students$classroom)) {
# Identify rows belonging to this cluster
idx <- which(students$classroom == g)
# Subset the design matrix and residual vector for this cluster
Xg <- X_mat[idx, , drop = FALSE] # n_g x k
eg <- e_hat[idx] # n_g x 1
# The outer product eg %*% t(eg) is n_g x n_g — it captures
# all pairwise residual products within the cluster
meat <- meat + t(Xg) %*% (eg %o% eg) %*% Xg
}
# Assemble the sandwich: bread x meat x bread
V_cl_hand <- bread %*% meat %*% bread
# Small-sample correction (HC1-style for clusters)
# Adjusts for finite G and degrees of freedom
V_cl_hand <- V_cl_hand * ((N_many - 1) / (N_many - k)) * (G_many / (G_many - 1))
# Clustered SEs
se_cl_hand <- sqrt(diag(V_cl_hand))
sandwich packageIn practice, use vcovCL() — it does exactly what we just
built, in one line:
V_cl_pkg <- vcovCL(mod_many, cluster = students$classroom, type = "HC1")
# Verify: our by-hand calculation matches
data.frame(
By_Hand = se_cl_hand,
Package = sqrt(diag(V_cl_pkg))
)
## By_Hand Package
## (Intercept) 0.5666881 0.5666881
## hours 0.1257683 0.1257683
V_ols <- vcov(mod_many)
data.frame(
OLS_SE = sqrt(diag(V_ols)),
Cluster_SE = sqrt(diag(V_cl_pkg)),
Ratio = sqrt(diag(V_cl_pkg)) / sqrt(diag(V_ols))
)
## OLS_SE Cluster_SE Ratio
## (Intercept) 0.30236270 0.5666881 1.87420
## hours 0.05598211 0.1257683 2.24658
The clustered SEs are substantially larger — reflecting the within-classroom correlation that OLS ignores. The intercept SE increases the most because \(\alpha_g\) shifts the mean score for entire classrooms.
# OLS inference (standard)
coeftest(mod_many)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.346534 0.302363 232.656 < 2.2e-16 ***
## hours 2.975602 0.055982 53.153 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Corrected inference (clustered by classroom)
coeftest(mod_many, vcov. = V_cl_pkg)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.34653 0.56669 124.136 < 2.2e-16 ***
## hours 2.97560 0.12577 23.659 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficient estimates are identical — clustering only changes the standard errors, \(t\)-statistics, and \(p\)-values. With 50 clusters, the sandwich estimator is reliable and the correction is well-behaved.
ct_many <- coeftest(mod_many, vcov. = V_cl_pkg)
show_reg(list(mod_many, mod_many),
custom.model.names = c("OLS SEs", "Clustered SEs"),
override.se = list(sqrt(diag(V_ols)),
sqrt(diag(V_cl_pkg))),
override.pvalues = list(coef(summary(mod_many))[, 4],
ct_many[, 4]),
caption = "OLS vs. clustered standard errors (50 classrooms)",
caption.above = TRUE)
| OLS SEs | Clustered SEs | |
|---|---|---|
| (Intercept) | 70.35*** | 70.35*** |
| (0.30) | (0.57) | |
| hours | 2.98*** | 2.98*** |
| (0.06) | (0.13) | |
| R2 | 0.74 | 0.74 |
| Adj. R2 | 0.74 | 0.74 |
| Num. obs. | 1000 | 1000 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | ||
The sandwich estimator relies on the law of large numbers applied to the cluster-level contributions to the meat. When the number of clusters \(G\) is small, this approximation is poor — the estimator can be biased downward, producing clustered SEs that are still too small. This matters in practice because many political science and public health datasets have a small number of country- or state-level clusters (often 10–30), putting them squarely in the zone where the standard sandwich may be unreliable.
We now simulate a similar setup but with only 8 countries:
set.seed(512)
# 8 countries, 150 individuals each
G_few <- 8
n_per_c <- 150
N_few <- G_few * n_per_c
# Country IDs
country <- rep(paste0("Country_", LETTERS[1:G_few]), each = n_per_c)
# Individual covariate — includes a country-level component
x_group_c <- rep(rnorm(G_few, mean = 0, sd = 1), each = n_per_c)
x_few <- rnorm(N_few, mean = 0, sd = 1) + x_group_c
# Country-level effect (large, creating strong within-country correlation)
# Drawn independently of x_group_c
alpha_c <- rep(rnorm(G_few, mean = 0, sd = 3), each = n_per_c)
# Individual error
eps_c <- rnorm(N_few, mean = 0, sd = 2)
# Outcome
y_few <- 5 + 2 * x_few + alpha_c + eps_c
country_dat <- data.frame(y = y_few, x = x_few, country)
mod_few <- lm(y ~ x, data = country_dat)
V_ols_few <- vcov(mod_few)
V_cl_few <- vcovCL(mod_few, cluster = country_dat$country, type = "HC1")
data.frame(
OLS_SE = sqrt(diag(V_ols_few)),
Cluster_SE = sqrt(diag(V_cl_few)),
Ratio = sqrt(diag(V_cl_few)) / sqrt(diag(V_ols_few))
)
## OLS_SE Cluster_SE Ratio
## (Intercept) 0.09921432 1.0560607 10.644236
## x 0.07611123 0.5878487 7.723548
coeftest(mod_few)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.419880 0.099214 44.549 < 2.2e-16 ***
## x 1.450636 0.076111 19.059 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(mod_few, vcov. = V_cl_few)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.41988 1.05606 4.1853 3.057e-05 ***
## x 1.45064 0.58785 2.4677 0.01374 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With only 8 clusters, the sandwich estimator is formed from just 8 terms in the meat — a small sample from which to estimate a covariance matrix. The clustered SEs may still be too small, leading to over-rejection. Researchers typically recommend a minimum of 30–50 clusters for the standard sandwich to be reliable.
What to do with few clusters: Several alternatives
exist, including bias-reduced linearization (BRL), the wild cluster
bootstrap (fwildclusterboot package in R), and
degrees-of-freedom adjustments. These are beyond the scope of this lab,
but be aware that simply plugging in vcovCL() with \(G < 20\) may not fully solve the
problem.
To build intuition, let’s simulate the same DGP many times but vary the number of clusters, and track how well the clustered SE approximates the true standard error:
set.seed(2026)
n_sims <- 500
G_values <- c(5, 8, 15, 25, 50, 100)
n_per_g <- 30 # observations per cluster (held fixed)
true_beta <- 2
results <- list()
for (G in G_values) {
N <- G * n_per_g
beta_hats <- numeric(n_sims)
se_ols <- numeric(n_sims)
se_cl <- numeric(n_sims)
for (s in 1:n_sims) {
# Simulate data with group structure
group <- rep(1:G, each = n_per_g)
x_grp <- rep(rnorm(G, 0, 1), each = n_per_g) # group-level x component
x <- rnorm(N) + x_grp # x is partially shared within groups
alpha <- rep(rnorm(G, 0, 3), each = n_per_g) # group effect (indep. of x_grp)
y <- 1 + true_beta * x + alpha + rnorm(N)
mod <- lm(y ~ x)
beta_hats[s] <- coef(mod)["x"]
se_ols[s] <- sqrt(vcov(mod)["x", "x"])
se_cl[s] <- sqrt(vcovCL(mod, cluster = group, type = "HC1")["x", "x"])
}
# True SE = SD of beta_hat across simulations
true_se <- sd(beta_hats)
results[[as.character(G)]] <- data.frame(
G = G,
mean_se_ols = mean(se_ols),
mean_se_cl = mean(se_cl),
true_se = true_se,
ratio_ols = mean(se_ols) / true_se,
ratio_cl = mean(se_cl) / true_se
)
}
sim_df <- bind_rows(results)
# Plot: SE / True SE as a function of G
sim_long <- sim_df |>
select(G, ratio_ols, ratio_cl) |>
pivot_longer(cols = c(ratio_ols, ratio_cl),
names_to = "type", values_to = "ratio") |>
mutate(type = ifelse(type == "ratio_ols", "OLS SE", "Clustered SE"))
ggplot(sim_long, aes(x = G, y = ratio, color = type, shape = type)) +
geom_point(size = 3) +
geom_line(linewidth = 0.8) +
geom_hline(yintercept = 1, linetype = "dashed", color = "grey40") +
scale_x_continuous(breaks = G_values) +
labs(title = "Standard Error Accuracy by Number of Clusters",
subtitle = paste0("Based on ", n_sims, " simulations per G value (",
n_per_g, " obs per cluster)"),
x = "Number of clusters (G)",
y = "Mean estimated SE / True SE",
color = NULL, shape = NULL) +
theme_minimal(base_size = 13) +
theme(legend.position = "bottom")
As the number of clusters increases, the ratio of the clustered SE to the true SE converges to 1. With fewer than ~20 clusters, the sandwich estimator tends to underestimate the true standard error.
ggsave("output/plots/cluster_tradeoff.pdf", width = 7, height = 4.5)
Key takeaways:
In this section, you will apply the concepts from today’s lab to a real political economy dataset. The data come from Franzese’s study of electoral and partisan cycles in public debt (see Franzese, 2002). The observations are country-years — a grouped (clustered) data structure where each country contributes multiple observations. This is exactly the kind of data where omitted variable bias, functional misspecification, and within-group error correlation are all relevant concerns.
| Variable | Description |
|---|---|
debtx |
Public debt as a fraction of GDP |
growth |
Real GDP growth rate |
ue |
Unemployment rate |
cog |
Partisan center of gravity (left-right position of government) |
enop |
Effective number of parties in government (party-system fractionalization) |
Load the dataset and get to know it before modeling. Good EDA is your first defense against misspecification, missing data, and surprises in the results.
# Load the data
debt <- read_csv("data/debt_699")
Tasks:
names() or glimpse() to inspect what
variables are available. Then use select() to keep only the
five variables listed above (plus the country identifier — you’ll need
it later for clustering).summary() to inspect distributions. Are there
missing values? On which variables?drop_na() to create a complete-case dataset. How
many rows did you lose?summarize() to compute the mean and standard
deviation of debtx and growth.group_by() and summarize(n = n()) to
check. Is the panel balanced?debtx vs. growth.
Add a linear fit (geom_smooth(method = "lm")) and a loess
smoother (geom_smooth(method = "loess")). Does the
relationship look linear?cor() (on the complete-case dataset). Which pairs are most
strongly correlated? How might this affect OVB?# Your code here:
# Tasks 1-4:
# debt |>
# select(...) |>
# drop_na() |>
# summary()
#
# Task 5:
# your_data |> group_by(country_var) |> summarize(n = n())
#
# Task 6:
# ggplot(your_data, aes(x = growth, y = debtx)) +
# geom_point(alpha = 0.3) +
# geom_smooth(method = "lm", color = "red", se = FALSE) +
# geom_smooth(method = "loess", color = "blue", se = FALSE)
#
# Task 7:
# cor(your_data |> select(debtx, growth, ue, cog, enop))
Task 6 is especially important: if the loess and linear fits diverge, that is a visual signal that a quadratic or other nonlinear term might be warranted — exactly the misspecification problem from Part 2.3.
The goal is to examine whether omitting controls biases the estimated effect of economic growth on public debt.
Tasks:
debtx ~ growth + ue + cog + enopdebtx ~ growthgrowth across the two models
using screenreg(). In which direction is the bias? Why
might that be, substantively?growth.# Your code here:
# Hint: use screenreg(list(model_a, model_b)) for side-by-side comparison
# For the bonus, recall: bias = sum of (beta_omitted * delta_omitted|included)
# where delta comes from regressing each omitted variable on growth
The Franzese data are country-years — observations are grouped by country. Unobserved country-level factors (institutions, political culture) affect all years within a country, creating within-group error correlation. Cluster your standard errors to account for this.
Tasks:
names(debt) to find it.vcovCL() from the sandwich package.coeftest(). How do the
clustered standard errors compare to the OLS standard errors? Does any
coefficient lose significance?# Your code here:
# Hint: vcovCL(model_a, cluster = your_data$country_var, type = "HC1")
# Then: coeftest(model_a, vcov. = your_clustered_vcov)
# For task 4: length(unique(your_data$country_var))
Check whether the linear specification of Model A is adequate, or whether a nonlinear term is needed.
Tasks:
growth (using
I(growth^2)) to Model A. Does the quadratic term appear
significant?predict() to compute predicted values of
debtx as a function of growth (holding other
covariates at their means) for both the linear and quadratic
specifications.growth. Does the
linear specification seem adequate?# Your code here:
# Hint for 1: lm(debtx ~ growth + I(growth^2) + ue + cog + enop, data = ...)
# Hint for 2: create a newdata data.frame with growth varying and other vars at mean
# newdata <- data.frame(growth = seq(...), ue = mean(...), cog = mean(...), enop = mean(...))
# predict(model, newdata = newdata)
Practice saving outputs to the output/ subfolders.
Tasks:
screenreg() to display Models A and B side by side
in the console. Then export the same table as a .tex file
using texreg() with the file argument:
output/tables/debt_models.tex.ggsave(): output/plots/cef_debt.pdf.output/data/debt_clean.rds using
saveRDS().output/models/model_a.rds using
saveRDS().# Your code here:
# Hint for 1:
# screenreg(list(model_a, model_b), custom.model.names = c("Full", "Bivariate"))
# texreg(list(model_a, model_b), file = "output/tables/debt_models.tex")
#
# Hint for 2:
# ggsave("output/plots/cef_debt.pdf", width = 6, height = 6/1.618)
#
# Hint for 3:
# saveRDS(your_clean_data, file = "output/data/debt_clean.rds")
#
# Hint for 4:
# saveRDS(model_a, file = "output/models/model_a.rds")
This appendix covers interaction effects and how to compute and visualize conditional marginal effects step by step. The material is self-contained — work through it on your own after the lab session.
In many social science applications, the effect of one variable on the outcome depends on the level of another variable. For example:
The simplest interaction model captures this conditional relationship:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 \times x_2) + \varepsilon \]
In this model, the marginal effect of \(x_1\) on \(y\) is no longer a single number. It is a function of \(x_2\):
\[ \frac{\partial E[y \mid x_1, x_2]}{\partial x_1} = \beta_1 + \beta_3 x_2 \]
This means \(\beta_1\) alone is not “the effect of \(x_1\).” It is the effect of \(x_1\) only when \(x_2 = 0\), which may not even be a meaningful value in the data. To properly interpret an interaction model, we must compute and visualize the marginal effect across the observed range of the moderating variable.
However, the simple linear interaction produces a marginal effect that is always a straight line — it increases or decreases monotonically with \(x_2\). In many real applications, the conditional effect is nonlinear: the effect of \(x_1\) may be strongest at moderate values of \(x_2\) and weaker (or even reversed) at the extremes. To capture this, we can include a quadratic interaction:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_2^2 + \beta_4 (x_1 \times x_2) + \beta_5 (x_1 \times x_2^2) + \varepsilon \]
Now the marginal effect of \(x_1\) is:
\[ \frac{\partial E[y \mid x_1, x_2]}{\partial x_1} = \beta_1 + \beta_4 x_2 + \beta_5 x_2^2 \]
This is a parabola in \(x_2\) — it can peak at intermediate values and cross zero at both ends, producing a much richer and more substantively interesting picture. Think of a context like: the effect of economic openness on growth is positive at moderate levels of institutional quality, but diminishes in both very weak states (which cannot absorb the benefits) and very strong states (which have already captured them).
We simulate a DGP where the marginal effect of \(x_1\) follows an inverted-U shape in \(x_2\):
\[ y = 1 + 2 x_1 + 0.5 x_2 - 0.3 x_2^2 + 1.5 (x_1 \times x_2) - 0.8 (x_1 \times x_2^2) + \varepsilon \]
The marginal effect of \(x_1\) is therefore: \(\frac{\partial y}{\partial x_1} = 2 + 1.5 x_2 - 0.8 x_2^2\)
This is a downward-opening parabola. It is positive for a range of \(x_2\) values and turns negative at the extremes, crossing zero at approximately \(x_2 \approx -0.9\) and \(x_2 \approx 2.8\).
set.seed(512)
b0 <- 1; b1 <- 2; b2 <- 0.5; b3 <- -0.3; b4 <- 1.5; b5 <- -0.8
nobs <- 300
vcv <- rbind(c(1.0, 0.3), c(0.3, 1.0))
X_int <- mvrnorm(n = nobs, mu = c(0, 0), Sigma = vcv)
x1 <- X_int[, 1]; x2 <- X_int[, 2]
e <- rnorm(nobs, mean = 0, sd = 2)
y <- b0 + b1*x1 + b2*x2 + b3*x2^2 + b4*x1*x2 + b5*x1*x2^2 + e
intdat <- data.frame(y, x1, x2)
rm(X_int, x1, x2, e, y, vcv, b0, b1, b2, b3, b4, b5, nobs)
We use I() to include the squared term and
: for interactions:
fit_int <- lm(y ~ x1 + x2 + I(x2^2) + x1:x2 + x1:I(x2^2), data = intdat)
summary(fit_int)
##
## Call:
## lm(formula = y ~ x1 + x2 + I(x2^2) + x1:x2 + x1:I(x2^2), data = intdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7123 -1.3581 0.0417 1.4456 4.9332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.06932 0.15028 7.116 8.54e-12 ***
## x1 2.05733 0.15646 13.150 < 2e-16 ***
## x2 0.61518 0.14460 4.254 2.82e-05 ***
## I(x2^2) -0.31462 0.11578 -2.717 0.00697 **
## x1:x2 1.43932 0.12630 11.396 < 2e-16 ***
## x1:I(x2^2) -0.69758 0.09934 -7.022 1.52e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.048 on 294 degrees of freedom
## Multiple R-squared: 0.5232, Adjusted R-squared: 0.5151
## F-statistic: 64.53 on 5 and 294 DF, p-value: < 2.2e-16
Important: Use R’s formula syntax rather than
creating interaction terms manually. The I() function tells
R to interpret the expression arithmetically (squaring \(x_2\)) rather than as a formula
operator.
\[ \frac{\partial \hat{y}}{\partial x_1} = \hat{\beta}_{x_1} + \hat{\beta}_{x_1:x_2} \cdot x_2 + \hat{\beta}_{x_1:x_2^2} \cdot x_2^2 \]
B <- coef(fit_int)
names(B)
## [1] "(Intercept)" "x1" "x2" "I(x2^2)" "x1:x2"
## [6] "x1:I(x2^2)"
x2_vals <- seq(min(intdat$x2), max(intdat$x2), length.out = 200)
me_x1 <- B["x1"] + B["x1:x2"] * x2_vals + B["x1:I(x2^2)"] * x2_vals^2
Quick plot to check the shape:
plot(x2_vals, me_x1, type = "l", lwd = 2,
xlab = "x2", ylab = "Marginal effect of x1",
main = "ME of x1 conditional on x2 (point estimates)")
abline(h = 0, lty = 2, col = "red")
The variance of the conditional marginal effect involves all pairwise variances and covariances among the three relevant coefficients:
\[ \text{Var}\left(\frac{\partial \hat{y}}{\partial x_1}\right) = \text{Var}(\hat{\beta}_1) + x_2^2 \text{Var}(\hat{\beta}_4) + x_2^4 \text{Var}(\hat{\beta}_5) + 2 x_2 \text{Cov}(\hat{\beta}_1, \hat{\beta}_4) + 2 x_2^2 \text{Cov}(\hat{\beta}_1, \hat{\beta}_5) + 2 x_2^3 \text{Cov}(\hat{\beta}_4, \hat{\beta}_5) \]
V <- vcov(fit_int)
c1 <- "x1"; c4 <- "x1:x2"; c5 <- "x1:I(x2^2)"
var_me <- V[c1,c1] +
x2_vals^2 * V[c4,c4] +
x2_vals^4 * V[c5,c5] +
2 * x2_vals * V[c1,c4] +
2 * x2_vals^2 * V[c1,c5] +
2 * x2_vals^3 * V[c4,c5]
se_x1 <- sqrt(var_me)
lb <- me_x1 - qnorm(0.975) * se_x1
ub <- me_x1 + qnorm(0.975) * se_x1
me_df <- data.frame(x2 = x2_vals, me = me_x1, se = se_x1, lb, ub)
ggplot(me_df, aes(x = x2)) +
geom_ribbon(aes(ymin = lb, ymax = ub), fill = "grey80", alpha = 0.5) +
geom_line(aes(y = me), linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
labs(title = "Marginal Effect of x1 on y",
subtitle = "Conditional on x2, with 95% confidence interval",
x = "x2", y = "Estimated marginal effect of x1") +
theme_minimal(base_size = 14)
Marginal effect of x1 on y, conditional on x2, with 95% CI. The inverted-U shape shows the effect is strongest at moderate x2 and reverses at the extremes.
ggsave("output/plots/marginal_effect_x1.pdf", width = 6, height = 6 / 1.618)
How to read this plot:
To use robust or clustered SEs, simply swap the variance-covariance matrix in Step 2:
V_robust <- vcovHC(fit_int, type = "HC1")
var_me_robust <- V_robust[c1,c1] +
x2_vals^2 * V_robust[c4,c4] +
x2_vals^4 * V_robust[c5,c5] +
2 * x2_vals * V_robust[c1,c4] +
2 * x2_vals^2 * V_robust[c1,c5] +
2 * x2_vals^3 * V_robust[c4,c5]
se_x1_robust <- sqrt(var_me_robust)
me_df$lb_robust <- me_x1 - qnorm(0.975) * se_x1_robust
me_df$ub_robust <- me_x1 + qnorm(0.975) * se_x1_robust
ggplot(me_df, aes(x = x2)) +
geom_ribbon(aes(ymin = lb_robust, ymax = ub_robust), fill = "steelblue", alpha = 0.3) +
geom_ribbon(aes(ymin = lb, ymax = ub), fill = "grey70", alpha = 0.3) +
geom_line(aes(y = me), linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Marginal Effect of x1 on y",
subtitle = "Grey = OLS SEs | Blue = HC-Robust SEs",
x = "x2", y = "Estimated marginal effect of x1") +
theme_minimal(base_size = 14)
Comparing OLS-based (grey) and HC-robust (blue) confidence intervals.
The same approach works with vcovCL() for clustered SEs
— just plug in the cluster-robust vcov matrix. This flexibility is one
reason why computing marginal effects by hand is valuable: you can use
any variance-covariance estimator.
This appendix covers simultaneity — the fourth classical source of endogeneity. Work through it on your own after the lab session.
Simultaneity arises when two variables are jointly determined — each is both a cause and an effect of the other. In such cases, regressing \(y\) on \(x\) with OLS is invalid because \(x\) is endogenous: it depends on \(y\), which means it is correlated with \(\varepsilon\).
Classic examples in social science:
In each case, regressing one variable on the other with OLS conflates the two causal directions, and the coefficient estimate is a mix of both — not a clean estimate of either.
Consider the following structural system:
\[ \begin{cases} y = 0.5x + z + \varepsilon_1 \\ x = 0.5y + w + \varepsilon_2 \end{cases} \]
where \(z\) and \(w\) are exogenous variables, and \(\varepsilon_1, \varepsilon_2\) are independent errors. If we naively regress \(y\) on \(x\) and \(z\), the estimate of \(0.5\) will be biased because \(x\) is a function of \(y\) (and therefore of \(\varepsilon_1\)).
To obtain unbiased estimates, we need the reduced form: we solve the system algebraically to express each endogenous variable solely as a function of exogenous variables and errors:
\[ y = \frac{4}{3}z + \frac{2}{3}w + \frac{4}{3}\varepsilon_1 + \frac{2}{3}\varepsilon_2 \]
\[ x = \frac{2}{3}z + \frac{4}{3}w + \frac{2}{3}\varepsilon_1 + \frac{4}{3}\varepsilon_2 \]
set.seed(123)
n <- 1000
z <- rnorm(n, 0, 3)
w <- rnorm(n, 0, 3)
e1 <- rnorm(n, 0, 2)
e2 <- rnorm(n, 0, 2)
# Generate data from the reduced form
y_sim <- (4/3)*z + (2/3)*w + (4/3)*e1 + (2/3)*e2
x_sim <- (2/3)*z + (4/3)*w + (2/3)*e1 + (4/3)*e2
Compare the naive structural regression against the reduced form:
# Naive OLS on the structural equation (biased)
naive_model <- lm(y_sim ~ x_sim + z)
# Reduced form regression on exogenous variables only (unbiased)
rf_model <- lm(y_sim ~ w + z)
# Compare side by side
screenreg(list(naive_model, rf_model),
custom.model.names = c("Naive (Structural)", "Reduced Form"))
##
## =============================================
## Naive (Structural) Reduced Form
## ---------------------------------------------
## (Intercept) -0.05 -0.07
## (0.06) (0.09)
## x_sim 0.62 ***
## (0.01)
## z 0.89 *** 1.31 ***
## (0.02) (0.03)
## w 0.69 ***
## (0.03)
## ---------------------------------------------
## R^2 0.88 0.70
## Adj. R^2 0.88 0.70
## Num. obs. 1000 1000
## =============================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
The true effect of \(x\) on \(y\) is \(0.5\), but the naive estimate is biased upward. The reduced form recovers the true parameters:
data.frame(
Variable = c("w", "z"),
Truth = round(c(2/3, 4/3), 3),
Estimate = round(coef(rf_model)[2:3], 3)
)
## Variable Truth Estimate
## w w 0.667 0.688
## z z 1.333 1.313
Key takeaway: When simultaneity is present, OLS on the structural equation is biased and inconsistent. The reduced form — regressing \(y\) on exogenous variables only — recovers unbiased parameters. In practice, this motivates the use of instrumental variables (IV), where we use exogenous variation in \(w\) to isolate the causal effect of \(x\) on \(y\).