class: center, middle, inverse, title-slide .title[ # CSSS/POLS 512 ] .subtitle[ ## Lab 1: Linear Regression Review — Endogeneity, Autocorrelation, and Inference ] .author[ ### Ramses Llobet ] .date[ ### Spring 2026 ] --- # Welcome! - Welcome to **CS&SS / POLS 512** lab sections! - I am **Ramses Llobet** ([rllobet@uw.edu](mailto:rllobet@uw.edu)) - Ph.D. candidate in Political Science, UW - Research: political economy and applied statistics -- - **Please** do not hesitate to stop me if you don't hear or understand me - **Please** ask questions. No question is silly. --- # Quick Logistics .pull-left[ **Lab sessions** - Fridays 1:30 - 3:20 pm (Zoom) - Materials on course website (`.zip`) - Recordings available after **Office hours** - By appointment (Zoom) - Email me with topic + availability ] .pull-right[ **Communication** - **Slack** is preferred for coding Qs - Post a *minimal reproducible example* - Not screenshots of code **Homework** - 3 problem sets, submit PDF on Canvas - Use LaTeX (Overleaf) or RMarkdown - Format: `CSSS512HW1NameSurname` ] --- # Today's Plan .pull-left[ **Part 1: R Refresher & OLS Review** - The CEF and the linearity assumption - What the OLS formula actually does - dplyr essentials, project organization **Part 2: Endogeneity Problems** - 2.1 Omitted variable bias - 2.2 Measurement error - 2.3 Functional misspecification ] .pull-right[ **Part 3: Clustered Standard Errors** - Why clustering matters - The sandwich estimator (by hand + `sandwich`) - Many vs. few clusters **Part 4: Practice** - Franzese public debt data **Self-study (Appendix):** - A: Interaction effects - B: Simultaneity bias ] --- class: inverse, center, middle # Part 1: What Does Regression Estimate? --- # The Conditional Expectation Function Our goal in most empirical work is to estimate the **CEF**: `$$\mu(x) = E[y_i \mid x_i]$$` The average value of `\(y\)` we would observe at a given value of `\(x\)`. -- The **linear regression model** assumes the CEF is linear: `$$E[y_i \mid x_i] = \beta_0 + \beta_1 x_{1i} + \dots + \beta_k x_{ki}$$` This is a strong assumption: a one-unit change in `\(x_1\)` always shifts `\(E[y]\)` by `\(\beta_1\)`, regardless of the level of `\(x_1\)` or other covariates. If the true CEF is nonlinear, **the linear model is misspecified**. -- `$$y_i = X_i'\beta + \varepsilon_i, \quad E[\varepsilon_i \mid x_i] = 0$$` This last condition — **exogeneity** — is what makes OLS work. --- # What Does the OLS Formula Do? `$$\hat{\beta} = (X'X)^{-1}X'y$$` -- | Piece | Matrix | What it captures | |:------|:-------|:-----------------| | Regressor variation | `\(X'X\)` | How much each `\(x\)` varies and how much they overlap | | Regressor-outcome covariation | `\(X'y\)` | How much each `\(x\)` co-moves with `\(y\)` | -- **Covariance vs. correlation:** Same information, different scale. `$$\Sigma = \begin{pmatrix} \text{Var}(y) & \text{Cov}(y, x_1) \\ \text{Cov}(x_1, y) & \text{Var}(x_1) \end{pmatrix} \quad \xrightarrow{\text{rescale by SDs}} \quad R = \begin{pmatrix} 1 & \text{Cor}(y, x_1) \\ \text{Cor}(x_1, y) & 1 \end{pmatrix}$$` Covariance preserves units; correlation standardizes to `\([-1, 1]\)`. OLS works with covariances. --- # What Does the OLS Formula Do? (cont.) **OLS asks:** *"How much does each regressor co-move with `\(y\)`, after removing the overlap among regressors?"* Without `\((X'X)^{-1}\)`, the raw `\(X'y\)` double-counts shared variation. The inversion strips out overlap so each `\(\hat{\beta}\)` reflects only its regressor's *independent* contribution. -- OLS finds the **best linear approximation** to the CEF — a projection onto the space you *specify*, not the space the data *need*. -- **Key distinction:** | | What it says | Status | |:--|:------------|:-------| | `\(X'\hat{e} = 0\)` | Residuals uncorrelated with regressors | **Result** — OLS forces this mechanically | | `\(E[\varepsilon_i \mid x_i] = 0\)` | Population errors uncorrelated with regressors | **Assumption** — can be violated | Every OLS model satisfies the first; only correctly specified models satisfy the second. .center[**→ Switch to RMarkdown: Part 1 (OLS review, dplyr, project setup)**] --- class: inverse, center, middle # Part 2: What Can Go Wrong? --- # When Exogeneity Fails When `\(E[\varepsilon_i \mid x_i] \neq 0\)`, the estimated regression line **no longer traces the CEF**. -- **"Endogeneity"** is an **umbrella term** — it tells you the mean independence assumption is violated, but not *why*. Different sources → different biases → different remedies. Be specific: say "OVB" or "measurement error," not just "endogeneity." -- Four forms we cover (+ simultaneity in self-study): | Source | Mechanism | Effect on `\(\hat{\beta}\)` | |--------|-----------|----------------| | **Omitted variables** | Relevant `\(x\)` left out, absorbed into `\(\varepsilon\)` | Depends on signs | | **Measurement error** | True `\(x\)` observed with noise | Classical: toward zero; non-classical: unpredictable | | **Functional misspecification** | True CEF nonlinear, linear model omits `\(x^2\)` etc. | OVB via omitted transformation | | *Simultaneity (Appendix B)* | `\(x\)` and `\(y\)` jointly determined | Depends on system | --- # Omitted Variable Bias True model: `\(\quad y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon\)` We estimate: `\(\quad y = \beta_0 + \beta_1 x_1 + \varepsilon^*\)` -- `$$\hat{\beta}_1^{\text{short}} \xrightarrow{p} \beta_1 + \underbrace{\beta_2 \cdot \delta_{x_2 \mid x_1}}_{\text{bias}}$$` The bias is the product of: 1. **How much the omitted variable affects `\(y\)`** → `\(\beta_2\)` 2. **How correlated the omitted variable is with the included regressor** → `\(\delta_{x_2 \mid x_1}\)` (slope from regressing `\(x_2\)` on `\(x_1\)`) If `\(\beta_2 = 0\)` or `\(\delta = 0\)` → no bias. Both conditions must hold. -- .center[**→ Switch to RMarkdown: Part 2.1 (OVB simulation)**] --- # Measurement Error We observe a noisy version of `\(x_2\)`: `\(\quad x_2^* = x_2 + \upsilon\)` -- `$$\hat{\beta}_2^* \xrightarrow{p} \beta_2 \cdot \underbrace{\frac{\sigma_{x_2}^2}{\sigma_{x_2}^2 + \sigma_\upsilon^2}}_{\text{reliability ratio} < 1}$$` **Attenuation bias:** more noise → lower reliability → coefficient shrinks toward zero. **Caveat:** This holds only under **classical** error (`\(\upsilon\)` independent of `\(x\)` and `\(\varepsilon\)`). Non-classical error (e.g., noise correlated with the true value) can bias in **either direction**. Many social science variables are measured with error: survey responses, coded text, GDP in developing countries, historical data. -- .center[**→ Switch to RMarkdown: Part 2.2 (measurement error simulation)**] --- # Functional Misspecification True model includes `\(x_2^2\)` (and possibly `\(x_1 \times x_2\)`), but we impose linearity. -- `$$\hat{\beta}_2^{\text{linear}} \xrightarrow{p} \beta_2 + \beta_3 \cdot \delta_{x_2^2 \mid x_2, \dots}$$` The omitted nonlinear terms are **mechanically correlated** with the included regressors — this is OVB where the "omitted variable" is a transformation of an included regressor. -- **Key insight:** The misspecified model estimates the **wrong CEF** — a straight line where the truth is curved. Narrower confidence intervals do not mean better estimation if the model is wrong. -- .center[**→ Switch to RMarkdown: Part 2.3 (misspecification + CEF prediction)**] --- class: inverse, center, middle # Part 3: Clustered Standard Errors --- # Estimation vs. Inference So far: **estimation** — getting `\(\hat{\beta}\)`. Required exogeneity, not distributional assumptions. **Inference** (SEs, CIs, tests) is a separate step — it *does* depend on the error structure: `$$\hat{V}_{\text{OLS}} = \hat{\sigma}^2 (X'X)^{-1} \quad \text{assumes i.i.d. errors}$$` If errors are heteroskedastic, clustered, or autocorrelated: `\(\hat{\beta}\)` is fine, but **SEs are wrong** (usually too small). --- # Why Clustering Matters Observations are often **grouped**: students in schools, workers in firms, respondents in countries. Unobserved group-level factors create within-group error correlation: `$$\text{Cov}(\varepsilon_i, \varepsilon_j) \neq 0 \quad \text{for } i, j \text{ in the same group}$$` -- **Good news:** `\(\hat{\beta}\)` is still **unbiased** — clustering is an inference problem, not an estimation problem. **Bad news:** OLS standard errors are **too small** → we reject `\(H_0\)` too often. -- **Note:** An unobserved group effect `\(\alpha_g\)` is technically an omitted variable — but if `\(\alpha_g\)` is **uncorrelated** with the regressors, there is no OVB (the coefficient is unbiased). The only problem is correlated errors → wrong SEs. Our simulation isolates this case. --- # The Sandwich Estimator `$$\hat{V}_{CL} = \underbrace{(X'X)^{-1}}_{\text{bread}} \left( \sum_{g=1}^{G} X_g' \hat{e}_g \hat{e}_g' X_g \right) \underbrace{(X'X)^{-1}}_{\text{bread}}$$` - **Bread** `\((X'X)^{-1}\)`: adjusts for regressor scale/overlap (same as in estimation) - **$X_g' \hat{e}_g$** = cluster `\(g\)`'s **score**: a `\(k \times 1\)` vector summarizing how much that cluster's residuals co-move with the regressors. Correlated errors within a cluster → large score; independent errors → scores cancel out - **Meat** = `\(\sum_g (X_g' \hat{e}_g)(X_g' \hat{e}_g)'\)`: sums each cluster's score times itself → total contribution to variance of `\(\hat{\beta}\)` - Special case: if each cluster = 1 obs → HC (heteroskedasticity-robust) SEs -- In R: `vcovCL(model, cluster = data$group, type = "HC1")` -- **How many clusters?** `\(G < 20\)`: unreliable. `\(G \geq 40\)`–$50$: generally reliable. In between: use with caution. -- .center[**→ Switch to RMarkdown: Part 3 (sandwich by hand, many vs. few clusters)**] --- class: inverse, center, middle # Part 4: Coding Practice --- # Franzese Public Debt Data Open `Lab1.Rmd` → **Part 4: Coding Practice** **Exercises:** 1. **EDA:** Load, explore, visualize (`dplyr`, `ggplot2`, correlation matrix) 2. **OVB:** Compare full vs. bivariate model 3. **Clustering:** Compute clustered SEs by country 4. **Misspecification:** Linear vs. quadratic CEF 5. **Export:** Tables, figures, data -- Work through the exercises. I'm here for questions. --- # Wrap-up **Today we covered:** - The CEF, the linearity assumption, and what OLS actually computes - Four sources of endogeneity: OVB, measurement error, functional misspecification, simultaneity - Prediction workflow: manual `\(\hat{y} = X_{\text{new}}\hat{\beta}\)` and `predict()` - Clustered standard errors: why, how, and when they work -- **Self-study (Appendix A & B):** - Interaction effects and conditional marginal effects - Simultaneity bias and reduced-form estimation -- **Next week:** Autocorrelation, ACF/PACF, and time series diagnostics --- class: inverse, center, middle # Questions? rllobet@uw.edu