CSSS/POLS 512

class: center, middle, inverse, title-slide

.title[
# CSSS/POLS 512
]
.subtitle[
## Lab 1: Linear Regression Review — Endogeneity, Autocorrelation, and Inference
]
.author[
### Ramses Llobet
]
.date[
### Spring 2026
]

---

# Welcome!

- Welcome to **CS&SS / POLS 512** lab sections!

- I am **Ramses Llobet** ([rllobet@uw.edu](mailto:rllobet@uw.edu))
  - Ph.D. candidate in Political Science, UW
  - Research: political economy and applied statistics

- **Please** do not hesitate to stop me if you don't hear or understand me

- **Please** ask questions. No question is silly.

---

# Quick Logistics

.pull-left[
**Lab sessions**
- Fridays 1:30 - 3:20 pm (Zoom)
- Materials on course website (`.zip`)
- Recordings available after

**Office hours**
- By appointment (Zoom)
- Email me with topic + availability
]

.pull-right[
**Communication**
- **Slack** is preferred for coding Qs
- Post a *minimal reproducible example*
- Not screenshots of code

**Homework**
- 3 problem sets, submit PDF on Canvas
- Use LaTeX (Overleaf) or RMarkdown
- Format: `CSSS512HW1NameSurname`
]

---

# Today's Plan

.pull-left[
**Part 1: R Refresher & OLS Review**
- The CEF and the linearity assumption
- What the OLS formula actually does
- dplyr essentials, project organization

**Part 2: Endogeneity Problems**
- 2.1 Omitted variable bias
- 2.2 Measurement error
- 2.3 Functional misspecification
]

.pull-right[
**Part 3: Clustered Standard Errors**
- Why clustering matters
- The sandwich estimator (by hand + `sandwich`)
- Many vs. few clusters

**Part 4: Practice**
- Franzese public debt data

**Self-study (Appendix):**
- A: Interaction effects
- B: Simultaneity bias
]

---
class: inverse, center, middle

# Part 1: What Does Regression Estimate?

---

# The Conditional Expectation Function

Our goal in most empirical work is to estimate the **CEF**:

`$$\mu(x) = E[y_i \mid x_i]$$`

The average value of `$y$` we would observe at a given value of `$x$`.

The **linear regression model** assumes the CEF is linear:

`$$E[y_i \mid x_i] = \beta_0 + \beta_1 x_{1i} + \dots + \beta_k x_{ki}$$`

This is a strong assumption: a one-unit change in `$x_1$` always shifts `$E[y]$` by `$\beta_1$`, regardless of the level of `$x_1$` or other covariates. If the true CEF is nonlinear, **the linear model is misspecified**.

`$$y_i = X_i'\beta + \varepsilon_i, \quad E[\varepsilon_i \mid x_i] = 0$$`

This last condition — **exogeneity** — is what makes OLS work.

---

# What Does the OLS Formula Do?

`$$\hat{\beta} = (X'X)^{-1}X'y$$`

| Piece | Matrix | What it captures |
|:------|:-------|:-----------------|
| Regressor variation | `$X'X$` | How much each `$x$` varies and how much they overlap |
| Regressor-outcome covariation | `$X'y$` | How much each `$x$` co-moves with `$y$` |

**Covariance vs. correlation:** Same information, different scale.

`$$\Sigma = \begin{pmatrix} \text{Var}(y) & \text{Cov}(y, x_1) \\ \text{Cov}(x_1, y) & \text{Var}(x_1) \end{pmatrix} \quad \xrightarrow{\text{rescale by SDs}} \quad R = \begin{pmatrix} 1 & \text{Cor}(y, x_1) \\ \text{Cor}(x_1, y) & 1 \end{pmatrix}$$`

Covariance preserves units; correlation standardizes to `$[-1, 1]$`. OLS works with covariances.

---

# What Does the OLS Formula Do? (cont.)

**OLS asks:** *"How much does each regressor co-move with `$y$`, after removing the overlap among regressors?"*

Without `$(X'X)^{-1}$`, the raw `$X'y$` double-counts shared variation. The inversion strips out overlap so each `$\hat{\beta}$` reflects only its regressor's *independent* contribution.

OLS finds the **best linear approximation** to the CEF — a projection onto the space you *specify*, not the space the data *need*.

**Key distinction:**

| | What it says | Status |
|:--|:------------|:-------|
| `$X'\hat{e} = 0$` | Residuals uncorrelated with regressors | **Result** — OLS forces this mechanically |
| `$E[\varepsilon_i \mid x_i] = 0$` | Population errors uncorrelated with regressors | **Assumption** — can be violated |

Every OLS model satisfies the first; only correctly specified models satisfy the second.

.center[**&rarr; Switch to RMarkdown: Part 1 (OLS review, dplyr, project setup)**]

---
class: inverse, center, middle

# Part 2: What Can Go Wrong?

---

# When Exogeneity Fails

When `$E[\varepsilon_i \mid x_i] \neq 0$`, the estimated regression line **no longer traces the CEF**.

**"Endogeneity"** is an **umbrella term** — it tells you the mean independence assumption is violated, but not *why*. Different sources &rarr; different biases &rarr; different remedies. Be specific: say "OVB" or "measurement error," not just "endogeneity."

Four forms we cover (+ simultaneity in self-study):

| Source | Mechanism | Effect on `$\hat{\beta}$` |
|--------|-----------|----------------|
| **Omitted variables** | Relevant `$x$` left out, absorbed into `$\varepsilon$` | Depends on signs |
| **Measurement error** | True `$x$` observed with noise | Classical: toward zero; non-classical: unpredictable |
| **Functional misspecification** | True CEF nonlinear, linear model omits `$x^2$` etc. | OVB via omitted transformation |
| *Simultaneity (Appendix B)* | `$x$` and `$y$` jointly determined | Depends on system |

---

# Omitted Variable Bias

True model: `$\quad y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon$`

We estimate: `$\quad y = \beta_0 + \beta_1 x_1 + \varepsilon^*$`

`$$\hat{\beta}_1^{\text{short}} \xrightarrow{p} \beta_1 + \underbrace{\beta_2 \cdot \delta_{x_2 \mid x_1}}_{\text{bias}}$$`

The bias is the product of:

1. **How much the omitted variable affects `$y$`** &rarr; `$\beta_2$`
2. **How correlated the omitted variable is with the included regressor** &rarr; `$\delta_{x_2 \mid x_1}$` (slope from regressing `$x_2$` on `$x_1$`)

If `$\beta_2 = 0$` or `$\delta = 0$` &rarr; no bias. Both conditions must hold.

.center[**&rarr; Switch to RMarkdown: Part 2.1 (OVB simulation)**]

---

# Measurement Error

We observe a noisy version of `$x_2$`: `$\quad x_2^* = x_2 + \upsilon$`

`$$\hat{\beta}_2^* \xrightarrow{p} \beta_2 \cdot \underbrace{\frac{\sigma_{x_2}^2}{\sigma_{x_2}^2 + \sigma_\upsilon^2}}_{\text{reliability ratio} < 1}$$`

**Attenuation bias:** more noise &rarr; lower reliability &rarr; coefficient shrinks toward zero.

**Caveat:** This holds only under **classical** error (`$\upsilon$` independent of `$x$` and `$\varepsilon$`). Non-classical error (e.g., noise correlated with the true value) can bias in **either direction**.

Many social science variables are measured with error: survey responses, coded text, GDP in developing countries, historical data.

.center[**&rarr; Switch to RMarkdown: Part 2.2 (measurement error simulation)**]

---

# Functional Misspecification

True model includes `$x_2^2$` (and possibly `$x_1 \times x_2$`), but we impose linearity.

`$$\hat{\beta}_2^{\text{linear}} \xrightarrow{p} \beta_2 + \beta_3 \cdot \delta_{x_2^2 \mid x_2, \dots}$$`

The omitted nonlinear terms are **mechanically correlated** with the included regressors — this is OVB where the "omitted variable" is a transformation of an included regressor.

**Key insight:** The misspecified model estimates the **wrong CEF** — a straight line where the truth is curved. Narrower confidence intervals do not mean better estimation if the model is wrong.

.center[**&rarr; Switch to RMarkdown: Part 2.3 (misspecification + CEF prediction)**]

---
class: inverse, center, middle

# Part 3: Clustered Standard Errors

---

# Estimation vs. Inference

So far: **estimation** — getting `$\hat{\beta}$`. Required exogeneity, not distributional assumptions.

**Inference** (SEs, CIs, tests) is a separate step — it *does* depend on the error structure:

`$$\hat{V}_{\text{OLS}} = \hat{\sigma}^2 (X'X)^{-1} \quad \text{assumes i.i.d. errors}$$`

If errors are heteroskedastic, clustered, or autocorrelated: `$\hat{\beta}$` is fine, but **SEs are wrong** (usually too small).

---

# Why Clustering Matters

Observations are often **grouped**: students in schools, workers in firms, respondents in countries.

Unobserved group-level factors create within-group error correlation:

`$$\text{Cov}(\varepsilon_i, \varepsilon_j) \neq 0 \quad \text{for } i, j \text{ in the same group}$$`

**Good news:** `$\hat{\beta}$` is still **unbiased** — clustering is an inference problem, not an estimation problem.

**Bad news:** OLS standard errors are **too small** &rarr; we reject `$H_0$` too often.

**Note:** An unobserved group effect `$\alpha_g$` is technically an omitted variable — but if `$\alpha_g$` is **uncorrelated** with the regressors, there is no OVB (the coefficient is unbiased). The only problem is correlated errors &rarr; wrong SEs. Our simulation isolates this case.

---

# The Sandwich Estimator

`$$\hat{V}_{CL} = \underbrace{(X'X)^{-1}}_{\text{bread}} \left( \sum_{g=1}^{G} X_g' \hat{e}_g \hat{e}_g' X_g \right) \underbrace{(X'X)^{-1}}_{\text{bread}}$$`

- **Bread** `$(X'X)^{-1}$`: adjusts for regressor scale/overlap (same as in estimation)
- **$X_g' \hat{e}_g$** = cluster `$g$`'s **score**: a `$k \times 1$` vector summarizing how much that cluster's residuals co-move with the regressors. Correlated errors within a cluster &rarr; large score; independent errors &rarr; scores cancel out
- **Meat** = `$\sum_g (X_g' \hat{e}_g)(X_g' \hat{e}_g)'$`: sums each cluster's score times itself &rarr; total contribution to variance of `$\hat{\beta}$`
- Special case: if each cluster = 1 obs &rarr; HC (heteroskedasticity-robust) SEs

In R: `vcovCL(model, cluster = data$group, type = "HC1")`

**How many clusters?** `$G < 20$`: unreliable. `$G \geq 40$`–$50$: generally reliable. In between: use with caution.

.center[**&rarr; Switch to RMarkdown: Part 3 (sandwich by hand, many vs. few clusters)**]

---
class: inverse, center, middle

# Part 4: Coding Practice

---

# Franzese Public Debt Data

Open `Lab1.Rmd` &rarr; **Part 4: Coding Practice**

**Exercises:**

1. **EDA:** Load, explore, visualize (`dplyr`, `ggplot2`, correlation matrix)
2. **OVB:** Compare full vs. bivariate model
3. **Clustering:** Compute clustered SEs by country
4. **Misspecification:** Linear vs. quadratic CEF
5. **Export:** Tables, figures, data

Work through the exercises. I'm here for questions.

---

# Wrap-up

**Today we covered:**

- The CEF, the linearity assumption, and what OLS actually computes
- Four sources of endogeneity: OVB, measurement error, functional misspecification, simultaneity
- Prediction workflow: manual `$\hat{y} = X_{\text{new}}\hat{\beta}$` and `predict()`
- Clustered standard errors: why, how, and when they work

**Self-study (Appendix A & B):**

- Interaction effects and conditional marginal effects
- Simultaneity bias and reduced-form estimation

**Next week:** Autocorrelation, ACF/PACF, and time series diagnostics

---
class: inverse, center, middle

# Questions?

rllobet@uw.edu