Disclaimer

These materials have been revised and updated by successive teaching assistants over recent years.

Revision history:

Packages and Functions Used in This Lab

Package Purpose Key functions
plm Panel models, dynamic-panel GMM plm(), pgmm(), pdata.frame()
lmtest / sandwich Test stats and robust SEs coeftest(), vcovHC()
MASS MVN draws for parametric simulation mvrnorm()
simcf King-Tomz-Wittenberg counterfactual simulators (Adolph’s package) cfMake(), cfChange(), ldvsimev(), ldvsimfd(), ldvsimrr()
texreg Cross-format regression tables screenreg(), htmlreg(), texreg()
tidyverse Data manipulation, plotting ggplot(), tibble(), pipes (|>)
patchwork Side-by-side ggplot panels +, /

We also use EmplUK from data(EmplUK, package = "plm") — the Arellano-Bond (1991) labour-demand panel — for the canonical demonstration of difference and system GMM in Part 3.


Today’s roadmap

Part Topic
1 Nickell bias and a hand-built GMM fix. Derivation + a small simulated panel where we reproduce the bias, build Anderson-Hsiao IV and one-step Arellano-Bond GMM from raw matrix algebra, and verify against pgmm().
2 pgmm() and the diagnostic battery. Walk through the function’s formula syntax and arguments, then the five diagnostics on a worked example (Sargan/Hansen, AR(1), AR(2), Wald on time dummies, instrument count).
3 The IV/GMM family demonstrated on EmplUK. Anderson-Hsiao → Arellano-Bond → Blundell-Bond. One-step vs two-step (Windmeijer). Collapsed instruments (Roodman 2009).
4 Applied workflow on the Przeworski democracy panel. FE-ARDL(1,1) → AH → Diff-GMM → Sys-GMM, including time-invariant oil so students see when sys-GMM helps. Misspecified examples included.
5 Cigarette tax counterfactual. Diff-GMM fit on the 1985–1995 cigarette panel + a +60-cent tax-hike scenario, with EV / FD / RR forecasts via simcf.

Lab 5 closed by fitting FE-ARDL(1,1) on the democracy panel and flagged the Nickell bias on the lagged DV as a residual concern. Today we tackle that concern head-on. The pedagogical thread is continuity: same data, same outcome — but now interpreted through three bias-corrected estimators.


Part 1 — Nickell bias and a hand-built GMM fix

1.1 The Nickell mechanism

Take the simplest dynamic panel:

\[ y_{it} = \alpha_i + \phi\, y_{i,t-1} + \beta\, x_{it} + \varepsilon_{it}, \qquad |\phi| < 1, \quad \varepsilon_{it} \sim \text{iid}(0, \sigma^2) \]

The within transformation subtracts unit means: \(\tilde y_{it} = y_{it} - \bar y_i\). This sweeps out \(\alpha_i\) — the whole reason we use FE. But it introduces a new problem. The within-transformed lag \(\tilde y_{i,t-1}\) contains \(\bar y_{i,-1}\), which depends on \(\varepsilon_{i,1}, \ldots, \varepsilon_{i,T-1}\). The within-transformed error \(\tilde\varepsilon_{it}\) also contains \(\bar\varepsilon_i\), which depends on \(\varepsilon_{i,1}, \ldots, \varepsilon_{i,T}\). They share terms.

Therefore \(\tilde y_{i,t-1}\) and \(\tilde\varepsilon_{it}\) are correlated, OLS on the within-transformed data is no longer consistent, and Nickell (1981) showed that as \(N \to \infty\) with \(T\) fixed,

\[ \boxed{\;\text{plim}(\hat\phi_{\text{FE}} - \phi) \approx -\,\frac{1+\phi}{T-1}\;} \]

Three consequences worth flagging:

  1. Bias is downward on \(\phi\). FE understates persistence. If your true \(\phi = 0.85\), expect FE to give you something around 0.7 with \(T = 15\). The bias is of order \(1/T\) — small \(T\) is bad.
  2. Bias travels. Because \(\hat\phi\) absorbs less of the persistence, \(\hat\beta\) on contemporaneous regressors picks up some of the slack — typically biased upward. Long-run effects \(\beta/(1-\phi)\) are also biased; the direction depends on which bias dominates.
  3. Adding more units does not fix it. Nickell’s bias is asymptotic in \(N\) with \(T\) fixed — so doubling the number of countries (or firms, or states) leaves the bias exactly where it was. Only adding more time periods helps. When more time periods isn’t an option, change estimator.

The rest of Part 1 builds that “change estimator” path by hand, on a small simulated panel. We will (i) generate data with known parameters, (ii) reproduce the Nickell bias on an FE-LDV fit, (iii) difference to kill \(\alpha_i\) and watch the residual endogeneity appear, (iv) construct the Anderson-Hsiao instrumental-variable estimator from raw matrix algebra, and (v) extend to an over-identified one-step Arellano-Bond GMM. The pedagogical goal is to demystify GMM: it is just a weighted moment-condition system with a closed-form solution.

1.2 A toy DGP

We simulate a small panel from the DGP

\[ y_{it} = \alpha_i + \phi\, y_{i,t-1} + \beta\, x_{it} + \varepsilon_{it} \]

with \(\phi = 0.7\), \(\beta = 0.5\), \(\alpha_i \sim N(0, 1)\), \(\varepsilon_{it} \sim N(0, 1)\), \(x_{it} \sim N(0, 1)\). We keep \(N = 50\) and \(T = 8\) — small enough that the matrix algebra fits in memory of your head, large enough that estimates are well-defined. A burn-in period lets the series escape its arbitrary initial value before we collect observations.

We keep the true \(\varepsilon_{it}\) alongside the observed data — students never have this in real life, but here it lets us see the endogeneity in §1.4.

set.seed(20260515)

# DGP parameters
phi_true  <- 0.7    # true persistence
beta_true <- 0.5    # true effect of x on y
N         <- 50     # number of units
T_obs     <- 8      # observed periods per unit
burn      <- 50     # burn-in length

# Unit fixed effects, drawn once
alpha <- rnorm(N, mean = 0, sd = 1)

# Generate one unit at a time and stack
toy_list <- vector("list", N)
for (i in seq_len(N)) {
  Tlong <- T_obs + burn
  x_i   <- rnorm(Tlong, 0, 1)
  eps_i <- rnorm(Tlong, 0, 1)
  y_i   <- numeric(Tlong)
  # initial value from the stationary distribution of an AR(1) with intercept alpha_i
  y_i[1] <- alpha[i] / (1 - phi_true) +
            rnorm(1, sd = 1 / sqrt(1 - phi_true^2))
  for (tt in 2:Tlong) {
    y_i[tt] <- alpha[i] + phi_true * y_i[tt - 1] +
               beta_true * x_i[tt] + eps_i[tt]
  }
  # Pre-compute the slices BEFORE building the tibble — tibble evaluates
  # columns sequentially, so y_lag = y_i[...] inside tibble() would shadow the
  # outer 'y_i' with the just-created 'y' column. Pre-computing avoids that.
  #
  # Lags are constructed from the OBSERVED window only — NA where the lag
  # would reach into pre-window data. This mirrors what pgmm sees on a real
  # panel, so the manual fits in §1.5–§1.6 use the same sample pgmm does.
  y_obs   <- y_i  [(burn + 1):Tlong]
  y_lag1  <- c(NA,             y_obs[1:(T_obs - 1)])
  y_lag2  <- c(NA, NA,         y_obs[1:(T_obs - 2)])
  y_lag3  <- c(NA, NA, NA,     y_obs[1:(T_obs - 3)])
  y_lag4  <- c(NA, NA, NA, NA, y_obs[1:(T_obs - 4)])
  x_obs   <- x_i  [(burn + 1):Tlong]
  eps_obs <- eps_i[(burn + 1):Tlong]
  toy_list[[i]] <- tibble(
    unit   = i,
    time   = seq_len(T_obs),
    y      = y_obs,
    y_lag1 = y_lag1,
    y_lag2 = y_lag2,
    y_lag3 = y_lag3,
    y_lag4 = y_lag4,
    x      = x_obs,
    eps    = eps_obs
  )
}
toy <- bind_rows(toy_list)
glimpse(toy)
## Rows: 400
## Columns: 9
## $ unit   <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, …
## $ time   <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, …
## $ y      <dbl> -4.5417604, -6.1090964, -7.9833272, -6.8805058, -6.9750735, -6.…
## $ y_lag1 <dbl> NA, -4.5417604, -6.1090964, -7.9833272, -6.8805058, -6.9750735,…
## $ y_lag2 <dbl> NA, NA, -4.5417604, -6.1090964, -7.9833272, -6.8805058, -6.9750…
## $ y_lag3 <dbl> NA, NA, NA, -4.5417604, -6.1090964, -7.9833272, -6.8805058, -6.…
## $ y_lag4 <dbl> NA, NA, NA, NA, -4.5417604, -6.1090964, -7.9833272, -6.8805058,…
## $ x      <dbl> 1.38427471, 1.02050701, -0.61394292, -0.03089128, 0.77359679, 0…
## $ eps    <dbl> -0.255258495, -1.821209281, -1.781079872, 0.342177224, -0.92660…

The columns we will use downstream:

  • y — outcome at time \(t\).
  • y_lag1, y_lag2, y_lag3, y_lag4 — lagged outcomes at \(t-1, t-2, t-3, t-4\). NA for the first \(k\) rows of each unit since those lags would reach pre-window. This is exactly what pgmm() sees internally, so downstream drop_na() steps will yield the same effective sample as pgmm().
  • x — covariate at time \(t\).
  • eps — the true error \(\varepsilon_{it}\) (visible to us, not to a real researcher).

1.3 Step 1 — Reproduce the Nickell bias on an FE-LDV fit

Within-demean every variable inside its unit, then OLS the demeaned outcome on the demeaned lag and demeaned \(x\). No intercept (it’s absorbed by the demeaning). This is exactly what plm(model = "within") does internally.

# Listwise-delete the t = 1 rows where y_lag1 is unobservable (pre-window),
# then within-demean over the remaining 7 rows per unit. This mirrors what
# plm(model = "within") does internally.
toy_dm <- toy |>
  drop_na(y, y_lag1, x) |>
  group_by(unit) |>
  mutate(y_dm     = y      - mean(y),
         y_lag_dm = y_lag1 - mean(y_lag1),
         x_dm     = x      - mean(x)) |>
  ungroup()

# OLS on the within-demeaned variables, no intercept
fit_fe_manual <- lm(y_dm ~ y_lag_dm + x_dm - 1, data = toy_dm)
phi_fe   <- unname(coef(fit_fe_manual)["y_lag_dm"])
beta_fe  <- unname(coef(fit_fe_manual)["x_dm"])

tibble(parameter = c("phi", "beta"),
       true      = c(phi_true, beta_true),
       FE_hat    = c(phi_fe,  beta_fe),
       bias      = FE_hat - true) |>
  knitr::kable(digits = 3,
               caption = "FE-LDV on the toy panel. True phi = 0.7; FE estimate is biased downward by Nickell. True beta = 0.5; the bias is mild in this DGP but you can see it travelling.")
FE-LDV on the toy panel. True phi = 0.7; FE estimate is biased downward by Nickell. True beta = 0.5; the bias is mild in this DGP but you can see it travelling.
parameter true FE_hat bias
phi 0.7 0.511 -0.189
beta 0.5 0.427 -0.073

\(\hat\phi_{FE}\) comes in below 0.7 — the Nickell bias is visible. The next steps build an estimator that closes that gap.

One-realization caveat. Nickell’s formula predicts the expected bias under repeated sampling, \(\text{plim}(\hat\phi_{FE} - \phi) \approx -(1+\phi)/(T-1) \approx -0.243\) here. The number in the bias column above is just one realization around that expectation, scattered by Monte-Carlo noise. Re-run with a different seed and the magnitude moves; the sign and rough scale stay. Read the table as evidence the bias is real, not as a precise estimate of its size.

1.4 Step 2 — Difference to kill \(\alpha_i\) and expose the endogeneity

Take first differences within each unit. The \(\alpha_i\) vanishes (\(\Delta \alpha_i = 0\)), but the differenced lag \(\Delta y_{i,t-1}\) correlates with the differenced error \(\Delta \varepsilon_{it}\) — both contain \(\varepsilon_{i,t-1}\). Let’s see this correlation directly.

# Compute differences within each unit.  drop_na() removes rows whose lags
# would reach into the pre-window — t = 1 (where y_lag1 is NA) and t = 2
# (where y_lag2 is NA).  The remaining sample is t = 3..8 per unit, exactly
# the rows pgmm uses internally for transformation = "d" with lag(y, 2)
# as the instrument.
#
# We force a final unit-then-time row ordering with arrange(unit, time) so
# downstream matrix construction (Part 1.7) lines up rows correctly.
toy_d <- toy |>
  arrange(unit, time) |>
  group_by(unit) |>
  mutate(dy      = y      - y_lag1,
         dy_lag  = y_lag1 - y_lag2,    # = Δy_{t-1}
         dx      = x      - dplyr::lag(x),
         deps    = eps    - dplyr::lag(eps)) |>
  ungroup() |>
  drop_na(dy, dy_lag, dx, deps) |>
  arrange(unit, time)

# The endogeneity in the differenced model:
# Δy_{t-1} and Δε_{t} should be correlated (both contain ε_{t-1});
# Δx and Δε should not be (x is exogenous in our DGP).
tibble(
  pair = c("Δy_lag vs Δε  (endogeneity in the differenced equation)",
           "Δx     vs Δε  (sanity check — x is exogenous in the DGP)"),
  correlation = c(cor(toy_d$dy_lag, toy_d$deps),
                  cor(toy_d$dx,     toy_d$deps))
) |>
  knitr::kable(digits = 3,
               caption = "Correlations among differenced variables and the differenced error in the simulated data.")
Correlations among differenced variables and the differenced error in the simulated data.
pair correlation
Δy_lag vs Δε (endogeneity in the differenced equation) -0.575
Δx vs Δε (sanity check — x is exogenous in the DGP) -0.043

The first row is sizeable and clearly non-zero — that’s exactly the endogeneity problem an instrumental variable will fix. The second is near zero, confirming that \(\Delta x\) is exogenous to \(\Delta \varepsilon\) in this DGP and does not itself need an instrument.

Why \(y_{i,t-2}\) is the natural instrument. It depends on \(\varepsilon_{i,1}, \ldots, \varepsilon_{i,t-2}\) but neither \(\varepsilon_{i,t-1}\) nor \(\varepsilon_{i,t}\). So \(y_{i,t-2}\) is correlated with \(\Delta y_{i,t-1} = y_{i,t-1} - y_{i,t-2}\) (the relevance condition) but uncorrelated with \(\Delta \varepsilon_{it}\) under the assumption that \(\varepsilon\) is iid (the exclusion condition). In the simulated data we can check both directly:

tibble(
  condition = c("Relevance: cor(y_lag2, Δy_lag) should be far from 0",
                "Exclusion: cor(y_lag2, Δε)     should be near 0"),
  correlation = c(cor(toy_d$y_lag2, toy_d$dy_lag),
                  cor(toy_d$y_lag2, toy_d$deps))
) |>
  knitr::kable(digits = 3,
               caption = "Relevance and exclusion checks for the lag-2 instrument.")
Relevance and exclusion checks for the lag-2 instrument.
condition correlation
Relevance: cor(y_lag2, Δy_lag) should be far from 0 -0.134
Exclusion: cor(y_lag2, Δε) should be near 0 -0.009

Both conditions hold in the simulated data, by construction.

1.5 Step 3 — Anderson-Hsiao IV by hand

We have one endogenous regressor (\(\Delta y_{i,t-1}\)) and one valid instrument (\(y_{i,t-2}\)). \(\Delta x\) instruments itself. The system is just-identified.

The closed-form IV (= 2SLS = GMM with any weighting matrix) estimator is

\[ \hat \theta_{IV} \;=\; (Z' X)^{-1} Z' \Delta y \]

where \(\Delta y\) is the stacked outcome vector, \(X\) is the regressor matrix with columns \([\Delta y_{i,t-1}, \Delta x_{it}]\), and \(Z\) is the instrument matrix with columns \([y_{i,t-2}, \Delta x_{it}]\). Let’s build each piece explicitly.

We have 300 rows (50 units × 6 periods after dropping \(t=1, 2\) where the lags reach pre-window), 2 columns in \(X\), and 2 columns in \(Z\) — a 2×2 system to solve.

# Stack the data as plain vectors / matrices.  drop_na in the previous chunk
# already kept only rows where every needed lag is observed (t >= 2 here).
dy_vec  <- toy_d$dy                       # outcome vector
X_mat   <- cbind(dy_lag = toy_d$dy_lag,   # endogenous regressor
                 dx     = toy_d$dx)        # exogenous regressor
Z_mat   <- cbind(y_lag2 = toy_d$y_lag2,   # instrument for dy_lag
                 dx     = toy_d$dx)        # dx instruments itself

# Build Z'X (a 2x2 matrix) and Z'dy (a 2-vector)
ZpX  <- crossprod(Z_mat, X_mat)
Zpdy <- crossprod(Z_mat, dy_vec)

# Closed-form IV estimator
theta_AH <- solve(ZpX, Zpdy)
rownames(theta_AH) <- c("phi (dy_lag)", "beta (dx)")

The two small matrices we just built:

knitr::kable(ZpX,  digits = 2, caption = "Z'X")
Z’X
dy_lag dx
y_lag2 -214.58 26.73
dx -151.49 660.48
knitr::kable(Zpdy, digits = 2, caption = "Z'Δy")
Z’Δy
y_lag2 -156.21
dx 197.32

And the resulting IV estimates:

knitr::kable(theta_AH, digits = 3,
             caption = "Anderson-Hsiao IV estimates from the closed-form solve.")
Anderson-Hsiao IV estimates from the closed-form solve.
phi (dy_lag) 0.788
beta (dx) 0.479

Read off \(\hat\phi_{AH}\) — it should be close to the true value 0.7. The Nickell bias has been corrected by replacing the endogenous \(\Delta y_{i,t-1}\) with its instrumented counterpart.

Comparison: FE-LDV vs Anderson-Hsiao on the same panel.

tibble(parameter = c("phi", "beta"),
       true      = c(phi_true,        beta_true),
       FE        = c(phi_fe,          beta_fe),
       AH        = c(theta_AH[1, 1],  theta_AH[2, 1])) |>
  knitr::kable(digits = 3,
               caption = "FE-LDV is biased; Anderson-Hsiao IV recovers the truth (up to sampling error).")
FE-LDV is biased; Anderson-Hsiao IV recovers the truth (up to sampling error).
parameter true FE AH
phi 0.7 0.511 0.788
beta 0.5 0.427 0.479

1.6 Step 4 — Add more lags → one-step GMM

What if we use two lags as instruments — say \(y_{i,t-2}\) and \(y_{i,t-3}\)? Each is exogenous under the iid-\(\varepsilon\) assumption, and each is correlated with \(\Delta y_{i,t-1}\). Adding the second moment makes the system over-identified (more moments than parameters), and we have to choose a weighting matrix to combine them.

The (one-step) GMM estimator with weighting matrix \(W\) is

\[ \hat\theta_{GMM} \;=\; \big( X' Z\, W\, Z' X \big)^{-1} \, X' Z\, W\, Z' \Delta y \]

For one-step GMM, the standard choice is \(W = (Z'Z)^{-1}\). (Two-step would use the residual-based optimal \(W\) — see §3.4. We stop at one-step here to keep the algebra clean.)

To use lag-3 as an instrument we need \(y_{i,t-3}\) observable, which means restricting the equation to \(t \ge 3\) — we lose one period of usable rows.

Heads-up on the verification step. With the same sample as pgmm uses, the just-identified AH case in §1.5 will match pgmm(..., | lag(y, 2)) to numerical precision in §1.7 — one instrument and one endogenous regressor leave no degree of freedom for stacking conventions. The over-identified GMM below (\(y_{i,t-2}\) and \(y_{i,t-3}\) both as instruments) will still differ slightly from pgmm(..., | lag(y, 2:3)). Why? We stack the two lags as two ordinary IV columns that apply at every row; pgmm uses a richer per-period moment structure (a separate moment for each lag at each period) — the convention used in the Arellano-Bond literature. Same sample, different moments, slightly different estimates.

The instrument matrix now has 3 columns (two lags of \(y\) plus the exogenous \(\Delta x\)) against only 2 parameters in \(X\) — the system is over-identified by one moment. Because \(y_{i,t-3}\) must come from the observed window, we further drop the \(t=3\) row, leaving 250 rows (5 periods \(\times\) 50 units).

# Adding y_lag3 as an instrument means we need y_lag3 to be observed, which
# requires t >= 4 (the t = 3 row in toy_d would have y_lag3 reaching pre-
# window data).  Drop those rows so every instrument in Z is genuinely
# available from the observed sample — same logic as listwise deletion.
toy_g <- toy_d |> drop_na(y_lag3)

dy_vec <- toy_g$dy
X_mat  <- cbind(dy_lag = toy_g$dy_lag,
                dx     = toy_g$dx)
# Wider instrument matrix: two lags of y plus the exogenous dx
Z_mat  <- cbind(y_lag2 = toy_g$y_lag2,
                y_lag3 = toy_g$y_lag3,
                dx     = toy_g$dx)

# One-step weighting matrix W = (Z'Z)^{-1}
ZpZ   <- crossprod(Z_mat)             # 3x3
W_one <- solve(ZpZ)

# Z'X and Z'dy
ZpX  <- crossprod(Z_mat, X_mat)        # 3x2
Zpdy <- crossprod(Z_mat, dy_vec)       # 3x1

# GMM normal equation: θ = (X'Z W Z'X)^{-1} X'Z W Z'Δy
XpZ_W   <- t(ZpX) %*% W_one
A_mat   <- XpZ_W %*% ZpX               # 2x2 — the bread
b_vec   <- XpZ_W %*% Zpdy              # 2x1 — the meat

theta_GMM <- solve(A_mat, b_vec)
rownames(theta_GMM) <- c("phi (dy_lag)", "beta (dx)")

The one-step weighting matrix:

knitr::kable(W_one, digits = 4,
             caption = "W = (Z'Z)^{-1}, the one-step GMM weighting matrix (3 × 3).")
W = (Z’Z)^{-1}, the one-step GMM weighting matrix (3 × 3).
y_lag2 y_lag3 dx
y_lag2 0.0027 -0.0026 0.0000
y_lag3 -0.0026 0.0027 0.0000
dx 0.0000 0.0000 0.0018

The GMM estimates:

knitr::kable(theta_GMM, digits = 3,
             caption = "One-step Arellano-Bond GMM estimates from the closed-form solve.")
One-step Arellano-Bond GMM estimates from the closed-form solve.
phi (dy_lag) 0.563
beta (dx) 0.444

A side-by-side of the three estimators on the same data:

tibble(parameter = c("phi", "beta"),
       true      = c(phi_true,         beta_true),
       FE        = c(phi_fe,           beta_fe),
       AH        = c(theta_AH[1, 1],   theta_AH[2, 1]),
       GMM       = c(theta_GMM[1, 1],  theta_GMM[2, 1])) |>
  knitr::kable(digits = 3,
               caption = "FE biased, AH unbiased (just-identified), GMM unbiased and more efficient (over-identified). With only one extra moment the GMM gain is small, but the principle generalizes to the larger moment sets used in pgmm.")
FE biased, AH unbiased (just-identified), GMM unbiased and more efficient (over-identified). With only one extra moment the GMM gain is small, but the principle generalizes to the larger moment sets used in pgmm.
parameter true FE AH GMM
phi 0.7 0.511 0.788 0.563
beta 0.5 0.427 0.479 0.444

The GMM estimate sits very close to the AH estimate — that is the expected behavior with only one extra moment. The conceptual payoff is the machinery: we constructed the moment conditions \(Z' (\Delta y - X \theta) = 0\), picked a weighting matrix, and solved the over-identified system in one closed-form line. Every estimator in Parts 3 and 4 is just this template scaled up.

1.7 Verify against pgmm()

Run the just-identified pgmm AH fit on the same sample and put the coefficients side by side with our manual result:

pdat_toy <- pdata.frame(toy, index = c("unit", "time"))
ah_pgmm <- pgmm(y ~ lag(y, 1) + x | lag(y, 2),
                data = pdat_toy,
                effect = "individual",
                model  = "onestep",
                transformation = "d")

tibble(parameter           = c("phi", "beta"),
       manual_literal_AH   = c(theta_AH[1, 1], theta_AH[2, 1]),
       pgmm_AH             = c(unname(coef(ah_pgmm)["lag(y, 1)"]),
                               unname(coef(ah_pgmm)["x"]))) |>
  knitr::kable(digits = 6,
               caption = "Our literal Anderson-Hsiao (one stacked moment for y_{t-2}) vs pgmm with the same instrument specification.")
Our literal Anderson-Hsiao (one stacked moment for y_{t-2}) vs pgmm with the same instrument specification.
parameter manual_literal_AH pgmm_AH
phi 0.787692 0.675650
beta 0.479428 0.456828

The numbers are close but do not match exactly. The reason is not the sample — both fits use the same 300 rows. It is the moment structure. Look at the instrument matrix pgmm builds for the first unit:

# pgmm's per-unit instrument block: rows are periods (t = 3..8),
# columns are: 6 period-specific copies of y_{t-2}, plus 1 column for x.
ah_pgmm$W[[1]] |>
  round(3) |>
  as.data.frame() |>
  knitr::kable(caption = "pgmm's instrument matrix for the first unit (6 × 7). The block-diagonal structure of the first six columns reveals that pgmm splits the lag-2 instrument into one moment **per period**, not a single stacked moment.")
pgmm’s instrument matrix for the first unit (6 × 7). The block-diagonal structure of the first six columns reveals that pgmm splits the lag-2 instrument into one moment per period, not a single stacked moment.
V1 V2 V3 V4 V5 V6 x
3 -4.542 0.000 0.000 0.000 0.000 0.000 -1.634
4 0.000 -6.109 0.000 0.000 0.000 0.000 0.583
5 0.000 0.000 -7.983 0.000 0.000 0.000 0.804
6 0.000 0.000 0.000 -6.881 0.000 0.000 0.001
7 0.000 0.000 0.000 0.000 -6.975 0.000 -0.859
8 0.000 0.000 0.000 0.000 0.000 -6.133 -0.364

What our literal AH did with one y-instrument moment, pgmm does with six — one per period \(t = 3, \ldots, 8\). Each column of the block-diagonal section is “\(y_{i,t-2}\) at period \(t\), zero elsewhere”. So pgmm’s “just-identified” AH spec is actually over-identified internally, with \(T - 2 = 6\) moments for y plus 1 moment for x. Different moment system, slightly different estimates in finite samples.

To replicate pgmm exactly by hand, we need two things:

  1. Build the period-stacked block-diagonal \(Z\) (one column per period × instrument pair).
  2. Use Arellano-Bond’s one-step weighting matrix \(W = \big(\sum_i Z_i' H Z_i\big)^{-1}\), where \(H\) is the \((T{-}2) \times (T{-}2)\) tridiagonal matrix with \(2\) on the diagonal and \(-1\) on the off-diagonals. This \(H\) accounts for the MA(1) structure of \(\Delta\varepsilon\) implied by iid \(\varepsilon\) in levels.
# Rebuild X and y from toy_d (the AH-sample, 300 rows) — §1.6 overwrote
# X_mat and dy_vec with the smaller §1.6 GMM sample (toy_g, 250 rows),
# so we can't reuse those here.
dy_vec_AH <- toy_d$dy
X_mat_AH  <- cbind(dy_lag = toy_d$dy_lag, dx = toy_d$dx)

unit_ids    <- sort(unique(toy_d$unit))
n_per       <- length(unique(toy_d$time))   # 6 differenced periods (t = 3..8)
period_vals <- sort(unique(toy_d$time))

# Build per-unit Z blocks: rows = periods kept for unit i, cols = (6 y instrs + 1 x instr).
Z_blocks <- lapply(unit_ids, function(i) {
  d_i <- toy_d |> filter(unit == i) |> arrange(time)
  Z_i <- matrix(0, nrow = nrow(d_i), ncol = n_per)
  for (k in seq_len(nrow(d_i))) {
    p_idx <- match(d_i$time[k], period_vals)
    Z_i[k, p_idx] <- d_i$y_lag2[k]
  }
  cbind(Z_i, dx = d_i$dx)
})

# Arellano-Bond H matrix: tridiagonal with 2 on the diagonal, -1 off
H_AB <- matrix(0, n_per, n_per)
diag(H_AB) <- 2
for (k in seq_len(n_per - 1)) {
  H_AB[k, k + 1] <- -1
  H_AB[k + 1, k] <- -1
}

# One-step W = (sum_i Z_i' H Z_i)^{-1}
ZHZ <- Reduce("+", lapply(Z_blocks, function(Z_i) t(Z_i) %*% H_AB %*% Z_i))
W_AB <- solve(ZHZ)

# Stack Z, X, y across units and solve the GMM normal equation
Z_all <- do.call(rbind, Z_blocks)
ZtX <- crossprod(Z_all, X_mat_AH)
Zty <- crossprod(Z_all, dy_vec_AH)
theta_AH_pgmm_style <- solve(t(ZtX) %*% W_AB %*% ZtX,
                             t(ZtX) %*% W_AB %*% Zty)
rownames(theta_AH_pgmm_style) <- c("phi (dy_lag)", "beta (dx)")

Side by side with pgmm:

tibble(parameter             = c("phi", "beta"),
       manual_pgmm_style_AH  = c(theta_AH_pgmm_style[1, 1],
                                 theta_AH_pgmm_style[2, 1]),
       pgmm_AH               = c(unname(coef(ah_pgmm)["lag(y, 1)"]),
                                 unname(coef(ah_pgmm)["x"]))) |>
  knitr::kable(digits = 10,
               caption = "Manual GMM-style AH (period-stacked Z + Arellano-Bond H weighting) versus pgmm. The two now agree to many decimal places.")
Manual GMM-style AH (period-stacked Z + Arellano-Bond H weighting) versus pgmm. The two now agree to many decimal places.
parameter manual_pgmm_style_AH pgmm_AH
phi 0.6756498 0.6756498
beta 0.4568281 0.4568281

Exact match. Three pedagogical takeaways:

  1. There is no single “Anderson-Hsiao estimator”. The original 1981 proposal used a single stacked moment for \(y_{t-2}\) (the “literal AH” in §1.5). Modern software — pgmm, Stata’s xtabond/xtabond2 — implements a richer period-stacked version with the Arellano-Bond H-weighting. Both are consistent; they give slightly different point estimates in finite samples.
  2. The weighting matrix matters. Even with the same instruments, choosing \(W = (Z'Z)^{-1}\) vs \(W = (\sum_i Z_i' H Z_i)^{-1}\) produces different one-step estimators. Two-step GMM (§3.4) iterates from any sensible one-step starting point, but the one-step result depends on \(W\).
  3. The matrix algebra scales linearly. Adding more instrument lags or more endogenous regressors just makes \(Z\) and \(X\) wider; the closed-form normal equation is the same. Part 3 is pgmm packaging exactly this structure.

Part 2 — pgmm() and the diagnostic battery for dynamic-panel GMM

Part 1 built the mechanics of one-step GMM from scratch. The real workhorse for routine work is plm::pgmm(), which packages the same machinery (Anderson-Hsiao, Arellano-Bond, Blundell-Bond) with built-in diagnostics. Part 2 explains the function and walks through the five tests summary.pgmm prints by default, demonstrating each on a small worked example.

GMM is not a magic correction for endogeneity. It is a moment-based inference technique that delivers consistent estimates only if its identifying assumptions hold. Four of those assumptions get a formal test on every summary.pgmm printout; a fifth (the instrument count) is a structural check.

The five-row checklist (the order in which to read the printout):

Assumption Test \(H_0\) What we want to see
Instrument validity (overid) Sargan / Hansen \(E[Z'\varepsilon] = 0\) Fail to reject (\(p > 0.10\))
Error process is AR(1), not AR(2)+ AR(1) on \(\Delta\hat\varepsilon\) no first-order serial correlation in \(\Delta\hat\varepsilon\) Reject (\(p < 0.05\) — mechanical)
Same AR(2) on \(\Delta\hat\varepsilon\) no second-order serial correlation in \(\Delta\hat\varepsilon\) Fail to reject (\(p > 0.05\))
Time effects matter Wald on \(\tau_t\) all year dummies \(= 0\) Reject → keep year FE; fail to reject → safe to drop
Instruments not too many # instruments \(< N\) (rule of thumb) Check by hand

2.1 The pgmm() function — formula syntax and arguments

pgmm() is the workhorse for difference and system GMM in R. Its call has a few moving parts that deserve close reading before any fit.

Anatomy of a pgmm formula. The formula uses up to three blocks separated by |:

y ~ <regressors> | <GMM-style instruments> | <regular instruments (optional)>
  • Left of the first | — the regression equation, written in levels. Use lag(y, 1) for the lagged outcome; pgmm will internally difference it (under transformation = "d") or stack it with the level equation (under transformation = "ld"). For other regressors, write the variable name for the contemporaneous value; lag(x, 0:1) adds both the contemporaneous and the lag-1 term.
  • First | block — GMM-style instruments. Write lag(y, 2:99) to use lags 2, 3, …, up to whatever the panel supports, each one stacked per period (one moment for each lag × period combination — see the demonstration in §1.7). To include lagged levels of other regressors as instruments, list them too: lag(y, 2:99) + lag(x, 2:99). This is where the over-identification lives.
  • Second | block (optional) — regular instruments. Variables that enter as one stacked moment (standard IV-style), no per-period expansion. Use this for strictly exogenous regressors that you do not want spread across periods. If you put x here, pgmm treats it as its own instrument.

The five main arguments:

Argument Choices What it controls
data a pdata.frame Panel data with index = c(unit, time). Construct with pdata.frame(df, index = c("country", "year")).
effect "individual", "twoways", "time" "individual" adds only unit FE; "twoways" adds period dummies too (and a Wald test on them); "time" only period FE. Default: "twoways".
model "onestep", "twosteps" One-step uses the Arellano-Bond H-weighted weighting matrix (§1.7). Two-step re-estimates \(W\) using one-step residuals — asymptotically efficient but two-step SEs need the Windmeijer correction (§3.4).
transformation "d", "ld" "d" = difference GMM (one differenced equation). "ld" = levels-and-differences = system GMM. System GMM adds the level equation, instrumented by lagged differences.
collapse TRUE, FALSE Collapse the per-period instruments into a single column per lag (cuts moment count; see Roodman 2009 and §2.5). Available in newer plm versions.

A worked example. Fit a small Diff-GMM on EmplUK and print the summary so we have something to point at when we walk through each test below.

data("EmplUK", package = "plm")

# A modest Diff-GMM specification on the canonical AB-91 panel.
# Read the formula in three pieces:
#   * Left of |:   the regression equation. lag(log(emp), 1:2) puts both AR(1)
#                  and AR(2) lags on the RHS; lag(log(wage), 0:1) puts the
#                  contemporaneous wage and its lag; log(capital) and
#                  lag(log(output), 0:1) are exogenous regressors.
#   * Between the
#     two |s:      GMM-style instruments. lag(log(emp), 2:4) uses lags 2, 3,
#                  and 4 of log(emp), each stacked per period (one moment per
#                  (lag, period) pair — see §1.7).  Capping at 2:4 keeps the
#                  instrument count moderate (avoiding §2.5's pathologies).
#   * After second |: there is no third block here. We could add regular IV
#                     style instruments by writing "| ... | log(capital)" but
#                     for this demo we let log(capital) be a regressor only.
fit_demo <- pgmm(
  log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 0:1) +
             log(capital)       + lag(log(output), 0:1) |
             lag(log(emp), 2:4),
  data           = EmplUK,
  effect         = "twoways",      # unit FE + time FE → Wald-on-tau_t will print
  model          = "twosteps",     # use optimal two-step weighting
  transformation = "d"             # difference GMM
)

# Capture summary once — the diagnostics below pull components out of this object.
sum_demo <- summary(fit_demo, robust = TRUE)  # robust = TRUE → Windmeijer SEs
# Print the full summary so you have it on screen for the §2.2-§2.5 walkthrough
sum_demo
## Twoways effects Two-steps model Difference GMM 
## 
## Call:
## pgmm(formula = log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 
##     0:1) + log(capital) + lag(log(output), 0:1) | lag(log(emp), 
##     2:4), data = EmplUK, effect = "twoways", model = "twosteps", 
##     transformation = "d")
## 
## Unbalanced Panel: n = 140, T = 7-9, N = 1031
## 
## Number of Observations Used: 611
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.587153 -0.021510  0.000000  0.001498  0.035743  0.702491 
## 
## Coefficients:
##                          Estimate Std. Error z-value  Pr(>|z|)    
## lag(log(emp), 1:2)1     0.0331317  0.2429704  0.1364   0.89154    
## lag(log(emp), 1:2)2     0.0042604  0.0578536  0.0736   0.94130    
## lag(log(wage), 0:1)0   -0.3289821  0.1460541 -2.2525   0.02429 *  
## lag(log(wage), 0:1)1    0.0123661  0.1050457  0.1177   0.90629    
## log(capital)            0.3786318  0.0603133  6.2777 3.435e-10 ***
## lag(log(output), 0:1)0  0.4403456  0.1786435  2.4649   0.01370 *  
## lag(log(output), 0:1)1 -0.0313526  0.1760058 -0.1781   0.85862    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(15) = 15.4708 (p-value = 0.41807)
## Autocorrelation test (1): normal = 0.1924172 (p-value = 0.84742)
## Autocorrelation test (2): normal = -0.4885348 (p-value = 0.62517)
## Wald test for coefficients: chisq(7) = 108.6394 (p-value = < 2.22e-16)
## Wald test for time dummies: chisq(6) = 10.85012 (p-value = 0.093121)

The rest of Part 2 walks through each of the five diagnostic rows on this single fit, then transitions to the family of estimators in Part 3.

2.2 Sargan / Hansen overidentification test

The first concern with GMM is: are our instruments actually exogenous? With more moment conditions than parameters, we can ask whether they are jointly consistent with the data.

The null hypothesis is

\[H_0: \quad E[Z' \varepsilon] = 0 \quad (\text{the instrument set is jointly exogenous}).\]

The test statistic. GMM minimizes the quadratic form \(\hat g(\theta)' W \hat g(\theta)\) where \(\hat g\) is the sample-moment vector. At the GMM optimum this quadratic form is asymptotically \(\chi^2_{m-k}\) under \(H_0\), where \(m\) is the number of moment conditions and \(k\) the number of estimated parameters. So the value of the GMM objective at the optimum serves as the test statistic.

Two variants:

Name Built on Robust to heteroscedasticity?
Sargan one-step weighting matrix No (assumes spherical errors)
Hansen-J two-step (optimal) weighting matrix Yes

summary.pgmm reports Sargan by default; Stata’s xtabond2 also reports Hansen. They share the same null and the same \(\chi^2\) distribution, but Hansen is more robust to heteroscedastic / serially correlated errors.

Decision rule.

Do not reject (\(p > 0.10\)): instruments pass the joint exogeneity test — cautious go-ahead. Reject (\(p < 0.05\)): at least one moment condition is invalid. Diagnose by tightening lag depth (start instruments at lag 3 instead of lag 2) or by adding more lags of the DV.

Two caveats that every student gets wrong on first read:

  1. Joint test. A non-rejecting Sargan does not vouch for any individual instrument. It rules out gross collective violations only.
  2. Power-loss trap with many instruments. With too many instruments (see §2.5), the test asymptotically always fails to reject — even when instruments are invalid. A Sargan \(p\) exactly equal to 1.00 should be read as “instrument overfit”, not “all good”. Always print the instrument count alongside the \(p\).

On fit_demo. Pull the Sargan slot directly out of the summary object — summary.pgmm stores it as a list with $statistic, $parameter (the degrees of freedom \(m - k\)), and $p.value.

# The Sargan slot inside the summary object
sarg <- sum_demo$sargan
tibble(
  statistic = sarg$statistic,             # the value of the GMM objective at the optimum
  df        = sarg$parameter,             # m - k = (# instruments) - (# parameters)
  p_value   = sarg$p.value,               # P(chi^2_df > statistic)
  decision  = ifelse(sarg$p.value > 0.10,
                     "Fail to reject — instruments pass (cautious)",
                     "Reject — instrument validity in doubt")
) |>
  knitr::kable(digits = 3, caption = "Sargan test extracted from sum_demo$sargan.")
Sargan test extracted from sum_demo$sargan.
statistic df p_value decision
15.471 15 0.418 Fail to reject — instruments pass (cautious)

2.3 Arellano-Bond AR(1) and AR(2) tests on differenced residuals

The Arellano-Bond identification strategy assumes the levels error \(\varepsilon_{it}\) is iid — in particular, no serial correlation. If that fails, the instrument \(y_{i,t-2}\) (which depends on \(\varepsilon_{i,t-2}\)) becomes correlated with \(\Delta \varepsilon_{it} = \varepsilon_{it} - \varepsilon_{i,t-1}\), breaking exogeneity.

We can’t test the levels error directly — we never observe it — but we can test its differenced counterpart \(\Delta\hat\varepsilon_{it}\), and translate the results.

The mechanical fact. If \(\varepsilon_{it}\) is iid, then \(\Delta\varepsilon_{it}\) has an MA(1) structure because consecutive differences share a \(\varepsilon_{i,t-1}\) term. So:

  • First-order serial correlation in \(\Delta\hat\varepsilon\) is expected. The AR(1) test should reject.
  • Second-order serial correlation in \(\Delta\hat\varepsilon\) is not expected. Differences two periods apart share no \(\varepsilon\) terms under iid. The AR(2) test should fail to reject.

Decision rules.

Test \(H_0\) What you want
AR(1) \(\text{Corr}(\Delta\hat\varepsilon_{it}, \Delta\hat\varepsilon_{i,t-1}) = 0\) Reject (\(p < 0.05\)). If you fail to reject, your model is not actually dynamic — recheck the LDV.
AR(2) \(\text{Corr}(\Delta\hat\varepsilon_{it}, \Delta\hat\varepsilon_{i,t-2}) = 0\) Fail to reject (\(p > 0.05\)). Rejecting means levels error is not iid — lag-2 instrument is invalid.

What to do if AR(2) rejects. Three options, in order of cost:

  1. Drop lag-2 from the instrument set: change lag(y, 2:99) to lag(y, 3:99). Loses one lag’s worth of moments.
  2. Add more lags of the DV to the regression: try lag(y, 1:2) on the RHS instead of just lag(y, 1). The extra lag soaks up the residual autocorrelation.
  3. Reconsider whether the dependent variable is mis-specified (e.g., should be in growth rates rather than levels).

On fit_demo. The summary object stores AR(1) and AR(2) as $m1 and $m2, each a list with $statistic (a \(z\)-statistic) and $p.value.

# AR(1) and AR(2) slots from the summary
ar1 <- sum_demo$m1   # first-order serial correlation in Δ residuals
ar2 <- sum_demo$m2   # second-order serial correlation in Δ residuals

tibble(
  test       = c("AR(1) on Δresid", "AR(2) on Δresid"),
  z_stat     = c(ar1$statistic, ar2$statistic),
  p_value    = c(ar1$p.value,   ar2$p.value),
  decision   = c(
    ifelse(ar1$p.value < 0.05,
           "Reject (expected — mechanical from differencing)",
           "Fail to reject (concern: model may not be dynamic)"),
    ifelse(ar2$p.value > 0.05,
           "Fail to reject (levels error looks iid — lag-2 instrument OK)",
           "Reject (levels error is NOT iid — lag-2 instrument suspect)")
  )
) |>
  knitr::kable(digits = 3, caption = "Arellano-Bond AR tests on the differenced residuals.")
Arellano-Bond AR tests on the differenced residuals.
test z_stat p_value decision
AR(1) on Δresid 0.192 0.847 Fail to reject (concern: model may not be dynamic)
AR(2) on Δresid -0.489 0.625 Fail to reject (levels error looks iid — lag-2 instrument OK)

2.4 Wald test for time dummies

When pgmm is fit with effect = "twoways", the summary prints a Wald-type test on the joint significance of the year dummies.

  • \(H_0\): all \(\tau_t = 0\) (no time effects).
  • Reject → year effects matter; keep them in the spec.
  • Fail to reject → safe to drop, simplifying the spec and reducing the instrument count.

In our Part 4 democracy fits, we use effect = "individual" for numerical reasons (see §4.1), so this test does not print. But on shorter panels where two-way effects are feasible, this is a routine check.

On fit_demo. Because we set effect = "twoways" above, the summary stores a Wald slot at $wald.td (Wald-on-time-dummies). It’s a list with $statistic (a \(\chi^2\) value), $parameter (df = number of period dummies), and $p.value.

wd <- sum_demo$wald.td
tibble(
  test       = "Wald on time dummies (H0: all tau_t = 0)",
  chi_sq     = wd$statistic,
  df         = wd$parameter,
  p_value    = wd$p.value,
  decision   = ifelse(wd$p.value < 0.05,
                      "Reject — keep year FE in the spec",
                      "Fail to reject — safe to drop year FE")
) |>
  knitr::kable(digits = 3, caption = "Wald test on the period dummies.")
Wald test on the period dummies.
test chi_sq df p_value decision
Wald on time dummies (H0: all tau_t = 0) 10.85 6 0.093 Fail to reject — safe to drop year FE

2.5 Instrument count — and how to regulate it

This is the only diagnostic in the battery that is not a statistical test. It’s a structural property of your specification, and getting it wrong can sabotage every other test in the printout.

The mechanics. Difference GMM with lag(y, 2:99) uses one moment for each (period, lag) combination available in the panel. The instrument count on the LDV alone grows roughly as

\[\frac{(T-1)(T-2)}{2}\]

— quadratic in \(T\). For \(T = 9\) (EmplUK) this gives 28 moments per regressor; for \(T = 20\) (typical macro panel) it gives 171; for \(T = 40\) (our democracy panel) it can exceed 700. System GMM roughly doubles these because the level equation contributes its own moments on top of the differenced ones.

Three pathologies of too-many-instruments. When the moment count grows large relative to \(N\) (the number of panel units):

  1. The optimal weighting matrix becomes singular. pgmm falls back to a generalized inverse and emits general inverse is used warnings. The two-step estimator is still consistent but loses its precision guarantee.
  2. Sargan / Hansen tests lose power. With many moments, the test statistic becomes diffuse and the p-value drifts toward 1.00 even when instruments are invalid. Roodman calls this “instrument overfit”; reading a Sargan \(p \approx 1\) as a “passed test” is exactly backwards. A pristine p-value of 1.00 is a red flag, not a green light.
  3. Two-step standard errors are downward-biased, even with the Windmeijer correction. The estimator overfits the moment conditions and produces deceptively tight confidence intervals — the betas look more precise than they really are.

These pathologies often coexist. A fit with 700 instruments on \(N = 130\) countries will pass every diagnostic on the printout and be a poor approximation to the truth. The instrument count is the cheapest sanity check.

Two remedies for too many instruments.

  1. Limit lag depth. Replace lag(y, 2:99) with lag(y, 2:k) for a small \(k\). Common choices: \(k = 3\), \(4\), or \(5\). The moment count drops from quadratic in \(T\) to linear, killing most of the over-identification problem.
  2. Collapse the instrument matrix. Stata’s xtabond2 has a collapse option that column-sums the per-period moments into a single moment per (regressor, lag). plm::pgmm doesn’t expose collapse directly, but limiting lag depth gives a similar reduction in moment count and is the practical equivalent in R.

Rule of thumb (Roodman 2009b). Keep the moment count below \(N\), ideally well below. EmplUK has \(N = 140\) firms, so the full AB spec (~30 moments) sits comfortably within bounds. The democracy panel has \(N \approx 130\) countries with \(T\) up to 40, so the full lag(y, 2:99) spec will balloon past \(N\) — exactly the regime where the pathologies bite hardest. We will demonstrate this in Part 4.

The practical workflow (applied in Part 4):

  • Fit the same specification at multiple lag depths: full (2:99), moderate (2:4 or 2:5), and aggressively truncated (2:3).
  • Compare point estimates of \(\hat\phi\) and the key betas across columns. Substantial drift → instruments are doing some heavy lifting; stable estimates → robustness check passes.
  • For the main results table, pick a moderate lag depth where Sargan is informative (not trivially 1.00) and report the others in an appendix.

On fit_demo. fit_demo$W is a list with one entry per unit; each entry is the per-unit instrument matrix. ncol(fit_demo$W[[1]]) gives the total moment count (same across units).

# Instrument count and units, from the fitted object directly
n_inst  <- length(fit_demo$W[[1]][1, ])    # # columns in the per-unit instrument matrix
n_units <- length(fit_demo$W)              # # of unit-level blocks

tibble(
  metric      = c("# instruments", "# units (N)", "ratio (# inst / N)"),
  value       = c(n_inst, n_units, round(n_inst / n_units, 2)),
  decision    = c(NA, NA,
                  ifelse(n_inst < n_units,
                         "Passes rule of thumb (# inst < N)",
                         "Fails — too many instruments"))
) |>
  knitr::kable(caption = "Instrument count and the Roodman rule-of-thumb check.")
Instrument count and the Roodman rule-of-thumb check.
metric value decision
# instruments 28.0 NA
# units (N) 140.0 NA
ratio (# inst / N) 0.2 Passes rule of thumb (# inst < N)

2.6 A pointer to likelihood-based alternatives

When GMM is fragile (small \(T\), near-unit-root outcome, weak instruments), an alternative path is transformed-likelihood estimation. The idea: instead of instrumenting your way past the \(\alpha_i\) problem, marginalize \(\alpha_i\) out of the likelihood through an information-orthogonal reparameterization.

Pickup et al. (2017) introduced this for political-science panels; Pickup & Hopkins (2022) compared it to GMM in a Monte Carlo and concluded that the orthogonal-panel model (OrthoPanels::opm()) is more robust for \(T \in [5, 20]\). It also avoids the instrument-count headache entirely.

We will not fit OPM in this lab — it is a different inferential paradigm (Bayesian, posterior samples) and deserves its own treatment. If you find yourself in the small-\(T\) regime where Diff-GMM and Sys-GMM disagree, OPM is the next thing to try.


Part 3 — The IV/GMM family demonstrated on EmplUK

Bridge from Part 1. Part 1 built Anderson-Hsiao and one-step GMM by hand on a toy panel — explicit \(X\), \(Z\), \(W\) matrices and a single solve() for the closed form. Part 3 does the same logic on real data via pgmm(), which packages the matrix algebra for us and uses a richer per-period moment structure than our manual stacking (compare §1.6 and §1.7). The machinery is identical; the conveniences are notational.

For Part 3 we use the canonical AB/BB demonstration data, EmplUK. It is shipped with plm and is the panel that Arellano & Bond (1991) used in the original paper: \(N = 140\) UK firms, \(T = 7\)\(9\) years (1976–1984), outcome emp (employment).

data("EmplUK", package = "plm")
glimpse(EmplUK)
## Rows: 1,031
## Columns: 7
## $ firm    <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,…
## $ year    <dbl> 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1977, 1978, 1979, 19…
## $ sector  <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,…
## $ emp     <dbl> 5.041, 5.600, 5.015, 4.715, 4.093, 3.166, 2.936, 71.319, 70.64…
## $ wage    <dbl> 13.1516, 12.3018, 12.8395, 13.8039, 14.2897, 14.8681, 13.7784,…
## $ capital <dbl> 0.5894, 0.6318, 0.6771, 0.6171, 0.5076, 0.4229, 0.3920, 16.936…
## $ output  <dbl> 95.7072, 97.3569, 99.6083, 100.5501, 99.5581, 98.6151, 100.030…
length(unique(EmplUK$firm))
## [1] 140
range(EmplUK$year)
## [1] 1976 1984

EmplUK is in exactly the regime where Nickell bias is severe — small \(T\), persistent outcome. Perfect classroom data.

3.1 First differences and Anderson-Hsiao (1981)

Difference the dynamic panel equation:

\[ \Delta y_{it} = \phi\, \Delta y_{i,t-1} + \beta\, \Delta x_{it} + \Delta \varepsilon_{it} \]

The unit fixed effects \(\alpha_i\) vanish — they are constant in \(t\), so \(\Delta \alpha_i = 0\). Differencing achieves what within-demeaning achieves, but without mixing future \(\varepsilon\) into the lag through \(\bar y_{i,-1}\). This is the core idea behind every estimator in the GMM family.

Of course, the differenced lag \(\Delta y_{i,t-1}\) is still correlated with \(\Delta\varepsilon_{it}\) — both contain \(\varepsilon_{i,t-1}\) mechanically. We need an instrument.

Anderson and Hsiao proposed using \(y_{i,t-2}\) as IV for \(\Delta y_{i,t-1}\). The exclusion logic is the same as in §1.4: \(y_{i,t-2}\) is correlated with \(\Delta y_{i,t-1}\) (relevance) but, under iid \(\varepsilon\), uncorrelated with \(\Delta\varepsilon_{it}\) (exclusion — precisely the assumption tested by the AR(2) diagnostic in §2.2).

The pgmm call below implements that logic. Read the formula in two halves:

  • Left of | — the regression equation in level form. lag(log(emp), 1) is the lagged outcome; pgmm internally differences it.
  • Right of | — the GMM instrument block. lag(log(emp), 2) says use only the second lag of log(emp) as an instrument — that is the Anderson-Hsiao restriction.
ah_fit <- pgmm(log(emp) ~ lag(log(emp), 1) + log(wage) + log(capital) |
                 lag(log(emp), 2),
               data = EmplUK,
               effect = "individual",   # only unit FE (no time dummies)
               model  = "onestep",      # one-step is enough for just-identified IV
               transformation = "d")    # "d" = difference GMM
summary(ah_fit)
## Oneway (individual) effect One-step model Difference GMM 
## 
## Call:
## pgmm(formula = log(emp) ~ lag(log(emp), 1) + log(wage) + log(capital) | 
##     lag(log(emp), 2), data = EmplUK, effect = "individual", model = "onestep", 
##     transformation = "d")
## 
## Unbalanced Panel: n = 140, T = 7-9, N = 1031
## 
## Number of Observations Used: 751
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.903849 -0.045858  0.000000 -0.001085  0.041154  0.844502 
## 
## Coefficients:
##                   Estimate Std. Error z-value  Pr(>|z|)    
## lag(log(emp), 1)  0.801824   0.157098  5.1040 3.326e-07 ***
## log(wage)        -0.631281   0.195599 -3.2274  0.001249 ** 
## log(capital)      0.241204   0.056267  4.2868 1.813e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(6) = 34.79026 (p-value = 4.732e-06)
## Autocorrelation test (1): normal = -3.923134 (p-value = 8.7404e-05)
## Autocorrelation test (2): normal = -1.10812 (p-value = 0.26781)
## Wald test for coefficients: chisq(3) = 605.8932 (p-value = < 2.22e-16)

summary(ah_fit) prints the coefficients, the Sargan test (degenerate here — just-identified means \(m - k = 0\) and the test does not apply), and the Arellano-Bond AR(1)/AR(2) tests on the differenced residuals. AH is rarely used in practice (it leaves valid moments on the table) but is the right conceptual stepping stone between “no IV” and “many IVs”.

3.2 Arellano-Bond Difference GMM (1991)

The key insight from Arellano & Bond: under the AH exogeneity assumption, all lags \(y_{i,t-2}, y_{i,t-3}, \ldots\) are valid instruments for \(\Delta y_{i,t-1}\). With each successive period \(t\) we gain one more usable lag. Stacking them all up turns AH into an over-identified GMM problem with a corresponding efficiency gain.

The standard Arellano-Bond difference GMM specification on EmplUK. Pay attention to the formula syntax:

  • lag(log(emp), 1:2) — both AR(1) and AR(2) terms enter the equation. This is heavier dynamics than what we’ll fit on the democracy panel, but it matches the published AB specification.
  • lag(log(wage), 0:1) — both contemporaneous wage and lagged wage. The 0 means “the contemporaneous value”, 1 means “the first lag”.
  • | lag(log(emp), 2:99) — instrument block. 2:99 means “use every lag from 2 up to 99 (or as many as the panel allows)”. This is the full Arellano-Bond moment set.
ab_fit <- pgmm(log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 0:1) +
                 log(capital) + lag(log(output), 0:1) |
                 lag(log(emp), 2:99),
               data = EmplUK,
               effect = "twoways",      # unit + time fixed effects (AB-91 spec)
               model  = "twosteps",     # use the optimal two-step weighting matrix
               transformation = "d")    # difference GMM
summary(ab_fit, robust = TRUE)          # robust = TRUE → Windmeijer-corrected SEs
## Twoways effects Two-steps model Difference GMM 
## 
## Call:
## pgmm(formula = log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 
##     0:1) + log(capital) + lag(log(output), 0:1) | lag(log(emp), 
##     2:99), data = EmplUK, effect = "twoways", model = "twosteps", 
##     transformation = "d")
## 
## Unbalanced Panel: n = 140, T = 7-9, N = 1031
## 
## Number of Observations Used: 611
## Residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.6190677 -0.0255683  0.0000000 -0.0001339  0.0332013  0.6410272 
## 
## Coefficients:
##                         Estimate Std. Error z-value  Pr(>|z|)    
## lag(log(emp), 1:2)1     0.474151   0.185398  2.5575 0.0105437 *  
## lag(log(emp), 1:2)2    -0.052967   0.051749 -1.0235 0.3060506    
## lag(log(wage), 0:1)0   -0.513205   0.145565 -3.5256 0.0004225 ***
## lag(log(wage), 0:1)1    0.224640   0.141950  1.5825 0.1135279    
## log(capital)            0.292723   0.062627  4.6741 2.953e-06 ***
## lag(log(output), 0:1)0  0.609775   0.156263  3.9022 9.530e-05 ***
## lag(log(output), 0:1)1 -0.446373   0.217302 -2.0542 0.0399605 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(25) = 30.11247 (p-value = 0.22011)
## Autocorrelation test (1): normal = -1.53845 (p-value = 0.12394)
## Autocorrelation test (2): normal = -0.2796829 (p-value = 0.77972)
## Wald test for coefficients: chisq(7) = 142.0353 (p-value = < 2.22e-16)
## Wald test for time dummies: chisq(6) = 16.97046 (p-value = 0.0093924)

What to read off the printout, applying the §2 checklist top-to-bottom:

  • Coefficient table. \(\hat\phi_1 \approx 0.47\) on lag(log(emp), 1), \(\hat\phi_2 \approx -0.05\) on lag(log(emp), 2) — modest persistence with slight overshooting. Standard errors are the Windmeijer-corrected two-step SEs because we set robust = TRUE.
  • Sargan test (bottom of summary, §2.1). Overidentification test for the differenced equation. We want \(p > 0.10\).
  • AR(1) on differenced residuals (§2.2). Should reject — and does.
  • AR(2) on differenced residuals (§2.2). Should not reject — and does not.
  • Wald test for time dummies (§2.3). Tests whether the year dummies as a block are non-zero. Only printed because effect = "twoways".

3.3 Blundell-Bond System GMM (1998)

When the outcome is highly persistent (\(\phi\) near 1), the moment \(E[y_{i,t-2} \cdot \Delta y_{i,t-1}]\) shrinks toward zero — lagged levels become weak instruments for the differenced equation. In macro panels this is the typical regime, not the exception.

Blundell and Bond’s solution: stack a level equation alongside the differenced equation, and instrument the level equation with lagged differences.

\[ \begin{aligned} \text{Differenced eq:} &\quad \Delta y_{it} = \phi\, \Delta y_{i,t-1} + \beta\, \Delta x_{it} + \Delta\varepsilon_{it} \quad \text{(IVs: } y_{i,t-2}, y_{i,t-3}, \ldots\text{)} \\ \text{Level eq:} &\quad y_{it} = \alpha_i + \phi\, y_{i,t-1} + \beta\, x_{it} + \varepsilon_{it} \quad \text{(IVs: } \Delta y_{i,t-1}, \Delta x_{i,t-1}\text{)} \end{aligned} \]

The new exogeneity assumption: lagged differences are uncorrelated with \(\alpha_i\). Equivalent to a stationarity condition on the initial conditions; reasonable in steady-state panels, less so when the panel is observed mid-transition.

In pgmm we get system GMM by setting transformation = "ld" (levels and differences):

bb_fit <- pgmm(log(emp) ~ lag(log(emp), 1) + lag(log(wage), 0:1) +
                 lag(log(capital), 0:1) |
                 lag(log(emp), 2:99) + lag(log(wage), 2:99) +
                 lag(log(capital), 2:99),
               data = EmplUK,
               effect = "twoways",
               model  = "onestep",
               transformation = "ld")   # "ld" = levels + differences (system GMM)
summary(bb_fit, robust = TRUE)
## Twoways effects One-step model System GMM 
## 
## Call:
## pgmm(formula = log(emp) ~ lag(log(emp), 1) + lag(log(wage), 0:1) + 
##     lag(log(capital), 0:1) | lag(log(emp), 2:99) + lag(log(wage), 
##     2:99) + lag(log(capital), 2:99), data = EmplUK, effect = "twoways", 
##     model = "onestep", transformation = "ld")
## 
## Unbalanced Panel: n = 140, T = 7-9, N = 1031
## 
## Number of Observations Used: 1642
## Residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.7530341 -0.0369030  0.0000000  0.0002882  0.0466069  0.6001503 
## 
## Coefficients:
##                          Estimate Std. Error z-value  Pr(>|z|)    
## lag(log(emp), 1)         0.935605   0.026295 35.5810 < 2.2e-16 ***
## lag(log(wage), 0:1)0    -0.630976   0.118054 -5.3448 9.050e-08 ***
## lag(log(wage), 0:1)1     0.482620   0.136887  3.5257 0.0004224 ***
## lag(log(capital), 0:1)0  0.483930   0.053867  8.9838 < 2.2e-16 ***
## lag(log(capital), 0:1)1 -0.424393   0.058479 -7.2572 3.952e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(100) = 118.763 (p-value = 0.097096)
## Autocorrelation test (1): normal = -4.808434 (p-value = 1.5212e-06)
## Autocorrelation test (2): normal = -0.2800133 (p-value = 0.77947)
## Wald test for coefficients: chisq(5) = 11174.82 (p-value = < 2.22e-16)
## Wald test for time dummies: chisq(7) = 14.71138 (p-value = 0.039882)

Two things change compared with diff-GMM:

  1. We can — and should — list instrument blocks for every endogenous regressor on the right side of |, not just the lagged DV. Each block contributes its own moment conditions.
  2. The instrument count roughly doubles (level moments stack on top of difference moments). Efficiency gain is large in persistent panels; the cost is a stronger identifying assumption (lagged differences uncorrelated with \(\alpha_i\)) and a much larger instrument matrix — see §2.4 and §3.5.

3.4 One-step vs two-step + Windmeijer

GMM minimizes a quadratic form \(\hat g(\theta)' \, W \, \hat g(\theta)\) where \(\hat g\) is the sample moment vector and \(W\) is a weighting matrix. Two strategies:

  • One-step: \(W\) is the identity (or a fixed matrix that’s optimal under iid errors). Simple; consistent; not asymptotically efficient when errors are heteroscedastic.
  • Two-step: estimate one-step first, use the residuals to build the optimal weighting matrix \(W^* = (\hat E[g g'])^{-1}\), then re-estimate. Asymptotically efficient.

In pgmm we picked model = "twosteps" for AB above. The catch: two-step standard errors are downward biased in finite samples — sometimes severely. Windmeijer (2005) derived a correction. Invoke it via robust = TRUE in summary.pgmm:

# Two-step AB fit, plain SEs
summary(ab_fit, robust = FALSE)$coefficients[, 2] |> head(3)
##  lag(log(emp), 1:2)1  lag(log(emp), 1:2)2 lag(log(wage), 0:1)0 
##           0.08530307           0.02728433           0.04934539
# Two-step AB fit, Windmeijer-style robust SEs
summary(ab_fit, robust = TRUE)$coefficients[, 2] |> head(3)
##  lag(log(emp), 1:2)1  lag(log(emp), 1:2)2 lag(log(wage), 0:1)0 
##            0.1853985            0.0517491            0.1455653

The robust SEs in the second call are larger — that’s the Windmeijer correction in action. Always report the robust two-step SEs; the plain ones are not safe in finite samples.

3.5 Collapsed instruments — §2.4 applied

We argued in §2.4 that too many instruments degrades both the weighting matrix and the Sargan diagnostic. Here we demonstrate the remedy: refit the same AB specification but cap the lag depth at 2:4 so the moment count stays moderate.

ab_fit_collapsed <- pgmm(log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 0:1) +
                           log(capital) + lag(log(output), 0:1) |
                           lag(log(emp), 2:4),
                         data = EmplUK,
                         effect = "twoways",
                         model = "twosteps",
                         transformation = "d")

And the instrument-count table across the EmplUK fits from this part:

# Helper functions to pull instrument count and N from a pgmm object.
# fit$W is a list of unit-level instrument matrices; ncol() of one block
# gives the moment count (same for every unit).
n_inst  <- function(fit) length(fit$W[[1]][1, ])
n_units <- function(fit) length(fit$W)

tibble(model = c("AH (just-id)", "AB (full)", "AB (lag 2:4)", "BB (full)"),
       n_instruments = c(n_inst(ah_fit), n_inst(ab_fit),
                         n_inst(ab_fit_collapsed), n_inst(bb_fit)),
       n_units       = c(n_units(ah_fit), n_units(ab_fit),
                         n_units(ab_fit_collapsed), n_units(bb_fit))) |>
  mutate(ratio        = round(n_instruments / n_units, 2),
         passes_rule  = n_instruments < n_units) |>
  knitr::kable(caption = "Instrument count vs. sample size N. The Roodman rule of thumb (§2.4) wants ratio < 1.")
Instrument count vs. sample size N. The Roodman rule of thumb (§2.4) wants ratio < 1.
model n_instruments n_units ratio passes_rule
AH (just-id) 9 140 0.06 TRUE
AB (full) 38 140 0.27 TRUE
AB (lag 2:4) 28 140 0.20 TRUE
BB (full) 113 140 0.81 TRUE

Limiting from 2:99 to 2:4 typically cuts the moment count by 60–80%. The point estimates usually move only a little; the diagnostic power of Sargan recovers. On EmplUK with \(N = 140\), every spec above passes the rule of thumb — but Part 4 will show what happens on a panel where the full spec blows past \(N\).


Part 4 — Applied workflow on the democracy panel

We now bring the full estimator menu back to the same panel Lab 5 used: the Przeworski democracy data with outcome ln_gdpw and treatment reg, moderator edt, and time-invariant control oil.

4.1 Data and six-fit menu

We re-build democracy_dyn exactly as in Lab 5, plus two pre-computed interaction columns so the GMM formulas stay readable.

democracy <- read_csv("data/democracy.csv") |>
  rename_with(tolower) |>
  mutate(ln_gdpw   = log(gdpw),
         country_f = factor(country),
         year_f    = factor(year))

democracy_dyn <- democracy |>
  group_by(country) |>
  arrange(year) |>
  mutate(ln_gdpw_lag   = dplyr::lag(ln_gdpw),
         reg_lag       = dplyr::lag(reg),
         edt_lag       = dplyr::lag(edt),
         reg_x_edt     = reg * edt,
         reg_x_edt_lag = reg_lag * edt_lag) |>
  ungroup()

# Number of countries and median T
democracy_dyn |>
  drop_na(ln_gdpw, reg, edt, oil) |>
  group_by(country) |>
  summarise(T_i = n(), .groups = "drop") |>
  summarise(N = n(),
            T_med = median(T_i),
            T_min = min(T_i),
            T_max = max(T_i))
## # A tibble: 1 × 4
##       N T_med T_min T_max
##   <int> <int> <int> <int>
## 1   113    28     5    28
pdat <- pdata.frame(democracy_dyn, index = c("country", "year"))

The panel is unbalanced. Median \(T\) is well above 15 — meaning Nickell bias is moderate rather than catastrophic — but it is still worth correcting, especially for countries with shorter observation windows.

ARDL(1,1) form throughout, mirroring Lab 5:

\[ \ln \text{gdpw}_{it} = \alpha_i + \tau_t + \phi\, \ln\text{gdpw}_{i,t-1} + \beta_0\,\text{reg}_{it} + \beta_1\,\text{reg}_{i,t-1} + \beta_e\,\text{edt}_{it} + \gamma_0\,(\text{reg}\!\cdot\!\text{edt})_{it} + \gamma_1\,(\text{reg}\!\cdot\!\text{edt})_{i,t-1} + \beta_o\,\text{oil}_i + \varepsilon_{it} \]

Six fits:

Col Model Transformation Instruments
(1) FE-ARDL(1,1) within, twoways — (Lab 5 baseline; biased)
(2) Anderson-Hsiao “d”, onestep lag(ln_gdpw, 2)
(3) Diff-GMM (full) “d”, twosteps, indiv lag(ln_gdpw, 2:99)
(4) Diff-GMM (collapsed) “d”, twosteps, indiv lag(ln_gdpw, 2:4)
(5) Sys-GMM (full) “ld”, twosteps, indiv lag(ln_gdpw, 2:99)
(6) Sys-GMM (collapsed) “ld”, twosteps, indiv lag(ln_gdpw, 2:4)

Design notes. Two practical adjustments before fitting:

  • oil is strictly time-invariant. Differencing makes the oil column identically zero in every “d” specification (AH, Diff-GMM), producing a singular normal-equations matrix. We drop oil from the three differenced specifications and keep it only in Sys-GMM, where the level equation identifies it. The two-way FE absorbs oil into the unit fixed effects. This is exactly the pedagogical point: when you have a time-invariant control you care about, diff-GMM cannot help you — sys-GMM can.
  • effect = "individual" for all GMM fits, not "twoways". With \(T \approx 40\) unique years, an unbalanced panel, and a large instrument set, the combination of two-way effects + GMM moments produces a rank-deficient weighting matrix that breaks the optimal two-step estimator. (The FE-ARDL baseline (1) keeps twoways because plm’s within transformation handles the rank deficiency cleanly.)
# (1) FE-ARDL(1,1) — Lab 5 baseline, refit with plm so it merges into a
#     single comparison table with the GMM fits.  oil is collinear with the
#     unit FE and gets dropped silently by plm.
fit_fe <- plm(ln_gdpw ~ ln_gdpw_lag + reg + reg_lag + edt +
                reg_x_edt + reg_x_edt_lag + oil,
              data = pdat,
              effect = "twoways",
              model  = "within")

# (2) Anderson-Hsiao — just-identified IV. Drop oil (would difference to 0).
fit_ah <- pgmm(ln_gdpw ~ lag(ln_gdpw, 1) + reg + reg_lag + edt +
                 reg_x_edt + reg_x_edt_lag |
                 lag(ln_gdpw, 2),
               data = pdat,
               effect = "individual",
               model  = "onestep",
               transformation = "d")

# (3) Diff-GMM (full lag depth). Drop oil.
fit_diff_full <- pgmm(ln_gdpw ~ lag(ln_gdpw, 1) + reg + reg_lag + edt +
                        reg_x_edt + reg_x_edt_lag |
                        lag(ln_gdpw, 2:99),
                      data = pdat,
                      effect = "individual",
                      model  = "twosteps",
                      transformation = "d")

# (4) Diff-GMM (collapsed-style: lag 2:4 only). Drop oil.
fit_diff_coll <- pgmm(ln_gdpw ~ lag(ln_gdpw, 1) + reg + reg_lag + edt +
                        reg_x_edt + reg_x_edt_lag |
                        lag(ln_gdpw, 2:4),
                      data = pdat,
                      effect = "individual",
                      model  = "twosteps",
                      transformation = "d")

# (5) Sys-GMM (full). Keep oil — the level equation can identify it.
fit_sys_full <- pgmm(ln_gdpw ~ lag(ln_gdpw, 1) + reg + reg_lag + edt +
                       reg_x_edt + reg_x_edt_lag + oil |
                       lag(ln_gdpw, 2:99),
                     data = pdat,
                     effect = "individual",
                     model  = "twosteps",
                     transformation = "ld")

# (6) Sys-GMM (collapsed-style). Keep oil.
fit_sys_coll <- pgmm(ln_gdpw ~ lag(ln_gdpw, 1) + reg + reg_lag + edt +
                       reg_x_edt + reg_x_edt_lag + oil |
                       lag(ln_gdpw, 2:4),
                     data = pdat,
                     effect = "individual",
                     model  = "twosteps",
                     transformation = "ld")

4.2 Summaries and comparison

We lean on two GMM fits most: full Diff-GMM and full Sys-GMM. We print the full Diff-GMM summary so you see what a pgmm printout looks like end-to-end; for Sys-GMM we extract the coefficient table and the four diagnostic statistics into compact tables so the prose flow is not buried under a wall of console output.

# Canonical full printout — read it once to know the layout
summary(fit_diff_full, robust = TRUE)
## Oneway (individual) effect Two-steps model Difference GMM 
## 
## Call:
## pgmm(formula = ln_gdpw ~ lag(ln_gdpw, 1) + reg + reg_lag + edt + 
##     reg_x_edt + reg_x_edt_lag | lag(ln_gdpw, 2:99), data = pdat, 
##     effect = "individual", model = "twosteps", transformation = "d")
## 
## Unbalanced Panel: n = 135, T = 6-40, N = 4126
## 
## Number of Observations Used: 2674
## Residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.7504327 -0.0020965  0.0000000  0.0002401  0.0023728  0.5862290 
## 
## Coefficients:
##                   Estimate Std. Error z-value Pr(>|z|)    
## lag(ln_gdpw, 1)  0.9507663  0.0148429 64.0551  < 2e-16 ***
## reg             -0.0150909  0.0272604 -0.5536  0.57987    
## reg_lag         -0.0018121  0.0280968 -0.0645  0.94858    
## edt             -0.0080669  0.0039462 -2.0442  0.04093 *  
## reg_x_edt        0.0045252  0.0048995  0.9236  0.35570    
## reg_x_edt_lag    0.0008659  0.0047661  0.1817  0.85583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(740) = 111.2742 (p-value = 1)
## Autocorrelation test (1): normal = -5.493184 (p-value = 3.9475e-08)
## Autocorrelation test (2): normal = -0.06936944 (p-value = 0.9447)
## Wald test for coefficients: chisq(6) = 12971.02 (p-value = < 2.22e-16)

Sys-GMM (full), focused excerpt. Coefficients with Windmeijer-corrected standard errors, followed by the four §2 diagnostic statistics in a compact table:

s_sys <- summary(fit_sys_full, robust = TRUE)

# Coefficients
as.data.frame(s_sys$coefficients) |>
  tibble::rownames_to_column("term") |>
  knitr::kable(digits = 3,
               caption = "Sys-GMM (full) — coefficients with Windmeijer-corrected SEs.")
Sys-GMM (full) — coefficients with Windmeijer-corrected SEs.
term Estimate Std. Error z-value Pr(>|z|)
lag(ln_gdpw, 1) 1.010 0.002 591.855 0.000
reg -0.020 0.020 -0.993 0.321
reg_lag -0.030 0.019 -1.570 0.116
edt -0.012 0.003 -4.144 0.000
reg_x_edt 0.004 0.004 1.160 0.246
reg_x_edt_lag 0.005 0.004 1.234 0.217
oil -0.026 0.010 -2.515 0.012
# Diagnostics
tibble(
  test           = c("Sargan", "AR(1) on Δresid", "AR(2) on Δresid", "# instruments / N"),
  statistic      = c(s_sys$sargan$statistic,
                     s_sys$m1$statistic,
                     s_sys$m2$statistic,
                     length(fit_sys_full$W[[1]][1, ])),
  parameter_df_N = c(s_sys$sargan$parameter,
                     NA_real_,
                     NA_real_,
                     length(fit_sys_full$W)),
  p_value        = c(s_sys$sargan$p.value,
                     s_sys$m1$p.value,
                     s_sys$m2$p.value,
                     NA_real_),
  decision       = c("want fail to reject (p > 0.10)",
                     "want reject (p < 0.05)",
                     "want fail to reject (p > 0.05)",
                     "want ratio < 1")
) |>
  knitr::kable(digits = 3,
               caption = "Sys-GMM (full) — the four §2 diagnostics on one line each.")
Sys-GMM (full) — the four §2 diagnostics on one line each.
test statistic parameter_df_N p_value decision
Sargan 111.643 784 1.000 want fail to reject (p > 0.10)
AR(1) on Δresid -5.497 NA 0.000 want reject (p < 0.05)
AR(2) on Δresid -0.084 NA 0.933 want fail to reject (p > 0.05)
# instruments / N 791.000 135 NA want ratio < 1

Reading the printouts against the §2 checklist:

  • Sargan p-value (§2.1). Both fits have Sargan \(p\) very close to 1.00 — the Roodman power-loss pathology from §2.4. With several hundred instruments on ~130 countries we are deep in the over-identification regime where the test cannot reject. The instrument-count line at the bottom of the Sys-GMM excerpt makes this concrete.
  • AR(1) on differenced residuals (§2.2). Rejects in both — mechanical and expected.
  • AR(2) on differenced residuals (§2.2). Fails to reject in both — the AR(1) error assumption holds and lag-2 is valid.
  • Instrument count (§2.4). Confirmed in the bottom line of the Sys-GMM excerpt above and in the diagnostic table below.

Side-by-side coefficient table.

fits <- list(`(1) FE-ARDL`     = fit_fe,
             `(2) AH`          = fit_ah,
             `(3) Diff-GMM (full)`     = fit_diff_full,
             `(4) Diff-GMM (lag 2:4)`  = fit_diff_coll,
             `(5) Sys-GMM (full)`      = fit_sys_full,
             `(6) Sys-GMM (lag 2:4)`   = fit_sys_coll)

show_reg(fits,
         custom.model.names = names(fits),
         caption = "Six estimators on the democracy panel: ARDL(1,1) form throughout. Coefficient names follow plm/pgmm conventions; lag(ln_gdpw, 1) and ln_gdpw_lag refer to the same lagged outcome.",
         label = "tab:p4-compare",
         single.row = FALSE,
         digits = 3,
         stars = c(0.01, 0.05, 0.1))
Six estimators on the democracy panel: ARDL(1,1) form throughout. Coefficient names follow plm/pgmm conventions; lag(ln_gdpw, 1) and ln_gdpw_lag refer to the same lagged outcome.
 
  1. FE-ARDL
  1. AH
  1. Diff-GMM (full)
  1. Diff-GMM (lag 2:4)
  1. Sys-GMM (full)
  1. Sys-GMM (lag 2:4)
ln_gdpw_lag 0.925***          
  (0.008)          
reg 0.013 -0.050 -0.015 -0.057 -0.020 -0.044*
  (0.018) (0.034) (0.027) (0.037) (0.020) (0.023)
reg_lag -0.018 0.013 -0.002 0.009 -0.030 -0.031
  (0.017) (0.033) (0.028) (0.033) (0.019) (0.020)
edt -0.004 -0.028*** -0.008** -0.024*** -0.012*** -0.016***
  (0.003) (0.005) (0.004) (0.006) (0.003) (0.003)
reg_x_edt -0.001 0.012** 0.005 0.013* 0.004 0.009**
  (0.004) (0.006) (0.005) (0.007) (0.004) (0.004)
reg_x_edt_lag 0.001 -0.001 0.001 -0.001 0.005 0.004
  (0.004) (0.005) (0.005) (0.006) (0.004) (0.004)
lag(ln_gdpw, 1)   1.031*** 0.951*** 1.013*** 1.010*** 1.013***
    (0.022) (0.015) (0.021) (0.002) (0.002)
oil         -0.026** -0.036***
          (0.010) (0.014)
R2 0.844          
Adj. R2 0.836          
Num. obs. 2787 4126 4126 4126 4126 4126
n   135 135 135 135 135
T   40 40 40 40 40
Num. obs. used   2674 2674 2674 5461 5461
Sargan Test: chisq   52.624 111.274 97.629 111.643 105.778
Sargan Test: df   37.000 740.000 110.000 784.000 154.000
Sargan Test: p-value   0.046 1.000 0.794 1.000 0.999
Wald Test Coefficients: chisq   6666.721 12971.017 10251.020 25896930.378 17727569.930
Wald Test Coefficients: df   6 6 6 7 7
Wald Test Coefficients: p-value   0.000 0.000 0.000 0.000 0.000
***p < 0.01; **p < 0.05; *p < 0.1

Custom diagnostic table for a cleaner read of the §2 checklist across all six fits:

extract_diags <- function(fit, label, fit_class) {
  if (fit_class == "plm") {
    return(tibble(model = label,
                  phi_hat = unname(coef(fit)["ln_gdpw_lag"]),
                  sargan_p = NA_real_, ar1_p = NA_real_, ar2_p = NA_real_,
                  n_inst = NA_integer_,
                  n_obs = nobs(fit)))
  }
  s <- summary(fit, robust = TRUE)
  tibble(model = label,
         phi_hat  = unname(coef(fit)["lag(ln_gdpw, 1)"]),
         sargan_p = if (!is.null(s$sargan)) s$sargan$p.value else NA_real_,
         ar1_p    = if (!is.null(s$m1)) s$m1$p.value else NA_real_,
         ar2_p    = if (!is.null(s$m2)) s$m2$p.value else NA_real_,
         n_inst   = length(fit$W[[1]][1, ]),
         n_obs    = sum(sapply(fit$residuals, length)))
}

diag_tbl <- bind_rows(
  extract_diags(fit_fe,        "(1) FE-ARDL",            "plm"),
  extract_diags(fit_ah,        "(2) AH",                 "pgmm"),
  extract_diags(fit_diff_full, "(3) Diff-GMM (full)",    "pgmm"),
  extract_diags(fit_diff_coll, "(4) Diff-GMM (lag 2:4)", "pgmm"),
  extract_diags(fit_sys_full,  "(5) Sys-GMM (full)",     "pgmm"),
  extract_diags(fit_sys_coll,  "(6) Sys-GMM (lag 2:4)",  "pgmm")
)

diag_tbl |>
  mutate(across(c(phi_hat, sargan_p, ar1_p, ar2_p), ~ round(.x, 3))) |>
  knitr::kable(caption = "Diagnostics across estimators. phi_hat is the lagged-DV coefficient; sargan_p is the overid test p-value; ar1_p / ar2_p are the Arellano-Bond differenced-residual tests; n_inst is the instrument count.")
Diagnostics across estimators. phi_hat is the lagged-DV coefficient; sargan_p is the overid test p-value; ar1_p / ar2_p are the Arellano-Bond differenced-residual tests; n_inst is the instrument count.
model phi_hat sargan_p ar1_p ar2_p n_inst n_obs
(1) FE-ARDL 0.925 NA NA NA NA 2787
(2) AH 1.031 0.046 0 0.924 43 5130
(3) Diff-GMM (full) 0.951 1.000 0 0.945 746 5130
(4) Diff-GMM (lag 2:4) 1.013 0.794 0 0.923 116 5130
(5) Sys-GMM (full) 1.010 1.000 0 0.933 791 10395
(6) Sys-GMM (lag 2:4) 1.013 0.999 0 0.920 161 10395

Two patterns leap out of this table:

  • \(\hat\phi\) moves up as we move FE → AH → Diff-GMM → Sys-GMM. This is the Nickell-bias correction, exactly as predicted.
  • The full-lag-depth GMM specifications have absurd instrument counts (~700) on a panel with ~130 countries. The collapsed lag(2:4) versions drop the count by an order of magnitude, and that’s when Sargan becomes informative.

4.3 Misspecified examples

Before drawing substantive conclusions, it is worth seeing what a failed specification looks like. Diff-GMM with lag(y, 2:99) uses all available lags. What happens if we choke off the instrument set to just lag 2 and lag 3? Or to lags 2 through 5?

# Misspec 1: only lag-2 and lag-3 as instruments — over-identified by 1 moment.
fit_miss23 <- pgmm(ln_gdpw ~ lag(ln_gdpw, 1) + reg + reg_lag + edt +
                     reg_x_edt + reg_x_edt_lag |
                     lag(ln_gdpw, 2:3),
                   data = pdat, effect = "individual",
                   model = "twosteps", transformation = "d")

# Misspec 2: lag-2 through lag-5 — still tight, but a bit more identification.
fit_miss25 <- pgmm(ln_gdpw ~ lag(ln_gdpw, 1) + reg + reg_lag + edt +
                     reg_x_edt + reg_x_edt_lag |
                     lag(ln_gdpw, 2:5),
                   data = pdat, effect = "individual",
                   model = "twosteps", transformation = "d")

# Compare the two against the full Diff-GMM in (3).
diag_misspec <- bind_rows(
  extract_diags(fit_miss23,    "Misspec (2:3)",    "pgmm"),
  extract_diags(fit_miss25,    "Misspec (2:5)",    "pgmm"),
  extract_diags(fit_diff_full, "Diff-GMM (full)",  "pgmm")
)
diag_misspec |>
  mutate(across(c(phi_hat, sargan_p, ar1_p, ar2_p), ~ round(.x, 3))) |>
  knitr::kable(caption = "Misspecified Diff-GMM fits with truncated instrument sets, compared to the full specification. Watch what happens to $\\hat\\phi$ and the diagnostics as the instrument set shrinks.")
Misspecified Diff-GMM fits with truncated instrument sets, compared to the full specification. Watch what happens to \(\hat\phi\) and the diagnostics as the instrument set shrinks.
model phi_hat sargan_p ar1_p ar2_p n_inst n_obs
Misspec (2:3) 1.029 0.228 0 0.918 80 5130
Misspec (2:5) 1.003 0.997 0 0.923 151 5130
Diff-GMM (full) 0.951 1.000 0 0.945 746 5130

Reading this table against the §2 checklist:

  • \(\hat\phi\) moves around depending on instrument depth. With just lag 2 and lag 3, the moment conditions are sparse and the estimate becomes noisy. The full 2:99 spec stabilizes it.
  • Sargan \(p\)-values in the truncated specs are finite and informative — exactly the value of not having too many instruments. In the full spec, Sargan = 1.00 because of the power-loss problem (§2.4).
  • AR(1) and AR(2) tests track residual autocorrelation in the differenced equation, so they’re roughly invariant to instrument count. They should look similar across the three columns.

The teaching point. There is no single “correct” instrument depth. The Sargan test is informative only when the moment count is moderate. Standard practice (Roodman 2009): try multiple lag depths, report all, flag if estimates move substantially. We do this above with 2:99 (full), 2:4 (collapsed), and 2:3 / 2:5 (heavily truncated). If \(\hat\phi\) is reasonably stable across them, you have evidence of robustness.

4.4 Substantive read-out

Three things to track across the comparison table in §4.2:

  1. Lagged DV coefficient \(\hat\phi\). FE-ARDL gives the smallest value — Nickell bias pushes it down. AH, Diff-GMM, and Sys-GMM give larger values. The gap is a direct read of the bias correction.
  2. The interaction \(\widehat{\text{reg}\times\text{edt}}\). Lab 5 found that the contemporaneous interaction is small and the lagged interaction carries the moderation story. Does this pattern survive once we correct the bias? Read both reg_x_edt and reg_x_edt_lag rows.
  3. The time-invariant oil. Diff-GMM kills it (differencing); Sys-GMM identifies it through the level equation. If you actually need to estimate the oil effect, you need sys-GMM.

The diagnostic table is the second-order read: are the GMM fits credible? Sargan trivially passes in the full-lag-depth columns (Roodman pathology), but the collapsed columns give us informative tests. If you wanted to publish from this panel, the lag(2:4) specifications are the ones to report.


Part 5 — Tax and cigarettes: a Diff-GMM fit with counterfactual forecast

For a second applied example with a sharper substantive story — and an actual policy counterfactual to interpret — we turn to the cigarette consumption panel that the old version of this lab used (and that you saw in lecture). The question is simple: by how much would a tax hike reduce cigarette consumption?

The data are US state-level cigarette sales, prices, taxes, and income from 1985–1995. 48 states × 11 years = 528 observations, a balanced panel — short enough that Nickell bias matters and long enough that Diff-GMM is well-identified. The variables we need:

Variable Description
packpc packs of cigarettes per capita (the outcome)
income state personal income (total)
pop state population
tax average federal + state + local excise tax (cents/pack)
taxs excise taxes including sales taxes (cents/pack)
avgprs average price including sales taxes (cents/pack)
cpi consumer price index (for inflation adjustment)

5.1 Load and inflation-adjust to 1995 dollars

We need dollar amounts in a constant unit. The trick: every dollar variable gets multiplied by the 1995 CPI and divided by the CPI in the observed year — bringing the figure to “1995 dollars”. We do this inline (no helper function) so the operation is fully transparent.

# Load the panel. 48 states × 11 years (1985–1995) = 528 rows.
cig <- read_csv("data/cigarette.csv", show_col_types = FALSE)

# Quick panel check
length(unique(cig$state))    # number of states
## [1] 48
length(unique(cig$year))     # number of years
## [1] 11
# Inflation-adjust. The CPI is the same across states within a given year,
# so we just pull the 1995 value once and rescale.
cpi_base <- unique(cig$cpi[cig$year == 1995])      # 1995 CPI value (a scalar)

cig <- cig |>
  mutate(
    income95   = cpi_base * income / cpi,           # state income, 1995 dollars
    tax95      = cpi_base * tax    / cpi,           # avg excise tax, 1995 cents/pack
    taxs95     = cpi_base * taxs   / cpi,           # excise + sales tax
    avgprs95   = cpi_base * avgprs / cpi,           # avg price (incl. sales tax)
    income95pc = income95 / pop,                    # per-capita income (1995 dollars)
    pretax95   = avgprs95 - taxs95                  # price net of tax
  )

# Quick look at the post-adjustment numbers
cig |>
  summarise(across(c(packpc, income95pc, avgprs95, taxs95),
                   list(mean = mean, sd = sd))) |>
  knitr::kable(digits = 2,
               caption = "Means and standard deviations of the key variables after inflation adjustment.")
Means and standard deviations of the key variables after inflation adjustment.
packpc_mean packpc_sd income95pc_mean income95pc_sd avgprs95_mean avgprs95_sd taxs95_mean taxs95_sd
106.45 23.12 21.25 3.26 175.14 23.61 53.84 13.86

5.2 Diff-GMM fit on cigarette consumption

The Diff-GMM spec — packs per capita as a dynamic function of per-capita income and the (tax-inclusive) average price, with lag-2 instruments stacked GMM-style up to whatever the panel supports:

# Convert to plm's pdata.frame: tells plm which columns identify the
# unit-time grid.  Order matters: c("unit", "time").
pdat_cig <- pdata.frame(cig, index = c("state", "year"))

# Diff-GMM (Arellano-Bond) fit on cigarette consumption.
# Reading the formula:
#   * Left of |:   packpc is the outcome; lag(packpc, 1) is the lagged DV
#                  (pgmm will internally difference everything).  income95pc
#                  and avgprs95 enter as contemporaneous regressors.
#   * First |:    lag(packpc, 2:99) is the GMM-style instrument block —
#                  every available lag from 2 onward, stacked per period.
# Arguments:
#   * effect = "individual": only state FE.  We do NOT use "twoways" here
#     because pgmm needs at least one regressor to be moved by year, and
#     with this panel the time effects would absorb the avgprs95 variation
#     we want to identify.
#   * model = "twosteps": Arellano-Bond two-step efficient weighting matrix.
#     summary(..., robust = TRUE) below applies the Windmeijer correction.
#   * transformation = "d": difference GMM (not system).
fit_cig <- pgmm(
  packpc ~ lag(packpc, 1) + income95pc + avgprs95 |
           lag(packpc, 2:99),
  data           = pdat_cig,
  effect         = "individual",
  model          = "twosteps",
  transformation = "d"
)

# Capture the summary once — the §2 diagnostic battery hangs off this object
sum_cig <- summary(fit_cig, robust = TRUE)
sum_cig
## Oneway (individual) effect Two-steps model Difference GMM 
## 
## Call:
## pgmm(formula = packpc ~ lag(packpc, 1) + income95pc + avgprs95 | 
##     lag(packpc, 2:99), data = pdat_cig, effect = "individual", 
##     model = "twosteps", transformation = "d")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 432
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -25.4405  -2.4992   0.1460   0.1232   2.7499  25.6071 
## 
## Coefficients:
##                 Estimate Std. Error z-value  Pr(>|z|)    
## lag(packpc, 1)  0.639465   0.055354 11.5523 < 2.2e-16 ***
## income95pc     -0.479040   0.496258 -0.9653    0.3344    
## avgprs95       -0.179869   0.028086 -6.4042 1.511e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(44) = 47.09887 (p-value = 0.34693)
## Autocorrelation test (1): normal = -3.443569 (p-value = 0.00057409)
## Autocorrelation test (2): normal = -0.5365189 (p-value = 0.5916)
## Wald test for coefficients: chisq(3) = 2446.946 (p-value = < 2.22e-16)

Read off (against the §2 checklist):

  • \(\hat\phi\) on lag(packpc, 1) should be positive and below 1 (sticky consumption from year to year).
  • \(\hat\beta\) on avgprs95 should be negative (higher price → fewer packs sold).
  • \(\hat\beta\) on income95pc should be small and positive (richer states buy slightly more).
  • Sargan, AR(1), AR(2) at the bottom — apply the §2.2–§2.3 decision rules.
  • Instrument count below: this panel is \(N = 48\) states, so the full lag(packpc, 2:99) spec will overshoot the \(N\) rule and Sargan will likely come back \(\approx 1\) (the Roodman pathology). We accept this trade-off because the §2.5 instrument-limited spec gives qualitatively the same coefficients on the substantive variables.

5.3 Counterfactual scenario: a 60-cent tax hike in 1996

This is the policy question: what if the average state raised its tax by 60 cents per pack, starting in 1996? Sixty cents is roughly 3 standard deviations of the 1995 tax distribution — a deliberately large shock to make the response visible. We compute the trajectory of packpc 3 years out (1996, 1997, 1998) and report three quantities of interest:

  • EV (Expected Value): the predicted packs/capita under the tax hike.
  • FD (First Difference): the gap between the treatment trajectory and the no-change baseline.
  • RR (Relative Risk): that same gap expressed as a percent of the baseline.
# Sample statistics that motivate the +60 cent scenario:
# the average tax in 1995 is about 60 cents and 1 sd is ~18 cents,
# so +60 is roughly a 3-sd intervention — large but informative.
summary(pdat_cig$taxs95[pdat_cig$year == 1995])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34.44   48.75   59.84   61.87   74.78  112.63
sd(pdat_cig$taxs95[pdat_cig$year == 1995])
## [1] 18.47741
# Forecast horizon
periods_out <- 3

# Number of parameter draws for the Monte-Carlo simulation
sims <- 1000

# We anchor the forecast at the last observed year (1995). lagY is the most
# recent ΔY for each state — we average across states for the average-state
# forecast. initialY is the level of Y in 1995, same averaging.
last_year <- pdat_cig[pdat_cig$year == 1995, ]
prev_year <- pdat_cig[pdat_cig$year == 1994, ]

# Match states between the two slices (panel is balanced so this is trivial)
initialY <- mean(last_year$packpc, na.rm = TRUE)
lagY     <- mean(last_year$packpc - prev_year$packpc, na.rm = TRUE)

tibble(
  constant = c("periods_out", "initialY (avg packpc, 1995)", "lagY (avg ΔY, 1995)"),
  value    = c(periods_out, initialY, lagY)
) |>
  knitr::kable(digits = 2,
               caption = "Anchor constants for the cigarette counterfactual.")
Anchor constants for the cigarette counterfactual.
constant value
periods_out 3.00
initialY (avg packpc, 1995) 96.33
lagY (avg ΔY, 1995) 1.30

Parameter draws. Sample 1000 parameter vectors from \(\text{MVN}(\hat\theta, \hat V_{\text{Windmeijer}})\) — the joint distribution of \((\phi, \beta_{\text{income}}, \beta_{\text{price}})\) implied by the fit’s Windmeijer-corrected variance. The first column of the resulting matrix is \(\phi\); the remaining two are the betas on income and price.

set.seed(20260520)

# Joint MVN draws from the fit. vcovHC() gives the Windmeijer-corrected
# variance for two-step GMM (this matches what summary(..., robust = TRUE)
# uses for SEs).
simparam <- mvrnorm(n     = sims,
                    mu    = coefficients(fit_cig),
                    Sigma = vcovHC(fit_cig))

# Split: phi is the LDV coef (column 1); betas are the rest (columns 2 onward)
simphi  <- simparam[, 1]                              # length-1000 vector
simbeta <- simparam[, 2:ncol(simparam), drop = FALSE] # 1000 × (p-1) matrix

# Sanity check
dim(simbeta)
## [1] 1000    2
head(simphi)
## [1] 0.5722195 0.6450104 0.5157122 0.5672364 0.5571531 0.6867599

Build the counterfactual matrices. Two cfMake objects:

  • xhyp represents the treatment trajectory (60-cent tax hike happens in 1996, persists).
  • xbase represents the baseline (nothing changes).

cfMake returns an object whose $x field holds the change in each covariate at each forecast period (because transform = "diff" below will undo the differencing). We zero out everything by default, then cfChange() the price variable in period 1 only — sustained higher price thereafter is encoded by the LDV propagating it forward.

# Covariate-only formula for cfMake (LDV is fed in separately via phi)
formula_cf <- packpc ~ income95pc + avgprs95 - 1

# Treatment: +60 cent change in avgprs95 at period 1, no other changes
xhyp <- cfMake(formula = formula_cf, data = pdat_cig, nscen = periods_out)
xhyp$x    <- 0 * xhyp$x       # zero out every period's change
xhyp$xpre <- 0 * xhyp$xpre    # zero out the "previous" matrix too
xhyp <- cfChange(xhyp, covname = "avgprs95", x = 60, scen = 1)

# Baseline: zero change in everything
xbase <- xhyp
xbase$x <- xbase$xpre

# Inspect: the 60-cent shock appears in row 1 of xhyp$x, zeros elsewhere
xhyp$x
##   packpc income95pc avgprs95
## 1      0          0       60
## 2      0          0        0
## 3      0          0        0
xbase$x
##   packpc income95pc avgprs95
## 1      0          0        0
## 2      0          0        0
## 3      0          0        0

5.4 Simulate EV, FD, and RR

Three simcf calls — each takes the parameter draws (simbeta, simphi), the counterfactual matrix (xhyp or xbase), the initial conditions (initialY, lagY), and transform = "diff" to say “the fit is in first differences; undo that when forecasting in levels”.

# Expected value of Y under the +60 cent shock (level scale)
sev_treat <- ldvsimev(
  xhyp,
  b        = simbeta,
  ci       = 0.95,
  constant = NA,         # NA = no intercept term (FE absorbed)
  phi      = simphi,
  lagY     = lagY,
  transform = "diff",
  initialY  = initialY
)

# Expected value under the no-shock baseline
sev_base <- ldvsimev(
  xbase,
  b        = simbeta,
  ci       = 0.95,
  constant = NA,
  phi      = simphi,
  lagY     = lagY,
  transform = "diff",
  initialY  = initialY
)

# First difference: treatment level - baseline level, with CIs
sfd <- ldvsimfd(
  xhyp,
  b        = simbeta,
  ci       = 0.95,
  constant = NA,
  phi      = simphi,
  lagY     = lagY,
  transform = "diff"
)

# Relative risk: (treatment - baseline) / baseline × 100, with CIs
srr <- ldvsimrr(
  xhyp,
  b        = simbeta,
  ci       = 0.95,
  constant = NA,
  phi      = simphi,
  lagY     = lagY,
  transform = "diff",
  initialY  = initialY
)

# Stack the three quantities into one tidy frame
plot_data <- bind_rows(
  tibble(quantity = "EV (packs/capita)",
         h = seq_len(periods_out),
         pe = sev_treat$pe,
         lo = sev_treat$lower[, 1],
         hi = sev_treat$upper[, 1],
         baseline_pe = sev_base$pe),
  tibble(quantity = "FD (treat − baseline)",
         h = seq_len(periods_out),
         pe = sfd$pe,
         lo = sfd$lower[, 1],
         hi = sfd$upper[, 1],
         baseline_pe = NA_real_),
  tibble(quantity = "RR (% change)",
         h = seq_len(periods_out),
         pe = srr$pe,
         lo = srr$lower[, 1],
         hi = srr$upper[, 1],
         baseline_pe = NA_real_)
) |>
  mutate(quantity = factor(quantity,
                           levels = c("EV (packs/capita)",
                                      "FD (treat − baseline)",
                                      "RR (% change)")))

5.5 The three-panel forecast plot

ggplot(plot_data, aes(h, pe)) +
  geom_ribbon(aes(ymin = lo, ymax = hi), alpha = 0.25, fill = "steelblue") +
  geom_line(linewidth = 1, colour = "steelblue") +
  geom_line(aes(y = baseline_pe), linetype = "dashed", colour = "grey40", linewidth = 0.8) +
  geom_hline(yintercept = 0, linetype = "dotted", colour = "grey60") +
  facet_wrap(~ quantity, scales = "free_y") +
  scale_x_continuous(breaks = seq_len(periods_out),
                     labels = paste0("199", 5 + seq_len(periods_out))) +
  labs(x = NULL, y = NULL) +
  theme_minimal(base_size = 11) +
  theme(strip.text = element_text(size = 11, face = "bold"),
        panel.spacing.x = unit(1, "lines"))
Counterfactual forecast of a 60-cent tax hike on cigarette consumption. EV: predicted packs/capita under the hike (solid) and under baseline (dashed). FD: absolute difference (treat - baseline). RR: percent change. Horizon = 1996-1998. Bands are 95% intervals from 1000 parametric draws.

Counterfactual forecast of a 60-cent tax hike on cigarette consumption. EV: predicted packs/capita under the hike (solid) and under baseline (dashed). FD: absolute difference (treat - baseline). RR: percent change. Horizon = 1996-1998. Bands are 95% intervals from 1000 parametric draws.

Read the three panels.

  • EV. The blue band is the predicted trajectory under the +60-cent hike; the grey dashed line is the no-change baseline. The vertical gap is exactly what FD plots in the middle panel. Even a year after the shock, predicted consumption is several packs/capita below baseline.
  • FD. The absolute drop in packs/capita, year by year. The effect is largest in the first year (the immediate response to a higher price) and damps slowly as \(\hat\phi\) propagates the past forward.
  • RR. Same effect expressed as a percent. This is the metric a public-health analyst would report: by year 3, consumption is roughly X% below where it would have been without the tax.

Caveats to flag in class.

  1. The shock is large. +60 cents is a 3-SD intervention — a useful demonstration of the model’s dynamics, not a real policy proposal. For policy work, scale the shock to a realistic intervention size.
  2. Average-state forecast. We averaged initialY and lagY across states. A state-specific forecast would let each state start at its own 1995 level; this is the natural extension and is straightforward to code.
  3. Diff-GMM only. We only used the Diff-GMM coefficients. Comparing Sys-GMM or FE-LDV here would show the same kind of bias-correction story we saw in §4 on the democracy panel.

References