Disclaimer

These materials have been revised and updated by successive teaching assistants over recent years.

Revision history:

Packages and Functions Used in This Lab

Package Purpose Key functions
plm Panel models, dynamic-panel GMM plm(), pgmm(), pdata.frame()
sandwich Robust variance-covariance matrices vcovHC()
MASS MVN draws for parametric simulation mvrnorm()
simcf King-Tomz-Wittenberg counterfactual simulators (Adolph’s package) cfMake(), cfChange(), ldvsimev(), ldvsimfd(), ldvsimrr()
texreg Side-by-side model tables htmlreg()
tidyverse Data manipulation, plotting ggplot(), tibble(), pipes (|>)

The lab uses EmplUK from data(EmplUK, package = "plm") — the Arellano-Bond (1991) labor-demand panel — for the main demonstration of fixed-effects LDV, Anderson-Hsiao, Difference GMM, and System GMM in Part 2.


Today’s roadmap

Part Topic
1 Why dynamic-panel GMM? Nickell bias, the estimator menu, assumptions, and diagnostic tests.
2 Dynamic panel estimators in practice with EmplUK. FE-LDV -> Anderson-Hsiao -> Difference GMM -> System GMM, with diagnostics and forecasts.
3 Cigarette price counterfactual. Diff-GMM fit on the 1985-1995 cigarette panel + a +60-cent price scenario, with EV / FD / RR forecasts via simcf.
Appendix A Self-study. Nickell bias and hand-built GMM from raw matrix algebra.

Lab 5 ended with dynamic panel models. Once a lagged dependent variable enters a fixed-effects model, the within transformation creates Nickell bias: the transformed lagged outcome is mechanically correlated with the transformed error. This lab shows why that happens, how the IV/GMM estimators respond to it, and how to check whether the GMM specification is credible.

The detailed matrix algebra is now in Appendix A. The main lab focuses on the applied workflow: choose the estimator, inspect the moment conditions, read the diagnostics, and simulate interpretable quantities of interest.


Part 1 — Dynamic panels, Nickell bias, and the GMM roadmap

1.1 Why fixed-effects LDV models are biased when \(T\) is short

Start from the dynamic panel model:

\[ y_{it} = \alpha_i + \phi y_{i,t-1} + x_{it}\beta + \varepsilon_{it}. \]

The fixed effect \(\alpha_i\) captures stable differences across units. The problem is that the within transformation subtracts unit means from every variable. The transformed lagged outcome contains past shocks through its unit mean, and the transformed error also contains those same shocks through its unit mean. As a result, \(\tilde y_{i,t-1}\) and \(\tilde\varepsilon_{it}\) are correlated even when the original \(\varepsilon_{it}\) is serially independent.

Nickell’s leading approximation is:

\[ \text{plim}(\hat\phi_{FE} - \phi) \approx -\frac{1 + \phi}{T - 1}. \]

The main implication is practical: adding more units does not solve the problem when \(T\) is fixed. The bias shrinks as the number of time periods grows. With many units and short panels, a different estimator is needed.

1.2 Estimator assumptions and diagnostic tests

Read this table before fitting any GMM model. It tells us what each estimator assumes, what it is trying to fix, and how the diagnostics should look if the specification is credible.

Estimator Core idea Main assumptions What it identifies Diagnostics / tests Target result Red flags
FE-LDV Estimate the dynamic model with unit FE and \(y_{i,t-1}\) on the RHS. Strict exogeneity apart from the lagged outcome; large enough \(T\) for Nickell bias to be small. Within-unit short-run effects; long-run effects use \(\beta/(1-\phi)\). Residual serial-correlation checks; compare \(\hat\phi\) to GMM estimates. Useful baseline when \(T\) is moderate/large. Small \(T\); \(\hat\phi\) much lower than GMM; persistent residuals.
Anderson-Hsiao IV First-difference to remove \(\alpha_i\); instrument \(\Delta y_{i,t-1}\) with \(y_{i,t-2}\). No serial correlation in level errors; lag 2 is relevant and exogenous. Dynamic effect after removing unit FE. Conceptually, the textbook AH estimator is just-identified. In pgmm(), the lag instrument is period-stacked, so an overidentification test may still print. Inspect AR tests too. Coefficients move away from FE-LDV in the expected direction. Very noisy estimates; weak lag instruments.
Difference GMM / Arellano-Bond Difference the equation and use deeper lags as GMM-style instruments. No AR(2) in differenced residuals; instruments valid; enough within variation after differencing. Dynamic effects in the differenced equation, interpreted back in the level model. Sargan/Hansen; AR(1); AR(2); instrument count. Sargan/Hansen fail to reject; AR(1) rejects; AR(2) fails to reject; instruments < units. AR(2) rejects; Sargan/Hansen rejects; p-values near 1 with too many instruments.
System GMM / Blundell-Bond Add a level equation instrumented by lagged differences. Difference-GMM assumptions plus stationarity/initial-condition restrictions for the level equation. Dynamic effects with better information when series are persistent. Can retain time-invariant regressors. Same tests as Diff-GMM, plus sensitivity to Diff-GMM results. Similar substantive conclusions to Diff-GMM, often more precise. Big disagreement with Diff-GMM; too many instruments; questionable stationarity restrictions.
Two-step GMM + Windmeijer Re-estimate the weighting matrix using one-step residuals. Same moments as the chosen GMM estimator; finite-sample SE correction needed. More efficient GMM estimates. Compare robust vs non-robust two-step SEs. Report Windmeijer-corrected robust SEs. Plain two-step SEs are too small.
Limited / collapsed instruments Restrict lag depth or collapse instruments to avoid instrument proliferation. Same moment assumptions, but fewer moments. More stable diagnostics and less overfitting of endogenous variables. Instrument count and estimate stability across lag windows. Instruments fewer than units; estimates stable across reasonable lag limits. Estimates change dramatically when lag depth changes.

1.3 How to read the diagnostic battery

For a credible Difference-GMM or System-GMM fit, the diagnostic pattern is usually:

  1. Sargan/Hansen overidentification test: fail to reject. A small p-value says at least some instruments are suspect. A p-value near 1 is not automatically good; with too many instruments, the test loses power.
  2. AR(1) test on differenced residuals: reject. This is expected because differencing an iid level error creates first-order serial correlation in \(\Delta\varepsilon_{it}\).
  3. AR(2) test on differenced residuals: fail to reject. This is the crucial serial-correlation test. If AR(2) rejects, lag-2 instruments are not valid.
  4. Wald test on time effects: reject if common shocks matter. If it rejects, keep time dummies when the specification can support them.
  5. Instrument count: keep the number of instruments below the number of units as a practical rule of thumb. If the instrument count is large, reduce lag depth or collapse instruments.

The rest of the lab applies this checklist. Appendix A shows the matrix algebra behind the estimators and explains where the moment conditions come from.


Part 2 — Dynamic panel estimators in practice with EmplUK

The real workhorse for routine dynamic-panel GMM in R is plm::pgmm(), which packages the Anderson-Hsiao, Arellano-Bond, and Blundell-Bond logic with built-in diagnostics. The applied workflow begins with a dynamic fixed-effects model, explains why that model is limited in a short panel, and then moves estimator by estimator through the IV/GMM family.

Part 2 uses one dataset throughout: EmplUK, the classic Arellano-Bond labor-demand panel. The substantive question is: how does firm employment respond to wages, capital, and output when employment is persistent over time?

GMM is not a magic correction for endogeneity. It is a moment-based inference technique that delivers consistent estimates only if its identifying assumptions hold. Four of those assumptions get a formal test on every summary.pgmm printout; a fifth (the instrument count) is a structural check.

The five-row checklist (the order in which to read the printout):

Assumption Test \(H_0\) Target result
Instrument validity (overid) Sargan / Hansen \(E[Z'\varepsilon] = 0\) Fail to reject (\(p > 0.10\))
Error process is AR(1), not AR(2)+ AR(1) on \(\Delta\hat\varepsilon\) no first-order serial correlation in \(\Delta\hat\varepsilon\) Reject (\(p < 0.05\) — mechanical)
Same AR(2) on \(\Delta\hat\varepsilon\) no second-order serial correlation in \(\Delta\hat\varepsilon\) Fail to reject (\(p > 0.05\))
Time effects matter Wald on \(\tau_t\) all year dummies \(= 0\) Reject -> keep year FE; fail to reject -> consider dropping only if theory allows
Instruments not too many # instruments \(< N\) (rule of thumb) Check by hand

2.1 The data and the fixed-effects baseline

EmplUK contains a short panel of UK firms from 1976 to 1984. The variables are economic quantities measured at the firm-year level. Logs are used so the coefficients can be read approximately as elasticities.

Variable Meaning Role in this example
emp firm employment outcome, modeled as log(emp)
wage firm wage measure labor cost, modeled as log(wage)
capital firm capital stock production input, modeled as log(capital)
output firm output demand / production scale, modeled as log(output)
firm, year panel identifiers unit and time indexes
# Load the classic Arellano-Bond labor-demand panel.
data("EmplUK", package = "plm")

# Count observations per firm to show the short-panel structure.
firm_T <- table(EmplUK$firm)

# Summarize the panel dimensions in a small table.
tibble(
  quantity = c("Firms", "Firm-years", "First year", "Last year",
               "Minimum T per firm", "Maximum T per firm"),
  value    = c(length(unique(EmplUK$firm)),
               nrow(EmplUK),
               min(EmplUK$year),
               max(EmplUK$year),
               min(firm_T),
               max(firm_T))
) |>
  knitr::kable(caption = "Panel structure of the EmplUK labor-demand data.")
Panel structure of the EmplUK labor-demand data.
quantity value
Firms 140
Firm-years 1031
First year 1976
Last year 1984
Minimum T per firm 7
Maximum T per firm 9
# Informally check balance
table(table(EmplUK$firm))
## 
##   7   8   9 
## 103  23  14

2.1.1 Seeing pooled, between, and within variation

Before fitting a dynamic panel model, it helps to see where the variation in the variables lives. In panel data, the same pooled association can be decomposed into:

  • Overall / pooled: the correlation across all firm-years.
  • Between: the correlation across firm means.
  • Within: the correlation after subtracting each firm’s mean from the outcome and predictor.

This distinction matters because fixed effects use within-firm variation, while pooled and random-effects estimators can also use between-firm differences.

# Build logged variables because the labor-demand models use logs.
# The lag is created within firm so that the first observation for each firm is
# missing, as it should be.
empl_wb_df <- EmplUK |>
  arrange(firm, year) |>
  group_by(firm) |>
  mutate(log_emp      = log(emp),
         log_emp_lag1 = dplyr::lag(log_emp, 1),
         log_wage     = log(wage),
         log_capital  = log(capital),
         log_output   = log(output)) |>
  ungroup()

# These are the predictors in the dynamic labor-demand equation.
empl_wb_vars <- c("log_emp_lag1", "log_wage", "log_capital", "log_output")

# Stack the predictors into long format so the same operations can be repeated
# for each variable with group_by().
empl_wb_long <- empl_wb_df |>
  select(firm, log_emp, all_of(empl_wb_vars)) |>
  pivot_longer(cols = all_of(empl_wb_vars),
               names_to = "variable",
               values_to = "value")

# Overall correlation: all firm-years are pooled together.
empl_overall <- empl_wb_long |>
  group_by(variable) |>
  summarise(overall = cor(value, log_emp, use = "pairwise.complete.obs"),
            .groups = "drop")

# Between correlation: first take each firm's mean, then correlate firm means.
empl_between <- empl_wb_long |>
  group_by(variable, firm) |>
  summarise(value_mean   = mean(value,   na.rm = TRUE),
            log_emp_mean = mean(log_emp, na.rm = TRUE),
            .groups = "drop") |>
  group_by(variable) |>
  summarise(between = cor(value_mean, log_emp_mean, use = "pairwise.complete.obs"),
            .groups = "drop")

# Within correlation: subtract each firm's mean from both variables, then
# correlate the deviations.
empl_within <- empl_wb_long |>
  group_by(variable, firm) |>
  mutate(value_dev   = value   - mean(value,   na.rm = TRUE),
         log_emp_dev = log_emp - mean(log_emp, na.rm = TRUE)) |>
  ungroup() |>
  group_by(variable) |>
  summarise(within = cor(value_dev, log_emp_dev, use = "pairwise.complete.obs"),
            .groups = "drop")

# Within-variance share: how much of each predictor's variance is movement
# within firms rather than differences between firms.
empl_within_share <- empl_wb_long |>
  group_by(variable, firm) |>
  mutate(value_dev = value - mean(value, na.rm = TRUE)) |>
  ungroup() |>
  group_by(variable) |>
  summarise(within_share = var(value_dev, na.rm = TRUE) / var(value, na.rm = TRUE),
            .groups = "drop")

# Join the four summaries into one table.
empl_wb_tab <- empl_overall |>
  left_join(empl_between,      by = "variable") |>
  left_join(empl_within,       by = "variable") |>
  left_join(empl_within_share, by = "variable") |>
  arrange(desc(abs(overall)))

empl_wb_tab |>
  knitr::kable(digits = 3,
               col.names = c("Variable", "Overall", "Between", "Within",
                             "Within var. share"),
               caption = "Correlation with log employment, decomposed into overall, between-firm, and within-firm correlations.")
Correlation with log employment, decomposed into overall, between-firm, and within-firm correlations.
Variable Overall Between Within Within var. share
log_emp_lag1 0.995 1.000 0.754 0.015
log_capital 0.911 0.917 0.739 0.020
log_output 0.089 0.037 0.533 0.828
log_wage -0.014 -0.011 -0.286 0.100
empl_wb_tab |>
  mutate(variable = fct_reorder(variable, between)) |>
  ggplot(aes(y = variable)) +
  geom_segment(aes(x = between, xend = within, yend = variable),
               colour = "grey70") +
  geom_point(aes(x = between, colour = "Between"), size = 2.6) +
  geom_point(aes(x = within, colour = "Within"), size = 2.6) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey50") +
  scale_colour_manual(values = c("Between" = "firebrick",
                                 "Within" = "steelblue")) +
  labs(x = "Correlation with log employment", y = NULL, colour = NULL) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom")
Between- and within-firm correlations with log employment in EmplUK.

Between- and within-firm correlations with log employment in EmplUK.

Two details matter for the models below. First, lagged employment is strongly correlated with current employment, which is why the dynamic term belongs in the model. Second, wages, capital, and output can tell different stories across firms than within the same firm over time. That is exactly why it is useful to inspect the between and within pieces before deciding what a coefficient is identifying.

The natural first model is a fixed-effects dynamic regression:

\[ \log(emp)_{it} = \alpha_i + \tau_t + \phi_1 \log(emp)_{i,t-1} + \phi_2 \log(emp)_{i,t-2} + x_{it}\beta + \varepsilon_{it}. \]

This model is useful because it matches the substantive story: employment is persistent, and firms differ in stable ways. The problem is that this is a short panel. With only 7 to 9 observations per firm, the lagged dependent variable is correlated with the transformed error after firm fixed effects are removed. This is the Nickell-bias problem from Part 1.

# Convert the data to a pdata.frame so plm knows the firm-year structure.
empl_pdat <- pdata.frame(EmplUK, index = c("firm", "year"))

# FE-LDV baseline. This mirrors the Arellano-Bond labor-demand specification
# but estimates it with the within transformation rather than GMM instruments.
fe_ldv_fit <- plm(
  log(emp) ~ lag(log(emp), 1:2) +
             lag(log(wage), 0:1) +
             log(capital) +
             lag(log(output), 0:1),
  data   = empl_pdat,
  effect = "twoways",   # firm FE and year FE
  model  = "within"     # fixed-effects / within estimator
)

summary(fe_ldv_fit)
## Twoways effects Within Model
## 
## Call:
## plm(formula = log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 
##     0:1) + log(capital) + lag(log(output), 0:1), data = empl_pdat, 
##     effect = "twoways", model = "within")
## 
## Unbalanced Panel: n = 140, T = 5-7, N = 751
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -0.474827 -0.043536  0.001307  0.044602  0.451831 
## 
## Coefficients:
##                         Estimate Std. Error t-value  Pr(>|t|)    
## lag(log(emp), 1:2)1     0.700274   0.037557 18.6458 < 2.2e-16 ***
## lag(log(emp), 1:2)2    -0.169027   0.035413 -4.7731 2.284e-06 ***
## lag(log(wage), 0:1)0   -0.559251   0.056941 -9.8215 < 2.2e-16 ***
## lag(log(wage), 0:1)1    0.292637   0.059360  4.9299 1.067e-06 ***
## log(capital)            0.348173   0.026848 12.9682 < 2.2e-16 ***
## lag(log(output), 0:1)0  0.490632   0.123499  3.9728 7.971e-05 ***
## lag(log(output), 0:1)1 -0.607329   0.122101 -4.9740 8.584e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    15.748
## Residual Sum of Squares: 5.3411
## R-Squared:      0.66085
## Adj. R-Squared: 0.57465
## F-statistic: 166.463 on 7 and 598 DF, p-value: < 2.22e-16

This model is fit first because it is the most transparent specification, not because it is the best estimator here. In a short dynamic panel, the FE-LDV estimate gives a baseline for comparison against the IV/GMM estimators.

Before moving on, check whether the model residuals look independent. Lab 5 introduced three variance-covariance corrections for panel models:

Correction Allows Main use
Cluster-robust by unit arbitrary heteroskedasticity and serial correlation within firm default correction when residual dependence is mainly within units
Beck-Katz PCSE contemporaneous correlation across units TSCS panels where many units move together in the same year
Driscoll-Kraay serial correlation and cross-sectional dependence longer panels where both problems are present

EmplUK has a very short time dimension, so the main comparison table below uses firm-clustered standard errors for the FE-LDV and FD-LDV baselines. The diagnostics still matter because they show why conventional standard errors are not credible.

# Serial-correlation test for FE-LDV residuals.
# H0: no serial correlation in the idiosyncratic errors.
fe_serial <- pbgtest(fe_ldv_fit)

# Pesaran CD test for cross-sectional dependence.
# H0: residuals are cross-sectionally independent across firms.
fe_csd <- pcdtest(fe_ldv_fit, test = "cd")

tibble(
  diagnostic = c("Serial correlation: pbgtest",
                 "Cross-sectional dependence: Pesaran CD"),
  null = c("No serial correlation within firms",
           "Cross-sectional independence across firms"),
  p_value = c(fe_serial$p.value, fe_csd$p.value)
) |>
  knitr::kable(digits = 3,
               caption = "Post-estimation dependence checks for the FE-LDV baseline.")
Post-estimation dependence checks for the FE-LDV baseline.
diagnostic null p_value
Serial correlation: pbgtest No serial correlation within firms 0.000
Cross-sectional dependence: Pesaran CD Cross-sectional independence across firms 0.274
# Firm-clustered covariance matrix for the FE-LDV baseline.
# method = "arellano" is the plm convention for a covariance matrix robust to
# general heteroskedasticity and serial correlation within firm.
fe_vcov_cluster <- vcovHC(fe_ldv_fit,
                          method = "arellano",
                          type = "HC1",
                          cluster = "group")

# Coefficient table with firm-clustered standard errors.
fe_clustered <- lmtest::coeftest(fe_ldv_fit, vcov = fe_vcov_cluster)
fe_clustered
## 
## t test of coefficients:
## 
##                         Estimate Std. Error t value  Pr(>|t|)    
## lag(log(emp), 1:2)1     0.700274   0.062394 11.2234 < 2.2e-16 ***
## lag(log(emp), 1:2)2    -0.169027   0.072151 -2.3427 0.0194731 *  
## lag(log(wage), 0:1)0   -0.559251   0.151552 -3.6901 0.0002446 ***
## lag(log(wage), 0:1)1    0.292637   0.134858  2.1700 0.0304026 *  
## log(capital)            0.348173   0.048716  7.1470 2.593e-12 ***
## lag(log(output), 0:1)0  0.490632   0.169683  2.8915 0.0039738 ** 
## lag(log(output), 0:1)1 -0.607329   0.177043 -3.4304 0.0006442 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The next step keeps the dynamic model but changes the estimation strategy.

2.1.2 A first-differenced bridge model

Because the GMM estimators below work with first-differenced equations, it is useful to fit the first-differenced model directly before introducing instruments:

Start from the same dynamic equation in levels and subtract the previous period from the current period:

\[ \begin{aligned} &\left[ \log(emp)_{it} - \log(emp)_{i,t-1} \right] \\ &\quad = \phi_1 \left[ \log(emp)_{i,t-1} - \log(emp)_{i,t-2} \right] + \phi_2 \left[ \log(emp)_{i,t-2} - \log(emp)_{i,t-3} \right] \\ &\qquad + \left(x_{it} - x_{i,t-1}\right)\beta + \delta_t + \left(\varepsilon_{it} - \varepsilon_{i,t-1}\right). \end{aligned} \]

The unit fixed effect drops out because \(\alpha_i - \alpha_i = 0\). The term \(\delta_t\) is a period effect for changes: it captures shocks that move all firms between two adjacent years. But the lagged dependent-variable problem remains visible: the first lagged change contains \(\log(emp)_{i,t-1}\), which contains \(\varepsilon_{i,t-1}\), and the differenced error also contains \(-\varepsilon_{i,t-1}\).

Using compact notation:

\[ \Delta \log(emp)_{it} = \delta_t + \phi_1 \Delta \log(emp)_{i,t-1} + \phi_2 \Delta \log(emp)_{i,t-2} + \Delta x_{it}\beta + \Delta\varepsilon_{it}. \]

This model removes firm fixed effects by subtracting adjacent years instead of subtracting firm means. The period effects are now effects on changes between years, not level differences across years. The model is useful as a bridge because it has the same differenced structure as Difference GMM, but it is still not a valid final estimator for a short dynamic panel. The reason is the same as before: \(\Delta \log(emp)_{i,t-1}\) shares \(\varepsilon_{i,t-1}\) with \(\Delta\varepsilon_{it}\).

The code below fits the model with lm() and reports firm-clustered standard errors. Clustering is an inference correction: it allows residuals from the same firm to be correlated. It does not solve the dynamic endogeneity problem.

# Build the first-differenced variables one firm at a time.
# The d_ prefix means "change from the previous year".
empl_fd <- EmplUK |>
  as_tibble() |>
  arrange(firm, year) |>
  group_by(firm) |>
  mutate(log_emp     = log(emp),
         log_wage    = log(wage),
         log_capital = log(capital),
         log_output  = log(output),
         d_log_emp     = log_emp     - dplyr::lag(log_emp),
         d_log_wage    = log_wage    - dplyr::lag(log_wage),
         d_log_capital = log_capital - dplyr::lag(log_capital),
         d_log_output  = log_output  - dplyr::lag(log_output),
         d_log_emp_lag1    = dplyr::lag(d_log_emp, 1),
         d_log_emp_lag2    = dplyr::lag(d_log_emp, 2),
         d_log_wage_lag1   = dplyr::lag(d_log_wage, 1),
         d_log_output_lag1 = dplyr::lag(d_log_output, 1)) |>
  ungroup()

# Keep only rows where every variable in the FD model is observed.
# This makes the sample explicit before fitting lm().
empl_fd_model <- empl_fd |>
  drop_na(d_log_emp, d_log_emp_lag1, d_log_emp_lag2,
          d_log_wage, d_log_wage_lag1,
          d_log_capital,
          d_log_output, d_log_output_lag1,
          firm, year)

# Naive first-differenced LDV model with period effects.
# factor(year) captures common shocks to annual changes.
fd_ldv_fit <- lm(
  d_log_emp ~ d_log_emp_lag1 + d_log_emp_lag2 +
              d_log_wage + d_log_wage_lag1 +
              d_log_capital +
              d_log_output + d_log_output_lag1 +
              factor(year),
  data = empl_fd_model
)

# Firm-clustered covariance matrix for the FD-LDV bridge model.
# The cluster vector is the firm ID in the exact sample used by lm().
fd_vcov_cluster <- sandwich::vcovCL(fd_ldv_fit,
                                    cluster = empl_fd_model$firm,
                                    type = "HC1")

# Coefficient table with firm-clustered standard errors.
fd_clustered <- lmtest::coeftest(fd_ldv_fit, vcov = fd_vcov_cluster)
fd_clustered
## 
## t test of coefficients:
## 
##                     Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)        0.0082133  0.0085202  0.9640 0.3354451    
## d_log_emp_lag1     0.1991876  0.0420364  4.7385 2.693e-06 ***
## d_log_emp_lag2    -0.0171433  0.0312393 -0.5488 0.5833651    
## d_log_wage        -0.5650252  0.1683421 -3.3564 0.0008397 ***
## d_log_wage_lag1    0.1527075  0.0767036  1.9909 0.0469491 *  
## d_log_capital      0.4023066  0.0585539  6.8707 1.608e-11 ***
## d_log_output       0.6001617  0.1727582  3.4740 0.0005499 ***
## d_log_output_lag1 -0.3690729  0.1596989 -2.3111 0.0211690 *  
## factor(year)1980   0.0060845  0.0144073  0.4223 0.6729404    
## factor(year)1981  -0.0342071  0.0187360 -1.8257 0.0683872 .  
## factor(year)1982  -0.0382944  0.0164787 -2.3239 0.0204661 *  
## factor(year)1983  -0.0278807  0.0189011 -1.4751 0.1407156    
## factor(year)1984  -0.0125293  0.0168219 -0.7448 0.4566727    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# For diagnostics, refit the same already-differenced equation with plm.
# model = "pooling" means OLS on the differenced variables. The pdata.frame
# structure tells pbgtest() and pcdtest() which observations belong to the same
# firm and which observations share the same year.
empl_fd_pdat <- pdata.frame(empl_fd_model, index = c("firm", "year"))
fd_ldv_plm <- plm(
  d_log_emp ~ d_log_emp_lag1 + d_log_emp_lag2 +
              d_log_wage + d_log_wage_lag1 +
              d_log_capital +
              d_log_output + d_log_output_lag1 +
              factor(year),
  data = empl_fd_pdat,
  model = "pooling"
)

# The key question for SEs: is there still residual dependence after fitting
# the first-differenced model? If yes, conventional SEs are too optimistic.
fd_serial <- pbgtest(fd_ldv_plm)
fd_csd <- pcdtest(fd_ldv_plm, test = "cd")

tibble(
  diagnostic = c("Serial correlation: pbgtest",
                 "Cross-sectional dependence: Pesaran CD"),
  null = c("No serial correlation within firms",
           "Cross-sectional independence across firms"),
  p_value = c(fd_serial$p.value, fd_csd$p.value)
) |>
  knitr::kable(digits = 3,
               caption = "Post-estimation dependence checks for the FD-LDV bridge model.")
Post-estimation dependence checks for the FD-LDV bridge model.
diagnostic null p_value
Serial correlation: pbgtest No serial correlation within firms 0.005
Cross-sectional dependence: Pesaran CD Cross-sectional independence across firms 0.605

2.2 From the FE baseline to IV/GMM estimators

The rest of this part keeps the same research question and the same dataset, but changes the estimator. The sequence moves from the familiar FE-LDV baseline, to a first-differenced bridge model, to Anderson-Hsiao, Difference GMM, limited-instrument Difference GMM, and System GMM. For each estimator, the logic is the same: state the identifying assumption, fit the model, read the diagnostics, and decide what the estimate is buying relative to the previous specification.

Before the first pgmm() call, read its formula as two pieces:

y ~ <regressors> | <GMM-style instruments>

The left side of | is the dynamic equation written in levels. The right side of | is the instrument set. Under transformation = "d", pgmm() differences the equation internally; under transformation = "ld", it stacks the differenced equation and the level equation for System GMM. This is why the formula is written in levels even when the estimator is Difference GMM.

The GMM logic is simple: after differencing, valid instruments should be uncorrelated with the differenced model residuals. For a differenced dynamic panel,

\[ \Delta y_{it} = \phi\,\Delta y_{i,t-1} + \Delta x_{it}\beta + \Delta\varepsilon_{it}. \]

If \(z_{it}\) is a valid instrument, the population moment condition is:

\[ E\left[z_{it}\Delta\varepsilon_{it}\right] = 0. \]

In sample form, GMM chooses the coefficients that make the instrument-residual moments as close to zero as possible:

\[ \hat g(\theta) = \frac{1}{N}\sum_{i=1}^{N} Z_i' \left(\Delta y_i - \Delta X_i\theta\right), \qquad \hat\theta_{GMM} = \arg\min_{\theta} \hat g(\theta)'W\hat g(\theta). \]

The matrix \(W\) tells the estimator how to weight the different moment conditions. One-step GMM uses a simple initial weighting matrix; two-step GMM first estimates the model, uses the residuals to estimate a better weighting matrix, and then estimates the same structural coefficients again. This is different from the two stages in 2SLS, where the first stage predicts the endogenous regressor.

The instrument strategy only fixes the variables that are actually treated as endogenous or predetermined and supplied with instruments. In the main Difference-GMM example below, the GMM-style instruments are aimed at the lagged outcome. They correct the mechanical endogeneity of \(\Delta y_{i,t-1}\) after differencing. The other covariates in \(X_{it}\) are not automatically “purged” by the lagged-outcome instruments. Their coefficients are credible only under the maintained assumption that those covariates are exogenous or predetermined in the way the model specifies. If a covariate is also endogenous, it needs its own instrument block.

The DAG below shows the main logic. The lagged outcome is endogenous because it shares the previous shock with the differenced error. Deeper lags of the outcome are useful instruments because they predict the lagged outcome but should not be correlated with the current differenced error if the no-serial-correlation assumption is right.

Dynamic-panel IV/GMM logic. The lag instruments target the endogenous lagged outcome; covariates in $X$ require their own exogeneity assumption or their own instruments.

Dynamic-panel IV/GMM logic. The lag instruments target the endogenous lagged outcome; covariates in \(X\) require their own exogeneity assumption or their own instruments.

The weighting matrix \(W\) does not make invalid instruments valid. The validity of the lag instruments comes from an assumption about the level errors. For example, \(y_{i,t-2}\) is useful as an instrument for \(\Delta y_{i,t-1}\) only if it predicts the lagged outcome but does not predict the current differenced error:

\[ E[y_{i,t-2}\Delta\varepsilon_{it}] = 0, \qquad \Delta\varepsilon_{it} = \varepsilon_{it} - \varepsilon_{i,t-1}. \]

This requires \(y_{i,t-2}\) to be uncorrelated with both \(\varepsilon_{it}\) and \(\varepsilon_{i,t-1}\). Since \(y_{i,t-2}\) contains older shocks such as \(\varepsilon_{i,t-2}\), this assumption fails if the level errors are serially correlated. In that case, the path from old shocks to later shocks remains open, and lagged outcomes inherit the error contamination.

This is why the Arellano-Bond serial-correlation tests matter. Some AR(1) correlation in the differenced residuals is expected mechanically because \(\Delta\varepsilon_{it}\) and \(\Delta\varepsilon_{i,t-1}\) both contain \(\varepsilon_{i,t-1}\). The more important diagnostic is AR(2): if AR(2) is present in the differenced residuals, that signals serial correlation in the level errors, which makes lag-2 instruments suspect. The preferred diagnostic pattern is therefore: AR(1) rejects, AR(2) does not reject.

One important implication follows. If the lagged-outcome coefficient \(\hat\phi\) is badly biased, the other coefficients can also be distorted because the model is misspecifying the dynamic part of the outcome. Instrumenting the lagged outcome can therefore change the estimates on \(X_{it}\) indirectly. But those \(X\) coefficients are not made unbiased by the lag instruments alone. They still require the relevant exogeneity assumption for \(X\), or separate instruments if \(X\) is endogenous.

2.3 Anderson-Hsiao: the first IV solution

Difference the dynamic panel equation:

\[ \Delta y_{it} = \phi\, \Delta y_{i,t-1} + \beta\, \Delta x_{it} + \Delta \varepsilon_{it} \]

The unit fixed effects \(\alpha_i\) vanish — they are constant in \(t\), so \(\Delta \alpha_i = 0\). Differencing achieves what within-demeaning achieves, but without mixing future \(\varepsilon\) into the lag through \(\bar y_{i,-1}\). This is the core idea behind every estimator in the GMM family.

Of course, the differenced lag \(\Delta y_{i,t-1}\) is still correlated with \(\Delta\varepsilon_{it}\) — both contain \(\varepsilon_{i,t-1}\) mechanically. An instrument is needed.

Anderson and Hsiao proposed using \(y_{i,t-2}\) as IV for \(\Delta y_{i,t-1}\). The exclusion logic is the same as in Appendix A: \(y_{i,t-2}\) is correlated with \(\Delta y_{i,t-1}\) (relevance) but, under iid \(\varepsilon\), uncorrelated with \(\Delta\varepsilon_{it}\) (exclusion — precisely the assumption tested by the AR(2) diagnostic in §2.4).

This is an IV estimator, not OLS. OLS on the differenced equation would regress \(\Delta y_{it}\) directly on \(\Delta y_{i,t-1}\), but \(\Delta y_{i,t-1}\) contains \(\varepsilon_{i,t-1}\) and \(\Delta\varepsilon_{it}\) also contains \(-\varepsilon_{i,t-1}\). That mechanical overlap makes OLS inconsistent. Anderson-Hsiao fixes this by replacing the endogenous differenced lag with variation predicted by a valid lagged-level instrument.

Why use pgmm() here, then? Because IV is a special case of GMM. The moment condition is:

\[ E\left[y_{i,t-2}\left(\Delta y_{it} - \phi\Delta y_{i,t-1} - \Delta x_{it}\beta\right)\right] = 0. \]

If there is exactly one valid instrument for one endogenous regressor, a just-identified IV/2SLS estimate and a GMM estimate give the same coefficient, because the GMM weighting matrix cannot change a just-identified solution. In pgmm(), however, the instrument is stacked by period, so even the AH-style call can create multiple moment conditions. That is why the software estimates it as a restricted one-step GMM model and may print an overidentification test.

The pgmm call below implements that logic. Read the formula in two halves:

  • Left of | — the regression equation in level form. lag(log(emp), 1) is the lagged outcome; pgmm internally differences it.
  • Right of | — the GMM instrument block. lag(log(emp), 2) says use only the second lag of log(emp) as an instrument — that is the Anderson-Hsiao restriction.
ah_fit <- pgmm(log(emp) ~ lag(log(emp), 1) + log(wage) + log(capital) |
                 lag(log(emp), 2),
               data = EmplUK,
               effect = "individual",   # only unit FE (no time dummies)
               model  = "onestep",      # one-step keeps the AH-style example simple
               transformation = "d")    # "d" = difference GMM
sum_ah <- summary(ah_fit, robust = TRUE)
sum_ah
## Oneway (individual) effect One-step model Difference GMM 
## 
## Call:
## pgmm(formula = log(emp) ~ lag(log(emp), 1) + log(wage) + log(capital) | 
##     lag(log(emp), 2), data = EmplUK, effect = "individual", model = "onestep", 
##     transformation = "d")
## 
## Unbalanced Panel: n = 140, T = 7-9, N = 1031
## 
## Number of Observations Used: 751
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.903849 -0.045858  0.000000 -0.001085  0.041154  0.844502 
## 
## Coefficients:
##                   Estimate Std. Error z-value  Pr(>|z|)    
## lag(log(emp), 1)  0.801824   0.157098  5.1040 3.326e-07 ***
## log(wage)        -0.631281   0.195599 -3.2274  0.001249 ** 
## log(capital)      0.241204   0.056267  4.2868 1.813e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(6) = 34.79026 (p-value = 4.732e-06)
## Autocorrelation test (1): normal = -3.923134 (p-value = 8.7404e-05)
## Autocorrelation test (2): normal = -1.10812 (p-value = 0.26781)
## Wald test for coefficients: chisq(3) = 605.8932 (p-value = < 2.22e-16)

In this fit, the AH-style model is best read as a bridge, not as the final specification. The AR(2) test does not reject, which is good for the lag-2 exclusion logic. But the Sargan test rejects, meaning the period-stacked instruments are not jointly credible under this simple specification. That is exactly why the lab moves next to richer Difference-GMM specifications: AH shows the IV idea clearly, but it often leaves useful valid moments out of the estimator.

Conceptually, the textbook AH estimator uses one lag as the instrument for one endogenous differenced lag. In the simplest just-identified IV case, there is no Sargan test because there are no extra instruments to test. pgmm() implements the same idea with period-stacked moments: even lag(log(emp), 2) can create more than one moment condition because the lag-2 instrument is available in several periods. That is why the software may still print a Sargan overidentification test for this AH-style fit. If the model is overidentified, read the Sargan test the same way as in the later GMM models: the null is that the instruments are jointly valid, so the preferred result is to fail to reject.

The lag range is the pedagogical bridge from AH to Difference GMM:

  • lag(y, 2) keeps the clean Anderson-Hsiao idea: one deeper lag is used to break the mechanical correlation between \(\Delta y_{i,t-1}\) and \(\Delta\varepsilon_{it}\).
  • lag(y, 2:3) or lag(y, 2:6) adds more valid lagged levels. This can improve efficiency if those lags are relevant instruments.
  • lag(y, 2:99) means “use every available lag from 2 onward.” In a short panel like EmplUK, the panel only has 7 to 9 years per firm, so 99 is not literally 99 lags. It just tells pgmm() to use as many eligible lags as the data allow.

The trade-off is instrument proliferation. More lags can increase efficiency, but they also increase the number of moment conditions, can overfit the endogenous regressor, and can make the Sargan/Hansen test weak. In EmplUK, lag(y, 2:99) is tolerable mainly because the realized number of instruments remains below the number of firms (\(N = 140\)). That is not guaranteed in other panels. With longer \(T\) or more endogenous regressors, 2:99 can quickly create more instruments than units, and then a restricted range such as 2:6 is usually safer.

AH is rarely used as the final model because it leaves valid moments on the table, but it is the right conceptual stepping stone between OLS on differences, one-instrument IV, and many-instrument Difference GMM.

2.4 Arellano-Bond Difference GMM (1991)

The key insight from Arellano & Bond: under the AH exogeneity assumption, all lags \(y_{i,t-2}, y_{i,t-3}, \ldots\) are valid instruments for \(\Delta y_{i,t-1}\). With each successive period \(t\), one more usable lag becomes available. Stacking them all up turns AH into an over-identified GMM problem with a corresponding efficiency gain.

The standard Arellano-Bond difference GMM specification on EmplUK. Pay attention to the formula syntax:

  • lag(log(emp), 1:2) — both AR(1) and AR(2) terms enter the equation. This is heavier dynamics than the cigarette model in Part 3, but it matches the published AB specification.
  • lag(log(wage), 0:1) — both contemporaneous wage and lagged wage. The 0 means “the contemporaneous value”, 1 means “the first lag”.
  • | lag(log(emp), 2:99) — instrument block. 2:99 means “use every lag from 2 up to 99 (or as many as the panel allows)”. This is the full Arellano-Bond moment set.
ab_fit <- pgmm(log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 0:1) +
                 log(capital) + lag(log(output), 0:1) |
                 lag(log(emp), 2:99),
               data = EmplUK,
               effect = "twoways",      # unit + time fixed effects (AB-91 spec)
               model  = "twosteps",     # use the optimal two-step weighting matrix
               transformation = "d")    # difference GMM
sum_ab <- summary(ab_fit, robust = TRUE) # robust = TRUE -> Windmeijer-corrected SEs
sum_ab
## Twoways effects Two-steps model Difference GMM 
## 
## Call:
## pgmm(formula = log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 
##     0:1) + log(capital) + lag(log(output), 0:1) | lag(log(emp), 
##     2:99), data = EmplUK, effect = "twoways", model = "twosteps", 
##     transformation = "d")
## 
## Unbalanced Panel: n = 140, T = 7-9, N = 1031
## 
## Number of Observations Used: 611
## Residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.6190677 -0.0255683  0.0000000 -0.0001339  0.0332013  0.6410272 
## 
## Coefficients:
##                         Estimate Std. Error z-value  Pr(>|z|)    
## lag(log(emp), 1:2)1     0.474151   0.185398  2.5575 0.0105437 *  
## lag(log(emp), 1:2)2    -0.052967   0.051749 -1.0235 0.3060506    
## lag(log(wage), 0:1)0   -0.513205   0.145565 -3.5256 0.0004225 ***
## lag(log(wage), 0:1)1    0.224640   0.141950  1.5825 0.1135279    
## log(capital)            0.292723   0.062627  4.6741 2.953e-06 ***
## lag(log(output), 0:1)0  0.609775   0.156263  3.9022 9.530e-05 ***
## lag(log(output), 0:1)1 -0.446373   0.217302 -2.0542 0.0399605 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(25) = 30.11247 (p-value = 0.22011)
## Autocorrelation test (1): normal = -1.53845 (p-value = 0.12394)
## Autocorrelation test (2): normal = -0.2796829 (p-value = 0.77972)
## Wald test for coefficients: chisq(7) = 142.0353 (p-value = < 2.22e-16)
## Wald test for time dummies: chisq(6) = 16.97046 (p-value = 0.0093924)

What to read off the printout, applying the §2 checklist top-to-bottom:

  • Coefficient table. \(\hat\phi_1 \approx 0.47\) on lag(log(emp), 1), \(\hat\phi_2 \approx -0.05\) on lag(log(emp), 2) — modest persistence with slight overshooting. Standard errors are the Windmeijer-corrected two-step SEs because robust = TRUE.
  • Sargan test (bottom of summary, §2.3). Overidentification test for the differenced equation. Target result: \(p > 0.10\).
  • AR(1) on differenced residuals (§2.4). Should usually reject. If it does not reject clearly in this short panel, read that as a reason to inspect the dynamic specification rather than as a pass/fail rule by itself.
  • AR(2) on differenced residuals (§2.4). Should not reject — and does not.
  • Wald test for time dummies (bottom of the printout). Tests whether the year dummies as a block are non-zero. Only printed because effect = "twoways".

Overall, the Difference-GMM fit is a credible candidate model in this application. The Sargan test does not reject, so there is no strong evidence against the overidentifying restrictions. The AR(2) test also does not reject, so the lag-2 instruments remain defensible. The AR(1) test is weaker than the textbook pattern, but this is not the decisive diagnostic; the key threat to lag-2 instruments is second-order serial correlation in the differenced residuals.

2.5 Blundell-Bond System GMM (1998)

When the outcome is highly persistent (\(\phi\) near 1), the moment \(E[y_{i,t-2} \cdot \Delta y_{i,t-1}]\) shrinks toward zero. Lagged levels then become weak instruments for the differenced equation. In macro panels this is common: variables such as employment, income, democracy, or consumption often move slowly over time.

Blundell and Bond’s solution is to estimate a system of two equations. The first equation is the familiar differenced equation from Arellano-Bond. The second equation is the original level equation. The level equation is not estimated by ordinary fixed effects; it is instrumented with lagged differences.

\[ \begin{aligned} \text{Differenced eq:} &\quad \Delta y_{it} = \phi\, \Delta y_{i,t-1} + \beta\, \Delta x_{it} + \Delta\varepsilon_{it} \quad \text{(IVs: } y_{i,t-2}, y_{i,t-3}, \ldots\text{)} \\ \text{Level eq:} &\quad y_{it} = \alpha_i + \phi\, y_{i,t-1} + \beta\, x_{it} + \varepsilon_{it} \quad \text{(IVs: } \Delta y_{i,t-1}, \Delta x_{i,t-1}\text{)} \end{aligned} \]

The extra equation increases the amount of information used by the estimator. It does not literally create more firms or years, but it adds level-equation moments to the difference-equation moments. In that sense, it increases the effective information available for estimating the dynamic coefficients. This can be especially helpful when the differenced equation has weak instruments.

The cost is a stronger assumption. The new moment conditions require lagged differences to be uncorrelated with the unit effect \(\alpha_i\). This is often described as a restriction on the initial conditions: units may have different levels, but their recent changes should not be systematically related to their fixed effects. That assumption is more plausible when the panel is close to a stable long-run process. It is less plausible when units are observed during a transition, a reform period, a crisis, or any setting where units with high fixed effects are also systematically growing faster or slower.

Use System GMM when the outcome is persistent and Difference GMM appears weak or imprecise, but only if the level-equation assumption is substantively defensible and the instrument count remains controlled. Do not use it mechanically. Be cautious when the series have strong trends, when the panel begins far from steady state, when Difference GMM and System GMM tell very different stories, or when the number of instruments approaches the number of units.

In pgmm, system GMM is fit by setting transformation = "ld" (levels + differences):

bb_fit <- pgmm(log(emp) ~ lag(log(emp), 1) + lag(log(wage), 0:1) +
                 lag(log(capital), 0:1) |
                 lag(log(emp), 2:99) + lag(log(wage), 2:99) +
                 lag(log(capital), 2:99),
               data = EmplUK,
               effect = "twoways",
               model  = "onestep",
               transformation = "ld")   # "ld" = levels + differences (system GMM)
sum_bb <- summary(bb_fit, robust = TRUE)
sum_bb
## Twoways effects One-step model System GMM 
## 
## Call:
## pgmm(formula = log(emp) ~ lag(log(emp), 1) + lag(log(wage), 0:1) + 
##     lag(log(capital), 0:1) | lag(log(emp), 2:99) + lag(log(wage), 
##     2:99) + lag(log(capital), 2:99), data = EmplUK, effect = "twoways", 
##     model = "onestep", transformation = "ld")
## 
## Unbalanced Panel: n = 140, T = 7-9, N = 1031
## 
## Number of Observations Used: 1642
## Residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.7530341 -0.0369030  0.0000000  0.0002882  0.0466069  0.6001503 
## 
## Coefficients:
##                          Estimate Std. Error z-value  Pr(>|z|)    
## lag(log(emp), 1)         0.935605   0.026295 35.5810 < 2.2e-16 ***
## lag(log(wage), 0:1)0    -0.630976   0.118054 -5.3448 9.050e-08 ***
## lag(log(wage), 0:1)1     0.482620   0.136887  3.5257 0.0004224 ***
## lag(log(capital), 0:1)0  0.483930   0.053867  8.9838 < 2.2e-16 ***
## lag(log(capital), 0:1)1 -0.424393   0.058479 -7.2572 3.952e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(100) = 118.763 (p-value = 0.097096)
## Autocorrelation test (1): normal = -4.808434 (p-value = 1.5212e-06)
## Autocorrelation test (2): normal = -0.2800133 (p-value = 0.77947)
## Wald test for coefficients: chisq(5) = 11174.82 (p-value = < 2.22e-16)
## Wald test for time dummies: chisq(7) = 14.71138 (p-value = 0.039882)

Two things change compared with Difference GMM:

  1. Instrument blocks can — and should — be listed for every endogenous regressor on the right side of |, not just the lagged DV. Each block contributes its own moment conditions.
  2. The instrument count often grows quickly because level moments stack on top of difference moments. Efficiency gains can be large in persistent panels; the cost is a stronger identifying assumption and a larger instrument matrix — see §2.6 and §2.7.

The System-GMM output looks different from Difference GMM because it uses additional level-equation moments. The lagged employment coefficient is much larger, which implies stronger persistence. The AR(1) and AR(2) pattern is textbook: AR(1) rejects and AR(2) does not. The Sargan p-value is acceptable but closer to the margin, and the instrument count is larger. This makes System GMM useful as a comparison, but not automatically preferable. It buys precision and persistence by adding stronger assumptions.

2.6 One-step vs two-step + Windmeijer

GMM minimizes a quadratic form \(\hat g(\theta)' \, W \, \hat g(\theta)\) where \(\hat g\) is the sample moment vector and \(W\) is a weighting matrix. Two strategies:

  • One-step: \(W\) is the identity (or a fixed matrix that’s optimal under iid errors). Simple; consistent; not asymptotically efficient when errors are heteroscedastic.
  • Two-step: estimate one-step first, use the residuals to build the optimal weighting matrix \(W^* = (\hat E[g g'])^{-1}\), then re-estimate. Asymptotically efficient.

This “two-step” language can be confusing because it is different from 2SLS. In 2SLS, the first stage predicts the endogenous regressor with the instruments, and the second stage regresses the outcome on that fitted value. In two-step GMM, both steps estimate the same structural coefficients from the same moment conditions; the second step only changes the weighting matrix used to combine the moments.

In pgmm, model = "twosteps" was selected for AB above. The catch: two-step standard errors are downward biased in finite samples — sometimes severely. Windmeijer (2005) derived a correction. Invoke it via robust = TRUE in summary.pgmm:

# Two-step AB fit, plain SEs
summary(ab_fit, robust = FALSE)$coefficients[, 2] |> head(3)
##  lag(log(emp), 1:2)1  lag(log(emp), 1:2)2 lag(log(wage), 0:1)0 
##           0.08530307           0.02728433           0.04934539
# Two-step AB fit, Windmeijer-style robust SEs
summary(ab_fit, robust = TRUE)$coefficients[, 2] |> head(3)
##  lag(log(emp), 1:2)1  lag(log(emp), 1:2)2 lag(log(wage), 0:1)0 
##            0.1853985            0.0517491            0.1455653

The robust SEs in the second call are larger — that’s the Windmeijer correction in action. Always report the robust two-step SEs; the plain ones are not safe in finite samples.

2.7 Limited instruments — instrument count applied

Section 2.6 argued that too many instruments degrades both the weighting matrix and the Sargan diagnostic. Here the same AB specification is refit with lag depth capped at 2:6 so the moment count stays moderate.

Why 2:6? Lag 2 is the first valid lag under the Arellano-Bond logic: it is far enough back to avoid the mechanical correlation between \(\Delta y_{i,t-1}\) and \(\Delta\varepsilon_{it}\). Lags 3 through 6 add more moment conditions, but still keep the instrument set smaller than the full 2:99 specification. In this dataset, a very narrow window such as 2:4 is so conservative that it removes much of the identifying variation for the lagged outcome. The 2:6 window is a better sensitivity check: it reduces the instrument count while still preserving a plausible amount of employment persistence.

The important point is not that 2:6 is a universal rule. The lag window must use valid instruments. If the AR(2) test rejected, lag 2 would be suspect and the model would need to be changed. If the Sargan test rejected, the instrument set would also be questionable. Limiting instruments helps only when the remaining instruments are valid and relevant; it does not make bad instruments good.

ab_fit_lag26 <- pgmm(log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 0:1) +
                       log(capital) + lag(log(output), 0:1) |
                       lag(log(emp), 2:6),
                     data = EmplUK,
                     effect = "twoways",
                     model = "twosteps",
                     transformation = "d")

sum_ab_lag26 <- summary(ab_fit_lag26, robust = TRUE)
sum_ab_lag26
## Twoways effects Two-steps model Difference GMM 
## 
## Call:
## pgmm(formula = log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 
##     0:1) + log(capital) + lag(log(output), 0:1) | lag(log(emp), 
##     2:6), data = EmplUK, effect = "twoways", model = "twosteps", 
##     transformation = "d")
## 
## Unbalanced Panel: n = 140, T = 7-9, N = 1031
## 
## Number of Observations Used: 611
## Residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.6339252 -0.0233820  0.0000000 -0.0002934  0.0294973  0.6617657 
## 
## Coefficients:
##                         Estimate Std. Error z-value  Pr(>|z|)    
## lag(log(emp), 1:2)1     0.354649   0.214933  1.6500  0.098933 .  
## lag(log(emp), 1:2)2    -0.044811   0.055247 -0.8111  0.417303    
## lag(log(wage), 0:1)0   -0.436421   0.141283 -3.0890  0.002008 ** 
## lag(log(wage), 0:1)1    0.153272   0.125533  1.2210  0.222097    
## log(capital)            0.309765   0.068780  4.5037 6.677e-06 ***
## lag(log(output), 0:1)0  0.569246   0.152034  3.7442  0.000181 ***
## lag(log(output), 0:1)1 -0.297321   0.199777 -1.4883  0.136683    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(22) = 27.24218 (p-value = 0.20217)
## Autocorrelation test (1): normal = -1.009434 (p-value = 0.31277)
## Autocorrelation test (2): normal = -0.1774997 (p-value = 0.85912)
## Wald test for coefficients: chisq(7) = 109.2295 (p-value = < 2.22e-16)
## Wald test for time dummies: chisq(6) = 15.96688 (p-value = 0.013933)

And the diagnostic table across the EmplUK fits from this part:

# fit$W is a list of unit-level instrument matrices.
# length(fit$W[[1]][1, ]) gives the number of instruments for the first unit.
# length(fit$W) gives the number of units used by pgmm.
gmm_diagnostics <- tibble(
  model = c("AH-style", "Diff-GMM", "Diff-GMM lag 2:6", "System GMM"),
  sargan_p = c(sum_ah$sargan$p.value,
               sum_ab$sargan$p.value,
               sum_ab_lag26$sargan$p.value,
               sum_bb$sargan$p.value),
  ar1_p = c(sum_ah$m1$p.value,
            sum_ab$m1$p.value,
            sum_ab_lag26$m1$p.value,
            sum_bb$m1$p.value),
  ar2_p = c(sum_ah$m2$p.value,
            sum_ab$m2$p.value,
            sum_ab_lag26$m2$p.value,
            sum_bb$m2$p.value),
  n_instruments = c(length(ah_fit$W[[1]][1, ]),
                    length(ab_fit$W[[1]][1, ]),
                    length(ab_fit_lag26$W[[1]][1, ]),
                    length(bb_fit$W[[1]][1, ])),
  n_units = c(length(ah_fit$W),
              length(ab_fit$W),
              length(ab_fit_lag26$W),
              length(bb_fit$W))
) |>
  mutate(instrument_ratio = round(n_instruments / n_units, 2),
         instruments_lt_N = n_instruments < n_units)

gmm_diagnostics |>
  knitr::kable(digits = 3,
               caption = "Model diagnostics. Preferred pattern: Sargan not rejected, AR(2) not rejected, and instrument count below N.")
Model diagnostics. Preferred pattern: Sargan not rejected, AR(2) not rejected, and instrument count below N.
model sargan_p ar1_p ar2_p n_instruments n_units instrument_ratio instruments_lt_N
AH-style 0.000 0.000 0.268 9 140 0.06 TRUE
Diff-GMM 0.220 0.124 0.780 38 140 0.27 TRUE
Diff-GMM lag 2:6 0.202 0.313 0.859 35 140 0.25 TRUE
System GMM 0.097 0.000 0.779 113 140 0.81 TRUE

Limiting from 2:99 to 2:6 reduces the moment count while keeping the same substantive equation. In longer panels this kind of restriction can be a large reduction; in EmplUK it is more modest but still illustrates the diagnostic logic. On EmplUK with \(N = 140\), every spec above passes the rule of thumb — but the instrument-count logic becomes even more important when the full instrument set approaches or exceeds \(N\).

The limited model should be read as a sensitivity check. In this application, 2:6 keeps the Sargan and AR(2) diagnostics in the acceptable range and recovers a more plausible dynamic coefficient than a very narrow 2:4 window. That comparison is useful: it shows that limiting instruments is not only about reducing a number in the diagnostic table. It can also change the information used to estimate persistence.

2.8 Compare the estimators

Before putting the models side by side, return to the standard-error problem. The FE-LDV and FD-LDV diagnostics above check for serial correlation within firms and cross-sectional dependence across firms. For these OLS-style baselines, residual dependence is an inference problem: the coefficient may be biased because of dynamic endogeneity, and conventional standard errors are also too optimistic. The table therefore reports firm-clustered standard errors for both baseline columns.

For GMM, the logic is different. The Arellano-Bond AR tests are not merely a prompt to “cluster the standard errors.” The AR(2) test is part of the instrument-validity argument: if second-order serial correlation remains in the differenced residuals, then lag-2 instruments are suspect. Robust GMM standard errors help with heteroskedasticity and finite-sample inference, but they do not rescue invalid moment conditions. If AR(2) rejects, the specification needs to be changed.

Now put the baseline and the GMM estimators side by side. This is the table to read slowly in class:

  • FE-LDV is the familiar within estimator, but Nickell-biased when \(T\) is short.
  • FD-LDV removes firm fixed effects by first differencing and uses firm-clustered standard errors, but it still treats the endogenous differenced lag as if it were exogenous.
  • Anderson-Hsiao uses one lag as an instrument; it is conceptually clean but inefficient.
  • Diff-GMM uses the full set of valid lagged levels as instruments.
  • Diff-GMM (lag 2:6) limits the instrument count; compare it with the full Diff-GMM column to see whether estimates are sensitive to instrument depth.
  • System GMM adds a level equation, which can help when the outcome is highly persistent but requires stronger assumptions.

The choice is not mechanical. FE-LDV is the natural baseline and can be reasonable when \(T\) is not very small. FD-LDV is a useful bridge because it uses the same differenced equation as Difference GMM, but without instruments. Anderson-Hsiao is useful for seeing the IV logic clearly, but it is usually inefficient. Difference GMM is the standard starting point for many short dynamic panels because it removes fixed effects by differencing and uses deeper lags as instruments. If the instrument count grows quickly or estimates change sharply with lag depth, restrict the instrument set. System GMM is most attractive when the outcome is highly persistent and lagged levels are weak instruments in the differenced equation, but it requires the stronger level-equation assumption and can generate many instruments quickly.

# Keep only substantively meaningful coefficients and drop the many time-dummy
# coefficients that pgmm prints for two-way models.
coef_order <- c("lag(log(emp), 1)",
                "lag(log(emp), 1:2)1",
                "lag(log(emp), 1:2)2",
                "d_log_emp_lag1",
                "d_log_emp_lag2",
                "log(wage)",
                "lag(log(wage), 0:1)0",
                "lag(log(wage), 0:1)1",
                "d_log_wage",
                "d_log_wage_lag1",
                "log(capital)",
                "lag(log(capital), 0:1)0",
                "lag(log(capital), 0:1)1",
                "d_log_capital",
                "lag(log(output), 0:1)0",
                "lag(log(output), 0:1)1",
                "d_log_output",
                "d_log_output_lag1")

coef_labels <- c("lag(log(emp), 1)"        = "log(emp) lag 1",
                 "lag(log(emp), 1:2)1"     = "log(emp) lag 1",
                 "lag(log(emp), 1:2)2"     = "log(emp) lag 2",
                 "d_log_emp_lag1"          = "log(emp) lag 1",
                 "d_log_emp_lag2"          = "log(emp) lag 2",
                 "log(wage)"               = "log(wage)",
                 "lag(log(wage), 0:1)0"    = "log(wage)",
                 "lag(log(wage), 0:1)1"    = "log(wage) lag 1",
                 "d_log_wage"              = "log(wage)",
                 "d_log_wage_lag1"         = "log(wage) lag 1",
                 "log(capital)"            = "log(capital)",
                 "lag(log(capital), 0:1)0" = "log(capital)",
                 "lag(log(capital), 0:1)1" = "log(capital) lag 1",
                 "d_log_capital"           = "log(capital)",
                 "lag(log(output), 0:1)0"  = "log(output)",
                 "lag(log(output), 0:1)1"  = "log(output) lag 1",
                 "d_log_output"            = "log(output)",
                 "d_log_output_lag1"       = "log(output) lag 1")

# Extract FE-LDV coefficients with firm-clustered standard errors.
fe_names <- intersect(coef_order, rownames(fe_clustered))
fe_tab <- fe_clustered[fe_names, , drop = FALSE]

# Extract FD-LDV coefficients with firm-clustered standard errors.
fd_names <- intersect(coef_order, rownames(fd_clustered))
fd_tab <- fd_clustered[fd_names, , drop = FALSE]

# Extract GMM coefficients from the robust pgmm summaries.
ah_names <- intersect(coef_order, rownames(sum_ah$coefficients))
ab_names <- intersect(coef_order, rownames(sum_ab$coefficients))
ab26_names <- intersect(coef_order, rownames(sum_ab_lag26$coefficients))
bb_names <- intersect(coef_order, rownames(sum_bb$coefficients))

ah_tab <- sum_ah$coefficients[ah_names, , drop = FALSE]
ab_tab <- sum_ab$coefficients[ab_names, , drop = FALSE]
ab26_tab <- sum_ab_lag26$coefficients[ab26_names, , drop = FALSE]
bb_tab <- sum_bb$coefficients[bb_names, , drop = FALSE]

# Create clean texreg objects manually. This avoids default GOF rows such as
# Wald chi-square statistics that are less useful here than the GMM diagnostics.
tr_fe <- texreg::createTexreg(
  coef.names = unname(coef_labels[fe_names]),
  coef = fe_tab[, 1],
  se = fe_tab[, 2],
  pvalues = fe_tab[, 4],
  gof.names = c("Serial corr. p", "Cross-sec dep. p", "Instruments", "Instruments/N",
                "Sargan p", "AR(1) p", "AR(2) p"),
  gof = c(fe_serial$p.value, fe_csd$p.value, NA, NA, NA, NA, NA),
  gof.decimal = c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE),
  model.name = "FE-LDV"
)

tr_fd <- texreg::createTexreg(
  coef.names = unname(coef_labels[fd_names]),
  coef = fd_tab[, 1],
  se = fd_tab[, 2],
  pvalues = fd_tab[, 4],
  gof.names = c("Serial corr. p", "Cross-sec dep. p", "Instruments", "Instruments/N",
                "Sargan p", "AR(1) p", "AR(2) p"),
  gof = c(fd_serial$p.value, fd_csd$p.value, NA, NA, NA, NA, NA),
  gof.decimal = c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE),
  model.name = "FD-LDV"
)

tr_ah <- texreg::createTexreg(
  coef.names = unname(coef_labels[ah_names]),
  coef = ah_tab[, 1],
  se = ah_tab[, 2],
  pvalues = ah_tab[, 4],
  gof.names = c("Serial corr. p", "Cross-sec dep. p", "Instruments", "Instruments/N",
                "Sargan p", "AR(1) p", "AR(2) p"),
  gof = c(NA,
          NA,
          length(ah_fit$W[[1]][1, ]),
          length(ah_fit$W[[1]][1, ]) / length(ah_fit$W),
          sum_ah$sargan$p.value,
          sum_ah$m1$p.value,
          sum_ah$m2$p.value),
  gof.decimal = c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE),
  model.name = "AH-style"
)

tr_ab <- texreg::createTexreg(
  coef.names = unname(coef_labels[ab_names]),
  coef = ab_tab[, 1],
  se = ab_tab[, 2],
  pvalues = ab_tab[, 4],
  gof.names = c("Serial corr. p", "Cross-sec dep. p", "Instruments", "Instruments/N",
                "Sargan p", "AR(1) p", "AR(2) p"),
  gof = c(NA,
          NA,
          length(ab_fit$W[[1]][1, ]),
          length(ab_fit$W[[1]][1, ]) / length(ab_fit$W),
          sum_ab$sargan$p.value,
          sum_ab$m1$p.value,
          sum_ab$m2$p.value),
  gof.decimal = c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE),
  model.name = "Diff-GMM"
)

tr_ab26 <- texreg::createTexreg(
  coef.names = unname(coef_labels[ab26_names]),
  coef = ab26_tab[, 1],
  se = ab26_tab[, 2],
  pvalues = ab26_tab[, 4],
  gof.names = c("Serial corr. p", "Cross-sec dep. p", "Instruments", "Instruments/N",
                "Sargan p", "AR(1) p", "AR(2) p"),
  gof = c(NA,
          NA,
          length(ab_fit_lag26$W[[1]][1, ]),
          length(ab_fit_lag26$W[[1]][1, ]) / length(ab_fit_lag26$W),
          sum_ab_lag26$sargan$p.value,
          sum_ab_lag26$m1$p.value,
          sum_ab_lag26$m2$p.value),
  gof.decimal = c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE),
  model.name = "Diff-GMM lag 2:6"
)

tr_bb <- texreg::createTexreg(
  coef.names = unname(coef_labels[bb_names]),
  coef = bb_tab[, 1],
  se = bb_tab[, 2],
  pvalues = bb_tab[, 4],
  gof.names = c("Serial corr. p", "Cross-sec dep. p", "Instruments", "Instruments/N",
                "Sargan p", "AR(1) p", "AR(2) p"),
  gof = c(NA,
          NA,
          length(bb_fit$W[[1]][1, ]),
          length(bb_fit$W[[1]][1, ]) / length(bb_fit$W),
          sum_bb$sargan$p.value,
          sum_bb$m1$p.value,
          sum_bb$m2$p.value),
  gof.decimal = c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE),
  model.name = "System GMM"
)

model_list <- list(tr_fe, tr_fd, tr_ah, tr_ab, tr_ab26, tr_bb)

if (knitr::is_html_output()) {
  texreg::htmlreg(
    model_list,
    caption = "Dynamic labor-demand models on EmplUK. FE-LDV and FD-LDV report firm-clustered SEs; GMM columns report robust/Windmeijer-corrected SEs where applicable.",
    digits = 3,
    doctype = FALSE
  )
} else {
  texreg::texreg(
    model_list,
    caption = "Dynamic labor-demand models on EmplUK. FE-LDV and FD-LDV report firm-clustered SEs; GMM columns report robust/Windmeijer-corrected SEs where applicable.",
    digits = 3,
    table = FALSE
  )
}
Dynamic labor-demand models on EmplUK. FE-LDV and FD-LDV report firm-clustered SEs; GMM columns report robust/Windmeijer-corrected SEs where applicable.
  FE-LDV FD-LDV AH-style Diff-GMM Diff-GMM lag 2:6 System GMM
log(emp) lag 1 0.700*** 0.199*** 0.802*** 0.474* 0.355 0.936***
  (0.062) (0.042) (0.157) (0.185) (0.215) (0.026)
log(emp) lag 2 -0.169* -0.017   -0.053 -0.045  
  (0.072) (0.031)   (0.052) (0.055)  
log(wage) -0.559*** -0.565*** -0.631** -0.513*** -0.436** -0.631***
  (0.152) (0.168) (0.196) (0.146) (0.141) (0.118)
log(wage) lag 1 0.293* 0.153*   0.225 0.153 0.483***
  (0.135) (0.077)   (0.142) (0.126) (0.137)
log(capital) 0.348*** 0.402*** 0.241*** 0.293*** 0.310*** 0.484***
  (0.049) (0.059) (0.056) (0.063) (0.069) (0.054)
log(output) 0.491** 0.600***   0.610*** 0.569***  
  (0.170) (0.173)   (0.156) (0.152)  
log(output) lag 1 -0.607*** -0.369*   -0.446* -0.297  
  (0.177) (0.160)   (0.217) (0.200)  
log(capital) lag 1           -0.424***
            (0.058)
Serial corr. p 0.000 0.005        
Cross-sec dep. p 0.274 0.605        
Instruments     9 38 35 113
Instruments/N     0.064 0.271 0.250 0.807
Sargan p     0.000 0.220 0.202 0.097
AR(1) p     0.000 0.124 0.313 0.000
AR(2) p     0.268 0.780 0.859 0.779
***p < 0.001; **p < 0.01; *p < 0.05

The degrees of freedom from Wald or chi-square tests are not shown here because they mainly summarize omnibus model tests. They are mathematically interpretable — they count the restrictions being tested — but they are not the diagnostic quantities needed for this workflow. The relevant checks are serial correlation and cross-sectional dependence for the OLS-style baselines, and Sargan/AR(1)/AR(2)/instrument count for the GMM columns.

The coefficients are not expected to be identical. The Anderson-Hsiao column is intentionally simpler than the published AB-style specification, and the System-GMM column uses a closely related but not identical equation because system GMM is more sensitive to instrument growth in this short panel. The point of the table is to ask whether the substantive story travels across estimators: employment is persistent, wages tend to reduce employment, capital and output tend to raise employment, and the GMM estimators give us alternatives to the Nickell-biased FE-LDV benchmark.

After the estimator comparison, the last step is to translate coefficients into quantities of interest. The forecast below asks what the fitted dynamics imply for employment under a concrete wage scenario. The exercise uses the two main GMM estimators from the table: Difference GMM and System GMM.

2.9 Prediction and forecasting from dynamic GMM models

There are several useful ways to turn a fitted dynamic panel model into interpretable quantities:

  1. Expected-value forecast with simcf. This is the main applied tool. It uses the estimated dynamic equation, simulated coefficients, initial outcome values, and a hypothetical covariate path to generate expected outcomes, first differences, and relative risks.
  2. Manual recursive forecast. This is the same calculation written out step by step. It is useful for seeing how lagged outcomes, covariates, and dynamic coefficients produce a path over time, but it is code-heavy.
  3. Dynamic multiplier / impulse response. This strips away the baseline outcome path and focuses on the response to the wage shock itself: how much happens immediately, and how much is transmitted through persistence?

The scenario throughout this section is:

What would expected employment look like over the next six forecast periods if firm wages increased by 10%, holding the other included covariates fixed?

Wages are the clearest labor-demand shock in this dataset. A wage increase raises the cost of labor, so the expected sign is negative: firms should employ fewer workers, all else equal.

This section compares two estimators from the table above:

  • Difference GMM (ab_fit) estimates the differenced equation. It uses lagged levels of employment as instruments for the differenced lagged outcome. In this lab it has two employment lags, current and lagged wages, capital, and current and lagged output.
  • System GMM (bb_fit) stacks the differenced equation with a level equation. It can be more efficient when employment is persistent, but it adds the stronger assumption that lagged differences are valid instruments for the level equation. In this lab it has one employment lag, current and lagged wages, and current and lagged capital.

So the predictions below compare both the estimator and the dynamic equation. A difference in forecasts should not be read as only “Diff-GMM vs System-GMM”; it also reflects AR(2) versus AR(1) dynamics and a slightly different control set.

This matters for interpretation. Difference GMM often has wider intervals in persistent panels because lagged levels can be weak instruments for changes. That uncertainty enters the coefficient draws and then compounds as the dynamic model is iterated forward. System GMM can produce tighter intervals because it adds level-equation moments, increasing the effective information used by the estimator. That precision comes from stronger assumptions, especially the validity of lagged differences as instruments for the level equation.

The shapes can also differ. In this lab, Difference GMM uses two employment lags, and the second lag partly offsets the first. That makes the response look relatively flat after the first few periods. System GMM uses one employment lag with stronger persistence, so the forecast adjusts more gradually over time.

2.9.1 A simcf forecast: 10% wage increase

The fitted models are in logs, so a 10% increase in wages is approximately:

\[ \log(1.10). \]

The forecast horizon should be read relative to the empirical panel. EmplUK observes firms for only 7 to 9 years, so a six-period forecast is already a fairly long extrapolation from a short time window. There is no universal rule for the maximum valid horizon, but the farther the forecast moves beyond the observed panel, the more it depends on the model’s dynamic assumptions rather than on observed data. For that reason, long horizons should be treated as model-implied scenarios, not as precise forecasts.

set.seed(20260522)

# Forecast horizon: six future years after the last observed year, 1984.
# This is long enough to see the dynamic adjustment, but still short enough
# that the forecast is not entirely driven by the model's long-run extrapolation.
empl_periods_out <- 6

# Number of coefficient simulations.
empl_sims <- 1000

# Convert the data to a tibble and anchor the forecast at the last two years.
empl_df <- as_tibble(EmplUK)
empl_last <- empl_df |> filter(year == 1984)
empl_prev <- empl_df |> filter(year == 1983)

# A 10% increase in wages on the log scale.
wage_shock <- log(1.10)

# Last observed average log employment levels.
# Difference GMM has two lags of employment; System GMM has one.
empl_lagY_ab <- c(mean(log(empl_last$emp), na.rm = TRUE),
                  mean(log(empl_prev$emp), na.rm = TRUE))
empl_lagY_bb <- mean(log(empl_last$emp), na.rm = TRUE)

Now set up the Difference-GMM forecast. simcf takes the lagged-outcome coefficients through phi = and the other coefficients through b =.

# Simulate coefficient uncertainty from the Windmeijer-corrected covariance
# matrix of the full Difference-GMM model.
empl_draws_ab <- mvrnorm(n = empl_sims,
                         mu = coef(ab_fit),
                         Sigma = vcovHC(ab_fit))

# Difference GMM has two lagged-outcome coefficients.
empl_phi_ab <- empl_draws_ab[, c("lag(log(emp), 1:2)1",
                                 "lag(log(emp), 1:2)2")]

# Keep only the substantive covariates used in the forecast.
# The year dummies are omitted, so future year effects are set to zero.
empl_beta_names_ab <- c("lag(log(wage), 0:1)0",
                        "lag(log(wage), 0:1)1",
                        "log(capital)",
                        "lag(log(output), 0:1)0",
                        "lag(log(output), 0:1)1")

empl_beta_ab <- empl_draws_ab[, empl_beta_names_ab]

# Baseline path: hold wages, capital, and output at their 1984 averages.
empl_xbase_ab <- matrix(NA,
                        nrow = empl_periods_out,
                        ncol = length(empl_beta_names_ab))
colnames(empl_xbase_ab) <- empl_beta_names_ab

empl_xbase_ab[, "lag(log(wage), 0:1)0"]   <- mean(log(empl_last$wage), na.rm = TRUE)
empl_xbase_ab[, "lag(log(wage), 0:1)1"]   <- mean(log(empl_last$wage), na.rm = TRUE)
empl_xbase_ab[, "log(capital)"]           <- mean(log(empl_last$capital), na.rm = TRUE)
empl_xbase_ab[, "lag(log(output), 0:1)0"] <- mean(log(empl_last$output), na.rm = TRUE)
empl_xbase_ab[, "lag(log(output), 0:1)1"] <- mean(log(empl_last$output), na.rm = TRUE)

# Treatment path: wages are 10% higher in every future year.
# The lagged-wage column changes only from horizon 2 onward because the first
# forecast year still lags the observed 1984 wage level.
empl_xhyp_ab <- empl_xbase_ab
empl_xhyp_ab[, "lag(log(wage), 0:1)0"] <- empl_xhyp_ab[, "lag(log(wage), 0:1)0"] +
  wage_shock
empl_xhyp_ab[2:empl_periods_out, "lag(log(wage), 0:1)1"] <-
  empl_xhyp_ab[2:empl_periods_out, "lag(log(wage), 0:1)1"] + wage_shock

empl_xbase_ab
##      lag(log(wage), 0:1)0 lag(log(wage), 0:1)1 log(capital)
## [1,]             3.107475             3.107475    -1.192489
## [2,]             3.107475             3.107475    -1.192489
## [3,]             3.107475             3.107475    -1.192489
## [4,]             3.107475             3.107475    -1.192489
## [5,]             3.107475             3.107475    -1.192489
## [6,]             3.107475             3.107475    -1.192489
##      lag(log(output), 0:1)0 lag(log(output), 0:1)1
## [1,]                4.58738                4.58738
## [2,]                4.58738                4.58738
## [3,]                4.58738                4.58738
## [4,]                4.58738                4.58738
## [5,]                4.58738                4.58738
## [6,]                4.58738                4.58738
empl_xhyp_ab
##      lag(log(wage), 0:1)0 lag(log(wage), 0:1)1 log(capital)
## [1,]             3.202785             3.107475    -1.192489
## [2,]             3.202785             3.202785    -1.192489
## [3,]             3.202785             3.202785    -1.192489
## [4,]             3.202785             3.202785    -1.192489
## [5,]             3.202785             3.202785    -1.192489
## [6,]             3.202785             3.202785    -1.192489
##      lag(log(output), 0:1)0 lag(log(output), 0:1)1
## [1,]                4.58738                4.58738
## [2,]                4.58738                4.58738
## [3,]                4.58738                4.58738
## [4,]                4.58738                4.58738
## [5,]                4.58738                4.58738
## [6,]                4.58738                4.58738

Now set up the System-GMM forecast. This model has one lag of employment and uses current and lagged wages and capital.

# Simulate coefficient uncertainty from the robust covariance matrix of the
# System-GMM model.
empl_draws_bb <- mvrnorm(n = empl_sims,
                         mu = coef(bb_fit),
                         Sigma = vcovHC(bb_fit))

# System GMM has one lagged-outcome coefficient in this lab specification.
empl_phi_bb <- matrix(empl_draws_bb[, "lag(log(emp), 1)"],
                      ncol = 1)
colnames(empl_phi_bb) <- "lag(log(emp), 1)"

# Keep the substantive covariates in the System-GMM equation.
empl_beta_names_bb <- c("lag(log(wage), 0:1)0",
                        "lag(log(wage), 0:1)1",
                        "lag(log(capital), 0:1)0",
                        "lag(log(capital), 0:1)1")

empl_beta_bb <- empl_draws_bb[, empl_beta_names_bb]

# Baseline path: hold wages and capital at their 1984 averages.
empl_xbase_bb <- matrix(NA,
                        nrow = empl_periods_out,
                        ncol = length(empl_beta_names_bb))
colnames(empl_xbase_bb) <- empl_beta_names_bb

empl_xbase_bb[, "lag(log(wage), 0:1)0"]     <- mean(log(empl_last$wage), na.rm = TRUE)
empl_xbase_bb[, "lag(log(wage), 0:1)1"]     <- mean(log(empl_last$wage), na.rm = TRUE)
empl_xbase_bb[, "lag(log(capital), 0:1)0"]  <- mean(log(empl_last$capital), na.rm = TRUE)
empl_xbase_bb[, "lag(log(capital), 0:1)1"]  <- mean(log(empl_last$capital), na.rm = TRUE)

# Treatment path: wages are 10% higher in every future year.
empl_xhyp_bb <- empl_xbase_bb
empl_xhyp_bb[, "lag(log(wage), 0:1)0"] <- empl_xhyp_bb[, "lag(log(wage), 0:1)0"] +
  wage_shock
empl_xhyp_bb[2:empl_periods_out, "lag(log(wage), 0:1)1"] <-
  empl_xhyp_bb[2:empl_periods_out, "lag(log(wage), 0:1)1"] + wage_shock

empl_xbase_bb
##      lag(log(wage), 0:1)0 lag(log(wage), 0:1)1 lag(log(capital), 0:1)0
## [1,]             3.107475             3.107475               -1.192489
## [2,]             3.107475             3.107475               -1.192489
## [3,]             3.107475             3.107475               -1.192489
## [4,]             3.107475             3.107475               -1.192489
## [5,]             3.107475             3.107475               -1.192489
## [6,]             3.107475             3.107475               -1.192489
##      lag(log(capital), 0:1)1
## [1,]               -1.192489
## [2,]               -1.192489
## [3,]               -1.192489
## [4,]               -1.192489
## [5,]               -1.192489
## [6,]               -1.192489
empl_xhyp_bb
##      lag(log(wage), 0:1)0 lag(log(wage), 0:1)1 lag(log(capital), 0:1)0
## [1,]             3.202785             3.107475               -1.192489
## [2,]             3.202785             3.202785               -1.192489
## [3,]             3.202785             3.202785               -1.192489
## [4,]             3.202785             3.202785               -1.192489
## [5,]             3.202785             3.202785               -1.192489
## [6,]             3.202785             3.202785               -1.192489
##      lag(log(capital), 0:1)1
## [1,]               -1.192489
## [2,]               -1.192489
## [3,]               -1.192489
## [4,]               -1.192489
## [5,]               -1.192489
## [6,]               -1.192489

Now simulate expected values, first differences, and relative risks. Because the outcome is logged employment, transform = "log" returns expected values on the employment scale rather than the log-employment scale.

# Difference-GMM expected values under the wage shock and baseline.
empl_ev_treat_ab <- ldvsimev(empl_xhyp_ab, b = empl_beta_ab, ci = 0.95,
                             constant = NA, phi = empl_phi_ab,
                             lagY = empl_lagY_ab, transform = "log")
empl_ev_base_ab <- ldvsimev(empl_xbase_ab, b = empl_beta_ab, ci = 0.95,
                            constant = NA, phi = empl_phi_ab,
                            lagY = empl_lagY_ab, transform = "log")

# Difference-GMM first difference and relative risk.
empl_fd_ab <- ldvsimfd(empl_xhyp_ab, b = empl_beta_ab, ci = 0.95,
                       constant = NA, xpre = empl_xbase_ab,
                       phi = empl_phi_ab, lagY = empl_lagY_ab,
                       transform = "log")
empl_rr_ab <- ldvsimrr(empl_xhyp_ab, b = empl_beta_ab, ci = 0.95,
                       constant = NA, xpre = empl_xbase_ab,
                       phi = empl_phi_ab, lagY = empl_lagY_ab,
                       transform = "log")

# System-GMM expected values under the wage shock and baseline.
empl_ev_treat_bb <- ldvsimev(empl_xhyp_bb, b = empl_beta_bb, ci = 0.95,
                             constant = NA, phi = empl_phi_bb,
                             lagY = empl_lagY_bb, transform = "log")
empl_ev_base_bb <- ldvsimev(empl_xbase_bb, b = empl_beta_bb, ci = 0.95,
                            constant = NA, phi = empl_phi_bb,
                            lagY = empl_lagY_bb, transform = "log")

# System-GMM first difference and relative risk.
empl_fd_bb <- ldvsimfd(empl_xhyp_bb, b = empl_beta_bb, ci = 0.95,
                       constant = NA, xpre = empl_xbase_bb,
                       phi = empl_phi_bb, lagY = empl_lagY_bb,
                       transform = "log")
empl_rr_bb <- ldvsimrr(empl_xhyp_bb, b = empl_beta_bb, ci = 0.95,
                       constant = NA, xpre = empl_xbase_bb,
                       phi = empl_phi_bb, lagY = empl_lagY_bb,
                       transform = "log")

# Put the main quantities of interest in one table.
empl_qoi <- bind_rows(
  tibble(model = "Diff-GMM",
         quantity = "EV: wage +10%",
         horizon = seq_len(empl_periods_out),
         pe = as.numeric(empl_ev_treat_ab$pe),
         lower = as.numeric(empl_ev_treat_ab$lower[, 1]),
         upper = as.numeric(empl_ev_treat_ab$upper[, 1])),
  tibble(model = "Diff-GMM",
         quantity = "FD: difference in employment",
         horizon = seq_len(empl_periods_out),
         pe = as.numeric(empl_fd_ab$pe),
         lower = as.numeric(empl_fd_ab$lower[, 1]),
         upper = as.numeric(empl_fd_ab$upper[, 1])),
  tibble(model = "Diff-GMM",
         quantity = "RR: percent change",
         horizon = seq_len(empl_periods_out),
         pe = 100 * (as.numeric(empl_rr_ab$pe) - 1),
         lower = 100 * (as.numeric(empl_rr_ab$lower[, 1]) - 1),
         upper = 100 * (as.numeric(empl_rr_ab$upper[, 1]) - 1)),
  tibble(model = "System GMM",
         quantity = "EV: wage +10%",
         horizon = seq_len(empl_periods_out),
         pe = as.numeric(empl_ev_treat_bb$pe),
         lower = as.numeric(empl_ev_treat_bb$lower[, 1]),
         upper = as.numeric(empl_ev_treat_bb$upper[, 1])),
  tibble(model = "System GMM",
         quantity = "FD: difference in employment",
         horizon = seq_len(empl_periods_out),
         pe = as.numeric(empl_fd_bb$pe),
         lower = as.numeric(empl_fd_bb$lower[, 1]),
         upper = as.numeric(empl_fd_bb$upper[, 1])),
  tibble(model = "System GMM",
         quantity = "RR: percent change",
         horizon = seq_len(empl_periods_out),
         pe = 100 * (as.numeric(empl_rr_bb$pe) - 1),
         lower = 100 * (as.numeric(empl_rr_bb$lower[, 1]) - 1),
         upper = 100 * (as.numeric(empl_rr_bb$upper[, 1]) - 1))
)

# The full empl_qoi object is used below for plotting. To keep the printed
# table readable, show only the first five horizons for the percent-change
# quantity from each estimator.
empl_qoi |>
  filter(quantity == "RR: percent change") |>
  group_by(model) |>
  slice_head(n = 5) |>
  ungroup() |>
  knitr::kable(digits = 3,
               caption = "Preview of the counterfactual percent-change response to a sustained 10% wage increase: first five horizons from each estimator.")
Preview of the counterfactual percent-change response to a sustained 10% wage increase: first five horizons from each estimator.
model quantity horizon pe lower upper
Diff-GMM RR: percent change 1 -4.782 -7.387 -2.211
Diff-GMM RR: percent change 2 -4.928 -7.265 -2.555
Diff-GMM RR: percent change 3 -4.796 -7.258 -2.155
Diff-GMM RR: percent change 4 -4.699 -7.298 -1.845
Diff-GMM RR: percent change 5 -4.654 -7.364 -1.584
System GMM RR: percent change 1 -5.842 -7.850 -3.726
System GMM RR: percent change 2 -6.805 -8.977 -4.761
System GMM RR: percent change 3 -7.697 -10.436 -5.051
System GMM RR: percent change 4 -8.525 -11.887 -5.161
System GMM RR: percent change 5 -9.296 -13.423 -5.334

The same objects can be plotted directly. The first row shows expected employment under the baseline and the 10% wage scenario. The second row shows the first difference: treatment employment minus baseline employment.

# Expected-value paths for both models.
empl_ev_plot <- bind_rows(
  tibble(model = "Diff-GMM", quantity = "Baseline",
         horizon = seq_len(empl_periods_out),
         pe = as.numeric(empl_ev_base_ab$pe),
         lower = as.numeric(empl_ev_base_ab$lower[, 1]),
         upper = as.numeric(empl_ev_base_ab$upper[, 1])),
  tibble(model = "Diff-GMM", quantity = "Wage +10%",
         horizon = seq_len(empl_periods_out),
         pe = as.numeric(empl_ev_treat_ab$pe),
         lower = as.numeric(empl_ev_treat_ab$lower[, 1]),
         upper = as.numeric(empl_ev_treat_ab$upper[, 1])),
  tibble(model = "System GMM", quantity = "Baseline",
         horizon = seq_len(empl_periods_out),
         pe = as.numeric(empl_ev_base_bb$pe),
         lower = as.numeric(empl_ev_base_bb$lower[, 1]),
         upper = as.numeric(empl_ev_base_bb$upper[, 1])),
  tibble(model = "System GMM", quantity = "Wage +10%",
         horizon = seq_len(empl_periods_out),
         pe = as.numeric(empl_ev_treat_bb$pe),
         lower = as.numeric(empl_ev_treat_bb$lower[, 1]),
         upper = as.numeric(empl_ev_treat_bb$upper[, 1]))
)

# First differences for both models.
empl_fd_plot <- bind_rows(
  tibble(model = "Diff-GMM", quantity = "Difference",
         horizon = seq_len(empl_periods_out),
         pe = as.numeric(empl_fd_ab$pe),
         lower = as.numeric(empl_fd_ab$lower[, 1]),
         upper = as.numeric(empl_fd_ab$upper[, 1])),
  tibble(model = "System GMM", quantity = "Difference",
         horizon = seq_len(empl_periods_out),
         pe = as.numeric(empl_fd_bb$pe),
         lower = as.numeric(empl_fd_bb$lower[, 1]),
         upper = as.numeric(empl_fd_bb$upper[, 1]))
)

# Combine expected values and differences into one faceted figure.
empl_simcf_plot <- bind_rows(
  empl_ev_plot |>
    mutate(panel = "Expected employment"),
  empl_fd_plot |>
    mutate(panel = "Employment difference")
)

# Use the same estimator colors in every Part 2 estimator plot.
gmm_colors <- c("Diff-GMM" = "steelblue",
                "System GMM" = "firebrick")

ggplot(empl_simcf_plot,
       aes(x = horizon, y = pe,
           colour = model, fill = model,
           linetype = quantity,
           group = interaction(model, quantity))) +
  geom_hline(data = tibble(panel = "Employment difference"),
             aes(yintercept = 0), inherit.aes = FALSE,
             linetype = "dashed", colour = "grey50") +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.12,
              colour = NA) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 2) +
  facet_grid(panel ~ model, scales = "free_y") +
  scale_x_continuous(breaks = seq_len(empl_periods_out)) +
  scale_colour_manual(values = gmm_colors) +
  scale_fill_manual(values = gmm_colors) +
  scale_linetype_manual(values = c("Baseline" = "dashed",
                                   "Wage +10%" = "solid",
                                   "Difference" = "solid")) +
  labs(x = "Forecast horizon", y = NULL,
       colour = "Estimator", fill = "Estimator",
       linetype = "Quantity") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom")
Simulated employment paths and counterfactual difference from a sustained 10% wage increase.

Simulated employment paths and counterfactual difference from a sustained 10% wage increase.

The first visual difference between the two estimators is the adjustment pattern. The Difference-GMM forecast reaches most of its implied effect quickly and then changes only modestly after the first few horizons. The System-GMM forecast is more gradual because more of the previous-period response is carried forward.

The comparison is useful because the models rest on different identifying assumptions. Difference GMM is less demanding, but in persistent panels lagged levels may be weak instruments for differences. System GMM adds more information by using a level equation, but only if lagged differences are valid instruments for that level equation. If the two forecasts differ, read that difference together with the diagnostics, the instrument count, and the substantive plausibility of the wage effect.

2.9.2 Optional: manual recursive forecast

This subsection is optional and shows what simcf is doing under the hood. The calculation is just the recursive forecast from the estimated dynamic model. The Difference-GMM model is AR(2):

\[ y_t = \phi_1 y_{t-1} + \phi_2 y_{t-2} + x_t\beta, \]

while the System-GMM model used here is AR(1):

\[ y_t = \phi_1 y_{t-1} + x_t\beta. \]

The manual forecast below uses the same simulated coefficients, the same baseline scenario, and the same 10% wage scenario.

# Number of coefficient draws and forecast periods.
n_sims <- empl_sims
H <- empl_periods_out

# Empty matrices to store Difference-GMM forecast paths.
manual_log_base_ab  <- matrix(NA, nrow = n_sims, ncol = H)
manual_log_treat_ab <- matrix(NA, nrow = n_sims, ncol = H)

# Empty matrices to store System-GMM forecast paths.
manual_log_base_bb  <- matrix(NA, nrow = n_sims, ncol = H)
manual_log_treat_bb <- matrix(NA, nrow = n_sims, ncol = H)

# Difference-GMM lag coefficients.
phi1_ab <- empl_phi_ab[, "lag(log(emp), 1:2)1"]
phi2_ab <- empl_phi_ab[, "lag(log(emp), 1:2)2"]

# System-GMM lag coefficient.
phi1_bb <- as.numeric(empl_phi_bb[, "lag(log(emp), 1)"])

# Forecast each horizon recursively. The loop is useful because horizon h
# depends on previous forecasted outcomes.
for (h in seq_len(H)) {

  # Difference-GMM lag 1.
  if (h == 1) {
    lag1_base_ab  <- empl_lagY_ab[1]
    lag1_treat_ab <- empl_lagY_ab[1]
  } else {
    lag1_base_ab  <- manual_log_base_ab[, h - 1]
    lag1_treat_ab <- manual_log_treat_ab[, h - 1]
  }

  # Difference-GMM lag 2.
  if (h == 1) {
    lag2_base_ab  <- empl_lagY_ab[2]
    lag2_treat_ab <- empl_lagY_ab[2]
  } else if (h == 2) {
    lag2_base_ab  <- empl_lagY_ab[1]
    lag2_treat_ab <- empl_lagY_ab[1]
  } else {
    lag2_base_ab  <- manual_log_base_ab[, h - 2]
    lag2_treat_ab <- manual_log_treat_ab[, h - 2]
  }

  # Difference-GMM x_t beta.
  xb_base_ab_h  <- as.numeric(empl_beta_ab %*% empl_xbase_ab[h, ])
  xb_treat_ab_h <- as.numeric(empl_beta_ab %*% empl_xhyp_ab[h, ])

  # Difference-GMM recursive forecast on the log scale.
  manual_log_base_ab[, h] <-
    phi1_ab * lag1_base_ab + phi2_ab * lag2_base_ab + xb_base_ab_h
  manual_log_treat_ab[, h] <-
    phi1_ab * lag1_treat_ab + phi2_ab * lag2_treat_ab + xb_treat_ab_h

  # System-GMM lag 1.
  if (h == 1) {
    lag1_base_bb  <- empl_lagY_bb
    lag1_treat_bb <- empl_lagY_bb
  } else {
    lag1_base_bb  <- manual_log_base_bb[, h - 1]
    lag1_treat_bb <- manual_log_treat_bb[, h - 1]
  }

  # System-GMM x_t beta.
  xb_base_bb_h  <- as.numeric(empl_beta_bb %*% empl_xbase_bb[h, ])
  xb_treat_bb_h <- as.numeric(empl_beta_bb %*% empl_xhyp_bb[h, ])

  # System-GMM recursive forecast on the log scale.
  manual_log_base_bb[, h]  <- phi1_bb * lag1_base_bb  + xb_base_bb_h
  manual_log_treat_bb[, h] <- phi1_bb * lag1_treat_bb + xb_treat_bb_h
}

# Convert from log employment to employment.
manual_base_ab  <- exp(manual_log_base_ab)
manual_treat_ab <- exp(manual_log_treat_ab)
manual_fd_ab    <- manual_treat_ab - manual_base_ab
manual_rr_ab    <- manual_treat_ab / manual_base_ab

manual_base_bb  <- exp(manual_log_base_bb)
manual_treat_bb <- exp(manual_log_treat_bb)
manual_fd_bb    <- manual_treat_bb - manual_base_bb
manual_rr_bb    <- manual_treat_bb / manual_base_bb

# Summarize Difference-GMM simulated paths by horizon.
manual_fd_tab_ab <- tibble(
  model = "Diff-GMM",
  quantity = "FD: difference in employment",
  horizon = seq_len(H),
  pe = colMeans(manual_fd_ab),
  lower = apply(manual_fd_ab, 2, quantile, probs = 0.025),
  upper = apply(manual_fd_ab, 2, quantile, probs = 0.975)
)

manual_rr_tab_ab <- tibble(
  model = "Diff-GMM",
  quantity = "RR: percent change",
  horizon = seq_len(H),
  pe = 100 * (colMeans(manual_rr_ab) - 1),
  lower = 100 * (apply(manual_rr_ab, 2, quantile, probs = 0.025) - 1),
  upper = 100 * (apply(manual_rr_ab, 2, quantile, probs = 0.975) - 1)
)

# Summarize System-GMM simulated paths by horizon.
manual_fd_tab_bb <- tibble(
  model = "System GMM",
  quantity = "FD: difference in employment",
  horizon = seq_len(H),
  pe = colMeans(manual_fd_bb),
  lower = apply(manual_fd_bb, 2, quantile, probs = 0.025),
  upper = apply(manual_fd_bb, 2, quantile, probs = 0.975)
)

manual_rr_tab_bb <- tibble(
  model = "System GMM",
  quantity = "RR: percent change",
  horizon = seq_len(H),
  pe = 100 * (colMeans(manual_rr_bb) - 1),
  lower = 100 * (apply(manual_rr_bb, 2, quantile, probs = 0.025) - 1),
  upper = 100 * (apply(manual_rr_bb, 2, quantile, probs = 0.975) - 1)
)

manual_qoi <- bind_rows(manual_fd_tab_ab, manual_rr_tab_ab,
                        manual_fd_tab_bb, manual_rr_tab_bb)

# Again, keep the full manual_qoi object for plotting, but print a compact
# preview: five horizons from each estimator for the percent-change quantity.
manual_qoi |>
  filter(quantity == "RR: percent change") |>
  group_by(model) |>
  slice_head(n = 5) |>
  ungroup() |>
  knitr::kable(digits = 3,
               caption = "Preview of the manual recursive percent-change forecast for a sustained 10% wage increase: first five horizons from each estimator.")
Preview of the manual recursive percent-change forecast for a sustained 10% wage increase: first five horizons from each estimator.
model quantity horizon pe lower upper
Diff-GMM RR: percent change 1 -4.782 -7.387 -2.211
Diff-GMM RR: percent change 2 -4.928 -7.265 -2.555
Diff-GMM RR: percent change 3 -4.796 -7.258 -2.155
Diff-GMM RR: percent change 4 -4.699 -7.298 -1.845
Diff-GMM RR: percent change 5 -4.654 -7.364 -1.584
System GMM RR: percent change 1 -5.842 -7.850 -3.726
System GMM RR: percent change 2 -6.805 -8.977 -4.761
System GMM RR: percent change 3 -7.697 -10.436 -5.051
System GMM RR: percent change 4 -8.525 -11.887 -5.161
System GMM RR: percent change 5 -9.296 -13.423 -5.334

The manual calculation follows the same logic as ldvsimev(): draw coefficients, set a counterfactual path for \(x_t\), anchor the forecast with the last observed outcomes, and iterate the dynamic equation forward. At horizon 1, only the contemporaneous wage coefficient enters the shock. From horizon 2 onward, lagged wages and lagged employment both transmit the change.

bind_rows(manual_fd_tab_ab, manual_fd_tab_bb) |>
  ggplot(aes(x = horizon, y = pe, colour = model, fill = model)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.10,
              colour = NA) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 2) +
  facet_wrap(~ model, nrow = 1, ncol = 2, scales = "free_y") +
  scale_x_continuous(breaks = seq_len(H)) +
  scale_colour_manual(values = gmm_colors) +
  scale_fill_manual(values = gmm_colors) +
  labs(x = "Forecast horizon", y = "Employment difference",
       colour = NULL, fill = NULL) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none")
Manual recursive forecast of the employment difference from a sustained 10% wage increase.

Manual recursive forecast of the employment difference from a sustained 10% wage increase.

2.9.3 Dynamic multiplier / impulse response

The forecast above compared employment under one baseline path and one treatment path. A related question is the dynamic multiplier: how does a sustained 10% wage increase move employment toward its new equilibrium?

For Difference GMM, the response follows the AR(2) equation:

\[ \Delta y_h = \phi_1 \Delta y_{h-1} + \phi_2 \Delta y_{h-2} + \beta_{wage,0}\log(1.10) + \beta_{wage,1}\log(1.10), \]

after the first horizon. For System GMM, the response follows the AR(1) equation:

\[ \Delta y_h = \phi_1 \Delta y_{h-1} + \beta_{wage,0}\log(1.10) + \beta_{wage,1}\log(1.10). \]

The long-run multipliers therefore differ:

\[ \text{Diff-GMM: } \frac{(\beta_{wage,0}+\beta_{wage,1})\log(1.10)} {1-\phi_1-\phi_2} \qquad \text{System GMM: } \frac{(\beta_{wage,0}+\beta_{wage,1})\log(1.10)} {1-\phi_1}. \]

# Use the same parametric coefficient draws created in Section 2.9.
# Each row is one simulated set of coefficients from the estimated sampling
# distribution of the model.
phi1_sim_ab <- empl_draws_ab[, "lag(log(emp), 1:2)1"]
phi2_sim_ab <- empl_draws_ab[, "lag(log(emp), 1:2)2"]
b_wage0_sim_ab <- empl_draws_ab[, "lag(log(wage), 0:1)0"]
b_wage1_sim_ab <- empl_draws_ab[, "lag(log(wage), 0:1)1"]

phi1_sim_bb <- empl_draws_bb[, "lag(log(emp), 1)"]
b_wage0_sim_bb <- empl_draws_bb[, "lag(log(wage), 0:1)0"]
b_wage1_sim_bb <- empl_draws_bb[, "lag(log(wage), 0:1)1"]

# Trace the response for twelve forecast periods.
irf_H <- 12

# Each row is a simulated coefficient draw; each column is a forecast horizon.
irf_log_ab <- matrix(NA, nrow = empl_sims, ncol = irf_H)
irf_log_bb <- matrix(NA, nrow = empl_sims, ncol = irf_H)

for (h in seq_len(irf_H)) {

  # Difference-GMM previous responses. Before the shock, responses are zero.
  if (h == 1) {
    irf_lag1_ab <- 0
    irf_lag2_ab <- 0
  } else if (h == 2) {
    irf_lag1_ab <- irf_log_ab[, h - 1]
    irf_lag2_ab <- 0
  } else {
    irf_lag1_ab <- irf_log_ab[, h - 1]
    irf_lag2_ab <- irf_log_ab[, h - 2]
  }

  # System-GMM previous response.
  if (h == 1) {
    irf_lag1_bb <- 0
  } else {
    irf_lag1_bb <- irf_log_bb[, h - 1]
  }

  # Current wage enters immediately. Lagged wage enters from horizon 2 onward.
  wage_effect_ab <- b_wage0_sim_ab * wage_shock
  wage_effect_bb <- b_wage0_sim_bb * wage_shock
  if (h >= 2) {
    wage_effect_ab <- wage_effect_ab + b_wage1_sim_ab * wage_shock
    wage_effect_bb <- wage_effect_bb + b_wage1_sim_bb * wage_shock
  }

  # Dynamic propagation for every simulated coefficient draw.
  irf_log_ab[, h] <- phi1_sim_ab * irf_lag1_ab +
                     phi2_sim_ab * irf_lag2_ab +
                     wage_effect_ab

  irf_log_bb[, h] <- phi1_sim_bb * irf_lag1_bb +
                     wage_effect_bb
}

# Convert the simulated log responses into percent changes in employment.
irf_pct_ab <- 100 * (exp(irf_log_ab) - 1)
irf_pct_bb <- 100 * (exp(irf_log_bb) - 1)

# Long-run multipliers, one for each simulated coefficient draw.
# This is a nonlinear quantity because the denominator includes the dynamic
# coefficient(s). If a few simulated draws put the denominator close to zero,
# the mean can be dominated by extreme values. The median is more stable and
# more readable for the dashed long-run reference line.
irf_long_run_pct_ab <- 100 * (exp(((b_wage0_sim_ab + b_wage1_sim_ab) * wage_shock) /
                                    (1 - phi1_sim_ab - phi2_sim_ab)) - 1)
irf_long_run_pct_bb <- 100 * (exp(((b_wage0_sim_bb + b_wage1_sim_bb) * wage_shock) /
                                    (1 - phi1_sim_bb)) - 1)

irf_tab <- bind_rows(
  tibble(model = "Diff-GMM",
         horizon = seq_len(irf_H),
         percent_response = colMeans(irf_pct_ab),
         lower = apply(irf_pct_ab, 2, quantile, probs = 0.025),
         upper = apply(irf_pct_ab, 2, quantile, probs = 0.975),
         long_run_percent = median(irf_long_run_pct_ab),
         long_run_lower = quantile(irf_long_run_pct_ab, probs = 0.025),
         long_run_upper = quantile(irf_long_run_pct_ab, probs = 0.975)),
  tibble(model = "System GMM",
         horizon = seq_len(irf_H),
         percent_response = colMeans(irf_pct_bb),
         lower = apply(irf_pct_bb, 2, quantile, probs = 0.025),
         upper = apply(irf_pct_bb, 2, quantile, probs = 0.975),
         long_run_percent = median(irf_long_run_pct_bb),
         long_run_lower = quantile(irf_long_run_pct_bb, probs = 0.025),
         long_run_upper = quantile(irf_long_run_pct_bb, probs = 0.975))
)

# Print a compact preview of the impulse-response table. The full irf_tab
# object is used in the plot below.
irf_tab |>
  group_by(model) |>
  slice_head(n = 5) |>
  ungroup() |>
  knitr::kable(digits = 3,
               caption = "Preview of the dynamic multiplier from a sustained 10% wage increase: first five horizons from each estimator.")
Preview of the dynamic multiplier from a sustained 10% wage increase: first five horizons from each estimator.
model horizon percent_response lower upper long_run_percent long_run_lower long_run_upper
Diff-GMM 1 -4.782 -7.387 -2.211 -4.668 -7.433 -1.340
Diff-GMM 2 -4.928 -7.265 -2.555 -4.668 -7.433 -1.340
Diff-GMM 3 -4.796 -7.258 -2.155 -4.668 -7.433 -1.340
Diff-GMM 4 -4.699 -7.298 -1.845 -4.668 -7.433 -1.340
Diff-GMM 5 -4.654 -7.364 -1.584 -4.668 -7.433 -1.340
System GMM 1 -5.842 -7.850 -3.726 -19.546 -65.474 -2.717
System GMM 2 -6.805 -8.977 -4.761 -19.546 -65.474 -2.717
System GMM 3 -7.697 -10.436 -5.051 -19.546 -65.474 -2.717
System GMM 4 -8.525 -11.887 -5.161 -19.546 -65.474 -2.717
System GMM 5 -9.296 -13.423 -5.334 -19.546 -65.474 -2.717
ggplot(irf_tab,
       aes(x = horizon, y = percent_response, colour = model, fill = model)) +
  geom_hline(aes(yintercept = long_run_percent, colour = model),
             linetype = "dashed") +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.12,
              colour = NA) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(1, irf_H, by = 1)) +
  scale_colour_manual(values = gmm_colors) +
  scale_fill_manual(values = gmm_colors) +
  labs(x = "Forecast horizon",
       y = "Employment response (%)",
       colour = NULL, fill = NULL,
       caption = "Ribbons = 95% parametric-simulation intervals. Dashed lines = median long-run multipliers.") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom")
Dynamic multiplier from a sustained 10% wage increase.

Dynamic multiplier from a sustained 10% wage increase.

This plot strips the forecast down to the response itself. Difference GMM and System GMM can imply different adjustment paths because they use different moment conditions and different dynamic equations. The long-run multiplier is especially sensitive because it divides by \(1-\phi_1-\phi_2\) or \(1-\phi_1\). If that denominator is close to zero for some simulated draws, the long-run distribution becomes very skewed. For that reason, the dashed line uses the median simulated long-run multiplier, while the table still reports the 95% simulation interval.


Part 3 — Tax and cigarettes: a Diff-GMM fit with counterfactual forecast

For a second applied example with a sharper substantive story — and an actual policy counterfactual to interpret — the lab turns to the cigarette consumption panel used in the old version of this lab and covered in lecture. The question is simple: by how much would a tax-induced price increase reduce cigarette consumption?

The data are US state-level cigarette sales, prices, taxes, and income from 1985–1995. 48 states × 11 years = 528 observations, a balanced panel — short enough that Nickell bias matters and long enough that Diff-GMM is well-identified. The variables needed here are:

Variable Description
packpc packs of cigarettes per capita (the outcome)
income state personal income (total)
pop state population
tax average federal + state + local excise tax (cents/pack)
taxs excise taxes including sales taxes (cents/pack)
avgprs average price including sales taxes (cents/pack)
cpi consumer price index (for inflation adjustment)

3.1 Load and inflation-adjust to 1995 dollars

Dollar amounts need to be in a constant unit. The trick: every dollar variable gets multiplied by the 1995 CPI and divided by the CPI in the observed year — bringing the figure to “1995 dollars”. This is done inline (no helper function) so the operation is fully transparent.

# Load the panel. 48 states × 11 years (1985–1995) = 528 rows.
cig <- read_csv("data/cigarette.csv", show_col_types = FALSE)

# Quick panel check
length(unique(cig$state))    # number of states
## [1] 48
length(unique(cig$year))     # number of years
## [1] 11
# Inflation-adjust. The CPI is the same across states within a given year,
# so pull the 1995 value once and rescale.
cpi_base <- unique(cig$cpi[cig$year == 1995])      # 1995 CPI value (a scalar)

cig <- cig |>
  mutate(
    income95   = cpi_base * income / cpi,           # state income, 1995 dollars
    tax95      = cpi_base * tax    / cpi,           # avg excise tax, 1995 cents/pack
    taxs95     = cpi_base * taxs   / cpi,           # excise + sales tax
    avgprs95   = cpi_base * avgprs / cpi,           # avg price (incl. sales tax)
    income95pc = income95 / pop,                    # per-capita income (1995 dollars)
    pretax95   = avgprs95 - taxs95                  # price net of tax
  )

# Quick look at the post-adjustment numbers
cig |>
  summarise(across(c(packpc, income95pc, avgprs95, taxs95),
                   list(mean = mean, sd = sd))) |>
  knitr::kable(digits = 2,
               caption = "Means and standard deviations of the key variables after inflation adjustment.")
Means and standard deviations of the key variables after inflation adjustment.
packpc_mean packpc_sd income95pc_mean income95pc_sd avgprs95_mean avgprs95_sd taxs95_mean taxs95_sd
106.45 23.12 21.25 3.26 175.14 23.61 53.84 13.86

3.1.1 Seeing pooled, between, and within variation

The cigarette panel is balanced, but the same pooled/between/within distinction still matters. A high pooled correlation may come from stable differences across states, while a within correlation asks whether changes inside the same state over time move with changes in consumption.

# Create the lagged outcome within state. The first year for each state has no
# lagged value and is therefore missing.
cig_wb_df <- cig |>
  arrange(state, year) |>
  group_by(state) |>
  mutate(packpc_lag1 = dplyr::lag(packpc, 1)) |>
  ungroup()

# These are the variables used in the model or policy scenario.
cig_wb_vars <- c("packpc_lag1", "income95pc", "avgprs95",
                 "tax95", "taxs95", "pretax95")

# Stack predictors into long format: one row is one state-year-variable.
cig_wb_long <- cig_wb_df |>
  select(state, packpc, all_of(cig_wb_vars)) |>
  pivot_longer(cols = all_of(cig_wb_vars),
               names_to = "variable",
               values_to = "value")

# Overall correlation: all state-years pooled together.
cig_overall <- cig_wb_long |>
  group_by(variable) |>
  summarise(overall = cor(value, packpc, use = "pairwise.complete.obs"),
            .groups = "drop")

# Between correlation: collapse to state means, then correlate those means.
cig_between <- cig_wb_long |>
  group_by(variable, state) |>
  summarise(value_mean  = mean(value,  na.rm = TRUE),
            packpc_mean = mean(packpc, na.rm = TRUE),
            .groups = "drop") |>
  group_by(variable) |>
  summarise(between = cor(value_mean, packpc_mean, use = "pairwise.complete.obs"),
            .groups = "drop")

# Within correlation: remove each state's mean from both variables, then
# correlate the state-year deviations.
cig_within <- cig_wb_long |>
  group_by(variable, state) |>
  mutate(value_dev  = value  - mean(value,  na.rm = TRUE),
         packpc_dev = packpc - mean(packpc, na.rm = TRUE)) |>
  ungroup() |>
  group_by(variable) |>
  summarise(within = cor(value_dev, packpc_dev, use = "pairwise.complete.obs"),
            .groups = "drop")

# Within-variance share: the fraction of each predictor's variance that comes
# from movement over time within states.
cig_within_share <- cig_wb_long |>
  group_by(variable, state) |>
  mutate(value_dev = value - mean(value, na.rm = TRUE)) |>
  ungroup() |>
  group_by(variable) |>
  summarise(within_share = var(value_dev, na.rm = TRUE) / var(value, na.rm = TRUE),
            .groups = "drop")

# Join the pieces into one table.
cig_wb_tab <- cig_overall |>
  left_join(cig_between,      by = "variable") |>
  left_join(cig_within,       by = "variable") |>
  left_join(cig_within_share, by = "variable") |>
  arrange(desc(abs(overall)))

cig_wb_tab |>
  knitr::kable(digits = 3,
               col.names = c("Variable", "Overall", "Between", "Within",
                             "Within var. share"),
               caption = "Correlation with cigarette packs per capita, decomposed into overall, between-state, and within-state correlations.")
Correlation with cigarette packs per capita, decomposed into overall, between-state, and within-state correlations.
Variable Overall Between Within Within var. share
packpc_lag1 0.979 0.999 0.902 0.207
avgprs95 -0.614 -0.544 -0.843 0.542
taxs95 -0.539 -0.496 -0.678 0.269
tax95 -0.497 -0.472 -0.581 0.248
pretax95 -0.467 -0.477 -0.706 0.847
income95pc -0.129 -0.047 -0.681 0.080
cig_wb_tab |>
  mutate(variable = fct_reorder(variable, between)) |>
  ggplot(aes(y = variable)) +
  geom_segment(aes(x = between, xend = within, yend = variable),
               colour = "grey70") +
  geom_point(aes(x = between, colour = "Between"), size = 2.6) +
  geom_point(aes(x = within, colour = "Within"), size = 2.6) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey50") +
  scale_colour_manual(values = c("Between" = "firebrick",
                                 "Within" = "steelblue")) +
  labs(x = "Correlation with packs per capita", y = NULL, colour = NULL) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom")
Between- and within-state correlations with cigarette packs per capita.

Between- and within-state correlations with cigarette packs per capita.

The lagged outcome captures persistence in cigarette consumption. The tax and price variables are also informative within states over time, which is the relevant variation for the counterfactual price increase. The between correlations should not be read as the causal effect of price: states with high prices may differ from low-price states in many stable ways that are not in the model.

3.2 Diff-GMM fit on cigarette consumption

The Diff-GMM spec — packs per capita as a dynamic function of per-capita income and the (tax-inclusive) average price, with lag-2 instruments stacked GMM-style up to whatever the panel supports:

The model is:

\[ \text{packpc}_{it} = \alpha_i + \phi\,\text{packpc}_{i,t-1} + \beta_1\,\text{income95pc}_{it} + \beta_2\,\text{avgprs95}_{it} + \varepsilon_{it}. \]

pgmm(..., transformation = "d") estimates the differenced equation internally:

\[ \Delta \text{packpc}_{it} = \phi\,\Delta \text{packpc}_{i,t-1} + \beta_1\,\Delta \text{income95pc}_{it} + \beta_2\,\Delta \text{avgprs95}_{it} + \Delta\varepsilon_{it}. \]

The lagged difference \(\Delta \text{packpc}_{i,t-1}\) is endogenous because it shares \(\varepsilon_{i,t-1}\) with \(\Delta\varepsilon_{it}\). Difference GMM instruments it with deeper lags of the level outcome, starting at \(t-2\).

The other covariates are not automatically instrumented by this specification. The model treats changes in income95pc and avgprs95 as exogenous or predetermined enough to enter directly. That assumption is part of the research design: if price changes respond to unobserved shocks in cigarette consumption, then price would need its own instrument block or a different identification strategy.

# Convert to plm's pdata.frame: tells plm which columns identify the
# unit-time grid.  Order matters: c("unit", "time").
pdat_cig <- pdata.frame(cig, index = c("state", "year"))

# Diff-GMM (Arellano-Bond) fit on cigarette consumption.
# Reading the formula:
#   * Left of |:   packpc is the outcome; lag(packpc, 1) is the lagged DV
#                  (pgmm will internally difference everything).  income95pc
#                  and avgprs95 enter as contemporaneous regressors.
#   * First |:    lag(packpc, 2:99) is the GMM-style instrument block —
#                  every available lag from 2 onward, stacked per period.
# Arguments:
#   * effect = "individual": only state FE. This keeps the example simple
#     and preserves a direct price counterfactual. A two-way version
#     is a useful robustness check, but it shifts identification toward
#     within-year cross-state price differences and can produce weighting-
#     matrix warnings in short panels.
#   * model = "twosteps": Arellano-Bond two-step efficient weighting matrix.
#     summary(..., robust = TRUE) below applies the Windmeijer correction.
#   * transformation = "d": difference GMM (not system).
fit_cig <- pgmm(
  packpc ~ lag(packpc, 1) + income95pc + avgprs95 |
           lag(packpc, 2:99),
  data           = pdat_cig,
  effect         = "individual",
  model          = "twosteps",
  transformation = "d"
)

# Capture the summary once — the Part 2 diagnostic battery hangs off this object
sum_cig <- summary(fit_cig, robust = TRUE)
sum_cig
## Oneway (individual) effect Two-steps model Difference GMM 
## 
## Call:
## pgmm(formula = packpc ~ lag(packpc, 1) + income95pc + avgprs95 | 
##     lag(packpc, 2:99), data = pdat_cig, effect = "individual", 
##     model = "twosteps", transformation = "d")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 432
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -25.4405  -2.4992   0.1460   0.1232   2.7499  25.6071 
## 
## Coefficients:
##                 Estimate Std. Error z-value  Pr(>|z|)    
## lag(packpc, 1)  0.639465   0.055354 11.5523 < 2.2e-16 ***
## income95pc     -0.479040   0.496258 -0.9653    0.3344    
## avgprs95       -0.179869   0.028086 -6.4042 1.511e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(44) = 47.09887 (p-value = 0.34693)
## Autocorrelation test (1): normal = -3.443569 (p-value = 0.00057409)
## Autocorrelation test (2): normal = -0.5365189 (p-value = 0.5916)
## Wald test for coefficients: chisq(3) = 2446.946 (p-value = < 2.22e-16)

Read off (against the Part 2 checklist):

  • \(\hat\phi\) on lag(packpc, 1) should be positive and below 1 (sticky consumption from year to year).
  • \(\hat\beta\) on avgprs95 should be negative (higher price → fewer packs sold).
  • \(\hat\beta\) on income95pc is an empirical control, not the main object of interest. Its sign can be sensitive to the short panel, the price measure, and the dynamic specification.
  • Sargan, AR(1), AR(2) at the bottom — apply the §2.3–§2.4 decision rules.
  • Instrument count below: this panel is \(N = 48\) states, so the full lag(packpc, 2:99) spec can get close to the \(N\) rule quickly. Read the Sargan test together with the instrument count, not by itself.
# Put the core diagnostic statistics in one small table.
# These are the same objects printed at the bottom of summary(fit_cig).
cig_n_inst <- length(fit_cig$W[[1]][1, ])
cig_n_units <- length(fit_cig$W)

tibble(
  diagnostic = c("Sargan overidentification test",
                 "AR(1) on differenced residuals",
                 "AR(2) on differenced residuals",
                 "Instrument count / N"),
  statistic = c(as.numeric(sum_cig$sargan$statistic),
                as.numeric(sum_cig$m1$statistic),
                as.numeric(sum_cig$m2$statistic),
                cig_n_inst),
  p_value = c(sum_cig$sargan$p.value,
              sum_cig$m1$p.value,
              sum_cig$m2$p.value,
              NA_real_),
  desired_pattern = c("fail to reject",
                      "reject",
                      "fail to reject",
                      "ratio below 1"),
  decision_rule = c("p > 0.10",
                    "p < 0.05",
                    "p > 0.05",
                    paste0(cig_n_inst, " / ", cig_n_units,
                           " = ", round(cig_n_inst / cig_n_units, 2)))
) |>
  knitr::kable(digits = 3,
               caption = "Cigarette Diff-GMM diagnostics, read against the opening checklist.")
Cigarette Diff-GMM diagnostics, read against the opening checklist.
diagnostic statistic p_value desired_pattern decision_rule
Sargan overidentification test 47.099 0.347 fail to reject p > 0.10
AR(1) on differenced residuals -3.444 0.001 reject p < 0.05
AR(2) on differenced residuals -0.537 0.592 fail to reject p > 0.05
Instrument count / N 47.000 NA ratio below 1 47 / 48 = 0.98

If the AR(2) test rejected, the analysis would stop: lag-2 instruments would not be valid. If the Sargan test rejected, a more conservative instrument set would be needed. If the instrument count were much larger than \(N\), a high Sargan p-value would be suspicious rather than reassuring.

3.3 Counterfactual scenario: a 60-cent price increase in 1996

This is the policy question: what if the average state’s tax-inclusive price increased by 60 cents per pack, starting in 1996? This can be interpreted as a tax-induced price scenario only under a full pass-through assumption: a 60-cent tax increase becomes a 60-cent increase in the price paid by consumers. Sixty cents is roughly 3 standard deviations of the 1995 tax distribution — a deliberately large shock to make the response visible. The forecast computes the trajectory of packpc 3 years out (1996, 1997, 1998) and reports three quantities of interest:

  • EV (Expected Value): the predicted packs/capita under the price increase.
  • FD (First Difference in expected outcomes): the gap between the treatment trajectory and the no-change baseline.
  • RR (Relative Ratio): the treatment trajectory divided by the baseline trajectory. It is converted to a percent change in the plot.

There are three ingredients in any dynamic counterfactual:

  1. A fitted dynamic model. Use fit_cig, the Diff-GMM model from §3.2.
  2. Initial conditions. Dynamic forecasts need a starting level of the outcome and a recent lagged change in the outcome.
  3. A hypothetical covariate path. Here, the average price rises by 60 cents in 1996 and all other covariate changes are set to zero.

Because the model is estimated in differences, a one-period change in the covariate path means something specific. Setting avgprs95 to increase by 60 in the first forecast period means price jumps by 60 cents in 1996. Setting later changes to zero means the new higher price level persists in 1997 and 1998; it does not mean the price returns to its old level.

# Sample statistics that motivate the +60 cent scenario:
# the average tax in 1995 is about 60 cents and 1 sd is ~18 cents,
# so +60 is roughly a 3-sd intervention — large but informative.
summary(pdat_cig$taxs95[pdat_cig$year == 1995])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34.44   48.75   59.84   61.87   74.78  112.63
sd(pdat_cig$taxs95[pdat_cig$year == 1995])
## [1] 18.47741
# Forecast horizon
periods_out <- 3

# Number of parameter draws for the Monte-Carlo simulation
sims <- 1000

# Anchor the forecast at the last observed year (1995).
# initialY is the average level of packpc in 1995.
# lagY is the average change in packpc from 1994 to 1995.
# Because this is an "average-state" forecast, both are averaged
# across states.
last_year <- pdat_cig[pdat_cig$year == 1995, ]
prev_year <- pdat_cig[pdat_cig$year == 1994, ]

# Match states between the two slices (panel is balanced so this is trivial)
initialY <- mean(last_year$packpc, na.rm = TRUE)
lagY     <- mean(last_year$packpc - prev_year$packpc, na.rm = TRUE)

tibble(
  constant = c("periods_out", "initialY (avg packpc, 1995)", "lagY (avg ΔY, 1995)"),
  value    = c(periods_out, initialY, lagY)
) |>
  knitr::kable(digits = 2,
               caption = "Anchor constants for the cigarette counterfactual.")
Anchor constants for the cigarette counterfactual.
constant value
periods_out 3.00
initialY (avg packpc, 1995) 96.33
lagY (avg ΔY, 1995) 1.30

Parameter draws. Model uncertainty is simulated through draws from the coefficient distribution. Each row of simparam is one possible parameter vector drawn from

\[ \tilde\theta^{(s)} \sim \mathcal{N}(\hat\theta,\widehat{V}_{Windmeijer}). \]

The lagged-outcome coefficient is \(\phi\). The remaining columns are the coefficients on income and price. They are separated because simcf asks for the LDV coefficient through phi = and the non-LDV coefficients through b =.

set.seed(20260520)

# Joint MVN draws from the fit. vcovHC() gives the Windmeijer-corrected
# variance for two-step GMM (this matches what summary(..., robust = TRUE)
# uses for SEs).
simparam <- mvrnorm(n     = sims,
                    mu    = coefficients(fit_cig),
                    Sigma = vcovHC(fit_cig))

# Split by coefficient names so the code stays correct if the formula changes.
# phi is the LDV coefficient; betas are the substantive covariates.
simphi <- simparam[, "lag(packpc, 1)"]                 # length-1000 vector
simbeta <- simparam[, c("income95pc", "avgprs95"),
                    drop = FALSE]                     # 1000 x 2 matrix

# Sanity check
dim(simbeta)
## [1] 1000    2
head(simphi)
## [1] 0.5722195 0.6450104 0.5157122 0.5672364 0.5571531 0.6867599

Build the counterfactual matrices. Two cfMake objects:

  • xhyp represents the treatment trajectory (60-cent price increase happens in 1996, persists).
  • xbase represents the baseline (nothing changes).

cfMake returns an object whose $x field holds the change in each covariate at each forecast period. Because the model was estimated in first differences, this is exactly the needed input: a row for the 1996 change, a row for the 1997 change, and a row for the 1998 change.

Every covariate change is set to zero first. Then avgprs95 changes by 60 in the first forecast period. The lagged dependent variable propagates that initial shock forward through the dynamic model.

# Covariate-only formula for cfMake (LDV is fed in separately via phi)
formula_cf <- packpc ~ income95pc + avgprs95 - 1

# Treatment: +60 cent change in avgprs95 at period 1, no other changes
xhyp <- cfMake(formula = formula_cf, data = pdat_cig, nscen = periods_out)
xhyp$x    <- 0 * xhyp$x       # zero out every period's change
xhyp$xpre <- 0 * xhyp$xpre    # zero out the "previous" matrix too
xhyp <- cfChange(xhyp, covname = "avgprs95", x = 60, scen = 1)

# Baseline: zero change in everything
xbase <- xhyp
xbase$x <- xbase$xpre

# Inspect: the 60-cent shock appears in row 1 of xhyp$x, zeros elsewhere
xhyp$x
##   packpc income95pc avgprs95
## 1      0          0       60
## 2      0          0        0
## 3      0          0        0
xbase$x
##   packpc income95pc avgprs95
## 1      0          0        0
## 2      0          0        0
## 3      0          0        0

3.4 Simulate EV, FD, and RR

simcf can report three different quantities. The inputs are the same each time:

  • x is the counterfactual covariate path.
  • b = simbeta passes the simulated non-LDV coefficients.
  • phi = simphi passes the simulated lagged-outcome coefficient.
  • lagY = lagY gives the most recent observed change in the outcome.
  • transform = "diff" tells simcf that the model was estimated in differences.
  • initialY = initialY lets simcf convert the differenced forecast back to the level scale.

The outputs differ because EV, FD, and RR answer different questions.

# EV under treatment:
# What is expected packpc if the 60-cent price increase happens?
# Because transform = "diff" and initialY is provided, the result is returned
# on the original level scale of packs per capita.
sev_treat <- ldvsimev(
  xhyp,
  b        = simbeta,
  ci       = 0.95,
  constant = NA,         # NA = no intercept term (FE absorbed)
  phi      = simphi,
  lagY     = lagY,
  transform = "diff",
  initialY  = initialY
)

# EV under baseline:
# What is expected packpc if there is no price increase?
# Compute this separately so the dashed baseline path can be drawn in the EV
# panel and so the FD comparison is explicit.
sev_base <- ldvsimev(
  xbase,
  b        = simbeta,
  ci       = 0.95,
  constant = NA,
  phi      = simphi,
  lagY     = lagY,
  transform = "diff",
  initialY  = initialY
)

# FD:
# What is the absolute effect of the treatment path?
# ldvsimfd() compares xhyp$x to xhyp$xpre. Since xhyp$xpre is all zeros,
# this is "60-cent shock" minus "no shock" at each forecast horizon.
sfd <- ldvsimfd(
  xhyp,
  b        = simbeta,
  ci       = 0.95,
  constant = NA,
  phi      = simphi,
  lagY     = lagY,
  transform = "diff"
)

# RR:
# What is the treatment path divided by the baseline path?
# This is a ratio. Below it is converted to percent change as 100 * (RR - 1).
srr <- ldvsimrr(
  xhyp,
  b        = simbeta,
  ci       = 0.95,
  constant = NA,
  phi      = simphi,
  lagY     = lagY,
  transform = "diff",
  initialY  = initialY
)

# Stack the three quantities into one tidy frame for plotting.
# Each simcf object returns:
#   pe    = point estimate
#   lower = lower confidence interval
#   upper = upper confidence interval
plot_data <- bind_rows(
  tibble(quantity = "EV (packs/capita)",
         h = seq_len(periods_out),
         pe = sev_treat$pe,
         lo = sev_treat$lower[, 1],
         hi = sev_treat$upper[, 1],
         baseline_pe = sev_base$pe),
  tibble(quantity = "FD (treat − baseline)",
         h = seq_len(periods_out),
         pe = sfd$pe,
         lo = sfd$lower[, 1],
         hi = sfd$upper[, 1],
         baseline_pe = NA_real_),
  tibble(quantity = "RR (% change)",
         h = seq_len(periods_out),
         pe = 100 * (srr$pe - 1),
         lo = 100 * (srr$lower[, 1] - 1),
         hi = 100 * (srr$upper[, 1] - 1),
         baseline_pe = NA_real_)
) |>
  mutate(quantity = factor(quantity,
                           levels = c("EV (packs/capita)",
                                      "FD (treat − baseline)",
                                      "RR (% change)")))

3.5 The three-panel forecast plot

ggplot(plot_data, aes(h, pe)) +
  geom_ribbon(aes(ymin = lo, ymax = hi), alpha = 0.25, fill = "steelblue") +
  geom_line(linewidth = 1, colour = "steelblue") +
  geom_line(aes(y = baseline_pe), linetype = "dashed", colour = "grey40", linewidth = 0.8) +
  geom_hline(yintercept = 0, linetype = "dotted", colour = "grey60") +
  facet_wrap(~ quantity, scales = "free_y") +
  scale_x_continuous(breaks = seq_len(periods_out),
                     labels = paste0("199", 5 + seq_len(periods_out))) +
  labs(x = NULL, y = NULL) +
  theme_minimal(base_size = 11) +
  theme(strip.text = element_text(size = 11, face = "bold"),
        panel.spacing.x = unit(1, "lines"))
Counterfactual forecast of a 60-cent price increase on cigarette consumption. EV: predicted packs/capita under the increase (solid) and under baseline (dashed). FD: absolute difference (treat - baseline). RR: percent change. Horizon = 1996-1998. Bands are 95% intervals from 1000 parametric draws.

Counterfactual forecast of a 60-cent price increase on cigarette consumption. EV: predicted packs/capita under the increase (solid) and under baseline (dashed). FD: absolute difference (treat - baseline). RR: percent change. Horizon = 1996-1998. Bands are 95% intervals from 1000 parametric draws.

Read the three panels.

  • EV. The blue band is the predicted trajectory under the +60-cent hike; the grey dashed line is the no-change baseline. The vertical gap is exactly what FD plots in the middle panel. Even a year after the shock, predicted consumption is several packs/capita below baseline.
  • FD. The absolute drop in packs/capita, year by year. The effect is largest in the first year (the immediate response to a higher price) and damps slowly as \(\hat\phi\) propagates the past forward.
  • RR. Same effect expressed as a percent. This is the metric a public-health analyst would report; read the third-year percentage directly from the right panel.

Caveats to flag in class.

  1. The shock is large. +60 cents is a 3-SD intervention — a useful demonstration of the model’s dynamics, not a real policy proposal. For policy work, scale the shock to a realistic intervention size.
  2. Average-state forecast. initialY and lagY are averaged across states. A state-specific forecast would let each state start at its own 1995 level; this is the natural extension and is straightforward to code.
  3. Diff-GMM only. The simulation uses only the Diff-GMM coefficients. Comparing Sys-GMM or FE-LDV here would be a natural extension, but this section keeps the simulation focused on one fitted Diff-GMM model.

Appendix A — Self-study: Nickell bias and a hand-built GMM fix

This appendix is optional self-study. It intentionally uses matrix algebra and a few explicit loops because the goal is to show how GMM is constructed under the hood. These are not required coding patterns for the applied sections above; they are included to make the mechanics visible one step at a time.

A.1 The Nickell mechanism

Take the simplest dynamic panel:

\[ y_{it} = \alpha_i + \phi\, y_{i,t-1} + \beta\, x_{it} + \varepsilon_{it}, \qquad |\phi| < 1, \quad \varepsilon_{it} \sim \text{iid}(0, \sigma^2) \]

The within transformation subtracts unit means: \(\tilde y_{it} = y_{it} - \bar y_i\). This sweeps out \(\alpha_i\) — the whole reason FE is used. But it introduces a new problem. The within-transformed lag \(\tilde y_{i,t-1}\) contains \(\bar y_{i,-1}\), which depends on \(\varepsilon_{i,1}, \ldots, \varepsilon_{i,T-1}\). The within-transformed error \(\tilde\varepsilon_{it}\) also contains \(\bar\varepsilon_i\), which depends on \(\varepsilon_{i,1}, \ldots, \varepsilon_{i,T}\). They share terms.

Therefore \(\tilde y_{i,t-1}\) and \(\tilde\varepsilon_{it}\) are correlated, OLS on the within-transformed data is no longer consistent, and Nickell (1981) showed that as \(N \to \infty\) with \(T\) fixed,

\[ \boxed{\;\text{plim}(\hat\phi_{\text{FE}} - \phi) \approx -\,\frac{1+\phi}{T-1}\;} \]

Three consequences worth flagging:

  1. Bias is downward on \(\phi\). FE understates persistence. If your true \(\phi = 0.85\), expect FE to give you something around 0.7 with \(T = 15\). The bias is of order \(1/T\) — small \(T\) is bad.
  2. Bias travels. Because \(\hat\phi\) absorbs less of the persistence, \(\hat\beta\) on contemporaneous regressors picks up some of the slack — typically biased upward. Long-run effects \(\beta/(1-\phi)\) are also biased; the direction depends on which bias dominates.
  3. Adding more units does not fix it. Nickell’s bias is asymptotic in \(N\) with \(T\) fixed — so doubling the number of countries (or firms, or states) leaves the bias exactly where it was. Only adding more time periods helps. When more time periods isn’t an option, change estimator.

This self-study appendix builds that “change estimator” path by hand, on a small simulated panel. The sequence is: (i) generate data with known parameters, (ii) reproduce the Nickell bias on an FE-LDV fit, (iii) difference to kill \(\alpha_i\) and watch the residual endogeneity appear, (iv) construct the Anderson-Hsiao instrumental-variable estimator from raw matrix algebra, and (v) extend to an over-identified one-step Arellano-Bond GMM. The goal is to demystify GMM: it is just a weighted moment-condition system with a closed-form solution.

A.2 A toy DGP

A small panel is simulated from the DGP

\[ y_{it} = \alpha_i + \phi\, y_{i,t-1} + \beta\, x_{it} + \varepsilon_{it} \]

with \(\phi = 0.7\), \(\beta = 0.5\), \(\alpha_i \sim N(0, 1)\), \(\varepsilon_{it} \sim N(0, 1)\), \(x_{it} \sim N(0, 1)\). The design uses \(N = 50\) and \(T = 8\) — small enough for the matrix algebra to stay visible, large enough that estimates are well-defined. A burn-in period lets the series escape its arbitrary initial value before observations are collected.

The true \(\varepsilon_{it}\) is kept alongside the observed data. This is not available in real applications, but it makes the endogeneity in §A.4 visible.

set.seed(20260515)

# DGP parameters
phi_true  <- 0.7    # true persistence
beta_true <- 0.5    # true effect of x on y
N         <- 50     # number of units
T_obs     <- 8      # observed periods per unit
burn      <- 50     # burn-in length

# Unit fixed effects, drawn once
alpha <- rnorm(N, mean = 0, sd = 1)

# Generate one unit at a time and stack
toy_list <- vector("list", N)
for (i in seq_len(N)) {
  Tlong <- T_obs + burn
  x_i   <- rnorm(Tlong, 0, 1)
  eps_i <- rnorm(Tlong, 0, 1)
  y_i   <- numeric(Tlong)
  # initial value from the stationary distribution of an AR(1) with intercept alpha_i
  y_i[1] <- alpha[i] / (1 - phi_true) +
            rnorm(1, sd = 1 / sqrt(1 - phi_true^2))
  for (tt in 2:Tlong) {
    y_i[tt] <- alpha[i] + phi_true * y_i[tt - 1] +
               beta_true * x_i[tt] + eps_i[tt]
  }
  # Pre-compute the slices BEFORE building the tibble — tibble evaluates
  # columns sequentially, so y_lag = y_i[...] inside tibble() would shadow the
  # outer 'y_i' with the just-created 'y' column. Pre-computing avoids that.
  #
  # Lags are constructed from the OBSERVED window only — NA where the lag
  # would reach into pre-window data. This mirrors what pgmm sees on a real
  # panel, so the manual fits in §A.5–§A.6 use the same sample pgmm does.
  y_obs   <- y_i  [(burn + 1):Tlong]
  y_lag1  <- c(NA,             y_obs[1:(T_obs - 1)])
  y_lag2  <- c(NA, NA,         y_obs[1:(T_obs - 2)])
  y_lag3  <- c(NA, NA, NA,     y_obs[1:(T_obs - 3)])
  y_lag4  <- c(NA, NA, NA, NA, y_obs[1:(T_obs - 4)])
  x_obs   <- x_i  [(burn + 1):Tlong]
  eps_obs <- eps_i[(burn + 1):Tlong]
  toy_list[[i]] <- tibble(
    unit   = i,
    time   = seq_len(T_obs),
    y      = y_obs,
    y_lag1 = y_lag1,
    y_lag2 = y_lag2,
    y_lag3 = y_lag3,
    y_lag4 = y_lag4,
    x      = x_obs,
    eps    = eps_obs
  )
}
toy <- bind_rows(toy_list)
glimpse(toy)
## Rows: 400
## Columns: 9
## $ unit   <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, …
## $ time   <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, …
## $ y      <dbl> -4.5417604, -6.1090964, -7.9833272, -6.8805058, -6.9750735, -6.…
## $ y_lag1 <dbl> NA, -4.5417604, -6.1090964, -7.9833272, -6.8805058, -6.9750735,…
## $ y_lag2 <dbl> NA, NA, -4.5417604, -6.1090964, -7.9833272, -6.8805058, -6.9750…
## $ y_lag3 <dbl> NA, NA, NA, -4.5417604, -6.1090964, -7.9833272, -6.8805058, -6.…
## $ y_lag4 <dbl> NA, NA, NA, NA, -4.5417604, -6.1090964, -7.9833272, -6.8805058,…
## $ x      <dbl> 1.38427471, 1.02050701, -0.61394292, -0.03089128, 0.77359679, 0…
## $ eps    <dbl> -0.255258495, -1.821209281, -1.781079872, 0.342177224, -0.92660…

The columns used downstream:

  • y — outcome at time \(t\).
  • y_lag1, y_lag2, y_lag3, y_lag4 — lagged outcomes at \(t-1, t-2, t-3, t-4\). NA for the first \(k\) rows of each unit since those lags would reach pre-window. This is exactly what pgmm() sees internally, so downstream drop_na() steps will yield the same effective sample as pgmm().
  • x — covariate at time \(t\).
  • eps — the true error \(\varepsilon_{it}\) (visible to us, not to a real researcher).

A.3 Step 1 — Reproduce the Nickell bias on an FE-LDV fit

Within-demean every variable inside its unit, then OLS the demeaned outcome on the demeaned lag and demeaned \(x\). No intercept (it’s absorbed by the demeaning). This is exactly what plm(model = "within") does internally.

# Listwise-delete the t = 1 rows where y_lag1 is unobservable (pre-window),
# then within-demean over the remaining 7 rows per unit. This mirrors what
# plm(model = "within") does internally.
toy_dm <- toy |>
  drop_na(y, y_lag1, x) |>
  group_by(unit) |>
  mutate(y_dm     = y      - mean(y),
         y_lag_dm = y_lag1 - mean(y_lag1),
         x_dm     = x      - mean(x)) |>
  ungroup()

# OLS on the within-demeaned variables, no intercept
fit_fe_manual <- lm(y_dm ~ y_lag_dm + x_dm - 1, data = toy_dm)
phi_fe   <- unname(coef(fit_fe_manual)["y_lag_dm"])
beta_fe  <- unname(coef(fit_fe_manual)["x_dm"])

tibble(parameter = c("phi", "beta"),
       true      = c(phi_true, beta_true),
       FE_hat    = c(phi_fe,  beta_fe),
       bias      = FE_hat - true) |>
  knitr::kable(digits = 3,
               caption = "FE-LDV on the toy panel. True phi = 0.7; FE estimate is biased downward by Nickell. True beta = 0.5; the bias is mild in this DGP but you can see it travelling.")
FE-LDV on the toy panel. True phi = 0.7; FE estimate is biased downward by Nickell. True beta = 0.5; the bias is mild in this DGP but you can see it travelling.
parameter true FE_hat bias
phi 0.7 0.511 -0.189
beta 0.5 0.427 -0.073

\(\hat\phi_{FE}\) comes in below 0.7 — the Nickell bias is visible. The next steps build an estimator that closes that gap.

One-realization caveat. Nickell’s formula predicts the expected bias under repeated sampling, \(\text{plim}(\hat\phi_{FE} - \phi) \approx -(1+\phi)/(T-1) \approx -0.243\) here. The number in the bias column above is just one realization around that expectation, scattered by Monte-Carlo noise. Re-run with a different seed and the magnitude moves; the sign and rough scale stay. Read the table as evidence the bias is real, not as a precise estimate of its size.

A.4 Step 2 — Difference to kill \(\alpha_i\) and expose the endogeneity

Take first differences within each unit. The \(\alpha_i\) vanishes (\(\Delta \alpha_i = 0\)), but the differenced lag \(\Delta y_{i,t-1}\) correlates with the differenced error \(\Delta \varepsilon_{it}\) — both contain \(\varepsilon_{i,t-1}\). Let’s see this correlation directly.

# Compute differences within each unit.  drop_na() removes rows whose lags
# would reach into the pre-window — t = 1 (where y_lag1 is NA) and t = 2
# (where y_lag2 is NA).  The remaining sample is t = 3..8 per unit, exactly
# the rows pgmm uses internally for transformation = "d" with lag(y, 2)
# as the instrument.
#
# Force a final unit-then-time row ordering with arrange(unit, time) so
# downstream matrix construction (Appendix A.7) lines up rows correctly.
toy_d <- toy |>
  arrange(unit, time) |>
  group_by(unit) |>
  mutate(dy      = y      - y_lag1,
         dy_lag  = y_lag1 - y_lag2,    # = Δy_{t-1}
         dx      = x      - dplyr::lag(x),
         deps    = eps    - dplyr::lag(eps)) |>
  ungroup() |>
  drop_na(dy, dy_lag, dx, deps) |>
  arrange(unit, time)

# The endogeneity in the differenced model:
# Δy_{t-1} and Δε_{t} should be correlated (both contain ε_{t-1});
# Δx and Δε should not be correlated because x is exogenous in the DGP.
tibble(
  pair = c("Δy_lag vs Δε  (endogeneity in the differenced equation)",
           "Δx     vs Δε  (sanity check — x is exogenous in the DGP)"),
  correlation = c(cor(toy_d$dy_lag, toy_d$deps),
                  cor(toy_d$dx,     toy_d$deps))
) |>
  knitr::kable(digits = 3,
               caption = "Correlations among differenced variables and the differenced error in the simulated data.")
Correlations among differenced variables and the differenced error in the simulated data.
pair correlation
Δy_lag vs Δε (endogeneity in the differenced equation) -0.575
Δx vs Δε (sanity check — x is exogenous in the DGP) -0.043

The first row is sizeable and clearly non-zero — that’s exactly the endogeneity problem an instrumental variable will fix. The second is near zero, confirming that \(\Delta x\) is exogenous to \(\Delta \varepsilon\) in this DGP and does not itself need an instrument.

Why \(y_{i,t-2}\) is the natural instrument. It depends on \(\varepsilon_{i,1}, \ldots, \varepsilon_{i,t-2}\) but neither \(\varepsilon_{i,t-1}\) nor \(\varepsilon_{i,t}\). So \(y_{i,t-2}\) is correlated with \(\Delta y_{i,t-1} = y_{i,t-1} - y_{i,t-2}\) (the relevance condition) but uncorrelated with \(\Delta \varepsilon_{it}\) under the assumption that \(\varepsilon\) is iid (the exclusion condition). In the simulated data, both can be checked directly:

tibble(
  condition = c("Relevance: cor(y_lag2, Δy_lag) should be far from 0",
                "Exclusion: cor(y_lag2, Δε)     should be near 0"),
  correlation = c(cor(toy_d$y_lag2, toy_d$dy_lag),
                  cor(toy_d$y_lag2, toy_d$deps))
) |>
  knitr::kable(digits = 3,
               caption = "Relevance and exclusion checks for the lag-2 instrument.")
Relevance and exclusion checks for the lag-2 instrument.
condition correlation
Relevance: cor(y_lag2, Δy_lag) should be far from 0 -0.134
Exclusion: cor(y_lag2, Δε) should be near 0 -0.009

Both conditions hold in the simulated data, by construction.

A.5 Step 3 — Anderson-Hsiao IV by hand

There is one endogenous regressor (\(\Delta y_{i,t-1}\)) and one valid instrument (\(y_{i,t-2}\)). \(\Delta x\) instruments itself. The system is just-identified.

The closed-form IV (= 2SLS = GMM with any weighting matrix) estimator is

\[ \hat \theta_{IV} \;=\; (Z' X)^{-1} Z' \Delta y \]

where \(\Delta y\) is the stacked outcome vector, \(X\) is the regressor matrix with columns \([\Delta y_{i,t-1}, \Delta x_{it}]\), and \(Z\) is the instrument matrix with columns \([y_{i,t-2}, \Delta x_{it}]\). Let’s build each piece explicitly.

The data have 300 rows (50 units × 6 periods after dropping \(t=1, 2\) where the lags reach pre-window), 2 columns in \(X\), and 2 columns in \(Z\) — a 2×2 system to solve.

# Stack the data as plain vectors / matrices.  drop_na in the previous chunk
# already kept only rows where every needed lag is observed (t >= 2 here).
dy_vec  <- toy_d$dy                       # outcome vector
X_mat   <- cbind(dy_lag = toy_d$dy_lag,   # endogenous regressor
                 dx     = toy_d$dx)        # exogenous regressor
Z_mat   <- cbind(y_lag2 = toy_d$y_lag2,   # instrument for dy_lag
                 dx     = toy_d$dx)        # dx instruments itself

# Build Z'X (a 2x2 matrix) and Z'dy (a 2-vector)
ZpX  <- crossprod(Z_mat, X_mat)
Zpdy <- crossprod(Z_mat, dy_vec)

# Closed-form IV estimator
theta_AH <- solve(ZpX, Zpdy)
rownames(theta_AH) <- c("phi (dy_lag)", "beta (dx)")

The two small matrices just built:

knitr::kable(ZpX,  digits = 2, caption = "Z'X")
Z’X
dy_lag dx
y_lag2 -214.58 26.73
dx -151.49 660.48
knitr::kable(Zpdy, digits = 2, caption = "Z'Δy")
Z’Δy
y_lag2 -156.21
dx 197.32

And the resulting IV estimates:

knitr::kable(theta_AH, digits = 3,
             caption = "Anderson-Hsiao IV estimates from the closed-form solve.")
Anderson-Hsiao IV estimates from the closed-form solve.
phi (dy_lag) 0.788
beta (dx) 0.479

Read off \(\hat\phi_{AH}\) — it should be close to the true value 0.7. The Nickell bias has been corrected by replacing the endogenous \(\Delta y_{i,t-1}\) with its instrumented counterpart.

Comparison: FE-LDV vs Anderson-Hsiao on the same panel.

tibble(parameter = c("phi", "beta"),
       true      = c(phi_true,        beta_true),
       FE        = c(phi_fe,          beta_fe),
       AH        = c(theta_AH[1, 1],  theta_AH[2, 1])) |>
  knitr::kable(digits = 3,
               caption = "FE-LDV is biased; Anderson-Hsiao IV recovers the truth (up to sampling error).")
FE-LDV is biased; Anderson-Hsiao IV recovers the truth (up to sampling error).
parameter true FE AH
phi 0.7 0.511 0.788
beta 0.5 0.427 0.479

A.6 Step 4 — Add more lags → one-step GMM

What if two lags are used as instruments — say \(y_{i,t-2}\) and \(y_{i,t-3}\)? Each is exogenous under the iid-\(\varepsilon\) assumption, and each is correlated with \(\Delta y_{i,t-1}\). Adding the second moment makes the system over-identified (more moments than parameters), and a weighting matrix is needed to combine them.

The (one-step) GMM estimator with weighting matrix \(W\) is

\[ \hat\theta_{GMM} \;=\; \big( X' Z\, W\, Z' X \big)^{-1} \, X' Z\, W\, Z' \Delta y \]

For one-step GMM, the standard choice is \(W = (Z'Z)^{-1}\). (Two-step would use the residual-based optimal \(W\) — see §2.6. This appendix stops at one-step to keep the algebra clean.)

To use lag-3 as an instrument, \(y_{i,t-3}\) must be observable, which means restricting the equation to \(t \ge 3\) — one period of usable rows is lost.

Heads-up on the verification step. The literal AH estimator in §A.5 uses one stacked lag-2 instrument column for the lagged outcome. pgmm(..., | lag(y, 2)) uses the same basic lag-2 logic, but it period-stacks the instrument into separate moments. The estimates should be close, but they need not match exactly in finite samples. The over-identified GMM below (\(y_{i,t-2}\) and \(y_{i,t-3}\) both as instruments) can also differ slightly from pgmm(..., | lag(y, 2:3)) for the same reason: same sample, different moment structure.

The instrument matrix now has 3 columns (two lags of \(y\) plus the exogenous \(\Delta x\)) against only 2 parameters in \(X\) — the system is over-identified by one moment. Because \(y_{i,t-3}\) must come from the observed window, the \(t=3\) row is also dropped, leaving 250 rows (5 periods \(\times\) 50 units).

# Adding y_lag3 as an instrument means y_lag3 must be observed, which
# requires t >= 4 (the t = 3 row in toy_d would have y_lag3 reaching pre-
# window data).  Drop those rows so every instrument in Z is genuinely
# available from the observed sample — same logic as listwise deletion.
toy_g <- toy_d |> drop_na(y_lag3)

dy_vec <- toy_g$dy
X_mat  <- cbind(dy_lag = toy_g$dy_lag,
                dx     = toy_g$dx)
# Wider instrument matrix: two lags of y plus the exogenous dx
Z_mat  <- cbind(y_lag2 = toy_g$y_lag2,
                y_lag3 = toy_g$y_lag3,
                dx     = toy_g$dx)

# One-step weighting matrix W = (Z'Z)^{-1}
ZpZ   <- crossprod(Z_mat)             # 3x3
W_one <- solve(ZpZ)

# Z'X and Z'dy
ZpX  <- crossprod(Z_mat, X_mat)        # 3x2
Zpdy <- crossprod(Z_mat, dy_vec)       # 3x1

# GMM normal equation: θ = (X'Z W Z'X)^{-1} X'Z W Z'Δy
XpZ_W   <- t(ZpX) %*% W_one
A_mat   <- XpZ_W %*% ZpX               # 2x2 — the bread
b_vec   <- XpZ_W %*% Zpdy              # 2x1 — the meat

theta_GMM <- solve(A_mat, b_vec)
rownames(theta_GMM) <- c("phi (dy_lag)", "beta (dx)")

The one-step weighting matrix:

knitr::kable(W_one, digits = 4,
             caption = "W = (Z'Z)^{-1}, the one-step GMM weighting matrix (3 × 3).")
W = (Z’Z)^{-1}, the one-step GMM weighting matrix (3 × 3).
y_lag2 y_lag3 dx
y_lag2 0.0027 -0.0026 0.0000
y_lag3 -0.0026 0.0027 0.0000
dx 0.0000 0.0000 0.0018

The GMM estimates:

knitr::kable(theta_GMM, digits = 3,
             caption = "One-step Arellano-Bond GMM estimates from the closed-form solve.")
One-step Arellano-Bond GMM estimates from the closed-form solve.
phi (dy_lag) 0.563
beta (dx) 0.444

A side-by-side of the three estimators on the same data:

tibble(parameter = c("phi", "beta"),
       true      = c(phi_true,         beta_true),
       FE        = c(phi_fe,           beta_fe),
       AH        = c(theta_AH[1, 1],   theta_AH[2, 1]),
       GMM       = c(theta_GMM[1, 1],  theta_GMM[2, 1])) |>
  knitr::kable(digits = 3,
               caption = "FE biased, literal AH unbiased (just-identified), GMM unbiased and more efficient (over-identified). With only one extra moment the GMM gain is small, but the principle generalizes to the larger moment sets used in pgmm.")
FE biased, literal AH unbiased (just-identified), GMM unbiased and more efficient (over-identified). With only one extra moment the GMM gain is small, but the principle generalizes to the larger moment sets used in pgmm.
parameter true FE AH GMM
phi 0.7 0.511 0.788 0.563
beta 0.5 0.427 0.479 0.444

The GMM estimate sits very close to the AH estimate — that is the expected behavior with only one extra moment. The conceptual payoff is the machinery: construct the moment conditions \(Z' (\Delta y - X \theta) = 0\), pick a weighting matrix, and solve the over-identified system in one closed-form line. The estimators in the main lab are this same template scaled up.

A.7 Verify against pgmm()

This final verification is the most technical part of the appendix. It is mainly for readers who want to understand why a literal hand-built Anderson-Hsiao estimator does not exactly match the object returned by pgmm(). The difference is not a coding error; it comes from the way pgmm() period-stacks the instrument moments.

Run the AH-style pgmm fit on the same sample and put the coefficients side by side with the manual result:

pdat_toy <- pdata.frame(toy, index = c("unit", "time"))
ah_pgmm <- pgmm(y ~ lag(y, 1) + x | lag(y, 2),
                data = pdat_toy,
                effect = "individual",
                model  = "onestep",
                transformation = "d")

tibble(parameter           = c("phi", "beta"),
       manual_literal_AH   = c(theta_AH[1, 1], theta_AH[2, 1]),
       pgmm_AH             = c(unname(coef(ah_pgmm)["lag(y, 1)"]),
                               unname(coef(ah_pgmm)["x"]))) |>
  knitr::kable(digits = 6,
               caption = "Literal Anderson-Hsiao (one stacked moment for y_{t-2}) vs pgmm with the same instrument specification.")
Literal Anderson-Hsiao (one stacked moment for y_{t-2}) vs pgmm with the same instrument specification.
parameter manual_literal_AH pgmm_AH
phi 0.787692 0.675650
beta 0.479428 0.456828

The numbers are close but do not match exactly. The reason is not the sample — both fits use the same 300 rows. It is the moment structure. Look at the instrument matrix pgmm builds for the first unit:

# pgmm's per-unit instrument block: rows are periods (t = 3..8),
# columns are: 6 period-specific copies of y_{t-2}, plus 1 column for x.
ah_pgmm$W[[1]] |>
  round(3) |>
  as.data.frame() |>
  knitr::kable(caption = "pgmm's instrument matrix for the first unit (6 × 7). The block-diagonal structure of the first six columns reveals that pgmm splits the lag-2 instrument into one moment **per period**, not a single stacked moment.")
pgmm’s instrument matrix for the first unit (6 × 7). The block-diagonal structure of the first six columns reveals that pgmm splits the lag-2 instrument into one moment per period, not a single stacked moment.
V1 V2 V3 V4 V5 V6 x
3 -4.542 0.000 0.000 0.000 0.000 0.000 -1.634
4 0.000 -6.109 0.000 0.000 0.000 0.000 0.583
5 0.000 0.000 -7.983 0.000 0.000 0.000 0.804
6 0.000 0.000 0.000 -6.881 0.000 0.000 0.001
7 0.000 0.000 0.000 0.000 -6.975 0.000 -0.859
8 0.000 0.000 0.000 0.000 0.000 -6.133 -0.364

What literal AH did with one y-instrument moment, pgmm does with six — one per period \(t = 3, \ldots, 8\). Each column of the block-diagonal section is “\(y_{i,t-2}\) at period \(t\), zero elsewhere”. So the AH-style pgmm spec is over-identified internally, with \(T - 2 = 6\) moments for y plus 1 moment for x. Different moment system, slightly different estimates in finite samples.

To replicate pgmm exactly by hand, two things are needed:

  1. Build the period-stacked block-diagonal \(Z\) (one column per period × instrument pair).
  2. Use Arellano-Bond’s one-step weighting matrix \(W = \big(\sum_i Z_i' H Z_i\big)^{-1}\), where \(H\) is the \((T{-}2) \times (T{-}2)\) tridiagonal matrix with \(2\) on the diagonal and \(-1\) on the off-diagonals. This \(H\) accounts for the MA(1) structure of \(\Delta\varepsilon\) implied by iid \(\varepsilon\) in levels.
# Rebuild X and y from toy_d (the AH-sample, 300 rows) — §A.6 overwrote
# X_mat and dy_vec with the smaller §A.6 GMM sample (toy_g, 250 rows),
# so those objects cannot be reused here.
dy_vec_AH <- toy_d$dy
X_mat_AH  <- cbind(dy_lag = toy_d$dy_lag, dx = toy_d$dx)

unit_ids    <- sort(unique(toy_d$unit))
n_per       <- length(unique(toy_d$time))   # 6 differenced periods (t = 3..8)
period_vals <- sort(unique(toy_d$time))

# Build per-unit Z blocks.
# Rows are periods kept for unit i.
# Columns are the 6 period-specific y instruments plus 1 column for dx.
Z_blocks <- vector("list", length(unit_ids))

for (j in seq_along(unit_ids)) {
  # Work one unit at a time.
  d_i <- toy_d |>
    filter(unit == unit_ids[j]) |>
    arrange(time)

  # Start with zeros, then place y_{t-2} in the column for its period.
  Z_i <- matrix(0, nrow = nrow(d_i), ncol = n_per)

  for (k in seq_len(nrow(d_i))) {
    p_idx <- match(d_i$time[k], period_vals)
    Z_i[k, p_idx] <- d_i$y_lag2[k]
  }

  # Add dx as the final ordinary instrument column.
  Z_blocks[[j]] <- cbind(Z_i, dx = d_i$dx)
}

# Arellano-Bond H matrix: tridiagonal with 2 on the diagonal, -1 off
H_AB <- matrix(0, n_per, n_per)
diag(H_AB) <- 2
for (k in seq_len(n_per - 1)) {
  H_AB[k, k + 1] <- -1
  H_AB[k + 1, k] <- -1
}

# One-step W = (sum_i Z_i' H Z_i)^{-1}.
# Build the sum one unit at a time so the matrix formula is visible.
ZHZ <- matrix(0, nrow = ncol(Z_blocks[[1]]), ncol = ncol(Z_blocks[[1]]))
for (j in seq_along(Z_blocks)) {
  Z_i <- Z_blocks[[j]]
  ZHZ <- ZHZ + t(Z_i) %*% H_AB %*% Z_i
}
W_AB <- solve(ZHZ)

# Stack Z across units.
Z_all <- Z_blocks[[1]]
for (j in 2:length(Z_blocks)) {
  Z_all <- rbind(Z_all, Z_blocks[[j]])
}

# Solve the GMM normal equation.
ZtX <- crossprod(Z_all, X_mat_AH)
Zty <- crossprod(Z_all, dy_vec_AH)
theta_AH_pgmm_style <- solve(t(ZtX) %*% W_AB %*% ZtX,
                             t(ZtX) %*% W_AB %*% Zty)
rownames(theta_AH_pgmm_style) <- c("phi (dy_lag)", "beta (dx)")

Side by side with pgmm:

tibble(parameter             = c("phi", "beta"),
       manual_pgmm_style_AH  = c(theta_AH_pgmm_style[1, 1],
                                 theta_AH_pgmm_style[2, 1]),
       pgmm_AH               = c(unname(coef(ah_pgmm)["lag(y, 1)"]),
                                 unname(coef(ah_pgmm)["x"]))) |>
  knitr::kable(digits = 10,
               caption = "Manual GMM-style AH (period-stacked Z + Arellano-Bond H weighting) versus pgmm. The two now agree to many decimal places.")
Manual GMM-style AH (period-stacked Z + Arellano-Bond H weighting) versus pgmm. The two now agree to many decimal places.
parameter manual_pgmm_style_AH pgmm_AH
phi 0.6756498 0.6756498
beta 0.4568281 0.4568281

Exact match. Three pedagogical takeaways:

  1. There is no single “Anderson-Hsiao estimator”. The original 1981 proposal used a single stacked moment for \(y_{t-2}\) (the “literal AH” in §A.5). Modern software — pgmm, Stata’s xtabond/xtabond2 — implements a richer period-stacked version with the Arellano-Bond H-weighting. Both are consistent; they give slightly different point estimates in finite samples.
  2. The weighting matrix matters. Even with the same instruments, choosing \(W = (Z'Z)^{-1}\) vs \(W = (\sum_i Z_i' H Z_i)^{-1}\) produces different one-step estimators. Two-step GMM (§2.6) iterates from any sensible one-step starting point, but the one-step result depends on \(W\).
  3. The matrix algebra scales linearly. Adding more instrument lags or more endogenous regressors just makes \(Z\) and \(X\) wider; the closed-form normal equation is the same. Part 2 is pgmm packaging exactly this structure.

References