Disclaimer

These materials have been revised and updated by successive teaching assistants over recent years.

Revision history:

Packages and Functions Used in This Lab

Package Purpose Key functions
forecast ARIMA estimation, forecasting Arima(), auto.arima(), forecast(), checkresiduals()
tseries Unit root tests adf.test(), kpss.test(), PP.test()
MASS Multivariate normal simulation mvrnorm()
lmtest Coefficient testing; autocorrelation tests on OLS residuals coeftest(), bgtest(), dwtest()
sandwich HAC and other robust covariance matrices NeweyWest(), vcovHAC()
nlme GLS with structured error correlations gls(), corAR1(), corARMA()
patchwork Multi-panel figures +, / operators
tidyverse Data manipulation and plotting ggplot(), tibble(), pipes (|>)

Part 1: Diagnosing Non-Stationarity (Properly)

In Labs 2 and 3, we focused on stationary processes — series with constant mean, constant variance, and autocovariance depending only on the lag distance. The ARMA toolkit, ACF/PACF identification, AIC/BIC selection, and out-of-sample cross-validation all rely on stationarity.

A natural temptation is to treat “non-stationarity” as a single problem with a single fix: difference the series and re-apply the ARMA toolkit. That is sometimes correct, but it is also one of the most common ways applied analyses go wrong.

A series that looks non-stationary may be stationary in disguise

For example, stationary around a deterministic trend, or stationary except for one or two structural breaks. Differencing such a series is over-differencing: it introduces a non-invertible MA root, biases long-run and level-shift effects toward zero, and often produces ARIMA models that fit the wrong dynamics.

This lab is built around the discipline of diagnosing before differencing. Part 1 lays out how to do that — what the three main flavors of non-stationarity look like, what unit-root tests actually test (and where they fail), and what to do when test verdicts disagree with visual or substantive evidence. Parts 2–6 then put these diagnostics to work on real data, including a worked example (the Bush approval series) where the standard tests give the wrong answer.

1.1 Stationarity

A process \(\{y_t\}\) is (weakly) stationary if all three of the following hold:

  1. Constant mean: \(\mathbb{E}[y_t] = \mu\) for all \(t\),
  2. Constant variance: \(\text{Var}(y_t) = \sigma^2\) for all \(t\),
  3. Time-invariant autocovariance: \(\text{Cov}(y_t, y_{t-k}) = \gamma_k\) depends only on the lag \(k\), not on \(t\).

Under weak stationarity, the ACF is a property of the process, not the sample, and ACF/PACF can be read for AR/MA orders as in Lab 2. The unconditional mean exists and forecasts revert to it. None of these properties survive when the mean drifts, the variance grows, or the covariance structure changes over time.

Weak stationarity requires the first two moments (mean, variance, autocovariances) to be time-invariant. Strict stationarity is stronger: it requires the entire joint distribution to be time-invariant, so all moments that exist are also time-invariant, but the requirement is on the distribution itself, not just its moments. For Gaussian processes the two coincide (a Gaussian distribution is pinned down by its first two moments), which is why “weakly stationary Gaussian = strictly stationary” gets used as a shortcut.

Each condition rules out a specific way the data-generating process could be drifting through time, and each violation has a recognizable visual signature. We use two rolling-window diagnostics throughout Part 1: the rolling mean (centered moving average) tracks Condition 1, and the rolling standard deviation tracks Condition 2. Condition 3 needs a richer tool — a rolling lag-1 autocorrelation. The three subsections below show what each condition looks like when it holds and when it fails.

1.1.1 Constant mean

A stationary AR(1) oscillates around a fixed point with no systematic drift. A violation looks like a series whose level moves through time via a deterministic trend, a random walk with drift, or a step shift. The cleanest visual diagnostic is the rolling mean: under stationarity, it hovers; under violation, it traces out the time-varying mean function.

set.seed(101)
T_n   <- 300
t_idx <- seq_len(T_n)

y1_stat <- as.numeric(arima.sim(n = T_n, model = list(ar = 0.5), sd = 1))
y1_viol <- 0.04 * t_idx +
           as.numeric(arima.sim(n = T_n, model = list(ar = 0.5), sd = 1))

df1 <- tibble(t = t_idx,
              stat       = y1_stat,
              viol       = y1_viol,
              roll_stat  = roll_mean(y1_stat),
              roll_viol  = roll_mean(y1_viol))

p1a <- ggplot(df1, aes(t)) +
  geom_line(aes(y = stat), color = "grey60") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black", alpha = 0.5) +
  geom_line(aes(y = roll_stat), color = "steelblue", linewidth = 1) +
  labs(title = "Stationary mean",
       subtitle = "AR(1); blue = 30-period rolling mean",
       y = expression(y[t]), x = "t") +
  theme_minimal(base_size = 11)

p1b <- ggplot(df1, aes(t)) +
  geom_line(aes(y = viol), color = "grey60") +
  geom_line(aes(y = roll_viol), color = "firebrick", linewidth = 1) +
  labs(title = "Drifting mean",
       subtitle = "y_t = 0.04*t + AR(1); rolling mean trends with t",
       y = expression(y[t]), x = "t") +
  theme_minimal(base_size = 11)

p1a + p1b

The dashed reference at zero on the left makes it clear that the rolling mean stays put. On the right, the rolling mean climbs steadily — the unconditional mean is no longer a single number but a function of \(t\). Condition 1 fails, and any technique that assumes a single \(\mu\) (the AR equilibrium, mean-reversion forecasts, the standard ACF interpretation) becomes unreliable.

1.1.2 Constant variance

Stationarity also requires the variance of \(y_t\) to be the same everywhere. A violation looks like a series whose amplitude grows or shrinks over time. The diagnostic is the rolling standard deviation: a constant-width band signals stationarity; an expanding band signals heteroskedasticity. Random walks, ARCH/GARCH-type processes, and series with regime-dependent volatility all violate this condition, but for very different reasons.

set.seed(102)

y2_stat <- as.numeric(arima.sim(n = T_n, model = list(ar = 0.5), sd = 1))

# Violation: AR(1) with innovation SD growing linearly with t
sigma_t <- 1 + 0.015 * t_idx
y2_viol <- numeric(T_n)
y2_viol[1] <- rnorm(1, sd = sigma_t[1])
for (i in 2:T_n) y2_viol[i] <- 0.5 * y2_viol[i-1] + rnorm(1, sd = sigma_t[i])

df2 <- tibble(t = t_idx,
              stat    = y2_stat,
              viol    = y2_viol,
              sd_stat = roll_sd(y2_stat),
              sd_viol = roll_sd(y2_viol))

p2a <- ggplot(df2, aes(t)) +
  geom_ribbon(aes(ymin = -sd_stat, ymax = sd_stat),
              fill = "steelblue", alpha = 0.25) +
  geom_line(aes(y = stat), color = "grey50") +
  labs(title = "Constant variance",
       subtitle = "AR(1) with sigma = 1",
       y = expression(y[t]), x = "t") +
  theme_minimal(base_size = 11)

p2b <- ggplot(df2, aes(t)) +
  geom_ribbon(aes(ymin = -sd_viol, ymax = sd_viol),
              fill = "firebrick", alpha = 0.25) +
  geom_line(aes(y = viol), color = "grey50") +
  labs(title = "Growing variance",
       subtitle = "AR(1) with sigma_t = 1 + 0.015*t",
       y = expression(y[t]), x = "t") +
  theme_minimal(base_size = 11)

p2a + p2b

The left ribbon stays flat around \(\pm 1\); the right ribbon fans out from about \(\pm 1\) to about \(\pm 6\). When variance is non-constant, ordinary inference with standard errors is misleading and forecast intervals understate uncertainty in the high-variance regime. (Note: this condition is about the unconditional variance. Even GARCH processes — where the conditional variance moves around — can be unconditionally stationary if the variance equation itself has a stationary solution.)

1.1.3 Time-invariant autocovariance

The third condition is the one most people forget: stationarity is not just about levels and variances, but about the entire dependence structure being stable over time. The autocovariance

\[\gamma_k = \text{Cov}(y_t, y_{t-k})\]

must depend only on the lag \(k\), never on the time index \(t\). The most direct diagnostic is to compute autocorrelation in a rolling window and watch whether it changes with \(t\). Under stationarity the rolling autocorrelation should be flat; under violation it traces out a time-varying dependence structure.

set.seed(103)

# Stationary AR(1): same phi throughout
y3_stat <- as.numeric(arima.sim(n = T_n, model = list(ar = 0.5), sd = 1))

# Violation: phi shifts halfway from 0.2 to 0.85
half <- T_n / 2
y3_first  <- as.numeric(arima.sim(n = half, model = list(ar = 0.2), sd = 1))
y3_second <- numeric(half)
y3_second[1] <- 0.85 * y3_first[half] + rnorm(1)
for (i in 2:half) y3_second[i] <- 0.85 * y3_second[i-1] + rnorm(1)
y3_viol <- c(y3_first, y3_second)

# Rolling lag-1 autocorrelation (window centered on t)
roll_ac1 <- function(x, w = 60) {
  n <- length(x); out <- rep(NA_real_, n); h <- w %/% 2
  for (i in (h + 1):(n - h)) {
    sub <- x[(i - h):(i + h - 1)]
    out[i] <- cor(sub[-1], sub[-length(sub)])
  }
  out
}

df3 <- tibble(
  t        = t_idx,
  y_stat   = y3_stat,
  y_viol   = y3_viol,
  ac1_stat = roll_ac1(y3_stat, w = 60),
  ac1_viol = roll_ac1(y3_viol, w = 60)
)

# Top row: time series themselves
ts_a <- ggplot(df3, aes(t, y_stat)) +
  geom_line(color = "grey55") +
  labs(title = "Stationary (constant phi=0.5)",
       subtitle = "Time series",
       y = expression(y[t]), x = NULL) +
  theme_minimal(base_size = 11)

ts_b <- ggplot(df3, aes(t, y_viol)) +
  geom_line(color = "grey55") +
  geom_vline(xintercept = half, linetype = "dashed", color = "firebrick") +
  labs(title = "Shifting (phi: 0.2 -> 0.85 at t=150)",
       subtitle = "Time series",
       y = expression(y[t]), x = NULL) +
  theme_minimal(base_size = 11)

# Bottom row: rolling lag-1 autocorrelation over time
ac_a <- ggplot(df3, aes(t, ac1_stat)) +
  geom_hline(yintercept = 0.5, linetype = "dashed",
             color = "black", alpha = 0.5) +
  geom_line(color = "steelblue", linewidth = 1) +
  coord_cartesian(ylim = c(-0.2, 1)) +
  labs(subtitle = "Rolling lag-1 ACF (window = 60); dashed line = true rho_1",
       y = expression(rho[1]), x = "t") +
  theme_minimal(base_size = 11)

ac_b <- ggplot(df3, aes(t, ac1_viol)) +
  geom_hline(yintercept = c(0.2, 0.85), linetype = "dashed",
             color = "black", alpha = 0.5) +
  geom_vline(xintercept = half, linetype = "dashed", color = "firebrick") +
  geom_line(color = "firebrick", linewidth = 1) +
  coord_cartesian(ylim = c(-0.2, 1)) +
  labs(subtitle = "Rolling lag-1 ACF (window = 60); dashed lines = true rho_1 in each regime",
       y = expression(rho[1]), x = "t") +
  theme_minimal(base_size = 11)

(ts_a | ts_b) / (ac_a | ac_b)

Read the figure top to bottom. The time series in the top row look broadly comparable in level and amplitude — there is no obvious mean shift or variance change at \(t = 150\). But the texture differs: the right-hand series is jagged in the first half (fast mean reversion, \(\phi = 0.2\)) and visibly smoother in the second half (slow mean reversion, \(\phi = 0.85\)). The rolling lag-1 autocorrelation in the bottom row makes the difference precise. On the left it hovers around its true value \(\rho_1 = 0.5\) for the entire sample. On the right it sits near \(0.2\) for the first 150 periods, then climbs to roughly \(0.85\) — the autocorrelation function is itself a function of \(t\).

Why the third condition matters. Mean and variance are easy to spot. The autocovariance shift is the silent killer: a series can have a flat rolling mean and a flat rolling SD but still violate Condition 3 if its dependence structure shifts (regime changes, parameter drift, structural breaks in the dynamics). The standard ACF/PACF interpretation, lag selection, and forecast intervals all assume the autocovariance structure is the same on the sample we use to fit the model and on the sample we use to forecast. When that fails, none of those tools is trustworthy — even if the level and amplitude look stable.

1.2 Three flavors of non-stationarity

Most non-stationary series we encounter in applied work fall into one of three categories. They look superficially similar — all three drift, wander, or shift over time — but they require very different modeling strategies.

(a) Trend stationary (TS). The series has a deterministic mean function: \[y_t = \alpha + \gamma t + u_t, \quad u_t \text{ stationary}.\] The mean is non-constant, but it is a known function of time. Subtract it (or include \(t\) as a regressor) and the residuals are stationary. Cure: detrend.

(b) Difference stationary (DS), a.k.a. unit root. Each shock is permanent: \[y_t = y_{t-1} + \varepsilon_t \quad \text{(simplest case: random walk)}.\] The first difference \(\Delta y_t = \varepsilon_t\) is stationary, but the level has no fixed mean — past shocks accumulate forever. Cure: difference. This is the case that ARIMA(\(p,1,q\)) models target.

(c) Stationary with structural breaks (SSB). The data-generating process is otherwise stationary, but the parameters shift at one or more known dates: \[y_t = \alpha_1 \mathbf{1}\{t < T^*\} + \alpha_2 \mathbf{1}\{t \geq T^*\} + u_t, \quad u_t \text{ stationary}.\] Within each regime the series is stationary; the apparent non-stationarity is cross-regime, driven by the level shift at \(T^*\). Cure: model the breaks (intervention dummies) and let \(u_t\) be ARMA.

The three look alike in raw plots but require fundamentally different cures:

Flavor DGP Right cure Wrong cure \(\Rightarrow\)
TS (trend + stationary noise) \(y_t = \gamma t + u_t\) Detrend or include \(t\) as regressor Differencing \(\Rightarrow\) MA unit root, lost trend information
DS (unit root) \(y_t = y_{t-1} + \varepsilon_t\) Difference once Detrending \(\Rightarrow\) residuals still I(1), spurious significance
SSB (breaks) \(y_t = \alpha_g + u_t\), \(g\) depends on \(t\) Model the breaks (dummies) + ARMA on residuals Differencing \(\Rightarrow\) over-differencing; residuals look like MA(1) with \(\theta \approx -1\)

The three subsections below work each flavor end-to-end: simulate the violation, plot the rolling-mean diagnostic, apply the cure, and confirm with a residual-autocorrelation test that the non-stationarity has been removed. Test choice depends on the cure:

  • For pure transformations (e.g., diff(y)), no parameters are estimated, and Box.test(..., type = "Ljung-Box") applies cleanly with the standard \(\chi^2_h\) reference.
  • For OLS-based cures (e.g., detrending or break dummies), the right tool is lmtest::bgtest() (Breusch-Godfrey LM test). It accounts for estimated regressors and remains valid in the presence of lagged dependent variables, where Ljung-Box is biased toward not rejecting.

A subtlety to flag up front: the cure addresses non-stationarity, not serial dependence. After a successful cure, the residuals are stationary, but they may still have ARMA structure. A test that rejects “no autocorrelation” therefore does not mean the cure failed — it means the next modeling step is to fit an ARMA model on the cured series.

1.2.1 (a) Trend stationary

DGP: \(y_t = 0.05\, t + u_t\) with \(u_t\) a stationary AR(1), \(\phi = 0.15\).

set.seed(104)
T_n   <- 200
t_idx <- seq_len(T_n)

# Simulate trend + AR(1) noise
y_TS <- 0.05 * t_idx +
        as.numeric(arima.sim(n = T_n, model = list(ar = 0.15), sd = 1))

# Cure: detrend by regressing on t and keeping the residuals
fit_TS <- lm(y_TS ~ t_idx)
res_TS <- as.numeric(residuals(fit_TS))

ts_df <- tibble(t = t_idx, raw = y_TS, cured = res_TS,
                m_raw   = roll_mean(y_TS),
                m_cured = roll_mean(res_TS))

p_TS_raw <- ggplot(ts_df, aes(t)) +
  geom_line(aes(y = raw), color = "grey60") +
  geom_line(aes(y = m_raw), color = "firebrick", linewidth = 1) +
  labs(title = "Raw: trend + AR(1)",
       subtitle = "rolling mean climbs with t",
       y = expression(y[t]), x = "t") +
  theme_minimal(base_size = 11)

p_TS_cured <- ggplot(ts_df, aes(t)) +
  geom_line(aes(y = cured), color = "grey60") +
  geom_line(aes(y = m_cured), color = "steelblue", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  labs(title = "After cure: residuals from lm(y ~ t)",
       subtitle = "rolling mean flat at zero",
       y = "residuals", x = "t") +
  theme_minimal(base_size = 11)

p_TS_raw + p_TS_cured

lmtest::bgtest(fit_TS, order = 4)
## 
##  Breusch-Godfrey test for serial correlation of order up to 4
## 
## data:  fit_TS
## LM test = 2.8672, df = 4, p-value = 0.5803

The trend is gone — the rolling mean of the residuals sits at zero with constant amplitude. Breusch-Godfrey rejects the null of no residual autocorrelation, but that is expected: the DGP put AR(1) noise on top of the trend, and detrending leaves that AR structure intact. Non-stationarity is removed; the residuals are stationary AR(1), which would be modeled with Arima(res_TS, order = c(1, 0, 0)) next.

1.2.2 (b) Difference stationary (random walk)

DGP: \(y_t = y_{t-1} + \varepsilon_t\).

set.seed(105)

# Simulate random walk
y_DS  <- cumsum(rnorm(T_n))

# Cure: take first differences
dy_DS <- diff(y_DS)

ds_raw_df  <- tibble(t = t_idx,
                     raw = y_DS,
                     m_raw = roll_mean(y_DS))
ds_diff_df <- tibble(t = t_idx[-1],
                     cured = as.numeric(dy_DS),
                     m_cured = roll_mean(as.numeric(dy_DS)))

p_DS_raw <- ggplot(ds_raw_df, aes(t)) +
  geom_line(aes(y = raw), color = "grey60") +
  geom_line(aes(y = m_raw), color = "firebrick", linewidth = 1) +
  labs(title = "Raw: random walk",
       subtitle = "rolling mean wanders with no anchor",
       y = expression(y[t]), x = "t") +
  theme_minimal(base_size = 11)

p_DS_cured <- ggplot(ds_diff_df, aes(t)) +
  geom_line(aes(y = cured), color = "grey60") +
  geom_line(aes(y = m_cured), color = "steelblue", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  labs(title = "After cure: diff(y)",
       subtitle = "rolling mean flat at zero",
       y = expression(Delta * y[t]), x = "t") +
  theme_minimal(base_size = 11)

p_DS_raw + p_DS_cured

Box.test(dy_DS, lag = 10, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  dy_DS
## X-squared = 5.3047, df = 10, p-value = 0.8699

Differencing the random walk gives white-noise innovations. Ljung-Box fails to reject — the cure is complete and there is no residual ARMA structure to model. This is the cleanest of the three cases.

1.2.3 (c) Stationary with structural break

DGP: a stationary AR(1) plus a single level shift of \(+5\) at \(t = 100\).

set.seed(106)
break_t <- 100

# Simulate AR(1) + level shift
y_SSB <- as.numeric(arima.sim(n = T_n, model = list(ar = 0.15), sd = 1)) +
         5 * (t_idx >= break_t)

# Cure: include a regime dummy in OLS
fit_SSB <- lm(y_SSB ~ I(t_idx >= break_t))
res_SSB <- as.numeric(residuals(fit_SSB))

ssb_df <- tibble(t = t_idx,
                 raw   = y_SSB,
                 cured = res_SSB,
                 m_raw   = roll_mean(y_SSB),
                 m_cured = roll_mean(res_SSB))

p_SSB_raw <- ggplot(ssb_df, aes(t)) +
  geom_line(aes(y = raw), color = "grey60") +
  geom_line(aes(y = m_raw), color = "firebrick", linewidth = 1) +
  geom_vline(xintercept = break_t, linetype = "dashed", color = "firebrick") +
  labs(title = "Raw: AR(1) + level shift at t=100",
       subtitle = "rolling mean steps up",
       y = expression(y[t]), x = "t") +
  theme_minimal(base_size = 11)

p_SSB_cured <- ggplot(ssb_df, aes(t)) +
  geom_line(aes(y = cured), color = "grey60") +
  geom_line(aes(y = m_cured), color = "steelblue", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  labs(title = "After cure: residuals from lm(y ~ break_dummy)",
       subtitle = "rolling mean flat at zero",
       y = "residuals", x = "t") +
  theme_minimal(base_size = 11)

p_SSB_raw + p_SSB_cured

lmtest::bgtest(fit_SSB, order = 4)
## 
##  Breusch-Godfrey test for serial correlation of order up to 4
## 
## data:  fit_SSB
## LM test = 5.2415, df = 4, p-value = 0.2634

The break dummy absorbs the level shift; the residuals are flat at zero. As in the TS case, BG rejects because the underlying \(u_t\) is AR(1) — that is the substantive dynamics we now want to model, not evidence that the cure failed.


The asymmetric misdiagnosis cost. Each cure removes exactly the kind of non-stationarity its DGP put in. Apply the wrong cure — for instance, differencing the SSB series in §1.2.3 instead of using a break dummy — and you get a transformed series whose ACF still looks “fine,” but whose model parameters now reflect the artifact of over-differencing rather than the substantive dynamics. That is the asymmetric misdiagnosis cost: differencing a stationary-with-breaks series corrupts inference in a way that is hard to detect from residual diagnostics alone, because the differenced series will look “good enough” by ACF/PACF standards. Part 2 walks through exactly this trap on the presidential approval data.

1.3 Unit-root tests: ADF and KPSS

Two complementary tests are the standard tools:

Augmented Dickey–Fuller (ADF) tests the null hypothesis that the series has a unit root:

  • \(H_0\): \(y_t\) is integrated of order 1 — i.e., I(1), needs differencing.
  • \(H_1\): \(y_t\) is I(0) — stationary (possibly around a constant or a linear trend, depending on the deterministic terms included).

Reject \(\Rightarrow\) evidence of stationarity. Fail to reject \(\Rightarrow\) no evidence against a unit root (which is not the same as evidence for it).

Kwiatkowski–Phillips–Schmidt–Shin (KPSS) flips the null:

  • \(H_0\): \(y_t\) is stationary (possibly around a deterministic trend).
  • \(H_1\): \(y_t\) has a unit root.

Reject \(\Rightarrow\) evidence of non-stationarity. Fail to reject \(\Rightarrow\) no evidence against stationarity.

Used together, the tests cross-check each other:

ADF KPSS Implication
Reject I(1) Fail to reject I(0) Both say stationary \(\Rightarrow\) I(0)
Fail to reject I(1) Reject I(0) Both say non-stationary \(\Rightarrow\) I(1)
Reject I(1) Reject I(0) Conflicting — investigate (often a structural break)
Fail to reject I(1) Fail to reject I(0) Both inconclusive — short series, low power, or near-unit-root

The third row is where most applied trouble lives. Both tests reject is the hallmark of stationarity-with-breaks (Section 1.2 case c) — neither hypothesis is right, because the truth is “stationary within regimes.”

Three pitfalls every analyst should know

  1. Low power against near-unit-root alternatives. If \(y_t\) is a stationary AR(1) with \(\phi = 0.95\), ADF will fail to reject in moderate samples — not because there is a unit root, but because the test has trouble distinguishing \(\phi = 0.95\) from \(\phi = 1\). Sample size matters: power improves with \(T\).

  2. Sensitivity to deterministic regressors. ADF is run with a constant only, with a constant and a linear trend, or with neither. The wrong choice changes the verdict. Default rule: if the series visibly trends, include the trend (adf.test() does this automatically); otherwise, just the constant.

  3. Vulnerability to structural breaks (Perron 1989). A stationary series with a one-time level shift has the same slow ACF decay as a unit root. ADF tends to fail to reject the unit-root null; KPSS tends to reject the stationarity null. Both tests are misled — and, crucially, in opposite directions, which is what generates the conflicting verdict in row 3 of the table above. The fix is to model the break explicitly and re-test on the residuals.

The practical lesson: unit-root tests are diagnostics, not oracles. Always pair them with (a) a careful plot of the series, (b) substantive knowledge of any events that might have shifted the level, (c) ACF/PACF of both levels and residuals, and (d) comparison of competing specifications side-by-side.

1.4 What to do instead

Before differencing, the disciplined workflow is:

  1. Plot the series. Visible level shifts, slow drifts, or trends should be modeled, not differenced.
  2. Inspect the correlogram (ACF and PACF). Slow, near-linear decay of the ACF over many lags is the visual signature of a unit root or near-unit-root persistence; a sharp cut-off in the PACF points to AR structure on a stationary series. The correlogram is cheap to compute and tells you what the formal tests will be facing before you run them.
  3. List candidate breaks. Substantive knowledge — known interventions, regime changes, events — generates the dummy variables we need.
  4. Run ADF and KPSS in parallel, with and without trend terms. Compare verdicts to steps 1–2’s visual evidence.
  5. Fit competing specifications (e.g., ARMAX with breaks vs. ARIMA on differences) and compare residual diagnostics, AIC/BIC, and out-of-sample fit. The data should pick the model, not the test.
  6. Re-run the unit-root tests on the residuals of the leading specification. If the residuals are stationary by both tests, you have ruled out the unit-root explanation conditional on your modeled breaks.

We will use exactly this workflow in Part 2 to show that the presidential approval series — which on a naive ADF/KPSS read looks like a textbook unit-root case — is actually stationary once 9/11 and the Iraq War are accounted for.

1.5 ARIMA notation (when differencing is the right cure)

When the diagnosis genuinely points to a unit root, differencing is the right move and the model is ARIMA(\(p, d, q\)):

\[\underbrace{\Delta^d y_t}_{\text{differenced } d \text{ times}} \sim \text{ARMA}(p, q).\]

  • \(p\) = AR order on the differenced series,
  • \(d\) = number of differences applied to reach stationarity,
  • \(q\) = MA order on the differenced series.

Special cases:

  • ARIMA(0,0,0) = white noise.
  • ARIMA(1,0,0) = stationary AR(1) — no differencing.
  • ARIMA(0,1,0) = random walk.
  • ARIMA(1,1,0) = AR(1) on the differenced series.
  • ARIMA(\(p, d, q\))\((P, D, Q)_m\) = seasonal extension at period \(m\).

Part 2: ARMAX with Breaks — The Approval Misdiagnosis

This part walks through the diagnostic discipline from Part 1 on a real series — monthly Bush approval, February 2001 to June 2006. The textbook recipe is “the ACF decays slowly, so it’s non-stationary, so difference it, so fit ARIMA.” We’ll see why that’s the wrong move here, what the right specification is, and how a small change in diagnostic discipline rescues a meaningful AR(1) coefficient that the textbook approach throws away.

2.1 The data and the substantive question

approval <- read_csv("data/approval.csv")
glimpse(approval)
## Rows: 65
## Columns: 8
## $ month         <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7,…
## $ year          <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 20…
## $ approve       <dbl> 58.67, 58.00, 60.50, 55.00, 54.00, 56.50, 56.00, 75.67, …
## $ disapprove    <dbl> 23.67, 26.67, 29.50, 33.33, 34.00, 34.00, 35.00, 18.33, …
## $ unsure        <dbl> 17.67, 15.33, 10.00, 11.67, 12.00, 9.50, 9.00, 6.00, 3.0…
## $ sept.oct.2001 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ iraq.war      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ avg.price     <dbl> 144.975, 140.925, 155.160, 170.175, 161.625, 142.060, 14…

The dataset is monthly, \(T = 65\), covering Feb 2001 – Jun 2006. The variables we use:

  • approve: percent approving of George W. Bush.
  • sept.oct.2001: a pulse dummy equal to 1 in September and October 2001 only — capturing the immediate “rally around the flag” effect after 9/11.
  • iraq.war: a pulse dummy equal to 1 in March, April, and May 2003 — capturing the Iraq-War rally bump.
  • avg.price: monthly average oil price index.

Both intervention dummies are pulses (one in a handful of months, zero otherwise), not steps (one forever after the event). They model the immediate spike, not a permanent regime shift.

Substantive question. What was the immediate effect of 9/11 and the Iraq War on approval, do oil-price movements affect approval, and how persistent are the unmodeled dynamics — once we account for the visible long downward drift?

approve <- ts(approval$approve, start = c(2001, 2), frequency = 12)

xcov_approve <- cbind(
  sept.oct.2001 = approval$sept.oct.2001,
  iraq.war      = approval$iraq.war,
  avg.price     = approval$avg.price
)

2.2 Visual diagnosis

p_approve <- autoplot(approve) +
  geom_vline(xintercept = 2001 + 8/12, linetype = "dashed", color = "firebrick") +
  geom_vline(xintercept = 2003 + 2/12, linetype = "dashed", color = "steelblue") +
  annotate("text", x = 2002,   y = 45, label = "9/11",     color = "firebrick", size = 4) +
  annotate("text", x = 2003.7, y = 45, label = "Iraq War", color = "steelblue", size = 4) +
  labs(title = "U.S. Presidential Approval (Bush, 2001–2006)",
       y = "% Approving", x = "Year") +
  theme_minimal(base_size = 13)

p_approve

Three features stand out. (i) A massive spike at 9/11 (Sep 2001), peaking around 88% approval. (ii) A smaller bump after the Iraq invasion (March 2003). (iii) A long downward drift from late 2002 onward, ending around 35% in mid-2006. There is no visible “level shift” that persists — the post-rally trajectory always reverts toward a declining baseline.

p_approve_acf <-
  (ggAcf(approve)  + labs(title = "ACF — Approval levels")  + theme_minimal(base_size = 11)) |
  (ggPacf(approve) + labs(title = "PACF — Approval levels") + theme_minimal(base_size = 11))

p_approve_acf

The ACF decays slowly — a textbook signature that students often read as “unit root, must difference.” But Part 1 warned us: a slow ACF decay is also what trend-stationary series and stationary series with breaks look like. The visual evidence here points clearly to a deterministic decline (a downward trend, plus the two pulse interventions), not a stochastic drift. We need formal tests to triangulate.

2.3 Unit-root tests: which null to choose?

adf.test(approve)                             # H0: unit root
## 
##  Augmented Dickey-Fuller Test
## 
## data:  approve
## Dickey-Fuller = -3.9565, Lag order = 3, p-value = 0.01721
## alternative hypothesis: stationary
kpss.test(approve, null = "Level")            # H0: stationary around a constant
## 
##  KPSS Test for Level Stationarity
## 
## data:  approve
## KPSS Level = 1.2485, Truncation lag parameter = 3, p-value = 0.01
kpss.test(approve, null = "Trend")            # H0: stationary around a linear trend
## 
##  KPSS Test for Trend Stationarity
## 
## data:  approve
## KPSS Trend = 0.11109, Truncation lag parameter = 3, p-value = 0.1
Test \(H_0\) \(p\)-value Verdict
ADF (default = constant + trend) unit root \(0.017\) reject \(\Rightarrow\) stationary
KPSS (Level) stationary around \(\mu\) \(0.010\) reject \(\Rightarrow\) non-stationary
KPSS (Trend) stationary around trend \(0.100\) don’t reject \(\Rightarrow\) trend-stationary

The verdicts conflict — but only superficially. ADF and KPSS-Trend agree that the series is stationary once we allow for a deterministic trend. KPSS-Level disagrees because it assumes the mean is constant, which is plainly false (approval falls from ~85 to ~35 over five years). This is Pitfall 2 from Part 1: the deterministic specification matters. The same series gives opposite verdicts when the null is “stationary around a constant” vs. “stationary around a trend.”

Diagnosis: the approval series is trend-stationary with two pulse interventions. Not a unit root. The slow ACF decay is real persistence in the levels, not a stochastic trend.

2.4 The naive approach: difference and fit ARIMA(1,1,0)

A practitioner who only ran KPSS-Level — or who took the slow ACF decay at face value — would conclude “non-stationary” and difference. Let’s see what that gives us.

fit_naive <- Arima(approve, order = c(1, 1, 0), xreg = xcov_approve)
fit_naive
## Series: approve 
## Regression with ARIMA(1,1,0) errors 
## 
## Coefficients:
##          ar1  sept.oct.2001  iraq.war  avg.price
##       0.0701        11.1213    5.3167    -0.0669
## s.e.  0.1320         2.5128    2.5723     0.0352
## 
## sigma^2 = 13.12:  log likelihood = -171.11
## AIC=352.22   AICc=353.26   BIC=363.02

The output reports ar1 \(\approx 0.07\) with standard error \(\approx 0.13\)not significant, the coefficient is essentially zero. This is the textbook signature of overdifferencing: the AR component on the differenced series finds nothing to model, because differencing has stripped out all the persistence that was actually in the levels. The coefficient on the AR term collapses toward zero precisely because the wrong differencing destroyed the structure it was supposed to capture.

What we are left with is a model that says: “the changes in approval are essentially uncorrelated noise, except for jumps at 9/11 and the Iraq War, plus a small oil-price effect.” That is not the dynamic story a slow-decay ACF was telling us. We have thrown that story away.

2.5 The right specification: ARMAX(1,0,0)

The right model treats the levels as the dependent variable and lets an AR(1) error process carry the persistence. This is sometimes called a regression with ARMA errors, sometimes ARMAX — the X stands for “eXogenous” regressors:

\[ y_t = \beta_0 + \beta_1\,\mathbf{1}\{911\}_t + \beta_2\,\mathbf{1}\{Iraq\}_t + \beta_3\,\text{oil}_t + u_t, \qquad u_t = \phi\, u_{t-1} + \varepsilon_t. \]

The \(\beta\) coefficients are effects on the level of approval. The persistence is in the error process, not in the covariates — so \(\beta\) is the immediate-and-long-run effect of a covariate (no lagged-\(y\) propagation; we’ll add that in Part 3).

In R: Arima(y, order = c(1, 0, 0), xreg = X).

fit_armax <- Arima(approve, order = c(1, 0, 0), xreg = xcov_approve)
fit_armax
## Series: approve 
## Regression with ARIMA(1,0,0) errors 
## 
## Coefficients:
##         ar1  intercept  sept.oct.2001  iraq.war  avg.price
##       0.918    71.4311        11.5152    5.8123    -0.0892
## s.e.  0.049     8.2425         2.5738    2.5299     0.0357
## 
## sigma^2 = 12.81:  log likelihood = -173.43
## AIC=358.86   AICc=360.31   BIC=371.91

Compare this to the ARIMA(1,1,0) above:

  • AR(1) coefficient is now 0.92 (s.e. 0.05), highly significant. The slow ACF decay was real persistence — captured cleanly when we model the levels.
  • 9/11 dummy ≈ 11.5: approval rose by about 11.5 percentage points in the rally months.
  • Iraq War dummy ≈ 5.8: about a 5.8-point bump for the war’s first three months.
  • Oil price ≈ −0.09: a $10 increase in the oil-price index lowers approval by about 0.9 points.
  • Intercept ≈ 71: long-run baseline approval, absent shocks.

Residual diagnostics

If the ARMAX captures the dynamics, residuals should be white noise — and crucially, they should be stationary by both ADF and KPSS, since the entire point of this section is to establish that no unit root remains once we model trend-and-breaks-and-AR(1).

res_armax <- residuals(fit_armax)

p_armax_resid <-
  (ggAcf(res_armax)  + labs(title = "ACF — ARMAX residuals")  + theme_minimal(base_size = 11)) |
  (ggPacf(res_armax) + labs(title = "PACF — ARMAX residuals") + theme_minimal(base_size = 11))

p_armax_resid

adf.test(res_armax)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  res_armax
## Dickey-Fuller = -4.0106, Lag order = 3, p-value = 0.01481
## alternative hypothesis: stationary
kpss.test(res_armax)
## 
##  KPSS Test for Level Stationarity
## 
## data:  res_armax
## KPSS Level = 0.28673, Truncation lag parameter = 3, p-value = 0.1
Box.test(res_armax, lag = 12, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  res_armax
## X-squared = 5.9234, df = 12, p-value = 0.9199

All three diagnostics agree:

  • ADF \(p \approx 0.015\) — reject the unit-root null (residuals are stationary).
  • KPSS \(p \approx 0.10\) — fail to reject the stationary null.
  • Ljung-Box at lag 12, \(p \approx 0.92\) — fail to reject white noise.

The unit-root hypothesis is ruled out conditional on the modeled trend-and-breaks-and-AR(1) structure. This is exactly the disciplined workflow from Section 1.4: re-test on the residuals of the leading specification.

auto.arima as a cross-check (not a decision rule)

auto.arima() minimizes AIC over \((p, d, q)\) (and seasonal orders). Treat it as a cross-check, not a decision rule: AIC-minimizing search tends to over-parameterize in short series, and “best AIC” is not “right substantive specification.”

fit_auto <- auto.arima(approve, xreg = xcov_approve,
                       stepwise = FALSE, approximation = FALSE)
fit_auto
## Series: approve 
## Regression with ARIMA(1,0,0)(0,0,1)[12] errors 
## 
## Coefficients:
##          ar1     sma1  intercept  sept.oct.2001  iraq.war  avg.price
##       0.9322  -0.2920    72.8397        11.2855    5.9419    -0.0941
## s.e.  0.0478   0.1755     7.5448         2.4999    2.3586     0.0344
## 
## sigma^2 = 12.31:  log likelihood = -172.09
## AIC=358.19   AICc=360.15   BIC=373.41

auto.arima returns ARIMA(1,0,0)(0,0,1)[12]:

  • Where it helps. It picks \(d = 0\)no differencing. Independent confirmation of our diagnosis.
  • Where it reaches. It adds a seasonal MA(1) with \(\hat\theta_{12} \approx -0.29\), s.e. \(0.18\), \(p \approx 0.10\). Three reasons to be cautious: (i) substantive grounding — no obvious theoretical reason for a 12-month cycle in approval over a five-year window; (ii) sample size\(T = 65\) months is roughly five seasonal cycles, too few to identify a stable seasonal structure; (iii) statistical strength\(p \approx 0.10\) is borderline, and AIC happily includes such terms.

We retain the parsimonious ARMAX(1,0,0): its residuals are already white noise, the seasonal MA is borderline and unsupported by theory, and parsimony pays off in the small-sample dynamic-effects calculations to come. Use auto.arima to ask “did I miss something?” — never “what is the right model?”

2.6 Side-by-side: ARMAX(1,0,0) vs. ARIMA(1,1,0)

extract_coefs <- function(fit, label) {
  ct <- lmtest::coeftest(fit)
  tibble(
    Coefficient = rownames(ct),
    Estimate    = ct[, "Estimate"],
    `Std. Err`  = ct[, "Std. Error"],
    `p-value`   = ct[, "Pr(>|z|)"]
  ) |> mutate(Model = label)
}

compare_df <- bind_rows(
  extract_coefs(fit_naive, "ARIMA(1,1,0)"),
  extract_coefs(fit_armax, "ARMAX(1,0,0)")
)

compare_df |>
  mutate(across(c(Estimate, `Std. Err`), ~ round(.x, 3)),
         `p-value` = round(`p-value`, 3)) |>
  pivot_wider(names_from = Model,
              values_from = c(Estimate, `Std. Err`, `p-value`),
              names_glue = "{Model}: {.value}") |>
  knitr::kable(caption = "ARIMA(1,1,0) vs ARMAX(1,0,0): coefficient comparison.")
ARIMA(1,1,0) vs ARMAX(1,0,0): coefficient comparison.
Coefficient ARIMA(1,1,0): Estimate ARMAX(1,0,0): Estimate ARIMA(1,1,0): Std. Err ARMAX(1,0,0): Std. Err ARIMA(1,1,0): p-value ARMAX(1,0,0): p-value
ar1 0.070 0.918 0.132 0.049 0.596 0.000
sept.oct.2001 11.121 11.515 2.513 2.574 0.000 0.000
iraq.war 5.317 5.812 2.572 2.530 0.039 0.022
avg.price -0.067 -0.089 0.035 0.036 0.057 0.013
intercept NA 71.431 NA 8.243 NA 0.000

A coefficient plot makes the AR(1) collapse — the central headline of this part — immediately visible:

plot_df <- compare_df |>
  filter(Coefficient %in% c("ar1", "sept.oct.2001", "iraq.war", "avg.price")) |>
  mutate(Coefficient = factor(Coefficient,
           levels = c("ar1", "sept.oct.2001", "iraq.war", "avg.price"),
           labels = c("AR(1)", "9/11 dummy", "Iraq War dummy", "Oil price")),
         lo = Estimate - 1.96 * `Std. Err`,
         hi = Estimate + 1.96 * `Std. Err`)

p_coefs <- ggplot(plot_df,
       aes(x = Estimate, y = Coefficient, color = Model)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey40") +
  geom_pointrange(aes(xmin = lo, xmax = hi),
                  position = position_dodge(width = 0.5), size = 0.6) +
  scale_color_manual(values = c("ARIMA(1,1,0)" = "firebrick",
                                "ARMAX(1,0,0)" = "steelblue")) +
  facet_wrap(~ Coefficient, scales = "free", ncol = 2) +
  labs(title = "Coefficient estimates with 95% CI",
       subtitle = "ARMAX recovers the AR(1) persistence that differencing destroys",
       x = NULL, y = NULL, color = NULL) +
  theme_minimal(base_size = 11) +
  theme(axis.text.y = element_blank(),
        legend.position = "bottom",
        strip.text = element_text(face = "bold"))

p_coefs

Where the two models differ:

  1. AR(1) coefficient: 0.07 vs 0.92. The headline result. The ARMAX captures persistent stationary dynamics that the ARIMA(1,1,0) — by differencing first — has destroyed.
  2. Substantive scale. ARIMA(1,1,0) coefficients describe effects on the change \(\Delta y_t\) (a 9/11 shock raises the change in approval by about 11 points). ARMAX(1,0,0) coefficients describe effects on the level of \(y_t\) (a 9/11 shock raises approval by about 11.5 points). The level interpretation is more direct.
  3. Intercept. ARIMA(1,1,0) has no fitted intercept (differencing absorbs it). ARMAX has 71.4, the conditional baseline of approval.

Where they agree: the pulse-shock coefficients are similar in magnitude, because pulse interventions transfer cleanly between specifications — they are 1 only in a few months and 0 otherwise, so differencing changes the regressor pattern but not the integrated impact much.

Note on AIC. AIC is reported as 352 for ARIMA(1,1,0) and 359 for ARMAX(1,0,0). These are not directly comparable. When Arima is called with \(d=1\), the likelihood is computed on the differenced series (\(T-1\) observations) and on differently-transformed data; AIC across different values of \(d\) measures different things. The right comparators are residual stationarity, white-noise residuals, and substantive interpretability — and ARMAX(1,0,0) wins on all three.

2.7 What ARMAX gives us — and what it doesn’t

Looking ahead to Part 3. The ARMAX(1,0,0) we just estimated is one of two distinct ways to handle persistence in a regression. The other adds a lagged dependent variable (LDV), \(y_t = \alpha + \phi y_{t-1} + X_t\beta + \varepsilon_t\), putting the \(\phi\) on \(y\) itself rather than on the error. These are different models, not different estimators of the same model — they predict different long-run responses to a covariate shock. We develop this contrast carefully in §3.2 once we have the unrestricted ARDL(1,1) on the table; for now, the takeaway is that ARMAX puts the AR coefficient on the errors, so the \(\beta\)’s are immediate-and-long-run effects with no further propagation.

What ARMAX gives us

The ARMAX(1,0,0) gives us:

  • Clean estimates of intervention effects on the level of approval.
  • Stationary, white-noise residuals (verified by ADF, KPSS, and Ljung-Box).
  • A persistent AR(1) error (\(\hat{\phi} \approx 0.92\)) that captures slow, unmodeled dynamics — the kind of “mood drift” that no single covariate could explain.

But there is a key limitation. In ARMAX, covariate effects do not propagate. The 9/11 dummy is +11.5 in September–October 2001; from November 2001 onward it is 0 and contributes nothing further. The AR(1) propagation lives in the errors, not in the covariates. So the \(\beta\)’s are immediate effects, equal to long-run effects.

For pulse interventions like 9/11, this is fine — we want the immediate effect, and we have a story for residual persistence. For continuous covariates like oil price, however, a static \(\beta\) can miss real dynamics: an oil-price increase in March may take a few months to show up in approval, propagating gradually through public opinion. To model that, we need:

  1. Lagged values of \(x\) in the regression — a distributed-lag (DL) model, and
  2. A lagged dependent variable so the effects of \(x\) propagate through \(y\) — an autoregressive distributed-lag (ARDL) model.

That is the move in Part 3. ARDL is also the algebraic gateway to error-correction models, which we use in Parts 4–5.


Part 3: ARDL — Distributed Lags and Dynamic Effects

In Part 2 we landed on ARMAX(1,0,0) as the right specification for the approval series: regression on level \(y_t\) with AR(1) errors. That model captured persistence in the errors, but it left covariates entirely static: a one-unit increase in oil price affects approval by exactly \(\hat\beta_3 = -0.09\) in the same period and contributes nothing further. The error process can transmit innovations forward through time, but covariate effects do not propagate.

For pulse interventions like 9/11 or the Iraq War rally, the static-effect interpretation is fine as those are brief, well-defined events. For continuous covariates like oil price, however, a shift in the input may take time to filter through public opinion. To model that propagation we move from ARMAX to ARDL autoregressive distributed lag.

Two specifications introduce dynamics:

  • Distributed lag (DL): include lagged values of \(x\) as separate regressors, where effects of \(x\) at lag \(\ell\) get their own coefficient \(\beta_\ell\).

\[y_t \;=\; \alpha + \beta_1 x_t + \beta_2 x_{t-1} + u_t,\]

  • ARDL: also include a lagged \(y\). The lagged dependent variable propagates all covariate effects forward through its dynamics.

\[y_t \;=\; \alpha + \phi\, y_{t-1} + \beta_1 x_t + \beta_2 x_{t-1} + u_t,\]

  • DL is finite-horizon (effects vanish after lag \(L\)).
  • ARDL is infinite-horizon with geometric decay.

They have different identification, different residual properties, and — as we will see in Part 4 — very different connections to error-correction.

3.1 Two stories of persistence: AR errors vs LDV

Before §3.2 patches the AR-error model and §3.3 moves to ARDL, we set up the conceptual contrast that distinguishes them. Moving from “AR on errors” to “AR on \(y\)” is not a different estimator of the same model — it is a different model, with different long-run predictions about what happens when a covariate moves.

df_lab4 <- approval |>
  mutate(approve_lag = lag(approve, 1),
         oil         = avg.price,
         oil_lag     = lag(avg.price, 1)) |>
  drop_na()

Two narratives for persistence

When a series shows autocorrelation, the question is where in the model the persistence lives.

Story A — “shocks themselves persist.” Random innovations have memory; a shock today affects \(y\) today, lingers into tomorrow, and decays out. Persistence lives in the errors.

Story B — “yesterday’s outcome shapes today’s outcome.” The system has state. Whatever happened to \(y\) yesterday is partly who \(y\) is today. Persistence lives in the dependent variable itself.

These are different beliefs about the data-generating process, not different statistical fixes.

The two specifications, side by side

Story A — regression with AR(1) errors (the ARMAX of §2.5): \[ y_t \;=\; X_t \beta + u_t, \qquad u_t = \phi u_{t-1} + \varepsilon_t. \] The \(\phi\) governs the noise’s persistence. Fitted by Arima(y, xreg = X, order = c(1,0,0)) — GLS/ML handles the AR(1) covariance.

Story B — lagged dependent variable (LDV): \[ y_t \;=\; \alpha + \phi y_{t-1} + X_t \beta + \varepsilon_t, \qquad \varepsilon_t \sim \text{WN}. \] The \(\phi\) governs the outcome’s persistence. Fitted by lm(y ~ lag(y) + X) — OLS, provided residuals come out white noise (Breusch-Godfrey gateway, cf. Lab 3 §2.4).

What “restriction” means here

Both stories sit inside a more general parent — the unrestricted ARDL(1,1):

\[ y_t \;=\; \alpha + \phi y_{t-1} + \beta_0 X_t + \beta_1 X_{t-1} + \varepsilon_t. \]

This has four free coefficients. A restriction is a constraint imposed before estimation that ties one coefficient to the others (or fixes its value). Once imposed, that coefficient is no longer free; its value follows from the constraint plus the others.

Specification \(\alpha\) \(\phi\) (\(y_{t-1}\)) \(\beta_0\) (\(X_t\)) \(\beta_1\) (\(X_{t-1}\)) LRM of \(X\)
Unrestricted ARDL(1,1) free free free free \((\beta_0+\beta_1)/(1-\phi)\)
Story A (AR errors / COMFAC) free free free \(= -\phi\beta_0\) \(\beta_0\)
Story B (LDV-only / partial adjustment) free free free \(= 0\) \(\beta_0/(1-\phi)\)

Reading the rows: Story B drops \(X_{t-1}\) entirely. Story A keeps \(X_{t-1}\) but pins its coefficient to \(-\phi\beta_0\). Both reduce the free-parameter count from four to three; they constrain the same coefficient (\(\beta_1\)) in different ways, and they pick out different long-run multipliers as a consequence.

Why the AR-errors restriction takes the value \(-\phi\beta_0\)

The constraint isn’t arbitrary. Substitute \(u_{t-1} = y_{t-1} - X_{t-1}\beta_0\) into the AR-errors equation:

\[ y_t \;=\; X_t\beta_0 + \phi(y_{t-1} - X_{t-1}\beta_0) + \varepsilon_t \;=\; \phi y_{t-1} + \beta_0 X_t \;\underbrace{-\phi\beta_0}_{\text{the constraint}}\, X_{t-1} + \varepsilon_t. \]

The lagged-\(X\) coefficient \(-\phi\beta_0\) is exactly what falls out of “this is really an AR-errors model.” Substantively, \(-\phi\beta_0\) is the precise amount needed to cancel the LDV’s propagation of yesterday’s \(X\) effect — which is how Story A keeps covariates static.

The substantive payoff

The LRM column of the restriction table above already makes the point: with \(\beta_0 = 1\) and \(\phi = 0.8\), Story A’s long-run effect is \(\beta_0 = 1\); Story B’s is \(\beta_0/(1-\phi) = 5\) — a factor of five from the same nominal coefficient. The figure below makes the asymmetry visible: panel (a) shows the response to a permanent step in \(X\) (covariate shock); panel (b) shows the response to a unit innovation \(\varepsilon_0\) — where Story A does propagate, but only the noise.

beta_demo <- 1
phi_demo  <- 0.8
H_demo    <- 30
t_demo    <- 0:H_demo

# Top panel: COVARIATE IRF — permanent +1 step in X at t=0, no innovations
resp_cov_A <- rep(beta_demo, H_demo + 1)                              # flat at beta
resp_cov_B <- beta_demo * (1 - phi_demo^(t_demo + 1)) / (1 - phi_demo) # geometric to beta/(1-phi)

cov_df <- bind_rows(
  tibble(t = t_demo, y = resp_cov_A, story = "Story A: AR errors"),
  tibble(t = t_demo, y = resp_cov_B, story = "Story B: LDV")
)

p_cov <- ggplot(cov_df, aes(t, y, color = story)) +
  geom_hline(yintercept = c(1, 5), linetype = "dashed", color = "grey50") +
  geom_line(linewidth = 1.1) + geom_point(size = 1.6) +
  scale_color_manual(values = c("Story A: AR errors" = "steelblue",
                                "Story B: LDV"       = "firebrick")) +
  annotate("text", x = H_demo, y = 1.35, label = "Story A LRM = beta = 1",
           color = "steelblue", size = 3.4, hjust = 1) +
  annotate("text", x = H_demo, y = 4.65, label = "Story B LRM = beta/(1-phi) = 5",
           color = "firebrick", size = 3.4, hjust = 1) +
  labs(title = "(a) Covariate IRF -- response to a permanent +1 step in X",
       subtitle = "Story A flat; Story B rises geometrically. Covariate dynamics differ.",
       x = NULL, y = "Cumulative response of y", color = NULL) +
  ylim(0, 6) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none")

# Bottom panel: INNOVATION IRF — unit shock in epsilon_0, X held at 0
# Both stories: u_t (Story A) or y_t deviation (Story B) decays as phi^t — identical
resp_inn <- phi_demo^t_demo

inn_df <- tibble(t = t_demo, y = resp_inn)

p_inn <- ggplot(inn_df, aes(t, y)) +
  geom_hline(yintercept = 0, linetype = "dotted", color = "grey60") +
  geom_line(color = "purple4", linewidth = 1.1) +
  geom_point(color = "purple4", size = 1.6) +
  annotate("text", x = H_demo / 2, y = 0.65,
           label = "Story A and Story B trace the SAME phi^t decay\n(innovations propagate identically)",
           hjust = 0.5, size = 3.4, color = "purple4", lineheight = 1.05) +
  labs(title = "(b) Innovation IRF -- response to a unit shock in epsilon at t=0",
       subtitle = "Identical geometric decay -- innovation dynamics are the same in both stories.",
       x = "Periods after shock", y = "Response of y") +
  ylim(0, 1.1) +
  theme_minimal(base_size = 11)

p_two_stories <- p_cov / p_inn +
  patchwork::plot_layout(heights = c(1.05, 1))

if (!dir.exists("output/slides")) dir.create("output/slides", recursive = TRUE)
ggsave("output/slides/p3_two_stories_irf.png", p_two_stories,
       width = 10, height = 6.3, dpi = 110)
p_two_stories

Read both panels together: the two stories produce identical responses to innovation shocks, but radically different responses to covariate shocks. That asymmetry is the COMFAC restriction made visible. Story A propagates persistence in the errors (panel b) but holds covariates static (panel a, flat line). Story B propagates both — a covariate shock at \(t=0\) feeds into \(y_0\), then through the LDV into \(y_1\), and so on, producing the rising trajectory in panel (a). The factor-of-five gap in the cumulative effect between panels (a)’s two lines is what students should remember; the identical curves in panel (b) explain why the AR coefficient \(\phi\) alone — without context about where it lives — is not enough information to read the long-run effect.

Estimation: which OLS works for which?

OLS properties differ depending on whether the spec matches the DGP:

Specification Estimator \(\hat\beta\) unbiased? Consistent? SE correct?
Static + AR errors naive OLS
Static + AR errors GLS / ML via Arima(...)
LDV + WN errors OLS via lm(...) ✗ (\(O(1/T)\), Hurwicz)
LDV + AR errors OLS

This is not a “don’t use OLS” table — it’s a “match the estimator to the spec” table. The ✗ marks describe what happens when the estimator ignores a dynamic component the data actually contain. The remedy is never “abandon OLS”; it is “fix the specification so OLS has nothing left to ignore.”

  • Row 1 — naive OLS on AR-errors data: the coefficients are unbiased and consistent, only the variance formula is wrong. Cure: HAC standard errors (§3.2), or model the AR explicitly via Arima() (row 2).
  • Row 4 — OLS with both LDV and unmodeled AR residuals: now \(y_{t-1}\) is correlated with \(u_t\) through \(u_{t-1}\), so the regressor is endogenous and OLS becomes biased and inconsistent. Cure: fix the specification — either add more lags of \(y\) until residuals pass Breusch-Godfrey (still OLS, just a richer LDV), or move to an ML-based estimator that handles the residual AR structure jointly. Keele and Kelly (2006, Tables 3–4) document 50–200% bias when this row’s misspecification is left unaddressed.

The Breusch-Godfrey test on LDV residuals is the gateway between rows 3 and 4: pass and OLS-LDV is exactly the right tool; reject and the spec needs to change before reading any coefficients off the regression. In practice, almost all of our applied work in Parts 3–6 lives in rows 2 or 3 — the rows where OLS or its ML cousin work cleanly because we matched the estimator to the DGP.

Choose by testing — and by theory

The literature (De Boef and Keele 2008; Cook and Webb 2021) recommends fitting the unrestricted ARDL(1,1) and testing the restrictions before imposing either:

  • Partial adjustment \(H_0\): \(\beta_1 = 0\) — standard \(t\)-test on lagged \(X\).
  • COMFAC \(H_0\): \(\beta_1 + \phi\beta_0 = 0\)nonlinear Wald test, \(\chi^2_1\) via the delta method.
# Unrestricted ARDL(1,1) — fit here for the restriction test, reused in §3.3
fit_ardl <- lm(approve ~ approve_lag + oil + oil_lag +
                 sept.oct.2001 + iraq.war,
               data = df_lab4)

# Delta-method Wald test of COMFAC: g(theta) = beta_oil_lag + phi * beta_oil = 0
b0  <- coef(fit_ardl)["oil"]
b1  <- coef(fit_ardl)["oil_lag"]
phi <- coef(fit_ardl)["approve_lag"]
g_hat <- b1 + phi * b0

# Gradient of g w.r.t. (phi, beta_0, beta_1), zero elsewhere
G <- numeric(length(coef(fit_ardl)))
names(G) <- names(coef(fit_ardl))
G["approve_lag"] <- b0
G["oil"]         <- phi
G["oil_lag"]     <- 1

se_g   <- sqrt(as.numeric(t(G) %*% vcov(fit_ardl) %*% G))
W_stat <- as.numeric(g_hat^2 / se_g^2)
p_chi  <- pchisq(W_stat, df = 1, lower.tail = FALSE)

tibble(
  Quantity = c("g_hat = beta_oil_lag + phi * beta_oil",
               "Standard error (delta method)",
               "Wald statistic (chi^2_1)",
               "p-value"),
  Value = c(round(unname(g_hat), 4), round(se_g, 4),
            round(W_stat, 3), round(p_chi, 4))
) |>
  knitr::kable(caption = "Wald test of the COMFAC restriction on the approval/oil ARDL(1,1).")
Wald test of the COMFAC restriction on the approval/oil ARDL(1,1).
Quantity Value
g_hat = beta_oil_lag + phi * beta_oil -0.0258
Standard error (delta method) 0.0136
Wald statistic (chi^2_1) 3.5960
p-value 0.0579

Result: \(\hat g = -0.026\), \(W \approx 3.6\), \(p \approx 0.06\)borderline. We just barely fail to reject COMFAC at the 5% level. The data do not decisively distinguish between Story A and Story B for our application, and that is a typical empirical situation.

In such cases the choice falls to theory. What does the persistence in \(y\) substantively mean?

  • Story A reading of approval. Today’s approval is fully determined by today’s events; slow autocorrelation reflects that the news cycle is itself persistent — wars drag on, oil shocks take months to filter through gas prices. The AR(1) on the errors absorbs that unmodeled news-cycle drift. People are not anchored on yesterday’s approval.
  • Story B reading of approval. Approval has real state-dependence: today’s evaluations are partly shaped by yesterday’s through anchoring, social transmission, slow updating, or simple cognitive inertia. The LDV captures genuine psychological persistence in opinion.

Neither reading is obviously right. A reasonable analyst commits to one based on theory and reports the LRM with the matching formula; a very reasonable analyst presents both as a robustness check. The methodological sin Cook and Webb (2021) warn against is taking coefficients estimated under one story and plugging them into the other story’s LRM formula — as the figure showed, that conflation can imply long-run effects off by a factor of five.

For the rest of Part 3 we proceed with Story B (LDV via lm()) and read the LRM as \((\beta_0+\beta_1)/(1-\phi)\). Before that, §3.2 starts with the simplest distributed-lag regression — no LDV — and shows how its residuals force the choice on us, first via patches that stay in Story A space (HAC, GLS-AR(1)).

3.2 Distributed lag (DL): finite-horizon dynamics with AR-error patches

The simplest DL adds one lag of the continuous covariate:

\[y_t \;=\; \alpha + \beta_0 x_t + \beta_1 x_{t-1} + \gamma_1 \mathbf{1}\{911\}_t + \gamma_2 \mathbf{1}\{Iraq\}_t + u_t,\]

with \(u_t\) assumed white noise. The shock dummies stay as static controls — they don’t need lags because they’re nonzero for only a few months. The lag structure on \(x\) is what introduces dynamic propagation. In Story-A-vs-B language: this is a distributed-lag in the mean (a variant of Story A’s static structure with an extra lag of \(X\)), not yet an LDV.

fit_dl <- lm(approve ~ oil + oil_lag + sept.oct.2001 + iraq.war, data = df_lab4)
coeftest(fit_dl)
## 
## t test of coefficients:
## 
##                Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   98.982404   3.400564 29.1076 < 2.2e-16 ***
## oil           -0.144284   0.061960 -2.3286 0.0233239 *  
## oil_lag       -0.096806   0.064444 -1.5022 0.1383845    
## sept.oct.2001 17.559949   4.661709  3.7668 0.0003837 ***
## iraq.war       6.975436   3.818350  1.8268 0.0727867 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The contemporaneous oil coefficient is significant (\(\sim -0.14\)); the lagged coefficient is borderline. The cumulative or “long-run” effect of a permanent unit increase in oil under DL is just the sum of the lag coefficients:

\[\theta_{DL} = \beta_0 + \beta_1.\]

Aside: when one lag coefficient is not individually significant

A practical question: if one of the two oil coefficients is not individually significant, can we still interpret the cumulative \(\theta_{DL}\)? Don’t read significance off individual coefficients — test the linear combination directly. The sum \(\hat\beta_0 + \hat\beta_1\) has its own variance, \(\text{Var}(\hat\beta_0) + \text{Var}(\hat\beta_1) + 2\text{Cov}(\hat\beta_0, \hat\beta_1)\), and its own \(t\)-statistic for “is the long-run effect zero?”:

# H0: beta_0 + beta_1 = 0  (no cumulative oil effect)
car::linearHypothesis(fit_dl, "oil + oil_lag = 0")
## 
## Linear hypothesis test:
## oil  + oil_lag = 0
## 
## Model 1: restricted model
## Model 2: approve ~ oil + oil_lag + sept.oct.2001 + iraq.war
## 
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     60 9333.9                                  
## 2     59 2414.8  1    6919.1 169.06 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is distinct from a joint significance test (\(H_0: \beta_0 = \beta_1 = 0\), “does \(X\) matter at any lag?”), which would use linearHypothesis(fit_dl, c("oil = 0", "oil_lag = 0")). The two can give different verdicts: a significant sum with insignificant individuals means we have evidence for a long-run effect but can’t pin down the lag distribution; a rejected joint test with insignificant sum means short-run effects that cancel in the long run. Report the cumulative-effect test alongside individual coefficients whenever the long-run effect is the substantive quantity.

But the residuals tell us the DL model is incomplete:

dwtest(fit_dl)
## 
##  Durbin-Watson test
## 
## data:  fit_dl
## DW = 0.4108, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
Box.test(residuals(fit_dl), lag = 12, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(fit_dl)
## X-squared = 78.27, df = 12, p-value = 8.815e-12

Durbin-Watson is near \(0.4\) (\(p < 0.001\)) and Ljung-Box at lag 12 also rejects (\(p < 0.001\)). The residuals carry strong serial dependence — the same high-persistence structure we know is in approval (AR(1) at 0.92 in the ARMAX). DL captured the dynamic effect of oil, but it left the persistence in approval itself unmodeled. We have two routes that stay within Story A (patch the inference; leave the static-mean specification alone), and a third route covered in §3.3 that moves to Story B.

Patch 1: HAC (Newey-West) standard errors

With autocorrelated residuals, OLS coefficients are still consistent (provided the regressors are exogenous), but the OLS variance formula \(\hat\sigma^2 (X'X)^{-1}\) understates sampling variability — confidence intervals are too narrow and \(t\)-stats too large. The standard remedy is Heteroskedasticity- and Autocorrelation-Consistent (HAC) standard errors, due to Newey and West (1987). HAC replaces the OLS “meat” with a kernel-weighted sum of sample autocovariances of the score:

\[\hat V_{HAC} = (X'X)^{-1} \left( \sum_{j = -L}^{L} k\!\left(\tfrac{j}{L}\right) \hat\Gamma_j \right) (X'X)^{-1}, \qquad \hat\Gamma_j = \tfrac{1}{T}\sum_{t} (X_t \hat\varepsilon_t)(X_{t-j} \hat\varepsilon_{t-j})'.\]

The truncation lag \(L\) controls how far out the sum runs; the kernel \(k(\cdot)\) (Bartlett by default) downweights distant lags so the estimator stays positive-definite. NeweyWest() uses a data-driven default for \(L\) (Newey and West 1994), sensible for \(T \approx 60\).

coeftest(fit_dl, vcov = sandwich::NeweyWest(fit_dl, prewhite = FALSE))
## 
## t test of coefficients:
## 
##                Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   98.982404   7.621352 12.9875 < 2.2e-16 ***
## oil           -0.144284   0.066233 -2.1784  0.033381 *  
## oil_lag       -0.096806   0.056359 -1.7177  0.091102 .  
## sept.oct.2001 17.559949   2.523597  6.9583 3.183e-09 ***
## iraq.war       6.975436   2.081917  3.3505  0.001411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The point estimates are identical to OLS (HAC only patches the variance), but the SEs widen. What HAC does and doesn’t do. It restores valid inference under arbitrary serial correlation, but it does not make OLS efficient and it does not protect against omitted-variable bias. To recover efficiency we either (a) specify the error correlation explicitly and re-estimate by GLS, or (b) move the dynamics into the mean equation (the Story-B move of §3.3).

Patch 2: GLS with AR(1) errors

If we are willing to commit to a model of the error dynamics, we re-estimate the same regression with a structured covariance:

\[y_t = X_t'\beta + u_t, \qquad u_t = \rho\, u_{t-1} + \varepsilon_t, \qquad \varepsilon_t \stackrel{\text{iid}}{\sim} (0, \sigma^2).\]

Feasible GLS estimates \(\rho\) from the OLS residuals, then re-weights observations by an estimate of \(\Omega^{-1/2}\) (the inverse Cholesky of the implied AR(1) covariance). The result is BLUE if the AR(1) assumption is correct — both estimates and SEs are valid and efficient. This is, structurally, exactly the AR-errors model from Story A, fit a different way than Arima().

fit_dl_gls <- gls(approve ~ oil + oil_lag + sept.oct.2001 + iraq.war,
                  data = df_lab4, correlation = corAR1())
summary(fit_dl_gls)$tTable |> round(4)
##                 Value Std.Error t-value p-value
## (Intercept)   73.2582    9.7346  7.5256  0.0000
## oil           -0.0714    0.0363 -1.9679  0.0538
## oil_lag       -0.0297    0.0366 -0.8133  0.4193
## sept.oct.2001 11.5642    2.6494  4.3648  0.0001
## iraq.war       6.0833    2.6404  2.3039  0.0248

The estimated AR(1) parameter on the residuals (Phi in the correlation summary) lands around \(0.93\) — the same high persistence we already saw in the ARMAX of Part 2. Side-by-side comparison:

tibble(
  Coefficient     = names(coef(fit_dl)),
  `OLS est`       = round(coef(fit_dl), 3),
  `OLS SE`        = round(sqrt(diag(vcov(fit_dl))), 3),
  `HAC SE`        = round(sqrt(diag(sandwich::NeweyWest(fit_dl, prewhite = FALSE))), 3),
  `GLS-AR(1) est` = round(coef(fit_dl_gls), 3),
  `GLS-AR(1) SE`  = round(sqrt(diag(vcov(fit_dl_gls))), 3)
) |>
  knitr::kable(caption = "DL specification: OLS vs. OLS+HAC vs. GLS-AR(1).")
DL specification: OLS vs. OLS+HAC vs. GLS-AR(1).
Coefficient OLS est OLS SE HAC SE GLS-AR(1) est GLS-AR(1) SE
(Intercept) 98.982 3.401 7.621 73.258 9.735
oil -0.144 0.062 0.066 -0.071 0.036
oil_lag -0.097 0.064 0.056 -0.030 0.037
sept.oct.2001 17.560 4.662 2.524 11.564 2.649
iraq.war 6.975 3.818 2.082 6.083 2.640

OLS and HAC share point estimates by construction; only the SEs differ. GLS estimates differ from OLS because GLS reweights observations using the AR(1) structure; the intervention dummies move the most. GLS SEs are typically smaller than HAC SEs if the AR(1) model is correct — the efficiency gain HAC cannot deliver.

Both HAC and GLS are legitimate fixes within Story A — they accept the static-mean specification and patch (HAC) or reweight (GLS) accordingly. Neither changes the model. §3.3 takes the alternative route: move to Story B by adding the lagged dependent variable, and recover both efficient estimation and propagating dynamics in one move.

3.3 ARDL(1,1): reading the model we just fit

The unrestricted ARDL(1,1) we estimated in §3.1 for the COMFAC test is our workhorse going forward:

\[y_t \;=\; \alpha + \phi\, y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + \gamma_1 \mathbf{1}\{911\}_t + \gamma_2 \mathbf{1}\{Iraq\}_t + \varepsilon_t.\]

The “(1,1)” labels the orders: one lag of \(y\), one lag of \(x\) beyond the contemporaneous \(x_t\). The lagged \(y\) does double duty — it captures persistent dynamics in approval that have nothing to do with our regressors, and it propagates the effect of \(x\) forward through time. Today’s approval depends on yesterday’s, and yesterday’s approval was already affected by yesterday’s oil price, so today’s approval inherits a piece of yesterday’s oil shock through the LDV.

coeftest(fit_ardl)
## 
## t test of coefficients:
## 
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   14.0065812  5.4416046  2.5740  0.01263 *  
## approve_lag    0.8347474  0.0514990 16.2090  < 2e-16 ***
## oil           -0.0322845  0.0274584 -1.1758  0.24449    
## oil_lag        0.0011796  0.0282933  0.0417  0.96689    
## sept.oct.2001 17.2786953  1.9994764  8.6416  5.2e-12 ***
## iraq.war       4.1543960  1.6469076  2.5225  0.01442 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Read off the coefficients:

  • \(\hat\phi \approx 0.83\) — the LDV. Yesterday’s approval explains most of today’s; this is the persistence ARMAX put in its AR(1) error and ARDL puts in the lagged \(y\).
  • \(\hat\beta_0 \approx -0.03\), \(\hat\beta_1 \approx 0.001\) — individual oil-price coefficients are small and not significant on their own. This is normal for ARDL on a persistent series. Once the LDV is included, contemporaneous covariates only have to explain the part of \(y_t\) that is not already predicted by \(y_{t-1}\) — a small target. The substantively meaningful quantity is the combined long-run multiplier, computed below in §3.4.
  • \(\hat\gamma_{911} \approx 17.3\), \(\hat\gamma_{\text{Iraq}} \approx 4.2\)immediate-period effects of each pulse on approval. These then propagate forward through the LDV in subsequent months.

Residual diagnostics on the LDV fit:

dwtest(fit_ardl)
## 
##  Durbin-Watson test
## 
## data:  fit_ardl
## DW = 2.4701, p-value = 0.9274
## alternative hypothesis: true autocorrelation is greater than 0
Box.test(residuals(fit_ardl), lag = 12, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(fit_ardl)
## X-squared = 19.705, df = 12, p-value = 0.07287

Durbin-Watson is near \(2\) (\(p \approx 0.93\)); Ljung-Box at lag 12 has \(p \approx 0.07\) — borderline but acceptable. The autocorrelation that DL had has been absorbed by the LDV — we are solidly in row 3 of the §3.1 OLS-properties table (LDV with white-noise errors), where OLS is consistent and standard errors are correct. This is a precondition for the dynamic-effects calculations that follow.

3.4 Long-run multiplier and speed of adjustment

The headline formulas for ARDL(1,1):

\[ \theta_{LR} \;=\; \frac{\beta_0 + \beta_1}{1 - \phi}, \qquad \lambda \;=\; 1 - \phi, \qquad \text{half-life} \;=\; \frac{\ln(0.5)}{\ln(\phi)}. \]

  • \(\theta_{LR}\): the long-run multiplier — eventual response of \(y\) to a permanent unit increase in \(x\), once all dynamics have played out.
  • \(\lambda\): the speed of adjustment — fraction of the remaining gap to equilibrium closed each period.
  • Half-life: number of periods until the system closes half the remaining gap.
phi  <- unname(coef(fit_ardl)["approve_lag"])
b0   <- unname(coef(fit_ardl)["oil"])
b1   <- unname(coef(fit_ardl)["oil_lag"])

theta_LR  <- (b0 + b1) / (1 - phi)
lambda    <- 1 - phi
half_life <- log(0.5) / log(phi)

tibble(
  Quantity = c("phi (LDV coefficient)",
               "Long-run multiplier on oil (theta)",
               "Speed of adjustment (lambda = 1 - phi)",
               "Half-life (months)"),
  Value = round(c(phi, theta_LR, lambda, half_life), 3)
) |>
  knitr::kable(caption = "ARDL(1,1) summary statistics for the approval/oil relationship.")
ARDL(1,1) summary statistics for the approval/oil relationship.
Quantity Value
phi (LDV coefficient) 0.835
Long-run multiplier on oil (theta) -0.188
Speed of adjustment (lambda = 1 - phi) 0.165
Half-life (months) 3.837

A permanent one-unit increase in the oil-price index lowers approval by about \(0.19\) percentage points in the long run — about double the static ARMAX coefficient (\(-0.09\)). The contemporaneous coefficient \(\hat\beta_0\) sees only a fraction of this; the LDV pulls in the rest over the following months. Roughly \(17\%\) of the remaining gap closes each month, with a half-life around four months. The system has effectively reached its new equilibrium after about two years.

For substantive context: a \(10\)-unit oil-price increase (roughly the monthly standard deviation in this sample) implies a long-run drop of about \(1.9\) percentage points in approval — modest in any single month, but meaningful when stacked over multiple shocks.

3.5 The impulse response: tracing the dynamic adjustment

The ARDL parameters give us the destination (\(\theta_{LR}\)) and the speed (\(\lambda\)). To see the path — when most of the adjustment happens, when the system effectively converges — we trace the period-by-period response to a unit step shock in \(x\) starting from pre-intervention equilibrium. This is the same simulation logic from Lab 3 Section 2.5, generalized to ARDL.

For the deviation \(\delta_t = y_t - y_{eq}\) from equilibrium under a unit step in \(x\) at \(t = 0\) (held forever after):

\[ \delta_0 \;=\; \beta_0, \qquad \delta_t \;=\; \phi\, \delta_{t-1} + \beta_0 + \beta_1 \quad (t \geq 1). \]

The recursion has a closed-form solution that converges to \(\theta_{LR}\) as \(t \to \infty\) — but the simulation form is more transparent and generalizes immediately to higher-order ARDL.

H <- 36
irf <- numeric(H + 1)
irf[1] <- b0
for (t in 2:(H + 1)) irf[t] <- phi * irf[t - 1] + b0 + b1

irf_df <- tibble(month = 0:H, response = irf)
hl_idx <- round(half_life)

p3_irf <- ggplot(irf_df, aes(month, response)) +
  geom_hline(yintercept = 0,        linetype = "dotted", color = "grey40") +
  geom_hline(yintercept = theta_LR, linetype = "dashed", color = "firebrick") +
  geom_segment(aes(x = hl_idx, xend = hl_idx, y = 0,
                   yend = irf[hl_idx + 1]),
               linetype = "dotted", color = "grey50") +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_point(color = "steelblue", size = 1.6) +
  annotate("text", x = H - 0.5, y = theta_LR + 0.012,
           label = sprintf("Long-run multiplier  theta = %.3f", theta_LR),
           color = "firebrick", size = 3.5, hjust = 1) +
  annotate("text", x = hl_idx + 0.5, y = irf[hl_idx + 1] - 0.025,
           label = sprintf("Half-life ~ %.1f months", half_life),
           color = "grey25", size = 3.2, hjust = 0) +
  labs(title = "Step-response of approval to a unit shock in oil price",
       subtitle = "ARDL(1,1) impulse response, deviations from equilibrium",
       x = "Months after shock", y = "Response (percentage points)") +
  theme_minimal(base_size = 12)

p3_irf

The trajectory tells the dynamic story. The immediate effect at \(t = 0\) is small (\(-0.03\), just \(\hat\beta_0\)). Each subsequent month, the LDV transmits a fraction \(\phi\) of yesterday’s deviation forward, while the cumulative oil pressure \((\beta_0 + \beta_1)\) keeps acting. The series converges geometrically to the long-run equilibrium \(\theta = -0.19\).

Reporting at multiple horizons makes the speed of approach concrete:

key_horizons <- c(0, 3, 6, 12, 24, 36)
tibble(
  `Months after shock`        = key_horizons,
  `Cumulative response (pp)`  = round(irf[key_horizons + 1], 3),
  `Fraction of long-run`      =
    sprintf("%.1f%%", 100 * irf[key_horizons + 1] / theta_LR)
) |>
  knitr::kable(caption = "Approach to equilibrium at selected horizons.")
Approach to equilibrium at selected horizons.
Months after shock Cumulative response (pp) Fraction of long-run
0 -0.032 17.2%
3 -0.098 51.8%
6 -0.135 72.0%
12 -0.170 90.5%
24 -0.186 98.9%
36 -0.188 99.9%

About a third of the long-run effect lands in the first three months, half by the half-life (\(\sim 4\) months), \(90\%\) by month 12, and the system is essentially converged by month 24. This dynamic-speed information was nowhere in the ARMAX summary. With the LDV doing the propagation, ARDL gives us not just the destination but the path — which is what most policy and political-economy questions actually want to know.

A note on inference validity

Parametric simulation requires the regression’s sampling distribution to be approximately \(\mathcal{N}(\hat\beta, V(\hat\beta))\), which holds when the regression is balanced — both sides share the same order of integration, or are I(1) and cointegrated. Approval is I(0) (Part 2) and the ARDL absorbs any I(1) component of oil through the LDV under the Pesaran-Shin bounds-testing framework, so the condition is satisfied here. Part 5 treats balance and cointegration formally on genuinely I(1) data, and explains exactly when a level-on-level regression is well-behaved.

Confidence intervals via parametric simulation

The idea. Instead of computing the IRF once at \(\hat\beta\), draw many vectors from the sampling distribution of \(\hat\beta\), compute the IRF on each draw, and read pointwise quantiles off the bundle of trajectories. This is the King, Tomz, and Wittenberg (2000) approach to presenting dynamic uncertainty — the same logic §5.5 uses for the cointegration example.

Step 1: one draw, one IRF. vcov(fit_ardl) is the joint covariance of \(\hat\beta\). A single draw \(\beta^{(1)} \sim \mathcal{N}(\hat\beta, V(\hat\beta))\) is one realization of “what the coefficients could plausibly have been.” Plug it into exactly the same recursion as the point IRF:

set.seed(2026)
beta_hat <- coef(fit_ardl)
V        <- vcov(fit_ardl)

# One draw from the joint sampling distribution of beta_hat
beta_draw <- MASS::mvrnorm(1, mu = beta_hat, Sigma = V)

# Same recursion as the point IRF, but with this draw's coefficients
phi_d <- beta_draw["approve_lag"]
b0_d  <- beta_draw["oil"]
b1_d  <- beta_draw["oil_lag"]

irf_one <- numeric(H + 1)
irf_one[1] <- b0_d
for (t in 2:(H + 1)) irf_one[t] <- phi_d * irf_one[t - 1] + b0_d + b1_d

tibble(month     = 0:5,
       point     = round(irf[1:6],     4),
       one_draw  = round(irf_one[1:6], 4))
## # A tibble: 6 × 3
##   month   point one_draw
##   <int>   <dbl>    <dbl>
## 1     0 -0.0323  -0.0404
## 2     1 -0.0581  -0.0793
## 3     2 -0.0796  -0.111 
## 4     3 -0.0975  -0.137 
## 5     4 -0.112   -0.159 
## 6     5 -0.125   -0.176

Different draws give different trajectories — pointwise quantiles across \(S\) draws give the confidence band.

Step 2: a reusable function. Wrap the procedure so it generalizes to any dynamic quantity fn(beta, H) derivable from the fitted coefficients (re-used for the GECM in §4):

# Parametric simulation of any dynamic quantity fn(beta, H) for a fitted model.
# Returns tibble(horizon, median, lo, hi) with pointwise (1 - alpha) bands.
simulate_dynamic <- function(fit, fn, H, S = 1000, alpha = 0.05, seed = NULL) {
  if (!is.null(seed)) set.seed(seed)
  beta_hat <- coef(fit)
  V        <- vcov(fit)

  # S x p matrix of coefficient draws from the multivariate normal
  draws <- MASS::mvrnorm(n = S, mu = beta_hat, Sigma = V)

  # Apply fn() to each draw -> S x (H+1) matrix of trajectories
  sim_mat <- t(apply(draws, 1, fn, H = H))

  tibble(
    horizon = 0:H,
    median  = apply(sim_mat, 2, median),
    lo      = apply(sim_mat, 2, quantile, probs = alpha / 2),
    hi      = apply(sim_mat, 2, quantile, probs = 1 - alpha / 2)
  )
}

# IRF for ARDL(1,1) given a coefficient vector
ardl11_irf <- function(beta, H) {
  phi <- beta["approve_lag"]
  b0  <- beta["oil"]
  b1  <- beta["oil_lag"]

  irf <- numeric(H + 1)
  irf[1] <- b0
  for (t in 2:(H + 1)) irf[t] <- phi * irf[t - 1] + b0 + b1
  irf
}

Step 3: apply and plot.

irf_band <- simulate_dynamic(fit_ardl, fn = ardl11_irf, H = 36,
                             S = 1000, seed = 2026)
head(irf_band)
## # A tibble: 6 × 4
##   horizon  median      lo       hi
##     <int>   <dbl>   <dbl>    <dbl>
## 1       0 -0.0307 -0.0869  0.0208 
## 2       1 -0.0581 -0.115  -0.00173
## 3       2 -0.0796 -0.144  -0.00935
## 4       3 -0.0984 -0.168  -0.0133 
## 5       4 -0.113  -0.187  -0.0180 
## 6       5 -0.125  -0.202  -0.0204
ggplot(irf_band, aes(horizon, median)) +
  geom_hline(yintercept = 0,        linetype = "dotted", color = "grey40") +
  geom_hline(yintercept = theta_LR, linetype = "dashed", color = "firebrick") +
  geom_ribbon(aes(ymin = lo, ymax = hi), fill = "steelblue", alpha = 0.25) +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_point(color = "steelblue", size = 1.4) +
  annotate("text", x = 35, y = theta_LR + 0.012,
           label = sprintf("Long-run multiplier  theta = %.3f", theta_LR),
           color = "firebrick", size = 3.5, hjust = 1) +
  labs(title = "Step-response of approval to a unit shock in oil price (95% CI)",
       subtitle = "ARDL(1,1) parametric simulation, S = 1000",
       x = "Months after shock", y = "Response (percentage points)") +
  theme_minimal(base_size = 12)

The band widens with horizon — uncertainty in \(\hat\phi\) and \(\hat{(\beta_0+\beta_1)}\) compounds through the recursion, and the long-run multiplier \(\theta_{LR}\) is itself a non-linear function of the coefficients. Even so, the 95% band stays clear of zero across the entire 36-month horizon, so the negative oil effect on approval is statistically distinguishable from “no effect” — modest in magnitude (Section 3.4) but real.


Part 4: ARDL → ECM-Form (Algebraic Bridge)

Part 3 left us with the ARDL(1,1)

\[ y_t = \alpha + \phi\, y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + \varepsilon_t \tag{ARDL} \]

and three substantively interesting quantities computed from its coefficients:

  1. The long-run multiplier \(\theta_{LR} = (\beta_0 + \beta_1)/(1 - \phi)\) — the destination of the system after a permanent shock to \(x\).
  2. The speed of adjustment \(\lambda = 1 - \phi\) — the fraction of the remaining gap that closes each period.
  3. The impulse-response trajectory — the path between today and that destination, traced by simulation.

The catch is that all three are derived from the fitted ARDL. \(\theta_{LR}\) and \(\lambda\) are non-linear combinations of the coefficients, so their standard errors require the delta method (or a parametric simulation, as in Part 5); the IRF is recovered by recursion. None of these quantities appears directly in the ARDL regression output.

This part rewrites the same model in an algebraically equivalent form — the error-correction (EC) form — that places those quantities as coefficients in a single regression:

\[ \Delta y_t \;=\; \alpha + \beta_0\, \Delta x_t \;-\; \lambda \bigl(y_{t-1} - \theta\, x_{t-1}\bigr) + \gamma_1 \mathbf{1}\{911\}_t + \gamma_2 \mathbf{1}\{Iraq\}_t + \varepsilon_t. \tag{ECM} \]

\(\lambda\) is now a coefficient on the lagged level of \(y\), and the long-run multiplier \(\theta\) appears inside the bracketed deviation \((y_{t-1} - \theta\, x_{t-1})\) — the gap between yesterday’s \(y\) and where it “should” be given yesterday’s \(x\). Same data, same fitted values, identical residuals — just a different way of presenting the same coefficient information. It is an algebraic identity, not a new model.

Why bother. Three reasons.

  1. Direct inference on dynamics. \(\lambda\) is a regression coefficient, not a derived quantity. We get a \(t\)-test on the speed of adjustment for free; \(\theta\) is also identified inside the regression and can be read off directly with a delta-method SE.
  2. Pedagogical clarity. The short-run/long-run decomposition is visible. One part of the regression captures transient dynamics (\(\beta_0 \Delta x_t\) — the immediate change in \(x\)), another part captures the long-run pull back toward equilibrium (\(-\lambda\) times the deviation from the long-run relation \(y - \theta x\)).
  3. Bridge to cointegration. When \(y\) and \(x\) are I(1), the EC form has a structural interpretation: the bracketed term measures deviations from a cointegrating relationship, and \(\lambda\) is the rate at which those deviations get corrected. Approval is stationary, so what we do in this part is the algebra — a reparameterization of the ARDL we already fit. The structural cointegration reading — where the EC form earns its name — comes with new, genuinely I(1) data in Part 5.

This is the De Boef and Keele (2008, eqs. 3–5, pp. 189–190) construction. Cook and Webb (2021, pp. 561–565) document why the same algebra implies that fitting AR-errors and reading off LDV-style multipliers is a model conflation — the EC-form derivation makes the implied long-run structure transparent so the conflation is harder to commit by accident.

4.1 Step-by-step derivation

Start from the ARDL(1,1): \[ y_t = \alpha + \phi\, y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + \varepsilon_t. \]

Step 1. Subtract \(y_{t-1}\) from both sides: \[ y_t - y_{t-1} = \alpha + (\phi - 1) y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + \varepsilon_t. \] Define \(\lambda \equiv 1 - \phi\) (the speed of adjustment) and \(\Delta y_t \equiv y_t - y_{t-1}\): \[ \Delta y_t = \alpha - \lambda\, y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + \varepsilon_t. \]

Step 2. Add and subtract \(\beta_0 x_{t-1}\) on the right: \[ \Delta y_t = \alpha - \lambda\, y_{t-1} + \beta_0\, (x_t - x_{t-1}) + (\beta_0 + \beta_1)\, x_{t-1} + \varepsilon_t. \] With \(\Delta x_t \equiv x_t - x_{t-1}\), this is the single-equation generalized ECM (GECM): \[ \Delta y_t = \alpha + \beta_0 \Delta x_t - \lambda\, y_{t-1} + (\beta_0 + \beta_1)\, x_{t-1} + \varepsilon_t. \tag{GECM} \]

Step 3. Factor the long-run relation by defining \(\theta \equiv (\beta_0+\beta_1)/(1-\phi) = (\beta_0+\beta_1)/\lambda\): \[ \boxed{\;\Delta y_t = \alpha + \beta_0 \Delta x_t - \lambda\,\bigl(y_{t-1} - \theta\, x_{t-1}\bigr) + \varepsilon_t.\;} \tag{ECM} \] The bracketed term \(z_{t-1} \equiv y_{t-1} - \theta\, x_{t-1}\) is the error-correction term: the gap between yesterday’s \(y\) and the long-run relation \(\theta x\).

Three coefficients with clean substantive meaning:

Coefficient Meaning Sign expectation
\(\beta_0\) on \(\Delta x_t\) Short-run effect: contemporaneous response of \(\Delta y\) to \(\Delta x\) matches \(\beta_0\) from the ARDL
\(\lambda\) on \(-(y_{t-1} - \theta x_{t-1})\) Speed of adjustment: fraction of the gap closed each period \(0 < \lambda < 1\) for a stationary system; negative sign on the EC term
\(\theta\) in the EC term Long-run multiplier: equilibrium relationship between \(y\) and \(x\) matches \((\beta_0+\beta_1)/(1-\phi)\) from the ARDL

The (ARDL), (GECM), and (ECM) equations are the same model — three notations, identical fit. Switching among them is purely an interpretive convenience.

4.2 Estimating the GECM directly

Step 2’s (GECM) form is what we actually estimate by OLS, because the EC form (Step 3) has \(\theta\) inside a nonlinear function of the level coefficients. We let lm() fit the linear GECM and back out \(\theta\) from the coefficient ratio.

df_lab4 <- df_lab4 |>
  mutate(d_approve = approve - approve_lag,
         d_oil     = oil     - oil_lag)
fit_gecm <- lm(d_approve ~ d_oil + approve_lag + oil_lag +
                 sept.oct.2001 + iraq.war,
               data = df_lab4)
coeftest(fit_gecm)
## 
## t test of coefficients:
## 
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   14.006581   5.441605  2.5740 0.012630 *  
## d_oil         -0.032285   0.027458 -1.1758 0.244494    
## approve_lag   -0.165253   0.051499 -3.2089 0.002172 ** 
## oil_lag       -0.031105   0.015201 -2.0462 0.045277 *  
## sept.oct.2001 17.278695   1.999476  8.6416  5.2e-12 ***
## iraq.war       4.154396   1.646908  2.5225 0.014417 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Read the GECM coefficients in the language of the (GECM) form:

  • d_oil coefficient \(\hat\beta_0\): the short-run effect — how much approval changes contemporaneously per unit change in oil price.
  • approve_lag coefficient: this estimates \(-\lambda\) (negative of the speed of adjustment). We want it negative — a negative coefficient on \(y_{t-1}\) means the system pulls \(y\) down when it is above equilibrium and up when below.
  • oil_lag coefficient: this estimates \(\beta_0 + \beta_1\) (the combined effect of the lagged oil level). Together with the lagged-\(y\) coefficient, these two pin down the long-run multiplier.

The implied dynamic quantities:

gy <- coef(fit_gecm)["approve_lag"]  # = -lambda
gx <- coef(fit_gecm)["oil_lag"]      # = beta_0 + beta_1
b0 <- coef(fit_gecm)["d_oil"]        # short-run effect

lambda_hat <- -gy
theta_hat  <- -gx / gy
phi_hat    <- 1 - lambda_hat

tibble(
  Quantity = c("Speed of adjustment (lambda = 1 - phi)",
               "Long-run multiplier (theta)",
               "Implied LDV coefficient (phi)",
               "Short-run effect (beta_0 on d_oil)",
               "Half-life (months)"),
  Value = round(c(lambda_hat, theta_hat, phi_hat, b0,
                  log(0.5)/log(phi_hat)), 4)
) |>
  knitr::kable(caption = "Dynamic quantities backed out of the GECM coefficients.")
Dynamic quantities backed out of the GECM coefficients.
Quantity Value
Speed of adjustment (lambda = 1 - phi) 0.1653
Long-run multiplier (theta) -0.1882
Implied LDV coefficient (phi) 0.8347
Short-run effect (beta_0 on d_oil) -0.0323
Half-life (months) 3.8375

These match exactly what we computed from the ARDL(1,1) in Part 3 — as they must, because the two parameterizations are algebraically identical.

4.3 Verification: ARDL and GECM are the same model

To make the equivalence concrete, fit the ARDL again and compare.

fit_ardl_p4 <- lm(approve ~ approve_lag + oil + oil_lag +
                    sept.oct.2001 + iraq.war,
                  data = df_lab4)

tibble(
  Diagnostic = c("Residual standard error",
                 "Sum of squared residuals",
                 "Maximum |residual difference| (ARDL vs GECM)"),
  ARDL  = c(round(summary(fit_ardl_p4)$sigma, 4),
            round(sum(residuals(fit_ardl_p4)^2), 4),
            NA),
  GECM  = c(round(summary(fit_gecm)$sigma, 4),
            round(sum(residuals(fit_gecm)^2), 4),
            NA),
  `ARDL − GECM` = c(NA, NA,
            signif(max(abs(residuals(fit_ardl_p4) - residuals(fit_gecm))), 3))
) |>
  knitr::kable(caption = "ARDL and GECM produce identical fits to numerical precision.")
ARDL and GECM produce identical fits to numerical precision.
Diagnostic ARDL GECM ARDL − GECM
Residual standard error 2.7439 2.7439 NA
Sum of squared residuals 436.6754 436.6754 NA
Maximum |residual difference| (ARDL vs GECM) NA NA 0

Residuals match to floating-point precision. The two equations are different parameterizations of the same dynamic model, not different models. Choose whichever form makes the substantive question easier to answer:

  • Parameter-counting question (“does the law have an immediate effect?”): ARDL gives \(\beta_0\) directly with a \(t\)-test.
  • Speed question (“how fast does the system correct toward equilibrium?”): GECM gives \(\lambda\) directly with a \(t\)-test on approve_lag.
  • Long-run question (“what is the equilibrium response?”): both give the same multiplier — ARDL via \((\beta_0+\beta_1)/(1-\phi)\), GECM via the negative ratio of the oil_lag coefficient to the approve_lag coefficient.

4.4 The two-step (Engle-Granger style) ECM

A close cousin of the GECM imposes the long-run relationship in a first stage and then estimates only the dynamic correction in a second stage. This is the form most directly tied to cointegration analysis — and it is what we will use in Part 5 for the US fiscal-sustainability example.

The two steps are:

Stage 1. Estimate the long-run relationship by static OLS: \[ y_t = \mu + \theta\, x_t + z_t. \tag{Stage 1} \] The residual \(\hat z_t = y_t - \hat\mu - \hat\theta\, x_t\) is the cointegrating residual — the gap between \(y\) and its long-run relationship with \(x\). Under cointegration this residual is stationary even when \(y\) and \(x\) individually are not (Part 5).

Stage 2. Estimate the dynamic correction by regressing \(\Delta y_t\) on \(\Delta x_t\) and the lagged Stage-1 residual: \[ \Delta y_t = a + b_0\, \Delta x_t + \rho\, \hat z_{t-1} + \varepsilon_t. \tag{Stage 2} \] The coefficient \(\rho\) is \(-\lambda\): it tells us what fraction of yesterday’s deviation gets corrected today.

# Stage 1: long-run relationship by static OLS
stage1 <- lm(approve ~ oil, data = df_lab4)
df_lab4 <- df_lab4 |>
  mutate(z_hat     = residuals(stage1),
         z_hat_lag = dplyr::lag(z_hat, 1))

# Long-run multiplier from Stage 1
theta_stage1 <- coef(stage1)["oil"]

# Stage 2: dynamic correction
fit_2step <- lm(d_approve ~ d_oil + z_hat_lag + sept.oct.2001 + iraq.war,
                data = df_lab4 |> drop_na())
coeftest(fit_2step)
## 
## t test of coefficients:
## 
##                Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   -0.962851   0.372704 -2.5834  0.012325 *  
## d_oil         -0.033958   0.027892 -1.2175  0.228351    
## z_hat_lag     -0.170037   0.052495 -3.2391  0.001986 ** 
## sept.oct.2001 16.969486   2.007533  8.4529 1.072e-11 ***
## iraq.war       4.012653   1.661927  2.4145  0.018934 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Stage-1 long-run multiplier \(\hat\theta\) is similar in spirit to the GECM-derived value but not identical — Stage 1 estimates \(\theta\) off the levels regression alone, ignoring short-run dynamics, while the GECM/ARDL estimates \(\theta\) jointly with the dynamics. With well-behaved I(1) data the two converge as \(T \to \infty\) (Engle and Granger 1987), but in finite samples they differ. For our (stationary) approval data they will not match exactly, which is itself a useful diagnostic — it tells us we should not be relying on the cointegration interpretation here.

The Stage-2 coefficient on z_hat_lag is the analog of \(-\lambda\) from the GECM. Compare:

tibble(
  Quantity            = c("Long-run multiplier (theta)",
                          "Speed of adjustment (lambda)"),
  `GECM (single-eq)`  = c(round(theta_hat,    4),
                          round(lambda_hat,   4)),
  `Engle-Granger 2-step` = c(round(unname(theta_stage1),                       4),
                             round(-unname(coef(fit_2step)["z_hat_lag"]),      4))
) |>
  knitr::kable(caption =
    "GECM vs Engle-Granger two-step. Differences shrink with cointegration; here approval is stationary, so they do not exactly match.")
GECM vs Engle-Granger two-step. Differences shrink with cointegration; here approval is stationary, so they do not exactly match.
Quantity GECM (single-eq) Engle-Granger 2-step
Long-run multiplier (theta) -0.1882 -0.2455
Speed of adjustment (lambda) 0.1653 0.1700

4.5 The cointegration caveat

The (ECM) equation is algebraically valid for any ARDL(1,1) — stationary or not. What it does not do, on its own, is establish that \(z_t = y_t - \theta x_t\) is a meaningful equilibrium error. Two cases to keep distinct:

  1. \(y\) and \(x\) are individually stationary (I(0)). This is our approval/oil case. The ECM is just a different way of writing the same dynamic regression. The “long-run multiplier” \(\theta\) is well defined as a property of the joint distribution; the EC term has no special structural meaning beyond “deviation from \(\theta x\).”

  2. \(y\) and \(x\) are individually I(1) and cointegrated. Now \(\theta\) is the cointegrating coefficient — the unique linear combination that produces a stationary residual \(z_t\). The ECM has a structural interpretation: \(\lambda > 0\) is the system’s mean-reversion speed back to the cointegrating relation. Under cointegration the OLS estimator of \(\theta\) in Stage 1 is super-consistent (Engle and Granger 1987): it converges at rate \(T\) rather than \(\sqrt{T}\), even though the levels regression is “spurious” in the standard sense.

  3. \(y\) and \(x\) are individually I(1) but not cointegrated. Now there is no stable long-run relation; the ECM is misspecified; the levels regression \(y_t = \mu + \theta x_t + z_t\) is a textbook spurious regression with non-stationary residuals. Fitting an ECM in this case will produce coefficients that “look fine” but describe nothing real. Estimating an ECM is not the same as having found cointegration (Philips 2018, p. 232, fn. 7).

So: this part shows you the algebra; Part 5 shows you the test. We move to a genuinely I(1) example — quarterly US federal receipts and outlays — and ask whether the long-run fiscal relation cointegrates. If yes, the ECM machinery we built here gets a structural reading. If no, we have to interpret in differences only.


Part 5 (if time permits): Cointegration and a Real ECM (US Fiscal Sustainability)

In Part 4 we wrote the algebra of the error-correction model (ECM) on stationary data. The ECM there was a parameterization of the ARDL, not a structural claim — there was no I(1) noise to correct, no equilibrium relation in the sense the cointegration literature uses the term. The mechanics are unchanged when we move to genuinely I(1) data, but now the EC term acquires a meaning: it measures deviations from a cointegrating relationship and \(\lambda\) becomes the speed at which those deviations are corrected.

The application is US federal fiscal sustainability. The intertemporal government budget constraint says that, for the government’s debt to remain bounded, federal receipts and expenditures must move together in the long run — they must be cointegrated (Hakkio and Rush 1991). When cointegrated, deviations between the two are mean-reverting; when not, debt explodes. So the question “is US fiscal policy sustainable?” reduces to a testable cointegration claim about two I(1) macro series.

We use quarterly data 1947Q1–2025Q4, downloaded from FRED:

  • FGRECPT — Federal Government Current Receipts (billions of dollars).
  • FGEXPND — Federal Government Current Expenditures (billions of dollars).

Both series grow exponentially over time. We work with logs, which stabilizes the variance and converts the growth into a roughly linear trend.

Balance: when is a regression in levels well-behaved?

Before testing cointegration, a methodological preamble that organizes the rest of Part 5. A regression is balanced when both sides have the same order of integration:

\[\underbrace{y_t}_{I(d_y)} \;=\; X_t'\beta + u_t \quad \text{requires}\quad d_y = d_X, \text{ or cointegration when } d > 0.\]

Why this matters depends on the orders involved:

  • Both I(0). Standard. OLS is consistent and \(\hat\beta \to \mathcal{N}(\beta, V(\hat\beta))\) at the usual \(\sqrt{T}\) rate. Inference is conventional, \(t\)-statistics are pivotal, and vcov(fit) is a valid input to parametric simulation. This is the case in §3.3’s approval-oil ARDL.
  • Both I(1) but the residual is I(1). Spurious regression (Granger and Newbold 1974; Phillips 1986). \(\hat\beta\) does not converge in any usual sense; \(t\)-statistics diverge with \(T\); \(R^2\) is inflated. Two independent random walks will routinely produce a “highly significant” relationship that is pure stochastic accident. Standard inference is invalid; the cure when no long-run relation exists is to difference both sides.
  • Both I(1) and the residual is I(0). Cointegration. \(\hat\beta\) is super-consistent — it converges at rate \(T\), faster than the usual \(\sqrt{T}\). The two non-stationary series share a common stochastic trend, and the cointegrating coefficient pins down the long-run relationship. The residual is a stationary deviation around that long-run, which the ECM uses as its error-correction term.
  • Mixed orders (\(d_y \neq d_X\)). Only well-behaved under specific configurations: an ARDL with the LDV on the right, or a Pesaran-Shin bounds-testing setup, both of which can absorb the higher-order regressor. Without one of those, inference is non-pivotal.

Cointegration is therefore the I(1) version of balance. Without it, regressing one trending series on another is the spurious-regression trap; with it, the same regression recovers a well-defined long-run relationship and the ECM gives you the dynamics of return-to-equilibrium.

The Engle-Granger procedure operationalizes this distinction in three steps:

  1. Verify each series is I(1) (ADF/KPSS on levels and on differences).
  2. Estimate the candidate long-run relationship by OLS in levels and test whether its residual is stationary — that is the cointegration test.
  3. If cointegration holds, estimate the ECM (one-step GECM, or the two-step EG that uses the Stage-1 residual).

The next three subsections walk through each step on the fiscal-sustainability data.

5.1 The data

rec <- read_csv("data/FGRECPT.csv", show_col_types = FALSE) |>
  rename(date = 1, receipts = FGRECPT)
exp_ <- read_csv("data/FGEXPND.csv", show_col_types = FALSE) |>
  rename(date = 1, expenditures = FGEXPND)

fiscal <- inner_join(rec, exp_, by = "date") |>
  mutate(log_receipts = log(receipts),
         log_outlays  = log(expenditures))

nrow(fiscal)
## [1] 316
range(fiscal$date)
## [1] "1947-01-01" "2025-10-01"
fiscal |>
  pivot_longer(c(log_receipts, log_outlays),
               names_to = "series", values_to = "value") |>
  mutate(series = recode(series,
                         log_receipts = "log Receipts",
                         log_outlays  = "log Expenditures")) |>
  ggplot(aes(date, value, color = series)) +
  geom_line(linewidth = 0.9) +
  scale_color_manual(values = c("log Receipts" = "steelblue",
                                "log Expenditures" = "firebrick")) +
  labs(title = "US Federal Receipts and Expenditures, 1947Q1–2025Q4 (logs)",
       subtitle = "Two trending series that may share a long-run equilibrium",
       y = "Log of nominal billions of dollars",
       x = NULL, color = NULL) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

The two log series move closely together for ~80 years. The visible gap (expenditures consistently above receipts) is the persistent deficit. The question is whether the gap is mean-reverting: do expenditures and receipts drift back into alignment after temporary divergences, or does the deficit grow without bound?

5.2 Step 1 — Are both series I(1)?

The cointegration framework requires that each individual series be I(1) (non-stationary in levels, stationary in first differences). Test that first.

adf.test(fiscal$log_receipts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  fiscal$log_receipts
## Dickey-Fuller = -1.2606, Lag order = 6, p-value = 0.8882
## alternative hypothesis: stationary
adf.test(fiscal$log_outlays)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  fiscal$log_outlays
## Dickey-Fuller = -1.1382, Lag order = 6, p-value = 0.9152
## alternative hypothesis: stationary
adf.test(diff(fiscal$log_receipts))
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(fiscal$log_receipts)
## Dickey-Fuller = -7.2248, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
adf.test(diff(fiscal$log_outlays))
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(fiscal$log_outlays)
## Dickey-Fuller = -7.6744, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

ADF on the levels fails to reject the unit-root null for both series (large \(p\)-values, around \(0.9\)). ADF on the first differences rejects strongly (\(p < 0.01\) for both). KPSS-Trend on the levels also rejects stationarity. The verdict is unambiguous: both log_receipts and log_outlays are I(1).

We can now ask whether they are cointegrated.

5.3 Step 2 — The Engle-Granger cointegration test

If \(y_t\) and \(x_t\) are both I(1), they are cointegrated if there exists a linear combination \(z_t = y_t - \theta x_t\) that is stationary (I(0)). The cointegrating coefficient \(\theta\) is what the Stage-1 long-run regression estimates.

Engle and Granger (1987) propose a two-step test:

  1. Stage 1. Estimate \(\theta\) by static OLS: \(y_t = \mu + \theta x_t + z_t\). Obtain residuals \(\hat z_t\).
  2. Cointegration test. Run an ADF test on \(\hat z_t\). If we can reject the unit-root null on the residuals, the linear combination is stationary and the series are cointegrated.

Critical values for the Engle-Granger test are different from the standard ADF table — they account for the fact that we are testing a residual whose coefficient was estimated from the same data. MacKinnon (1991, 2010) provides the relevant tables; the rough 5% critical value for two variables and no trend in the cointegrating regression is \(-3.34\).

# Stage 1: long-run regression
stage1 <- lm(log_receipts ~ log_outlays, data = fiscal)
coeftest(stage1)
## 
## t test of coefficients:
## 
##              Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept) 0.1694256  0.0265661   6.3775 6.432e-10 ***
## log_outlays 0.9508386  0.0039439 241.0905 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
theta_hat <- coef(stage1)["log_outlays"]
fiscal$z_hat <- residuals(stage1)

# Cointegration test: ADF on the residuals
adf_z <- adf.test(fiscal$z_hat)
adf_z
## 
##  Augmented Dickey-Fuller Test
## 
## data:  fiscal$z_hat
## Dickey-Fuller = -4.3992, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

The cointegrating coefficient is \(\hat\theta \approx 0.95\) — close to (but slightly below) one. The interpretation: in the long run, a one-percent increase in expenditures is associated with about a 0.95-percent increase in receipts. Receipts cover almost — but not quite — all expenditures in the long-run equilibrium. The remaining ~5% is, in expectation, the long-run share of expenditures financed by debt.

The ADF statistic on the Stage-1 residuals is about \(-4.4\), well below MacKinnon’s 5% critical value of \(-3.34\). We reject the unit-root null on \(\hat z_t\): receipts and expenditures are cointegrated. There is a stable long-run fiscal relationship; deviations correct over time.

ggplot(fiscal, aes(date, z_hat)) +
  geom_hline(yintercept = 0, color = "grey40", linetype = "dashed") +
  geom_line(color = "steelblue") +
  labs(title = "Cointegrating residual z_hat = log_receipts - 0.951 * log_outlays - 0.169",
       subtitle = "Stationary mean-reverting deviations from the long-run fiscal relation",
       y = "Cointegrating residual", x = NULL) +
  theme_minimal(base_size = 12)

The residual hovers around zero with mean-reverting fluctuations — visually consistent with the ADF verdict. Notable departures (e.g., the surge in 2008–09 and 2020–21) reflect recession-driven deficit spikes; in each case the residual eventually returns toward zero.

5.4 Step 3 — Estimate the ECM

With cointegration confirmed, the Engle-Granger Stage-2 ECM gives the dynamics of correction:

\[ \Delta y_t \;=\; a + b_0 \Delta x_t \;+\; \rho\, \hat z_{t-1} \;+\; \varepsilon_t, \]

where \(\rho = -\lambda\). (The single-equation GECM gives the same \(\lambda\) and a similar \(\theta\); we’ll show both.)

fiscal <- fiscal |>
  mutate(d_log_receipts = log_receipts - dplyr::lag(log_receipts, 1),
         d_log_outlays  = log_outlays  - dplyr::lag(log_outlays,  1),
         z_lag          = dplyr::lag(z_hat, 1)) |>
  drop_na()

fit_ecm <- lm(d_log_receipts ~ d_log_outlays + z_lag, data = fiscal)
coeftest(fit_ecm)
## 
## t test of coefficients:
## 
##                 Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)    0.0167842  0.0018896  8.8826 < 2.2e-16 ***
## d_log_outlays -0.0713822  0.0368275 -1.9383  0.053490 .  
## z_lag         -0.0588984  0.0165036 -3.5688  0.000415 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lambda_hat   <- -unname(coef(fit_ecm)["z_lag"])
half_life_q  <- log(0.5) / log(1 - lambda_hat)
short_run_b0 <- unname(coef(fit_ecm)["d_log_outlays"])

tibble(
  Quantity = c("Long-run multiplier (theta, from Stage 1)",
               "Speed of adjustment (lambda) per quarter",
               "Half-life of deviations (quarters)",
               "Short-run effect (b0 on d_log_outlays)"),
  Value = round(c(unname(theta_hat), lambda_hat, half_life_q, short_run_b0), 4)
) |>
  knitr::kable(caption = "Cointegration and ECM summary (US fiscal data).")
Cointegration and ECM summary (US fiscal data).
Quantity Value
Long-run multiplier (theta, from Stage 1) 0.9508
Speed of adjustment (lambda) per quarter 0.0589
Half-life of deviations (quarters) 11.4185
Short-run effect (b0 on d_log_outlays) -0.0714

Reading the dynamics. Each quarter, about \(5.9\%\) of the gap between receipts and the long-run relationship \(0.951 \cdot \text{log\_outlays} + 0.169\) gets corrected. The half-life of a deviation is about \(11.4\) quarters — roughly three years. After a fiscal shock (e.g., a recession-driven jump in expenditures or a tax-cut–driven drop in receipts), the system takes about a decade to converge ~99% of the way back to equilibrium. This is slow but bounded — exactly what fiscal sustainability requires.

The short-run coefficient on \(\Delta\text{log\_outlays}\) is negative and only marginally significant. Substantively, this reflects the cyclical nature of fiscal policy: when expenditures jump quarter-on-quarter (often during recessions), receipts tend to fall in the same quarter (also recession-driven). The long-run cointegration (\(\theta \approx 0.95\)) handles the equilibrium; the short-run dynamic captures the within-quarter cyclical co-movement.

5.5 Parametric simulation: uncertainty around the dynamic quantities

The point estimates above hide uncertainty. The cointegrating coefficient \(\theta\), the speed of adjustment \(\lambda\), and the half-life are all estimates with sampling distributions. We use parametric simulation (King, Tomz, and Wittenberg 2000) to propagate that uncertainty into the dynamic quantities. The recipe:

  1. Draw \(N\) coefficient vectors from the multivariate-normal sampling distribution of the ECM coefficients.
  2. For each draw, compute the implied dynamic quantities (\(\lambda\), half-life, simulated impulse response).
  3. Summarize across draws — report point estimates plus 95% intervals.
set.seed(512)
n_sims <- 2000

# Draw from the ECM coefficient sampling distribution
sim_coefs <- mvrnorm(n = n_sims, mu = coef(fit_ecm), Sigma = vcov(fit_ecm))

# For each draw, compute lambda, half-life, and a 24-quarter step-response
H <- 24
sim_lambda  <- -sim_coefs[, "z_lag"]
sim_b0      <-  sim_coefs[, "d_log_outlays"]

# Simulated step-response of log_receipts to a permanent +1 shock
# in log_outlays, expressed as deviations from pre-shock equilibrium.
# Using GECM-equivalent dynamics with theta fixed at Stage-1 estimate:
sim_irf <- matrix(NA_real_, nrow = n_sims, ncol = H + 1)
for (i in seq_len(n_sims)) {
  lam <- sim_lambda[i]; b0 <- sim_b0[i]; th <- unname(theta_hat)
  delta <- numeric(H + 1)
  delta[1] <- b0  # immediate response of d_log_receipts to d_log_outlays
  # cumulative response of log_receipts to a permanent +1 shock in log_outlays
  resp <- numeric(H + 1)
  resp[1] <- delta[1]  # period 1 cumulative
  for (t in 2:(H + 1)) {
    # error-correction: pull resp toward th (the long-run multiplier)
    resp[t] <- resp[t-1] - lam * (resp[t-1] - th)
  }
  sim_irf[i, ] <- resp
}

# Summarize: 95% CI per horizon
irf_summary <- tibble(
  horizon = 0:H,
  mean    = colMeans(sim_irf),
  lo      = apply(sim_irf, 2, quantile, 0.025),
  hi      = apply(sim_irf, 2, quantile, 0.975)
)
ggplot(irf_summary, aes(horizon, mean)) +
  geom_hline(yintercept = unname(theta_hat),
             linetype = "dashed", color = "firebrick") +
  geom_ribbon(aes(ymin = lo, ymax = hi), fill = "steelblue", alpha = 0.25) +
  geom_line(color = "steelblue", linewidth = 1) +
  annotate("text", x = H - 1, y = unname(theta_hat) + 0.01,
           label = sprintf("Long-run multiplier theta ≈ %.3f", theta_hat),
           color = "firebrick", size = 3.5, hjust = 1) +
  labs(title = "Cumulative response of log_receipts to a +1 shock in log_outlays",
       subtitle = "ECM dynamics with parametric-simulation 95% CI (n_sims = 2000)",
       x = "Quarters after shock",
       y = "Cumulative response (logs)") +
  theme_minimal(base_size = 12)

The trajectory rises from the immediate short-run response toward \(\hat\theta\) over many quarters. The 95% ribbon widens as we extend the horizon, reflecting accumulated uncertainty about \(\lambda\). By quarter 24 (six years), most of the long-run adjustment has occurred, but the band still has substantial width — even with \(T = 316\) observations, the speed of adjustment is hard to pin down precisely on macro data.

tibble(
  Quantity = c("Long-run multiplier (theta)",
               "Speed of adjustment (lambda) per quarter",
               "Half-life (quarters)"),
  Mean = round(c(unname(theta_hat),
                 mean(sim_lambda),
                 mean(log(0.5) / log(1 - sim_lambda[sim_lambda > 0 & sim_lambda < 1]))), 3),
  `2.5%` = round(c(unname(theta_hat),
                   quantile(sim_lambda, 0.025),
                   quantile(log(0.5) / log(1 - sim_lambda[sim_lambda > 0 & sim_lambda < 1]), 0.025)), 3),
  `97.5%` = round(c(unname(theta_hat),
                    quantile(sim_lambda, 0.975),
                    quantile(log(0.5) / log(1 - sim_lambda[sim_lambda > 0 & sim_lambda < 1]), 0.975)), 3)
) |>
  knitr::kable(caption = "Point estimates and 95% intervals from parametric simulation. (theta is taken from Stage 1 and treated as fixed here; richer schemes would jointly resample theta and the ECM coefficients — see Philips 2018 for a bounds-testing approach.)")
Point estimates and 95% intervals from parametric simulation. (theta is taken from Stage 1 and treated as fixed here; richer schemes would jointly resample theta and the ECM coefficients — see Philips 2018 for a bounds-testing approach.)
Quantity Mean 2.5% 97.5%
Long-run multiplier (theta) 0.951 0.951 0.951
Speed of adjustment (lambda) per quarter 0.059 0.028 0.090
Half-life (quarters) 12.608 7.317 24.270

The 95% interval for the half-life is roughly 6–25 quarters — substantial uncertainty. The point estimate of ~11 quarters is our best read, but a reader of this analysis should treat anything from one and a half to six years as plausible.

5.6 Bottom line

The combined evidence — both series I(1) by ADF, cointegrated by Engle-Granger, ECM with \(\hat\lambda \approx 0.06\) — supports the view that US federal fiscal policy is sustainable in the long run, but adjusts slowly. Receipts and expenditures share a stable long-run relationship in which receipts cover roughly 95% of expenditures over a multi-decade equilibrium horizon. Deviations are corrected at about 6% per quarter, with a half-life around three years.

This is not a normative claim about whether 5% steady-state debt growth is desirable, only a positive claim about the data-generating process: the historical series do not show evidence of an explosive deficit path. Whether the cointegrating relationship will continue to hold under future shocks is a forecasting question, not a cointegration-testing question, and is beyond what this lab covers.

For an applied paper, the next steps would be:

  1. Robustness to break dates. Test for structural breaks in the cointegrating relationship, particularly around major policy regime changes.
  2. Bounds-testing approach. Re-estimate using Pesaran, Shin, and Smith’s (2001) bounds test, which is robust to mixed I(0)/I(1) regressors. Philips (2018) provides the political-science walkthrough.
  3. Multivariate cointegration. Add additional macro regressors (debt/GDP, interest rates) and use the Johansen procedure if multiple cointegrating relations are plausible.

We have done what the lab was for: to take the ECM machinery built in Part 4 as algebra, and demonstrate it on data where the cointegration interpretation has structural content.


Part 6 (if time permits): Interrupted Time Series (ITS) Design

Parts 2–5 dealt with one form of dynamic causal inference: estimating coefficients and propagating shocks within a fully parameterized dynamic model (ARMAX, ARDL, ECM). Part 6 introduces a different family — interrupted time series (ITS) designs — where the causal claim comes not from a parametric model of the dynamics, but from a quasi-experimental contrast: the trajectory of \(y_t\) before the intervention versus the trajectory after, with the pre-intervention path serving as the counterfactual.

ITS is the canonical causal-inference tool for time-series settings with a sharp, datable intervention, and it is widely used in public-policy evaluation (Bernal, Cummins, and Gasparrini 2017). The key insight: when randomization is impossible — as with a national policy that takes effect on one date — we can still estimate a causal effect by comparing the observed post-intervention path to the projected pre-intervention path.

6.1 The standard ITS model

The standard ITS specification is:

\[ y_t = \beta_0 + \beta_1 t + \beta_2 D_t + \beta_3 (t - T^*) \cdot D_t + \varepsilon_t \]

where:

  • \(\beta_0\): baseline level (intercept before intervention)
  • \(\beta_1\): pre-intervention trend (slope before intervention)
  • \(D_t\): intervention indicator (\(0\) before, \(1\) after)
  • \(\beta_2\): level change — the immediate jump (or drop) at the intervention
  • \(\beta_3\): slope change — how the trend changes after the intervention
  • \(T^*\): the time of the intervention

The key causal parameters are \(\beta_2\) (did the level shift immediately?) and \(\beta_3\) (did the trajectory’s slope change going forward?). Together they characterize the intervention’s effect: a step shift, a slope change, or both. Identification rests on the assumption that, absent the intervention, the pre-trend would have continued — the standard parallel-trends-style assumption translated to a single-unit time series.

6.1.1 What ITS assumes — and why you must defend each piece

Before we list functional forms or fit anything, we name the assumptions that make ITS causal rather than merely descriptive. The umbrella assumption — sometimes called conditional ignorability of treatment timing, or in the time-series setting sequential ignorability — says this: once we condition on the covariates and dynamic structure our model includes, the post-period trajectory would have followed the pre-period model in expectation, absent the intervention. Plain language: anything that would have stopped the pre-trend from continuing has been captured by the model. Asserting this assumption by default — without defending it — is the single most common form of weak inference in applied ITS work.

Within this umbrella, four threats to the assumption are worth naming individually, in roughly the order they tend to bite:

  1. No anticipation. Agents must not respond to the intervention before \(T^*\). The date of the policy is the date of the response, full stop. Threats: announcements that precede implementation, expectation effects, leaked drafts, behavioral adjustment to “this is coming.” Defense: announcement dummies plus a ramp variable for the anticipation window (covered in §6.4); restricting the analysis to a window where anticipation is implausible; documenting (with sources) what was publicly known before \(T^*\). Anticipation is the threat to single-unit ITS that is hardest to fix after the fact — better to design around it than to patch.

  2. Stable pre-trend (conditional continuation). The pre-period model — once we include covariates and dynamic structure — captures everything that would have continued absent the policy. Threats: unmodeled time-varying confounders, structural breaks unrelated to the policy, secular shifts. Defense: substantively-justified covariate inclusion (§6.3); AR or ARMA error structure where appropriate (§6.5); placebo tests at fake intervention dates; robustness to pre-period length.

  3. No co-occurring shocks (exclusion). Nothing else that affects \(y\) also changes at \(T^*\). Threats: bundled reforms, contemporaneous economic events, weather. Defense: condition on observables (§6.3) or, when available, use a true control unit that did not receive the intervention but did receive the common shocks (the CITS rung of the ladder, §6.1.4).

  4. Stable measurement. The data-generating process for measurement does not change at \(T^*\). Threats: survey instruments redesigned, sample frames altered, administrative coding changed. Defense: check for measurement breaks separately; restrict to consistently-measured periods.

Modeling choices are the operational expression of the assumption. Every covariate you condition on, every lag you add to the AR error, every anticipation window you include — each is a substantive commitment to a story about what could have caused \(y\) in the post-period if the policy hadn’t intervened. We tabulate this connection in §6.1.3, after §6.1.2 walks through the two specifications most often used.

6.1.2 Two example specifications

Most ITS analyses use one of two functional forms. The choice is whether the intervention plausibly produces a permanent shift (Example 1) or a transient one (Example 2). Notation in both: \(D_t \equiv \mathbf{1}\{t \geq T^*\}\) is the regime indicator, \(S_t \equiv (t-T^*)\,D_t\) is the time-since-intervention counter, \(P_t \equiv \mathbf{1}\{t = T^*\}\) is a pulse.

Example 1 — Level + slope change (segmented regression)

\[ y_t \;=\; \beta_0 + \beta_1 t + \beta_2 D_t + \beta_3 S_t + \varepsilon_t. \]

The default ITS specification. \(\beta_2\) is the immediate level shift at \(T^*\) (the vertical jump); \(\beta_3\) is the change in slope after the intervention (the kink). Both have intervention-time interpretations because \(S_t\) is re-zeroed at \(T^*\). Use when the policy plausibly produces both an immediate jump and a different growth rate going forward.

set.seed(42)
T_n     <- 60
T_star  <- 30
alpha   <- 10; beta1 <- 0.5; beta2 <- -3; beta3 <- -0.4

df_ls <- tibble(t = 1:T_n) |>
  mutate(D      = as.integer(t >= T_star),
         S      = pmax(t - T_star, 0),
         y_mean = alpha + beta1*t + beta2*D + beta3*S,
         y      = y_mean + rnorm(T_n, 0, 0.4),
         cf     = alpha + beta1*t)

p_ls <- ggplot(df_ls, aes(t, y)) +
  geom_ribbon(data = df_ls |> filter(t >= T_star),
              aes(ymin = y_mean, ymax = cf),
              fill = "firebrick", alpha = 0.18) +
  geom_line(aes(y = cf), color = "grey40", linetype = "dashed", linewidth = 0.8) +
  geom_line(aes(y = y_mean), color = "steelblue", linewidth = 1.0) +
  geom_point(aes(color = factor(D)), size = 1.3, show.legend = FALSE) +
  scale_color_manual(values = c("0" = "steelblue", "1" = "firebrick")) +
  geom_vline(xintercept = T_star, linetype = "dashed", color = "grey40") +
  annotate("segment", x = T_star, xend = T_star,
           y = alpha + beta1*T_star, yend = alpha + beta1*T_star + beta2,
           arrow = arrow(length = unit(0.22, "cm")), color = "darkred", linewidth = 0.7) +
  annotate("text", x = T_star + 0.7,
           y = (alpha + beta1*T_star) + beta2/2,
           label = "beta_2 (level shift)", hjust = 0, color = "darkred", size = 3.5) +
  annotate("text", x = T_n - 0.5,
           y = (df_ls$y_mean[T_n] + df_ls$cf[T_n])/2,
           label = "Effect at horizon h\n= y_observed - y_counterfactual",
           hjust = 1, color = "firebrick", size = 3.2, lineheight = 0.95) +
  annotate("text", x = T_star + 14,
           y = alpha + beta1*(T_star + 14) + beta2 + beta3*14 + 3.5,
           label = "beta_3 changes the slope", hjust = 0, color = "darkblue", size = 3.4) +
  annotate("text", x = T_star - 0.5, y = max(df_ls$y) + 0.5,
           label = "T*", color = "grey30", size = 4, hjust = 1, fontface = "italic") +
  labs(title = "Example 1: Level + slope change ITS",
       subtitle = "Solid blue = mean function; dashed grey = counterfactual (extrapolated pre-trend)",
       x = "Time t", y = "Outcome y") +
  theme_minimal(base_size = 11)

if (!dir.exists("output/slides")) dir.create("output/slides", recursive = TRUE)
ggsave("output/slides/p6_its_levelslope.png", p_ls, width = 9, height = 4, dpi = 110)
p_ls

Read the figure: the counterfactual (dashed grey) is what the pre-period model predicts for the post-period — extrapolated from the same intercept and slope. The observed mean (solid blue) drops at \(T^*\) by \(\beta_2\), then continues at the new slope \(\beta_1 + \beta_3\). The substantive causal effect at horizon \(h\) is the vertical gap between observed and counterfactual at \(T^* + h\) — which keeps growing whenever \(\beta_3 \neq 0\). This is what the shaded region is showing.

Example 2 — Pulse intervention

\[ y_t \;=\; \beta_0 + \beta_1 t + \omega P_t + \varepsilon_t. \]

Use when the intervention is a one-off shock with no expected persistence — a single news event, an election day, a scandal that briefly moves opinion, a one-week supply disruption. \(\omega\) is the size of the spike at \(T^*\); the system is back on trend by \(t = T^* + 1\). The contrast with Example 1 is sharp: transient vs. permanent.

set.seed(43)
alpha_p <- 10; beta1_p <- 0.05; omega <- 4

df_pl <- tibble(t = 1:T_n) |>
  mutate(P      = as.integer(t == T_star),
         y_mean = alpha_p + beta1_p*t + omega*P,
         y      = y_mean + rnorm(T_n, 0, 0.3),
         cf     = alpha_p + beta1_p*t)

p_pl <- ggplot(df_pl, aes(t, y)) +
  geom_line(aes(y = cf), color = "grey40", linetype = "dashed", linewidth = 0.8) +
  geom_line(aes(y = y_mean), color = "steelblue", linewidth = 1.0) +
  geom_point(color = "steelblue", size = 1.3) +
  geom_point(data = df_pl |> filter(t == T_star), aes(y = y),
             color = "firebrick", size = 2.6) +
  geom_vline(xintercept = T_star, linetype = "dashed", color = "grey40") +
  annotate("segment", x = T_star, xend = T_star,
           y = alpha_p + beta1_p*T_star, yend = alpha_p + beta1_p*T_star + omega,
           arrow = arrow(length = unit(0.22, "cm")), color = "darkred", linewidth = 0.7) +
  annotate("text", x = T_star + 0.7, y = alpha_p + beta1_p*T_star + omega/2,
           label = "omega (one-period spike)", hjust = 0, color = "darkred", size = 3.5) +
  annotate("text", x = T_n - 0.5, y = alpha_p + beta1_p*(T_n - 0.5) - 0.8,
           label = "Returns to baseline trend by t = T* + 1",
           hjust = 1, color = "grey25", size = 3.3) +
  annotate("text", x = T_star - 0.5, y = max(df_pl$y) + 0.3,
           label = "T*", color = "grey30", size = 4, hjust = 1, fontface = "italic") +
  labs(title = "Example 2: Pulse intervention",
       subtitle = "One-period shock at T*; immediate return to trend",
       x = "Time t", y = "Outcome y") +
  theme_minimal(base_size = 11)

ggsave("output/slides/p6_its_pulse.png", p_pl, width = 9, height = 4, dpi = 110)
p_pl

Other functional forms — separate-regime trends (algebraically equivalent to Example 1), level-only or slope-only (sub-cases of Example 1 obtained by testing \(\beta_2 = 0\) or \(\beta_3 = 0\)), anticipation/ramp (§6.4), multiple interventions (one \((D, S)\) pair per break date) — are all built from the same primitives. Don’t drop terms a priori; fit the full Example 1 specification and let the data plus theory decide which sub-form is supported.

6.1.3 Modeling choices as identification strengtheners

Each modeling choice is the operational tool for a specific assumption from §6.1.1. The connection is what makes ITS causal rather than just curve-fitting:

Modeling choice Threat it addresses When it is justified
Covariates \(X_t\) in the regression (§6.3) Time-varying observable confounders (e.g., world oil price for gasoline; weather for emissions; unemployment for approval) Theory says these are confounders. Including them strengthens conditional ignorability from “the pre-trend would have continued unconditionally” to “the pre-trend conditional on \(X\) would have continued.”
AR error structure (§6.5) Persistent residual structure that the mean equation didn’t capture Substantive process plausibly has dynamic noise (slow-moving omitted variables, autocorrelated measurement). Strengthens the conditioning set without committing to causal momentum in \(y\).
Lagged dependent variable (Story B from §3.1) Genuine state-dependence: past \(y\) causally produces current \(y\) Theory predicts momentum (sticky prices, habit formation, capital accumulation). Not a routine fix for autocorrelation — a substantive commitment about the data-generating process.
Anticipation / ramp window (§6.4) Agents respond before \(T^*\) in expectation The policy was credibly announced before implementation, with a documented announcement date.
True control unit (CITS) Common shocks that hit both treated and control (recessions, weather, contemporaneous policies) A unit exists that does not receive the intervention. Without one, CITS doesn’t identify anything new (see §6.1.4).

Theory-first ordering. Each row above is a substantive claim about what could have caused \(y\) in the post-period absent the policy. AIC/BIC will allow you to add or drop any of them; only theory tells you which are defensible. Modeling without that theory produces fits that may be statistically respectable but identify nothing causal.

6.1.4 A credibility ladder for ITS

Three levels of evidence, weakest to strongest. All three are still single-policy designs; none of them is “stronger than ITS” in the sense of resting on weaker assumptions — each requires more of the data, not fewer modeling choices.

  1. Single-unit ITS. The base case. Relies entirely on the extrapolated pre-trend as the counterfactual. Conditional ignorability has to do all the work; if any of the four assumptions in §6.1.1 fails materially, the estimate is biased.

  2. Single-unit ITS with falsification. Run placebo tests at fake intervention dates a few years before \(T^*\) — the fake “effect” should be near zero. Vary the pre-period length. Add and drop covariates. Each robustness check the design survives makes the conditional-ignorability assumption more credible. This is what good single-unit ITS papers actually do.

  3. Comparative ITS (CITS) — when, and only when, a true control unit exists. The control must be a series that does not receive the intervention at \(T^*\) but is plausibly subject to the same common shocks (recessions, weather, contemporaneous policies). The treatment-by-post interactions (\(G_i D_t\), \(G_i S_t\)) absorb shocks that affect both units identically and identify the differential effect — the DiD-with-time-series logic. Without a true untreated control, CITS doesn’t apply: if the policy hits all units uniformly (a federal law with no state opt-out, a global event), the would-be control receives the same shock, the differential is zero by symmetry, and the CITS interactions identify nothing new.

When a true control is unavailable but you still want a comparative design, the natural next steps live in design space, not model space — and rest on different identifying assumptions:

  • Synthetic control constructs a counterfactual for a single treated unit by reweighting a pool of donor units (Abadie, Diamond, & Hainmueller 2010). The identifying assumption is a factor-model structure shared between the treated unit and the donor pool, plus pre-period fit.
  • Panel difference-in-differences uses many treated and untreated units, identifying the effect from variation across both time and units. The identifying assumption is parallel trends across units — a much-weakened form of pre-trend stability that is checked via leads, lags, and event-study designs.

Neither is a strict upgrade over single-unit ITS — they are different designs for different settings, with their own assumptions to defend. We flag them here only to mark the natural exit ramps when single-unit ITS or CITS doesn’t fit.

6.1.5 Dynamics: AR errors vs. lagged \(y\)

ITS data is almost always serially correlated. The two routes for handling it map directly onto Story A and Story B from §3.1: model the residuals as ARMA (Story A — Arima(y, xreg = X)), or include \(y_{t-1}\) on the right-hand side (Story B — lm(y ~ lag(y) + X)). The choice has the same stakes as in §3.1: Story A keeps the intervention coefficients’ moment-of-impact reading; Story B reinterprets them as direct effects propagated forward through the LDV’s own dynamics, with full effects requiring an IRF (cf. §3.5).

Default to Story A unless theory genuinely predicts that past \(y\) causes future \(y\) — sticky prices, habit formation, capital stocks (the canonical defensible cases discussed in §3.1). The Cook & Webb (2021) warning against treating \(\hat\rho\) on \(y_{t-1}\) as a “persistence parameter” applies in full force here. We use Option A in §6.5.

6.2 ITS applied to BC’s carbon tax

We apply the standard ITS model to a setting where the issues from §6.1 (covariate identification of the pre-trend, anticipation effects) have substantive bite.

The intervention. In its 19 February 2008 budget, the Government of British Columbia announced a revenue-neutral carbon tax of $10/tCO effective 1 July 2008, scheduled to escalate by $5/tCO each year through 2012. For motor gasoline, the initial rate translated to an additional 2.34 ¢/L on the retail price, climbing to 6.67 ¢/L by 2012. Tax revenue was rebated through personal-income-tax cuts and a low-income climate-action credit, so the policy was neutral on aggregate fiscal balance (Murray and Rivers 2015).

The substantive question. Did the carbon tax cause a measurable decline in BC gasoline consumption beyond what would have happened anyway given world oil prices?

The data. Monthly net sales of gasoline in BC, January 1993 to December 2015 (Statistics Canada PID 23-10-0080-01), merged with monthly Vancouver retail gasoline prices (PID 18-10-0001-01). \(T = 276\) months — 186 pre-tax, 90 post-tax. We work with log gasoline sales as the outcome, so the linear trend captures growth in log volume (population + economic activity) and the intervention coefficient is a percent change.

bc <- read_csv("data/bc_carbon_tax.csv", show_col_types = FALSE) |>
  mutate(date_d   = ym(date),
         t        = row_number(),
         log_gas  = log(gasoline_kl),
         month_f  = factor(month(date_d, label = FALSE)),
         D        = as.integer(date_d >= ymd("2008-07-01")),                 # post-tax
         S        = pmax(0, t - which(date_d == ymd("2008-07-01"))[1]),      # time since
         D_announce = as.integer(date_d >= ymd("2008-02-01")))               # post-announcement
glimpse(bc)
## Rows: 276
## Columns: 10
## $ date        <chr> "1993-01", "1993-02", "1993-03", "1993-04", "1993-05", "19…
## $ gasoline_kl <dbl> 263175, 266268, 318488, 305159, 321572, 340359, 357239, 37…
## $ gas_price_c <dbl> 55.6, 54.9, 53.5, 54.3, 53.7, 54.5, 55.3, 55.4, 56.4, 56.4…
## $ date_d      <date> 1993-01-01, 1993-02-01, 1993-03-01, 1993-04-01, 1993-05-0…
## $ t           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ log_gas     <dbl> 12.48057, 12.49226, 12.67134, 12.62859, 12.68098, 12.73776…
## $ month_f     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7…
## $ D           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ S           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ D_announce  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
ggplot(bc, aes(date_d, log_gas)) +
  geom_line(color = "grey55") +
  geom_vline(xintercept = ymd("2008-07-01"),
             linetype = "dashed", color = "firebrick") +
  annotate("text", x = ymd("2008-09-01"), y = max(bc$log_gas) - 0.04,
           label = "Carbon tax\n(1 Jul 2008)",
           color = "firebrick", size = 3.2, hjust = 0) +
  labs(title = "BC monthly gasoline sales (log thousands of litres), 1993–2015",
       subtitle = "Strong seasonality + secular growth; intervention point marked",
       y = "log(gasoline sales)", x = NULL) +
  theme_minimal(base_size = 12)

The series has a positive secular trend, strong monthly seasonality, and a pronounced mid-2008 spike followed by lower post-tax levels. Crucially, global oil prices peaked in mid-2008 and crashed in the subsequent recession — so the post-period decline confounds the carbon tax with a worldwide price shock. This is exactly the identification problem §6.1 was set up to address: a naïve segmented regression has no way to separate the local tax from the simultaneous global shock.

A bare segmented ITS (no covariates)

Start with the level + slope specification from §6.1.2 (Example 1) plus month dummies — without conditioning on the gas price.

fit_bare <- lm(log_gas ~ t + D + S + month_f, data = bc)
round(coeftest(fit_bare)[c("t", "D", "S"), ], 4)
##   Estimate Std. Error t value Pr(>|t|)
## t   0.0008     0.0001 13.5272        0
## D  -0.0486     0.0111 -4.3900        0
## S  -0.0008     0.0002 -4.3719        0

The level change D and slope change S are both negative and significant: the bare model concludes that, at implementation, BC gasoline sales dropped by about \(4.9\%\) relative to trend, and continued declining at \(-0.08\%\) per month thereafter. Read at face value, this is a substantial effect.

But it conflates two stories. The 2008 oil-price spike pushed retail prices in BC from about \(1.20\)/L in early 2008 to over \(1.50\)/L mid-summer, then crashed below \(0.90\)/L by late 2008. Subsequent years saw global prices recover and stay high through 2014, then crash again. None of that is captured in the bare ITS. Whatever portion of the post-tax decline is globally driven, the bare model misattributes to the carbon tax.

6.3 Conditioning on covariates: identifying the pre-trend

The bare ITS leans on a strong assumption: absent the carbon tax, the pre-trend would have continued linearly into the post-period. That’s hard to defend here because gasoline consumption is mechanically driven by retail price, and the post-2008 period contains the largest oil-price swings in the sample.

The fix is to condition on the price. The identifying assumption becomes:

Conditional on retail gasoline price and seasonal pattern, the post-period would have followed the pre-period’s pattern had the carbon tax not been imposed.

This is the conditional-ignorability translation of parallel trends. The counterfactual is no longer “what does the pre-trend extrapolate to?” but “what does the pre-period model, evaluated at the post-period covariates, predict?”. The latter is much more credible when the pre-trend has known mechanical drivers we can observe.

fit_cov <- lm(log_gas ~ t + D + S + log(gas_price_c) + month_f, data = bc)
round(coeftest(fit_cov)[c("t", "D", "S", "log(gas_price_c)"), ], 4)
##                  Estimate Std. Error t value Pr(>|t|)
## t                  0.0015     0.0001 14.2202    0e+00
## D                 -0.0389     0.0101 -3.8607    1e-04
## S                 -0.0011     0.0002 -6.6105    0e+00
## log(gas_price_c)  -0.1656     0.0212 -7.8034    0e+00

Comparison with the bare specification:

tibble(
  Coefficient                              =
    c("Pre-trend (per month)",
      "Level change at intervention (D)",
      "Slope change after intervention (S)",
      "log(retail price) elasticity"),
  `Bare ITS`        = c(coef(fit_bare)[c("t", "D", "S")], NA) |> round(4),
  `With log(price)` = c(coef(fit_cov) [c("t", "D", "S", "log(gas_price_c)")]) |> round(4)
) |>
  knitr::kable(caption = "ITS estimates with and without conditioning on retail gasoline price.")
ITS estimates with and without conditioning on retail gasoline price.
Coefficient Bare ITS With log(price)
Pre-trend (per month) 0.0008 0.0015
Level change at intervention (D) -0.0486 -0.0389
Slope change after intervention (S) -0.0008 -0.0011
log(retail price) elasticity NA -0.1656

Two patterns in the table:

  • D shrinks from \(-0.049\) to \(-0.039\) once we condition on price. Roughly a fifth of the implementation-date drop the bare ITS attributed to the tax was actually driven by the contemporaneous global oil-price spike.
  • The price elasticity is about \(-0.17\). A 1% increase in retail price reduces sales by ~0.17%. This is in the standard short-run gasoline-elasticity range (-0.1 to -0.3 in the literature) and is a useful sanity check that the covariate is doing real identifying work — not just being a free parameter.

A counterfactual plot makes the conditional-ignorability story visible. We compute two predicted paths from fit_cov: the realized post-tax fit, and a counterfactual that turns off the post-tax dummies (D = 0, S = 0), holding price and season at their actual values.

bc_cf <- bc |>
  mutate(D = 0, S = 0)

bc <- bc |>
  mutate(fit_real = predict(fit_cov, newdata = bc),
         fit_cf   = predict(fit_cov, newdata = bc_cf))

ggplot(bc, aes(date_d)) +
  geom_line(aes(y = log_gas), color = "grey60", alpha = 0.7) +
  geom_line(aes(y = fit_real), color = "steelblue", linewidth = 0.9) +
  geom_line(aes(y = fit_cf),
            data = filter(bc, date_d >= ymd("2008-07-01")),
            color = "firebrick", linewidth = 0.9, linetype = "dashed") +
  geom_vline(xintercept = ymd("2008-07-01"),
             linetype = "dotted", color = "firebrick") +
  annotate("text", x = ymd("2014-01-01"), y = 12.93,
           label = "No-tax counterfactual",
           color = "firebrick", size = 3.2, hjust = 0) +
  annotate("text", x = ymd("2014-01-01"), y = 12.86,
           label = "Realized fit",
           color = "steelblue", size = 3.2, hjust = 0) +
  labs(title = "BC carbon tax: realized fit vs. no-tax counterfactual",
       subtitle = "Both paths condition on price and season; gap = ITS-implied tax effect",
       y = "log(gasoline sales)", x = NULL) +
  theme_minimal(base_size = 12)

The dashed red line is the prediction the model would make for the post-period if the carbon tax had not happened — but using the realized post-period prices and seasonal pattern. The gap between the realized fit (blue) and the counterfactual (red) is the ITS-implied carbon-tax effect, conditional on covariates.

What’s still being assumed exogenous. We assume retail price is exogenous to BC gasoline consumption — i.e., world oil prices drive retail prices, retail prices drive consumption, but BC consumption doesn’t drive retail prices. For BC’s small share of the global market, this is defensible; in larger economies, simultaneity would call for IV or two-stage methods.

6.4 Anticipation: announcement vs. implementation

The carbon tax was announced on 19 February 2008 and took effect on 1 July 2008 — a five-month gap during which consumers had time to adjust expectations. If they pre-bought gasoline (filled tanks ahead of the tax) or made longer-term changes (deferred a vehicle purchase, switched commutes), the immediate-pre-tax data points would already be polluted, biasing the implementation-date level shift.

This is a structural concern about the design, not the model fit. Three ways to address it (per the modeling-choices table in §6.1.3):

  1. Anticipation step. Add an indicator D_announce that turns on at the announcement date. Tests whether the announcement itself shifted the level.
  2. Drop the announcement window. Exclude observations February–June 2008 from the regression — unbiased pre/post-tax estimates at the cost of five observations.
  3. Treat the announcement date as \(T^*\). Use 19 February as the intervention date if theory predicts anticipation is the dominant channel.

We demonstrate (1):

fit_anticip <- lm(log_gas ~ t + D_announce + D + S + log(gas_price_c) + month_f,
                  data = bc)
round(coeftest(fit_anticip)[c("t", "D_announce", "D", "S", "log(gas_price_c)"), ], 4)
##                  Estimate Std. Error t value Pr(>|t|)
## t                  0.0015     0.0001 14.1796   0.0000
## D_announce        -0.0108     0.0191 -0.5680   0.5705
## D                 -0.0293     0.0196 -1.4910   0.1372
## S                 -0.0011     0.0002 -6.6204   0.0000
## log(gas_price_c)  -0.1631     0.0217 -7.5202   0.0000

D_announce is the level shift starting at the announcement; D is the additional level shift at implementation, on top of any announcement-window adjustment.

tibble(
  Coefficient                  = c("D (level change at impl.)",
                                   "D_announce (announcement effect)",
                                   "S (slope change)"),
  `With price (no anticip.)`  =
    c(coef(fit_cov)["D"], NA, coef(fit_cov)["S"]) |> round(4),
  `With price + announcement` =
    c(coef(fit_anticip)["D"], coef(fit_anticip)["D_announce"],
      coef(fit_anticip)["S"]) |> round(4)
) |>
  knitr::kable(caption = "Implementation effect with and without an announcement adjustment.")
Implementation effect with and without an announcement adjustment.
Coefficient With price (no anticip.) With price + announcement
D (level change at impl.) -0.0389 -0.0293
D_announce (announcement effect) NA -0.0108
S (slope change) -0.0011 -0.0011

Reading the announcement coefficient: \(-0.011\) (\(-1.1\%\), \(p \approx 0.57\)) — small and indistinguishable from zero. Anticipation does not appear to be a major channel here. As a complementary check, remedy (2) — dropping the Feb–Jun 2008 window — gives a similar story:

fit_dropwin <- lm(log_gas ~ t + D + S + log(gas_price_c) + month_f,
                  data = bc |> filter(date_d < ymd("2008-02-01") |
                                       date_d >= ymd("2008-07-01")))
round(coef(fit_dropwin)["D"], 4)
##       D 
## -0.0402

The level-change estimate barely moves. Either remedy yields the same conclusion: we keep §6.3 as the headline. The sensitivity check is what gives us the right to do so.

6.5 ITS with AR errors

Apply Option A from §6.1.5 — model the residuals as AR(1) — using Arima() with the §6.3 covariates and intervention indicators as xreg.

xcov <- model.matrix(~ t + D + S + log(gas_price_c) + month_f, data = bc)[, -1]

fit_its_ar <- Arima(bc$log_gas, order = c(1, 0, 0), xreg = xcov)
fit_its_ar
## Series: bc$log_gas 
## Regression with ARIMA(1,0,0) errors 
## 
## Coefficients:
##          ar1  intercept       t        D        S  log(gas_price_c)  month_f2
##       0.0539    13.2568  0.0015  -0.0390  -0.0011           -0.1645   -0.0278
## s.e.  0.0616     0.0824  0.0001   0.0103   0.0002            0.0216    0.0109
##       month_f3  month_f4  month_f5  month_f6  month_f7  month_f8  month_f9
##         0.0697    0.0755    0.1500    0.1584    0.2111    0.2313    0.1239
## s.e.    0.0112    0.0113    0.0114    0.0114    0.0114    0.0114    0.0113
##       month_f10  month_f11  month_f12
##          0.1199     0.0359     0.0708
## s.e.     0.0112     0.0112     0.0109
## 
## sigma^2 = 0.001523:  log likelihood = 512.32
## AIC=-988.64   AICc=-985.98   BIC=-923.48
checkresiduals(fit_its_ar)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(1,0,0) errors
## Q* = 14.841, df = 9, p-value = 0.0954
## 
## Model df: 1.   Total lags used: 10

Compare the OLS and ARIMA point estimates on the carbon-tax coefficients:

tibble(
  Parameter             = c("Pre-trend (t)", "Level change (D)",
                            "Slope change (S)", "log(price) elasticity"),
  `OLS (with price)`    = round(coef(fit_cov)[c("t", "D", "S", "log(gas_price_c)")], 4),
  `ARIMA(1,0,0) errors` = round(coef(fit_its_ar)[c("t", "D", "S", "log(gas_price_c)")], 4)
) |>
  knitr::kable(caption = "ITS coefficients: OLS vs. ARIMA(1,0,0) errors.")
ITS coefficients: OLS vs. ARIMA(1,0,0) errors.
Parameter OLS (with price) ARIMA(1,0,0) errors
Pre-trend (t) 0.0015 0.0015
Level change (D) -0.0389 -0.0390
Slope change (S) -0.0011 -0.0011
log(price) elasticity -0.1656 -0.1645

The AR(1) parameter on the residuals is small (\(\approx 0.05\)). With price and season already absorbing the systematic post-period variation, the OLS estimates from §6.3 are essentially unchanged. We use Story A (AR errors) here — there is no theoretical reason to think last month’s gasoline-sales residual causes this month’s beyond what the price covariate already captures.

Bottom line

Across specifications, the BC carbon tax is associated with about a 3–4% drop in monthly gasoline sales at implementation, plus a small additional negative slope change. The bare ITS over-states the effect by attributing the global 2008 oil-price spike to the tax; conditioning on retail price brings the estimate into the range the energy-economics literature finds (Murray and Rivers 2015). Anticipation is not a major confound, and the AR-errors correction barely moves the headline numbers.

The pedagogy: the substantive coefficient depends on which counterfactual you defend. Bare ITS posits a linear-in-time counterfactual; covariate-conditional ITS posits a price-and-season-conditional counterfactual and gets a smaller, more credible answer. Neither is “right” without theory — but the second is much harder to be wrong about, because gasoline consumption is mechanically a function of price, and the elasticity recovered comes out at a textbook value.


Appendix: Self-Study

A.1 ADF and KPSS false positive simulation

A targeted simulation to put numbers behind the three pitfalls from §1.3. Four series with known DGPs (\(T = 250\), seed 100); we run ADF and KPSS (with both Level and Trend nulls) on each and tabulate verdicts:

set.seed(100); T_n <- 250; t_idx <- 1:T_n
y_A <- arima.sim(n = T_n, model = list(ar = 0.4), sd = 1)                      # I(0)
y_B <- cumsum(rnorm(T_n, sd = 1))                                              # I(1)
y_C <- rnorm(T_n, sd = 1) + 5 * (t_idx >= T_n / 2)                             # I(0) + break
y_D <- 0.04 * t_idx + arima.sim(n = T_n, model = list(ar = 0.4), sd = 1)       # I(0) around trend
run_tests <- function(y, label) {
  adf    <- suppressWarnings(adf.test(y))
  kpss_L <- suppressWarnings(kpss.test(y, null = "Level"))
  kpss_T <- suppressWarnings(kpss.test(y, null = "Trend"))
  tibble(Series = label,
         `ADF p`    = round(adf$p.value, 3),
         `KPSS-L p` = round(kpss_L$p.value, 3),
         `KPSS-T p` = round(kpss_T$p.value, 3))
}

bind_rows(
  run_tests(y_A, "A: AR(1) phi=0.4              [I(0)]"),
  run_tests(y_B, "B: Random walk               [I(1)]"),
  run_tests(y_C, "C: i.i.d. + level break       [I(0)]"),
  run_tests(y_D, "D: Trend + AR(1)             [I(0) around trend]")
) |>
  knitr::kable(caption = "ADF and KPSS verdicts on four series with known truth.")
ADF and KPSS verdicts on four series with known truth.
Series ADF p KPSS-L p KPSS-T p
A: AR(1) phi=0.4 [I(0)] 0.010 0.10 0.10
B: Random walk [I(1)] 0.366 0.01 0.01
C: i.i.d. + level break [I(0)] 0.282 0.01 0.01
D: Trend + AR(1) [I(0) around trend] 0.010 0.01 0.10

The headline results: Series A and B are consensus-correct (all three tests agree with the truth). Series C is the Perron (1989) trap — both ADF and KPSS-Level reject in directions consistent with a unit root, but the truth is I(0) within each regime; the unmodeled break fools both tests. Series D shows the deterministic-regressor pitfall — the same series gives opposite KPSS verdicts depending on whether the null is Level or Trend. Both lessons map directly onto §1.3 row 3 of the verdict matrix and onto the diagnostic move in Part 2’s approval analysis.


References

Audited core (literature folder):

  • Box-Steffensmeier, J. M., Freeman, J. R., Hitt, M. P., & Pevehouse, J. C. W. (2014). Time Series Analysis for the Social Sciences. Cambridge University Press. Cambridge UP
  • Cook, S. J., & Webb, C. (2021). Lagged outcomes, lagged predictors, and lagged errors: A clarification on common factors. Political Analysis, 29(4), 561–569. doi:10.1017/pan.2020.43 · Cambridge Core
  • De Boef, S., & Keele, L. (2008). Taking time seriously. American Journal of Political Science, 52(1), 184–200. doi:10.1111/j.1540-5907.2007.00307.x · Wiley · Open PDF
  • Keele, L., & Kelly, N. J. (2006). Dynamic models for dynamic theories: The ins and outs of lagged dependent variables. Political Analysis, 14(2), 186–205. doi:10.1093/pan/mpj006 · Open PDF (Nuffield)
  • Philips, A. Q. (2018). Have your cake and eat it too? Cointegration and dynamic inference from autoregressive distributed lag models. American Journal of Political Science, 62(1), 230–244. doi:10.1111/ajps.12318 · Wiley

Methods cited inline (load-bearing for Parts 1, 3–6):

  • Bernal, J. L., Cummins, S., & Gasparrini, A. (2017). Interrupted time series regression for the evaluation of public health interventions: A tutorial. International Journal of Epidemiology, 46(1), 348–355. doi:10.1093/ije/dyw098 · Oxford Academic
  • Engle, R. F., & Granger, C. W. J. (1987). Co-integration and error correction: Representation, estimation, and testing. Econometrica, 55(2), 251–276. doi:10.2307/1913236 · JSTOR
  • Hakkio, C. S., & Rush, M. (1991). Is the budget deficit “too large”? Economic Inquiry, 29(3), 429–445. doi:10.1111/j.1465-7295.1991.tb00837.x · Wiley
  • King, G., Tomz, M., & Wittenberg, J. (2000). Making the most of statistical analyses: Improving interpretation and presentation. American Journal of Political Science, 44(2), 347–361. doi:10.2307/2669316 · JSTOR
  • MacKinnon, J. G. (2010). Critical values for cointegration tests. Queen’s Economics Department Working Paper No. 1227. Queen’s University (PDF)
  • Murray, B., & Rivers, N. (2015). British Columbia’s revenue-neutral carbon tax: A review of the latest “grand experiment” in environmental policy. Energy Policy, 86, 674–683. doi:10.1016/j.enpol.2015.08.011 · ScienceDirect
  • Newey, W. K., & West, K. D. (1987). A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica, 55(3), 703–708. doi:10.2307/1913610 · JSTOR
  • Perron, P. (1989). The great crash, the oil price shock, and the unit root hypothesis. Econometrica, 57(6), 1361–1401. doi:10.2307/1913712 · JSTOR
  • Pesaran, M. H., Shin, Y., & Smith, R. J. (2001). Bounds testing approaches to the analysis of level relationships. Journal of Applied Econometrics, 16(3), 289–326. doi:10.1002/jae.616 · Wiley