These materials have been revised and updated by successive teaching assistants over recent years.
Revision history:
| Package | Purpose | Key functions |
|---|---|---|
forecast |
ARIMA estimation, forecasting | Arima(), auto.arima(),
forecast(), checkresiduals() |
tseries |
Unit root tests | adf.test(), kpss.test(),
PP.test() |
MASS |
Multivariate normal simulation | mvrnorm() |
lmtest |
Coefficient testing; autocorrelation tests on OLS residuals | coeftest(), bgtest(),
dwtest() |
sandwich |
HAC and other robust covariance matrices | NeweyWest(), vcovHAC() |
nlme |
GLS with structured error correlations | gls(), corAR1(),
corARMA() |
patchwork |
Multi-panel figures | +, / operators |
tidyverse |
Data manipulation and plotting | ggplot(), tibble(), pipes
(|>) |
In Labs 2 and 3, we focused on stationary processes — series with constant mean, constant variance, and autocovariance depending only on the lag distance. The ARMA toolkit, ACF/PACF identification, AIC/BIC selection, and out-of-sample cross-validation all rely on stationarity.
A natural temptation is to treat “non-stationarity” as a single problem with a single fix: difference the series and re-apply the ARMA toolkit. That is sometimes correct, but it is also one of the most common ways applied analyses go wrong.
A series that looks non-stationary may be stationary in disguise
For example, stationary around a deterministic trend, or stationary except for one or two structural breaks. Differencing such a series is over-differencing: it introduces a non-invertible MA root, biases long-run and level-shift effects toward zero, and often produces ARIMA models that fit the wrong dynamics.
This lab is built around the discipline of diagnosing before differencing. Part 1 lays out how to do that — what the three main flavors of non-stationarity look like, what unit-root tests actually test (and where they fail), and what to do when test verdicts disagree with visual or substantive evidence. Parts 2–6 then put these diagnostics to work on real data, including a worked example (the Bush approval series) where the standard tests give the wrong answer.
A process \(\{y_t\}\) is (weakly) stationary if all three of the following hold:
Under weak stationarity, the ACF is a property of the process, not the sample, and ACF/PACF can be read for AR/MA orders as in Lab 2. The unconditional mean exists and forecasts revert to it. None of these properties survive when the mean drifts, the variance grows, or the covariance structure changes over time.
Weak stationarity requires the first two moments (mean, variance, autocovariances) to be time-invariant. Strict stationarity is stronger: it requires the entire joint distribution to be time-invariant, so all moments that exist are also time-invariant, but the requirement is on the distribution itself, not just its moments. For Gaussian processes the two coincide (a Gaussian distribution is pinned down by its first two moments), which is why “weakly stationary Gaussian = strictly stationary” gets used as a shortcut.
Each condition rules out a specific way the data-generating process could be drifting through time, and each violation has a recognizable visual signature. We use two rolling-window diagnostics throughout Part 1: the rolling mean (centered moving average) tracks Condition 1, and the rolling standard deviation tracks Condition 2. Condition 3 needs a richer tool — a rolling lag-1 autocorrelation. The three subsections below show what each condition looks like when it holds and when it fails.
A stationary AR(1) oscillates around a fixed point with no systematic drift. A violation looks like a series whose level moves through time via a deterministic trend, a random walk with drift, or a step shift. The cleanest visual diagnostic is the rolling mean: under stationarity, it hovers; under violation, it traces out the time-varying mean function.
set.seed(101)
T_n <- 300
t_idx <- seq_len(T_n)
y1_stat <- as.numeric(arima.sim(n = T_n, model = list(ar = 0.5), sd = 1))
y1_viol <- 0.04 * t_idx +
as.numeric(arima.sim(n = T_n, model = list(ar = 0.5), sd = 1))
df1 <- tibble(t = t_idx,
stat = y1_stat,
viol = y1_viol,
roll_stat = roll_mean(y1_stat),
roll_viol = roll_mean(y1_viol))
p1a <- ggplot(df1, aes(t)) +
geom_line(aes(y = stat), color = "grey60") +
geom_hline(yintercept = 0, linetype = "dashed", color = "black", alpha = 0.5) +
geom_line(aes(y = roll_stat), color = "steelblue", linewidth = 1) +
labs(title = "Stationary mean",
subtitle = "AR(1); blue = 30-period rolling mean",
y = expression(y[t]), x = "t") +
theme_minimal(base_size = 11)
p1b <- ggplot(df1, aes(t)) +
geom_line(aes(y = viol), color = "grey60") +
geom_line(aes(y = roll_viol), color = "firebrick", linewidth = 1) +
labs(title = "Drifting mean",
subtitle = "y_t = 0.04*t + AR(1); rolling mean trends with t",
y = expression(y[t]), x = "t") +
theme_minimal(base_size = 11)
p1a + p1b
The dashed reference at zero on the left makes it clear that the rolling mean stays put. On the right, the rolling mean climbs steadily — the unconditional mean is no longer a single number but a function of \(t\). Condition 1 fails, and any technique that assumes a single \(\mu\) (the AR equilibrium, mean-reversion forecasts, the standard ACF interpretation) becomes unreliable.
Stationarity also requires the variance of \(y_t\) to be the same everywhere. A violation looks like a series whose amplitude grows or shrinks over time. The diagnostic is the rolling standard deviation: a constant-width band signals stationarity; an expanding band signals heteroskedasticity. Random walks, ARCH/GARCH-type processes, and series with regime-dependent volatility all violate this condition, but for very different reasons.
set.seed(102)
y2_stat <- as.numeric(arima.sim(n = T_n, model = list(ar = 0.5), sd = 1))
# Violation: AR(1) with innovation SD growing linearly with t
sigma_t <- 1 + 0.015 * t_idx
y2_viol <- numeric(T_n)
y2_viol[1] <- rnorm(1, sd = sigma_t[1])
for (i in 2:T_n) y2_viol[i] <- 0.5 * y2_viol[i-1] + rnorm(1, sd = sigma_t[i])
df2 <- tibble(t = t_idx,
stat = y2_stat,
viol = y2_viol,
sd_stat = roll_sd(y2_stat),
sd_viol = roll_sd(y2_viol))
p2a <- ggplot(df2, aes(t)) +
geom_ribbon(aes(ymin = -sd_stat, ymax = sd_stat),
fill = "steelblue", alpha = 0.25) +
geom_line(aes(y = stat), color = "grey50") +
labs(title = "Constant variance",
subtitle = "AR(1) with sigma = 1",
y = expression(y[t]), x = "t") +
theme_minimal(base_size = 11)
p2b <- ggplot(df2, aes(t)) +
geom_ribbon(aes(ymin = -sd_viol, ymax = sd_viol),
fill = "firebrick", alpha = 0.25) +
geom_line(aes(y = viol), color = "grey50") +
labs(title = "Growing variance",
subtitle = "AR(1) with sigma_t = 1 + 0.015*t",
y = expression(y[t]), x = "t") +
theme_minimal(base_size = 11)
p2a + p2b
The left ribbon stays flat around \(\pm 1\); the right ribbon fans out from about \(\pm 1\) to about \(\pm 6\). When variance is non-constant, ordinary inference with standard errors is misleading and forecast intervals understate uncertainty in the high-variance regime. (Note: this condition is about the unconditional variance. Even GARCH processes — where the conditional variance moves around — can be unconditionally stationary if the variance equation itself has a stationary solution.)
The third condition is the one most people forget: stationarity is not just about levels and variances, but about the entire dependence structure being stable over time. The autocovariance
\[\gamma_k = \text{Cov}(y_t, y_{t-k})\]
must depend only on the lag \(k\), never on the time index \(t\). The most direct diagnostic is to compute autocorrelation in a rolling window and watch whether it changes with \(t\). Under stationarity the rolling autocorrelation should be flat; under violation it traces out a time-varying dependence structure.
set.seed(103)
# Stationary AR(1): same phi throughout
y3_stat <- as.numeric(arima.sim(n = T_n, model = list(ar = 0.5), sd = 1))
# Violation: phi shifts halfway from 0.2 to 0.85
half <- T_n / 2
y3_first <- as.numeric(arima.sim(n = half, model = list(ar = 0.2), sd = 1))
y3_second <- numeric(half)
y3_second[1] <- 0.85 * y3_first[half] + rnorm(1)
for (i in 2:half) y3_second[i] <- 0.85 * y3_second[i-1] + rnorm(1)
y3_viol <- c(y3_first, y3_second)
# Rolling lag-1 autocorrelation (window centered on t)
roll_ac1 <- function(x, w = 60) {
n <- length(x); out <- rep(NA_real_, n); h <- w %/% 2
for (i in (h + 1):(n - h)) {
sub <- x[(i - h):(i + h - 1)]
out[i] <- cor(sub[-1], sub[-length(sub)])
}
out
}
df3 <- tibble(
t = t_idx,
y_stat = y3_stat,
y_viol = y3_viol,
ac1_stat = roll_ac1(y3_stat, w = 60),
ac1_viol = roll_ac1(y3_viol, w = 60)
)
# Top row: time series themselves
ts_a <- ggplot(df3, aes(t, y_stat)) +
geom_line(color = "grey55") +
labs(title = "Stationary (constant phi=0.5)",
subtitle = "Time series",
y = expression(y[t]), x = NULL) +
theme_minimal(base_size = 11)
ts_b <- ggplot(df3, aes(t, y_viol)) +
geom_line(color = "grey55") +
geom_vline(xintercept = half, linetype = "dashed", color = "firebrick") +
labs(title = "Shifting (phi: 0.2 -> 0.85 at t=150)",
subtitle = "Time series",
y = expression(y[t]), x = NULL) +
theme_minimal(base_size = 11)
# Bottom row: rolling lag-1 autocorrelation over time
ac_a <- ggplot(df3, aes(t, ac1_stat)) +
geom_hline(yintercept = 0.5, linetype = "dashed",
color = "black", alpha = 0.5) +
geom_line(color = "steelblue", linewidth = 1) +
coord_cartesian(ylim = c(-0.2, 1)) +
labs(subtitle = "Rolling lag-1 ACF (window = 60); dashed line = true rho_1",
y = expression(rho[1]), x = "t") +
theme_minimal(base_size = 11)
ac_b <- ggplot(df3, aes(t, ac1_viol)) +
geom_hline(yintercept = c(0.2, 0.85), linetype = "dashed",
color = "black", alpha = 0.5) +
geom_vline(xintercept = half, linetype = "dashed", color = "firebrick") +
geom_line(color = "firebrick", linewidth = 1) +
coord_cartesian(ylim = c(-0.2, 1)) +
labs(subtitle = "Rolling lag-1 ACF (window = 60); dashed lines = true rho_1 in each regime",
y = expression(rho[1]), x = "t") +
theme_minimal(base_size = 11)
(ts_a | ts_b) / (ac_a | ac_b)
Read the figure top to bottom. The time series in the top row look broadly comparable in level and amplitude — there is no obvious mean shift or variance change at \(t = 150\). But the texture differs: the right-hand series is jagged in the first half (fast mean reversion, \(\phi = 0.2\)) and visibly smoother in the second half (slow mean reversion, \(\phi = 0.85\)). The rolling lag-1 autocorrelation in the bottom row makes the difference precise. On the left it hovers around its true value \(\rho_1 = 0.5\) for the entire sample. On the right it sits near \(0.2\) for the first 150 periods, then climbs to roughly \(0.85\) — the autocorrelation function is itself a function of \(t\).
Why the third condition matters. Mean and variance are easy to spot. The autocovariance shift is the silent killer: a series can have a flat rolling mean and a flat rolling SD but still violate Condition 3 if its dependence structure shifts (regime changes, parameter drift, structural breaks in the dynamics). The standard ACF/PACF interpretation, lag selection, and forecast intervals all assume the autocovariance structure is the same on the sample we use to fit the model and on the sample we use to forecast. When that fails, none of those tools is trustworthy — even if the level and amplitude look stable.
Most non-stationary series we encounter in applied work fall into one of three categories. They look superficially similar — all three drift, wander, or shift over time — but they require very different modeling strategies.
(a) Trend stationary (TS). The series has a deterministic mean function: \[y_t = \alpha + \gamma t + u_t, \quad u_t \text{ stationary}.\] The mean is non-constant, but it is a known function of time. Subtract it (or include \(t\) as a regressor) and the residuals are stationary. Cure: detrend.
(b) Difference stationary (DS), a.k.a. unit root. Each shock is permanent: \[y_t = y_{t-1} + \varepsilon_t \quad \text{(simplest case: random walk)}.\] The first difference \(\Delta y_t = \varepsilon_t\) is stationary, but the level has no fixed mean — past shocks accumulate forever. Cure: difference. This is the case that ARIMA(\(p,1,q\)) models target.
(c) Stationary with structural breaks (SSB). The data-generating process is otherwise stationary, but the parameters shift at one or more known dates: \[y_t = \alpha_1 \mathbf{1}\{t < T^*\} + \alpha_2 \mathbf{1}\{t \geq T^*\} + u_t, \quad u_t \text{ stationary}.\] Within each regime the series is stationary; the apparent non-stationarity is cross-regime, driven by the level shift at \(T^*\). Cure: model the breaks (intervention dummies) and let \(u_t\) be ARMA.
The three look alike in raw plots but require fundamentally different cures:
| Flavor | DGP | Right cure | Wrong cure \(\Rightarrow\) |
|---|---|---|---|
| TS (trend + stationary noise) | \(y_t = \gamma t + u_t\) | Detrend or include \(t\) as regressor | Differencing \(\Rightarrow\) MA unit root, lost trend information |
| DS (unit root) | \(y_t = y_{t-1} + \varepsilon_t\) | Difference once | Detrending \(\Rightarrow\) residuals still I(1), spurious significance |
| SSB (breaks) | \(y_t = \alpha_g + u_t\), \(g\) depends on \(t\) | Model the breaks (dummies) + ARMA on residuals | Differencing \(\Rightarrow\) over-differencing; residuals look like MA(1) with \(\theta \approx -1\) |
The three subsections below work each flavor end-to-end: simulate the violation, plot the rolling-mean diagnostic, apply the cure, and confirm with a residual-autocorrelation test that the non-stationarity has been removed. Test choice depends on the cure:
diff(y)), no parameters are estimated, and
Box.test(..., type = "Ljung-Box") applies cleanly with the
standard \(\chi^2_h\) reference.lmtest::bgtest()
(Breusch-Godfrey LM test). It accounts for estimated regressors and
remains valid in the presence of lagged dependent variables, where
Ljung-Box is biased toward not rejecting.A subtlety to flag up front: the cure addresses non-stationarity, not serial dependence. After a successful cure, the residuals are stationary, but they may still have ARMA structure. A test that rejects “no autocorrelation” therefore does not mean the cure failed — it means the next modeling step is to fit an ARMA model on the cured series.
DGP: \(y_t = 0.05\, t + u_t\) with \(u_t\) a stationary AR(1), \(\phi = 0.15\).
set.seed(104)
T_n <- 200
t_idx <- seq_len(T_n)
# Simulate trend + AR(1) noise
y_TS <- 0.05 * t_idx +
as.numeric(arima.sim(n = T_n, model = list(ar = 0.15), sd = 1))
# Cure: detrend by regressing on t and keeping the residuals
fit_TS <- lm(y_TS ~ t_idx)
res_TS <- as.numeric(residuals(fit_TS))
ts_df <- tibble(t = t_idx, raw = y_TS, cured = res_TS,
m_raw = roll_mean(y_TS),
m_cured = roll_mean(res_TS))
p_TS_raw <- ggplot(ts_df, aes(t)) +
geom_line(aes(y = raw), color = "grey60") +
geom_line(aes(y = m_raw), color = "firebrick", linewidth = 1) +
labs(title = "Raw: trend + AR(1)",
subtitle = "rolling mean climbs with t",
y = expression(y[t]), x = "t") +
theme_minimal(base_size = 11)
p_TS_cured <- ggplot(ts_df, aes(t)) +
geom_line(aes(y = cured), color = "grey60") +
geom_line(aes(y = m_cured), color = "steelblue", linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
labs(title = "After cure: residuals from lm(y ~ t)",
subtitle = "rolling mean flat at zero",
y = "residuals", x = "t") +
theme_minimal(base_size = 11)
p_TS_raw + p_TS_cured
lmtest::bgtest(fit_TS, order = 4)
##
## Breusch-Godfrey test for serial correlation of order up to 4
##
## data: fit_TS
## LM test = 2.8672, df = 4, p-value = 0.5803
The trend is gone — the rolling mean of the residuals sits at zero
with constant amplitude. Breusch-Godfrey rejects the null of no residual
autocorrelation, but that is expected: the DGP put AR(1) noise
on top of the trend, and detrending leaves that AR structure intact.
Non-stationarity is removed; the residuals are stationary AR(1), which
would be modeled with Arima(res_TS, order = c(1, 0, 0))
next.
DGP: \(y_t = y_{t-1} + \varepsilon_t\).
set.seed(105)
# Simulate random walk
y_DS <- cumsum(rnorm(T_n))
# Cure: take first differences
dy_DS <- diff(y_DS)
ds_raw_df <- tibble(t = t_idx,
raw = y_DS,
m_raw = roll_mean(y_DS))
ds_diff_df <- tibble(t = t_idx[-1],
cured = as.numeric(dy_DS),
m_cured = roll_mean(as.numeric(dy_DS)))
p_DS_raw <- ggplot(ds_raw_df, aes(t)) +
geom_line(aes(y = raw), color = "grey60") +
geom_line(aes(y = m_raw), color = "firebrick", linewidth = 1) +
labs(title = "Raw: random walk",
subtitle = "rolling mean wanders with no anchor",
y = expression(y[t]), x = "t") +
theme_minimal(base_size = 11)
p_DS_cured <- ggplot(ds_diff_df, aes(t)) +
geom_line(aes(y = cured), color = "grey60") +
geom_line(aes(y = m_cured), color = "steelblue", linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
labs(title = "After cure: diff(y)",
subtitle = "rolling mean flat at zero",
y = expression(Delta * y[t]), x = "t") +
theme_minimal(base_size = 11)
p_DS_raw + p_DS_cured
Box.test(dy_DS, lag = 10, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: dy_DS
## X-squared = 5.3047, df = 10, p-value = 0.8699
Differencing the random walk gives white-noise innovations. Ljung-Box fails to reject — the cure is complete and there is no residual ARMA structure to model. This is the cleanest of the three cases.
DGP: a stationary AR(1) plus a single level shift of \(+5\) at \(t = 100\).
set.seed(106)
break_t <- 100
# Simulate AR(1) + level shift
y_SSB <- as.numeric(arima.sim(n = T_n, model = list(ar = 0.15), sd = 1)) +
5 * (t_idx >= break_t)
# Cure: include a regime dummy in OLS
fit_SSB <- lm(y_SSB ~ I(t_idx >= break_t))
res_SSB <- as.numeric(residuals(fit_SSB))
ssb_df <- tibble(t = t_idx,
raw = y_SSB,
cured = res_SSB,
m_raw = roll_mean(y_SSB),
m_cured = roll_mean(res_SSB))
p_SSB_raw <- ggplot(ssb_df, aes(t)) +
geom_line(aes(y = raw), color = "grey60") +
geom_line(aes(y = m_raw), color = "firebrick", linewidth = 1) +
geom_vline(xintercept = break_t, linetype = "dashed", color = "firebrick") +
labs(title = "Raw: AR(1) + level shift at t=100",
subtitle = "rolling mean steps up",
y = expression(y[t]), x = "t") +
theme_minimal(base_size = 11)
p_SSB_cured <- ggplot(ssb_df, aes(t)) +
geom_line(aes(y = cured), color = "grey60") +
geom_line(aes(y = m_cured), color = "steelblue", linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
labs(title = "After cure: residuals from lm(y ~ break_dummy)",
subtitle = "rolling mean flat at zero",
y = "residuals", x = "t") +
theme_minimal(base_size = 11)
p_SSB_raw + p_SSB_cured
lmtest::bgtest(fit_SSB, order = 4)
##
## Breusch-Godfrey test for serial correlation of order up to 4
##
## data: fit_SSB
## LM test = 5.2415, df = 4, p-value = 0.2634
The break dummy absorbs the level shift; the residuals are flat at zero. As in the TS case, BG rejects because the underlying \(u_t\) is AR(1) — that is the substantive dynamics we now want to model, not evidence that the cure failed.
The asymmetric misdiagnosis cost. Each cure removes exactly the kind of non-stationarity its DGP put in. Apply the wrong cure — for instance, differencing the SSB series in §1.2.3 instead of using a break dummy — and you get a transformed series whose ACF still looks “fine,” but whose model parameters now reflect the artifact of over-differencing rather than the substantive dynamics. That is the asymmetric misdiagnosis cost: differencing a stationary-with-breaks series corrupts inference in a way that is hard to detect from residual diagnostics alone, because the differenced series will look “good enough” by ACF/PACF standards. Part 2 walks through exactly this trap on the presidential approval data.
Two complementary tests are the standard tools:
Augmented Dickey–Fuller (ADF) tests the null hypothesis that the series has a unit root:
Reject \(\Rightarrow\) evidence of stationarity. Fail to reject \(\Rightarrow\) no evidence against a unit root (which is not the same as evidence for it).
Kwiatkowski–Phillips–Schmidt–Shin (KPSS) flips the null:
Reject \(\Rightarrow\) evidence of non-stationarity. Fail to reject \(\Rightarrow\) no evidence against stationarity.
Used together, the tests cross-check each other:
| ADF | KPSS | Implication |
|---|---|---|
| Reject I(1) | Fail to reject I(0) | Both say stationary \(\Rightarrow\) I(0) |
| Fail to reject I(1) | Reject I(0) | Both say non-stationary \(\Rightarrow\) I(1) |
| Reject I(1) | Reject I(0) | Conflicting — investigate (often a structural break) |
| Fail to reject I(1) | Fail to reject I(0) | Both inconclusive — short series, low power, or near-unit-root |
The third row is where most applied trouble lives. Both tests reject is the hallmark of stationarity-with-breaks (Section 1.2 case c) — neither hypothesis is right, because the truth is “stationary within regimes.”
Low power against near-unit-root alternatives. If \(y_t\) is a stationary AR(1) with \(\phi = 0.95\), ADF will fail to reject in moderate samples — not because there is a unit root, but because the test has trouble distinguishing \(\phi = 0.95\) from \(\phi = 1\). Sample size matters: power improves with \(T\).
Sensitivity to deterministic regressors. ADF is
run with a constant only, with a constant and a linear trend,
or with neither. The wrong choice changes the verdict. Default rule: if
the series visibly trends, include the trend (adf.test()
does this automatically); otherwise, just the constant.
Vulnerability to structural breaks (Perron 1989). A stationary series with a one-time level shift has the same slow ACF decay as a unit root. ADF tends to fail to reject the unit-root null; KPSS tends to reject the stationarity null. Both tests are misled — and, crucially, in opposite directions, which is what generates the conflicting verdict in row 3 of the table above. The fix is to model the break explicitly and re-test on the residuals.
The practical lesson: unit-root tests are diagnostics, not oracles. Always pair them with (a) a careful plot of the series, (b) substantive knowledge of any events that might have shifted the level, (c) ACF/PACF of both levels and residuals, and (d) comparison of competing specifications side-by-side.
Before differencing, the disciplined workflow is:
We will use exactly this workflow in Part 2 to show that the presidential approval series — which on a naive ADF/KPSS read looks like a textbook unit-root case — is actually stationary once 9/11 and the Iraq War are accounted for.
When the diagnosis genuinely points to a unit root, differencing is the right move and the model is ARIMA(\(p, d, q\)):
\[\underbrace{\Delta^d y_t}_{\text{differenced } d \text{ times}} \sim \text{ARMA}(p, q).\]
Special cases:
This part walks through the diagnostic discipline from Part 1 on a real series — monthly Bush approval, February 2001 to June 2006. The textbook recipe is “the ACF decays slowly, so it’s non-stationary, so difference it, so fit ARIMA.” We’ll see why that’s the wrong move here, what the right specification is, and how a small change in diagnostic discipline rescues a meaningful AR(1) coefficient that the textbook approach throws away.
approval <- read_csv("data/approval.csv")
glimpse(approval)
## Rows: 65
## Columns: 8
## $ month <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7,…
## $ year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 20…
## $ approve <dbl> 58.67, 58.00, 60.50, 55.00, 54.00, 56.50, 56.00, 75.67, …
## $ disapprove <dbl> 23.67, 26.67, 29.50, 33.33, 34.00, 34.00, 35.00, 18.33, …
## $ unsure <dbl> 17.67, 15.33, 10.00, 11.67, 12.00, 9.50, 9.00, 6.00, 3.0…
## $ sept.oct.2001 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ iraq.war <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ avg.price <dbl> 144.975, 140.925, 155.160, 170.175, 161.625, 142.060, 14…
The dataset is monthly, \(T = 65\), covering Feb 2001 – Jun 2006. The variables we use:
approve: percent approving of George W. Bush.sept.oct.2001: a pulse dummy equal to
1 in September and October 2001 only — capturing the immediate “rally
around the flag” effect after 9/11.iraq.war: a pulse dummy equal to 1 in
March, April, and May 2003 — capturing the Iraq-War rally bump.avg.price: monthly average oil price index.Both intervention dummies are pulses (one in a handful of months, zero otherwise), not steps (one forever after the event). They model the immediate spike, not a permanent regime shift.
Substantive question. What was the immediate effect of 9/11 and the Iraq War on approval, do oil-price movements affect approval, and how persistent are the unmodeled dynamics — once we account for the visible long downward drift?
approve <- ts(approval$approve, start = c(2001, 2), frequency = 12)
xcov_approve <- cbind(
sept.oct.2001 = approval$sept.oct.2001,
iraq.war = approval$iraq.war,
avg.price = approval$avg.price
)
p_approve <- autoplot(approve) +
geom_vline(xintercept = 2001 + 8/12, linetype = "dashed", color = "firebrick") +
geom_vline(xintercept = 2003 + 2/12, linetype = "dashed", color = "steelblue") +
annotate("text", x = 2002, y = 45, label = "9/11", color = "firebrick", size = 4) +
annotate("text", x = 2003.7, y = 45, label = "Iraq War", color = "steelblue", size = 4) +
labs(title = "U.S. Presidential Approval (Bush, 2001–2006)",
y = "% Approving", x = "Year") +
theme_minimal(base_size = 13)
p_approve
Three features stand out. (i) A massive spike at 9/11 (Sep 2001), peaking around 88% approval. (ii) A smaller bump after the Iraq invasion (March 2003). (iii) A long downward drift from late 2002 onward, ending around 35% in mid-2006. There is no visible “level shift” that persists — the post-rally trajectory always reverts toward a declining baseline.
p_approve_acf <-
(ggAcf(approve) + labs(title = "ACF — Approval levels") + theme_minimal(base_size = 11)) |
(ggPacf(approve) + labs(title = "PACF — Approval levels") + theme_minimal(base_size = 11))
p_approve_acf
The ACF decays slowly — a textbook signature that students often read as “unit root, must difference.” But Part 1 warned us: a slow ACF decay is also what trend-stationary series and stationary series with breaks look like. The visual evidence here points clearly to a deterministic decline (a downward trend, plus the two pulse interventions), not a stochastic drift. We need formal tests to triangulate.
adf.test(approve) # H0: unit root
##
## Augmented Dickey-Fuller Test
##
## data: approve
## Dickey-Fuller = -3.9565, Lag order = 3, p-value = 0.01721
## alternative hypothesis: stationary
kpss.test(approve, null = "Level") # H0: stationary around a constant
##
## KPSS Test for Level Stationarity
##
## data: approve
## KPSS Level = 1.2485, Truncation lag parameter = 3, p-value = 0.01
kpss.test(approve, null = "Trend") # H0: stationary around a linear trend
##
## KPSS Test for Trend Stationarity
##
## data: approve
## KPSS Trend = 0.11109, Truncation lag parameter = 3, p-value = 0.1
| Test | \(H_0\) | \(p\)-value | Verdict |
|---|---|---|---|
| ADF (default = constant + trend) | unit root | \(0.017\) | reject \(\Rightarrow\) stationary |
| KPSS (Level) | stationary around \(\mu\) | \(0.010\) | reject \(\Rightarrow\) non-stationary |
| KPSS (Trend) | stationary around trend | \(0.100\) | don’t reject \(\Rightarrow\) trend-stationary |
The verdicts conflict — but only superficially. ADF and KPSS-Trend agree that the series is stationary once we allow for a deterministic trend. KPSS-Level disagrees because it assumes the mean is constant, which is plainly false (approval falls from ~85 to ~35 over five years). This is Pitfall 2 from Part 1: the deterministic specification matters. The same series gives opposite verdicts when the null is “stationary around a constant” vs. “stationary around a trend.”
Diagnosis: the approval series is trend-stationary with two pulse interventions. Not a unit root. The slow ACF decay is real persistence in the levels, not a stochastic trend.
A practitioner who only ran KPSS-Level — or who took the slow ACF decay at face value — would conclude “non-stationary” and difference. Let’s see what that gives us.
fit_naive <- Arima(approve, order = c(1, 1, 0), xreg = xcov_approve)
fit_naive
## Series: approve
## Regression with ARIMA(1,1,0) errors
##
## Coefficients:
## ar1 sept.oct.2001 iraq.war avg.price
## 0.0701 11.1213 5.3167 -0.0669
## s.e. 0.1320 2.5128 2.5723 0.0352
##
## sigma^2 = 13.12: log likelihood = -171.11
## AIC=352.22 AICc=353.26 BIC=363.02
The output reports ar1 \(\approx 0.07\) with standard error \(\approx 0.13\) — not
significant, the coefficient is essentially zero. This is the
textbook signature of overdifferencing: the AR
component on the differenced series finds nothing to model, because
differencing has stripped out all the persistence that was actually in
the levels. The coefficient on the AR term collapses toward zero
precisely because the wrong differencing destroyed the structure it was
supposed to capture.
What we are left with is a model that says: “the changes in approval are essentially uncorrelated noise, except for jumps at 9/11 and the Iraq War, plus a small oil-price effect.” That is not the dynamic story a slow-decay ACF was telling us. We have thrown that story away.
The right model treats the levels as the dependent variable and lets an AR(1) error process carry the persistence. This is sometimes called a regression with ARMA errors, sometimes ARMAX — the X stands for “eXogenous” regressors:
\[ y_t = \beta_0 + \beta_1\,\mathbf{1}\{911\}_t + \beta_2\,\mathbf{1}\{Iraq\}_t + \beta_3\,\text{oil}_t + u_t, \qquad u_t = \phi\, u_{t-1} + \varepsilon_t. \]
The \(\beta\) coefficients are effects on the level of approval. The persistence is in the error process, not in the covariates — so \(\beta\) is the immediate-and-long-run effect of a covariate (no lagged-\(y\) propagation; we’ll add that in Part 3).
In R: Arima(y, order = c(1, 0, 0), xreg = X).
fit_armax <- Arima(approve, order = c(1, 0, 0), xreg = xcov_approve)
fit_armax
## Series: approve
## Regression with ARIMA(1,0,0) errors
##
## Coefficients:
## ar1 intercept sept.oct.2001 iraq.war avg.price
## 0.918 71.4311 11.5152 5.8123 -0.0892
## s.e. 0.049 8.2425 2.5738 2.5299 0.0357
##
## sigma^2 = 12.81: log likelihood = -173.43
## AIC=358.86 AICc=360.31 BIC=371.91
Compare this to the ARIMA(1,1,0) above:
If the ARMAX captures the dynamics, residuals should be white noise — and crucially, they should be stationary by both ADF and KPSS, since the entire point of this section is to establish that no unit root remains once we model trend-and-breaks-and-AR(1).
res_armax <- residuals(fit_armax)
p_armax_resid <-
(ggAcf(res_armax) + labs(title = "ACF — ARMAX residuals") + theme_minimal(base_size = 11)) |
(ggPacf(res_armax) + labs(title = "PACF — ARMAX residuals") + theme_minimal(base_size = 11))
p_armax_resid
adf.test(res_armax)
##
## Augmented Dickey-Fuller Test
##
## data: res_armax
## Dickey-Fuller = -4.0106, Lag order = 3, p-value = 0.01481
## alternative hypothesis: stationary
kpss.test(res_armax)
##
## KPSS Test for Level Stationarity
##
## data: res_armax
## KPSS Level = 0.28673, Truncation lag parameter = 3, p-value = 0.1
Box.test(res_armax, lag = 12, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: res_armax
## X-squared = 5.9234, df = 12, p-value = 0.9199
All three diagnostics agree:
The unit-root hypothesis is ruled out conditional on the modeled trend-and-breaks-and-AR(1) structure. This is exactly the disciplined workflow from Section 1.4: re-test on the residuals of the leading specification.
auto.arima as a cross-check (not a decision rule)auto.arima() minimizes AIC over \((p, d, q)\) (and seasonal orders). Treat it
as a cross-check, not a decision rule: AIC-minimizing
search tends to over-parameterize in short series, and “best AIC” is not
“right substantive specification.”
fit_auto <- auto.arima(approve, xreg = xcov_approve,
stepwise = FALSE, approximation = FALSE)
fit_auto
## Series: approve
## Regression with ARIMA(1,0,0)(0,0,1)[12] errors
##
## Coefficients:
## ar1 sma1 intercept sept.oct.2001 iraq.war avg.price
## 0.9322 -0.2920 72.8397 11.2855 5.9419 -0.0941
## s.e. 0.0478 0.1755 7.5448 2.4999 2.3586 0.0344
##
## sigma^2 = 12.31: log likelihood = -172.09
## AIC=358.19 AICc=360.15 BIC=373.41
auto.arima returns
ARIMA(1,0,0)(0,0,1)[12]:
We retain the parsimonious ARMAX(1,0,0): its residuals are already
white noise, the seasonal MA is borderline and unsupported by theory,
and parsimony pays off in the small-sample dynamic-effects calculations
to come. Use auto.arima to ask “did I miss something?” —
never “what is the right model?”
extract_coefs <- function(fit, label) {
ct <- lmtest::coeftest(fit)
tibble(
Coefficient = rownames(ct),
Estimate = ct[, "Estimate"],
`Std. Err` = ct[, "Std. Error"],
`p-value` = ct[, "Pr(>|z|)"]
) |> mutate(Model = label)
}
compare_df <- bind_rows(
extract_coefs(fit_naive, "ARIMA(1,1,0)"),
extract_coefs(fit_armax, "ARMAX(1,0,0)")
)
compare_df |>
mutate(across(c(Estimate, `Std. Err`), ~ round(.x, 3)),
`p-value` = round(`p-value`, 3)) |>
pivot_wider(names_from = Model,
values_from = c(Estimate, `Std. Err`, `p-value`),
names_glue = "{Model}: {.value}") |>
knitr::kable(caption = "ARIMA(1,1,0) vs ARMAX(1,0,0): coefficient comparison.")
| Coefficient | ARIMA(1,1,0): Estimate | ARMAX(1,0,0): Estimate | ARIMA(1,1,0): Std. Err | ARMAX(1,0,0): Std. Err | ARIMA(1,1,0): p-value | ARMAX(1,0,0): p-value |
|---|---|---|---|---|---|---|
| ar1 | 0.070 | 0.918 | 0.132 | 0.049 | 0.596 | 0.000 |
| sept.oct.2001 | 11.121 | 11.515 | 2.513 | 2.574 | 0.000 | 0.000 |
| iraq.war | 5.317 | 5.812 | 2.572 | 2.530 | 0.039 | 0.022 |
| avg.price | -0.067 | -0.089 | 0.035 | 0.036 | 0.057 | 0.013 |
| intercept | NA | 71.431 | NA | 8.243 | NA | 0.000 |
A coefficient plot makes the AR(1) collapse — the central headline of this part — immediately visible:
plot_df <- compare_df |>
filter(Coefficient %in% c("ar1", "sept.oct.2001", "iraq.war", "avg.price")) |>
mutate(Coefficient = factor(Coefficient,
levels = c("ar1", "sept.oct.2001", "iraq.war", "avg.price"),
labels = c("AR(1)", "9/11 dummy", "Iraq War dummy", "Oil price")),
lo = Estimate - 1.96 * `Std. Err`,
hi = Estimate + 1.96 * `Std. Err`)
p_coefs <- ggplot(plot_df,
aes(x = Estimate, y = Coefficient, color = Model)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey40") +
geom_pointrange(aes(xmin = lo, xmax = hi),
position = position_dodge(width = 0.5), size = 0.6) +
scale_color_manual(values = c("ARIMA(1,1,0)" = "firebrick",
"ARMAX(1,0,0)" = "steelblue")) +
facet_wrap(~ Coefficient, scales = "free", ncol = 2) +
labs(title = "Coefficient estimates with 95% CI",
subtitle = "ARMAX recovers the AR(1) persistence that differencing destroys",
x = NULL, y = NULL, color = NULL) +
theme_minimal(base_size = 11) +
theme(axis.text.y = element_blank(),
legend.position = "bottom",
strip.text = element_text(face = "bold"))
p_coefs
Where the two models differ:
Where they agree: the pulse-shock coefficients are similar in magnitude, because pulse interventions transfer cleanly between specifications — they are 1 only in a few months and 0 otherwise, so differencing changes the regressor pattern but not the integrated impact much.
Note on AIC. AIC is reported as 352 for ARIMA(1,1,0)
and 359 for ARMAX(1,0,0). These are not directly
comparable. When Arima is called with \(d=1\), the likelihood is computed on the
differenced series (\(T-1\)
observations) and on differently-transformed data; AIC across different
values of \(d\) measures different
things. The right comparators are residual stationarity, white-noise
residuals, and substantive interpretability — and ARMAX(1,0,0) wins on
all three.
Looking ahead to Part 3. The ARMAX(1,0,0) we just estimated is one of two distinct ways to handle persistence in a regression. The other adds a lagged dependent variable (LDV), \(y_t = \alpha + \phi y_{t-1} + X_t\beta + \varepsilon_t\), putting the \(\phi\) on \(y\) itself rather than on the error. These are different models, not different estimators of the same model — they predict different long-run responses to a covariate shock. We develop this contrast carefully in §3.2 once we have the unrestricted ARDL(1,1) on the table; for now, the takeaway is that ARMAX puts the AR coefficient on the errors, so the \(\beta\)’s are immediate-and-long-run effects with no further propagation.
The ARMAX(1,0,0) gives us:
But there is a key limitation. In ARMAX, covariate effects do not propagate. The 9/11 dummy is +11.5 in September–October 2001; from November 2001 onward it is 0 and contributes nothing further. The AR(1) propagation lives in the errors, not in the covariates. So the \(\beta\)’s are immediate effects, equal to long-run effects.
For pulse interventions like 9/11, this is fine — we want the immediate effect, and we have a story for residual persistence. For continuous covariates like oil price, however, a static \(\beta\) can miss real dynamics: an oil-price increase in March may take a few months to show up in approval, propagating gradually through public opinion. To model that, we need:
That is the move in Part 3. ARDL is also the algebraic gateway to error-correction models, which we use in Parts 4–5.
In Part 2 we landed on ARMAX(1,0,0) as the right
specification for the approval series: regression on level \(y_t\) with AR(1) errors.
That model captured persistence in the errors, but it left
covariates entirely static: a one-unit increase in oil price
affects approval by exactly \(\hat\beta_3 =
-0.09\) in the same period and contributes nothing further. The
error process can transmit innovations forward through time, but
covariate effects do not propagate.
For pulse interventions like 9/11 or the Iraq War rally, the static-effect interpretation is fine as those are brief, well-defined events. For continuous covariates like oil price, however, a shift in the input may take time to filter through public opinion. To model that propagation we move from ARMAX to ARDL autoregressive distributed lag.
Two specifications introduce dynamics:
\[y_t \;=\; \alpha + \beta_1 x_t + \beta_2 x_{t-1} + u_t,\]
\[y_t \;=\; \alpha + \phi\, y_{t-1} + \beta_1 x_t + \beta_2 x_{t-1} + u_t,\]
They have different identification, different residual properties, and — as we will see in Part 4 — very different connections to error-correction.
Before §3.2 patches the AR-error model and §3.3 moves to ARDL, we set up the conceptual contrast that distinguishes them. Moving from “AR on errors” to “AR on \(y\)” is not a different estimator of the same model — it is a different model, with different long-run predictions about what happens when a covariate moves.
df_lab4 <- approval |>
mutate(approve_lag = lag(approve, 1),
oil = avg.price,
oil_lag = lag(avg.price, 1)) |>
drop_na()
When a series shows autocorrelation, the question is where in the model the persistence lives.
Story A — “shocks themselves persist.” Random innovations have memory; a shock today affects \(y\) today, lingers into tomorrow, and decays out. Persistence lives in the errors.
Story B — “yesterday’s outcome shapes today’s outcome.” The system has state. Whatever happened to \(y\) yesterday is partly who \(y\) is today. Persistence lives in the dependent variable itself.
These are different beliefs about the data-generating process, not different statistical fixes.
Story A — regression with AR(1) errors (the ARMAX of
§2.5): \[
y_t \;=\; X_t \beta + u_t, \qquad u_t = \phi u_{t-1} + \varepsilon_t.
\] The \(\phi\) governs the
noise’s persistence. Fitted by
Arima(y, xreg = X, order = c(1,0,0)) — GLS/ML handles the
AR(1) covariance.
Story B — lagged dependent variable (LDV): \[
y_t \;=\; \alpha + \phi y_{t-1} + X_t \beta + \varepsilon_t, \qquad
\varepsilon_t \sim \text{WN}.
\] The \(\phi\) governs the
outcome’s persistence. Fitted by
lm(y ~ lag(y) + X) — OLS, provided residuals come
out white noise (Breusch-Godfrey gateway, cf. Lab 3 §2.4).
Both stories sit inside a more general parent — the unrestricted ARDL(1,1):
\[ y_t \;=\; \alpha + \phi y_{t-1} + \beta_0 X_t + \beta_1 X_{t-1} + \varepsilon_t. \]
This has four free coefficients. A restriction is a constraint imposed before estimation that ties one coefficient to the others (or fixes its value). Once imposed, that coefficient is no longer free; its value follows from the constraint plus the others.
| Specification | \(\alpha\) | \(\phi\) (\(y_{t-1}\)) | \(\beta_0\) (\(X_t\)) | \(\beta_1\) (\(X_{t-1}\)) | LRM of \(X\) |
|---|---|---|---|---|---|
| Unrestricted ARDL(1,1) | free | free | free | free | \((\beta_0+\beta_1)/(1-\phi)\) |
| Story A (AR errors / COMFAC) | free | free | free | \(= -\phi\beta_0\) | \(\beta_0\) |
| Story B (LDV-only / partial adjustment) | free | free | free | \(= 0\) | \(\beta_0/(1-\phi)\) |
Reading the rows: Story B drops \(X_{t-1}\) entirely. Story A keeps \(X_{t-1}\) but pins its coefficient to \(-\phi\beta_0\). Both reduce the free-parameter count from four to three; they constrain the same coefficient (\(\beta_1\)) in different ways, and they pick out different long-run multipliers as a consequence.
The constraint isn’t arbitrary. Substitute \(u_{t-1} = y_{t-1} - X_{t-1}\beta_0\) into the AR-errors equation:
\[ y_t \;=\; X_t\beta_0 + \phi(y_{t-1} - X_{t-1}\beta_0) + \varepsilon_t \;=\; \phi y_{t-1} + \beta_0 X_t \;\underbrace{-\phi\beta_0}_{\text{the constraint}}\, X_{t-1} + \varepsilon_t. \]
The lagged-\(X\) coefficient \(-\phi\beta_0\) is exactly what falls out of “this is really an AR-errors model.” Substantively, \(-\phi\beta_0\) is the precise amount needed to cancel the LDV’s propagation of yesterday’s \(X\) effect — which is how Story A keeps covariates static.
The LRM column of the restriction table above already makes the point: with \(\beta_0 = 1\) and \(\phi = 0.8\), Story A’s long-run effect is \(\beta_0 = 1\); Story B’s is \(\beta_0/(1-\phi) = 5\) — a factor of five from the same nominal coefficient. The figure below makes the asymmetry visible: panel (a) shows the response to a permanent step in \(X\) (covariate shock); panel (b) shows the response to a unit innovation \(\varepsilon_0\) — where Story A does propagate, but only the noise.
beta_demo <- 1
phi_demo <- 0.8
H_demo <- 30
t_demo <- 0:H_demo
# Top panel: COVARIATE IRF — permanent +1 step in X at t=0, no innovations
resp_cov_A <- rep(beta_demo, H_demo + 1) # flat at beta
resp_cov_B <- beta_demo * (1 - phi_demo^(t_demo + 1)) / (1 - phi_demo) # geometric to beta/(1-phi)
cov_df <- bind_rows(
tibble(t = t_demo, y = resp_cov_A, story = "Story A: AR errors"),
tibble(t = t_demo, y = resp_cov_B, story = "Story B: LDV")
)
p_cov <- ggplot(cov_df, aes(t, y, color = story)) +
geom_hline(yintercept = c(1, 5), linetype = "dashed", color = "grey50") +
geom_line(linewidth = 1.1) + geom_point(size = 1.6) +
scale_color_manual(values = c("Story A: AR errors" = "steelblue",
"Story B: LDV" = "firebrick")) +
annotate("text", x = H_demo, y = 1.35, label = "Story A LRM = beta = 1",
color = "steelblue", size = 3.4, hjust = 1) +
annotate("text", x = H_demo, y = 4.65, label = "Story B LRM = beta/(1-phi) = 5",
color = "firebrick", size = 3.4, hjust = 1) +
labs(title = "(a) Covariate IRF -- response to a permanent +1 step in X",
subtitle = "Story A flat; Story B rises geometrically. Covariate dynamics differ.",
x = NULL, y = "Cumulative response of y", color = NULL) +
ylim(0, 6) +
theme_minimal(base_size = 11) +
theme(legend.position = "none")
# Bottom panel: INNOVATION IRF — unit shock in epsilon_0, X held at 0
# Both stories: u_t (Story A) or y_t deviation (Story B) decays as phi^t — identical
resp_inn <- phi_demo^t_demo
inn_df <- tibble(t = t_demo, y = resp_inn)
p_inn <- ggplot(inn_df, aes(t, y)) +
geom_hline(yintercept = 0, linetype = "dotted", color = "grey60") +
geom_line(color = "purple4", linewidth = 1.1) +
geom_point(color = "purple4", size = 1.6) +
annotate("text", x = H_demo / 2, y = 0.65,
label = "Story A and Story B trace the SAME phi^t decay\n(innovations propagate identically)",
hjust = 0.5, size = 3.4, color = "purple4", lineheight = 1.05) +
labs(title = "(b) Innovation IRF -- response to a unit shock in epsilon at t=0",
subtitle = "Identical geometric decay -- innovation dynamics are the same in both stories.",
x = "Periods after shock", y = "Response of y") +
ylim(0, 1.1) +
theme_minimal(base_size = 11)
p_two_stories <- p_cov / p_inn +
patchwork::plot_layout(heights = c(1.05, 1))
if (!dir.exists("output/slides")) dir.create("output/slides", recursive = TRUE)
ggsave("output/slides/p3_two_stories_irf.png", p_two_stories,
width = 10, height = 6.3, dpi = 110)
p_two_stories
Read both panels together: the two stories produce identical responses to innovation shocks, but radically different responses to covariate shocks. That asymmetry is the COMFAC restriction made visible. Story A propagates persistence in the errors (panel b) but holds covariates static (panel a, flat line). Story B propagates both — a covariate shock at \(t=0\) feeds into \(y_0\), then through the LDV into \(y_1\), and so on, producing the rising trajectory in panel (a). The factor-of-five gap in the cumulative effect between panels (a)’s two lines is what students should remember; the identical curves in panel (b) explain why the AR coefficient \(\phi\) alone — without context about where it lives — is not enough information to read the long-run effect.
OLS properties differ depending on whether the spec matches the DGP:
| Specification | Estimator | \(\hat\beta\) unbiased? | Consistent? | SE correct? |
|---|---|---|---|---|
| Static + AR errors | naive OLS | ✓ | ✓ | ✗ |
| Static + AR errors | GLS / ML via Arima(...) |
✓ | ✓ | ✓ |
| LDV + WN errors | OLS via lm(...) |
✗ (\(O(1/T)\), Hurwicz) | ✓ | ✓ |
| LDV + AR errors | OLS | ✗ | ✗ | ✗ |
This is not a “don’t use OLS” table — it’s a “match the estimator to the spec” table. The ✗ marks describe what happens when the estimator ignores a dynamic component the data actually contain. The remedy is never “abandon OLS”; it is “fix the specification so OLS has nothing left to ignore.”
Arima() (row 2).The Breusch-Godfrey test on LDV residuals is the gateway between rows 3 and 4: pass and OLS-LDV is exactly the right tool; reject and the spec needs to change before reading any coefficients off the regression. In practice, almost all of our applied work in Parts 3–6 lives in rows 2 or 3 — the rows where OLS or its ML cousin work cleanly because we matched the estimator to the DGP.
The literature (De Boef and Keele 2008; Cook and Webb 2021) recommends fitting the unrestricted ARDL(1,1) and testing the restrictions before imposing either:
# Unrestricted ARDL(1,1) — fit here for the restriction test, reused in §3.3
fit_ardl <- lm(approve ~ approve_lag + oil + oil_lag +
sept.oct.2001 + iraq.war,
data = df_lab4)
# Delta-method Wald test of COMFAC: g(theta) = beta_oil_lag + phi * beta_oil = 0
b0 <- coef(fit_ardl)["oil"]
b1 <- coef(fit_ardl)["oil_lag"]
phi <- coef(fit_ardl)["approve_lag"]
g_hat <- b1 + phi * b0
# Gradient of g w.r.t. (phi, beta_0, beta_1), zero elsewhere
G <- numeric(length(coef(fit_ardl)))
names(G) <- names(coef(fit_ardl))
G["approve_lag"] <- b0
G["oil"] <- phi
G["oil_lag"] <- 1
se_g <- sqrt(as.numeric(t(G) %*% vcov(fit_ardl) %*% G))
W_stat <- as.numeric(g_hat^2 / se_g^2)
p_chi <- pchisq(W_stat, df = 1, lower.tail = FALSE)
tibble(
Quantity = c("g_hat = beta_oil_lag + phi * beta_oil",
"Standard error (delta method)",
"Wald statistic (chi^2_1)",
"p-value"),
Value = c(round(unname(g_hat), 4), round(se_g, 4),
round(W_stat, 3), round(p_chi, 4))
) |>
knitr::kable(caption = "Wald test of the COMFAC restriction on the approval/oil ARDL(1,1).")
| Quantity | Value |
|---|---|
| g_hat = beta_oil_lag + phi * beta_oil | -0.0258 |
| Standard error (delta method) | 0.0136 |
| Wald statistic (chi^2_1) | 3.5960 |
| p-value | 0.0579 |
Result: \(\hat g = -0.026\), \(W \approx 3.6\), \(p \approx 0.06\) — borderline. We just barely fail to reject COMFAC at the 5% level. The data do not decisively distinguish between Story A and Story B for our application, and that is a typical empirical situation.
In such cases the choice falls to theory. What does the persistence in \(y\) substantively mean?
Neither reading is obviously right. A reasonable analyst commits to one based on theory and reports the LRM with the matching formula; a very reasonable analyst presents both as a robustness check. The methodological sin Cook and Webb (2021) warn against is taking coefficients estimated under one story and plugging them into the other story’s LRM formula — as the figure showed, that conflation can imply long-run effects off by a factor of five.
For the rest of Part 3 we proceed with Story B (LDV
via lm()) and read the LRM as \((\beta_0+\beta_1)/(1-\phi)\). Before that,
§3.2 starts with the simplest distributed-lag regression — no LDV — and
shows how its residuals force the choice on us, first via patches that
stay in Story A space (HAC, GLS-AR(1)).
The simplest DL adds one lag of the continuous covariate:
\[y_t \;=\; \alpha + \beta_0 x_t + \beta_1 x_{t-1} + \gamma_1 \mathbf{1}\{911\}_t + \gamma_2 \mathbf{1}\{Iraq\}_t + u_t,\]
with \(u_t\) assumed white noise. The shock dummies stay as static controls — they don’t need lags because they’re nonzero for only a few months. The lag structure on \(x\) is what introduces dynamic propagation. In Story-A-vs-B language: this is a distributed-lag in the mean (a variant of Story A’s static structure with an extra lag of \(X\)), not yet an LDV.
fit_dl <- lm(approve ~ oil + oil_lag + sept.oct.2001 + iraq.war, data = df_lab4)
coeftest(fit_dl)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.982404 3.400564 29.1076 < 2.2e-16 ***
## oil -0.144284 0.061960 -2.3286 0.0233239 *
## oil_lag -0.096806 0.064444 -1.5022 0.1383845
## sept.oct.2001 17.559949 4.661709 3.7668 0.0003837 ***
## iraq.war 6.975436 3.818350 1.8268 0.0727867 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The contemporaneous oil coefficient is significant (\(\sim -0.14\)); the lagged coefficient is borderline. The cumulative or “long-run” effect of a permanent unit increase in oil under DL is just the sum of the lag coefficients:
\[\theta_{DL} = \beta_0 + \beta_1.\]
A practical question: if one of the two oil coefficients is not individually significant, can we still interpret the cumulative \(\theta_{DL}\)? Don’t read significance off individual coefficients — test the linear combination directly. The sum \(\hat\beta_0 + \hat\beta_1\) has its own variance, \(\text{Var}(\hat\beta_0) + \text{Var}(\hat\beta_1) + 2\text{Cov}(\hat\beta_0, \hat\beta_1)\), and its own \(t\)-statistic for “is the long-run effect zero?”:
# H0: beta_0 + beta_1 = 0 (no cumulative oil effect)
car::linearHypothesis(fit_dl, "oil + oil_lag = 0")
##
## Linear hypothesis test:
## oil + oil_lag = 0
##
## Model 1: restricted model
## Model 2: approve ~ oil + oil_lag + sept.oct.2001 + iraq.war
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 60 9333.9
## 2 59 2414.8 1 6919.1 169.06 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is distinct from a joint significance test
(\(H_0: \beta_0 = \beta_1 = 0\), “does
\(X\) matter at any lag?”), which would
use linearHypothesis(fit_dl, c("oil = 0", "oil_lag = 0")).
The two can give different verdicts: a significant sum with
insignificant individuals means we have evidence for a long-run effect
but can’t pin down the lag distribution; a rejected joint test
with insignificant sum means short-run effects that cancel in
the long run. Report the cumulative-effect test alongside individual
coefficients whenever the long-run effect is the substantive
quantity.
But the residuals tell us the DL model is incomplete:
dwtest(fit_dl)
##
## Durbin-Watson test
##
## data: fit_dl
## DW = 0.4108, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
Box.test(residuals(fit_dl), lag = 12, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(fit_dl)
## X-squared = 78.27, df = 12, p-value = 8.815e-12
Durbin-Watson is near \(0.4\) (\(p < 0.001\)) and Ljung-Box at lag 12 also rejects (\(p < 0.001\)). The residuals carry strong serial dependence — the same high-persistence structure we know is in approval (AR(1) at 0.92 in the ARMAX). DL captured the dynamic effect of oil, but it left the persistence in approval itself unmodeled. We have two routes that stay within Story A (patch the inference; leave the static-mean specification alone), and a third route covered in §3.3 that moves to Story B.
With autocorrelated residuals, OLS coefficients are still consistent (provided the regressors are exogenous), but the OLS variance formula \(\hat\sigma^2 (X'X)^{-1}\) understates sampling variability — confidence intervals are too narrow and \(t\)-stats too large. The standard remedy is Heteroskedasticity- and Autocorrelation-Consistent (HAC) standard errors, due to Newey and West (1987). HAC replaces the OLS “meat” with a kernel-weighted sum of sample autocovariances of the score:
\[\hat V_{HAC} = (X'X)^{-1} \left( \sum_{j = -L}^{L} k\!\left(\tfrac{j}{L}\right) \hat\Gamma_j \right) (X'X)^{-1}, \qquad \hat\Gamma_j = \tfrac{1}{T}\sum_{t} (X_t \hat\varepsilon_t)(X_{t-j} \hat\varepsilon_{t-j})'.\]
The truncation lag \(L\) controls
how far out the sum runs; the kernel \(k(\cdot)\) (Bartlett by default)
downweights distant lags so the estimator stays positive-definite.
NeweyWest() uses a data-driven default for \(L\) (Newey and West 1994), sensible for
\(T \approx 60\).
coeftest(fit_dl, vcov = sandwich::NeweyWest(fit_dl, prewhite = FALSE))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.982404 7.621352 12.9875 < 2.2e-16 ***
## oil -0.144284 0.066233 -2.1784 0.033381 *
## oil_lag -0.096806 0.056359 -1.7177 0.091102 .
## sept.oct.2001 17.559949 2.523597 6.9583 3.183e-09 ***
## iraq.war 6.975436 2.081917 3.3505 0.001411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The point estimates are identical to OLS (HAC only patches the variance), but the SEs widen. What HAC does and doesn’t do. It restores valid inference under arbitrary serial correlation, but it does not make OLS efficient and it does not protect against omitted-variable bias. To recover efficiency we either (a) specify the error correlation explicitly and re-estimate by GLS, or (b) move the dynamics into the mean equation (the Story-B move of §3.3).
If we are willing to commit to a model of the error dynamics, we re-estimate the same regression with a structured covariance:
\[y_t = X_t'\beta + u_t, \qquad u_t = \rho\, u_{t-1} + \varepsilon_t, \qquad \varepsilon_t \stackrel{\text{iid}}{\sim} (0, \sigma^2).\]
Feasible GLS estimates \(\rho\) from
the OLS residuals, then re-weights observations by an estimate of \(\Omega^{-1/2}\) (the inverse Cholesky of
the implied AR(1) covariance). The result is BLUE if the AR(1)
assumption is correct — both estimates and SEs are valid and efficient.
This is, structurally, exactly the AR-errors model from Story
A, fit a different way than Arima().
fit_dl_gls <- gls(approve ~ oil + oil_lag + sept.oct.2001 + iraq.war,
data = df_lab4, correlation = corAR1())
summary(fit_dl_gls)$tTable |> round(4)
## Value Std.Error t-value p-value
## (Intercept) 73.2582 9.7346 7.5256 0.0000
## oil -0.0714 0.0363 -1.9679 0.0538
## oil_lag -0.0297 0.0366 -0.8133 0.4193
## sept.oct.2001 11.5642 2.6494 4.3648 0.0001
## iraq.war 6.0833 2.6404 2.3039 0.0248
The estimated AR(1) parameter on the residuals (Phi in
the correlation summary) lands around \(0.93\) — the same high persistence we
already saw in the ARMAX of Part 2. Side-by-side comparison:
tibble(
Coefficient = names(coef(fit_dl)),
`OLS est` = round(coef(fit_dl), 3),
`OLS SE` = round(sqrt(diag(vcov(fit_dl))), 3),
`HAC SE` = round(sqrt(diag(sandwich::NeweyWest(fit_dl, prewhite = FALSE))), 3),
`GLS-AR(1) est` = round(coef(fit_dl_gls), 3),
`GLS-AR(1) SE` = round(sqrt(diag(vcov(fit_dl_gls))), 3)
) |>
knitr::kable(caption = "DL specification: OLS vs. OLS+HAC vs. GLS-AR(1).")
| Coefficient | OLS est | OLS SE | HAC SE | GLS-AR(1) est | GLS-AR(1) SE |
|---|---|---|---|---|---|
| (Intercept) | 98.982 | 3.401 | 7.621 | 73.258 | 9.735 |
| oil | -0.144 | 0.062 | 0.066 | -0.071 | 0.036 |
| oil_lag | -0.097 | 0.064 | 0.056 | -0.030 | 0.037 |
| sept.oct.2001 | 17.560 | 4.662 | 2.524 | 11.564 | 2.649 |
| iraq.war | 6.975 | 3.818 | 2.082 | 6.083 | 2.640 |
OLS and HAC share point estimates by construction; only the SEs differ. GLS estimates differ from OLS because GLS reweights observations using the AR(1) structure; the intervention dummies move the most. GLS SEs are typically smaller than HAC SEs if the AR(1) model is correct — the efficiency gain HAC cannot deliver.
Both HAC and GLS are legitimate fixes within Story A — they accept the static-mean specification and patch (HAC) or reweight (GLS) accordingly. Neither changes the model. §3.3 takes the alternative route: move to Story B by adding the lagged dependent variable, and recover both efficient estimation and propagating dynamics in one move.
The unrestricted ARDL(1,1) we estimated in §3.1 for the COMFAC test is our workhorse going forward:
\[y_t \;=\; \alpha + \phi\, y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + \gamma_1 \mathbf{1}\{911\}_t + \gamma_2 \mathbf{1}\{Iraq\}_t + \varepsilon_t.\]
The “(1,1)” labels the orders: one lag of \(y\), one lag of \(x\) beyond the contemporaneous \(x_t\). The lagged \(y\) does double duty — it captures persistent dynamics in approval that have nothing to do with our regressors, and it propagates the effect of \(x\) forward through time. Today’s approval depends on yesterday’s, and yesterday’s approval was already affected by yesterday’s oil price, so today’s approval inherits a piece of yesterday’s oil shock through the LDV.
coeftest(fit_ardl)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.0065812 5.4416046 2.5740 0.01263 *
## approve_lag 0.8347474 0.0514990 16.2090 < 2e-16 ***
## oil -0.0322845 0.0274584 -1.1758 0.24449
## oil_lag 0.0011796 0.0282933 0.0417 0.96689
## sept.oct.2001 17.2786953 1.9994764 8.6416 5.2e-12 ***
## iraq.war 4.1543960 1.6469076 2.5225 0.01442 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Read off the coefficients:
Residual diagnostics on the LDV fit:
dwtest(fit_ardl)
##
## Durbin-Watson test
##
## data: fit_ardl
## DW = 2.4701, p-value = 0.9274
## alternative hypothesis: true autocorrelation is greater than 0
Box.test(residuals(fit_ardl), lag = 12, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(fit_ardl)
## X-squared = 19.705, df = 12, p-value = 0.07287
Durbin-Watson is near \(2\) (\(p \approx 0.93\)); Ljung-Box at lag 12 has \(p \approx 0.07\) — borderline but acceptable. The autocorrelation that DL had has been absorbed by the LDV — we are solidly in row 3 of the §3.1 OLS-properties table (LDV with white-noise errors), where OLS is consistent and standard errors are correct. This is a precondition for the dynamic-effects calculations that follow.
The headline formulas for ARDL(1,1):
\[ \theta_{LR} \;=\; \frac{\beta_0 + \beta_1}{1 - \phi}, \qquad \lambda \;=\; 1 - \phi, \qquad \text{half-life} \;=\; \frac{\ln(0.5)}{\ln(\phi)}. \]
phi <- unname(coef(fit_ardl)["approve_lag"])
b0 <- unname(coef(fit_ardl)["oil"])
b1 <- unname(coef(fit_ardl)["oil_lag"])
theta_LR <- (b0 + b1) / (1 - phi)
lambda <- 1 - phi
half_life <- log(0.5) / log(phi)
tibble(
Quantity = c("phi (LDV coefficient)",
"Long-run multiplier on oil (theta)",
"Speed of adjustment (lambda = 1 - phi)",
"Half-life (months)"),
Value = round(c(phi, theta_LR, lambda, half_life), 3)
) |>
knitr::kable(caption = "ARDL(1,1) summary statistics for the approval/oil relationship.")
| Quantity | Value |
|---|---|
| phi (LDV coefficient) | 0.835 |
| Long-run multiplier on oil (theta) | -0.188 |
| Speed of adjustment (lambda = 1 - phi) | 0.165 |
| Half-life (months) | 3.837 |
A permanent one-unit increase in the oil-price index lowers approval by about \(0.19\) percentage points in the long run — about double the static ARMAX coefficient (\(-0.09\)). The contemporaneous coefficient \(\hat\beta_0\) sees only a fraction of this; the LDV pulls in the rest over the following months. Roughly \(17\%\) of the remaining gap closes each month, with a half-life around four months. The system has effectively reached its new equilibrium after about two years.
For substantive context: a \(10\)-unit oil-price increase (roughly the monthly standard deviation in this sample) implies a long-run drop of about \(1.9\) percentage points in approval — modest in any single month, but meaningful when stacked over multiple shocks.
The ARDL parameters give us the destination (\(\theta_{LR}\)) and the speed (\(\lambda\)). To see the path — when most of the adjustment happens, when the system effectively converges — we trace the period-by-period response to a unit step shock in \(x\) starting from pre-intervention equilibrium. This is the same simulation logic from Lab 3 Section 2.5, generalized to ARDL.
For the deviation \(\delta_t = y_t - y_{eq}\) from equilibrium under a unit step in \(x\) at \(t = 0\) (held forever after):
\[ \delta_0 \;=\; \beta_0, \qquad \delta_t \;=\; \phi\, \delta_{t-1} + \beta_0 + \beta_1 \quad (t \geq 1). \]
The recursion has a closed-form solution that converges to \(\theta_{LR}\) as \(t \to \infty\) — but the simulation form is more transparent and generalizes immediately to higher-order ARDL.
H <- 36
irf <- numeric(H + 1)
irf[1] <- b0
for (t in 2:(H + 1)) irf[t] <- phi * irf[t - 1] + b0 + b1
irf_df <- tibble(month = 0:H, response = irf)
hl_idx <- round(half_life)
p3_irf <- ggplot(irf_df, aes(month, response)) +
geom_hline(yintercept = 0, linetype = "dotted", color = "grey40") +
geom_hline(yintercept = theta_LR, linetype = "dashed", color = "firebrick") +
geom_segment(aes(x = hl_idx, xend = hl_idx, y = 0,
yend = irf[hl_idx + 1]),
linetype = "dotted", color = "grey50") +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(color = "steelblue", size = 1.6) +
annotate("text", x = H - 0.5, y = theta_LR + 0.012,
label = sprintf("Long-run multiplier theta = %.3f", theta_LR),
color = "firebrick", size = 3.5, hjust = 1) +
annotate("text", x = hl_idx + 0.5, y = irf[hl_idx + 1] - 0.025,
label = sprintf("Half-life ~ %.1f months", half_life),
color = "grey25", size = 3.2, hjust = 0) +
labs(title = "Step-response of approval to a unit shock in oil price",
subtitle = "ARDL(1,1) impulse response, deviations from equilibrium",
x = "Months after shock", y = "Response (percentage points)") +
theme_minimal(base_size = 12)
p3_irf
The trajectory tells the dynamic story. The immediate effect at \(t = 0\) is small (\(-0.03\), just \(\hat\beta_0\)). Each subsequent month, the LDV transmits a fraction \(\phi\) of yesterday’s deviation forward, while the cumulative oil pressure \((\beta_0 + \beta_1)\) keeps acting. The series converges geometrically to the long-run equilibrium \(\theta = -0.19\).
Reporting at multiple horizons makes the speed of approach concrete:
key_horizons <- c(0, 3, 6, 12, 24, 36)
tibble(
`Months after shock` = key_horizons,
`Cumulative response (pp)` = round(irf[key_horizons + 1], 3),
`Fraction of long-run` =
sprintf("%.1f%%", 100 * irf[key_horizons + 1] / theta_LR)
) |>
knitr::kable(caption = "Approach to equilibrium at selected horizons.")
| Months after shock | Cumulative response (pp) | Fraction of long-run |
|---|---|---|
| 0 | -0.032 | 17.2% |
| 3 | -0.098 | 51.8% |
| 6 | -0.135 | 72.0% |
| 12 | -0.170 | 90.5% |
| 24 | -0.186 | 98.9% |
| 36 | -0.188 | 99.9% |
About a third of the long-run effect lands in the first three months, half by the half-life (\(\sim 4\) months), \(90\%\) by month 12, and the system is essentially converged by month 24. This dynamic-speed information was nowhere in the ARMAX summary. With the LDV doing the propagation, ARDL gives us not just the destination but the path — which is what most policy and political-economy questions actually want to know.
Parametric simulation requires the regression’s sampling distribution to be approximately \(\mathcal{N}(\hat\beta, V(\hat\beta))\), which holds when the regression is balanced — both sides share the same order of integration, or are I(1) and cointegrated. Approval is I(0) (Part 2) and the ARDL absorbs any I(1) component of oil through the LDV under the Pesaran-Shin bounds-testing framework, so the condition is satisfied here. Part 5 treats balance and cointegration formally on genuinely I(1) data, and explains exactly when a level-on-level regression is well-behaved.
The idea. Instead of computing the IRF once at \(\hat\beta\), draw many vectors from the sampling distribution of \(\hat\beta\), compute the IRF on each draw, and read pointwise quantiles off the bundle of trajectories. This is the King, Tomz, and Wittenberg (2000) approach to presenting dynamic uncertainty — the same logic §5.5 uses for the cointegration example.
Step 1: one draw, one IRF.
vcov(fit_ardl) is the joint covariance of \(\hat\beta\). A single draw \(\beta^{(1)} \sim \mathcal{N}(\hat\beta,
V(\hat\beta))\) is one realization of “what the coefficients
could plausibly have been.” Plug it into exactly the same recursion as
the point IRF:
set.seed(2026)
beta_hat <- coef(fit_ardl)
V <- vcov(fit_ardl)
# One draw from the joint sampling distribution of beta_hat
beta_draw <- MASS::mvrnorm(1, mu = beta_hat, Sigma = V)
# Same recursion as the point IRF, but with this draw's coefficients
phi_d <- beta_draw["approve_lag"]
b0_d <- beta_draw["oil"]
b1_d <- beta_draw["oil_lag"]
irf_one <- numeric(H + 1)
irf_one[1] <- b0_d
for (t in 2:(H + 1)) irf_one[t] <- phi_d * irf_one[t - 1] + b0_d + b1_d
tibble(month = 0:5,
point = round(irf[1:6], 4),
one_draw = round(irf_one[1:6], 4))
## # A tibble: 6 × 3
## month point one_draw
## <int> <dbl> <dbl>
## 1 0 -0.0323 -0.0404
## 2 1 -0.0581 -0.0793
## 3 2 -0.0796 -0.111
## 4 3 -0.0975 -0.137
## 5 4 -0.112 -0.159
## 6 5 -0.125 -0.176
Different draws give different trajectories — pointwise quantiles across \(S\) draws give the confidence band.
Step 2: a reusable function. Wrap the procedure so
it generalizes to any dynamic quantity fn(beta, H)
derivable from the fitted coefficients (re-used for the GECM in §4):
# Parametric simulation of any dynamic quantity fn(beta, H) for a fitted model.
# Returns tibble(horizon, median, lo, hi) with pointwise (1 - alpha) bands.
simulate_dynamic <- function(fit, fn, H, S = 1000, alpha = 0.05, seed = NULL) {
if (!is.null(seed)) set.seed(seed)
beta_hat <- coef(fit)
V <- vcov(fit)
# S x p matrix of coefficient draws from the multivariate normal
draws <- MASS::mvrnorm(n = S, mu = beta_hat, Sigma = V)
# Apply fn() to each draw -> S x (H+1) matrix of trajectories
sim_mat <- t(apply(draws, 1, fn, H = H))
tibble(
horizon = 0:H,
median = apply(sim_mat, 2, median),
lo = apply(sim_mat, 2, quantile, probs = alpha / 2),
hi = apply(sim_mat, 2, quantile, probs = 1 - alpha / 2)
)
}
# IRF for ARDL(1,1) given a coefficient vector
ardl11_irf <- function(beta, H) {
phi <- beta["approve_lag"]
b0 <- beta["oil"]
b1 <- beta["oil_lag"]
irf <- numeric(H + 1)
irf[1] <- b0
for (t in 2:(H + 1)) irf[t] <- phi * irf[t - 1] + b0 + b1
irf
}
Step 3: apply and plot.
irf_band <- simulate_dynamic(fit_ardl, fn = ardl11_irf, H = 36,
S = 1000, seed = 2026)
head(irf_band)
## # A tibble: 6 × 4
## horizon median lo hi
## <int> <dbl> <dbl> <dbl>
## 1 0 -0.0307 -0.0869 0.0208
## 2 1 -0.0581 -0.115 -0.00173
## 3 2 -0.0796 -0.144 -0.00935
## 4 3 -0.0984 -0.168 -0.0133
## 5 4 -0.113 -0.187 -0.0180
## 6 5 -0.125 -0.202 -0.0204
ggplot(irf_band, aes(horizon, median)) +
geom_hline(yintercept = 0, linetype = "dotted", color = "grey40") +
geom_hline(yintercept = theta_LR, linetype = "dashed", color = "firebrick") +
geom_ribbon(aes(ymin = lo, ymax = hi), fill = "steelblue", alpha = 0.25) +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(color = "steelblue", size = 1.4) +
annotate("text", x = 35, y = theta_LR + 0.012,
label = sprintf("Long-run multiplier theta = %.3f", theta_LR),
color = "firebrick", size = 3.5, hjust = 1) +
labs(title = "Step-response of approval to a unit shock in oil price (95% CI)",
subtitle = "ARDL(1,1) parametric simulation, S = 1000",
x = "Months after shock", y = "Response (percentage points)") +
theme_minimal(base_size = 12)
The band widens with horizon — uncertainty in \(\hat\phi\) and \(\hat{(\beta_0+\beta_1)}\) compounds through the recursion, and the long-run multiplier \(\theta_{LR}\) is itself a non-linear function of the coefficients. Even so, the 95% band stays clear of zero across the entire 36-month horizon, so the negative oil effect on approval is statistically distinguishable from “no effect” — modest in magnitude (Section 3.4) but real.
Part 3 left us with the ARDL(1,1)
\[ y_t = \alpha + \phi\, y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + \varepsilon_t \tag{ARDL} \]
and three substantively interesting quantities computed from its coefficients:
The catch is that all three are derived from the fitted ARDL. \(\theta_{LR}\) and \(\lambda\) are non-linear combinations of the coefficients, so their standard errors require the delta method (or a parametric simulation, as in Part 5); the IRF is recovered by recursion. None of these quantities appears directly in the ARDL regression output.
This part rewrites the same model in an algebraically equivalent form — the error-correction (EC) form — that places those quantities as coefficients in a single regression:
\[ \Delta y_t \;=\; \alpha + \beta_0\, \Delta x_t \;-\; \lambda \bigl(y_{t-1} - \theta\, x_{t-1}\bigr) + \gamma_1 \mathbf{1}\{911\}_t + \gamma_2 \mathbf{1}\{Iraq\}_t + \varepsilon_t. \tag{ECM} \]
\(\lambda\) is now a coefficient on the lagged level of \(y\), and the long-run multiplier \(\theta\) appears inside the bracketed deviation \((y_{t-1} - \theta\, x_{t-1})\) — the gap between yesterday’s \(y\) and where it “should” be given yesterday’s \(x\). Same data, same fitted values, identical residuals — just a different way of presenting the same coefficient information. It is an algebraic identity, not a new model.
Why bother. Three reasons.
This is the De Boef and Keele (2008, eqs. 3–5, pp. 189–190) construction. Cook and Webb (2021, pp. 561–565) document why the same algebra implies that fitting AR-errors and reading off LDV-style multipliers is a model conflation — the EC-form derivation makes the implied long-run structure transparent so the conflation is harder to commit by accident.
Start from the ARDL(1,1): \[ y_t = \alpha + \phi\, y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + \varepsilon_t. \]
Step 1. Subtract \(y_{t-1}\) from both sides: \[ y_t - y_{t-1} = \alpha + (\phi - 1) y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + \varepsilon_t. \] Define \(\lambda \equiv 1 - \phi\) (the speed of adjustment) and \(\Delta y_t \equiv y_t - y_{t-1}\): \[ \Delta y_t = \alpha - \lambda\, y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + \varepsilon_t. \]
Step 2. Add and subtract \(\beta_0 x_{t-1}\) on the right: \[ \Delta y_t = \alpha - \lambda\, y_{t-1} + \beta_0\, (x_t - x_{t-1}) + (\beta_0 + \beta_1)\, x_{t-1} + \varepsilon_t. \] With \(\Delta x_t \equiv x_t - x_{t-1}\), this is the single-equation generalized ECM (GECM): \[ \Delta y_t = \alpha + \beta_0 \Delta x_t - \lambda\, y_{t-1} + (\beta_0 + \beta_1)\, x_{t-1} + \varepsilon_t. \tag{GECM} \]
Step 3. Factor the long-run relation by defining \(\theta \equiv (\beta_0+\beta_1)/(1-\phi) = (\beta_0+\beta_1)/\lambda\): \[ \boxed{\;\Delta y_t = \alpha + \beta_0 \Delta x_t - \lambda\,\bigl(y_{t-1} - \theta\, x_{t-1}\bigr) + \varepsilon_t.\;} \tag{ECM} \] The bracketed term \(z_{t-1} \equiv y_{t-1} - \theta\, x_{t-1}\) is the error-correction term: the gap between yesterday’s \(y\) and the long-run relation \(\theta x\).
Three coefficients with clean substantive meaning:
| Coefficient | Meaning | Sign expectation |
|---|---|---|
| \(\beta_0\) on \(\Delta x_t\) | Short-run effect: contemporaneous response of \(\Delta y\) to \(\Delta x\) | matches \(\beta_0\) from the ARDL |
| \(\lambda\) on \(-(y_{t-1} - \theta x_{t-1})\) | Speed of adjustment: fraction of the gap closed each period | \(0 < \lambda < 1\) for a stationary system; negative sign on the EC term |
| \(\theta\) in the EC term | Long-run multiplier: equilibrium relationship between \(y\) and \(x\) | matches \((\beta_0+\beta_1)/(1-\phi)\) from the ARDL |
The (ARDL), (GECM), and (ECM) equations are the same model — three notations, identical fit. Switching among them is purely an interpretive convenience.
Step 2’s (GECM) form is what we actually estimate by OLS, because the
EC form (Step 3) has \(\theta\) inside
a nonlinear function of the level coefficients. We let lm()
fit the linear GECM and back out \(\theta\) from the coefficient ratio.
df_lab4 <- df_lab4 |>
mutate(d_approve = approve - approve_lag,
d_oil = oil - oil_lag)
fit_gecm <- lm(d_approve ~ d_oil + approve_lag + oil_lag +
sept.oct.2001 + iraq.war,
data = df_lab4)
coeftest(fit_gecm)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.006581 5.441605 2.5740 0.012630 *
## d_oil -0.032285 0.027458 -1.1758 0.244494
## approve_lag -0.165253 0.051499 -3.2089 0.002172 **
## oil_lag -0.031105 0.015201 -2.0462 0.045277 *
## sept.oct.2001 17.278695 1.999476 8.6416 5.2e-12 ***
## iraq.war 4.154396 1.646908 2.5225 0.014417 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Read the GECM coefficients in the language of the (GECM) form:
d_oil coefficient \(\hat\beta_0\): the
short-run effect — how much approval changes contemporaneously
per unit change in oil price.approve_lag coefficient: this
estimates \(-\lambda\) (negative of the
speed of adjustment). We want it negative — a negative
coefficient on \(y_{t-1}\) means the
system pulls \(y\) down when it is
above equilibrium and up when below.oil_lag coefficient: this estimates
\(\beta_0 + \beta_1\) (the combined
effect of the lagged oil level). Together with the lagged-\(y\) coefficient, these two pin down the
long-run multiplier.The implied dynamic quantities:
gy <- coef(fit_gecm)["approve_lag"] # = -lambda
gx <- coef(fit_gecm)["oil_lag"] # = beta_0 + beta_1
b0 <- coef(fit_gecm)["d_oil"] # short-run effect
lambda_hat <- -gy
theta_hat <- -gx / gy
phi_hat <- 1 - lambda_hat
tibble(
Quantity = c("Speed of adjustment (lambda = 1 - phi)",
"Long-run multiplier (theta)",
"Implied LDV coefficient (phi)",
"Short-run effect (beta_0 on d_oil)",
"Half-life (months)"),
Value = round(c(lambda_hat, theta_hat, phi_hat, b0,
log(0.5)/log(phi_hat)), 4)
) |>
knitr::kable(caption = "Dynamic quantities backed out of the GECM coefficients.")
| Quantity | Value |
|---|---|
| Speed of adjustment (lambda = 1 - phi) | 0.1653 |
| Long-run multiplier (theta) | -0.1882 |
| Implied LDV coefficient (phi) | 0.8347 |
| Short-run effect (beta_0 on d_oil) | -0.0323 |
| Half-life (months) | 3.8375 |
These match exactly what we computed from the ARDL(1,1) in Part 3 — as they must, because the two parameterizations are algebraically identical.
To make the equivalence concrete, fit the ARDL again and compare.
fit_ardl_p4 <- lm(approve ~ approve_lag + oil + oil_lag +
sept.oct.2001 + iraq.war,
data = df_lab4)
tibble(
Diagnostic = c("Residual standard error",
"Sum of squared residuals",
"Maximum |residual difference| (ARDL vs GECM)"),
ARDL = c(round(summary(fit_ardl_p4)$sigma, 4),
round(sum(residuals(fit_ardl_p4)^2), 4),
NA),
GECM = c(round(summary(fit_gecm)$sigma, 4),
round(sum(residuals(fit_gecm)^2), 4),
NA),
`ARDL − GECM` = c(NA, NA,
signif(max(abs(residuals(fit_ardl_p4) - residuals(fit_gecm))), 3))
) |>
knitr::kable(caption = "ARDL and GECM produce identical fits to numerical precision.")
| Diagnostic | ARDL | GECM | ARDL − GECM |
|---|---|---|---|
| Residual standard error | 2.7439 | 2.7439 | NA |
| Sum of squared residuals | 436.6754 | 436.6754 | NA |
| Maximum |residual difference| (ARDL vs GECM) | NA | NA | 0 |
Residuals match to floating-point precision. The two equations are different parameterizations of the same dynamic model, not different models. Choose whichever form makes the substantive question easier to answer:
approve_lag.oil_lag coefficient to the
approve_lag coefficient.A close cousin of the GECM imposes the long-run relationship in a first stage and then estimates only the dynamic correction in a second stage. This is the form most directly tied to cointegration analysis — and it is what we will use in Part 5 for the US fiscal-sustainability example.
The two steps are:
Stage 1. Estimate the long-run relationship by static OLS: \[ y_t = \mu + \theta\, x_t + z_t. \tag{Stage 1} \] The residual \(\hat z_t = y_t - \hat\mu - \hat\theta\, x_t\) is the cointegrating residual — the gap between \(y\) and its long-run relationship with \(x\). Under cointegration this residual is stationary even when \(y\) and \(x\) individually are not (Part 5).
Stage 2. Estimate the dynamic correction by regressing \(\Delta y_t\) on \(\Delta x_t\) and the lagged Stage-1 residual: \[ \Delta y_t = a + b_0\, \Delta x_t + \rho\, \hat z_{t-1} + \varepsilon_t. \tag{Stage 2} \] The coefficient \(\rho\) is \(-\lambda\): it tells us what fraction of yesterday’s deviation gets corrected today.
# Stage 1: long-run relationship by static OLS
stage1 <- lm(approve ~ oil, data = df_lab4)
df_lab4 <- df_lab4 |>
mutate(z_hat = residuals(stage1),
z_hat_lag = dplyr::lag(z_hat, 1))
# Long-run multiplier from Stage 1
theta_stage1 <- coef(stage1)["oil"]
# Stage 2: dynamic correction
fit_2step <- lm(d_approve ~ d_oil + z_hat_lag + sept.oct.2001 + iraq.war,
data = df_lab4 |> drop_na())
coeftest(fit_2step)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.962851 0.372704 -2.5834 0.012325 *
## d_oil -0.033958 0.027892 -1.2175 0.228351
## z_hat_lag -0.170037 0.052495 -3.2391 0.001986 **
## sept.oct.2001 16.969486 2.007533 8.4529 1.072e-11 ***
## iraq.war 4.012653 1.661927 2.4145 0.018934 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Stage-1 long-run multiplier \(\hat\theta\) is similar in spirit to the GECM-derived value but not identical — Stage 1 estimates \(\theta\) off the levels regression alone, ignoring short-run dynamics, while the GECM/ARDL estimates \(\theta\) jointly with the dynamics. With well-behaved I(1) data the two converge as \(T \to \infty\) (Engle and Granger 1987), but in finite samples they differ. For our (stationary) approval data they will not match exactly, which is itself a useful diagnostic — it tells us we should not be relying on the cointegration interpretation here.
The Stage-2 coefficient on z_hat_lag is the analog of
\(-\lambda\) from the GECM.
Compare:
tibble(
Quantity = c("Long-run multiplier (theta)",
"Speed of adjustment (lambda)"),
`GECM (single-eq)` = c(round(theta_hat, 4),
round(lambda_hat, 4)),
`Engle-Granger 2-step` = c(round(unname(theta_stage1), 4),
round(-unname(coef(fit_2step)["z_hat_lag"]), 4))
) |>
knitr::kable(caption =
"GECM vs Engle-Granger two-step. Differences shrink with cointegration; here approval is stationary, so they do not exactly match.")
| Quantity | GECM (single-eq) | Engle-Granger 2-step |
|---|---|---|
| Long-run multiplier (theta) | -0.1882 | -0.2455 |
| Speed of adjustment (lambda) | 0.1653 | 0.1700 |
The (ECM) equation is algebraically valid for any ARDL(1,1) — stationary or not. What it does not do, on its own, is establish that \(z_t = y_t - \theta x_t\) is a meaningful equilibrium error. Two cases to keep distinct:
\(y\) and \(x\) are individually stationary (I(0)). This is our approval/oil case. The ECM is just a different way of writing the same dynamic regression. The “long-run multiplier” \(\theta\) is well defined as a property of the joint distribution; the EC term has no special structural meaning beyond “deviation from \(\theta x\).”
\(y\) and \(x\) are individually I(1) and cointegrated. Now \(\theta\) is the cointegrating coefficient — the unique linear combination that produces a stationary residual \(z_t\). The ECM has a structural interpretation: \(\lambda > 0\) is the system’s mean-reversion speed back to the cointegrating relation. Under cointegration the OLS estimator of \(\theta\) in Stage 1 is super-consistent (Engle and Granger 1987): it converges at rate \(T\) rather than \(\sqrt{T}\), even though the levels regression is “spurious” in the standard sense.
\(y\) and \(x\) are individually I(1) but not cointegrated. Now there is no stable long-run relation; the ECM is misspecified; the levels regression \(y_t = \mu + \theta x_t + z_t\) is a textbook spurious regression with non-stationary residuals. Fitting an ECM in this case will produce coefficients that “look fine” but describe nothing real. Estimating an ECM is not the same as having found cointegration (Philips 2018, p. 232, fn. 7).
So: this part shows you the algebra; Part 5 shows you the test. We move to a genuinely I(1) example — quarterly US federal receipts and outlays — and ask whether the long-run fiscal relation cointegrates. If yes, the ECM machinery we built here gets a structural reading. If no, we have to interpret in differences only.
In Part 4 we wrote the algebra of the error-correction model (ECM) on stationary data. The ECM there was a parameterization of the ARDL, not a structural claim — there was no I(1) noise to correct, no equilibrium relation in the sense the cointegration literature uses the term. The mechanics are unchanged when we move to genuinely I(1) data, but now the EC term acquires a meaning: it measures deviations from a cointegrating relationship and \(\lambda\) becomes the speed at which those deviations are corrected.
The application is US federal fiscal sustainability. The intertemporal government budget constraint says that, for the government’s debt to remain bounded, federal receipts and expenditures must move together in the long run — they must be cointegrated (Hakkio and Rush 1991). When cointegrated, deviations between the two are mean-reverting; when not, debt explodes. So the question “is US fiscal policy sustainable?” reduces to a testable cointegration claim about two I(1) macro series.
We use quarterly data 1947Q1–2025Q4, downloaded from FRED:
FGRECPT — Federal Government Current Receipts (billions
of dollars).FGEXPND — Federal Government Current Expenditures
(billions of dollars).Both series grow exponentially over time. We work with logs, which stabilizes the variance and converts the growth into a roughly linear trend.
Before testing cointegration, a methodological preamble that organizes the rest of Part 5. A regression is balanced when both sides have the same order of integration:
\[\underbrace{y_t}_{I(d_y)} \;=\; X_t'\beta + u_t \quad \text{requires}\quad d_y = d_X, \text{ or cointegration when } d > 0.\]
Why this matters depends on the orders involved:
vcov(fit) is a valid input to parametric simulation. This
is the case in §3.3’s approval-oil ARDL.Cointegration is therefore the I(1) version of balance. Without it, regressing one trending series on another is the spurious-regression trap; with it, the same regression recovers a well-defined long-run relationship and the ECM gives you the dynamics of return-to-equilibrium.
The Engle-Granger procedure operationalizes this distinction in three steps:
The next three subsections walk through each step on the fiscal-sustainability data.
rec <- read_csv("data/FGRECPT.csv", show_col_types = FALSE) |>
rename(date = 1, receipts = FGRECPT)
exp_ <- read_csv("data/FGEXPND.csv", show_col_types = FALSE) |>
rename(date = 1, expenditures = FGEXPND)
fiscal <- inner_join(rec, exp_, by = "date") |>
mutate(log_receipts = log(receipts),
log_outlays = log(expenditures))
nrow(fiscal)
## [1] 316
range(fiscal$date)
## [1] "1947-01-01" "2025-10-01"
fiscal |>
pivot_longer(c(log_receipts, log_outlays),
names_to = "series", values_to = "value") |>
mutate(series = recode(series,
log_receipts = "log Receipts",
log_outlays = "log Expenditures")) |>
ggplot(aes(date, value, color = series)) +
geom_line(linewidth = 0.9) +
scale_color_manual(values = c("log Receipts" = "steelblue",
"log Expenditures" = "firebrick")) +
labs(title = "US Federal Receipts and Expenditures, 1947Q1–2025Q4 (logs)",
subtitle = "Two trending series that may share a long-run equilibrium",
y = "Log of nominal billions of dollars",
x = NULL, color = NULL) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")
The two log series move closely together for ~80 years. The visible gap (expenditures consistently above receipts) is the persistent deficit. The question is whether the gap is mean-reverting: do expenditures and receipts drift back into alignment after temporary divergences, or does the deficit grow without bound?
The cointegration framework requires that each individual series be I(1) (non-stationary in levels, stationary in first differences). Test that first.
adf.test(fiscal$log_receipts)
##
## Augmented Dickey-Fuller Test
##
## data: fiscal$log_receipts
## Dickey-Fuller = -1.2606, Lag order = 6, p-value = 0.8882
## alternative hypothesis: stationary
adf.test(fiscal$log_outlays)
##
## Augmented Dickey-Fuller Test
##
## data: fiscal$log_outlays
## Dickey-Fuller = -1.1382, Lag order = 6, p-value = 0.9152
## alternative hypothesis: stationary
adf.test(diff(fiscal$log_receipts))
##
## Augmented Dickey-Fuller Test
##
## data: diff(fiscal$log_receipts)
## Dickey-Fuller = -7.2248, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
adf.test(diff(fiscal$log_outlays))
##
## Augmented Dickey-Fuller Test
##
## data: diff(fiscal$log_outlays)
## Dickey-Fuller = -7.6744, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
ADF on the levels fails to reject the unit-root null for both series (large \(p\)-values, around \(0.9\)). ADF on the first differences rejects strongly (\(p < 0.01\) for both). KPSS-Trend on the levels also rejects stationarity. The verdict is unambiguous: both log_receipts and log_outlays are I(1).
We can now ask whether they are cointegrated.
If \(y_t\) and \(x_t\) are both I(1), they are cointegrated if there exists a linear combination \(z_t = y_t - \theta x_t\) that is stationary (I(0)). The cointegrating coefficient \(\theta\) is what the Stage-1 long-run regression estimates.
Engle and Granger (1987) propose a two-step test:
Critical values for the Engle-Granger test are different from the standard ADF table — they account for the fact that we are testing a residual whose coefficient was estimated from the same data. MacKinnon (1991, 2010) provides the relevant tables; the rough 5% critical value for two variables and no trend in the cointegrating regression is \(-3.34\).
# Stage 1: long-run regression
stage1 <- lm(log_receipts ~ log_outlays, data = fiscal)
coeftest(stage1)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1694256 0.0265661 6.3775 6.432e-10 ***
## log_outlays 0.9508386 0.0039439 241.0905 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
theta_hat <- coef(stage1)["log_outlays"]
fiscal$z_hat <- residuals(stage1)
# Cointegration test: ADF on the residuals
adf_z <- adf.test(fiscal$z_hat)
adf_z
##
## Augmented Dickey-Fuller Test
##
## data: fiscal$z_hat
## Dickey-Fuller = -4.3992, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
The cointegrating coefficient is \(\hat\theta \approx 0.95\) — close to (but slightly below) one. The interpretation: in the long run, a one-percent increase in expenditures is associated with about a 0.95-percent increase in receipts. Receipts cover almost — but not quite — all expenditures in the long-run equilibrium. The remaining ~5% is, in expectation, the long-run share of expenditures financed by debt.
The ADF statistic on the Stage-1 residuals is about \(-4.4\), well below MacKinnon’s 5% critical value of \(-3.34\). We reject the unit-root null on \(\hat z_t\): receipts and expenditures are cointegrated. There is a stable long-run fiscal relationship; deviations correct over time.
ggplot(fiscal, aes(date, z_hat)) +
geom_hline(yintercept = 0, color = "grey40", linetype = "dashed") +
geom_line(color = "steelblue") +
labs(title = "Cointegrating residual z_hat = log_receipts - 0.951 * log_outlays - 0.169",
subtitle = "Stationary mean-reverting deviations from the long-run fiscal relation",
y = "Cointegrating residual", x = NULL) +
theme_minimal(base_size = 12)
The residual hovers around zero with mean-reverting fluctuations — visually consistent with the ADF verdict. Notable departures (e.g., the surge in 2008–09 and 2020–21) reflect recession-driven deficit spikes; in each case the residual eventually returns toward zero.
With cointegration confirmed, the Engle-Granger Stage-2 ECM gives the dynamics of correction:
\[ \Delta y_t \;=\; a + b_0 \Delta x_t \;+\; \rho\, \hat z_{t-1} \;+\; \varepsilon_t, \]
where \(\rho = -\lambda\). (The single-equation GECM gives the same \(\lambda\) and a similar \(\theta\); we’ll show both.)
fiscal <- fiscal |>
mutate(d_log_receipts = log_receipts - dplyr::lag(log_receipts, 1),
d_log_outlays = log_outlays - dplyr::lag(log_outlays, 1),
z_lag = dplyr::lag(z_hat, 1)) |>
drop_na()
fit_ecm <- lm(d_log_receipts ~ d_log_outlays + z_lag, data = fiscal)
coeftest(fit_ecm)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0167842 0.0018896 8.8826 < 2.2e-16 ***
## d_log_outlays -0.0713822 0.0368275 -1.9383 0.053490 .
## z_lag -0.0588984 0.0165036 -3.5688 0.000415 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lambda_hat <- -unname(coef(fit_ecm)["z_lag"])
half_life_q <- log(0.5) / log(1 - lambda_hat)
short_run_b0 <- unname(coef(fit_ecm)["d_log_outlays"])
tibble(
Quantity = c("Long-run multiplier (theta, from Stage 1)",
"Speed of adjustment (lambda) per quarter",
"Half-life of deviations (quarters)",
"Short-run effect (b0 on d_log_outlays)"),
Value = round(c(unname(theta_hat), lambda_hat, half_life_q, short_run_b0), 4)
) |>
knitr::kable(caption = "Cointegration and ECM summary (US fiscal data).")
| Quantity | Value |
|---|---|
| Long-run multiplier (theta, from Stage 1) | 0.9508 |
| Speed of adjustment (lambda) per quarter | 0.0589 |
| Half-life of deviations (quarters) | 11.4185 |
| Short-run effect (b0 on d_log_outlays) | -0.0714 |
Reading the dynamics. Each quarter, about \(5.9\%\) of the gap between receipts and the long-run relationship \(0.951 \cdot \text{log\_outlays} + 0.169\) gets corrected. The half-life of a deviation is about \(11.4\) quarters — roughly three years. After a fiscal shock (e.g., a recession-driven jump in expenditures or a tax-cut–driven drop in receipts), the system takes about a decade to converge ~99% of the way back to equilibrium. This is slow but bounded — exactly what fiscal sustainability requires.
The short-run coefficient on \(\Delta\text{log\_outlays}\) is negative and only marginally significant. Substantively, this reflects the cyclical nature of fiscal policy: when expenditures jump quarter-on-quarter (often during recessions), receipts tend to fall in the same quarter (also recession-driven). The long-run cointegration (\(\theta \approx 0.95\)) handles the equilibrium; the short-run dynamic captures the within-quarter cyclical co-movement.
The point estimates above hide uncertainty. The cointegrating coefficient \(\theta\), the speed of adjustment \(\lambda\), and the half-life are all estimates with sampling distributions. We use parametric simulation (King, Tomz, and Wittenberg 2000) to propagate that uncertainty into the dynamic quantities. The recipe:
set.seed(512)
n_sims <- 2000
# Draw from the ECM coefficient sampling distribution
sim_coefs <- mvrnorm(n = n_sims, mu = coef(fit_ecm), Sigma = vcov(fit_ecm))
# For each draw, compute lambda, half-life, and a 24-quarter step-response
H <- 24
sim_lambda <- -sim_coefs[, "z_lag"]
sim_b0 <- sim_coefs[, "d_log_outlays"]
# Simulated step-response of log_receipts to a permanent +1 shock
# in log_outlays, expressed as deviations from pre-shock equilibrium.
# Using GECM-equivalent dynamics with theta fixed at Stage-1 estimate:
sim_irf <- matrix(NA_real_, nrow = n_sims, ncol = H + 1)
for (i in seq_len(n_sims)) {
lam <- sim_lambda[i]; b0 <- sim_b0[i]; th <- unname(theta_hat)
delta <- numeric(H + 1)
delta[1] <- b0 # immediate response of d_log_receipts to d_log_outlays
# cumulative response of log_receipts to a permanent +1 shock in log_outlays
resp <- numeric(H + 1)
resp[1] <- delta[1] # period 1 cumulative
for (t in 2:(H + 1)) {
# error-correction: pull resp toward th (the long-run multiplier)
resp[t] <- resp[t-1] - lam * (resp[t-1] - th)
}
sim_irf[i, ] <- resp
}
# Summarize: 95% CI per horizon
irf_summary <- tibble(
horizon = 0:H,
mean = colMeans(sim_irf),
lo = apply(sim_irf, 2, quantile, 0.025),
hi = apply(sim_irf, 2, quantile, 0.975)
)
ggplot(irf_summary, aes(horizon, mean)) +
geom_hline(yintercept = unname(theta_hat),
linetype = "dashed", color = "firebrick") +
geom_ribbon(aes(ymin = lo, ymax = hi), fill = "steelblue", alpha = 0.25) +
geom_line(color = "steelblue", linewidth = 1) +
annotate("text", x = H - 1, y = unname(theta_hat) + 0.01,
label = sprintf("Long-run multiplier theta ≈ %.3f", theta_hat),
color = "firebrick", size = 3.5, hjust = 1) +
labs(title = "Cumulative response of log_receipts to a +1 shock in log_outlays",
subtitle = "ECM dynamics with parametric-simulation 95% CI (n_sims = 2000)",
x = "Quarters after shock",
y = "Cumulative response (logs)") +
theme_minimal(base_size = 12)
The trajectory rises from the immediate short-run response toward \(\hat\theta\) over many quarters. The 95% ribbon widens as we extend the horizon, reflecting accumulated uncertainty about \(\lambda\). By quarter 24 (six years), most of the long-run adjustment has occurred, but the band still has substantial width — even with \(T = 316\) observations, the speed of adjustment is hard to pin down precisely on macro data.
tibble(
Quantity = c("Long-run multiplier (theta)",
"Speed of adjustment (lambda) per quarter",
"Half-life (quarters)"),
Mean = round(c(unname(theta_hat),
mean(sim_lambda),
mean(log(0.5) / log(1 - sim_lambda[sim_lambda > 0 & sim_lambda < 1]))), 3),
`2.5%` = round(c(unname(theta_hat),
quantile(sim_lambda, 0.025),
quantile(log(0.5) / log(1 - sim_lambda[sim_lambda > 0 & sim_lambda < 1]), 0.025)), 3),
`97.5%` = round(c(unname(theta_hat),
quantile(sim_lambda, 0.975),
quantile(log(0.5) / log(1 - sim_lambda[sim_lambda > 0 & sim_lambda < 1]), 0.975)), 3)
) |>
knitr::kable(caption = "Point estimates and 95% intervals from parametric simulation. (theta is taken from Stage 1 and treated as fixed here; richer schemes would jointly resample theta and the ECM coefficients — see Philips 2018 for a bounds-testing approach.)")
| Quantity | Mean | 2.5% | 97.5% |
|---|---|---|---|
| Long-run multiplier (theta) | 0.951 | 0.951 | 0.951 |
| Speed of adjustment (lambda) per quarter | 0.059 | 0.028 | 0.090 |
| Half-life (quarters) | 12.608 | 7.317 | 24.270 |
The 95% interval for the half-life is roughly 6–25 quarters — substantial uncertainty. The point estimate of ~11 quarters is our best read, but a reader of this analysis should treat anything from one and a half to six years as plausible.
The combined evidence — both series I(1) by ADF, cointegrated by Engle-Granger, ECM with \(\hat\lambda \approx 0.06\) — supports the view that US federal fiscal policy is sustainable in the long run, but adjusts slowly. Receipts and expenditures share a stable long-run relationship in which receipts cover roughly 95% of expenditures over a multi-decade equilibrium horizon. Deviations are corrected at about 6% per quarter, with a half-life around three years.
This is not a normative claim about whether 5% steady-state debt growth is desirable, only a positive claim about the data-generating process: the historical series do not show evidence of an explosive deficit path. Whether the cointegrating relationship will continue to hold under future shocks is a forecasting question, not a cointegration-testing question, and is beyond what this lab covers.
For an applied paper, the next steps would be:
We have done what the lab was for: to take the ECM machinery built in Part 4 as algebra, and demonstrate it on data where the cointegration interpretation has structural content.
Parts 2–5 dealt with one form of dynamic causal inference: estimating coefficients and propagating shocks within a fully parameterized dynamic model (ARMAX, ARDL, ECM). Part 6 introduces a different family — interrupted time series (ITS) designs — where the causal claim comes not from a parametric model of the dynamics, but from a quasi-experimental contrast: the trajectory of \(y_t\) before the intervention versus the trajectory after, with the pre-intervention path serving as the counterfactual.
ITS is the canonical causal-inference tool for time-series settings with a sharp, datable intervention, and it is widely used in public-policy evaluation (Bernal, Cummins, and Gasparrini 2017). The key insight: when randomization is impossible — as with a national policy that takes effect on one date — we can still estimate a causal effect by comparing the observed post-intervention path to the projected pre-intervention path.
The standard ITS specification is:
\[ y_t = \beta_0 + \beta_1 t + \beta_2 D_t + \beta_3 (t - T^*) \cdot D_t + \varepsilon_t \]
where:
The key causal parameters are \(\beta_2\) (did the level shift immediately?) and \(\beta_3\) (did the trajectory’s slope change going forward?). Together they characterize the intervention’s effect: a step shift, a slope change, or both. Identification rests on the assumption that, absent the intervention, the pre-trend would have continued — the standard parallel-trends-style assumption translated to a single-unit time series.
Before we list functional forms or fit anything, we name the assumptions that make ITS causal rather than merely descriptive. The umbrella assumption — sometimes called conditional ignorability of treatment timing, or in the time-series setting sequential ignorability — says this: once we condition on the covariates and dynamic structure our model includes, the post-period trajectory would have followed the pre-period model in expectation, absent the intervention. Plain language: anything that would have stopped the pre-trend from continuing has been captured by the model. Asserting this assumption by default — without defending it — is the single most common form of weak inference in applied ITS work.
Within this umbrella, four threats to the assumption are worth naming individually, in roughly the order they tend to bite:
No anticipation. Agents must not respond to the intervention before \(T^*\). The date of the policy is the date of the response, full stop. Threats: announcements that precede implementation, expectation effects, leaked drafts, behavioral adjustment to “this is coming.” Defense: announcement dummies plus a ramp variable for the anticipation window (covered in §6.4); restricting the analysis to a window where anticipation is implausible; documenting (with sources) what was publicly known before \(T^*\). Anticipation is the threat to single-unit ITS that is hardest to fix after the fact — better to design around it than to patch.
Stable pre-trend (conditional continuation). The pre-period model — once we include covariates and dynamic structure — captures everything that would have continued absent the policy. Threats: unmodeled time-varying confounders, structural breaks unrelated to the policy, secular shifts. Defense: substantively-justified covariate inclusion (§6.3); AR or ARMA error structure where appropriate (§6.5); placebo tests at fake intervention dates; robustness to pre-period length.
No co-occurring shocks (exclusion). Nothing else that affects \(y\) also changes at \(T^*\). Threats: bundled reforms, contemporaneous economic events, weather. Defense: condition on observables (§6.3) or, when available, use a true control unit that did not receive the intervention but did receive the common shocks (the CITS rung of the ladder, §6.1.4).
Stable measurement. The data-generating process for measurement does not change at \(T^*\). Threats: survey instruments redesigned, sample frames altered, administrative coding changed. Defense: check for measurement breaks separately; restrict to consistently-measured periods.
Modeling choices are the operational expression of the assumption. Every covariate you condition on, every lag you add to the AR error, every anticipation window you include — each is a substantive commitment to a story about what could have caused \(y\) in the post-period if the policy hadn’t intervened. We tabulate this connection in §6.1.3, after §6.1.2 walks through the two specifications most often used.
Most ITS analyses use one of two functional forms. The choice is whether the intervention plausibly produces a permanent shift (Example 1) or a transient one (Example 2). Notation in both: \(D_t \equiv \mathbf{1}\{t \geq T^*\}\) is the regime indicator, \(S_t \equiv (t-T^*)\,D_t\) is the time-since-intervention counter, \(P_t \equiv \mathbf{1}\{t = T^*\}\) is a pulse.
\[ y_t \;=\; \beta_0 + \beta_1 t + \beta_2 D_t + \beta_3 S_t + \varepsilon_t. \]
The default ITS specification. \(\beta_2\) is the immediate level shift at \(T^*\) (the vertical jump); \(\beta_3\) is the change in slope after the intervention (the kink). Both have intervention-time interpretations because \(S_t\) is re-zeroed at \(T^*\). Use when the policy plausibly produces both an immediate jump and a different growth rate going forward.
set.seed(42)
T_n <- 60
T_star <- 30
alpha <- 10; beta1 <- 0.5; beta2 <- -3; beta3 <- -0.4
df_ls <- tibble(t = 1:T_n) |>
mutate(D = as.integer(t >= T_star),
S = pmax(t - T_star, 0),
y_mean = alpha + beta1*t + beta2*D + beta3*S,
y = y_mean + rnorm(T_n, 0, 0.4),
cf = alpha + beta1*t)
p_ls <- ggplot(df_ls, aes(t, y)) +
geom_ribbon(data = df_ls |> filter(t >= T_star),
aes(ymin = y_mean, ymax = cf),
fill = "firebrick", alpha = 0.18) +
geom_line(aes(y = cf), color = "grey40", linetype = "dashed", linewidth = 0.8) +
geom_line(aes(y = y_mean), color = "steelblue", linewidth = 1.0) +
geom_point(aes(color = factor(D)), size = 1.3, show.legend = FALSE) +
scale_color_manual(values = c("0" = "steelblue", "1" = "firebrick")) +
geom_vline(xintercept = T_star, linetype = "dashed", color = "grey40") +
annotate("segment", x = T_star, xend = T_star,
y = alpha + beta1*T_star, yend = alpha + beta1*T_star + beta2,
arrow = arrow(length = unit(0.22, "cm")), color = "darkred", linewidth = 0.7) +
annotate("text", x = T_star + 0.7,
y = (alpha + beta1*T_star) + beta2/2,
label = "beta_2 (level shift)", hjust = 0, color = "darkred", size = 3.5) +
annotate("text", x = T_n - 0.5,
y = (df_ls$y_mean[T_n] + df_ls$cf[T_n])/2,
label = "Effect at horizon h\n= y_observed - y_counterfactual",
hjust = 1, color = "firebrick", size = 3.2, lineheight = 0.95) +
annotate("text", x = T_star + 14,
y = alpha + beta1*(T_star + 14) + beta2 + beta3*14 + 3.5,
label = "beta_3 changes the slope", hjust = 0, color = "darkblue", size = 3.4) +
annotate("text", x = T_star - 0.5, y = max(df_ls$y) + 0.5,
label = "T*", color = "grey30", size = 4, hjust = 1, fontface = "italic") +
labs(title = "Example 1: Level + slope change ITS",
subtitle = "Solid blue = mean function; dashed grey = counterfactual (extrapolated pre-trend)",
x = "Time t", y = "Outcome y") +
theme_minimal(base_size = 11)
if (!dir.exists("output/slides")) dir.create("output/slides", recursive = TRUE)
ggsave("output/slides/p6_its_levelslope.png", p_ls, width = 9, height = 4, dpi = 110)
p_ls
Read the figure: the counterfactual (dashed grey) is what the pre-period model predicts for the post-period — extrapolated from the same intercept and slope. The observed mean (solid blue) drops at \(T^*\) by \(\beta_2\), then continues at the new slope \(\beta_1 + \beta_3\). The substantive causal effect at horizon \(h\) is the vertical gap between observed and counterfactual at \(T^* + h\) — which keeps growing whenever \(\beta_3 \neq 0\). This is what the shaded region is showing.
\[ y_t \;=\; \beta_0 + \beta_1 t + \omega P_t + \varepsilon_t. \]
Use when the intervention is a one-off shock with no expected persistence — a single news event, an election day, a scandal that briefly moves opinion, a one-week supply disruption. \(\omega\) is the size of the spike at \(T^*\); the system is back on trend by \(t = T^* + 1\). The contrast with Example 1 is sharp: transient vs. permanent.
set.seed(43)
alpha_p <- 10; beta1_p <- 0.05; omega <- 4
df_pl <- tibble(t = 1:T_n) |>
mutate(P = as.integer(t == T_star),
y_mean = alpha_p + beta1_p*t + omega*P,
y = y_mean + rnorm(T_n, 0, 0.3),
cf = alpha_p + beta1_p*t)
p_pl <- ggplot(df_pl, aes(t, y)) +
geom_line(aes(y = cf), color = "grey40", linetype = "dashed", linewidth = 0.8) +
geom_line(aes(y = y_mean), color = "steelblue", linewidth = 1.0) +
geom_point(color = "steelblue", size = 1.3) +
geom_point(data = df_pl |> filter(t == T_star), aes(y = y),
color = "firebrick", size = 2.6) +
geom_vline(xintercept = T_star, linetype = "dashed", color = "grey40") +
annotate("segment", x = T_star, xend = T_star,
y = alpha_p + beta1_p*T_star, yend = alpha_p + beta1_p*T_star + omega,
arrow = arrow(length = unit(0.22, "cm")), color = "darkred", linewidth = 0.7) +
annotate("text", x = T_star + 0.7, y = alpha_p + beta1_p*T_star + omega/2,
label = "omega (one-period spike)", hjust = 0, color = "darkred", size = 3.5) +
annotate("text", x = T_n - 0.5, y = alpha_p + beta1_p*(T_n - 0.5) - 0.8,
label = "Returns to baseline trend by t = T* + 1",
hjust = 1, color = "grey25", size = 3.3) +
annotate("text", x = T_star - 0.5, y = max(df_pl$y) + 0.3,
label = "T*", color = "grey30", size = 4, hjust = 1, fontface = "italic") +
labs(title = "Example 2: Pulse intervention",
subtitle = "One-period shock at T*; immediate return to trend",
x = "Time t", y = "Outcome y") +
theme_minimal(base_size = 11)
ggsave("output/slides/p6_its_pulse.png", p_pl, width = 9, height = 4, dpi = 110)
p_pl
Other functional forms — separate-regime trends (algebraically equivalent to Example 1), level-only or slope-only (sub-cases of Example 1 obtained by testing \(\beta_2 = 0\) or \(\beta_3 = 0\)), anticipation/ramp (§6.4), multiple interventions (one \((D, S)\) pair per break date) — are all built from the same primitives. Don’t drop terms a priori; fit the full Example 1 specification and let the data plus theory decide which sub-form is supported.
Each modeling choice is the operational tool for a specific assumption from §6.1.1. The connection is what makes ITS causal rather than just curve-fitting:
| Modeling choice | Threat it addresses | When it is justified |
|---|---|---|
| Covariates \(X_t\) in the regression (§6.3) | Time-varying observable confounders (e.g., world oil price for gasoline; weather for emissions; unemployment for approval) | Theory says these are confounders. Including them strengthens conditional ignorability from “the pre-trend would have continued unconditionally” to “the pre-trend conditional on \(X\) would have continued.” |
| AR error structure (§6.5) | Persistent residual structure that the mean equation didn’t capture | Substantive process plausibly has dynamic noise (slow-moving omitted variables, autocorrelated measurement). Strengthens the conditioning set without committing to causal momentum in \(y\). |
| Lagged dependent variable (Story B from §3.1) | Genuine state-dependence: past \(y\) causally produces current \(y\) | Theory predicts momentum (sticky prices, habit formation, capital accumulation). Not a routine fix for autocorrelation — a substantive commitment about the data-generating process. |
| Anticipation / ramp window (§6.4) | Agents respond before \(T^*\) in expectation | The policy was credibly announced before implementation, with a documented announcement date. |
| True control unit (CITS) | Common shocks that hit both treated and control (recessions, weather, contemporaneous policies) | A unit exists that does not receive the intervention. Without one, CITS doesn’t identify anything new (see §6.1.4). |
Theory-first ordering. Each row above is a substantive claim about what could have caused \(y\) in the post-period absent the policy. AIC/BIC will allow you to add or drop any of them; only theory tells you which are defensible. Modeling without that theory produces fits that may be statistically respectable but identify nothing causal.
Three levels of evidence, weakest to strongest. All three are still single-policy designs; none of them is “stronger than ITS” in the sense of resting on weaker assumptions — each requires more of the data, not fewer modeling choices.
Single-unit ITS. The base case. Relies entirely on the extrapolated pre-trend as the counterfactual. Conditional ignorability has to do all the work; if any of the four assumptions in §6.1.1 fails materially, the estimate is biased.
Single-unit ITS with falsification. Run placebo tests at fake intervention dates a few years before \(T^*\) — the fake “effect” should be near zero. Vary the pre-period length. Add and drop covariates. Each robustness check the design survives makes the conditional-ignorability assumption more credible. This is what good single-unit ITS papers actually do.
Comparative ITS (CITS) — when, and only when, a true control unit exists. The control must be a series that does not receive the intervention at \(T^*\) but is plausibly subject to the same common shocks (recessions, weather, contemporaneous policies). The treatment-by-post interactions (\(G_i D_t\), \(G_i S_t\)) absorb shocks that affect both units identically and identify the differential effect — the DiD-with-time-series logic. Without a true untreated control, CITS doesn’t apply: if the policy hits all units uniformly (a federal law with no state opt-out, a global event), the would-be control receives the same shock, the differential is zero by symmetry, and the CITS interactions identify nothing new.
When a true control is unavailable but you still want a comparative design, the natural next steps live in design space, not model space — and rest on different identifying assumptions:
Neither is a strict upgrade over single-unit ITS — they are different designs for different settings, with their own assumptions to defend. We flag them here only to mark the natural exit ramps when single-unit ITS or CITS doesn’t fit.
ITS data is almost always serially correlated. The two routes for
handling it map directly onto Story A and Story B from
§3.1: model the residuals as ARMA (Story A —
Arima(y, xreg = X)), or include \(y_{t-1}\) on the right-hand side (Story B —
lm(y ~ lag(y) + X)). The choice has the same stakes as in
§3.1: Story A keeps the intervention coefficients’ moment-of-impact
reading; Story B reinterprets them as direct effects
propagated forward through the LDV’s own dynamics, with full effects
requiring an IRF (cf. §3.5).
Default to Story A unless theory genuinely predicts that past \(y\) causes future \(y\) — sticky prices, habit formation, capital stocks (the canonical defensible cases discussed in §3.1). The Cook & Webb (2021) warning against treating \(\hat\rho\) on \(y_{t-1}\) as a “persistence parameter” applies in full force here. We use Option A in §6.5.
We apply the standard ITS model to a setting where the issues from §6.1 (covariate identification of the pre-trend, anticipation effects) have substantive bite.
The intervention. In its 19 February 2008 budget, the Government of British Columbia announced a revenue-neutral carbon tax of $10/tCO effective 1 July 2008, scheduled to escalate by $5/tCO each year through 2012. For motor gasoline, the initial rate translated to an additional 2.34 ¢/L on the retail price, climbing to 6.67 ¢/L by 2012. Tax revenue was rebated through personal-income-tax cuts and a low-income climate-action credit, so the policy was neutral on aggregate fiscal balance (Murray and Rivers 2015).
The substantive question. Did the carbon tax cause a measurable decline in BC gasoline consumption beyond what would have happened anyway given world oil prices?
The data. Monthly net sales of gasoline in BC, January 1993 to December 2015 (Statistics Canada PID 23-10-0080-01), merged with monthly Vancouver retail gasoline prices (PID 18-10-0001-01). \(T = 276\) months — 186 pre-tax, 90 post-tax. We work with log gasoline sales as the outcome, so the linear trend captures growth in log volume (population + economic activity) and the intervention coefficient is a percent change.
bc <- read_csv("data/bc_carbon_tax.csv", show_col_types = FALSE) |>
mutate(date_d = ym(date),
t = row_number(),
log_gas = log(gasoline_kl),
month_f = factor(month(date_d, label = FALSE)),
D = as.integer(date_d >= ymd("2008-07-01")), # post-tax
S = pmax(0, t - which(date_d == ymd("2008-07-01"))[1]), # time since
D_announce = as.integer(date_d >= ymd("2008-02-01"))) # post-announcement
glimpse(bc)
## Rows: 276
## Columns: 10
## $ date <chr> "1993-01", "1993-02", "1993-03", "1993-04", "1993-05", "19…
## $ gasoline_kl <dbl> 263175, 266268, 318488, 305159, 321572, 340359, 357239, 37…
## $ gas_price_c <dbl> 55.6, 54.9, 53.5, 54.3, 53.7, 54.5, 55.3, 55.4, 56.4, 56.4…
## $ date_d <date> 1993-01-01, 1993-02-01, 1993-03-01, 1993-04-01, 1993-05-0…
## $ t <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ log_gas <dbl> 12.48057, 12.49226, 12.67134, 12.62859, 12.68098, 12.73776…
## $ month_f <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7…
## $ D <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ S <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ D_announce <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
ggplot(bc, aes(date_d, log_gas)) +
geom_line(color = "grey55") +
geom_vline(xintercept = ymd("2008-07-01"),
linetype = "dashed", color = "firebrick") +
annotate("text", x = ymd("2008-09-01"), y = max(bc$log_gas) - 0.04,
label = "Carbon tax\n(1 Jul 2008)",
color = "firebrick", size = 3.2, hjust = 0) +
labs(title = "BC monthly gasoline sales (log thousands of litres), 1993–2015",
subtitle = "Strong seasonality + secular growth; intervention point marked",
y = "log(gasoline sales)", x = NULL) +
theme_minimal(base_size = 12)
The series has a positive secular trend, strong monthly seasonality, and a pronounced mid-2008 spike followed by lower post-tax levels. Crucially, global oil prices peaked in mid-2008 and crashed in the subsequent recession — so the post-period decline confounds the carbon tax with a worldwide price shock. This is exactly the identification problem §6.1 was set up to address: a naïve segmented regression has no way to separate the local tax from the simultaneous global shock.
Start with the level + slope specification from §6.1.2 (Example 1) plus month dummies — without conditioning on the gas price.
fit_bare <- lm(log_gas ~ t + D + S + month_f, data = bc)
round(coeftest(fit_bare)[c("t", "D", "S"), ], 4)
## Estimate Std. Error t value Pr(>|t|)
## t 0.0008 0.0001 13.5272 0
## D -0.0486 0.0111 -4.3900 0
## S -0.0008 0.0002 -4.3719 0
The level change D and slope change S are
both negative and significant: the bare model concludes that, at
implementation, BC gasoline sales dropped by about \(4.9\%\) relative to trend, and continued
declining at \(-0.08\%\) per month
thereafter. Read at face value, this is a substantial
effect.
But it conflates two stories. The 2008 oil-price spike pushed retail prices in BC from about \(1.20\)/L in early 2008 to over \(1.50\)/L mid-summer, then crashed below \(0.90\)/L by late 2008. Subsequent years saw global prices recover and stay high through 2014, then crash again. None of that is captured in the bare ITS. Whatever portion of the post-tax decline is globally driven, the bare model misattributes to the carbon tax.
The bare ITS leans on a strong assumption: absent the carbon tax, the pre-trend would have continued linearly into the post-period. That’s hard to defend here because gasoline consumption is mechanically driven by retail price, and the post-2008 period contains the largest oil-price swings in the sample.
The fix is to condition on the price. The identifying assumption becomes:
Conditional on retail gasoline price and seasonal pattern, the post-period would have followed the pre-period’s pattern had the carbon tax not been imposed.
This is the conditional-ignorability translation of parallel trends. The counterfactual is no longer “what does the pre-trend extrapolate to?” but “what does the pre-period model, evaluated at the post-period covariates, predict?”. The latter is much more credible when the pre-trend has known mechanical drivers we can observe.
fit_cov <- lm(log_gas ~ t + D + S + log(gas_price_c) + month_f, data = bc)
round(coeftest(fit_cov)[c("t", "D", "S", "log(gas_price_c)"), ], 4)
## Estimate Std. Error t value Pr(>|t|)
## t 0.0015 0.0001 14.2202 0e+00
## D -0.0389 0.0101 -3.8607 1e-04
## S -0.0011 0.0002 -6.6105 0e+00
## log(gas_price_c) -0.1656 0.0212 -7.8034 0e+00
Comparison with the bare specification:
tibble(
Coefficient =
c("Pre-trend (per month)",
"Level change at intervention (D)",
"Slope change after intervention (S)",
"log(retail price) elasticity"),
`Bare ITS` = c(coef(fit_bare)[c("t", "D", "S")], NA) |> round(4),
`With log(price)` = c(coef(fit_cov) [c("t", "D", "S", "log(gas_price_c)")]) |> round(4)
) |>
knitr::kable(caption = "ITS estimates with and without conditioning on retail gasoline price.")
| Coefficient | Bare ITS | With log(price) |
|---|---|---|
| Pre-trend (per month) | 0.0008 | 0.0015 |
| Level change at intervention (D) | -0.0486 | -0.0389 |
| Slope change after intervention (S) | -0.0008 | -0.0011 |
| log(retail price) elasticity | NA | -0.1656 |
Two patterns in the table:
D shrinks from \(-0.049\) to \(-0.039\) once we condition on price.
Roughly a fifth of the implementation-date drop the bare ITS attributed
to the tax was actually driven by the contemporaneous global oil-price
spike.A counterfactual plot makes the conditional-ignorability story
visible. We compute two predicted paths from fit_cov: the
realized post-tax fit, and a counterfactual that turns off the
post-tax dummies (D = 0, S = 0), holding price and season
at their actual values.
bc_cf <- bc |>
mutate(D = 0, S = 0)
bc <- bc |>
mutate(fit_real = predict(fit_cov, newdata = bc),
fit_cf = predict(fit_cov, newdata = bc_cf))
ggplot(bc, aes(date_d)) +
geom_line(aes(y = log_gas), color = "grey60", alpha = 0.7) +
geom_line(aes(y = fit_real), color = "steelblue", linewidth = 0.9) +
geom_line(aes(y = fit_cf),
data = filter(bc, date_d >= ymd("2008-07-01")),
color = "firebrick", linewidth = 0.9, linetype = "dashed") +
geom_vline(xintercept = ymd("2008-07-01"),
linetype = "dotted", color = "firebrick") +
annotate("text", x = ymd("2014-01-01"), y = 12.93,
label = "No-tax counterfactual",
color = "firebrick", size = 3.2, hjust = 0) +
annotate("text", x = ymd("2014-01-01"), y = 12.86,
label = "Realized fit",
color = "steelblue", size = 3.2, hjust = 0) +
labs(title = "BC carbon tax: realized fit vs. no-tax counterfactual",
subtitle = "Both paths condition on price and season; gap = ITS-implied tax effect",
y = "log(gasoline sales)", x = NULL) +
theme_minimal(base_size = 12)
The dashed red line is the prediction the model would make for the post-period if the carbon tax had not happened — but using the realized post-period prices and seasonal pattern. The gap between the realized fit (blue) and the counterfactual (red) is the ITS-implied carbon-tax effect, conditional on covariates.
What’s still being assumed exogenous. We assume retail price is exogenous to BC gasoline consumption — i.e., world oil prices drive retail prices, retail prices drive consumption, but BC consumption doesn’t drive retail prices. For BC’s small share of the global market, this is defensible; in larger economies, simultaneity would call for IV or two-stage methods.
The carbon tax was announced on 19 February 2008 and took effect on 1 July 2008 — a five-month gap during which consumers had time to adjust expectations. If they pre-bought gasoline (filled tanks ahead of the tax) or made longer-term changes (deferred a vehicle purchase, switched commutes), the immediate-pre-tax data points would already be polluted, biasing the implementation-date level shift.
This is a structural concern about the design, not the model fit. Three ways to address it (per the modeling-choices table in §6.1.3):
D_announce that turns on at the announcement date. Tests
whether the announcement itself shifted the level.We demonstrate (1):
fit_anticip <- lm(log_gas ~ t + D_announce + D + S + log(gas_price_c) + month_f,
data = bc)
round(coeftest(fit_anticip)[c("t", "D_announce", "D", "S", "log(gas_price_c)"), ], 4)
## Estimate Std. Error t value Pr(>|t|)
## t 0.0015 0.0001 14.1796 0.0000
## D_announce -0.0108 0.0191 -0.5680 0.5705
## D -0.0293 0.0196 -1.4910 0.1372
## S -0.0011 0.0002 -6.6204 0.0000
## log(gas_price_c) -0.1631 0.0217 -7.5202 0.0000
D_announce is the level shift starting at the
announcement; D is the additional level shift at
implementation, on top of any announcement-window adjustment.
tibble(
Coefficient = c("D (level change at impl.)",
"D_announce (announcement effect)",
"S (slope change)"),
`With price (no anticip.)` =
c(coef(fit_cov)["D"], NA, coef(fit_cov)["S"]) |> round(4),
`With price + announcement` =
c(coef(fit_anticip)["D"], coef(fit_anticip)["D_announce"],
coef(fit_anticip)["S"]) |> round(4)
) |>
knitr::kable(caption = "Implementation effect with and without an announcement adjustment.")
| Coefficient | With price (no anticip.) | With price + announcement |
|---|---|---|
| D (level change at impl.) | -0.0389 | -0.0293 |
| D_announce (announcement effect) | NA | -0.0108 |
| S (slope change) | -0.0011 | -0.0011 |
Reading the announcement coefficient: \(-0.011\) (\(-1.1\%\), \(p \approx 0.57\)) — small and indistinguishable from zero. Anticipation does not appear to be a major channel here. As a complementary check, remedy (2) — dropping the Feb–Jun 2008 window — gives a similar story:
fit_dropwin <- lm(log_gas ~ t + D + S + log(gas_price_c) + month_f,
data = bc |> filter(date_d < ymd("2008-02-01") |
date_d >= ymd("2008-07-01")))
round(coef(fit_dropwin)["D"], 4)
## D
## -0.0402
The level-change estimate barely moves. Either remedy yields the same conclusion: we keep §6.3 as the headline. The sensitivity check is what gives us the right to do so.
Apply Option A from §6.1.5 — model the residuals as AR(1) — using
Arima() with the §6.3 covariates and intervention
indicators as xreg.
xcov <- model.matrix(~ t + D + S + log(gas_price_c) + month_f, data = bc)[, -1]
fit_its_ar <- Arima(bc$log_gas, order = c(1, 0, 0), xreg = xcov)
fit_its_ar
## Series: bc$log_gas
## Regression with ARIMA(1,0,0) errors
##
## Coefficients:
## ar1 intercept t D S log(gas_price_c) month_f2
## 0.0539 13.2568 0.0015 -0.0390 -0.0011 -0.1645 -0.0278
## s.e. 0.0616 0.0824 0.0001 0.0103 0.0002 0.0216 0.0109
## month_f3 month_f4 month_f5 month_f6 month_f7 month_f8 month_f9
## 0.0697 0.0755 0.1500 0.1584 0.2111 0.2313 0.1239
## s.e. 0.0112 0.0113 0.0114 0.0114 0.0114 0.0114 0.0113
## month_f10 month_f11 month_f12
## 0.1199 0.0359 0.0708
## s.e. 0.0112 0.0112 0.0109
##
## sigma^2 = 0.001523: log likelihood = 512.32
## AIC=-988.64 AICc=-985.98 BIC=-923.48
checkresiduals(fit_its_ar)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(1,0,0) errors
## Q* = 14.841, df = 9, p-value = 0.0954
##
## Model df: 1. Total lags used: 10
Compare the OLS and ARIMA point estimates on the carbon-tax coefficients:
tibble(
Parameter = c("Pre-trend (t)", "Level change (D)",
"Slope change (S)", "log(price) elasticity"),
`OLS (with price)` = round(coef(fit_cov)[c("t", "D", "S", "log(gas_price_c)")], 4),
`ARIMA(1,0,0) errors` = round(coef(fit_its_ar)[c("t", "D", "S", "log(gas_price_c)")], 4)
) |>
knitr::kable(caption = "ITS coefficients: OLS vs. ARIMA(1,0,0) errors.")
| Parameter | OLS (with price) | ARIMA(1,0,0) errors |
|---|---|---|
| Pre-trend (t) | 0.0015 | 0.0015 |
| Level change (D) | -0.0389 | -0.0390 |
| Slope change (S) | -0.0011 | -0.0011 |
| log(price) elasticity | -0.1656 | -0.1645 |
The AR(1) parameter on the residuals is small (\(\approx 0.05\)). With price and season already absorbing the systematic post-period variation, the OLS estimates from §6.3 are essentially unchanged. We use Story A (AR errors) here — there is no theoretical reason to think last month’s gasoline-sales residual causes this month’s beyond what the price covariate already captures.
Across specifications, the BC carbon tax is associated with about a 3–4% drop in monthly gasoline sales at implementation, plus a small additional negative slope change. The bare ITS over-states the effect by attributing the global 2008 oil-price spike to the tax; conditioning on retail price brings the estimate into the range the energy-economics literature finds (Murray and Rivers 2015). Anticipation is not a major confound, and the AR-errors correction barely moves the headline numbers.
The pedagogy: the substantive coefficient depends on which counterfactual you defend. Bare ITS posits a linear-in-time counterfactual; covariate-conditional ITS posits a price-and-season-conditional counterfactual and gets a smaller, more credible answer. Neither is “right” without theory — but the second is much harder to be wrong about, because gasoline consumption is mechanically a function of price, and the elasticity recovered comes out at a textbook value.
A targeted simulation to put numbers behind the three pitfalls from §1.3. Four series with known DGPs (\(T = 250\), seed 100); we run ADF and KPSS (with both Level and Trend nulls) on each and tabulate verdicts:
set.seed(100); T_n <- 250; t_idx <- 1:T_n
y_A <- arima.sim(n = T_n, model = list(ar = 0.4), sd = 1) # I(0)
y_B <- cumsum(rnorm(T_n, sd = 1)) # I(1)
y_C <- rnorm(T_n, sd = 1) + 5 * (t_idx >= T_n / 2) # I(0) + break
y_D <- 0.04 * t_idx + arima.sim(n = T_n, model = list(ar = 0.4), sd = 1) # I(0) around trend
run_tests <- function(y, label) {
adf <- suppressWarnings(adf.test(y))
kpss_L <- suppressWarnings(kpss.test(y, null = "Level"))
kpss_T <- suppressWarnings(kpss.test(y, null = "Trend"))
tibble(Series = label,
`ADF p` = round(adf$p.value, 3),
`KPSS-L p` = round(kpss_L$p.value, 3),
`KPSS-T p` = round(kpss_T$p.value, 3))
}
bind_rows(
run_tests(y_A, "A: AR(1) phi=0.4 [I(0)]"),
run_tests(y_B, "B: Random walk [I(1)]"),
run_tests(y_C, "C: i.i.d. + level break [I(0)]"),
run_tests(y_D, "D: Trend + AR(1) [I(0) around trend]")
) |>
knitr::kable(caption = "ADF and KPSS verdicts on four series with known truth.")
| Series | ADF p | KPSS-L p | KPSS-T p |
|---|---|---|---|
| A: AR(1) phi=0.4 [I(0)] | 0.010 | 0.10 | 0.10 |
| B: Random walk [I(1)] | 0.366 | 0.01 | 0.01 |
| C: i.i.d. + level break [I(0)] | 0.282 | 0.01 | 0.01 |
| D: Trend + AR(1) [I(0) around trend] | 0.010 | 0.01 | 0.10 |
The headline results: Series A and B are consensus-correct (all three tests agree with the truth). Series C is the Perron (1989) trap — both ADF and KPSS-Level reject in directions consistent with a unit root, but the truth is I(0) within each regime; the unmodeled break fools both tests. Series D shows the deterministic-regressor pitfall — the same series gives opposite KPSS verdicts depending on whether the null is Level or Trend. Both lessons map directly onto §1.3 row 3 of the verdict matrix and onto the diagnostic move in Part 2’s approval analysis.
Audited core (literature folder):
Methods cited inline (load-bearing for Parts 1, 3–6):