These materials have been revised and updated by successive teaching assistants over recent years.
Revision history:
| Package | Purpose | Key functions |
|---|---|---|
forecast |
ARIMA estimation, diagnostics, cross-validation | Arima(), auto.arima(),
checkresiduals(), tsCV(),
forecast() |
tseries |
Stationarity and specification tests | adf.test(), kpss.test(),
jarque.bera.test() |
lmtest |
Regression diagnostic tests | bgtest() (Breusch-Godfrey for serial correlation) |
patchwork |
Composing multi-panel figures | +, / operators |
tidyverse |
Data manipulation and plotting | ggplot(), tibble(), pipes
(|>) |
In Lab 2, we learned to identify which dynamic components are present in a time series — trend, seasonality, AR, MA — using ACF/PACF plots and the Box-Jenkins identification table. Now we take the next step: estimating the model, interpreting the coefficients, comparing candidate specifications, and validating the chosen model with cross-validation.
The key question shifts from “what kind of process is this?” to “how well does my model capture the dynamics, and how do I know?”
We use a classic dataset from Harvey (1989): monthly counts of deaths and serious injuries in UK road accidents from January 1969 to December 1984. In February 1983, a seatbelt law was introduced — providing a natural policy intervention that we can model.
# Load the UK road deaths dataset
ukdata <- read_csv("data/ukdeaths.csv")
#ukdata <- read_csv(here("CSSS512_labs", "Lab3", "data", "ukdeaths.csv"))
ukdata
## # A tibble: 192 × 5
## death law year mt month
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1687 0 1969 1 January
## 2 1508 0 1969 2 February
## 3 1507 0 1969 3 March
## 4 1385 0 1969 4 April
## 5 1632 0 1969 5 May
## 6 1511 0 1969 6 June
## 7 1559 0 1969 7 July
## 8 1630 0 1969 8 August
## 9 1579 0 1969 9 September
## 10 1653 0 1969 10 October
## # ℹ 182 more rows
# Create time series object: monthly data starting Jan 1969
deaths <- ts(ukdata$death, start = c(1969, 1), frequency = 12)
# The seatbelt law indicator: 0 before Feb 1983, 1 after
law <- ts(ukdata$law, start = c(1969, 1), frequency = 12)
autoplot(deaths) +
geom_vline(xintercept = 1983 + 1/12, linetype = "dashed", color = "firebrick") +
annotate("text", x = 1983.5, y = 2600, label = "Seatbelt law\n(Feb 1983)",
color = "firebrick", size = 3.5) +
labs(title = "UK Road Deaths, 1969–1984",
y = "Deaths and Serious Injuries", x = "Year") +
theme_minimal(base_size = 13)
The series has two visible features: a strong seasonal pattern (deaths peak in winter months — dark, icy roads) and a notable drop after the seatbelt law. There may also be a slight downward trend over the full period. The substantive question is: how large was the predicted effect of the seatbelt law, accounting for the seasonal and autoregressive dynamics?
Before estimating, we apply the Box-Jenkins diagnostic workflow from Lab 2. This is a quick review — the focus of this lab is estimation, not identification.
We start with an STL decomposition to visually separate the three components of the series: trend, seasonality, and remainder. This gives us a qualitative sense of which features we need to account for in the model.
# STL decomposition with a periodic seasonal window (12 months)
decomp <- stl(deaths, s.window = "periodic")
# visualize with forecast (ggplot)
autoplot(decomp) +
labs(title = "STL Decomposition of UK Road Deaths",
x = "Year") +
theme_minimal(base_size = 11)
The decomposition confirms our visual impressions:
Now we examine the ACF and PACF of the raw series to identify any autoregressive or moving average structure:
p1 <- ggAcf(deaths, lag.max = 36) +
labs(title = "ACF — UK Road Deaths") + theme_minimal(base_size = 11)
p2 <- ggPacf(deaths, lag.max = 36) +
labs(title = "PACF — UK Road Deaths") + theme_minimal(base_size = 11)
p1 + p2
The ACF shows slow decay (trend or persistence) with seasonal spikes at lags 12, 24, 36. The PACF shows significant spikes at lags 1 and 12. This suggests an AR process with seasonality — both of which we need to account for in our model.
The raw ACF is dominated by the trend (slow decay) and seasonality
(spikes at 12, 24, 36), making it hard to read the AR/MA structure
underneath. Using the Lab 2 approach (Approach 2), we remove both
features in one step with an lm() that regresses deaths on
a linear trend, month dummies, and the seatbelt law indicator. The
residuals from this regression are the cleaned series —
trend, seasonality, and the law effect all removed — so any remaining
autocorrelation must come from the ARMA dynamics we want to
identify.
# Build a data frame with a time index, month factor, and the law indicator
clean_df <- ukdata |>
mutate(
t = row_number(), # linear time index
month = factor(month, levels = month.name) # month factor
)
# Regress deaths on trend + month dummies + law
# Residuals = deaths with trend, seasonality, and law effect removed
fit_clean <- lm(death ~ t + month + law, data = clean_df)
# check results
summary(fit_clean)
##
## Call:
## lm(formula = death ~ t + month + law, data = clean_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -333.70 -105.83 4.71 85.02 414.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1872.6884 43.2957 43.253 < 2e-16 ***
## t -1.7649 0.2406 -7.337 7.47e-12 ***
## monthFebruary -184.0861 53.9429 -3.413 0.000797 ***
## monthMarch -131.7587 53.9381 -2.443 0.015551 *
## monthApril -243.1814 53.9343 -4.509 1.18e-05 ***
## monthMay -104.1040 53.9317 -1.930 0.055160 .
## monthJune -158.0892 53.9300 -2.931 0.003818 **
## monthJuly -80.6993 53.9295 -1.496 0.136324
## monthAugust -64.0594 53.9300 -1.188 0.236485
## monthSeptember -9.4821 53.9317 -0.176 0.860638
## monthOctober 131.6578 53.9343 2.441 0.015622 *
## monthNovember 332.9851 53.9381 6.173 4.40e-09 ***
## monthDecember 451.3750 53.9429 8.368 1.67e-14 ***
## law -226.3850 41.0372 -5.517 1.20e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 152.4 on 178 degrees of freedom
## Multiple R-squared: 0.7419, Adjusted R-squared: 0.723
## F-statistic: 39.35 on 13 and 178 DF, p-value: < 2.2e-16
# Convert residuals back to a ts object
deaths_clean <- ts(residuals(fit_clean), start = c(1969, 1), frequency = 12)
autoplot(deaths_clean) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
labs(title = "UK Road Deaths After Removing Trend, Seasonality, and Law Effect",
y = "Residuals", x = "Year") +
theme_minimal(base_size = 13)
Now we compare the ACF and PACF of the raw and cleaned series side by side:
p_acf_raw <- ggAcf(deaths, lag.max = 36) +
labs(title = "ACF — Raw Series") + theme_minimal(base_size = 10)
p_pacf_raw <- ggPacf(deaths, lag.max = 36) +
labs(title = "PACF — Raw Series") + theme_minimal(base_size = 10)
p_acf_clean <- ggAcf(deaths_clean, lag.max = 36) +
labs(title = "ACF — Cleaned Series") + theme_minimal(base_size = 10)
p_pacf_clean <- ggPacf(deaths_clean, lag.max = 36) +
labs(title = "PACF — Cleaned Series") + theme_minimal(base_size = 10)
(p_acf_raw + p_pacf_raw) / (p_acf_clean + p_pacf_clean)
The progression is striking: after removing trend, seasonality, and the law effect, the slow decay and the seasonal spikes largely disappear. What remains is the underlying dynamic structure — the ACF tails off geometrically (lag 1 ≈ 0.59, lag 2 ≈ 0.50, lag 3 ≈ 0.41, etc), and the PACF shows its strongest spike at lag 1 (lag 1 ≈ 0.59), with a smaller spike at lag 2 (lag 2 ≈ 0.22). This pattern points to a low-order AR process — likely AR(1) or AR(2) — as the main remaining dynamic.
We can try to confirm this by fitting an AR(1) model to the cleaned series using arima:
# Fit dynamic hypotheses to the cleaned series
h1 <- Arima(deaths_clean, order = c(1, 0, 0))
h2 <- Arima(deaths_clean, order = c(2, 0, 0))
# Strip the new "fc_model" class so texreg falls back to Arima
class(h1) <- "Arima"
class(h2) <- "Arima"
# Seem how the AR(1) and AR(2) models compare
show_reg(list(h1, h2),
custom.model.names = c("AR(1)", "AR(2)"),
custom.coef.map = list(
"ar1" = "AR(1)",
"ar2" = "AR(2)",
"intercept" = "Intercept"
))
| AR(1) | AR(2) | |
|---|---|---|
| AR(1) | 0.54*** | 0.42*** |
| (0.06) | (0.07) | |
| AR(2) | 0.22** | |
| (0.07) | ||
| Intercept | -1.08 | -1.88 |
| (19.21) | (23.87) | |
| AIC | 2400.98 | 2393.64 |
| BIC | 2410.75 | 2406.67 |
| Log Likelihood | -1197.49 | -1192.82 |
| Num. obs. | 192 | 192 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | ||
The best fitting model is clearly AR(2), now we can check the residuals to see if it captures the dynamics adequately:
# test Ljung-Box on the residuals of best fitting model
Box.test(residuals(h2), lag = 24, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(h2)
## X-squared = 33.298, df = 24, p-value = 0.09795
Time diagnostics tells us how to approach modeling:
arima() with
xreg.In Part 3 we will compare several candidate specifications to decide between AR(1), AR(2), and ARMA alternatives.
Before running arima(), let’s understand what
it computes. ARMA models are estimated by maximum
likelihood — the same principle you used in generalized linear
models. The logic, familiar from introductory econometrics:
For a regression model with AR(1) errors, the joint density of the data given the parameters is \(f(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{\beta}, \phi_1, \sigma^2)\). Viewed as a function of the parameters (with the data treated as fixed), this is the likelihood \(\mathcal{L}(\boldsymbol{\beta}, \phi_1, \sigma^2;\, \boldsymbol{y}, \boldsymbol{X})\). Its logarithm splits into two distinct contributions:
\[ \begin{aligned} \ell(\boldsymbol{\beta}, \phi_1, \sigma^2;\, \boldsymbol{y}, \boldsymbol{X}) =\; & \underbrace{-\frac{1}{2}\log\!\left(\frac{\sigma^2}{1-\phi_1^2}\right) - \frac{\left(y_1 - \frac{\boldsymbol{x}_1 \boldsymbol{\beta}}{1-\phi_1}\right)^{2}}{2\sigma^2 / (1-\phi_1^2)}}_{\text{initial observation } y_1 \sim \text{ stationary distribution of the AR(1)}} \\[4pt] & \underbrace{- \frac{T-1}{2}\log\sigma^2 - \sum_{t=2}^{T}\frac{(y_t - \boldsymbol{x}_t\boldsymbol{\beta} - \phi_1 y_{t-1})^{2}}{2\sigma^2}}_{\text{normal likelihood for } y_2, \ldots, y_T \text{ given the immediate past (same as OLS)}} \end{aligned} \]
A note on notation: we write the parameters on the left of the semicolon (treating them as arguments of the function) and the data on the right (treating them as fixed). This is the standard frequentist convention for writing a likelihood — distinct from Bayesian posterior notation like \(p(\boldsymbol{\theta} \mid \boldsymbol{y})\), which reads “density of parameters given data.”
The two parts of the likelihood. The first term is what makes this time series MLE different from ordinary regression: it models \(y_1\) as a draw from the stationary (marginal) distribution of the AR(1) process, \(\mathcal{N}\!\left(\boldsymbol{x}_1 \boldsymbol{\beta} / (1-\phi_1),\; \sigma^2 / (1-\phi_1^2)\right)\). The second term is the familiar normal regression likelihood applied to observations \(y_2, \ldots, y_T\), treating their immediate lag as a predictor.
Joint MLE vs. lm with a lagged DV. If
you tried to estimate the same model with
lm(y ~ lag(y) + x), OLS would drop \(y_1\) (since there is no \(y_0\) to serve as its lag) and estimate
\(\boldsymbol{\beta}\) and \(\phi_1\) using only the second term of the
likelihood — the conditional likelihood. arima()
does more in two senses: (i) it uses the exact likelihood including
\(y_1\), so no information is wasted;
and (ii) it estimates \(\boldsymbol{\beta}\), \(\phi_1\), and \(\sigma^2\) jointly by
maximizing this full objective, accounting for the serial correlation in
the errors. OLS, by contrast, assumes independent errors and estimates
\(\boldsymbol{\beta}\) first — which,
combined with a lagged dependent variable, is biased (see Section
2.4).
Generalization to AR(\(p\)). The structure above extends naturally to higher-order AR processes. For AR(\(p\)), the second term of the log-likelihood uses the first \(p\) observations as initial conditions, and the main sum runs from \(t = p+1\) to \(T\):
\[ -\frac{T-p}{2}\log\sigma^2 - \sum_{t=p+1}^{T}\frac{\left(y_t - \boldsymbol{x}_t\boldsymbol{\beta} - \phi_1 y_{t-1} - \phi_2 y_{t-2} - \cdots - \phi_p y_{t-p}\right)^{2}}{2\sigma^2} \]
The first term (initial conditions) becomes the joint density of \((y_1, \ldots, y_p)\) under the stationary distribution of the AR(\(p\)) — a multivariate normal with a more involved covariance matrix. For AR(2) specifically, the second term expands to the squared residual \(\left(y_t - \boldsymbol{x}_t\boldsymbol{\beta} - \phi_1 y_{t-1} - \phi_2 y_{t-2}\right)^2\) summed from \(t = 3\) onward.
arima() function with xregIn Lab 2, we used Arima() to estimate ARMA models on
simulated data with no covariates. In applied work, we almost always
have covariates — variables we believe explain part of
the variation in the outcome. The arima() and
Arima() functions accept covariates through the
xreg argument.
The model we estimate can be written as:
\[ y_t = \boldsymbol{x}_t \boldsymbol{\beta} + \eta_t, \quad \eta_t = \phi_1 \eta_{t-1} + \cdots + \phi_p \eta_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q} \]
In words: this is a regression model where the errors have
temporal structure — an ARMA(\(p,q\)) process, rather than the i.i.d.
errors assumed by OLS. The arima() function estimates the
regression coefficients \(\boldsymbol{\beta}\) and the ARMA
parameters \(\phi, \theta\)
simultaneously by maximum likelihood.
Arima()Before estimating our first model, let’s break down the function’s
signature. Arima() (from forecast) and base
R’s arima() share the same core arguments;
Arima() adds a few conveniences for interpretation and
forecasting.
Arima(y,
order = c(p, d, q), # non-seasonal orders
seasonal = list(order = c(P, D, Q), period = m), # seasonal orders
xreg = NULL, # matrix of covariates
include.mean = TRUE, # estimate an intercept/mean?
include.drift = FALSE, # estimate a linear drift (only if d >= 1)?
method = c("CSS-ML", "ML", "CSS")) # estimation method
Key arguments, one by one:
y — a univariate time series. Must
be a ts object (so R knows the frequency, needed for
seasonal models).
order = c(p, d, q) — the
non-seasonal ARIMA orders. \(p\) is the
AR order, \(d\) is the number of
differences, \(q\) is the MA order. For
now we are working with stationary data, so \(d = 0\). We will discuss \(d \neq 0\) in Lab 4.
seasonal = list(order = c(P, D, Q), period = m)
— the seasonal ARIMA orders at period \(m\) (e.g., \(m =
12\) for monthly data). Leave it at its default (no seasonal
terms) if you are handling seasonality with dummies in xreg
— which is what we do in this lab.
xreg — a matrix or data frame of
covariates. Rows must align with observations in y. If you
pass a single covariate, it can be a vector. The coefficients on these
covariates are what we interpret substantively — the effect of the
seatbelt law, seasonal shifts, etc.
include.mean — if TRUE
(the default), the model estimates a constant (the intercept in a
regression, or the unconditional mean in a pure ARMA). Set to
FALSE only if you have reason to believe the series has
zero mean.
include.drift — adds a linear drift
term \(\delta \cdot t\) to the model.
This is essentially a shortcut for fitting a linear trend
without manually constructing a time index
(t = 1, 2, ..., n) and passing it through
xreg. It only applies when \(d
\geq 1\) (differenced series); for stationary models (\(d = 0\)), use xreg with a time
index instead.
method — how the likelihood is
maximized. "CSS-ML" (default) uses conditional
sum-of-squares for starting values and then exact ML; "ML"
is exact ML throughout; "CSS" is conditional ML only
(faster but slightly less efficient — this is closest to what
lm does, as we discussed in Section 2.1).
Once you have a fitted Arima object, these extractors
work the same way as for lm:
| Function | Returns |
|---|---|
coef(fit) |
Named vector of estimated coefficients |
vcov(fit) |
Variance-covariance matrix of coefficient estimates |
residuals(fit) |
One-step-ahead forecast errors (what you check for white noise) |
fitted(fit) |
In-sample fitted values |
AIC(fit), BIC(fit),
logLik(fit) |
Goodness-of-fit statistics |
forecast(fit, h = H, xreg = ...) |
Out-of-sample forecasts \(H\) periods ahead |
checkresiduals(fit) |
Residual diagnostics (time plot + ACF + Ljung-Box) |
Let’s put it together with the simplest useful specification: AR(1) errors plus the seatbelt law as a covariate. No seasonal term, no differencing (\(d = 0\)).
# xreg must be a matrix (or vector) of covariates — here, just the law indicator
fit_ar1 <- Arima(deaths, order = c(1, 0, 0), xreg = law)
fit_ar1
## Series: deaths
## Regression with ARIMA(1,0,0) errors
##
## Coefficients:
## ar1 intercept xreg
## 0.6439 1719.193 -377.4542
## s.e. 0.0553 42.078 107.6521
##
## sigma^2 = 39913: log likelihood = -1288.26
## AIC=2584.52 AICc=2584.73 BIC=2597.55
Reading the output:
ar1: the estimated AR(1) coefficient \(\hat{\phi}_1\). This tells you how strongly
last month’s deaths predict this month’s, after accounting for the
law.intercept: the estimated mean level of the series when
the law is not in effect (\(\text{law} =
0\)).law: the estimated effect of the seatbelt law on road
deaths, holding the AR dynamics constant. This is the immediate
(impact) effect — the change in the expected value of \(y_t\) in the period the law takes
effect.sigma^2, log likelihood, AIC,
AICc, BIC: the innovation variance \(\hat{\sigma}^2\) and fit statistics (we use
these for model comparison in Part 3).The ACF showed clear seasonal spikes. We handle this with month dummy
variables, as we learned in Lab 2 (Approach 2: lm() with
dummies). Here, we include the dummies as additional xreg
covariates in arima().
# Create month dummy variables (11 dummies; January = reference)
month_dummies <- model.matrix(~ factor(ukdata$month) - 1)[, -1]
colnames(month_dummies) <- month.abb[-1] # Feb, Mar, ..., Dec
# Combine law + month dummies into a covariate matrix
xcov <- cbind(law = ukdata$law, month_dummies)
head(xcov, 3)
## law Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 0 0 0 0 1 0 0 0 0 0 0 0
## 2 0 0 0 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 1 0 0 0 0
# AR(1) with seatbelt law + seasonal dummies
fit_ar1_seas <- Arima(deaths, order = c(1, 0, 0), xreg = xcov)
fit_ar1_seas
## Series: deaths
## Regression with ARIMA(1,0,0) errors
##
## Coefficients:
## ar1 intercept law Feb Mar Apr May
## 0.6442 1481.2826 -370.0694 172.1115 679.4141 62.2094 238.6466
## s.e. 0.0550 42.9918 70.2727 53.0982 53.0102 44.9728 50.0669
## Jun Jul Aug Sep Oct Nov Dec
## 157.3445 81.677 113.0147 137.4017 563.2579 364.0130 224.8335
## s.e. 50.2149 45.020 35.1794 35.1859 54.5760 55.0383 54.5765
##
## sigma^2 = 17618: log likelihood = -1204
## AIC=2437.99 AICc=2440.72 BIC=2486.85
Notice how the seasonal dummies capture the winter peak (October–December coefficients are positive and large) and the summer trough (June–August are negative). The law coefficient may change because we’re now controlling for the seasonal pattern that was previously confounded with the AR dynamics.
lm()? The Breusch-Godfrey
testA natural question: if this is “just a regression with covariates,”
why not use lm() and add a lagged dependent variable? The
answer is that OLS with a lagged dependent variable and serially
correlated errors produces biased and inconsistent
estimates. Let’s see this empirically.
# OLS with a lag of deaths + law + month dummies
ukdata_lagged <- ukdata |>
mutate(death_lag1 = lag(death, 1)) |>
na.omit()
fit_ols <- lm(death ~ death_lag1 + law + factor(month), data = ukdata_lagged)
The Breusch-Godfrey (BG) test (Breusch 1978; Godfrey 1978) is a Lagrange multiplier (LM) test for serial correlation in the residuals of a regression model. It generalizes the classic Durbin-Watson test in two important ways: (i) it remains valid when the regression includes a lagged dependent variable (which DW does not), and (ii) it can detect serial correlation of any order \(p\), not just first-order.
How the test works (auxiliary regression). The BG test is a two-step procedure:
\[ \hat{e}_t = \boldsymbol{x}_t \boldsymbol{\gamma} + \rho_1 \hat{e}_{t-1} + \rho_2 \hat{e}_{t-2} + \cdots + \rho_p \hat{e}_{t-p} + v_t \]
Intuition. If the residuals are truly white noise, their lagged values should have no predictive power — the \(\hat{\rho}_k\) coefficients should all be zero, \(R^2\) should be small, and \(BG\) should be small. If lagged residuals do predict the current residual, the auxiliary \(R^2\) grows and \(BG\) rejects the null.
Relationship to Ljung-Box (Lab 2). Both tests detect serial correlation, but they operate in different contexts:
For OLS with a lagged dependent variable, BG is the right diagnostic.
For ARIMA residuals, Ljung-Box (via checkresiduals()) is
standard.
Choosing the order \(p\). Test a few lags — typically 1, 2, and the seasonal period (12 for monthly data). If any rejects, the residuals have structure OLS failed to capture.
# Test for first-order serial correlation
bgtest(fit_ols, order = 1)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: fit_ols
## LM test = 11.546, df = 1, p-value = 0.000679
# Test for second-order serial correlation
bgtest(fit_ols, order = 2)
##
## Breusch-Godfrey test for serial correlation of order up to 2
##
## data: fit_ols
## LM test = 11.984, df = 2, p-value = 0.002498
# Test at the seasonal lag (monthly data)
bgtest(fit_ols, order = 12)
##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: fit_ols
## LM test = 30.006, df = 12, p-value = 0.002787
If the Breusch-Godfrey test rejects (small \(p\)-value), OLS residuals are serially
correlated — meaning your standard errors are wrong and the coefficient
on law is potentially biased. This is why we use
arima() with ARMA errors instead: it models the serial
correlation explicitly, producing consistent estimates and correct
inference.
# Compare the law coefficient from OLS vs. ARIMA
cat("OLS estimate of law effect:", round(coef(fit_ols)["law"], 1), "\n")
## OLS estimate of law effect: -145.3
cat("ARIMA estimate of law effect:", round(coef(fit_ar1_seas)["law"], 1), "\n")
## ARIMA estimate of law effect: -370.1
In a regression with AR errors, the coefficient on law
captures the immediate (impact) effect of the policy —
the change in \(y_t\) in the first
period after the law takes effect. But because the errors carry their
own temporal dependence, the effect propagates through time
before the series settles at a new equilibrium. Two quantities
matter:
The long-run effect is derived from the equilibrium mean of the process. For AR(1) errors:
\[ \mathbb{E}(y_t) = \frac{\boldsymbol{x}_t \boldsymbol{\beta}}{1 - \phi_1} \qquad \Longrightarrow \qquad \text{Long-run effect} = \frac{\hat{\beta}}{1 - \hat{\phi}_1} \]
For AR(2) errors, both \(\phi_1\) and \(\phi_2\) contribute to persistence, so the denominator involves both:
\[ \mathbb{E}(y_t) = \frac{\boldsymbol{x}_t \boldsymbol{\beta}}{1 - \phi_1 - \phi_2} \qquad \Longrightarrow \qquad \text{Long-run effect} = \frac{\hat{\beta}}{1 - \hat{\phi}_1 - \hat{\phi}_2} \]
Generalizing to AR(\(p\)):
\[ \mathbb{E}(y_t) = \frac{\boldsymbol{x}_t \boldsymbol{\beta}}{1 - \sum_{k=1}^{p}\phi_k} \qquad \Longrightarrow \qquad \text{Long-run effect} = \frac{\hat{\beta}}{1 - \sum_{k=1}^{p}\hat{\phi}_k} \]
The denominator is the stationarity condition written backwards: \(1 - \sum \phi_k\) must be positive (and the AR polynomial must have all roots outside the unit circle) for the equilibrium to be well defined. As the sum of AR coefficients approaches 1, the long-run multiplier explodes — mirroring the variance explosion we saw in Lab 2 for AR(1) as \(\phi \to 1\).
Since the cleaned ACF/PACF in Section 1.2 suggested an AR(2) process, let’s estimate that specification with the law and month dummies, and compute both effects.
fit_ar2_seas <- Arima(deaths, order = c(2, 0, 0), xreg = xcov)
fit_ar2_seas
## Series: deaths
## Regression with ARIMA(2,0,0) errors
##
## Coefficients:
## ar1 ar2 intercept law Feb Mar Apr
## 0.4696 0.2711 1477.8553 -347.9213 172.0374 679.6911 62.2433
## s.e. 0.0692 0.0694 45.7778 80.5683 45.0545 45.1889 37.9386
## May Jun Jul Aug Sep Oct Nov
## 240.9785 157.2316 81.6359 113.1874 137.3939 563.2884 363.9678
## s.e. 42.8920 42.8448 37.9681 35.0594 35.0610 46.6042 47.0010
## Dec
## 224.7363
## s.e. 46.5606
##
## sigma^2 = 16399: log likelihood = -1196.65
## AIC=2425.3 AICc=2428.41 BIC=2477.42
# Extract estimates
beta_law <- coef(fit_ar2_seas)["law"]
phi1 <- coef(fit_ar2_seas)["ar1"]
phi2 <- coef(fit_ar2_seas)["ar2"]
# Long-run multiplier for AR(2): 1 / (1 - phi1 - phi2)
lr_multiplier <- 1 / (1 - phi1 - phi2)
# Long-run effect of the law
lr_effect <- beta_law * lr_multiplier
tibble(
Quantity = c("phi_1", "phi_2", "Sum of AR coefficients",
"Long-run multiplier: 1 / (1 - phi1 - phi2)",
"Immediate (impact) effect of law",
"Long-run effect of law"),
Value = c(round(phi1, 3),
round(phi2, 3),
round(phi1 + phi2, 3),
round(lr_multiplier, 3),
round(beta_law, 1),
round(lr_effect, 1))
)
## # A tibble: 6 × 2
## Quantity Value
## <chr> <dbl>
## 1 phi_1 0.47
## 2 phi_2 0.271
## 3 Sum of AR coefficients 0.741
## 4 Long-run multiplier: 1 / (1 - phi1 - phi2) 3.86
## 5 Immediate (impact) effect of law -348.
## 6 Long-run effect of law -1342.
Reading the output substantively:
law — the
estimated drop in deaths in the first month the law takes effect.To make the impact-vs-equilibrium distinction concrete, let’s trace out the predicted response of the series to a permanent switch of the law from 0 to 1, starting from its pre-law equilibrium. This is called an impulse-response or step-response trajectory.
We simulate forward using the estimated AR(2) recursion:
\[ y_t = \alpha + \beta_{\text{law}} \cdot \text{law}_t + \phi_1 y_{t-1} + \phi_2 y_{t-2} \]
# Extract intercept (mean of the de-meaned process) and build a step-response
alpha_hat <- coef(fit_ar2_seas)["intercept"]
# Periods to simulate (24 months after intervention)
h <- 24
# Pre-law equilibrium: mean when law = 0
y_pre_eq <- alpha_hat / (1 - phi1 - phi2)
# Post-law equilibrium: mean when law = 1
y_post_eq <- (alpha_hat + beta_law) / (1 - phi1 - phi2)
# Simulate y_t from t = 1 to h, starting at pre-law equilibrium
# Law switches on at t = 1 and stays on
y_path <- numeric(h)
y_path[1] <- alpha_hat + beta_law + phi1 * y_pre_eq + phi2 * y_pre_eq
y_path[2] <- alpha_hat + beta_law + phi1 * y_path[1] + phi2 * y_pre_eq
for (t in 3:h) {
y_path[t] <- alpha_hat + beta_law + phi1 * y_path[t-1] + phi2 * y_path[t-2]
}
# Assemble for plotting: include pre-intervention baseline
step_df <- tibble(
t = -3:h,
y = c(rep(y_pre_eq, 4), y_path)
)
ggplot(step_df, aes(x = t, y = y)) +
geom_hline(yintercept = y_pre_eq, linetype = "dotted", color = "grey40") +
geom_hline(yintercept = y_post_eq, linetype = "dashed", color = "firebrick") +
geom_vline(xintercept = 0, linetype = "dashed", color = "steelblue") +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(color = "steelblue", size = 1.5) +
annotate("text", x = h - 2, y = y_pre_eq + 50,
label = "Pre-law equilibrium", color = "grey30", size = 3.5, hjust = 1) +
annotate("text", x = h - 2, y = y_post_eq - 50,
label = "Post-law equilibrium (long-run effect)",
color = "firebrick", size = 3.5, hjust = 1) +
annotate("text", x = 0.3, y = y_pre_eq - 80,
label = "Law takes effect", color = "steelblue", size = 3.5, hjust = 0) +
labs(title = "Predicted Response of UK Road Deaths to the Seatbelt Law",
subtitle = "AR(2) dynamics: series gradually adjusts from pre- to post-law equilibrium",
x = "Months after intervention", y = "Predicted deaths/month") +
theme_minimal(base_size = 13)
The plot shows the core idea: the series drops sharply in month 1 (the impact effect), but continues to fall for several more months as the AR dynamics transmit the shock through time. Eventually the series stabilizes at the new, lower equilibrium — the long-run effect.
This distinction between impact and equilibrium shift is one of the most important ideas in applied time series analysis, and it generalizes directly to more complex dynamic responses. In Lab 4 we will build on this to compute counterfactual forecasts with uncertainty (confidence intervals around the predicted trajectory).
The horizon \(h\) controls how far forward we trace the dynamic response. A natural question is whether the substantive conclusion depends on this choice. Two things to keep in mind:
The trajectory itself does not change with \(h\). The values at periods \(1, 2, \ldots\) are fully determined by the estimated coefficients and the initial conditions. Extending from \(h = 24\) to \(h = 48\) just shows more of the same curve — the predictions at months 1–24 are identical. So \(h\) affects the visual presentation, not the underlying forecast.
The visible “approach to equilibrium” does depend on \(h\), because the series converges exponentially and never exactly reaches equilibrium in finite time. Choosing \(h\) too short makes the curve look like it is still adjusting; choosing it too long flattens out with visually redundant periods. For the UK deaths AR(2) model, the series has already completed about 99.5% of the total long-run change by month 24 and is numerically indistinguishable from equilibrium by month 36.
There are three principled approaches in the applied literature:
Approach 1: Tie \(h\) to the half-life of shocks. From the AR characteristic roots, compute the dominant decay rate and the implied half-life of a shock. Simulate roughly 5–10 half-lives to capture essentially all the dynamic adjustment.
# Characteristic roots of the AR(2) polynomial: 1 - phi1*z - phi2*z^2 = 0
ar_roots <- polyroot(c(1, -phi1, -phi2))
decay_rate <- max(1 / abs(ar_roots)) # dominant inverse-root modulus
half_life <- log(0.5) / log(decay_rate) # shock half-life, in periods
tibble(
`Dominant decay rate` = round(decay_rate, 3),
`Shock half-life (periods)` = round(half_life, 2),
`Suggested h (7 half-lives)` = ceiling(7 * half_life)
)
## # A tibble: 1 × 3
## `Dominant decay rate` `Shock half-life (periods)` `Suggested h (7 half-lives)`
## <dbl> <dbl> <dbl>
## 1 0.806 3.21 23
Approach 2: Convergence tolerance. Simulate a long horizon, then take \(h\) to be the first period where the gap to the equilibrium drops below a tolerance (e.g., 1% of the total long-run change). This is the most “objective” criterion because you specify exactly what “converged” means.
Approach 3: Substantively meaningful horizon. Pick \(h\) based on the policy/research question — 1–3 years for monthly policy data, 3–5 years for quarterly macro data, etc. For the seatbelt law, 24 months matches the natural evaluation window governments use for policy interventions.
Practical rule. Take \(h\) as the maximum of (a) 5–10 half-lives and (b) a substantively meaningful horizon. Extending beyond this doesn’t change conclusions — it just adds flat visual real estate. For our UK deaths AR(2), both criteria give \(h \approx 24\) (our choice above), confirming that the 24-month window is adequate. If we were modeling a more persistent series (e.g., \(\sum\phi_k \approx 0.95\)), we would need a much longer horizon to display the full dynamic adjustment.
As a practical check, report results at multiple horizons (e.g., effect at 6, 12, 24 months) rather than a single endpoint — this shows both the speed of adjustment and the long-run total, and makes the simulation’s time-path interpretation transparent to readers (Jordan & Philips, 2018, “dynamac” R package; Aptech impulse-response analysis guide; Hyndman & Athanasopoulos FPP3).
The coefficients may differ because OLS is biased by the lagged dependent variable + serial correlation, while ARIMA estimates everything jointly by MLE.
After estimating one model, we need to assess how well it fits and compare it against alternatives. Common metrics:
| Metric | Formula | Properties |
|---|---|---|
| AIC | \(-2 \ln L + 2k\) | Penalizes complexity; lower = better |
| BIC | \(-2 \ln L + k \ln n\) | Heavier penalty than AIC; favors parsimony |
| MSE | \(\frac{1}{n}\sum \hat{e}_t^2\) | Mean squared error; scale-dependent |
| RMSE | \(\sqrt{\text{MSE}}\) | Same units as \(y\); easier to interpret |
| MAE | \(\frac{1}{n}\sum \lvert\hat{e}_t\rvert\) | Less sensitive to outliers than RMSE |
Both AIC (Akaike Information Criterion) and
BIC (Bayesian Information Criterion) are
likelihood-based model selection criteria. They are
computed directly from the maximized log-likelihood \(\ln L\) (the same quantity
arima() optimizes — see Section 2.1) plus a penalty that
grows with the number of estimated parameters \(k\):
\[ \text{AIC} = -2 \ln L + 2k \qquad\qquad \text{BIC} = -2 \ln L + k \ln n \]
Lower values are better. What matters is differences between models, not absolute levels.
For a normal linear regression (or any model with Gaussian i.i.d. errors fit by MLE), the maximum likelihood estimate of the residual variance is \(\hat{\sigma}^2 = \text{MSE}\). Substituting this into the Gaussian log-likelihood and simplifying gives:
\[ \text{AIC} = n \ln(\text{MSE}) + 2k + C \]
where \(C = n + n \ln(2\pi)\) is a constant that does not depend on the model and therefore cancels when comparing AIC values across specifications. So for model comparison on the same data, AIC is equivalent to \(n \ln(\text{MSE}) + 2k\) (up to a constant) — a log-transformed, complexity-penalized MSE. Two intuitions follow:
When you fit a model with Arima() (maximum likelihood),
AIC and BIC are the natural criteria — they’re computed
from the same quantity the estimator already maximized, and they
incorporate the full likelihood (not just the sum of squared residuals).
MSE/RMSE are useful as a reality check (they tell you
how big a typical residual is in the original units) and for comparing
models on an out-of-sample test set, but they should not drive
specification choice for MLE-fit models.
Both balance fit against complexity, but they embody different model-selection philosophies and can disagree. The trade-off:
| Property | AIC | BIC |
|---|---|---|
| Penalty per parameter | \(2\) | \(\ln n\) (heavier whenever \(n > 7\)) |
| Asymptotic goal | Best predictive model | Identify the true model (if it’s in the candidate set) |
| Tends to select | Slightly larger models | Smaller, more parsimonious models |
| Consistent? | No (may asymptotically select too-large models) | Yes (converges to the true model as \(n \to \infty\)) |
| Efficient for prediction? | Yes (minimizes out-of-sample loss asymptotically) | No (may under-select for prediction) |
Rules of thumb:
Arima() reports AICc automatically.# Extract GOF metrics for a single model
resid_ar1 <- residuals(fit_ar1_seas)
tibble(
Metric = c("AIC", "BIC", "Log-Likelihood", "MSE", "RMSE", "MAE"),
Value = c(
AIC(fit_ar1_seas),
BIC(fit_ar1_seas),
fit_ar1_seas$loglik,
mean(resid_ar1^2),
sqrt(mean(resid_ar1^2)),
mean(abs(resid_ar1))
)
) |> mutate(Value = round(Value, 2))
## # A tibble: 6 × 2
## Metric Value
## <chr> <dbl>
## 1 AIC 2438.
## 2 BIC 2487.
## 3 Log-Likelihood -1204
## 4 MSE 16333.
## 5 RMSE 128.
## 6 MAE 102.
We fit several candidate specifications and compare them. The candidates span different dynamic structures, all with the law + seasonal dummies as covariates.
# Candidate models — all include law + month dummies via xreg
models <- list(
"AR(1)" = Arima(deaths, order = c(1, 0, 0), xreg = xcov),
"AR(2)" = Arima(deaths, order = c(2, 0, 0), xreg = xcov),
"MA(1)" = Arima(deaths, order = c(0, 0, 1), xreg = xcov),
"ARMA(1,1)" = Arima(deaths, order = c(1, 0, 1), xreg = xcov),
"ARMA(2,1)" = Arima(deaths, order = c(2, 0, 1), xreg = xcov),
"ARMA(1,2)" = Arima(deaths, order = c(1, 0, 2), xreg = xcov)
)
# Build comparison table
comparison <- tibble(
Model = names(models),
AIC = map_dbl(models, AIC),
BIC = map_dbl(models, BIC),
RMSE = map_dbl(models, ~ sqrt(mean(residuals(.x)^2))),
LB_pvalue = map_dbl(models, ~ Box.test(residuals(.x), lag = 24,
type = "Ljung-Box")$p.value)
) |>
arrange(AIC) |>
mutate(across(where(is.numeric), ~ round(.x, 2)),
delta_AIC = AIC - min(AIC))
comparison
## # A tibble: 6 × 6
## Model AIC BIC RMSE LB_pvalue delta_AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARMA(2,1) 2417. 2472. 120. 0.05 0
## 2 ARMA(1,2) 2418. 2473. 120. 0.05 0.970
## 3 ARMA(1,1) 2418. 2470. 121. 0.05 1.71
## 4 AR(2) 2425. 2477. 123. 0.01 8.64
## 5 AR(1) 2438. 2487. 128. 0 21.3
## 6 MA(1) 2482. 2531. 143. 0 65.3
How to read this table:
comparison |>
mutate(Model = fct_reorder(Model, AIC)) |>
ggplot(aes(x = Model, y = delta_AIC)) +
geom_col(fill = "steelblue", alpha = 0.8) +
geom_hline(yintercept = 2, linetype = "dashed", color = "firebrick") +
labs(title = "Model Comparison: \u0394AIC from Best Model",
y = "\u0394AIC", x = NULL) +
coord_flip() +
theme_minimal(base_size = 12)
Take the best model and verify it’s correctly specified by checking one step above and below:
# Identify the best model by AIC
best_name <- comparison$Model[1]
cat("Best model by AIC:", best_name, "\n\n")
## Best model by AIC: ARMA(2,1)
# Check residuals of the best model
best_fit <- models[[best_name]]
checkresiduals(best_fit)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(2,0,1) errors
## Q* = 36.164, df = 21, p-value = 0.02096
##
## Model df: 3. Total lags used: 24
auto.arima() picksAs a sanity check, let’s see what auto.arima()
selects:
fit_auto <- auto.arima(deaths,
xreg = xcov,
stationary = TRUE,
seasonal = FALSE, # we handle seasonality via dummies
stepwise = FALSE, # thorough search
approximation = FALSE)
fit_auto
## Series: deaths
## Regression with ARIMA(2,0,1) errors
##
## Coefficients:
## ar1 ar2 ma1 intercept law Feb Mar
## 1.1899 -0.2157 -0.7950 1469.1318 -321.2201 171.8751 679.9702
## s.e. 0.1071 0.0976 0.0724 68.9326 78.8301 41.0435 41.1674
## Apr May Jun Jul Aug Sep Oct
## 62.5233 241.9387 157.0544 81.4898 113.1761 137.2673 563.4235
## s.e. 39.3198 40.6465 40.5352 39.3223 35.1482 35.1484 41.3888
## Nov Dec
## 363.9178 224.6292
## s.e. 41.4152 41.3042
##
## sigma^2 = 15582: log likelihood = -1191.33
## AIC=2416.66 AICc=2420.18 BIC=2472.04
Compare with our manual selection: do they agree? If
auto.arima() picks a different specification, investigate
why — it may have found a better fit, or it may have been misled by the
stepwise algorithm.
In-sample metrics (AIC, RMSE) tell us how well the model fits the data it was trained on. But what we really care about is out-of-sample performance — how well the model predicts data it has not seen. Cross-validation (CV) provides a more honest assessment of model quality.
The key difference from standard CV: In cross-sectional data, we can randomly split the data into training and test sets. In time series, observations are ordered — we cannot predict the past from the future. Instead, we use a rolling forecast window: the training set always comes before the test set.
Two common schemes:
The expanding window uses all available history; the sliding window uses a fixed-length lookback. Expanding is more common in practice; sliding is useful when the process may change over time (non-stationarity).
Let’s implement a sliding-window CV for our UK deaths data, comparing three candidate models: AR(1), AR(2), and ARMA(1,1) — all with the seatbelt law and month dummies as covariates.
Before wrapping everything in a loop, let’s walk through what happens in a single iteration. Once the logic is clear, the loop is just “do this many times and store the results.”
We set the hyperparameters first:
min_window <- 170 # training window length (ensures seatbelt law is inside training)
h_forward <- 12 # forecast horizon: 12 months ahead
n_obs <- length(deaths)
n_iter <- n_obs - min_window - h_forward + 1 # number of CV iterations
n_iter
## [1] 11
Each iteration \(i\) does four things: (1) define the train/test windows, (2) subset the data, (3) fit the model, (4) forecast and store errors. Let’s run one iteration for \(i = 1\) and inspect each step.
Step 1: Define the train/test windows for iteration \(i = 1\).
i <- 1
# Training window: observations i through i + min_window - 1
train_idx <- i:(i + min_window - 1)
# Test window: the 12 observations immediately after the training window
test_idx <- (i + min_window):(i + min_window + h_forward - 1)
range(train_idx) # 1 to 170 (Jan 1969 – Feb 1983)
## [1] 1 170
range(test_idx) # 171 to 182 (Mar 1983 – Feb 1984)
## [1] 171 182
The training set covers observations 1–170 (Jan 1969 through Feb 1983, including the seatbelt law introduction); the test set is the next 12 months.
Step 2: Subset the outcome and covariate matrix.
y_train <- deaths[train_idx]
y_test <- deaths[test_idx]
x_train <- xcov[train_idx, ]
x_test <- xcov[test_idx, ]
length(y_train) # 170 training observations
## [1] 170
length(y_test) # 12 test observations
## [1] 12
Step 3: Fit the three candidate models on the training data.
fit1 <- Arima(y_train, order = c(1, 0, 0), xreg = x_train) # AR(1)
fit2 <- Arima(y_train, order = c(2, 0, 0), xreg = x_train) # AR(2)
fit3 <- Arima(y_train, order = c(1, 0, 1), xreg = x_train) # ARMA(1,1)
Step 4: Forecast 12 months ahead and compute absolute errors.
Each forecast() call uses the fitted model to generate
predictions for the test period.
The forecast() function (from the forecast
package) takes a fitted model object and returns point forecasts plus
prediction intervals. The signature for ARIMA-type models is:
forecast(object,
h = NULL, # number of periods to forecast
xreg = NULL, # future values of covariates (if the model uses xreg)
level = c(80, 95), # confidence levels for prediction intervals
...)
Key arguments:
object — the fitted model (an
Arima object in our case).h — number of future periods to
predict. If the model has xreg, this is usually omitted
(inferred from the number of rows in xreg), but being
explicit is clearer.xreg — a matrix of future
covariate values covering the forecast horizon. Must have the
same columns (same names, same order) as the xreg used at
fit time. Without this, forecast() does not know what the
regressors will be in the future.level — the confidence levels for
prediction intervals. The default c(80, 95) returns 80% and
95% intervals.The object forecast() returns is a list with several
components. The ones we use most often:
$mean — point forecasts (a ts vector of
length h).$lower, $upper — matrices of lower/upper
bounds, one column per confidence level.$x — the original training series (useful for
plotting).# Generate 12-month-ahead forecasts for each model
# xreg = x_test supplies the future values of law + month dummies
fc1 <- forecast(fit1, h = h_forward, xreg = x_test)
fc2 <- forecast(fit2, h = h_forward, xreg = x_test)
fc3 <- forecast(fit3, h = h_forward, xreg = x_test)
# fc1$mean is the vector of 12 point forecasts; compare to y_test
abs_err_ar1 <- abs(as.numeric(fc1$mean) - y_test)
abs_err_ar2 <- abs(as.numeric(fc2$mean) - y_test)
abs_err_arma <- abs(as.numeric(fc3$mean) - y_test)
# Inspect: forecast errors at each horizon (months 1, 2, ..., 12 ahead)
tibble(
Horizon = 1:h_forward,
`AR(1) err` = round(abs_err_ar1, 1),
`AR(2) err` = round(abs_err_ar2, 1),
`ARMA(1,1) err` = round(abs_err_arma, 1)
)
## # A tibble: 12 × 4
## Horizon `AR(1) err` `AR(2) err` `ARMA(1,1) err`
## <int> <dbl> <dbl> <dbl>
## 1 1 68.5 120. 115.
## 2 2 96 151. 172.
## 3 3 1.6 64.8 94
## 4 4 130. 67.8 28
## 5 5 119. 59.1 12.6
## 6 6 175. 120 68.1
## 7 7 80 131. 186.
## 8 8 8.4 38 96.1
## 9 9 231. 189. 129
## 10 10 331. 293. 232.
## 11 11 25.8 8.6 68.4
## 12 12 40.2 9.8 48.4
That’s it for one iteration. Each row tells us “how far off was each
model’s forecast at each horizon?” We store these 12-element vectors.
For CV we only need the point forecasts ($mean), but the
same object also carries prediction intervals — which we will use in
Part 5 when we forecast beyond the end of the data.
The loop does exactly this for \(i = 1, 2, \ldots, n\_iter\): it slides the training window forward by one month, re-fits all three models, and stores the absolute errors as a row of a matrix. After all iterations, we average across rows to get the mean absolute error at each horizon, and across horizons to get the overall out-of-sample MAE.
# Pre-allocate storage: one row per iteration, one column per horizon
mae_ar1 <- matrix(NA, nrow = n_iter, ncol = h_forward)
mae_ar2 <- matrix(NA, nrow = n_iter, ncol = h_forward)
mae_arma <- matrix(NA, nrow = n_iter, ncol = h_forward)
for (i in 1:n_iter) {
# Step 1: train/test windows
train_idx <- i:(i + min_window - 1)
test_idx <- (i + min_window):(i + min_window + h_forward - 1)
if (max(test_idx) > n_obs) next
# Step 2: subset
y_train <- deaths[train_idx]; y_test <- deaths[test_idx]
x_train <- xcov[train_idx, ]; x_test <- xcov[test_idx, ]
# Step 3: fit three candidates
fit1 <- Arima(y_train, order = c(1, 0, 0), xreg = x_train)
fit2 <- Arima(y_train, order = c(2, 0, 0), xreg = x_train)
fit3 <- Arima(y_train, order = c(1, 0, 1), xreg = x_train)
# Step 4: forecast and store absolute errors
fc1 <- forecast(fit1, h = h_forward, xreg = x_test)
fc2 <- forecast(fit2, h = h_forward, xreg = x_test)
fc3 <- forecast(fit3, h = h_forward, xreg = x_test)
mae_ar1[i, ] <- abs(as.numeric(fc1$mean) - y_test)
mae_ar2[i, ] <- abs(as.numeric(fc2$mean) - y_test)
mae_arma[i, ] <- abs(as.numeric(fc3$mean) - y_test)
}
Average the absolute errors across iterations at each horizon, then compare the three models:
cv_results <- tibble(
Horizon = 1:h_forward,
`AR(1)` = colMeans(mae_ar1, na.rm = TRUE),
`AR(2)` = colMeans(mae_ar2, na.rm = TRUE),
`ARMA(1,1)` = colMeans(mae_arma, na.rm = TRUE)
) |>
pivot_longer(-Horizon, names_to = "Model", values_to = "MAE")
ggplot(cv_results, aes(x = Horizon, y = MAE, color = Model)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(title = "Cross-Validation: Mean Absolute Error by Forecast Horizon",
x = "Periods Ahead", y = "MAE (deaths)", color = NULL) +
theme_minimal(base_size = 12)
# Overall average MAE across all horizons
tibble(
Model = c("AR(1)", "AR(2)", "ARMA(1,1)"),
`Avg MAE` = c(mean(mae_ar1, na.rm = TRUE),
mean(mae_ar2, na.rm = TRUE),
mean(mae_arma, na.rm = TRUE))
) |> mutate(`Avg MAE` = round(`Avg MAE`, 1)) |>
arrange(`Avg MAE`)
## # A tibble: 3 × 2
## Model `Avg MAE`
## <chr> <dbl>
## 1 ARMA(1,1) 91.8
## 2 AR(1) 96.3
## 3 AR(2) 98.3
Two things to read from these outputs:
If the CV winner agrees with the in-sample AIC winner (Part 3), you have a robust choice. If they disagree, investigate why — often one metric is favoring a slightly more (or less) complex specification.
forecast::tsCV()The forecast package provides tsCV() to
automate time series cross-validation. It handles the rolling window
logic internally — you supply a function that fits and forecasts, and
tsCV() returns the forecast errors for each origin and
horizon.
Let’s scale up from three models to a fuller grid — the same set used in the old CSSS 512 lab — so we can see how both the dynamic structure (AR/MA orders) and the covariate specification (how we handle seasonality) affect out-of-sample performance.
The candidates differ along two dimensions:
| Model | Dynamics | Covariates |
|---|---|---|
| 1a | AR(1) | law only |
| 1b | AR(1) | law + Q4 dummy (Oct–Dec) |
| 1c | AR(1) | law + 11 month dummies |
| 1d | AR(1) | law + selected months (Jan, Sep, Oct, Nov, Dec) |
| 1e | AR(1) with seasonal AR(1)\(_{12}\) | law only |
| 2a | AR(2) | law + 11 month dummies |
| 2b | MA(1) | law + 11 month dummies |
| 2c | ARMA(1,1) | law + 11 month dummies |
| 2d | ARMA(2,1) | law + 11 month dummies |
| 2e | ARMA(1,2) | law + 11 month dummies |
| 2f | ARMA(2,2) | law + 11 month dummies |
# 11 month dummies (January = reference)
month_dummies <- model.matrix(~ factor(ukdata$month) - 1)[, -1]
colnames(month_dummies) <- month.abb[-1]
# Q4 dummy = 1 for October–December
q4_dummy <- as.numeric(ukdata$month %in% c("October", "November", "December"))
# Selected months (Jan is the reference, so we include Sep, Oct, Nov, Dec
# explicitly plus the "Jan" indicator built from the month factor)
jan_dummy <- as.numeric(ukdata$month == "January")
# Covariate matrices for each specification
xcov_1a <- cbind(law = ukdata$law)
xcov_1b <- cbind(law = ukdata$law, q4 = q4_dummy)
xcov_1c <- cbind(law = ukdata$law, month_dummies) # all 11 months
xcov_1d <- cbind(law = ukdata$law, jan = jan_dummy,
month_dummies[, c("Sep", "Oct", "Nov", "Dec")]) # selected months
xcov_1e <- cbind(law = ukdata$law) # seasonal AR handles seasonality
tsCV() expects a function with signature
function(y, h, xreg, newxreg) — given a training series
y and future covariates newxreg, return a
forecast object. We write one function per candidate,
keeping each as explicit as possible:
# Family 1: AR(1) dynamics with different covariate specifications
fc_1a <- function(y, h, xreg, newxreg) {
fit <- Arima(y, order = c(1, 0, 0), xreg = xreg)
forecast(fit, h = h, xreg = newxreg)
}
fc_1b <- function(y, h, xreg, newxreg) {
fit <- Arima(y, order = c(1, 0, 0), xreg = xreg)
forecast(fit, h = h, xreg = newxreg)
}
fc_1c <- function(y, h, xreg, newxreg) {
fit <- Arima(y, order = c(1, 0, 0), xreg = xreg)
forecast(fit, h = h, xreg = newxreg)
}
fc_1d <- function(y, h, xreg, newxreg) {
fit <- Arima(y, order = c(1, 0, 0), xreg = xreg)
forecast(fit, h = h, xreg = newxreg)
}
# Model 1e adds a seasonal AR(1) at lag 12 on top of the non-seasonal AR(1)
fc_1e <- function(y, h, xreg, newxreg) {
fit <- Arima(y, order = c(1, 0, 0),
seasonal = list(order = c(1, 0, 0), period = 12),
xreg = xreg)
forecast(fit, h = h, xreg = newxreg)
}
# Family 2: higher-order ARMA dynamics, all with the 11 month-dummy covariates
fc_2a <- function(y, h, xreg, newxreg) { # AR(2)
fit <- Arima(y, order = c(2, 0, 0), xreg = xreg)
forecast(fit, h = h, xreg = newxreg)
}
fc_2b <- function(y, h, xreg, newxreg) { # MA(1)
fit <- Arima(y, order = c(0, 0, 1), xreg = xreg)
forecast(fit, h = h, xreg = newxreg)
}
fc_2c <- function(y, h, xreg, newxreg) { # ARMA(1,1)
fit <- Arima(y, order = c(1, 0, 1), xreg = xreg)
forecast(fit, h = h, xreg = newxreg)
}
fc_2d <- function(y, h, xreg, newxreg) { # ARMA(2,1)
fit <- Arima(y, order = c(2, 0, 1), xreg = xreg)
forecast(fit, h = h, xreg = newxreg)
}
fc_2e <- function(y, h, xreg, newxreg) { # ARMA(1,2)
fit <- Arima(y, order = c(1, 0, 2), xreg = xreg)
forecast(fit, h = h, xreg = newxreg)
}
fc_2f <- function(y, h, xreg, newxreg) { # ARMA(2,2)
fit <- Arima(y, order = c(2, 0, 2), xreg = xreg)
forecast(fit, h = h, xreg = newxreg)
}
Each function does the same two steps: fit an Arima()
model on the training chunk supplied by tsCV(), then
forecast() forward. What differs between them is only the
order (and for fc_1e, the
seasonal argument).
tsCV() on all 11 candidatesh = 12 means tsCV() produces forecasts at
horizons 1, 2, …, 12 months ahead at each origin. The result is a matrix
with rows = origins and columns = horizons; we average columns to get
the mean absolute error at each horizon.
# Sliding window of 170 training observations; forecast 12 months ahead
W <- 170 # training window length
H <- 12 # forecast horizon
# For each model, tsCV() returns a (T x H) matrix of forecast errors:
# rows = forecast origins (one per period moved forward)
# columns = horizons 1, 2, ..., 12 months ahead
# We take the absolute value and average down each column to get MAE at each horizon.
# Family 1 models (each uses its own covariate matrix)
err_1a <- tsCV(deaths, forecastfunction = fc_1a, h = H, window = W, xreg = xcov_1a)
err_1b <- tsCV(deaths, forecastfunction = fc_1b, h = H, window = W, xreg = xcov_1b)
err_1c <- tsCV(deaths, forecastfunction = fc_1c, h = H, window = W, xreg = xcov_1c)
err_1d <- tsCV(deaths, forecastfunction = fc_1d, h = H, window = W, xreg = xcov_1d)
err_1e <- tsCV(deaths, forecastfunction = fc_1e, h = H, window = W, xreg = xcov_1e)
# Family 2 models (all share the full month-dummies covariate matrix xcov_1c)
err_2a <- tsCV(deaths, forecastfunction = fc_2a, h = H, window = W, xreg = xcov_1c)
err_2b <- tsCV(deaths, forecastfunction = fc_2b, h = H, window = W, xreg = xcov_1c)
err_2c <- tsCV(deaths, forecastfunction = fc_2c, h = H, window = W, xreg = xcov_1c)
err_2d <- tsCV(deaths, forecastfunction = fc_2d, h = H, window = W, xreg = xcov_1c)
err_2e <- tsCV(deaths, forecastfunction = fc_2e, h = H, window = W, xreg = xcov_1c)
err_2f <- tsCV(deaths, forecastfunction = fc_2f, h = H, window = W, xreg = xcov_1c)
# Average absolute errors across origins → MAE at each horizon (length-H vector)
cv_1a <- colMeans(abs(err_1a), na.rm = TRUE)
cv_1b <- colMeans(abs(err_1b), na.rm = TRUE)
cv_1c <- colMeans(abs(err_1c), na.rm = TRUE)
cv_1d <- colMeans(abs(err_1d), na.rm = TRUE)
cv_1e <- colMeans(abs(err_1e), na.rm = TRUE)
cv_2a <- colMeans(abs(err_2a), na.rm = TRUE)
cv_2b <- colMeans(abs(err_2b), na.rm = TRUE)
cv_2c <- colMeans(abs(err_2c), na.rm = TRUE)
cv_2d <- colMeans(abs(err_2d), na.rm = TRUE)
cv_2e <- colMeans(abs(err_2e), na.rm = TRUE)
cv_2f <- colMeans(abs(err_2f), na.rm = TRUE)
We compare all 11 models on one plot. To make the ranking visually
intuitive, we order the models by their average MAE across
horizons (best → worst) and assign colors from a
viridis gradient — so the best-performing models appear in
one end of the gradient (dark purple) and the worst in the other
(yellow). This makes it easy to spot the winners at a glance: they form
the darker cluster along the bottom of the plot.
# Assemble a matrix: rows = horizons, columns = models
cv_matrix <- cbind(cv_1a, cv_1b, cv_1c, cv_1d, cv_1e,
cv_2a, cv_2b, cv_2c, cv_2d, cv_2e, cv_2f)
colnames(cv_matrix) <- c("1a", "1b", "1c", "1d", "1e",
"2a", "2b", "2c", "2d", "2e", "2f")
# Compute the average MAE per model and rank models from best (lowest) to worst
model_rank <- names(sort(colMeans(cv_matrix, na.rm = TRUE)))
# Viridis palette ordered from best → worst (dark = best, light = worst)
model_colors <- setNames(viridisLite::viridis(length(model_rank), end = 0.9),
model_rank)
# Tidy for ggplot, with Model as a factor ordered by rank
cv_long <- as_tibble(cv_matrix) |>
mutate(Horizon = 1:H) |>
pivot_longer(-Horizon, names_to = "Model", values_to = "MAE") |>
mutate(Model = factor(Model, levels = model_rank)) # legend order = rank order
ggplot(cv_long, aes(x = Horizon, y = MAE, color = Model)) +
geom_line(linewidth = 0.8) +
geom_point(size = 1.5) +
scale_color_manual(values = model_colors) +
scale_x_continuous(breaks = 1:H) +
labs(title = "Cross-Validation of UK Road Deaths Models",
subtitle = "Models ordered by average MAE across horizons (best = darkest)",
x = "Periods Ahead", y = "MAE (deaths)", color = "Model") +
theme_minimal(base_size = 12)
A single summary ranking averages MAE across the 12 horizons:
avg_cv <- colMeans(cv_matrix, na.rm = TRUE)
tibble(
Model = names(avg_cv),
`Avg MAE` = round(avg_cv, 1)
) |>
arrange(`Avg MAE`)
## # A tibble: 11 × 2
## Model `Avg MAE`
## <chr> <dbl>
## 1 2f 78.1
## 2 2c 79.5
## 3 2e 80.1
## 4 2d 80.3
## 5 1c 80.4
## 6 2a 81.4
## 7 2b 83.7
## 8 1b 102.
## 9 1e 102.
## 10 1a 171.
## 11 1d 196.
Interpreting the ranking. The best model by average MAE is the one that produces the most accurate forecasts across horizons. As in Part 3, this should be read alongside:
checkresiduals() on candidates near the top to ensure
residuals are white noise.tsCV() vs. the manual looptsCV() is more concise and handles the rolling-window
logic for you. The manual loop gives more control (custom error metrics,
storing point forecasts, inspecting individual iterations). In practice,
use tsCV() for quick model comparison, and the manual loop
when you need to inspect the internals of the CV process or compute
non-standard quantities.
After selecting a model — say the AR(2) specification with seatbelt law and month dummies — we often want to use it to forecast forward, producing predicted values with uncertainty bounds.
Forecasting with xreg requires us to supply the
future values of the covariates over the forecast
horizon. For the UK deaths data, this means providing values of the law
indicator and all 11 month dummies for each future month we want to
predict.
Suppose we want to forecast 5 years (60 months) beyond the end of the observed data (Dec 1984), assuming the seatbelt law remains in effect:
# Use the AR(2) model we already estimated in Section 2.5
fit_final <- fit_ar2_seas
# Forecast horizon: 60 months (5 years)
n_ahead <- 60
# Future law indicator: law = 1 everywhere (law kept in effect)
law_future <- rep(1, n_ahead)
# Future month dummies: the pattern cycles every 12 months
# Observed data ends in December 1984, so next month is January 1985
month_cycle <- c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec")
future_months <- factor(rep(month_cycle, length.out = n_ahead),
levels = month_cycle)
future_dummies <- model.matrix(~ future_months - 1)[, -1] # drop Jan (reference)
colnames(future_dummies) <- month.abb[-1]
# Combine into the future covariate matrix (must have same columns as training xcov)
xcov_future <- cbind(law = law_future, future_dummies)
head(xcov_future, 3)
## law Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 1 0 0 0 0 0 0 0 0 0 0 0
## 2 1 1 0 0 0 0 0 0 0 0 0 0
## 3 1 0 1 0 0 0 0 0 0 0 0 0
The forecast() function returns point forecasts plus
prediction intervals (default 80% and 95%):
fc <- forecast(fit_final, h = n_ahead, xreg = xcov_future)
# Summary of the first few months
head(as.data.frame(fc))
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1985 1119.910 955.7962 1284.024 868.9196 1370.900
## Feb 1985 1284.622 1103.3147 1465.929 1007.3365 1561.908
## Mar 1985 1798.760 1600.3092 1997.211 1495.2555 2102.265
## Apr 1985 1182.371 975.3963 1389.346 865.8306 1498.912
## May 1985 1363.362 1150.5540 1576.169 1037.9004 1688.823
## Jun 1985 1280.961 1064.5773 1497.344 950.0308 1611.891
Key columns:
Point Forecast: the predicted mean under the
scenario.Lo 80 / Hi 80: the 80% prediction interval
(1.28 standard errors).Lo 95 / Hi 95: the 95% prediction interval
(1.96 standard errors).The quickest way to plot the forecast is autoplot(),
which displays observed data + point forecast + prediction ribbon:
We build the plot manually (rather than using autoplot()
on the forecast object) so we can control the colors — observed data in
dark gray, the forecast mean as a solid orange line, and the two
prediction intervals in distinct, clearly separable shades.
# Assemble the forecast into a data frame
obs_df <- tibble(
time = as.numeric(time(deaths)),
y = as.numeric(deaths)
)
last_time <- as.numeric(tail(time(deaths), 1))
fc_df <- tibble(
time = last_time + (1:length(fc$mean)) / 12,
mean = as.numeric(fc$mean),
lo80 = as.numeric(fc$lower[, 1]), # 80% lower
hi80 = as.numeric(fc$upper[, 1]), # 80% upper
lo95 = as.numeric(fc$lower[, 2]), # 95% lower
hi95 = as.numeric(fc$upper[, 2]) # 95% upper
)
ggplot() +
# observed data
geom_line(data = obs_df, aes(x = time, y = y), color = "grey25", linewidth = 0.5) +
# 95% prediction interval (outer ribbon — light)
geom_ribbon(data = fc_df, aes(x = time, ymin = lo95, ymax = hi95),
fill = "#fdd0a2", alpha = 0.8) +
# 80% prediction interval (inner ribbon — darker)
geom_ribbon(data = fc_df, aes(x = time, ymin = lo80, ymax = hi80),
fill = "#fd8d3c", alpha = 0.7) +
# forecast mean
geom_line(data = fc_df, aes(x = time, y = mean),
color = "#d94801", linewidth = 0.9) +
# intervention marker + end-of-data marker
geom_vline(xintercept = 1983 + 1/12, linetype = "dashed", color = "firebrick") +
geom_vline(xintercept = last_time, linetype = "dotted", color = "grey40") +
annotate("text", x = 1983.5, y = 2700, label = "Seatbelt law",
color = "firebrick", size = 3.3, hjust = 0) +
annotate("text", x = last_time + 0.1, y = 2700, label = "End of data",
color = "grey30", size = 3.3, hjust = 0) +
labs(title = "5-Year Forecast: UK Road Deaths Under Continued Seatbelt Law",
subtitle = "Dark ribbon = 80% PI; light ribbon = 95% PI",
y = "Deaths and Serious Injuries", x = "Year") +
theme_minimal(base_size = 13)
The forecast captures the seasonal pattern (winter peaks, summer troughs) and the reduced baseline level from the seatbelt law. Prediction intervals widen slightly as we forecast further ahead — reflecting accumulated uncertainty from the AR(2) dynamics.
A common use of xreg-based forecasting is
scenario analysis. What if we considered reversing the
seatbelt law? We re-run the forecast with law = 0 over the
future period and compare.
# Scenario: law reversed (law = 0 everywhere)
xcov_no_law <- xcov_future
xcov_no_law[, "law"] <- 0
fc_no_law <- forecast(fit_final, h = n_ahead, xreg = xcov_no_law)
To keep each scenario readable, we split them into two vertically stacked panels that share the same axes. This way, the observed history is visible in both panels and the forecast trajectory is clear without overlapping ribbons.
# Build plotting frames
obs_df <- tibble(
time = as.numeric(time(deaths)),
mean = as.numeric(deaths)
)
build_forecast_df <- function(fc_obj) {
last_time <- as.numeric(tail(time(deaths), 1))
times_future <- last_time + (1:length(fc_obj$mean)) / 12
tibble(
time = times_future,
mean = as.numeric(fc_obj$mean),
lwr = as.numeric(fc_obj$lower[, 2]), # 95% lower
upr = as.numeric(fc_obj$upper[, 2]) # 95% upper
)
}
fc_kept_df <- build_forecast_df(fc)
fc_reversed_df <- build_forecast_df(fc_no_law)
# Shared y-axis range so the two panels are directly comparable
y_range <- range(c(obs_df$mean, fc_kept_df$lwr, fc_kept_df$upr,
fc_reversed_df$lwr, fc_reversed_df$upr))
# Helper: build one panel for a given scenario
plot_scenario <- function(fc_df, panel_title, color) {
ggplot() +
geom_line(data = obs_df, aes(x = time, y = mean), color = "black") +
geom_ribbon(data = fc_df, aes(x = time, ymin = lwr, ymax = upr),
fill = color, alpha = 0.25) +
geom_line(data = fc_df, aes(x = time, y = mean),
color = color, linewidth = 0.9) +
geom_vline(xintercept = 1983 + 1/12, linetype = "dashed", color = "grey40") +
geom_vline(xintercept = max(obs_df$time), linetype = "dotted", color = "grey50") +
annotate("text", x = 1983.5, y = y_range[2] * 0.98, label = "Seatbelt law",
color = "grey30", size = 3.2, hjust = 0) +
coord_cartesian(ylim = y_range) +
labs(title = panel_title, y = "Deaths", x = NULL) +
theme_minimal(base_size = 12)
}
p_kept <- plot_scenario(fc_kept_df,
"Scenario 1 — Law kept (law = 1)",
"steelblue")
p_reversed <- plot_scenario(fc_reversed_df,
"Scenario 2 — Law reversed (law = 0)",
"firebrick") +
labs(x = "Year")
p_kept / p_reversed +
patchwork::plot_annotation(
title = "5-Year Forecast: UK Road Deaths Under Two Policy Scenarios",
subtitle = "Shaded ribbons = 95% prediction intervals from the AR(2) model"
)
The difference in level between the two panels — compare the forecast
ribbon in the top panel to the bottom — is the model’s estimated
effect of keeping the law in place. The AR(2) persistence
amplifies the law coefficient into a long-run difference of
roughly the equilibrium shift we computed in Section 2.5.
Caveat on scenario analysis. These are predictive comparisons under the assumption that the estimated model is correctly specified and that the covariate path is truly exogenous. For causal interpretation of policy counterfactuals — including explicit modeling of the intervention timing and uncertainty around treatment effects — we will use interrupted time series (ITS) designs in Lab 4, which formalize this logic.