Disclaimer

These materials have been revised and updated by successive teaching assistants over recent years.

Revision history:

Ramses Llobet (Spring 2026); rllobet@uw.edu

Packages and Functions Used in This Lab

Package	Purpose	Key functions
`forecast`	ARIMA estimation, diagnostics, cross-validation	`Arima()`, `auto.arima()`, `checkresiduals()`, `tsCV()`, `forecast()`
`tseries`	Stationarity and specification tests	`adf.test()`, `kpss.test()`, `jarque.bera.test()`
`lmtest`	Regression diagnostic tests	`bgtest()` (Breusch-Godfrey for serial correlation)
`patchwork`	Composing multi-panel figures	`+`, `/` operators
`tidyverse`	Data manipulation and plotting	`ggplot()`, `tibble()`, pipes (`\|>`)

Part 1: From Diagnostics to Estimation

In Lab 2, we learned to identify which dynamic components are present in a time series — trend, seasonality, AR, MA — using ACF/PACF plots and the Box-Jenkins identification table. Now we take the next step: estimating the model, interpreting the coefficients, comparing candidate specifications, and validating the chosen model with cross-validation.

The key question shifts from “what kind of process is this?” to “how well does my model capture the dynamics, and how do I know?”

1.1 The applied example: UK road deaths and the seatbelt law

We use a classic dataset from Harvey (1989): monthly counts of deaths and serious injuries in UK road accidents from January 1969 to December 1984. In February 1983, a seatbelt law was introduced — providing a natural policy intervention that we can model.

# Load the UK road deaths dataset
ukdata <- read_csv("data/ukdeaths.csv")
#ukdata <- read_csv(here("CSSS512_labs", "Lab3", "data", "ukdeaths.csv"))
ukdata

## # A tibble: 192 × 5
##    death   law  year    mt month    
##    <dbl> <dbl> <dbl> <dbl> <chr>    
##  1  1687     0  1969     1 January  
##  2  1508     0  1969     2 February 
##  3  1507     0  1969     3 March    
##  4  1385     0  1969     4 April    
##  5  1632     0  1969     5 May      
##  6  1511     0  1969     6 June     
##  7  1559     0  1969     7 July     
##  8  1630     0  1969     8 August   
##  9  1579     0  1969     9 September
## 10  1653     0  1969    10 October  
## # ℹ 182 more rows

# Create time series object: monthly data starting Jan 1969
deaths <- ts(ukdata$death, start = c(1969, 1), frequency = 12)

# The seatbelt law indicator: 0 before Feb 1983, 1 after
law <- ts(ukdata$law, start = c(1969, 1), frequency = 12)

autoplot(deaths) +
  geom_vline(xintercept = 1983 + 1/12, linetype = "dashed", color = "firebrick") +
  annotate("text", x = 1983.5, y = 2600, label = "Seatbelt law\n(Feb 1983)",
           color = "firebrick", size = 3.5) +
  labs(title = "UK Road Deaths, 1969–1984",
       y = "Deaths and Serious Injuries", x = "Year") +
  theme_minimal(base_size = 13)

The series has two visible features: a strong seasonal pattern (deaths peak in winter months — dark, icy roads) and a notable drop after the seatbelt law. There may also be a slight downward trend over the full period. The substantive question is: how large was the predicted effect of the seatbelt law, accounting for the seasonal and autoregressive dynamics?

1.2 Diagnosing the series (review)

Before estimating, we apply the Box-Jenkins diagnostic workflow from Lab 2. This is a quick review — the focus of this lab is estimation, not identification.

Seasonal-Trend decomposition (STL)

We start with an STL decomposition to visually separate the three components of the series: trend, seasonality, and remainder. This gives us a qualitative sense of which features we need to account for in the model.

# STL decomposition with a periodic seasonal window (12 months)
decomp <- stl(deaths, s.window = "periodic")

# visualize with forecast (ggplot)
autoplot(decomp) +
  labs(title = "STL Decomposition of UK Road Deaths",
       x = "Year") +
  theme_minimal(base_size = 11)

The decomposition confirms our visual impressions:

Trend: a gentle decline over the 1970s, with a sharper drop in early 1983 (the seatbelt law).
Seasonal: a strong, regular 12-month cycle with winter peaks and summer troughs.
Remainder: what is left after removing trend and seasonality — this is where we expect to see the AR/MA dynamics.

ACF and PACF

Now we examine the ACF and PACF of the raw series to identify any autoregressive or moving average structure:

p1 <- ggAcf(deaths, lag.max = 36) +
  labs(title = "ACF — UK Road Deaths") + theme_minimal(base_size = 11)

p2 <- ggPacf(deaths, lag.max = 36) +
  labs(title = "PACF — UK Road Deaths") + theme_minimal(base_size = 11)

p1 + p2

The ACF shows slow decay (trend or persistence) with seasonal spikes at lags 12, 24, 36. The PACF shows significant spikes at lags 1 and 12. This suggests an AR process with seasonality — both of which we need to account for in our model.

Cleaning the series: detrend and deseasonalize

The raw ACF is dominated by the trend (slow decay) and seasonality (spikes at 12, 24, 36), making it hard to read the AR/MA structure underneath. Using the Lab 2 approach (Approach 2), we remove both features in one step with an lm() that regresses deaths on a linear trend, month dummies, and the seatbelt law indicator. The residuals from this regression are the cleaned series — trend, seasonality, and the law effect all removed — so any remaining autocorrelation must come from the ARMA dynamics we want to identify.

# Build a data frame with a time index, month factor, and the law indicator
clean_df <- ukdata |>
  mutate(
    t     = row_number(),                             # linear time index
    month = factor(month, levels = month.name)        # month factor
  )

# Regress deaths on trend + month dummies + law
# Residuals = deaths with trend, seasonality, and law effect removed
fit_clean <- lm(death ~ t + month + law, data = clean_df)

# check results
summary(fit_clean)

## 
## Call:
## lm(formula = death ~ t + month + law, data = clean_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -333.70 -105.83    4.71   85.02  414.65 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1872.6884    43.2957  43.253  < 2e-16 ***
## t                -1.7649     0.2406  -7.337 7.47e-12 ***
## monthFebruary  -184.0861    53.9429  -3.413 0.000797 ***
## monthMarch     -131.7587    53.9381  -2.443 0.015551 *  
## monthApril     -243.1814    53.9343  -4.509 1.18e-05 ***
## monthMay       -104.1040    53.9317  -1.930 0.055160 .  
## monthJune      -158.0892    53.9300  -2.931 0.003818 ** 
## monthJuly       -80.6993    53.9295  -1.496 0.136324    
## monthAugust     -64.0594    53.9300  -1.188 0.236485    
## monthSeptember   -9.4821    53.9317  -0.176 0.860638    
## monthOctober    131.6578    53.9343   2.441 0.015622 *  
## monthNovember   332.9851    53.9381   6.173 4.40e-09 ***
## monthDecember   451.3750    53.9429   8.368 1.67e-14 ***
## law            -226.3850    41.0372  -5.517 1.20e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 152.4 on 178 degrees of freedom
## Multiple R-squared:  0.7419, Adjusted R-squared:  0.723 
## F-statistic: 39.35 on 13 and 178 DF,  p-value: < 2.2e-16

# Convert residuals back to a ts object
deaths_clean <- ts(residuals(fit_clean), start = c(1969, 1), frequency = 12)

autoplot(deaths_clean) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  labs(title = "UK Road Deaths After Removing Trend, Seasonality, and Law Effect",
       y = "Residuals", x = "Year") +
  theme_minimal(base_size = 13)

Now we compare the ACF and PACF of the raw and cleaned series side by side:

p_acf_raw   <- ggAcf(deaths, lag.max = 36) +
  labs(title = "ACF — Raw Series") + theme_minimal(base_size = 10)
p_pacf_raw  <- ggPacf(deaths, lag.max = 36) +
  labs(title = "PACF — Raw Series") + theme_minimal(base_size = 10)
p_acf_clean  <- ggAcf(deaths_clean, lag.max = 36) +
  labs(title = "ACF — Cleaned Series") + theme_minimal(base_size = 10)
p_pacf_clean <- ggPacf(deaths_clean, lag.max = 36) +
  labs(title = "PACF — Cleaned Series") + theme_minimal(base_size = 10)

(p_acf_raw + p_pacf_raw) / (p_acf_clean + p_pacf_clean)

The progression is striking: after removing trend, seasonality, and the law effect, the slow decay and the seasonal spikes largely disappear. What remains is the underlying dynamic structure — the ACF tails off geometrically (lag 1 ≈ 0.59, lag 2 ≈ 0.50, lag 3 ≈ 0.41, etc), and the PACF shows its strongest spike at lag 1 (lag 1 ≈ 0.59), with a smaller spike at lag 2 (lag 2 ≈ 0.22). This pattern points to a low-order AR process — likely AR(1) or AR(2) — as the main remaining dynamic.

We can try to confirm this by fitting an AR(1) model to the cleaned series using arima:

# Fit dynamic hypotheses to the cleaned series
h1 <- Arima(deaths_clean, order = c(1, 0, 0))
h2 <- Arima(deaths_clean, order = c(2, 0, 0))

# Strip the new "fc_model" class so texreg falls back to Arima
class(h1) <- "Arima"
class(h2) <- "Arima"

# Seem how the AR(1) and AR(2) models compare
show_reg(list(h1, h2),
          custom.model.names = c("AR(1)", "AR(2)"),
          custom.coef.map = list(
            "ar1"       = "AR(1)",
            "ar2"       = "AR(2)",
            "intercept" = "Intercept"
          ))

Statistical models
	AR(1)	AR(2)
AR(1)	0.54^***	0.42^***
	(0.06)	(0.07)
AR(2)		0.22^**
		(0.07)
Intercept	-1.08	-1.88
	(19.21)	(23.87)
AIC	2400.98	2393.64
BIC	2410.75	2406.67
Log Likelihood	-1197.49	-1192.82
Num. obs.	192	192
^*p < 0.001; ^p < 0.01; ^*p < 0.05

The best fitting model is clearly AR(2), now we can check the residuals to see if it captures the dynamics adequately:

# test Ljung-Box on the residuals of best fitting model
Box.test(residuals(h2), lag = 24, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(h2)
## X-squared = 33.298, df = 24, p-value = 0.09795

Time diagnostics tells us how to approach modeling:

include the seatbelt law as a covariate,
handle seasonality with month dummies, and
fit a low-order AR(2) structure to absorb the remaining temporal dependence — all in one joint model via arima() with xreg.

In Part 3 we will compare several candidate specifications to decide between AR(1), AR(2), and ARMA alternatives.

Part 2: Estimating ARMA Models with Covariates

2.1 Maximum likelihood estimation (MLE) of ARMA models

Before running arima(), let’s understand what it computes. ARMA models are estimated by maximum likelihood — the same principle you used in generalized linear models. The logic, familiar from introductory econometrics:

Express the joint probability of the observed data as a function of the parameters (the likelihood).
Take the logarithm to convert the product of probabilities into a sum (easier to optimize).
Substitute the linear predictor $\mu_t = \boldsymbol{x}_t \boldsymbol{\beta}$ (the “systematic component” — the deterministic part of the model).
Maximize numerically to find the parameter values that make the observed data most likely.

For a regression model with AR(1) errors, the joint density of the data given the parameters is $f(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{\beta}, \phi_1, \sigma^2)$. Viewed as a function of the parameters (with the data treated as fixed), this is the likelihood $\mathcal{L}(\boldsymbol{\beta}, \phi_1, \sigma^2;\, \boldsymbol{y}, \boldsymbol{X})$. Its logarithm splits into two distinct contributions:

\[ \begin{aligned} \ell(\boldsymbol{\beta}, \phi_1, \sigma^2;\, \boldsymbol{y}, \boldsymbol{X}) =\; & \underbrace{-\frac{1}{2}\log\!\left(\frac{\sigma^2}{1-\phi_1^2}\right) - \frac{\left(y_1 - \frac{\boldsymbol{x}_1 \boldsymbol{\beta}}{1-\phi_1}\right)^{2}}{2\sigma^2 / (1-\phi_1^2)}}_{\text{initial observation } y_1 \sim \text{ stationary distribution of the AR(1)}} \\[4pt] & \underbrace{- \frac{T-1}{2}\log\sigma^2 - \sum_{t=2}^{T}\frac{(y_t - \boldsymbol{x}_t\boldsymbol{\beta} - \phi_1 y_{t-1})^{2}}{2\sigma^2}}_{\text{normal likelihood for } y_2, \ldots, y_T \text{ given the immediate past (same as OLS)}} \end{aligned} \]

A note on notation: we write the parameters on the left of the semicolon (treating them as arguments of the function) and the data on the right (treating them as fixed). This is the standard frequentist convention for writing a likelihood — distinct from Bayesian posterior notation like $p(\boldsymbol{\theta} \mid \boldsymbol{y})$, which reads “density of parameters given data.”

The two parts of the likelihood. The first term is what makes this time series MLE different from ordinary regression: it models $y_1$ as a draw from the stationary (marginal) distribution of the AR(1) process, $\mathcal{N}\!\left(\boldsymbol{x}_1 \boldsymbol{\beta} / (1-\phi_1),\; \sigma^2 / (1-\phi_1^2)\right)$. The second term is the familiar normal regression likelihood applied to observations $y_2, \ldots, y_T$, treating their immediate lag as a predictor.

Joint MLE vs. lm with a lagged DV. If you tried to estimate the same model with lm(y ~ lag(y) + x), OLS would drop $y_1$ (since there is no $y_0$ to serve as its lag) and estimate $\boldsymbol{\beta}$ and $\phi_1$ using only the second term of the likelihood — the conditional likelihood. arima() does more in two senses: (i) it uses the exact likelihood including $y_1$, so no information is wasted; and (ii) it estimates $\boldsymbol{\beta}$, $\phi_1$, and $\sigma^2$ jointly by maximizing this full objective, accounting for the serial correlation in the errors. OLS, by contrast, assumes independent errors and estimates $\boldsymbol{\beta}$ first — which, combined with a lagged dependent variable, is biased (see Section 2.4).

Generalization to AR($p$). The structure above extends naturally to higher-order AR processes. For AR($p$), the second term of the log-likelihood uses the first $p$ observations as initial conditions, and the main sum runs from $t = p+1$ to $T$:

\[ -\frac{T-p}{2}\log\sigma^2 - \sum_{t=p+1}^{T}\frac{\left(y_t - \boldsymbol{x}_t\boldsymbol{\beta} - \phi_1 y_{t-1} - \phi_2 y_{t-2} - \cdots - \phi_p y_{t-p}\right)^{2}}{2\sigma^2} \]

The first term (initial conditions) becomes the joint density of $(y_1, \ldots, y_p)$ under the stationary distribution of the AR($p$) — a multivariate normal with a more involved covariance matrix. For AR(2) specifically, the second term expands to the squared residual $\left(y_t - \boldsymbol{x}_t\boldsymbol{\beta} - \phi_1 y_{t-1} - \phi_2 y_{t-2}\right)^2$ summed from $t = 3$ onward.

2.2 The `arima()` function with `xreg`

In Lab 2, we used Arima() to estimate ARMA models on simulated data with no covariates. In applied work, we almost always have covariates — variables we believe explain part of the variation in the outcome. The arima() and Arima() functions accept covariates through the xreg argument.

The model we estimate can be written as:

\[ y_t = \boldsymbol{x}_t \boldsymbol{\beta} + \eta_t, \quad \eta_t = \phi_1 \eta_{t-1} + \cdots + \phi_p \eta_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q} \]

In words: this is a regression model where the errors have temporal structure — an ARMA($p,q$) process, rather than the i.i.d. errors assumed by OLS. The arima() function estimates the regression coefficients $\boldsymbol{\beta}$ and the ARMA parameters $\phi, \theta$ simultaneously by maximum likelihood.

Anatomy of `Arima()`

Before estimating our first model, let’s break down the function’s signature. Arima() (from forecast) and base R’s arima() share the same core arguments; Arima() adds a few conveniences for interpretation and forecasting.

Arima(y,
      order    = c(p, d, q),                          # non-seasonal orders
      seasonal = list(order = c(P, D, Q), period = m), # seasonal orders
      xreg          = NULL,      # matrix of covariates
      include.mean  = TRUE,      # estimate an intercept/mean?
      include.drift = FALSE,     # estimate a linear drift (only if d >= 1)?
      method        = c("CSS-ML", "ML", "CSS"))       # estimation method

Key arguments, one by one:

y — a univariate time series. Must be a ts object (so R knows the frequency, needed for seasonal models).
order = c(p, d, q) — the non-seasonal ARIMA orders. $p$ is the AR order, $d$ is the number of differences, $q$ is the MA order. For now we are working with stationary data, so $d = 0$. We will discuss $d \neq 0$ in Lab 4.
seasonal = list(order = c(P, D, Q), period = m) — the seasonal ARIMA orders at period $m$ (e.g., $m = 12$ for monthly data). Leave it at its default (no seasonal terms) if you are handling seasonality with dummies in xreg — which is what we do in this lab.
xreg — a matrix or data frame of covariates. Rows must align with observations in y. If you pass a single covariate, it can be a vector. The coefficients on these covariates are what we interpret substantively — the effect of the seatbelt law, seasonal shifts, etc.
include.mean — if TRUE (the default), the model estimates a constant (the intercept in a regression, or the unconditional mean in a pure ARMA). Set to FALSE only if you have reason to believe the series has zero mean.
include.drift — adds a linear drift term $\delta \cdot t$ to the model. This is essentially a shortcut for fitting a linear trend without manually constructing a time index (t = 1, 2, ..., n) and passing it through xreg. It only applies when $d \geq 1$ (differenced series); for stationary models ($d = 0$), use xreg with a time index instead.
method — how the likelihood is maximized. "CSS-ML" (default) uses conditional sum-of-squares for starting values and then exact ML; "ML" is exact ML throughout; "CSS" is conditional ML only (faster but slightly less efficient — this is closest to what lm does, as we discussed in Section 2.1).

Useful companion functions

Once you have a fitted Arima object, these extractors work the same way as for lm:

Function	Returns
`coef(fit)`	Named vector of estimated coefficients
`vcov(fit)`	Variance-covariance matrix of coefficient estimates
`residuals(fit)`	One-step-ahead forecast errors (what you check for white noise)
`fitted(fit)`	In-sample fitted values
`AIC(fit)`, `BIC(fit)`, `logLik(fit)`	Goodness-of-fit statistics
`forecast(fit, h = H, xreg = ...)`	Out-of-sample forecasts $H$ periods ahead
`checkresiduals(fit)`	Residual diagnostics (time plot + ACF + Ljung-Box)

A first model: AR(1) with the seatbelt law

Let’s put it together with the simplest useful specification: AR(1) errors plus the seatbelt law as a covariate. No seasonal term, no differencing ($d = 0$).

# xreg must be a matrix (or vector) of covariates — here, just the law indicator
fit_ar1 <- Arima(deaths, order = c(1, 0, 0), xreg = law)
fit_ar1

## Series: deaths 
## Regression with ARIMA(1,0,0) errors 
## 
## Coefficients:
##          ar1  intercept       xreg
##       0.6439   1719.193  -377.4542
## s.e.  0.0553     42.078   107.6521
## 
## sigma^2 = 39913:  log likelihood = -1288.26
## AIC=2584.52   AICc=2584.73   BIC=2597.55

Reading the output:

ar1: the estimated AR(1) coefficient $\hat{\phi}_1$. This tells you how strongly last month’s deaths predict this month’s, after accounting for the law.
intercept: the estimated mean level of the series when the law is not in effect ($\text{law} = 0$).
law: the estimated effect of the seatbelt law on road deaths, holding the AR dynamics constant. This is the immediate (impact) effect — the change in the expected value of $y_t$ in the period the law takes effect.
Standard errors are in parentheses below each coefficient. A coefficient is significant at the 5% level if $|\text{coef}/\text{SE}| > 2$.
sigma^2, log likelihood, AIC, AICc, BIC: the innovation variance $\hat{\sigma}^2$ and fit statistics (we use these for model comparison in Part 3).

2.3 Adding seasonality: month dummies

The ACF showed clear seasonal spikes. We handle this with month dummy variables, as we learned in Lab 2 (Approach 2: lm() with dummies). Here, we include the dummies as additional xreg covariates in arima().

# Create month dummy variables (11 dummies; January = reference)
month_dummies <- model.matrix(~ factor(ukdata$month) - 1)[, -1]
colnames(month_dummies) <- month.abb[-1]  # Feb, Mar, ..., Dec

# Combine law + month dummies into a covariate matrix
xcov <- cbind(law = ukdata$law, month_dummies)

head(xcov, 3)

##   law Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1   0   0   0   0   1   0   0   0   0   0   0   0
## 2   0   0   0   1   0   0   0   0   0   0   0   0
## 3   0   0   0   0   0   0   0   1   0   0   0   0

# AR(1) with seatbelt law + seasonal dummies
fit_ar1_seas <- Arima(deaths, order = c(1, 0, 0), xreg = xcov)
fit_ar1_seas

## Series: deaths 
## Regression with ARIMA(1,0,0) errors 
## 
## Coefficients:
##          ar1  intercept        law       Feb       Mar      Apr       May
##       0.6442  1481.2826  -370.0694  172.1115  679.4141  62.2094  238.6466
## s.e.  0.0550    42.9918    70.2727   53.0982   53.0102  44.9728   50.0669
##            Jun     Jul       Aug       Sep       Oct       Nov       Dec
##       157.3445  81.677  113.0147  137.4017  563.2579  364.0130  224.8335
## s.e.   50.2149  45.020   35.1794   35.1859   54.5760   55.0383   54.5765
## 
## sigma^2 = 17618:  log likelihood = -1204
## AIC=2437.99   AICc=2440.72   BIC=2486.85

Notice how the seasonal dummies capture the winter peak (October–December coefficients are positive and large) and the summer trough (June–August are negative). The law coefficient may change because we’re now controlling for the seasonal pattern that was previously confounded with the AR dynamics.

2.4 Why not just use `lm()`? The Breusch-Godfrey test

A natural question: if this is “just a regression with covariates,” why not use lm() and add a lagged dependent variable? The answer is that OLS with a lagged dependent variable and serially correlated errors produces biased and inconsistent estimates. Let’s see this empirically.

# OLS with a lag of deaths + law + month dummies
ukdata_lagged <- ukdata |>
  mutate(death_lag1 = lag(death, 1)) |>
  na.omit()

fit_ols <- lm(death ~ death_lag1 + law + factor(month), data = ukdata_lagged)

The Breusch-Godfrey test in detail

The Breusch-Godfrey (BG) test (Breusch 1978; Godfrey 1978) is a Lagrange multiplier (LM) test for serial correlation in the residuals of a regression model. It generalizes the classic Durbin-Watson test in two important ways: (i) it remains valid when the regression includes a lagged dependent variable (which DW does not), and (ii) it can detect serial correlation of any order $p$, not just first-order.

$H_0$: The residuals are serially uncorrelated (up to order $p$).
$H_1$: The residuals exhibit serial correlation at one or more lags up to $p$.

How the test works (auxiliary regression). The BG test is a two-step procedure:

Fit your main regression and save the residuals $\hat{e}_t$.
Run an auxiliary regression of the residuals on (a) all the original regressors $\boldsymbol{x}_t$ and (b) their own lags $\hat{e}_{t-1}, \hat{e}_{t-2}, \ldots, \hat{e}_{t-p}$:

\[ \hat{e}_t = \boldsymbol{x}_t \boldsymbol{\gamma} + \rho_1 \hat{e}_{t-1} + \rho_2 \hat{e}_{t-2} + \cdots + \rho_p \hat{e}_{t-p} + v_t \]

Compute the test statistic $BG = n \cdot R^2$ from the auxiliary regression, where $n$ is the sample size and $R^2$ is from this regression. Under $H_0$, $BG \sim \chi^2_p$ asymptotically.

Intuition. If the residuals are truly white noise, their lagged values should have no predictive power — the $\hat{\rho}_k$ coefficients should all be zero, $R^2$ should be small, and $BG$ should be small. If lagged residuals do predict the current residual, the auxiliary $R^2$ grows and $BG$ rejects the null.

Relationship to Ljung-Box (Lab 2). Both tests detect serial correlation, but they operate in different contexts:

Ljung-Box tests whether a time series (or ARMA model residuals) is white noise, based directly on sample autocorrelations. It assumes the regressors include no lags of the dependent variable.
Breusch-Godfrey is designed for regression residuals, and critically, it remains valid when the regression includes lagged dependent variables — the scenario where Durbin-Watson and (sometimes) Ljung-Box give misleading results.

For OLS with a lagged dependent variable, BG is the right diagnostic. For ARIMA residuals, Ljung-Box (via checkresiduals()) is standard.

Choosing the order $p$. Test a few lags — typically 1, 2, and the seasonal period (12 for monthly data). If any rejects, the residuals have structure OLS failed to capture.

# Test for first-order serial correlation
bgtest(fit_ols, order = 1)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  fit_ols
## LM test = 11.546, df = 1, p-value = 0.000679

# Test for second-order serial correlation
bgtest(fit_ols, order = 2)

## 
##  Breusch-Godfrey test for serial correlation of order up to 2
## 
## data:  fit_ols
## LM test = 11.984, df = 2, p-value = 0.002498

# Test at the seasonal lag (monthly data)
bgtest(fit_ols, order = 12)

## 
##  Breusch-Godfrey test for serial correlation of order up to 12
## 
## data:  fit_ols
## LM test = 30.006, df = 12, p-value = 0.002787

If the Breusch-Godfrey test rejects (small $p$-value), OLS residuals are serially correlated — meaning your standard errors are wrong and the coefficient on law is potentially biased. This is why we use arima() with ARMA errors instead: it models the serial correlation explicitly, producing consistent estimates and correct inference.

# Compare the law coefficient from OLS vs. ARIMA
cat("OLS estimate of law effect:", round(coef(fit_ols)["law"], 1), "\n")

## OLS estimate of law effect: -145.3

cat("ARIMA estimate of law effect:", round(coef(fit_ar1_seas)["law"], 1), "\n")

## ARIMA estimate of law effect: -370.1

2.5 Interpreting AR coefficients: impact, long-run, and predicted trajectory

In a regression with AR errors, the coefficient on law captures the immediate (impact) effect of the policy — the change in $y_t$ in the first period after the law takes effect. But because the errors carry their own temporal dependence, the effect propagates through time before the series settles at a new equilibrium. Two quantities matter:

Impact effect: $\hat{\beta}_{\text{law}}$ — the change realized in the first period.
Long-run (equilibrium) effect: the cumulative total once all dynamics have played out.

The long-run multiplier, from AR(1) to AR($p$)

The long-run effect is derived from the equilibrium mean of the process. For AR(1) errors:

\[ \mathbb{E}(y_t) = \frac{\boldsymbol{x}_t \boldsymbol{\beta}}{1 - \phi_1} \qquad \Longrightarrow \qquad \text{Long-run effect} = \frac{\hat{\beta}}{1 - \hat{\phi}_1} \]

For AR(2) errors, both $\phi_1$ and $\phi_2$ contribute to persistence, so the denominator involves both:

\[ \mathbb{E}(y_t) = \frac{\boldsymbol{x}_t \boldsymbol{\beta}}{1 - \phi_1 - \phi_2} \qquad \Longrightarrow \qquad \text{Long-run effect} = \frac{\hat{\beta}}{1 - \hat{\phi}_1 - \hat{\phi}_2} \]

Generalizing to AR($p$):

\[ \mathbb{E}(y_t) = \frac{\boldsymbol{x}_t \boldsymbol{\beta}}{1 - \sum_{k=1}^{p}\phi_k} \qquad \Longrightarrow \qquad \text{Long-run effect} = \frac{\hat{\beta}}{1 - \sum_{k=1}^{p}\hat{\phi}_k} \]

The denominator is the stationarity condition written backwards: $1 - \sum \phi_k$ must be positive (and the AR polynomial must have all roots outside the unit circle) for the equilibrium to be well defined. As the sum of AR coefficients approaches 1, the long-run multiplier explodes — mirroring the variance explosion we saw in Lab 2 for AR(1) as $\phi \to 1$.

Fitting an AR(2) with seasonality

Since the cleaned ACF/PACF in Section 1.2 suggested an AR(2) process, let’s estimate that specification with the law and month dummies, and compute both effects.

fit_ar2_seas <- Arima(deaths, order = c(2, 0, 0), xreg = xcov)
fit_ar2_seas

## Series: deaths 
## Regression with ARIMA(2,0,0) errors 
## 
## Coefficients:
##          ar1     ar2  intercept        law       Feb       Mar      Apr
##       0.4696  0.2711  1477.8553  -347.9213  172.0374  679.6911  62.2433
## s.e.  0.0692  0.0694    45.7778    80.5683   45.0545   45.1889  37.9386
##            May       Jun      Jul       Aug       Sep       Oct       Nov
##       240.9785  157.2316  81.6359  113.1874  137.3939  563.2884  363.9678
## s.e.   42.8920   42.8448  37.9681   35.0594   35.0610   46.6042   47.0010
##            Dec
##       224.7363
## s.e.   46.5606
## 
## sigma^2 = 16399:  log likelihood = -1196.65
## AIC=2425.3   AICc=2428.41   BIC=2477.42

# Extract estimates
beta_law <- coef(fit_ar2_seas)["law"]
phi1     <- coef(fit_ar2_seas)["ar1"]
phi2     <- coef(fit_ar2_seas)["ar2"]

# Long-run multiplier for AR(2): 1 / (1 - phi1 - phi2)
lr_multiplier <- 1 / (1 - phi1 - phi2)

# Long-run effect of the law
lr_effect <- beta_law * lr_multiplier

tibble(
  Quantity = c("phi_1", "phi_2", "Sum of AR coefficients",
               "Long-run multiplier: 1 / (1 - phi1 - phi2)",
               "Immediate (impact) effect of law",
               "Long-run effect of law"),
  Value    = c(round(phi1, 3),
               round(phi2, 3),
               round(phi1 + phi2, 3),
               round(lr_multiplier, 3),
               round(beta_law, 1),
               round(lr_effect, 1))
)

## # A tibble: 6 × 2
##   Quantity                                       Value
##   <chr>                                          <dbl>
## 1 phi_1                                          0.47 
## 2 phi_2                                          0.271
## 3 Sum of AR coefficients                         0.741
## 4 Long-run multiplier: 1 / (1 - phi1 - phi2)     3.86 
## 5 Immediate (impact) effect of law            -348.   
## 6 Long-run effect of law                     -1342.

Reading the output substantively:

The impact effect is the coefficient on law — the estimated drop in deaths in the first month the law takes effect.
The sum of AR coefficients $\hat{\phi}_1 + \hat{\phi}_2$ summarizes the total persistence of the process. For a stationary AR(2), this must be less than 1.
The long-run effect is $\hat{\beta}_{\text{law}} \cdot [1/(1 - \hat{\phi}_1 - \hat{\phi}_2)]$ — larger in magnitude than the impact effect because each period’s reduction feeds through the AR dynamics.

Visualizing the predicted trajectory

To make the impact-vs-equilibrium distinction concrete, let’s trace out the predicted response of the series to a permanent switch of the law from 0 to 1, starting from its pre-law equilibrium. This is called an impulse-response or step-response trajectory.

We simulate forward using the estimated AR(2) recursion:

\[ y_t = \alpha + \beta_{\text{law}} \cdot \text{law}_t + \phi_1 y_{t-1} + \phi_2 y_{t-2} \]

# Extract intercept (mean of the de-meaned process) and build a step-response
alpha_hat <- coef(fit_ar2_seas)["intercept"]

# Periods to simulate (24 months after intervention)
h <- 24

# Pre-law equilibrium: mean when law = 0
y_pre_eq <- alpha_hat / (1 - phi1 - phi2)

# Post-law equilibrium: mean when law = 1
y_post_eq <- (alpha_hat + beta_law) / (1 - phi1 - phi2)

# Simulate y_t from t = 1 to h, starting at pre-law equilibrium
# Law switches on at t = 1 and stays on
y_path <- numeric(h)
y_path[1] <- alpha_hat + beta_law + phi1 * y_pre_eq + phi2 * y_pre_eq
y_path[2] <- alpha_hat + beta_law + phi1 * y_path[1] + phi2 * y_pre_eq
for (t in 3:h) {
  y_path[t] <- alpha_hat + beta_law + phi1 * y_path[t-1] + phi2 * y_path[t-2]
}

# Assemble for plotting: include pre-intervention baseline
step_df <- tibble(
  t = -3:h,
  y = c(rep(y_pre_eq, 4), y_path)
)

ggplot(step_df, aes(x = t, y = y)) +
  geom_hline(yintercept = y_pre_eq, linetype = "dotted", color = "grey40") +
  geom_hline(yintercept = y_post_eq, linetype = "dashed", color = "firebrick") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "steelblue") +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_point(color = "steelblue", size = 1.5) +
  annotate("text", x = h - 2, y = y_pre_eq + 50,
           label = "Pre-law equilibrium", color = "grey30", size = 3.5, hjust = 1) +
  annotate("text", x = h - 2, y = y_post_eq - 50,
           label = "Post-law equilibrium (long-run effect)",
           color = "firebrick", size = 3.5, hjust = 1) +
  annotate("text", x = 0.3, y = y_pre_eq - 80,
           label = "Law takes effect", color = "steelblue", size = 3.5, hjust = 0) +
  labs(title = "Predicted Response of UK Road Deaths to the Seatbelt Law",
       subtitle = "AR(2) dynamics: series gradually adjusts from pre- to post-law equilibrium",
       x = "Months after intervention", y = "Predicted deaths/month") +
  theme_minimal(base_size = 13)

The plot shows the core idea: the series drops sharply in month 1 (the impact effect), but continues to fall for several more months as the AR dynamics transmit the shock through time. Eventually the series stabilizes at the new, lower equilibrium — the long-run effect.

This distinction between impact and equilibrium shift is one of the most important ideas in applied time series analysis, and it generalizes directly to more complex dynamic responses. In Lab 4 we will build on this to compute counterfactual forecasts with uncertainty (confidence intervals around the predicted trajectory).

How long should we simulate? Choosing the horizon

The horizon $h$ controls how far forward we trace the dynamic response. A natural question is whether the substantive conclusion depends on this choice. Two things to keep in mind:

The trajectory itself does not change with $h$. The values at periods $1, 2, \ldots$ are fully determined by the estimated coefficients and the initial conditions. Extending from $h = 24$ to $h = 48$ just shows more of the same curve — the predictions at months 1–24 are identical. So $h$ affects the visual presentation, not the underlying forecast.
The visible “approach to equilibrium” does depend on $h$, because the series converges exponentially and never exactly reaches equilibrium in finite time. Choosing $h$ too short makes the curve look like it is still adjusting; choosing it too long flattens out with visually redundant periods. For the UK deaths AR(2) model, the series has already completed about 99.5% of the total long-run change by month 24 and is numerically indistinguishable from equilibrium by month 36.

There are three principled approaches in the applied literature:

Approach 1: Tie $h$ to the half-life of shocks. From the AR characteristic roots, compute the dominant decay rate and the implied half-life of a shock. Simulate roughly 5–10 half-lives to capture essentially all the dynamic adjustment.

# Characteristic roots of the AR(2) polynomial: 1 - phi1*z - phi2*z^2 = 0
ar_roots   <- polyroot(c(1, -phi1, -phi2))
decay_rate <- max(1 / abs(ar_roots))          # dominant inverse-root modulus
half_life  <- log(0.5) / log(decay_rate)       # shock half-life, in periods
tibble(
  `Dominant decay rate` = round(decay_rate, 3),
  `Shock half-life (periods)` = round(half_life, 2),
  `Suggested h (7 half-lives)` = ceiling(7 * half_life)
)

## # A tibble: 1 × 3
##   `Dominant decay rate` `Shock half-life (periods)` `Suggested h (7 half-lives)`
##                   <dbl>                       <dbl>                        <dbl>
## 1                 0.806                        3.21                           23

Approach 2: Convergence tolerance. Simulate a long horizon, then take $h$ to be the first period where the gap to the equilibrium drops below a tolerance (e.g., 1% of the total long-run change). This is the most “objective” criterion because you specify exactly what “converged” means.

Approach 3: Substantively meaningful horizon. Pick $h$ based on the policy/research question — 1–3 years for monthly policy data, 3–5 years for quarterly macro data, etc. For the seatbelt law, 24 months matches the natural evaluation window governments use for policy interventions.

Practical rule. Take $h$ as the maximum of (a) 5–10 half-lives and (b) a substantively meaningful horizon. Extending beyond this doesn’t change conclusions — it just adds flat visual real estate. For our UK deaths AR(2), both criteria give $h \approx 24$ (our choice above), confirming that the 24-month window is adequate. If we were modeling a more persistent series (e.g., $\sum\phi_k \approx 0.95$), we would need a much longer horizon to display the full dynamic adjustment.

As a practical check, report results at multiple horizons (e.g., effect at 6, 12, 24 months) rather than a single endpoint — this shows both the speed of adjustment and the long-run total, and makes the simulation’s time-path interpretation transparent to readers (Jordan & Philips, 2018, “dynamac” R package; Aptech impulse-response analysis guide; Hyndman & Athanasopoulos FPP3).

The coefficients may differ because OLS is biased by the lagged dependent variable + serial correlation, while ARIMA estimates everything jointly by MLE.

Part 3: Model Selection

3.1 Goodness-of-fit metrics

After estimating one model, we need to assess how well it fits and compare it against alternatives. Common metrics:

Metric	Formula	Properties
AIC	$-2 \ln L + 2k$	Penalizes complexity; lower = better
BIC	$-2 \ln L + k \ln n$	Heavier penalty than AIC; favors parsimony
MSE	$\frac{1}{n}\sum \hat{e}_t^2$	Mean squared error; scale-dependent
RMSE	$\sqrt{\text{MSE}}$	Same units as $y$; easier to interpret
MAE	$\frac{1}{n}\sum \lvert\hat{e}_t\rvert$	Less sensitive to outliers than RMSE

What AIC and BIC actually are

Both AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are likelihood-based model selection criteria. They are computed directly from the maximized log-likelihood $\ln L$ (the same quantity arima() optimizes — see Section 2.1) plus a penalty that grows with the number of estimated parameters $k$:

\[ \text{AIC} = -2 \ln L + 2k \qquad\qquad \text{BIC} = -2 \ln L + k \ln n \]

The $-2 \ln L$ term rewards fit: a better-fitting model has higher $\ln L$, so a smaller $-2 \ln L$. If you only looked at this term, you would always prefer the most complex model — it fits training data best.
The $+2k$ (AIC) or $+k \ln n$ (BIC) term is the complexity penalty: every extra parameter you add costs you some AIC/BIC. The criterion answers the question “does the improvement in fit justify the extra parameter?”

Lower values are better. What matters is differences between models, not absolute levels.

How AIC/BIC relate to MSE

For a normal linear regression (or any model with Gaussian i.i.d. errors fit by MLE), the maximum likelihood estimate of the residual variance is $\hat{\sigma}^2 = \text{MSE}$. Substituting this into the Gaussian log-likelihood and simplifying gives:

\[ \text{AIC} = n \ln(\text{MSE}) + 2k + C \]

where $C = n + n \ln(2\pi)$ is a constant that does not depend on the model and therefore cancels when comparing AIC values across specifications. So for model comparison on the same data, AIC is equivalent to $n \ln(\text{MSE}) + 2k$ (up to a constant) — a log-transformed, complexity-penalized MSE. Two intuitions follow:

If you compare two models with the same $k$ (same number of parameters), ranking by AIC is equivalent to ranking by MSE/RMSE.
When models have different $k$, AIC and MSE can disagree: AIC may prefer a smaller model because the extra parameters in the larger one don’t reduce MSE enough to justify their cost. This is why we use AIC/BIC instead of raw MSE for model selection — raw MSE always favors the most complex model, which overfits.

If the model was fit by MLE, use AIC/BIC (not MSE)

When you fit a model with Arima() (maximum likelihood), AIC and BIC are the natural criteria — they’re computed from the same quantity the estimator already maximized, and they incorporate the full likelihood (not just the sum of squared residuals). MSE/RMSE are useful as a reality check (they tell you how big a typical residual is in the original units) and for comparing models on an out-of-sample test set, but they should not drive specification choice for MLE-fit models.

AIC vs. BIC: which to pick?

Both balance fit against complexity, but they embody different model-selection philosophies and can disagree. The trade-off:

Property	AIC	BIC
Penalty per parameter	$2$	$\ln n$ (heavier whenever $n > 7$)
Asymptotic goal	Best predictive model	Identify the true model (if it’s in the candidate set)
Tends to select	Slightly larger models	Smaller, more parsimonious models
Consistent?	No (may asymptotically select too-large models)	Yes (converges to the true model as $n \to \infty$)
Efficient for prediction?	Yes (minimizes out-of-sample loss asymptotically)	No (may under-select for prediction)

Rules of thumb:

For forecasting (minimizing prediction error): AIC is usually preferred.
For inference (finding a parsimonious, interpretable model): BIC is usually preferred.
With small samples: use AICc (corrected AIC) — it adds a finite-sample correction that prevents AIC from overfitting. Arima() reports AICc automatically.
When AIC and BIC disagree: prefer the more parsimonious model (the BIC choice) unless you have strong reasons to believe the extra parameter is substantively important — Occam’s razor.
Do not compare AIC/BIC across models fit to different datasets (e.g., different sample sizes after differencing), or across models with fundamentally different specifications (e.g., ARIMA vs. a Box-Cox-transformed model).

# Extract GOF metrics for a single model
resid_ar1 <- residuals(fit_ar1_seas)

tibble(
  Metric = c("AIC", "BIC", "Log-Likelihood", "MSE", "RMSE", "MAE"),
  Value = c(
    AIC(fit_ar1_seas),
    BIC(fit_ar1_seas),
    fit_ar1_seas$loglik,
    mean(resid_ar1^2),
    sqrt(mean(resid_ar1^2)),
    mean(abs(resid_ar1))
  )
) |> mutate(Value = round(Value, 2))

## # A tibble: 6 × 2
##   Metric          Value
##   <chr>           <dbl>
## 1 AIC             2438.
## 2 BIC             2487.
## 3 Log-Likelihood -1204 
## 4 MSE            16333.
## 5 RMSE             128.
## 6 MAE              102.

3.2 Comparing candidate models

We fit several candidate specifications and compare them. The candidates span different dynamic structures, all with the law + seasonal dummies as covariates.

# Candidate models — all include law + month dummies via xreg
models <- list(
  "AR(1)"      = Arima(deaths, order = c(1, 0, 0), xreg = xcov),
  "AR(2)"      = Arima(deaths, order = c(2, 0, 0), xreg = xcov),
  "MA(1)"      = Arima(deaths, order = c(0, 0, 1), xreg = xcov),
  "ARMA(1,1)"  = Arima(deaths, order = c(1, 0, 1), xreg = xcov),
  "ARMA(2,1)"  = Arima(deaths, order = c(2, 0, 1), xreg = xcov),
  "ARMA(1,2)"  = Arima(deaths, order = c(1, 0, 2), xreg = xcov)
)

# Build comparison table
comparison <- tibble(
  Model = names(models),
  AIC   = map_dbl(models, AIC),
  BIC   = map_dbl(models, BIC),
  RMSE  = map_dbl(models, ~ sqrt(mean(residuals(.x)^2))),
  LB_pvalue = map_dbl(models, ~ Box.test(residuals(.x), lag = 24,
                                          type = "Ljung-Box")$p.value)
) |>
  arrange(AIC) |>
  mutate(across(where(is.numeric), ~ round(.x, 2)),
         delta_AIC = AIC - min(AIC))

comparison

## # A tibble: 6 × 6
##   Model       AIC   BIC  RMSE LB_pvalue delta_AIC
##   <chr>     <dbl> <dbl> <dbl>     <dbl>     <dbl>
## 1 ARMA(2,1) 2417. 2472.  120.      0.05     0    
## 2 ARMA(1,2) 2418. 2473.  120.      0.05     0.970
## 3 ARMA(1,1) 2418. 2470.  121.      0.05     1.71 
## 4 AR(2)     2425. 2477.  123.      0.01     8.64 
## 5 AR(1)     2438. 2487.  128.      0       21.3  
## 6 MA(1)     2482. 2531.  143.      0       65.3

How to read this table:

AIC/BIC: Lower is better. The model with the lowest AIC is preferred.
RMSE: Lower means better in-sample fit, but doesn’t penalize complexity.
LB_pvalue: Ljung-Box $p$-value on the residuals. We want $p > 0.05$ (residuals are white noise). Models that fail this test are inadequately specified.
delta_AIC: The AIC gap from the best model. Models within $\Delta$AIC $< 2$ are roughly equivalent.

comparison |>
  mutate(Model = fct_reorder(Model, AIC)) |>
  ggplot(aes(x = Model, y = delta_AIC)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  geom_hline(yintercept = 2, linetype = "dashed", color = "firebrick") +
  labs(title = "Model Comparison: \u0394AIC from Best Model",
       y = "\u0394AIC", x = NULL) +
  coord_flip() +
  theme_minimal(base_size = 12)

Over-fitting and under-fitting check

Take the best model and verify it’s correctly specified by checking one step above and below:

# Identify the best model by AIC
best_name <- comparison$Model[1]
cat("Best model by AIC:", best_name, "\n\n")

## Best model by AIC: ARMA(2,1)

# Check residuals of the best model
best_fit <- models[[best_name]]
checkresiduals(best_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(2,0,1) errors
## Q* = 36.164, df = 21, p-value = 0.02096
## 
## Model df: 3.   Total lags used: 24

3.3 What `auto.arima()` picks

As a sanity check, let’s see what auto.arima() selects:

fit_auto <- auto.arima(deaths,
                       xreg = xcov,
                       stationary = TRUE,
                       seasonal = FALSE,   # we handle seasonality via dummies
                       stepwise = FALSE,    # thorough search
                       approximation = FALSE)

fit_auto

## Series: deaths 
## Regression with ARIMA(2,0,1) errors 
## 
## Coefficients:
##          ar1      ar2      ma1  intercept        law       Feb       Mar
##       1.1899  -0.2157  -0.7950  1469.1318  -321.2201  171.8751  679.9702
## s.e.  0.1071   0.0976   0.0724    68.9326    78.8301   41.0435   41.1674
##           Apr       May       Jun      Jul       Aug       Sep       Oct
##       62.5233  241.9387  157.0544  81.4898  113.1761  137.2673  563.4235
## s.e.  39.3198   40.6465   40.5352  39.3223   35.1482   35.1484   41.3888
##            Nov       Dec
##       363.9178  224.6292
## s.e.   41.4152   41.3042
## 
## sigma^2 = 15582:  log likelihood = -1191.33
## AIC=2416.66   AICc=2420.18   BIC=2472.04

Compare with our manual selection: do they agree? If auto.arima() picks a different specification, investigate why — it may have found a better fit, or it may have been misled by the stepwise algorithm.

Part 4: Cross-Validation for Time Series

4.1 Why cross-validation?

In-sample metrics (AIC, RMSE) tell us how well the model fits the data it was trained on. But what we really care about is out-of-sample performance — how well the model predicts data it has not seen. Cross-validation (CV) provides a more honest assessment of model quality.

The key difference from standard CV: In cross-sectional data, we can randomly split the data into training and test sets. In time series, observations are ordered — we cannot predict the past from the future. Instead, we use a rolling forecast window: the training set always comes before the test set.

4.2 Expanding and sliding windows

Two common schemes:

Expanding window: The training set grows as we move forward. All observations before the test window are included.
Sliding (rolling) window: The training set has a fixed length. As the test window moves forward, old observations are dropped from the back of the training set.

The expanding window uses all available history; the sliding window uses a fixed-length lookback. Expanding is more common in practice; sliding is useful when the process may change over time (non-stationarity).

4.3 Implementing CV from scratch

Let’s implement a sliding-window CV for our UK deaths data, comparing three candidate models: AR(1), AR(2), and ARMA(1,1) — all with the seatbelt law and month dummies as covariates.

Before wrapping everything in a loop, let’s walk through what happens in a single iteration. Once the logic is clear, the loop is just “do this many times and store the results.”

Step by step: one CV iteration

We set the hyperparameters first:

min_window <- 170   # training window length (ensures seatbelt law is inside training)
h_forward  <- 12    # forecast horizon: 12 months ahead
n_obs      <- length(deaths)
n_iter     <- n_obs - min_window - h_forward + 1   # number of CV iterations
n_iter

## [1] 11

Each iteration $i$ does four things: (1) define the train/test windows, (2) subset the data, (3) fit the model, (4) forecast and store errors. Let’s run one iteration for $i = 1$ and inspect each step.

Step 1: Define the train/test windows for iteration $i = 1$.

i <- 1

# Training window: observations i through i + min_window - 1
train_idx <- i:(i + min_window - 1)

# Test window: the 12 observations immediately after the training window
test_idx  <- (i + min_window):(i + min_window + h_forward - 1)

range(train_idx)   # 1 to 170 (Jan 1969 – Feb 1983)

## [1]   1 170

range(test_idx)    # 171 to 182 (Mar 1983 – Feb 1984)

## [1] 171 182

The training set covers observations 1–170 (Jan 1969 through Feb 1983, including the seatbelt law introduction); the test set is the next 12 months.

Step 2: Subset the outcome and covariate matrix.

y_train <- deaths[train_idx]
y_test  <- deaths[test_idx]
x_train <- xcov[train_idx, ]
x_test  <- xcov[test_idx, ]

length(y_train)   # 170 training observations

## [1] 170

length(y_test)    # 12 test observations

## [1] 12

Step 3: Fit the three candidate models on the training data.

fit1 <- Arima(y_train, order = c(1, 0, 0), xreg = x_train)   # AR(1)
fit2 <- Arima(y_train, order = c(2, 0, 0), xreg = x_train)   # AR(2)
fit3 <- Arima(y_train, order = c(1, 0, 1), xreg = x_train)   # ARMA(1,1)

Step 4: Forecast 12 months ahead and compute absolute errors.

Each forecast() call uses the fitted model to generate predictions for the test period.

The forecast() function (from the forecast package) takes a fitted model object and returns point forecasts plus prediction intervals. The signature for ARIMA-type models is:

forecast(object,
         h = NULL,           # number of periods to forecast
         xreg = NULL,        # future values of covariates (if the model uses xreg)
         level = c(80, 95),  # confidence levels for prediction intervals
         ...)

Key arguments:

object — the fitted model (an Arima object in our case).
h — number of future periods to predict. If the model has xreg, this is usually omitted (inferred from the number of rows in xreg), but being explicit is clearer.
xreg — a matrix of future covariate values covering the forecast horizon. Must have the same columns (same names, same order) as the xreg used at fit time. Without this, forecast() does not know what the regressors will be in the future.
level — the confidence levels for prediction intervals. The default c(80, 95) returns 80% and 95% intervals.

The object forecast() returns is a list with several components. The ones we use most often:

$mean — point forecasts (a ts vector of length h).
$lower, $upper — matrices of lower/upper bounds, one column per confidence level.
$x — the original training series (useful for plotting).

# Generate 12-month-ahead forecasts for each model
# xreg = x_test supplies the future values of law + month dummies
fc1 <- forecast(fit1, h = h_forward, xreg = x_test)
fc2 <- forecast(fit2, h = h_forward, xreg = x_test)
fc3 <- forecast(fit3, h = h_forward, xreg = x_test)

# fc1$mean is the vector of 12 point forecasts; compare to y_test
abs_err_ar1  <- abs(as.numeric(fc1$mean) - y_test)
abs_err_ar2  <- abs(as.numeric(fc2$mean) - y_test)
abs_err_arma <- abs(as.numeric(fc3$mean) - y_test)

# Inspect: forecast errors at each horizon (months 1, 2, ..., 12 ahead)
tibble(
  Horizon     = 1:h_forward,
  `AR(1) err`   = round(abs_err_ar1, 1),
  `AR(2) err`   = round(abs_err_ar2, 1),
  `ARMA(1,1) err` = round(abs_err_arma, 1)
)

## # A tibble: 12 × 4
##    Horizon `AR(1) err` `AR(2) err` `ARMA(1,1) err`
##      <int>       <dbl>       <dbl>           <dbl>
##  1       1        68.5       120.            115. 
##  2       2        96         151.            172. 
##  3       3         1.6        64.8            94  
##  4       4       130.         67.8            28  
##  5       5       119.         59.1            12.6
##  6       6       175.        120              68.1
##  7       7        80         131.            186. 
##  8       8         8.4        38              96.1
##  9       9       231.        189.            129  
## 10      10       331.        293.            232. 
## 11      11        25.8         8.6            68.4
## 12      12        40.2         9.8            48.4

That’s it for one iteration. Each row tells us “how far off was each model’s forecast at each horizon?” We store these 12-element vectors. For CV we only need the point forecasts ($mean), but the same object also carries prediction intervals — which we will use in Part 5 when we forecast beyond the end of the data.

The loop does exactly this for $i = 1, 2, \ldots, n\_iter$: it slides the training window forward by one month, re-fits all three models, and stores the absolute errors as a row of a matrix. After all iterations, we average across rows to get the mean absolute error at each horizon, and across horizons to get the overall out-of-sample MAE.

The full loop

# Pre-allocate storage: one row per iteration, one column per horizon
mae_ar1  <- matrix(NA, nrow = n_iter, ncol = h_forward)
mae_ar2  <- matrix(NA, nrow = n_iter, ncol = h_forward)
mae_arma <- matrix(NA, nrow = n_iter, ncol = h_forward)

for (i in 1:n_iter) {
  # Step 1: train/test windows
  train_idx <- i:(i + min_window - 1)
  test_idx  <- (i + min_window):(i + min_window + h_forward - 1)
  if (max(test_idx) > n_obs) next

  # Step 2: subset
  y_train <- deaths[train_idx]; y_test <- deaths[test_idx]
  x_train <- xcov[train_idx, ]; x_test <- xcov[test_idx, ]

  # Step 3: fit three candidates
  fit1 <- Arima(y_train, order = c(1, 0, 0), xreg = x_train)
  fit2 <- Arima(y_train, order = c(2, 0, 0), xreg = x_train)
  fit3 <- Arima(y_train, order = c(1, 0, 1), xreg = x_train)

  # Step 4: forecast and store absolute errors
  fc1 <- forecast(fit1, h = h_forward, xreg = x_test)
  fc2 <- forecast(fit2, h = h_forward, xreg = x_test)
  fc3 <- forecast(fit3, h = h_forward, xreg = x_test)

  mae_ar1[i, ]  <- abs(as.numeric(fc1$mean) - y_test)
  mae_ar2[i, ]  <- abs(as.numeric(fc2$mean) - y_test)
  mae_arma[i, ] <- abs(as.numeric(fc3$mean) - y_test)
}

Visualizing CV results

Average the absolute errors across iterations at each horizon, then compare the three models:

cv_results <- tibble(
  Horizon     = 1:h_forward,
  `AR(1)`     = colMeans(mae_ar1,  na.rm = TRUE),
  `AR(2)`     = colMeans(mae_ar2,  na.rm = TRUE),
  `ARMA(1,1)` = colMeans(mae_arma, na.rm = TRUE)
) |>
  pivot_longer(-Horizon, names_to = "Model", values_to = "MAE")

ggplot(cv_results, aes(x = Horizon, y = MAE, color = Model)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  labs(title = "Cross-Validation: Mean Absolute Error by Forecast Horizon",
       x = "Periods Ahead", y = "MAE (deaths)", color = NULL) +
  theme_minimal(base_size = 12)

# Overall average MAE across all horizons
tibble(
  Model = c("AR(1)", "AR(2)", "ARMA(1,1)"),
  `Avg MAE` = c(mean(mae_ar1,  na.rm = TRUE),
                mean(mae_ar2,  na.rm = TRUE),
                mean(mae_arma, na.rm = TRUE))
) |> mutate(`Avg MAE` = round(`Avg MAE`, 1)) |>
  arrange(`Avg MAE`)

## # A tibble: 3 × 2
##   Model     `Avg MAE`
##   <chr>         <dbl>
## 1 ARMA(1,1)      91.8
## 2 AR(1)          96.3
## 3 AR(2)          98.3

Two things to read from these outputs:

Horizon-by-horizon plot: where is each model strong or weak? Short-horizon forecasts are usually very close across models; long-horizon differences reflect the quality of the dynamic structure.
Overall MAE table: a single summary for ranking. The model with the lowest average MAE is the best out-of-sample performer.

If the CV winner agrees with the in-sample AIC winner (Part 3), you have a robust choice. If they disagree, investigate why — often one metric is favoring a slightly more (or less) complex specification.

4.4 Modern alternative: `forecast::tsCV()`

The forecast package provides tsCV() to automate time series cross-validation. It handles the rolling window logic internally — you supply a function that fits and forecasts, and tsCV() returns the forecast errors for each origin and horizon.

A broader candidate set

Let’s scale up from three models to a fuller grid — the same set used in the old CSSS 512 lab — so we can see how both the dynamic structure (AR/MA orders) and the covariate specification (how we handle seasonality) affect out-of-sample performance.

The candidates differ along two dimensions:

Model	Dynamics	Covariates
1a	AR(1)	law only
1b	AR(1)	law + Q4 dummy (Oct–Dec)
1c	AR(1)	law + 11 month dummies
1d	AR(1)	law + selected months (Jan, Sep, Oct, Nov, Dec)
1e	AR(1) with seasonal AR(1)$_{12}$	law only
2a	AR(2)	law + 11 month dummies
2b	MA(1)	law + 11 month dummies
2c	ARMA(1,1)	law + 11 month dummies
2d	ARMA(2,1)	law + 11 month dummies
2e	ARMA(1,2)	law + 11 month dummies
2f	ARMA(2,2)	law + 11 month dummies

Build the covariate matrices

# 11 month dummies (January = reference)
month_dummies <- model.matrix(~ factor(ukdata$month) - 1)[, -1]
colnames(month_dummies) <- month.abb[-1]

# Q4 dummy = 1 for October–December
q4_dummy <- as.numeric(ukdata$month %in% c("October", "November", "December"))

# Selected months (Jan is the reference, so we include Sep, Oct, Nov, Dec
# explicitly plus the "Jan" indicator built from the month factor)
jan_dummy <- as.numeric(ukdata$month == "January")

# Covariate matrices for each specification
xcov_1a <- cbind(law = ukdata$law)
xcov_1b <- cbind(law = ukdata$law, q4 = q4_dummy)
xcov_1c <- cbind(law = ukdata$law, month_dummies)                       # all 11 months
xcov_1d <- cbind(law = ukdata$law, jan = jan_dummy,
                 month_dummies[, c("Sep", "Oct", "Nov", "Dec")])        # selected months
xcov_1e <- cbind(law = ukdata$law)                                       # seasonal AR handles seasonality

Define forecast functions for each candidate

tsCV() expects a function with signature function(y, h, xreg, newxreg) — given a training series y and future covariates newxreg, return a forecast object. We write one function per candidate, keeping each as explicit as possible:

# Family 1: AR(1) dynamics with different covariate specifications
fc_1a <- function(y, h, xreg, newxreg) {
  fit <- Arima(y, order = c(1, 0, 0), xreg = xreg)
  forecast(fit, h = h, xreg = newxreg)
}

fc_1b <- function(y, h, xreg, newxreg) {
  fit <- Arima(y, order = c(1, 0, 0), xreg = xreg)
  forecast(fit, h = h, xreg = newxreg)
}

fc_1c <- function(y, h, xreg, newxreg) {
  fit <- Arima(y, order = c(1, 0, 0), xreg = xreg)
  forecast(fit, h = h, xreg = newxreg)
}

fc_1d <- function(y, h, xreg, newxreg) {
  fit <- Arima(y, order = c(1, 0, 0), xreg = xreg)
  forecast(fit, h = h, xreg = newxreg)
}

# Model 1e adds a seasonal AR(1) at lag 12 on top of the non-seasonal AR(1)
fc_1e <- function(y, h, xreg, newxreg) {
  fit <- Arima(y, order = c(1, 0, 0),
               seasonal = list(order = c(1, 0, 0), period = 12),
               xreg = xreg)
  forecast(fit, h = h, xreg = newxreg)
}

# Family 2: higher-order ARMA dynamics, all with the 11 month-dummy covariates
fc_2a <- function(y, h, xreg, newxreg) {              # AR(2)
  fit <- Arima(y, order = c(2, 0, 0), xreg = xreg)
  forecast(fit, h = h, xreg = newxreg)
}

fc_2b <- function(y, h, xreg, newxreg) {              # MA(1)
  fit <- Arima(y, order = c(0, 0, 1), xreg = xreg)
  forecast(fit, h = h, xreg = newxreg)
}

fc_2c <- function(y, h, xreg, newxreg) {              # ARMA(1,1)
  fit <- Arima(y, order = c(1, 0, 1), xreg = xreg)
  forecast(fit, h = h, xreg = newxreg)
}

fc_2d <- function(y, h, xreg, newxreg) {              # ARMA(2,1)
  fit <- Arima(y, order = c(2, 0, 1), xreg = xreg)
  forecast(fit, h = h, xreg = newxreg)
}

fc_2e <- function(y, h, xreg, newxreg) {              # ARMA(1,2)
  fit <- Arima(y, order = c(1, 0, 2), xreg = xreg)
  forecast(fit, h = h, xreg = newxreg)
}

fc_2f <- function(y, h, xreg, newxreg) {              # ARMA(2,2)
  fit <- Arima(y, order = c(2, 0, 2), xreg = xreg)
  forecast(fit, h = h, xreg = newxreg)
}

Each function does the same two steps: fit an Arima() model on the training chunk supplied by tsCV(), then forecast() forward. What differs between them is only the order (and for fc_1e, the seasonal argument).

Run `tsCV()` on all 11 candidates

h = 12 means tsCV() produces forecasts at horizons 1, 2, …, 12 months ahead at each origin. The result is a matrix with rows = origins and columns = horizons; we average columns to get the mean absolute error at each horizon.

# Sliding window of 170 training observations; forecast 12 months ahead
W <- 170    # training window length
H <- 12     # forecast horizon

# For each model, tsCV() returns a (T x H) matrix of forecast errors:
#   rows    = forecast origins (one per period moved forward)
#   columns = horizons 1, 2, ..., 12 months ahead
# We take the absolute value and average down each column to get MAE at each horizon.

# Family 1 models (each uses its own covariate matrix)
err_1a <- tsCV(deaths, forecastfunction = fc_1a, h = H, window = W, xreg = xcov_1a)
err_1b <- tsCV(deaths, forecastfunction = fc_1b, h = H, window = W, xreg = xcov_1b)
err_1c <- tsCV(deaths, forecastfunction = fc_1c, h = H, window = W, xreg = xcov_1c)
err_1d <- tsCV(deaths, forecastfunction = fc_1d, h = H, window = W, xreg = xcov_1d)
err_1e <- tsCV(deaths, forecastfunction = fc_1e, h = H, window = W, xreg = xcov_1e)

# Family 2 models (all share the full month-dummies covariate matrix xcov_1c)
err_2a <- tsCV(deaths, forecastfunction = fc_2a, h = H, window = W, xreg = xcov_1c)
err_2b <- tsCV(deaths, forecastfunction = fc_2b, h = H, window = W, xreg = xcov_1c)
err_2c <- tsCV(deaths, forecastfunction = fc_2c, h = H, window = W, xreg = xcov_1c)
err_2d <- tsCV(deaths, forecastfunction = fc_2d, h = H, window = W, xreg = xcov_1c)
err_2e <- tsCV(deaths, forecastfunction = fc_2e, h = H, window = W, xreg = xcov_1c)
err_2f <- tsCV(deaths, forecastfunction = fc_2f, h = H, window = W, xreg = xcov_1c)

# Average absolute errors across origins → MAE at each horizon (length-H vector)
cv_1a <- colMeans(abs(err_1a), na.rm = TRUE)
cv_1b <- colMeans(abs(err_1b), na.rm = TRUE)
cv_1c <- colMeans(abs(err_1c), na.rm = TRUE)
cv_1d <- colMeans(abs(err_1d), na.rm = TRUE)
cv_1e <- colMeans(abs(err_1e), na.rm = TRUE)
cv_2a <- colMeans(abs(err_2a), na.rm = TRUE)
cv_2b <- colMeans(abs(err_2b), na.rm = TRUE)
cv_2c <- colMeans(abs(err_2c), na.rm = TRUE)
cv_2d <- colMeans(abs(err_2d), na.rm = TRUE)
cv_2e <- colMeans(abs(err_2e), na.rm = TRUE)
cv_2f <- colMeans(abs(err_2f), na.rm = TRUE)

Visualizing the full CV grid

We compare all 11 models on one plot. To make the ranking visually intuitive, we order the models by their average MAE across horizons (best → worst) and assign colors from a viridis gradient — so the best-performing models appear in one end of the gradient (dark purple) and the worst in the other (yellow). This makes it easy to spot the winners at a glance: they form the darker cluster along the bottom of the plot.

# Assemble a matrix: rows = horizons, columns = models
cv_matrix <- cbind(cv_1a, cv_1b, cv_1c, cv_1d, cv_1e,
                   cv_2a, cv_2b, cv_2c, cv_2d, cv_2e, cv_2f)
colnames(cv_matrix) <- c("1a", "1b", "1c", "1d", "1e",
                         "2a", "2b", "2c", "2d", "2e", "2f")

# Compute the average MAE per model and rank models from best (lowest) to worst
model_rank <- names(sort(colMeans(cv_matrix, na.rm = TRUE)))

# Viridis palette ordered from best → worst (dark = best, light = worst)
model_colors <- setNames(viridisLite::viridis(length(model_rank), end = 0.9),
                         model_rank)

# Tidy for ggplot, with Model as a factor ordered by rank
cv_long <- as_tibble(cv_matrix) |>
  mutate(Horizon = 1:H) |>
  pivot_longer(-Horizon, names_to = "Model", values_to = "MAE") |>
  mutate(Model = factor(Model, levels = model_rank))   # legend order = rank order

ggplot(cv_long, aes(x = Horizon, y = MAE, color = Model)) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 1.5) +
  scale_color_manual(values = model_colors) +
  scale_x_continuous(breaks = 1:H) +
  labs(title = "Cross-Validation of UK Road Deaths Models",
       subtitle = "Models ordered by average MAE across horizons (best = darkest)",
       x = "Periods Ahead", y = "MAE (deaths)", color = "Model") +
  theme_minimal(base_size = 12)

Average CV across all 11 candidates

A single summary ranking averages MAE across the 12 horizons:

avg_cv <- colMeans(cv_matrix, na.rm = TRUE)

tibble(
  Model  = names(avg_cv),
  `Avg MAE` = round(avg_cv, 1)
) |>
  arrange(`Avg MAE`)

## # A tibble: 11 × 2
##    Model `Avg MAE`
##    <chr>     <dbl>
##  1 2f         78.1
##  2 2c         79.5
##  3 2e         80.1
##  4 2d         80.3
##  5 1c         80.4
##  6 2a         81.4
##  7 2b         83.7
##  8 1b        102. 
##  9 1e        102. 
## 10 1a        171. 
## 11 1d        196.

Interpreting the ranking. The best model by average MAE is the one that produces the most accurate forecasts across horizons. As in Part 3, this should be read alongside:

AIC/BIC (in-sample fit, penalized for complexity).
Parsimony: if two models have similar CV MAE and similar AIC, prefer the simpler one.
Residual diagnostics: check checkresiduals() on candidates near the top to ensure residuals are white noise.

`tsCV()` vs. the manual loop

tsCV() is more concise and handles the rolling-window logic for you. The manual loop gives more control (custom error metrics, storing point forecasts, inspecting individual iterations). In practice, use tsCV() for quick model comparison, and the manual loop when you need to inspect the internals of the CV process or compute non-standard quantities.

Part 5: Predicted Values and Prediction Intervals

After selecting a model — say the AR(2) specification with seatbelt law and month dummies — we often want to use it to forecast forward, producing predicted values with uncertainty bounds.

5.1 Setting up the forecast covariates

Forecasting with xreg requires us to supply the future values of the covariates over the forecast horizon. For the UK deaths data, this means providing values of the law indicator and all 11 month dummies for each future month we want to predict.

Suppose we want to forecast 5 years (60 months) beyond the end of the observed data (Dec 1984), assuming the seatbelt law remains in effect:

# Use the AR(2) model we already estimated in Section 2.5
fit_final <- fit_ar2_seas

# Forecast horizon: 60 months (5 years)
n_ahead <- 60

# Future law indicator: law = 1 everywhere (law kept in effect)
law_future <- rep(1, n_ahead)

# Future month dummies: the pattern cycles every 12 months
# Observed data ends in December 1984, so next month is January 1985
month_cycle <- c("Jan","Feb","Mar","Apr","May","Jun",
                 "Jul","Aug","Sep","Oct","Nov","Dec")
future_months <- factor(rep(month_cycle, length.out = n_ahead),
                        levels = month_cycle)
future_dummies <- model.matrix(~ future_months - 1)[, -1]  # drop Jan (reference)
colnames(future_dummies) <- month.abb[-1]

# Combine into the future covariate matrix (must have same columns as training xcov)
xcov_future <- cbind(law = law_future, future_dummies)
head(xcov_future, 3)

##   law Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1   1   0   0   0   0   0   0   0   0   0   0   0
## 2   1   1   0   0   0   0   0   0   0   0   0   0
## 3   1   0   1   0   0   0   0   0   0   0   0   0

5.2 Generating forecasts and prediction intervals

The forecast() function returns point forecasts plus prediction intervals (default 80% and 95%):

fc <- forecast(fit_final, h = n_ahead, xreg = xcov_future)

# Summary of the first few months
head(as.data.frame(fc))

##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 1985       1119.910  955.7962 1284.024  868.9196 1370.900
## Feb 1985       1284.622 1103.3147 1465.929 1007.3365 1561.908
## Mar 1985       1798.760 1600.3092 1997.211 1495.2555 2102.265
## Apr 1985       1182.371  975.3963 1389.346  865.8306 1498.912
## May 1985       1363.362 1150.5540 1576.169 1037.9004 1688.823
## Jun 1985       1280.961 1064.5773 1497.344  950.0308 1611.891

Key columns:

Point Forecast: the predicted mean under the scenario.
Lo 80 / Hi 80: the 80% prediction interval (1.28 standard errors).
Lo 95 / Hi 95: the 95% prediction interval (1.96 standard errors).

5.3 Visualizing the forecast

The quickest way to plot the forecast is autoplot(), which displays observed data + point forecast + prediction ribbon:

We build the plot manually (rather than using autoplot() on the forecast object) so we can control the colors — observed data in dark gray, the forecast mean as a solid orange line, and the two prediction intervals in distinct, clearly separable shades.

# Assemble the forecast into a data frame
obs_df <- tibble(
  time = as.numeric(time(deaths)),
  y    = as.numeric(deaths)
)

last_time <- as.numeric(tail(time(deaths), 1))
fc_df <- tibble(
  time  = last_time + (1:length(fc$mean)) / 12,
  mean  = as.numeric(fc$mean),
  lo80  = as.numeric(fc$lower[, 1]),   # 80% lower
  hi80  = as.numeric(fc$upper[, 1]),   # 80% upper
  lo95  = as.numeric(fc$lower[, 2]),   # 95% lower
  hi95  = as.numeric(fc$upper[, 2])    # 95% upper
)

ggplot() +
  # observed data
  geom_line(data = obs_df, aes(x = time, y = y), color = "grey25", linewidth = 0.5) +
  # 95% prediction interval (outer ribbon — light)
  geom_ribbon(data = fc_df, aes(x = time, ymin = lo95, ymax = hi95),
              fill = "#fdd0a2", alpha = 0.8) +
  # 80% prediction interval (inner ribbon — darker)
  geom_ribbon(data = fc_df, aes(x = time, ymin = lo80, ymax = hi80),
              fill = "#fd8d3c", alpha = 0.7) +
  # forecast mean
  geom_line(data = fc_df, aes(x = time, y = mean),
            color = "#d94801", linewidth = 0.9) +
  # intervention marker + end-of-data marker
  geom_vline(xintercept = 1983 + 1/12, linetype = "dashed", color = "firebrick") +
  geom_vline(xintercept = last_time, linetype = "dotted", color = "grey40") +
  annotate("text", x = 1983.5, y = 2700, label = "Seatbelt law",
           color = "firebrick", size = 3.3, hjust = 0) +
  annotate("text", x = last_time + 0.1, y = 2700, label = "End of data",
           color = "grey30", size = 3.3, hjust = 0) +
  labs(title = "5-Year Forecast: UK Road Deaths Under Continued Seatbelt Law",
       subtitle = "Dark ribbon = 80% PI; light ribbon = 95% PI",
       y = "Deaths and Serious Injuries", x = "Year") +
  theme_minimal(base_size = 13)

The forecast captures the seasonal pattern (winter peaks, summer troughs) and the reduced baseline level from the seatbelt law. Prediction intervals widen slightly as we forecast further ahead — reflecting accumulated uncertainty from the AR(2) dynamics.

5.4 Comparing scenarios: law kept vs. law reversed

A common use of xreg-based forecasting is scenario analysis. What if we considered reversing the seatbelt law? We re-run the forecast with law = 0 over the future period and compare.

# Scenario: law reversed (law = 0 everywhere)
xcov_no_law <- xcov_future
xcov_no_law[, "law"] <- 0

fc_no_law <- forecast(fit_final, h = n_ahead, xreg = xcov_no_law)

To keep each scenario readable, we split them into two vertically stacked panels that share the same axes. This way, the observed history is visible in both panels and the forecast trajectory is clear without overlapping ribbons.

# Build plotting frames
obs_df <- tibble(
  time = as.numeric(time(deaths)),
  mean = as.numeric(deaths)
)

build_forecast_df <- function(fc_obj) {
  last_time    <- as.numeric(tail(time(deaths), 1))
  times_future <- last_time + (1:length(fc_obj$mean)) / 12
  tibble(
    time = times_future,
    mean = as.numeric(fc_obj$mean),
    lwr  = as.numeric(fc_obj$lower[, 2]),   # 95% lower
    upr  = as.numeric(fc_obj$upper[, 2])    # 95% upper
  )
}

fc_kept_df     <- build_forecast_df(fc)
fc_reversed_df <- build_forecast_df(fc_no_law)

# Shared y-axis range so the two panels are directly comparable
y_range <- range(c(obs_df$mean, fc_kept_df$lwr, fc_kept_df$upr,
                   fc_reversed_df$lwr, fc_reversed_df$upr))

# Helper: build one panel for a given scenario
plot_scenario <- function(fc_df, panel_title, color) {
  ggplot() +
    geom_line(data = obs_df, aes(x = time, y = mean), color = "black") +
    geom_ribbon(data = fc_df, aes(x = time, ymin = lwr, ymax = upr),
                fill = color, alpha = 0.25) +
    geom_line(data = fc_df, aes(x = time, y = mean),
              color = color, linewidth = 0.9) +
    geom_vline(xintercept = 1983 + 1/12, linetype = "dashed", color = "grey40") +
    geom_vline(xintercept = max(obs_df$time), linetype = "dotted", color = "grey50") +
    annotate("text", x = 1983.5, y = y_range[2] * 0.98, label = "Seatbelt law",
             color = "grey30", size = 3.2, hjust = 0) +
    coord_cartesian(ylim = y_range) +
    labs(title = panel_title, y = "Deaths", x = NULL) +
    theme_minimal(base_size = 12)
}

p_kept     <- plot_scenario(fc_kept_df,
                            "Scenario 1 — Law kept (law = 1)",
                            "steelblue")
p_reversed <- plot_scenario(fc_reversed_df,
                            "Scenario 2 — Law reversed (law = 0)",
                            "firebrick") +
              labs(x = "Year")

p_kept / p_reversed +
  patchwork::plot_annotation(
    title    = "5-Year Forecast: UK Road Deaths Under Two Policy Scenarios",
    subtitle = "Shaded ribbons = 95% prediction intervals from the AR(2) model"
  )

The difference in level between the two panels — compare the forecast ribbon in the top panel to the bottom — is the model’s estimated effect of keeping the law in place. The AR(2) persistence amplifies the law coefficient into a long-run difference of roughly the equilibrium shift we computed in Section 2.5.

Caveat on scenario analysis. These are predictive comparisons under the assumption that the estimated model is correctly specified and that the covariate path is truly exogenous. For causal interpretation of policy counterfactuals — including explicit modeling of the intervention timing and uncertainty around treatment effects — we will use interrupted time series (ITS) designs in Lab 4, which formalize this logic.

Metric	Formula	Properties
AIC	\(-2 \ln L + 2k\)	Penalizes complexity; lower = better
BIC	\(-2 \ln L + k \ln n\)	Heavier penalty than AIC; favors parsimony
MSE	\(\frac{1}{n}\sum \hat{e}_t^2\)	Mean squared error; scale-dependent
RMSE	\(\sqrt{\text{MSE}}\)	Same units as \(y\); easier to interpret
MAE	\(\frac{1}{n}\sum \lvert\hat{e}_t\rvert\)	Less sensitive to outliers than RMSE

Property	AIC	BIC
Penalty per parameter	\(2\)	\(\ln n\) (heavier whenever \(n > 7\))
Asymptotic goal	Best predictive model	Identify the true model (if it’s in the candidate set)
Tends to select	Slightly larger models	Smaller, more parsimonious models
Consistent?	No (may asymptotically select too-large models)	Yes (converges to the true model as \(n \to \infty\))
Efficient for prediction?	Yes (minimizes out-of-sample loss asymptotically)	No (may under-select for prediction)

CSSS/POLS 512 - Time Series and Panel Data Methods

Lab 3: Modeling Stationary Time Series — Estimation, Selection, and Cross-Validation

Ramses Llobet

Spring 2026