---
title: "CSSS/POLS 512 - Time Series and Panel Data Methods"
subtitle: "Lab 3: Modeling Stationary Time Series"
author: "Ramses Llobet"
fontfamily: accanthis
linkcolor: blue
output:
html_document:
toc: true
toc_depth: 4
editor_options:
chunk_output_type: console
fontsize: 11pt
header-includes:
\usepackage{float}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error=FALSE, warning=FALSE, message = FALSE)
# clean environment
rm(list=ls())
# load packages
library(tseries)
library(MASS)
library(tidyverse)
# load source
source("source/TS_code.R") # Chris's TS simulation code
```
# Estimation and Interpretation of ARMA models
Recall that we use maximum likelihood approach to estimate the parameters of an ARMA model (although we also discussed using least squares).
This should be familiar:
1. Express the joint probability of the data using the chosen probability distribution (i.e. the likelihood of data given parameters)
2. Take a logarithm and transform the product of probabilities to the sum of log-probabilities (because $\sum$ is easier for optimization than $\prod$)
3. Substitute the linear predictor $\mu = X\beta$ (sometimes we call it "systematic component")
$$
\begin{aligned}
\mathcal{L}(\boldsymbol{\beta}, \phi_1 | \boldsymbol{y}, \boldsymbol{X}) & = -\frac{1}{2}\text{log}\Bigg(\frac{\sigma^2}{1-\phi^2_1}\Bigg)-\frac{\Bigg(y_1-\frac{\boldsymbol{x_{1}\beta}}{1-\phi_1}\Bigg)^2}{\frac{2\sigma^2}{1-\phi^2_1}} \\
& \qquad \qquad -\frac{T-1}{2}\text{log}\sigma^2-\sum^T_{T=2}\frac{(y_t-\boldsymbol{x_t\beta}-\phi_1y_{t-1})^2}{2\sigma^2}
\end{aligned}
$$
We can estimate the parameters, $\boldsymbol{\beta}$ and $\phi_1$ that make the data most likely using numerical methods. \newline
The `arima()` function in `R` will compute these for us.
```{r}
# Load data
# Number of deaths and serious injuries in UK road accidents each month.
# Jan 1969 - Dec 1984. Seatbelt law introduced in Feb 1983
# (indicator in second column). Source: Harvey, 1989, p.519ff.
# http://www.staff.city.ac.uk/~sc397/courses/3ts/datasets.html
#
# Variable names: death law
ukdata <- read.csv("data/ukdeaths.csv", header = TRUE)
names(ukdata)
attach(ukdata) # recall detach()
## Estimate an AR(1) using arima
xcovariates <- law # we want to use seatbelt law to explain the change in death number
arima.res1a <- arima(
death, # dependent variable
order = c(1, 0, 0), # AR(1) term
xreg = xcovariates, # feed covariates - it should be a matrix in which each column is a time series of a covariate
include.mean = TRUE # include intercept
)
print(arima.res1a) # How do we interpret these parameter estimates?
```
- The numbers in the 1st row are **coefficients**, and the numbers in the 2nd row are **standard erros** of coefficients.
- The table does not show stars, but a convenient way to determine statistical significance (0.05 level) is to see whether `coef/se > 2`.
- Introducing the seatbelt law is associated on average with a 378 decrease in the number of road accidents in the next period, all else constant.
- We know in an AR(1) process the effect accumulates over time and will converge to a fixed mean.$$\mathbb{E}(y_t) = \frac{x_t \boldsymbol{\beta}}{1-\phi_1}$$
- For the above AR(1) model, the coefficient for AR(1) term is about 0.64, so we have $$\mathbb{E}(y_t) = \frac{x_t \boldsymbol{\beta}}{1-\phi_1} = \frac{1719.193}{1-0.644}=4828$$
- In general, we will forecast these expected values and also predicted values using simulation.
# Model Selection
How can we know the above AR(1) model is better than other candidate models?
- We often assess and select models in three general ways:
- In-sample fit
- Out-of-sample fit: examine model performance on a new dataset
- Split sample into training and test sets and perform cross-validation
- What are common goodness-of-fit metrics?
- log likelihood: the optimized value of loss function
- Akaike Information Criterion (AIC) / Bayesian Information Criterion (BIC)
- mean squared error (MSE)
- root mean squared error (RMSE)
- mean absolute error (MAE)
- ...
- Suppose that we don't know whether AR(1) model is better able to capture the underlying process in the observed time series than AR(2) model. We can compare them based on their goodness-of-fit metrics.
## What Performance Metrics Should We Choose?
- There are a lot of GOF metrics, e.g. MSE, RMSE, MAE, etc. Do we have particular reasons to choose certain metrics?
- difference between MAE and MSE/RMSE
- Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. It indicates that MSE/RMSE is more sensitive to outliers.
- A forecast method that minimizes the MAE will lead to forecasts of the median^[A technical proof is given at [https://en.wikipedia.org/wiki/Mean_absolute_error#Optimality_property](https://en.wikipedia.org/wiki/Mean_absolute_error#Optimality_property), and an intuitive explanation is given at [https://stats.stackexchange.com/questions/355538/why-does-minimizing-the-mae-lead-to-forecasting-the-median-and-not-the-mean](https://stats.stackexchange.com/questions/355538/why-does-minimizing-the-mae-lead-to-forecasting-the-median-and-not-the-mean).], while minimizing the RMSE will lead to forecasts of the mean.
- However, both MAE and MSE (not RMSE!) are scale-dependent.
- we cannot compare model performance on different time series based on MAE and MSE, because different time series have different scales and units.
- some scholars recommend using scale-invariant metrics such as mean absolute percentage error (MAPE) to compare model performance on different time series.^[For more information, see [https://otexts.com/fpp3/accuracy.html](https://otexts.com/fpp3/accuracy.html) and [https://towardsdatascience.com/time-series-forecast-error-metrics-you-should-know-cc88b8c67f27](https://towardsdatascience.com/time-series-forecast-error-metrics-you-should-know-cc88b8c67f27).]
## In-sample Fit
```{r}
# Extract estimation results from the list object `arima.res1a`
pe.1a <- arima.res1a$coef # parameter estimates (betas)
se.1a <- sqrt(diag(arima.res1a$var.coef)) # standard errors of coefficients
ll.1a <- arima.res1a$loglik # log likelihood at its maximum
sigma2hat.1a <- arima.res1a$sigma2 # standard error of the regression
aic.1a <- arima.res1a$aic # Akaike Information Criterion
resid.1a <- arima.res1a$resid # residuals
# We can manually calculate AIC.
# Recall that the AIC is equal to the deviance (-2*log likelihood at its maximum)
# of the model plus 2 * the dimension of the model
# (number of free parameters of the model)
-2*ll.1a + 2*length(pe.1a)
# And MSE is just the expected value of the squared residuals
mean(resid.1a^2)
# RMSE
sqrt(mean(resid.1a^2))
# MAE
mean(abs(resid.1a))
```
## Cross Validation for Time Series
- We assume that the relationship between subsample and sample is very close to the relationship between sample and population. If this analogy applies, we should be able to split the data, and accurately predict one half the data using a model fit on the other half.
- Using CV to calculate any GOF metrics yields more reliable results than in-sample versions of the test.
- Types of cross-validation
- Hold out cross-validation: split the sample to training set and test set. We train our model with the training dataset and we validate our model with the test dataset.
- $k$-fold cross-validation: repeat for many different splits - this approach is fast but biased and less robust
- Leave-one-out cross-validation (LOOCV): just leave out one case at each iteration (an extreme version of $K$-fold CV) - this approach is slow (esp. in a large dataset!) but unbiased and more robust
![Hold Out Cross-Validation](images/holdoutcv.jpeg)
![$K$-fold Cross-Validation](images/kfoldcv.png)
- Conventional CV techniques assume data are *i.i.d.*. However, in time series, because our observations have temporal dependence, we cannot shuffle data randomly, otherwise we will predict past from future, which doesn't make any sense. Rather, we cross-validate our model by **rolling forecast window**.
- The length of each test set is fixed. The test set is rolling forward at each iteration.
- But we can have two variants of training set:
- **expanding window**: all observations prior to test set constitute a training set.
- **rolling/sliding window**: the length of training set is fixed. At each iteration we forget older observations.
- This idea is typically adopted when past data becomes deprecated, which is common in non-stationary environments.^[See [https://arxiv.org/pdf/1905.11744.pdf](https://arxiv.org/pdf/1905.11744.pdf).]
- The forecast performance is computed by averaging over the test sets.
- In `R`, we can use `tsCV()` from `forecast` package to perform time series CV.[^1]
- you can also use `stretch_tsibble()` in `fable` package to generate data folds.[^2]
- A tidyverse style workflow is presented in .
- But for our homework, we need to perform time series CV from scratch. This is a way to check whether we really understand this concept.
### Expanding Window Scheme
```{r}
expand_window <- matrix(
c(1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2),
ncol = 15,
byrow = TRUE,
dimnames = list(seq(1, 10), seq(1, 15))
) %>%
as.data.frame() %>%
# the dot "." it's a way of referencing the current piped df
mutate(iteration = rownames(.)) %>%
pivot_longer(-iteration,
names_to = "period",
values_to = "value") %>%
mutate(iteration = as.integer(iteration),
period = as.integer(period),
value = factor(value))
expand_window %>%
ggplot(aes(x = period,
y = iteration,
fill = value)) +
geom_tile(color = "white") +
scale_x_continuous(breaks = seq(1, 15)) +
scale_y_reverse(breaks = seq(1, 10)) +
scale_fill_manual(labels = c("Dropped", "Train", "Test"),
values = c("gray", "blue", "red"),
name = "") +
theme_classic() +
labs(x = "Periods",
y = "Iterations",
title = "Expanding Window Scheme") +
theme(
axis.line.y = element_line(arrow = arrow(type='closed',
ends = "first",
length = unit(10,'pt'))),
axis.line.x = element_line(arrow = arrow(type='closed',
ends = "last",
length = unit(10,'pt')))
)
```
### Sliding Window Scheme
```{r}
sliding_window <- matrix(
c(
1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2
),
ncol = 15,
byrow = TRUE,
dimnames = list(seq(1, 10), seq(1, 15))) %>%
as.data.frame() %>%
mutate(iteration = rownames(.)) %>%
pivot_longer(-iteration,
names_to = "period",
values_to = "value") %>%
mutate(iteration = as.integer(iteration),
period = as.integer(period),
value = factor(value))
sliding_window %>%
ggplot(aes(x = period, y = iteration, fill = value)) +
geom_tile(color = "white") +
scale_x_continuous(breaks = seq(1, 15)) +
scale_y_reverse(breaks = seq(1, 10)) +
scale_fill_manual(labels = c("Dropped", "Train", "Test"),
values = c("gray", "blue", "red"),
name = "") +
theme_classic() +
labs(x = "Periods", y = "Iterations", title = "Sliding Window Scheme") +
theme(
axis.line.y = element_line(arrow = arrow(type='closed',
ends = "first",
length = unit(10,'pt'))),
axis.line.x = element_line(arrow = arrow(type='closed',
ends = "last",
length = unit(10,'pt')))
)
```
## Implementing sliding window time series CV
- Hyper-parameters:
- minimum length of training set $K$
- number of forward periods $P$ ($P \geq 1$)
- What do we know?
- time series object
- starting point of time series: $t$
- The total length of time series $T$
- Model Setup: e.g. the order of AR process, etc.
- What do we want to know? training set and test set at each iteration
- number of iterations: $T - K$
- training set:
- starting point: $t + (i - 1)$
- end point: $t + (i - 1) + (K - 1)$
- test set:
- starting point: $t + K + (i - 1)$
- end point: $\min\{t + K + (i - 1) + (P - 1), \,\,\, T\}$
### How to program a time series CV?
```{r, include=FALSE, echo=FALSE}
# this is a for loop for time series CV
min_window <- 170
forward <- 12
# ts information
ts <- ts(death)
ts_start <- tsp(ts)[1]
ts_length <- length(ts)
# model setup
order <- c(1, 0, 0)
xcovariates <- law
include.mean <- TRUE
# let's look one example step-by-step:
# start and end of train set
(train_start <- ts_start + 21)
(train_end <- ts_start + 21 + (min_window - 1))
# start and end of test set
(test_start <- ts_start + min_window + 21)
(test_end <- min(ts_start + min_window + 21 + (forward - 1), ts_length))
# construct train and test set
(train_set <- window(ts, start = train_start, end = train_end))
(test_set <- window(ts, start = test_start, end = test_end))
# same, but for covariates
(x_train <- window(xcovariates, start = train_start, end = train_end))
(x_test <- window(xcovariates, start = test_start, end = test_end))
# fit model
(fit <- forecast::Arima(train_set,
order = order,
include.mean = include.mean,
xreg = x_train))
# predict outcomes in test set
(pred <- forecast::forecast(fit,
h = forward,
xreg = x_test))
# performance metrics: MAE
abs(pred$mean - test_set)
```
```{r}
## Now we repeat the above but with a loop
# this is a for loop for time series CV
min_window <- 170
forward <- 12
# ts information
ts <- ts(death)
ts_start <- tsp(ts)[1]
ts_length <- length(ts)
# model setup
order <- c(1, 0, 0)
xcovariates <- law
include.mean <- TRUE
# for loop for CV procedure
n_iter <- ts_length - min_window # iterations
mae <- matrix(NA, # matrix to store GoF
nrow = n_iter,
ncol = forward)
for (i in 1:n_iter) {
# start and end of train set
train_start <- ts_start + (i - 1)
train_end <- ts_start + (i - 1) + (min_window - 1)
# start and end of test set
test_start <- ts_start + min_window + (i -1)
test_end <- min(ts_start + min_window + (i -1) + (forward - 1), ts_length)
# construct train and test set
train_set <- window(ts, start = train_start, end = train_end)
test_set <- window(ts, start = test_start, end = test_end)
x_train <- window(xcovariates, start = train_start, end = train_end)
x_test <- window(xcovariates, start = test_start, end = test_end)
# fit model
fit <- forecast::Arima(train_set,
order = order,
include.mean = include.mean,
xreg = x_train)
# predict outcomes in test set
pred <- forecast::forecast(fit,
h = forward,
xreg = x_test)
# performance metrics: MAE
mae[i, 1:length(test_set)] <- abs(pred$mean - test_set)
}
colMeans(mae, na.rm = TRUE)
# Note: GoF is based on "next period" prediction performance.
```
### Using custom function `arimaCV`
Look the custom function `arimaCV` in the `TS_code.R` file. In the following code chunk, we break and explain line by line how the functions is programmed.
```{r, eval=FALSE}
# ARIMA Cross-validation by rolling windows
# Adapted from Rob J Hyndman's code:
# http://robjhyndman.com/hyndsight/tscvexample/
#
# Could use further generalization, e.g. to seasonality
# Careful! This can produce singularities using categorical covariates
arimaCV <- function(
x, # time series object
order, seasonal, xreg, include.mean, # options for ARMA model
# hyper-parameters for rolling-window cross-validation
forward=NULL,
minper=NULL
) {
require(forecast) # use package forecast
if (!any(class(x)=="ts")) x <- ts(x) # automatically change inputs to ts object
n <- length(x) # the length of time series
# pre-allocate a [(T-K) x P] matrix for performance metrics
mae <- matrix(NA, nrow=n-minper, ncol=forward) # here we use mean standard error.
st <- tsp(x)[1]+(minper-2) # Here we calculate the "starting point" of test set minus 2 (i.e. the end point of training set - 1).
for(i in 1:(n-minper)) {
xshort <- window(x, start=st+(i-minper+1), end=st+i)
xnext <- window(x, start=st+(i+1), end=min(n, st+(i+forward)))
xregshort <- window(xreg, start=st+(i-minper+1), end=st+i)
xregnext <- window(xreg, start=st+(i+1), end=min(n, st+(i+forward)))
# fitting the model
fit <- Arima(xshort,
order=order,
seasonal=seasonal,
xreg=xregshort,
include.mean=include.mean)
# prediction of the model
fcast <- forecast(fit,
h=length(xnext),
xreg=xregnext)
mae[i,1:length(xnext)] <- abs(fcast[['mean']]-xnext)
}
colMeans(mae, na.rm=TRUE)
}
```
See in the next code chunk how to apply `arimaCV`.
```{r}
# Set rolling window length and look ahead period for cross-validation
minper <- 170 # minimum window- 170th observation has a seat belt law finally!
forward <- 12 # 1~12th period forward after 170th observation!
# Attempt at rolling window cross-validation (see caveats)
cv.1a <- arimaCV(death,
order=c(1,0,0),
forward=forward,
seasonal = NULL,
xreg=xcovariates,
include.mean=TRUE,
minper=minper)
cv.1a
```
### Applied model selection
In the next section, we will construct a matrix of covariates and evaluate multiple candidate models. We will employ `arimaCV` to automate the calculation of out-of-sample goodness-of-fit metrics and compare the performance of all the models.
```{r}
#It looks like there is seasonality in the time series, so let's try to control for each month or Q4
# Make some month variables (there are easier ways!)
jan <- as.numeric(month=="January")
feb <- as.numeric(month=="February")
mar <- as.numeric(month=="March")
apr <- as.numeric(month=="April")
may <- as.numeric(month=="May")
jun <- as.numeric(month=="June")
jul <- as.numeric(month=="July")
aug <- as.numeric(month=="August")
sep <- as.numeric(month=="September")
oct <- as.numeric(month=="October")
nov <- as.numeric(month=="November")
dec <- as.numeric(month=="December")
# Make a fourth quarter indicator
q4 <- as.numeric(oct|nov|dec)
# Store all these variables in the dataframe
ukdata$jan <- jan
ukdata$feb <- feb
ukdata$mar <- mar
ukdata$apr <- apr
ukdata$may <- may
ukdata$jun <- jun
ukdata$jul <- jul
ukdata$aug <- aug
ukdata$sep <- sep
ukdata$oct <- oct
ukdata$nov <- nov
ukdata$dec <- dec
ukdata$q4 <- q4
## Model estimation
# time series, outcome:
ts_death <- ts(death)
## Model 1b: AR(1) model of death as function of law & q4
xcovariates <- cbind(law, q4)
arima.res1b <- arima(ts_death,
order = c(1, 0, 0),
xreg = xcovariates,
include.mean = TRUE)
cv.1b <- arimaCV(ts_death,
order = c(1,0,0),
forward = forward,
seasonal = NULL,
xreg = xcovariates,
include.mean = TRUE,
minper = minper)
## Model 1c: AR(1) model of death as function of law & months
xcovariates <- cbind(law, jan, feb, mar, apr, may, jun, aug, sep, oct, nov, dec)
arima.res1c <- arima(ts_death,
order = c(1,0,0),
xreg = xcovariates,
include.mean = TRUE)
cv.1c <- arimaCV(ts_death,
order=c(1,0,0),
forward=forward,
seasonal = NULL,
xreg=xcovariates,
include.mean=TRUE,
minper=minper)
## Model 1d: AR(1) model of death as function of law & select months
xcovariates <- cbind(law, jan, sep, oct, nov, dec)
arima.res1d <- arima(ts_death,
order = c(1,0,0),
xreg = xcovariates,
include.mean = TRUE)
cv.1d <- arimaCV(ts_death,
order=c(1,0,0),
forward=forward,
seasonal = NULL,
xreg=xcovariates,
include.mean=TRUE,
minper=minper)
## Model 1e: AR(1)AR(1)_12 model of death as function of law
xcovariates <- cbind(law)
arima.res1e <- arima(ts_death,
order = c(1,0,0),
seasonal = list(order = c(1,0,0),
period = 12),
xreg = xcovariates,
include.mean = TRUE)
cv.1e <- arimaCV(ts_death,
order=c(1,0,0),
forward=forward,
seasonal = list(order = c(1,0,0),
period = 12),
xreg=xcovariates,
include.mean=TRUE,
minper=minper)
## Model 2a: AR(2) model of death as function of law & months
xcovariates <- cbind(law, jan, feb, mar, apr, may, jun, aug, sep, oct, nov, dec)
arima.res2a <- arima(ts_death,
order = c(2,0,0),
xreg = xcovariates,
include.mean = TRUE)
cv.2a <- arimaCV(ts_death,
order=c(2,0,0),
forward=forward,
seasonal = NULL,
xreg=xcovariates,
include.mean=TRUE,
minper=minper)
## Model 2b: MA(1) model of death as function of law & months
xcovariates <- cbind(law, jan, feb, mar, apr, may, jun, aug, sep, oct, nov, dec)
arima.res2b <- arima(ts_death,
order = c(0,0,1),
xreg = xcovariates,
include.mean = TRUE)
cv.2b <- arimaCV(ts_death,
order=c(0,0,1),
forward=forward,
seasonal = NULL,
xreg=xcovariates,
include.mean=TRUE,
minper=minper)
## Model 2c: ARMA(1,1) model of death as function of law & months
xcovariates <- cbind(law, jan, feb, mar, apr, may, jun, aug, sep, oct, nov, dec)
arima.res2c <- arima(ts_death,
order = c(1,0,1),
xreg = xcovariates,
include.mean = TRUE)
cv.2c <- arimaCV(ts_death,
order=c(1,0,1),
forward=forward,
seasonal = NULL,
xreg=xcovariates,
include.mean=TRUE,
minper=minper)
## Model 2d: ARMA(2,1) model of death as function of law & months
xcovariates <- cbind(law, jan, feb, mar, apr, may, jun, aug, sep, oct, nov, dec)
arima.res2d <- arima(ts_death,
order = c(2,0,1),
xreg = xcovariates,
include.mean = TRUE)
cv.2d <- arimaCV(ts_death,
order=c(2,0,1),
forward=forward,
seasonal = NULL,
xreg=xcovariates,
include.mean=TRUE,
minper=minper)
## Model 2e: ARMA(1,2) model of death as function of law & months
xcovariates <- cbind(law, jan, feb, mar, apr, may, jun, aug, sep, oct, nov, dec)
arima.res2e <- arima(ts_death,
order = c(1,0,2),
xreg = xcovariates,
include.mean = TRUE)
cv.2e <- arimaCV(ts_death,
order=c(1,0,2),
forward=forward,
seasonal = NULL,
xreg=xcovariates,
include.mean=TRUE,
minper=minper)
## Model 2f: ARMA(2,2) model of death as function of law & months
xcovariates <- cbind(law, jan, feb, mar, apr, may, jun, aug, sep, oct, nov, dec)
arima.res2f <- arima(ts_death,
order = c(2,0,2),
xreg = xcovariates,
include.mean = TRUE)
cv.2f <- arimaCV(ts_death,
order=c(2,0,2),
forward=forward,
seasonal = TRUE,
xreg=xcovariates,
include.mean=TRUE,
minper=minper)
```
We have fit several models, the best way to assess the best fitting DGP is via visualizaiton:
```{r}
library(RColorBrewer)
# Plot cross-validation results
allCV <- cbind(cv.1a, cv.1b, cv.1c, cv.1d, cv.1e, cv.2a, cv.2b, cv.2c, cv.2d, cv.2e, cv.2f)
labs <- c("1a", "1b", "1c", "1d", "1e", "2a", "2b", "2c", "2d", "2e", "2f")
col <- c(brewer.pal(7, "Reds")[3:7], brewer.pal(8, "Blues")[3:8])
matplot(allCV,
type="l",
col=col,
lty=1,
ylab="Mean Absolute Error",
xlab="Periods Forward",
main="Cross-validation of accident deaths models",
xlim=c(0.75,12.75))
text(labs, x=rep(12.5,length(labs)), y=allCV[nrow(allCV),], col=col)
# Average cross-validation results
avgCV12 <- apply(allCV, 2, mean) %>% sort()
avgCV12
# Based on AIC & cross-validation,
# let's select Model 2f to be our final model;
# there are other plausible models
arima.resF <- arima.res2f
arima.res2f
# Alas, is 2f the most parsimonious?
length(arima.res2f$coef)
length(arima.res2c$coef) # <-- best model, in my opinion
length(arima.res2e$coef)
length(arima.res2d$coef)
length(arima.res1c$coef) # <-- second best option?
# Let's check for evidence on white noise:
Box.test(arima.res2f$residuals)
Box.test(arima.res2c$residuals)
Box.test(arima.res1c$residuals) # does not reject the Q-Statistic test
```
### Least Squares vs. MLE-ARIMA
Now, let's compare how much least squares `lm()` differs from MLE-arima `arima()` when estimating similar models for this case of univariate time series.
```{r}
## What would happen if we used linear regression on a single lag of death?
lm.res1f <- lm(death ~ lag(death) + jan + feb + mar + apr + may + jun +
aug + sep + oct + nov + dec + law)
summary(lm.res1f)
library(lmtest)
# Check LS result for serial correlation in the first or second order
# Breusch-Godfrey Test:
bgtest(lm.res1f, 1)
bgtest(lm.res1f, 2)
# Q-statistic test:
Box.test(lm.res1f$residuals)
# Evidence of residual serial correlation is strong
# Rerun with two lags
lm.res1g <- lm(death ~ lag(death, 1) + lag(death, 2) + jan + feb + mar + apr + may + jun +
aug + sep + oct + nov + dec + law)
summary(lm.res1g)
# Check LS result for serial correlation in the first or second order
# Breusch-Godfrey Test:
bgtest(lm.res1g,1)
bgtest(lm.res1g,2)
# Q-statistic test:
Box.test(lm.res1g$residuals)
# Borderline evidence of serial correlation, but substantively different result.
# (Even small time series assumptions can have big implications for substance.)
# Comparing best fitted models: LS vs arima:
arima.res2ls <- arima(ts_death,
order = c(2,0,0),
xreg = xcovariates,
include.mean = TRUE)
stargazer::stargazer(arima.res2c,lm.res1g,arima.res2ls,
type="text")
```
### Automate search with `forecast::auto.arima()`
Tired of manual search? `auto.arima()` in the forecast package can automate the search, but be careful!
auto.arima guessed that this time series was non-stationary, and without the additive seasonal effects manually added, tried to fit complex seasonal ARMA terms. Restricting to a stationary model and telling auto.arima to leave seasonality to the month dummies produced a simpler, better fitting model another warning: if you `seasonal = TRUE`, set **approximation** and **stepwise** to `TRUE` or prepare to wait
```{r}
xcovariates <- cbind(law, jan, feb, mar, apr, may, jun, aug, sep, oct, nov, dec)
arima.res3a <- auto.arima(ts_death,
stationary=TRUE,
seasonal=FALSE,
ic="aic",
approximation=FALSE,
stepwise=FALSE,
xreg = xcovariates)
print(arima.res3a) # same as model 2d
print(arima.res2d)
cv.3a <- arimaCV(ts_death,
order=c(2,0,1),
forward=forward,
seasonal = NULL,
xreg=xcovariates,
include.mean=TRUE,
minper=minper)
```
# Prediction and Visualization
## Counterfactual Forecasting with Hypothetical scenarios
Now that we've selected a model, let's interpret it: ARMA(2,2) using counterfactuals iterated over time.
```{r}
## Predict out five years (60 periods, i.e. 5 years) assuming law is kept
# Make newdata data frame for prediction
(xcovariates <- cbind(law, jan, feb, mar, apr, may,
jun, aug, sep, oct, nov, dec))
(n.ahead <- 60)
(lawhyp0 <- rep(0, n.ahead))
(lawhyp1 <- rep(1, n.ahead))
janhyp <- rep( c( 1,0,0, 0,0,0, 0,0,0, 0,0,0 ), 5)
febhyp <- rep( c( 0,1,0, 0,0,0, 0,0,0, 0,0,0 ), 5)
marhyp <- rep( c( 0,0,1, 0,0,0, 0,0,0, 0,0,0 ), 5)
aprhyp <- rep( c( 0,0,0, 1,0,0, 0,0,0, 0,0,0 ), 5)
mayhyp <- rep( c( 0,0,0, 0,1,0, 0,0,0, 0,0,0 ), 5)
junhyp <- rep( c( 0,0,0, 0,0,1, 0,0,0, 0,0,0 ), 5)
aughyp <- rep( c( 0,0,0, 0,0,0, 0,1,0, 0,0,0 ), 5)
sephyp <- rep( c( 0,0,0, 0,0,0, 0,0,1, 0,0,0 ), 5)
octhyp <- rep( c( 0,0,0, 0,0,0, 0,0,0, 1,0,0 ), 5)
novhyp <- rep( c( 0,0,0, 0,0,0, 0,0,0, 0,1,0 ), 5)
dechyp <- rep( c( 0,0,0, 0,0,0, 0,0,0, 0,0,1 ), 5)
## Set data for scenario law == 0 (absence of law)
newdata0 <- cbind(lawhyp0, janhyp, febhyp, marhyp,
aprhyp, mayhyp, junhyp, aughyp,
sephyp, octhyp, novhyp, dechyp) %>%
as.data.frame() %>%
`colnames<-`(c("law", "jan", "feb", "mar", "apr",
"may", "jun", "aug", "sep", "oct",
"nov", "dec"))
# Run predict
ypred0 <- predict(arima.resF,
n.ahead = n.ahead, # years of prediction
newxreg = newdata0) # covariates
## Set data for scenario law == 1 (implementation of law)
newdata1 <- cbind(lawhyp1, janhyp, febhyp, marhyp,
aprhyp, mayhyp, junhyp, aughyp,
sephyp, octhyp, novhyp, dechyp) %>%
as.data.frame() %>%
`colnames<-`(c("law", "jan", "feb", "mar", "apr",
"may", "jun", "aug", "sep", "oct",
"nov", "dec"))
# Run predict
ypred <- predict(arima.resF,
n.ahead = n.ahead,
newxreg = newdata1)
```
## Plot Predicted Values and Prediction Interval with base R
```{r}
plot.new()
par(usr = c(0, length(death) + n.ahead, 1000, 3000))
# make the x-axis
axis(1,
at = seq(from = 10, to = 252, by = 12),
labels = 1969:1989)
# make the y-axis
axis(2)
title(xlab = "Time",
ylab = "Deaths",
main="Predicted effect of reversing seat belt law")
# Polygon of predictive interval for no law (optional)
(x0 <- (length(death)+1):(length(death) + n.ahead))
(y0 <- c(ypred0$pred - 2*ypred0$se,
rev(ypred0$pred + 2*ypred0$se),
(ypred0$pred - 2*ypred0$se)[1]))
polygon(x = c(x0, rev(x0), x0[1]),
y = y0,
border=NA,
col="#FFBFBFFF")
# Plot the actual data
lines(x = 1:length(death),
y = death)
# Add the predictions for no law
lines(x = length(death):(length(death)+n.ahead),
y = c(death[length(death)],ypred0$pred), # link up the actual data to the prediction
col = "red")
# Add the lower predictive interval for no law
lines(x = length(death):(length(death) + n.ahead),
y = c(death[length(death)], ypred0$pred - 2*ypred0$se),
col = "red",
lty="dashed")
# Add the upper predictive interval for no law
lines(x = length(death):(length(death) + n.ahead),
y = c(death[length(death)], ypred0$pred + 2*ypred0$se),
col = "red",
lty = "dashed")
# Add the predictions for keeping law
lines(x = length(death):(length(death)+n.ahead),
y = c(death[length(death)],ypred0$pred), # link up the actual data to the prediction
col = "blue")
```