---
title: "CSSS/POLS 512 - Time Series and Panel Data Methods"
subtitle: "Lab 4: Modeling Non-Stationary Time Series"
author: "Ramses Llobet"
fontfamily: accanthis
linkcolor: blue
output:
html_document:
toc: true
toc_depth: 4
editor_options:
chunk_output_type: console
fontsize: 11pt
header-includes:
\usepackage{float}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error=FALSE, warning=FALSE, message = FALSE)
# clean environment
rm(list=ls())
# load packages
library(tseries) # For unit root tests
library(lmtest) # For Breusch-Godfrey LM test of serial correlation
library(urca) # For estimating cointegration models
library(stargazer)
library(RColorBrewer)
library(MASS) # For mvrnorm()
library(tidyverse)
# load source
source("source/TS_code.R") # Chris's TS custom functions
# Custom functions to quickly visualize TS
plot_time <- function(df) {
# layout of three plots
layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE))
# Time series plot, with mean on 0
plot(df, main = "Time Series Plot")
abline(h = 0, col = "red", lty = 2)
# Auto-correlation function plot
acf(df, main = "Auto Correlation Function")
# Partial auto-correlation function plot
pacf(df, main = "Partial Auto Correlation Function")
# Q-statistic
Box.test(df)
}
```
# Review Problem Set 1
## Simulate to understand
If you are uncertain about the appearance of a specific dynamic process, it is advisable to simulate it.
Here, we use Chris's `makeTS()` function to simulate various dynamic processes for exploration. This function is integrated within `ts()` and the custom function `plot_time()` I have created to directly visualize the time series plot, its **autocorrelation function** (ACF) and **partial autocorrelation function** (PACF), along with the Q-statistic test for **white noise**.
It is important to note that in the **Q-test** (Box-Pierce):
- the null hypothesis $H_0$ posits that the time series adheres to a **white noise** process.
- the alternative hypothesis $H_1$ posits that your time series **has autocorrelation**.
Therefore, if we reject the null hypotheses in favor to the alternative, $Q:p-\text{value}\leq 0.10$, then we have evidence that the time series or its residuals still have dynamic processes to model. In contrast, if we do not reject the null hypothesis, $Q:p-\text{value} > 0.10$, then we have evidence that our time series follows white noise and no dynamics may confound our analyses.
In your simulations, remember to `set.seed()` to replicate the same simulations (below, I am not setting any seed).
```{r}
# What does time trend process looks like?
plot_time(ts(makeTS(n=100, trend=0.02)))
# What does an AR process looks like?
plot_time(ts(makeTS(n=100, ar=0.6)))
# What does an MA process looks like?
plot_time(ts(makeTS(n=100, ma=0.8)))
# What about additive seasonality?
plot_time(ts(makeTS(n=100,
seasons=c(0,4,0,
0,0,6,
0,0,0,
-3,0,0),
adjust="level")))
# What does an ARMA process looks like?
plot_time(ts(makeTS(n=100, ar=0.6, ma=0.8)))
```
## Problem Set 1
For the initial twelve time series (A-L), each time series' data-generating process consists of at most one dynamic component. Hence, you can initially infer the process by visually inspecting the plot, autocorrelation function (ACF), and partial autocorrelation function (PACF).
If you are unsure of what is going on, you can employ the **Box-Jenkins diagnostics** method to systematically identify and isolate the dynamic components until the residual series exhibit characteristics of white noise. Remember:
1- Visualize the time series, using `plot()` or `decompose()`.
2- If deterministic trend or seasonal additive seems to be present, you can verify it fitting `time()` or `decompose()$seasonal` in `lm()`, and check the residuals.
3- If neither trend or additive seasonality seems to be present, check for AR or MA processes, but be sure to first adjust for trends and/or seasonality or otherwise may be difficult to identify what's going on.
4- If you still unsure, narrow down two or three potential candidates, and estimate them using `arima()`. Select those models that provide the lowest **AIC**
5- If in any of the above steps, the Q-statistic in `Box.test()` provides evidence of **white noise**, then very likely you have found the best candidate model of the underlying DGP.
6- If you have more than one final candidate, then use the most **parsimonious** specification. That is, the specification that estimates fewer parameters but returns white noise.
```{r}
# load data
df <- read_csv(file="data/mysterytsUW.csv")
# time series: C
df_1c <- df %>% select(c) %>% ts(frequency = 12)
plot_time(df_1c)
# time series: D
df_1d <- df %>% select(d) %>% ts(frequency = 12)
plot_time(df_1d)
# time series: G
df_1g <- df %>% select(g) %>% ts(frequency = 12)
plot_time(df_1g)
# time series: H
df_1h <- df %>% select(h) %>% ts(frequency = 12)
plot_time(df_1h)
# time series: I
df_1i <- df %>% select(i) %>% ts(frequency = 12)
plot_time(df_1i)
# in this last one, the Q-test is spurious. Look at the visualizations.
```
The six remaining time series M-R could have up to two components. These were more difficult to guess with only eyeballing.
```{r}
# time series: Q
df_1q <- df %>% select(q) %>% ts(frequency = 12)
plot_time(df_1q)
# could it be AR(2)? or MA(2)? or ARMA(1,1)?
res1 <- arima(df_1q, order = c(2,0,0))
res2 <- arima(df_1q, order = c(0,0,2))
res3 <- arima(df_1q, order = c(1,0,1))
stargazer(res1,res2,res3,
type = "text")
plot_time(res1$residuals)
```
For the second part, we know that all the series follow an `AR(p)` process. However, some of these series may be not be stationary.
If you were uncertain, you could test for unit root using the Unit Root Test with the function `PP.test()`, where the null $H_0$ states that the time series **has** unit root.
If the sum of your `AR(p)` estimates is below 1, then you have evidence of stationary, but if their sum is above 1, then your series are non-stationary.
```{r}
# load data
df2 <- read_csv(file="data/mysterytsUW2.csv")
# time series: S
df_2s <- df2 %>% select(s) %>% ts(frequency = 12)
plot_time(df_2s)
adf.test(df_2s)
PP.test(df_2s)
# time series: U
df_2u <- df2 %>% select(u) %>% ts(frequency = 12)
plot_time(df_2u)
(res <- arima(df_2u, order = c(2,0,0)))
plot_time(res$residuals)
```
# Non-Stationary Time Series
For Problem Set 2, the most important techniques are unit root test, model estimation, model selection, and counterfactual forecasting.
Up to this point, our focus has primarily been on **autoregressive** $AR(p)$ and **moving average** $MA(q)$ processes, as the data-generating processes (DGPs) we've encountered have either been **stationary** or possess **unit roots**. However, in the case of nonstationary time series, the stationary process may be integrated $I(d)$. Additionally, when faced with the possibility of multiplicative seasonality, we arrive at a comprehensive definition of dynamic specification with the *Autoregressive Integrated Moving Average (ARIMA)* model.
$$\text{ARIMA}(p,d,q)(P,D,Q)_m $$
Recall that stationary time series have three properties:
1. **Mean stationarity**
+ mean does not depend on $t$, constant over time
+ variance also does not depend on $t$, constant over time
2. **Covariance stationarity**
+ covariance of $y_t$ and $y_{t-1}$ do not depend on $t$
+ does not depend on the actual time the covariance is computed
3. **Ergodicity**
+ sample moments converge in probability to the population moments
+ sample mean and variance tend to be the same as entire series
Nonstationary procesess lack these properties. Note that we are abstracting from trends and other covariates, which means that the social processes are a black-box in the time dimension.
Why do nonstationary processes matter?
1. ACF and PACF **not defined** since the covariances in the nominator depend on $t$
2. **Spurious regression**: we may detect strong correlations between nonstationary processes although they could be conditionally (mean) independent.
3. Long run forecasts are **unfeasible** since the process does not revert to its mean.
How can we treat nonstationary processes?
1. Analyze nonstationary process using **ARIMA** (differencing)
+ effective at capturing short run changes.
+ outcome is transformed to a difference.
+ long-run predictions not feasible.
2. Analyze nonstationary process using cointegration
+ effective at capturing long run relationships between nonstationary processes.
+ outcome is left as a level.
+ short-run and long-run predictions feasible.
+ appropriate for analyzing multiple time series.
We will explore non-stationary modeling using the example discussed in [topic 4](https://faculty.washington.edu/cadolph/index.php?page=102&loc=panUW&pdf=topic4.pw) on president approvals.
```{r}
# clean environment
rm(list = ls())
dev.off()
# load source
source("source/TS_code.R") # Chris's TS custom functions
# load data
approval <- read_csv(file="data/approval.csv")
attach(approval)
colnames(approval)
```
As always, the first step is to plot the time series and try to make sense of its underlying DGP.
```{r}
# plot presidential approval
plot(approve,
type="l",
ylab="Percent Approving",
xlab="Time",
main = "US Presidential Approval")
lines(x=c(8,8),
y=c(-1000,1000),
col="red")
lines(x=c(26,26),
y=c(-1000,1000),
col="blue")
text("9/11",
x = 10, y = 40,
col="red",cex=0.7)
text("Iraq \n War",
x = 28, y = 40,
col="blue",cex=0.7)
```
In the plot above, it is evident that there is a significant shock in the 8th period, followed by a weaker one in the 27th period. Both of these shocks correspond to two significant events: the 9/11 attacks and the Iraq War. We call these types of shocks as **structural breaks**.
In U.S. politics, domestic political shocks often propagated through the stock market. Specifically, presidential approval ratings are highly correlated with oil prices with a `r round(cor(approve,avg.price),2)`.
```{r}
# plot average oil price
plot(avg.price,
type="l",
ylab="$ per Barrel",
xlab="Time",
main = "Average Price of Oil")
lines(x=c(8,8),
y=c(-1000,1000),
col="red")
lines(x=c(26,26),
y=c(-1000,1000),
col="blue")
text("9/11",x = 10, y = 175,
col="red",cex=0.7)
text("Iraq \n War",x = 28, y = 175,
col="blue",cex=0.7)
```
As always, after looking at the time series plots, we need to look at the ACF and PACF.
```{r}
# Look at the ACF and PACF of approvals
acf(approve)
pacf(approve)
# Look at the ACF and PACF of oil prices
pacf(avg.price)
pacf(avg.price)
```
## Unit Root Tests
**Unit-root tests are the first step to study and analyze non-stationary time series.**
Intuition: if time series is stationary, then regressing $y_{t}-y_{t-1}$ on $y_{t-1}$ should produce a negative coefficient. Why?
In a stationary series, knowing the past value of the series helps to predict the next period's change. Positive shifts should be followed by negative shifts (mean reversion).
$$
\begin{aligned}
y_t & = \rho y_{t-1} + \epsilon_{t} \\
y_t - y_{t-1} & = \rho y_{t-1} - y_{t-1} + \epsilon_{t} \\
\Delta y_{t} & = \gamma y_{t-1} + \epsilon_{t} \quad \quad \text{, where } \gamma=(\rho - 1) \\
\end{aligned}
$$
### ADF test
Augmented Dickey-Fuller test: null hypothesis of unit root.
Same with Phillips-Perron test, but differs in how the AR($p$) time series is modeled w.r.t. lags, serial correlation, heteroskedasticity.
The ADF test regression can be expressed as follows:
$$\Delta y_t = \boldsymbol{\beta D_t} + \pi y_{t-1} + \sum^p_{j=1}\psi_j\Delta y_{t-j}+\epsilon_t$$
where $\boldsymbol{D}_t$ is a vector of deterministic trends. The $p$ lagged difference terms, $\Delta y_{t-j}$, are used to approximate the ARMA structure of the errors (consider the equivalence of AR and MA models), and the value of $p$ is set so that the error is serially uncorrelated. The error term is assumed to be homoskedastic.
Under the null hypothesis ($\rho - 1 = 0$ => unit root => non-stationary), $\pi$ is set to 0 and is used to construct the ADF test statistic. If we don't get a statistically significant result from ADF test, then the time series is likely to be stationary.
Conflicting results from PP and ADF tests may suggest a structural break within the time series of presidential approval.
```{r}
adf.test(approve)
adf.test(avg.price)
```
Notice that the ADF test claims that presidential approvals follows a stationary process while oil prices are nonstationary.
### PP test
The PP test regression can be expressed as follows:
$$\Delta y_t = \boldsymbol{\beta D_t} + \pi y_{t-1} +u_t$$
The PP test **ignores any serial correlation** in the test regression. Instead, it corrects for serial correlation and heteroskedasticity in the errors $u_t$ by directly modifying the test statistic, $t_{\pi=0}$
As we can see, both ADF test and PP test assume data has $I(1)$ non-stationarity. But they also differ in how the time series is modeled. PP test allows for the residuals to be **serially correlated**.
```{r}
PP.test(approve)
PP.test(avg.price)
```
In contrast to the ADF, the PP test claims that presidential that both approvals and oil prices are nonstationary.
Sometimes we may find that the results of ADF test and PP test provide **different evidence**. This is because some underlying temporal dependence is not incorporated into the assumed models, or because our sample is too small. For example, if there is a **structural break** (i.e. persistent arises due to large and infrequent shocks) in the time series, then ADF test tends to provide evidence of a non-stationary process.
If we suspect that our time series have particular forms of temporal dependence ignored by those tests, there are other [potential statistical tests](https://www.aptech.com/blog/unit-root-tests-with-structural-breaks/):
- Higher-order differencing non-stationarity: Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.
- Structural break: Zivot-Andrews test, etc.
## Differencing the Time Series
Recall that we can use differencing $d$ to address integration processes $I(d)$ of nonstationarity. We can difference a random walk as follows:
$$
\begin{aligned}
y_{t} =& y_{t-1} + \boldsymbol{x_t \beta} + e_t \\
y_{t}-y_{t-1} =& y_{t-1} - y_{t-1} + \boldsymbol{x_t \beta} + e_t \\
\Delta y_t =& \boldsymbol{x_t\beta}+e_t
\end{aligned}
$$
If $y_{t}$'s DGP has a $I(1)$ process, differencing results in an `AR(0)` and stationary `I(0)`. But note that we are now explaining the **short run** difference or change in the variable, not the level.
```{r}
diff(approve) # get an order-1 differenced time series
diff(approve, differences = 2) # order-2 differenced time series
# Look at the ACF and PACF of approvals
diff(approve) %>% acf()
diff(approve) %>% pacf()
# Look at the ACF and PACF of oil prices
diff(avg.price) %>% acf()
diff(avg.price) %>% pacf()
# Check unit root tests
diff(approve) %>% adf.test()
diff(avg.price) %>% adf.test()
diff(approve) %>% PP.test()
diff(avg.price) %>% PP.test()
```
This model can be estimated using `arima()` function.
## Forecasting with a Non-stationary Time Series
As explained in the lecture (Topic 4), we estimate an **ARIMA(0,1,0)** which fits a little better than **ARIMA(2,1,2)** on the AIC criterion.
```{r}
xcovariates <- cbind(sept.oct.2001, iraq.war, avg.price)
(arima.res1a <-
arima(approve,
order = c(0,1,0),
xreg = xcovariates,
include.mean = FALSE)) # does FALSE matters?
(arima.res2a <-
arima(approve,
order = c(2,1,2),
xreg = xcovariates,
include.mean = FALSE)) # does FALSE matters?
# Check out the Q-test statistic:
Box.test(arima.res1a$residuals)
Box.test(arima.res2a$residuals)
# extract estimates and statistics from final model:
(pe.1a <- arima.res1a$coef) # parameter estimates (betas)
(se.1a <- sqrt(diag(arima.res1a$var.coef))) # standard errors
(ll.1a <- arima.res1a$loglik) # log likelihood at its maximum
(sigma2hat.1a <- arima.res1a$sigma2) # standard error of the regression
(aic.1a <- arima.res1a$aic) # Akaike Information Criterion
(resid.1a <- arima.res1a$resid) # residuals (white noise)
```
Some counterfactual stories might help....
+ What if 9/11 had not happened?
+ What if Bush had not invaded Iraq?
+ What if the price of oil had remained at pre-war levels?
### Step 1: construct counterfactual scenarios
We will create a new dataset with hypothetical scenarios. In the object **xhyp_no911**, we will change the dummy values of September 2001 with a counterfactual of *what if* the 9/11 attacks never happened. In the data object **iraq.war**, we will do the same type of counter factual.
```{r, results='hide'}
(xhyp_no911 <- xcovariates[8:65, ]) # recall 8th period -> 9/11 shock
(xhyp_no911[, "sept.oct.2001"] <- 0)
(xhyp_noiraq <- xcovariates[26:65, ])
(xhyp_no911[, "iraq.war"] <- 0)
(xhyp_oilprewar <- xcovariates[26:65, ])
(xhyp_oilprewar[, "avg.price"] <- xcovariates[25, "avg.price"])
```
### Step 2: Generate Predicted Values and Prediction Intervals
Now we will use base R `predict()` function to forecast the new scenario on how the presidential approvals may have fluctuated absent the 9/11 shock. After that, we will create a new data object **pred_no911** which contains the first original values of presidential approvals up until the counterfactual period.
```{r}
approve_no911 <-
predict(arima.res1a,
newxreg = xhyp_no911) # predict after september 2001; it will return a list containing expected predictions and standard errors of predictions
# construct a data.frame for plot
# we input the original values up until the counterfactual period
pred_no911 <-
tibble(
mean = c(approve[1:7], approve_no911$pred),
# uncertainty of forecast:
se = c(rep(0, 7), rep(approve_no911$se,
65-8+1)),
lwr = mean - 1.96*se,
upr = mean + 1.96*se
)
# let's see how it goes compared to original time series
plot(ts(approve,
start = c(2001, 2),
frequency = 12),
ylab="Percent Approving",
xlab="Time",
main = "US Presidential Approval")
lines(ts(c(approve[7],
approve_no911$pred),
start = c(2001, 8),
frequency = 12),
lty = "dashed")
```
### Step 3: Simulate Expected Values and Confidence Intervals
**Note**: in there seems to be a code bug in Chris's code for this topic. I will contact him to double check it but my sense is that the bug originates in his function `simcf::ldvsimev`. Below we are simulating the same expected values.
```{r}
# First we need the simulated parameters
sims <- 10000
simparams <- mvrnorm(n = sims,
mu = arima.res1a$coef,
Sigma = arima.res1a$var.coef)
head(simparams)
dim(simparams)
sim_approve <- matrix(NA, nrow = sims, ncol = 65 - 8 + 1)
model <- arima.res1a
## Let's break the process step by step
# For the first simulation
(model$coef <- simparams[1, ])
(sim_approve[1, ] <- predict(model, newxreg = xhyp_no911)[["pred"]])
# For the second simulation
(model$coef <- simparams[2, ])
(sim_approve[2, ] <- predict(model, newxreg = xhyp_no911)[["pred"]])
# and so on...
# Loop predict function for each multivariate simulation
for (i in 1:sims) {
model$coef <- simparams[i, ]
sim_approve[i, ] <- predict(model, newxreg = xhyp_no911)[["pred"]]
}
dim(sim_approve)
# Now we create a new data frame with the mean of all the simulations
# Question: why do we take the mean? Why to do all these simulations?
exp_no911 <-
data.frame(
mean = c(approve[1:7],
colMeans(sim_approve)),
# confidence intervals using the quantile function
lwr = c(approve[1:7],
apply(sim_approve, 2, quantile, 0.025)),
upr = c(approve[1:7],
apply(sim_approve, 2, quantile, 0.975))
)
```
### Step 4: Visualize
Finally, we visualize our expected values of the counterfactual on presidential approvals absent the 9/11 shock.
```{r}
plot_data <-
bind_rows(pred_no911, exp_no911, .id = "type") %>%
mutate(
type = factor(type,
levels=1:2,
labels=c("pred.","expect."))
)
plot_data$time <-
paste0(approval$year, "-",approval$month) %>% lubridate::ym() %>% rep(2)
ggplot(plot_data, aes(x = time,
y = mean,
ymax = upr,
ymin = lwr)) +
geom_line() +
geom_ribbon(aes(fill = type), alpha = 0.3) +
labs(x = "Time", y = "Approval Rate", title = "US Presidential Approval") +
coord_cartesian(ylim = c(10,70))+
theme_classic()
```
## Cointegration Analysis
- Thus far, we have been examining a single time series with covariates. This assumes that there is no feedback between variables.
- Yet, we may be interested in the relationship between **two** potentially **non-stationary** time series that influence each other.
- The original idea behind cointegration is that two time series may be in equilibrium in the **long run** but in the short run the two series deviate from that equilibrium.
- Cointegrated time series are two non-stationary time series that are **causally** connected and do not tend toward any particular level but tend toward each other.
- **Cointegration** means that a specific combination of two non-stationary series may be stationary. We say that these two series are cointegrated and the vector that defines the stationary linear combination is called the cointegrating vector.
More on ECM Applications in political science: ![Grant and Leob (2016)](https://www.cambridge.org/core/journals/political-analysis/article/abs/error-correction-methods-with-political-time-series/654241453806EC8B6601DDC2B6680110)
### Engle-Granger Two Step
Below you have the code from Chris' slides in Topic 4, so you cna reproduce the same example.
Step 1
```{r}
set.seed(98105)
# Generate cointegrated data
e1 <- rnorm(100)
x <- cumsum(e1) # x is a cumulative sum of e1, so it must be non-stationary
e2 <- rnorm(100)
y <- 0.6*x + e2 # define the relationship between x and y. They are both non-stationary
#Run step 1 of the Engle-Granger two step
coint.reg <- lm(y ~ x - 1)
#Estimate the cointegration vector by least squares with no constant
coint.err <- residuals(coint.reg)
#This gives us the cotingeration vector.
#Check for stationarity of the cointegration vector
adf.test(coint.err)$statistic %>% # get the test statistic of ADF test
punitroot(trend="nc") # punitroot() comes from urca package. it compute the cumulative probability of the ACF test statistic. "nc" means it is a test for regression without intercept (see ?urca::punitroot)
#Make the lag of the cointegration error term
coint.err.lag <- coint.err[1:(length(coint.err)-2)]
#Make the difference of y and x
dy <- diff(y)
dx <- diff(x)
#And their lags
dy.lag <- dy[1:(length(dy)-1)]
dx.lag <- dx[1:(length(dx)-1)]
#Delete the first dy, because we are missing lags for this obs
dy <- dy[2:length(dy)]
```
Step 2
```{r}
#Estimate an Error Correction Model with LS
ecm1 <- lm(dy ~ coint.err.lag + dy.lag + dx.lag)
summary(ecm1)
```
### Cointegration Analysis: Johansen Estimator
```{r}
#Alternatively, we can use the Johansen estimator
#Create a matrix of the cointegrated variables
cointvars <- cbind(y,x)
# Perform cointegration tests
coint.test1 <- ca.jo(cointvars,
ecdet = "const",
type="eigen",
K=2,
spec="longrun")
summary(coint.test1)
ecm.res1 <- cajorls(coint.test1,
r = 1, # Cointegration rank
reg.number = 1) # which variable(s) to put on LHS
#(column indexes of cointvars)
summary(ecm.res1)
```
```{r warning=FALSE, message=FALSE}
#For the presidential approval example, use an ECM equivalent to the
#ARIMA(1,0,1) model that we chose earlier
cointvars <- cbind(approve,avg.price)
ecm.test1 <- ca.jo(cointvars,
ecdet = "const",
type="eigen",
K=2,
spec="longrun",
dumvar=cbind(sept.oct.2001,iraq.war))
summary(ecm.test1)
ecm.res1 <- cajorls(ecm.test1,
r = 1,
reg.number = 1)
summary(ecm.res1$rlm)
```
```{r warning=FALSE, message=FALSE}
#Cointegration analysis with a spurious regressor
phony <- rnorm(length(approve))
for (i in 2:length(phony)){
phony[i] <- phony[i-1] + rnorm(1)
}
cointvars <- cbind(approve,avg.price,phony)
ecm.test1 <- ca.jo(cointvars,
ecdet = "const",
type="eigen",
K=2,
spec="longrun",
dumvar=cbind(sept.oct.2001,iraq.war))
summary(ecm.test1)
ecm.res1 <- cajorls(ecm.test1,
r = 1,
reg.number = 1)
summary(ecm.res1$rlm)
```