Review Problem Set 1

Simulate to understand

If you are uncertain about the appearance of a specific dynamic process, it is advisable to simulate it.

Here, we use Chris’s makeTS() function to simulate various dynamic processes for exploration. This function is integrated within ts() and the custom function plot_time() I have created to directly visualize the time series plot, its autocorrelation function (ACF) and partial autocorrelation function (PACF), along with the Q-statistic test for white noise.

It is important to note that in the Q-test (Box-Pierce):

  • the null hypothesis \(H_0\) posits that the time series adheres to a white noise process.

  • the alternative hypothesis \(H_1\) posits that your time series has autocorrelation.

Therefore, if we reject the null hypotheses in favor to the alternative, \(Q:p-\text{value}\leq 0.10\), then we have evidence that the time series or its residuals still have dynamic processes to model. In contrast, if we do not reject the null hypothesis, \(Q:p-\text{value} > 0.10\), then we have evidence that our time series follows white noise and no dynamics may confound our analyses.

In your simulations, remember to set.seed() to replicate the same simulations (below, I am not setting any seed).

# What does time trend process looks like?

plot_time(ts(makeTS(n=100, trend=0.02)))

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 7.6741, df = 1, p-value = 0.005602
# What does an AR process looks like?

plot_time(ts(makeTS(n=100, ar=0.6)))

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 50.122, df = 1, p-value = 1.445e-12
# What does an MA process looks like?

plot_time(ts(makeTS(n=100, ma=0.8)))

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 30.442, df = 1, p-value = 3.44e-08
# What about additive seasonality?

plot_time(ts(makeTS(n=100,
             seasons=c(0,4,0, 
                       0,0,6,
                       0,0,0,
                       -3,0,0), 
             adjust="level")))

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 0.13482, df = 1, p-value = 0.7135
# What does an ARMA process looks like?

plot_time(ts(makeTS(n=100, ar=0.6, ma=0.8)))

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 58.217, df = 1, p-value = 2.343e-14

Problem Set 1

For the initial twelve time series (A-L), each time series’ data-generating process consists of at most one dynamic component. Hence, you can initially infer the process by visually inspecting the plot, autocorrelation function (ACF), and partial autocorrelation function (PACF).

If you are unsure of what is going on, you can employ the Box-Jenkins diagnostics method to systematically identify and isolate the dynamic components until the residual series exhibit characteristics of white noise. Remember:

1- Visualize the time series, using plot() or decompose().

2- If deterministic trend or seasonal additive seems to be present, you can verify it fitting time() or decompose()$seasonal in lm(), and check the residuals.

3- If neither trend or additive seasonality seems to be present, check for AR or MA processes, but be sure to first adjust for trends and/or seasonality or otherwise may be difficult to identify what’s going on.

4- If you still unsure, narrow down two or three potential candidates, and estimate them using arima(). Select those models that provide the lowest AIC

5- If in any of the above steps, the Q-statistic in Box.test() provides evidence of white noise, then very likely you have found the best candidate model of the underlying DGP.

6- If you have more than one final candidate, then use the most parsimonious specification. That is, the specification that estimates fewer parameters but returns white noise.

# load data

df <- read_csv(file="data/mysterytsUW.csv")


# time series: C

df_1c <- df %>% select(c) %>% ts(frequency = 12)

plot_time(df_1c)

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 11.868, df = 1, p-value = 0.0005712
# time series: D

df_1d <- df %>% select(d) %>% ts(frequency = 12)

plot_time(df_1d)

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 25.41, df = 1, p-value = 4.634e-07
# time series: G

df_1g <- df %>% select(g) %>% ts(frequency = 12)

plot_time(df_1g)

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 59.936, df = 1, p-value = 9.77e-15
# time series: H

df_1h <- df %>% select(h) %>% ts(frequency = 12)

plot_time(df_1h)

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 21.993, df = 1, p-value = 2.736e-06
# time series: I

df_1i <- df %>% select(i) %>% ts(frequency = 12)

plot_time(df_1i)

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 3.7496, df = 1, p-value = 0.05282
# in this last one, the Q-test is spurious. Look at the visualizations.

The six remaining time series M-R could have up to two components. These were more difficult to guess with only eyeballing.

# time series: Q

df_1q <- df %>% select(q) %>% ts(frequency = 12)

plot_time(df_1q)

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 10.661, df = 1, p-value = 0.001094
# could it be AR(2)? or MA(2)? or ARMA(1,1)?

res1 <- arima(df_1q, order = c(2,0,0))
res2 <- arima(df_1q, order = c(0,0,2))
res3 <- arima(df_1q, order = c(1,0,1))

stargazer(res1,res2,res3,
          type = "text")
## 
## ===============================================
##                        Dependent variable:     
##                   -----------------------------
##                               df_1q            
##                      (1)       (2)       (3)   
## -----------------------------------------------
## ar1               0.257***            0.763*** 
##                    (0.097)             (0.127) 
##                                                
## ar2               0.294***                     
##                    (0.099)                     
##                                                
## ma1                           0.165   -0.459***
##                              (0.103)   (0.163) 
##                                                
## ma2                         0.352***           
##                              (0.100)           
##                                                
## intercept           0.211     0.222     0.213  
##                    (0.210)   (0.146)   (0.217) 
##                                                
## -----------------------------------------------
## Observations         100       100       100   
## Log Likelihood    -137.752  -139.042  -139.119 
## sigma2              0.918     0.942     0.944  
## Akaike Inf. Crit.  283.505   286.083   286.238 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
plot_time(res1$residuals)

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 0.092565, df = 1, p-value = 0.7609

For the second part, we know that all the series follow an AR(p) process. However, some of these series may be not be stationary.

If you were uncertain, you could test for unit root using the Unit Root Test with the function PP.test(), where the null \(H_0\) states that the time series has unit root.

If the sum of your AR(p) estimates is below 1, then you have evidence of stationary, but if their sum is above 1, then your series are non-stationary.

# load data

df2 <- read_csv(file="data/mysterytsUW2.csv")


# time series: S

df_2s <- df2 %>% select(s) %>% ts(frequency = 12)

plot_time(df_2s)

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 984.31, df = 1, p-value < 2.2e-16
adf.test(df_2s)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  df_2s
## Dickey-Fuller = -1.9746, Lag order = 9, p-value = 0.5891
## alternative hypothesis: stationary
PP.test(df_2s)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  df_2s
## Dickey-Fuller = -2.0696, Truncation lag parameter = 7, p-value = 0.5489
# time series: U

df_2u <- df2 %>% select(u) %>% ts(frequency = 12)

plot_time(df_2u)

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 910.15, df = 1, p-value < 2.2e-16
(res <- arima(df_2u, order = c(2,0,0)))
## 
## Call:
## arima(x = df_2u, order = c(2, 0, 0))
## 
## Coefficients:
##          ar1      ar2  intercept
##       1.1517  -0.2078     0.4623
## s.e.  0.0309   0.0309     0.5734
## 
## sigma^2 estimated as 1.062:  log likelihood = -1450.33,  aic = 2908.66
plot_time(res$residuals)

## 
##  Box-Pierce test
## 
## data:  df
## X-squared = 0.002517, df = 1, p-value = 0.96

Non-Stationary Time Series

For Problem Set 2, the most important techniques are unit root test, model estimation, model selection, and counterfactual forecasting.

Up to this point, our focus has primarily been on autoregressive \(AR(p)\) and moving average \(MA(q)\) processes, as the data-generating processes (DGPs) we’ve encountered have either been stationary or possess unit roots. However, in the case of nonstationary time series, the stationary process may be integrated \(I(d)\). Additionally, when faced with the possibility of multiplicative seasonality, we arrive at a comprehensive definition of dynamic specification with the Autoregressive Integrated Moving Average (ARIMA) model.

\[\text{ARIMA}(p,d,q)(P,D,Q)_m \]

Recall that stationary time series have three properties:

  1. Mean stationarity
    • mean does not depend on \(t\), constant over time
    • variance also does not depend on \(t\), constant over time
  2. Covariance stationarity
    • covariance of \(y_t\) and \(y_{t-1}\) do not depend on \(t\)
    • does not depend on the actual time the covariance is computed
  3. Ergodicity
    • sample moments converge in probability to the population moments
    • sample mean and variance tend to be the same as entire series

Nonstationary procesess lack these properties. Note that we are abstracting from trends and other covariates, which means that the social processes are a black-box in the time dimension.

Why do nonstationary processes matter?

  1. ACF and PACF not defined since the covariances in the nominator depend on \(t\)

  2. Spurious regression: we may detect strong correlations between nonstationary processes although they could be conditionally (mean) independent.

  3. Long run forecasts are unfeasible since the process does not revert to its mean.

How can we treat nonstationary processes?

  1. Analyze nonstationary process using ARIMA (differencing)
    • effective at capturing short run changes.
    • outcome is transformed to a difference.
    • long-run predictions not feasible.
  2. Analyze nonstationary process using cointegration
    • effective at capturing long run relationships between nonstationary processes.
    • outcome is left as a level.
    • short-run and long-run predictions feasible.
    • appropriate for analyzing multiple time series.

We will explore non-stationary modeling using the example discussed in topic 4 on president approvals.

# clean environment
rm(list = ls())
dev.off()
## null device 
##           1
# load source
source("source/TS_code.R")    # Chris's TS custom functions


# load data
approval <-  read_csv(file="data/approval.csv")

attach(approval)
colnames(approval)
## [1] "month"         "year"          "approve"       "disapprove"   
## [5] "unsure"        "sept.oct.2001" "iraq.war"      "avg.price"

As always, the first step is to plot the time series and try to make sense of its underlying DGP.

# plot presidential approval
plot(approve,
     type="l",
     ylab="Percent Approving",
     xlab="Time",
     main = "US Presidential Approval")

lines(x=c(8,8),
      y=c(-1000,1000),
      col="red")

lines(x=c(26,26),
      y=c(-1000,1000),
      col="blue")

text("9/11",
     x = 10, y = 40, 
     col="red",cex=0.7)

text("Iraq \n War",
     x = 28, y = 40, 
     col="blue",cex=0.7)

In the plot above, it is evident that there is a significant shock in the 8th period, followed by a weaker one in the 27th period. Both of these shocks correspond to two significant events: the 9/11 attacks and the Iraq War. We call these types of shocks as structural breaks.

In U.S. politics, domestic political shocks often propagated through the stock market. Specifically, presidential approval ratings are highly correlated with oil prices with a -0.84.

# plot average oil price
plot(avg.price,
     type="l",
     ylab="$ per Barrel",
     xlab="Time",
     main = "Average Price of Oil")

lines(x=c(8,8),
      y=c(-1000,1000),
      col="red")

lines(x=c(26,26),
      y=c(-1000,1000),
      col="blue")

text("9/11",x = 10, y = 175, 
     col="red",cex=0.7)

text("Iraq \n War",x = 28, y = 175, 
     col="blue",cex=0.7) 

As always, after looking at the time series plots, we need to look at the ACF and PACF.

# Look at the ACF and PACF of approvals
acf(approve)

pacf(approve)

# Look at the ACF and PACF of oil prices
pacf(avg.price)
pacf(avg.price)

Unit Root Tests

Unit-root tests are the first step to study and analyze non-stationary time series.

Intuition: if time series is stationary, then regressing \(y_{t}-y_{t-1}\) on \(y_{t-1}\) should produce a negative coefficient. Why?

In a stationary series, knowing the past value of the series helps to predict the next period’s change. Positive shifts should be followed by negative shifts (mean reversion).

\[ \begin{aligned} y_t & = \rho y_{t-1} + \epsilon_{t} \\ y_t - y_{t-1} & = \rho y_{t-1} - y_{t-1} + \epsilon_{t} \\ \Delta y_{t} & = \gamma y_{t-1} + \epsilon_{t} \quad \quad \text{, where } \gamma=(\rho - 1) \\ \end{aligned} \]

ADF test

Augmented Dickey-Fuller test: null hypothesis of unit root.

Same with Phillips-Perron test, but differs in how the AR(\(p\)) time series is modeled w.r.t. lags, serial correlation, heteroskedasticity.

The ADF test regression can be expressed as follows:

\[\Delta y_t = \boldsymbol{\beta D_t} + \pi y_{t-1} + \sum^p_{j=1}\psi_j\Delta y_{t-j}+\epsilon_t\] where \(\boldsymbol{D}_t\) is a vector of deterministic trends. The \(p\) lagged difference terms, \(\Delta y_{t-j}\), are used to approximate the ARMA structure of the errors (consider the equivalence of AR and MA models), and the value of \(p\) is set so that the error is serially uncorrelated. The error term is assumed to be homoskedastic.

Under the null hypothesis (\(\rho - 1 = 0\) => unit root => non-stationary), \(\pi\) is set to 0 and is used to construct the ADF test statistic. If we don’t get a statistically significant result from ADF test, then the time series is likely to be stationary.

Conflicting results from PP and ADF tests may suggest a structural break within the time series of presidential approval.

adf.test(approve)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  approve
## Dickey-Fuller = -3.9565, Lag order = 3, p-value = 0.01721
## alternative hypothesis: stationary
adf.test(avg.price)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  avg.price
## Dickey-Fuller = -3.0115, Lag order = 3, p-value = 0.1649
## alternative hypothesis: stationary

Notice that the ADF test claims that presidential approvals follows a stationary process while oil prices are nonstationary.

PP test

The PP test regression can be expressed as follows:

\[\Delta y_t = \boldsymbol{\beta D_t} + \pi y_{t-1} +u_t\]

The PP test ignores any serial correlation in the test regression. Instead, it corrects for serial correlation and heteroskedasticity in the errors \(u_t\) by directly modifying the test statistic, \(t_{\pi=0}\)

As we can see, both ADF test and PP test assume data has \(I(1)\) non-stationarity. But they also differ in how the time series is modeled. PP test allows for the residuals to be serially correlated.

PP.test(approve)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  approve
## Dickey-Fuller = -2.8387, Truncation lag parameter = 3, p-value = 0.235
PP.test(avg.price)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  avg.price
## Dickey-Fuller = -2.3318, Truncation lag parameter = 3, p-value = 0.4405

In contrast to the ADF, the PP test claims that presidential that both approvals and oil prices are nonstationary.

Sometimes we may find that the results of ADF test and PP test provide different evidence. This is because some underlying temporal dependence is not incorporated into the assumed models, or because our sample is too small. For example, if there is a structural break (i.e. persistent arises due to large and infrequent shocks) in the time series, then ADF test tends to provide evidence of a non-stationary process.

If we suspect that our time series have particular forms of temporal dependence ignored by those tests, there are other potential statistical tests:

  • Higher-order differencing non-stationarity: Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.

  • Structural break: Zivot-Andrews test, etc.

Differencing the Time Series

Recall that we can use differencing \(d\) to address integration processes \(I(d)\) of nonstationarity. We can difference a random walk as follows:

\[ \begin{aligned} y_{t} =& y_{t-1} + \boldsymbol{x_t \beta} + e_t \\ y_{t}-y_{t-1} =& y_{t-1} - y_{t-1} + \boldsymbol{x_t \beta} + e_t \\ \Delta y_t =& \boldsymbol{x_t\beta}+e_t \end{aligned} \]

If \(y_{t}\)’s DGP has a \(I(1)\) process, differencing results in an AR(0) and stationary I(0). But note that we are now explaining the short run difference or change in the variable, not the level.

diff(approve) # get an order-1 differenced time series
##  [1] -0.67  2.50 -5.50 -1.00  2.50 -0.50 19.67 12.33 -1.00 -1.00 -2.33 -1.67
## [13] -2.75 -3.00  0.08 -2.93 -2.90 -4.00  0.70 -2.45  1.58 -3.58 -2.58 -1.42
## [25]  6.45  4.80 -3.67 -4.33 -2.33 -0.17 -8.50  3.67 -3.00  6.83 -3.25 -3.92
## [37] -0.66  1.33 -4.67  1.17 -0.75  2.25  2.67 -3.50  4.83 -3.00  0.67  0.58
## [49] -2.58 -1.17 -0.50 -1.75  1.08 -3.58  0.25 -3.25 -2.42  3.92  0.75 -3.33
## [61] -3.17 -0.83  1.33  3.00
diff(approve, differences = 2) # order-2 differenced time series
##  [1]   3.17  -8.00   4.50   3.50  -3.00  20.17  -7.34 -13.33   0.00  -1.33
## [11]   0.66  -1.08  -0.25   3.08  -3.01   0.03  -1.10   4.70  -3.15   4.03
## [21]  -5.16   1.00   1.16   7.87  -1.65  -8.47  -0.66   2.00   2.16  -8.33
## [31]  12.17  -6.67   9.83 -10.08  -0.67   3.26   1.99  -6.00   5.84  -1.92
## [41]   3.00   0.42  -6.17   8.33  -7.83   3.67  -0.09  -3.16   1.41   0.67
## [51]  -1.25   2.83  -4.66   3.83  -3.50   0.83   6.34  -3.17  -4.08   0.16
## [61]   2.34   2.16   1.67
# Look at the ACF and PACF of approvals
diff(approve) %>%  acf()

diff(approve) %>%  pacf()

# Look at the ACF and PACF of oil prices
diff(avg.price) %>% acf()

diff(avg.price) %>% pacf()

# Check unit root tests
diff(approve) %>%  adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -4.3461, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
diff(avg.price) %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -5.3361, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
diff(approve) %>%  PP.test()
## 
##  Phillips-Perron Unit Root Test
## 
## data:  .
## Dickey-Fuller = -6.703, Truncation lag parameter = 3, p-value = 0.01
diff(avg.price) %>% PP.test()
## 
##  Phillips-Perron Unit Root Test
## 
## data:  .
## Dickey-Fuller = -5.4685, Truncation lag parameter = 3, p-value = 0.01

This model can be estimated using arima() function.

Forecasting with a Non-stationary Time Series

As explained in the lecture (Topic 4), we estimate an ARIMA(0,1,0) which fits a little better than ARIMA(2,1,2) on the AIC criterion.

xcovariates <- cbind(sept.oct.2001, iraq.war, avg.price)

(arima.res1a <- 
  arima(approve,
        order = c(0,1,0),
        xreg = xcovariates, 
        include.mean = FALSE)) # does FALSE matters?
## 
## Call:
## arima(x = approve, order = c(0, 1, 0), xreg = xcovariates, include.mean = FALSE)
## 
## Coefficients:
##       sept.oct.2001  iraq.war  avg.price
##             11.2072    5.6899    -0.0710
## s.e.         2.5192    2.4891     0.0337
## 
## sigma^2 estimated as 12.35:  log likelihood = -171.25,  aic = 350.5
(arima.res2a <- 
  arima(approve,
        order = c(2,1,2),
        xreg = xcovariates, 
        include.mean = FALSE)) # does FALSE matters?
## 
## Call:
## arima(x = approve, order = c(2, 1, 2), xreg = xcovariates, include.mean = FALSE)
## 
## Coefficients:
##           ar1      ar2     ma1    ma2  sept.oct.2001  iraq.war  avg.price
##       -0.7338  -0.6364  0.8715  1.000         9.7795    5.7884    -0.0511
## s.e.   0.1382   0.1162  0.0672  0.069         1.9202    2.7251     0.0307
## 
## sigma^2 estimated as 10.26:  log likelihood = -167.42,  aic = 350.84
# Check out the Q-test statistic:

Box.test(arima.res1a$residuals)
## 
##  Box-Pierce test
## 
## data:  arima.res1a$residuals
## X-squared = 0.23871, df = 1, p-value = 0.6251
Box.test(arima.res2a$residuals)
## 
##  Box-Pierce test
## 
## data:  arima.res2a$residuals
## X-squared = 0.0019896, df = 1, p-value = 0.9644
# extract estimates and statistics from final model:

(pe.1a <- arima.res1a$coef)                    # parameter estimates (betas)
## sept.oct.2001      iraq.war     avg.price 
##   11.20723939    5.68986003   -0.07101481
(se.1a <- sqrt(diag(arima.res1a$var.coef)))    # standard errors
## sept.oct.2001      iraq.war     avg.price 
##    2.51920817    2.48908383    0.03367682
(ll.1a <- arima.res1a$loglik)                  # log likelihood at its maximum
## [1] -171.2507
(sigma2hat.1a <- arima.res1a$sigma2)           # standard error of the regression
## [1] 12.35062
(aic.1a <- arima.res1a$aic)                    # Akaike Information Criterion
## [1] 350.5013
(resid.1a <- arima.res1a$resid)                # residuals (white noise)
## Time Series:
## Start = 1 
## End = 65 
## Frequency = 1 
##  [1]  0.06896534 -0.95760998  3.51089580 -4.43371265 -1.60717662  1.11059527
##  [7] -0.49893478  9.17823481 10.86638479  9.17823481 -1.60007513 -2.17909353
## [13] -1.62384037 -1.78774934 -1.95075620  0.04271723 -2.99746407 -2.79525316
## [19] -4.00887685  0.72734070 -2.12581740  1.39358613 -3.81576917 -2.07082382
## [25] -0.31571972  1.32825844  4.06144599 -4.32156087  1.32825844 -2.19010083
## [31]  0.59518456 -8.08420829  2.85119925 -3.36572627  6.59423083 -2.58991735
## [37] -3.38206282 -0.03151894  1.76851644 -3.35161007  1.06951405 -1.16188589
## [43]  2.01387576  2.61141278 -2.57858286  4.68726023 -3.98284495  0.59756490
## [49]  1.14243729 -1.37807436 -0.01068325 -1.07735040 -1.79047844  2.03514918
## [55] -2.18668945  3.21167260 -4.57407111 -5.68703628  3.41046875  1.67745340
## [61] -3.58281272 -2.14206064  1.42294481  2.50032405  2.84163698

Some counterfactual stories might help….

  • What if 9/11 had not happened?
  • What if Bush had not invaded Iraq?
  • What if the price of oil had remained at pre-war levels?

Step 1: construct counterfactual scenarios

We will create a new dataset with hypothetical scenarios. In the object xhyp_no911, we will change the dummy values of September 2001 with a counterfactual of what if the 9/11 attacks never happened. In the data object iraq.war, we will do the same type of counter factual.

(xhyp_no911 <- xcovariates[8:65, ]) # recall 8th period -> 9/11 shock
(xhyp_no911[, "sept.oct.2001"] <- 0)
(xhyp_noiraq <- xcovariates[26:65, ])
(xhyp_no911[, "iraq.war"] <- 0)
(xhyp_oilprewar <- xcovariates[26:65, ])
(xhyp_oilprewar[, "avg.price"] <- xcovariates[25, "avg.price"])

Step 2: Generate Predicted Values and Prediction Intervals

Now we will use base R predict() function to forecast the new scenario on how the presidential approvals may have fluctuated absent the 9/11 shock. After that, we will create a new data object pred_no911 which contains the first original values of presidential approvals up until the counterfactual period.

approve_no911 <- 
  predict(arima.res1a, 
          newxreg = xhyp_no911) # predict after september 2001; it will return a list containing expected predictions and standard errors of predictions

# construct a data.frame for plot

# we input the original values up until the counterfactual period

pred_no911 <- 
  tibble(
  mean = c(approve[1:7], approve_no911$pred),
  # uncertainty of forecast:
  se = c(rep(0, 7), rep(approve_no911$se, 
                        65-8+1)),
  lwr = mean - 1.96*se,
  upr = mean + 1.96*se
)

# let's see how it goes compared to original time series
plot(ts(approve, 
        start = c(2001, 2), 
        frequency = 12),
     ylab="Percent Approving",
     xlab="Time",
     main = "US Presidential Approval")

lines(ts(c(approve[7], 
           approve_no911$pred), 
         start = c(2001, 8), 
         frequency = 12), 
      lty = "dashed")

Step 3: Simulate Expected Values and Confidence Intervals

Note: in there seems to be a code bug in Chris’s code for this topic. I will contact him to double check it but my sense is that the bug originates in his function simcf::ldvsimev. Below we are simulating the same expected values.

# First we need the simulated parameters
sims <- 10000

simparams <- mvrnorm(n = sims, 
                     mu = arima.res1a$coef, 
                     Sigma = arima.res1a$var.coef)

head(simparams)
##      sept.oct.2001 iraq.war   avg.price
## [1,]     10.877991 6.204078 -0.04375834
## [2,]     13.510364 6.628567 -0.06750429
## [3,]     15.163841 4.682793 -0.07748665
## [4,]     12.662988 3.812799 -0.05939149
## [5,]      9.364514 7.997687 -0.04502996
## [6,]     10.682281 1.374317 -0.09134948
dim(simparams)
## [1] 10000     3
sim_approve <- matrix(NA, nrow = sims, ncol = 65 - 8 + 1)

model <- arima.res1a

## Let's break the process step by step

# For the first simulation

(model$coef <- simparams[1, ])
## sept.oct.2001      iraq.war     avg.price 
##   10.87799111    6.20407762   -0.04375834
(sim_approve[1, ] <- predict(model, newxreg = xhyp_no911)[["pred"]])
## Time Series:
## Start = 66 
## End = 123 
## Frequency = 1 
##  [1] 53.82639 54.72825 55.36231 55.73207 55.63908 55.61064 55.01771 54.37118
##  [9] 54.39416 54.43573 54.37118 54.37665 54.35980 54.16005 54.27491 54.42019
## [17] 54.10644 53.42600 53.07594 53.53102 53.93250 53.95198 53.86577 53.39428
## [25] 53.13807 53.64261 53.86796 54.01324 53.60650 53.27504 52.88777 52.61757
## [33] 51.80519 51.86711 52.12091 52.26641 52.30251 51.73474 51.82270 52.42831
## [41] 52.47295 52.12638 51.38577 50.67141 51.02717 51.05211 50.46356 49.60502
## [49] 47.78008 48.59596 50.60906 50.92302 50.35154 50.50732 49.87392 48.48569
## [57] 47.76455 47.86213
# For the second simulation

(model$coef <- simparams[2, ])
## sept.oct.2001      iraq.war     avg.price 
##   13.51036446    6.62856663   -0.06750429
(sim_approve[2, ] <- predict(model, newxreg = xhyp_no911)[["pred"]])
## Time Series:
## Start = 66 
## End = 123 
## Frequency = 1 
##  [1] 50.21344 51.60471 52.58284 53.15326 53.00981 52.96593 52.05125 51.05387
##  [9] 51.08931 51.15344 51.05387 51.06231 51.03632 50.72816 50.90536 51.12948
## [17] 50.64547 49.59578 49.05574 49.75779 50.37714 50.40718 50.27420 49.54684
## [25] 49.15160 49.92993 50.27757 50.50169 49.87423 49.36289 48.76548 48.34864
## [33] 47.09542 47.19094 47.58246 47.80692 47.86261 46.98674 47.12242 48.05668
## [41] 48.12554 47.59090 46.44839 45.34638 45.89519 45.93367 45.02574 43.70130
## [49] 40.88604 42.14466 45.25019 45.73453 44.85293 45.09324 44.11612 41.97454
## [57] 40.86207 41.01261
# and so on... 

# Loop predict function for each multivariate simulation

for (i in 1:sims) {
  model$coef <- simparams[i, ]
  sim_approve[i, ] <- predict(model, newxreg = xhyp_no911)[["pred"]]
}


dim(sim_approve)
## [1] 10000    58
# Now we create a new data frame with the mean of all the simulations

# Question: why do we take the mean? Why to do all these simulations?

exp_no911 <- 
  data.frame(
  mean = c(approve[1:7], 
           colMeans(sim_approve)),
  # confidence intervals using the quantile function
  lwr = c(approve[1:7], 
          apply(sim_approve, 2, quantile, 0.025)),
  upr = c(approve[1:7], 
          apply(sim_approve, 2, quantile, 0.975))
)

Step 4: Visualize

Finally, we visualize our expected values of the counterfactual on presidential approvals absent the 9/11 shock.

plot_data <- 
  bind_rows(pred_no911, exp_no911, .id = "type") %>% 
  mutate(
    type = factor(type,
                  levels=1:2,
                  labels=c("pred.","expect."))
  )

plot_data$time <- 
  paste0(approval$year, "-",approval$month) %>% lubridate::ym()  %>%  rep(2)

ggplot(plot_data, aes(x = time, 
                      y = mean, 
                      ymax = upr, 
                      ymin = lwr)) +
  geom_line() +
  geom_ribbon(aes(fill = type), alpha = 0.3) +
  labs(x = "Time", y = "Approval Rate", title = "US Presidential Approval") +
  coord_cartesian(ylim = c(10,70))+
  theme_classic()

Cointegration Analysis

  • Thus far, we have been examining a single time series with covariates. This assumes that there is no feedback between variables.

  • Yet, we may be interested in the relationship between two potentially non-stationary time series that influence each other.

  • The original idea behind cointegration is that two time series may be in equilibrium in the long run but in the short run the two series deviate from that equilibrium.

  • Cointegrated time series are two non-stationary time series that are causally connected and do not tend toward any particular level but tend toward each other.

  • Cointegration means that a specific combination of two non-stationary series may be stationary. We say that these two series are cointegrated and the vector that defines the stationary linear combination is called the cointegrating vector.

More on ECM Applications in political science: Grant and Leob (2016)

Engle-Granger Two Step

Below you have the code from Chris’ slides in Topic 4, so you cna reproduce the same example.

Step 1

set.seed(98105)

# Generate cointegrated data
e1 <- rnorm(100)
x <- cumsum(e1) # x is a cumulative sum of e1, so it must be non-stationary
e2 <- rnorm(100)
y <- 0.6*x + e2 # define the relationship between x and y. They are both non-stationary

#Run step 1 of the Engle-Granger two step
coint.reg <- lm(y ~ x - 1)              
#Estimate the cointegration vector by least squares with no constant
coint.err <- residuals(coint.reg)       
#This gives us the cotingeration vector.

#Check for stationarity of the cointegration vector
adf.test(coint.err)$statistic  %>%  # get the test statistic of ADF test
  punitroot(trend="nc") # punitroot() comes from urca package. it compute the cumulative probability of the ACF test statistic. "nc" means it is a test for regression without intercept (see ?urca::punitroot)
## Dickey-Fuller 
##  1.835502e-06
#Make the lag of the cointegration error term
coint.err.lag <- coint.err[1:(length(coint.err)-2)]

#Make the difference of y and x
dy <- diff(y)
dx <- diff(x)

#And their lags
dy.lag <- dy[1:(length(dy)-1)]
dx.lag <- dx[1:(length(dx)-1)]

#Delete the first dy, because we are missing lags for this obs
dy <- dy[2:length(dy)]

Step 2

#Estimate an Error Correction Model with LS
ecm1 <- lm(dy ~ coint.err.lag + dy.lag + dx.lag)
summary(ecm1)
## 
## Call:
## lm(formula = dy ~ coint.err.lag + dy.lag + dx.lag)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.05134 -0.64656  0.02462  0.76963  2.04272 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.08009    0.11676   0.686    0.494    
## coint.err.lag -0.89254    0.14294  -6.244 1.22e-08 ***
## dy.lag        -0.85110    0.11222  -7.584 2.36e-11 ***
## dx.lag         0.56191    0.13088   4.293 4.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.152 on 94 degrees of freedom
## Multiple R-squared:  0.4014, Adjusted R-squared:  0.3822 
## F-statistic: 21.01 on 3 and 94 DF,  p-value: 1.688e-10

Cointegration Analysis: Johansen Estimator

#Alternatively, we can use the Johansen estimator
#Create a matrix of the cointegrated variables
cointvars <- cbind(y,x)
# Perform cointegration tests
coint.test1 <- ca.jo(cointvars,
                   ecdet = "const",
                   type="eigen",
                   K=2,
                   spec="longrun")
summary(coint.test1)
## 
## ###################### 
## # Johansen-Procedure # 
## ###################### 
## 
## Test type: maximal eigenvalue statistic (lambda max) , without linear trend and constant in cointegration 
## 
## Eigenvalues (lambda):
## [1] 3.520082e-01 2.399643e-02 1.460788e-17
## 
## Values of teststatistic and critical values of test:
## 
##           test 10pct  5pct  1pct
## r <= 1 |  2.38  7.52  9.24 12.97
## r = 0  | 42.52 13.75 15.67 20.20
## 
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
## 
##                y.l2      x.l2   constant
## y.l2      1.0000000  1.000000  1.0000000
## x.l2     -0.5565254 -4.955866  0.4169868
## constant -0.2979546 32.290892 32.4988636
## 
## Weights W:
## (This is the loading matrix)
## 
##            y.l2        x.l2      constant
## y.d -0.92569359 0.003754100 -1.582064e-17
## x.d  0.02021318 0.007824302 -4.297376e-18
ecm.res1 <- cajorls(coint.test1,
                    r = 1,           # Cointegration rank
                    reg.number = 1)  # which variable(s) to put on LHS
#(column indexes of cointvars)
summary(ecm.res1)
##      Length Class  Mode   
## rlm  12     lm     list   
## beta  3     -none- numeric
#For the presidential approval example, use an ECM equivalent to the
#ARIMA(1,0,1) model that we chose earlier

cointvars <- cbind(approve,avg.price)
ecm.test1 <- ca.jo(cointvars,
                   ecdet = "const",
                   type="eigen",
                   K=2,
                   spec="longrun",
                   dumvar=cbind(sept.oct.2001,iraq.war))
summary(ecm.test1)
## 
## ###################### 
## # Johansen-Procedure # 
## ###################### 
## 
## Test type: maximal eigenvalue statistic (lambda max) , without linear trend and constant in cointegration 
## 
## Eigenvalues (lambda):
## [1] 2.391542e-01 1.367744e-01 1.387779e-16
## 
## Values of teststatistic and critical values of test:
## 
##           test 10pct  5pct  1pct
## r <= 1 |  9.27  7.52  9.24 12.97
## r = 0  | 17.22 13.75 15.67 20.20
## 
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
## 
##               approve.l2 avg.price.l2 constant
## approve.l2     1.0000000    1.0000000  1.00000
## avg.price.l2   0.1535049    0.3382829 -1.05616
## constant     -76.0019182 -120.7778161 90.24200
## 
## Weights W:
## (This is the loading matrix)
## 
##              approve.l2 avg.price.l2     constant
## approve.d   -0.12619985   0.02231967 5.407866e-17
## avg.price.d -0.02220281  -0.58771490 3.727850e-16
ecm.res1 <- cajorls(ecm.test1,
                    r = 1,
                    reg.number = 1)
summary(ecm.res1$rlm)
## 
## Call:
## lm(formula = substitute(form1), data = data.mat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1403 -1.6753 -0.2261  1.6430  5.9537 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## ect1          -0.12620    0.03006  -4.198 9.37e-05 ***
## sept.oct.2001 19.55846    2.11737   9.237 5.40e-13 ***
## iraq.war       5.01870    1.62432   3.090  0.00307 ** 
## approve.dl1   -0.31757    0.09448  -3.361  0.00138 ** 
## avg.price.dl1 -0.05055    0.02593  -1.949  0.05613 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.668 on 58 degrees of freedom
## Multiple R-squared:  0.6301, Adjusted R-squared:  0.5983 
## F-statistic: 19.76 on 5 and 58 DF,  p-value: 1.915e-11
#Cointegration analysis with a spurious regressor
phony <- rnorm(length(approve))
for (i in 2:length(phony)){
    phony[i] <- phony[i-1] + rnorm(1) 
}

cointvars <- cbind(approve,avg.price,phony)

ecm.test1 <- ca.jo(cointvars,
                   ecdet = "const",
                   type="eigen",
                   K=2,
                   spec="longrun",
                   dumvar=cbind(sept.oct.2001,iraq.war))
summary(ecm.test1)
## 
## ###################### 
## # Johansen-Procedure # 
## ###################### 
## 
## Test type: maximal eigenvalue statistic (lambda max) , without linear trend and constant in cointegration 
## 
## Eigenvalues (lambda):
## [1] 2.522926e-01 2.289387e-01 7.351248e-02 4.440892e-16
## 
## Values of teststatistic and critical values of test:
## 
##           test 10pct  5pct  1pct
## r <= 2 |  4.81  7.52  9.24 12.97
## r <= 1 | 16.38 13.75 15.67 20.20
## r = 0  | 18.32 19.77 22.00 26.81
## 
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
## 
##                approve.l2 avg.price.l2     phony.l2     constant
## approve.l2     1.00000000    1.0000000   1.00000000   1.00000000
## avg.price.l2   0.09303864    0.6851653  -0.03481344   0.07966019
## phony.l2       1.00806669   -3.8070344  -0.83878975   3.65291691
## constant     -69.12999967 -160.7372332 -59.59236067 -92.56171519
## 
## Weights W:
## (This is the loading matrix)
## 
##              approve.l2 avg.price.l2     phony.l2      constant
## approve.d   -0.12669497  0.009643857 -0.003510338  6.872469e-16
## avg.price.d  0.03091869 -0.420502736 -0.053024935 -7.633789e-16
## phony.d      0.01535096  0.012992853 -0.012809245 -1.347668e-16
ecm.res1 <- cajorls(ecm.test1,
                    r = 1,
                    reg.number = 1)
summary(ecm.res1$rlm) 
## 
## Call:
## lm(formula = substitute(form1), data = data.mat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9919 -1.5593 -0.1705  1.7087  6.4629 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## ect1          -0.12669    0.02919  -4.340 5.90e-05 ***
## sept.oct.2001 19.10637    2.11016   9.054 1.26e-12 ***
## iraq.war       5.52734    1.65178   3.346  0.00145 ** 
## approve.dl1   -0.32650    0.09473  -3.447  0.00107 ** 
## avg.price.dl1 -0.04073    0.02578  -1.580  0.11972    
## phony.dl1      0.12774    0.31229   0.409  0.68404    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.664 on 57 degrees of freedom
## Multiple R-squared:  0.6377, Adjusted R-squared:  0.5996 
## F-statistic: 16.72 on 6 and 57 DF,  p-value: 5.14e-11