If you are uncertain about the appearance of a specific dynamic process, it is advisable to simulate it.
Here, we use Chris’s makeTS()
function to simulate
various dynamic processes for exploration. This function is integrated
within ts()
and the custom function
plot_time()
I have created to directly visualize the time
series plot, its autocorrelation function (ACF) and
partial autocorrelation function (PACF), along with the
Q-statistic test for white noise.
It is important to note that in the Q-test (Box-Pierce):
the null hypothesis \(H_0\) posits that the time series adheres to a white noise process.
the alternative hypothesis \(H_1\) posits that your time series has autocorrelation.
Therefore, if we reject the null hypotheses in favor to the alternative, \(Q:p-\text{value}\leq 0.10\), then we have evidence that the time series or its residuals still have dynamic processes to model. In contrast, if we do not reject the null hypothesis, \(Q:p-\text{value} > 0.10\), then we have evidence that our time series follows white noise and no dynamics may confound our analyses.
In your simulations, remember to set.seed()
to replicate
the same simulations (below, I am not setting any seed).
# What does time trend process looks like?
plot_time(ts(makeTS(n=100, trend=0.02)))
##
## Box-Pierce test
##
## data: df
## X-squared = 7.6741, df = 1, p-value = 0.005602
# What does an AR process looks like?
plot_time(ts(makeTS(n=100, ar=0.6)))
##
## Box-Pierce test
##
## data: df
## X-squared = 50.122, df = 1, p-value = 1.445e-12
# What does an MA process looks like?
plot_time(ts(makeTS(n=100, ma=0.8)))
##
## Box-Pierce test
##
## data: df
## X-squared = 30.442, df = 1, p-value = 3.44e-08
# What about additive seasonality?
plot_time(ts(makeTS(n=100,
seasons=c(0,4,0,
0,0,6,
0,0,0,
-3,0,0),
adjust="level")))
##
## Box-Pierce test
##
## data: df
## X-squared = 0.13482, df = 1, p-value = 0.7135
# What does an ARMA process looks like?
plot_time(ts(makeTS(n=100, ar=0.6, ma=0.8)))
##
## Box-Pierce test
##
## data: df
## X-squared = 58.217, df = 1, p-value = 2.343e-14
For the initial twelve time series (A-L), each time series’ data-generating process consists of at most one dynamic component. Hence, you can initially infer the process by visually inspecting the plot, autocorrelation function (ACF), and partial autocorrelation function (PACF).
If you are unsure of what is going on, you can employ the Box-Jenkins diagnostics method to systematically identify and isolate the dynamic components until the residual series exhibit characteristics of white noise. Remember:
1- Visualize the time series, using plot()
or
decompose()
.
2- If deterministic trend or seasonal additive seems to be present,
you can verify it fitting time()
or
decompose()$seasonal
in lm()
, and check the
residuals.
3- If neither trend or additive seasonality seems to be present, check for AR or MA processes, but be sure to first adjust for trends and/or seasonality or otherwise may be difficult to identify what’s going on.
4- If you still unsure, narrow down two or three potential
candidates, and estimate them using arima()
. Select those
models that provide the lowest AIC
5- If in any of the above steps, the Q-statistic in
Box.test()
provides evidence of white
noise, then very likely you have found the best candidate model
of the underlying DGP.
6- If you have more than one final candidate, then use the most parsimonious specification. That is, the specification that estimates fewer parameters but returns white noise.
# load data
df <- read_csv(file="data/mysterytsUW.csv")
# time series: C
df_1c <- df %>% select(c) %>% ts(frequency = 12)
plot_time(df_1c)
##
## Box-Pierce test
##
## data: df
## X-squared = 11.868, df = 1, p-value = 0.0005712
# time series: D
df_1d <- df %>% select(d) %>% ts(frequency = 12)
plot_time(df_1d)
##
## Box-Pierce test
##
## data: df
## X-squared = 25.41, df = 1, p-value = 4.634e-07
# time series: G
df_1g <- df %>% select(g) %>% ts(frequency = 12)
plot_time(df_1g)
##
## Box-Pierce test
##
## data: df
## X-squared = 59.936, df = 1, p-value = 9.77e-15
# time series: H
df_1h <- df %>% select(h) %>% ts(frequency = 12)
plot_time(df_1h)
##
## Box-Pierce test
##
## data: df
## X-squared = 21.993, df = 1, p-value = 2.736e-06
# time series: I
df_1i <- df %>% select(i) %>% ts(frequency = 12)
plot_time(df_1i)
##
## Box-Pierce test
##
## data: df
## X-squared = 3.7496, df = 1, p-value = 0.05282
# in this last one, the Q-test is spurious. Look at the visualizations.
The six remaining time series M-R could have up to two components. These were more difficult to guess with only eyeballing.
# time series: Q
df_1q <- df %>% select(q) %>% ts(frequency = 12)
plot_time(df_1q)
##
## Box-Pierce test
##
## data: df
## X-squared = 10.661, df = 1, p-value = 0.001094
# could it be AR(2)? or MA(2)? or ARMA(1,1)?
res1 <- arima(df_1q, order = c(2,0,0))
res2 <- arima(df_1q, order = c(0,0,2))
res3 <- arima(df_1q, order = c(1,0,1))
stargazer(res1,res2,res3,
type = "text")
##
## ===============================================
## Dependent variable:
## -----------------------------
## df_1q
## (1) (2) (3)
## -----------------------------------------------
## ar1 0.257*** 0.763***
## (0.097) (0.127)
##
## ar2 0.294***
## (0.099)
##
## ma1 0.165 -0.459***
## (0.103) (0.163)
##
## ma2 0.352***
## (0.100)
##
## intercept 0.211 0.222 0.213
## (0.210) (0.146) (0.217)
##
## -----------------------------------------------
## Observations 100 100 100
## Log Likelihood -137.752 -139.042 -139.119
## sigma2 0.918 0.942 0.944
## Akaike Inf. Crit. 283.505 286.083 286.238
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
plot_time(res1$residuals)
##
## Box-Pierce test
##
## data: df
## X-squared = 0.092565, df = 1, p-value = 0.7609
For the second part, we know that all the series follow an
AR(p)
process. However, some of these series may be not be
stationary.
If you were uncertain, you could test for unit root using the Unit
Root Test with the function PP.test()
, where the null \(H_0\) states that the time series
has unit root.
If the sum of your AR(p)
estimates is below 1, then you
have evidence of stationary, but if their sum is above 1, then your
series are non-stationary.
# load data
df2 <- read_csv(file="data/mysterytsUW2.csv")
# time series: S
df_2s <- df2 %>% select(s) %>% ts(frequency = 12)
plot_time(df_2s)
##
## Box-Pierce test
##
## data: df
## X-squared = 984.31, df = 1, p-value < 2.2e-16
adf.test(df_2s)
##
## Augmented Dickey-Fuller Test
##
## data: df_2s
## Dickey-Fuller = -1.9746, Lag order = 9, p-value = 0.5891
## alternative hypothesis: stationary
PP.test(df_2s)
##
## Phillips-Perron Unit Root Test
##
## data: df_2s
## Dickey-Fuller = -2.0696, Truncation lag parameter = 7, p-value = 0.5489
# time series: U
df_2u <- df2 %>% select(u) %>% ts(frequency = 12)
plot_time(df_2u)
##
## Box-Pierce test
##
## data: df
## X-squared = 910.15, df = 1, p-value < 2.2e-16
(res <- arima(df_2u, order = c(2,0,0)))
##
## Call:
## arima(x = df_2u, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## 1.1517 -0.2078 0.4623
## s.e. 0.0309 0.0309 0.5734
##
## sigma^2 estimated as 1.062: log likelihood = -1450.33, aic = 2908.66
plot_time(res$residuals)
##
## Box-Pierce test
##
## data: df
## X-squared = 0.002517, df = 1, p-value = 0.96
For Problem Set 2, the most important techniques are unit root test, model estimation, model selection, and counterfactual forecasting.
Up to this point, our focus has primarily been on autoregressive \(AR(p)\) and moving average \(MA(q)\) processes, as the data-generating processes (DGPs) we’ve encountered have either been stationary or possess unit roots. However, in the case of nonstationary time series, the stationary process may be integrated \(I(d)\). Additionally, when faced with the possibility of multiplicative seasonality, we arrive at a comprehensive definition of dynamic specification with the Autoregressive Integrated Moving Average (ARIMA) model.
\[\text{ARIMA}(p,d,q)(P,D,Q)_m \]
Recall that stationary time series have three properties:
Nonstationary procesess lack these properties. Note that we are abstracting from trends and other covariates, which means that the social processes are a black-box in the time dimension.
Why do nonstationary processes matter?
ACF and PACF not defined since the covariances in the nominator depend on \(t\)
Spurious regression: we may detect strong correlations between nonstationary processes although they could be conditionally (mean) independent.
Long run forecasts are unfeasible since the process does not revert to its mean.
How can we treat nonstationary processes?
We will explore non-stationary modeling using the example discussed in topic 4 on president approvals.
# clean environment
rm(list = ls())
dev.off()
## null device
## 1
# load source
source("source/TS_code.R") # Chris's TS custom functions
# load data
approval <- read_csv(file="data/approval.csv")
attach(approval)
colnames(approval)
## [1] "month" "year" "approve" "disapprove"
## [5] "unsure" "sept.oct.2001" "iraq.war" "avg.price"
As always, the first step is to plot the time series and try to make sense of its underlying DGP.
# plot presidential approval
plot(approve,
type="l",
ylab="Percent Approving",
xlab="Time",
main = "US Presidential Approval")
lines(x=c(8,8),
y=c(-1000,1000),
col="red")
lines(x=c(26,26),
y=c(-1000,1000),
col="blue")
text("9/11",
x = 10, y = 40,
col="red",cex=0.7)
text("Iraq \n War",
x = 28, y = 40,
col="blue",cex=0.7)
In the plot above, it is evident that there is a significant shock in the 8th period, followed by a weaker one in the 27th period. Both of these shocks correspond to two significant events: the 9/11 attacks and the Iraq War. We call these types of shocks as structural breaks.
In U.S. politics, domestic political shocks often propagated through the stock market. Specifically, presidential approval ratings are highly correlated with oil prices with a -0.84.
# plot average oil price
plot(avg.price,
type="l",
ylab="$ per Barrel",
xlab="Time",
main = "Average Price of Oil")
lines(x=c(8,8),
y=c(-1000,1000),
col="red")
lines(x=c(26,26),
y=c(-1000,1000),
col="blue")
text("9/11",x = 10, y = 175,
col="red",cex=0.7)
text("Iraq \n War",x = 28, y = 175,
col="blue",cex=0.7)
As always, after looking at the time series plots, we need to look at the ACF and PACF.
# Look at the ACF and PACF of approvals
acf(approve)
pacf(approve)
# Look at the ACF and PACF of oil prices
pacf(avg.price)
pacf(avg.price)
Unit-root tests are the first step to study and analyze non-stationary time series.
Intuition: if time series is stationary, then regressing \(y_{t}-y_{t-1}\) on \(y_{t-1}\) should produce a negative coefficient. Why?
In a stationary series, knowing the past value of the series helps to predict the next period’s change. Positive shifts should be followed by negative shifts (mean reversion).
\[ \begin{aligned} y_t & = \rho y_{t-1} + \epsilon_{t} \\ y_t - y_{t-1} & = \rho y_{t-1} - y_{t-1} + \epsilon_{t} \\ \Delta y_{t} & = \gamma y_{t-1} + \epsilon_{t} \quad \quad \text{, where } \gamma=(\rho - 1) \\ \end{aligned} \]
Augmented Dickey-Fuller test: null hypothesis of unit root.
Same with Phillips-Perron test, but differs in how the AR(\(p\)) time series is modeled w.r.t. lags, serial correlation, heteroskedasticity.
The ADF test regression can be expressed as follows:
\[\Delta y_t = \boldsymbol{\beta D_t} + \pi y_{t-1} + \sum^p_{j=1}\psi_j\Delta y_{t-j}+\epsilon_t\] where \(\boldsymbol{D}_t\) is a vector of deterministic trends. The \(p\) lagged difference terms, \(\Delta y_{t-j}\), are used to approximate the ARMA structure of the errors (consider the equivalence of AR and MA models), and the value of \(p\) is set so that the error is serially uncorrelated. The error term is assumed to be homoskedastic.
Under the null hypothesis (\(\rho - 1 = 0\) => unit root => non-stationary), \(\pi\) is set to 0 and is used to construct the ADF test statistic. If we don’t get a statistically significant result from ADF test, then the time series is likely to be stationary.
Conflicting results from PP and ADF tests may suggest a structural break within the time series of presidential approval.
adf.test(approve)
##
## Augmented Dickey-Fuller Test
##
## data: approve
## Dickey-Fuller = -3.9565, Lag order = 3, p-value = 0.01721
## alternative hypothesis: stationary
adf.test(avg.price)
##
## Augmented Dickey-Fuller Test
##
## data: avg.price
## Dickey-Fuller = -3.0115, Lag order = 3, p-value = 0.1649
## alternative hypothesis: stationary
Notice that the ADF test claims that presidential approvals follows a stationary process while oil prices are nonstationary.
The PP test regression can be expressed as follows:
\[\Delta y_t = \boldsymbol{\beta D_t} + \pi y_{t-1} +u_t\]
The PP test ignores any serial correlation in the test regression. Instead, it corrects for serial correlation and heteroskedasticity in the errors \(u_t\) by directly modifying the test statistic, \(t_{\pi=0}\)
As we can see, both ADF test and PP test assume data has \(I(1)\) non-stationarity. But they also differ in how the time series is modeled. PP test allows for the residuals to be serially correlated.
PP.test(approve)
##
## Phillips-Perron Unit Root Test
##
## data: approve
## Dickey-Fuller = -2.8387, Truncation lag parameter = 3, p-value = 0.235
PP.test(avg.price)
##
## Phillips-Perron Unit Root Test
##
## data: avg.price
## Dickey-Fuller = -2.3318, Truncation lag parameter = 3, p-value = 0.4405
In contrast to the ADF, the PP test claims that presidential that both approvals and oil prices are nonstationary.
Sometimes we may find that the results of ADF test and PP test provide different evidence. This is because some underlying temporal dependence is not incorporated into the assumed models, or because our sample is too small. For example, if there is a structural break (i.e. persistent arises due to large and infrequent shocks) in the time series, then ADF test tends to provide evidence of a non-stationary process.
If we suspect that our time series have particular forms of temporal dependence ignored by those tests, there are other potential statistical tests:
Higher-order differencing non-stationarity: Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.
Structural break: Zivot-Andrews test, etc.
Recall that we can use differencing \(d\) to address integration processes \(I(d)\) of nonstationarity. We can difference a random walk as follows:
\[ \begin{aligned} y_{t} =& y_{t-1} + \boldsymbol{x_t \beta} + e_t \\ y_{t}-y_{t-1} =& y_{t-1} - y_{t-1} + \boldsymbol{x_t \beta} + e_t \\ \Delta y_t =& \boldsymbol{x_t\beta}+e_t \end{aligned} \]
If \(y_{t}\)’s DGP has a \(I(1)\) process, differencing results in an
AR(0)
and stationary I(0)
. But note that we
are now explaining the short run difference or change
in the variable, not the level.
diff(approve) # get an order-1 differenced time series
## [1] -0.67 2.50 -5.50 -1.00 2.50 -0.50 19.67 12.33 -1.00 -1.00 -2.33 -1.67
## [13] -2.75 -3.00 0.08 -2.93 -2.90 -4.00 0.70 -2.45 1.58 -3.58 -2.58 -1.42
## [25] 6.45 4.80 -3.67 -4.33 -2.33 -0.17 -8.50 3.67 -3.00 6.83 -3.25 -3.92
## [37] -0.66 1.33 -4.67 1.17 -0.75 2.25 2.67 -3.50 4.83 -3.00 0.67 0.58
## [49] -2.58 -1.17 -0.50 -1.75 1.08 -3.58 0.25 -3.25 -2.42 3.92 0.75 -3.33
## [61] -3.17 -0.83 1.33 3.00
diff(approve, differences = 2) # order-2 differenced time series
## [1] 3.17 -8.00 4.50 3.50 -3.00 20.17 -7.34 -13.33 0.00 -1.33
## [11] 0.66 -1.08 -0.25 3.08 -3.01 0.03 -1.10 4.70 -3.15 4.03
## [21] -5.16 1.00 1.16 7.87 -1.65 -8.47 -0.66 2.00 2.16 -8.33
## [31] 12.17 -6.67 9.83 -10.08 -0.67 3.26 1.99 -6.00 5.84 -1.92
## [41] 3.00 0.42 -6.17 8.33 -7.83 3.67 -0.09 -3.16 1.41 0.67
## [51] -1.25 2.83 -4.66 3.83 -3.50 0.83 6.34 -3.17 -4.08 0.16
## [61] 2.34 2.16 1.67
# Look at the ACF and PACF of approvals
diff(approve) %>% acf()
diff(approve) %>% pacf()
# Look at the ACF and PACF of oil prices
diff(avg.price) %>% acf()
diff(avg.price) %>% pacf()
# Check unit root tests
diff(approve) %>% adf.test()
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -4.3461, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
diff(avg.price) %>% adf.test()
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -5.3361, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
diff(approve) %>% PP.test()
##
## Phillips-Perron Unit Root Test
##
## data: .
## Dickey-Fuller = -6.703, Truncation lag parameter = 3, p-value = 0.01
diff(avg.price) %>% PP.test()
##
## Phillips-Perron Unit Root Test
##
## data: .
## Dickey-Fuller = -5.4685, Truncation lag parameter = 3, p-value = 0.01
This model can be estimated using arima()
function.
As explained in the lecture (Topic 4), we estimate an ARIMA(0,1,0) which fits a little better than ARIMA(2,1,2) on the AIC criterion.
xcovariates <- cbind(sept.oct.2001, iraq.war, avg.price)
(arima.res1a <-
arima(approve,
order = c(0,1,0),
xreg = xcovariates,
include.mean = FALSE)) # does FALSE matters?
##
## Call:
## arima(x = approve, order = c(0, 1, 0), xreg = xcovariates, include.mean = FALSE)
##
## Coefficients:
## sept.oct.2001 iraq.war avg.price
## 11.2072 5.6899 -0.0710
## s.e. 2.5192 2.4891 0.0337
##
## sigma^2 estimated as 12.35: log likelihood = -171.25, aic = 350.5
(arima.res2a <-
arima(approve,
order = c(2,1,2),
xreg = xcovariates,
include.mean = FALSE)) # does FALSE matters?
##
## Call:
## arima(x = approve, order = c(2, 1, 2), xreg = xcovariates, include.mean = FALSE)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sept.oct.2001 iraq.war avg.price
## -0.7338 -0.6364 0.8715 1.000 9.7795 5.7884 -0.0511
## s.e. 0.1382 0.1162 0.0672 0.069 1.9202 2.7251 0.0307
##
## sigma^2 estimated as 10.26: log likelihood = -167.42, aic = 350.84
# Check out the Q-test statistic:
Box.test(arima.res1a$residuals)
##
## Box-Pierce test
##
## data: arima.res1a$residuals
## X-squared = 0.23871, df = 1, p-value = 0.6251
Box.test(arima.res2a$residuals)
##
## Box-Pierce test
##
## data: arima.res2a$residuals
## X-squared = 0.0019896, df = 1, p-value = 0.9644
# extract estimates and statistics from final model:
(pe.1a <- arima.res1a$coef) # parameter estimates (betas)
## sept.oct.2001 iraq.war avg.price
## 11.20723939 5.68986003 -0.07101481
(se.1a <- sqrt(diag(arima.res1a$var.coef))) # standard errors
## sept.oct.2001 iraq.war avg.price
## 2.51920817 2.48908383 0.03367682
(ll.1a <- arima.res1a$loglik) # log likelihood at its maximum
## [1] -171.2507
(sigma2hat.1a <- arima.res1a$sigma2) # standard error of the regression
## [1] 12.35062
(aic.1a <- arima.res1a$aic) # Akaike Information Criterion
## [1] 350.5013
(resid.1a <- arima.res1a$resid) # residuals (white noise)
## Time Series:
## Start = 1
## End = 65
## Frequency = 1
## [1] 0.06896534 -0.95760998 3.51089580 -4.43371265 -1.60717662 1.11059527
## [7] -0.49893478 9.17823481 10.86638479 9.17823481 -1.60007513 -2.17909353
## [13] -1.62384037 -1.78774934 -1.95075620 0.04271723 -2.99746407 -2.79525316
## [19] -4.00887685 0.72734070 -2.12581740 1.39358613 -3.81576917 -2.07082382
## [25] -0.31571972 1.32825844 4.06144599 -4.32156087 1.32825844 -2.19010083
## [31] 0.59518456 -8.08420829 2.85119925 -3.36572627 6.59423083 -2.58991735
## [37] -3.38206282 -0.03151894 1.76851644 -3.35161007 1.06951405 -1.16188589
## [43] 2.01387576 2.61141278 -2.57858286 4.68726023 -3.98284495 0.59756490
## [49] 1.14243729 -1.37807436 -0.01068325 -1.07735040 -1.79047844 2.03514918
## [55] -2.18668945 3.21167260 -4.57407111 -5.68703628 3.41046875 1.67745340
## [61] -3.58281272 -2.14206064 1.42294481 2.50032405 2.84163698
Some counterfactual stories might help….
We will create a new dataset with hypothetical scenarios. In the object xhyp_no911, we will change the dummy values of September 2001 with a counterfactual of what if the 9/11 attacks never happened. In the data object iraq.war, we will do the same type of counter factual.
(xhyp_no911 <- xcovariates[8:65, ]) # recall 8th period -> 9/11 shock
(xhyp_no911[, "sept.oct.2001"] <- 0)
(xhyp_noiraq <- xcovariates[26:65, ])
(xhyp_no911[, "iraq.war"] <- 0)
(xhyp_oilprewar <- xcovariates[26:65, ])
(xhyp_oilprewar[, "avg.price"] <- xcovariates[25, "avg.price"])
Now we will use base R predict()
function to forecast
the new scenario on how the presidential approvals may have fluctuated
absent the 9/11 shock. After that, we will create a new data object
pred_no911 which contains the first original values of
presidential approvals up until the counterfactual period.
approve_no911 <-
predict(arima.res1a,
newxreg = xhyp_no911) # predict after september 2001; it will return a list containing expected predictions and standard errors of predictions
# construct a data.frame for plot
# we input the original values up until the counterfactual period
pred_no911 <-
tibble(
mean = c(approve[1:7], approve_no911$pred),
# uncertainty of forecast:
se = c(rep(0, 7), rep(approve_no911$se,
65-8+1)),
lwr = mean - 1.96*se,
upr = mean + 1.96*se
)
# let's see how it goes compared to original time series
plot(ts(approve,
start = c(2001, 2),
frequency = 12),
ylab="Percent Approving",
xlab="Time",
main = "US Presidential Approval")
lines(ts(c(approve[7],
approve_no911$pred),
start = c(2001, 8),
frequency = 12),
lty = "dashed")
Note: in there seems to be a code bug in Chris’s
code for this topic. I will contact him to double check it but my sense
is that the bug originates in his function simcf::ldvsimev
.
Below we are simulating the same expected values.
# First we need the simulated parameters
sims <- 10000
simparams <- mvrnorm(n = sims,
mu = arima.res1a$coef,
Sigma = arima.res1a$var.coef)
head(simparams)
## sept.oct.2001 iraq.war avg.price
## [1,] 10.877991 6.204078 -0.04375834
## [2,] 13.510364 6.628567 -0.06750429
## [3,] 15.163841 4.682793 -0.07748665
## [4,] 12.662988 3.812799 -0.05939149
## [5,] 9.364514 7.997687 -0.04502996
## [6,] 10.682281 1.374317 -0.09134948
dim(simparams)
## [1] 10000 3
sim_approve <- matrix(NA, nrow = sims, ncol = 65 - 8 + 1)
model <- arima.res1a
## Let's break the process step by step
# For the first simulation
(model$coef <- simparams[1, ])
## sept.oct.2001 iraq.war avg.price
## 10.87799111 6.20407762 -0.04375834
(sim_approve[1, ] <- predict(model, newxreg = xhyp_no911)[["pred"]])
## Time Series:
## Start = 66
## End = 123
## Frequency = 1
## [1] 53.82639 54.72825 55.36231 55.73207 55.63908 55.61064 55.01771 54.37118
## [9] 54.39416 54.43573 54.37118 54.37665 54.35980 54.16005 54.27491 54.42019
## [17] 54.10644 53.42600 53.07594 53.53102 53.93250 53.95198 53.86577 53.39428
## [25] 53.13807 53.64261 53.86796 54.01324 53.60650 53.27504 52.88777 52.61757
## [33] 51.80519 51.86711 52.12091 52.26641 52.30251 51.73474 51.82270 52.42831
## [41] 52.47295 52.12638 51.38577 50.67141 51.02717 51.05211 50.46356 49.60502
## [49] 47.78008 48.59596 50.60906 50.92302 50.35154 50.50732 49.87392 48.48569
## [57] 47.76455 47.86213
# For the second simulation
(model$coef <- simparams[2, ])
## sept.oct.2001 iraq.war avg.price
## 13.51036446 6.62856663 -0.06750429
(sim_approve[2, ] <- predict(model, newxreg = xhyp_no911)[["pred"]])
## Time Series:
## Start = 66
## End = 123
## Frequency = 1
## [1] 50.21344 51.60471 52.58284 53.15326 53.00981 52.96593 52.05125 51.05387
## [9] 51.08931 51.15344 51.05387 51.06231 51.03632 50.72816 50.90536 51.12948
## [17] 50.64547 49.59578 49.05574 49.75779 50.37714 50.40718 50.27420 49.54684
## [25] 49.15160 49.92993 50.27757 50.50169 49.87423 49.36289 48.76548 48.34864
## [33] 47.09542 47.19094 47.58246 47.80692 47.86261 46.98674 47.12242 48.05668
## [41] 48.12554 47.59090 46.44839 45.34638 45.89519 45.93367 45.02574 43.70130
## [49] 40.88604 42.14466 45.25019 45.73453 44.85293 45.09324 44.11612 41.97454
## [57] 40.86207 41.01261
# and so on...
# Loop predict function for each multivariate simulation
for (i in 1:sims) {
model$coef <- simparams[i, ]
sim_approve[i, ] <- predict(model, newxreg = xhyp_no911)[["pred"]]
}
dim(sim_approve)
## [1] 10000 58
# Now we create a new data frame with the mean of all the simulations
# Question: why do we take the mean? Why to do all these simulations?
exp_no911 <-
data.frame(
mean = c(approve[1:7],
colMeans(sim_approve)),
# confidence intervals using the quantile function
lwr = c(approve[1:7],
apply(sim_approve, 2, quantile, 0.025)),
upr = c(approve[1:7],
apply(sim_approve, 2, quantile, 0.975))
)
Finally, we visualize our expected values of the counterfactual on presidential approvals absent the 9/11 shock.
plot_data <-
bind_rows(pred_no911, exp_no911, .id = "type") %>%
mutate(
type = factor(type,
levels=1:2,
labels=c("pred.","expect."))
)
plot_data$time <-
paste0(approval$year, "-",approval$month) %>% lubridate::ym() %>% rep(2)
ggplot(plot_data, aes(x = time,
y = mean,
ymax = upr,
ymin = lwr)) +
geom_line() +
geom_ribbon(aes(fill = type), alpha = 0.3) +
labs(x = "Time", y = "Approval Rate", title = "US Presidential Approval") +
coord_cartesian(ylim = c(10,70))+
theme_classic()
Thus far, we have been examining a single time series with covariates. This assumes that there is no feedback between variables.
Yet, we may be interested in the relationship between two potentially non-stationary time series that influence each other.
The original idea behind cointegration is that two time series may be in equilibrium in the long run but in the short run the two series deviate from that equilibrium.
Cointegrated time series are two non-stationary time series that are causally connected and do not tend toward any particular level but tend toward each other.
Cointegration means that a specific combination of two non-stationary series may be stationary. We say that these two series are cointegrated and the vector that defines the stationary linear combination is called the cointegrating vector.
More on ECM Applications in political science:
Below you have the code from Chris’ slides in Topic 4, so you cna reproduce the same example.
Step 1
set.seed(98105)
# Generate cointegrated data
e1 <- rnorm(100)
x <- cumsum(e1) # x is a cumulative sum of e1, so it must be non-stationary
e2 <- rnorm(100)
y <- 0.6*x + e2 # define the relationship between x and y. They are both non-stationary
#Run step 1 of the Engle-Granger two step
coint.reg <- lm(y ~ x - 1)
#Estimate the cointegration vector by least squares with no constant
coint.err <- residuals(coint.reg)
#This gives us the cotingeration vector.
#Check for stationarity of the cointegration vector
adf.test(coint.err)$statistic %>% # get the test statistic of ADF test
punitroot(trend="nc") # punitroot() comes from urca package. it compute the cumulative probability of the ACF test statistic. "nc" means it is a test for regression without intercept (see ?urca::punitroot)
## Dickey-Fuller
## 1.835502e-06
#Make the lag of the cointegration error term
coint.err.lag <- coint.err[1:(length(coint.err)-2)]
#Make the difference of y and x
dy <- diff(y)
dx <- diff(x)
#And their lags
dy.lag <- dy[1:(length(dy)-1)]
dx.lag <- dx[1:(length(dx)-1)]
#Delete the first dy, because we are missing lags for this obs
dy <- dy[2:length(dy)]
Step 2
#Estimate an Error Correction Model with LS
ecm1 <- lm(dy ~ coint.err.lag + dy.lag + dx.lag)
summary(ecm1)
##
## Call:
## lm(formula = dy ~ coint.err.lag + dy.lag + dx.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.05134 -0.64656 0.02462 0.76963 2.04272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.08009 0.11676 0.686 0.494
## coint.err.lag -0.89254 0.14294 -6.244 1.22e-08 ***
## dy.lag -0.85110 0.11222 -7.584 2.36e-11 ***
## dx.lag 0.56191 0.13088 4.293 4.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.152 on 94 degrees of freedom
## Multiple R-squared: 0.4014, Adjusted R-squared: 0.3822
## F-statistic: 21.01 on 3 and 94 DF, p-value: 1.688e-10
#Alternatively, we can use the Johansen estimator
#Create a matrix of the cointegrated variables
cointvars <- cbind(y,x)
# Perform cointegration tests
coint.test1 <- ca.jo(cointvars,
ecdet = "const",
type="eigen",
K=2,
spec="longrun")
summary(coint.test1)
##
## ######################
## # Johansen-Procedure #
## ######################
##
## Test type: maximal eigenvalue statistic (lambda max) , without linear trend and constant in cointegration
##
## Eigenvalues (lambda):
## [1] 3.520082e-01 2.399643e-02 1.460788e-17
##
## Values of teststatistic and critical values of test:
##
## test 10pct 5pct 1pct
## r <= 1 | 2.38 7.52 9.24 12.97
## r = 0 | 42.52 13.75 15.67 20.20
##
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
##
## y.l2 x.l2 constant
## y.l2 1.0000000 1.000000 1.0000000
## x.l2 -0.5565254 -4.955866 0.4169868
## constant -0.2979546 32.290892 32.4988636
##
## Weights W:
## (This is the loading matrix)
##
## y.l2 x.l2 constant
## y.d -0.92569359 0.003754100 -1.582064e-17
## x.d 0.02021318 0.007824302 -4.297376e-18
ecm.res1 <- cajorls(coint.test1,
r = 1, # Cointegration rank
reg.number = 1) # which variable(s) to put on LHS
#(column indexes of cointvars)
summary(ecm.res1)
## Length Class Mode
## rlm 12 lm list
## beta 3 -none- numeric
#For the presidential approval example, use an ECM equivalent to the
#ARIMA(1,0,1) model that we chose earlier
cointvars <- cbind(approve,avg.price)
ecm.test1 <- ca.jo(cointvars,
ecdet = "const",
type="eigen",
K=2,
spec="longrun",
dumvar=cbind(sept.oct.2001,iraq.war))
summary(ecm.test1)
##
## ######################
## # Johansen-Procedure #
## ######################
##
## Test type: maximal eigenvalue statistic (lambda max) , without linear trend and constant in cointegration
##
## Eigenvalues (lambda):
## [1] 2.391542e-01 1.367744e-01 1.387779e-16
##
## Values of teststatistic and critical values of test:
##
## test 10pct 5pct 1pct
## r <= 1 | 9.27 7.52 9.24 12.97
## r = 0 | 17.22 13.75 15.67 20.20
##
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
##
## approve.l2 avg.price.l2 constant
## approve.l2 1.0000000 1.0000000 1.00000
## avg.price.l2 0.1535049 0.3382829 -1.05616
## constant -76.0019182 -120.7778161 90.24200
##
## Weights W:
## (This is the loading matrix)
##
## approve.l2 avg.price.l2 constant
## approve.d -0.12619985 0.02231967 5.407866e-17
## avg.price.d -0.02220281 -0.58771490 3.727850e-16
ecm.res1 <- cajorls(ecm.test1,
r = 1,
reg.number = 1)
summary(ecm.res1$rlm)
##
## Call:
## lm(formula = substitute(form1), data = data.mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1403 -1.6753 -0.2261 1.6430 5.9537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## ect1 -0.12620 0.03006 -4.198 9.37e-05 ***
## sept.oct.2001 19.55846 2.11737 9.237 5.40e-13 ***
## iraq.war 5.01870 1.62432 3.090 0.00307 **
## approve.dl1 -0.31757 0.09448 -3.361 0.00138 **
## avg.price.dl1 -0.05055 0.02593 -1.949 0.05613 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.668 on 58 degrees of freedom
## Multiple R-squared: 0.6301, Adjusted R-squared: 0.5983
## F-statistic: 19.76 on 5 and 58 DF, p-value: 1.915e-11
#Cointegration analysis with a spurious regressor
phony <- rnorm(length(approve))
for (i in 2:length(phony)){
phony[i] <- phony[i-1] + rnorm(1)
}
cointvars <- cbind(approve,avg.price,phony)
ecm.test1 <- ca.jo(cointvars,
ecdet = "const",
type="eigen",
K=2,
spec="longrun",
dumvar=cbind(sept.oct.2001,iraq.war))
summary(ecm.test1)
##
## ######################
## # Johansen-Procedure #
## ######################
##
## Test type: maximal eigenvalue statistic (lambda max) , without linear trend and constant in cointegration
##
## Eigenvalues (lambda):
## [1] 2.522926e-01 2.289387e-01 7.351248e-02 4.440892e-16
##
## Values of teststatistic and critical values of test:
##
## test 10pct 5pct 1pct
## r <= 2 | 4.81 7.52 9.24 12.97
## r <= 1 | 16.38 13.75 15.67 20.20
## r = 0 | 18.32 19.77 22.00 26.81
##
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
##
## approve.l2 avg.price.l2 phony.l2 constant
## approve.l2 1.00000000 1.0000000 1.00000000 1.00000000
## avg.price.l2 0.09303864 0.6851653 -0.03481344 0.07966019
## phony.l2 1.00806669 -3.8070344 -0.83878975 3.65291691
## constant -69.12999967 -160.7372332 -59.59236067 -92.56171519
##
## Weights W:
## (This is the loading matrix)
##
## approve.l2 avg.price.l2 phony.l2 constant
## approve.d -0.12669497 0.009643857 -0.003510338 6.872469e-16
## avg.price.d 0.03091869 -0.420502736 -0.053024935 -7.633789e-16
## phony.d 0.01535096 0.012992853 -0.012809245 -1.347668e-16
ecm.res1 <- cajorls(ecm.test1,
r = 1,
reg.number = 1)
summary(ecm.res1$rlm)
##
## Call:
## lm(formula = substitute(form1), data = data.mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9919 -1.5593 -0.1705 1.7087 6.4629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## ect1 -0.12669 0.02919 -4.340 5.90e-05 ***
## sept.oct.2001 19.10637 2.11016 9.054 1.26e-12 ***
## iraq.war 5.52734 1.65178 3.346 0.00145 **
## approve.dl1 -0.32650 0.09473 -3.447 0.00107 **
## avg.price.dl1 -0.04073 0.02578 -1.580 0.11972
## phony.dl1 0.12774 0.31229 0.409 0.68404
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.664 on 57 degrees of freedom
## Multiple R-squared: 0.6377, Adjusted R-squared: 0.5996
## F-statistic: 16.72 on 6 and 57 DF, p-value: 5.14e-11