Dynamic Panel Estimators

Nickell Bias

Recall that we can remove fixed effects by differencing:

\[ \begin{aligned} y_{it} & = \phi y_{it-1} + \alpha_i + \epsilon_{it} \\ y_{it} - \bar{y_i} & = \phi(y_{it-1} - \bar{y_i})+(x_{it}-\bar{x}_{it})\beta+(\epsilon_{it}-\bar{\epsilon}_{it}) \end{aligned} \]

Alternatively, we can include dummy variables for each group. This is the “within” estimator or fixed effects model. However, we introduce bias when we difference the model in this way.

This is because \(\bar{y}_i\) is computed using the past \(y\)’s. This is correlated with \(\bar{\epsilon}_{it}\), which is computed using the past \(\epsilon\)’s. Or:

\[ \bar{y_i} = \frac{\bar{X_i}\beta + \bar{\epsilon_i}}{1-\phi} \]

Specifically, this creates bias in the lagged dependent variable (LDV). If the other regressors are correlated with the LDV, then their coefficients may also be seriously biased.

The degree of bias is order \(1/T\), so it is big for small \(T\).

Furthermore,

  • The bias increases as \(\beta\) decreases

  • The bias increases as \(\phi\) increases

  • Small \(N\) is not the problem. Small \(T\) is the problem

Increasing \(N\) does not mitigate the problem. Purging serial correlation in the errors or getting the specification right (including other regressors) also doesn’t solve the problem.

Instrumental variables

We therefore turn to instrumental variables.

Recall that an instrumental variable must fulfill two conditions:

  1. Relevance: \(z\) is correlated with \(x\);

  2. Exogeneity: \(z\) is uncorrelated with \(\epsilon\). It must influence \(y\) only through \(x\).

\[ \begin{aligned} y_{it} - \bar{y_i} & = \phi(y_{it-1} - \bar{y_i})+(x_{it}-\bar{x}_{it})\beta+(\epsilon_{it}-\bar{\epsilon}_{it}) \\ \Delta y_{it} & = \phi \Delta y_{it-1} + \Delta x_{it}\beta + \Delta \epsilon_{it} \end{aligned} \]

We use lagged levels and lagged differences of \(\Delta y_{it}\) as instruments for the LDV. These help to predict \(\Delta y_{it}\) but not \(\Delta \epsilon_{it}\) if the errors are iid (see lecture slides for why).

Note: This is different than instrumenting \(x_{it}\). We are not addressing the endogeneity that may exist there. One should not be conflated with the other.

Generalized Method of Moments

Recall what is a moment in statistics. Moments are used to describe features of a distribution or a dataset, and provide information about the central tendency, the shape, and many other distributional properties. For example, the mean and the variance are the first and second central moments of a distribution. The motivation behind GMM estimation is that the usual least squares IV approach cannot handle multiple equations.

To give a brief overview of the intuition behind GMM, consider a linear model:

\[y_t = \boldsymbol{x}_t \boldsymbol{\beta} + e_t\]

Recall that the exogeneity assumption (aka no misspsecification) holds under Gauss-Markov.

\[E[\boldsymbol{x}_t\epsilon_t]=0\]

Assume there exists some combination instrumental variables \(\boldsymbol{z}_t\) that implies the following:

\[ \begin{aligned} E[\boldsymbol{z}_t\epsilon_t] & =0 \\ E[\boldsymbol{z}_t\epsilon_t] & = E[\boldsymbol{z}_t(y_t-\boldsymbol{x}_t\boldsymbol{\beta})]=0 \end{aligned} \]

Intuition: GMM estimates a General Weighting Matrix \(W\) that returns uncorrelated covariances between the set of instruments \(Z_t\) and the predicted errors \(\hat{e_t}\) from the sample.

Recall the distinction between DGP/population and sample in statistical inference:

  • Population moment:

\[E[\boldsymbol{z}_t(y_t-\boldsymbol{x}_t\boldsymbol{\beta})]=0\]

  • Sample moments:

\[\frac{1}{n}\sum^n_{t=1}\boldsymbol{z}_t(y-\boldsymbol{x_t}\boldsymbol{\beta})\]

\[\substack{\frac{1}{n}\sum^n_{t=1}z_{1t}(y-\boldsymbol{x}_t \boldsymbol{\beta})\\ \vdots\\ \frac{1}{n}\sum^n_{t=1}z_{Kt}(y-\boldsymbol{x}_t \boldsymbol{\beta}) }\]

Therefore, a generalized method of moments (GMM) esitmator of \(\beta\) is a vector of \(\hat{\beta}\) that solves the problem:

\[ \hat{\beta}_{\text{gmm}} = \underset{b}{\text{arg min}} \left[ \sum Z_i'(y_i - X_ib) \right]' \hat{W} \left[ \sum Z_i'(y_i - X_ib) \right] \]

We can compute the above analytically or numerically, where weighting matrix \(\hat{W}\) provides a solution that minimizes the weighted sum fo squares of variances between instruments and residuals. Thus,

\[\boldsymbol{S}_{zy}-\boldsymbol{S}_{zx}\boldsymbol{\beta}=0 \;\;\; \text{where}\]

\[\boldsymbol{S}_{zy}=n^{-1}\sum^n_{t=1}\boldsymbol{z}_t y_{t} \;\;\; \text{and} \;\;\; \boldsymbol{S}_{zx}=n^{-1}\sum^n_{t=1}\boldsymbol{z}_t x_{t}\]

We solve for

\[\hat{\boldsymbol{\beta}}=\boldsymbol{S}^{-1}_{zx}\boldsymbol{S}_{xy}\]

\[\hat{\boldsymbol{\beta}}=\boldsymbol{S}^{-1}_{zx}\boldsymbol{S}_{xy}\]

Important: GMM is not magical solution to endogeneity or misspecification. GMM is only and inference technique, such as least squares or maximum likelihood. Rather, is the estimator specification of some authors that tries to address Nickell bias from the lagged panel models.

  • Anderson-Hsiao estimator: uses the second and third lagged levels as instruments.

  • Arellano-Bond Difference GMM: uses \(\Delta y\) as the outcome and all available lagged levels as instruments in each period.

  • Arellano-Bover/Blundell-Bond System GMM: adds the available lagged differences as instruments.

Dynamic Panel Model: Verify Key Assumptions

When using panel GMM, we need to verify some modeling assumptions:

  • The number or the weakness of instruments
    • Sargan test: if \(\mathbb{E}[\hat{\varepsilon} Z] = 0\), then the combination of instruments is strong enough (over-identifying restrictions).
    • If we don’t reject the null (p > 0.05 or 0.1), then the instruments are strong enough.
  • Serial correlation of error terms – should be AR(1) only, otherwise, for example, the lag 2 term becomes a weak instrument.
    • Arellano-Bond’s test for serial correlation
  • Difference or system GMM (whether to use lagged differences as IV)
  • Use of year fixed effects
    • Wald Chi-squared test on year FE dummies: to see whether the inclusion of FE dummies is statistically significant
  • Use of linear or log-log specification \(\rightarrow\) interpretation of results will be different
    • linear: one unit increase in \(x\) will induce \(\beta\) increase in \(y\)
    • log-log: 1% increase in \(x\) will induce \(100\times\beta\)% increase in \(y\). (elasticity)

Appplication: How Does Tax Affect Cigarrete Consumption?

Cigarette Consumption Data

Variable Description1
cpi consumer price index
pop state population
packpc number of packs per capita
income95 state personal income (total)
income95pc state personal income (per capita)
tax95 average state, federal, and average local excise taxes for fiscal year
taxs95 average excise taxes for fiscal year, including sales taxes
avgprs95 average price during fiscal year, including sales taxes (cents)
pretax95 pretax95 = avgprs95 - taxs95
# Load cigarette consumption data (Jonathan Gruber, MIT)
# Variables (see codebook):
# state year    cpi pop packpc  income  tax avgprs  taxs
data <- read.csv("data/cigarette.csv")
colnames(data)
## [1] "state"  "year"   "cpi"    "pop"    "packpc" "income" "tax"    "avgprs"
## [9] "taxs"
data$state %>% unique() %>% length() # N = 48
## [1] 48
data$year %>% unique() %>% length() # T = 11
## [1] 11
# Quick inflation adjustment to 1995 dollars
inflAdjust <- function(x,cpi,year,target) {
    unique(cpi[year==target])*x/cpi     
  #Multiply x with cpi in target year then divide by cpi in observed year
}


data <-
  data %>% 
  mutate(
    # Make adjustments to state personal income
    income95 = inflAdjust(income, cpi, year, 1995),
    # Average state, federal, and average local excise taxes
    tax95 = inflAdjust(tax, cpi, year, 1995),
    # Average price, including sales taxes
    avgprs95 = inflAdjust(avgprs, cpi, year, 1995),
    # Average excise taxes, including sales taxes
    taxs95 = inflAdjust(taxs, cpi, year, 1995),
    # Create per capita income (in k)
    income95pc = income95/pop,
    # Create pretax price, 1995 dollars
    pretax95 = avgprs95 - taxs95
  )

attach(data)

Examining the time series

# Look at the time series

data %>% 
  ggplot(aes(x = year,
                 y = packpc,
                 group = state,
                 color = state)) +
  geom_line() +
  scale_x_continuous(breaks = seq(1985, 1995))

# Look at the ACF and PACF

acf_data <- 
  split(data$packpc, data$state) %>%
  map(function(x) acf(x, plot = FALSE)$acf) %>%
  map(as.vector) %>%
  bind_cols() %>%
  mutate(lag = 0:10) %>%
  pivot_longer(!lag, names_to = "state", values_to = "acf") %>%
  arrange(state, lag)


acf_data %>% 
  ggplot(aes(x = lag, y = acf)) +
  geom_bar(stat = "identity", width = 0.1) +
  geom_hline(yintercept = 0.2, color="red", 
             linetype="dashed", linewidth=0.2) +
  geom_hline(yintercept = -0.2, color="red", 
             linetype="dashed", linewidth=0.2) +
  facet_wrap(state~.) +
  labs(title = "ACF of US states")

pacf_data <- 
  split(data$packpc, data$state) %>%
  map(function(x) pacf(x, plot = FALSE)$acf) %>%
  map(as.vector) %>%
  bind_cols() %>%
  mutate(lag = 1:10) %>%
  pivot_longer(!lag, names_to = "state", values_to = "pacf") %>%
  arrange(state, lag)


pacf_data %>% 
  ggplot(aes(x = lag, y = pacf)) +
  geom_bar(stat = "identity", width = 0.1) +
  geom_hline(yintercept = 0.2, color="red", 
             linetype="dashed", linewidth=0.2) +
  geom_hline(yintercept = -0.2, color="red", 
             linetype="dashed", linewidth=0.2) +
  facet_wrap(state~.) +
  labs(title = "PACF of US states")

# Are the series stationary?

pseries <- split(data$packpc, data$state)
pp_pval <- sapply(pseries, function(x){pp.test(x)[["p.value"]]})
adf_pval <- sapply(pseries, function(x){adf.test(x)[["p.value"]]})
hist(pp_pval, breaks = 10)

hist(adf_pval, breaks = 10)

Fixed Effect Model

# Check for time invariant variables:
pvar(data, index = c("state", "year"))
## no time variation:       state 
## no individual variation: year cpi
# "within" option tells plm to do fixed effects
# Note that if you want to add year fixed effects then set effect="time" and for both state
# and year fixed effects set effect effect="twoway"
fe_res <- plm(packpc ~ income95pc + pretax95 + taxs95, data = data, model="within", effect="twoway")
coeftest(fe_res, vcov. = vcovHC, method = "arellano")
## 
## t test of coefficients:
## 
##             Estimate Std. Error  t value Pr(>|t|)    
## income95pc  0.969966   0.643947   1.5063  0.13267    
## pretax95   -0.188551   0.077797  -2.4236  0.01575 *  
## taxs95     -0.481852   0.044020 -10.9462  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Some tests for serial correlation of errors (needed because we have a linear regression
# with lags of the dependent variable on the RHS
# the standard LM test (note we could specify order)
pbgtest(fe_res)
## 
##  Breusch-Godfrey/Wooldridge test for serial correlation in panel models
## 
## data:  packpc ~ income95pc + pretax95 + taxs95
## chisq = 129.6, df = 11, p-value < 2.2e-16
## alternative hypothesis: serial correlation in idiosyncratic errors

Ramdom Effect Model

# Estimate a random effects AR(I)MA(p,q) model using lme (Restricted ML)
re_res <- lme(# A formula object including the response,
              # the fixed covariates, and any grouping variables
              fixed = packpc ~ income95pc + pretax95 + taxs95,  
              # i.e. response variable and explanatory variables 

              # The random effects component
              random = ~ 1 | state,                     
              # 1 indicates the intercept and state indicates the grouping

              # The TS dynamics: specify the time & group variables,
              # and the order of the ARMA(p,q) process
              correlation = corARMA(form = ~ year | state,
                                    p = 1,  # AR(p) order
                                    q = 0   # MA(q) order
                                    ) 
              )

summary(re_res)
## Linear mixed-effects model fit by REML
##   Data: NULL 
##       AIC     BIC    logLik
##   3253.21 3283.04 -1619.605
## 
## Random effects:
##  Formula: ~1 | state
##         (Intercept) Residual
## StdDev:  0.01123977 20.92621
## 
## Correlation Structure: AR(1)
##  Formula: ~year | state 
##  Parameter estimate(s):
##       Phi 
## 0.9764735 
## Fixed effects:  packpc ~ income95pc + pretax95 + taxs95 
##                 Value Std.Error  DF    t-value p-value
## (Intercept) 173.08136  8.574965 477  20.184498  0.0000
## income95pc   -1.05746  0.387602 477  -2.728198  0.0066
## pretax95     -0.14537  0.024800 477  -5.861684  0.0000
## taxs95       -0.46630  0.040769 477 -11.437827  0.0000
##  Correlation: 
##            (Intr) incm95 prtx95
## income95pc -0.856              
## pretax95   -0.099 -0.223       
## taxs95     -0.160 -0.097 -0.035
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -2.79963471 -0.57137010 -0.08122771  0.45749547  3.97887773 
## 
## Number of Observations: 528
## Number of Groups: 48

Dynamic Panel Model

First, you will need to create an object data with class pdata.frame.

# Panel based diagnostics available in the plm library
# (This package recently expanded to contain many many panel data tests
#  for serial correlation, fixed effects, and unit roots)

# First, create a plm data frame (special data frame that "knows" the
# unit variable and time variable
pdata <- pdata.frame(data, index=c("state", "year"))
head(pdata)
##         state year   cpi     pop   packpc   income  tax   avgprs     taxs
## AL-1985    AL 1985 1.076 3973000 116.4863 46014968 32.5 102.1817 33.34834
## AL-1986    AL 1986 1.096 3992000 117.1593 48703940 32.5 107.9892 33.40584
## AL-1987    AL 1987 1.136 4016000 115.8367 51846312 32.5 113.5273 33.46067
## AL-1988    AL 1988 1.183 4024000 115.2584 55698852 32.5 120.0334 33.52509
## AL-1989    AL 1989 1.240 4030000 109.2060 60044480 32.5 133.2560 33.65600
## AL-1990    AL 1990 1.307 4048508 111.7449 64094948 32.5 143.4486 33.75692
##         income95    tax95 avgprs95   taxs95 income95pc  pretax95
## AL-1985 65173615 46.03160 144.7257 47.23314   16.40413  97.49257
## AL-1986 67723361 45.19161 150.1601 46.45118   16.96477 103.70893
## AL-1987 69554378 43.60035 152.3025 44.88914   17.31932 107.41336
## AL-1988 71754049 41.86813 154.6331 43.18869   17.83152 111.44437
## AL-1989 73796599 39.94355 163.7759 41.36431   18.31181 122.41162
## AL-1990 74736574 37.89595 167.2652 39.36155   18.46028 127.90366
# Do an panel unit root test on the undifferenced cigarette data;
# there are many options; see ?purtest

# Note: this purtest() function isn't working due to unbalanced panel
#purtest(packpc~1, data=pdata, test="ips")

To estimate a dynamic panel model, we use the function plm::pgmm. To estimate Arellano-Bond with GMM for fixed effects with lagged DV, pgmm needs formulas in a specific format:

  1. in the first part of the RHS, include lags of DV and covariates, as shown below.
  2. in the second part, include the panel data instruments (99 here means use up to the 99th lag of the difference as an instrument).
  3. in an optional (not shown) third part of the RHS, include any other instruments.
# Estimate Arellano-Bond GMM for fixed effects with lagged DV

# note that pgmm formulas construct lag() properly for pdata.frame,
# if the data is not pdata.frame, lag() will create a wrong lag structure

formula_linear <- 
  packpc ~ 
  lag(packpc, 1) + income95pc + avgprs95 | lag(packpc, 2:99)

# if you don't want to include too many instruments, you can just set the second part as `lag(packpc, 2:...)`

pgmm_1a <- pgmm(
  formula_linear,
  data = pdata,
  effect = "individual", # should consider two-way (default) for small T; can also set `individual` FE
  transformation = "d" # should do `ld` if T=3, `d` for difference GMM and `ld` for system GMM; default is `d` (difference GMM)
)

Dynamic Panel Model: Investigate AR(1)/AR(2) Autocorrelation and Instrument Weakness

Sargan test has a null of the instruments as a group being exogenous- higher p is good. The residuals of the differences equations should exhibit AR(1) but not AR(2) behavior higher p for AR(2) test is good

summary(pgmm_1a)
## Oneway (individual) effect One-step model Difference GMM 
## 
## Call:
## pgmm(formula = formula_linear, data = pdata, effect = "individual", 
##     transformation = "d")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 432
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -25.4308  -2.5080   0.1463   0.1238   2.7384  25.6025 
## 
## Coefficients:
##                 Estimate Std. Error z-value  Pr(>|z|)    
## lag(packpc, 1)  0.638987   0.055342 11.5462 < 2.2e-16 ***
## income95pc     -0.475568   0.486760 -0.9770    0.3286    
## avgprs95       -0.180791   0.027799 -6.5035 7.848e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(44) = 47.99763 (p-value = 0.31401)
## Autocorrelation test (1): normal = -3.948861 (p-value = 7.8524e-05)
## Autocorrelation test (2): normal = -0.5688819 (p-value = 0.56944)
## Wald test for coefficients: chisq(3) = 2496.07 (p-value = < 2.22e-16)
# Good Sargan test, Good AR(2) test

Some examples of Mis-specified models.

# only include lag 2 and lag 3 terms
pgmm(
  packpc ~ lag(packpc, 1) + income95pc + avgprs95 | lag(packpc, 2:3),
  data = pdata, effect = "individual", transformation = "d"
) %>% summary()
## Oneway (individual) effect One-step model Difference GMM 
## 
## Call:
## pgmm(formula = packpc ~ lag(packpc, 1) + income95pc + avgprs95 | 
##     lag(packpc, 2:3), data = pdata, effect = "individual", transformation = "d")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 432
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -25.9207  -2.4789   0.1232   0.1159   2.7106  25.5997 
## 
## Coefficients:
##                 Estimate Std. Error z-value  Pr(>|z|)    
## lag(packpc, 1)  0.660475   0.052854 12.4963 < 2.2e-16 ***
## income95pc     -0.258571   0.456052 -0.5670    0.5707    
## avgprs95       -0.175368   0.026891 -6.5215  6.96e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(16) = 40.04482 (p-value = 0.00076694)
## Autocorrelation test (1): normal = -3.831746 (p-value = 0.00012724)
## Autocorrelation test (2): normal = -0.4941905 (p-value = 0.62117)
## Wald test for coefficients: chisq(3) = 2405.391 (p-value = < 2.22e-16)
# only include lag 2-5 terms
pgmm(
  packpc ~ lag(packpc, 1) + income95pc + avgprs95 | lag(packpc, 2:5),
  data = pdata, effect = "individual", transformation = "d"
) %>% summary()
## Oneway (individual) effect One-step model Difference GMM 
## 
## Call:
## pgmm(formula = packpc ~ lag(packpc, 1) + income95pc + avgprs95 | 
##     lag(packpc, 2:5), data = pdata, effect = "individual", transformation = "d")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 432
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -25.6948  -2.5357   0.1483   0.1217   2.6744  25.5813 
## 
## Coefficients:
##                 Estimate Std. Error z-value  Pr(>|z|)    
## lag(packpc, 1)  0.650350   0.055545 11.7086 < 2.2e-16 ***
## income95pc     -0.325994   0.497744 -0.6549    0.5125    
## avgprs95       -0.181297   0.027242 -6.6550 2.834e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(29) = 42.46279 (p-value = 0.050998)
## Autocorrelation test (1): normal = -3.907354 (p-value = 9.3312e-05)
## Autocorrelation test (2): normal = -0.5460146 (p-value = 0.58506)
## Wald test for coefficients: chisq(3) = 2503.149 (p-value = < 2.22e-16)

Let’s try different model specifications:

  • model 1a: linear + difference GMM + individual FE
  • model 1b: linear + system GMM + individual FE
  • model 1c: linear + difference GMM + two-way FE
  • model 1b: linear + system GMM + two-way FE
# model 1a: linear + difference GMM + individual FE
summary(pgmm_1a)
## Oneway (individual) effect One-step model Difference GMM 
## 
## Call:
## pgmm(formula = formula_linear, data = pdata, effect = "individual", 
##     transformation = "d")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 432
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -25.4308  -2.5080   0.1463   0.1238   2.7384  25.6025 
## 
## Coefficients:
##                 Estimate Std. Error z-value  Pr(>|z|)    
## lag(packpc, 1)  0.638987   0.055342 11.5462 < 2.2e-16 ***
## income95pc     -0.475568   0.486760 -0.9770    0.3286    
## avgprs95       -0.180791   0.027799 -6.5035 7.848e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(44) = 47.99763 (p-value = 0.31401)
## Autocorrelation test (1): normal = -3.948861 (p-value = 7.8524e-05)
## Autocorrelation test (2): normal = -0.5688819 (p-value = 0.56944)
## Wald test for coefficients: chisq(3) = 2496.07 (p-value = < 2.22e-16)
# model 1b: linear + system GMM + individual FE
pgmm_1b <- pgmm(
  packpc ~ lag(packpc, 1) + income95pc + avgprs95 | lag(packpc, 2:99),
  data = pdata, effect = "individual", transformation = "ld"
)
summary(pgmm_1b)
## Oneway (individual) effect One-step model System GMM 
## 
## Call:
## pgmm(formula = packpc ~ lag(packpc, 1) + income95pc + avgprs95 | 
##     lag(packpc, 2:99), data = pdata, effect = "individual", transformation = "ld")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 912
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -35.0188  -2.5248   0.1318   0.2344   2.8275  29.0107 
## 
## Coefficients:
##                  Estimate Std. Error z-value Pr(>|z|)    
## lag(packpc, 1)  0.9372190  0.0149757 62.5826   <2e-16 ***
## income95pc      0.2033459  0.1245084  1.6332   0.1024    
## avgprs95       -0.0021312  0.0098748 -0.2158   0.8291    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(55) = 47.79775 (p-value = 0.74372)
## Autocorrelation test (1): normal = -3.451448 (p-value = 0.00055759)
## Autocorrelation test (2): normal = 0.5944316 (p-value = 0.55222)
## Wald test for coefficients: chisq(3) = 109661.9 (p-value = < 2.22e-16)
# model 1c: linear + difference GMM + two-way FE
pgmm_1c <- pgmm(
  packpc ~ lag(packpc, 1) + income95pc + avgprs95 | lag(packpc, 2:99),
  data = pdata, effect = "twoways", transformation = "d"
)
summary(pgmm_1c)
## Twoways effects One-step model Difference GMM 
## 
## Call:
## pgmm(formula = packpc ~ lag(packpc, 1) + income95pc + avgprs95 | 
##     lag(packpc, 2:99), data = pdata, effect = "twoways", transformation = "d")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 432
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -18.939  -1.890  -0.259   0.000   1.824  20.425 
## 
## Coefficients:
##                 Estimate Std. Error z-value  Pr(>|z|)    
## lag(packpc, 1)  0.252415   0.117744  2.1438   0.03205 *  
## income95pc      1.062384   0.674055  1.5761   0.11500    
## avgprs95       -0.285703   0.060572 -4.7168 2.396e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(44) = 45.92782 (p-value = 0.39225)
## Autocorrelation test (1): normal = -3.571845 (p-value = 0.00035448)
## Autocorrelation test (2): normal = 0.02846648 (p-value = 0.97729)
## Wald test for coefficients: chisq(3) = 35.65439 (p-value = 8.8603e-08)
## Wald test for time dummies: chisq(9) = 70.99223 (p-value = 9.7256e-12)
# model 1d: linear + system GMM + two-way FE
pgmm_1d <- pgmm(
  packpc ~ lag(packpc, 1) + income95pc + avgprs95 | lag(packpc, 2:99),
  data = pdata, effect = "twoways", transformation = "ld"
)
summary(pgmm_1d)
## Twoways effects One-step model System GMM 
## 
## Call:
## pgmm(formula = packpc ~ lag(packpc, 1) + income95pc + avgprs95 | 
##     lag(packpc, 2:99), data = pdata, effect = "twoways", transformation = "ld")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 912
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -31.81880  -2.37490  -0.03745   0.00000   2.18303  27.94124 
## 
## Coefficients:
##                 Estimate Std. Error z-value  Pr(>|z|)    
## lag(packpc, 1)  0.912506   0.035086 26.0074 < 2.2e-16 ***
## income95pc     -0.016144   0.110832 -0.1457  0.884186    
## avgprs95       -0.101336   0.032190 -3.1481  0.001644 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(55) = 45.80834 (p-value = 0.80677)
## Autocorrelation test (1): normal = -3.133862 (p-value = 0.0017252)
## Autocorrelation test (2): normal = 0.7322591 (p-value = 0.46401)
## Wald test for coefficients: chisq(3) = 2942.746 (p-value = < 2.22e-16)
## Wald test for time dummies: chisq(9) = 87.03407 (p-value = 6.3969e-15)
stargazer(pgmm_1a,pgmm_1b,pgmm_1c,pgmm_1d,
          type="text")
## 
## =====================================================
##                         Dependent variable:          
##                --------------------------------------
##                                packpc                
##                   (1)      (2)       (3)       (4)   
## -----------------------------------------------------
## lag(packpc, 1) 0.639***  0.937***  0.252**  0.913*** 
##                 (0.055)  (0.015)   (0.118)   (0.035) 
##                                                      
## income95pc      -0.476    0.203     1.062    -0.016  
##                 (0.487)  (0.125)   (0.674)   (0.111) 
##                                                      
## avgprs95       -0.181***  -0.002  -0.286*** -0.101***
##                 (0.028)  (0.010)   (0.061)   (0.032) 
##                                                      
## -----------------------------------------------------
## Observations      48        48       48        48    
## =====================================================
## Note:                     *p<0.1; **p<0.05; ***p<0.01

Now, we will fit the log-log version:

  • model 2a: log-log + difference GMM + individual FE
  • model 2b: log-log + system GMM + individual FE
  • model 2c: log-log + difference GMM + two-way FE
  • model 2b: log-log + system GMM + two-way FE
formula_loglog <- 
  log(packpc) ~ lag(log(packpc), 1) + log(income95pc) + log(avgprs95) | lag(log(packpc), 2:99)

# model 2a: log-log + difference GMM + individual FE
pgmm_2a <- pgmm(formula_loglog, data = pdata, effect = "individual", transformation = "d")
summary(pgmm_2a)
## Oneway (individual) effect One-step model Difference GMM 
## 
## Call:
## pgmm(formula = formula_loglog, data = pdata, effect = "individual", 
##     transformation = "d")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 432
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.263020 -0.024090  0.002736  0.001092  0.027682  0.216378 
## 
## Coefficients:
##                      Estimate Std. Error z-value  Pr(>|z|)    
## lag(log(packpc), 1)  0.674051   0.061660 10.9317 < 2.2e-16 ***
## log(income95pc)     -0.066044   0.111880 -0.5903     0.555    
## log(avgprs95)       -0.305656   0.043073 -7.0963 1.281e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(44) = 46.02872 (p-value = 0.38824)
## Autocorrelation test (1): normal = -4.050332 (p-value = 5.1145e-05)
## Autocorrelation test (2): normal = 0.05678179 (p-value = 0.95472)
## Wald test for coefficients: chisq(3) = 2140.806 (p-value = < 2.22e-16)
# model 2b: log-log + system GMM + individual FE
pgmm_2b <- pgmm(formula_loglog, data = pdata, effect = "individual", transformation = "ld")
summary(pgmm_2b)
## Oneway (individual) effect One-step model System GMM 
## 
## Call:
## pgmm(formula = formula_loglog, data = pdata, effect = "individual", 
##     transformation = "ld")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 912
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.366760 -0.022702  0.002000  0.001797  0.025148  0.278323 
## 
## Coefficients:
##                       Estimate Std. Error  z-value Pr(>|z|)    
## lag(log(packpc), 1)  0.9860119  0.0094198 104.6742   <2e-16 ***
## log(income95pc)     -0.0057655  0.0154370  -0.3735   0.7088    
## log(avgprs95)        0.0110743  0.0077600   1.4271   0.1535    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(55) = 47.59902 (p-value = 0.75038)
## Autocorrelation test (1): normal = -3.352045 (p-value = 0.00080217)
## Autocorrelation test (2): normal = 0.982078 (p-value = 0.32606)
## Wald test for coefficients: chisq(3) = 4737095 (p-value = < 2.22e-16)
# model 2c: log-log + difference GMM + two-way FE
pgmm_2c <- pgmm(formula_loglog, data = pdata, effect = "twoways", transformation = "d")
summary(pgmm_2c)
## Twoways effects One-step model Difference GMM 
## 
## Call:
## pgmm(formula = formula_loglog, data = pdata, effect = "twoways", 
##     transformation = "d")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 432
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.181013 -0.019492 -0.001678  0.000000  0.017812  0.204343 
## 
## Coefficients:
##                     Estimate Std. Error z-value  Pr(>|z|)    
## lag(log(packpc), 1)  0.33315    0.14500  2.2976   0.02158 *  
## log(income95pc)      0.18131    0.18166  0.9981   0.31825    
## log(avgprs95)       -0.62271    0.11127 -5.5965 2.188e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(44) = 43.30027 (p-value = 0.5015)
## Autocorrelation test (1): normal = -3.620059 (p-value = 0.00029454)
## Autocorrelation test (2): normal = 0.5219654 (p-value = 0.60169)
## Wald test for coefficients: chisq(3) = 45.27327 (p-value = 8.0945e-10)
## Wald test for time dummies: chisq(9) = 71.0946 (p-value = 9.2856e-12)
# model 2d: log-log + system GMM + two-way FE
pgmm_2d <- pgmm(formula_loglog, data = pdata, effect = "twoways", transformation = "ld")
summary(pgmm_2d)
## Twoways effects One-step model System GMM 
## 
## Call:
## pgmm(formula = formula_loglog, data = pdata, effect = "twoways", 
##     transformation = "ld")
## 
## Balanced Panel: n = 48, T = 11, N = 528
## 
## Number of Observations Used: 912
## Residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.3421969 -0.0233093  0.0006081  0.0000000  0.0224292  0.2716041 
## 
## Coefficients:
##                       Estimate Std. Error z-value  Pr(>|z|)    
## lag(log(packpc), 1)  0.9454089  0.0286753 32.9694 < 2.2e-16 ***
## log(income95pc)     -0.0072777  0.0214627 -0.3391  0.734544    
## log(avgprs95)       -0.1650673  0.0494761 -3.3363  0.000849 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sargan test: chisq(55) = 44.19932 (p-value = 0.85117)
## Autocorrelation test (1): normal = -3.066312 (p-value = 0.0021672)
## Autocorrelation test (2): normal = 1.1093 (p-value = 0.2673)
## Wald test for coefficients: chisq(3) = 4820.892 (p-value = < 2.22e-16)
## Wald test for time dummies: chisq(9) = 90.44993 (p-value = 1.3224e-15)

Simulate Conditional Forecasting

  • Unfortunately, plm package does not have predict method for plm() and pgmm() functions. So we have to manually construct predictions from new data using simcf package (literally the name of package means “simulate counterfactual”).

  • To simulate counterfactual outcomes, we need the following things:

    • Simulated parameters
    • Hypothetical Scenarios
    • Lagged outcome \(y_{t-1}\)
    • Initial outcome \(y_0\) (average packpc in 1995)
  • Simulations with and without year FE dummies are a bit different. Here we take pgmm_1a and pgmm_1c as two examples.

When deciding what could be a plausible scenario, first look at the sample statistics.

# How big a change in price to simulate?
# How about "double" the average tax in the most recent year?
summary(pdata$taxs95[pdata$year==1995])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34.44   48.75   59.84   61.87   74.78  112.63
# The average (and median) tax is about 60 cents/pack
sd(pdata$taxs95[pdata$year==1995])
## [1] 18.47741

If the mean is around 61, and the standard deviation is 18.47, a +60 cents increase would also be about 3 sd’s, and raise the tax to a bit more than the max observed. Other possibilities:

  • A standard deviation increase.
  • The interquartile change: an increase from the first quartile (48.75) to the third one (74.78).
  • Raise every state to the max observed for any state in 1995 (112.60 cents)

Now, let’s create a matrix with our year dummies. Here we are using he function simcf::makeFEdummies. Once we have created this matrix, we cbind it with the data.

# Construct the year dummies
# it is equivalent to `model.matrix(~ -1 + year, data = pdata)`
yearfe <- makeFEdummies(pdata$year)

# Why drop first 2 col's? (the first 2 years are missing due to instrumental variables y_{t-2})
yearfe <- yearfe[, 3:ncol(yearfe)]

# List all the years
yearlist <- unique(pdata$year)

# List the years and exclude the first two
yearlist <- yearlist[3:length(yearlist)]

# Create names for the year dummies
colnames(yearfe) <- paste0("y", yearlist)

head(yearfe)
##   y1987 y1988 y1989 y1990 y1991 y1992 y1993 y1994 y1995
## 1     0     0     0     0     0     0     0     0     0
## 2     0     0     0     0     0     0     0     0     0
## 3     1     0     0     0     0     0     0     0     0
## 4     0     1     0     0     0     0     0     0     0
## 5     0     0     1     0     0     0     0     0     0
## 6     0     0     0     1     0     0     0     0     0
# concatenate year FE dummies to the pdata.frame
datayearfe <- cbind(pdata, yearfe)

head(datayearfe)
##   state year   cpi     pop   packpc   income  tax   avgprs     taxs income95
## 1    AL 1985 1.076 3973000 116.4863 46014968 32.5 102.1817 33.34834 65173615
## 2    AL 1986 1.096 3992000 117.1593 48703940 32.5 107.9892 33.40584 67723361
## 3    AL 1987 1.136 4016000 115.8367 51846312 32.5 113.5273 33.46067 69554378
## 4    AL 1988 1.183 4024000 115.2584 55698852 32.5 120.0334 33.52509 71754049
## 5    AL 1989 1.240 4030000 109.2060 60044480 32.5 133.2560 33.65600 73796599
## 6    AL 1990 1.307 4048508 111.7449 64094948 32.5 143.4486 33.75692 74736574
##      tax95 avgprs95   taxs95 income95pc  pretax95 y1987 y1988 y1989 y1990 y1991
## 1 46.03160 144.7257 47.23314   16.40413  97.49257     0     0     0     0     0
## 2 45.19161 150.1601 46.45118   16.96477 103.70893     0     0     0     0     0
## 3 43.60035 152.3025 44.88914   17.31932 107.41336     1     0     0     0     0
## 4 41.86813 154.6331 43.18869   17.83152 111.44437     0     1     0     0     0
## 5 39.94355 163.7759 41.36431   18.31181 122.41162     0     0     1     0     0
## 6 37.89595 167.2652 39.36155   18.46028 127.90366     0     0     0     1     0
##   y1992 y1993 y1994 y1995
## 1     0     0     0     0
## 2     0     0     0     0
## 3     0     0     0     0
## 4     0     0     0     0
## 5     0     0     0     0
## 6     0     0     0     0

Now, we specify the formulas of each model, and save the specificaiton in an object.

# Construct formulas -- linear + difference GMM + individual FE (without year FE dummies)
formula_1a <- packpc ~ income95pc + avgprs95 - 1 #with Income and Price as covariates

# Construct formulas -- linear + system GMM + individual FE (without year dummies but with intercept)
formula_1b <- packpc ~ income95pc + avgprs95

# Construct formulas -- linear + difference GMM + two-way FE (with year FE dummies)
formula_1c <- update(
  formula_1a,
  paste(c("~ .", colnames(yearfe)), collapse = "+") # add year FE dummies to the formula
)

formula_1c
## packpc ~ income95pc + avgprs95 + y1987 + y1988 + y1989 + y1990 + 
##     y1991 + y1992 + y1993 + y1994 + y1995 - 1
# Construct formulas -- linear + system GMM + two-way FE (with year dummies and with intercept)
formula_1d <- update(formula_1b, paste(c("~ .", colnames(yearfe)), collapse = "+"))

formula_1d
## packpc ~ income95pc + avgprs95 + y1987 + y1988 + y1989 + y1990 + 
##     y1991 + y1992 + y1993 + y1994 + y1995

Simulation of parameters.

# Number of simulations
sims <- 1000

# Recall model 1a:Difference GMM with state fixed effects
#packpc ~ lag(packpc, 1) + income95pc + avgprs95 | lag(packpc, 2:99) 

# Simulate parameters from an multivariate normal distribution
simparam_1a <- mvrnorm(sims, coefficients(pgmm_1a), vcovHC(pgmm_1a))

simphi_1a <- simparam_1a[, 1] # Extract the simulated phis
head(simphi_1a)
## [1] 0.6305317 0.6135699 0.6745606 0.5531332 0.6445569 0.5076610
simbeta_1a <- simparam_1a[, 2:ncol(simparam_1a)] # Extract the simulated betas
head(simbeta_1a)
##      income95pc   avgprs95
## [1,] -0.7780931 -0.1853170
## [2,] -0.4285905 -0.1875193
## [3,] -0.2197715 -0.1790330
## [4,] -0.8106588 -0.2266810
## [5,] -0.6892746 -0.1731899
## [6,] -1.1943628 -0.2060910

Now, create matrix/data with hypothetical predictors. In this example:

  • Assume an average state raised taxes 60 cents starting 1996

Moreover, pgmm uses covariates in differenced form. Therefore, in your xhyp data object with counterfactuals, we must set all the covariates to 0 (no change) in the matrix x (which refers to the present). Exceptions:

  1. changes in covariates of interest.
  2. time dummies aren’t differenced

At this moment, we can ignore the state fixed effects for now and add them later because model is linear.

# Forecast for 3 years from 1996 to 1998
periods.out <- 3

# Make matrix of hypothetical x's: covariates
xhyp_1a <- cfMake(formula=formula_1a, 
                  data=datayearfe, 
                  nscen=periods.out)

#With mean packpc, income, and price for the forecast period

# make every element in 'x' matrix be zero
xhyp_1a$x <- xhyp_1a$xpre <- 0 * xhyp_1a$x

# change the value in the predictor of interest (again, 'x' matrix)
xhyp_1a <- cfChange(xhyp_1a, 
                    covname="avgprs95", 
                    x=60,  # only change avgprs to 60 (cents)
                    scen=1)

# Create baseline scenario
xbase_1a <- xhyp_1a
xbase_1a$x <- xbase_1a$xpre

# We need a lag of the price per pack
lagY_1a <- NULL

# Hypothetical previous change in Y for simulation
for (i in 1:length(pgmm_1a$model)){     #For 1 to 48
    lagY_1a <- c(lagY_1a, as.data.frame(pgmm_1a$model[[i]])["1995",]$packpc)
}

#Hypothetical change in packpc for each state in 1995

#Find the mean of these hypothetical previous changes
lagY_1a <- mean(lagY_1a, na.rm=TRUE)

# Hypothetical initial level of Y for simulation
initialY <- mean(pdata$packpc[pdata$year==1995], na.rm=TRUE)    #The mean of packpc in 1995

Now that we have everything that we need for the simulation, we can use simcf::ldvsimev, simcf::ldvsimfd and simcf::ldvsimrr to estimate the quantities of interest.

# Simulate expected values of Y (on original level scale)
# out to periods.out given hypothetical future values of X,
# initial lags of the change in Y, and an initial level of Y

sim_ev1a <- ldvsimev(xhyp_1a,            # The matrix of hypothetical x's
                     simbeta_1a,         # The matrix of simulated betas
                     ci=0.95,            # Desired confidence interval
                     constant=NA,        # NA indicates no constant!
                     phi=simphi_1a,      # estimated AR parameters; length must match lagY 
                     lagY=lagY_1a,       # lags of y, most recent last
                     transform="diff",   # "log" to undo log transformation,
                                         # "diff" to under first differencing
                                         # "difflog" to do both
                     initialY=initialY   # for differenced models, the lag of the level of y
                     )

# Simulate expected values of Y given no change in covariates
sim_base1a <- ldvsimev(xbase_1a,           # The matrix of hypothetical x's
                       simbeta_1a,         # The matrix of simulated betas
                       ci=0.95,            # Desired confidence interval
                       constant=NA,        # NA indicates no constant!
                       phi=simphi_1a,      # estimated AR parameters; length must match lagY 
                       lagY=lagY_1a,       # lags of y, most recent last
                       transform="diff",   # "log" to undo log transformation,
                                           # "diff" to under first differencing
                                           # "difflog" to do both
                       initialY=initialY   # for differenced models, the lag of the level of y
                       )    

# Simulate first differences in y 
# out to periods.out given hypothetical future values of x, xpre,
# and initial lags of the change in y
sim_fd1a <- ldvsimfd(xhyp_1a,            # The matrix of hypothetical x's
                     simbeta_1a,        # The matrix of simulated betas
                     ci=0.95,            # Desired confidence interval
                     constant=NA,        # Column containing the constant
                                         # set to NA for no constant
                     phi=simphi_1a,     # estimated AR parameters; length must match lagY 
                     lagY=lagY_1a,       # lags of y, most recent last
                     transform="diff"   # Model is differenced
                     #initialY=initialY   # Redundant in this case (fd of linear differenced Y)
                     )

# Simulate relative risks in y
# out to periods.out given hypothetical future values of x, xpre,
# and initial lags of the change in y
sim_rr1a <- ldvsimrr(xhyp_1a,            # The matrix of hypothetical x's
                     simbeta_1a,        # The matrix of simulated betas
                     ci=0.95,            # Desired confidence interval
                     constant=NA,        # Column containing the constant
                                         # set to NA for no constant
                     phi=simphi_1a,     # estimated AR parameters; length must match lagY 
                     lagY=lagY_1a,       # lags of y, most recent last
                     transform="diff",   # Model is differenced
                     initialY=initialY   # Required for differenced Y in ldvsimrr
                     )

Now we are going to plot the simulations using Chris’s tile package and the custom function cigLineplots. The latter is already saved in the TS_code.R script, in the source folder.

# Make plots of expected values, first differences, and percent changes
# using custom tile code in helperCigs.R

# Hypothetical initial level of Y for simulation
initialY <- mean(pdata$packpc[pdata$year==1995], na.rm=TRUE)

# axis limits
limitsEV <- c(50, 105)
limitsFD <- c(-40, 5)
limitsRR <- c(-40, 5)

# Model 1a Lineplot: Packs
cigLineplots(sim_ev1a, sim_base1a, sim_fd1a, sim_rr1a,
             limitsEV, limitsFD, limitsRR, initialY,
             main="Cigarette Taxes & Consumption: 1a. Linear Difference GMM, Individual Effects",
             file=NULL)

# Save pdf output
cigLineplots(sim_ev1a, sim_base1a, sim_fd1a, sim_rr1a,
             limitsEV, limitsFD, limitsRR, initialY,
             main="Cigarette Taxes & Consumption: 1a. Linear Difference GMM, Individual Effects",
             file="output/m1aEVFDRR")

What if we include year FE dummies?

# Recall model 1g: packpc ~ lag(packpc, 1) + income95pc + avgprs95 | lag(packpc, 2:99)
# Difference GMM with state and year fixed effects

# Simulate parameters
simparam_1c <- mvrnorm(sims, coefficients(pgmm_1c), vcovHC(pgmm_1c))        #Sample parameters
simphi_1c <- simparam_1c[,1, drop = FALSE]  #Extract the phis
head(simphi_1c)
##      lag(packpc, 1)
## [1,]      0.3161586
## [2,]      0.2152047
## [3,]      0.2021806
## [4,]      0.2948359
## [5,]      0.4313263
## [6,]      0.1955249
simbeta_1c <- simparam_1c[,2:ncol(simparam_1c)] #Extract the betas
head(simbeta_1c)
##      income95pc   avgprs95       1987       1988       1989      1990
## [1,]  0.4719479 -0.2614730 -1.0492861 -1.2055932 -3.0057420 -4.289267
## [2,]  0.7890498 -0.3076465 -0.4239592 -0.3645147 -2.4363030 -4.908887
## [3,]  0.8743731 -0.4148937  0.4281356  0.9074744  0.6340861 -1.022933
## [4,]  1.7670041 -0.2568263 -0.3245827 -1.2810467 -3.3943722 -5.475072
## [5,]  0.4427081 -0.2250346  0.9184665 -0.2594482 -1.1895247 -2.637772
## [6,]  1.3613348 -0.3162667 -0.8361219 -1.3616656 -3.0899072 -5.009428
##            1991      1992      1993       1994       1995
## [1,] -4.2187835 -3.555403 -8.189917 -10.131184 -10.158173
## [2,] -4.6489377 -3.368394 -9.584629 -12.749927  -9.765093
## [3,]  0.3525874  1.733858 -3.660171  -7.518644  -5.889305
## [4,] -4.8518883 -4.150860 -9.539685 -12.481546 -11.085336
## [5,] -2.1694373 -1.276903 -5.036891  -6.637915  -5.844811
## [6,] -4.7459816 -3.601002 -8.752850 -12.792358 -10.104429
# Make matrix of hypothetical x's:
# Assume an average state raised taxes 60 cents starting 1996
#
# Issues -- we need to somehow include the state and year FEs:
#           Let's set the state to be an "average" state in 1995,
#           and year to be like the last year (1995)

# Make matrix of hypothetical x's: covariates
xhyp_1c <- cfMake(formula=formula_1c, 
                  data=datayearfe, 
                  nscen=periods.out)        #Including the year fixed effects

# pgmm uses covariates in differenced form
# so we want most of them to be 0 (no change)
# exceptions:
# (1) changes in covariates of interest
# (2) differenced time dummies require special care
xhyp_1c$x <- xhyp_1c$xpre <- 0*xhyp_1c$x
xhyp_1c <- cfChange(xhyp_1c, "avgprs95", x=60, scen=1)      #Assume tax is raised 60 cents in 1996
xhyp_1c
## $x
##   packpc income95pc avgprs95 y1987 y1988 y1989 y1990 y1991 y1992 y1993 y1994
## 1      0          0       60     0     0     0     0     0     0     0     0
## 2      0          0        0     0     0     0     0     0     0     0     0
## 3      0          0        0     0     0     0     0     0     0     0     0
##   y1995
## 1     0
## 2     0
## 3     0
## 
## $xpre
##   packpc income95pc avgprs95 y1987 y1988 y1989 y1990 y1991 y1992 y1993 y1994
## 1      0          0        0     0     0     0     0     0     0     0     0
## 2      0          0        0     0     0     0     0     0     0     0     0
## 3      0          0        0     0     0     0     0     0     0     0     0
##   y1995
## 1     0
## 2     0
## 3     0
## 
## $model
## packpc ~ income95pc + avgprs95 + y1987 + y1988 + y1989 + y1990 + 
##     y1991 + y1992 + y1993 + y1994 + y1995 - 1
## 
## attr(,"class")
## [1] "list"           "counterfactual"
# We can "ignore" the state fixed effects for now and add them later
# because model is total linear
# Create baseline scenario
xbase_1c <- xhyp_1c
xbase_1c$x <- xbase_1c$xpre
xbase_1c
## $x
##   packpc income95pc avgprs95 y1987 y1988 y1989 y1990 y1991 y1992 y1993 y1994
## 1      0          0        0     0     0     0     0     0     0     0     0
## 2      0          0        0     0     0     0     0     0     0     0     0
## 3      0          0        0     0     0     0     0     0     0     0     0
##   y1995
## 1     0
## 2     0
## 3     0
## 
## $xpre
##   packpc income95pc avgprs95 y1987 y1988 y1989 y1990 y1991 y1992 y1993 y1994
## 1      0          0        0     0     0     0     0     0     0     0     0
## 2      0          0        0     0     0     0     0     0     0     0     0
## 3      0          0        0     0     0     0     0     0     0     0     0
##   y1995
## 1     0
## 2     0
## 3     0
## 
## $model
## packpc ~ income95pc + avgprs95 + y1987 + y1988 + y1989 + y1990 + 
##     y1991 + y1992 + y1993 + y1994 + y1995 - 1
## 
## attr(,"class")
## [1] "list"           "counterfactual"
# We need a lag of the price per pack
lagY_1c <- NULL # Hypothetical previous change in Y for simulation

for (i in 1:length(pgmm_1c$model)){
  lagY_1c <- c(lagY_1c, as.data.frame(pgmm_1c$model[[i]])["1995",]$packpc) 
}
    
#Store change in packpc 1995 for each state

lagY_1c <- mean(lagY_1c, na.rm=TRUE)    
#Find the mean for all packpc changes in 1995

# Hypothetical initial level of Y for simulation
pdata$packpc[pdata$year==1995]
##   AL-1995   AR-1995   AZ-1995   CA-1995   CO-1995   CT-1995   DE-1995   FL-1995 
## 101.08543 111.04297  71.95417  56.85931  82.58292  79.47219 124.46660  93.07455 
##   GA-1995   IA-1995   ID-1995   IL-1995   IN-1995   KS-1995   KY-1995   LA-1995 
##  97.47462  92.40160  74.84978  83.26508 134.25835  88.75344 172.64778 105.17613 
##   MA-1995   MD-1995   ME-1995   MI-1995   MN-1995   MO-1995   MS-1995   MT-1995 
##  76.62064  77.47355 102.46978  81.38825  82.94530 122.45028 105.58245  87.15957 
##   NC-1995   ND-1995   NE-1995   NH-1995   NJ-1995   NM-1995   NV-1995   NY-1995 
## 121.53806  79.80697  87.27071 156.33675  80.37137  64.66887  93.52612  70.81732 
##   OH-1995   OK-1995   OR-1995   PA-1995   RI-1995   SC-1995   SD-1995   TN-1995 
## 111.38010 108.68011  92.15575  95.64309  92.59980 108.08275  97.21923 122.32005 
##   TX-1995   UT-1995   VA-1995   VT-1995   WA-1995   WI-1995   WV-1995   WY-1995 
##  73.07931  49.27220 105.38687 122.33475  65.53092  92.46635 115.56883 112.23814
initialY <- mean(pdata$packpc[pdata$year==1995], na.rm=TRUE)
#Set the initial mean value of pack in 1995 across states
 
# Simulate expected values of Y (on original level scale)
# out to periods.out given hypothetical future values of X,
# initial lags of the change in Y, and an initial level of Y
sim_ev1c <- ldvsimev(xhyp_1c,               # The matrix of hypothetical x's
                     simbeta_1c,           # The matrix of simulated betas
                     ci=0.95,            # Desired confidence interval
                     constant=NA,        # NA indicates no constant!
                     phi=simphi_1c,            # estimated AR parameters; length must match lagY 
                     lagY=lagY_1c,       # lags of y, most recent last
                     transform="diff",   # "log" to undo log transformation,
                                         # "diff" to under first differencing
                                         # "difflog" to do both
                     initialY=initialY   # for differenced models, the lag of the level of y
                     )

# Simulate expected values of Y given no change in covariates
sim_base1c <- ldvsimev(xbase_1c,               # The matrix of hypothetical x's
                       simbeta_1c,           # The matrix of simulated betas
                       ci=0.95,            # Desired confidence interval
                       constant=NA,        # NA indicates no constant!
                       phi=simphi_1c,            # estimated AR parameters; length must match lagY 
                       lagY=lagY_1c,          # lags of y, most recent last
                       transform="diff",   # "log" to undo log transformation,
                                         # "diff" to under first differencing
                                         # "difflog" to do both
                       initialY=initialY   # for differenced models, the lag of the level of y
                       )    

# Simulate first differences in y 
# out to periods.out given hypothetical future values of x, xpre,
# and initial lags of the change in y
sim_fd1c <- ldvsimfd(xhyp_1c,            # The matrix of hypothetical x's
                     simbeta_1c,        # The matrix of simulated betas
                     ci=0.95,            # Desired confidence interval
                     constant=NA,        # Column containing the constant
                                         # set to NA for no constant
                     phi=simphi_1c,     # estimated AR parameters; length must match lagY 
                     lagY=lagY_1c,       # lags of y, most recent last
                     transform="diff",   # Model is differenced
                     #initialY=initialY   # Redundant in this case (fd of linear differenced Y)
                     )

# Simulate relative risks in y
# out to periods.out given hypothetical future values of x, xpre,
# and initial lags of the change in y
sim_rr1c <- ldvsimrr(xhyp_1c,            # The matrix of hypothetical x's
                     simbeta_1c,        # The matrix of simulated betas
                     ci=0.95,            # Desired confidence interval
                     constant=NA,        # Column containing the constant
                                         # set to NA for no constant
                     phi=simphi_1c,     # estimated AR parameters; length must match lagY 
                     lagY=lagY_1c,       # lags of y, most recent last
                     transform="diff",   # Model is differenced
                     initialY=initialY   # Required for differenced Y in ldvsimrr
                     )




# Model xhyp_1c Lineplot: Packs
cigLineplots(sim_ev1c, sim_base1c, sim_fd1c, sim_rr1c,
             limitsEV, limitsFD, limitsRR, initialY,
             main="Cigarette Taxes & Consumption: 1c. Linear Difference GMM, Two-way Effects",
             file=NULL
             )

# save output
cigLineplots(sim_ev1c, sim_base1c, sim_fd1c, sim_rr1c,
             limitsEV, limitsFD, limitsRR, initialY,
             main="Cigarette Taxes & Consumption: 1c. Linear Difference GMM, Two-way Effects",
             file="output/m1cEVFDRR"
             )

  1. See https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/Cigarette.html.↩︎