CSSS/POLS 512: Lab 6

Fixed Effects, Random Effects & Lagged Variables in Panel Data Analysis

Tao Lin

May 6, 2022

Agenda

  • Fixed Effects
  • Random Effects
  • Lagged Variables in Panel Data Models
  • Counterfactual Forecasting with RE and FE Models
  • Tips for Poster Session
    • How to make a nicely formatted table?
    • How to make a poster using Rmarkdown?

Case Study: Does Democracy Contribute to Economic Development? (Przeworski et al. 2003)

# Load packages
library(nlme)      # Estimation of mixed effects models
library(lme4)      # Alternative package for mixed effects models
library(plm)       # Econometrics package for linear panel models
library(fixest)    # Alternative package for fixed effect models
library(arm)       # Gelman & Hill code for mixed effects simulation
library(pcse)      # Calculate PCSEs for LS models (Beck & Katz)
library(tseries)   # For ADF unit root test
library(simcf)     # For panel functions and simulators
library(tidyverse) # For data wrangling
library(lmtest)    # Print results after correcting standard errors

# Load Democracy data
# Variable names:
# COUNTRY   YEAR    BRITCOL     CATH
# CIVLIB    EDT ELF60   GDPW    MOSLEM
# NEWC      OIL POLLIB  REG STRA
data <- read.csv("Lab6_data/democracy.csv", header = TRUE, na.strings = ".")
colnames(data)
 [1] "COUNTRY" "CTYNAME" "REGION"  "YEAR"    "BRITCOL" "CATH"    "CIVLIB" 
 [8] "EDT"     "ELF60"   "GDPW"    "MOSLEM"  "NEWC"    "OIL"     "POLLIB" 
[15] "REG"     "STRA"   
colnames(data) <- colnames(data) |> tolower() # change variable names to lower case
colnames(data)
 [1] "country" "ctyname" "region"  "year"    "britcol" "cath"    "civlib" 
 [8] "edt"     "elf60"   "gdpw"    "moslem"  "newc"    "oil"     "pollib" 
[15] "reg"     "stra"   
data$country |> unique() |> length()
[1] 135
data$year |> unique() |> length()
[1] 40
# so it is an N=135, T=40 panel data.
Variable Description1
britcol Dummy variable coded 1 for every year in countries that were British colonies any time after 1918, 0 otherwise.
cath Percentage of Catholics in the population.
civilib Civil liberty. This variable can take values from 1 (least free) to 7 (most free).
edt Cumulative years of education of the average member of the labor force.
elf60 Index of ethnolinguistic fractionalization, 1960.
gdpw Real GDP per worker, 1985 international prices.
moslem Percentage of Moslems in the population.
newc Dummy variable coded 1 for every year in countries that became independent after 1945, 0 otherwise.
oil Dummy variable coded 1 if the average ratio of fuel exports to total exports in 1984-86 exceeded 50%, 0 otherwise.
pollib Political liberty. This variable can take values from 1 (least free) to 7 (most free).
reg Dummy variable coded 1 for dictatorships and 0 for democracies. Transition years are coded as the regime that emerges in that year. For instance, there was a transition from democracy to dictatorship in Argentina in 1955. In that year, reg=1.
stra The sum of past transitions to authoritarianism (as defined by reg) in a country. If a country experienced one or more transitions to authoritarianism before 1950, stra was coded 1 in 1950.

An Overview of Panel Data Models

The full flexibility model:

\[\Delta^{di}y_{it} = \alpha_i + X_{it}\beta_i + \sum^{P_i}_{p=1}y_{i,t-p}\phi_i + \sum^{Q_i}_{q=1}y_{i,t-q}\rho_i + \epsilon_{it}\]

where intercept \(\alpha_i\), slope \(\beta_i\), ARIMA order \(P_i, D_i, Q_i\) and its coefficients \(\phi_i\), \(\rho_i\) vary and allows panel heteroskedasticity \(\sigma^2_i\). We can put unit-specific trend \(t\theta_i\) and seasonality \(\kappa_{k,i}\) inside this model too.

Fixed Effects

What is Fixed Effects?

Suppose there is unobserved group-level or individual-level heterogeneity \(u_i\) in the true data generating process (DGP) (such group-level or unit-level heterogeneity that does not change over individual or time periods): \[ y_{it} = \alpha + \beta^\prime x_{it} + \color{red}{\delta^\prime u_{i}} + \varepsilon_{it} \] Because it is unobserved, we cannot measure it. Now set \(\alpha_i = \alpha_i^* = \alpha + \delta^\prime u_i\), we can remove this unobservable group-level heterogeneity and rewrite the above equation as follows: \[ y_{it} = \alpha_i + \beta^\prime x_{it} + \varepsilon_{it} \]

The FE model can control for omitted unobservable variables.

  • if we add \(\alpha_i\): control for time-invariant unit-level heterogeneity, such as a country’s culture.
  • if we add \(\lambda_t\): control for common shocks at each period, such as global economic conditions each year.
  • Of course we can also control for omitted unobservables at other levels, such as group, cohort, etc.

However, FE models come at a large cost. - We cannot add time-invariant variables into FE models, otherwise it will cause perfect collinearity. - Incidental parameter problem: suppose we include individual-level FE \(\alpha_i\), the estimates is consistent as \(T \rightarrow \infty\), but not as \(N \rightarrow \infty\).

There are 3 ways to estimate this FE model: 1. Within Estimator 2. Least Square Dummy Variable (LSDV) Estimator 3. First Difference (FD) Estimator

How to Check Time- or Unit-invariant Variables?

pdata <- pdata.frame(data, index = c("country", "year"))
head(pdata) # it will create an id for each observation
       country ctyname region year britcol cath civlib   edt elf60 gdpw moslem
1-1962       1 Algeria Africa 1962       0  0.5     NA 1.160  0.43 5012   99.1
1-1963       1 Algeria Africa 1963       0  0.5     NA 1.250  0.43 6083   99.1
1-1964       1 Algeria Africa 1964       0  0.5     NA 1.345  0.43 6502   99.1
1-1965       1 Algeria Africa 1965       0  0.5     NA 1.450  0.43 6620   99.1
1-1966       1 Algeria Africa 1966       0  0.5     NA 1.560  0.43 6612   99.1
1-1967       1 Algeria Africa 1967       0  0.5     NA 1.675  0.43 6982   99.1
       newc oil pollib reg stra
1-1962    1   1     NA   0    0
1-1963    1   1     NA   0    0
1-1964    1   1     NA   0    0
1-1965    1   1     NA   0    0
1-1966    1   1     NA   0    0
1-1967    1   1     NA   0    0
pvar(pdata)
no time variation:       country ctyname region britcol cath elf60 moslem newc oil 
no individual variation: year edt elf60 
all NA in time dimension for at least one individual:  civlib edt elf60 pollib 
all NA in ind. dimension for at least one time period: civlib edt elf60 pollib 
# or
pvar(data, index = c("country", "year"))
no time variation:       country ctyname region britcol cath elf60 moslem newc oil 
no individual variation: year edt elf60 
all NA in time dimension for at least one individual:  civlib edt elf60 pollib 
all NA in ind. dimension for at least one time period: civlib edt elf60 pollib 

Within Estimator

We take each individuals’ deviation from group means (average over \(t\) within each group/unit \(i\)). Because the observable unit-level heterogeneity is time-invariant, its value should not changed after averaging over time within each unit. Thus, we can avoid estimating the varying intercepts. \[ \begin{aligned} \tilde{y}_{it} =& y_{it} - \bar{y}_{i} \\ =& \color{red}{(\alpha_i - \bar{\alpha}_i)} + \beta^\prime \underbrace{(x_{it} - \bar{x}_{i})}_{\tilde{x}_{it}} + \underbrace{(\varepsilon_{it} - \bar{\varepsilon}_i)}_{\tilde{\varepsilon}_{it}} \\ =& \beta^\prime \tilde{x}_{it} + \tilde{\varepsilon}_{it} \end{aligned} \]

Like OLS, the within estimator \(\hat{\beta}_W\) is asymptotically normal and efficient: \[ \widehat{\beta}_{W} \stackrel{\text { approx. }}{\sim} \mathcal{N}\left(\beta, \sigma^{2}\left(\tilde{\mathbf{X}}^{\top} \tilde{\mathbf{X}}\right)^{-1}\right) \]

However, the estimation of standard error is a little tricky. When we demean all the variables, we already used \(N\) degrees of freedom for \(N\) units. \[ \widehat{\sigma}^{2}=\frac{1}{N T-N-K} \sum_{i=1}^{N} \sum_{t=1}^{T}{\widehat{\tilde{\varepsilon}}}_{i}^{2} \] Hence, if we use a canned OLS function for \(\hat{\beta}_W\), the standard error is not correct, and we need to adjust it manually.

Within-FE Model in R

# within estimator using `plm`
plm_within <- plm(
  log(gdpw) ~ reg + edt, # if we include `oil`, plm() will automatically drop it. Why?
  data = data,
  index = c("country", "year"),
  model = "within", # calculate a within estimator
                    # you can also calculate a "pooling" estimator, which is exactly an OLS model without FE
  effect = "twoways" # estimate a two-way FE model; it means that we consider both country FE and year FE
                    # by default it will only create averages over year within country
)
summary(plm_within) # note that the standard error does not take serial correlation into account
Twoways effects Within Model

Call:
plm(formula = log(gdpw) ~ reg + edt, data = data, effect = "twoways", 
    model = "within", index = c("country", "year"))

Unbalanced Panel: n = 113, T = 5-28, N = 2900

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.6405721 -0.0799739  0.0078023  0.0794105  0.5976897 

Coefficients:
     Estimate Std. Error t-value  Pr(>|t|)    
reg 0.0274582  0.0126445  2.1716 0.0299740 *  
edt 0.0217672  0.0057988  3.7537 0.0001778 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    62.255
Residual Sum of Squares: 61.822
R-Squared:      0.0069488
Adj. R-Squared: -0.04382
F-statistic: 9.64946 on 2 and 2758 DF, p-value: 6.6663e-05
coeftest(plm_within, vcov. = vcovHC(plm_within, method = "arellano")) # Arellano (1987) heteroskedastic and serial correlation

t test of coefficients:

    Estimate Std. Error t value Pr(>|t|)
reg 0.027458   0.027678  0.9921   0.3213
edt 0.021767   0.020064  1.0849   0.2781
coeftest(plm_within, vcov. = vcovBK(plm_within)) # Beck and Katz (1995) panel corrected standard errors

t test of coefficients:

    Estimate Std. Error t value Pr(>|t|)
reg 0.027458   0.033323  0.8240     0.41
edt 0.021767   0.019322  1.1266     0.26
coeftest(plm_within, vcov. = vcovSCC(plm_within)) # Driscoll and Kraay panel corrected standard errors

t test of coefficients:

     Estimate Std. Error t value Pr(>|t|)  
reg 0.0274582  0.0194738  1.4100  0.15865  
edt 0.0217672  0.0096307  2.2602  0.02389 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# an alternative way is to use `fixest`; it is faster when we have a large number of FEs
fixest_model <- feols(
  log(gdpw) ~ reg + edt | country + year,
  data = data,
  cluster = ~ country + year
)
NOTE: 1,226 observations removed because of NA values (RHS: 1,226).
summary(fixest_model)
OLS estimation, Dep. Var.: log(gdpw)
Observations: 2,900 
Fixed-effects: country: 113,  year: 28
Standard-errors: Clustered (country & year) 
    Estimate Std. Error  t value Pr(>|t|) 
reg 0.027458   0.028661 0.958037  0.34654 
edt 0.021767   0.020583 1.057552  0.29963 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.146007     Adj. R2: 0.979069
                 Within R2: 0.006949

Recovering Fixed Effects

Although within estimator ignores varying intercepts, we can actually recover them by using: \[ \hat{\alpha}_i = \hat{y}_{it} - \hat{\beta}^\prime x_{it} \] Sometimes it is useful for us to look at the variation between groups.

plm::fixef(plm_within)
     1      2      3      4      5      6      7      9     12     14     15 
9.0359 7.2290 7.2571 8.0582 6.2773 6.5999 7.4875 6.9910 8.0574 7.9979 6.1111 
    16     17     18     19     21     22     23     24     25     26     27 
8.8052 7.3584 7.3129 7.0053 7.9941 7.3571 7.1240 7.5886 7.4434 6.8096 7.1167 
    28     29     30     31     33     34     35     37     39     40     41 
7.7384 8.5749 8.1579 7.2758 7.3074 6.9509 7.6513 7.7913 8.6077 7.7106 8.4986 
    42     43     44     45     46     47     48     50     52     53     54 
6.5958 6.9910 8.6073 7.0720 6.7182 7.8508 7.7992 9.1082 9.5395 8.6171 8.2586 
    55     57     58     59     60     61     62     63     64     65     66 
8.1832 8.4233 7.3688 7.9101 8.4038 9.1186 8.3928 8.4709 9.8077 9.6805 9.0808 
    67     68     69     70     71     72     73     74     76     77     78 
8.0791 8.6016 8.7394 8.4788 8.4000 8.4796 8.0331 8.5837 8.7413 9.5642 8.0126 
    79     80     81     82     83     84     85     86     87     89     91 
7.0471 7.1911 7.6276 9.1569 9.3784 9.1492 8.7839 8.7029 8.1206 8.3399 6.4892 
    92     93     94     95     96     97     98     99    101    102    103 
7.2936 7.6676 7.7916 9.1962 7.7352 9.1447 8.3748 7.5992 9.2921 9.4239 8.7944 
   104    105    106    107    108    110    111    112    113    114    116 
8.2487 9.3350 9.1709 9.4579 9.3866 8.8132 8.8980 9.2281 8.9777 9.3688 8.8795 
   117    118    119    120    122    123    124    125    126    128    129 
9.5394 9.3664 8.6816 8.6035 9.2327 9.4440 9.6242 8.1277 9.2703 8.7859 9.4947 
   130    131    132 
8.9919 9.4621 8.0091 

LSDV Estimator

Another option to directly estimate group-specific intercepts is to use the least squares dummy variables (LSDV) estimator, i.e., OLS with \(N\) group dummies excluding the global intercept.

The group dummies are often colloquially called “fixed effects.” For large \(T\), it should be very similar to FE estimator. But it is not a good idea for very small \(T\): estimates of \(\alpha_i\) will be poor due to small \(T\) within each group.

# LSDV estimator using `lm`
plm_lsdv <- plm(
  log(gdpw) ~ reg + edt + factor(country) + factor(year) - 1, # subtract 1 to drop global intercept
  index = c("country", "year"),
  model = "pooling", # run an OLS model
  data = data
)
coeftest(plm_lsdv, vcov. = vcovHC(plm_lsdv, method = "arellano"))[1:2,] # you can also use lm() to implement LSDV estimator, but it is difficult to adjust for standard errors
     Estimate Std. Error   t value  Pr(>|t|)
reg 0.0274582 0.02767787 0.9920634 0.3212536
edt 0.0217672 0.02006382 1.0848985 0.2780614

Within vs. LSDV

Within and LSDV estimators are equivalent to each other. They give exactly the same estimates.

But we favor one over another based on the following reasons:

  • Their standard errors will differ slightly. We don’t need to adjust degrees of freedom for \(\sigma^2\) manually in LSDV estimator because it directly estimate the fixed effects of N groups. By contrast, standard errors from vanilla OLS on the within estimator will be slightly off due to incorrect degrees of freedom (smart software such as plm() in R and areg in Stata will correct it).
  • Compared to within estimator, LSDV approach can be computationally demanding. The model matrix is a very large matrix to invert when \(N\) is large.
  • However, within and LSDV cannot estimate the intercept very well. The estimates of \(\alpha_i\) is inconsistent if \(N \rightarrow \infty\) and \(T\) fixed (i.e. incidental parameters problem). But there is no need to estimate \(\alpha_i\) unless the quantity of interest is a function of it, such as some non-linear models.

FD Estimator

Since the group-level time-invariant unobservables are constant across time, first-differences are an alter- native to purge out it. The first difference model is the following: \[ \begin{aligned} \Delta y_{it} =& y_{it} - y_{i,t-1} \\ =& \color{red}{(\alpha_i - \alpha_i)} + \beta^\prime \underbrace{(x_{it} - x_{i,t-1})}_{\Delta x_{it}} + \underbrace{(\varepsilon_{it} - \varepsilon_{i,t-1})}_{\Delta \varepsilon_{it}} \\ =& \beta^\prime \Delta x_{it} + \Delta \varepsilon_{it} \end{aligned} \] If \(\Delta \varepsilon_{it}\) are homoskedastic and without serial correlation, then usual OLS standard errors work just fine. Otherwise we need to use GLS to estimate uncertainties.

FD estimator in plm() will automatically include an intercept. The intercept in FD model is actually an coefficient for linear time trend (in the second step of the above equation, it is \(\theta (t+1 - t)\)).

FE vs. FD Estimators

  • FD and FE will give the same estimates when \(T = 2\), but for \(T > 2\), the two methods are different.
  • Note that FD often cannot account for two-way fixed effects. It is not intuitive to have \(y_{it} - y_{i-1,t}\).
  • Both FE and FD estimators are consistent under the assumption of strict exogeneity.
  • However, the primary tradeoff between FE and FD estimators lies in the i.i.d. assumption: if original errors, \(\varepsilon_{it}\), are serially uncorrelated, then FE estimator will be more efficient than FD estimator.
  • But if there are substantial serial dependence in the original errors, we may consider FD estimator to make the first-differences of errors, \(\Delta \varepsilon_{it}\), serially uncorrelated. In the latter case, FD estimator is more efficient than FE estimator.
  • We can always try both: if results are not very sensitive, good!
# FD estimator using `plm`
plm_fd <- plm(
  log(gdpw) ~ reg + edt - 1,
  index = c("country", "year"),
  model = "fd", # FD estimator
  data = data
)
summary(plm_fd)
Oneway (individual) effect First-Difference Model

Call:
plm(formula = log(gdpw) ~ reg + edt - 1, data = data, model = "fd", 
    index = c("country", "year"))

Unbalanced Panel: n = 113, T = 5-28, N = 2900
Observations used in estimation: 2787

Residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.6344 -0.0102  0.0190  0.0168  0.0469  0.5230 

Coefficients:
     Estimate Std. Error t-value  Pr(>|t|)    
reg 0.0061632  0.0082067  0.7510    0.4527    
edt 0.0395749  0.0055039  7.1904 8.264e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    10.679
Residual Sum of Squares: 11.628
R-Squared:      5.8796e-05
Adj. R-Squared: -0.00030025
F-statistic: 26.1857 on 2 and 2785 DF, p-value: 5.4114e-12
plm_owfe <- plm(log(gdpw) ~ reg + edt, data = data, index = c("country", "year"), model = "within", effect = "individual")
summary(plm_owfe)
Oneway (individual) effect Within Model

Call:
plm(formula = log(gdpw) ~ reg + edt, data = data, effect = "individual", 
    model = "within", index = c("country", "year"))

Unbalanced Panel: n = 113, T = 5-28, N = 2900

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-0.832562 -0.095557  0.015056  0.111956  0.714304 

Coefficients:
     Estimate Std. Error t-value Pr(>|t|)    
reg -0.014014   0.015048 -0.9313   0.3518    
edt  0.167823   0.003712 45.2107   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    158.01
Residual Sum of Squares: 91.097
R-Squared:      0.42346
Adj. R-Squared: 0.39986
F-statistic: 1022.77 on 2 and 2785 DF, p-value: < 2.22e-16

Random Effects / Mixed Effects

What is Random Effects?

Assumes the individuals come from a common population. Each individual’s unobservable heterogeneous effect on the outcome follows a normal distribution with an unknown variance \(\sigma_\alpha^2\). Then we may have the following model setup: \[ \begin{aligned} y_{i t} &=\beta_{0}+\beta_{1} x_{i t}+\alpha_{i}+\varepsilon_{i t} \\ \alpha_{i} & \sim \mathcal{N}\left(0, \sigma_{\alpha}^{2}\right) \\ \varepsilon_{i t} & \sim \mathcal{N}\left(0, \sigma_\varepsilon^{2}\right) \end{aligned} \]

Estimate an RE model in R

  • Three packages in R: plm (feasible generalized least squares by defaulty), nlme (maximum likelihood by default), and lme4 (restricted maximum likelihood by default)
  • Pros and Cons among different estimation strategies
    • FGLS2
      • preferred by some econometricians because it has fewer distributional assumptions.
      • sometimes FGLS is computationally complicated because it need to invert a large matrix.
    • ML: biased estimation of variance parameters; but such bias gets smaller for larger sample size
    • REML: unbiased estimation of variance parameters
    • If you prefer a Bayesian approach for more flexible modeling, you can use brms or even manually specify the model in JAGS and stan.
  • Pros and cons among different packages, besides estimation strategies
    • plm cannot deal with mixed effects, i.e., the mean of \(\alpha_i\) varies due to certain covariates, i.e. \(\alpha_i \sim \mathcal{N}(\gamma^\prime z_{it}, \sigma_\alpha^2)\), but nlme and lme4 can do so.
    • nlme can take ARMA structure into account, but plm and lme4 can only consider AR terms.
    • nlme is difficult to specify crossed (non-nested) random effect structure (i.e. multiple separated random effects), while lme4 is easy to do so.
  • In terms of the mixed effects model in nlme and lme4, you may be confused by their different use of “fixed effects” and “random effects.”3
    • fixed effect: coefficients that do not vary by group, such that they are fixed, not random.
    • random effect: group-specific effect; these coefficients vary across groups.

A typical mixed effect model is as follows: \[ y_{it} = \underbrace{\beta^\prime x_{it}}_\text{"fixed effect"} + \underbrace{\gamma^\prime z_i}_\text{"random effects"} + \varepsilon_{it} \] or can be specified as follows: \[ \begin{aligned} y_{i t} &=\beta^\prime x_{i t}+\alpha_{i}+\varepsilon_{i t} \\ \alpha_{i} & \sim \mathcal{N}\left(\gamma^\prime z_i, \sigma_{\alpha}^{2}\right) \\ \varepsilon_{i t} & \sim \mathcal{N}\left(0, \sigma_\varepsilon^{2}\right) \end{aligned} \]

Estimate an RE model in R

# RE model using `plm`
plm_re <- plm(
  log(gdpw) ~ reg + edt + oil,
  data = data,
  index = c("country", "year"),
  model = "random",
  effect = "individual"
)
summary(plm_re)
Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = log(gdpw) ~ reg + edt + oil, data = data, effect = "individual", 
    model = "random", index = c("country", "year"))

Unbalanced Panel: n = 113, T = 5-28, N = 2900

Effects:
                  var std.dev share
idiosyncratic 0.03271 0.18086 0.091
individual    0.32784 0.57258 0.909
theta:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8601  0.9382  0.9404  0.9381  0.9404  0.9404 

Residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.82664 -0.10254  0.01505  0.00081  0.11712  0.75912 

Coefficients:
              Estimate Std. Error  z-value  Pr(>|z|)    
(Intercept)  7.7290183  0.0608325 127.0540 < 2.2e-16 ***
reg         -0.0030963  0.0150650  -0.2055  0.837159    
edt          0.1717169  0.0036682  46.8124 < 2.2e-16 ***
oil          0.5458284  0.1705342   3.2007  0.001371 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    173.89
Residual Sum of Squares: 96.136
R-Squared:      0.44728
Adj. R-Squared: 0.44671
Chisq: 2200.57 on 3 DF, p-value: < 2.22e-16
# RE model using `nlme`
lme_re <- lme(
  fixed = log(gdpw) ~ reg + edt + oil,
  random = ~ 1 | country,
  data = data,
  na.action = na.omit # need to specify how we deal with NA values, otherwise throws an error
)
summary(lme_re)
Linear mixed-effects model fit by REML
  Data: data 
        AIC       BIC   logLik
  -996.8293 -961.0027 504.4146

Random effects:
 Formula: ~1 | country
        (Intercept)  Residual
StdDev:   0.6733608 0.1808866

Fixed effects:  log(gdpw) ~ reg + edt + oil 
                Value  Std.Error   DF   t-value p-value
(Intercept)  7.735312 0.07004634 2785 110.43135  0.0000
reg         -0.006041 0.01498196 2785  -0.40321  0.6868
edt          0.170680 0.00366084 2785  46.62312  0.0000
oil          0.543484 0.19892569  111   2.73210  0.0073
 Correlation: 
    (Intr) reg    edt   
reg -0.066              
edt -0.258 -0.058       
oil -0.335  0.010  0.032

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-4.59951054 -0.52967154  0.08334205  0.61991189  4.00624582 

Number of Observations: 2900
Number of Groups: 113 
# RE model using `lme4`
lmer_re <- lmer(
  log(gdpw) ~ reg + edt + oil + (1 | country),
  data = data
)
summary(lmer_re)
Linear mixed model fit by REML ['lmerMod']
Formula: log(gdpw) ~ reg + edt + oil + (1 | country)
   Data: data

REML criterion at convergence: -1008.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.5995 -0.5297  0.0833  0.6199  4.0062 

Random effects:
 Groups   Name        Variance Std.Dev.
 country  (Intercept) 0.45341  0.6734  
 Residual             0.03272  0.1809  
Number of obs: 2900, groups:  country, 113

Fixed effects:
             Estimate Std. Error t value
(Intercept)  7.735312   0.070046 110.431
reg         -0.006041   0.014982  -0.403
edt          0.170680   0.003661  46.623
oil          0.543484   0.198926   2.732

Correlation of Fixed Effects:
    (Intr) reg    edt   
reg -0.066              
edt -0.258 -0.058       
oil -0.335  0.010  0.032

FE vs. RE

  • Different assumption about the relationship between group heterogeneity \(u_i\) and covariates \(x_{it}\).
    • In FE model, we assume conditional ignorability \(E[\varepsilon_{it}|x_{it}, u_i] = 0\). That is, we assume that conditional on unobservable confounder \(u_i\), \(\varepsilon_{it}\) and \(x_{it}\) is uncorrelated. \(x_{it}\) and \(u_i\) have some correlation.
    • In RE model, we assume there is no unmeasured confounder, so ignorability holds without conditioning on \(u_i\): \(E[\varepsilon_{it} | x_{it}] = E[u_i] = 0\). \(x_{it}\) and \(u_i\) have no correlation.
  • RE estimator is a matrix-weighted average of within estimator and between estimator.
    • But an intuitive way to think about it is that FE model rule out between-group variation; but in RE model, we preserve the information of between-group variation by modeling \(\sigma_\alpha^2\).
    • It is difficult to show it mathematically, because it requires some complicated algebra. We can think of RE model doing quasi-demeaning or partial-pooling:

\[ y_{it} - \theta\bar{y}_i = \beta^\prime (x_{it} - \theta \bar{x}_i) + (\varepsilon_{it} - \theta\bar{\varepsilon}_i) \] where \(\theta = 0\) implies pooled OLS and \(\theta = 1\) implies fixed effects. Doing some math shows that \[ \theta = 1 - \sqrt{\frac{\sigma_\varepsilon^2}{\sigma_\varepsilon^2 + T \cdot \sigma_\alpha^2}} \]

Dynamic TS Structure in Panel Data Model

Detect Time Series Structure in Panel Data

We know that for some countries which are good at growth maintenance, gdpw may be a non-stationary time series; whereas for some countries experienced economic fluctuation, their gdpw may have mean-reverting or even worse behavior in the long run.

ggplot(pdata, aes(x = year, y = gdpw, group = country, color = region)) +
  geom_line(alpha = 0.5) +
  theme_bw()

Detect Time Series Structure in Panel Data

We can use statistical tests to formally detect stationarity of time series in panel data.

  • purtest() in plm package
  • punitroots package: the author no longer maintain this package
  • apply adf.test() and pp.test() to each time series, and plot the distribution of p values.
pseries <- split(pdata$gdpw, pdata$country)
pp_pval <- sapply(pseries, function(x){pp.test(x)[["p.value"]]})
adf_pval <- sapply(pseries, function(x){adf.test(x)[["p.value"]]})
hist(pp_pval, breaks = 100)

hist(adf_pval, breaks = 100)

Detect Time Series Structure in Panel Data

We detect many countries have non-stationary time series for the level dependent variable, but what about a differenced dependent variable?

pdata$gdpw_diff <- diff(pdata$gdpw) # because we already have a pdata.frame, we can safely use plm::diff() to transform it
ggplot(pdata, aes(x = year, y = gdpw_diff, group = country, color = region)) +
  geom_line(alpha = 0.5) +
  theme_bw() # much more stationary!

pseries <- split(pdata$gdpw_diff, pdata$country) # have NA values in the first observation, need to remove them in statistical tests
pp_pval <- sapply(pseries, function(x){pp.test(na.omit(x))[["p.value"]]})
adf_pval <- sapply(pseries, function(x){adf.test(na.omit(x))[["p.value"]]})
hist(pp_pval, breaks = 100)

hist(adf_pval, breaks = 100)

Dynamic TS structure in RE model

re_arima <- lme( # random effects ARIMA(1, 1, 0)
  fixed = gdpw_diff ~ reg + edt + oil,
  random = ~ 1 | country,
  correlation = corARMA(
    form = ~ year | country, # year nested in country
    p = 1,
    q = 0
  ),
  data = pdata %>% 
    mutate(year = as.integer(year)), # the time variable need to be integer, otherwise it cannot perform ARMA estimation
  na.action = na.omit
)
summary(re_arima)
Linear mixed-effects model fit by REML
  Data: pdata %>% mutate(year = as.integer(year)) 
       AIC     BIC    logLik
  44165.63 44207.3 -22075.81

Random effects:
 Formula: ~1 | country
        (Intercept) Residual
StdDev:   0.2261208  581.998

Correlation Structure: AR(1)
 Formula: ~year | country 
 Parameter estimate(s):
      Phi 
0.2268233 
Fixed effects:  gdpw_diff ~ reg + edt + oil 
                Value Std.Error   DF   t-value p-value
(Intercept)  27.43342  26.89294 2732  1.020097  0.3078
reg         107.53138  33.46420 2732  3.213326  0.0013
edt          24.95241   5.35870 2732  4.656433  0.0000
oil         -14.76787  43.07012  111 -0.342880  0.7323
 Correlation: 
    (Intr) reg    edt   
reg  0.071              
edt -0.735 -0.561       
oil -0.328  0.018  0.137

Standardized Within-Group Residuals:
         Min           Q1          Med           Q3          Max 
-16.05507865  -0.28772677  -0.03423844   0.33234239  13.38099374 

Number of Observations: 2847
Number of Groups: 113 

Dynamic TS structure in FE model

pdata$gdpw_diff_lag1 <- lag(pdata$gdpw_diff)
pdata$gdpw_diff_lag2 <- lag(pdata$gdpw_diff, 2)
fe_ar1 <- plm( # fixed effect ARIMA(1, 1, 0); cannot incorporate MA structure
  gdpw_diff ~ gdpw_diff_lag1 + reg + edt,
  data = pdata, model = "within", effect = "individual"
)
fe_ar2 <- plm( # fixed effect ARIMA(2, 1, 0); cannot incorporate MA structure
  gdpw_diff ~ gdpw_diff_lag1 + gdpw_diff_lag2 + reg + edt,
  data = pdata, model = "within", effect = "individual"
)
pbgtest(fe_ar1) # null means no serial correlation; so there is serial correlation

    Breusch-Godfrey/Wooldridge test for serial correlation in panel models

data:  gdpw_diff ~ gdpw_diff_lag1 + reg + edt
chisq = 37.209, df = 3, p-value = 4.155e-08
alternative hypothesis: serial correlation in idiosyncratic errors
pbgtest(fe_ar2) # there is no serial correlation for AR(2) model with first-differenced outcome variable

    Breusch-Godfrey/Wooldridge test for serial correlation in panel models

data:  gdpw_diff ~ gdpw_diff_lag1 + gdpw_diff_lag2 + reg + edt
chisq = 4.0656, df = 2, p-value = 0.131
alternative hypothesis: serial correlation in idiosyncratic errors

FE vs. lagged dependent variable (LDV)

There is a tradeoff when we both incorporate FE and lagged dependent variable in one model.

The estimate of \(\phi\) will always be underestimated in dynamic FE model, but the bias vanishes as \(T\) increases. This is called “Nickell bias.”

To obtain an unbiased estimate with small \(T\), we may choose dynamic panel data model, or prefer either FE or LDV model. If we care about causal inference and want to use a different-in-difference design (suppose it is a simple \(2\times 2\) DID) to estimate treatment effect, then we may need to think about the underlying causal identification assumption:

  • FE: conditioning on time-invariant unobservable individual-level confounder \(\alpha_i\), \(x_{it}\) and \(y_{it}\) are independent.
  • LDV: conditioning on time-varying unobservable individual-level confounder \(y_{i,t-1}\), \(x_{it}\) and \(y_{it}\) are independent.

FE and LDV have a very useful bracketing property (Angrist and Pischke 2009; Ding and Li 2019):

  • If LDV is correct, but we mistakenly use FE, then the estimate of treatment effect will be biased upward.
  • If FE is correct, but we mistakenly use LDV, then the estimate of treatment effect will be biased downward.
  • We can think of FE and LDV as bounding the causal effect of interest.
  • A better way is to always check the robustness of our findings using alternative identifying assumptions.

Recent literature attempts to go beyond the additive assumption of unobservable confounders. We can allow unit-level effect and time effect interact with each other so that we can account for time-invariant and time-varying unit-level confounders at the same time. Recent literature consider decomposing the matrix of error terms to several latent factors (e.g. use PCA) to account for this interactive structure (Bai 2009; Xu 2017).

Counterfactual Forecasting

Counterfactual Forecasting with RE Models

  • Extract model information for simulation, such as coefficients and variance-covariance matrix.
    • In nlme package, we use fixed.effects() or fixef() to extract coefficients that are fixed across all groups. (note that here we have different meaning of “fixed effect” and “random effect.”)
    • You can extract group-varying coefficients or intercepts using random.effects() or ranef().
    • Simply using coefficients() or coef() will aggregate “fixed effect” and “random effect” for each group.
  • Simulate parameters from multivariate Normal distribution
  • Create hypothetical scenarios
  • Use simulated parameters and hypothetical scenarios to calculate quantity of interest, such as expected values, first difference, relative risk, etc.
sims <- 1000
simbetas <- mvrnorm(sims, fixef(re_arima), vcov(re_arima))

# Make matrix of hypothetical x's
formula <- gdpw_diff ~ reg + edt + oil
periods.out <- 50
xhyp <- cfMake(formula, pdata, periods.out)
for (i in 1:periods.out) 
  xhyp <- cfChange(xhyp, "edt", x = mean(pdata$edt, na.rm=TRUE) + sd(pdata$edt, na.rm=TRUE), scen=i) # we assume a permanent increase in education (+ 1 SD) over the next 50 years

phi <- 0.25 # from model summary()
lagY <- mean(pdata$gdpw_diff, na.rm = TRUE) # Hypothetical previous change in Y for simulation
initialY <- mean(pdata$gdpw, na.rm = TRUE) # Hypothetical initial level of Y for simulation
# Simulate expected values of Y (on original level scale)
# out to periods.out given hypothetical future values of X,
# initial lags of the change in Y, and an initial level of Y
sim_evre <- ldvsimev(xhyp,               # The matrix of hypothetical x's
                    simbetas,           # The matrix of simulated betas
                    ci=0.95,            # Desired confidence interval
                    constant=1,         # Column containing the constant
                                        # set to NA for no constant
                    phi=phi,            # estimated AR parameters; length must match lagY 
                    lagY=lagY,          # lags of y, most recent last
                    transform="diff",   # "log" to undo log transformation,
                                        # "diff" to under first differencing
                                        # "difflog" to do both
                    initialY=initialY   # for differenced models, the lag of the level of y
                    )

# Simulate first differences in Y (on original level scale)
# out to periods.out given hypothetical future values of X, X0,
# and initial lags of the change in Y
sim_fdre <- ldvsimfd(xhyp,               # The matrix of hypothetical x's
                    simbetas,           # The matrix of simulated betas
                    ci=0.95,            # Desired confidence interval
                    constant=1,         # Column containing the constant
                                        # set to NA for no constant
                    phi=phi,            # estimated AR parameters; length must match lagY 
                    lagY=lagY,          # lags of y, most recent last
                    transform="diff"    # "log" to undo log transformation,
                                        # "diff" to under first differencing
                                        # "difflog" to do both
                    )
plot.new()
par(usr=c(1,50,-15000,15000))
axis(1,at=seq(1,50,10))
axis(2,at=seq(-15000,15000,5000))
title(xlab="Time",ylab="Expected change in constant GDP ($ pc)",main="Random effects ARIMA(1,1,0)") 

# Make the x-coord of a confidence envelope polygon
xpoly <- c(1:periods.out,
           rev(1:periods.out),
           1)

# Make the y-coord of a confidence envelope polygon
ypoly <- c(sim_fdre$lower,
           rev(sim_fdre$upper),
           sim_fdre$lower[1])

# Choose the color of the polygon 
col <- "lightblue"

# Plot the polygon first, before the points & lines
polygon(x=xpoly,
        y=ypoly,
        col=col,
        border=FALSE
        )

# Plot the fitted line
lines(x=1:periods.out,y=sim_fdre$pe,col="darkblue")
lines(x=c(0,50),y=c(0,0),lty="solid")

Counterfactual Forecasting with FE Model

  • The general steps are similar to counterfactual forecasting with RE model.
  • We can use fixef() to extract group-specific intercepts.
  • Sometimes we may prefer using panel-corrected standard error as variance-covariance matrix for parameter simulation.
sims <- 1000
simparam <- mvrnorm(sims, coef(fe_ar1), vcovBK(fe_ar1))
# Pull off the simulated lag coefficient
simphi <- simparam[,1]
# Put together the "constant" term (avg of the FEs, or a specific FE if you like)
# with the rest of the regressors
simbetas <- cbind(rep(mean(fixef(fe_ar1)), sims), simparam[,2:ncol(simparam)])

# Make matrix of hypothetical x's: drop OIL!
#                                  and now we need hypothetical countries!
#                                  and no intercept!

# Make matrix of hypothetical x's
# Note the formula below is a little different; we've removed the lag
# and will let the simulator handle lag structure
formula <- gdpw_diff ~ reg + edt

periods.out <- 50
xhyp <- cfMake(formula, pdata, periods.out)
for (i in 1:periods.out) 
  xhyp <- cfChange(xhyp, "edt", x=mean(pdata$edt,na.rm=TRUE)+sd(pdata$edt,na.rm=TRUE), scen=i)

phi <- mean(simphi) 
lagY <- mean(pdata$gdpw_diff, na.rm = TRUE) # Hypothetical previous change in Y for simulation
initialY <- mean(pdata$gdpw, na.rm = TRUE) # Hypothetical initial level of Y for simulation
# Simulate expected values of Y (on original level scale)
# out to periods.out given hypothetical future values of X,
# initial lags of the change in Y, and an initial level of Y
sim_evfe <- ldvsimev(xhyp,             # The matrix of hypothetical x's
                    simbetas,           # The matrix of simulated betas
                    ci=0.95,            # Desired confidence interval
                    constant=NA,        # NA indicates no constant!
                    phi=phi,            # estimated AR parameters; length must match lagY 
                    lagY=lagY,          # lags of y, most recent last
                    transform="diff",   # "log" to undo log transformation,
                                        # "diff" to under first differencing
                                        # "difflog" to do both
                    initialY=initialY   # for differenced models, the lag of the level of y
                    )


# Simulate first differences in Y (on original level scale)
# out to periods.out given hypothetical future values of X, X0,
# and initial lags of the change in Y
sim_fdfe <- ldvsimfd(xhyp,               # The matrix of hypothetical x's
                      simbetas,           # The matrix of simulated betas
                      ci=0.95,            # Desired confidence interval
                      constant=NA,         # Column containing the constant
                                          # set to NA for no constant
                      phi=phi,            # estimated AR parameters; length must match lagY 
                      lagY=lagY,          # lags of y, most recent last
                      transform="diff"    # "log" to undo log transformation,
                                          # "diff" to under first differencing
                                          # "difflog" to do both
                      )
plot.new()
par(usr=c(1,50,-15000,15000))
axis(1,at=seq(1,50,10))
axis(2,at=seq(-15000,15000,5000))
title(xlab="Time",ylab="Expected change in constant GDP ($ pc)",main="Fixed effects ARIMA(1,1,0), pcse") 

# Make the x-coord of a confidence envelope polygon
xpoly <- c(1:periods.out,
           rev(1:periods.out),
           1)
# Make the y-coord of a confidence envelope polygon
ypoly <- c(sim_fdfe$lower,
           rev(sim_fdfe$upper),
           sim_fdfe$lower[1])
# Choose the color of the polygon 
col <- "lightgreen"
# Plot the polygon first, before the points & lines
polygon(x=xpoly,
        y=ypoly,
        col=col,
        border=FALSE
        )
# Plot the fitted line
lines(x=1:periods.out,y=sim_fdfe$pe,col="darkgreen")
lines(x=c(0,50),y=c(0,0),lty="solid")

Tips for Poster Session

  • How to produce a nicely formatted table?
    • Popular options for regression tables: stargazer, texreg, modelsummary, etc.
      • Note that the author of stargazer no longer maintain this package now.
    • If you just want to make a simple table: xtable, kableExtra, modelsummary.
  • If you want to make a reproducible poster with Rmarkdown: posterdown and postr.

Footnotes

  1. See http://www.u.arizona.edu/~mishler/PrzeworskiCodebook.PDF.

  2. See https://cran.r-project.org/web/packages/plm/vignettes/A_plmPackage.html#nlme.

  3. See https://statmodeling.stat.columbia.edu/2005/01/25/why_i_dont_use/.