Fixed Effects, Random Effects & Lagged Variables in Panel Data Analysis
Tao Lin
May 6, 2022
Agenda
Fixed Effects
Random Effects
Lagged Variables in Panel Data Models
Counterfactual Forecasting with RE and FE Models
Tips for Poster Session
How to make a nicely formatted table?
How to make a poster using Rmarkdown?
Case Study: Does Democracy Contribute to Economic Development? (Przeworski et al. 2003)
# Load packageslibrary(nlme) # Estimation of mixed effects modelslibrary(lme4) # Alternative package for mixed effects modelslibrary(plm) # Econometrics package for linear panel modelslibrary(fixest) # Alternative package for fixed effect modelslibrary(arm) # Gelman & Hill code for mixed effects simulationlibrary(pcse) # Calculate PCSEs for LS models (Beck & Katz)library(tseries) # For ADF unit root testlibrary(simcf) # For panel functions and simulatorslibrary(tidyverse) # For data wranglinglibrary(lmtest) # Print results after correcting standard errors# Load Democracy data# Variable names:# COUNTRY YEAR BRITCOL CATH# CIVLIB EDT ELF60 GDPW MOSLEM# NEWC OIL POLLIB REG STRAdata <-read.csv("Lab6_data/democracy.csv", header =TRUE, na.strings =".")colnames(data)
Dummy variable coded 1 for every year in countries that were British colonies any time after 1918, 0 otherwise.
cath
Percentage of Catholics in the population.
civilib
Civil liberty. This variable can take values from 1 (least free) to 7 (most free).
edt
Cumulative years of education of the average member of the labor force.
elf60
Index of ethnolinguistic fractionalization, 1960.
gdpw
Real GDP per worker, 1985 international prices.
moslem
Percentage of Moslems in the population.
newc
Dummy variable coded 1 for every year in countries that became independent after 1945, 0 otherwise.
oil
Dummy variable coded 1 if the average ratio of fuel exports to total exports in 1984-86 exceeded 50%, 0 otherwise.
pollib
Political liberty. This variable can take values from 1 (least free) to 7 (most free).
reg
Dummy variable coded 1 for dictatorships and 0 for democracies. Transition years are coded as the regime that emerges in that year. For instance, there was a transition from democracy to dictatorship in Argentina in 1955. In that year, reg=1.
stra
The sum of past transitions to authoritarianism (as defined by reg) in a country. If a country experienced one or more transitions to authoritarianism before 1950, stra was coded 1 in 1950.
where intercept \(\alpha_i\), slope \(\beta_i\), ARIMA order \(P_i, D_i, Q_i\) and its coefficients \(\phi_i\), \(\rho_i\) vary and allows panel heteroskedasticity \(\sigma^2_i\). We can put unit-specific trend \(t\theta_i\) and seasonality \(\kappa_{k,i}\) inside this model too.
Fixed Effects
What is Fixed Effects?
Suppose there is unobserved group-level or individual-level heterogeneity \(u_i\) in the true data generating process (DGP) (such group-level or unit-level heterogeneity that does not change over individual or time periods): \[
y_{it} = \alpha + \beta^\prime x_{it} + \color{red}{\delta^\prime u_{i}} + \varepsilon_{it}
\] Because it is unobserved, we cannot measure it. Now set \(\alpha_i = \alpha_i^* = \alpha + \delta^\prime u_i\), we can remove this unobservable group-level heterogeneity and rewrite the above equation as follows: \[
y_{it} = \alpha_i + \beta^\prime x_{it} + \varepsilon_{it}
\]
The FE model can control for omitted unobservable variables.
if we add \(\alpha_i\): control for time-invariant unit-level heterogeneity, such as a country’s culture.
if we add \(\lambda_t\): control for common shocks at each period, such as global economic conditions each year.
Of course we can also control for omitted unobservables at other levels, such as group, cohort, etc.
However, FE models come at a large cost. - We cannot add time-invariant variables into FE models, otherwise it will cause perfect collinearity. - Incidental parameter problem: suppose we include individual-level FE \(\alpha_i\), the estimates is consistent as \(T \rightarrow \infty\), but not as \(N \rightarrow \infty\).
There are 3 ways to estimate this FE model: 1. Within Estimator 2. Least Square Dummy Variable (LSDV) Estimator 3. First Difference (FD) Estimator
How to Check Time- or Unit-invariant Variables?
pdata <-pdata.frame(data, index =c("country", "year"))head(pdata) # it will create an id for each observation
country ctyname region year britcol cath civlib edt elf60 gdpw moslem
1-1962 1 Algeria Africa 1962 0 0.5 NA 1.160 0.43 5012 99.1
1-1963 1 Algeria Africa 1963 0 0.5 NA 1.250 0.43 6083 99.1
1-1964 1 Algeria Africa 1964 0 0.5 NA 1.345 0.43 6502 99.1
1-1965 1 Algeria Africa 1965 0 0.5 NA 1.450 0.43 6620 99.1
1-1966 1 Algeria Africa 1966 0 0.5 NA 1.560 0.43 6612 99.1
1-1967 1 Algeria Africa 1967 0 0.5 NA 1.675 0.43 6982 99.1
newc oil pollib reg stra
1-1962 1 1 NA 0 0
1-1963 1 1 NA 0 0
1-1964 1 1 NA 0 0
1-1965 1 1 NA 0 0
1-1966 1 1 NA 0 0
1-1967 1 1 NA 0 0
pvar(pdata)
no time variation: country ctyname region britcol cath elf60 moslem newc oil
no individual variation: year edt elf60
all NA in time dimension for at least one individual: civlib edt elf60 pollib
all NA in ind. dimension for at least one time period: civlib edt elf60 pollib
# orpvar(data, index =c("country", "year"))
no time variation: country ctyname region britcol cath elf60 moslem newc oil
no individual variation: year edt elf60
all NA in time dimension for at least one individual: civlib edt elf60 pollib
all NA in ind. dimension for at least one time period: civlib edt elf60 pollib
Within Estimator
We take each individuals’ deviation from group means (average over \(t\) within each group/unit \(i\)). Because the observable unit-level heterogeneity is time-invariant, its value should not changed after averaging over time within each unit. Thus, we can avoid estimating the varying intercepts. \[
\begin{aligned}
\tilde{y}_{it} =& y_{it} - \bar{y}_{i} \\
=& \color{red}{(\alpha_i - \bar{\alpha}_i)} + \beta^\prime \underbrace{(x_{it} - \bar{x}_{i})}_{\tilde{x}_{it}} + \underbrace{(\varepsilon_{it} - \bar{\varepsilon}_i)}_{\tilde{\varepsilon}_{it}} \\
=& \beta^\prime \tilde{x}_{it} + \tilde{\varepsilon}_{it}
\end{aligned}
\]
Like OLS, the within estimator \(\hat{\beta}_W\) is asymptotically normal and efficient: \[
\widehat{\beta}_{W} \stackrel{\text { approx. }}{\sim} \mathcal{N}\left(\beta, \sigma^{2}\left(\tilde{\mathbf{X}}^{\top} \tilde{\mathbf{X}}\right)^{-1}\right)
\]
However, the estimation of standard error is a little tricky. When we demean all the variables, we already used \(N\) degrees of freedom for \(N\) units. \[
\widehat{\sigma}^{2}=\frac{1}{N T-N-K} \sum_{i=1}^{N} \sum_{t=1}^{T}{\widehat{\tilde{\varepsilon}}}_{i}^{2}
\] Hence, if we use a canned OLS function for \(\hat{\beta}_W\), the standard error is not correct, and we need to adjust it manually.
Within-FE Model in R
# within estimator using `plm`plm_within <-plm(log(gdpw) ~ reg + edt, # if we include `oil`, plm() will automatically drop it. Why?data = data,index =c("country", "year"),model ="within", # calculate a within estimator# you can also calculate a "pooling" estimator, which is exactly an OLS model without FEeffect ="twoways"# estimate a two-way FE model; it means that we consider both country FE and year FE# by default it will only create averages over year within country)summary(plm_within) # note that the standard error does not take serial correlation into account
Twoways effects Within Model
Call:
plm(formula = log(gdpw) ~ reg + edt, data = data, effect = "twoways",
model = "within", index = c("country", "year"))
Unbalanced Panel: n = 113, T = 5-28, N = 2900
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.6405721 -0.0799739 0.0078023 0.0794105 0.5976897
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
reg 0.0274582 0.0126445 2.1716 0.0299740 *
edt 0.0217672 0.0057988 3.7537 0.0001778 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 62.255
Residual Sum of Squares: 61.822
R-Squared: 0.0069488
Adj. R-Squared: -0.04382
F-statistic: 9.64946 on 2 and 2758 DF, p-value: 6.6663e-05
coeftest(plm_within, vcov. =vcovHC(plm_within, method ="arellano")) # Arellano (1987) heteroskedastic and serial correlation
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
reg 0.027458 0.027678 0.9921 0.3213
edt 0.021767 0.020064 1.0849 0.2781
coeftest(plm_within, vcov. =vcovBK(plm_within)) # Beck and Katz (1995) panel corrected standard errors
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
reg 0.027458 0.033323 0.8240 0.41
edt 0.021767 0.019322 1.1266 0.26
coeftest(plm_within, vcov. =vcovSCC(plm_within)) # Driscoll and Kraay panel corrected standard errors
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
reg 0.0274582 0.0194738 1.4100 0.15865
edt 0.0217672 0.0096307 2.2602 0.02389 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# an alternative way is to use `fixest`; it is faster when we have a large number of FEsfixest_model <-feols(log(gdpw) ~ reg + edt | country + year,data = data,cluster =~ country + year)
NOTE: 1,226 observations removed because of NA values (RHS: 1,226).
Although within estimator ignores varying intercepts, we can actually recover them by using: \[
\hat{\alpha}_i = \hat{y}_{it} - \hat{\beta}^\prime x_{it}
\] Sometimes it is useful for us to look at the variation between groups.
Another option to directly estimate group-specific intercepts is to use the least squares dummy variables (LSDV) estimator, i.e., OLS with \(N\) group dummies excluding the global intercept.
The group dummies are often colloquially called “fixed effects.” For large \(T\), it should be very similar to FE estimator. But it is not a good idea for very small \(T\): estimates of \(\alpha_i\) will be poor due to small \(T\) within each group.
# LSDV estimator using `lm`plm_lsdv <-plm(log(gdpw) ~ reg + edt +factor(country) +factor(year) -1, # subtract 1 to drop global interceptindex =c("country", "year"),model ="pooling", # run an OLS modeldata = data)coeftest(plm_lsdv, vcov. =vcovHC(plm_lsdv, method ="arellano"))[1:2,] # you can also use lm() to implement LSDV estimator, but it is difficult to adjust for standard errors
Estimate Std. Error t value Pr(>|t|)
reg 0.0274582 0.02767787 0.9920634 0.3212536
edt 0.0217672 0.02006382 1.0848985 0.2780614
Within vs. LSDV
Within and LSDV estimators are equivalent to each other. They give exactly the same estimates.
But we favor one over another based on the following reasons:
Their standard errors will differ slightly. We don’t need to adjust degrees of freedom for \(\sigma^2\) manually in LSDV estimator because it directly estimate the fixed effects of N groups. By contrast, standard errors from vanilla OLS on the within estimator will be slightly off due to incorrect degrees of freedom (smart software such as plm() in R and areg in Stata will correct it).
Compared to within estimator, LSDV approach can be computationally demanding. The model matrix is a very large matrix to invert when \(N\) is large.
However, within and LSDV cannot estimate the intercept very well. The estimates of \(\alpha_i\) is inconsistent if \(N \rightarrow \infty\) and \(T\) fixed (i.e. incidental parameters problem). But there is no need to estimate \(\alpha_i\) unless the quantity of interest is a function of it, such as some non-linear models.
FD Estimator
Since the group-level time-invariant unobservables are constant across time, first-differences are an alter- native to purge out it. The first difference model is the following: \[
\begin{aligned}
\Delta y_{it} =& y_{it} - y_{i,t-1} \\
=& \color{red}{(\alpha_i - \alpha_i)} + \beta^\prime \underbrace{(x_{it} - x_{i,t-1})}_{\Delta x_{it}} + \underbrace{(\varepsilon_{it} - \varepsilon_{i,t-1})}_{\Delta \varepsilon_{it}} \\
=& \beta^\prime \Delta x_{it} + \Delta \varepsilon_{it}
\end{aligned}
\] If \(\Delta \varepsilon_{it}\) are homoskedastic and without serial correlation, then usual OLS standard errors work just fine. Otherwise we need to use GLS to estimate uncertainties.
FD estimator in plm() will automatically include an intercept. The intercept in FD model is actually an coefficient for linear time trend (in the second step of the above equation, it is \(\theta (t+1 - t)\)).
FE vs. FD Estimators
FD and FE will give the same estimates when \(T = 2\), but for \(T > 2\), the two methods are different.
Note that FD often cannot account for two-way fixed effects. It is not intuitive to have \(y_{it} - y_{i-1,t}\).
Both FE and FD estimators are consistent under the assumption of strict exogeneity.
However, the primary tradeoff between FE and FD estimators lies in the i.i.d. assumption: if original errors, \(\varepsilon_{it}\), are serially uncorrelated, then FE estimator will be more efficient than FD estimator.
But if there are substantial serial dependence in the original errors, we may consider FD estimator to make the first-differences of errors, \(\Delta \varepsilon_{it}\), serially uncorrelated. In the latter case, FD estimator is more efficient than FE estimator.
We can always try both: if results are not very sensitive, good!
Oneway (individual) effect First-Difference Model
Call:
plm(formula = log(gdpw) ~ reg + edt - 1, data = data, model = "fd",
index = c("country", "year"))
Unbalanced Panel: n = 113, T = 5-28, N = 2900
Observations used in estimation: 2787
Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.6344 -0.0102 0.0190 0.0168 0.0469 0.5230
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
reg 0.0061632 0.0082067 0.7510 0.4527
edt 0.0395749 0.0055039 7.1904 8.264e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 10.679
Residual Sum of Squares: 11.628
R-Squared: 5.8796e-05
Adj. R-Squared: -0.00030025
F-statistic: 26.1857 on 2 and 2785 DF, p-value: 5.4114e-12
plm_owfe <-plm(log(gdpw) ~ reg + edt, data = data, index =c("country", "year"), model ="within", effect ="individual")summary(plm_owfe)
Oneway (individual) effect Within Model
Call:
plm(formula = log(gdpw) ~ reg + edt, data = data, effect = "individual",
model = "within", index = c("country", "year"))
Unbalanced Panel: n = 113, T = 5-28, N = 2900
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.832562 -0.095557 0.015056 0.111956 0.714304
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
reg -0.014014 0.015048 -0.9313 0.3518
edt 0.167823 0.003712 45.2107 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 158.01
Residual Sum of Squares: 91.097
R-Squared: 0.42346
Adj. R-Squared: 0.39986
F-statistic: 1022.77 on 2 and 2785 DF, p-value: < 2.22e-16
Random Effects / Mixed Effects
What is Random Effects?
Assumes the individuals come from a common population. Each individual’s unobservable heterogeneous effect on the outcome follows a normal distribution with an unknown variance \(\sigma_\alpha^2\). Then we may have the following model setup: \[
\begin{aligned}
y_{i t} &=\beta_{0}+\beta_{1} x_{i t}+\alpha_{i}+\varepsilon_{i t} \\
\alpha_{i} & \sim \mathcal{N}\left(0, \sigma_{\alpha}^{2}\right) \\
\varepsilon_{i t} & \sim \mathcal{N}\left(0, \sigma_\varepsilon^{2}\right)
\end{aligned}
\]
Estimate an RE model in R
Three packages in R: plm (feasible generalized least squares by defaulty), nlme (maximum likelihood by default), and lme4 (restricted maximum likelihood by default)
Pros and Cons among different estimation strategies
preferred by some econometricians because it has fewer distributional assumptions.
sometimes FGLS is computationally complicated because it need to invert a large matrix.
ML: biased estimation of variance parameters; but such bias gets smaller for larger sample size
REML: unbiased estimation of variance parameters
If you prefer a Bayesian approach for more flexible modeling, you can use brms or even manually specify the model in JAGS and stan.
Pros and cons among different packages, besides estimation strategies
plm cannot deal with mixed effects, i.e., the mean of \(\alpha_i\) varies due to certain covariates, i.e. \(\alpha_i \sim \mathcal{N}(\gamma^\prime z_{it}, \sigma_\alpha^2)\), but nlme and lme4 can do so.
nlme can take ARMA structure into account, but plm and lme4 can only consider AR terms.
nlme is difficult to specify crossed (non-nested) random effect structure (i.e. multiple separated random effects), while lme4 is easy to do so.
In terms of the mixed effects model in nlme and lme4, you may be confused by their different use of “fixed effects” and “random effects.”3
fixed effect: coefficients that do not vary by group, such that they are fixed, not random.
random effect: group-specific effect; these coefficients vary across groups.
A typical mixed effect model is as follows: \[
y_{it} = \underbrace{\beta^\prime x_{it}}_\text{"fixed effect"} + \underbrace{\gamma^\prime z_i}_\text{"random effects"} + \varepsilon_{it}
\] or can be specified as follows: \[
\begin{aligned}
y_{i t} &=\beta^\prime x_{i t}+\alpha_{i}+\varepsilon_{i t} \\
\alpha_{i} & \sim \mathcal{N}\left(\gamma^\prime z_i, \sigma_{\alpha}^{2}\right) \\
\varepsilon_{i t} & \sim \mathcal{N}\left(0, \sigma_\varepsilon^{2}\right)
\end{aligned}
\]
Estimate an RE model in R
# RE model using `plm`plm_re <-plm(log(gdpw) ~ reg + edt + oil,data = data,index =c("country", "year"),model ="random",effect ="individual")summary(plm_re)
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = log(gdpw) ~ reg + edt + oil, data = data, effect = "individual",
model = "random", index = c("country", "year"))
Unbalanced Panel: n = 113, T = 5-28, N = 2900
Effects:
var std.dev share
idiosyncratic 0.03271 0.18086 0.091
individual 0.32784 0.57258 0.909
theta:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.8601 0.9382 0.9404 0.9381 0.9404 0.9404
Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.82664 -0.10254 0.01505 0.00081 0.11712 0.75912
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 7.7290183 0.0608325 127.0540 < 2.2e-16 ***
reg -0.0030963 0.0150650 -0.2055 0.837159
edt 0.1717169 0.0036682 46.8124 < 2.2e-16 ***
oil 0.5458284 0.1705342 3.2007 0.001371 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 173.89
Residual Sum of Squares: 96.136
R-Squared: 0.44728
Adj. R-Squared: 0.44671
Chisq: 2200.57 on 3 DF, p-value: < 2.22e-16
# RE model using `nlme`lme_re <-lme(fixed =log(gdpw) ~ reg + edt + oil,random =~1| country,data = data,na.action = na.omit # need to specify how we deal with NA values, otherwise throws an error)summary(lme_re)
Linear mixed-effects model fit by REML
Data: data
AIC BIC logLik
-996.8293 -961.0027 504.4146
Random effects:
Formula: ~1 | country
(Intercept) Residual
StdDev: 0.6733608 0.1808866
Fixed effects: log(gdpw) ~ reg + edt + oil
Value Std.Error DF t-value p-value
(Intercept) 7.735312 0.07004634 2785 110.43135 0.0000
reg -0.006041 0.01498196 2785 -0.40321 0.6868
edt 0.170680 0.00366084 2785 46.62312 0.0000
oil 0.543484 0.19892569 111 2.73210 0.0073
Correlation:
(Intr) reg edt
reg -0.066
edt -0.258 -0.058
oil -0.335 0.010 0.032
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-4.59951054 -0.52967154 0.08334205 0.61991189 4.00624582
Number of Observations: 2900
Number of Groups: 113
# RE model using `lme4`lmer_re <-lmer(log(gdpw) ~ reg + edt + oil + (1| country),data = data)summary(lmer_re)
Linear mixed model fit by REML ['lmerMod']
Formula: log(gdpw) ~ reg + edt + oil + (1 | country)
Data: data
REML criterion at convergence: -1008.8
Scaled residuals:
Min 1Q Median 3Q Max
-4.5995 -0.5297 0.0833 0.6199 4.0062
Random effects:
Groups Name Variance Std.Dev.
country (Intercept) 0.45341 0.6734
Residual 0.03272 0.1809
Number of obs: 2900, groups: country, 113
Fixed effects:
Estimate Std. Error t value
(Intercept) 7.735312 0.070046 110.431
reg -0.006041 0.014982 -0.403
edt 0.170680 0.003661 46.623
oil 0.543484 0.198926 2.732
Correlation of Fixed Effects:
(Intr) reg edt
reg -0.066
edt -0.258 -0.058
oil -0.335 0.010 0.032
FE vs. RE
Different assumption about the relationship between group heterogeneity \(u_i\) and covariates \(x_{it}\).
In FE model, we assume conditional ignorability \(E[\varepsilon_{it}|x_{it}, u_i] = 0\). That is, we assume that conditional on unobservable confounder \(u_i\), \(\varepsilon_{it}\) and \(x_{it}\) is uncorrelated. \(x_{it}\) and \(u_i\) have some correlation.
In RE model, we assume there is no unmeasured confounder, so ignorability holds without conditioning on \(u_i\): \(E[\varepsilon_{it} | x_{it}] = E[u_i] = 0\). \(x_{it}\) and \(u_i\) have no correlation.
RE estimator is a matrix-weighted average of within estimator and between estimator.
But an intuitive way to think about it is that FE model rule out between-group variation; but in RE model, we preserve the information of between-group variation by modeling \(\sigma_\alpha^2\).
It is difficult to show it mathematically, because it requires some complicated algebra. We can think of RE model doing quasi-demeaning or partial-pooling:
\[
y_{it} - \theta\bar{y}_i = \beta^\prime (x_{it} - \theta \bar{x}_i) + (\varepsilon_{it} - \theta\bar{\varepsilon}_i)
\] where \(\theta = 0\) implies pooled OLS and \(\theta = 1\) implies fixed effects. Doing some math shows that \[
\theta = 1 - \sqrt{\frac{\sigma_\varepsilon^2}{\sigma_\varepsilon^2 + T \cdot \sigma_\alpha^2}}
\]
Dynamic TS Structure in Panel Data Model
Detect Time Series Structure in Panel Data
We know that for some countries which are good at growth maintenance, gdpw may be a non-stationary time series; whereas for some countries experienced economic fluctuation, their gdpw may have mean-reverting or even worse behavior in the long run.
ggplot(pdata, aes(x = year, y = gdpw, group = country, color = region)) +geom_line(alpha =0.5) +theme_bw()
Detect Time Series Structure in Panel Data
We can use statistical tests to formally detect stationarity of time series in panel data.
purtest() in plm package
punitroots package: the author no longer maintain this package
apply adf.test() and pp.test() to each time series, and plot the distribution of p values.
We detect many countries have non-stationary time series for the level dependent variable, but what about a differenced dependent variable?
pdata$gdpw_diff <-diff(pdata$gdpw) # because we already have a pdata.frame, we can safely use plm::diff() to transform itggplot(pdata, aes(x = year, y = gdpw_diff, group = country, color = region)) +geom_line(alpha =0.5) +theme_bw() # much more stationary!
pseries <-split(pdata$gdpw_diff, pdata$country) # have NA values in the first observation, need to remove them in statistical testspp_pval <-sapply(pseries, function(x){pp.test(na.omit(x))[["p.value"]]})adf_pval <-sapply(pseries, function(x){adf.test(na.omit(x))[["p.value"]]})hist(pp_pval, breaks =100)
hist(adf_pval, breaks =100)
Dynamic TS structure in RE model
re_arima <-lme( # random effects ARIMA(1, 1, 0)fixed = gdpw_diff ~ reg + edt + oil,random =~1| country,correlation =corARMA(form =~ year | country, # year nested in countryp =1,q =0 ),data = pdata %>%mutate(year =as.integer(year)), # the time variable need to be integer, otherwise it cannot perform ARMA estimationna.action = na.omit)summary(re_arima)
Linear mixed-effects model fit by REML
Data: pdata %>% mutate(year = as.integer(year))
AIC BIC logLik
44165.63 44207.3 -22075.81
Random effects:
Formula: ~1 | country
(Intercept) Residual
StdDev: 0.2261208 581.998
Correlation Structure: AR(1)
Formula: ~year | country
Parameter estimate(s):
Phi
0.2268233
Fixed effects: gdpw_diff ~ reg + edt + oil
Value Std.Error DF t-value p-value
(Intercept) 27.43342 26.89294 2732 1.020097 0.3078
reg 107.53138 33.46420 2732 3.213326 0.0013
edt 24.95241 5.35870 2732 4.656433 0.0000
oil -14.76787 43.07012 111 -0.342880 0.7323
Correlation:
(Intr) reg edt
reg 0.071
edt -0.735 -0.561
oil -0.328 0.018 0.137
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-16.05507865 -0.28772677 -0.03423844 0.33234239 13.38099374
Number of Observations: 2847
Number of Groups: 113
Dynamic TS structure in FE model
pdata$gdpw_diff_lag1 <-lag(pdata$gdpw_diff)pdata$gdpw_diff_lag2 <-lag(pdata$gdpw_diff, 2)fe_ar1 <-plm( # fixed effect ARIMA(1, 1, 0); cannot incorporate MA structure gdpw_diff ~ gdpw_diff_lag1 + reg + edt,data = pdata, model ="within", effect ="individual")fe_ar2 <-plm( # fixed effect ARIMA(2, 1, 0); cannot incorporate MA structure gdpw_diff ~ gdpw_diff_lag1 + gdpw_diff_lag2 + reg + edt,data = pdata, model ="within", effect ="individual")pbgtest(fe_ar1) # null means no serial correlation; so there is serial correlation
Breusch-Godfrey/Wooldridge test for serial correlation in panel models
data: gdpw_diff ~ gdpw_diff_lag1 + reg + edt
chisq = 37.209, df = 3, p-value = 4.155e-08
alternative hypothesis: serial correlation in idiosyncratic errors
pbgtest(fe_ar2) # there is no serial correlation for AR(2) model with first-differenced outcome variable
Breusch-Godfrey/Wooldridge test for serial correlation in panel models
data: gdpw_diff ~ gdpw_diff_lag1 + gdpw_diff_lag2 + reg + edt
chisq = 4.0656, df = 2, p-value = 0.131
alternative hypothesis: serial correlation in idiosyncratic errors
FE vs. lagged dependent variable (LDV)
There is a tradeoff when we both incorporate FE and lagged dependent variable in one model.
The estimate of \(\phi\) will always be underestimated in dynamic FE model, but the bias vanishes as \(T\) increases. This is called “Nickell bias.”
To obtain an unbiased estimate with small \(T\), we may choose dynamic panel data model, or prefer either FE or LDV model. If we care about causal inference and want to use a different-in-difference design (suppose it is a simple \(2\times 2\) DID) to estimate treatment effect, then we may need to think about the underlying causal identification assumption:
FE: conditioning on time-invariant unobservable individual-level confounder \(\alpha_i\), \(x_{it}\) and \(y_{it}\) are independent.
LDV: conditioning on time-varying unobservable individual-level confounder \(y_{i,t-1}\), \(x_{it}\) and \(y_{it}\) are independent.
FE and LDV have a very useful bracketing property (Angrist and Pischke 2009; Ding and Li 2019):
If LDV is correct, but we mistakenly use FE, then the estimate of treatment effect will be biased upward.
If FE is correct, but we mistakenly use LDV, then the estimate of treatment effect will be biased downward.
We can think of FE and LDV as bounding the causal effect of interest.
A better way is to always check the robustness of our findings using alternative identifying assumptions.
Recent literature attempts to go beyond the additive assumption of unobservable confounders. We can allow unit-level effect and time effect interact with each other so that we can account for time-invariant and time-varying unit-level confounders at the same time. Recent literature consider decomposing the matrix of error terms to several latent factors (e.g. use PCA) to account for this interactive structure (Bai 2009; Xu 2017).
Counterfactual Forecasting
Counterfactual Forecasting with RE Models
Extract model information for simulation, such as coefficients and variance-covariance matrix.
In nlme package, we use fixed.effects() or fixef() to extract coefficients that are fixed across all groups. (note that here we have different meaning of “fixed effect” and “random effect.”)
You can extract group-varying coefficients or intercepts using random.effects() or ranef().
Simply using coefficients() or coef() will aggregate “fixed effect” and “random effect” for each group.
Simulate parameters from multivariate Normal distribution
Create hypothetical scenarios
Use simulated parameters and hypothetical scenarios to calculate quantity of interest, such as expected values, first difference, relative risk, etc.
sims <-1000simbetas <-mvrnorm(sims, fixef(re_arima), vcov(re_arima))# Make matrix of hypothetical x'sformula <- gdpw_diff ~ reg + edt + oilperiods.out <-50xhyp <-cfMake(formula, pdata, periods.out)for (i in1:periods.out) xhyp <-cfChange(xhyp, "edt", x =mean(pdata$edt, na.rm=TRUE) +sd(pdata$edt, na.rm=TRUE), scen=i) # we assume a permanent increase in education (+ 1 SD) over the next 50 yearsphi <-0.25# from model summary()lagY <-mean(pdata$gdpw_diff, na.rm =TRUE) # Hypothetical previous change in Y for simulationinitialY <-mean(pdata$gdpw, na.rm =TRUE) # Hypothetical initial level of Y for simulation
# Simulate expected values of Y (on original level scale)# out to periods.out given hypothetical future values of X,# initial lags of the change in Y, and an initial level of Ysim_evre <-ldvsimev(xhyp, # The matrix of hypothetical x's simbetas, # The matrix of simulated betasci=0.95, # Desired confidence intervalconstant=1, # Column containing the constant# set to NA for no constantphi=phi, # estimated AR parameters; length must match lagY lagY=lagY, # lags of y, most recent lasttransform="diff", # "log" to undo log transformation,# "diff" to under first differencing# "difflog" to do bothinitialY=initialY # for differenced models, the lag of the level of y )# Simulate first differences in Y (on original level scale)# out to periods.out given hypothetical future values of X, X0,# and initial lags of the change in Ysim_fdre <-ldvsimfd(xhyp, # The matrix of hypothetical x's simbetas, # The matrix of simulated betasci=0.95, # Desired confidence intervalconstant=1, # Column containing the constant# set to NA for no constantphi=phi, # estimated AR parameters; length must match lagY lagY=lagY, # lags of y, most recent lasttransform="diff"# "log" to undo log transformation,# "diff" to under first differencing# "difflog" to do both )
plot.new()par(usr=c(1,50,-15000,15000))axis(1,at=seq(1,50,10))axis(2,at=seq(-15000,15000,5000))title(xlab="Time",ylab="Expected change in constant GDP ($ pc)",main="Random effects ARIMA(1,1,0)") # Make the x-coord of a confidence envelope polygonxpoly <-c(1:periods.out,rev(1:periods.out),1)# Make the y-coord of a confidence envelope polygonypoly <-c(sim_fdre$lower,rev(sim_fdre$upper), sim_fdre$lower[1])# Choose the color of the polygon col <-"lightblue"# Plot the polygon first, before the points & linespolygon(x=xpoly,y=ypoly,col=col,border=FALSE )# Plot the fitted linelines(x=1:periods.out,y=sim_fdre$pe,col="darkblue")lines(x=c(0,50),y=c(0,0),lty="solid")
Counterfactual Forecasting with FE Model
The general steps are similar to counterfactual forecasting with RE model.
We can use fixef() to extract group-specific intercepts.
Sometimes we may prefer using panel-corrected standard error as variance-covariance matrix for parameter simulation.
sims <-1000simparam <-mvrnorm(sims, coef(fe_ar1), vcovBK(fe_ar1))# Pull off the simulated lag coefficientsimphi <- simparam[,1]# Put together the "constant" term (avg of the FEs, or a specific FE if you like)# with the rest of the regressorssimbetas <-cbind(rep(mean(fixef(fe_ar1)), sims), simparam[,2:ncol(simparam)])# Make matrix of hypothetical x's: drop OIL!# and now we need hypothetical countries!# and no intercept!# Make matrix of hypothetical x's# Note the formula below is a little different; we've removed the lag# and will let the simulator handle lag structureformula <- gdpw_diff ~ reg + edtperiods.out <-50xhyp <-cfMake(formula, pdata, periods.out)for (i in1:periods.out) xhyp <-cfChange(xhyp, "edt", x=mean(pdata$edt,na.rm=TRUE)+sd(pdata$edt,na.rm=TRUE), scen=i)phi <-mean(simphi) lagY <-mean(pdata$gdpw_diff, na.rm =TRUE) # Hypothetical previous change in Y for simulationinitialY <-mean(pdata$gdpw, na.rm =TRUE) # Hypothetical initial level of Y for simulation
# Simulate expected values of Y (on original level scale)# out to periods.out given hypothetical future values of X,# initial lags of the change in Y, and an initial level of Ysim_evfe <-ldvsimev(xhyp, # The matrix of hypothetical x's simbetas, # The matrix of simulated betasci=0.95, # Desired confidence intervalconstant=NA, # NA indicates no constant!phi=phi, # estimated AR parameters; length must match lagY lagY=lagY, # lags of y, most recent lasttransform="diff", # "log" to undo log transformation,# "diff" to under first differencing# "difflog" to do bothinitialY=initialY # for differenced models, the lag of the level of y )# Simulate first differences in Y (on original level scale)# out to periods.out given hypothetical future values of X, X0,# and initial lags of the change in Ysim_fdfe <-ldvsimfd(xhyp, # The matrix of hypothetical x's simbetas, # The matrix of simulated betasci=0.95, # Desired confidence intervalconstant=NA, # Column containing the constant# set to NA for no constantphi=phi, # estimated AR parameters; length must match lagY lagY=lagY, # lags of y, most recent lasttransform="diff"# "log" to undo log transformation,# "diff" to under first differencing# "difflog" to do both )
plot.new()par(usr=c(1,50,-15000,15000))axis(1,at=seq(1,50,10))axis(2,at=seq(-15000,15000,5000))title(xlab="Time",ylab="Expected change in constant GDP ($ pc)",main="Fixed effects ARIMA(1,1,0), pcse") # Make the x-coord of a confidence envelope polygonxpoly <-c(1:periods.out,rev(1:periods.out),1)# Make the y-coord of a confidence envelope polygonypoly <-c(sim_fdfe$lower,rev(sim_fdfe$upper), sim_fdfe$lower[1])# Choose the color of the polygon col <-"lightgreen"# Plot the polygon first, before the points & linespolygon(x=xpoly,y=ypoly,col=col,border=FALSE )# Plot the fitted linelines(x=1:periods.out,y=sim_fdfe$pe,col="darkgreen")lines(x=c(0,50),y=c(0,0),lty="solid")
Tips for Poster Session
How to produce a nicely formatted table?
Popular options for regression tables: stargazer, texreg, modelsummary, etc.
Note that the author of stargazer no longer maintain this package now.
If you just want to make a simple table: xtable, kableExtra, modelsummary.
If you want to make a reproducible poster with Rmarkdown: posterdown and postr.