CSSS/POLS 512: Lab 2

Time Series & Linear Regression Review

Tao Lin

April 8, 2022

Agenda

  • Linear Regression Review
    • Diagnostics
    • Interpretation
  • Time Series Diagnostics
    • Deterministic Trend and De-trend
    • Detect Seasonality
    • Discover Autocorrelation in Time Series
  • More on Linear Regression
    • What Does Regression Coefficient “Average” From? 🤔
    • Why Do I Lose Stars After Adding/Droping Covariates? 😭

Linear Regression Review

Common Tricks for Regression Diagnostics

  • Check Linearity: plot residuals vs. fitted values
  • Detect Outliers: plot standardized residual vs. fitted values
    • in R, use rstandard() to extract standardized residual.
  • Check Heteroskedasticity: spread-location plot
    • given fitted values, residuals should have about the same variation
    • x-axis: fitted values
    • y-axis: square root of the absolute value of the standardized residuals \(\sqrt{|e^*|}\) (i.e. absolute value of standardized sd)
    • in R, use plot(model, which = 3).
  • Check Multicollinearity: use variance inflation factor (VIF)
    • VIF measures how well a predictor \(j\) can be predicted by other covariates. Higher VIF will lead to higher sampling variance of coefficient \(\hat{\beta}_j\) (because the variance of variable \(j\) has been explained by other covariates, there is nothing more that we can use to predict \(y\), which will lead to large uncertainty in coefficient estimate).
    • In R, use vif() in car package to calculate VIF for each coefficient.
    • There is no definite threshold for multicollinearity. A conventional rule is VIF > 5.
    • But we need more information to decide whether to remove certain variables.

Visualizing Regression Results

  • Coefficient Plot: Plot the effect size of each coefficient
    • need to calculate confidence interval using point estimates and standard errors of each coefficient.
    • in R, we can use geom_errorbar to visualize that.
  • Interaction Effect Plot: Plot how the total effect of variable \(j\) change with variable \(k\)
    • need to re-calculate the standard error, because \(V(\beta_1 + \beta_3 Z) = V(\beta_1) + 2 Z \cdot Cov(\beta_1, \beta_3) + Z^2 \cdot V(\beta_3)\)!
    • in R, need to use vcov() to fetch variance and covariance matrix in the first place.
    • But there are also some packages that can automatically plot interaction effect.
  • Simulating Predictions: we want to know how changes in variable \(j\) will induce changes in outcome \(y\) under certain conditions.
    • Use mvrnorm() to jointly simulate coefficients from models.
    • Holding other covariates at certain values, and only change the hypothetical value of variable \(j\), to see how the predicted values will change.
    • People often hold other covariates at their means or medians. But some scholars suggest to hold them at their observed values.1

Coding Exercise: Regression Diagnostics

  1. Check linearity and heteroskedasticity of this model.
  2. Check multicollinearity of this model.
  3. Visualize the effect size of each coefficient.
  • extract point estimates and standard errors, and calculate confidence interval using \(ev \pm 1.96 * se\)
  • use ggplot2 to plot both point estimates and confidence intervals. In geom_errorbar, let y = ev, ymax = upper and ymin = lower.
  1. Simulate predicted mpg when disp varies, holding other covariates at their medians.
  • fetch coefficients and vcov, and use MASS::mvrnorm to simulate coefficients 1e4 times. Now we have 1e4 simulated regression lins.
  • create multiple hypothetical scenarios in a data.frame where other covariates equal to their medians, but only allow disp varies from 70 to 500.
  • For each scenario, use simulated regression models to calculate predicted values. Now we should have a matrix with 431 columns and 10,000 rows.
  • Calculate confidence interval at each hypothetical value of disp and plot the regression line and confidence interval using geom_ribbon in ggplot2.

Time Series Description

Time Series Object in R

  • Remember we use the function ts() to coerce vector or matrix into time-series (ts) object.
  • ts object includes:
    • information about time includes:
      • starting time period
      • ending time period
      • frequency
    • Corresponding elements are assumed to be random variables.
  • Note that each time series in ts object corresponds to a column in matrix. To see why, we can look at the following two examples:
# Monthly unemployment in Maine from January 1996 to August 2006
maine.month.ts <- read.table("Lab2_data/Maine.dat", header = TRUE) |>
  ts(start = c(1996, 1), freq = 12)
maine.month.ts # it seems that we transform it to an object with multiple columns
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1996 6.7 6.7 6.4 5.9 5.2 4.8 4.8 4.0 4.2 4.4 5.0 5.0
1997 6.4 6.5 6.3 5.9 4.9 4.8 4.5 4.0 4.1 4.3 4.8 5.0
1998 6.2 5.7 5.6 4.6 4.0 4.2 4.1 3.6 3.7 4.1 4.3 4.0
1999 4.9 5.0 4.6 4.3 3.9 4.0 3.6 3.3 3.1 3.3 3.7 3.7
2000 4.4 4.4 4.1 3.5 3.1 3.0 2.8 2.5 2.6 2.8 3.1 3.0
2001 3.9 4.2 4.0 4.1 3.5 3.5 3.4 3.1 3.4 3.7 4.0 4.0
2002 5.0 4.9 5.0 4.7 4.0 4.2 4.0 3.6 3.7 3.9 4.5 4.6
2003 5.6 5.8 5.6 5.5 4.8 4.9 4.8 4.3 4.5 4.6 4.8 4.7
2004 5.6 5.6 5.5 4.8 4.2 4.3 4.2 3.8 4.0 4.2 4.6 4.6
2005 5.5 5.8 5.5 5.2 4.7 4.6 4.5 4.1 4.4 4.4 4.8 4.6
2006 5.3 5.6 4.9 4.6 4.2 4.4 4.4 3.9                
as.matrix(maine.month.ts) |> as.data.frame() # but it is still one column if we transform it back to a data.frame
    unemploy
1        6.7
2        6.7
3        6.4
4        5.9
5        5.2
6        4.8
7        4.8
8        4.0
9        4.2
10       4.4
11       5.0
12       5.0
13       6.4
14       6.5
15       6.3
16       5.9
17       4.9
18       4.8
19       4.5
20       4.0
21       4.1
22       4.3
23       4.8
24       5.0
25       6.2
26       5.7
27       5.6
28       4.6
29       4.0
30       4.2
31       4.1
32       3.6
33       3.7
34       4.1
35       4.3
36       4.0
37       4.9
38       5.0
39       4.6
40       4.3
41       3.9
42       4.0
43       3.6
44       3.3
45       3.1
46       3.3
47       3.7
48       3.7
49       4.4
50       4.4
51       4.1
52       3.5
53       3.1
54       3.0
55       2.8
56       2.5
57       2.6
58       2.8
59       3.1
60       3.0
61       3.9
62       4.2
63       4.0
64       4.1
65       3.5
66       3.5
67       3.4
68       3.1
69       3.4
70       3.7
71       4.0
72       4.0
73       5.0
74       4.9
75       5.0
76       4.7
77       4.0
78       4.2
79       4.0
80       3.6
81       3.7
82       3.9
83       4.5
84       4.6
85       5.6
86       5.8
87       5.6
88       5.5
89       4.8
90       4.9
91       4.8
92       4.3
93       4.5
94       4.6
95       4.8
96       4.7
97       5.6
98       5.6
99       5.5
100      4.8
101      4.2
102      4.3
103      4.2
104      3.8
105      4.0
106      4.2
107      4.6
108      4.6
109      5.5
110      5.8
111      5.5
112      5.2
113      4.7
114      4.6
115      4.5
116      4.1
117      4.4
118      4.4
119      4.8
120      4.6
121      5.3
122      5.6
123      4.9
124      4.6
125      4.2
126      4.4
127      4.4
128      3.9
# Electricity (millions of kWh), beer (Ml), and chocolate production (tonnes) in Australia from January 1958 to December 1990 
cbe.ts <- read.table("Lab2_data/cbe.dat", header = TRUE) |>
  ts(start = 1958, freq = 12)
cbe.ts # this example of multiple time series shows that each column is a single time series
         choc  beer  elec
Jan 1958 1451  96.3  1497
Feb 1958 2037  84.4  1463
Mar 1958 2477  91.2  1648
Apr 1958 2785  81.9  1595
May 1958 2994  80.5  1777
Jun 1958 2681  70.4  1824
Jul 1958 3098  74.8  1994
Aug 1958 2708  75.9  1835
Sep 1958 2517  86.3  1787
Oct 1958 2445  98.7  1699
Nov 1958 2087 100.9  1633
Dec 1958 1801 113.8  1645
Jan 1959 1216  89.8  1597
Feb 1959 2173  84.4  1577
Mar 1959 2286  87.2  1709
Apr 1959 3121  85.6  1756
May 1959 3458  72.0  1936
Jun 1959 3511  69.2  2052
Jul 1959 3524  77.5  2105
Aug 1959 2767  78.1  2016
Sep 1959 2744  94.3  1914
Oct 1959 2603  97.7  1925
Nov 1959 2527 100.2  1824
Dec 1959 1846 116.4  1765
Jan 1960 1066  97.1  1721
Feb 1960 2327  93.0  1752
Mar 1960 3066  96.0  1914
Apr 1960 3048  80.5  1857
May 1960 3806  76.1  2159
Jun 1960 4042  69.9  2195
Jul 1960 3583  73.6  2287
Aug 1960 3438  92.6  2276
Sep 1960 2957  94.2  2096
Oct 1960 2885  93.5  2055
Nov 1960 2744 108.5  2004
Dec 1960 1837 109.4  1924
Jan 1961 1447 105.1  1851
Feb 1961 2504  92.5  1839
Mar 1961 3248  97.1  2019
Apr 1961 3098  81.4  1937
May 1961 4318  79.1  2270
Jun 1961 3561  72.1  2251
Jul 1961 3316  78.7  2382
Aug 1961 3379  87.1  2364
Sep 1961 2717  91.4  2129
Oct 1961 2354 109.9  2110
Nov 1961 2445 116.3  2072
Dec 1961 1542 113.0  1980
Jan 1962 1606 100.0  1995
Feb 1962 2590  84.8  1932
Mar 1962 3588  94.3  2171
Apr 1962 3202  87.1  2162
May 1962 4704  90.3  2489
Jun 1962 4005  72.4  2424
Jul 1962 3810  84.9  2641
Aug 1962 3488  92.7  2630
Sep 1962 2781  92.2  2324
Oct 1962 2944 114.9  2412
Nov 1962 2817 112.5  2284
Dec 1962 1960 118.3  2186
Jan 1963 1937 106.0  2184
Feb 1963 2903  91.2  2144
Mar 1963 3357  96.6  2379
Apr 1963 3552  96.3  2383
May 1963 4581  88.2  2717
Jun 1963 3905  70.2  2774
Jul 1963 4581  86.5  3051
Aug 1963 4037  88.2  2891
Sep 1963 3345 102.8  2613
Oct 1963 3175 119.1  2600
Nov 1963 2808 119.2  2493
Dec 1963 2050 125.1  2410
Jan 1964 1719 106.1  2390
Feb 1964 3143 102.1  2463
Mar 1964 3756 105.2  2616
Apr 1964 4776 101.0  2734
May 1964 4540  84.3  2970
Jun 1964 4309  87.5  3125
Jul 1964 4563  92.7  3342
Aug 1964 3506  94.4  3207
Sep 1964 3665 113.0  2964
Oct 1964 3361 113.9  2919
Nov 1964 3094 122.9  2764
Dec 1964 2440 132.7  2732
Jan 1965 1633 106.9  2622
Feb 1965 2935  96.6  2698
Mar 1965 4159 127.3  2950
Apr 1965 4159  98.2  2895
May 1965 4894 100.2  3200
Jun 1965 4921  89.4  3408
Jul 1965 4577  95.3  3679
Aug 1965 4155 104.2  3473
Sep 1965 3851 106.4  3154
Oct 1965 3429 116.2  3107
Nov 1965 3370 135.9  3052
Dec 1965 2726 134.0  2918
Jan 1966 1674 104.6  2786
Feb 1966 3257 107.1  2739
Mar 1966 4731 123.5  3125
Apr 1966 4481  98.8  3033
May 1966 5443  98.6  3486
Jun 1966 5566  90.6  3661
Jul 1966 5044  89.1  3927
Aug 1966 4781 105.2  3851
Sep 1966 4014 114.0  3456
Oct 1966 3561 122.1  3390
Nov 1966 3801 138.0  3280
Dec 1966 2685 142.2  3166
Jan 1967 1805 116.4  3080
Feb 1967 3756 112.6  3069
Mar 1967 4227 123.8  3340
Apr 1967 4595 103.6  3310
May 1967 5702 113.9  3798
Jun 1967 4681  98.6  3883
Jul 1967 4395  95.0  4191
Aug 1967 4459 116.0  4213
Sep 1967 4191 113.9  3766
Oct 1967 3742 127.5  3628
Nov 1967 3279 131.4  3520
Dec 1967 2468 145.9  3322
Jan 1968 1742 131.5  3250
Feb 1968 3366 131.0  3287
Mar 1968 3633 130.5  3552
Apr 1968 3701 118.9  3440
May 1968 4926 114.3  4153
Jun 1968 4522  85.7  4265
Jul 1968 5275 104.6  4655
Aug 1968 4717 105.1  4492
Sep 1968 4114 117.3  4051
Oct 1968 3851 142.5  3967
Nov 1968 3493 140.0  3807
Dec 1968 2654 159.8  3639
Jan 1969 2168 131.2  3647
Feb 1969 3561 125.4  3560
Mar 1969 4305 126.5  3929
Apr 1969 4413 119.4  3858
May 1969 5307 113.5  4485
Jun 1969 5361  98.7  4697
Jul 1969 4948 114.5  4977
Aug 1969 4472 113.8  4675
Sep 1969 3846 133.1  4596
Oct 1969 3715 143.4  4491
Nov 1969 3343 137.3  4127
Dec 1969 2939 165.2  4144
Jan 1970 1615 126.9  4014
Feb 1970 3497 124.0  3994
Mar 1970 3915 135.7  4320
Apr 1970 4858 130.0  4400
May 1970 4962 109.4  5002
Jun 1970 4504 117.8  5091
Jul 1970 4767 120.3  5471
Aug 1970 4291 121.0  5193
Sep 1970 4091 132.3  4997
Oct 1970 4164 142.9  4737
Nov 1970 3915 147.4  4546
Dec 1970 3130 175.9  4498
Jan 1971 1696 132.6  4350
Feb 1971 3887 123.7  4206
Mar 1971 4749 153.3  4743
Apr 1971 4781 134.0  4582
May 1971 5089 119.6  5191
Jun 1971 5484 116.2  5457
Jul 1971 5072 118.6  5891
Aug 1971 4611 130.7  5618
Sep 1971 4117 129.3  5158
Oct 1971 3910 144.4  5030
Nov 1971 4252 163.2  4800
Dec 1971 3624 179.4  4654
Jan 1972 1678 128.1  4453
Feb 1972 3851 138.4  4440
Mar 1972 5021 152.7  4945
Apr 1972 4581 120.0  4788
May 1972 6195 140.5  5425
Jun 1972 5338 116.2  5706
Jul 1972 4909 121.4  6061
Aug 1972 4640 127.8  5846
Sep 1972 3706 143.6  5242
Oct 1972 4113 157.6  5408
Nov 1972 3879 166.2  5114
Dec 1972 3411 182.3  5042
Jan 1973 2043 153.1  5008
Feb 1973 3736 147.6  4657
Mar 1973 4490 157.7  5359
Apr 1973 3715 137.2  5193
May 1973 5623 151.5  5891
Jun 1973 4671  98.7  5980
Jul 1973 5591 145.8  6390
Aug 1973 5461 151.7  6366
Sep 1973 4795 129.4  5756
Oct 1973 4846 174.1  5640
Nov 1973 4843 197.0  5429
Dec 1973 3278 193.9  5398
Jan 1974 2411 164.1  5413
Feb 1974 4278 142.8  5141
Mar 1974 4639 157.9  5695
Apr 1974 4559 159.2  5554
May 1974 6404 162.2  6369
Jun 1974 4851 123.1  6592
Jul 1974 6480 130.0  7107
Aug 1974 6394 150.1  6917
Sep 1974 5752 169.4  6353
Oct 1974 4718 179.7  6205
Nov 1974 4659 182.1  5830
Dec 1974 3842 194.3  5646
Jan 1975 2873 161.4  5379
Feb 1975 5556 169.4  5489
Mar 1975 5389 168.8  5824
Apr 1975 6135 158.1  5907
May 1975 6707 158.5  6482
Jun 1975 5220 135.3  6795
Jul 1975 6249 149.3  7028
Aug 1975 5281 143.4  6776
Sep 1975 4192 142.2  6274
Oct 1975 4867 188.4  6362
Nov 1975 3752 166.2  5940
Dec 1975 3492 199.2  5958
Jan 1976 1979 182.7  5769
Feb 1976 4584 145.2  5887
Mar 1976 5139 182.1  6367
Apr 1976 5044 158.7  6165
May 1976 5501 141.6  6868
Jun 1976 5044 132.6  7201
Jul 1976 5035 139.6  7601
Aug 1976 5167 147.0  7581
Sep 1976 4650 166.6  7090
Oct 1976 5298 157.0  6841
Nov 1976 4373 180.4  6408
Dec 1976 3941 210.2  6435
Jan 1977 2334 159.8  6176
Feb 1977 4381 157.8  6138
Mar 1977 5665 168.2  6717
Apr 1977 4393 158.4  6470
May 1977 5232 152.0  7312
Jun 1977 5876 142.2  7763
Jul 1977 5900 137.2  8171
Aug 1977 5704 152.6  7788
Sep 1977 4718 166.8  7311
Oct 1977 4650 165.6  6679
Nov 1977 4446 198.6  6704
Dec 1977 3061 201.5  6724
Jan 1978 2155 170.7  6552
Feb 1978 4274 164.4  6427
Mar 1978 4695 179.7  7105
Apr 1978 4362 157.0  6869
May 1978 4889 168.0  7683
Jun 1978 5370 139.3  8082
Jul 1978 5072 138.6  8555
Aug 1978 4985 153.4  8386
Sep 1978 3978 138.9  7553
Oct 1978 4139 172.1  7398
Nov 1978 3995 198.4  7112
Dec 1978 3025 217.8  6886
Jan 1979 1949 173.7  7077
Feb 1979 4357 153.8  6820
Mar 1979 4638 175.6  7426
Apr 1979 3994 147.1  7143
May 1979 6174 160.3  8261
Jun 1979 5656 135.2  8240
Jul 1979 4411 148.8  8977
Aug 1979 5504 151.0  8991
Sep 1979 4463 148.2  8026
Oct 1979 4458 182.2  7911
Nov 1979 4528 189.2  7510
Dec 1979 2830 183.1  7381
Jan 1980 1843 170.0  7366
Feb 1980 5042 158.4  7414
Mar 1980 5348 176.1  7824
Apr 1980 5257 156.2  7524
May 1980 6699 153.2  8279
Jun 1980 5388 117.9  8707
Jul 1980 6001 149.8  9486
Aug 1980 5966 156.6  8973
Sep 1980 4845 166.7  8231
Oct 1980 4507 156.8  8206
Nov 1980 4214 158.6  7927
Dec 1980 3460 210.8  7999
Jan 1981 1833 203.6  7834
Feb 1981 4978 175.2  7521
Mar 1981 6464 168.7  8284
Apr 1981 5820 155.9  7999
May 1981 6447 147.3  8940
Jun 1981 6191 137.0  9381
Jul 1981 6628 141.1 10078
Aug 1981 5452 167.4  9796
Sep 1981 5295 160.2  8471
Oct 1981 5080 191.9  8572
Nov 1981 5564 174.4  8150
Dec 1981 3965 208.2  8168
Jan 1982 2062 159.4  8166
Feb 1982 5099 161.1  7903
Mar 1982 6162 172.1  8606
Apr 1982 5529 158.4  8071
May 1982 6416 114.6  9178
Jun 1982 6382 159.6  9873
Jul 1982 5624 159.7 10476
Aug 1982 5785 159.4  9296
Sep 1982 4644 160.7  8818
Oct 1982 5331 165.5  8697
Nov 1982 5143 205.0  8381
Dec 1982 4596 205.2  8293
Jan 1983 2180 141.6  7942
Feb 1983 5786 148.1  8001
Mar 1983 5840 184.9  8744
Apr 1983 5666 132.5  8397
May 1983 6360 137.3  9115
Jun 1983 6219 135.5  9773
Jul 1983 6082 121.7 10358
Aug 1983 5653 166.1  9849
Sep 1983 5726 146.8  9083
Oct 1983 5049 162.8  9143
Nov 1983 5859 186.8  8800
Dec 1983 4091 185.5  8741
Jan 1984 2167 151.5  8492
Feb 1984 6480 158.1  8795
Mar 1984 7375 143.0  9354
Apr 1984 6583 151.2  8796
May 1984 7251 147.6 10072
Jun 1984 6730 130.7 10174
Jul 1984 6428 137.5 11326
Aug 1984 5228 146.1 10744
Sep 1984 4716 133.6  9806
Oct 1984 6101 167.9  9740
Nov 1984 5753 181.9  9373
Dec 1984 4000 202.0  9244
Jan 1985 2691 166.5  9407
Feb 1985 5898 151.3  8827
Mar 1985 6526 146.2  9880
Apr 1985 5840 148.3  9364
May 1985 6650 144.7 10580
Jun 1985 5717 123.6 10899
Jul 1985 7236 151.6 11687
Aug 1985 6523 133.9 11280
Sep 1985 5729 137.4 10208
Oct 1985 6004 181.6 10212
Nov 1985 5950 182.0  9725
Dec 1985 4690 190.0  9721
Jan 1986 3687 161.2  9846
Feb 1986 7791 155.5  9407
Mar 1986 7153 141.9 10265
Apr 1986 6434 164.6  9970
May 1986 7850 136.2 10801
Jun 1986 6809 126.8 11246
Jul 1986 8379 152.5 12167
Aug 1986 6914 126.6 11578
Sep 1986 6919 150.1 10645
Oct 1986 7265 186.3 10613
Nov 1986 6994 147.5 10104
Dec 1986 5503 200.4 10348
Jan 1987 3782 177.2 10263
Feb 1987 7502 127.4  9973
Mar 1987 8119 177.1 10803
Apr 1987 7292 154.4 10409
May 1987 6886 135.2 11458
Jun 1987 7049 126.4 11845
Jul 1987 7977 147.3 12559
Aug 1987 8519 140.6 12070
Sep 1987 6680 152.3 11221
Oct 1987 7994 151.2 11338
Nov 1987 7047 172.2 10761
Dec 1987 5782 215.3 11012
Jan 1988 3771 154.1 10923
Feb 1988 7906 159.3 10790
Mar 1988 8970 160.4 11427
Apr 1988 6077 151.9 10788
May 1988 7919 148.4 11772
Jun 1988 7340 139.6 12104
Jul 1988 7791 148.2 12634
Aug 1988 7368 153.5 12772
Sep 1988 8255 145.1 11764
Oct 1988 7816 183.7 11956
Nov 1988 7476 210.5 11646
Dec 1988 6696 203.3 11750
Jan 1989 4484 153.3 11485
Feb 1989 8274 144.3 11198
Mar 1989 8866 169.6 12265
Apr 1989 8572 143.7 11704
May 1989 9176 160.0 12419
Jun 1989 8645 135.5 13259
Jul 1989 8265 141.7 13945
Aug 1989 9558 159.9 13839
Sep 1989 7037 145.6 12387
Oct 1989 9101 183.4 12546
Nov 1989 8180 198.1 12038
Dec 1989 7072 186.7 11977
Jan 1990 3832 171.9 12336
Feb 1990 7253 150.5 11793
Mar 1990 8667 163.0 12877
Apr 1990 7658 153.6 11923
May 1990 8859 152.8 13306
Jun 1990 7291 135.4 13988
Jul 1990 7529 148.3 14002
Aug 1990 8715 148.3 14338
Sep 1990 8450 133.5 12867
Oct 1990 9085 193.8 12761
Nov 1990 8350 208.4 12449
Dec 1990 7080 197.0 12658

Box-Jenkins Method

  • Suppose that now we get a time series. How can we “reverse-engineer” its functional form or data generating process (DGP)?
    • For example, we expect that an observed time series can be written as \[y_t = \phi_1\overbrace{y_{t-1}}^\text{AR process} + \phi_{12} \underbrace{y_{t-12}}_\text{seasonality} + \overbrace{t\beta_1+\beta_0}^\text{deterministic trend} + \underbrace{\varepsilon_t}_\text{white noises}\]
    • How can we discover which term should include in this functional form?
    • How can we identify the coefficients \(\phi_1, \phi_{12}, t\)?
  • The Box-Jenkins Method provides a general guideline to diagnose time series. It assumes that time series are composed by multiple temporal processes (e.g. trend, seasonality, autoregressive process, etc.). It then performs diagnostics to compare the observed series with generic forms to decide what processes occur in the data (i.e. DGP).
    • Step 1: Study generic forms and properties
    • Step 2: Study these realizations for an indication of which possibly applies to your data
    • Step 3: Assess your guess–diagnose and iterate
    • Step 4: Perform a meta-analysis at the end to determine which specification is the best
  • We will cover the first three steps in this lab.

Deterministic Trend

\[y_t = \beta_{0} + t\beta_{1} + e_t\] \[e_{t} \sim \mathcal{N}(0, \sigma^{2}) \]

  • Each period entails another \(\beta_1\) increase in \(\mathbb{E}(y_t)\)

  • Time has a purely systematic relationship with y

  • Once the time series is detrended, it is simply white noise

\[y_t - t\hat{\beta_1} = \hat{\beta_0} + \hat{e_t}\] \[\hat{e_t} \sim \mathcal{N}(0, \hat{\sigma}^2)\]

  • White noise is normally distributed with mean zero and constant variance

Deterministic Trend

Simulate a deterministic trend with noise, de-trend the data, and plot the time series.

# Set the slope of the trend
b1 <- 3/2

#Set the intercept
b0 <- 2

#Set the number of periods
n <- 50
t <- seq(0, n) # create a sequence of t from 0 to 50
y <- rep(NA, n) # pre-allocate elements for a sequence of time series

# simulate the data
set.seed(98105)
for (i in 1:length(t)){
    y[i] <- b1 * t[i] + b0 + rnorm(1, 0, 15) 
    #The rnorm gives us the noise with mean = 0 and sd = 15
}
y
 [1] -38.484116  19.457351  -8.496254   7.344132  35.161226  21.884259
 [7]  28.489241  32.197319  30.639978   2.229333  20.555835  34.128072
[13]  30.880365  27.779679  25.442973  31.197895  46.376686  49.433384
[19]  42.117503  44.285260  11.395613  37.742546  25.849821  42.118610
[25]  60.490933  51.781601  26.804279  33.634958  65.498712  23.719921
[31]  65.602853  60.516626  61.438060  33.378171  58.786994  55.863424
[37]  71.158262  85.087314  65.254664  74.966883  58.863447  33.258493
[43]  67.564891  39.195174  50.315585  59.863058  52.050526  59.059821
[49]  74.381470  61.656394  82.192530

Deterministic Trend

Plot the simulated time series before de-trending.

plot(y, type = "l", col = "red", ylab = "y", xlab = "Time",
     main = "Simulated Deterministic Trend, y=2+3/2t + Noise")
abline(a = 2, b = 3/2, lty = "dashed") # use slope and intercept to plot the trend

Deterministic Trend

Now de-trend the time series.

y.minus.tbeta <- rep(NA, n)
for (i in 1:length(t)){
    y.minus.tbeta[i] <- y[i] - b1*t[i]
}
#Plot and take a minute to inspect the residuals
plot(y.minus.tbeta, type = "l", col = "red", ylab = "y", xlab = "Time",
     main = "Detrended Time Series")
abline(a = 2, b = 0, lty = "dashed") # use intercept and plot the mean after de-trending

Suppose that we don’t know the true trend, we can use OLS to estimate it.

# Find the least squares estimate of the slope
slope1 <- lm(y ~ t)
slope1

Call:
lm(formula = y ~ t)

Coefficients:
(Intercept)            t  
     11.529        1.211  
# Why is the estimated slope different from the true slope?
plot(y, type="l", col="red",ylab="y",xlab="Time",main = "Simulated Deterministic Trend y=2+3/2t + Noise")
abline(a = 2, b = 3/2, lty = "dashed")
abline(a = slope1$coefficients[1], b = slope1$coefficients[2], lty = "dashed", col = "green")

As we learned from the lecture, this is caused by high leverage of starting and ending point in a time series. It also can be caused by selection bias if we truncate the time series.

But ignoring selection bias, we could just use least squares estimate the model. Longer periods make the estimate better!

Seasonality

Besides using ACF and PACF to detect seasonality, we use decompose() or stl() to extract seasonality from time series. This is also known as STL decomposition (Seasonal and Trend decomposition using Loess2). It assumes that any time series can be decomposed in either additive or multiplicative ways:

  • Additive model: \(S + T + L\)
  • Multiplicative model: \(S \times T \times L = \log(S) + \log(T) + \log(L)\)

Now let’s compare and re-examine two previous examples: Australia Beer consumption and Airline Passenger Numbers.

aus.beer <- cbe.ts[, 2] # Beer consumption in Australia
plot(aus.beer)

The seasonal variation looks constant; it doesn’t change very much when the time series value increases. We should use the additive model.

AP <- AirPassengers
plot(AP)

It seems that as the time series increases in magnitude, the seasonal variation increases as well. Here we should use a multiplicative model.

Let’s check the beer example.

  • Note that if we don’t specify the total periods of a cycle, decompose() and stl() will automatically use pre-specified frequency information to extract seasonality. In the beer example, the frequence is 12, so decompose() will assume a 12-month cycle and extract seasonality.
  • If we think there should be longer cycle in time series, we need to tell decompose() or stl() to extract seasonality based on a new frequency.
decompose(aus.beer, type = "additive") |> plot()

# or
stl(aus.beer, s.window = 12) |> plot()

# or you can extract seasonality from scratch - this is what decompose() or stl() actually does to extract seasonality
beer.detrend <- lm(aus.beer ~ time(aus.beer)) |> resid()
beer.detrend |>
  matrix(ncol = 12, byrow = TRUE) |> # create a matrix in which each column denotes a month
  colMeans() |> # calculate the average value of each month over years
  rep(33) |>  # repeat the cycle - there are 33 years in the time series
  as.ts() |>
  plot() # we get the same seasonal pattern in decompose() and stl()

We need to use lm() to extract deterministic trend and then use decompose() or stl() to extract the seasonality pattern. After de-trending and removing seasonality, then we can use acf and pacf to determine whether there is autoregressive process in the remaining part of time series.

  • We can also use the “random” part in decompose() in ACF and PACF plot. However, because decompose() use LOESS to fit trend, it is difficult to identify the slope of deterministic trend in our hypothesized functional form of time series.
stl(aus.beer, s.window = 12)$time.series[, "remainder"] |> pacf()

  • Hence, we recommend use lm() to de-trend time series.
beer.detrend <- lm(aus.beer ~ time(aus.beer)) |> resid() |> ts(freq = 12)
beer.stl <- stl(beer.detrend, s.window = 12)
pacf(beer.detrend - beer.stl$time.series[, "seasonal"])

The two PACF plots are different because in the first code block, it fit a loess to extract trend, whereas in the second code block, it fit a straight regression line. Again, if we want to recover the functional form of time series, linear assumption is more convenient for us.

If the time series has a multiplicative pattern, we need to take a log and then use division to remove trend and seasonality.

Autocorrelation

After de-trending and removing seasonality from the time series, we can check whether the time series have some forms of serial correlation. For today’s section, we will only focus on how to diagnose autoregressive process.

Remember what population correlation formula looks like:

\[ corr(x,y) = \frac{cov(x,y)}{\sigma_x\sigma_y} = \frac{\overbrace{E[(x-\mu_x)(y-\mu_y)]}^\text{Covariance: measures how much x and y covary}}{\underbrace{\sigma_x\sigma_y}_\text{Standard Deviation: normalizes the degree of covarying}} \]

Remember what (sample) correlation formula looks like:

\[ \hat{corr}(x,y) = \frac{\sum_{i=1}^n{(x_i-\overline{x})(y_i-\overline{y}})}{(n-1) s_x s_y} \]

To see why there should be \(n-1\), see Pearson correlation coefficient on Wikipedia.

Population Autocorrelation

Suppose we want to investigate the population of time series, that is, we assume the number of periods in a time series is infinite. Then population autocorrelation \(r_k\) is \(corr(y_t, y_{t-k})\):

\[ r_k = \frac{cov(y_t, y_{t-k})}{\sigma_y^2}=\frac{\sum_{t=k+1}^T(y_t-\overline{y})(y_{t-k}-\overline{y})}{\sum_{t=1}^T(y_t-\overline{y})^2} \]

Plainly, it determines how much each observation is correlated to the observations lagged by \(k\).

Or, how much each observation covariates with the observations lagged by \(k\) (which we call autocovariation) relative to the total variance of all observations \(y\).

We call this autocorrelation function, or acf, and denote by \(\rho_k\).

We denote autocovariation as \(\gamma_k\).

Sample Autocorrelation

Since we use sampling intervals in our usual time-series data, we have to estimate the population autocorrelation with our sample autocorrelation.

First, the sample autocovariation, \(c_k\), is calculated as:

\[ c_k = \frac{1}{n-k}\sum_{t=1}^{n-k}(y_t-\overline{y})(y_{t-k}-\overline{y}) \]

Why do we have \(n-k\) in the denominator? Consider the following example:

Period \(y_t\) \(y_{t-1}\) \(y_{t-3}\)
1 0.8 NA NA
2 7.6 0.8 NA
3 3.2 7.6 NA
4 -5.9 3.2 0.8
5 8.7 -5.9 7.6
6 6.6 8.7 3.2
  • When we want to create \(y_{t-1}\), we cannot create lagged variables for \(t = 1\), so the total number that we can used to calculate sample autocorrelation is \(n-1\).
  • Likewise, when we want to create \(y_{t-3}\), we cannot create lagged variables for \(t \in \{1,2,3\}\), so the total number that we can used to calculate sample autocorrelation is \(n-3\).
  • If we want to create \(y_{t-k}\), then the total number that we can used to calculate sample autocorrelation is \(n-k\).

Since we use sampling intervals in our usual time-series data, we have to estimate the population autocorrelation with our sample autocorrelation.

First, the sample autocovariation, \(c_k\), is calculated as:

\[ c_k = \frac{1}{n-k}\sum_{t=1}^{n-k}(y_t-\overline{y})(y_{t-k}-\overline{y}) \]

Note that \(c_0\) will be just \(s_y^2\). Therefore, the sample autocorrleation, \(r_k\), is:

\[ r_k = \frac{c_k}{c_0} = \frac{\sum_{t=k+1}^{n}(y_t-\overline{y})(y_{t-k}-\overline{y})}{\sum_{t=1}^{n}(y_t-\overline{y})^2} \cdot \frac{n}{n-k} \]

  • As \(n\) increases, the \(n/(n-k)\) term will converge to \(1\).
  • Also, \(r_k=r_{-k}\). That is, whether one computes the statistic from leads or lags, one should arrive at identical estimates.
  • Like correlation coefficient, we also have hypothesis testing and corresponding confidence interval for \(\rho_k\).
  • This is exactly how R calculate ACF. To see why, check the next slide.

Calculating ACF in R

Let’s calculate \(r_1\) with a hypothetical data. With manual check, we know that acf() actually use the sample autocorrelation formula to calculate \(r_k\).

set.seed(98105)
y <- rnorm(5, 10, 1) # simulate a time series with 5 periods
var.y <- var(y) # (sigma_y)^2
mean <- mean(y)
c_1 <- ((y[1]-mean)*(y[2]-mean) + (y[2]-mean)*(y[3]-mean) + # manually calculate sample autocorrelation
  (y[3]-mean)*(y[4]-mean) + (y[4]-mean)*(y[5]-mean))/(5-1)
r_1 <- c_1/var.y
r_1
[1] -0.3032223
acf(y, plot = FALSE) # acf() actually use the same formula!

Autocorrelations of series 'y', by lag

     0      1      2      3      4 
 1.000 -0.303  0.057  0.148 -0.402 

ACF function in R

Let’s see another example for acf().

wave <- read.table("Lab2_data/wave.dat", header = TRUE) |> ts()
plot(wave)
abline(reg = lm(wave ~ time(wave))) # a compact version of plotting the trend

ACF function in R

Let’s see another example for acf().

acf(wave, plot = FALSE)

Autocorrelations of series 'wave', by lag

     0      1      2      3      4      5      6      7      8      9     10 
 1.000  0.470 -0.263 -0.499 -0.379 -0.215 -0.038  0.178  0.269  0.130 -0.074 
    11     12     13     14     15     16     17     18     19     20     21 
-0.079  0.029  0.070  0.063 -0.010 -0.102 -0.125 -0.109 -0.048  0.077  0.165 
    22     23     24     25 
 0.124  0.049 -0.005 -0.066 
# 0, 1, 2, ... means lag, which is k
# 1.000, 0.470, -0.263, ... means acf per each lag (k)

acf(wave) # so this is simply the plot of all acfs per k=1, 2, 3, ..., 25

We call this correlogram.

Sample ACF as an estimator for historical (population) ACF

To obtain the estimate population acf, we use sample acf. We assume that the sampling distribution of \(r_k\) is approximately normal, with a mean of \(-1/n\) and a variance of \(1/n\). Therefore, we set the 95% confidence interval for testing the null hypothesis \(r_k=0\) which is the blue dotted line in acf plot by calculating:

\[ \frac{-1}{n}\pm\frac{1.96}{\sqrt{n}} \]

How can we use ACF to identify the functional form of time series?

  • A time-series data with trend or seasonal variation has high acf, which may cause upward bias in our sample acf. Therefore, we generally de-trend a time-series data to obtain the sample acf.
  • Suppose that we already remove trend and seasonality from time series.
    • If it is a white noise, then there will be no significant autocorrelation for any lags.
    • If it is an autoregressive process AR(\(p\)), we would expect that ACF will tail off after period \(p\).
    • If it is an moving average process MA(\(q\)), we would expect that ACF will cut off after period \(q\). (we will learn MA process next week)

ACF Examples

Let’s see how ACF behaves when a time series has trend, seasonality, and AR(1) process.

plot(AP) # it has trend, seasonality, and potentially AR process

acf(AP) # before de-trending and removing seasonality, it has high autocorrelation and fluctuation

# acf after de-trending
AP.trend <- lm(AP ~ seq_along(AP))
AP.detrend <- AP.trend$residuals
acf(AP.detrend)  # now we only have fluctuation

# acf after removing seasonality
AP.remainder <- AP.detrend - decompose(AP)$seasonal
acf(AP.remainder)  # less fluctuation now

The acf diminishes as lag increases due to the increasing trend, and the fluctuation is due to the seasonal variation.

ACF of white noises

set.seed(2022)
whitenoise <- ts(rnorm(50))
acf(whitenoise)

As a true white noise simply varies randomly through time, ACFs will be zero (no correlation between lagged time series). More precisely, it is expected that 5% of the ACFs will be significantly different from zero at the 95% significance level.

As this white noise is \(T=50\), the blue dotted line ranges around \(\pm2/\sqrt{50}=\pm0.28\). The statistically significant value at lag 2 is due to sampling variation.

Partial ACF

But it is often hard to determine the components of a series based only on ACF, especially when we have trends, seasonal variation or mixed ARMA processes.

Therefore, we also use partial ACF, PACF, which is a measure of correlation between time series observations that are \(k\) units apart, after the correlation at intermediate lags has been controlled for or “partialled” out. In other words, the PACF(\(k\)) is the correlation between \(y_t\) and \(y_{t-k}\) after removing the effect of the intermediate \(y\)’s.

We simply use function pacf() to see the correlogram of PACF.

PACF can be derived by:

\[ PACF(j) = corr(y_{t-j}, \hat{e_t}) \]

where \(\hat{e_t}\) is an estimator of a white noise of our time series data.

How can we use PACF to identify the functional form of time series?

  • A time-series data with trend doesn’t create much upward bias in pacf, because some of this information is already partialled out.
  • Seasonal variation will become more clear in pacf. Suppose that there is a 12-month cycle, then we would expect that there is a spike at \(t = 12\) in pacf plot.
  • Suppose that we already remove trend and seasonality from time series.
    • If it is a white noise, then there will be no significant autocorrelation for any lags.
    • If it is an autoregressive process AR(\(p\)), we would expect that ACF will cut off after period \(p\).
    • If it is an moving average process MA(\(q\)), we would expect that ACF will tail off after period \(q\). (we will learn MA process next week)

PACF Examples

Let’s see how PACF behaves for white noises.

pacf(whitenoise)

PACF Examples

Let’s see how PACF behaves under trend, seasonality, and autoregressive process.

acf(AP)

pacf(AP)

  • We can find that trend and seasonality will strongly influence ACF, whereas PACF is not affected by trend (because this information is partialed out).
  • Moreover, we can more easily detect seasonality in PACF. It has a spike at \(t=12\) in PACF plot, which means that we should include \(y_{t-12}\) in our hypothesized functional form of time series.

Practical Rules of ACF and PACF Patterns

We can easily detect the number of Airline Passengers is an AR(1) process with seasonality, because ACF tails off and PACF cuts off after \(t-1\) and \(t-12\). We will talk more about how to use ACF and PACF to determine whether the time series is an AR or MA process. To give you a sense, here is a table showing different pattern of ACF and PACF for ARMA models:

AR(\(p\)) MA(\(q\)) ARMA(\(p,q\))
ACF Tails off Cuts off after lag \(q\) Tails off
PACF Cuts off after lag p; PACF(\(p\)) \(= \phi_p\) Tails off Tails off

We will go back to this table and discuss more about the practical rules to identify the functional form of time series.

Use arima.sim() to Informally Verifty Our Hypothesis

We can use arima.sim() to simulate time series and verify our guesses. For example, what are the differences in ACF and PACF between AR process with and without deterministic trend?

set.seed(98105)
ar.trend <- arima.sim(
  list(order = c(1, 0, 0), # the functional form of time series: (AR-order, integrate-order, MA-order)
       ar = 2/3,    # the coefficient for y_{t-1}
       ma = NULL,   # the coefficient for \epsilon
       beta = 4/5,  # slope
       alpha = 10), # intercept
  n = 50            # length of time series
)
ar <- arima.sim(list(order = c(1, 0, 0), ar = 2/3, ma = NULL), n = 50)

As we discussed before, trend will induce an upward bias in ACF plot. That’s why we want to de-trend a time series in the first place.

acf(ar)

acf(ar.trend)

Trend can induce slight bias for PACF at \(t=1\), but it does not affect PACF plot very much.

pacf(ar)

pacf(ar.trend)

  • Note that although we know that true \(\phi_1\) is 0.67, the PACF correlogram of AR(1) with deterministic trend model shows that \(\phi_1\) is higher than 0.67, even above 0.7. This is because the trend is affecting the autocorrelation throughout the time.
  • Therefore, we will need to de-trend the model and than obtain the correlogram.

Identifying Unknown Time Series Processes

global <- read.table("Lab2_data/global.dat") |>
  t() |>
  as.vector() |>
  as.data.frame()
global.month.ts <- ts(global, start = c(1856, 1), end = c(2005, 12), frequency = 12)

Coding Exercise: Identifying Unknown Time Series Processes

  1. Identify the deterministic trend in the time series.
  2. Identify the seasonality in the time series.
  3. Remove the deterministic trend and seasonality from the time series.
  4. Plot the pattern of ACF and PACF. What functional form do you think is reasonable in this time series?
  5. Use arima.sim() to verify your guess.

Summary for Time Series Diagnostics

In conclusion, to inspect any unknown time series data, we will need to utilize all available resources, including:

  • Our generic knowledge about the components of our data (trend, seasonality, lag, etc)
  • Applying different tools to test whether the knowledge is actually right
  • Deciding whether our model with certain components that we assume are in the data fits based on statistics such as AIC (Alkarine Information Criteria)
  • This whole process is called Box-Jenkins Method. Next week we will continue the topic of time series diagnostics.

More on Linear Regression

Effective Sample

  • Coefficients of linear regression are often interpreted as causal effects. They are actually weighted averages of something.
    • But weighted average of what?
    • Past studies show that linear regression will give more weights to observations with larger conditional variance of treatment status (Angrist and Pischke 2009).
    • In human language, we can say regression will assign more weights to those observations whose variation in treatment status hasn’t been explained by other control variables.
      • This understanding of regression coefficient shed some lights on recent advancement in difference-in-difference method.
    • Hence, the “(weighted) effective sample” and the “nominal sample” of regression could be different.
  • So, how can we recover these weights and obtain a better understanding of our regression models?
    • Step 1: regress treatment \(D\) on other covariates \(X\). We call this a “D Model.”
    • Step 2: obtain residuals of D model
    • Step 3: square the residuals of D model. They are the variance of treatment status conditional on \(X\), which is what we want.
    • We can also use regweight to investigate these weights.
  • Why should we care about the effective sample of regression?
    • Understand why we get certain coefficients.
    • Maybe it can inspire good qualitative studies!

Effective Sample - Replicating Jensen (2003)

  • Let’s do some replication exercise
    • Jensen (2003) fit a fixed effect model and argue that democratic political institutions are associated with more FDI inflows.
    • If you don’t know what fixed effect is, don’t worry. We will talk about it later this quarter. For now you can just think of it as feeding many binary variables into regression.
    • You can also find the following replication in Aronow and Samii (2015).
library(tidyverse)
library(dataverse) # use this package to download their data from dataverse
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
jensen_data <- get_dataframe_by_name(
  "jensen-rep.tab",
  "10.7910/DVN/29098"
)
y_formula <- Fvar5 ~ regime + var5 + market + lgdppc + gdpgrowt + tradeofg + overallb + generalg + country + d2 + d3
y_model <- lm(y_formula, data = jensen_data) 
summary(y_model) # democratic political institutions are associated with higher levels of FDI inflows

Call:
lm(formula = y_formula, data = jensen_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.2092  -0.3934  -0.0639   0.2688  11.9197 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      7.780457   6.533424   1.191  0.23389    
regime                           0.021076   0.010401   2.026  0.04291 *  
var5                             0.364187   0.023856  15.266  < 2e-16 ***
market                          -0.554161   0.449721  -1.232  0.21805    
lgdppc                           0.834303   0.527355   1.582  0.11385    
gdpgrowt                         0.024384   0.007458   3.269  0.00110 ** 
tradeofg                         0.005536   0.003071   1.803  0.07162 .  
overallb                        -0.023144   0.008423  -2.748  0.00607 ** 
generalg                        -0.039327   0.014015  -2.806  0.00508 ** 
countryAlgeria                  -0.769471   1.759341  -0.437  0.66191    
countryArgentina                -0.217992   1.669869  -0.131  0.89615    
countryAustralia                -0.156684   1.514469  -0.103  0.91761    
countryAustria                  -1.586642   1.408756  -1.126  0.26023    
countryBahrain                  -2.871117   1.732004  -1.658  0.09759 .  
countryBangladesh                0.529181   2.106034   0.251  0.80164    
countryBelarus                  -1.314027   1.515115  -0.867  0.38593    
countryBenin                    -1.260464   1.500536  -0.840  0.40104    
countryBolivia                   0.488885   1.391487   0.351  0.72538    
countryBotswana                  0.009360   1.422880   0.007  0.99475    
countryBrazil                    0.454032   2.141688   0.212  0.83214    
countryBulgaria                 -0.938846   1.425483  -0.659  0.51024    
countryBurkina Faso             -0.593859   1.419675  -0.418  0.67578    
countryBurundi                  -0.966488   1.367211  -0.707  0.47973    
countryCameroon                 -0.528642   1.431846  -0.369  0.71203    
countryCanada                   -0.256858   1.635617  -0.157  0.87523    
countryCentral African Republic -0.408461   1.820240  -0.224  0.82248    
countryChad                     -0.487813   1.406777  -0.347  0.72882    
countryChile                     0.202560   1.453261   0.139  0.88917    
countryChina                     4.383751   3.002044   1.460  0.14443    
countryColombia                 -0.011690   1.670793  -0.007  0.99442    
countryComoros                  -1.629000   1.658470  -0.982  0.32614    
countryCongo, Dem. Rep.         -0.101006   1.704363  -0.059  0.95275    
countryCongo, Rep.              -1.230937   1.472940  -0.836  0.40346    
countryCosta Rica               -0.878026   1.332592  -0.659  0.51007    
countryCote d'Ivoire            -0.378930   1.476118  -0.257  0.79744    
countryCroatia                  -0.459311   1.516690  -0.303  0.76206    
countryCzech Republic            0.090824   1.538914   0.059  0.95295    
countryDenmark                  -1.366390   1.394951  -0.980  0.32748    
countryEcuador                  -0.818150   1.401128  -0.584  0.55936    
countryEgypt, Arab Rep.          1.142792   1.821610   0.627  0.53052    
countryEstonia                   0.415399   1.500746   0.277  0.78198    
countryEthiopia                  0.648511   1.905340   0.340  0.73363    
countryFiji                     -1.539227   1.490522  -1.033  0.30192    
countryFinland                  -1.674644   1.371825  -1.221  0.22238    
countryFrance                   -0.047935   1.829928  -0.026  0.97911    
countryGabon                    -1.806854   1.523626  -1.186  0.23585    
countryGambia, The              -1.103775   1.532171  -0.720  0.47139    
countryGermany                  -0.627451   1.974268  -0.318  0.75067    
countryGhana                    -0.268661   1.487119  -0.181  0.85666    
countryGreece                   -1.064810   1.424256  -0.748  0.45480    
countryGuatemala                -0.706671   1.414018  -0.500  0.61732    
countryGuinea                   -1.077860   1.467542  -0.734  0.46278    
countryGuyana                   -2.994982   1.502347  -1.994  0.04638 *  
countryHaiti                    -0.815099   1.428599  -0.571  0.56838    
countryHonduras                 -1.415105   1.578564  -0.896  0.37016    
countryHungary                   0.052412   1.429351   0.037  0.97075    
countryIceland                  -3.663865   1.848237  -1.982  0.04762 *  
countryIndia                     1.426339   2.852664   0.500  0.61715    
countryIndonesia                 1.030879   2.247355   0.459  0.64651    
countryIran, Islamic Rep.       -0.103101   1.838136  -0.056  0.95528    
countryIreland                  -1.647015   1.361439  -1.210  0.22656    
countryIsrael                   -1.396425   1.394339  -1.001  0.31675    
countryItaly                    -0.701067   1.833155  -0.382  0.70219    
countryJamaica                  -2.368045   1.366655  -1.733  0.08335 .  
countryJapan                    -0.472606   2.074262  -0.228  0.81980    
countryJordan                   -1.325352   1.357720  -0.976  0.32914    
countryKenya                    -0.005241   1.604597  -0.003  0.99739    
countryKorea, Rep.              -0.798335   1.738910  -0.459  0.64623    
countryKuwait                   -1.755731   1.493501  -1.176  0.23995    
countryLatvia                    2.503362   1.501608   1.667  0.09570 .  
countryLesotho                  -1.070562   1.389505  -0.770  0.44115    
countryLithuania                -0.951983   1.465559  -0.650  0.51607    
countryMadagascar               -0.915276   1.505808  -0.608  0.54339    
countryMalawi                   -0.543680   1.419650  -0.383  0.70180    
countryMalaysia                  1.191615   1.524981   0.781  0.43469    
countryMali                     -0.661622   1.427505  -0.463  0.64309    
countryMexico                    0.568644   1.932491   0.294  0.76860    
countryMongolia                 -1.823137   1.461429  -1.248  0.21241    
countryMorocco                  -0.134527   1.596473  -0.084  0.93286    
countryNamibia                  -0.042042   1.529543  -0.027  0.97808    
countryNepal                    -0.606929   1.557997  -0.390  0.69692    
countryNetherlands              -0.638559   1.502349  -0.425  0.67087    
countryNew Zealand              -0.348592   1.360599  -0.256  0.79783    
countryNicaragua                -0.875398   1.327978  -0.659  0.50987    
countryNiger                    -0.472562   1.478298  -0.320  0.74927    
countryNigeria                   1.471260   2.061675   0.714  0.47557    
countryNorway                   -1.610421   1.371560  -1.174  0.24052    
countryOman                     -1.189005   1.418525  -0.838  0.40205    
countryPakistan                  0.555077   2.068894   0.268  0.78851    
countryPanama                   -1.718419   1.399232  -1.228  0.21960    
countryParaguay                 -1.266933   1.326562  -0.955  0.33971    
countryPeru                     -0.224679   1.551209  -0.145  0.88486    
countryPhilippines               0.226883   1.849508   0.123  0.90238    
countryPoland                    0.348474   1.757334   0.198  0.84284    
countryPortugal                 -0.986743   1.420537  -0.695  0.48740    
countryRomania                  -0.238332   1.599100  -0.149  0.88154    
countryRussian Federation        0.540775   2.297226   0.235  0.81393    
countryRwanda                   -0.274914   1.417975  -0.194  0.84630    
countrySenegal                  -0.852630   1.402597  -0.608  0.54335    
countrySierra Leone             -1.337035   1.335865  -1.001  0.31705    
countrySingapore                 1.645610   1.662746   0.990  0.32248    
countrySouth Africa             -0.236653   1.831588  -0.129  0.89721    
countrySpain                     0.051776   1.718197   0.030  0.97596    
countrySri Lanka                -0.715195   1.520287  -0.470  0.63811    
countrySudan                    -0.288631   1.639942  -0.176  0.86032    
countrySwaziland                 1.005616   1.525757   0.659  0.50994    
countrySweden                   -0.639113   1.431268  -0.447  0.65527    
countrySwitzerland              -1.635414   1.449412  -1.128  0.25936    
countrySyrian Arab Republic     -0.650697   1.449605  -0.449  0.65358    
countryThailand                  0.223848   1.821157   0.123  0.90219    
countryTogo                     -0.061608   1.360865  -0.045  0.96390    
countryTrinidad and Tobago      -0.339657   1.453096  -0.234  0.81521    
countryTunisia                  -0.209754   1.382703  -0.152  0.87945    
countryTurkey                   -0.408241   1.808293  -0.226  0.82142    
countryUganda                   -0.478112   1.667380  -0.287  0.77435    
countryUnited Kingdom            0.543773   1.845459   0.295  0.76830    
countryUnited States             0.591666   2.316175   0.255  0.79841    
countryUruguay                  -1.515523   1.331549  -1.138  0.25523    
countryVenezuela                -0.816506   1.515953  -0.539  0.59024    
countryYemen, Rep.              -3.224999   1.668954  -1.932  0.05350 .  
countryZambia                    0.156704   1.392462   0.113  0.91041    
countryZimbabwe                 -0.938322   1.405607  -0.668  0.50452    
d2                              -0.076561   0.132325  -0.579  0.56296    
d3                               0.263350   0.206296   1.277  0.20195    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.28 on 1506 degrees of freedom
Multiple R-squared:  0.6137,    Adjusted R-squared:  0.5822 
F-statistic: 19.45 on 123 and 1506 DF,  p-value: < 2.2e-16

Effective Sample - Replicating Jensen (2003)

  • Let’s recover regression weights and corresponding effective sample using D model!
d_formula <- regime ~ var5 + market + lgdppc + gdpgrowt + tradeofg + overallb + generalg + country + d2 + d3
d_model <- lm(d_formula, data = jensen_data)
weights <- d_model$residuals^2

# we can look at how different nominal sample and effective sample are.
jensen_data %>%
  select(-country, -year, -Fvar5, -regime) %>%
  apply(., 2, function(x) {
    c(mean(x), sd(x),
      weighted.mean(x, weights),
      sqrt(weighted.mean((x - weighted.mean(x, weights))^2, weights)))
  }) %>%
  t() %>%
  `colnames<-`(c("Nominal Mean", "Nominal SD", "Effective Mean", "Effective SD"))
         Nominal Mean Nominal SD Effective Mean Effective SD
var5        1.1752344  1.9642666      1.0396521    1.6495790
market     24.2338195  1.9218543     24.0152792    1.5606882
lgdppc      8.0542249  1.1339495      7.7979652    0.9124630
gdpgrowt    3.4257344  5.1275527      3.1994243    4.9895449
tradeofg   67.4539343 49.3379091     61.0085132   41.3038847
overallb   -4.0107131  6.3324700     -3.2134829    5.8089877
generalg   15.6689503  6.3112568     13.1417353    4.9461324
d2          0.4693252  0.4992113      0.3610634    0.4803089
d3          0.3110429  0.4630624      0.3803371    0.4854696

Effective Sample - Replicating Jensen (2003)

jensen_data$weights <- weights
weighted_countries <- jensen_data %>%
  select(country, weights) %>%
  group_by(country) %>%
  summarise(mean_weights = mean(weights)) %>% # average weight for each country over years
  mutate(nominal = 1) # all countries are included in nominal sample
world <- map_data("world") %>%
  left_join(., weighted_countries, by = c("region" = "country")) %>%
  replace_na(list(mean_weights = 0, nominal = 0))

# geographic distribution of nominal sample
ggplot() +
  geom_map(data = world, map = world,
           aes(x = long, y = lat, map_id = region, fill = factor(nominal))) + 
  scale_fill_manual("Nominal Sample", values = c("white", "black"))

# geographical distribution of effective sample
ggplot() +
  geom_map(data = world, map = world,
           aes(x = long, y = lat, map_id = region, fill = mean_weights)) +
  scale_fill_continuous("Effective Sample Weight %", limits = c(0, 16), low = "white", high = "black")

Backdoor Criterion

  • We are largely stargazers. Sometimes unexpected appearance/disappearance of stars is quite annoying. A lot of possibilities behind this phenomenon.
  • One explanation is that we may include bad controls or drop good controls in our regression. What are bad controls?
    • Post-treatment variables
    • Colliders
    • Confounders

Post-treatment Bias

\[Y \leftarrow X \leftarrow D\]

  • We want to estimate the total effect of \(D\) on \(Y\)
D <- rnorm(1e3)
X <- 5 * D + rnorm(1e3)
Y <- 10 * X + rnorm(1e3)

lm(Y ~ D) |> summary()  # this will give us roughly correct estimate

Call:
lm(formula = Y ~ D)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.428  -6.333   0.191   6.528  34.798 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.2348     0.3073   0.764    0.445    
D            49.3792     0.2975 165.959   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.715 on 998 degrees of freedom
Multiple R-squared:  0.965, Adjusted R-squared:  0.965 
F-statistic: 2.754e+04 on 1 and 998 DF,  p-value: < 2.2e-16
lm(Y ~ D + X) |> summary()  # including the post-treatment variable X will mask the relationship between D and Y

Call:
lm(formula = Y ~ D + X)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8764 -0.6670  0.0390  0.6896  3.1816 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02553    0.03149   0.811    0.418    
D            0.13366    0.16342   0.818    0.414    
X            9.98520    0.03255 306.734   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9953 on 997 degrees of freedom
Multiple R-squared:  0.9996,    Adjusted R-squared:  0.9996 
F-statistic: 1.359e+06 on 2 and 997 DF,  p-value: < 2.2e-16

Collider Bias

\[Y \rightarrow X \leftarrow D\]

D <- rnorm(1e3)
X <- 5 * D + 10 * Y + rnorm(1e3)

lm(Y ~ D) |> summary()  # D and Y have no correlation

Call:
lm(formula = Y ~ D)

Residuals:
     Min       1Q   Median       3Q      Max 
-180.062  -32.761    0.583   36.685  163.049 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.059      1.641   0.645    0.519
D              2.186      1.625   1.345    0.179

Residual standard error: 51.91 on 998 degrees of freedom
Multiple R-squared:  0.001811,  Adjusted R-squared:  0.0008105 
F-statistic:  1.81 on 1 and 998 DF,  p-value: 0.1788
lm(Y ~ D + X) |> summary()  # including the collider X will create information flows between D and Y

Call:
lm(formula = Y ~ D + X)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.35055 -0.06536 -0.00104  0.06393  0.32794 

Coefficients:
              Estimate Std. Error   t value Pr(>|t|)    
(Intercept)  1.733e-03  3.088e-03     0.561    0.575    
D           -5.016e-01  3.060e-03  -163.948   <2e-16 ***
X            1.000e-01  5.954e-06 16796.220   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09763 on 997 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.413e+08 on 2 and 997 DF,  p-value: < 2.2e-16

Unobservable Confounders

\[D \leftarrow U \rightarrow Y\]

\[D \rightarrow Y\]

We want to estimate the direct effect of \(D\) on \(Y\).

U <- rnorm(1e3)
D <- 2 * U + rnorm(1e3)
Y <- 5 * D + 10 * U + rnorm(1e3)

lm(Y ~ D) |> summary()  # biased estimate of direct effect

Call:
lm(formula = Y ~ D)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.4206  -3.0485   0.0342   3.2417  13.4430 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.08417    0.15252  -0.552    0.581    
D            9.05345    0.06782 133.500   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.822 on 998 degrees of freedom
Multiple R-squared:  0.947, Adjusted R-squared:  0.9469 
F-statistic: 1.782e+04 on 1 and 998 DF,  p-value: < 2.2e-16
lm(Y ~ D + U) |> summary()  # including the confounder U will correct the estimate

Call:
lm(formula = Y ~ D + U)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7377 -0.7171 -0.0568  0.6967  3.1830 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.861e-04  3.264e-02   0.015    0.988    
D           4.950e+00  3.194e-02 155.000   <2e-16 ***
U           1.011e+01  7.011e-02 144.215   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.032 on 997 degrees of freedom
Multiple R-squared:  0.9976,    Adjusted R-squared:  0.9976 
F-statistic: 2.05e+05 on 2 and 997 DF,  p-value: < 2.2e-16

Shutting the Backdoor

  • Include all pre-treatment variables in regression models.
  • Control confounders as many as possible.
  • Don’t include colliders.

Footnotes

  1. See https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1540-5907.2012.00602.x.

  2. This is the abbrevation of Locally Weighted Scatterplot Smoothing or locally weighted smoothing.