CSSS/POLS 512: Lab 1

Logistics & R Refresher

Tao Lin

April 1, 2022

Agenda

  • Logistics
    • Labs, Office Hours, Homework
    • Goals and Expectations
    • R, RStudio, R Markdown, \(\LaTeX\)
  • R Refresher
    • Data Wrangling for Time Series and Panel Data
    • Tips for Debugging and Coding Efficiency
    • How to Look for Help? reprex, Piazza & Stack Overflow

Acknowledgement

The section materials are adapted from the original version made by Daniel Yoo and Inhwan Ko, former TAs to CSSS/POLS 512, Department of Political Science, University of Washington.

Logistics

Labs, Office Hours, and Homework

  1. Lab Sessions: Fri, 1:30-3:20 pm via Zoom (link available on the website)
  • Covers application of material from lecture using examples; clarification and extention of lecture material; Q&A for homework and lectures
  • Materials will be available on the course website, and will be offered in a compact .zip file (thanks to participants’ suggestions today!).
  • Every section is expected to have slides, coding exercise, and data files. The solution for coding exercise will be offered after the section.
  1. Office Hours: Wed 11:00-12:00 pm & Fri 3:20-4:20 pm via Zoom (the same link for lab session)
  • Available for trouble shooting and specific questions about homework and lecture materials
  1. Homework: 3-4 due every 2 weeks or so
  • Using RStudio with R Markdown with write up in \(\LaTeX\)
  • Many packages: tseries, forecast, lmtest, urca, quantmod, etc.

Expectation and Goals

  1. When this course is over, you should be able to do the following (and more):
  • Identify and understand time series dynamics: seasonality, deterministic trends, moving average processes, autoregressive processes
  • Distinguish between stationary and nonstationary time series, perform unit root tests, fit ARMA and ARIMA models, use cross validation for model assessment
  • Analyze multiple continuous time series using vector autoregression, perform cointegration tests, and estimate error correction models for cointegrated time series
  • Distinguish between random effects, fixed effects, and mixed effects and decide when each of these are appropriate
  • Understand Nickell bias and use an instrumental variable approach with GMM to address the issue
  • Perform multiple imputation and in-sample simulations for panel data

Expectation and Goals

  1. The course moves fast: you should be comfortable doing the following for the homework assignments and project
  • tidying and transforming data, especially time series and panel data
  • importing and exporting datasets
  • generating plots of your data and results
  • writing basic functions and loops for repeated procedures
  1. Fortunately, for those of you new to R, there are many resources to get you up to speed

Three levels of Methodological Sophistication

  • Blind Consumer 🤩
    • “I include dyad fixed effects in my model because I read some famous methodologists say in top academic journals that everyone should.”
    • “I want to apply causal forest to my project because it seems so fancy.”
  • Power User 🤔
    • Choose appropriate methods based on your research question and data from wide range of available tools
    • Understand how canned packages actually work internally
  • Developer 👨‍💻
    • Identify common methodological problems in your field and provide a solution
    • Write software packages for public use

Computation Software

  1. Please make sure that you have R or RStudio installed on your computer
  2. If you would like to learn how to use \(\LaTeX\), this is a great opportunity to do so
  • An easy way to get introduced to this is to use R Markdown within RStudio
  • Make sure you have TeX installed, which you can find here
  • Make sure you have R Markdown installed using install.packages("rmarkdown")
  • Now in RStudio, choose File \(\rightarrow\) New File \(\rightarrow\) R Markdown

R Markdown

  1. Using R Markdown
  • Choose to compile your document as a PDF or HTML file and give it a title
  • Now you will be given a template
  • Embed your code within ```{r} and ```, and write up your text outside
  • Then press Knit and it will produce a PDF or HTML document with your code, R output, and text nicely formatted
  • Please try to complete your homework in this way
  • However, other forms such as R and \(\LaTeX\) are also welcome (especially for those who took data visualization class)

More on R Markdown

  • You can use R Markdown to write an academic paper
    • Control chunk options such as include (hide code blocks) or fig.align (adjust the alignment of figures)
    • install.packages("tinytex") and install pandoc.
    • You can load \(\LaTeX\) packages or .tex/.sty files in the header
  • R Markdown can not only run R, but also run python!
    • install.packages("reticulate")
    • Use py$... to call objects from previous python chunk to R chunk.1
  • In 2022, RStudio release Quarto.
    • An ambitious “next-gen” tool that aims to replace R Markdown and Jupyter Notebook.
    • The slides for today’s section is powered by this!

How to Look for Help?

How to Look for Help?

  • Just google the error message or find them on Stack Overflow!
  • We use Piazza for Q&A and troubleshooting
    • People encountering similar problems can see how to solve them (avoid reinventing the wheel)
    • If you are not comfortable with public post, you can post your questions anonymously, or just send me a private post.
    • If you have more questions that cannot be covered by one single post, I encourage you to come to my office hours.
    • Piazza by default sends email notification for new posts. If you feel that’s annoying, you can turn off the notification in the setting.
  • Besides coding issues, you are also welcomed to ask questions about the choice of methods, research design, etc.

How to Look for Help?

More on Troubleshooting

  • Minimal Reproducible Example (MRE)
    • “minimal”: “look in a smaller stack to find a needle”
      • inputs are small and simple
      • fewer packages loaded
      • fewer function calls
    • “reproducible”: provide code that someone else could run
    • Use reprex and dpasta to generate MRE!

More on Troubleshooting

How to use reprex to generate MRE?

library(reprex)
library(datapasta)

tribble_paste(mtcars) # if you have a small data.frame, you can use it

reprex(
  {
    f <- function(x) x + 1 # create f
    g <- function(x) f(x)  # create g
    g("a")
    ggplot()
  },
  session_info = TRUE, # include information about your computer and current R session (important!)
  style = TRUE, # make your code prettier
  venue = "slack" # choose slack style
)

Also see: https://www.rstudio.com/resources/webinars/help-me-help-you-creating-reproducible-examples/

Potential Data Source for Your Final Project

Questions?

R Refresher

Data Wrangling

Let’s refresh some basic packages and functions for wragling data:

  1. Loading the packages:
library(haven) #library(foreign) # for loading datasets
library(tidyverse) # for data wrangling
library(ggplot2) # for data visualization; already integrated in tidyverse
  1. Loading the data:
data <- read.csv("Lab1_data/Lab1data.csv", header = TRUE) 
# if NAs are coded as spaces, add an argument: na.strings = " "
# if NAs are coded as, say, 77, add an argument: na.strings = "77"
# using read_csv() will load data as a tibble

# How to view the data?

Data Wrangling

Let’s refresh some basic packages and functions for wragling data:

  1. Loading the packages:
library(haven) #library(foreign) # for loading datasets
library(tidyverse) # for data wrangling
library(ggplot2) # for data visualization; already integrated in tidyverse
  1. Loading the data:
data <- read.csv("Lab1_data/Lab1data.csv", header = TRUE) 
# if NAs are coded as spaces, add an argument: na.strings = " "
# if NAs are coded as, say, 77, add an argument: na.strings = "77"
# using read_csv() will load data as a tibble

# How to view the data?
head(data, n = 10) # view the first 10 rows
               country Year GDP.per.capita.PPP.current.international polity2
1  Antigua and Barbuda 2000                                 12345.82      NA
2  Antigua and Barbuda 2001                                 12654.92      NA
3  Antigua and Barbuda 2002                                 12959.93      NA
4  Antigua and Barbuda 2003                                 13699.04      NA
5  Antigua and Barbuda 2004                                 14866.37      NA
6  Antigua and Barbuda 2005                                 15791.64      NA
7  Antigua and Barbuda 2006                                 18234.80      NA
8  Antigua and Barbuda 2007                                 20324.14      NA
9  Antigua and Barbuda 2008                                 20565.86      NA
10 Antigua and Barbuda 2009                                 18778.49      NA
data$country |> unique() |> length() # view the total number of countries
[1] 174
data$Year |> unique() # view all years
 [1] 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
# R 4.0 introduce the built-in pipe, otherwise we need to write ugly code like this:
# length(unique(data$country)) # will be very annoying when we deal with many parentheses
# anyway, we have N=174, T=11 (2000~2010), gdp per capita, polity
# It's panel data!

Data Wrangling

Data Wrangling

  1. Change the name of variables:
rawdata <- data # save our rawdata before touching it
colnames(data)
[1] "country"                                 
[2] "Year"                                    
[3] "GDP.per.capita.PPP.current.international"
[4] "polity2"                                 
# Honestly, "GDP.per.capita.PPP.current.international" is too long. 
# How do we fix this?

Data Wrangling

  1. Change the name of variables:
rawdata <- data # save our rawdata before touching it
colnames(data)
[1] "country"                                 
[2] "Year"                                    
[3] "GDP.per.capita.PPP.current.international"
[4] "polity2"                                 
# Honestly, "GDP.per.capita.PPP.current.international" is too long. 
# How do we fix this?
colnames(data) <- c("country", "Year", "income", "polity2")
colnames(data)
[1] "country" "Year"    "income"  "polity2"
# or try this
data <- rawdata
data <- rawdata %>% # this pipe operator introduced by magittr; already integrated in tidyverse
    rename("income" = "GDP.per.capita.PPP.current.international")
colnames(data)
[1] "country" "Year"    "income"  "polity2"

Data Wrangling

  1. Subset the data
## only include year = 2000
data2000 <- subset(data, subset = (Year == 2000))
## or
data2000 <- data[data$Year == 2000, ]
## or
data2000 <- data %>% 
    filter(Year == 2000)

head(data2000, n = 10)
               country Year    income polity2
1  Antigua and Barbuda 2000 12345.817      NA
2          Afghanistan 2000        NA      -7
3              Albania 2000  4259.308       5
4              Algeria 2000  5395.332      -3
5              Andorra 2000        NA      NA
6               Angola 2000  2277.127      -3
7            Argentina 2000  9112.116       8
8              Armenia 2000  2033.462       5
9                Aruba 2000        NA      NA
10          Azerbaijan 2000  2207.057      -7

Data Wrangling

  1. Summarize the data
summary(data)
   country               Year          income           polity2       
 Length:1914        Min.   :2000   Min.   :  219.2   Min.   :-10.000  
 Class :character   1st Qu.:2002   1st Qu.: 1625.0   1st Qu.: -4.000  
 Mode  :character   Median :2005   Median : 4299.2   Median :  5.000  
                    Mean   :2005   Mean   : 7874.9   Mean   :  2.431  
                    3rd Qu.:2008   3rd Qu.: 9818.6   3rd Qu.:  8.000  
                    Max.   :2010   Max.   :91712.3   Max.   : 10.000  
                                   NA's   :373       NA's   :542      
# if you only care about income and polity2, you can do this
data.mean <- sapply(data[, 3:4], mean, na.rm = TRUE) # sapply() just apply the function mean() to each element in a vector
data.sd <- sapply(data[, 3:4], sd, na.rm = TRUE)
rbind(data.mean, data.sd)
             income  polity2
data.mean  7874.852 2.430758
data.sd   10194.125 6.413676
quantile(data$income, probs = .25, na.rm = TRUE)  # view the 25% quantile of income
     25% 
1624.957 
quantile(data$income, probs = seq(0, 1, .25), na.rm = TRUE) # view quartiles; use seq() to create a sequence with equal margin between elements
        0%        25%        50%        75%       100% 
  219.1904  1624.9566  4299.1876  9818.5833 91712.3201 
# alternatively, some recent packages such as `modelsummary` support
# more advanced function for data summary

Data Wrangling

  1. Various plots

Useful site: http://www.sthda.com/english/wiki/ggplot2-essentials

p1 <- ggplot(data2000, aes(x=income)) # preparing a canvas...
p1 + geom_histogram() # histogram # ggplot just covers the canvas by a layer of histogram

Data Wrangling

  1. Various plots

Useful site: http://www.sthda.com/english/wiki/ggplot2-essentials

p2 <- ggplot(data2000, aes(x=polity2, y=income))
p2 + geom_point() # scatter plot

Data Wrangling

  1. Various plots

Useful site: http://www.sthda.com/english/wiki/ggplot2-essentials

p2 <- ggplot(data2000, aes(x=polity2, y=income))
p2 + geom_point() + 
  geom_smooth(method = "lm") # scatter plot + fitted line

Data Wrangling

  1. Various plots

Useful site: http://www.sthda.com/english/wiki/ggplot2-essentials

# create a threshold for high- and low-income countries
data2000$group <- NA
data2000$group[data2000$income > mean(data2000$income, na.rm = TRUE)] <- "High"
data2000$group[data2000$income <= mean(data2000$income, na.rm = TRUE)] <- "Low"
data2000$group <- as.factor(data2000$group) # make it as a "factor" variable

# look at the distribution of income for each group
p3 <- ggplot(na.omit(data2000), aes(group, income, color = group), na.rm = TRUE)
p3 + geom_boxplot() # boxplot

Time Series Data - Unemployment in Maine

# Acquire the data
#   Monthly unemployment in Maine from January 1996 to August 2006
maine.month <- read.table("Lab1_data/Maine.dat", header = TRUE)

# Attach the object and check its class (i.e. the type of data structure)
attach(maine.month)
class(maine.month)
[1] "data.frame"
# Monthly unemployment data 
head(maine.month, n = 10)
   unemploy
1       6.7
2       6.7
3       6.4
4       5.9
5       5.2
6       4.8
7       4.8
8       4.0
9       4.2
10      4.4

Time Series Data - Unemployment in Maine

# use ts() to create a time series object
# try help(ts) or ?ts to see what this function does
maine.month.ts <- ts(
  unemploy,
  start = c(1996, 1), # the starting year is 1996 January
  freq = 12 # the number of observations per unit of time is 1/12 (12 months)
)
class(maine.month.ts)
[1] "ts"
maine.month.ts
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1996 6.7 6.7 6.4 5.9 5.2 4.8 4.8 4.0 4.2 4.4 5.0 5.0
1997 6.4 6.5 6.3 5.9 4.9 4.8 4.5 4.0 4.1 4.3 4.8 5.0
1998 6.2 5.7 5.6 4.6 4.0 4.2 4.1 3.6 3.7 4.1 4.3 4.0
1999 4.9 5.0 4.6 4.3 3.9 4.0 3.6 3.3 3.1 3.3 3.7 3.7
2000 4.4 4.4 4.1 3.5 3.1 3.0 2.8 2.5 2.6 2.8 3.1 3.0
2001 3.9 4.2 4.0 4.1 3.5 3.5 3.4 3.1 3.4 3.7 4.0 4.0
2002 5.0 4.9 5.0 4.7 4.0 4.2 4.0 3.6 3.7 3.9 4.5 4.6
2003 5.6 5.8 5.6 5.5 4.8 4.9 4.8 4.3 4.5 4.6 4.8 4.7
2004 5.6 5.6 5.5 4.8 4.2 4.3 4.2 3.8 4.0 4.2 4.6 4.6
2005 5.5 5.8 5.5 5.2 4.7 4.6 4.5 4.1 4.4 4.4 4.8 4.6
2006 5.3 5.6 4.9 4.6 4.2 4.4 4.4 3.9                

Time Series Data - Unemployment in Maine

# Find the mean unemployment per year
maine.annual.ts <- aggregate(maine.month.ts) / 12
maine.annual.ts
Time Series:
Start = 1996 
End = 2005 
Frequency = 1 
 [1] 5.258333 5.125000 4.508333 3.950000 3.275000 3.733333 4.341667 4.991667
 [9] 4.616667 4.841667
# what does aggregate() do?
# ?aggregate tells us that it sums data by subsets
# for "ts" object, it will split and sum data by year

Time Series Data - Unemployment in Maine

plot(maine.month.ts, ylab = "Unemployed (%)", main = "Monthly Unemployment Rate in Maine")

How would you describe the pattern of unemployment?

Time Series Data - Unemployment in Maine

plot(maine.annual.ts, ylab = "Unemployed (%)", main = "Annual Unemployment Rate in Maine")

How would you describe the pattern of unemployment?

Time Series Data - Unemployment in Maine

# Find unmployment rates for Feburary and August
maine.feb <- window(maine.month.ts, start = c(1996, 2), freq = 1)
maine.aug <- window(maine.month.ts, start = c(1996, 8), freq = 1)
# Find ratio of mean unemployment in Feb and August versus grand mean
feb.ratio <- mean(maine.feb) / mean(maine.month.ts)
aug.ratio <- mean(maine.aug) / mean(maine.month.ts)
# what does window() do?
# what does `feb.ratio <- mean(maine.feb) / mean(maine.month.ts)` do?

Time Series Data - Unemployment in Maine

# Find unmployment rates for Feburary and August
maine.feb <- window(
  maine.month.ts,
  start = c(1996, 2), # start from Feburary 1996
  freq = 1) # the margin between sampled observations is 1 year!
# if you specify freq = 2, it turns out to be 1/2 year.
maine.aug <- window(maine.month.ts, start = c(1996, 8), freq = 1)
# Find ratio of mean unemployment in Feb and August versus grand mean
feb.ratio <- mean(maine.feb) / mean(maine.month.ts)
aug.ratio <- mean(maine.aug) / mean(maine.month.ts)
# what does window() do?
# what does `feb.ratio <- mean(maine.feb) / mean(maine.month.ts)` do?

maine.feb # window() can subset time series between two cutoffs
Time Series:
Start = 1996.083 
End = 2006.083 
Frequency = 1 
 [1] 6.7 6.5 5.7 5.0 4.4 4.2 4.9 5.8 5.6 5.8 5.6
feb.ratio # compare the average unemployment rate in Feb to the average unemployment rate throughout the year
[1] 1.222529
aug.ratio
[1] 0.8163732

What Does ts() Do To Your Data?

  • ts object is a matrix-like object with information about “time” as rownames and colnames and with “random variables” as elements (in the above example, the “random variable” is Maine).
    • In the above example, we feed in a data.frame with one single column. ts() will first use data.matrix() to transform it as a matrix, and then treat its number of columns as the total number of time series (so we only have 1 time series), and then treat the number of rows as the total number of observations within that single time series.
    • Caveat ⚠️: if a vector, matrix or data.frame has any character, R will automatically transform other numeric variables as characters. So if you have NA in your vector, ts() will return NA, which is OK. However, if you use white space " " or other characters (e.g. "NA") to code missing values, other numeric variables will become characters. In this case, data.matrix() in ts() will first transform these characters to factor, and then transform factor to numeric values, and you will find the resulting numeric values are totally different from the original ones.2 So make sure that your data only contains numeric variables or NA.3
maine <- read.table("Lab1_data/Maine.dat", header = TRUE)
maine[6, ] <- NA # assign NA to June 1996
maine |> as.matrix() # we still get a numeric sequence
       unemploy
  [1,]      6.7
  [2,]      6.7
  [3,]      6.4
  [4,]      5.9
  [5,]      5.2
  [6,]       NA
  [7,]      4.8
  [8,]      4.0
  [9,]      4.2
 [10,]      4.4
 [11,]      5.0
 [12,]      5.0
 [13,]      6.4
 [14,]      6.5
 [15,]      6.3
 [16,]      5.9
 [17,]      4.9
 [18,]      4.8
 [19,]      4.5
 [20,]      4.0
 [21,]      4.1
 [22,]      4.3
 [23,]      4.8
 [24,]      5.0
 [25,]      6.2
 [26,]      5.7
 [27,]      5.6
 [28,]      4.6
 [29,]      4.0
 [30,]      4.2
 [31,]      4.1
 [32,]      3.6
 [33,]      3.7
 [34,]      4.1
 [35,]      4.3
 [36,]      4.0
 [37,]      4.9
 [38,]      5.0
 [39,]      4.6
 [40,]      4.3
 [41,]      3.9
 [42,]      4.0
 [43,]      3.6
 [44,]      3.3
 [45,]      3.1
 [46,]      3.3
 [47,]      3.7
 [48,]      3.7
 [49,]      4.4
 [50,]      4.4
 [51,]      4.1
 [52,]      3.5
 [53,]      3.1
 [54,]      3.0
 [55,]      2.8
 [56,]      2.5
 [57,]      2.6
 [58,]      2.8
 [59,]      3.1
 [60,]      3.0
 [61,]      3.9
 [62,]      4.2
 [63,]      4.0
 [64,]      4.1
 [65,]      3.5
 [66,]      3.5
 [67,]      3.4
 [68,]      3.1
 [69,]      3.4
 [70,]      3.7
 [71,]      4.0
 [72,]      4.0
 [73,]      5.0
 [74,]      4.9
 [75,]      5.0
 [76,]      4.7
 [77,]      4.0
 [78,]      4.2
 [79,]      4.0
 [80,]      3.6
 [81,]      3.7
 [82,]      3.9
 [83,]      4.5
 [84,]      4.6
 [85,]      5.6
 [86,]      5.8
 [87,]      5.6
 [88,]      5.5
 [89,]      4.8
 [90,]      4.9
 [91,]      4.8
 [92,]      4.3
 [93,]      4.5
 [94,]      4.6
 [95,]      4.8
 [96,]      4.7
 [97,]      5.6
 [98,]      5.6
 [99,]      5.5
[100,]      4.8
[101,]      4.2
[102,]      4.3
[103,]      4.2
[104,]      3.8
[105,]      4.0
[106,]      4.2
[107,]      4.6
[108,]      4.6
[109,]      5.5
[110,]      5.8
[111,]      5.5
[112,]      5.2
[113,]      4.7
[114,]      4.6
[115,]      4.5
[116,]      4.1
[117,]      4.4
[118,]      4.4
[119,]      4.8
[120,]      4.6
[121,]      5.3
[122,]      5.6
[123,]      4.9
[124,]      4.6
[125,]      4.2
[126,]      4.4
[127,]      4.4
[128,]      3.9
maine.month.ts <- ts(maine, start = c(1996, 1), freq = 12) # so it does not affect the type of data in the resulting time series
maine.month.ts
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1996 6.7 6.7 6.4 5.9 5.2  NA 4.8 4.0 4.2 4.4 5.0 5.0
1997 6.4 6.5 6.3 5.9 4.9 4.8 4.5 4.0 4.1 4.3 4.8 5.0
1998 6.2 5.7 5.6 4.6 4.0 4.2 4.1 3.6 3.7 4.1 4.3 4.0
1999 4.9 5.0 4.6 4.3 3.9 4.0 3.6 3.3 3.1 3.3 3.7 3.7
2000 4.4 4.4 4.1 3.5 3.1 3.0 2.8 2.5 2.6 2.8 3.1 3.0
2001 3.9 4.2 4.0 4.1 3.5 3.5 3.4 3.1 3.4 3.7 4.0 4.0
2002 5.0 4.9 5.0 4.7 4.0 4.2 4.0 3.6 3.7 3.9 4.5 4.6
2003 5.6 5.8 5.6 5.5 4.8 4.9 4.8 4.3 4.5 4.6 4.8 4.7
2004 5.6 5.6 5.5 4.8 4.2 4.3 4.2 3.8 4.0 4.2 4.6 4.6
2005 5.5 5.8 5.5 5.2 4.7 4.6 4.5 4.1 4.4 4.4 4.8 4.6
2006 5.3 5.6 4.9 4.6 4.2 4.4 4.4 3.9                
maine[6, ] <- "NA" # assign a string called "NA" to June 1996
maine |> as.matrix() # we get a character sequence!
       unemploy
  [1,] "6.7"   
  [2,] "6.7"   
  [3,] "6.4"   
  [4,] "5.9"   
  [5,] "5.2"   
  [6,] "NA"    
  [7,] "4.8"   
  [8,] "4"     
  [9,] "4.2"   
 [10,] "4.4"   
 [11,] "5"     
 [12,] "5"     
 [13,] "6.4"   
 [14,] "6.5"   
 [15,] "6.3"   
 [16,] "5.9"   
 [17,] "4.9"   
 [18,] "4.8"   
 [19,] "4.5"   
 [20,] "4"     
 [21,] "4.1"   
 [22,] "4.3"   
 [23,] "4.8"   
 [24,] "5"     
 [25,] "6.2"   
 [26,] "5.7"   
 [27,] "5.6"   
 [28,] "4.6"   
 [29,] "4"     
 [30,] "4.2"   
 [31,] "4.1"   
 [32,] "3.6"   
 [33,] "3.7"   
 [34,] "4.1"   
 [35,] "4.3"   
 [36,] "4"     
 [37,] "4.9"   
 [38,] "5"     
 [39,] "4.6"   
 [40,] "4.3"   
 [41,] "3.9"   
 [42,] "4"     
 [43,] "3.6"   
 [44,] "3.3"   
 [45,] "3.1"   
 [46,] "3.3"   
 [47,] "3.7"   
 [48,] "3.7"   
 [49,] "4.4"   
 [50,] "4.4"   
 [51,] "4.1"   
 [52,] "3.5"   
 [53,] "3.1"   
 [54,] "3"     
 [55,] "2.8"   
 [56,] "2.5"   
 [57,] "2.6"   
 [58,] "2.8"   
 [59,] "3.1"   
 [60,] "3"     
 [61,] "3.9"   
 [62,] "4.2"   
 [63,] "4"     
 [64,] "4.1"   
 [65,] "3.5"   
 [66,] "3.5"   
 [67,] "3.4"   
 [68,] "3.1"   
 [69,] "3.4"   
 [70,] "3.7"   
 [71,] "4"     
 [72,] "4"     
 [73,] "5"     
 [74,] "4.9"   
 [75,] "5"     
 [76,] "4.7"   
 [77,] "4"     
 [78,] "4.2"   
 [79,] "4"     
 [80,] "3.6"   
 [81,] "3.7"   
 [82,] "3.9"   
 [83,] "4.5"   
 [84,] "4.6"   
 [85,] "5.6"   
 [86,] "5.8"   
 [87,] "5.6"   
 [88,] "5.5"   
 [89,] "4.8"   
 [90,] "4.9"   
 [91,] "4.8"   
 [92,] "4.3"   
 [93,] "4.5"   
 [94,] "4.6"   
 [95,] "4.8"   
 [96,] "4.7"   
 [97,] "5.6"   
 [98,] "5.6"   
 [99,] "5.5"   
[100,] "4.8"   
[101,] "4.2"   
[102,] "4.3"   
[103,] "4.2"   
[104,] "3.8"   
[105,] "4"     
[106,] "4.2"   
[107,] "4.6"   
[108,] "4.6"   
[109,] "5.5"   
[110,] "5.8"   
[111,] "5.5"   
[112,] "5.2"   
[113,] "4.7"   
[114,] "4.6"   
[115,] "4.5"   
[116,] "4.1"   
[117,] "4.4"   
[118,] "4.4"   
[119,] "4.8"   
[120,] "4.6"   
[121,] "5.3"   
[122,] "5.6"   
[123,] "4.9"   
[124,] "4.6"   
[125,] "4.2"   
[126,] "4.4"   
[127,] "4.4"   
[128,] "3.9"   
maine.month.ts <- ts(maine, start = c(1996, 1), freq = 12) # it totally change the original numeric values in the resulting time series!
maine.month.ts
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1996  35  35  33  30  24  36  21  13  15  17  23  23
1997  33  34  32  30  22  21  18  13  14  16  21  23
1998  31  28  27  19  13  15  14   9  10  14  16  13
1999  22  23  19  16  12  13   9   6   5   6  10  10
2000  17  17  14   8   5   4   3   1   2   3   5   4
2001  12  15  13  14   8   8   7   5   7  10  13  13
2002  23  22  23  20  13  15  13   9  10  12  18  19
2003  27  29  27  26  21  22  21  16  18  19  21  20
2004  27  27  26  21  15  16  15  11  13  15  19  19
2005  26  29  26  24  20  19  18  14  17  17  21  19
2006  25  27  22  19  15  17  17  12                
  • You can use debugonce(ts) to check its source code and understand how data is transformed within the function.
  • (statistical) definition of time series data:
    • realizations of sequences of random variables defined at fixed sampling intervals
    • sampling interval = one time unit of time-series (e.g. second, minute, hour, day, month, quarter)
  • For example, if we have quarterly pattern in our time-series data, the frequency should be 4:
maine <- read.table("Lab1_data/Maine.dat", header = TRUE)
maine.quarter.ts <- ts(
  maine,
  start = c(1996, 1), # this time we can interpret the starting point as the 1st quarter of 1996
  freq = 4 # every year has 4 quarter
)
maine.quarter.ts |> window(start = 1996, end = 2015)
     Qtr1 Qtr2 Qtr3 Qtr4
1996  6.7  6.7  6.4  5.9
1997  5.2  4.8  4.8  4.0
1998  4.2  4.4  5.0  5.0
1999  6.4  6.5  6.3  5.9
2000  4.9  4.8  4.5  4.0
2001  4.1  4.3  4.8  5.0
2002  6.2  5.7  5.6  4.6
2003  4.0  4.2  4.1  3.6
2004  3.7  4.1  4.3  4.0
2005  4.9  5.0  4.6  4.3
2006  3.9  4.0  3.6  3.3
2007  3.1  3.3  3.7  3.7
2008  4.4  4.4  4.1  3.5
2009  3.1  3.0  2.8  2.5
2010  2.6  2.8  3.1  3.0
2011  3.9  4.2  4.0  4.1
2012  3.5  3.5  3.4  3.1
2013  3.4  3.7  4.0  4.0
2014  5.0  4.9  5.0  4.7
2015  4.0               
# it doesn't make sense, because this is actually a monthly data. But it's a good example to understand how `ts()` work.
  • Someone may think the format of ts object looks like data.frame, so can we transform it as a data.frame? The answer is, we don’t need to do so.
    • A ts object is essentially a sequence, but we use two “tags” to identify each element in this sequence. For example, we use “year” (as a cycle of 12 months) and “month” to identify each value of unemployment rate. We can also use “hour” (as a cycle of 60 minutes) and “minute” to identify the position of each minute in a day. That’s why we finally see a matrix-like object. But a single time series is just a column of a matrix!
    • To see why, you can use as.matrix or as.data.frame(), it will transform it as a matrix or data.frame with one single column, which is just the original form of our data. We will see an evidence of this fact in the next example.

Time Series Data - Global Temperature

# Acquire the data
# Average global temperature from Univ. East Anglia and UK Met Office
# Monthly from January 1856 to December 2005
global <- read.table("Lab1_data/global.dat")
global <- t(global) |> as.vector() |> as.data.frame()

Coding Exercise: Time Series Data - Global Temperature

  1. Create a time series object using the data that starts in Jan 1856 and ends in Dec 2005 with monthly observations.

  2. Find the mean temperature for each year and save in a new time series object.

  3. Plot the two objects.

  4. Observe global temperature from 1970 to 2005 using the window function and plot.

Multiple Time Series - Electricty, Beer, Chocolate Production

# Acquire the data
#   Electricity (millions of kWh), beer (Ml), and chocolate production (tonnes) 
# in Australia from January 1958 to December 1990 
# from the Australian Bureau of Statistics

cbe <- read.table("Lab1_data/cbe.dat", header = TRUE)
class(cbe)
[1] "data.frame"

Multiple Time Series - Electricty, Beer, Chocolate Production

Multiple Time Series - Electricty, Beer, Chocolate Production

# Create separate time series objects for each
choc.ts <- ts(cbe[, 1], start = 1958, freq = 12)
beer.ts <- ts(cbe[, 2], start = 1958, freq = 12)
elec.ts <- ts(cbe[, 3], start = 1958, freq = 12)
# or since they have the same starting point adn same frequency (by our prior knowledge), we can write the code more compactly:
cbe.ts <- ts(cbe, start = 1958, freq = 12) # Another evidence that a time series is just a column of matrix. Here we can see three time series by column.
cbe.ts
         choc  beer  elec
Jan 1958 1451  96.3  1497
Feb 1958 2037  84.4  1463
Mar 1958 2477  91.2  1648
Apr 1958 2785  81.9  1595
May 1958 2994  80.5  1777
Jun 1958 2681  70.4  1824
Jul 1958 3098  74.8  1994
Aug 1958 2708  75.9  1835
Sep 1958 2517  86.3  1787
Oct 1958 2445  98.7  1699
Nov 1958 2087 100.9  1633
Dec 1958 1801 113.8  1645
Jan 1959 1216  89.8  1597
Feb 1959 2173  84.4  1577
Mar 1959 2286  87.2  1709
Apr 1959 3121  85.6  1756
May 1959 3458  72.0  1936
Jun 1959 3511  69.2  2052
Jul 1959 3524  77.5  2105
Aug 1959 2767  78.1  2016
Sep 1959 2744  94.3  1914
Oct 1959 2603  97.7  1925
Nov 1959 2527 100.2  1824
Dec 1959 1846 116.4  1765
Jan 1960 1066  97.1  1721
Feb 1960 2327  93.0  1752
Mar 1960 3066  96.0  1914
Apr 1960 3048  80.5  1857
May 1960 3806  76.1  2159
Jun 1960 4042  69.9  2195
Jul 1960 3583  73.6  2287
Aug 1960 3438  92.6  2276
Sep 1960 2957  94.2  2096
Oct 1960 2885  93.5  2055
Nov 1960 2744 108.5  2004
Dec 1960 1837 109.4  1924
Jan 1961 1447 105.1  1851
Feb 1961 2504  92.5  1839
Mar 1961 3248  97.1  2019
Apr 1961 3098  81.4  1937
May 1961 4318  79.1  2270
Jun 1961 3561  72.1  2251
Jul 1961 3316  78.7  2382
Aug 1961 3379  87.1  2364
Sep 1961 2717  91.4  2129
Oct 1961 2354 109.9  2110
Nov 1961 2445 116.3  2072
Dec 1961 1542 113.0  1980
Jan 1962 1606 100.0  1995
Feb 1962 2590  84.8  1932
Mar 1962 3588  94.3  2171
Apr 1962 3202  87.1  2162
May 1962 4704  90.3  2489
Jun 1962 4005  72.4  2424
Jul 1962 3810  84.9  2641
Aug 1962 3488  92.7  2630
Sep 1962 2781  92.2  2324
Oct 1962 2944 114.9  2412
Nov 1962 2817 112.5  2284
Dec 1962 1960 118.3  2186
Jan 1963 1937 106.0  2184
Feb 1963 2903  91.2  2144
Mar 1963 3357  96.6  2379
Apr 1963 3552  96.3  2383
May 1963 4581  88.2  2717
Jun 1963 3905  70.2  2774
Jul 1963 4581  86.5  3051
Aug 1963 4037  88.2  2891
Sep 1963 3345 102.8  2613
Oct 1963 3175 119.1  2600
Nov 1963 2808 119.2  2493
Dec 1963 2050 125.1  2410
Jan 1964 1719 106.1  2390
Feb 1964 3143 102.1  2463
Mar 1964 3756 105.2  2616
Apr 1964 4776 101.0  2734
May 1964 4540  84.3  2970
Jun 1964 4309  87.5  3125
Jul 1964 4563  92.7  3342
Aug 1964 3506  94.4  3207
Sep 1964 3665 113.0  2964
Oct 1964 3361 113.9  2919
Nov 1964 3094 122.9  2764
Dec 1964 2440 132.7  2732
Jan 1965 1633 106.9  2622
Feb 1965 2935  96.6  2698
Mar 1965 4159 127.3  2950
Apr 1965 4159  98.2  2895
May 1965 4894 100.2  3200
Jun 1965 4921  89.4  3408
Jul 1965 4577  95.3  3679
Aug 1965 4155 104.2  3473
Sep 1965 3851 106.4  3154
Oct 1965 3429 116.2  3107
Nov 1965 3370 135.9  3052
Dec 1965 2726 134.0  2918
Jan 1966 1674 104.6  2786
Feb 1966 3257 107.1  2739
Mar 1966 4731 123.5  3125
Apr 1966 4481  98.8  3033
May 1966 5443  98.6  3486
Jun 1966 5566  90.6  3661
Jul 1966 5044  89.1  3927
Aug 1966 4781 105.2  3851
Sep 1966 4014 114.0  3456
Oct 1966 3561 122.1  3390
Nov 1966 3801 138.0  3280
Dec 1966 2685 142.2  3166
Jan 1967 1805 116.4  3080
Feb 1967 3756 112.6  3069
Mar 1967 4227 123.8  3340
Apr 1967 4595 103.6  3310
May 1967 5702 113.9  3798
Jun 1967 4681  98.6  3883
Jul 1967 4395  95.0  4191
Aug 1967 4459 116.0  4213
Sep 1967 4191 113.9  3766
Oct 1967 3742 127.5  3628
Nov 1967 3279 131.4  3520
Dec 1967 2468 145.9  3322
Jan 1968 1742 131.5  3250
Feb 1968 3366 131.0  3287
Mar 1968 3633 130.5  3552
Apr 1968 3701 118.9  3440
May 1968 4926 114.3  4153
Jun 1968 4522  85.7  4265
Jul 1968 5275 104.6  4655
Aug 1968 4717 105.1  4492
Sep 1968 4114 117.3  4051
Oct 1968 3851 142.5  3967
Nov 1968 3493 140.0  3807
Dec 1968 2654 159.8  3639
Jan 1969 2168 131.2  3647
Feb 1969 3561 125.4  3560
Mar 1969 4305 126.5  3929
Apr 1969 4413 119.4  3858
May 1969 5307 113.5  4485
Jun 1969 5361  98.7  4697
Jul 1969 4948 114.5  4977
Aug 1969 4472 113.8  4675
Sep 1969 3846 133.1  4596
Oct 1969 3715 143.4  4491
Nov 1969 3343 137.3  4127
Dec 1969 2939 165.2  4144
Jan 1970 1615 126.9  4014
Feb 1970 3497 124.0  3994
Mar 1970 3915 135.7  4320
Apr 1970 4858 130.0  4400
May 1970 4962 109.4  5002
Jun 1970 4504 117.8  5091
Jul 1970 4767 120.3  5471
Aug 1970 4291 121.0  5193
Sep 1970 4091 132.3  4997
Oct 1970 4164 142.9  4737
Nov 1970 3915 147.4  4546
Dec 1970 3130 175.9  4498
Jan 1971 1696 132.6  4350
Feb 1971 3887 123.7  4206
Mar 1971 4749 153.3  4743
Apr 1971 4781 134.0  4582
May 1971 5089 119.6  5191
Jun 1971 5484 116.2  5457
Jul 1971 5072 118.6  5891
Aug 1971 4611 130.7  5618
Sep 1971 4117 129.3  5158
Oct 1971 3910 144.4  5030
Nov 1971 4252 163.2  4800
Dec 1971 3624 179.4  4654
Jan 1972 1678 128.1  4453
Feb 1972 3851 138.4  4440
Mar 1972 5021 152.7  4945
Apr 1972 4581 120.0  4788
May 1972 6195 140.5  5425
Jun 1972 5338 116.2  5706
Jul 1972 4909 121.4  6061
Aug 1972 4640 127.8  5846
Sep 1972 3706 143.6  5242
Oct 1972 4113 157.6  5408
Nov 1972 3879 166.2  5114
Dec 1972 3411 182.3  5042
Jan 1973 2043 153.1  5008
Feb 1973 3736 147.6  4657
Mar 1973 4490 157.7  5359
Apr 1973 3715 137.2  5193
May 1973 5623 151.5  5891
Jun 1973 4671  98.7  5980
Jul 1973 5591 145.8  6390
Aug 1973 5461 151.7  6366
Sep 1973 4795 129.4  5756
Oct 1973 4846 174.1  5640
Nov 1973 4843 197.0  5429
Dec 1973 3278 193.9  5398
Jan 1974 2411 164.1  5413
Feb 1974 4278 142.8  5141
Mar 1974 4639 157.9  5695
Apr 1974 4559 159.2  5554
May 1974 6404 162.2  6369
Jun 1974 4851 123.1  6592
Jul 1974 6480 130.0  7107
Aug 1974 6394 150.1  6917
Sep 1974 5752 169.4  6353
Oct 1974 4718 179.7  6205
Nov 1974 4659 182.1  5830
Dec 1974 3842 194.3  5646
Jan 1975 2873 161.4  5379
Feb 1975 5556 169.4  5489
Mar 1975 5389 168.8  5824
Apr 1975 6135 158.1  5907
May 1975 6707 158.5  6482
Jun 1975 5220 135.3  6795
Jul 1975 6249 149.3  7028
Aug 1975 5281 143.4  6776
Sep 1975 4192 142.2  6274
Oct 1975 4867 188.4  6362
Nov 1975 3752 166.2  5940
Dec 1975 3492 199.2  5958
Jan 1976 1979 182.7  5769
Feb 1976 4584 145.2  5887
Mar 1976 5139 182.1  6367
Apr 1976 5044 158.7  6165
May 1976 5501 141.6  6868
Jun 1976 5044 132.6  7201
Jul 1976 5035 139.6  7601
Aug 1976 5167 147.0  7581
Sep 1976 4650 166.6  7090
Oct 1976 5298 157.0  6841
Nov 1976 4373 180.4  6408
Dec 1976 3941 210.2  6435
Jan 1977 2334 159.8  6176
Feb 1977 4381 157.8  6138
Mar 1977 5665 168.2  6717
Apr 1977 4393 158.4  6470
May 1977 5232 152.0  7312
Jun 1977 5876 142.2  7763
Jul 1977 5900 137.2  8171
Aug 1977 5704 152.6  7788
Sep 1977 4718 166.8  7311
Oct 1977 4650 165.6  6679
Nov 1977 4446 198.6  6704
Dec 1977 3061 201.5  6724
Jan 1978 2155 170.7  6552
Feb 1978 4274 164.4  6427
Mar 1978 4695 179.7  7105
Apr 1978 4362 157.0  6869
May 1978 4889 168.0  7683
Jun 1978 5370 139.3  8082
Jul 1978 5072 138.6  8555
Aug 1978 4985 153.4  8386
Sep 1978 3978 138.9  7553
Oct 1978 4139 172.1  7398
Nov 1978 3995 198.4  7112
Dec 1978 3025 217.8  6886
Jan 1979 1949 173.7  7077
Feb 1979 4357 153.8  6820
Mar 1979 4638 175.6  7426
Apr 1979 3994 147.1  7143
May 1979 6174 160.3  8261
Jun 1979 5656 135.2  8240
Jul 1979 4411 148.8  8977
Aug 1979 5504 151.0  8991
Sep 1979 4463 148.2  8026
Oct 1979 4458 182.2  7911
Nov 1979 4528 189.2  7510
Dec 1979 2830 183.1  7381
Jan 1980 1843 170.0  7366
Feb 1980 5042 158.4  7414
Mar 1980 5348 176.1  7824
Apr 1980 5257 156.2  7524
May 1980 6699 153.2  8279
Jun 1980 5388 117.9  8707
Jul 1980 6001 149.8  9486
Aug 1980 5966 156.6  8973
Sep 1980 4845 166.7  8231
Oct 1980 4507 156.8  8206
Nov 1980 4214 158.6  7927
Dec 1980 3460 210.8  7999
Jan 1981 1833 203.6  7834
Feb 1981 4978 175.2  7521
Mar 1981 6464 168.7  8284
Apr 1981 5820 155.9  7999
May 1981 6447 147.3  8940
Jun 1981 6191 137.0  9381
Jul 1981 6628 141.1 10078
Aug 1981 5452 167.4  9796
Sep 1981 5295 160.2  8471
Oct 1981 5080 191.9  8572
Nov 1981 5564 174.4  8150
Dec 1981 3965 208.2  8168
Jan 1982 2062 159.4  8166
Feb 1982 5099 161.1  7903
Mar 1982 6162 172.1  8606
Apr 1982 5529 158.4  8071
May 1982 6416 114.6  9178
Jun 1982 6382 159.6  9873
Jul 1982 5624 159.7 10476
Aug 1982 5785 159.4  9296
Sep 1982 4644 160.7  8818
Oct 1982 5331 165.5  8697
Nov 1982 5143 205.0  8381
Dec 1982 4596 205.2  8293
Jan 1983 2180 141.6  7942
Feb 1983 5786 148.1  8001
Mar 1983 5840 184.9  8744
Apr 1983 5666 132.5  8397
May 1983 6360 137.3  9115
Jun 1983 6219 135.5  9773
Jul 1983 6082 121.7 10358
Aug 1983 5653 166.1  9849
Sep 1983 5726 146.8  9083
Oct 1983 5049 162.8  9143
Nov 1983 5859 186.8  8800
Dec 1983 4091 185.5  8741
Jan 1984 2167 151.5  8492
Feb 1984 6480 158.1  8795
Mar 1984 7375 143.0  9354
Apr 1984 6583 151.2  8796
May 1984 7251 147.6 10072
Jun 1984 6730 130.7 10174
Jul 1984 6428 137.5 11326
Aug 1984 5228 146.1 10744
Sep 1984 4716 133.6  9806
Oct 1984 6101 167.9  9740
Nov 1984 5753 181.9  9373
Dec 1984 4000 202.0  9244
Jan 1985 2691 166.5  9407
Feb 1985 5898 151.3  8827
Mar 1985 6526 146.2  9880
Apr 1985 5840 148.3  9364
May 1985 6650 144.7 10580
Jun 1985 5717 123.6 10899
Jul 1985 7236 151.6 11687
Aug 1985 6523 133.9 11280
Sep 1985 5729 137.4 10208
Oct 1985 6004 181.6 10212
Nov 1985 5950 182.0  9725
Dec 1985 4690 190.0  9721
Jan 1986 3687 161.2  9846
Feb 1986 7791 155.5  9407
Mar 1986 7153 141.9 10265
Apr 1986 6434 164.6  9970
May 1986 7850 136.2 10801
Jun 1986 6809 126.8 11246
Jul 1986 8379 152.5 12167
Aug 1986 6914 126.6 11578
Sep 1986 6919 150.1 10645
Oct 1986 7265 186.3 10613
Nov 1986 6994 147.5 10104
Dec 1986 5503 200.4 10348
Jan 1987 3782 177.2 10263
Feb 1987 7502 127.4  9973
Mar 1987 8119 177.1 10803
Apr 1987 7292 154.4 10409
May 1987 6886 135.2 11458
Jun 1987 7049 126.4 11845
Jul 1987 7977 147.3 12559
Aug 1987 8519 140.6 12070
Sep 1987 6680 152.3 11221
Oct 1987 7994 151.2 11338
Nov 1987 7047 172.2 10761
Dec 1987 5782 215.3 11012
Jan 1988 3771 154.1 10923
Feb 1988 7906 159.3 10790
Mar 1988 8970 160.4 11427
Apr 1988 6077 151.9 10788
May 1988 7919 148.4 11772
Jun 1988 7340 139.6 12104
Jul 1988 7791 148.2 12634
Aug 1988 7368 153.5 12772
Sep 1988 8255 145.1 11764
Oct 1988 7816 183.7 11956
Nov 1988 7476 210.5 11646
Dec 1988 6696 203.3 11750
Jan 1989 4484 153.3 11485
Feb 1989 8274 144.3 11198
Mar 1989 8866 169.6 12265
Apr 1989 8572 143.7 11704
May 1989 9176 160.0 12419
Jun 1989 8645 135.5 13259
Jul 1989 8265 141.7 13945
Aug 1989 9558 159.9 13839
Sep 1989 7037 145.6 12387
Oct 1989 9101 183.4 12546
Nov 1989 8180 198.1 12038
Dec 1989 7072 186.7 11977
Jan 1990 3832 171.9 12336
Feb 1990 7253 150.5 11793
Mar 1990 8667 163.0 12877
Apr 1990 7658 153.6 11923
May 1990 8859 152.8 13306
Jun 1990 7291 135.4 13988
Jul 1990 7529 148.3 14002
Aug 1990 8715 148.3 14338
Sep 1990 8450 133.5 12867
Oct 1990 9085 193.8 12761
Nov 1990 8350 208.4 12449
Dec 1990 7080 197.0 12658

Multiple Time Series - Electricty, Beer, Chocolate Production

cbind(choc.ts, beer.ts, elec.ts) |> plot()

# or
plot(cbe.ts)

Panel Data - Democracy and Income

Let’s use democracy and income data again here.

data <- read.csv("Lab1_data/Lab1data.csv", header = TRUE)
colnames(data) <- c("country", "Year", "income", "polity2")
head(unique(data$country)) # observations on 174 countries
[1] "Antigua and Barbuda" "Afghanistan"         "Albania"            
[4] "Algeria"             "Andorra"             "Angola"             
head(tapply(data$country, data$Year, length)) # inspect how many countries in each year
2000 2001 2002 2003 2004 2005 
 174  174  174  174  174  174 
head(tapply(data$Year, data$country, length)) # inspect how many years each country has
        Afghanistan             Albania             Algeria             Andorra 
                 11                  11                  11                  11 
             Angola Antigua and Barbuda 
                 11                  11 
nrow(data)
[1] 1914

Panel Data - Democracy and Income

p <- ggplot(data = na.omit(data), aes(x = Year, y = income, 
                                      group = country, color = country))
p + geom_line(alpha = 0.5) + # make lines semi-transparent
  guides(color = FALSE) # turn off the legend because there are so many countries

Coding Exercise: Panel Data - Democracy and Income

  1. Subset the data frame to show only country name and GDP per capita

  2. Rearrange the columns of the data frame ascending by polity score

  3. Show only values of GDP per capita for South Africa from 2002 to 2008

  4. Create a new variable that takes the first letter of the country and attaches it to the year of observation

  5. Find the mean of GDP per capita for each year of observation

More on TSCS Data Wrangling

  • use pivot_longer and pivot_wider in tidyverse to transform data (non-pipe solution: reshape2)
    • long version of data (see income and democracy data)
    • wide version of data (see next slide)
  • use tidylog or ViewPipeSteps to see what kind of changes happened in data
    • when you make any change in your data, make sure that the machine did what you want to do.
  • pitfall of lag
    • there are a lot of lag() functions in R. Without specifying which package it come from, R will choose stas::lag() or dplyr::lag() by default.
    • Neither stas::lag() nor dplyr::lag() can create correct lags in a panel data.
    • We need plm::lag() or force the data as a panel structure using panelr.

Wide Version of Panel Data

data %>%
  pivot_wider(id_cols = country, names_from = Year, values_from = income) %>%
  DT::datatable()

Tricks to Make You Code Faster

  • Avoid creating objects in a loop; rather, pre-allocate objects before you run a loop
    • Suppose that after a for loop you want to generate a vector called obj with 10^4 elements
    • So before run a loop: obj <- NA 👎; obj <- rep(NA, 10^4) 👍
  • Vectorization
  • Parallelization (less relevant for this class)
  • Find a better package: some packages use C++ or support parallelization, so they can run faster
  • Find a better machine, e.g. cloud computing (less relevant for this class)

Tricks to Make Debugging Easier

  • Write a well-commented code; Name your variables meaningfully and consistently
  • Print some information when wrangling data or running loops
    • print information from a time-consuming loop can also reduce your anxiety
    • use progress to track the progress of complex loops (track progress with parallelization: pbmcapply)
  • Reduce the number of iterations in a loop
  • Fix the random seed
  • Can use debug or debugonce() to isolate function in a debugger environment
    • cannot change the content of function if you do not have the source code of the function.
    • but you can still inspect how the value of variables change within it!

A Simple Example for Debugging Functions

dat <- data.frame(
  row.names = c("calories", "weight", "yumminess"),
  blackberry = c(4L, 9L, 6L),
  blueberry = c(1L, 2L, 8L),
  peach = c(59L, 150L, 10L),
  plum = c(30L, 78L, 5L)
)

fruit_avg <- function(dat, pattern) {
  cols <- grep(pattern, names(dat))
  mini_dat <- dat[ , cols]
  message("Found ", ncol(mini_dat), " fruits!")
  rowMeans(mini_dat)
}
dat
fruit_avg(dat, pattern = "berry")
fruit_avg(dat, pattern = "black")
debugonce(fruit_avg)
fruit_avg(dat, pattern = "black")

Footnotes

  1. See https://rstudio.github.io/reticulate/articles/calling_python.html.

  2. To see why coerce factors to numeric values will completely change original values, check https://adv-r.hadley.nz/vectors-chap.html#factors.

  3. A good news is that if we use tibble, when we assign any character to a numeric column, it will give us an error saying that the data types are incompatible. This is one reason why someone prefer tibble to traditional data.frame.