---
title: "lab1_key"
author: "Inhwan Ko"
date: "Oct 1, 2021"
output:
html_document:
df_print: paged
pdf_document: default
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Prerequiste
```{r, message=FALSE}
rm(list = ls()) # Clear memory
library(tidyverse) # Load package
```
# Vector Practice
1. `vector1` : The numbers one through five and then the number six five times
2. `vector2` : 10 randomly drawn numbers from a normal distribution with a mean 10 and a s.d. of 1
3. `vector3` : Results of 10 single binomial trials with a probability of 0.4
4. `vector4` : Sample 100 observations from a 5-trial binomial distribution with a probability of success of 0.4
5. `vector5` : The numbers one through three and the word apple
```{r}
vector1 <- c(1:5, rep(x=6, n=5))
vector2 <- rnorm(n=10, mean=10, sd=1)
vector3 <- rbinom(n=10, size=1, prob=0.4)
vector4 <- rbinom(n=100, size=5, prob=0.4)
vector5 <- c(1:5, "apple")
```
6. What type of data is `vector2`?
7. Round up `vector2` to two decimal place
8. What happened in `vector5`?
```{r}
class(vector2)
round(vector, digits=2)
vector5
```
# Matrices Practice
1. `matrix1`: Create 5 by 5 matrix containing all NAs
2. Assign matrix1 the row names (a,b,c,d,e) and the column names (1,2,3,4,5)
3. Replace the NAs in the first column of `matrix1` with "Inf"
```{r}
matrix1 <- matrix(data=NA, nrow=5, ncol=5)
rownames(matrix1) <- c("a","b","c","d","e")
colnames(matrix1) <- c(1,2,3,4,5)
matrix1[,1] <- rep(x=Inf, n=5) # or just matrix1[,1] <- Inf
```
# List Practice
1. Create a list `list1` that contains `vector1`, `vector2`, `vector3`, and `matrix1`
2. Name each list component as `vector1`, `vector2`, `vector3`, and `matrix1` respectively
3. Locate `vector2` from the list
```{r}
list1 <- list(vector1, vector2, vector3, matrix1)
names(list1) <- c("vector1", "vector2", "vector3", "matrix1")
list1$vector2
```
# Data Frames Practice 1
# Working directory
Check if your working directory is correct (where you have saved `lab01_data.csv`)
```{r}
basedir <- getwd()
rowdata.folder <- paste(basedir, "Specify if you create a folder", sep = "/")
```
## 1. Load Lab1data.csv in R
```{r}
# Load data
data <- read.csv("lab1_data.csv", header = TRUE, stringsAsFactors = FALSE)
# ?read.csv
```
## 2. What is the data structure? What does that tell us about type?
```{r}
# Check structure
dim(data)
class(data)
is.data.frame(data)
is.matrix(data)
# Alternatively
str(data)
```
## 3. Check the names and summary statistics of the data. Fix any names that are less than good.
```{r}
# Check and fix names
names(data)
names(data)[3] <- "gdp.per.cap"
names(data) # Check again
# Summary Statistics
summary(data)
```
## 4. Remove observations with missing values
```{r}
# Remove NAs
dataClean <- na.omit(data) # listwise deletion!!
dim(data)
dim(dataClean)
```
## 5. Calculate the average GDP per capita for Brazil for the observed period. Repeat the calculation for all countries.
```{r}
# Base R
mean(dataClean[dataClean$country == "Brazil", "gdp.per.cap"])
# Tidy way
dataClean %>%
filter(country == "Brazil") %>%
summarize(mean(gdp.per.cap))
# Average gdp.per.cap for all countries
dataClean %>%
group_by(country) %>%
summarize(mean(gdp.per.cap))
```
## 6. Plot GDP per capita (on the x-axis) and polity2 (on the y-axis).
```{r}
# Base Graphics
plot(x = dataClean$gdp.per.cap,
y = dataClean$polity2)
# Try logging GDP
plot(x = log(dataClean$gdp.per.cap),
y = dataClean$polity2,
xlab = "Logged GDP per capita",
ylab = "Polity2")
# ggplot2
ggplot(dataClean, aes(y = polity2, x = log(gdp.per.cap))) +
geom_point() +
labs(x = "Logged GDP per capita", y = "Polity2") +
theme_classic()
```
## 7. Create a new variable called "democracy". Assign 0 to countries with negative value or zero polity2 score, and assign 1 to countries with positive score.
```{r, results='hide'}
# Create a variable called "democracy"
dataClean$democracy <- NA
head(dataClean)
# You can subset data based on a logical statement
dataClean$polity2 <= 0
dataClean[dataClean$polity2 <= 0, ]
# Take advantage of this: Assign values to "democracy" based on polity2 values
dataClean$democracy[dataClean$polity2 <= 0] <- 0
# Do the same for positive Polity2 score
dataClean$democracy[dataClean$polity2 > 0] <- 1
# Tidy way
dataClean %>%
mutate(democracy = case_when(polity2 <= 0 ~ 0,
TRUE ~ 1))
```
## 8. Use a loop to do the same coding.
```{r}
dataClean$democracy <- NA # clean this variable
n <- nrow(dataClean) # how many loops do you need?
for (i in 1:n) {
if (dataClean$polity2[i] <= 0) dataClean$democracy[i] <- 0
else dataClean$democracy[i] <- 1
}
## or try this
for (i in 1:n) {
dataClean$democracy <- ifelse(dataClean$polity2[i]<=0, 1, 0)
}
```
# Data Frames Practice 2
## 1. Read in the data "lab1_survey.csv"
```{r}
# Clear and load data
rm(list = ls())
survey_data <- read.csv(file = "lab1_survey.csv")
```
## 2. Inspect the data. What format are they in? What values do the data take, and how do those values correspond with the survey?
```{r}
str(survey_data)
```
## 3. Generate some summary statistics.
```{r}
summary(survey_data)
mean(survey_data$R)
mean(survey_data$latex)
median(survey_data$R)
median(survey_data$latex)
sd(survey_data$R)
sd(survey_data$latex)
# Tidy way
survey_data %>%
summarize_all(funs(mean, median, sd, min, max))
# %>% gather(key = "stat")
```
## 4. How are these two variables related to each other (assuming equal intervals b/w categories)?
```{r}
cor1 <- cor(survey_data$R, survey_data$latex)
```
The correlation b/w R knowledge and LaTeX knowledge is `r cor1`, or more nicely, `r round(cor1, 2)`.
## 5. Are there any problems with the way the data are coded? (Think about lecture yesterday.)
## 6. Recode the data
```{r}
survey_data %>%
mutate(# Recode R into categories
R_cat = case_when(R == 0 ~ "What's that?",
R == 1 ~ "I've heard of it",
R == 2 ~ "I can use it or apply it",
TRUE ~ "I understand it well"),
# Recode latex into categories
latex_cat = case_when(latex == 0 ~ "What's that?",
latex == 1 ~ "I've heard of it",
latex == 2 ~ "I can use it or apply it",
TRUE ~ "I understand it well"))
# We're repeating ourselves... Must be a faster way
survey_data <-
survey_data %>%
mutate_at(vars(R, latex),
function(x) case_when(x == 0 ~ "What's that?",
x == 1 ~ "I've heard of it",
x == 2 ~ "I can use it or apply it",
TRUE ~ "I understand it well"))
```
## 7. Why is this coding method better?
## 8. Generate some plots of the data: bar charts are good here, scatterplots even better.
```{r, echo= FALSE}
# Bar charts
ggplot(survey_data, aes(x = R)) +
geom_bar() +
labs(x = "R knowledge")
ggplot(survey_data, aes(x = latex)) +
geom_bar() +
labs(x = "LaTeX knowledge")
# Scatter plot
ggplot(survey_data, aes(x = R, y = latex)) +
geom_jitter(alpha = .7, height = .2, width = .2) +
labs(x = "R knowledge", y = "LaTeX knowledge") +
theme_classic()
##### Something is wrong? #####
# Convert two variables into factors
knowledge_levels <- c("What's that?",
"I've heard of it",
"I can use it or apply it",
"I understand it well")
survey_data <-
survey_data %>%
mutate(R = factor(R, levels = knowledge_levels),
latex = factor(latex, levels = knowledge_levels)
)
# Redo the scatter plot
ggplot(survey_data, aes(x = R, y = latex)) +
geom_jitter(alpha = .7, height = .2, width = .2) +
labs(x = "R knowledge", y = "LaTeX knowledge") +
scale_x_discrete(limits = knowledge_levels) +
theme_classic()
```
# LaTex in R Markdown
$$
1 + 1 = 2
$$
$$
11 \times 11 = 121 \\
$$
$$
E = mc^2
$$
I think it's Einstein who proposed $E = mc^2$.
$$
x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
$$
$$
\begin{split}
X & = (x+a)(x-b) \\
& = x(x-b) + a(x-b) \\
& = x^2 + x(a-b) - ab
\end{split}
$$