---
title: "Lab 3: Practice Code"
subtitle: "Ordinary Least Squares and Maximum Likelihood Estimation"
author: "Your name"
date: \today
output: pdf_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
# Good practice to remove all objects from the workspace
rm(list = ls())
# Use library() for packages you need, or source() for other R files.
library(tidyverse)
```
# Optimization algorithms
The purpose of this short coding exercise is to start practicing with the `optim()` function. This function allows us to estimate likelihood parameters from any Probability Density Function (PDF) using Maximum Likelihood Estimation (MLE). Recall that `optim()` has several key arguments:
- **par**:
The initial guess or starting values for the parameters to be optimized. You must provide as many starting values as there are parameters to estimate.
- **fn**:
The objective function to be optimized. For OLS, we can provide the Sum of Squared Residuals (SSR). In contrast, for MLE, we should provide a function that characterizes the log-likelihood of the data-generating process (DGP) of interest.
- **IMPORTANT**: `optim()` is a minimizer by default. Therefore, when estimating likelihoods, you should multiply the log-likelihood function by $-1$, so you are maximizing the likelihood instead of minimizing it.
- **method**:
Specifies the optimization algorithm to be used (e.g., BFGS, Nelder-Mead, etc.).
- **hessian**: whether the Hessian matrix should be returned. This argument is required to compute the standard errors for ML estimates, but not for OLS.
- **Other arguments**:
Additional arguments for the objective function can be passed to `optim()`. For example, you might pass the predictor variables ($X_1$, $X_2$) and response ($Y$) as additional arguments.
# Data
## Simulation example
Below, we are simulating the following DGP:
Great! Let's modify your current simulation to estimate a linear model with three parameters: $\beta_0$, $\beta_1$, and $\beta_2$, by minimizing the sum of squared residuals (SSR) using the `optim()` function. The model you want to estimate is:
$$
Y \sim \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon
$$
where:
- $Y$ is normally distributed, and $\varepsilon \sim \mathcal{N}(0,1)$.
- $X_1 \sim \mathcal{N}(10,2)$ and $X_2 \sim \mathcal{N}(5,1)$.
- The true estimands are $\beta_0 = 5$, $\beta_1 = 2$, and $\beta_2 = -1.5$.
```{r, fig.align='center', fig.width=5, fig.height=3.5}
# Set simulation parameters
set.seed(123) # for reproducibility
n <- 100 # Sample size
beta_0 <- 5 # True intercept
beta_1 <- 2 # True coefficient for X1
beta_2 <- -1.5 # True coefficient for X2
# Simulate random variables
## Note: the square root in the sd argument of rnorm()
## Predictors
X_1 <- rnorm(n, mean = 10, sd = sqrt(2))
X_2 <- rnorm(n, mean = 5, sd = sqrt(1))
## Residuals
epsilon <- rnorm(n, mean = 0, sd = sqrt(1))
# DGP of the outcome variable
Y <- beta_0 + beta_1 * X_1 + beta_2 * X_2 + epsilon
# Visualize the distribution of Y
hist(Y, main = "Histogram of Y", xlab = "Y")
```
## Salaries dataset
**Description:** This dataset contains information on the 2008-09 nine-month academic salaries for Assistant Professors, Associate Professors, and Professors at a U.S. college. The data were collected as part of the college’s ongoing efforts to monitor salary differences between male and female faculty members.
Below, I am cleaning the data for you.
```{r}
# Load the data
data("Salaries", package = "carData")
# Prepare data
df <-
Salaries |>
# optional: turn it into a tibble
as_tibble() |>
# re-scale salary
mutate(salary=salary/1000) |>
# select predictors for analysis
select(salary,sex,yrs.since.phd)
```
# Ordinary Least Squares
In this section, we will using again the built-in dataset *Salaries* from the `car` / `carData` package, that we used in the previous lab code practice. However, before we will fit a toy example with simulated data. The goal is to fit least squares estimates using `optim()` instead of `lm()`.
## 1. Ordinary Least Squares estimation using OLS
Create a function called `SSR` that
1- Maps a parameter argument `params=` with each parameter to be estimated ($\beta_0, \beta_1$, and $\beta_2$).
2- Provides an objective function to estimate the linear projection of $Y$. That is $\hat{Y} = \hat{\beta_0} + \hat{\beta_1} X_1 + \hat{\beta_2} X_2$. Do not forget to include an intercept.
3- An objective function to estimate the Sum of Squared Residuals (SSR) which will return an output. This output is the one `optim()` will optimize until finding the global minimum.
Once you have created `SSR()`, use `optim()` to estimate $\hat{\beta_0}$, $\hat{\beta_1}$, and $\hat{\beta_2}$. Double-check your results with `lm()`.
```{r}
# Define the residual sum of squares (SSR) function
SSR <- function(...) {
# 1- Map a parameter arguments
beta_0 <- ... # Intercept
beta_1 <- ... # Coefficient for X1
beta_2 <- ... # Coefficient for X2
# 2- Linear projection model of Y (aka model to predict Y)
Y_hat <- ...
# Objective funtion to estimate SSR
SSR <- ...
return(SSR)
}
# Initial parameter guesses for beta_0, beta_1, and beta_2
init_params <- ...
# Use optim to minimize SSR and find the best estimates for the parameters
optim_ols <- optim(...)
# Compare with lm()
lm_ols <- lm(...)
```
## 2. Use `SSR()` and `optim()` to estimate the following model: $Y_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \epsilon$ using the *Salaries* dataset. Compare `optim()` results with `lm()`.
where:
- **salary** ($Y$): Outcome variable representing the nine-month salary in dollars.
- **sex** ($X_1$): Binary indicator for the respondent’s reported sex (Female or Male).
- **yrs.since.phd** ($X_2$): Count variable indicating the number of years since the respondent received their PhD.
- Notice that the model to be fit is exactly the same as Model 3 from `lab2_practice.Rmd`. Do not forget to input teh variable *sex* as **numeric** class.
```{r}
# Set vectors/matrix of covariates
# adjust n for the intercept!
# Set initial parameters
# Use optim to minimize SSR
# Compare with lm()
```
# Maximum Likelihood Estimation
From `lab2_practice.Rmd`, we know that model 3 (the one estimated in the previous section) violates the assumption of homoskedasticity. Therefore, in this section, we will re-estimate the model using a heteroskedastic normal model via Maximum Likelihood Estimation (MLE).
## 3. Estimating a Heteroskedastic Normal Model Using MLE
To estimate the heteroskedastic normal model, you will first need to create a function that connects the parameters to the objective function for estimation.
- You can modify the function we covered during the lab session or create your own. This time, ensure that the covariates/predictors are set into matrices for each parameter of the normal distribution that we want to model ($\beta$ and $\gamma$).
```{r}
# Set vectors/matrix of covariates
## vector of outcomes
Y <- ...
## matrix of covariates for mean and variance parameters
xcov <- ...
zcov <- ...
llk.hetnorm <- function(...) {
# 1- Map a parameter arguments, do not forget to include an intercept!
# 2- Define vector of parameters (b and g)
# 3- Define systematic components
# Systematic components for mean
# Systematic components for variance
# 4- Likelihood we want to maximize
}
# Initial guess, how many parameters we are going to estimate?
stval <- c(...)
# Run ML, and get the output we need
opmtim_mle <- optim(...)
# Extract point estimates and standard errors
# Compare with mle and lm() results
```