The Least Squares (LS) estimator can be derived as a Maximum Likelihood Estimation (MLE) under the assumption that the errors in the linear regression model are normally distributed and homoskedasticity.
Consider the standard linear regression model:
\[ Y_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} + \varepsilon_i, \quad i = 1, \ldots, n, \]
In OLS, we assume that the error terms \(\varepsilon_i\) are normally distributed with constant variance \(\varepsilon_i \sim \mathcal{N}(0,\sigma^2)\).
It must be the case that if Y is normally distributed, the errors must be normally distributed as well (but not necessarily otherwise!). Thus, we can express \(Y_i\) following a probability distribution \(Y_i \sim \mathcal{N}(\mu_i, \sigma^2)\). The gist is that the likelihood theory of inference allows to model any parameter from any distribution in other to better represent the DGP and also look at which predictors can explain variation in the distributonal parameters.
If we further assume that each \(i\) observation is independent and \(\sigma^2\) is a constant, we can express the PDF as a function of the systematic component for the paremter of interest, in this case the mean \(\mu\).
\[ \begin{aligned} Y_i & \sim \mathcal{N}(\mu_i, \sigma^2) & [\text{stochastic component}] \\ \mu_i & = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} \quad \quad & [\text{systematic component}] \\ Y_i &\mid X_i \sim \mathcal{N} \left( \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi}, \, \sigma^2 \right) & [\text{substitute in the parameter}] \\ \end{aligned} \]
The two above equations are equivalent, although the systematic and stochastic components are in different locations. Two different notations for the same model.
LS notation: \[ \begin{aligned} \varepsilon_i & \sim \mathcal{N}(0, \sigma^2) & (stochastic) \\ Y_i & = x_i \mathcal{\beta} & (systematic) \\ Y_i & = x_i \mathcal{\beta} + \varepsilon_i & (stochastic + systematic) \\ \end{aligned} \]
MLE notation: \[ \begin{aligned} Y_i & \sim \mathcal{N}(\mu_i, \sigma^2) & (stochastic) \\ \mu_i & = x_i \beta & (systematic) \\ Y_i & \sim N( x_i \beta, \sigma^2) & (stochastic + systematic) \\ \end{aligned} \]
Below we are going to derive the MLE estimates from a normal-homoskedatic model. These steps apply to any probability function that we are interested to model and estimate its parameters. The steps are:
1- Define a PDF that represents the data-generating process (DGP) of interest. 2- Define the likelihood (independence assumption) 3- Take the natural logarithm of the likelihood function to simplify its obective function. 4- Drop the parameters
The PDF for a single observation \(Y_i\) following a normal distribution with mean \(\mu_i\) and variance \(\sigma^2\) is:
\[ f(Y_i \mid \mu_i, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(Y_i - \mu_i)^2}{2\sigma^2} \right) \]
In the context of a linear model, the mean \(\mu_i\) is modeled as:
\[ \mu_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} = \mathbf{X}_i \boldsymbol{\beta} \]
For \(n\) independent observations, the likelihood function is the product of the individual PDFs:
\[ L(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(Y_i - \mathbf{X}_i \boldsymbol{\beta})^2}{2\sigma^2} \right) \]
Taking the natural logarithm of the likelihood simplifies the product to a sum:
\[ \log L(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) = \sum_{i=1}^{n} \log \left( \frac{1}{\sqrt{2\pi\sigma^2}} \right) + \sum_{i=1}^{n} \log \left( \exp\left( -\frac{(Y_i - \mathbf{X}_i \boldsymbol{\beta})^2}{2\sigma^2} \right) \right) \]
This simplifies to:
\[ \log L(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} \left( Y_i - \mathbf{X}_i \boldsymbol{\beta} \right)^2 \]
To find the maximum likelihood estimates (MLE) of \(\boldsymbol{\beta}\), we focus on maximizing the log-likelihood with respect to \(\boldsymbol{\beta}\). The first term \(-\frac{n}{2} \log(2\pi\sigma^2)\) does not depend on \(\boldsymbol{\beta}\), so we can drop it.
Thus, we are left with minimizing the following term:
\[ \arg\min_{\boldsymbol{\beta}} \, \frac{1}{2\sigma^2} \sum_{i=1}^{n} \left( Y_i - \mathbf{X}_i \boldsymbol{\beta} \right)^2 \]
Since \(\sigma^2\) is constant, and hence \(\frac{1}{2\sigma^2}\) does not depend on \(\boldsymbol{\beta}\) we can further simplify this to:
\[ \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \, \sum_{i=1}^{n} \left( Y_i - \mathbf{X}_i \boldsymbol{\beta} \right)^2 \]
This is exactly the Least Squares Objective Function. The task is to find \(\boldsymbol{\beta}\) that minimizes the sum of squared residuals (SSR):
\[ \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \, \sum_{i=1}^{n} \varepsilon_i^2 \]
Where \(\varepsilon_i = Y_i - \mathbf{X}_i \boldsymbol{\beta}\) are the residuals.
Thus, we have shown that minimizing the sum of squared residuals (SSR) is equivalent to maximizing the log-likelihood function under the assumption of normally distributed errors with constant variance. This means that the OLS estimator is also the MLE for the linear model with normally distributed dependent variable and homoskedastic errors.
A profile likelihood is a method for assessing the likelihood of a specific parameter in a model by fixing that parameter at different values and maximizing the likelihood with respect to all other parameters.
Each value in the profile likelihood represents the best possible fit of the model when the parameter of interest is fixed, while the other parameters are optimized. The peak of the profile likelihood corresponds to the maximum likelihood estimate (MLE) for that parameter, and it is commonly used for constructing confidence intervals and evaluating uncertainty around the parameter of interest.
Below I show a simple simulation where we can see the equivalence of an OLS and MLE estimator for a toy example of only estimating one parameter: the intercept.
# Let's simulate a normal distribution
## sample
n <- 100
## set simulation parameters
mu <- 67 # mean
sigma2 <- 10 # variance
## generate random normal distribution
y <- rnorm(n, mu, sd = sqrt(sigma2))
## mean and variance of y
mean(y) ; var(y)
## [1] 66.88789
## [1] 9.816897
## histogram
hist(y)
## beta_hat from intercept only linear model
lm(y ~ 1)$coef
## (Intercept)
## 66.88789
## define the residual sum of squares function with respect to mu
SSR <- function(mu, y) {
# create empty vector
SSR <- 0
# estimate SSR
SSR <- sum((y - mu) ^ 2)
return(SSR)
}
## Question: what is the value of mu that minimizies the SSR?
SSR(mu = 120, y = y)
## [1] 283061.5
## Use optim to find the minima
optim_mu <- optim(par=0,
fn=SSR,
method = "BFGS",
y=y)
optim_mu$value # minimum SSR
## [1] 971.8728
optim_mu$par # optim point (parameter)
## [1] 66.88789
# Let's visualize the profile likelihoods and the SSR
## First, provide a range of mu to plot over.
# Note: the range must fit the distribution boundaries, if any.
mu_range <- seq(from=55,
to=75,
length.out= 1000)
## function that will iterate SSR for each value of mu
SSR <- function(mu, y) {
# create empty vector
SSR.vector <- numeric(length(mu))
for (i in 1:length(SSR.vector)){
# estimate SSR
SSR.vector[i] <- sum((y - mu[i]) ^ 2)
}
return(SSR.vector)
}
# for each value of mu, it returns the SSR given y
tibble(
SSR = SSR(mu=mu_range,y=y),
mu = mu_range
)
## # A tibble: 1,000 × 2
## SSR mu
## <dbl> <dbl>
## 1 15104. 55
## 2 15057. 55.0
## 3 15009. 55.0
## 4 14962. 55.1
## 5 14914. 55.1
## 6 14867. 55.1
## 7 14820. 55.1
## 8 14773. 55.1
## 9 14726. 55.2
## 10 14679. 55.2
## # ℹ 990 more rows
head(SSR(mu=mu_range,
y=y)) # for each value of mu, it returns the SSR given y
## [1] 15104.07 15056.51 15009.03 14961.63 14914.32 14867.08
## Visualize
tibble(
SSR = SSR(mu=mu_range,y=y),
mu = mu_range
) %>%
ggplot(aes(y=SSR,
x=mu)) +
geom_line() +
geom_vline(xintercept = optim_mu$par,
color="red",
linetype="dashed")
## Now let's define the profile likelihood of the same distribution
## returns a vector of likelihood values over a given range of mu values
like_norm <- function(mu, sigma2, y) {
# create empty vector
likelihood.vector <- vector(mode = "numeric", length = length(mu))
# for each mu, calculate the likelihood value
for (i in 1:length(likelihood.vector)) {
likelihood.vector[i] <- prod(dnorm(x = y,
mean = mu[i],
sd = sqrt(sigma2)))
}
# return vector of likelihood values spanning range of mu
return(likelihood.vector)
}
#like_norm(mu=mu_range, sigma2 = sigma2, y = y)
tibble(
like_norm =like_norm(mu=mu_range, sigma2 = sigma2, y = y),
mu = mu_range
) %>%
ggplot(aes(y=like_norm,
x=mu)) +
geom_line() +
geom_vline(xintercept = optim_mu$par,
color="red",
linetype="dashed")
## returns a vector of likelihood values over a given range of mu values
loglike_norm <- function(mu, sigma2, y) {
# create empty vector
likelihood.vector <- vector(mode = "numeric", length = length(mu))
# for each mu, calculate the likelihood value
for (i in 1:length(likelihood.vector)) {
likelihood.vector[i] <- sum(dnorm(x = y,
mean = mu[i],
sd = sqrt(sigma2),
log = T))
}
# return vector of likelihood values spanning range of mu
return(likelihood.vector)
}
#loglike_norm(mu=mu_range, sigma2 = sigma2, y = y)
## create dataset for visualization
df_vis <-
tibble(SSR = SSR(mu=mu_range,y=y),
like_norm =like_norm(mu=mu_range, sigma2 = sigma2, y = y),
loglike_norm =loglike_norm(mu=mu_range, sigma2 = sigma2, y = y),
mu = mu_range)
df_vis %>%
pivot_longer(cols = c("SSR":"loglike_norm"),
names_to = "profile",
values_to = "value") %>%
ggplot(aes(y=value,
x=mu)) +
geom_line() +
geom_vline(xintercept = optim_mu$par,
linetype = "dashed",
color="red") +
facet_wrap(~profile, scales = "free") +
theme_minimal()
We have seen that under the assumptions of homoskedasticity and normality, the Maximum Likelihood Estimator (MLE) of a homoskedastic normal model should theoretically provide the same estimates as Ordinary Least Squares (OLS). However, in practice, many of the Gauss-Markov assumptions are often violated. A common violation is heteroskedasticity, which, as we saw in the previous lab, leads to inefficient estimates.
With MLE, we can improve upon OLS. One advantage of the likelihood-based approach to inference is that it allows researchers to model any parameter of a distribution. In the case of the normal distribution, we can model the variance \(\sigma^2\), which enables us to test hypotheses or account for heteroskedasticity in the data-generating process (DGP), leading to more accurate estimates. This is done through the heteroskedastic normal model, which is the MLE estimator that we will estimate in this section.
The full R
code can be found here
from Chris’ website
4.1 Generate Data
4.2 Fit OLS - lm()
4.3 Fit MLE - optim()
4.4 Calculate quantities of interest
In this exercise we are going to
Think back in linear regression where you model the \(\text{outcome}_i\) on:\(\beta_0 + x_{1, i} \beta_1 + x_{2, i} \beta_2... x_{k, i} \beta_k\), for person \(i\) and a total number of \(k\) covariates
An efficient way is to store in matrices all covariates as \(\mathbf{X}\) (think about your excel spreadsheet or .csv), and all coefficients as \(\boldsymbol{\beta}\).
Let’s say we have one constant and two covariates, i.e. three \(\beta\): \[ \mathbf{X} = \left[\begin{array} {rrrr} x_{0} & x_{1, 1} & x_{2, 1} \\ x_{0} & x_{1, 2} & x_{2, 2} \\ x_{0} & x_{1, 3} & x_{2, 3} \\ \vdots & \vdots & \vdots \\ x_{0} & x_{1, i} & x_{2, i} \\ \end{array}\right] \quad \boldsymbol{\beta} = \left[\begin{array} {r} \beta_0 \\ \beta_1 \\ \beta_2 \\ \end{array}\right] \]
Matrix multiplication generates another matrix of systematic components easily:
\[ \mathbf{X \boldsymbol{\beta}} = \left[\begin{array}{c} x_{0} \cdot \beta_0 + x_{1, 1} \cdot \beta_1 + x_{2, 1} \cdot \beta_2 \\ x_{0} \cdot \beta_0 + x_{1, 2} \cdot \beta_1 + x_{2, 2} \cdot \beta_2 \\ x_{0} \cdot \beta_0 + x_{1, 3} \cdot \beta_1 + x_{2, 3} \cdot \beta_2 \\ \vdots \\ x_{0} \cdot \beta_0 + x_{1, i} \cdot \beta_1 + x_{2, i} \cdot \beta_2 \end{array}\right] \]
Steps
Set the number of observations to 1500 (\(n\))
Generate two equivalent datasets (matrices \(\mathbf{x}\) and \(\mathbf{z}\)), both with one constant and two covariates
1
; two covariates are
drawn from a uniform distribution with min = 0
,
max = 1
Set a parameter vector (\(\boldsymbol{\beta}\)) for the mean, \(\mu_i\)
0, 5, 15
Set a parameter vector (\(\boldsymbol{\gamma}\)) for the variance, \(\sigma_i\), with heteroskedasticity
1, 0, 3
Create the systematic component for the mean (\(\boldsymbol{x}_{i}\boldsymbol{\beta}\))
Create the systematic component for the variance (\(\text{exp}(\boldsymbol{z}_{i}\boldsymbol{\gamma})\)),
Generate the outcome variable (\(y_{i}\))
Save the data to a data frame
Plot the data
rm(list=ls()) # Clear memory
set.seed(123456) # For reproducible random numbers
library(MASS) # Load packages
library(simcf)
library(tidyverse)
n <- 1500 # Generate 1500 observations
w0 <- rep(1, n) # Create the constant
w1 <- runif(n) # Create two covariates
w2 <- runif(n)
x <- cbind(w0, w1, w2) # Create a matrix of the covariates
z <- x # i.e., same covariates affect mu and sigma
beta <- c(0, 5, 15) # Set a parameter vector for the mean
# One for constant, one for covariate 1,
# one for covariate 2.
gamma <- c(1, 0, 3) # Set a parameter vector for the variance
# Gamma estimate for covariate 2 is set to be 3,
# whic creates heteroskedasticity
mu <- x %*% beta # Create systemtic component for the mean
sigma2 <- exp(z %*% gamma) # Create systematic component for the variance
# Since ith row of sigma2 = exp(1 + 0 + w2_i * 3)
# i.e., it is a function of w2 (heteroskedastic)
y <- rnorm(n = n, # Create the outcome variable
mean = mu, # Think about the stochastic component!
sd = sqrt(sigma2)
)
data <- cbind(y, w1, w2) # Save the data to a data frame
data <- as.data.frame(data)
names(data) <- c("y", "w1", "w2")
The clearest way is to put the parameters in the correct matrix
format. But %*%
is smart enough to make a vector
conformable when multiplied with a matrix.
lm()
Assume, one day, the true values of the parameters become unknown to mankind, and we as scientists try to recover the “truth” by gathering and studying the data
Motivation: If our statistical model is valid, it should be able to recover the true parameter values from the data we synthesize
Fit a model using least squares (function lm()
) and
regress the outcome variable on the two covariates
Calculate and print the AIC (More in lecture; I’ll show you this time)
ls.result <- lm(y ~ w1 + w2, data = data) # Fit a linear model
# using simulated data
summary(ls.result)
##
## Call:
## lm(formula = y ~ w1 + w2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5710 -2.3157 -0.0219 2.2124 21.9345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04867 0.28221 -0.172 0.863
## w1 4.78421 0.37699 12.690 <2e-16 ***
## w2 15.68283 0.37112 42.258 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.17 on 1497 degrees of freedom
## Multiple R-squared: 0.5678, Adjusted R-squared: 0.5672
## F-statistic: 983.3 on 2 and 1497 DF, p-value: < 2.2e-16
ls.aic <- AIC(ls.result) # Calculate and print the AIC
# i.e. Akaike Information Criterion
# to assess goodness of fit; lower AIC is better
print(ls.aic)
## [1] 8545.903
optim()
Opening the black box: build our own regression machine using maximum likelihood
Lecture recap: How to derive maximum likelihood estimators in four easy steps:
Express the joint probability of the data, using the chosen probability distribution
Convert the joint probability to the likelihood (trivial, as they are proportional)
Simplify the likelihood for easy maximization (take logs and reduce to “sufficient statistics”)
Substitute in the systematic component
Then we find the set of parameters that maximize the likelihood
optim()
Recap: our heteroskedastic normal model
Stochastic component: \[y_i \sim f_\mathcal{N}(\mu_{i}, \sigma_{i}^{2})\]
Systematic components: \[\mu_i = \boldsymbol{x}_{i} \boldsymbol{\beta}\] \[\sigma_{i}^{2} = \exp(\boldsymbol{z}_{i} \boldsymbol{\gamma})\]
MLE for a heteroskedastic normal data
\[ \begin{aligned} \mathrm{P}\left(\mathbf{y} | \boldsymbol{\mu}, \boldsymbol{\sigma}^{2}\right) &=\prod_{i=1}^{n} f_{\mathcal{N}}\left(y_{i} | \mu_{i}, \sigma_{i}^{2}\right) & \tiny[\text{Joint probability}] \\ \mathrm{P}\left(\mathbf{y} | \boldsymbol{\mu}, \boldsymbol{\sigma}^{2}\right) &=\prod_{i=1}^{n}\left(2 \pi \sigma_{i}^{2}\right)^{-1 / 2} \exp \left[\frac{-\left(y_{i}-\mu_{i}\right)^{2}}{2 \sigma_{i}^{2}}\right] & \tiny[\text{Expressed in Normal distribution}]\\ \log \mathcal{L}\left(\mathbf{\mu}, \boldsymbol{\sigma}^{2} | \mathbf{y}\right) & \propto-\frac{1}{2} \sum_{i=1}^{n} \log \sigma_{i}^{2}-\frac{1}{2} \sum_{i=1}^{n} \frac{\left(y_{i}-\mathbf{\mu}_{i} \right)^{2}}{\sigma_{i}^{2}} & \tiny[\text{Converted to log likelihood; simplify}] \\ \log \mathcal{L}(\boldsymbol{\beta}, \boldsymbol{\gamma} | \mathbf{y}) & \propto-\frac{1}{2} \sum_{i=1}^{n} \log \mathbf{z}_{i} \gamma-\frac{1}{2} \sum_{i=1}^{n} \frac{\left(y_{i}-\mathbf{x}_{i} \boldsymbol{\beta}\right)^{2}}{\exp \left(\mathbf{z}_{i} \gamma\right)} & \tiny[\text{Substitute in systematic components}] \end{aligned} \]
# A likelihood function for ML heteroskedastic Normal
llk.hetnormlin <- function(param, y, x, z) {
x <- as.matrix(x) # x (some covariates) as a matrix
z <- as.matrix(z) # z (some covariates) as a matrix
os <- rep(1, nrow(x)) # Set the intercept as 1 (constant)
x <- cbind(os, x) # Add intercept to covariates x
z <- cbind(os, z) # Add intercept to covariates z
b <- param[1 : ncol(x)] # Parameters for x
g <- param[(ncol(x) + 1) : (ncol(x) + ncol(z))] # Parameters for z
xb <- x %*% b # Systematic components for mean
s2 <- exp(z %*% g) # Systematic components for variance
sum(0.5 * (log(s2) + (y - xb)^2 / s2)) # Likelihood we want to maximize
# optim is a minimizer by default
# To maximize lnL is to minimize -lnL
# so the +/- signs are reversed
}
sum(0.5 * (log(s2) + (y - xb)^2 / s2))
Think back:
\[ \begin{aligned} \log \mathcal{L}(\boldsymbol{\beta}, \boldsymbol{\gamma} | \mathbf{y}) &\propto-\frac{1}{2} \sum_{i=1}^{n} \log \mathbf{z}_{i} \gamma-\frac{1}{2} \sum_{i=1}^{n} \frac{\left(y_{i}-\mathbf{x}_{i} \boldsymbol{\beta}\right)^{2}}{\exp \left(\mathbf{z}_{i} \gamma\right)}\\ &\propto-1 \times \sum_{i=1}^{n} 0.5\times (\log \mathbf{z}_{i} \gamma- \frac{\left(y_{i}-\mathbf{x}_{i} \boldsymbol{\beta}\right)^{2}}{\exp \left(\mathbf{z}_{i} \gamma\right)}) \end{aligned} \]
Important: optim()
is a minimizer by
default, so you need to reverse the +/- signs to find the maximum
point
# Create input matrices
xcovariates <- cbind(w1, w2)
zcovariates <- cbind(w1, w2)
# Initial guesses of beta0, beta1, ..., gamma0, gamma1, ...
# We need one entry per parameter, in order!
# Note: also include beta and gamma estiamtes for constants
stval <- c(0, 0, 0, 0, 0, 0)
# Run ML, and get the output we need
hetnorm.result <- optim(stval, # Initial guesses
llk.hetnormlin, # Likelihood function
method = "BFGS", # Gradient method
hessian = TRUE, # Return Hessian matrix
y = y, # Outcome variable
x = xcovariates, # Covariates x (w/o constant)
z = zcovariates # Covariates z (w/o constant)
)
Note: By default, optim()
performs a minimizing
procedure. You can make optim a maximizer by adding
control = list(fnscale = -1)
pe <- hetnorm.result$par # Point estimates
round(pe, 3)
## [1] -0.167 5.007 15.698 0.910 0.224 2.994
vc <- solve(hetnorm.result$hessian) # Var-cov matrix (for computing s.e.)
round(vc, 5)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.02909 -0.03265 -0.02810 0.00059 -0.00087 -0.00032
## [2,] -0.03265 0.06787 -0.00025 -0.00091 0.00099 0.00084
## [3,] -0.02810 -0.00025 0.10331 -0.00056 0.00143 -0.00033
## [4,] 0.00059 -0.00091 -0.00056 0.00920 -0.00832 -0.00749
## [5,] -0.00087 0.00099 0.00143 -0.00832 0.01693 -0.00042
## [6,] -0.00032 0.00084 -0.00033 -0.00749 -0.00042 0.01568
se <- sqrt(diag(vc)) # To compute standard errors (s.e.)
# take the diagonal of the Hessian;
# then take square root
round(se, 3)
## [1] 0.171 0.261 0.321 0.096 0.130 0.125
mle.result <- cbind(pe[1:3], se[1:3]) # Report ML results
colnames(mle.result) <- c("Estimate", "Std.Error")
rownames(mle.result) <- c("(Intercept)", "w1", "w2")
round(mle.result, 3)
## Estimate Std.Error
## (Intercept) -0.167 0.171
## w1 5.007 0.261
## w2 15.698 0.321
round(coef(summary(ls.result))[, c(1, 2)], 3) # Compare with lm() results
## Estimate Std. Error
## (Intercept) -0.049 0.282
## w1 4.784 0.377
## w2 15.683 0.371
ll <- -hetnorm.result$value # Report likelihood at maximum
# No need to have negative sign
# if optim is set as maximizer
ll
## [1] -2620.334
# AIC is 2 * number of parameters - 2 * ll (i.e. likelihood at maximum)
hetnorm.aic <- 2 * length(stval) - 2 * ll
# Compare AIC from ML and from lm(); lower is better
print(hetnorm.aic)
## [1] 5252.668
print(ls.aic)
## [1] 8545.903
Let’s practice a little bit with
Lab3_practice.Rmd
.
How much do OLS and MLE normal-heteroskedastic estimators differ in the presence of heteroskedasticity? We can explore this question using Monte Carlo simulations.
Below, I provide a Monte Carlo experiment to compare these two
estimators under heteroskedasticity, based on 500 repeated
samples from the same data-generating process (DGP). Note that
this is the same experiment we covered in Lab2.rmd
, where
we compared OLS under ideal conditions (BLUE) and in the presence of
heteroskedasticity.
### Monte Carlo Experiment OLS vs. MLE:---------------------------
set.seed(342342)
# sample size
n <- 600
# systematic component
b0 <- 0.2 # True value for the intercept
b1 <- 0.5 # True value for the slope
X <- rnorm(n, 0, 2) # independent variable
# stochastic component
e <- rnorm(n, 0, 1) # error term
# Heteroskedasticity parameter
gamma <- 0.35
# The DGP: Heteroskedasticity
Y <- b0 + b1*X + rnorm(n, 0, exp(X*gamma))
# Simulate repeated samples
sims <- 500
## Simulation for heteroskedasticity
par.est.1 <-
tibble(
b0_hat = rep(NA,sims),
b1_hat = rep(NA,sims),
model = "OLS"
)
par.est.2 <-
tibble(
b0_hat = rep(NA,sims),
b1_hat = rep(NA,sims),
model = "MLE"
)
## Likelihood function
llk.hetnormlin <- function(param, y, x, z) {
x <- as.matrix(x) # x (some covariates) as a matrix
z <- as.matrix(z) # z (some covariates) as a matrix
os <- rep(1, nrow(x)) # Set the intercept as 1 (constant)
x <- cbind(os, x) # Add intercept to covariates x
z <- cbind(os, z) # Add intercept to covariates z
b <- param[1 : ncol(x)] # Parameters for x
g <- param[(ncol(x) + 1) : (ncol(x) + ncol(z))] # Parameters for z
xb <- x %*% b # Systematic components for mean
s2 <- exp(z %*% g) # Systematic components for variance
sum(0.5 * (log(s2) + (y - xb)^2 / s2)) # Likelihood we want to maximize
}
## starting values for searching algorithm
stval <- c(0, 0, 0, 0)
The visualization clearly shows that the MLE estimator is much more efficient than the OLS estimator. Another way to interpret the results, and to motivate the use of MLE, is that the MLE estimator is consistently closer to the true parameter values than the OLS estimator. This difference is especially pronounced in the estimation of \(\hat{\beta_1}\).