Least Squares is Maximum Likelihood

The Least Squares (LS) estimator can be derived as a Maximum Likelihood Estimation (MLE) under the assumption that the errors in the linear regression model are normally distributed and homoskedasticity.

Consider the standard linear regression model:

\[ Y_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} + \varepsilon_i, \quad i = 1, \ldots, n, \]

In OLS, we assume that the error terms \(\varepsilon_i\) are normally distributed with constant variance \(\varepsilon_i \sim \mathcal{N}(0,\sigma^2)\).

It must be the case that if Y is normally distributed, the errors must be normally distributed as well (but not necessarily otherwise!). Thus, we can express \(Y_i\) following a probability distribution \(Y_i \sim \mathcal{N}(\mu_i, \sigma^2)\). The gist is that the likelihood theory of inference allows to model any parameter from any distribution in other to better represent the DGP and also look at which predictors can explain variation in the distributonal parameters.

If we further assume that each \(i\) observation is independent and \(\sigma^2\) is a constant, we can express the PDF as a function of the systematic component for the paremter of interest, in this case the mean \(\mu\).

\[ \begin{aligned} Y_i & \sim \mathcal{N}(\mu_i, \sigma^2) & [\text{stochastic component}] \\ \mu_i & = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} \quad \quad & [\text{systematic component}] \\ Y_i &\mid X_i \sim \mathcal{N} \left( \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi}, \, \sigma^2 \right) & [\text{substitute in the parameter}] \\ \end{aligned} \]

The two above equations are equivalent, although the systematic and stochastic components are in different locations. Two different notations for the same model.

LS notation: \[ \begin{aligned} \varepsilon_i & \sim \mathcal{N}(0, \sigma^2) & (stochastic) \\ Y_i & = x_i \mathcal{\beta} & (systematic) \\ Y_i & = x_i \mathcal{\beta} + \varepsilon_i & (stochastic + systematic) \\ \end{aligned} \]

MLE notation: \[ \begin{aligned} Y_i & \sim \mathcal{N}(\mu_i, \sigma^2) & (stochastic) \\ \mu_i & = x_i \beta & (systematic) \\ Y_i & \sim N( x_i \beta, \sigma^2) & (stochastic + systematic) \\ \end{aligned} \]

Maximum Likelihood Estimaiton

Below we are going to derive the MLE estimates from a normal-homoskedatic model. These steps apply to any probability function that we are interested to model and estimate its parameters. The steps are:

1- Define a PDF that represents the data-generating process (DGP) of interest. 2- Define the likelihood (independence assumption) 3- Take the natural logarithm of the likelihood function to simplify its obective function. 4- Drop the parameters

1. Define the PDF of the Normal Distribution:

The PDF for a single observation \(Y_i\) following a normal distribution with mean \(\mu_i\) and variance \(\sigma^2\) is:

\[ f(Y_i \mid \mu_i, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(Y_i - \mu_i)^2}{2\sigma^2} \right) \]

In the context of a linear model, the mean \(\mu_i\) is modeled as:

\[ \mu_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} = \mathbf{X}_i \boldsymbol{\beta} \]

2. Likelihood for \(n\) Observations:

For \(n\) independent observations, the likelihood function is the product of the individual PDFs:

\[ L(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(Y_i - \mathbf{X}_i \boldsymbol{\beta})^2}{2\sigma^2} \right) \]

3. Take the Log of the Likelihood (Log-Likelihood):

Taking the natural logarithm of the likelihood simplifies the product to a sum:

\[ \log L(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) = \sum_{i=1}^{n} \log \left( \frac{1}{\sqrt{2\pi\sigma^2}} \right) + \sum_{i=1}^{n} \log \left( \exp\left( -\frac{(Y_i - \mathbf{X}_i \boldsymbol{\beta})^2}{2\sigma^2} \right) \right) \]

This simplifies to:

\[ \log L(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} \left( Y_i - \mathbf{X}_i \boldsymbol{\beta} \right)^2 \]

4. Maximizing the Log-Likelihood:

To find the maximum likelihood estimates (MLE) of \(\boldsymbol{\beta}\), we focus on maximizing the log-likelihood with respect to \(\boldsymbol{\beta}\). The first term \(-\frac{n}{2} \log(2\pi\sigma^2)\) does not depend on \(\boldsymbol{\beta}\), so we can drop it.

Thus, we are left with minimizing the following term:

\[ \arg\min_{\boldsymbol{\beta}} \, \frac{1}{2\sigma^2} \sum_{i=1}^{n} \left( Y_i - \mathbf{X}_i \boldsymbol{\beta} \right)^2 \]

Since \(\sigma^2\) is constant, and hence \(\frac{1}{2\sigma^2}\) does not depend on \(\boldsymbol{\beta}\) we can further simplify this to:

\[ \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \, \sum_{i=1}^{n} \left( Y_i - \mathbf{X}_i \boldsymbol{\beta} \right)^2 \]

5. Least Squares Objective Function:

This is exactly the Least Squares Objective Function. The task is to find \(\boldsymbol{\beta}\) that minimizes the sum of squared residuals (SSR):

\[ \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \, \sum_{i=1}^{n} \varepsilon_i^2 \]

Where \(\varepsilon_i = Y_i - \mathbf{X}_i \boldsymbol{\beta}\) are the residuals.

Thus, we have shown that minimizing the sum of squared residuals (SSR) is equivalent to maximizing the log-likelihood function under the assumption of normally distributed errors with constant variance. This means that the OLS estimator is also the MLE for the linear model with normally distributed dependent variable and homoskedastic errors.

Visualizing profile likelihoods

A profile likelihood is a method for assessing the likelihood of a specific parameter in a model by fixing that parameter at different values and maximizing the likelihood with respect to all other parameters.

Each value in the profile likelihood represents the best possible fit of the model when the parameter of interest is fixed, while the other parameters are optimized. The peak of the profile likelihood corresponds to the maximum likelihood estimate (MLE) for that parameter, and it is commonly used for constructing confidence intervals and evaluating uncertainty around the parameter of interest.

Below I show a simple simulation where we can see the equivalence of an OLS and MLE estimator for a toy example of only estimating one parameter: the intercept.

# Let's simulate a normal distribution

## sample
n <- 100

## set simulation parameters
mu <- 67       # mean
sigma2 <- 10   # variance

## generate random normal distribution
y <- rnorm(n, mu, sd = sqrt(sigma2))


## mean and variance of y
mean(y) ; var(y)
## [1] 66.88789
## [1] 9.816897
## histogram
hist(y)

## beta_hat from intercept only linear model
lm(y ~ 1)$coef
## (Intercept) 
##    66.88789
## define the residual sum of squares function with respect to mu

SSR <- function(mu, y) {
  # create empty vector
  SSR <- 0
  
  # estimate SSR
  SSR <- sum((y - mu) ^ 2)
  
  return(SSR)
}

## Question: what is the value of mu that minimizies the SSR?
SSR(mu = 120, y = y)
## [1] 283061.5
## Use optim to find the minima
optim_mu <- optim(par=0,
                  fn=SSR,
                  method = "BFGS",
                  y=y)

optim_mu$value  # minimum SSR
## [1] 971.8728
optim_mu$par    # optim point (parameter)
## [1] 66.88789
# Let's visualize the profile likelihoods and the SSR




## First, provide a range of mu to plot over. 
# Note: the range must fit the distribution boundaries, if any.

mu_range <- seq(from=55, 
                to=75,
                length.out= 1000)

## function that will iterate SSR for each value of mu



SSR <- function(mu, y) {
  # create empty vector
  SSR.vector <- numeric(length(mu))
  
  for (i in 1:length(SSR.vector)){
    # estimate SSR
    SSR.vector[i] <- sum((y - mu[i]) ^ 2)
  }
  
  return(SSR.vector)
}


# for each value of mu, it returns the SSR given y

tibble(
  SSR = SSR(mu=mu_range,y=y),
  mu = mu_range
)
## # A tibble: 1,000 × 2
##       SSR    mu
##     <dbl> <dbl>
##  1 15104.  55  
##  2 15057.  55.0
##  3 15009.  55.0
##  4 14962.  55.1
##  5 14914.  55.1
##  6 14867.  55.1
##  7 14820.  55.1
##  8 14773.  55.1
##  9 14726.  55.2
## 10 14679.  55.2
## # ℹ 990 more rows
head(SSR(mu=mu_range,
    y=y)) # for each value of mu, it returns the SSR given y
## [1] 15104.07 15056.51 15009.03 14961.63 14914.32 14867.08
## Visualize

tibble(
  SSR = SSR(mu=mu_range,y=y),
  mu = mu_range
) %>% 
  ggplot(aes(y=SSR,
             x=mu)) +
  geom_line() +
  geom_vline(xintercept = optim_mu$par,
             color="red",
             linetype="dashed")

## Now let's define the profile likelihood of the same distribution

## returns a vector of likelihood values over a given range of mu values

like_norm <- function(mu, sigma2, y) {
  # create empty vector
  likelihood.vector <- vector(mode = "numeric", length = length(mu))
  
  # for each mu, calculate the likelihood value
  for (i in 1:length(likelihood.vector)) {
    likelihood.vector[i] <- prod(dnorm(x = y, 
                                       mean = mu[i], 
                                       sd = sqrt(sigma2)))
  }
  
  # return vector of likelihood values spanning range of mu
  return(likelihood.vector)
}

#like_norm(mu=mu_range, sigma2 = sigma2, y = y)


tibble(
  like_norm =like_norm(mu=mu_range, sigma2 = sigma2, y = y),
  mu = mu_range
) %>% 
  ggplot(aes(y=like_norm,
             x=mu)) +
  geom_line() +
  geom_vline(xintercept = optim_mu$par,
             color="red",
             linetype="dashed")

## returns a vector of likelihood values over a given range of mu values

loglike_norm <- function(mu, sigma2, y) {
  # create empty vector
  likelihood.vector <- vector(mode = "numeric", length = length(mu))
  
  # for each mu, calculate the likelihood value
  for (i in 1:length(likelihood.vector)) {
    likelihood.vector[i] <- sum(dnorm(x = y, 
                                      mean = mu[i], 
                                      sd = sqrt(sigma2), 
                                      log = T))
  }
  
  # return vector of likelihood values spanning range of mu
  return(likelihood.vector)
}

#loglike_norm(mu=mu_range, sigma2 = sigma2, y = y)



## create dataset for visualization


df_vis <-
  tibble(SSR = SSR(mu=mu_range,y=y),
         like_norm =like_norm(mu=mu_range, sigma2 = sigma2, y = y),
         loglike_norm =loglike_norm(mu=mu_range, sigma2 = sigma2, y = y),
         mu = mu_range)

df_vis %>% 
  pivot_longer(cols = c("SSR":"loglike_norm"),
               names_to = "profile",
               values_to = "value") %>% 
  ggplot(aes(y=value,
             x=mu)) +
  geom_line() +
  geom_vline(xintercept = optim_mu$par,
             linetype = "dashed",
             color="red") +
  facet_wrap(~profile, scales = "free") +
  theme_minimal()

Heteroskedastic Normal

We have seen that under the assumptions of homoskedasticity and normality, the Maximum Likelihood Estimator (MLE) of a homoskedastic normal model should theoretically provide the same estimates as Ordinary Least Squares (OLS). However, in practice, many of the Gauss-Markov assumptions are often violated. A common violation is heteroskedasticity, which, as we saw in the previous lab, leads to inefficient estimates.

With MLE, we can improve upon OLS. One advantage of the likelihood-based approach to inference is that it allows researchers to model any parameter of a distribution. In the case of the normal distribution, we can model the variance \(\sigma^2\), which enables us to test hypotheses or account for heteroskedasticity in the data-generating process (DGP), leading to more accurate estimates. This is done through the heteroskedastic normal model, which is the MLE estimator that we will estimate in this section.

In this exercise we are going to

Generating heteroskedastic normal data

  • Think back in linear regression where you model the \(\text{outcome}_i\) on:\(\beta_0 + x_{1, i} \beta_1 + x_{2, i} \beta_2... x_{k, i} \beta_k\), for person \(i\) and a total number of \(k\) covariates

  • An efficient way is to store in matrices all covariates as \(\mathbf{X}\) (think about your excel spreadsheet or .csv), and all coefficients as \(\boldsymbol{\beta}\).

  • Let’s say we have one constant and two covariates, i.e. three \(\beta\): \[ \mathbf{X} = \left[\begin{array} {rrrr} x_{0} & x_{1, 1} & x_{2, 1} \\ x_{0} & x_{1, 2} & x_{2, 2} \\ x_{0} & x_{1, 3} & x_{2, 3} \\ \vdots & \vdots & \vdots \\ x_{0} & x_{1, i} & x_{2, i} \\ \end{array}\right] \quad \boldsymbol{\beta} = \left[\begin{array} {r} \beta_0 \\ \beta_1 \\ \beta_2 \\ \end{array}\right] \]

  • Matrix multiplication generates another matrix of systematic components easily:

\[ \mathbf{X \boldsymbol{\beta}} = \left[\begin{array}{c} x_{0} \cdot \beta_0 + x_{1, 1} \cdot \beta_1 + x_{2, 1} \cdot \beta_2 \\ x_{0} \cdot \beta_0 + x_{1, 2} \cdot \beta_1 + x_{2, 2} \cdot \beta_2 \\ x_{0} \cdot \beta_0 + x_{1, 3} \cdot \beta_1 + x_{2, 3} \cdot \beta_2 \\ \vdots \\ x_{0} \cdot \beta_0 + x_{1, i} \cdot \beta_1 + x_{2, i} \cdot \beta_2 \end{array}\right] \]

  • Fast rule:a \(\bf{4} \times \it{3}\) matrix multiplied by a \(\it{3} \times \bf{1}\) one \(\rightarrow\) a \(\bf{4} \times \bf{1}\) matrix

Steps

  1. Set the number of observations to 1500 (\(n\))

  2. Generate two equivalent datasets (matrices \(\mathbf{x}\) and \(\mathbf{z}\)), both with one constant and two covariates

    • The constant takes the value of 1; two covariates are drawn from a uniform distribution with min = 0, max = 1
  3. Set a parameter vector (\(\boldsymbol{\beta}\)) for the mean, \(\mu_i\)

    • By “divine revelation”, we know the true values are 0, 5, 15
  4. Set a parameter vector (\(\boldsymbol{\gamma}\)) for the variance, \(\sigma_i\), with heteroskedasticity

    • Again, we know the true values are 1, 0, 3
    • Pause and think: why is there heteroskedasticity?
  5. Create the systematic component for the mean (\(\boldsymbol{x}_{i}\boldsymbol{\beta}\))

  6. Create the systematic component for the variance (\(\text{exp}(\boldsymbol{z}_{i}\boldsymbol{\gamma})\)),

    • Assume the same covariates affect both \(\mu_i\) and \(\sigma_i\), that is, \(\mathbf{x}\) and \(\mathbf{z}\) are the same
  7. Generate the outcome variable (\(y_{i}\))

  8. Save the data to a data frame

  9. Plot the data

rm(list=ls())       # Clear memory
set.seed(123456)    # For reproducible random numbers
library(MASS)       # Load packages
library(simcf) 
library(tidyverse)

n <- 1500           # Generate 1500 observations

w0 <- rep(1, n)     # Create the constant
w1 <- runif(n)      # Create two covariates
w2 <- runif(n)

x <- cbind(w0, w1, w2)       # Create a matrix of the covariates
z <- x                       # i.e., same covariates affect mu and sigma

beta <- c(0, 5, 15) # Set a parameter vector for the mean
                    # One for constant, one for covariate 1, 
                    # one for covariate 2.

gamma <- c(1, 0, 3) # Set a parameter vector for the variance 
                    # Gamma estimate for covariate 2 is set to be 3,
                    # whic creates heteroskedasticity 
mu <- x %*% beta             # Create systemtic component for the mean

sigma2 <- exp(z %*% gamma)   # Create systematic component for the variance
                             # Since ith row of sigma2 = exp(1 + 0 + w2_i * 3)
                             # i.e., it is a function of w2 (heteroskedastic)

y <- rnorm(n = n,            # Create the outcome variable
           mean = mu,        # Think about the stochastic component!
           sd = sqrt(sigma2)
           ) 

data <- cbind(y, w1, w2)     # Save the data to a data frame
data <- as.data.frame(data)
names(data) <- c("y", "w1", "w2")

The clearest way is to put the parameters in the correct matrix format. But %*% is smart enough to make a vector conformable when multiplied with a matrix.

Fitting a model using OLS - lm()

Assume, one day, the true values of the parameters become unknown to mankind, and we as scientists try to recover the “truth” by gathering and studying the data

Motivation: If our statistical model is valid, it should be able to recover the true parameter values from the data we synthesize

  1. Fit a model using least squares (function lm()) and regress the outcome variable on the two covariates

  2. Calculate and print the AIC (More in lecture; I’ll show you this time)

ls.result <- lm(y ~ w1 + w2, data = data) # Fit a linear model
                                          # using simulated data
summary(ls.result)
## 
## Call:
## lm(formula = y ~ w1 + w2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5710  -2.3157  -0.0219   2.2124  21.9345 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.04867    0.28221  -0.172    0.863    
## w1           4.78421    0.37699  12.690   <2e-16 ***
## w2          15.68283    0.37112  42.258   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.17 on 1497 degrees of freedom
## Multiple R-squared:  0.5678, Adjusted R-squared:  0.5672 
## F-statistic: 983.3 on 2 and 1497 DF,  p-value: < 2.2e-16
ls.aic <- AIC(ls.result) # Calculate and print the AIC 
                         # i.e. Akaike Information Criterion
                         # to assess goodness of fit; lower AIC is better
print(ls.aic)
## [1] 8545.903

Taking Stock

  • What do the above exercises tell us about statistical model and inference?
    • What do we mean when we say “fitting a model to the data”?
  • Data generating process (DGP): the “true, underlying” process that generates the data we observe
  • Model: an attempt to capture important aspects of, and approximate crudely, that DGP
    • We just stipulate the parameters and synthesize the data
    • It’ll be very worrying if our model can’t recover the truth…
  • Specifying a “correct” model to real world phenomena is incredibly hard, but also humbling, and hopefully exciting

Fitting the model using ML - optim()

  • Opening the black box: build our own regression machine using maximum likelihood

  • Lecture recap: How to derive maximum likelihood estimators in four easy steps:

    1. Express the joint probability of the data, using the chosen probability distribution

    2. Convert the joint probability to the likelihood (trivial, as they are proportional)

    3. Simplify the likelihood for easy maximization (take logs and reduce to “sufficient statistics”)

    4. Substitute in the systematic component

  • Then we find the set of parameters that maximize the likelihood

    • Using gradient descent; optim()
  • Recap: our heteroskedastic normal model

  • Stochastic component: \[y_i \sim f_\mathcal{N}(\mu_{i}, \sigma_{i}^{2})\]

  • Systematic components: \[\mu_i = \boldsymbol{x}_{i} \boldsymbol{\beta}\] \[\sigma_{i}^{2} = \exp(\boldsymbol{z}_{i} \boldsymbol{\gamma})\]

  • MLE for a heteroskedastic normal data

\[ \begin{aligned} \mathrm{P}\left(\mathbf{y} | \boldsymbol{\mu}, \boldsymbol{\sigma}^{2}\right) &=\prod_{i=1}^{n} f_{\mathcal{N}}\left(y_{i} | \mu_{i}, \sigma_{i}^{2}\right) & \tiny[\text{Joint probability}] \\ \mathrm{P}\left(\mathbf{y} | \boldsymbol{\mu}, \boldsymbol{\sigma}^{2}\right) &=\prod_{i=1}^{n}\left(2 \pi \sigma_{i}^{2}\right)^{-1 / 2} \exp \left[\frac{-\left(y_{i}-\mu_{i}\right)^{2}}{2 \sigma_{i}^{2}}\right] & \tiny[\text{Expressed in Normal distribution}]\\ \log \mathcal{L}\left(\mathbf{\mu}, \boldsymbol{\sigma}^{2} | \mathbf{y}\right) & \propto-\frac{1}{2} \sum_{i=1}^{n} \log \sigma_{i}^{2}-\frac{1}{2} \sum_{i=1}^{n} \frac{\left(y_{i}-\mathbf{\mu}_{i} \right)^{2}}{\sigma_{i}^{2}} & \tiny[\text{Converted to log likelihood; simplify}] \\ \log \mathcal{L}(\boldsymbol{\beta}, \boldsymbol{\gamma} | \mathbf{y}) & \propto-\frac{1}{2} \sum_{i=1}^{n} \log \mathbf{z}_{i} \gamma-\frac{1}{2} \sum_{i=1}^{n} \frac{\left(y_{i}-\mathbf{x}_{i} \boldsymbol{\beta}\right)^{2}}{\exp \left(\mathbf{z}_{i} \gamma\right)} & \tiny[\text{Substitute in systematic components}] \end{aligned} \]

# A likelihood function for ML heteroskedastic Normal
llk.hetnormlin <- function(param, y, x, z) {
  x <- as.matrix(x)                      # x (some covariates) as a matrix
  z <- as.matrix(z)                      # z (some covariates) as a matrix
  os <- rep(1, nrow(x))                  # Set the intercept as 1 (constant)
  x <- cbind(os, x)                      # Add intercept to covariates x
  z <- cbind(os, z)                      # Add intercept to covariates z
  
  b <- param[1 : ncol(x)]                         # Parameters for x
  g <- param[(ncol(x) + 1) : (ncol(x) + ncol(z))] # Parameters for z
  
  xb <- x %*% b                          # Systematic components for mean
  s2 <- exp(z %*% g)                     # Systematic components for variance
  
  sum(0.5 * (log(s2) + (y - xb)^2 / s2)) # Likelihood we want to maximize 
                                         # optim is a minimizer by default
                                         # To maximize lnL is to minimize -lnL
                                         # so the +/- signs are reversed
}
 sum(0.5 * (log(s2) + (y - xb)^2 / s2))

Think back:

\[ \begin{aligned} \log \mathcal{L}(\boldsymbol{\beta}, \boldsymbol{\gamma} | \mathbf{y}) &\propto-\frac{1}{2} \sum_{i=1}^{n} \log \mathbf{z}_{i} \gamma-\frac{1}{2} \sum_{i=1}^{n} \frac{\left(y_{i}-\mathbf{x}_{i} \boldsymbol{\beta}\right)^{2}}{\exp \left(\mathbf{z}_{i} \gamma\right)}\\ &\propto-1 \times \sum_{i=1}^{n} 0.5\times (\log \mathbf{z}_{i} \gamma- \frac{\left(y_{i}-\mathbf{x}_{i} \boldsymbol{\beta}\right)^{2}}{\exp \left(\mathbf{z}_{i} \gamma\right)}) \end{aligned} \]

Important: optim() is a minimizer by default, so you need to reverse the +/- signs to find the maximum point

# Create input matrices
xcovariates <- cbind(w1, w2)
zcovariates <- cbind(w1, w2)

# Initial guesses of beta0, beta1, ..., gamma0, gamma1, ...
# We need one entry per parameter, in order!
# Note: also include beta and gamma estiamtes for constants
stval <- c(0, 0, 0, 0, 0, 0)  

# Run ML, and get the output we need
hetnorm.result <- optim(stval,           # Initial guesses
                        llk.hetnormlin,  # Likelihood function
                        method = "BFGS", # Gradient method
                        hessian = TRUE,  # Return Hessian matrix
                        y = y,           # Outcome variable
                        x = xcovariates, # Covariates x (w/o constant)
                        z = zcovariates  # Covariates z (w/o constant)
                        ) 

Note: By default, optim() performs a minimizing procedure. You can make optim a maximizer by adding control = list(fnscale = -1)

pe <- hetnorm.result$par            # Point estimates
round(pe, 3)
## [1] -0.167  5.007 15.698  0.910  0.224  2.994
vc <- solve(hetnorm.result$hessian) # Var-cov matrix (for computing s.e.)
round(vc, 5)
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,]  0.02909 -0.03265 -0.02810  0.00059 -0.00087 -0.00032
## [2,] -0.03265  0.06787 -0.00025 -0.00091  0.00099  0.00084
## [3,] -0.02810 -0.00025  0.10331 -0.00056  0.00143 -0.00033
## [4,]  0.00059 -0.00091 -0.00056  0.00920 -0.00832 -0.00749
## [5,] -0.00087  0.00099  0.00143 -0.00832  0.01693 -0.00042
## [6,] -0.00032  0.00084 -0.00033 -0.00749 -0.00042  0.01568
se <- sqrt(diag(vc))                # To compute standard errors (s.e.)
                                    # take the diagonal of the Hessian; 
                                    # then take square root
round(se, 3)
## [1] 0.171 0.261 0.321 0.096 0.130 0.125
mle.result <- cbind(pe[1:3], se[1:3])         # Report ML results 
colnames(mle.result) <- c("Estimate", "Std.Error")
rownames(mle.result) <- c("(Intercept)", "w1", "w2")
round(mle.result, 3)
##             Estimate Std.Error
## (Intercept)   -0.167     0.171
## w1             5.007     0.261
## w2            15.698     0.321
round(coef(summary(ls.result))[, c(1, 2)], 3) # Compare with lm() results
##             Estimate Std. Error
## (Intercept)   -0.049      0.282
## w1             4.784      0.377
## w2            15.683      0.371
ll <- -hetnorm.result$value                   # Report likelihood at maximum 
                                              # No need to have negative sign 
                                              # if optim is set as maximizer
ll
## [1] -2620.334
# AIC is 2 * number of parameters - 2 * ll (i.e. likelihood at maximum)
hetnorm.aic <- 2 * length(stval) - 2 * ll  

# Compare AIC from ML and from lm(); lower is better
print(hetnorm.aic)
## [1] 5252.668
print(ls.aic)
## [1] 8545.903

Application: ML heteroskedastic normal

  • Let’s practice a little bit with Lab3_practice.Rmd.

    • Why the estimates of LS and MLE differ?
      • Salary should be logged because most salary increments work on a percentage basis (e.g., a 2% raise), which is a multiplicative scale, not an additive one
    • Years since PhD might need to be logged for several reasons:
      • Because promotion opportunities and job switching get rarer later in the career
      • Because people’s ability (and any influence it has on salary) tends to become clear eventually

Monte Carlo Experiment

How much do OLS and MLE normal-heteroskedastic estimators differ in the presence of heteroskedasticity? We can explore this question using Monte Carlo simulations.

Below, I provide a Monte Carlo experiment to compare these two estimators under heteroskedasticity, based on 500 repeated samples from the same data-generating process (DGP). Note that this is the same experiment we covered in Lab2.rmd, where we compared OLS under ideal conditions (BLUE) and in the presence of heteroskedasticity.

### Monte Carlo Experiment OLS vs. MLE:---------------------------

set.seed(342342)

# sample size
n <- 600

# systematic component
b0 <- 0.2 # True value for the intercept
b1 <- 0.5 # True value for the slope

X <- rnorm(n, 0, 2) # independent variable

# stochastic component
e <- rnorm(n, 0, 1) # error term

# Heteroskedasticity parameter
gamma <- 0.35

# The DGP: Heteroskedasticity
Y <- b0 + b1*X + rnorm(n, 0, exp(X*gamma))

# Simulate repeated samples
sims <- 500

## Simulation for heteroskedasticity

par.est.1 <- 
  tibble(
    b0_hat = rep(NA,sims),
    b1_hat = rep(NA,sims),
    model = "OLS"
  )

par.est.2 <- 
  tibble(
    b0_hat = rep(NA,sims),
    b1_hat = rep(NA,sims),
    model = "MLE"
  )

## Likelihood function

llk.hetnormlin <- function(param, y, x, z) {
  x <- as.matrix(x)                      # x (some covariates) as a matrix
  z <- as.matrix(z)                      # z (some covariates) as a matrix
  os <- rep(1, nrow(x))                  # Set the intercept as 1 (constant)
  x <- cbind(os, x)                      # Add intercept to covariates x
  z <- cbind(os, z)                      # Add intercept to covariates z
  
  b <- param[1 : ncol(x)]                         # Parameters for x
  g <- param[(ncol(x) + 1) : (ncol(x) + ncol(z))] # Parameters for z
  
  xb <- x %*% b                          # Systematic components for mean
  s2 <- exp(z %*% g)                     # Systematic components for variance
  
  sum(0.5 * (log(s2) + (y - xb)^2 / s2)) # Likelihood we want to maximize 
}

## starting values for searching algorithm
stval <- c(0, 0, 0, 0)

The visualization clearly shows that the MLE estimator is much more efficient than the OLS estimator. Another way to interpret the results, and to motivate the use of MLE, is that the MLE estimator is consistently closer to the true parameter values than the OLS estimator. This difference is especially pronounced in the estimation of \(\hat{\beta_1}\).