Probability Distributions

There are two basic types of random variables: (1) discrete and (2) continuous. A discrete random variable maps events to a countable set (e.g., integers), where each value has a nonzero probability. In contrast, a continuous random variable maps events to an uncountable set (e.g., real numbers) and assigns a positive probability to intervals of values. However, the probability of any specific value for a continuous random variable is zero.

The expected value of a random variable is its mean, regardless of the distribution it follows. An exception is the Cauchy distribution, which lacks a defined mean and has infinite variance.

R has numerous built-in probability distributions. Each distribution has an abbreviated name in R, accessible using four different prefixes (d, p, q, and r) depending on the desired operation.

While individual outcomes are uncertain, the underlying probability distribution provides a predictable pattern of events in the long run.

Discrete Distributions

For a discrete variable, the probability distribution function (PDF) is also known as the probability mass function (PMF). In this context, the function directly represents probabilities, as the discrete values are the only possible outcomes of the distribution.

Bernoulli

A Bernoulli distribution describes a random variable with only two possible values: 0 or 1. It only has one parameter, \(\pi\), which is the probability of success.

\[ \begin{aligned} y_i \sim f_{\text{Bern}}(y_i \mid \pi) & = \pi^{y_i}(1-\pi)^{1-y_i} \quad \quad \text{for } y_i = 0, 1 \\[2ex] E[y] & = \pi \\[2ex] V[y] & = \pi(1-\pi) \end{aligned} \]

This distribution models any dichotomous random variable (e.g., success vs. failure). A sequence of independent coin tosses or whether citizens vote in an election are examples of Bernoulli trials.

In R, there isn’t a direct Bernoulli function, but you can simulate it using sample() or dbinom().

set.seed(123)  # Setting seed for reproducibility

sample(
  # set of possible outcomes
  c(0, 1),
  # Number of sample to draw
  size = 10,
  # Sampling with replacement
  replace = TRUE,
  # (1-p, p) probabilities
  prob = c(0.4, 0.6)
)
##  [1] 1 0 1 0 0 1 1 0 1 1

Binomial

A binomial random variable is a variable that counts the number of successes from a series of n Bernoulli trials. There are two key components to define a binomial random variable: (1) the number of trials (\(n\)) and (2) the probability of success on any given trial (\(\pi\)).

\[ \begin{aligned} y_i \sim f_{\text{Bin}}(y_i \mid n,\pi) & = \left( \frac{n!}{y! (n-y)!} \right) \pi^y (1-\pi)^{(n-y)} \quad \quad \text{for } y_i = 0, 1,...,n \\[2ex] E[y] & = n*\pi \\[2ex] V[y] & = n*\pi(1-\pi) \end{aligned} \]

Base R functions to work with the binomial distribution are:

Function Purpose Example
rbinom() Generate random Binomial samples rbinom(n = 10, size = 5, prob = 0.3)
dbinom() Compute PMF (probability mass function) dbinom(x = 3, size = 5, prob = 0.3)
pbinom() Compute CDF (cumulative distribution function) pbinom(q = 3, size = 5, prob = 0.3, lower.tail = TRUE)
qbinom() Compute quantiles qbinom(p = 0.5, size = 5, prob = 0.3, lower.tail = TRUE)

where:

  • n: Number of random samples to generate (for rbinom()).
  • x: Number of successes for calculating the PMF (for dbinom()).
  • q: Number of successes for calculating the CDF (for pbinom()).
  • p: Cumulative probability value for the quantile function (for qbinom()).
  • size: Number of trials (\(n\)).
  • prob: Probability of success (\(p\)).
  • lower.tail: Logical; if TRUE, calculates the lower tail probability (\(P(X \leq q)\)).
# Generate 10 random samples
set.seed(123)

rbinom(n=10, size = 5, prob = 0.3) 
##  [1] 1 2 1 3 3 0 1 3 2 1

Extra: How to plot?

See below how to plot a pmf and cdf of a binomial distribution.

# PDMF
# Create x axis in data.frame 
tibble(x = seq(from = 0, to = 10, by =1)) |> 

# Create y axis 
  mutate(y = dbinom(x = x, size = 10, prob = 0.7)) |> 

# Plot, map data in aesthetic 
  ggplot(aes(x=x,y=y)) +

# Specify the geometry of plot
  geom_segment(aes(xend = x, yend = 0)) +
  
  geom_point(size=2) +
  
# Specify x axis breaks
  scale_x_continuous(breaks = 0:10) +
  
# Change label on y axis
  labs(y="P(X)", subtitle = "PMF")+

# (Optional: specify theme )
  theme_bw()

# CDF
# Create x axis in data.frame 
tibble(x = seq(from = 0, to = 10, by =1)) |> 

# Create y axis 
  mutate(y = pbinom(q = x, size =10, prob=.7)) |> 
  
# Plot, map data in aesthetic 
  ggplot(aes(x=x,y=y))+

# Specify the type of plot
  geom_segment(aes(xend = x, yend = 0)) +
  
  geom_point(size=2) +
  
# Specify x axis breaks
  scale_x_continuous(breaks = 0:10) +
  
# Change label on y axis
  labs(y="P(X)", subtitle = "CDF")+

# (Optional: specify theme )
  theme_bw()

Poisson

A Poisson random variable is used to model the number of occurrences of an event within a fixed interval of time or space, given that these events occur with a known constant rate and independently of the time since the last event. The Poisson distribution has a single key parameter: (1) the rate or mean number of occurrences (\(\lambda\)) over the specified interval.

The pmf of a Poisson random variable is given by:

\[ \begin{aligned} y_i \sim f_{\text{Pois}}(y_i \mid \lambda) & = \frac{\lambda^{y_i} e^{-\lambda}}{y_i!} \quad \quad \text{for } y_i = 0, 1, 2, \ldots \\[2ex] E[y] & = \lambda \\[2ex] V[y] & = \lambda \end{aligned} \]

Base R functions to work with the poisson distribution are:

Function Purpose Example
rpois() Generate random Poisson samples rpois(n = 10, lambda = 2.5)
dpois() Compute PMF (probability mass function) dpois(x = 3, lambda = 2.5)
ppois() Compute CDF (cumulative distribution function) ppois(q = 3, lambda = 2.5, lower.tail = TRUE)
qpois() Compute quantiles qpois(p = 0.5, lambda = 2.5, lower.tail = TRUE)

where:

  • n: Number of random samples to generate (for rpois()).
  • x: Number of occurrences for calculating the PMF (for dpois()).
  • q: Number of occurrences for calculating the CDF (for ppois()).
  • p: Cumulative probability value for the quantile function (for qpois()).
  • lambda: Rate or expected number of occurrences (\(\lambda\)).
  • lower.tail: Logical; if TRUE, calculates the lower tail probability (\(P(X \leq q)\)).
# Generate 10 random samples
set.seed(123)

rpois(n = 10, lambda = 2.5)
##  [1] 2 4 2 4 5 0 2 5 3 2

Both the Binomial and Poisson distributions model counting events but differ in context. The Binomial distribution counts the number of successes in a fixed number of trials, while the Poisson distribution models the number of events occurring in a fixed interval of time, assuming independence and a constant rate.

However, these distributions rely on strong assumptions that are often unrealistic in social processes (e.g., fixed time intervals in Poisson). Later, we will explore more flexible alternatives, such as the negative-binomial, beta-binomial, and zero-inflated negative binomial distributions, among others.

Continuous Distributions

Probability with continuous variables is conceptualized in ranges, as in the probability of a realization falling between two values. This is computed from the PDF by finding the area under the curve in that range.

Uniform

A Uniform random variable is used to model scenarios where all outcomes within a specified interval are equally likely. It is defined by two parameters: (1) the minimum value (\(a\)) and (2) the maximum value (\(b\)). The uniform distirbution has the following pdf:

\[ \begin{aligned} X & \sim \text{Uniform}(a, b) \\ f_X(x \mid a, b) & = \frac{1}{b - a} \quad \quad \text{for } a \leq x \leq b \\[2ex] E[X] & = \frac{a + b}{2} \\[2ex] V[X] & = \frac{(b - a)^2}{12} \end{aligned} \]

  • Support: \(a \leq x \leq b\).
  • The pdf is constant over the interval \([a, b]\) and zero outside of it.
  • The mean and variance are derived from the properties of the distribution.

Base R functions to work with the poisson distribution are:

Function Purpose Example
runif() Generate random Uniform samples runif(n = 10, min = 0, max = 1)
dunif() Compute PDF (probability density function) dunif(x = 0.5, min = 0, max = 1)
punif() Compute CDF (cumulative distribution function) punif(q = 0.5, min = 0, max = 1, lower.tail = TRUE)
qunif() Compute quantiles qunif(p = 0.5, min = 0, max = 1, lower.tail = TRUE)

where:

  • n: Number of random samples to generate (for runif()).
  • x: Value for which to calculate the PDF (for dunif()).
  • q: Value for calculating the CDF (for punif()).
  • p: Cumulative probability value for the quantile function (for qunif()).
  • min: Minimum value of the Uniform distribution (\(a\)).
  • max: Maximum value of the Uniform distribution (\(b\)).
  • lower.tail: Logical; if TRUE, calculates the lower tail probability \(P(X \leq q)\).
# Generate 10 random samples from Uniform(0, 1)
set.seed(123)
runif(n = 10, min = 0, max = 1)
##  [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565 0.5281055
##  [8] 0.8924190 0.5514350 0.4566147

This approach matches the format of the previous distributions and covers all essential elements, including the probability density function, mean, variance, and R functions with clear usage examples. Let me know if you’d like to refine it further or explore another topic!

Gaussian or Normal

A Normal random variable, often referred to as a Gaussian variable, is widely used to model continuous data that cluster around a mean value. It is characterized by two key parameters: (1) the mean (\(\mu\)) and (2) the variance (\(\sigma^2\)). The shape of the distribution is symmetric and bell-shaped, with the mean determining the center and the variance controlling the spread.

The pdf of a Normal random variable is given by:

\[ \begin{aligned} y_i \sim f_{\text{Norm}}(y_i \mid \mu, \sigma^2) & = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \quad \quad \text{for } y_i \in (-\infty, \infty) \\[2ex] E[y] & = \mu \\[2ex] V[y] & = \sigma^2 \end{aligned} \]

Base R functions to work with the gaussian distribution are:

Function Purpose Example
rnorm() Generate random Normal samples rnorm(n = 10, mean = 0, sd = 1)
dnorm() Compute PDF (probability density function) dnorm(x = 1, mean = 0, sd = 1)
pnorm() Compute CDF (cumulative distribution function) pnorm(q = 1, mean = 0, sd = 1, lower.tail = TRUE)
qnorm() Compute quantiles qnorm(p = 0.5, mean = 0, sd = 1, lower.tail = TRUE)

where:

  • n: Number of random samples to generate (for rnorm()).
  • x: Value for which to calculate the PDF (for dnorm()).
  • q: Value for calculating the CDF (for pnorm()).
  • p: Cumulative probability value for the quantile function (for qnorm()).
  • mean: Mean of the Normal distribution (\(\mu\)).
  • sd: Standard deviation of the Normal distribution (\(\sigma\)).
  • lower.tail: Logical; if TRUE, calculates the lower tail probability (\(P(X \leq q)\)).
# Generate 10 random samples from N(0, 1)
set.seed(123)
rnorm(n = 10, mean = 0, sd = 1)
##  [1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499
##  [7]  0.46091621 -1.26506123 -0.68685285 -0.44566197

Standarized Normal (or Gaussian)

A standardized normal distribution, also known as the standard normal distribution, is a special case of the normal distribution where the mean (\(\mu\)) is 0 and the variance (\(\sigma^2\)) is 1. Its probability density function (pdf) is given by:

\[ \begin{aligned} y_i \sim f_{\text{Norm}}(y_i \mid 0, 1) & = \frac{1}{\sqrt{2\pi}} \exp \left(-\frac{y_i^2}{2}\right) \quad \quad \text{for } y_i \in (-\infty, \infty) \\[2ex] E[y] & = 0 \\[2ex] V[y] & = 1 \end{aligned} \]

The standard normal distribution plays a crucial role in statistics, particularly in hypothesis testing and confidence interval estimation, due to its fundamental properties. Other continuous distributions, like the Chi-Square or Student’s t distributions, are derived from the standardized normal.

Statistical Inference

In statistical inference, we are concerned with making predictions (inferences) about a data-generating process (DGP) or population based on information obtained from a sample.

A statistical model is a formal representation of the process by which a DGP produces outputs. Since no DGP is deterministic, statistical models are assumed to have both a systematic and stochastic components.

\[ Y_i = \underbrace{\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki}}_{\text{Systematic Component}} \, + \, \underbrace{\varepsilon_i}_{\text{Stochastic Component}} \]

In the frequentist approach to statistical inference, the systematic component is fixed in the population and remains constant across samples. In contrast, the stochastic component varies from sample to sample, though its distribution remains the same.

It is common to distinguish between the population model, which represents the true data-generating process (DGP), and the sample (or empirical) model, which is the specified model fitted to the data.

\[ \begin{aligned} \text{Population model:} \quad Y_i & = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} + \varepsilon_i \\ \text{Sample model:} \quad \hat{y}_i & = \hat{\beta}_0 + \hat{\beta}_1 x_{1i} + \hat{\beta}_2 x_{2i} + \cdots + \hat{\beta}_p x_{pi} \end{aligned} \]

Notice that the assumption of no misspecification or correct specification implies that \(\text{Population model}=\text{Sample model}\).

Inference involves the itnerconnection between three concepts:

  1. Estimand: The estimand is the true quantity of interest in the data-generating process (DGP) that we want to learn about or infer. It is a parameter (e.g., population mean, causal effect, or regression coefficient) that characterizes a specific aspect of the population. The estimand is fixed and unknown in the population and does not depend on the sample.
  1. Estimator: An estimator is a statistical procedure, rule, or formula used to estimate the estimand based on the sample data. It is a function of the observed data and represents a method for deriving an approximation of the estimand.
  1. Estimate: An estimate is the numerical value obtained by applying the estimator to the observed sample data. It is a realized value and serves as the best guess or approximation of the estimand based on the available information from the sample.

Central Limit Theorem

The Central Limit Theorem (CLT) states that the distribution of the means of repeated samples drawn from any distribution will follow a normal distribution as long as the samples from which the means were computed are large enough.

# Set parameters
set.seed(123)                   # For reproducibility
p <- 0.3                        # Probability of success for Bernoulli distribution
n_samples <- 1000               # Number of repeated samples for each sample size
sample_sizes <- c(5, 30, 100)   # Different sample sizes to demonstrate the CLT


## Simulation of CLT

# Create an empty data frame to store results
results <- data.frame()

# Loop through different sample sizes and calculate sample means
for (n in sample_sizes) {
  # Generate sample means for the given sample size
  sample_means <- replicate(n_samples, 
                            mean(sample(c(0, 1), 
                                        size = n, 
                                        replace = TRUE, 
                                        prob = c(1 - p, p))))
  
  # Store results in a data frame
  temp_df <- data.frame(Sample_Size = factor(n), 
                        Sample_Means = sample_means)
  
  results <- rbind(results, temp_df)
}


# Calculate theoretical normal distribution parameters for each sample size

results <- 
  results |> 
  group_by(Sample_Size) |> 
  mutate(Theoretical_Mean = p,
         Theoretical_SD = sqrt(p * (1 - p) / as.numeric(as.character(Sample_Size))))

Law of Large Numbers

The Law of Large Numbers (LLN) states that, as the sample size increases, the sample mean will converge to the true population mean with high probability. This means that, for a sufficiently large sample, the sample mean is a consistent estimator of the population mean.

# Coin Flips

# Set the seed for reproducible results
set.seed(23212)

# Number of trials
n <- seq(1, 200, length = 200)

# Empty vector for heads in one set of trials
heads <- numeric(length(n))

# Empty vector for cumlative
c.heads <- numeric(length(n))

# Two sides of the coin
k <- c("Tails","Heads")

# Probabilities of heads and tails
p <- c(0.5, 0.5)

# Loop over the n trials

for (i in 1:length(n)) {
  # Count the number of heads from n[i] trials
  heads[i] <- sum(sample(k,
                         size = n[i],
                         prob = p,
                         replace = TRUE) == "Heads")
  
  # Compute the cumulative proportion of heads
  c.heads[i] <- sum(heads[1:i]/sum(n[1:i]))
}


dat <- tibble(probs=c.heads,
              trials=1:200)

Least Squares Estimation

The equation for a simple linear regression model with only one predictor \(X\), an intercept \(\beta_0\), and a slope coefficient \(\beta_1\) is given by:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]

For OLS to produce unbiased and efficient estimates of $ _0$ and \(\beta_1\), several assumptions must hold. These assumptions are also required for the Gauss-Markov theorem, which states that under these conditions, the OLS estimators are the Best Linear Unbiased Estimators (BLUE).

  1. Linearity:
    • The relationship between \(X\) and \(Y\) is linear in parameters. This means that the model is correctly specified as:

\[ E[Y_i \mid X_i] = \beta_0 + \beta_1 X_i + \varepsilon_i. \]

  • For linear regression, it is also implicitly assumed that the levels of \(Y\) are continuous and have equal intervals, meaning the difference between \(Y = 1\) and \(Y = 2\) is the same as the difference between \(Y = 2\) and \(Y = 3\).

  • This is crucial because the marginal effect (change in \(Y\) for a unit change in \(X\)) is meaningful only if \(Y\) is measured on a scale with equal-interval properties.

  1. Independence:
    • The error terms \(\varepsilon_i\) are independent of each other. This means there is no autocorrelation between the errors for different observations:

\[ \text{Cov}(\varepsilon_i, \varepsilon_j) = 0 \quad \text{for } i \neq j. \]

  1. Homoscedasticity:
    • The error terms \(\varepsilon_i\) have constant variance for all values of \(X\). This implies that the spread of the residuals is the same across different values of \(X\):

\[ \text{Var}(\varepsilon_i) = \sigma^2 \quad \forall \, i. \]

  1. No Perfect Collinearity:
    • For models with multiple predictors, it implies that no predictor is a perfect linear combination of others (their correlation is not 1).
  2. Exogeneity or no endogeneity:
    • The error term has an expected value of zero and is uncorrelated with the predictor \(X\):

\[ E[\varepsilon_i \mid X_i] = 0. \]

  • This ensures that \(X\) does not contain any information about the error term.
  • This assumption implies other assumptions, such no measurement error or no misspecification.
  1. Normality of Errors (for Hypothesis Testing):
    • The error terms \(\varepsilon_i\) are normally distributed:

\[ \varepsilon_i \sim \mathcal{N}(0, \sigma^2). \]

The Gauss-Markov theorem states that, under assumptions 1 through 5, the OLS estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the Best Linear Unbiased Estimators (BLUE). In other words, OLS is the most efficient method of linear estimation under the given assumptions.

How (un)Bias Looks Like?

Below, we simulate an OLS estimation under perfect conditions: no assumption is violated and the estimates \(\hat{\beta}\) are BLUE. Assume the following DGP:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]

Where \(X \sim \mathcal{N}(0,2)\), \(\epsilon_i \sim \mathcal{N}(0,1)\) and the true parameters are \(\beta_0=0.3\) and \(\beta_1=0.8\). We can simulate this regression model with one sample of 500 observations.

set.seed(342342)

## BLUE

# sample size
n <- 600


# systematic component
b0 <- 0.2 # True value for the intercept
b1 <- 0.5 # True value for the slope

X <- rnorm(n, 0, 2) # independent variable

# stochastic component
e <- rnorm(n, 0, 1) # error term

# "True" DGP: BLUE
Y <- b0 + b1*X + e


mean(Y) # mean
## [1] 0.1533692
lm(Y ~ 1) # intercept only
## 
## Call:
## lm(formula = Y ~ 1)
## 
## Coefficients:
## (Intercept)  
##      0.1534
lm(Y ~ X) # one predictor
## 
## Call:
## lm(formula = Y ~ X)
## 
## Coefficients:
## (Intercept)            X  
##      0.1892       0.5083

Note that the above analysis is based on a single sample, as is common in most empirical research. Even under ideal conditions, the estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\) will not perfectly match the true parameters \(\beta_0\) and \(\beta_1\). This is because unbiasedness does not guarantee that any particular estimate will equal the true parameter.

Instead, unbiasedness means that if we were to repeatedly draw different samples from the same data-generating process (DGP) and compute estimates using the same unbiased estimator, the average of these estimates would equal the true parameter value. In other words, an estimator is considered unbiased if the expected value of its estimates equals the true parameter value over many repeated samples.

## Simulate repeated samples

sims <- 500

par.est.1 <- 
  tibble(
    b0_hat = rep(NA,sims),
    b1_hat = rep(NA,sims),
    model = "BLUE"
  )

for (i in 1:sims) {
  # Start the loop
  
  # The DGP: BLUE
  Y <- b0 + b1*X + rnorm(n, 0, 1)
  
  # Estimate OLS model
  out <- lm(Y ~ X)
  
  # Extract estimates and store in the matrix columns
  
  par.est.1[i, 1] <- out$coef[1]
  par.est.1[i, 2] <- out$coef[2]
  
}

How Efficency Looks Like?

Now, we will violate one of the assumptions of the Gauss-Markov theorem: homoskedastic errors. This assumption states that the estimated errors have a constant variance. In contrast, we will introduce heteroskedasticity by allowing the variance of the errors to be correlated with the predictor \(X\). Specifically, as the value of \(X\) increases, the variance of \(Y\) (and its error term \(\varepsilon\)) will also increase:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad \text{where} \quad \varepsilon_i \sim \mathcal{N}(0, \sigma(X_i)). \]

Notice that in the standard model, the variance of \(\varepsilon_i\) is constant. However, here we use \(\sigma(X_i)\) to indicate that the variance is a function of the independent variable \(X\). Alternatively, you may encounter the notation \(\sigma_i\), where the subscript \(i\) implies that the standard deviation varies for each observation, indicating non-constant variance in the error terms.

## Heteroskedasticity
set.seed(234242)

# Heteroskedasticity parameter
gamma <- 0.35

# "True" DGP: BLUE
Y <- b0 + b1*X + e

# "True" DGP: Heteroskedasticity
Y.2 <- b0 + b1*X + rnorm(n, 0, exp(X*gamma))

cor(X,rnorm(n, 0, exp(X*gamma)))
## [1] -0.005925086
tibble(X,Y,Y.2) |> 
  pivot_longer(cols=2:3,
               names_to = "outcome",
               values_to = "Y") |> 
  mutate(outcome=fct_recode(outcome,
                            "Homoskedasticity" = "Y",
                            "Heteroskedasticity" = "Y.2")) |> 
  ggplot(aes(x=X,
             y=Y)) +
  geom_point() +
  theme_bw() +
  facet_wrap(.~ outcome) +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        text = element_text(size = 14))

## Simulation for heteroskedasticity

par.est.2 <- 
  tibble(
    b0_hat = rep(NA,sims),
    b1_hat = rep(NA,sims),
    model = "Heteroskedastic"
  )


for (i in 1:sims) {
  # Start the loop
  
  # The DGP: Heteroskedasticity
  Y <- b0 + b1*X + rnorm(n, 0, exp(X*gamma))
  
  # Estimate OLS model
  out <- lm(Y ~ X)
  
  # Extract estimates and store in the matrix columns
  
  par.est.2[i, 1] <- out$coef[1]
  par.est.2[i, 2] <- out$coef[2]
  
}


# Compare variance: beta_0
var(par.est.1[,1]); var(par.est.2[,1])
##             b0_hat
## b0_hat 0.001603449
##             b0_hat
## b0_hat 0.005044727
# Compare variance: beta_1
var(par.est.1[,2]); var(par.est.2[,2])
##              b1_hat
## b1_hat 0.0003906309
##             b1_hat
## b1_hat 0.003061268

When comparing the variances, we can see that the variances of the estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are higher in the model with heteroskedasticity compared to the model with homoskedasticity. In other words, the OLS estimates become inefficient and are no longer BLUE (Best Linear Unbiased Estimators) when the assumption of homoskedasticity is violated.

As we can see, under heteroskedasticity, the OLS estimator is inefficient.

How good is a model?

We use goodness-of-fit statistics to evaluate how well our model projects or predicts the outcome variable. For linear models estimated using least squares, the most common metrics are the Mean Squared Error (MSE) and Adjusted R-Squared. MSE is a standard measure for assessing bias and efficiency. However, Adjusted R-Squared is useful for comparing the predictive power of models with differently scaled outcomes, such as comparing a model with \(Y\) and \(\log(Y)\). Both statistics are derived from the Sum of Squared Residuals (SSR).

Important considerations:

  • Consistency in Samples: When comparing models, always use the same sample. If you compare models with a different number of observations due to data cleaning issues, metrics like MSE are not comparable. This is especially crucial when interpreting coefficients.

  • In-Sample Metrics: These goodness-of-fit statistics are in-sample metrics and may be misleading, as models can overfit by learning irregular patterns. We will explore out-of-sample statistics to improve model assessment in this course.

  • Causal Inference: If you’re focused on causal effects, you can ignore these statistics. Models with high predictive power can sometimes produce biased coefficient estimates, e.g., due to collider or mediation bias (see Cinelli et al., 2022).

Mean Squared Error

The Mean Squared Error (MSE) is a metric used to evaluate the performance of an estimator by measuring the average squared difference between the observed values and the predicted or estimated values. It is commonly used in regression analysis to quantify the accuracy of predictions.

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

where:

  • \(n\): The number of observations.
  • \(y_i\): The observed (true) value for the \(i\)-th observation.
  • \(\hat{y}_i\): The predicted (estimated) value for the \(i\)-th observation.

Adjusted R-Squared

The Adjusted R-squared (\(R^2_{adj}\)) is a modified version of the R-squared statistic that accounts for the number of predictors in the model. It indicates the proportion of variance explained by the model, adjusted for the number of predictors. Adjusted R-squared is especially useful when comparing models with different numbers of predictors or when the scale of the outcome variable varies.

\[ R^2_{\text{adj}} = 1 - \frac{\frac{\text{SSR}}{n - p - 1}}{\frac{\text{TSS}}{n - 1}} \]

where:

  • \(\text{SSR} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\): Sum of Sqaured Residuals, which measures the total squared error between the observed values \(y_i\) and the predicted values \(\hat{y}_i\).
  • \(\text{TSS} = \sum_{i=1}^{n} (y_i - \bar{y})^2\): Total Sum of Squares, which measures the total variability in the observed data around its mean \(\bar{y}\).
  • \(n\): Total number of observations.
  • \(p\): Number of predictors (excluding the intercept).
  • \(y_i\): Observed value of the outcome variable.
  • \(\hat{y}_i\): Predicted value from the model.
  • \(\bar{y}\): Mean of the observed values.

Intuition:

  • Numerator: \(\frac{\text{SSR}}{n - p - 1}\) represents the Mean Squared Error (MSE), adjusted for the number of predictors. This measures the prediction error of the specified model with covariates: \(Y_i = \beta_0 + \mathbf{X}\beta_k + \varepsilon_i\).

  • Denominator: \(\frac{\text{TSS}}{n - 1}\) represents the total variance in the observed values. You can think of it as the variance of the errors when fitting an intercept-only model: \(Y_i = \beta_0 + \varepsilon_i\).

  • Adjusted R-Squared: Indicates the proportion of the total variance explained by the model, penalized for adding predictors. In other words, it evaluates the ratio of variance explained between the fitted model with covariates and the intercept-only model, while penalizing for the effective number of degrees of freedom.