There are two basic types of random variables: (1) discrete and (2) continuous. A discrete random variable maps events to a countable set (e.g., integers), where each value has a nonzero probability. In contrast, a continuous random variable maps events to an uncountable set (e.g., real numbers) and assigns a positive probability to intervals of values. However, the probability of any specific value for a continuous random variable is zero.
The expected value of a random variable is its mean, regardless of the distribution it follows. An exception is the Cauchy distribution, which lacks a defined mean and has infinite variance.
R has numerous built-in probability distributions. Each distribution
has an abbreviated name in R, accessible using four different prefixes
(d
, p
, q
, and r
)
depending on the desired operation.
While individual outcomes are uncertain, the underlying probability distribution provides a predictable pattern of events in the long run.
For a discrete variable, the probability distribution function (PDF) is also known as the probability mass function (PMF). In this context, the function directly represents probabilities, as the discrete values are the only possible outcomes of the distribution.
A Bernoulli distribution describes a random variable with only two possible values: 0 or 1. It only has one parameter, \(\pi\), which is the probability of success.
\[ \begin{aligned} y_i \sim f_{\text{Bern}}(y_i \mid \pi) & = \pi^{y_i}(1-\pi)^{1-y_i} \quad \quad \text{for } y_i = 0, 1 \\[2ex] E[y] & = \pi \\[2ex] V[y] & = \pi(1-\pi) \end{aligned} \]
This distribution models any dichotomous random variable (e.g., success vs. failure). A sequence of independent coin tosses or whether citizens vote in an election are examples of Bernoulli trials.
In R, there isn’t a direct Bernoulli function, but you can simulate
it using sample()
or dbinom()
.
set.seed(123) # Setting seed for reproducibility
sample(
# set of possible outcomes
c(0, 1),
# Number of sample to draw
size = 10,
# Sampling with replacement
replace = TRUE,
# (1-p, p) probabilities
prob = c(0.4, 0.6)
)
## [1] 1 0 1 0 0 1 1 0 1 1
A binomial random variable is a variable that counts the number of successes from a series of n Bernoulli trials. There are two key components to define a binomial random variable: (1) the number of trials (\(n\)) and (2) the probability of success on any given trial (\(\pi\)).
\[ \begin{aligned} y_i \sim f_{\text{Bin}}(y_i \mid n,\pi) & = \left( \frac{n!}{y! (n-y)!} \right) \pi^y (1-\pi)^{(n-y)} \quad \quad \text{for } y_i = 0, 1,...,n \\[2ex] E[y] & = n*\pi \\[2ex] V[y] & = n*\pi(1-\pi) \end{aligned} \]
Base R functions to work with the binomial distribution are:
Function | Purpose | Example |
---|---|---|
rbinom() |
Generate random Binomial samples | rbinom(n = 10, size = 5, prob = 0.3) |
dbinom() |
Compute PMF (probability mass function) | dbinom(x = 3, size = 5, prob = 0.3) |
pbinom() |
Compute CDF (cumulative distribution function) | pbinom(q = 3, size = 5, prob = 0.3, lower.tail = TRUE) |
qbinom() |
Compute quantiles | qbinom(p = 0.5, size = 5, prob = 0.3, lower.tail = TRUE) |
where:
n
: Number of random samples to
generate (for rbinom()
).x
: Number of successes for calculating
the PMF (for dbinom()
).q
: Number of successes for calculating
the CDF (for pbinom()
).p
: Cumulative probability value for
the quantile function (for qbinom()
).size
: Number of trials (\(n\)).prob
: Probability of success (\(p\)).lower.tail
: Logical; if
TRUE
, calculates the lower tail probability (\(P(X \leq q)\)).# Generate 10 random samples
set.seed(123)
rbinom(n=10, size = 5, prob = 0.3)
## [1] 1 2 1 3 3 0 1 3 2 1
See below how to plot a pmf and cdf of a binomial distribution.
# PDMF
# Create x axis in data.frame
tibble(x = seq(from = 0, to = 10, by =1)) |>
# Create y axis
mutate(y = dbinom(x = x, size = 10, prob = 0.7)) |>
# Plot, map data in aesthetic
ggplot(aes(x=x,y=y)) +
# Specify the geometry of plot
geom_segment(aes(xend = x, yend = 0)) +
geom_point(size=2) +
# Specify x axis breaks
scale_x_continuous(breaks = 0:10) +
# Change label on y axis
labs(y="P(X)", subtitle = "PMF")+
# (Optional: specify theme )
theme_bw()
# CDF
# Create x axis in data.frame
tibble(x = seq(from = 0, to = 10, by =1)) |>
# Create y axis
mutate(y = pbinom(q = x, size =10, prob=.7)) |>
# Plot, map data in aesthetic
ggplot(aes(x=x,y=y))+
# Specify the type of plot
geom_segment(aes(xend = x, yend = 0)) +
geom_point(size=2) +
# Specify x axis breaks
scale_x_continuous(breaks = 0:10) +
# Change label on y axis
labs(y="P(X)", subtitle = "CDF")+
# (Optional: specify theme )
theme_bw()
A Poisson random variable is used to model the number of occurrences of an event within a fixed interval of time or space, given that these events occur with a known constant rate and independently of the time since the last event. The Poisson distribution has a single key parameter: (1) the rate or mean number of occurrences (\(\lambda\)) over the specified interval.
The pmf of a Poisson random variable is given by:
\[ \begin{aligned} y_i \sim f_{\text{Pois}}(y_i \mid \lambda) & = \frac{\lambda^{y_i} e^{-\lambda}}{y_i!} \quad \quad \text{for } y_i = 0, 1, 2, \ldots \\[2ex] E[y] & = \lambda \\[2ex] V[y] & = \lambda \end{aligned} \]
Base R functions to work with the poisson distribution are:
Function | Purpose | Example |
---|---|---|
rpois() |
Generate random Poisson samples | rpois(n = 10, lambda = 2.5) |
dpois() |
Compute PMF (probability mass function) | dpois(x = 3, lambda = 2.5) |
ppois() |
Compute CDF (cumulative distribution function) | ppois(q = 3, lambda = 2.5, lower.tail = TRUE) |
qpois() |
Compute quantiles | qpois(p = 0.5, lambda = 2.5, lower.tail = TRUE) |
where:
n
: Number of random samples to
generate (for rpois()
).x
: Number of occurrences for
calculating the PMF (for dpois()
).q
: Number of occurrences for
calculating the CDF (for ppois()
).p
: Cumulative probability value for
the quantile function (for qpois()
).lambda
: Rate or expected number of
occurrences (\(\lambda\)).lower.tail
: Logical; if
TRUE
, calculates the lower tail probability (\(P(X \leq q)\)).# Generate 10 random samples
set.seed(123)
rpois(n = 10, lambda = 2.5)
## [1] 2 4 2 4 5 0 2 5 3 2
Both the Binomial and Poisson distributions model counting events but differ in context. The Binomial distribution counts the number of successes in a fixed number of trials, while the Poisson distribution models the number of events occurring in a fixed interval of time, assuming independence and a constant rate.
However, these distributions rely on strong assumptions that are often unrealistic in social processes (e.g., fixed time intervals in Poisson). Later, we will explore more flexible alternatives, such as the negative-binomial, beta-binomial, and zero-inflated negative binomial distributions, among others.
Probability with continuous variables is conceptualized in ranges, as in the probability of a realization falling between two values. This is computed from the PDF by finding the area under the curve in that range.
A Uniform random variable is used to model scenarios where all outcomes within a specified interval are equally likely. It is defined by two parameters: (1) the minimum value (\(a\)) and (2) the maximum value (\(b\)). The uniform distirbution has the following pdf:
\[ \begin{aligned} X & \sim \text{Uniform}(a, b) \\ f_X(x \mid a, b) & = \frac{1}{b - a} \quad \quad \text{for } a \leq x \leq b \\[2ex] E[X] & = \frac{a + b}{2} \\[2ex] V[X] & = \frac{(b - a)^2}{12} \end{aligned} \]
Base R functions to work with the poisson distribution are:
Function | Purpose | Example |
---|---|---|
runif() |
Generate random Uniform samples | runif(n = 10, min = 0, max = 1) |
dunif() |
Compute PDF (probability density function) | dunif(x = 0.5, min = 0, max = 1) |
punif() |
Compute CDF (cumulative distribution function) | punif(q = 0.5, min = 0, max = 1, lower.tail = TRUE) |
qunif() |
Compute quantiles | qunif(p = 0.5, min = 0, max = 1, lower.tail = TRUE) |
where:
n
: Number of random samples to
generate (for runif()
).x
: Value for which to calculate the
PDF (for dunif()
).q
: Value for calculating the CDF (for
punif()
).p
: Cumulative probability value for
the quantile function (for qunif()
).min
: Minimum value of the Uniform
distribution (\(a\)).max
: Maximum value of the Uniform
distribution (\(b\)).lower.tail
: Logical; if
TRUE
, calculates the lower tail probability \(P(X \leq q)\).# Generate 10 random samples from Uniform(0, 1)
set.seed(123)
runif(n = 10, min = 0, max = 1)
## [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565 0.5281055
## [8] 0.8924190 0.5514350 0.4566147
This approach matches the format of the previous distributions and covers all essential elements, including the probability density function, mean, variance, and R functions with clear usage examples. Let me know if you’d like to refine it further or explore another topic!
A Normal random variable, often referred to as a Gaussian variable, is widely used to model continuous data that cluster around a mean value. It is characterized by two key parameters: (1) the mean (\(\mu\)) and (2) the variance (\(\sigma^2\)). The shape of the distribution is symmetric and bell-shaped, with the mean determining the center and the variance controlling the spread.
The pdf of a Normal random variable is given by:
\[ \begin{aligned} y_i \sim f_{\text{Norm}}(y_i \mid \mu, \sigma^2) & = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \quad \quad \text{for } y_i \in (-\infty, \infty) \\[2ex] E[y] & = \mu \\[2ex] V[y] & = \sigma^2 \end{aligned} \]
Base R functions to work with the gaussian distribution are:
Function | Purpose | Example |
---|---|---|
rnorm() |
Generate random Normal samples | rnorm(n = 10, mean = 0, sd = 1) |
dnorm() |
Compute PDF (probability density function) | dnorm(x = 1, mean = 0, sd = 1) |
pnorm() |
Compute CDF (cumulative distribution function) | pnorm(q = 1, mean = 0, sd = 1, lower.tail = TRUE) |
qnorm() |
Compute quantiles | qnorm(p = 0.5, mean = 0, sd = 1, lower.tail = TRUE) |
where:
n
: Number of random samples to
generate (for rnorm()
).x
: Value for which to calculate the
PDF (for dnorm()
).q
: Value for calculating the CDF (for
pnorm()
).p
: Cumulative probability value for
the quantile function (for qnorm()
).mean
: Mean of the Normal distribution
(\(\mu\)).sd
: Standard deviation of the Normal
distribution (\(\sigma\)).lower.tail
: Logical; if
TRUE
, calculates the lower tail probability (\(P(X \leq q)\)).# Generate 10 random samples from N(0, 1)
set.seed(123)
rnorm(n = 10, mean = 0, sd = 1)
## [1] -0.56047565 -0.23017749 1.55870831 0.07050839 0.12928774 1.71506499
## [7] 0.46091621 -1.26506123 -0.68685285 -0.44566197
A standardized normal distribution, also known as the standard normal distribution, is a special case of the normal distribution where the mean (\(\mu\)) is 0 and the variance (\(\sigma^2\)) is 1. Its probability density function (pdf) is given by:
\[ \begin{aligned} y_i \sim f_{\text{Norm}}(y_i \mid 0, 1) & = \frac{1}{\sqrt{2\pi}} \exp \left(-\frac{y_i^2}{2}\right) \quad \quad \text{for } y_i \in (-\infty, \infty) \\[2ex] E[y] & = 0 \\[2ex] V[y] & = 1 \end{aligned} \]
The standard normal distribution plays a crucial role in statistics, particularly in hypothesis testing and confidence interval estimation, due to its fundamental properties. Other continuous distributions, like the Chi-Square or Student’s t distributions, are derived from the standardized normal.
In statistical inference, we are concerned with making predictions (inferences) about a data-generating process (DGP) or population based on information obtained from a sample.
A statistical model is a formal representation of the process by which a DGP produces outputs. Since no DGP is deterministic, statistical models are assumed to have both a systematic and stochastic components.
\[ Y_i = \underbrace{\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki}}_{\text{Systematic Component}} \, + \, \underbrace{\varepsilon_i}_{\text{Stochastic Component}} \]
In the frequentist approach to statistical inference, the systematic component is fixed in the population and remains constant across samples. In contrast, the stochastic component varies from sample to sample, though its distribution remains the same.
It is common to distinguish between the population model, which represents the true data-generating process (DGP), and the sample (or empirical) model, which is the specified model fitted to the data.
\[ \begin{aligned} \text{Population model:} \quad Y_i & = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} + \varepsilon_i \\ \text{Sample model:} \quad \hat{y}_i & = \hat{\beta}_0 + \hat{\beta}_1 x_{1i} + \hat{\beta}_2 x_{2i} + \cdots + \hat{\beta}_p x_{pi} \end{aligned} \]
Notice that the assumption of no misspecification or correct specification implies that \(\text{Population model}=\text{Sample model}\).
Inference involves the itnerconnection between three concepts:
The Central Limit Theorem (CLT) states that the distribution of the means of repeated samples drawn from any distribution will follow a normal distribution as long as the samples from which the means were computed are large enough.
# Set parameters
set.seed(123) # For reproducibility
p <- 0.3 # Probability of success for Bernoulli distribution
n_samples <- 1000 # Number of repeated samples for each sample size
sample_sizes <- c(5, 30, 100) # Different sample sizes to demonstrate the CLT
## Simulation of CLT
# Create an empty data frame to store results
results <- data.frame()
# Loop through different sample sizes and calculate sample means
for (n in sample_sizes) {
# Generate sample means for the given sample size
sample_means <- replicate(n_samples,
mean(sample(c(0, 1),
size = n,
replace = TRUE,
prob = c(1 - p, p))))
# Store results in a data frame
temp_df <- data.frame(Sample_Size = factor(n),
Sample_Means = sample_means)
results <- rbind(results, temp_df)
}
# Calculate theoretical normal distribution parameters for each sample size
results <-
results |>
group_by(Sample_Size) |>
mutate(Theoretical_Mean = p,
Theoretical_SD = sqrt(p * (1 - p) / as.numeric(as.character(Sample_Size))))
The Law of Large Numbers (LLN) states that, as the sample size increases, the sample mean will converge to the true population mean with high probability. This means that, for a sufficiently large sample, the sample mean is a consistent estimator of the population mean.
# Coin Flips
# Set the seed for reproducible results
set.seed(23212)
# Number of trials
n <- seq(1, 200, length = 200)
# Empty vector for heads in one set of trials
heads <- numeric(length(n))
# Empty vector for cumlative
c.heads <- numeric(length(n))
# Two sides of the coin
k <- c("Tails","Heads")
# Probabilities of heads and tails
p <- c(0.5, 0.5)
# Loop over the n trials
for (i in 1:length(n)) {
# Count the number of heads from n[i] trials
heads[i] <- sum(sample(k,
size = n[i],
prob = p,
replace = TRUE) == "Heads")
# Compute the cumulative proportion of heads
c.heads[i] <- sum(heads[1:i]/sum(n[1:i]))
}
dat <- tibble(probs=c.heads,
trials=1:200)
The equation for a simple linear regression model with only one predictor \(X\), an intercept \(\beta_0\), and a slope coefficient \(\beta_1\) is given by:
\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]
For OLS to produce unbiased and efficient estimates of $ _0$ and \(\beta_1\), several assumptions must hold. These assumptions are also required for the Gauss-Markov theorem, which states that under these conditions, the OLS estimators are the Best Linear Unbiased Estimators (BLUE).
\[ E[Y_i \mid X_i] = \beta_0 + \beta_1 X_i + \varepsilon_i. \]
For linear regression, it is also implicitly assumed that the levels of \(Y\) are continuous and have equal intervals, meaning the difference between \(Y = 1\) and \(Y = 2\) is the same as the difference between \(Y = 2\) and \(Y = 3\).
This is crucial because the marginal effect (change in \(Y\) for a unit change in \(X\)) is meaningful only if \(Y\) is measured on a scale with equal-interval properties.
\[ \text{Cov}(\varepsilon_i, \varepsilon_j) = 0 \quad \text{for } i \neq j. \]
\[ \text{Var}(\varepsilon_i) = \sigma^2 \quad \forall \, i. \]
\[ E[\varepsilon_i \mid X_i] = 0. \]
\[ \varepsilon_i \sim \mathcal{N}(0, \sigma^2). \]
The Gauss-Markov theorem states that, under assumptions 1 through 5, the OLS estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the Best Linear Unbiased Estimators (BLUE). In other words, OLS is the most efficient method of linear estimation under the given assumptions.
Below, we simulate an OLS estimation under perfect conditions: no assumption is violated and the estimates \(\hat{\beta}\) are BLUE. Assume the following DGP:
\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]
Where \(X \sim \mathcal{N}(0,2)\), \(\epsilon_i \sim \mathcal{N}(0,1)\) and the true parameters are \(\beta_0=0.3\) and \(\beta_1=0.8\). We can simulate this regression model with one sample of 500 observations.
set.seed(342342)
## BLUE
# sample size
n <- 600
# systematic component
b0 <- 0.2 # True value for the intercept
b1 <- 0.5 # True value for the slope
X <- rnorm(n, 0, 2) # independent variable
# stochastic component
e <- rnorm(n, 0, 1) # error term
# "True" DGP: BLUE
Y <- b0 + b1*X + e
mean(Y) # mean
## [1] 0.1533692
lm(Y ~ 1) # intercept only
##
## Call:
## lm(formula = Y ~ 1)
##
## Coefficients:
## (Intercept)
## 0.1534
lm(Y ~ X) # one predictor
##
## Call:
## lm(formula = Y ~ X)
##
## Coefficients:
## (Intercept) X
## 0.1892 0.5083
Note that the above analysis is based on a single sample, as is common in most empirical research. Even under ideal conditions, the estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\) will not perfectly match the true parameters \(\beta_0\) and \(\beta_1\). This is because unbiasedness does not guarantee that any particular estimate will equal the true parameter.
Instead, unbiasedness means that if we were to repeatedly draw different samples from the same data-generating process (DGP) and compute estimates using the same unbiased estimator, the average of these estimates would equal the true parameter value. In other words, an estimator is considered unbiased if the expected value of its estimates equals the true parameter value over many repeated samples.
## Simulate repeated samples
sims <- 500
par.est.1 <-
tibble(
b0_hat = rep(NA,sims),
b1_hat = rep(NA,sims),
model = "BLUE"
)
for (i in 1:sims) {
# Start the loop
# The DGP: BLUE
Y <- b0 + b1*X + rnorm(n, 0, 1)
# Estimate OLS model
out <- lm(Y ~ X)
# Extract estimates and store in the matrix columns
par.est.1[i, 1] <- out$coef[1]
par.est.1[i, 2] <- out$coef[2]
}
Now, we will violate one of the assumptions of the Gauss-Markov theorem: homoskedastic errors. This assumption states that the estimated errors have a constant variance. In contrast, we will introduce heteroskedasticity by allowing the variance of the errors to be correlated with the predictor \(X\). Specifically, as the value of \(X\) increases, the variance of \(Y\) (and its error term \(\varepsilon\)) will also increase:
\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad \text{where} \quad \varepsilon_i \sim \mathcal{N}(0, \sigma(X_i)). \]
Notice that in the standard model, the variance of \(\varepsilon_i\) is constant. However, here we use \(\sigma(X_i)\) to indicate that the variance is a function of the independent variable \(X\). Alternatively, you may encounter the notation \(\sigma_i\), where the subscript \(i\) implies that the standard deviation varies for each observation, indicating non-constant variance in the error terms.
## Heteroskedasticity
set.seed(234242)
# Heteroskedasticity parameter
gamma <- 0.35
# "True" DGP: BLUE
Y <- b0 + b1*X + e
# "True" DGP: Heteroskedasticity
Y.2 <- b0 + b1*X + rnorm(n, 0, exp(X*gamma))
cor(X,rnorm(n, 0, exp(X*gamma)))
## [1] -0.005925086
tibble(X,Y,Y.2) |>
pivot_longer(cols=2:3,
names_to = "outcome",
values_to = "Y") |>
mutate(outcome=fct_recode(outcome,
"Homoskedasticity" = "Y",
"Heteroskedasticity" = "Y.2")) |>
ggplot(aes(x=X,
y=Y)) +
geom_point() +
theme_bw() +
facet_wrap(.~ outcome) +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
text = element_text(size = 14))
## Simulation for heteroskedasticity
par.est.2 <-
tibble(
b0_hat = rep(NA,sims),
b1_hat = rep(NA,sims),
model = "Heteroskedastic"
)
for (i in 1:sims) {
# Start the loop
# The DGP: Heteroskedasticity
Y <- b0 + b1*X + rnorm(n, 0, exp(X*gamma))
# Estimate OLS model
out <- lm(Y ~ X)
# Extract estimates and store in the matrix columns
par.est.2[i, 1] <- out$coef[1]
par.est.2[i, 2] <- out$coef[2]
}
# Compare variance: beta_0
var(par.est.1[,1]); var(par.est.2[,1])
## b0_hat
## b0_hat 0.001603449
## b0_hat
## b0_hat 0.005044727
# Compare variance: beta_1
var(par.est.1[,2]); var(par.est.2[,2])
## b1_hat
## b1_hat 0.0003906309
## b1_hat
## b1_hat 0.003061268
When comparing the variances, we can see that the variances of the estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are higher in the model with heteroskedasticity compared to the model with homoskedasticity. In other words, the OLS estimates become inefficient and are no longer BLUE (Best Linear Unbiased Estimators) when the assumption of homoskedasticity is violated.
As we can see, under heteroskedasticity, the OLS estimator is inefficient.
We use goodness-of-fit statistics to evaluate how well our model projects or predicts the outcome variable. For linear models estimated using least squares, the most common metrics are the Mean Squared Error (MSE) and Adjusted R-Squared. MSE is a standard measure for assessing bias and efficiency. However, Adjusted R-Squared is useful for comparing the predictive power of models with differently scaled outcomes, such as comparing a model with \(Y\) and \(\log(Y)\). Both statistics are derived from the Sum of Squared Residuals (SSR).
Important considerations:
Consistency in Samples: When comparing models, always use the same sample. If you compare models with a different number of observations due to data cleaning issues, metrics like MSE are not comparable. This is especially crucial when interpreting coefficients.
In-Sample Metrics: These goodness-of-fit statistics are in-sample metrics and may be misleading, as models can overfit by learning irregular patterns. We will explore out-of-sample statistics to improve model assessment in this course.
Causal Inference: If you’re focused on causal effects, you can ignore these statistics. Models with high predictive power can sometimes produce biased coefficient estimates, e.g., due to collider or mediation bias (see Cinelli et al., 2022).
The Mean Squared Error (MSE) is a metric used to evaluate the performance of an estimator by measuring the average squared difference between the observed values and the predicted or estimated values. It is commonly used in regression analysis to quantify the accuracy of predictions.
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
where:
The Adjusted R-squared (\(R^2_{adj}\)) is a modified version of the R-squared statistic that accounts for the number of predictors in the model. It indicates the proportion of variance explained by the model, adjusted for the number of predictors. Adjusted R-squared is especially useful when comparing models with different numbers of predictors or when the scale of the outcome variable varies.
\[ R^2_{\text{adj}} = 1 - \frac{\frac{\text{SSR}}{n - p - 1}}{\frac{\text{TSS}}{n - 1}} \]
where:
Intuition:
Numerator: \(\frac{\text{SSR}}{n - p - 1}\) represents the Mean Squared Error (MSE), adjusted for the number of predictors. This measures the prediction error of the specified model with covariates: \(Y_i = \beta_0 + \mathbf{X}\beta_k + \varepsilon_i\).
Denominator: \(\frac{\text{TSS}}{n - 1}\) represents the total variance in the observed values. You can think of it as the variance of the errors when fitting an intercept-only model: \(Y_i = \beta_0 + \varepsilon_i\).
Adjusted R-Squared: Indicates the proportion of the total variance explained by the model, penalized for adding predictors. In other words, it evaluates the ratio of variance explained between the fitted model with covariates and the intercept-only model, while penalizing for the effective number of degrees of freedom.