In today’s lab, we will explore several goodness-of-fit techniques to assess the performance of our binary models.
To start, we’ll re-fit the same binary models from last week, using data from the 2000 American National Election Survey (ANES). In this dataset, turnout is the binary outcome: either individuals voted (\(\text{Vote}_i = 1\)) or they did not (\(\text{Vote}_i = 0\)).
We’ll model the probability of voting in the 2000 U.S. presidential election as a function of Age, High School Degree (HS Degree), and College Degree (ColDeg):
\[ \begin{aligned} \text{Vote}_i & \sim \text{Bernoulli}(\pi_i) \\ \pi_i & = \text{logit}^{-1}(\mathbf{x}_i \boldsymbol{\beta}) \end{aligned} \]
Recall that we also fitted a second model where we included ever married as an additional predictor. Therefore, we have two binary logit models to compare: one that adjusts for marital status and one that does not.
# Get nice colors
col <- brewer.pal(5, "Set1")
blue <- col[2]
orange <- col[5]
# Models in R formula format
m1 <- vote00 ~ age + I(age^2) + hsdeg + coldeg
m2 <- vote00 ~ age + I(age^2) + hsdeg + coldeg + marriedo
# Note: the variable marriedo is current marrieds,
# the variable married is ever-marrieds
# Load data
fulldata <- read_csv("data/nes00a.csv")
# or
fulldata <- read_csv("https://faculty.washington.edu/cadolph/mle/nes00a.csv")
# Keep only cases observed for all models
data <- extractdata(m2, fulldata, na.rm = TRUE)
# alternatively, with dplyr
data <- fulldata |> select(vote00,age,hsdeg,coldeg,marriedo) |> na.omit()
# Construct variables and model objects
y <- data$vote00
x1 <-
data |>
mutate(age2 = age^2) |>
select(age,age2,hsdeg,coldeg) |> as.matrix()
x2 <-
data |>
mutate(age2 = age^2) |>
select(age,age2,hsdeg,coldeg,marriedo) |> as.matrix()
# Likelihood function for logit
llk.logit <- function(param,y,x) {
os <- rep(1,length(x[,1]))
x <- cbind(os,x)
b <- param[ 1 : ncol(x) ]
xb <- x%*%b
sum( y*log(1+exp(-xb)) + (1-y)*log(1+exp(xb)))
# optim is a minimizer, so min -ln L(param|y)
}
# Fit logit model using optim
ls.result <- lm(y~x1) # use ls estimates as starting values
stval <- ls.result$coefficients # initial guesses
logit.m1 <- optim(par=stval,
fn=llk.logit,
method="Nelder-Mead",
hessian=T,
y=y,
x=x1)
# call minimizer procedure
pe.m1 <- logit.m1$par # point estimates
vc.m1 <- solve(logit.m1$hessian) # var-cov matrix
se.m1 <- sqrt(diag(vc.m1)) # standard errors
ll.m1 <- -logit.m1$value # likelihood at maximum
# Alternative estimation technique: GLM
glm.m1 <- glm(m1,
family=binomial(link="logit"),
data=data)
coef(glm.m1)
## (Intercept) age I(age^2) hsdeg coldeg
## -3.0193890720 0.0747251922 -0.0004427014 1.1243907749 1.0795702357
# Fit logit model with added covariate: married
ls.result <- lm(y~x2) # use ls estimates as starting values
stval <- ls.result$coefficients # initial guesses
logit.m2 <- optim(par=stval,
fn=llk.logit,
method="Nelder-Mead",
hessian=T,y=y,x=x2)
# call minimizer procedure
pe.m2 <- logit.m2$par # point estimates
vc.m2 <- solve(logit.m2$hessian) # var-cov matrix
se.m2 <- sqrt(diag(vc.m2)) # standard errors
ll.m2 <- -logit.m2$value # likelihood at maximum
# GLM estimation of model with married
glm.m2 <- glm(m2,
family=binomial(link="logit"),
data=data)
coef(glm.m2)
## (Intercept) age I(age^2) hsdeg coldeg
## -2.8661484043 0.0614366824 -0.0003175198 1.0994894781 1.0525405326
## marriedo
## 0.3729890203
summary.glm(glm.m2)
##
## Call:
## glm(formula = m2, family = binomial(link = "logit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8661484 0.4215276 -6.799 1.05e-11 ***
## age 0.0614367 0.0173301 3.545 0.000392 ***
## I(age^2) -0.0003175 0.0001701 -1.867 0.061944 .
## hsdeg 1.0994895 0.1806362 6.087 1.15e-09 ***
## coldeg 1.0525405 0.1317486 7.989 1.36e-15 ***
## marriedo 0.3729890 0.1099446 3.393 0.000693 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2293.5 on 1782 degrees of freedom
## Residual deviance: 2057.5 on 1777 degrees of freedom
## AIC: 2069.5
##
## Number of Fisher Scoring iterations: 4
Now that we have fitted two models, we could ask, which one fits best the DGP? Below we explore different techniques to asses this question.
For models where we assume a binomial distribution (e.g., logistic regression for binary outcomes), the deviance is actually defined as a measure of goodness of fit, comparing the fitted model with a saturated model (a model that perfectly fits the data).
The deviance for a model \(\mathcal{M}\) with categorical outcomes, particularly in logistic regression, is defined as:
\[ D(\mathcal{M}) = -2 \left( \log \mathcal{L}(\mathcal{M}) - \log \mathcal{L}(\text{saturated model}) \right), \] where:
In practical terms, this can be simplified because the deviance of the saturated model is a constant, and we often interpret deviance as proportional to \(-2 \log \mathcal{L}(\mathcal{M})\) in relative comparisons.
So, while \(D(\mathcal{M}) = -2 \log \mathcal{L}(\mathcal{M})\) is commonly used as shorthand, strictly speaking, deviance is defined relative to a saturated model in categorical outcome models.
Given two nested models:
\[ \begin{aligned} \mathcal{M}_{1} &: \text{logit}^{-1}(\beta_{0} + \beta_{1}x_{1} + \dots +\beta_{i}x_{i})\\ \mathcal{M}_{2} &: \text{logit}^{-1}(\beta_{0} + \beta_{1}x_{1} + \dots +\beta_{i}x_{i}+\beta_{i+1}x_{i+1}+ \dots +\beta_{i+j}x_{i+j}) \end{aligned} \]
The likelihood ratio statistic tests the null hypothesis that the additional parameters are equal to zero:
\[H_{0}: \beta_{i+1}= \dots =\beta_{i+j}=0\]
The likelihood ratio test (LRT) is a statistical test for comparing the fit of two nested models: a simpler, restricted model (\(\mathcal{M}_1\)) and a more complex, unrestricted model (\(\mathcal{M}_2\)).
The likelihood ratio statistic is calculated as:
\[ \text{LR} = -2 \log \frac{\mathcal{L}(\mathcal{M}_1)}{\mathcal{L}(\mathcal{M}_2)} = 2 \left( \log \mathcal{L}(\mathcal{M}_2) - \log \mathcal{L}(\mathcal{M}_1) \right) \]
Under the null hypothesis that the restricted model \(\mathcal{M}_1\) is true, the test statistic \(\text{LR}\) follows a chi-squared distribution, \(f_{\chi^2}(m)\), where \(m\) is the number of restrictions imposed by \(\mathcal{M}_1\). This test is commonly used to assess whether the additional complexity of \(\mathcal{M}_2\) significantly improves model fit.
# Likelihood ratio analysis
## Extract log-likelihood values
# from optim
(ll.m1 <- -logit.m1$value) # likelihood at maximum
## [1] -1034.478
(ll.m2 <- -logit.m2$value) # likelihood at maximum
## [1] -1040.629
# from glm
(ll.m1 <- logLik(glm.m1) |> as.numeric())
## [1] -1034.478
(ll.m2 <- logLik(glm.m2) |> as.numeric())
## [1] -1028.728
# Deviance (the "-0" terms refer to the log-likelihood of
# the saturated model, which is zero for categorical outcomes)
(deviance.m1 <- -2*(ll.m1 - 0))
## [1] 2068.956
(deviance.m2 <- -2*(ll.m2 - 0))
## [1] 2057.455
## Check number of parameters in each model
(k.m1 <- length(pe.m1))
## [1] 5
(k.m2 <- length(pe.m2))
## [1] 6
## Likelihood ratio (LR) test
(lr.test <- 2*(ll.m2 - ll.m1))
## [1] 11.50118
(lr.test.p <- pchisq(lr.test,
df=(k.m2 - k.m1),
lower.tail=FALSE))
## [1] 0.0006955193
# Alternatively, you can perform the test with anova()
## Perform likelihood ratio test between nested models
(lrt_result <- anova(glm.m1, glm.m2, test = "LRT"))
## Analysis of Deviance Table
##
## Model 1: vote00 ~ age + I(age^2) + hsdeg + coldeg
## Model 2: vote00 ~ age + I(age^2) + hsdeg + coldeg + marriedo
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1778 2069.0
## 2 1777 2057.5 1 11.501 0.0006955 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For non-nested models, we cannot use likelihood ratio tests. Instead we turn to several information criteria measures to assess model fit, which can be thought of as penalized LR tests.
The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are methods used to compare models by penalizing the likelihood based on the number of parameters, thus accounting for model complexity. Lower values of AIC or BIC indicates better fit.
Both criteria provide a quantitative means to assess model performance, with BIC often favoring simpler models due to a stronger penalty on model complexity compared to AIC.
\[ \begin{aligned} \text{AIC}_{\mathcal{M}_2} &= 2 \times p_2 - 2 \log \mathcal{L}(\mathcal{M}_2) \\ \text{AIC}_{\mathcal{M}_1} &= 2 \times p_1 - 2 \log \mathcal{L}(\mathcal{M}_1) \end{aligned} \]
where \(p\) represents the number of parameters in each model’s likelihood, \(\mathcal{L}(\mathcal{M})\). Notice that as the number of parameters increases, the term \(2 \times p\) grows, resulting in a higher AIC unless the added parameters substantially improve the model’s likelihood. Thus, lower AIC values indicate a better fit.
# Akike Information Criterion (AIC)
(aic.m1 <- 2*k.m1 - 2*ll.m1)
## [1] 2078.956
(aic.m2 <- 2*k.m2 - 2*ll.m2)
## [1] 2069.455
(aic.test <- aic.m2 - aic.m1)
## [1] -9.501183
The BIC is based on a Bayesian comparison of models proposed by Raftery (1996).
\[ \begin{aligned} \text{BIC}_{\mathcal{M}_2} & = \log(n) \times p_2 - 2 \log \mathcal{L}(\mathcal{M}_2) \\ \text{BIC}_{\mathcal{M}_1} &= \log(n) \times p_1 - 2 \log \mathcal{L}(\mathcal{M}_1) \end{aligned} \]
where \(p\) is the number of parameters, and \(n\) is the number of observations. Lower BIC values are preferred, as they indicate a better balance between model fit and complexity.
More formally, recall Bayes theorem:
\[ \begin{aligned} P(\boldsymbol{\theta}|\boldsymbol{y}) &= \frac{P(\boldsymbol{y}|\boldsymbol{\theta})P(\boldsymbol{\theta})}{P(\boldsymbol{y})}\\ \Bigg[\frac{\text{P}(\mathcal{M}_{1}|\text{Observed Data})}{\text{P}(\mathcal{M}_{2}|(\text{Observed Data})} \Bigg] &\approx \Bigg[\frac{\text{P}(\text{Observed Data}|\mathcal{M}_{1})}{\text{P}(\text{Observed Data}|\mathcal{M}_{2})} \times \frac{\text{P}(\mathcal{M}_{1})}{\text{P}(\mathcal{M}_{2})}\Bigg] \end{aligned} \]
If we assume that \(\frac{\text{P}(\mathcal{M}_{1})}{\text{P}(\mathcal{M}_{2})}=1\), Raftery shows that
\[ 2 \log \Bigg[\frac{\text{P}(\text{Observed Data}|\mathcal{M}_{1})}{\text{P}(\text{Observed Data}|\mathcal{M}_{2})} \Bigg] \approx \text{BIC}_{\mathcal{M}_2} - \text{BIC}_{\mathcal{M}_1} \]
A model with a smaller BIC is again preferred. Raftery (1995) suggested guidelines for the strength of evidence favoring \(\text{BIC}_{\mathcal{M}_1}\) over \(\text{BIC}_{\mathcal{M}_2}\) are as follows:
However, there is some degree of arbitrariness with those. Domain knowledge is what eventually helps to decide whether a small difference may be worth to model.
# Bayesian Information Criterion (BIC)
(bic.m1 <- log(nrow(x1))*k.m1 - 2*ll.m1)
## [1] 2106.387
(bic.m2 <- log(nrow(x2))*k.m2 - 2*ll.m2)
## [1] 2102.372
(bic.test <- bic.m2 - bic.m1)
## [1] -4.01513
# compare information criteria
aic.test ; bic.test
## [1] -9.501183
## [1] -4.01513
The Percent Correctly Predicted (PCP) is a metric used to assess the predictive accuracy of binary classification models by calculating the percentage of correctly classified observations based on a threshold. Specifically, PCP is defined as the proportion of observations where \(\hat{y}_i \geq 0.5\) and \(y_i = 1\) or \(\hat{y}_i < 0.5\) and \(y_i = 0\).
We can also observe the percentage of observations our model that correctly predicts, such that
\[\hat{y}\geq0.5 \quad \text{and} \quad y=1\]
or
\[\hat{y}<0.5 \quad \text{and} \quad y=0\]
In other words, we check to see how often the predicted probability from our model is greater than or equal to 0.5 when the observed value is 1 and how often it is less than 0.5 when the observed value is 0.
Recall we want to report \(\text{PCP} - \overline{y}\) (i.e. the improvement from the baseline).
# Percent correctly predicted (using glm result and Chris' source code)
# goodness of fit for the null model:
(pcp.null <- pcp.glm(res=glm.m1,
y=data$vote00,
type="null"))
## [1] 0.6567583
# goodness of fit for the null model:
(pcp.m1 <- pcp.glm(res=glm.m1,
y=data$vote00,
type="model"))
## [1] 0.6999439
(pcp.m2 <- pcp.glm(res=glm.m2,
y=data$vote00,
type="model"))
## [1] 0.6977005
# percentage of improvement
(pcpi.m1 <- pcp.glm(res=glm.m1,
y=data$vote00,
type="improve"))
## [1] 0.125817
(pcpi.m2 <- pcp.glm(res=glm.m2,
y=data$vote00,
type="improve"))
## [1] 0.119281
tibble(
PCP = round(c(pcp.null,pcp.m1,pcp.m2,pcpi.m1,pcpi.m2,pcpi.m2-pcpi.m1),3),
type = c("null","model1","model2","improve1","improve2","diff")
)
## # A tibble: 6 × 2
## PCP type
## <dbl> <chr>
## 1 0.657 null
## 2 0.7 model1
## 3 0.698 model2
## 4 0.126 improve1
## 5 0.119 improve2
## 6 -0.007 diff
## Another way to compute PCP with the pscl package
library(pscl)
hitmiss(glm.m1)
## Classification Threshold = 0.5
## y=0 y=1
## yhat=0 205 128
## yhat=1 407 1043
## Percent Correctly Predicted = 69.99%
## Percent Correctly Predicted = 33.5%, for y = 0
## Percent Correctly Predicted = 89.07% for y = 1
## Null Model Correctly Predicts 65.68%
## [1] 69.99439 33.49673 89.06917
hitmiss(glm.m2)
## Classification Threshold = 0.5
## y=0 y=1
## yhat=0 210 137
## yhat=1 402 1034
## Percent Correctly Predicted = 69.77%
## Percent Correctly Predicted = 34.31%, for y = 0
## Percent Correctly Predicted = 88.3% for y = 1
## Null Model Correctly Predicts 65.68%
## [1] 69.77005 34.31373 88.30060
Interpretation: A PCP below 50% would be undesirable, as it indicates worse-than-random prediction. Additionally, it’s important to consider the baseline mean of the outcome \(\bar{y} = 0.65675\); thus, these PCP values (around 70%) may not represent a substantial improvement. The difference \(\text{PCP} - \bar{y}\) indicates the added predictive value from including covariates in the model.
We can also visualize the PCP using separation plots.
Sort the observations by predicted probability of the outcome (smallest to largest)
Plot 1s as red lines and 0s as tan lines
Trace a horizontal black line as the predicted probability for each case
Red lines to the left and tan lines to the right of where the horizontal line passes the half way point to the top are mispredictions.
A model that accurately predicts will be tan on the left and red on the right.
# Separation plots
separationplot(pred=glm.m1$fitted.values, actual=glm.m1$y, newplot = F)
separationplot(pred=glm.m2$fitted.values, actual=glm.m2$y, newplot = F)
In regression analysis, actual vs. predicted plots are used to assess model fit by comparing observed values, \(y\), with predicted values, \(\hat{y}\). Ideally, these values should align closely with the 45° line, indicating accurate predictions. Deviations from this line can reveal noise (variance \(\sigma^2\)) or systematic patterns due to misspecification, such as omitted variables or an incorrect functional form.
For binary outcomes (0 or 1), direct plotting of \(y\) vs. \(\hat{y}\) is less informative due to the discrete nature of the response. In this context, a binned approach is often used:
In a well-fitted model, the points in the actual vs. predicted plot should align closely with the 45° line. Deviations can indicate that the model over- or under-predicts the probability within certain ranges.
binPredict()
in RIn R, a function like binPredict()
can automate the
binning and calculation of averages for actual vs. predicted plots,
supporting options for:
quantiles = FALSE
) or equal number of observations per bin
(quantiles = TRUE
).Example code to generate binned plots:
# Bin and plot for Model 1 with equal number of observations per bin
binnedM1 <- binPredict(glm.m1, col=blue, label="M1: Age, Edu", quantiles=TRUE)
# Bin and plot for Model 1 with equal number of observations per bin
binnedM2 <- binPredict(glm.m2, col=orange, label="M2: Age, Edu, Married", quantiles=TRUE)
Using actual vs. predicted plots provides valuable insights into the accuracy and reliability of predictions, especially for binary data, where careful binning allows for meaningful comparisons between observed frequencies and predicted probabilities across ranges.
# Show actual vs predicted of M1 on screen
plot(binnedM1, display="avp", hide=TRUE, labx=0.35)
# Show actual vs predicted of M1 and M2 on screen
plot(binnedM1, binnedM2, display="avp", hide=TRUE, labx=0.35)
The EVP plot is similar to the AVP plot but uses the ratio of the actual to the predicted probabilities on the y-axis.
# Send error vs predicted of M1 and M2 on screen
plot(binnedM1, binnedM2, display="evp", hide=TRUE, labx=0.35)
Receiver Operating Characteristic (ROC) plots evaluate binary classification models across varying thresholds, illustrating the trade-off between sensitivity (true positive rate) and specificity (true negative rate). The ROC plot places sensitivity on the y-axis and \((1 - \text{specificity})\) on the x-axis, providing a comprehensive view of model performance at all possible thresholds.
Without a fixed threshold \(\pi^*\), ROC plots enable the examination of model performance across the full range of thresholds to help identify one that best balances sensitivity and specificity.
Sensitivity and Specificity are metrics used to evaluate a model’s performance in binary classification by assessing its ability to correctly identify positive and negative cases.
\[ \text{Sensitivity} = \Pr(\hat{y} \geq \pi^* | y = 1) \]
\[ \text{Specificity} = \Pr(\hat{y} < \pi^* | y = 0) \]
To create an ROC curve:
Each upward movement corresponds to a correct prediction of \(y=1\), while each rightward movement represents an incorrect prediction of a positive outcome. A well-performing model yields a curve that quickly rises towards the top left corner, indicating a strong true positive rate and low false positive rate.
The area under the ROC curve (AUC) quantifies Goodness of Fit (GOF) or the concordance index, indicating the model’s capability to discriminate between positive and negative cases. A higher AUC reflects better discriminatory power, with values close to 1 suggesting near-perfect predictions. Importantly:
While the AUC offers a succinct performance metric, the shape of the ROC curve itself reveals more nuanced information, showing where a model might perform better or worse across specific thresholds.
In summary, ROC plots provide an effective means of visualizing model predictive power across all thresholds, where a steeper curve rising toward the top left implies superior model performance.
# Send actual vs predicted, error rate vs predicted, and ROC on screen
plot(binnedM1, binnedM2,
thresholds=c(0.9, 0.8, 0.7, 0.6),
hide=TRUE, labx=0.35)
# Also see ROCR package for ROC curves and many other prediction metrics
# and the verification package for a rudimentary roc plot function roc.plot()
# Concordance Indexes / AUC (using glm result and my source code)
concord.null <- concord.glm(res=glm.m1, y=data$vote00, type="null")
concord.m1 <- concord.glm(res=glm.m1, y=data$vote00, type="model")
concord.m2 <- concord.glm(res=glm.m2, y=data$vote00, type="model")
concordi.m1 <- concord.glm(res=glm.m1, y=data$vote00, type="improve")
concordi.m2 <- concord.glm(res=glm.m2, y=data$vote00, type="improve")
Finally, we can compare the three versions of actual vs. predicted plots:
# Send ROC plots for M1 and M2 on screen
plot(binnedM1, binnedM2, display="roc", thresholds=c(0.9, 0.8, 0.7, 0.6),
labx=0.35)
For cross-validation, we select an error metric: in this case, we use the percent correctly predicted. \(\newline\)
In leave-on-out cross-validation, we remove each observation then find the rate at which the model correctly predicts the left out observation. \(\newline\)
We can then compare the PCP across models.
### Cross-validation (takes a few minutes to run)
## A percent-correctly-predicted-style cost function
## r is actual y, pi is expected y
## Rate of inaccuracy: mean(vote00!=round(yp))
costpcp <- function(r, pi=0) mean(r!=round(pi))
## an alternative cost function for binary data
## cost <- function(r, pi=0) mean(abs(r-pi)>0.5)
cv.m1 <- cv.glm(data, glm.m1, costpcp)
cvPCP.m1 <- 1 - cv.m1$delta[2]
cv.m2 <- cv.glm(data, glm.m2, costpcp)
cvPCP.m2 <- 1 - cv.m2$delta[2]
#### More cross-validation
obj <- glm.m1
i <- 3
## A simple leave-one-out cross-validation function for logit glm; returns predicted probs
loocv <- function (obj) {
data <- obj$data
m <- dim(data)[1]
form <- formula(obj)
fam <- obj$family$family
loo <- rep(NA, m)
for (i in 1:m) {
i.glm <- glm(form, data = data[-i, ], family = fam)
loo[i] <- predict(i.glm, newdata = data[i,], family = fam, type = "response")
}
loo
}
# LOOCV for models 1 and 2
predCVm1 <- loocv(glm.m1)
predCVm2 <- loocv(glm.m2)
# Make cross-validated AVP and ROC plots; note use of newpred input in binPredict
binnedM1cv <- binPredict(glm.m1, newpred=predCVm1, col=blue, label="M1: LOO-CV", quantiles=TRUE)
plot(binnedM1cv, display=c("avp","roc"), hide=TRUE, thresholds=c(0.9, 0.8, 0.7, 0.6),
labx=0.25)
binnedM2cv <- binPredict(glm.m2, newpred=predCVm2, col=orange, label="M2: LOO-CV", quantiles=TRUE)
plot(binnedM2cv,
display=c("avp","roc"),
hide=TRUE,
thresholds=c(0.9, 0.8, 0.7, 0.6),
labx=0.25)
plot(binnedM1cv,
binnedM2cv,
display=c("avp","roc"),
hide=TRUE,
thresholds=c(0.9, 0.8, 0.7, 0.6),
labx=0.25)