Last week review: binary logit model

In today’s lab, we will explore several goodness-of-fit techniques to assess the performance of our binary models.

To start, we’ll re-fit the same binary models from last week, using data from the 2000 American National Election Survey (ANES). In this dataset, turnout is the binary outcome: either individuals voted (\(\text{Vote}_i = 1\)) or they did not (\(\text{Vote}_i = 0\)).

We’ll model the probability of voting in the 2000 U.S. presidential election as a function of Age, High School Degree (HS Degree), and College Degree (ColDeg):

\[ \begin{aligned} \text{Vote}_i & \sim \text{Bernoulli}(\pi_i) \\ \pi_i & = \text{logit}^{-1}(\mathbf{x}_i \boldsymbol{\beta}) \end{aligned} \]

Recall that we also fitted a second model where we included ever married as an additional predictor. Therefore, we have two binary logit models to compare: one that adjusts for marital status and one that does not.

# Get nice colors
col <- brewer.pal(5, "Set1")
blue <- col[2]
orange <- col[5]

# Models in R formula format
m1 <- vote00 ~ age + I(age^2) + hsdeg + coldeg
m2 <- vote00 ~ age + I(age^2) + hsdeg + coldeg + marriedo

# Note:  the variable marriedo is current marrieds,
#        the variable married is ever-marrieds

# Load data
fulldata <- read_csv("data/nes00a.csv")

# or

fulldata <- read_csv("https://faculty.washington.edu/cadolph/mle/nes00a.csv")

# Keep only cases observed for all models
data <- extractdata(m2, fulldata, na.rm = TRUE)

# alternatively, with dplyr
data <- fulldata |> select(vote00,age,hsdeg,coldeg,marriedo) |> na.omit()


# Construct variables and model objects
y <- data$vote00

x1 <- 
  data |> 
  mutate(age2 = age^2) |> 
  select(age,age2,hsdeg,coldeg) |> as.matrix()

x2 <- 
  data |> 
  mutate(age2 = age^2) |> 
  select(age,age2,hsdeg,coldeg,marriedo) |> as.matrix()


# Likelihood function for logit
llk.logit <- function(param,y,x) {
  os <- rep(1,length(x[,1]))
  x <- cbind(os,x)
  b <- param[ 1 : ncol(x) ]
  xb <- x%*%b
  sum( y*log(1+exp(-xb)) + (1-y)*log(1+exp(xb)))
  # optim is a minimizer, so min -ln L(param|y)
}

# Fit logit model using optim
ls.result <- lm(y~x1)  # use ls estimates as starting values

stval <- ls.result$coefficients  # initial guesses

logit.m1 <- optim(par=stval,
                  fn=llk.logit,
                  method="Nelder-Mead",
                  hessian=T,
                  y=y,
                  x=x1)

# call minimizer procedure
pe.m1 <- logit.m1$par   # point estimates
vc.m1 <- solve(logit.m1$hessian)  # var-cov matrix
se.m1 <- sqrt(diag(vc.m1))    # standard errors
ll.m1 <- -logit.m1$value  # likelihood at maximum

# Alternative estimation technique:  GLM
glm.m1 <- glm(m1, 
              family=binomial(link="logit"), 
              data=data)

coef(glm.m1)

##   (Intercept)           age      I(age^2)         hsdeg        coldeg 
## -3.0193890720  0.0747251922 -0.0004427014  1.1243907749  1.0795702357

# Fit logit model with added covariate:  married
ls.result <- lm(y~x2)  # use ls estimates as starting values
stval <- ls.result$coefficients  # initial guesses

logit.m2 <- optim(par=stval,
                  fn=llk.logit,
                  method="Nelder-Mead",
                  hessian=T,y=y,x=x2)

# call minimizer procedure
pe.m2 <- logit.m2$par   # point estimates
vc.m2 <- solve(logit.m2$hessian)  # var-cov matrix
se.m2 <- sqrt(diag(vc.m2))    # standard errors
ll.m2 <- -logit.m2$value  # likelihood at maximum

# GLM estimation of model with married
glm.m2 <- glm(m2, 
              family=binomial(link="logit"), 
              data=data)

coef(glm.m2)

##   (Intercept)           age      I(age^2)         hsdeg        coldeg 
## -2.8661484043  0.0614366824 -0.0003175198  1.0994894781  1.0525405326 
##      marriedo 
##  0.3729890203

summary.glm(glm.m2)

## 
## Call:
## glm(formula = m2, family = binomial(link = "logit"), data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.8661484  0.4215276  -6.799 1.05e-11 ***
## age          0.0614367  0.0173301   3.545 0.000392 ***
## I(age^2)    -0.0003175  0.0001701  -1.867 0.061944 .  
## hsdeg        1.0994895  0.1806362   6.087 1.15e-09 ***
## coldeg       1.0525405  0.1317486   7.989 1.36e-15 ***
## marriedo     0.3729890  0.1099446   3.393 0.000693 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2293.5  on 1782  degrees of freedom
## Residual deviance: 2057.5  on 1777  degrees of freedom
## AIC: 2069.5
## 
## Number of Fisher Scoring iterations: 4

Goodness of Fit

Now that we have fitted two models, we could ask, which one fits best the DGP? Below we explore different techniques to asses this question.

For models where we assume a binomial distribution (e.g., logistic regression for binary outcomes), the deviance is actually defined as a measure of goodness of fit, comparing the fitted model with a saturated model (a model that perfectly fits the data).

The deviance for a model \(\mathcal{M}\) with categorical outcomes, particularly in logistic regression, is defined as:

\[ D(\mathcal{M}) = -2 \left( \log \mathcal{L}(\mathcal{M}) - \log \mathcal{L}(\text{saturated model}) \right), \] where:

\(\log \mathcal{L}(\mathcal{M})\) is the log-likelihood of the fitted model.
\(\log \mathcal{L}(\text{saturated model})\) is the log-likelihood of a model that perfectly fits the data (i.e., a model with as many parameters as there are observations).

In practical terms, this can be simplified because the deviance of the saturated model is a constant, and we often interpret deviance as proportional to \(-2 \log \mathcal{L}(\mathcal{M})\) in relative comparisons.

So, while \(D(\mathcal{M}) = -2 \log \mathcal{L}(\mathcal{M})\) is commonly used as shorthand, strictly speaking, deviance is defined relative to a saturated model in categorical outcome models.

Likelihood Ratio Test

Given two nested models:

\[ \begin{aligned} \mathcal{M}_{1} &: \text{logit}^{-1}(\beta_{0} + \beta_{1}x_{1} + \dots +\beta_{i}x_{i})\\ \mathcal{M}_{2} &: \text{logit}^{-1}(\beta_{0} + \beta_{1}x_{1} + \dots +\beta_{i}x_{i}+\beta_{i+1}x_{i+1}+ \dots +\beta_{i+j}x_{i+j}) \end{aligned} \]

The likelihood ratio statistic tests the null hypothesis that the additional parameters are equal to zero:

\[H_{0}: \beta_{i+1}= \dots =\beta_{i+j}=0\]

The likelihood ratio test (LRT) is a statistical test for comparing the fit of two nested models: a simpler, restricted model (\(\mathcal{M}_1\)) and a more complex, unrestricted model (\(\mathcal{M}_2\)).

The likelihood ratio statistic is calculated as:

\[ \text{LR} = -2 \log \frac{\mathcal{L}(\mathcal{M}_1)}{\mathcal{L}(\mathcal{M}_2)} = 2 \left( \log \mathcal{L}(\mathcal{M}_2) - \log \mathcal{L}(\mathcal{M}_1) \right) \]

Under the null hypothesis that the restricted model \(\mathcal{M}_1\) is true, the test statistic \(\text{LR}\) follows a chi-squared distribution, \(f_{\chi^2}(m)\), where \(m\) is the number of restrictions imposed by \(\mathcal{M}_1\). This test is commonly used to assess whether the additional complexity of \(\mathcal{M}_2\) significantly improves model fit.

# Likelihood ratio analysis

## Extract log-likelihood values

# from optim
(ll.m1 <- -logit.m1$value)  # likelihood at maximum

## [1] -1034.478

(ll.m2 <- -logit.m2$value)  # likelihood at maximum

## [1] -1040.629

# from glm
(ll.m1 <- logLik(glm.m1) |> as.numeric())

## [1] -1034.478

(ll.m2 <- logLik(glm.m2) |> as.numeric())

## [1] -1028.728

# Deviance (the "-0" terms refer to the log-likelihood of 
# the saturated model, which is zero for categorical outcomes)
(deviance.m1 <- -2*(ll.m1 - 0))

## [1] 2068.956

(deviance.m2 <- -2*(ll.m2 - 0))

## [1] 2057.455

## Check number of parameters in each model
(k.m1 <- length(pe.m1))

## [1] 5

(k.m2 <- length(pe.m2))

## [1] 6

## Likelihood ratio (LR) test
(lr.test <- 2*(ll.m2 - ll.m1))

## [1] 11.50118

(lr.test.p <- pchisq(lr.test,
                     df=(k.m2 - k.m1),
                     lower.tail=FALSE))

## [1] 0.0006955193

# Alternatively, you can perform the test with anova()

## Perform likelihood ratio test between nested models
(lrt_result <- anova(glm.m1, glm.m2, test = "LRT"))

## Analysis of Deviance Table
## 
## Model 1: vote00 ~ age + I(age^2) + hsdeg + coldeg
## Model 2: vote00 ~ age + I(age^2) + hsdeg + coldeg + marriedo
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      1778     2069.0                          
## 2      1777     2057.5  1   11.501 0.0006955 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Information Criteria

For non-nested models, we cannot use likelihood ratio tests. Instead we turn to several information criteria measures to assess model fit, which can be thought of as penalized LR tests.

The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are methods used to compare models by penalizing the likelihood based on the number of parameters, thus accounting for model complexity. Lower values of AIC or BIC indicates better fit.

Both criteria provide a quantitative means to assess model performance, with BIC often favoring simpler models due to a stronger penalty on model complexity compared to AIC.

Akaike Information Criterion (AIC)

\[ \begin{aligned} \text{AIC}_{\mathcal{M}_2} &= 2 \times p_2 - 2 \log \mathcal{L}(\mathcal{M}_2) \\ \text{AIC}_{\mathcal{M}_1} &= 2 \times p_1 - 2 \log \mathcal{L}(\mathcal{M}_1) \end{aligned} \]

where \(p\) represents the number of parameters in each model’s likelihood, \(\mathcal{L}(\mathcal{M})\). Notice that as the number of parameters increases, the term \(2 \times p\) grows, resulting in a higher AIC unless the added parameters substantially improve the model’s likelihood. Thus, lower AIC values indicate a better fit.

# Akike Information Criterion (AIC)
(aic.m1 <- 2*k.m1 - 2*ll.m1)

## [1] 2078.956

(aic.m2 <- 2*k.m2 - 2*ll.m2)

## [1] 2069.455

(aic.test <- aic.m2 - aic.m1)

## [1] -9.501183

Bayesian Information Criterion (BIC)

The BIC is based on a Bayesian comparison of models proposed by Raftery (1996).

\[ \begin{aligned} \text{BIC}_{\mathcal{M}_2} & = \log(n) \times p_2 - 2 \log \mathcal{L}(\mathcal{M}_2) \\ \text{BIC}_{\mathcal{M}_1} &= \log(n) \times p_1 - 2 \log \mathcal{L}(\mathcal{M}_1) \end{aligned} \]

where \(p\) is the number of parameters, and \(n\) is the number of observations. Lower BIC values are preferred, as they indicate a better balance between model fit and complexity.

More formally, recall Bayes theorem:

\[ \begin{aligned} P(\boldsymbol{\theta}|\boldsymbol{y}) &= \frac{P(\boldsymbol{y}|\boldsymbol{\theta})P(\boldsymbol{\theta})}{P(\boldsymbol{y})}\\ \Bigg[\frac{\text{P}(\mathcal{M}_{1}|\text{Observed Data})}{\text{P}(\mathcal{M}_{2}|(\text{Observed Data})} \Bigg] &\approx \Bigg[\frac{\text{P}(\text{Observed Data}|\mathcal{M}_{1})}{\text{P}(\text{Observed Data}|\mathcal{M}_{2})} \times \frac{\text{P}(\mathcal{M}_{1})}{\text{P}(\mathcal{M}_{2})}\Bigg] \end{aligned} \]

If we assume that \(\frac{\text{P}(\mathcal{M}_{1})}{\text{P}(\mathcal{M}_{2})}=1\), Raftery shows that

\[ 2 \log \Bigg[\frac{\text{P}(\text{Observed Data}|\mathcal{M}_{1})}{\text{P}(\text{Observed Data}|\mathcal{M}_{2})} \Bigg] \approx \text{BIC}_{\mathcal{M}_2} - \text{BIC}_{\mathcal{M}_1} \]

A model with a smaller BIC is again preferred. Raftery (1995) suggested guidelines for the strength of evidence favoring \(\text{BIC}_{\mathcal{M}_1}\) over \(\text{BIC}_{\mathcal{M}_2}\) are as follows:

However, there is some degree of arbitrariness with those. Domain knowledge is what eventually helps to decide whether a small difference may be worth to model.

# Bayesian Information Criterion (BIC)
(bic.m1 <- log(nrow(x1))*k.m1 - 2*ll.m1)

## [1] 2106.387

(bic.m2 <- log(nrow(x2))*k.m2 - 2*ll.m2)

## [1] 2102.372

(bic.test <- bic.m2 - bic.m1)

## [1] -4.01513

# compare information criteria
aic.test ; bic.test

## [1] -9.501183

## [1] -4.01513

Percent Correctly Predicted

The Percent Correctly Predicted (PCP) is a metric used to assess the predictive accuracy of binary classification models by calculating the percentage of correctly classified observations based on a threshold. Specifically, PCP is defined as the proportion of observations where \(\hat{y}_i \geq 0.5\) and \(y_i = 1\) or \(\hat{y}_i < 0.5\) and \(y_i = 0\).

Mathematical Formulation: \[ \text{PCP} = \frac{1}{n} \left[ \sum_{i} y_i \times \mathbb{I}(\hat{y}_i > 0.5) + (1 - y_i) \times \mathbb{I}(\hat{y}_i < 0.5) \right] \] where \(\mathbb{I}(\times)\) is an indicator function that returns 1 if the condition is true and 0 otherwise.

We can also observe the percentage of observations our model that correctly predicts, such that

\[\hat{y}\geq0.5 \quad \text{and} \quad y=1\]

\[\hat{y}<0.5 \quad \text{and} \quad y=0\]

In other words, we check to see how often the predicted probability from our model is greater than or equal to 0.5 when the observed value is 1 and how often it is less than 0.5 when the observed value is 0.

Recall we want to report \(\text{PCP} - \overline{y}\) (i.e. the improvement from the baseline).

# Percent correctly predicted (using glm result and Chris' source code)

# goodness of fit for the null model:
(pcp.null <- pcp.glm(res=glm.m1,
                     y=data$vote00, 
                     type="null"))

## [1] 0.6567583

# goodness of fit for the null model:
(pcp.m1 <- pcp.glm(res=glm.m1,
                   y=data$vote00,
                   type="model"))

## [1] 0.6999439

(pcp.m2 <- pcp.glm(res=glm.m2,
                   y=data$vote00,
                   type="model"))

## [1] 0.6977005

# percentage of improvement
(pcpi.m1 <- pcp.glm(res=glm.m1,
                    y=data$vote00,
                    type="improve"))

## [1] 0.125817

(pcpi.m2 <- pcp.glm(res=glm.m2,
                    y=data$vote00,
                    type="improve"))

## [1] 0.119281

tibble(
  PCP = round(c(pcp.null,pcp.m1,pcp.m2,pcpi.m1,pcpi.m2,pcpi.m2-pcpi.m1),3),
  type = c("null","model1","model2","improve1","improve2","diff")
)

## # A tibble: 6 × 2
##      PCP type    
##    <dbl> <chr>   
## 1  0.657 null    
## 2  0.7   model1  
## 3  0.698 model2  
## 4  0.126 improve1
## 5  0.119 improve2
## 6 -0.007 diff

## Another way to compute PCP with the pscl package
library(pscl)
hitmiss(glm.m1)

## Classification Threshold = 0.5 
##        y=0  y=1
## yhat=0 205  128
## yhat=1 407 1043
## Percent Correctly Predicted = 69.99%
## Percent Correctly Predicted = 33.5%, for y = 0
## Percent Correctly Predicted = 89.07%  for y = 1
## Null Model Correctly Predicts 65.68%

## [1] 69.99439 33.49673 89.06917

hitmiss(glm.m2)

## Classification Threshold = 0.5 
##        y=0  y=1
## yhat=0 210  137
## yhat=1 402 1034
## Percent Correctly Predicted = 69.77%
## Percent Correctly Predicted = 34.31%, for y = 0
## Percent Correctly Predicted = 88.3%  for y = 1
## Null Model Correctly Predicts 65.68%

## [1] 69.77005 34.31373 88.30060

For a model with the “married” variable, the PCP is 69.770%.
For a model without “married”, the PCP is 69.994%.

Interpretation: A PCP below 50% would be undesirable, as it indicates worse-than-random prediction. Additionally, it’s important to consider the baseline mean of the outcome \(\bar{y} = 0.65675\); thus, these PCP values (around 70%) may not represent a substantial improvement. The difference \(\text{PCP} - \bar{y}\) indicates the added predictive value from including covariates in the model.

Separation Plots

We can also visualize the PCP using separation plots.

Sort the observations by predicted probability of the outcome (smallest to largest)
Plot 1s as red lines and 0s as tan lines
Trace a horizontal black line as the predicted probability for each case

Red lines to the left and tan lines to the right of where the horizontal line passes the half way point to the top are mispredictions.

A model that accurately predicts will be tan on the left and red on the right.

# Separation plots
separationplot(pred=glm.m1$fitted.values, actual=glm.m1$y, newplot = F)

separationplot(pred=glm.m2$fitted.values, actual=glm.m2$y, newplot = F)

Actual vs Predicted Plots

In regression analysis, actual vs. predicted plots are used to assess model fit by comparing observed values, \(y\), with predicted values, \(\hat{y}\). Ideally, these values should align closely with the 45° line, indicating accurate predictions. Deviations from this line can reveal noise (variance \(\sigma^2\)) or systematic patterns due to misspecification, such as omitted variables or an incorrect functional form.

Special Considerations for Binary Data

For binary outcomes (0 or 1), direct plotting of \(y\) vs. \(\hat{y}\) is less informative due to the discrete nature of the response. In this context, a binned approach is often used:

Sort observations by predicted probability (from smallest to largest).
Bin observations into intervals based on predicted probabilities. Bins can be created by equal probability width (e.g., intervals like 0.1 to 0.2) or by equal number of observations per bin.
Calculate Average Predictions: For each bin, compute the average predicted probability within that range.
Calculate Average Observed Values: For each bin, also compute the average observed outcome (empirical frequency of 1s).
Plot Bin Averages: The average empirical frequency within each bin is plotted against the average predicted probability.

In a well-fitted model, the points in the actual vs. predicted plot should align closely with the 45° line. Deviations can indicate that the model over- or under-predicts the probability within certain ranges.

Visualizing Models with `binPredict()` in R

In R, a function like binPredict() can automate the binning and calculation of averages for actual vs. predicted plots, supporting options for:

Equal probability width bins (quantiles = FALSE) or equal number of observations per bin (quantiles = TRUE).
Model comparisons by overlaying results from different models in one plot.

Example code to generate binned plots:

# Bin and plot for Model 1 with equal number of observations per bin
binnedM1 <- binPredict(glm.m1, col=blue, label="M1: Age, Edu", quantiles=TRUE)

# Bin and plot for Model 1 with equal number of observations per bin
binnedM2 <- binPredict(glm.m2, col=orange, label="M2: Age, Edu, Married", quantiles=TRUE)

Using actual vs. predicted plots provides valuable insights into the accuracy and reliability of predictions, especially for binary data, where careful binning allows for meaningful comparisons between observed frequencies and predicted probabilities across ranges.

# Show actual vs predicted of M1 on screen
plot(binnedM1, display="avp", hide=TRUE, labx=0.35)

# Show actual vs predicted of M1 and M2 on screen
plot(binnedM1, binnedM2, display="avp", hide=TRUE, labx=0.35)

Error vs Predicted Plots

The EVP plot is similar to the AVP plot but uses the ratio of the actual to the predicted probabilities on the y-axis.

Sort the observations by predicted probability of the outcome (smallest to largest).
Bin the sorted data into ranges of predicted probabilities.
Compute the average predicted probability within each bin.
Compute the average empirical frequency within each bin.
Compute the ratio of the average empirical frequency to the average predicted probability for each bin.
Plot the ratio of the average empirical frequency to the average predicted probability against the average predicted probability for each bin.

# Send error vs predicted of M1 and M2 on screen
plot(binnedM1, binnedM2, display="evp", hide=TRUE, labx=0.35)

Receiver Operating Characteristic (ROC) Plots

Receiver Operating Characteristic (ROC) plots evaluate binary classification models across varying thresholds, illustrating the trade-off between sensitivity (true positive rate) and specificity (true negative rate). The ROC plot places sensitivity on the y-axis and \((1 - \text{specificity})\) on the x-axis, providing a comprehensive view of model performance at all possible thresholds.

Without a fixed threshold \(\pi^*\), ROC plots enable the examination of model performance across the full range of thresholds to help identify one that best balances sensitivity and specificity.

Sensitivity and Specificity

Sensitivity and Specificity are metrics used to evaluate a model’s performance in binary classification by assessing its ability to correctly identify positive and negative cases.

Threshold (\(\pi^*\)):
- Let \(\pi^*\) be a decision threshold in \([0,1]\) above which we classify \(\hat{y} = 1\).
Sensitivity:
- also known as the True Positive Rate, is the proportion of true positive cases the model correctly identifies:

\[ \text{Sensitivity} = \Pr(\hat{y} \geq \pi^* | y = 1) \]

Specificity:
- also known as the True Negative Rate, is the proportion of true negative cases the model correctly identifies:

\[ \text{Specificity} = \Pr(\hat{y} < \pi^* | y = 0) \]

Trade-off:
- Sensitivity and specificity often trade off against each other as \(\pi^*\) changes. High sensitivity is desirable when false negatives are particularly costly (e.g., diagnosing a serious infectious disease), whereas high specificity is preferred when false positives are more costly (e.g., unnecessary treatment with side effects).

Constructing an ROC Curve

To create an ROC curve:

Sort cases in descending order by predicted probability of a positive outcome (highest probability first).
Start at the origin \((0,0)\) and move sequentially through each case:
- Move up by \(\frac{1}{\text{total positives}}\) for each actual positive case, representing an increase in true positive rate (sensitivity).
- Move right by \(\frac{1}{\text{total negatives}}\) for each actual negative case, representing an increase in false positive rate \((1 - \text{specificity})\).

Each upward movement corresponds to a correct prediction of \(y=1\), while each rightward movement represents an incorrect prediction of a positive outcome. A well-performing model yields a curve that quickly rises towards the top left corner, indicating a strong true positive rate and low false positive rate.

Area Under the ROC Curve (AUC)

The area under the ROC curve (AUC) quantifies Goodness of Fit (GOF) or the concordance index, indicating the model’s capability to discriminate between positive and negative cases. A higher AUC reflects better discriminatory power, with values close to 1 suggesting near-perfect predictions. Importantly:

The AUC is often used as a single-number summary of model performance, especially for logit and probit models.
AUC is beneficial even if the model does not improve the Percentage Correctly Predicted (PCP), particularly in the context of rare events.

While the AUC offers a succinct performance metric, the shape of the ROC curve itself reveals more nuanced information, showing where a model might perform better or worse across specific thresholds.

In summary, ROC plots provide an effective means of visualizing model predictive power across all thresholds, where a steeper curve rising toward the top left implies superior model performance.

# Send actual vs predicted, error rate vs predicted, and ROC on screen
plot(binnedM1, binnedM2, 
     thresholds=c(0.9, 0.8, 0.7, 0.6),
     hide=TRUE, labx=0.35)

# Also see ROCR package for ROC curves and many other prediction metrics
# and the verification package for a rudimentary roc plot function roc.plot()

# Concordance Indexes / AUC (using glm result and my source code)
concord.null <- concord.glm(res=glm.m1, y=data$vote00, type="null")
concord.m1 <- concord.glm(res=glm.m1, y=data$vote00, type="model")
concord.m2 <- concord.glm(res=glm.m2, y=data$vote00, type="model")
concordi.m1 <- concord.glm(res=glm.m1, y=data$vote00, type="improve")
concordi.m2 <- concord.glm(res=glm.m2, y=data$vote00, type="improve")

Finally, we can compare the three versions of actual vs. predicted plots:

# Send ROC plots for M1 and M2 on screen
plot(binnedM1, binnedM2, display="roc", thresholds=c(0.9, 0.8, 0.7, 0.6),
     labx=0.35)

Cross-validation

For cross-validation, we select an error metric: in this case, we use the percent correctly predicted. \(\newline\)

In leave-on-out cross-validation, we remove each observation then find the rate at which the model correctly predicts the left out observation. \(\newline\)

We can then compare the PCP across models.

### Cross-validation (takes a few minutes to run)

## A percent-correctly-predicted-style cost function
## r is actual y, pi is expected y
## Rate of inaccuracy: mean(vote00!=round(yp))
costpcp <- function(r, pi=0) mean(r!=round(pi))

## an alternative cost function for binary data
## cost <- function(r, pi=0) mean(abs(r-pi)>0.5)

cv.m1 <- cv.glm(data, glm.m1, costpcp)
cvPCP.m1 <- 1 - cv.m1$delta[2]

cv.m2 <- cv.glm(data, glm.m2, costpcp)
cvPCP.m2 <- 1 - cv.m2$delta[2]

#### More cross-validation
obj <- glm.m1
i <- 3
## A simple leave-one-out cross-validation function for logit glm; returns predicted probs
loocv <- function (obj) {
  data <- obj$data
  m <- dim(data)[1]
  form <- formula(obj)
  fam <- obj$family$family
  loo <- rep(NA, m)
  for (i in 1:m) {
    i.glm <- glm(form, data = data[-i, ], family = fam)
    loo[i] <- predict(i.glm, newdata = data[i,], family = fam, type = "response")
  }
  loo
}

# LOOCV for models 1 and 2
predCVm1 <- loocv(glm.m1)
predCVm2 <- loocv(glm.m2)

# Make cross-validated AVP and ROC plots; note use of newpred input in binPredict
binnedM1cv <- binPredict(glm.m1, newpred=predCVm1, col=blue, label="M1: LOO-CV", quantiles=TRUE)

plot(binnedM1cv, display=c("avp","roc"), hide=TRUE, thresholds=c(0.9, 0.8, 0.7, 0.6),
     labx=0.25)

binnedM2cv <- binPredict(glm.m2, newpred=predCVm2, col=orange, label="M2: LOO-CV", quantiles=TRUE)

plot(binnedM2cv, 
     display=c("avp","roc"), 
     hide=TRUE, 
     thresholds=c(0.9, 0.8, 0.7, 0.6),
     labx=0.25)

plot(binnedM1cv, 
     binnedM2cv, 
     display=c("avp","roc"), 
     hide=TRUE, 
     thresholds=c(0.9, 0.8, 0.7, 0.6),
     labx=0.25)

CSSS/POLS 510 - Maximum Likelihood Estimation

Lab 6. In- and Out-Sample Goodness of Fit, and Cross-Valdiation for Binary Models

Ramses Llobet

Last week review: binary logit model

Goodness of Fit

Likelihood Ratio Test

Information Criteria

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

Percent Correctly Predicted

Separation Plots

Actual vs Predicted Plots

Special Considerations for Binary Data

Visualizing Models with `binPredict()` in R

Error vs Predicted Plots

Receiver Operating Characteristic (ROC) Plots

Sensitivity and Specificity

Constructing an ROC Curve

Area Under the ROC Curve (AUC)

Cross-validation

CSSS/POLS 510 - Maximum Likelihood Estimation

Lab 6. In- and Out-Sample Goodness of Fit, and Cross-Valdiation for Binary Models

Ramses Llobet

Last week review: binary logit model

Goodness of Fit

Likelihood Ratio Test

Information Criteria

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

Percent Correctly Predicted

Separation Plots

Actual vs Predicted Plots

Special Considerations for Binary Data

Visualizing Models with binPredict() in R

Error vs Predicted Plots

Receiver Operating Characteristic (ROC) Plots

Sensitivity and Specificity

Constructing an ROC Curve

Area Under the ROC Curve (AUC)

Cross-validation

Visualizing Models with `binPredict()` in R