Simulation of Quantities of Interest

Simulation is a powerful method for generating meaningful quantities of interest (QI) from statistical models. Instead of only reporting coefficients, researchers can simulate expected values, expected probabilities, first differences, and measures of uncertainty to give a clearer picture of the model results.

QI are especially valuable because they go beyond mere coefficients to provide a more intuitive understanding of the model’s output. While it is possible to compute expected probabilities using the model’s coefficients alone, simulation adds value by incorporating confidence intervals, thus addressing uncertainty.

In any statistical model, the coefficients are subject to estimation uncertainty—we estimate parameters from a sample but want to generalize to the broader population. If a different sample were collected, the estimated coefficients could vary, leading to uncertainty in any computed quantities from those coefficients. Simulation quantifies this uncertainty by drawing from the sampling distribution of the model’s estimates.

Steps for simulating quantities of interest:

  1. Estimate the model, storing the coefficients and their variance-covariance matrix.
  2. Draw from a multivariate normal distribution based on the estimated coefficients and their variance-covariance matrix.
  3. Repeat the process (e.g., 1,000 times), generating many sets of plausible coefficient estimates.

These draws reflect the variability of each coefficient and their covariances, resulting in a large number of plausible estimates. Rather than simply summarizing these draws, it is more informative to compute a substantively meaningful QI from the simulated estimates, such as a predicted or expected value of the dependent variable.

Two Types of Uncertainty:

When computing predictions and simulations of quantities of interest (QoI), it is essential to recognize that the model we use to approximate the data-generating process (DGP) is subject to two types of uncertainties:

  • Estimation uncertainty arises because the model’s parameters are estimated from a sample, making their true values uncertain.
  • Fundamental uncertainty reflects the inherent randomness of the world and applies when making predictions beyond the model’s estimation.

Thus, the choice between predicted and expected values depends on the type of uncertainty the analyst wants to highlight—whether focusing on the uncertainty about the model itself (estimation uncertainty) or on the variability in the world (fundamental uncertainty).

In a simple linear regression model, we can highlight both sources of uncertainty. The equation is as follows:

\[ \hat{y_i} = \color{red}{\hat{\beta_0}} + \color{red}{\hat{\beta_1}} x_i + \color{blue}{\epsilon_i} \]

  • \(\color{red}{\text{estimation uncertainty}}\): Represents unvertainty in the estimation of the coefficients \(\hat{\beta_0}\) and \(\hat{\beta_1}\), which are estimates derived from the data.

  • \(\color{blue}{\text{fundamental uncertainty}}\): combines uncertainty from the \(\color{red}{\text{systematic}}\) and \(\color{blue}{\text{stochastic}}\) processes, accounting for the randomness in the dependent variable that the model cannot explain.

Which quantity of interest should we report: expected values or predicted values? When analyzing estimands at the population or data-generating process (DGP) level, we focus on expected values. However, if the goal is to make specific predictions for real-world cases—such as in policy analysis for government agencies, private banks, or consulting firms—predicted values should be reported.

In summary, estimation uncertainty relates to the betas in the model, while fundamental uncertainty incorporates the error term and reflects real-world randomness.

Application: Ehlirhc (1973) Crime Data

The following application is based on Carse and Harden (2014, Chapter 9), and uses Ehlirhc (1973) Crime Data.

The Ehrlich (1973) Crime Data was originally collected by Isaac Ehrlich from various U.S. government sources, including the Uniform Crime Report of the FBI. It has been used in multiple studies to analyze how different variables influence the rate of crime at the state level in the United States during 1960.

We will use the following variables:

  • Ed: Mean years of schooling of the population aged 25 years or over.
  • Pop: State population in 1960 (in hundreds of thousands).
  • Wealth: Median value of transferable assets or family income.
  • Ineq: Income inequality, defined as the percentage of families earning below half the median income.
  • Prob: Probability of imprisonment, calculated as the ratio of the number of commitments to the number of offenses.

Simulating Expected Values

If we would be interested in analyzing the marginal effects of laws and socioeconomic characteristics at the DGP/population level, we look to generalize results (external validity). In this cases, we are mainly interested in report expected values. In particular:

\[ \hat{\mathbf{y}} = \mathbf{X} \color{red}{\hat{\boldsymbol{\beta}}} + \epsilon \] We want to simulate expected values from the estimates of the \(\color{red}{\text{systematic component}}\) (\(\hat{\boldsymbol{\beta}}\)) and in doing so we should reflect estimation uncertainty from our sample. The code below shows an application using Ehlirch’s crime data.

crime <- read.csv(file="data/crime.csv")

names(crime)
##  [1] "M"      "So"     "Ed"     "Po1"    "Po2"    "LF"     "M.F"    "Pop"   
##  [9] "NW"     "U1"     "U2"     "Wealth" "Ineq"   "Prob"   "Time"   "Crime"
# OLS model form the crime data

ols.crime <- lm(Crime ~ Prob + Ed + Wealth + Ineq + Pop,
                data= crime)

summary(ols.crime)
## 
## Call:
## lm(formula = Crime ~ Prob + Ed + Wealth + Ineq + Pop, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -525.23 -178.94  -25.09  145.62  771.65 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4213.3894  1241.3857  -3.394 0.001538 ** 
## Prob        -3537.8472  2379.4467  -1.487 0.144709    
## Ed            113.6824    65.2770   1.742 0.089088 .  
## Wealth          0.3938     0.1151   3.422 0.001420 ** 
## Ineq          101.9513    25.9314   3.932 0.000318 ***
## Pop             1.0250     1.3726   0.747 0.459458    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 297 on 41 degrees of freedom
## Multiple R-squared:  0.4744, Adjusted R-squared:  0.4103 
## F-statistic:   7.4 on 5 and 41 DF,  p-value: 5.054e-05
# Required library to draw from a multivariate distribution
library(MASS)

set.seed(139742)

# Number of simulated sets of estimates
sims <- 1000

# Create empty vector to stor eexpected values
crime.ev <- numeric(sims)

# Extract coefficient estimates
(ols.pe <- ols.crime$coef)
##   (Intercept)          Prob            Ed        Wealth          Ineq 
## -4213.3894097 -3537.8472359   113.6824213     0.3937634   101.9513371 
##           Pop 
##     1.0250155
# Extract variance-covariance matrix
(ols.vcv <- vcov(ols.crime))
##               (Intercept)          Prob            Ed       Wealth
## (Intercept) 1541038.49030 -709769.19103 -44771.256796 -89.61390595
## Prob        -709769.19103 5661766.66944   8786.410915  59.75016554
## Ed           -44771.25680    8786.41091   4261.086906  -2.08689151
## Wealth          -89.61391      59.75017     -2.086892   0.01323908
## Ineq         -29148.57228     520.23898    481.051641   2.14549156
## Pop              96.04510     718.81280     26.650630  -0.06370406
##                      Ineq          Pop
## (Intercept) -29148.572284  96.04509692
## Prob           520.238980 718.81279515
## Ed             481.051641  26.65063046
## Wealth           2.145492  -0.06370406
## Ineq           672.439541  -7.51149784
## Pop             -7.511498   1.88397661
# create matrix of random draws from point estimates and variance-covariance
crime.sim <- MASS::mvrnorm(n = sims,
                           mu = ols.pe,
                           Sigma = ols.vcv)

# Observe the output
head(crime.sim)
##      (Intercept)       Prob       Ed    Wealth      Ineq       Pop
## [1,]   -5783.511 -1514.5448 184.0508 0.4133287 131.70995 2.7579768
## [2,]   -2606.329  -197.3193 103.6119 0.2170619  57.52924 4.1641241
## [3,]   -3902.777 -4009.8277 132.8276 0.3606297  87.84978 0.9450562
## [4,]   -4298.787 -2933.4136 127.6321 0.3662950 100.72840 1.0143874
## [5,]   -3048.107   418.8794  81.7009 0.3237534  64.57730 3.1171501
## [6,]   -4638.457 -4482.6739 110.7449 0.4561119 112.09244 0.8862654
# Check out the mean
apply(crime.sim, 2, mean)
##   (Intercept)          Prob            Ed        Wealth          Ineq 
## -4191.7842430 -3532.9459074   115.1026667     0.3893421   101.2327120 
##           Pop 
##     1.0611060
# Set hypothetical values for the independent variables (at their means)
crime.x <- tibble(intercept = 1,
                  Prob = mean(crime$Prob),
                  Ed = mean(crime$Ed),
                  Wealth = mean(crime$Wealth),
                  Ineq = mean(crime$Ineq),
                  Pop = mean(crime$Pop))

# Compute one expected value to get the intuition

(ev1 <- as.matrix(crime.x) %*% crime.sim[1,]) # arguments must be conformable!
##          [,1]
## [1,] 917.1692
# Once you get the desired output, program the loop

for (i in 1:sims) {
  crime.ev[i] <- as.matrix(crime.x) %*% crime.sim[i,]
}

# Browse expected values
head(crime.ev)
## [1] 917.1692 887.8687 945.1400 827.0851 902.5784 923.7179
(mean_ev <- mean(crime.ev))
## [1] 906.0755
# We can now represent the expected value and its uncertainty

# CIs at a 0.05 significance level
(lwr <- quantile(crime.ev, probs = 0.025))
##     2.5% 
## 816.9564
(upr <- quantile(crime.ev, probs = 0.975))
##    97.5% 
## 991.2142

Although displaying a distribution of expected values is an effective way to communicate model results, by itself it doesn’t add much beyond simply reporting the coefficient table. You could essentially achieve the same thing without simulating the estimated parameters.

The real advantage of reporting estimands or quantities of interest (QoI) via simulation is to provide and contrast meaningful scenarios that address your research questions. For example, suppose we hypothesize that economic inequality moderates the crime rate at the state level. We can compute expected values for two scenarios:

  1. Low inequality: Compute the expected value at the minimum observed level of inequality.
  2. High inequality: Compute the expected value at the maximum observed level of inequality.
  3. First Difference (QoI): Compute the difference between the expected values of high and low inequality and determine if this difference is significantly different from zero.
# Set hypothetical values for the independent variables

(crime.x2 <- tibble(intercept = 1,
                    Prob = mean(crime$Prob),
                    Ed = mean(crime$Ed),
                    Wealth = mean(crime$Wealth),
                    # notice values of inequality
                    Ineq = c(min(crime$Ineq),
                             max(crime$Ineq)),
                    Pop = mean(crime$Pop)))
## # A tibble: 2 × 6
##   intercept   Prob    Ed Wealth  Ineq   Pop
##       <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1         1 0.0471  10.6  5254.  12.6  36.6
## 2         1 0.0471  10.6  5254.  27.6  36.6
# Matrix multiply both sets of hypothetical values 
# each set of a simulated coefficients

crime.ev2 <- matrix(NA,
                    nrow = sims,
                    ncol = 3)

# Because we have two scenarios, we need parallel simulations

for (i in 1:sims) {
  # two scenarios: low and high inequality
  crime.ev2[i, 1] <- as.matrix(crime.x2[1, ]) %*% crime.sim[i, ]
  crime.ev2[i, 2] <- as.matrix(crime.x2[2, ]) %*% crime.sim[i, ]
  
  # estimand - quantity of interest: first difference
  crime.ev2[i, 3] <-  crime.ev2[i, 2] - crime.ev2[i, 1]
}


## Prepare the data for visualization (long format for ggplot)

data_vis <- tibble(
  pe=c(crime.ev2[,1], # low inequality estimates
       crime.ev2[,2], # high inequality estimates
       crime.ev2[,3]), # high - low = difference
  
  # labels for scenarios
  scenario = c(rep("Low Ineq",sims),
               rep("High Ineq",sims),
               rep("First-Diff",sims)),
  
  # labels of quantities
  QIs = c(rep("Expected Values",sims*2),
          rep("First Differences",sims))
)

Simulation results show that the simulated coefficients average out to be very close to the actual estimated coefficients. Importantly, you could increase the precision of these simulations by increasing the number of iterations (e.g., the sims object).

Simulating Predicted Values

When we are interested to provide predictions for real, concrete cases (e.g. forecast of crime in Washington state), then we need to compute predicted values:

\[ \hat{\mathbf{y}} = \mathbf{X} \color{red}{\hat{\boldsymbol{\beta}}} + \color{blue}{\boldsymbol{\epsilon}} \]

To compute predicted values, we need to simulate both the estimates from the \(\color{red}{\text{systematic component}}\) (\(\hat{\boldsymbol{\beta}}\)) and the \(\color{blue}{\text{stochastic component}}\) (\(\hat{\boldsymbol{\epsilon}}\))

To generate predicted values of the dependent variable, we first compute the expected values, and then add random draws from the appropriate distribution. For OLS, we assume a normal distribution for the dependent variable, with the independent variables affecting the parameter \(\mu\).

However, we must also account for the normal distribution’s other parameter, \(\sigma\). The lm() function in R estimates \(\hat{\sigma}\) using the formula:

\[ \hat{\sigma} = \sqrt{\frac{\sum_{i=1}^n \left( Y_i - X_i \hat{\beta} \right)^2}{n - p}} \]

Where:

  • \(Y\) is the dependent variable,
  • \(X\) is the matrix of independent variables,
  • \(\hat{\beta}\) is the vector of estimated coefficients,
  • \(n\) is the sample size,
  • \(p\) is the number of estimated coefficients.

The numerator represents the sum of squared residuals (SSR), while the denominator adjusts for degrees of freedom.

Simulating \(\hat{\sigma}\)

We cannot simply use the model’s \(\hat{\sigma}\) for generating predicted values because it, too, is an estimate. To generate predicted values, we must simulate estimates of both the coefficients and variance. We simulate \(\hat{\sigma}\) by applying the formula above to each set of simulated coefficients.

# model residual degrees of freedom
crime.DoF <- ols.crime

# Empty vector
crime.sigma <- numeric(sims)

# Loop thorugh sets of simulated coefficients

for (i in 1:sims) {
  # create the matrix of independent variables
  x <- as.matrix(cbind(1, ols.crime$model[ , -1]))
  
  # Compute the SSR for each set of simulated coefficeints
  ssr <- sum((crime$Crime - (x %*% crime.sim[i, ]))^2)
  
  # Estimate sigma
  crime.sigma[i] <- sqrt(ssr/ols.crime$df.residual)
}

To generate a single predicted value, we provide two quantities to rnorm(): 1. Mean: The expected value calculated from the simulated estimates of \(\hat{\beta}\). 2. Standard deviation: The value of \(\hat{\sigma}\) calculated from the simulated estimates.

By doing this for each set of simulated estimates, we incorporate both estimation uncertainty (from the coefficients \(\hat{\beta}\)) and fundamental uncertainty (from the error term \(\hat{\sigma}\)) into the predicted values.

The code example below shows how to do this when all predictors are set to their means:

## Example: one predicted value with both estimation and fundamental uncertainty

(mu <-  as.matrix(crime.x) %*% crime.sim[1, ])
##          [,1]
## [1,] 917.1692
(sigma <- crime.sigma[1])
## [1] 313.4067
(example.pred <- rnorm(1, mean=mu, sd=sigma))
## [1] 1390.018
# Once you get the intuition, and the desired output, you can program the loop.

# Matrix multiply x predictors and each set of simulated coefficients to create mu
# then take a random draw from the normal distribution with mean equal to mu and
# sd equal to sigma

crime.pred <- numeric(sims)

for (i in 1:sims) {
  mu <- as.matrix(crime.x) %*% crime.sim[i, ]
  sigma <- crime.sigma[i] 
  crime.pred[i] <- rnorm(1, mean=mu, sd=sigma)
}


# Browse expected values
head(crime.pred)
## [1] 648.2879 818.5393 889.4261 621.8942 994.8853 867.0308
(mean_pred <- mean(crime.pred))
## [1] 891.8464
# We can now represent the expected value and its uncertainty

# CIs at a 0.05 significance level
(lwr <- quantile(crime.pred, probs = 0.025))
##     2.5% 
## 284.6418
(upr <- quantile(crime.pred, probs = 0.975))
##    97.5% 
## 1519.909

This process ensures that both estimation uncertainty (affecting the betas) and fundamental uncertainty (affecting the error term) are accounted for in the generated predicted values.

Recap: Heteroskedastic Normal Example

In this section, we will revisit the Heteroskedastic Normal model from last week. We will later compute quantities of interest using simcf().

rm(list=ls())       # Clear memory

set.seed(123456)    # For reproducible random numbers
library(MASS)       # Load packages
library(simcf) 
library(tidyverse)

n <- 1500           # Generate 1500 observations

w0 <- rep(1, n)     # Create the constant
w1 <- runif(n)      # Create two covariates
w2 <- runif(n)

x <- cbind(w0, w1, w2)       # Create a matrix of the covariates
z <- x                       # i.e., same covariates affect mu and sigma

beta <- c(0, 5, 15) # Set a parameter vector for the mean
                    # One for constant, one for covariate 1, 
                    # one for covariate 2.

gamma <- c(1, 0, 3) # Set a parameter vector for the variance 
                    # Gamma estimate for covariate 2 is set to be 3,
                    # whic creates heteroskedasticity 

mu <- x %*% beta             # Create systemtic component for the mean

sigma2 <- exp(z %*% gamma)   # Create systematic component for the variance
                             # Since ith row of sigma2 = exp(1 + 0 + w2_i * 3)
                             # i.e., it is a function of w2 (heteroskedastic)

y <- rnorm(n = n,            # Create the outcome variable
           mean = mu,        # Think about the stochastic component!
           sd = sqrt(sigma2)
           ) 

data <- cbind(y, w1, w2)     # Save the data to a data frame
data <- as.data.frame(data)
names(data) <- c("y", "w1", "w2")


ls.result <- lm(y ~ w1 + w2, data = data) # Fit a linear model
                                          # using simulated data

ls.aic <- AIC(ls.result) # Calculate and print the AIC 
                         # i.e. Akaike Information Criterion
                         # to assess goodness of fit; lower AIC is better


# A likelihood function for ML heteroskedastic Normal
llk.hetnormlin <- function(param, y, x, z) {
  x <- as.matrix(x)                      # x (some covariates) as a matrix
  z <- as.matrix(z)                      # z (some covariates) as a matrix
  os <- rep(1, nrow(x))                  # Set the intercept as 1 (constant)
  x <- cbind(os, x)                      # Add intercept to covariates x
  z <- cbind(os, z)                      # Add intercept to covariates z
  
  b <- param[1 : ncol(x)]                         # Parameters for x
  g <- param[(ncol(x) + 1) : (ncol(x) + ncol(z))] # Parameters for z
  
  xb <- x %*% b                          # Systematic components for mean
  s2 <- exp(z %*% g)                     # Systematic components for variance
  
  sum(0.5 * (log(s2) + (y - xb)^2 / s2)) # Likelihood we want to maximize 
                                         # optim is a minimizer by default
                                         # To maximize lnL is to minimize -lnL
                                         # so the +/- signs are reversed
}


# Create input matrices
xcovariates <- cbind(w1, w2)
zcovariates <- cbind(w1, w2)

# Initial guesses of beta0, beta1, ..., gamma0, gamma1, ...
# We need one entry per parameter, in order!
# Note: also include beta and gamma estiamtes for constants
stval <- c(0, 0, 0, 0, 0, 0)  

# Run ML, and get the output we need
hetnorm.result <- optim(stval,           # Initial guesses
                        llk.hetnormlin,  # Likelihood function
                        method = "BFGS", # Gradient method
                        hessian = TRUE,  # Return Hessian matrix
                        y = y,           # Outcome variable
                        x = xcovariates, # Covariates x (w/o constant)
                        z = zcovariates  # Covariates z (w/o constant)
                        ) 

Note: By default, optim() performs a minimizing procedure. You can make optim a maximizer by adding control = list(fnscale = -1)

pe <- hetnorm.result$par            # Point estimates
round(pe, 3)
## [1] -0.167  5.007 15.698  0.910  0.224  2.994
vc <- solve(hetnorm.result$hessian) # Var-cov matrix (for computing s.e.)
round(vc, 5)
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,]  0.02909 -0.03265 -0.02810  0.00059 -0.00087 -0.00032
## [2,] -0.03265  0.06787 -0.00025 -0.00091  0.00099  0.00084
## [3,] -0.02810 -0.00025  0.10331 -0.00056  0.00143 -0.00033
## [4,]  0.00059 -0.00091 -0.00056  0.00920 -0.00832 -0.00749
## [5,] -0.00087  0.00099  0.00143 -0.00832  0.01693 -0.00042
## [6,] -0.00032  0.00084 -0.00033 -0.00749 -0.00042  0.01568
se <- sqrt(diag(vc))                # To compute standard errors (s.e.)
                                    # take the diagonal of the Hessian; 
                                    # then take square root
round(se, 3)
## [1] 0.171 0.261 0.321 0.096 0.130 0.125
mle.result <- cbind(pe[1:3], se[1:3])         # Report ML results 
colnames(mle.result) <- c("Estimate", "Std.Error")
rownames(mle.result) <- c("(Intercept)", "w1", "w2")
round(mle.result, 3)
##             Estimate Std.Error
## (Intercept)   -0.167     0.171
## w1             5.007     0.261
## w2            15.698     0.321
round(coef(summary(ls.result))[, c(1, 2)], 3) # Compare with lm() results
##             Estimate Std. Error
## (Intercept)   -0.049      0.282
## w1             4.784      0.377
## w2            15.683      0.371
ll <- -hetnorm.result$value                   # Report likelihood at maximum 
                                              # No need to have negative sign 
                                              # if optim is set as maximizer
ll
## [1] -2620.334
# AIC is 2 * number of parameters - 2 * ll (i.e. likelihood at maximum)
hetnorm.aic <- 2 * length(stval) - 2 * ll  

# Compare AIC from ML and from lm(); lower is better
print(hetnorm.aic)
## [1] 5252.668
print(ls.aic)
## [1] 8545.903

Predict Quantities of Interest

Motivation: We want to study how the change in a particular explanatory variable affects the outcome variable, all else being equal

Scenario 1: Vary covariate 1; hold covariate 2 constant

  1. Create a data frame with a set of hypothetical scenarios for covariate 1, while keeping covariate 2 at its mean

    • What is the sensible range of some hypothetical scenarios for covariate 1? Consider the original range of w1.
  2. Calculate the predicted values using the predict() function

    • Hint: you need at least the following arguments: predict(object = ... , newdata = ... , interval = ... , level = ...)
  3. Plot the prediction intervals

  4. Similarly, calculate the confidence intervals using the predict() function

  5. Plot the confidence intervals; compare them with the predictive intervals

# Set up

# Set up hypothetical values for w1 and w2
(w1range <- seq(from = 0, to = 1, by = 0.05))
##  [1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70
## [16] 0.75 0.80 0.85 0.90 0.95 1.00
(w2mean <- mean(w2))                            
## [1] 0.4912698
# Create a new dataset using crossing()
(xhypo <- crossing(w1 = w1range, w2 = w2mean))
## # A tibble: 21 × 2
##       w1    w2
##    <dbl> <dbl>
##  1  0    0.491
##  2  0.05 0.491
##  3  0.1  0.491
##  4  0.15 0.491
##  5  0.2  0.491
##  6  0.25 0.491
##  7  0.3  0.491
##  8  0.35 0.491
##  9  0.4  0.491
## 10  0.45 0.491
## # ℹ 11 more rows
# Note: if you are more comfortable, you can directly create a data frame
# with the hypothetical values

# Let's predict values using base R function

# Calculate predicted values
simPI.w1 <- predict(ls.result,                 # A model object
                    newdata = xhypo,           # New dataset 
                    interval = "prediction",   # What kind of intervals?
                    level = 0.95)              # What levels of confidence?

simPI.w1 <- simPI.w1 |>
  as_tibble() |>         # Coerce it into a tibble
  bind_cols(w1 = w1range) # Combine hypo w1 with predicted y

# Calculate confidence intervals using predict()
simCI.w1 <- predict(ls.result,
                    newdata = xhypo,
                    interval = "confidence",
                    level = 0.95)

simCI.w1 <- simCI.w1 |>
  as_tibble() |>
  bind_cols(w1 = w1range)


# ggplot2
theme_set(theme_classic())
simALL.w1 <- bind_rows(
  simPI.w1 |> mutate(type = "PI"),
  simCI.w1 |> mutate(type = "CI")
  )

Scenario 2: Vary covariate 2; hold covariate 1 constant

Now, instead of providing a range of values from w1, we are going to hold constant this variable and instead we are going to compute the marginal effect of change in w2 by simulating a range of values.

w2range <- seq(from = 0, to = 1, by = 0.05)   # Set up hypothetical values for w1
w1mean <- mean(w1)                            # Set up mean for w2
xhypo <- crossing(w1 = w1mean,               # Create a new dataset 
                  w2 = w2range)                # crossing() is from tidyr package
# Calculate predicted values
simPI.w2 <- predict(ls.result,                 # A model object
                    newdata = xhypo,           # New dataset 
                    interval = "prediction",   # What kind of intervals?
                    level = 0.95)              # What levels of confidence?

simPI.w2 <- simPI.w2 |>
  as_tibble() |>
  bind_cols(w2 = w2range)


# Calculate confidence intervals using predict()
simCI.w2 <- predict(ls.result,
                    newdata = xhypo,
                    interval = "confidence",
                    level = 0.95)

simCI.w2 <- simCI.w2 |>
  as_tibble() |>
  bind_cols(w2 = w2range)


# ggplot2
theme_set(theme_classic())
simALL.w2 <- bind_rows(
  simPI.w2 |> mutate(type = "PI"),
  simCI.w2 |> mutate(type = "CI")
  )

Simulating QoI using simcf

Motivation: Can we use simulation methods to produce the same prediction and confidence intervals? Recall the lecture: we can draw a bunch of \(\tilde{\beta}\) from a multivariate normal distribution

Scenario 1: Vary covariate 1; hold covariate 2 constant

  1. Create a data frame with a set of hypothetical scenarios for covariate 1 while keeping covariate 2 at its mean

  2. Simulate the predicted values using MASS and simcf

  3. Plot the results

In order to use simcf to generate quantities of interest, the following steps are needed:

  1. Estimate: MLE \(\hat{\beta}\) and its variance \(\hat{V}(\hat{\beta})\)
    \(\color{red}{\rightarrow \texttt{optim(), glm()}}\)

  2. Simulate estimation uncertainty from a multivariate normal distribution:
    Draw \(\tilde{\beta} \sim MVN \big[\hat{\beta}, \hat{V}(\hat{\beta})\big]\)
    \(\color{red}{\rightarrow \texttt{MASS::mvrnorm()}}\)

  3. Create hypothetical scenarios of your substantive interest:
    Choose valuese of X: \(X_c\) \(\color{red}{\rightarrow \texttt{simcf::cfmake(), cfchange()} \dots}\)

  4. Calculate expected values:
    \(\tilde{\mu_c} = g(X_c, \tilde{\beta})\)  

  5. Simulate fundamental uncertainty:
    \(\tilde{y_c} \sim f(\tilde{\mu_c}, \tilde{\alpha})\)
    \(\color{red}{\rightarrow \texttt{simcf::hetnormsimpv()} \dots}\)

    or

  6. Compute EVs, First Differences or Relative Risks
    EV: \(\mathbb{E}(y|X_{c1})\)
    \(\color{red}{\rightarrow \texttt{simcf::logitsimev()} \dots}\)
    FD: \(\mathbb{E}(y|X_{c2}) - \mathbb{E}(y|X_{c1})\)
    \(\color{red}{\rightarrow \texttt{simcf::logitsimfd()} \dots}\)
    RR: \(\frac{\mathbb{E}(y|X_{c2})}{\mathbb{E}(y|X_{c1})}\)
    \(\color{red}{\rightarrow \texttt{simcf::logitsimrr()} \dots}\)

Recall our likelihood function is:

\[ \begin{aligned} \mathrm{P}\left(\mathbf{y} | \boldsymbol{\mu}, \boldsymbol{\sigma}^{2}\right) &=\prod_{i=1}^{n} f_{\mathcal{N}}\left(y_{i} | \mu_{i}, \sigma_{i}^{2}\right) & \tiny[\text{Joint probability}] \\ \mathrm{P}\left(\mathbf{y} | \boldsymbol{\mu}, \boldsymbol{\sigma}^{2}\right) &=\prod_{i=1}^{n}\left(2 \pi \sigma_{i}^{2}\right)^{-1 / 2} \exp \left[\frac{-\left(y_{i}-\mu_{i}\right)^{2}}{2 \sigma_{i}^{2}}\right] & \tiny[\text{Expressed in Normal distribution}]\\ \log \mathcal{L}\left(\mathbf{\mu}, \boldsymbol{\sigma}^{2} | \mathbf{y}\right) & \propto-\frac{1}{2} \sum_{i=1}^{n} \log \sigma_{i}^{2}-\frac{1}{2} \sum_{i=1}^{n} \frac{\left(y_{i}-\mathbf{\mu}_{i} \right)^{2}}{\sigma_{i}^{2}} & \tiny[\text{Converted to log likelihood; simplify}] \\ \log \mathcal{L}(\boldsymbol{\beta}, \boldsymbol{\gamma} | \mathbf{y}) & \propto-\frac{1}{2} \sum_{i=1}^{n} \log \mathbf{z}_{i} \gamma-\frac{1}{2} \sum_{i=1}^{n} \frac{\left(y_{i}-\mathbf{x}_{i} \boldsymbol{\beta}\right)^{2}}{\exp \left(\mathbf{z}_{i} \gamma\right)} & \tiny[\text{Substitute in systematic components}] \end{aligned} \]

# Draw parameters from the model predictive distribution
sims <- 10000
simparam <- mvrnorm(n = sims,
                    mu = pe,    # M.N.D. with population mean = pe (from ML);
                    Sigma = vc) # population var-covar matrix = vc (from ML)
head(simparam)                  # Inspect
##             [,1]     [,2]     [,3]      [,4]      [,5]     [,6]
## [1,]  0.24211743 4.253787 15.94193 0.8538881 0.1676888 3.187143
## [2,] -0.10028149 4.866367 15.88825 0.8131899 0.2369895 3.038976
## [3,] -0.15729804 4.813402 15.81664 0.9172401 0.3016272 2.864065
## [4,] -0.01521245 4.482252 16.12675 0.9590520 0.2337601 2.975280
## [5,]  0.10941227 4.821406 15.16535 0.9812625 0.2492318 2.763246
## [6,] -0.22838155 4.947090 15.62518 0.9596191 0.1447425 2.870572
# Separate into the simulated betas and simulated gammas
simbetas <- simparam[ , 1:(ncol(xcovariates) + 1)]
simgammas <- simparam[ , (ncol(simbetas) + 1):ncol(simparam)]

# Specify model formulas
varmodel <- (y ~ w1 + w2)

# Create hypothetical scenarios for covariate 1 
w1range <- seq(from = 0, to = 1, by = 0.2) 


# Use `cfMake` to create a baseline dataset
xhypo <- cfMake(varmodel, 
                data, 
                nscen = length(w1range)) 

xhypo
## $x
##          y        w1        w2
## 1 10.06525 0.5036167 0.4912698
## 2 10.06525 0.5036167 0.4912698
## 3 10.06525 0.5036167 0.4912698
## 4 10.06525 0.5036167 0.4912698
## 5 10.06525 0.5036167 0.4912698
## 6 10.06525 0.5036167 0.4912698
## 
## $xpre
##          y        w1        w2
## 1 10.06525 0.5036167 0.4912698
## 2 10.06525 0.5036167 0.4912698
## 3 10.06525 0.5036167 0.4912698
## 4 10.06525 0.5036167 0.4912698
## 5 10.06525 0.5036167 0.4912698
## 6 10.06525 0.5036167 0.4912698
## 
## $model
## y ~ w1 + w2
## 
## attr(,"class")
## [1] "list"           "counterfactual"
# Use `cfChange` and loop function to loop through each hypothetical x values 
for (i in 1:length(w1range)) {
  xhypo <- cfChange(xscen = xhypo, 
                    covname = "w1", 
                    x = w1range[i], 
                    scen = i) 
}

xhypo
## $x
##          y  w1        w2
## 1 10.06525 0.0 0.4912698
## 2 10.06525 0.2 0.4912698
## 3 10.06525 0.4 0.4912698
## 4 10.06525 0.6 0.4912698
## 5 10.06525 0.8 0.4912698
## 6 10.06525 1.0 0.4912698
## 
## $xpre
##          y        w1        w2
## 1 10.06525 0.5036167 0.4912698
## 2 10.06525 0.5036167 0.4912698
## 3 10.06525 0.5036167 0.4912698
## 4 10.06525 0.5036167 0.4912698
## 5 10.06525 0.5036167 0.4912698
## 6 10.06525 0.5036167 0.4912698
## 
## $model
## y ~ w1 + w2
## 
## attr(,"class")
## [1] "list"           "counterfactual"
# Repeat the same procedures for z 
zhypo <- cfMake(varmodel, data, nscen = length(w1range))

for (i in 1:length(w1range)) {
  zhypo <- cfChange(zhypo, "w1", x = w1range[i], scen = i) 
  }

zhypo
## $x
##          y  w1        w2
## 1 10.06525 0.0 0.4912698
## 2 10.06525 0.2 0.4912698
## 3 10.06525 0.4 0.4912698
## 4 10.06525 0.6 0.4912698
## 5 10.06525 0.8 0.4912698
## 6 10.06525 1.0 0.4912698
## 
## $xpre
##          y        w1        w2
## 1 10.06525 0.5036167 0.4912698
## 2 10.06525 0.5036167 0.4912698
## 3 10.06525 0.5036167 0.4912698
## 4 10.06525 0.5036167 0.4912698
## 5 10.06525 0.5036167 0.4912698
## 6 10.06525 0.5036167 0.4912698
## 
## $model
## y ~ w1 + w2
## 
## attr(,"class")
## [1] "list"           "counterfactual"
# Simulate predictive intervals for heteroskedastic linear models
# `hetnormsimpv()` is from simcf package
simRES.w1 <- hetnormsimpv(xhypo, simbetas,
                          zhypo, simgammas,
                          ci = 0.95,
                          constant = 1, 
                          varconstant = 1)

simRES.w1
## $pe
## [1]  7.515323  8.566149  9.575781 10.579808 11.552741 12.524484
## 
## $lower
##          [,1]
## low 0.8847932
## low 1.9381952
## low 2.8502531
## low 3.5495829
## low 4.5126707
## low 5.3290850
## 
## $upper
##        [,1]
## up 13.87476
## up 15.15291
## up 16.41457
## up 17.46719
## up 18.47127
## up 19.71323
simRES.w1 <- simRES.w1 |>
  bind_rows() |>         # Collaspe the list into d.f.
  bind_cols(w1 = w1range) # Combine with hypo. w1 values

simRES.w1                 # Inspect
## # A tibble: 6 × 4
##      pe lower[,1] upper[,1]    w1
##   <dbl>     <dbl>     <dbl> <dbl>
## 1  7.52     0.885      13.9   0  
## 2  8.57     1.94       15.2   0.2
## 3  9.58     2.85       16.4   0.4
## 4 10.6      3.55       17.5   0.6
## 5 11.6      4.51       18.5   0.8
## 6 12.5      5.33       19.7   1

Scenario 2: Vary covariate 2; hold covariate 1 constant

  1. Create a data frame with a set of hypothetical scenarios for covariate 2 while keeping covariate 1 at its mean

  2. Simulate the predicted values using MASS and simcf

  3. Plot the results

# Create hypothetical scenarios for w2
w2range <- seq(0, 1, by = 0.05)

xhypo <- cfMake(varmodel, data, nscen = length(w2range))

for (i in 1:length(w2range)) {
  xhypo <- cfChange(xhypo, "w2", x = w2range[i], scen = i) 
}

zhypo <- cfMake(varmodel, data, nscen = length(w2range))

for (i in 1:length(w2range)) {
  zhypo <- cfChange(zhypo, "w2", x = w2range[i], scen = i) 
}


# Simulate the predicted Y's and PI's
simRES.w2 <- hetnormsimpv(xhypo, simbetas,
                          zhypo, simgammas,
                          ci = 0.95,
                          constant = 1, 
                          varconstant = 1)

simRES.w2 <- simRES.w2 |>
  bind_rows() |>
  bind_cols(w2 = w2range)
# Plot the predictive intervals for hypothetical w2
ggplot(simRES.w2, aes(x = w2, y = pe, ymax = upper, ymin = lower)) +
  geom_line() +
  geom_ribbon(alpha = 0.1) +
  labs(y = "Predicted Y", x = "Covariate 2")

Can we compare them with the prediction intervals generated by predict()?

# Combine two dataframes
simPI.w2 <- simPI.w2 |> 
  rename(pe = fit, lower = lwr, upper = upr)

simALL.w2 <- bind_rows(
  simRES.w2 |> mutate(method = "sim"),
  simPI.w2 |> mutate(method = "lm")
)
# Plot the predictive intervals for hypothetical w2
ggplot(simALL.w2, aes(x = w2, y = pe, ymax = upper, ymin = lower)) +
  geom_line() +
  geom_ribbon(alpha = 0.1) +
  labs(y = "Predicted Y", x = "Covariate 2") +
  facet_grid(~ method)