Simulation is a powerful method for generating meaningful quantities of interest (QI) from statistical models. Instead of only reporting coefficients, researchers can simulate expected values, expected probabilities, first differences, and measures of uncertainty to give a clearer picture of the model results.
QI are especially valuable because they go beyond mere coefficients to provide a more intuitive understanding of the model’s output. While it is possible to compute expected probabilities using the model’s coefficients alone, simulation adds value by incorporating confidence intervals, thus addressing uncertainty.
In any statistical model, the coefficients are subject to estimation uncertainty—we estimate parameters from a sample but want to generalize to the broader population. If a different sample were collected, the estimated coefficients could vary, leading to uncertainty in any computed quantities from those coefficients. Simulation quantifies this uncertainty by drawing from the sampling distribution of the model’s estimates.
These draws reflect the variability of each coefficient and their covariances, resulting in a large number of plausible estimates. Rather than simply summarizing these draws, it is more informative to compute a substantively meaningful QI from the simulated estimates, such as a predicted or expected value of the dependent variable.
When computing predictions and simulations of quantities of interest (QoI), it is essential to recognize that the model we use to approximate the data-generating process (DGP) is subject to two types of uncertainties:
Thus, the choice between predicted and expected values depends on the type of uncertainty the analyst wants to highlight—whether focusing on the uncertainty about the model itself (estimation uncertainty) or on the variability in the world (fundamental uncertainty).
In a simple linear regression model, we can highlight both sources of uncertainty. The equation is as follows:
\[ \hat{y_i} = \color{red}{\hat{\beta_0}} + \color{red}{\hat{\beta_1}} x_i + \color{blue}{\hat{\epsilon}_i} \]
\(\color{red}{\text{Estimation uncertainty}}\): Represents uncertainty in the estimation of the coefficients \(\hat{\beta_0}\) and \(\hat{\beta_1}\), which are estimates derived from the data.
\(\color{blue}{\text{Fundamental uncertainty}}\): combines uncertainty from the \(\color{red}{\text{systematic}}\) and \(\color{blue}{\text{stochastic}}\) components, accounting for the randomness in the dependent variable that the model cannot explain.
Which quantity of interest should we report: expected values or predicted values? When analyzing estimands at the population or data-generating process (DGP) level, we focus on expected values. However, if the goal is to make specific predictions for real-world cases—such as in policy analysis for government agencies, private banks, or consulting firms—predicted values should be reported.
In summary, estimation uncertainty relates to the estimates \(\hat{\beta}\) in the model, while fundamental uncertainty incorporates the residuals \(\hat{\epsilon}\) and reflects real-world randomness.
The following application is based on Carse and Harden (2014, Chapter 9), and uses Ehlirhc (1973) Crime Data.
The Ehrlich (1973) Crime Data was originally collected by Isaac Ehrlich from various U.S. government sources, including the Uniform Crime Report of the FBI. It has been used in multiple studies to analyze how different variables influence the rate of crime at the state level in the United States during 1960.
We will use the following variables:
If we are interested in analyzing the marginal effects of laws and socioeconomic characteristics at the DGP/population level, we aim to generalize results (external validity). In these cases, we primarily report expected values. Specifically:
\[ \hat{\mathbf{y}} = \mathbf{X} \color{red}{\hat{\boldsymbol{\beta}}} + \hat{\epsilon} \]
We simulate expected values based on the estimates of the \(\color{red}{\text{systematic component}}\) (\(\hat{\boldsymbol{\beta}}\)), reflecting estimation uncertainty from the sample.
To compute expected values \(\mathbb{E}[y | X_c \tilde{\beta}] = \tilde{\mu_c}\) via simulation:
Extract the model’s \(k\) point estimates \(\hat{\beta}_k\) and their variance-covariance matrix \(\hat{V}(\hat{\beta}_k)\). Use these to draw \(n\) simulated random draws from the estimated coefficients: \(\boldsymbol{\tilde{\beta}} \sim \mathcal{MVN}(\boldsymbol{\hat{\beta}},\hat{V}[\boldsymbol{\hat{\beta}}])\). Here, \(\boldsymbol{\tilde{\beta}}\) is an \(n \times k\) matrix of simulated coefficients.
With \(n\) simulated draws of \(\boldsymbol{\tilde{\beta}}\), create a \(1 \times k\) transposed vector of predictor values \(\boldsymbol{x'}\) for some scenario \(c\) of interest, e.g., at their means \(\boldsymbol{x_c'}\).
Compute the expected values \(\boldsymbol{\tilde{\mu_c}}\) using matrix algebra: \(\boldsymbol{x_c'} \times \boldsymbol{\tilde{\beta}}\). Here, \(\boldsymbol{\tilde{\mu_c}}\) is an \(n \times 1\) vector of expected values.
Visualize \(\tilde{\mu_c}\) and/or report statistics from its simulated distribution, such as the mean or quantiles to represent the confidence intervals.
The code below demonstrates this using Ehlirch’s crime data.
crime <- read.csv(file="data/crime.csv")
names(crime)
## [1] "M" "So" "Ed" "Po1" "Po2" "LF" "M.F" "Pop"
## [9] "NW" "U1" "U2" "Wealth" "Ineq" "Prob" "Time" "Crime"
# OLS model form the crime data
ols.crime <- lm(Crime ~ Prob + Ed + Wealth + Ineq + Pop,
data= crime)
summary(ols.crime)
##
## Call:
## lm(formula = Crime ~ Prob + Ed + Wealth + Ineq + Pop, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -525.23 -178.94 -25.09 145.62 771.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4213.3894 1241.3857 -3.394 0.001538 **
## Prob -3537.8472 2379.4467 -1.487 0.144709
## Ed 113.6824 65.2770 1.742 0.089088 .
## Wealth 0.3938 0.1151 3.422 0.001420 **
## Ineq 101.9513 25.9314 3.932 0.000318 ***
## Pop 1.0250 1.3726 0.747 0.459458
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 297 on 41 degrees of freedom
## Multiple R-squared: 0.4744, Adjusted R-squared: 0.4103
## F-statistic: 7.4 on 5 and 41 DF, p-value: 5.054e-05
# Required library to draw from a multivariate distribution
library(MASS)
set.seed(139742)
### Simulation EV: step 1
# Number of simulated sets of estimates
sims <- 1000
# Create empty vector to stor eexpected values
crime.ev <- numeric(sims)
# Extract coefficient estimates
(ols.pe <- ols.crime$coef)
## (Intercept) Prob Ed Wealth Ineq
## -4213.3894097 -3537.8472359 113.6824213 0.3937634 101.9513371
## Pop
## 1.0250155
# Extract variance-covariance matrix
(ols.vcv <- vcov(ols.crime))
## (Intercept) Prob Ed Wealth
## (Intercept) 1541038.49030 -709769.19103 -44771.256796 -89.61390595
## Prob -709769.19103 5661766.66944 8786.410915 59.75016554
## Ed -44771.25680 8786.41091 4261.086906 -2.08689151
## Wealth -89.61391 59.75017 -2.086892 0.01323908
## Ineq -29148.57228 520.23898 481.051641 2.14549156
## Pop 96.04510 718.81280 26.650630 -0.06370406
## Ineq Pop
## (Intercept) -29148.572284 96.04509692
## Prob 520.238980 718.81279515
## Ed 481.051641 26.65063046
## Wealth 2.145492 -0.06370406
## Ineq 672.439541 -7.51149784
## Pop -7.511498 1.88397661
# create matrix of random draws from point estimates and variance-covariance
crime.sim <- MASS::mvrnorm(n = sims,
mu = ols.pe,
Sigma = ols.vcv)
# Observe the output
head(crime.sim)
## (Intercept) Prob Ed Wealth Ineq Pop
## [1,] -5783.511 -1514.5448 184.0508 0.4133287 131.70995 2.7579768
## [2,] -2606.329 -197.3193 103.6119 0.2170619 57.52924 4.1641241
## [3,] -3902.777 -4009.8277 132.8276 0.3606297 87.84978 0.9450562
## [4,] -4298.787 -2933.4136 127.6321 0.3662950 100.72840 1.0143874
## [5,] -3048.107 418.8794 81.7009 0.3237534 64.57730 3.1171501
## [6,] -4638.457 -4482.6739 110.7449 0.4561119 112.09244 0.8862654
# Check out the mean
apply(crime.sim, 2, mean)
## (Intercept) Prob Ed Wealth Ineq
## -4191.7842430 -3532.9459074 115.1026667 0.3893421 101.2327120
## Pop
## 1.0611060
### Simulation EV: step 2
# Set hypothetical values for the independent variables (at their means)
crime.x <- tibble(intercept = 1,
Prob = mean(crime$Prob),
Ed = mean(crime$Ed),
Wealth = mean(crime$Wealth),
Ineq = mean(crime$Ineq),
Pop = mean(crime$Pop))
### Simulation EV: step 3
# Compute one expected value to get the intuition
(ev1 <- as.matrix(crime.x) %*% crime.sim[1,]) # arguments must be conformable!
## [,1]
## [1,] 917.1692
# Once you get the desired output, program the loop
for (i in 1:sims) {
crime.ev[i] <- as.matrix(crime.x) %*% crime.sim[i,]
}
# Browse expected values
head(crime.ev)
## [1] 917.1692 887.8687 945.1400 827.0851 902.5784 923.7179
### Simulation EV: step 4
(mean_ev <- mean(crime.ev))
## [1] 906.0755
# We can now represent the expected value and its uncertainty
# CIs at a 0.05 significance level
(lwr <- quantile(crime.ev, probs = 0.025))
## 2.5%
## 816.9564
(upr <- quantile(crime.ev, probs = 0.975))
## 97.5%
## 991.2142
Although displaying a distribution of expected values is an effective way to communicate model results, by itself it doesn’t add much beyond simply reporting the average predicted expectation.
The real advantage of reporting estimands or quantities of interest (QoI) via simulation is to provide and contrast meaningful scenarios that address your research questions. For example, suppose we hypothesize that economic inequality moderates the crime rate at the state level. We can compute expected values for two scenarios by adding the following steps:
# Set hypothetical values for the independent variables
(crime.x2 <- tibble(intercept = 1,
Prob = mean(crime$Prob),
Ed = mean(crime$Ed),
Wealth = mean(crime$Wealth),
# notice values of inequality
Ineq = c(min(crime$Ineq),
max(crime$Ineq)),
Pop = mean(crime$Pop)))
## # A tibble: 2 × 6
## intercept Prob Ed Wealth Ineq Pop
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.0471 10.6 5254. 12.6 36.6
## 2 1 0.0471 10.6 5254. 27.6 36.6
# Matrix multiply both sets of hypothetical values
# each set of a simulated coefficients
crime.ev2 <- matrix(NA,
nrow = sims,
ncol = 3)
# Because we have two scenarios, we need parallel simulations
for (i in 1:sims) {
# two scenarios: low and high inequality
crime.ev2[i, 1] <- as.matrix(crime.x2[1, ]) %*% crime.sim[i, ]
crime.ev2[i, 2] <- as.matrix(crime.x2[2, ]) %*% crime.sim[i, ]
# estimand - quantity of interest: first difference
crime.ev2[i, 3] <- crime.ev2[i, 2] - crime.ev2[i, 1]
}
## Prepare the data for visualization (long format for ggplot)
data_vis <- tibble(
pe=c(crime.ev2[,1], # low inequality estimates
crime.ev2[,2], # high inequality estimates
crime.ev2[,3]), # high - low = difference
# labels for scenarios
scenario = c(rep("Low Ineq",sims),
rep("High Ineq",sims),
rep("First-Diff",sims)),
# labels of quantities
QIs = c(rep("Expected Values",sims*2),
rep("First Differences",sims))
)
Simulation results show that the simulated
coefficients average out to be very close to the actual
estimated coefficients. Importantly, you could increase the precision of
these simulations by increasing the number of iterations (e.g., the
sims
object).
When we are interested to provide predictions for real, concrete cases (e.g. forecast of crime in Washington state), then we need to compute predicted values:
\[ \hat{\mathbf{y}} = \mathbf{X} \color{red}{\hat{\boldsymbol{\beta}}} + \color{blue}{\boldsymbol{\hat{\epsilon}}} \]
To compute predicted values, we need to simulate both the estimates from the \(\color{red}{\text{systematic component}}\) (\(\hat{\boldsymbol{\beta}}\)) and the \(\color{blue}{\text{stochastic component}}\) (\(\hat{\boldsymbol{\epsilon}}\))
To generate predicted values of the dependent variable, we first compute the expected values, and then add random draws from the appropriate distribution. For OLS, we assume a normal distribution for the dependent variable, with the independent variables affecting the parameter \(\mu\).
However, we must also account for the normal distribution’s other
parameter, \(\sigma\).
The lm()
function in R estimates \(\hat{\sigma}\) using the formula:
\[ \hat{\sigma} = \sqrt{\frac{\sum_{i=1}^n \left( Y_i - X_i \hat{\beta} \right)^2}{n - p}} \]
Where:
The numerator represents the sum of squared residuals (SSR), while the denominator adjusts for degrees of freedom.
We cannot simply use the model’s \(\hat{\sigma}\) for generating predicted values because it, too, is an estimate. To generate predicted values, we must simulate estimates of both the coefficients and variance. We simulate \(\hat{\sigma}\) by applying the formula above to each set of simulated coefficients.
# model residual degrees of freedom
crime.DoF <- ols.crime
# Empty vector
crime.sigma <- numeric(sims)
# Loop thorugh sets of simulated coefficients
for (i in 1:sims) {
# create the matrix of independent variables
x <- as.matrix(cbind(1, ols.crime$model[ , -1]))
# Compute the SSR for each set of simulated coefficeints
ssr <- sum((crime$Crime - (x %*% crime.sim[i, ]))^2)
# Estimate sigma
crime.sigma[i] <- sqrt(ssr/ols.crime$df.residual)
}
To generate a single predicted value, we provide two quantities to
rnorm()
:
By doing this for each set of simulated estimates, we incorporate both estimation uncertainty (from the coefficients \(\hat{\beta}\)) and fundamental uncertainty (from the error term \(\hat{\sigma}\)) into the predicted values.
The code example below shows how to do this when all predictors are set to their means:
## Example: one predicted value with both estimation and fundamental uncertainty
(mu <- as.matrix(crime.x) %*% crime.sim[1, ])
## [,1]
## [1,] 917.1692
(sigma <- crime.sigma[1])
## [1] 313.4067
(example.pred <- rnorm(1, mean=mu, sd=sigma))
## [1] 1390.018
# Once you get the intuition, and the desired output, you can program the loop.
# Matrix multiply x predictors and each set of simulated coefficients to create mu
# then take a random draw from the normal distribution with mean equal to mu and
# sd equal to sigma
crime.pred <- numeric(sims)
for (i in 1:sims) {
mu <- as.matrix(crime.x) %*% crime.sim[i, ]
sigma <- crime.sigma[i]
crime.pred[i] <- rnorm(1, mean=mu, sd=sigma)
}
# Browse expected values
head(crime.pred)
## [1] 648.2879 818.5393 889.4261 621.8942 994.8853 867.0308
(mean_pred <- mean(crime.pred))
## [1] 891.8464
# We can now represent the expected value and its uncertainty
# CIs at a 0.05 significance level
(lwr <- quantile(crime.pred, probs = 0.025))
## 2.5%
## 284.6418
(upr <- quantile(crime.pred, probs = 0.975))
## 97.5%
## 1519.909
This process ensures our predictions account for fundamental uncertainty.
In this section, we will revisit the Heteroskedastic Normal model
from last week. We will later compute quantities of interest using
simcf()
.
rm(list=ls()) # Clear memory
set.seed(123456) # For reproducible random numbers
library(MASS) # Load packages
library(simcf)
library(tidyverse)
n <- 1500 # Generate 1500 observations
w0 <- rep(1, n) # Create the constant
w1 <- runif(n) # Create two covariates
w2 <- runif(n)
x <- cbind(w0, w1, w2) # Create a matrix of the covariates
z <- x # i.e., same covariates affect mu and sigma
beta <- c(0, 5, 15) # Set a parameter vector for the mean
# One for constant, one for covariate 1,
# one for covariate 2.
gamma <- c(1, 0, 3) # Set a parameter vector for the variance
# Gamma estimate for covariate 2 is set to be 3,
# whic creates heteroskedasticity
mu <- x %*% beta # Create systemtic component for the mean
sigma2 <- exp(z %*% gamma) # Create systematic component for the variance
# Since ith row of sigma2 = exp(1 + 0 + w2_i * 3)
# i.e., it is a function of w2 (heteroskedastic)
y <- rnorm(n = n, # Create the outcome variable
mean = mu, # Think about the stochastic component!
sd = sqrt(sigma2)
)
data <- cbind(y, w1, w2) # Save the data to a data frame
data <- as.data.frame(data)
names(data) <- c("y", "w1", "w2")
ls.result <- lm(y ~ w1 + w2, data = data) # Fit a linear model
# using simulated data
ls.aic <- AIC(ls.result) # Calculate and print the AIC
# i.e. Akaike Information Criterion
# to assess goodness of fit; lower AIC is better
# A likelihood function for ML heteroskedastic Normal
llk.hetnormlin <- function(param, y, x, z) {
x <- as.matrix(x) # x (some covariates) as a matrix
z <- as.matrix(z) # z (some covariates) as a matrix
os <- rep(1, nrow(x)) # Set the intercept as 1 (constant)
x <- cbind(os, x) # Add intercept to covariates x
z <- cbind(os, z) # Add intercept to covariates z
b <- param[1 : ncol(x)] # Parameters for x
g <- param[(ncol(x) + 1) : (ncol(x) + ncol(z))] # Parameters for z
xb <- x %*% b # Systematic components for mean
s2 <- exp(z %*% g) # Systematic components for variance
sum(0.5 * (log(s2) + (y - xb)^2 / s2)) # Likelihood we want to maximize
# optim is a minimizer by default
# To maximize lnL is to minimize -lnL
# so the +/- signs are reversed
}
# Create input matrices
xcovariates <- cbind(w1, w2)
zcovariates <- cbind(w1, w2)
# Initial guesses of beta0, beta1, ..., gamma0, gamma1, ...
# We need one entry per parameter, in order!
# Note: also include beta and gamma estiamtes for constants
stval <- c(0, 0, 0, 0, 0, 0)
# Run ML, and get the output we need
hetnorm.result <- optim(stval, # Initial guesses
llk.hetnormlin, # Likelihood function
method = "BFGS", # Gradient method
hessian = TRUE, # Return Hessian matrix
y = y, # Outcome variable
x = xcovariates, # Covariates x (w/o constant)
z = zcovariates # Covariates z (w/o constant)
)
Note: By default, optim()
performs a minimizing
procedure. You can make optim a maximizer by adding
control = list(fnscale = -1)
pe <- hetnorm.result$par # Point estimates
round(pe, 3)
## [1] -0.167 5.007 15.698 0.910 0.224 2.994
vc <- solve(hetnorm.result$hessian) # Var-cov matrix (for computing s.e.)
round(vc, 5)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.02909 -0.03265 -0.02810 0.00059 -0.00087 -0.00032
## [2,] -0.03265 0.06787 -0.00025 -0.00091 0.00099 0.00084
## [3,] -0.02810 -0.00025 0.10331 -0.00056 0.00143 -0.00033
## [4,] 0.00059 -0.00091 -0.00056 0.00920 -0.00832 -0.00749
## [5,] -0.00087 0.00099 0.00143 -0.00832 0.01693 -0.00042
## [6,] -0.00032 0.00084 -0.00033 -0.00749 -0.00042 0.01568
se <- sqrt(diag(vc)) # To compute standard errors (s.e.)
# take the diagonal of the Hessian;
# then take square root
round(se, 3)
## [1] 0.171 0.261 0.321 0.096 0.130 0.125
mle.result <- cbind(pe[1:3], se[1:3]) # Report ML results
colnames(mle.result) <- c("Estimate", "Std.Error")
rownames(mle.result) <- c("(Intercept)", "w1", "w2")
round(mle.result, 3)
## Estimate Std.Error
## (Intercept) -0.167 0.171
## w1 5.007 0.261
## w2 15.698 0.321
round(coef(summary(ls.result))[, c(1, 2)], 3) # Compare with lm() results
## Estimate Std. Error
## (Intercept) -0.049 0.282
## w1 4.784 0.377
## w2 15.683 0.371
ll <- -hetnorm.result$value # Report likelihood at maximum
# No need to have negative sign
# if optim is set as maximizer
ll
## [1] -2620.334
# AIC is 2 * number of parameters - 2 * ll (i.e. likelihood at maximum)
hetnorm.aic <- 2 * length(stval) - 2 * ll
# Compare AIC from ML and from lm(); lower is better
print(hetnorm.aic)
## [1] 5252.668
print(ls.aic)
## [1] 8545.903
predict
function.Motivation: We want to study how the change in a particular explanatory variable affects the outcome variable, all else being equal
Create a data frame with a set of hypothetical scenarios for covariate 1, while keeping covariate 2 at its mean
w1
.Calculate the predicted values using the predict()
function
predict(object = ... , newdata = ... , interval = ... , level = ...)
Plot the prediction intervals
Similarly, calculate the confidence intervals using the
predict()
function
Plot the confidence intervals; compare them with the predictive intervals
# Set up
# Set up hypothetical values for w1 and w2
(w1range <- seq(from = 0, to = 1, by = 0.05))
## [1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70
## [16] 0.75 0.80 0.85 0.90 0.95 1.00
(w2mean <- mean(w2))
## [1] 0.4912698
# Create a new dataset using crossing()
(xhypo <- crossing(w1 = w1range, w2 = w2mean))
## # A tibble: 21 × 2
## w1 w2
## <dbl> <dbl>
## 1 0 0.491
## 2 0.05 0.491
## 3 0.1 0.491
## 4 0.15 0.491
## 5 0.2 0.491
## 6 0.25 0.491
## 7 0.3 0.491
## 8 0.35 0.491
## 9 0.4 0.491
## 10 0.45 0.491
## # ℹ 11 more rows
# Note: if you are more comfortable, you can directly create a data frame
# with the hypothetical values
# Let's predict values using base R function
# Calculate predicted values
simPI.w1 <- predict(ls.result, # A model object
newdata = xhypo, # New dataset
interval = "prediction", # What kind of intervals?
level = 0.95) # What levels of confidence?
simPI.w1 <- simPI.w1 |>
as_tibble() |> # Coerce it into a tibble
bind_cols(w1 = w1range) # Combine hypo w1 with predicted y
# Calculate confidence intervals using predict()
simCI.w1 <- predict(ls.result,
newdata = xhypo,
interval = "confidence",
level = 0.95)
simCI.w1 <- simCI.w1 |>
as_tibble() |>
bind_cols(w1 = w1range)
# ggplot2
theme_set(theme_classic())
simALL.w1 <- bind_rows(
simPI.w1 |> mutate(type = "PI"),
simCI.w1 |> mutate(type = "CI")
)
Now, instead of providing a range of values from w1
, we
are going to hold constant this variable and instead we are going to
compute the marginal effect of change in w2
by simulating a
range of values.
w2range <- seq(from = 0, to = 1, by = 0.05) # Set up hypothetical values for w1
w1mean <- mean(w1) # Set up mean for w2
xhypo <- crossing(w1 = w1mean, # Create a new dataset
w2 = w2range) # crossing() is from tidyr package
# Calculate predicted values
simPI.w2 <- predict(ls.result, # A model object
newdata = xhypo, # New dataset
interval = "prediction", # What kind of intervals?
level = 0.95) # What levels of confidence?
simPI.w2 <- simPI.w2 |>
as_tibble() |>
bind_cols(w2 = w2range)
# Calculate confidence intervals using predict()
simCI.w2 <- predict(ls.result,
newdata = xhypo,
interval = "confidence",
level = 0.95)
simCI.w2 <- simCI.w2 |>
as_tibble() |>
bind_cols(w2 = w2range)
# ggplot2
theme_set(theme_classic())
simALL.w2 <- bind_rows(
simPI.w2 |> mutate(type = "PI"),
simCI.w2 |> mutate(type = "CI")
)
simcf
packageMotivation: Can we use simulation methods to produce the same prediction and confidence intervals? Recall the lecture: we can draw a bunch of \(\tilde{\beta}\) from a multivariate normal distribution
Create a data frame with a set of hypothetical scenarios for covariate 1 while keeping covariate 2 at its mean
Simulate the predicted values using MASS
and
simcf
Plot the results
In order to use simcf
to generate quantities of
interest, the following steps are needed:
Estimate: MLE \(\hat{\beta}\)
and its variance \(\hat{V}(\hat{\beta})\)
\(\color{red}{\rightarrow \texttt{optim(),
glm()}}\)
Simulate estimation uncertainty from a multivariate normal
distribution:
Draw \(\tilde{\beta} \sim MVN
\big[\hat{\beta}, \hat{V}(\hat{\beta})\big]\)
\(\color{red}{\rightarrow \texttt{MASS::mvrnorm()}}\)
Create hypothetical scenarios of your substantive interest:
Choose valuese of X: \(X_c\) \(\color{red}{\rightarrow \texttt{simcf::cfmake(),
cfchange()} \dots}\)
Calculate expected values:
\(\tilde{\mu_c} = g(X_c,
\tilde{\beta})\)
Simulate fundamental uncertainty:
\(\tilde{y_c} \sim f(\tilde{\mu_c},
\tilde{\alpha})\)
\(\color{red}{\rightarrow \texttt{simcf::hetnormsimpv()}
\dots}\)
or
Compute EVs, First Differences or Relative Risks
EV: \(\mathbb{E}(y|X_{c1})\)
\(\color{red}{\rightarrow \texttt{simcf::logitsimev()}
\dots}\)
FD: \(\mathbb{E}(y|X_{c2}) -
\mathbb{E}(y|X_{c1})\)
\(\color{red}{\rightarrow \texttt{simcf::logitsimfd()}
\dots}\)
RR: \(\frac{\mathbb{E}(y|X_{c2})}{\mathbb{E}(y|X_{c1})}\)
\(\color{red}{\rightarrow \texttt{simcf::logitsimrr()}
\dots}\)
Recall our likelihood function is:
\[ \begin{aligned} \mathrm{P}\left(\mathbf{y} | \boldsymbol{\mu}, \boldsymbol{\sigma}^{2}\right) &=\prod_{i=1}^{n} f_{\mathcal{N}}\left(y_{i} | \mu_{i}, \sigma_{i}^{2}\right) & \tiny[\text{Joint probability}] \\ \mathrm{P}\left(\mathbf{y} | \boldsymbol{\mu}, \boldsymbol{\sigma}^{2}\right) &=\prod_{i=1}^{n}\left(2 \pi \sigma_{i}^{2}\right)^{-1 / 2} \exp \left[\frac{-\left(y_{i}-\mu_{i}\right)^{2}}{2 \sigma_{i}^{2}}\right] & \tiny[\text{Expressed in Normal distribution}]\\ \log \mathcal{L}\left(\mathbf{\mu}, \boldsymbol{\sigma}^{2} | \mathbf{y}\right) & \propto-\frac{1}{2} \sum_{i=1}^{n} \log \sigma_{i}^{2}-\frac{1}{2} \sum_{i=1}^{n} \frac{\left(y_{i}-\mathbf{\mu}_{i} \right)^{2}}{\sigma_{i}^{2}} & \tiny[\text{Converted to log likelihood; simplify}] \\ \log \mathcal{L}(\boldsymbol{\beta}, \boldsymbol{\gamma} | \mathbf{y}) & \propto-\frac{1}{2} \sum_{i=1}^{n} \log \mathbf{z}_{i} \gamma-\frac{1}{2} \sum_{i=1}^{n} \frac{\left(y_{i}-\mathbf{x}_{i} \boldsymbol{\beta}\right)^{2}}{\exp \left(\mathbf{z}_{i} \gamma\right)} & \tiny[\text{Substitute in systematic components}] \end{aligned} \]
# Draw parameters from the model predictive distribution
sims <- 10000
simparam <- mvrnorm(n = sims,
mu = pe, # M.N.D. with population mean = pe (from ML);
Sigma = vc) # population var-covar matrix = vc (from ML)
head(simparam) # Inspect
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.24211743 4.253787 15.94193 0.8538881 0.1676888 3.187143
## [2,] -0.10028149 4.866367 15.88825 0.8131899 0.2369895 3.038976
## [3,] -0.15729804 4.813402 15.81664 0.9172401 0.3016272 2.864065
## [4,] -0.01521245 4.482252 16.12675 0.9590520 0.2337601 2.975280
## [5,] 0.10941227 4.821406 15.16535 0.9812625 0.2492318 2.763246
## [6,] -0.22838155 4.947090 15.62518 0.9596191 0.1447425 2.870572
# Separate into the simulated betas and simulated gammas
simbetas <- simparam[ , 1:(ncol(xcovariates) + 1)]
simgammas <- simparam[ , (ncol(simbetas) + 1):ncol(simparam)]
# Specify model formulas
varmodel <- (y ~ w1 + w2)
# Create hypothetical scenarios for covariate 1
w1range <- seq(from = 0, to = 1, by = 0.2)
# Use `cfMake` to create a baseline dataset
xhypo <- cfMake(varmodel,
data,
nscen = length(w1range))
xhypo
## $x
## y w1 w2
## 1 10.06525 0.5036167 0.4912698
## 2 10.06525 0.5036167 0.4912698
## 3 10.06525 0.5036167 0.4912698
## 4 10.06525 0.5036167 0.4912698
## 5 10.06525 0.5036167 0.4912698
## 6 10.06525 0.5036167 0.4912698
##
## $xpre
## y w1 w2
## 1 10.06525 0.5036167 0.4912698
## 2 10.06525 0.5036167 0.4912698
## 3 10.06525 0.5036167 0.4912698
## 4 10.06525 0.5036167 0.4912698
## 5 10.06525 0.5036167 0.4912698
## 6 10.06525 0.5036167 0.4912698
##
## $model
## y ~ w1 + w2
##
## attr(,"class")
## [1] "list" "counterfactual"
# Use `cfChange` and loop function to loop through each hypothetical x values
for (i in 1:length(w1range)) {
xhypo <- cfChange(xscen = xhypo,
covname = "w1",
x = w1range[i],
scen = i)
}
xhypo
## $x
## y w1 w2
## 1 10.06525 0.0 0.4912698
## 2 10.06525 0.2 0.4912698
## 3 10.06525 0.4 0.4912698
## 4 10.06525 0.6 0.4912698
## 5 10.06525 0.8 0.4912698
## 6 10.06525 1.0 0.4912698
##
## $xpre
## y w1 w2
## 1 10.06525 0.5036167 0.4912698
## 2 10.06525 0.5036167 0.4912698
## 3 10.06525 0.5036167 0.4912698
## 4 10.06525 0.5036167 0.4912698
## 5 10.06525 0.5036167 0.4912698
## 6 10.06525 0.5036167 0.4912698
##
## $model
## y ~ w1 + w2
##
## attr(,"class")
## [1] "list" "counterfactual"
# Repeat the same procedures for z
zhypo <- cfMake(varmodel, data, nscen = length(w1range))
for (i in 1:length(w1range)) {
zhypo <- cfChange(zhypo, "w1", x = w1range[i], scen = i)
}
zhypo
## $x
## y w1 w2
## 1 10.06525 0.0 0.4912698
## 2 10.06525 0.2 0.4912698
## 3 10.06525 0.4 0.4912698
## 4 10.06525 0.6 0.4912698
## 5 10.06525 0.8 0.4912698
## 6 10.06525 1.0 0.4912698
##
## $xpre
## y w1 w2
## 1 10.06525 0.5036167 0.4912698
## 2 10.06525 0.5036167 0.4912698
## 3 10.06525 0.5036167 0.4912698
## 4 10.06525 0.5036167 0.4912698
## 5 10.06525 0.5036167 0.4912698
## 6 10.06525 0.5036167 0.4912698
##
## $model
## y ~ w1 + w2
##
## attr(,"class")
## [1] "list" "counterfactual"
# Simulate predictive intervals for heteroskedastic linear models
# `hetnormsimpv()` is from simcf package
simRES.w1 <- hetnormsimpv(xhypo, simbetas,
zhypo, simgammas,
ci = 0.95,
constant = 1,
varconstant = 1)
simRES.w1
## $pe
## [1] 7.515323 8.566149 9.575781 10.579808 11.552741 12.524484
##
## $lower
## [,1]
## low 0.8847932
## low 1.9381952
## low 2.8502531
## low 3.5495829
## low 4.5126707
## low 5.3290850
##
## $upper
## [,1]
## up 13.87476
## up 15.15291
## up 16.41457
## up 17.46719
## up 18.47127
## up 19.71323
simRES.w1 <- simRES.w1 |>
bind_rows() |> # Collaspe the list into d.f.
bind_cols(w1 = w1range) # Combine with hypo. w1 values
simRES.w1 # Inspect
## # A tibble: 6 × 4
## pe lower[,1] upper[,1] w1
## <dbl> <dbl> <dbl> <dbl>
## 1 7.52 0.885 13.9 0
## 2 8.57 1.94 15.2 0.2
## 3 9.58 2.85 16.4 0.4
## 4 10.6 3.55 17.5 0.6
## 5 11.6 4.51 18.5 0.8
## 6 12.5 5.33 19.7 1
Create a data frame with a set of hypothetical scenarios for covariate 2 while keeping covariate 1 at its mean
Simulate the predicted values using MASS
and
simcf
Plot the results
# Create hypothetical scenarios for w2
w2range <- seq(0, 1, by = 0.05)
xhypo <- cfMake(varmodel, data, nscen = length(w2range))
for (i in 1:length(w2range)) {
xhypo <- cfChange(xhypo, "w2", x = w2range[i], scen = i)
}
zhypo <- cfMake(varmodel, data, nscen = length(w2range))
for (i in 1:length(w2range)) {
zhypo <- cfChange(zhypo, "w2", x = w2range[i], scen = i)
}
# Simulate the predicted Y's and PI's
simRES.w2 <- hetnormsimpv(xhypo, simbetas,
zhypo, simgammas,
ci = 0.95,
constant = 1,
varconstant = 1)
simRES.w2 <- simRES.w2 |>
bind_rows() |>
bind_cols(w2 = w2range)
# Plot the predictive intervals for hypothetical w2
ggplot(simRES.w2, aes(x = w2, y = pe, ymax = upper, ymin = lower)) +
geom_line() +
geom_ribbon(alpha = 0.1) +
labs(y = "Predicted Y", x = "Covariate 2")
Can we compare them with the prediction intervals generated by
predict()
?
# Combine two dataframes
simPI.w2 <- simPI.w2 |>
rename(pe = fit, lower = lwr, upper = upr)
simALL.w2 <- bind_rows(
simRES.w2 |> mutate(method = "sim"),
simPI.w2 |> mutate(method = "lm")
)
# Plot the predictive intervals for hypothetical w2
ggplot(simALL.w2, aes(x = w2, y = pe, ymax = upper, ymin = lower)) +
geom_line() +
geom_ribbon(alpha = 0.1) +
labs(y = "Predicted Y", x = "Covariate 2") +
facet_grid(~ method)