Count models are used for dependent variables that take on non-negative integer values, such as the number of births at a hospital on a given day or the number of patents awarded during a specified period. A typical starting point for estimating the parameters of a count model is Poisson regression, which assumes that counts follow a Poisson distribution.
Recall from problem set 2 that the Poisson distribution is characterized by a single parameter, \(\lambda\), which represents both the mean and the variance of the distribution. Specifically, for a Poisson-distributed random variable \(Y\), we have:
\[ \mathbb{E}(Y) = \text{Var}(Y) = \lambda \]
Poisson regression links the mean of the dependent variable to a linear combination of predictors, using a log-link function to ensure that the mean remains positive. This relationship is specified through \(\lambda\), the expected count, by modeling the log of the expected value of \(Y\) as a linear function of the predictors.
The model is defined as follows: \[ \log \left[ \mathbb{E}(Y \mid X) \right] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 \]
Exponentiating both sides yields:
\[ \mathbb{E}(Y \mid X) = \exp \left( \beta_0 + \beta_1 X_1 + \beta_2 X_2 \right) \]
where \(\mathbb{E}(Y \mid X)\) is the expected count, and \(\beta_0\), \(\beta_1\), and \(\beta_2\) are coefficients to be estimated.
We can simulate data from a Poisson model using the
rpois()
function in R. Here, we specify a systematic
component (i.e., a linear predictor) and exponentiate it to obtain the
mean parameter \(\lambda\), ensuring
that the expected value is positive.
# Poisson Simulation
set.seed(1234)
# Number of samples
n <- 1000
# "True" parameters
b0 <- 0.2
b1 <- 0.5
# Independent variable
X <- runif(n = n, min = -1, max = 1)
# Poisson DGP
Y1 <- rpois(n = n, lambda = exp(b0 + b1 * X))
# Fit model
model.pois <- glm(Y1 ~ X, family = "poisson")
summary(model.pois)
##
## Call:
## glm(formula = Y1 ~ X, family = "poisson")
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.13931 0.03037 4.587 4.5e-06 ***
## X 0.56232 0.05073 11.085 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1189.6 on 999 degrees of freedom
## Residual deviance: 1062.7 on 998 degrees of freedom
## AIC: 2737.3
##
## Number of Fisher Scoring iterations: 5
The above code generates a Poisson-distributed response variable
\(Y_1\) based on a linear predictor of
\(X\). The model parameters are then
estimated using glm()
with a Poisson family
specification.
To evaluate the stability of our estimates, we can repeat the Poisson model estimation multiple times and observe the distribution of the parameter estimates:
# Number of simulations
sims <- 1000
# Store estimates
par.est.pois <-
tibble(
b0_hat = rep(NA, sims),
b1_hat = rep(NA, sims),
b0_se = rep(NA, sims),
b1_se = rep(NA, sims),
model = "Poisson"
)
# Loop to estimate the model multiple times
for (i in 1:sims) {
Y <- rpois(n, exp(b0 + b1 * X))
model.pois <- glm(Y ~ X, family = "poisson")
vcv <- vcov(model.pois)
par.est.pois[i, 1] <- model.pois$coef[1]
par.est.pois[i, 2] <- model.pois$coef[2]
par.est.pois[i, 3] <- sqrt(diag(vcv)[1])
par.est.pois[i, 4] <- sqrt(diag(vcv)[2])
}
# Check out estimates
par.est.pois |>
pivot_longer(cols = 1:4, names_to = "estimates") |>
group_by(estimates) |>
summarize(mean = mean(value), sd = sd(value))
## # A tibble: 4 × 3
## estimates mean sd
## <chr> <dbl> <dbl>
## 1 b0_hat 0.200 0.0276
## 2 b0_se 0.0293 0.000453
## 3 b1_hat 0.497 0.0487
## 4 b1_se 0.0492 0.000685
This code performs 1000 simulations of the Poisson model and stores the estimated parameters (\(\hat{\beta}_0\), \(\hat{\beta}_1\)) and their standard errors for each run. By examining the mean and standard deviation of these estimates, we can assess their variability and reliability.
An important limitation of the Poisson model is its assumption that the mean and variance are equal. In many real-world count processes, especially in social science applications, the variance exceeds the mean—a phenomenon known as overdispersion. When overdispersion is present, the Poisson model can yield biased standard errors, leading to inaccurate inferences.
The Negative Binomial (NB) model is a more flexible approach for modeling count data because it can accommodate overdispersion by introducing a second parameter, which allows the variance to differ from the mean.
The NB model can be viewed as a “mixture” model because it combines the Poisson distribution with a Gamma distribution to model the variability. In contrast to the Poisson distribution, which assumes that the mean and variance are equal, the NB model relaxes this assumption by allowing the variance to exceed the mean.
In the last simulation, we set \(\lambda\) to be equal to the systematic component of the model. For the NB model, we set \(\lambda\) equal to both the systematic component and a stochastic component, yielding a mean \(\mu\) and a variance \(\nu\) defined as follows:
\[ \begin{aligned} \mu &= \exp \left( \beta_0 + \beta_1 X_1 + \beta_2 X_2 \right) \\ \nu &= \mu + \frac{\mu^2}{\theta} \end{aligned} \]
where \(\theta\) is a dispersion parameter that controls the degree of overdispersion. Because the NB model has a separate parameter for the variance, it is robust to overdispersion, unlike the Poisson model.
To illustrate the impact of overdispersion, we simulate overdispersed
data, estimate Poisson and NB models, and compare their results. To
simulate overdispersed count data, we use the rnbinom()
function, where the size
argument represents the dispersion
parameter \(\theta\), and
mu
is the mean \(\mu\).
# Negative Binomial DGP
Y2 <- rnbinom(n = n, size = 0.5, mu = exp(b0 + b1 * X))
# run NB regression with MASS package
model.nb <- MASS::glm.nb(Y2 ~ X)
summary(model.nb)
##
## Call:
## MASS::glm.nb(formula = Y2 ~ X, init.theta = 0.5210878262, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.18333 0.05305 3.456 0.000548 ***
## X 0.61331 0.09114 6.729 1.7e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.5211) family taken to be 1)
##
## Null deviance: 957.71 on 999 degrees of freedom
## Residual deviance: 911.51 on 998 degrees of freedom
## AIC: 3019.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.5211
## Std. Err.: 0.0433
##
## 2 x log-likelihood: -3013.1000
In the comparison below, a solid line represents the mean of \(Y\), and a dashed line denotes its variance. Observe that in Panel (a) (Poisson model), the mean and variance are nearly identical, whereas in Panel (b) (Negative Binomial model), the variance is significantly larger than the mean.
Next, we assess the implications of overdispersion on parameter estimation by comparing the Poisson and Negative Binomial estimators.
# Poisson vs. Negative Binomial
set.seed(1234)
# Store estimates
par.est.nb <- tibble(
b1_pois = rep(NA, sims),
b1_nb = rep(NA, sims),
b1_pois_se = rep(NA, sims),
b1_nb_se = rep(NA, sims)
)
# Loop for Negative Binomial regression
for (i in 1:sims) {
# Create count data with overdispersion
Y <- rnbinom(n = n, size = 0.5, mu = exp(b0 + b1 * X))
# Estimate models
model.pois <- glm(Y ~ X, family = "poisson")
model.nb <- MASS::glm.nb(Y ~ X)
# Variance-covariance matrices
vcv.pois <- vcov(model.pois)
vcv.nb <- vcov(model.nb)
# Store point estimates
par.est.nb[i, 1] <- model.pois$coef[2]
par.est.nb[i, 2] <- model.nb$coef[2]
# Store standard errors
par.est.nb[i, 3] <- sqrt(diag(vcv.pois)[2])
par.est.nb[i, 4] <- sqrt(diag(vcv.nb)[2])
}
# store results
results <- par.est.nb |>
pivot_longer(cols = 1:4, names_to = "estimates") |>
group_by(estimates) |>
summarize(mean = mean(value), sd = sd(value))
# Check estimates
results
## # A tibble: 4 × 3
## estimates mean sd
## <chr> <dbl> <dbl>
## 1 b1_nb 0.501 0.0944
## 2 b1_nb_se 0.0919 0.00271
## 3 b1_pois 0.501 0.0952
## 4 b1_pois_se 0.0493 0.00131
# coverage negative binomial
round(results$mean[2], 3) / round(results$sd[1], 3)
## [1] 0.9787234
# coverage poisson
round(results$mean[4], 3) / round(results$sd[3], 3)
## [1] 0.5157895
The simulation above highlights two important insights:
In this example, we recover unbiased point estimates of \(\beta_1\), as both the Poisson and Negative Binomial models return \(\hat{\beta}_1\) estimates of 0.501 and 0.501, respectively.
However, notice that the mean of the simulated standard errors for the Negative Binomial and Poisson models are 0.092 and 0.049, respectively. Meanwhile, the standard deviation of the point estimates \(\sigma(\hat{\beta}_1)\) are 0.094 and 0.095, yielding a coverage ratio of 0.9787234 for the Negative Binomial model and 0.5157895 for the Poisson model.
The low coverage of the Poisson model implies that when data are overdispersed, the Poisson model’s assumptions produce coefficient variability (standard errors) that are too small. This results in a high likelihood of Type I error (false positive), meaning that we may incorrectly reject the null hypothesis even when there is no significant effect.
Another potential issue with count models arises when there is a large number of cases where the value of \(Y\) is zero. These excess zeros may either be a natural part of the data-generating process (DGP) governing other counts (e.g., 1s, 2s, 3s, etc.) or they may be “inflated” by an additional process unrelated to counts greater than zero. In other words, the DGP may involve two separate systematic processes:
\[ \Pr(Y = 0) = \pi(X) \quad \text{(typically modeled with a binary logit or probit)} \]
where \(\pi(X)\) represents the probability of an observation being zero, given predictors \(X\).
\[ \Pr(Y = y \mid Y > 0) = f(y; \lambda) \quad \text{(e.g., a Poisson or Negative Binomial distribution)} \] where \(f(y; \lambda)\) represents the count distribution with parameter \(\lambda\).
Zero-inflated count models combine these two processes by estimating:
\[ \log \left( \frac{\pi(X)}{1 - \pi(X)} \right) = \beta_0 + \beta_1 X_1 + \ldots + \beta_k X_k \]
\[ \log(\lambda) = \alpha_0 + \alpha_1 X_1 + \ldots + \alpha_k X_k \]
In summary, zero-inflation models work by modeling the probability of zeros separately from the count-generating process, allowing flexibility to handle excess zeros present in the data. They can be applied using both Poisson and Negative Binomial distributions for the count component.
Below, we review an application that will be discussed in class:
Foreclosure Filings of Homeowners’ Associations in Harris County (Houston), TX, from 1995 to 2001.
The data include the following variables: - Foreclosure Filings (file9501): The number of filings in a neighborhood during 1995–2001. - Median Valuation (mdvaluation): The log of the median home price in the neighborhood. - Post-1975 Neighborhood: An indicator of whether the median home in the neighborhood was built after 1975.
The unit of observation is the neighborhood, which may or may not have an organized Homeowners’ Association (HOA), though this is unobserved.
## Load data
data <- read_csv("https://faculty.washington.edu/cadolph/mle/hoaCounts.csv")[,-1]
names(data)
## [1] "index" "filings" "file9501" "file8594" "harhoa"
## [6] "file8587" "file8894" "filb9500" "nhomes" "mdvaluation"
## [11] "mdyear" "pctwhite" "pcthispanic" "pctblack" "pctasian"
## data management
data <-
data %>%
## Restrict to places with >=100 homes
filter(nhomes >= 100) %>%
mutate(
## Construct exposure variable and add to dataset
exposure = nhomes * length(1995:2001),
## Explore relationship between ever-filers and age of neighborhood
everfile = ifelse(file9501 > 0, 1, 0),
## GAM suggests constructing new neighborhood (after 1975) covariate
post1975construction = ifelse(mdyear > 1975, 1, 0)
)
## Create alternate "no zeros" dataset
dataNZ <- data %>% filter(file9501 > 0)
## Process data for model
m1 <- file9501 ~ log(mdvaluation) + post1975construction
m1data <- extractdata(m1, data, extra=~exposure, na.rm = TRUE)
m1dataNZ <- extractdata(m1, dataNZ, extra=~exposure, na.rm = TRUE)
First, we run the analysis including \(Pr(Y=0)\).
### Model 1 - Poisson using GLM with zeros included
glm.poisWZ <- glm(m1,
data = m1data,
family = "poisson",
# turn the exposure into log
offset = log(m1data$exposure))
pe.poisWZ <- coef(glm.poisWZ) # point estimates
vc.poisWZ <- vcov(glm.poisWZ) # var-cov matrix
se.poisWZ <- sqrt(diag(vc.poisWZ)) # standard errors
ll.poisWZ <- logLik(glm.poisWZ) # likelihood at maximum
aic.poisWZ <- AIC(glm.poisWZ)
## Set up counterfactuals (range over mdvaluation, all else equal)
valueseq <- seq(35000, 400000, 1000)
xhypWZ <- cfMake(formula = m1,
data = m1data,
nscen = length(valueseq))
for (i in 1:length(valueseq)) {
xhypWZ <- cfChange(xhypWZ, "mdvaluation", x=valueseq[i], scen=i)
}
## Simulate results
sims <- 10000
simbetas.poisWZ <- mvrnorm(sims, pe.poisWZ, vc.poisWZ) # draw parameters
yhyp.poisWZ <- loglinsimev(xhypWZ, simbetas.poisWZ)
# Transform expected counts to expected rates
yhyp.poisWZ$pe <- yhyp.poisWZ$pe*1000
yhyp.poisWZ$lower <- yhyp.poisWZ$lower*1000
yhyp.poisWZ$upper <- yhyp.poisWZ$upper*1000
# Model 1 - ggplot visual
vis_poisWZ <-
yhyp.poisWZ %>% as.data.frame() %>%
bind_cols() %>% as_tibble() %>%
mutate(model="poisson",
valueseq=valueseq/1000)
vis_poisWZ %>%
ggplot(aes(y=pe,
ymin=lower,
ymax=upper,
x=valueseq)) +
geom_line(color=purple) +
geom_ribbon(alpha=0.2, fill=purple) +
scale_y_continuous(limits = c(0,25)) +
scale_x_continuous(breaks = seq(50,400,50),
limits = c(50,300),
labels = scales::dollar_format(suffix = "k")) +
labs(title= "Expected HOA foreclosure fillings per 1000 homes per year",
y=NULL,
x="Median home price of neighborhood") +
theme_caviz_hgrid +
coord_cartesian(xlim=c(50,225))+
annotate("text", x = 150, y = 10,
label = "Poisson including excess zeros",
color=purple, size=6)
Now, we run the analysis excluding \(Pr(Y=0)\).
### Model 2 - Poisson using GLM with zeros removed
glm.poisNZ <- glm(m1,
data = m1dataNZ,
family = "poisson",
# turn the exposure into log
offset = log(m1dataNZ$exposure))
pe.poisNZ <- coef(glm.poisNZ) # point estimates
vc.poisNZ <- vcov(glm.poisNZ) # var-cov matrix
se.poisNZ <- sqrt(diag(vc.poisNZ)) # standard errors
ll.poisNZ <- logLik(glm.poisNZ) # likelihood at maximum
aic.poisNZ <- AIC(glm.poisNZ)
stargazer::stargazer(glm.poisWZ,glm.poisNZ, type="text")
##
## =================================================
## Dependent variable:
## ----------------------------
## file9501
## (1) (2)
## -------------------------------------------------
## log(mdvaluation) -0.904*** -1.559***
## (0.029) (0.042)
##
## post1975construction 2.706*** 0.737***
## (0.043) (0.037)
##
## Constant 1.845*** 11.648***
## (0.309) (0.478)
##
## -------------------------------------------------
## Observations 1,417 326
## Log Likelihood -7,799.816 -3,025.548
## Akaike Inf. Crit. 15,605.630 6,057.096
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
# Set up counterfactuals (range over mdvaluation, all else equal)
valueseq <- seq(35000, 400000, 1000)
xhypNZ <- cfMake(m1, m1dataNZ, nscen = length(valueseq))
for (i in 1:length(valueseq)) {
xhypNZ <- cfChange(xhypNZ, "mdvaluation", x=valueseq[i], scen=i)
}
# Simulate results
simbetas.poisNZ <- mvrnorm(sims, pe.poisNZ, vc.poisNZ) # draw parameters
yhyp.poisNZ <- loglinsimev(xhypNZ, simbetas.poisNZ)
# Transform expected counts to expected rates
yhyp.poisNZ$pe <- yhyp.poisNZ$pe*1000
yhyp.poisNZ$lower <- yhyp.poisNZ$lower*1000
yhyp.poisNZ$upper <- yhyp.poisNZ$upper*1000
### Model 2 - ggplot visual
vis_poisNZ <-
yhyp.poisNZ %>% as.data.frame() %>%
bind_cols() %>% as_tibble() %>%
mutate(model="poisson_NZ",
valueseq=valueseq/1000)
vis_poisNZ %>%
ggplot(aes(y=pe,
ymin=lower,
ymax=upper,
x=valueseq)) +
geom_line(color=red) +
geom_ribbon(alpha=0.2, fill=red) +
scale_y_continuous(limits = c(0,25)) +
scale_x_continuous(labels = scales::dollar_format(suffix = "k")) +
labs(title= "Expected HOA foreclosure fillings per 1000 homes per year",
y=NULL,
x="Median home price of neighborhood") +
theme_caviz_hgrid +
coord_cartesian(xlim=c(40,225))+
annotate("text", x = 150, y = 10,
label = "Poisson including excess zeros",
color=red, size=6)
Now, we address the issue using the Negative Binomial (NB) model.
Recall that the NB model is a combination of the Poisson and Gamma distributions, where the Gamma distribution models the random error in the Poisson model. For more details, refer to slides 198-216.
### Model 3 - Negative binomial regression, zeros excluded
## Note: glm.nb requires offset be included in the model formula thus
m1nb <- file9501 ~ log(mdvaluation) + post1975construction + offset(log(exposure))
# NB is similar to the Beta-Binomial, but only bounded on one side
nb.result <- glm.nb(m1nb, data=m1dataNZ) # glm.nb is from MASS package
pe.nb <- coef(nb.result)
vc.nb <- vcov(nb.result)
se.nb <- sqrt(diag(vc.nb))
ll.nb <- logLik(nb.result)
aic.nb <- AIC(nb.result)
theta.nb <- nb.result$theta
setheta.nb <- nb.result$SE.theta
# Simulate results (we re-use xhyp from above)
simbetas.nb <- mvrnorm(sims, pe.nb, vc.nb) # draw parameters
yhyp.nb <- loglinsimev(xhypNZ, simbetas.nb)
# Transform expected counts to expected rates
yhyp.nb$pe <- yhyp.nb$pe*1000
yhyp.nb$lower <- yhyp.nb$lower*1000
yhyp.nb$upper <- yhyp.nb$upper*1000
### Model 3 - ggplot visual
vis_nb <-
yhyp.nb %>% as.data.frame() %>%
bind_cols() %>% as_tibble() %>%
mutate(model="nb",
valueseq=valueseq/1000)
vis_nb %>%
ggplot(aes(y=pe,
ymin=lower,
ymax=upper,
x=valueseq)) +
geom_line(color=blue) +
geom_ribbon(alpha=0.2, fill=blue) +
scale_y_continuous(limits = c(0,25)) +
scale_x_continuous(labels = scales::dollar_format(suffix = "k")) +
labs(title= "Expected HOA foreclosure filings per 1000 homes per year",
y=NULL,
x="Median home price of neighborhood") +
theme_caviz_hgrid +
coord_cartesian(xlim=c(40,225))+
annotate("text", x = 150, y = 10,
label = "Negative Binomial with all zeros deleted",
color=blue, size=6)
The Negative Binomial (NB) model constructs a probability model for overdispersion using the Gamma distribution.
In contrast, the Quasi-Poisson approach retains the Poisson-like mean-variance relationship but multiplicatively re-scales the variance up or down as needed.
As a result, the Quasi-Poisson model is more flexible than the NB model because it can scale down the variance to accommodate underdispersion, although underdispersion is very rare.
### Model 4 - Quasipoisson using GLM with zeros removed
## Quasipoisson using GLM with zeros removed
glm.qpoisNZ <- glm(m1, data= m1dataNZ,
family=quasipoisson,
offset=log(m1dataNZ$exposure))
pe.qpoisNZ <- coef(glm.qpoisNZ) # point estimates
vc.qpoisNZ <- vcov(glm.qpoisNZ) # var-cov matrix
se.qpoisNZ <- sqrt(diag(vc.qpoisNZ)) # standard errors
ll.qpoisNZ <- logLik(glm.qpoisNZ) # likelihood at maximum
aic.qpoisNZ <- AIC(glm.qpoisNZ)
## Simulate results
simbetas.qpoisNZ <- mvrnorm(sims, pe.qpoisNZ, vc.qpoisNZ) # draw parameters
yhyp.qpoisNZ <- loglinsimev(xhypNZ, simbetas.qpoisNZ)
## Transform expected counts to expected rates
yhyp.qpoisNZ$pe <- yhyp.qpoisNZ$pe*1000
yhyp.qpoisNZ$lower <- yhyp.qpoisNZ$lower*1000
yhyp.qpoisNZ$upper <- yhyp.qpoisNZ$upper*1000
### Model 4 - ggplot visual
vis_qpoisNZ <-
yhyp.qpoisNZ %>% as.data.frame() %>%
bind_cols() %>% as_tibble() %>%
mutate(model="qpoisNZ",
valueseq=valueseq/1000)
vis_qpoisNZ %>%
ggplot(aes(y=pe,
ymin=lower,
ymax=upper,
x=valueseq)) +
geom_line(color=brown) +
geom_ribbon(alpha=0.2, fill=brown) +
scale_y_continuous(limits = c(0,25)) +
scale_x_continuous(labels = scales::dollar_format(suffix = "k")) +
labs(title= "Expected HOA foreclosure filings per 1000 homes per year",
y=NULL,
x="Median home price of neighborhood") +
theme_caviz_hgrid +
coord_cartesian(xlim=c(40,225))+
annotate("text", x = 150, y = 10,
label = "Quasipoisson with all zeros deleted",
color=brown, size=6)
Here, we compare the models fitted until now.
## We can compare model 4 vs 3
vis_qpoisNZ %>%
ggplot(aes(y=pe,
ymin=lower,
ymax=upper,
x=valueseq)) +
geom_line(color=brown) +
geom_ribbon(alpha=0.2, fill=brown) +
geom_line(inherit.aes = F,
data=vis_nb,
aes(y=pe, x=valueseq),
color=blue) +
geom_ribbon(data=vis_nb,
aes(y=pe,
xmin=lower,
xmax=upper),
alpha=0,
color=blue,
linetype="dashed") +
scale_y_continuous(limits = c(0,25)) +
scale_x_continuous(labels = scales::dollar_format(suffix = "k")) +
labs(title= "Expected HOA foreclosure filings per 1000 homes per year",
y=NULL,
x="Median home price of neighborhood") +
theme_caviz_hgrid +
coord_cartesian(xlim=c(40,225))+
annotate("text", x = 100, y = 12,
label = "Quasipoisson with all zeros deleted",
color=brown, size=6) +
annotate("text", x = 150, y = 6,
label = "Negative Binomial with all zeros deleted",
color=blue, size=6)
## or model 4 vs 2
vis_qpoisNZ %>%
ggplot(aes(y=pe,
ymin=lower,
ymax=upper,
x=valueseq)) +
geom_line(color=brown) +
geom_ribbon(alpha=0.2, fill=brown) +
geom_line(inherit.aes = F,
data=vis_poisNZ,
aes(y=pe, x=valueseq),
color=red) +
geom_ribbon(data=vis_poisNZ,
aes(y=pe,
xmin=lower,
xmax=upper),
alpha=0,
color=red,
linetype="dashed") +
scale_y_continuous(limits = c(0,25)) +
scale_x_continuous(labels = scales::dollar_format(suffix = "k")) +
labs(title= "Expected HOA foreclosure filings per 1000 homes per year",
y=NULL,
x="Median home price of neighborhood") +
theme_caviz_hgrid +
coord_cartesian(xlim=c(40,225))+
annotate("text", x = 100, y = 12,
label = "Quasipoisson with all zeros deleted",
color=brown, size=6) +
annotate("text", x = 150, y = 6,
label = "Poisson with all zeros deleted",
color=red, size=6)
### Model 5 - Zero-Inflated Negative Binomial
## Estimate ZINB using zeroinfl in the pscl library
## Notice the particular form of the formula -- count first, then zero-inflation
m1zinb <- file9501 ~ log(mdvaluation) + post1975construction | log(mdvaluation) +
post1975construction + log(exposure)
## zeroninfl function comes from pscl package
zinb.result <- zeroinfl(m1zinb,
data=m1data, # we are now using the full data
dist="negbin", # character specification of count model family
offset=log(m1data$exposure))
## Extract results, taking care to keep track of both equations
pe.zinb.count <- zinb.result$coefficients$count
pe.zinb.zeros <- zinb.result$coefficients$zero
pe.zinb <- c(pe.zinb.count,pe.zinb.zeros)
vc.zinb <- vcov(zinb.result)
se.zinb <- sqrt(diag(vc.zinb))
se.zinb.count <- se.zinb[1:length(pe.zinb.count)]
se.zinb.zeros <- se.zinb[ (length(pe.zinb.count)+1) : length(se.zinb) ]
ll.zinb <- logLik(zinb.result)
aic.zinb <- AIC(zinb.result)
theta.zinb <- zinb.result$theta
setheta.zinb <- zinb.result$SE.theta
# Simulate results (we re-use xhyp from above)
# Option 1: E(count | not a structural zero)
# { other options include unconditional E(count) and Pr(Structural Zero) }
simparam.zinb <- mvrnorm(sims,pe.zinb,vc.zinb) # draw parameters
simbetas.zinb <- simparam.zinb[,1:length(pe.zinb.count)]
simgammas.zinb <- simparam.zinb[,(length(pe.zinb.count)+1) : length(se.zinb) ]
yhyp.zinbC <- loglinsimev(xhypWZ, simbetas.zinb)
## Transform expected counts to expected rates
yhyp.zinbC$pe <- yhyp.zinbC$pe*1000
yhyp.zinbC$lower <- yhyp.zinbC$lower*1000
yhyp.zinbC$upper <- yhyp.zinbC$upper*1000
## Option 2: Pr(Structural Zero)
## { other options include E(count|not a structural zero) and unconditional count }
xhypWZ0 <- xhypWZ
xhypWZ0$model <- file9501 ~ log(mdvaluation) + post1975construction + log(exposure)
xhypWZ0$x$exposure <- xhypWZ$xpre$exposure <- mean(m1data$exposure)
yhyp.zinbZ <- logitsimev(xhypWZ0, simgammas.zinb)
### Model 5 - ggplot visual option 1
vis_zinbC <-
yhyp.zinbC %>% as.data.frame() %>%
bind_cols() %>% as_tibble() %>%
mutate(model="zinbC",
valueseq=valueseq/1000)
(p1 <-
vis_zinbC %>%
ggplot(aes(y=pe,
ymin=lower,
ymax=upper,
x=valueseq)) +
geom_line(color=orange) +
geom_ribbon(alpha=0.2, fill=orange) +
scale_y_continuous(limits = c(0,40)) +
scale_x_continuous(labels = scales::dollar_format(suffix = "k")) +
labs(title= "Expected HOA foreclosure filings per 1000 homes per year",
y=NULL,
x="Median home price of neighborhood") +
theme_caviz_hgrid +
coord_cartesian(xlim=c(40,225))+
annotate("text", x = 150, y = 10,
label = "Zero-Inflated Negative Binomial",
color=orange, size=6))
### Model 5 - ggplot visual option 2
vis_zinbZ <-
yhyp.zinbZ %>% as.data.frame() %>%
bind_cols() %>% as_tibble() %>%
mutate(model="zinbZ",
pe = 1 - pe,
lower = 1 - lower,
upper = 1 - upper,
valueseq=valueseq/1000)
(p2 <-
vis_zinbZ %>%
ggplot(aes(y=pe,
ymin=lower,
ymax=upper,
x=valueseq)) +
geom_line(color=orange) +
geom_ribbon(alpha=0.2, fill=orange) +
scale_y_continuous(limits = c(0,1),breaks=seq(0,1,0.2)) +
scale_x_continuous(labels = scales::dollar_format(suffix = "k")) +
labs(title= "Probability neighborhood has a filing HOA",
y=NULL,
x="Median home price of neighborhood") +
theme_caviz_hgrid +
coord_cartesian(xlim=c(40,225))+
annotate("text", x = 100, y = 0.7,
label = "Zero-Inflated Negative Binomial",
color=orange, size=6))
### Combine options 1 and 2
bind_rows(vis_zinbC,vis_zinbZ) %>%
ggplot(aes(y=pe,
ymin=lower,
ymax=upper,
x=valueseq)) +
geom_line(color=orange) +
geom_ribbon(alpha=0.2, fill=orange) +
scale_x_continuous(labels = scales::dollar_format(suffix = "k")) +
labs(title= "Expected HOA foreclosure filings per 1000 homes per year",
y=NULL,
x="Median home price of neighborhood") +
theme_caviz_hgrid +
coord_cartesian(xlim=c(40,225)) +
facet_wrap(~model, scales = "free")
## You can aslo use packages like grid.arrange
gridExtra::grid.arrange(p1, p2, ncol=2) # adjust the size of the labels fi you choose this visual
### Model 6 - Zero-Inflated Poisson
## Estimate ZIP using zeroinfl in the pscl library
zip.result <- zeroinfl(m1zinb,
data=m1data,
dist="poisson",
offset=log(m1data$exposure))
## Extract results, taking care to keep track of both equations
pe.zip.count <- zip.result$coefficients$count
pe.zip.zeros <- zip.result$coefficients$zero
pe.zip <- c(pe.zip.count, pe.zip.zeros)
vc.zip <- vcov(zip.result)
se.zip <- sqrt(diag(vc.zip))
se.zip.count <- se.zip[1:length(pe.zip.count)]
se.zip.zeros <- se.zip[ (length(pe.zip.count)+1) : length(se.zip) ]
ll.zip <- logLik(zip.result)
aic.zip <- AIC(zip.result)
## Simulate results (we re-use xhyp from above)
## Option 1: E(count | not a structural zero)
## { other options include unconditional E(count) and Pr(Structural Zero) }
simparam.zip <- mvrnorm(sims,pe.zip,vc.zip) # draw parameters
simbetas.zip <- simparam.zip[,1:length(pe.zip.count)]
simgammas.zip <- simparam.zip[,(length(pe.zip.count)+1) : length(se.zip) ]
yhyp.zipC <- loglinsimev(xhypWZ, simbetas.zip)
## Transform expected counts to expected rates
yhyp.zipC$pe <- yhyp.zipC$pe*1000
yhyp.zipC$lower <- yhyp.zipC$lower*1000
yhyp.zipC$upper <- yhyp.zipC$upper*1000
## Option 3: Pr(Structural Zero)
## { other options include E(count|not a structural zero) and unconditional count }
yhyp.zipZ <- logitsimev(xhypWZ0, simgammas.zip)
### Model 6 - ggplot visual option 1
vis_zipC <-
yhyp.zipC %>% as.data.frame() %>%
bind_cols() %>% as_tibble() %>%
mutate(model="zipC",
valueseq=valueseq/1000)
(p3 <-
vis_zipC %>%
ggplot(aes(y=pe,
ymin=lower,
ymax=upper,
x=valueseq)) +
geom_line(color=green) +
geom_ribbon(alpha=0.2, fill=green) +
scale_y_continuous(limits = c(0,25)) +
scale_x_continuous(labels = scales::dollar_format(suffix = "k")) +
labs(title= "Expected HOA foreclosure filings per 1000 homes per year",
y=NULL,
x="Median home price of neighborhood") +
theme_caviz_hgrid +
coord_cartesian(xlim=c(40,225))+
annotate("text", x = 150, y = 10,
label = "Expected counts Poisson",
color=green, size=6))
### Model 6 - ggplot visual option 2
vis_zipZ <-
yhyp.zipZ %>% as.data.frame() %>%
bind_cols() %>% as_tibble() %>%
mutate(model="zinbZ",
pe = 1 - pe,
lower = 1 - lower,
upper = 1 - upper,
valueseq=valueseq/1000)
(p4 <-
vis_zinbZ %>%
ggplot(aes(y=pe,
ymin=lower,
ymax=upper,
x=valueseq)) +
geom_line(color=green) +
geom_ribbon(alpha=0.2, fill=green) +
scale_y_continuous(limits = c(0,1),breaks=seq(0,1,0.2)) +
scale_x_continuous(labels = scales::dollar_format(suffix = "k")) +
labs(title= "Probability neighborhood has a filing HOA",
y=NULL,
x="Median home price of neighborhood") +
theme_caviz_hgrid +
coord_cartesian(xlim=c(40,225))+
annotate("text", x = 100, y = 0.7,
label = "Zero-Inflated Poisson",
color=green, size=6))
### Combine options 1 and 2
## You can aslo use packages like grid.arrange
gridExtra::grid.arrange(p3, p4, ncol=2) # adjust the size of the labels fi you choose this visual
tile
## NOTE: to run the code below you need all the quantities estimated in the previous section
### Model 1 - Poisson using GLM with zeros included
# Set up lineplot trace of result for plotting using tile
trace1poisWZ <- lineplot(x = valueseq/1000,
y = yhyp.poisWZ$pe,
lower = yhyp.poisWZ$lower,
upper = yhyp.poisWZ$upper,
ci = list(mark="shaded"),
lex=1.5,
extrapolate = list(data=model.frame(m1, m1data),
cfact=model.frame(m1, xhypWZ$x),
omit.extrapolated=TRUE),
col = purple,
plot = 1
)
poisLabWZ <- textTile(labels="Poisson including excess zeros",
x=77,
y=3,
col=purple,
cex=1,
plot=1)
rug1 <- rugTile(x=m1data$mdvaluation[(m1data$mdvaluation>35000)
&(m1data$mdvaluation<400000)]/1000,
type="jitter",
col=nicegray, alpha=0.4, plot=1)
xat <- c(50, 100, 200)
xlabs <- c("$50k", "$100k", "$200k")
## Plot traces using tile
tile(trace1poisWZ, rug1, poisLabWZ,
RxC = c(1,1),
limits = c(35,400,0,28.5),
yaxis=list(major=FALSE),
xaxis=list(log=TRUE, at=xat, labels=xlabs),
xaxistitle = list(labels="Median home price of neighborhood"),
plottitle = list(labels="Expected HOA foreclosure filings per 1000 homes per year",
x=0.39, y=0.75),
height = list(plot="golden", xaxistitle=3.5, plottitle=3.5),
gridlines = list(type="y")
)
### Model 2 - Poisson using GLM with zeros removed
# Set up lineplot trace of result for plotting using tile
trace1poisNZ <- lineplot(x = valueseq/1000,
y = yhyp.poisNZ$pe,
lower = yhyp.poisNZ$lower,
upper = yhyp.poisNZ$upper,
ci = list(mark="shaded"),
lex=1.5,
extrapolate = list(data=model.frame(m1, m1dataNZ),
cfact=model.frame(m1, xhypNZ$x),
omit.extrapolated=TRUE),
col = red,
plot = 1
)
poisLabNZ <- textTile(labels="Poisson with all zeros deleted",
x=81,
y=12.5,
col=red,
cex=1,
plot=1)
rug1 <- rugTile(x=m1dataNZ$mdvaluation[(m1dataNZ$mdvaluation>35000)
&(m1dataNZ$mdvaluation<400000)]/1000,
type="jitter",
col=nicegray, alpha=0.4, plot=1)
xat <- c(50, 100, 200)
xlabs <- c("$50k", "$100k", "$200k")
## Plot traces using tile
tile(trace1poisNZ, rug1, poisLabNZ,
RxC = c(1,1),
limits = c(35,400,0,28.5),
yaxis=list(major=FALSE),
xaxis=list(log=TRUE, at=xat, labels=xlabs),
xaxistitle = list(labels="Median home price of neighborhood"),
plottitle = list(labels="Expected HOA foreclosure filings per 1000 homes per year",
x=0.39, y=0.75),
height = list(plot="golden", xaxistitle=3.5, plottitle=3.5),
gridlines = list(type="y")
)
### Model 3 - Negative binomial regression, zeros excluded
# Set up lineplot trace of result for plotting using tile
trace1nb <- lineplot(x = valueseq/1000,
y = yhyp.nb$pe,
lower = yhyp.nb$lower,
upper = yhyp.nb$upper,
ci = list(mark="shaded"),
lex=1.5,
extrapolate = list(data=model.frame(m1, m1dataNZ),
cfact=model.frame(m1, xhypNZ$x),
omit.extrapolated=TRUE),
col = blue,
plot = 1
)
poisLabNB <- textTile(labels="Negative Binomial with all zeros deleted",
x=98,
y=12.5,
col=blue,
cex=1,
plot=1)
xat <- c(50, 100, 200)
xlabs <- c("$50k", "$100k", "$200k")
## Plot traces using tile
tile(trace1nb, rug1, poisLabNB,
limits = c(35,400,0,28.5),
yaxis=list(major=FALSE),
xaxis=list(log=TRUE, at=xat, labels=xlabs),
xaxistitle = list(labels="Median home price of neighborhood"),
plottitle = list(labels="Expected HOA foreclosure filings per 1000 homes per year",
x=0.39, y=0.75),
height = list(plot="golden", xaxistitle=3.5, plottitle=3.5),
gridlines = list(type="y")
)
### Model 4 - Quasipoisson using GLM with zeros removed
## Set up lineplot trace of result for plotting using tile
trace1qpoisNZ <- lineplot(x = valueseq/1000,
y = yhyp.qpoisNZ$pe,
lower = yhyp.qpoisNZ$lower,
upper = yhyp.qpoisNZ$upper,
ci = list(mark="shaded"),
lex=1.5,
extrapolate = list(data=model.frame(m1, m1dataNZ),
cfact=model.frame(m1, xhypNZ$x),
omit.extrapolated=TRUE),
col = brown,
plot = 1
)
qpoisLabNZ <- textTile(labels="Quasipoisson with all zeros deleted",
x=93,
y=12.5,
col=brown,
cex=1,
plot=1)
# Switch negative binomial CIs to dashed lines
trace1nb$ci$mark <- "dashed"
# Move negative binomial labels
poisLabNB$y <- 7.5
poisLabNB$x <- 140
## Plot traces using tile
tile(trace1qpoisNZ, trace1nb, rug1, qpoisLabNZ, poisLabNB,
limits = c(35,400,0,28.5),
yaxis=list(major=FALSE),
xaxis=list(log=TRUE, at=xat, labels=xlabs),
xaxistitle = list(labels="Median home price of neighborhood"),
plottitle = list(labels="Expected HOA foreclosure filings per 1000 homes per year",
x=0.39, y=0.75),
height = list(plot="golden", xaxistitle=3.5, plottitle=3.5),
gridlines = list(type="y")
)
## Switch poisson CIs to dashed lines
trace1poisNZ$ci$mark <- "dashed"
## Move poisson labels
poisLabNZ$y <- 7.5
poisLabNZ$x <- 125
## Plot traces using tile
tile(trace1qpoisNZ, rug1, qpoisLabNZ, poisLabNZ, trace1poisNZ,
limits = c(35,400,0,28.5),
yaxis=list(major=FALSE),
xaxis=list(log=TRUE, at=xat, labels=xlabs),
xaxistitle = list(labels="Median home price of neighborhood"),
plottitle = list(labels="Expected HOA foreclosure filings per 1000 homes per year",
x=0.39, y=0.75),
height = list(plot="golden", xaxistitle=3.5, plottitle=3.5),
gridlines = list(type="y")
)
### Model 5 - Zero-Inflated Negative Binomial (Option 1)
## Set up lineplot trace of result for plotting using tile
trace1zinbC <- lineplot(x = valueseq/1000,
y = yhyp.zinbC$pe,
lower = yhyp.zinbC$lower,
upper = yhyp.zinbC$upper,
ci = list(mark="shaded"),
lex=1.5,
extrapolate = list(data=model.frame(m1, m1data),
cfact=model.frame(m1, xhypWZ$x),
omit.extrapolated=TRUE),
col = orange,
plot = 1
)
poisLabZINB2 <- textTile(labels="Zero-Inflated Negative Binomial",
x=80,
y=17.5,
col=orange,
cex=1,
plot=1)
xat <- c(50, 100, 200)
xlabs <- c("$50k", "$100k", "$200k")
## Plot traces using tile
tile(trace1zinbC, rug1, poisLabZINB2,
RxC = c(1,1),
limits = c(35,400,0,28.5),
yaxis=list(major=FALSE),
xaxis=list(log=TRUE, at=xat, labels=xlabs),
xaxistitle = list(labels="Median home price of neighborhood"),
plottitle = list(labels="Expected HOA foreclosure filings per 1000 homes per year",
x=0.39, y=0.75),
height = list(plot="golden", xaxistitle=3.5, plottitle=3.5),
gridlines = list(type="y")
)
### Model 5 - Zero-Inflated Negative Binomial (Option 2)
# recall: this option include E(count|not a structural zero) and unconditional count
## Set up lineplot trace of result for plotting using tile
trace1zinbZ <- lineplot(x = valueseq/1000,
y = 1-yhyp.zinbZ$pe,
lower = 1-yhyp.zinbZ$lower,
upper = 1-yhyp.zinbZ$upper,
ci = list(mark="shaded"),
lex=1.5,
extrapolate = list(data=model.frame(m1, m1data),
cfact=model.frame(m1, xhypWZ0$x),
omit.extrapolated=TRUE),
col = orange,
plot = 1
)
poisLabZINB1 <- textTile(labels="Zero-Inflated Negative Binomial",
x=189,
y=0.3,
col=orange,
cex=1,
plot=1)
xat <- c(50, 100, 200)
xlabs <- c("$50k", "$100k", "$200k")
## Plot traces using tile
tile(trace1zinbZ, rug1, poisLabZINB1,
RxC = c(1,1),
limits = c(35,400,-0.01,1.01),
yaxis=list(major=FALSE),
xaxis=list(log=TRUE, at=xat, labels=xlabs),
xaxistitle = list(labels="Median home price of neighborhood"),
plottitle = list(labels="Probability neighborhood has a foreclosure-filing HOA",
x=0.341, y=0.75),
height = list(plot="golden", xaxistitle=3.5, plottitle=3.5),
gridlines = list(type="y")
)
## Combined plot: revise some details
trace1nb$plot <- 2
trace1nb$ci <- list(mark="dashed")
trace1zinbC$plot <- 2
rug1$plot <- c(1,2)
poisLabZINB2$plot <- 2
poisLabNB2 <- textTile(labels="Negative Binomial",
x=115,
y=7.5,
col=blue,
cex=1,
plot=2)
poisLabZINB1$x <- 205
poisLabZINB2$x <- 94
## Plot traces using tile
tile(trace1zinbZ, trace1zinbC, rug1,
poisLabZINB1, poisLabZINB2,
RxC = c(1,2),
limits = rbind(c(35,400,0,1.0),
c(35,400,0,28.5)),
yaxis=list(major=FALSE),
xaxis=list(log=TRUE, at=xat, labels=xlabs),
xaxistitle = list(labels="Median home price of neighborhood"),
plottitle = list(labels1="Probability neighborhood has a filing HOA",
labels2="Expected HOA foreclosure filings per 1000 homes",
x1=0.32625, y1=0.75, x2=0.42025, y2=0.75), # 355 405
height = list(plot="golden", xaxistitle=3.5, plottitle=3.5),
width = list(spacer=3),
gridlines = list(type="y")
)
## Plot traces using tile
tile(trace1zinbZ, trace1zinbC, trace1nb, rug1,
poisLabZINB1, poisLabZINB2, poisLabNB2,
RxC = c(1,2),
limits = rbind(c(35,400,0,1.0),
c(35,400,0,28.5)),
yaxis=list(major=FALSE),
xaxis=list(log=TRUE, at=xat, labels=xlabs),
xaxistitle = list(labels="Median home price of neighborhood"),
plottitle = list(labels1="Probability neighborhood has a filing HOA",
labels2="Expected HOA foreclosure filings per 1000 homes",
x1=0.32625, y1=0.75, x2=0.42025, y2=0.75), # 355 405
height = list(plot="golden", xaxistitle=3.5, plottitle=3.5),
width = list(spacer=3),
gridlines = list(type="y")
)
### Model 6 - Zero-Inflated Poisson
## Set up lineplot trace of result for plotting using tile
trace1zipC <- lineplot(x = valueseq/1000,
y = yhyp.zipC$pe,
lower = yhyp.zipC$lower,
upper = yhyp.zipC$upper,
ci = list(mark="shaded"),
lex=1.5,
extrapolate = list(data=model.frame(m1, m1data),
cfact=model.frame(m1, xhypWZ$x),
omit.extrapolated=TRUE),
col = green,
plot = 1
)
poisLabZIP2 <- textTile(labels="Zero-Inflated Poisson",
x=120,
y=7.5,
col=green,
cex=1,
plot=1)
## Option 3: Pr(Structural Zero)
## { other options include E(count|not a structural zero) and unconditional count }
yhyp.zipZ <- logitsimev(xhypWZ0, simgammas.zip)
## Set up lineplot trace of result for plotting using tile
trace1zipZ <- lineplot(x = valueseq/1000,
y = 1-yhyp.zipZ$pe,
lower = 1-yhyp.zipZ$lower,
upper = 1-yhyp.zipZ$upper,
ci = list(mark="shaded"),
lex=1.5,
extrapolate = list(data=model.frame(m1, m1data),
cfact=model.frame(m1, xhypWZ0$x),
omit.extrapolated=TRUE),
col = green,
plot = 1
)
poisLabZIP1 <- textTile(labels="Zero-Inflated Poisson",
x=205,
y=0.133,
col=green,
cex=1,
plot=1)
## Combined plot: revise some details
trace1zipC$plot <- 2
poisLabZIP2$plot <- 2
trace1zinbZ$ci$mark <- "dashed"
trace1zinbC$ci$mark <- "dashed"
poisLabZINB1$x <- 75
poisLabZINB1$y <- 0.5
## Plot traces using tile
tile(trace1zipZ, trace1zipC, rug1,
poisLabZIP1, poisLabZIP2,
RxC = c(1,2),
limits = rbind(c(35,400,0,1.0),
c(35,400,0,28.5)),
yaxis=list(major=FALSE),
xaxis=list(log=TRUE, at=xat, labels=xlabs),
xaxistitle = list(labels="Median home price of neighborhood"),
plottitle = list(labels1="Probability neighborhood has a filing HOA",
labels2="Expected HOA foreclosure filings per 1000 homes",
x1=0.32625, y1=0.75, x2=0.42025, y2=0.75), # 355 405
height = list(plot="golden", xaxistitle=3.5, plottitle=3.5),
width = list(spacer=3),
gridlines = list(type="y")
)
## Plot traces using tile
tile(trace1zipZ, trace1zipC, rug1, trace1zinbZ, trace1zinbC,
poisLabZIP1, poisLabZIP2, poisLabZINB1, poisLabZINB2,
RxC = c(1,2),
limits = rbind(c(35,400,0,1.0),
c(35,400,0,28.5)),
yaxis=list(major=FALSE),
xaxis=list(log=TRUE, at=xat, labels=xlabs),
xaxistitle = list(labels="Median home price of neighborhood"),
plottitle = list(labels1="Probability neighborhood has a filing HOA",
labels2="Expected HOA foreclosure filings per 1000 homes",
x1=0.32625, y1=0.75, x2=0.42025, y2=0.75), # 355 405
height = list(plot="golden", xaxistitle=3.5, plottitle=3.5),
width = list(spacer=3),
gridlines = list(type="y")
)