Fragile Families Exploratory Data Analysis

This lesson contains discussion of child welfare, intimate partner violence, incarceration, mental health, and other sensitive topics. You may ALWAYS choose not to participate in disturbing topics. This lesson takes care to treat this topic with respect and compassion.

Welcome to the introduction to the Fragile Families Dataset. The goal of this lesson is to:

Introduce you to the Fragile Families Project and the data
Discuss possible use-cases and publications using the data
Demonstrate how to visualize some descriptive statistics of the data using R

children illustration

Illustration: Devon Holzwarth

The Fragile Families & Child Wellbeing Study

Between 1998 - 2000, nearly 5000 newly born children were recruited into the Fragile Families & Child Wellbeing Study. The term “fragile” refers to unmarried parents, which are represented in a sample of 3:1 for this study (75% of families are unmarried households and 25% are married households). These families have now been longitudinally followed for 20 years.

The study aims to approach the following questions:

What are the conditions and capabilities of unmarried parents, especially fathers?
What is the nature of the relationships between unmarried parents?
How do children born into these families fare?
How do policies and environmental conditions affect families and children?

In the Introduction to the Special Collection on the Fragile Families Challenge we learn about the six main outcome variables that were addressed in the Fragile Families Challenge:

child grade point average (GPA)
child grit
household eviction
household material hardship
caregiver layoff
caregiver participation in job training

Child Welfare Services

children illustration

Illustration: Devon Holzwarth

There are lots of resources for families struggling with any of the kinds of problems brought up in this study. From substance abuse (drugs and alcohol) to intimate partner violence recovery, it may be court mandated that parents complete certain programs in order to ensure the proper care of their children. Not all cases escalate to needing government involvement, and many cases that do need government involvement never get reported. It is important as a society that we value children, their health and wellbeing, and protect them from violence. Parents need to meet a “minimally sufficient level of care” for their children; including food, shelter, education, safety, and emotional wellbeing.

Critical Data Science

As Data Scientists we are not automatically neutral or objective, and it is our responsibility to approach sensitive topics with great care. Findings from our analyses can be used to affect real families and their children! We must be careful to ask where the data came from, how it’s being used, how the families are being protected, what all the variables mean, and what kinds of work would benefit these families the most.

The Frequently Asked Questions page is informative and helpful, and should be read before continuing.

Not so “Fragile”, and Certainly Not “Broken Homes”

As a critical data scientist, a child advocate I cannot in good conscience move forward without commenting on the poor choice of name for this project. No one likes to be called “fragile” or “broken”, and one of the factors that plays a key role in this data set is actually “grit”, or how resilient and determined the child turns out. Often we see increased resilience and perseverance in children who had adverse childhood experiences! Critical Data Science also stresses how important it is to return the findings/analysis process to the participants. This allows for a cycle of advocacy, analysis, and education. So I would like to ask:

what is a better name than Fragile Families?

Access to the Data

Please note that in order to access the data you need to make an account and register your reasons for wanting the data. I was able to get access in a few days, and specified that I was using the data for demonstration purposes with a Data Science class. Register here

In this case, I have downloaded the SPSS version and then used the R library foreign to read.spss()

library(foreign)
df <- read.spss("FF_allwaves.sav", use.value.label=TRUE, to.data.frame=TRUE)

Data Size

This data is

BIG!

Let’s investigate by calling:

dim(df)

## [1]  4898 17002

There are 17002 columns, meaning 17002 variables collected/recorded. The rows represent the families, with a total of 4898.

Trying to use the summary() function actually results in a maximum limit on my machine, and doesn’t give us very much information as the variables are all opaque codes.

As one study put it: “I wished to remove these variables to reduce false positives and speed up processing time, as machine-learning and imputation techniques can be processor intensive. However, eliminating all the unrelated variables in advance of analyses is a substantial task for a small team, not to mention a single person.” Leveraging Multiple Machine-Learning Techniques to Predict Major Life Outcomes from a Small Set of Psychological and Socioeconomic Variables: A Combined Bottom-up/Top-down Approach

Codebook

The codebooks give the column name and what it corresponds to. This is vital for choosing what we want to look at. You might go through the codebook and select a few variables of interest for your beginning EDA. Here’s where you start to get to know the data scheme and the kinds of data you have. For instance, we can look at how many families live in public housing projects in this dataset.

colnames(df[1:50])

##  [1] "idnum"     "cf1intmon" "cf1intyr"  "cf1lenhr"  "cf1lenmin" "cf1twoc"  
##  [7] "cf1fint"   "cf1natsm"  "f1natwt"   "cf1natsmx" "f1natwtx"  "cf1citsm" 
## [13] "f1citywt"  "f1a2"      "f1a3"      "f1a4"      "f1a4a"     "f1a5"     
## [19] "f1a5a"     "f1a6"      "f1a6a"     "f1a7"      "cf1age"    "f1b1a"    
## [25] "f1b1b"     "f1b2"      "f1b3"      "f1b4a"     "f1b4b"     "f1b4c"    
## [31] "f1b4d"     "f1b4e"     "f1b4f"     "f1b4g"     "f1b5a"     "f1b5b"    
## [37] "f1b5c"     "f1b5d"     "f1b6a"     "f1b6b"     "f1b6c"     "f1b6d"    
## [43] "f1b6e"     "f1b6f"     "f1b7a"     "f1b7b"     "f1b7c"     "f1b7d"    
## [49] "f1b7e"     "f1b8"

Data Types

Each column is of the type factor, but some can be converted to numeric or treated as ordinal data. Using summary on the first 20 variables (columns), we can see the factor levels of each. Note that several of the factor levels are not indicating the actual response but a code for something like “Not in Wave” (didn’t participate in this cycle) or “Skip” (which actually can end up being meaningful. Can you imagine purposefully skipping a question to hold meaning later on in analysis? Interesting!)

summary(df[1:20])

##     idnum                    cf1intmon               cf1intyr   
##  Length:4898        -9 Not in wave:1068   2000           :2073  
##  Class :character   7 Jul         : 796   1999           :1233  
##  Mode  :character   6 Jun         : 681   -9 Not in wave :1068  
##                     5 May         : 628   1998           : 524  
##                     4 Apr         : 503   -8 Out of range:   0  
##                     8 Aug         : 466   -7 N/A         :   0  
##                     (Other)       : 756   (Other)        :   0  
##            cf1lenhr             cf1lenmin               cf1twoc    
##  0             :2587   -9 Not in wave:1068   0 No           :3306  
##  -9 Not in wave:1068   40            : 551   -9 Not in wave :1068  
##  -3 Missing    : 518   45            : 542   1 Yes          : 524  
##  1             : 473   35            : 430   -8 Out of range:   0  
##  -6 Skip       : 235   30            : 428   -7 N/A         :   0  
##  2             :   5   50            : 272   -6 Skip        :   0  
##  (Other)       :  12   (Other)       :1607   (Other)        :   0  
##             cf1fint                cf1natsm       f1natwt       
##  1 Yes          :3830   1 Yes          :2726   Min.   :   2.02  
##  0 No           :1068   0 No           :1104   1st Qu.:  28.37  
##  -9 Not in wave :   0   -9 Not in wave :1068   Median : 109.14  
##  -8 Out of range:   0   -8 Out of range:   0   Mean   : 415.01  
##  -7 N/A         :   0   -7 N/A         :   0   3rd Qu.: 376.69  
##  -6 Skip        :   0   -6 Skip        :   0   Max.   :8622.66  
##  (Other)        :   0   (Other)        :   0   NA's   :2172     
##            cf1natsmx       f1natwtx                   cf1citsm   
##  1 Yes          :2469   Min.   :   2.097   1 Yes          :3742  
##  0 No           :1361   1st Qu.:  36.115   -9 Not in wave :1068  
##  -9 Not in wave :1068   Median : 132.674   0 No           :  88  
##  -8 Out of range:   0   Mean   : 458.205   -8 Out of range:   0  
##  -7 N/A         :   0   3rd Qu.: 434.828   -7 N/A         :   0  
##  -6 Skip        :   0   Max.   :9028.847   -6 Skip        :   0  
##  (Other)        :   0   NA's   :2429       (Other)        :   0  
##     f1citywt                     f1a2                   f1a3     
##  Min.   :   1.174   1 Yes          :2987   1 Yes          :3503  
##  1st Qu.:  16.040   -9 Not in wave :1068   -9 Not in wave :1068  
##  Median :  35.703   2 No           : 828   2 No           : 317  
##  Mean   :  92.795   -3 Missing     :   8   -3 Missing     :   9  
##  3rd Qu.:  74.697   -2 Don't know  :   7   -2 Don't know  :   1  
##  Max.   :3672.841   -8 Out of range:   0   -8 Out of range:   0  
##  NA's   :1156       (Other)        :   0   (Other)        :   0  
##               f1a4                 f1a4a                   f1a5     
##  1 Yes          :3504   -6 Skip       :3504   1 Yes          :3617  
##  -9 Not in wave :1068   -9 Not in wave:1068   -9 Not in wave :1068  
##  2 No           : 231   1 Yes         : 181   2 No           :  99  
##  -2 Don't know  :  91   2 No          :  81   -2 Don't know  :  94  
##  -3 Missing     :   4   -2 Don't know :  49   -3 Missing     :  20  
##  -8 Out of range:   0   -3 Missing    :  14   -8 Out of range:   0  
##  (Other)        :   0   (Other)       :   1   (Other)        :   0  
##             f1a5a                   f1a6     
##  -6 Skip       :3617   1 Yes          :2207  
##  -9 Not in wave:1068   2 No           :1604  
##  1 Yes         : 128   -9 Not in wave :1068  
##  2 No          :  33   -3 Missing     :  15  
##  -3 Missing    :  27   -2 Don't know  :   4  
##  -2 Don't know :  25   -8 Out of range:   0  
##  (Other)       :   0   (Other)        :   0

In order to convert a column to a numeric type that we can work with (mean, min, max, etc), we can do this: This actually represents the length of the interview with the father in hours.

library(ggplot2)
df$fatherinterview <- (as.numeric(as.character(df$cf1lenhr))*60) + as.numeric(as.character(df$cf1lenmin)) #add the hours and minutes together

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

df$fatherinterview <- df$fatherinterview / 60 #convert back to hours (optional)
hrs <-na.omit(df$fatherinterview) #remove NAs
hrs <- as.data.frame(hrs)
mean(hrs$hrs)

## [1] 0.7614316

min(hrs$hrs)

## [1] 0.01666667

max(hrs$hrs)

## [1] 35.73333

plt <- ggplot(hrs,aes(x="",y=hrs)) +
  geom_boxplot()+
  geom_hline(yintercept=mean(hrs$hrs),linetype="dashed",color="blue")+
  theme_bw()
plt

How many school days per week do kids eat breakfast?

# eating breakfast per week!
t <- as.data.frame(prop.table(table(df$p6d30)))
plt <- ggplot(t,aes(Var1, y=Freq,fill=Var1))+
  geom_bar(stat="identity")+
  ggtitle("Number of School Days Youth Eats Breakfast")+
  
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))+
  theme(legend.position = "none")
plt

How many times a week does the family eat dinner together?

# eating breakfast per week!
t <- as.data.frame(prop.table(table(df$p6d31)))
plt <- ggplot(t,aes(Var1, y=Freq,fill=Var1))+
  geom_bar(stat="identity")+
  ggtitle("Number of Nights Eating Dinner Together as a Family")+
   
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))+
  theme(legend.position = "none")
plt

Subjective Measure of Youth’s Health

# youth description of personal health
t <- as.data.frame(prop.table(table(df$k6d3)))
plt <- ggplot(t,aes(Var1, y=Freq,fill=Var1))+
  geom_bar(stat="identity")+
  
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))+
  ggtitle("Youth's Likert Rating of their own Health")+
  theme(legend.position = "none")
plt

Measuring “Grit”

One of the six outcome variables that the Fragile Families Challenge cared about was “grit”.

grit: courage and resolve; strength of character.

Different studies come up with different measures of grit, but it’s important to note that measuring such a thing does not simply come from one column in the data. It’s actually a combination of responses to various questions that hone in on a quality of grit. Shown below are some of the survey questions used to measure grit in the youth participating in this study.

# I finish whatever I begin
t <- as.data.frame(prop.table(table(df$k6d2m)))
plt <- ggplot(t,aes(Var1, y=Freq,fill=Var1))+
  geom_bar(stat="identity")+
  
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))+
  ggtitle("'I finish whatever I begin'")+
  theme(legend.position = "none")
plt

# I am a hard worker
t <- as.data.frame(prop.table(table(df$k6d2v)))
plt <- ggplot(t,aes(Var1, y=Freq,fill=Var1))+
  geom_bar(stat="identity")+
   
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))+
  ggtitle("'I am a hard worker'")+
  theme(legend.position = "none")
plt

# Once I make a plan to get something done, I stick to it
t <- as.data.frame(prop.table(table(df$k6d2k)))
plt <- ggplot(t,aes(Var1, y=Freq,fill=Var1))+
  geom_bar(stat="identity")+
   
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))+
  ggtitle("'Once I make a plan to get something done, I stick to it'")+
  theme(legend.position = "none")
plt

Why is this such a Big Data Problem?

In the example above, you end up triangulating a measure of “grit” through the various vectors that provide information about the phenomenon. But that isn’t even to look at predictors in any way! It’s simply to operationalize a single outcome variable. It could be the case that all 17002 columns matter for the outcomes we care about (not likely). It could be that only a few matter! How does someone go about investigating which variables to use? To find which ones matter?

Much later on in your Data Science career, you will learn about “reducing dimensionality” and other ways of interpreting large data. But for now, this is just to get you excited about working with real data that matters, and thinking critically about what might be important and how it ought to be used.

Keep going!