Module 1: Drug Poisoning Mortality by State
Ayush Noori | EduSTEM Data Science
Welcome to Module 1 of EduSTEM Data Science! This course is designed for motivated middle school and high school students who are curious to learn about diverse contemporary challenges through the lens of data science.
Module 1 will give you a hands-on introduction to data visualization in R. We will use the popular
ggplot2
package to create an animated bar plot representation of the drug poisoning epidemic over time in the United States. Please refer to the background reading of this module to learn more about unintentional drug overdose and the opioid epidemic.
This module will use data from National Center of Health Statistics, available here.
This dataset describes drug poisoning deaths at the U.S. and state level by selected demographic characteristics and includes age-adjusted death rates for drug poisoning. Deaths are classified using the International Classification of Diseases, Tenth Revision (ICD–10).
Drug-poisoning deaths are defined as having ICD–10 underlying cause-of-death codes
Estimates are based on the National Vital Statistics System multiple cause-of-death mortality files (1). Age-adjusted death rates (deaths per 100,000 U.S. standard population for 2000) are calculated using the direct method. Populations used for computing death rates for 2011–2016 are postcensal estimates based on the 2010 U.S. census. Rates for census years are based on populations enumerated in the corresponding censuses. Rates for noncensus years before 2010 are revised using updated intercensal population estimates and may differ from rates previously published. Death rates for some states and years may be low due to a high number of unresolved pending cases or misclassification of ICD–10 codes for unintentional poisoning as R99, “Other ill-defined and unspecified causes of mortality” (2). For example, this issue is known to affect New Jersey in 2009 and West Virginia in 2005 and 2009 but also may affect other years and other states. Drug poisoning death rates may be underestimated in those instances.
References:
- National Center for Health Statistics. National Vital Statistics System: Mortality data. Available from: http://www.cdc.gov/nchs/deaths.htm.
- CDC. CDC Wonder: Underlying cause of death 1999–2016. Available from: Underlying Cause of Death 1999-2020.
Background Reading
- Module 1 Background #1.pdf (287.4 KB)
- Module 1 Background #2.pdf (724.3 KB)
- Module 1 Background #3.pdf (627.5 KB)
Publication from ncbi.nlm.nih.gov
Worldwide Prevalence and Trends in Unintentional Drug Overdose: A Systematic Review of the Literature.Setup
Load the requisite libraries.
if(!require(data.table)){
install.packages("data.table")
library(data.table)
}
if(!require(ggplot2)){
install.packages("ggplot2")
library(ggplot2)
}
if(!require(gganimate)){
install.packages("gganimate")
library(gganimate)
}
if(!require(dplyr)){
install.packages("dplyr")
library(dplyr)
}
Set the working directory to where you have saved the NCHS data.
dir = "<insert appropriate directory address here>"
setwd(dir)
Read and Process Data
Read the data into the RStudio workspace. Then, select the needed columns from the dataset and create a rank column to facilitate plot animation.
dat = fread("Module 1 Data.csv", header = TRUE)
# select the needed columns from the dataset
dat = dat[, c("State", "Year", "Age-adjusted Rate")]
colnames(dat) = c("State", "Year", "Rate")
# create a rank column which will allow plot animation
dat = dat %>%
group_by(Year) %>%
# the * 1 makes it possible to have non-integer ranks while sliding
mutate(rank = min_rank(-Rate) * 1) %>%
ungroup()
Build Static Plots
Build all the static plots using the popular ggplot2
package, which is elegant and aesthetically pleasing, but with very different syntax than base R graphics.
ggplot2
works with dataframes. Here, we supply the dat
object as a dataframe to the ggplot2()
function. Aesthetic information from the source dataset, including the X and Y axes, are specified inside the aes()
function.
The layers in ggplot2 are called geoms
. Once the dataframe is specified and base setup is completed, you can append the geoms
one on top of the other by calling their respective functions. Here, we use geom_tile()
to create the bar plot and geom_text()
to create the data labels along the y-axis. The documentation has an extensive list of available geoms
.
Finally, the key function here is transition_states
, which stitches all the individual static plots together by year to allow us to animate the plot.
p = ggplot(dat, aes(rank, group = State,
fill = as.factor(State), color = as.factor(State))) +
geom_tile(aes(y = Rate/2,
height = Rate,
width = 0.9), alpha = 0.8, color = NA) +
# text labels along y-axis after coordinates are flipped (requires clip = "off" in coord_*)
geom_text(aes(y = 0, label = paste(paste(paste(State, ":", sep=""), Rate, sep=" "), " ")),
vjust = 0.2, hjust = 1) +
coord_flip(clip = "off", expand = FALSE) +
scale_y_continuous(labels = scales::comma) +
scale_x_reverse() +
guides(color = FALSE, fill = FALSE) +
labs(title='{closest_state}', x = "", y = "Drug Poisoning Mortality by State") +
theme(plot.title = element_text(hjust = 0, size = 22),
axis.ticks.y = element_blank(), # these relate to the axes post-flip
axis.text.y = element_blank(), # these relate to the axes post-flip
plot.margin = margin(1,1,1,4, "cm")) +
# the transition_states() function stitches all the individual plots together by year
transition_states(Year, transition_length = 4, state_length = 1) +
ease_aes('cubic-in-out')
Create Animated Plot
Finally, animate the plot! You will find the animated plot generated in the working directory.
animate(p, fps = 25, duration = 30, width = 800, height = 2000, renderer = gifski_renderer("Drug Poisoning Mortality by State.gif"))
With these courses, we hope to further our mission to make high-quality STEMX education accessible for all. For questions or support, please feel free to reach out to me at anooristudent@outlook.com.
Best Regards,
Ayush Noori
EduSTEM Boston Chapter Founder
Resources:
- R and RStudio Desktop
R is a free programming language and software environment for statistical computing, bioinformatics, and data visualization. RStudio is the associated free integrated development environment. Please download them via the instructions here to complete the course activities.
Stack Overflow is a question and answer site for programmers, and hosts a wide variety of answers to common R questions. It is an indispensable resource for the nascent R programmer. You can also refer to the RStudio Community.