PUMS 3: Script multiyear functions
pums-3-multiyear.Rmd
Time: the next dimension
Now that you’ve learned to calculate single-survey results from the PUMS microdata, what’s to stop you from calculating trends across multiple surveys? As it turns out, there are a few potholes in that road to avoid.
Hint #1 - Be mindful of span
This item relates to confidence levels: the Census Bureau strongly advises against drawing comparisons from surveys with overlapping spans (e.g. 2015-19 & 2016-20 5yr data), since identical observations are present in both surveys, which means you may underestimate change or overestimate certainty. 5-year data are best if 5-year intervals answer the need; to get annual trends, you’ll need to use the 1yr data (which involves more uncertainty). Remember to use Z-scores to determine whether trend values can be considered statistically distinct.
Hint #2 - Compare data dictionaries
Although they may seem consistent at first blush, many PUMS variable codes, values, and labels have changed during the course of the program. If you plan to compare data across multiple surveys, you’ll either want to confirm the variables of interest have remained consistent, or write your code to handle the differences among them. In some cases, the way data were reported–or the way the question was asked–might preclude accurate multi-year comparisons at your desired level of detail.
Hint #3 - Use real dollar variables
Due to inflation, the value of a dollar declines over time; to
achieve a true multi-year comparison of price or income variables, these
must be adjusted to real terms–i.e., adjusted to reflect a common
reference year dollar value. This involves multiplying by a ratio (known
as a ‘deflator’) of the relevant annual values of a price index,
generally the Personal Consumption Expenditures (PCE) Index, as it is
updated to remain valid across years. Prior to running the statistical
functions, use the psrccensus function
real_dollars()
on your survey data object to create real
versions of your monetary variables. This leaves the original variables
intact; the real versions will be suffixed with the reference year you
specify (i.e., if converting HINCP in the 2015 survey to 2020 values,
the new variable will be HINCP2020). Note you will need a St. Louis
Federal Reserve (FRED) API key.
Hint #4 - Minimize downloads
While writing multi-year functions, keep in mind
get_psrc_pums()
downloads and combines all possible
variables before returning those you requested, so it’s efficient to
group operations on the same survey (year/span) rather than to call
get_psrc_pums()
separately for each desired measure. We
recommend combining your data retrieval, manipulation and summarization
operations for a single year into a function, which you can then apply
across multiple surveys. This approach requires only as many downloads
as you have surveys–resulting in faster operations and lower demand on
memory.
Example
library(psrccensus)
library(magrittr)
library(dplyr)
# Build a single year function first
# -- it can include as many individual stat analyses as needed (see list items at end)
# -- notice `real_dollars()` creates the variable HINCP2020 later used for median statistic
pums_singleyear <- function(dyear, span=1){
hh_df <- get_psrc_pums(span, dyear, "h", c("HINCP","AGEP","HRACE","LNGI","SCHL"))
hh_df %<>% real_dollars(2020) %>% mutate(
ed_attain = factor(case_when(grepl("(Bach|Mast|Prof|Doct)", SCHL) ~ "Bachelor's degree or higher",
!is.na(SCHL) ~ "Less than a Bachelor's degree")),
lmtd_engl = factor(case_when(grepl("^No one", LNGI) ~ "Limited English proficiency",
!is.na(LNGI) ~ "English proficient")))
dvars <- c("HRACE","lmtd_engl","ed_attain") %>% as.list()
singleyr_rs <- list()
singleyr_rs[[1]] <- pums_bulk_stat(hh_df, "count", group_var_list=dvars, incl_na=FALSE)
singleyr_rs[[2]] <- pums_bulk_stat(hh_df, "median", "HINCP2020", dvars, incl_na=FALSE)
# singleyr_rs[[3]] <- ...
return(singleyr_rs)
}
# Multiyear function runs the single-year function across years and combines results
pums_multiyear <- function(dyears){
multiyear_rs <- lapply(dyears, pums_singleyear) %>% lapply(as.vector) %>%
do.call(rbind, .) %>% as.data.frame() %>% lapply(data.table::rbindlist)
return(multiyear_rs)
}
x <- pums_multiyear(2015:2019)