PUMS 4: Download optimization
pums-4-local-downloads.Rmd
Under the hood: sources for PUMS microdata
By default, the psrccensus::get_psrc_pums()
function
downloads microdata for the entire state from the Census FTP site.
Because each call to get_psrc_pums()
can retrieve variables
from the household and person datasets, it downloads both–which can take
around a minute for 5-yr surveys, depending on the internet connection.
(An earlier build pulled only requested variables via the Census API,
but package authors found it slower than the FTP approach, and subject
to unexplained, almost daily downtimes after 5pm Eastern time.)
Strategies to minimize downloads
One way to minimize download time is to reduce separate calls to
get_psrc_pums()
, requesting all variables you need for a
given span-year-level combination in one set. This also reduces memory
loads. Related efficiency hints include batching
multiyear analysis by year rather than by variable and summarizing
with the incl_na=FALSE
option instead of creating multiple filtered objects.
Shift to local files using dir=
You can also skip the download altogether by loading prepared
household and person tables stored locally. This shortcut is activated
by specifying the directory in which to find the data in the
dir=
argument to get_psrc_pums()
. The data
must be stored as gzip-compressed .rds files with a specific naming
convention–concatenated data year, level, and span; e.g. “2022h1.gz” or
“2017p5.gz” (for PSRC staff, these files already exist on the network;
see the PSRC Data wiki entry, “Working_with_PUMS_data”.)
Create the local file
library(data.table)
library(magrittr)
pums_rds <- "your/desired/storage/directory/path" # Specify your directory, or change your working directory to the storage location
make_offline_pums <- function(yr_lv_sp){
dyear <- as.integer(stringr::str_sub(yr_lv_sp, 1L, 4L))
level <- stringr::str_sub(yr_lv_sp, 5L, 5L)
span <- as.integer(stringr::str_sub(yr_lv_sp, 6L, 6L))
filenm <- paste0(pums_rds, "/", yr_lv_sp, ".gz")
dt <- psrccensus:::fetch_ftp(span, dyear, level)
readr::write_rds(dt, file = filenm, compress = "gz") # export compressed .rds
rm(dt)
return(NULL)
}
arg_vector <- expand.grid(dyear=c(2005:2019,2021:2022),
level=c("p","h"),
span=c(1,5)) %>% # Not using (discontinued) 3-yr
transpose() %>% lapply(paste0, collapse="") %>%
grep("0[5-8]\\w5$", ., invert=TRUE) %>% unlist() # Starting 5-yr ACS data with 2005-2009
lapply(arg_vector, make_offline_pums)
Utilize the local file option
Once the prepared files are in place–i.e. both the household file and
the person file for a given year & span–it is straightforward to
reference the directory location in the get_psrc_pums(dir=)
call:
library(psrccensus)
pums_rds <- "your/desired/storage/directory/path"
my_data <- get_psrc_pums(5, 2022, "p", c("HINCP", "SOC2"), dir = pums_rds)
Use cases and considerations
Loading from these files can improve speed significantly, so you may
want to consider it if you are working with PUMS a lot or are building a
server-hosted app that involves calling get_psrc_pums()
.
The downside is the required file management, e.g. when the Census
Bureau releases a new dataset, you’ll need to be sure the corresponding
.gz file is created before the option will work for that data year.
To increase speed for a server-hosted app, you may want to consider
going a step farther and carry through all the potential statistical
analyses, so the app will draw from stored summary results rather than
call any psrccensus
functions itself. Although this
involves an upfront investment of thought, time, and data processing, it
will pay off in dramatically lower response times for app users.