PUMS 2: Make your own variables
pums-2-make-vars.Rmd
Defining your own variables in PUMS
Why aren’t the variables included in the PUMS dataset enough? Many PUMS categories offer more detail than is useful for an analysis, and so need to be simplified to reflect its purpose (and to increase sample sizes per category). Other cases may require defining a new variable conditioned on multiple original variables. And finally, at times the analysis may apply only to part of the population.
The srvyr object delivered by get_psrc_pums() can be altered
dplyr commands while maintaining the associated weights
and structure (see its vignette). We
recommend using the mutate()
command, in combination with
case_when()
or ifelse()
to define a variable
that directly captures the needs of your analysis, so
psrccensus can deliver your intended statistics–and
especially, the associated margin of error–rather than attempting to
re-aggregate these from summary results.
Hints and things to watch out for when recoding PUMS variables:
Categorical (i.e. grouping) variables in srvyr are of Factor datatype, and new categorical variables should also be Factor datatype. This means the new or altered variable you create should use either the
factor()
command or its quicker alternate,as.factor()
.To sort your new variable categories in a particular order, specify
factor(levels=)
; the default is alphabetical (the only option if using theas.factor()
command).To assign
NA
use a constant as the right hand side, such asNA_character_
,NA_integer_
, orNA_real_
, depending on what datatype you want. R assigns type based on the first expression, and without context interprets normalNA
as a logical value.For a catch-all (aka “else”) category, consider using the
!is.na()
criterion rather thanTRUE
. GroupingNA
with other categories may obscure non-applicable cases.For logical conditions using value labels, the
grepl()
function can be very handy, as regular expressions can match one or many labels without typing out the entire label (although you’ll want to craft your regex pattern carefully, so it matches only the labels you intend it to).For complex recoding you may find it convenient to call get_psrc_pums() with the
labels=FALSE
option, as the underlying value codes are shorter and more easily handled with rules than are descriptive labels (use the data dictionary to interpret value codes). Be aware this leaves all columns as values.If you are defining a variable using PUMS values, be aware of a trap: R stores Factors as a hidden value that maps to a set of character “levels”, which are what is displayed. While string comparisons with a Factor will use the displayed “level”, numerical comparisons with a Factor will use the hidden value, even if the “level” looks like an integer (as PUMS values do). The way to handle this is to convert Factor to character first in your
mutate()
statement, i.e.as.integer(as.character(SOME_PUMS_VAR))
, as part of your logical expression.
Handle subsets via a universal variable & the
incl_na=FALSE
option
Rather than removing observations with the dplyr::filter() command,
we recommend that you assign those cases NA
in a custom
categorical variable–that way, you can use the full survey object in
repeated analyses without needing to manage different filtered subsets.
To exclude the NA
category from a results
table–particularly useful when reporting subset shares via the
psrc_pums_count()
function–specify the
incl_na=FALSE
option in any of the statistical functions.
This effectively filters the survey prior to running the statistic,
without affecting the data object itself. The default
incl_na=TRUE
option includes NA
groups and
gives accurate shares of the full population (or households, if that is
your unit of analysis).
Examples
Adding two custom variables with one mutate command:
library(psrccensus)
library(magrittr)
library(dplyr)
pums19_5p <- get_psrc_pums(5, 2019, "p", c("AGEP","SCHL","ESR")) # Pull the data;
# rather than pipe the result, use a separate assignment
pums19_5p %<>% mutate( # so any issues with mutate() don't negate your download
ed_25up = factor( # Use Factor datatype for categorical variables
case_when(AGEP<25 ~ NA_character_, # Type-specific NA constant
grepl("^(Bach|Mast|Prof|Doct)", SCHL) ~ "Bachelor's degree or higher", # Regex is concise; handy since PUMS labels can be wordy
!is.na(SCHL) ~ "Less than a Bachelor's degree")), # !is.na() criteria
emp_25up = factor( # Mutate() can assign more than one variable
case_when(AGEP<25 ~ NA_character_, # Type-specific NA constant
grepl("at work$", ESR) ~ "Employed", # Concise regex again; use care, checking the data dictionary
!is.na(ESR) ~ as.character(ESR)), # Retain NA for children under 16
levels=c("Employed","Unemployed","Not in labor force"))) # Preferred ordering via `levels=`
emp_ed_all <- psrc_pums_count(pums19_5p, group_vars=c("emp_25up", "ed_25up")) # These shares reflect the entire population
emp_ed_25up <- psrc_pums_count(pums19_5p, group_vars=c("emp_25up", "ed_25up"), incl_na=FALSE) # No NA; same counts but shares for only age 25+, as intended
Using labels=FALSE
and as.character()
for
more a complex recode:
- Notice conversion first to character, then integer to allow numeric comparisons.
- This approach has particular advantage over label matching when using value ranges
- All variables use values (including ESR; grepl() used above only works with labels)
pvars <- c("AGEP","FOD1P","FOD2P","INDP","ESR")
ftr_int <- function(x){as.integer(as.character(x))} # Micro-helper conversion function
pums18_5p <- get_psrc_pums(5, 2018, "p", pvars, labels=FALSE) # Labels=FALSE leaves values--but they are still Factors!
pums18_5p %<>% mutate(
med_deg = factor(
case_when(between(ftr_int(FOD1P), 6100, 6199)|between(ftr_int(FOD2P), 6100, 6199) ~ "Medical Degree",
TRUE ~ "No medical degree")), # Regardless of age
med_field = factor(
case_when(ftr_int(ESR) %in% c(1,2,4,5) & between(ftr_int(INDP), 7970, 8290) ~ "Medical industry, employed",
ftr_int(ESR)==3 & between(ftr_int(INDP), 7970, 8290) ~ "Medical industry, not currently employed",
ftr_int(ESR) %in% c(1,2,4,5) & !is.na(INDP) ~ "Non-medical industry, employed",
ftr_int(ESR)==3 & !is.na(INDP) ~ "Non-medical industry, not currently employed",
TRUE ~ NA_character_))) # Leaving out those not in the workforce
med_deg_work_all <- psrc_pums_count(pums18_5p, group_vars=c("med_field","med_deg")) # These shares reflect the entire population
med_deg_work_only <- psrc_pums_count(pums18_5p, group_vars=c("med_field","med_deg"), incl_na=FALSE) # These shares limited to workforce