Retrieve and summarize data
retrieve-and-summarize-data.Rmd
Works with travelSurveyTools
The psrc.travelsurvey package is meant to complement
and enhance the travelSurveyTools package for PSRC
users. It delivers data in the hts_data
format appropriate
as input to subsequent travelSurveyTools functions. It
also provides wrapper functions that provide PSRC defaults to
travelSurveyTools functions, resulting in simpler
workflows with fewer required arguments.
psrc.travelsurvey is constrained to an extent by
fitting into the travelSurveyTools system, but if you
have suggestions for other features or functions you’d find helpful,
please contact the authors and we’ll see if we can implement them.
Data retrieval
Since psrc.travelsurvey uses Elmer, the agency’s central database, to
retrieve PSRC household travel survey data you’ll need to be connected
(in office or via VPN). Then use get_psrc_hts()
, which
takes two arguments:
-
survey_years - a vector of one or more years; the
default is all available i.e.,
c(2017,2019,2021,2023)
as of this writing. - survey_vars - a vector of variable names (regardless of the table in which they occur)
The return object is a list with separate data.table elements for
each potential unit of analysis, i.e. household (hh
),
person
, day
, trip
, and
vehicle
. The travelSurveyTools package
workflow uses this central object for related or independent variable
summaries. It can include the full set of variables for all your
summaries, so it’s unnecessary to request separate data objects to feed
each separate summary data table. It also supports multiyear summaries
(trends/comparisons) via the survey_year
field.
For efficiency, get_psrc_hts()
only requests the
variables you specify, so you must know their precise names. A lookup
function, psrc_hts_varsearch()
can assist; it takes a
single string argument as the codebook search term and matches either
the variable name or description fields. Regular expressions are
supported for more complex searches; for example, the word boundary
operator \\b
can be used to omit vehicle model from results
when searching for “mode”.
View(psrc_hts_varsearch("\\bmode\\b"))
To browse the entire variable list, open the internal data object
init_variable_list
:
View(psrc.travelsurvey:::init_variable_list)
Note: It is unnecessary to include the survey year, table ids,
weights, or survey stratification variable (sample_segment
)
in your survey_vars
argument; since they are necessary for
later steps, get_psrc_hts()
includes them implicitly.
Factor variables and levels are applied from the codebook, and missing
or skip-value codes are recoded as NA
.
Summarization
To summarize data, use psrc_hts_stat()
. This has four
primary arguments, and one option. The arguments are:
-
hts_data - the
hts_data
object - analysis_unit - string specifying the scale at which statistics are calculated, either “hh”, “person”, “day”,“trip”, or “vehicle”.
- group_vars - any grouping variables, in nesting order (share summaries are calculated per categories of the variable listed last).
- stat_var (optional) - Numeric variable if min/max/median/mean is desired; omit for count/share summaries.
As an example, we’ll retrieve the following survey variables, then do
one count/share summary, and one numeric summary. Notice,
travelSurveyTools handles the relational element of the data, so tables
lower in the organizational heirarchy (like trips
) inherit
attributes from those above them (such as hh
); this is a
handy feature and avoids unnecessary data duplication among tables.
library(psrc.travelsurvey)
library(data.table)
# Specify which variables to retrieve
vars <- c("hhsize", "employment", "commute_mode", "dest_purpose",
"mode_class_5", "distance_miles")
# Retrieve the data
hts_data <- get_psrc_hts(survey_vars = vars) # default includes all survey_years
# Calculate a numeric summary, i.e. min/max/median/mean
rs1 <- psrc_hts_stat(hts_data,
analysis_unit="trip",
group_vars="hhsize",
stat_var="distance_miles")
head(dplyr::select(rs1[survey_year==2023], -c(survey_year, min, max)))
#> hhsize count mean mean_moe median
#> <ord> <int> <num> <num> <num>
#> 1: 1 person 10081 6.977442 1.4370516 2.222023
#> 2: 2 people 12719 11.073412 3.2645435 3.135158
#> 3: 3 people 3915 7.009381 0.7154436 3.035398
#> 4: 4 people 3858 6.123352 0.5365826 3.197576
#> 5: 5 people 1152 23.016527 12.1648649 2.116390
#> 6: 6 people 404 5.836721 1.1624158 2.054875
# Calculate a categorical summary, i.e. count/share
rs2 <- psrc_hts_stat(hts_data, "trip", c("hhsize", "mode_class_5"))
head(dplyr::select(rs2[survey_year==2023 & as.character(mode_class_5)=="Transit"],
-c(survey_year, mode_class_5)))
#> hhsize count prop prop_moe est est_moe
#> <ord> <int> <num> <num> <num> <num>
#> 1: 1 person 850 0.07819693 0.014343472 144804.30 27878.08
#> 2: 2 people 553 0.02450678 0.005668314 121841.55 28456.42
#> 3: 3 people 124 0.01886768 0.006262066 55321.98 18366.04
#> 4: 4 people 80 0.01031612 0.003696247 37591.82 13442.13
#> 5: 5 people 28 0.02357459 0.008536841 35762.24 12806.51
#> 6: 6 people 14 0.05805984 0.028137362 57373.12 28499.56
Custom variables and filtering
Since each hts_data
element is a data.table, they can be
manipulated with any combination of base r, tidyverse
(e.g. mutate()
), or data.table (e.g :=
)
syntax. The example below illustrates adding variables specific to the
hts_data$trip
element. Notice, the inheritance capability
mentioned earlier applies to custom variables as well, so if you want to
recode a household-level variable to feature in a trip-level analysis,
just add it to hts_data$hh
(not to
hts_data$trip
).
Standard category groupings not present in the base data can be added
via the functions described in the
accompanying vignette. These take the entire hts_data
object as argument, not a specific table (the package knows which is
involved). This lends itself to piping multiple standard recodes.
Keep in mind that filtering any table removes records available for
subsequent analyses. Rather than manage multiple filtered copies of the
data object, you may want to create a new variable on the relevant table
with NA
values for the records you intend to exclude from
the summary. You can then use the incl_na=FALSE
option to
exclude that line from summary results (of particular value when you
want shares calculated from only reported categories). Here we calculate
mode split percentages for work trips only by adding a variable coded
NA
for all non-work-related trips, then using the
incl_na=FALSE
option to limit statistics to that
category.
library(dplyr)
library(stringr)
# Use a standard variable recode to get simpler trip purpose
hts_data <- hts_bin_dest_purpose(hts_data)
# Use dplyr::mutate to add a simplified mode field & work purpose specifier
hts_data$trip <- mutate(
hts_data$trip,
purpose_work = case_when(
dest_purpose_bin4=="Work" ~ "Work",
!is.na(dest_purpose_bin4) ~ NA_character_)
)
# Calculate a categorical summary, i.e. count/share
rs3 <- psrc_hts_stat(hts_data, "trip", c("purpose_work", "mode_class_5"), incl_na=FALSE)
head(dplyr::select(rs3[survey_year==2023],
-c(survey_year, purpose_work)))
#> mode_class_5 count prop prop_moe est est_moe
#> <ord> <int> <num> <num> <num> <num>
#> 1: Drive 3211 0.865739816 0.018028620 1835881.03 125651.228
#> 2: Transit 402 0.029027564 0.007653030 61555.62 16093.543
#> 3: Walk 728 0.079432698 0.013730924 168444.35 29070.846
#> 4: Bike 153 0.008963787 0.002837518 19008.54 5922.878
#> 5: Other 118 0.016836136 0.008717198 35702.58 18613.062
Note: travelSurveyTools summary functions require additional preparatory steps that psrc.travelsurvey handles for you, for simplicity. To use native travelSurveyTools summary functions, you’ll need to handle these extra steps yourself, since travelSurveyTools does not have a dependency on the psrc.travelsurvey package.