Reliability Measures for ACS data tables
calculate-reliability-moe-transformed-acs.Rmd
Reliability measures for ACS estimates
Background
This vignette explains how to calculate reliability measures for ACS estimates.
The American Community Survey (ACS), the household travel survey, and other population surveys are all samples of the population with usually less 10% of the population answering the survey.
Because these surveys are a sample, they have several sources of potential error such as sampling error, nonresponse error, coverage error, measurement error, and processing error.Here is a document from Census about these concepts as they apply to ACS.
The data team wanted to develop a set of rough guidelines to help team analysts to better understanding when specifically could mean that the sample has quality issues and needs to be examined with additional scrutiny.
A coefficient of variation (CV) measures the relative amount of sampling error that is associated with a sample estimate. The CV is calculated as the ratio of the SE for an estimate to the estimate itself and is usually expressed as a percent.
The internal recommendation for understanding CVs are:
CV <= 15% is good
CV >15% and <=30% is fair
CV >30% and <=50% estimate should be used with caution
CV >50% examine with great caution whether it is usable
Note that since both the ACS and household travel survey are reported using a 90 percent confidence interval, where the Margin of Error (MOE) is reported in place of standard error, you can convert it to standard error by dividing by 1.645.
These breakpoints are based on guidance from the Census Bureau and the [National Center for Transit Research] (https://www.nctr.usf.edu/wp-content/uploads/2010/04/77802.pdf). It is important that analysts consider the nuances of their datasets to understand potential sources of error in their estimates. These internal categories are part of an ongoing conversation and may change.
Reliability is provided on data tables retrieved via psrccensus
The psrccensus library contains functionality has been added to provide coefficients of variation, standard errors, and reliability labels to ACS data retrieved directly from the api. Every ACS data table output from psrccensus via the function get_acs_recs() includes fields for standard error, coefficient of variation, and reliability. But often you need to transform the variables that come from the ACS to get at what you need to summarize, and this is not directly built into psrccensus.
How to calculate reliability estimates on transformed ACS dataframes
To calculate reliability using psrccensus, on user transformed tables, you may want to apply the function reliability_calcs(). This function requires you have a dataframe, a column holding the margins of error, and a column holding the estimate. So you need to have pre-calculated the margin of error transformed, along the way to be able to retrieve the reliability of the estimate.
At each stage of calculation and aggregation, you need to re-calculate the margins of error. The R library tidycensus has built in functions for re-calculating margins of error after data transformation. We recommend that users rely on these functions to derive margins of error when data needs to be aggregated. Another option would be to code the margin of error calculations into a user defined functions.
For more background on the types of calculations performed by tidycensus for margins of error, you can also read this document from Census
Common functions from tidycensus for calculating margins of error, are the following:
moe_product:
Calculate the margin of error for a derived productmoe_prop(): Calculate the margin of error for a derived proportion
moe_ratio() Calculate the margin of error for a derived ratio
moe_sum() Calculate the margin of error for a derived sum (or difference)
Example: Housing Cost Burden
In this example, we want to retrieve the percent of households that are cost-burdened by Census tract. Cost-burdened is defined in this case as spending more than 30% of household income on housing costs. This data will be used to create a map by Census Tract to show areas of greater housing cost burden geographically.
To do the summary, first you need to aggregate the data provided by households spending over 30% based on the available fields-Less than 10.0 percent,10.0 to 14.9 percent, 15.0 to 19.9 percent, 20.0 to 24.9 percent, 25.0 to 29.9 percent. These summed items by group provide the numerator of your calculation.
Next you need to redefine the total to remove when the values are not compute- this is denominator. Then you need to calculate the shares.
For each data transformation, you need to recalculate the margin of error.
Read in Libraries
## Have you updated tidycensus since the release date of the data you'll use?
## devtools::install_github('walkerke/tidycensus')
##
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Get the data from the ACS api via psrccensus
# Finding the corresponding ACS table - this works if you know the correct concept label. If not, another option would be to visit https://data.census.gov/table and search for the right subject table and skip to the next step ----
#x <- tidycensus::load_variables(2021,"acs5") %>%
# dplyr::filter(grepl("percent",
# concept, ignore.case=TRUE) &
# (geography=="tract"))
# getting data by tract
acs_data <- get_acs_recs(geography ='tract',
table.names = 'B25070', #subject table code
years = c(2021),
acs.type = 'acs5')
## Getting data from the 2017-2021 5-year ACS
Add a grouping
# data wrangling for grouping >30% and under 30% - census tract and population
acs_data <- acs_data %>% rename_at('label', ~'rent_burden')
# Create variables
acs_data <- acs_data %>%
mutate(rent_burden_group=factor(
case_when(grepl("Less than 10.0 percent|10.0 to 14.9 percent|15.0 to 19.9 percent|
20.0 to 24.9 percent|25.0 to 29.9 percent", rent_burden) ~
"Less than 30 percent",
grepl("30.0 to 34.9 percent|35.0 to 39.9 percent|35.0 to 39.9 percent|
40.0 to 49.9 percent|50.0 percent or more", rent_burden) ~
"Cost Burdened (More than 30 percent)",
grepl("Not computed", rent_burden) ~ "Not computed",
!is.na(rent_burden) ~ "Total")))
Aggregate and calculate
The moe_sum uses tidycensus.
# filter only fields of interest - census tract and estimate
# use the moe_sum function from tidycensus
acs_data_group <- acs_data %>%
dplyr::group_by(GEOID, rent_burden_group)%>%
dplyr::summarise(estimate=sum(estimate),
moe=tidycensus::moe_sum(estimate=estimate, moe=moe))
## `summarise()` has grouped output by 'GEOID'. You can override using the
## `.groups` argument.
Data wrangling
# pivot data to calculate percentage by census tract, calculate estimates
acs_data_group<- acs_data_group%>%
pivot_wider(names_from = rent_burden_group, values_from = c(estimate,moe))%>%
rename('estimate_cost_burdened'='estimate_Cost Burdened (More than 30 percent)',
'moe_cost_burdened'= 'moe_Cost Burdened (More than 30 percent)')
# weight averages for population & calculating percentage/share of population per tract instead of integer
acs_data_group<- acs_data_group%>%
mutate(estimate_new_total=(estimate_Total-`estimate_Not computed`))%>%
mutate(estimate_perc_cost_burdened=((estimate_cost_burdened/estimate_new_total)))
Calculate margins of error
Because you are applying the function across the columns, as opposed to down the grouped rows, you need to apply the function rowwise. You would not need to do this if the data were not pivoted.
Note the moe_sum is applied although you are taking a difference. Then moe_prop is applied when you are finding the shares.
#calculate moes
acs_data_group<- acs_data_group%>%
rowwise()%>%
mutate(moe_new_total=moe_sum(estimate=c(estimate_Total, `estimate_Not computed`),
moe=c(moe_Total,`moe_Not computed`)))%>%
mutate(moe_perc_cost_burdened=moe_prop(num=estimate_cost_burdened,
denom=estimate_new_total, moe_num=moe_cost_burdened, moe_denom=moe_new_total))
acs_data_final<- acs_data_group %>%
dplyr::select(GEOID, estimate_cost_burdened, estimate_new_total, estimate_perc_cost_burdened,
moe_cost_burdened, moe_new_total, moe_perc_cost_burdened) %>%
dplyr::filter(!is.na(estimate_cost_burdened))
acs_data_final<-reliability_calcs(acs_data_final, estimate='estimate_perc_cost_burdened',
moe='moe_perc_cost_burdened')
head(acs_data_final%>%select(GEOID, estimate_perc_cost_burdened, reliability))
## # A tibble: 6 × 3
## # Rowwise: GEOID
## GEOID estimate_perc_cost_burdened reliability
## <chr> <dbl> <chr>
## 1 53033000101 0.298 fair
## 2 53033000102 0.325 fair
## 3 53033000201 0.339 fair
## 4 53033000202 0.241 use with caution
## 5 53033000300 0.313 use with caution
## 6 53033000402 0.292 fair
Now the data is ready to be joined to the tract map, if desired.