When we see code being repeated more than once, functions are a great way to reduce duplication. Even if we call a function only once, they can be a nice way to break up large complicated processes.

What’s in a function?

The Formals
The Body
The Environment

To define a function here’s the basic skeleton

my_function_name <- function() {
  
}

Let’s create a custom function!

Here’s a CHAS table. Each csv will look similar to this:

file_01 <- read_csv(here('data', '050', 'Table9.csv'))
head(file_01, 10)

# A tibble: 10 × 152
   source     sumlevel geoid name  st    cnty  T9_est1 T9_est2 T9_est3
   <chr>      <chr>    <chr> <chr> <chr> <chr>   <dbl>   <dbl>   <dbl>
 1 2015thru2… 050      0500… Auta… 01    001     21395   15680   12835
 2 2015thru2… 050      0500… Bald… 01    003     80930   60895   54425
 3 2015thru2… 050      0500… Barb… 01    005      9345    5690    3460
 4 2015thru2… 050      0500… Bibb… 01    007      6890    5130    4330
 5 2015thru2… 050      0500… Blou… 01    009     20845   16425   15090
 6 2015thru2… 050      0500… Bull… 01    011      3520    2505     750
 7 2015thru2… 050      0500… Butl… 01    013      6505    4550    2925
 8 2015thru2… 050      0500… Calh… 01    015     44605   31255   25110
 9 2015thru2… 050      0500… Cham… 01    017     13450    9070    5745
10 2015thru2… 050      0500… Cher… 01    019     10735    8305    7730
# ℹ 143 more variables: T9_est4 <dbl>, T9_est5 <dbl>, T9_est6 <dbl>,
#   T9_est7 <dbl>, T9_est8 <dbl>, T9_est9 <dbl>, T9_est10 <dbl>,
#   T9_est11 <dbl>, T9_est12 <dbl>, T9_est13 <dbl>, T9_est14 <dbl>,
#   T9_est15 <dbl>, T9_est16 <dbl>, T9_est17 <dbl>, T9_est18 <dbl>,
#   T9_est19 <dbl>, T9_est20 <dbl>, T9_est21 <dbl>, T9_est22 <dbl>,
#   T9_est23 <dbl>, T9_est24 <dbl>, T9_est25 <dbl>, T9_est26 <dbl>,
#   T9_est27 <dbl>, T9_est28 <dbl>, T9_est29 <dbl>, T9_est30 <dbl>, …

Suppose we’d like to do some cleaning to each CHAS table in the same manner. Let’s create one that does the following:

filter for WA state and PSRC counties
pivot longer (so columns that start with ‘T’ are not across the table)
create 3 more columns that dissect the column containing the former ‘T…’ headers:
- create ‘table’ field extracting T and the numbers before the underscore
- create a ‘type’ field to identify whether values are ‘est’ or ‘moe’
- create a ‘sort’ field extracting the numeric digits at the end

# define the skeleton of our function
# add table as a parameter
clean_table <- function(table) {
  
  # fill it in!
  
}

Fill in the body with the argument to clean

clean_table <- function(table) {
  table %>% 
    filter(st == 53 & cnty %in% c('033', '035', '053', '061')) %>% 
    pivot_longer(cols = str_subset(colnames(table), "^T.*"), 
                 names_to = 'header', 
                 values_to = 'value') %>% 
    mutate(table = str_extract(header, "^T\\d*(?=_)"), 
           type = str_extract(header, "(?<=_)\\w{3}"), 
           sort = str_extract(header, "\\d+$")) 
}

# Regex used:
# table: "^T\\d*(?=_)" string starting with T and numeric digits followed by _
# type: "(?<=_)\\w{3}" 3 letters preceded by _
# sort: "\\d+$" last numeric digits at the end of the string

Functions will generally return the last evaluated expression. With the piping (%>%) in dplyr, our example is essentially a one liner expression. You can always add return(<name of object>) to explicitly return a specific object whenever your function is called.

Call the function

t9 <- clean_table(file_01)

Try with other files

file_02 <- read_csv(here('data', '050', 'Table10.csv'))
file_03 <- read_csv(here('data', '050', 'Table11.csv'))

t10 <- clean_table(file_02)
t11 <- clean_table(file_03)

If we forgot a step in the cleaning process, we can always edit the function and re-run our script

# Let's make this edit to our function that will convert the sort column from string to numeric
sort = as.numeric(str_extract(header, "\\d+$"))

Benefits of creating functions

Easier editing of code
Reduce redundancy
Break long processes into chunks