Tidy Data for ggplot2

The data used for the code-a-long in the Intro to ggplot2 module will have been cleaned and transformed ahead of time. While we won’t spend time on the details of that process, it shouldn’t be overlooked. The code below documents how it was done.

Having tidy data is an important step when working with ggplot2. It becomes especially important if the dataset has recurring updates with new additional data each time. By cleaning and transforming first, you will save yourself many lines of code and prevent hardcoding.

The original data source can be downloaded here from the Office of Financial Management (OFM).

To run this code on your own machine:

copy the code below into a new R script file
set my.dir to a directory of your choice
store the original data file from OFM in that directory
Run line by line to see how the data is transformed

For more information about tidy data, see the Tidying Messy Datasets section in this summary.

# This script uses base R syntax with the help of other libraries to read-in OFM April 1 Total Population 
# data and to transform it so that it is ggplot2-ready.

library(openxlsx) # reads excel files
library(reshape2) # helps pivot data
library(stringr) # tools to extract parts of strings using regular expressions
library(lubridate) # tools to convert data into various date-time formats

my.dir <- "C:/Users/clam/Documents/github/intro-ggplot2"

# read in excel data
raw <- read.xlsx(file.path(my.dir, "ofm_april1_population_final.xlsx"), startRow = 4)

# filter for PSRC region and remove Line column
region <- subset(raw, County %in% c("King", "Kitsap", "Pierce", "Snohomish"))
keep.cols <- setdiff(colnames(region), "Line")
region <- region[, keep.cols]

# OFM publishes data in a cross-tab (wider) format 
# melt table (pivot it longer) so that we have a new column called 'melted_cols'
# 'melted_cols' will contain those former column headers 
# that contain information such as year, the data attribute (population), 
# and data source type (Census/Estimate)
id.cols <- c("Filter", "County", "Jurisdiction")
df <- melt(region, id.vars = id.cols, variable.name = "melted_cols", value.name = "Estimate")

# create separate columns for data year (20##), attribute (Population), and source (Census or Estimate)
df$Year_chr <- str_extract(df$melted_cols, "^\\d+") # year as character datatype
df$Year_dt <- ymd(as.numeric(df$Year_chr), truncated = 2L) # year as a formal date (in case we need it)
df$Attribute <- str_extract(df$melted_cols, "(?:[^\\d+\\.])\\w+")
df$Source <- str_extract(df$melted_cols, "\\w+$")

# exclude melted_cols and re-order columns
cols.order <- c(id.cols, "Attribute", "Source", "Year_chr", "Year_dt", "Estimate")
df <- df[, cols.order]

# checkout the structure of df
str(df)

# the Estimate column should contain numeric values, not characters!
# convert the Estimates to be actual numbers and double check the structure
df$Estimate <- as.numeric(df$Estimate)
str(df)

# export table
write.xlsx(df, file.path(my.dir, "ofm_april1_population_final_tidied.xlsx"))