artwork by @allison_horst

artwork by @allison_horst

This module will introduce ggplot2 syntax to create bar and line graphs. Facets and extensions such as plotting using ggplotly() will also be demonstrated.

Data for the code-a-long is a munged OFM April 1 dataset which you can find here. For more information about April 1 Estimates and its original format see Office of Financial Management (OFM).

Handy keyboard shortcuts for Windows:

  • Comment a line or lines: Ctrl+Shift+C
  • Toggle cursor between script and console panes: Ctrl+1 and Ctrl+2
  • Create arrow <-: Alt+-
  • Run Line(s): Ctrl+Enter

Intro to ggplot2

You can copy and paste snippets from this outline into your script or type along. Some code may differ from what will be presented during the session.

Setup

Load the first three libraries. The fourth, extrafont, is optional depending on if you installed the package and followed additional steps here.

library(openxlsx)
library(ggplot2)
library(scales)
library(extrafont) # optional

You can replace the path below to where you’ve stored the data. Use forward slashes / or double back slashes \\

setwd("C:/Users/clam/Documents/github/intro-ggplot2/data")

df <- read.xlsx("ofm_april1_population_final_tidied.xlsx", detectDates = TRUE)

Before diving into ggplot2, get to know the data. Use unique() and str() to explore our columns and data frame.

Bar Graph

To create a simple bar graph focused on total population by county and year, we’ll use a subset of the the data. Query for only total county estimates (where column Filter equals 1)

df1 <- df[df$Filter == 1, ]

Initiate ggplot

g <- ggplot()

Geoms

Geoms are geometric objects drawn to represent data such as lines, bars, and points1. Within the geom function is a place to assign aesthetics aes(). aes() takes a handful of arguments and it will vary across geoms. Use the console to look up help on geom_col() and find its aesthetics.

# adding a geom and using aesthetics
g +
  geom_col(data = df1, 
           aes(x = Year_chr, y = Estimate, fill = County),
           color = "black",
           alpha = .5,
           position = "stack"
           )

Scales

Scales control the mapping from the values of the data space to the values of the aesthetic space2. When we map a variable, default settings are used. To change the default and customize those aesthetics, apply a scale function.

Most follow the format: scale_{aesthetic}_{method} or scale_{aesthetic}_{datatype}

# Using brewers to change the color palette. Select from a list of available palettes.
g +
  geom_col(data = df1, 
           aes(x = Year_chr, y = Estimate, fill = County),
           color = "black",
           alpha = .5,
           position = "stack"
  ) +
  scale_fill_brewer(palette = "Accent")
# Creating and using a custom color palette
my.pal <- c('#f6eff7','#bdc9e1','#67a9cf','#02818a')

g +
  geom_col(data = df1, 
           aes(x = Year_chr, y = Estimate, fill = County),
           color = "black",
           alpha = .5,
           position = "stack"
  ) +
  scale_fill_manual(values = my.pal)
# Formatting the axis scales
g +
  geom_col(data = df1, 
           aes(x = Year_chr, y = Estimate, fill = County),
           color = "black",
           alpha = .5,
           position = "stack"
  ) +
  scale_fill_manual(values = my.pal) +
  scale_y_continuous(labels = label_comma())
# Setting breaks and new labels for axis
g +
  geom_col(data = df1, 
           aes(x = Year_chr, y = Estimate, fill = County),
           color = "black",
           alpha = .5,
           position = "stack"
  ) +
  scale_fill_manual(values = my.pal) +  
  scale_y_continuous(labels = c("None", "A lot"),
                     breaks = c(0, 4e+06)) +
  scale_x_discrete(labels = c("2020" = "Worst Year Ever"))

Labs

An easy way to include plot titles, change axis labels, or legend titles is to use the labs function.

# Labels
g +
  geom_col(data = df1, 
           aes(x = Year_chr, y = Estimate, fill = County),
           color = "black",
           alpha = .5,
           position = "stack"
  ) +
  scale_fill_manual(values = my.pal) +
  scale_y_continuous(labels = c("None", "A lot"),
                     breaks = c(0, 4e+06)) +
  scale_x_discrete(labels = c("2020" = "Worst Year Ever")) +  
  labs(title = "My Plot", 
       x = "Year", 
       y = NULL, 
       caption = "footnote", 
       subtitle = "My subtitle")

Theme

The theme function controls the non-data part of your plot such as titles, labels, fonts, background, gridlines, and legends. There are a ton of keyword arguments you could use in theme(): https://ggplot2.tidyverse.org/reference/theme.html

Depending on what you’re changing, you’ll have to wrap your specs with one of the following functions:

  • element_line(): modify the line elements of the theme
  • element_text(): to modify the text elements
  • element_rect(): to modify the rectangle elements
  • element_blank(): to remove the element
# Changing the non-data part  
g +
  geom_col(data = df1, 
           aes(x = Year_chr, y = Estimate, fill = County),
           color = "black",
           alpha = .5,
           position = "stack"
  ) +
  scale_fill_manual(values = my.pal) +
  scale_y_continuous(labels = c("None", "A lot"),
                     breaks = c(0, 4e+06)) +
  scale_x_discrete(labels = c("2020" = "Worst Year Ever")) +  
  labs(title = "My Plot", 
     x = "Year", 
     y = NULL, 
     caption = "footnote", 
     subtitle = "My subtitle") +
  theme(plot.title = element_text(size = 10, face = 'bold'),
        axis.text.x = element_text(angle = 45,
                                   hjust = 1, 
                                   vjust = 1, 
                                   family = 'Comic Sans MS'),
        axis.ticks.x = element_blank(),
        axis.line = element_line(color = 'blue'),
        panel.background = element_rect(fill = 'pink'),
        legend.position = c(.5, .5),
        text = element_text(family = "Segoe UI")) 

If customizing the non-data elements is too much work, you can opt for some preset themes:

  • theme_bw()
  • theme_minimal()
  • theme_classic()
  • theme_dark()

Save the plot as a bitmap output file (e.g. .jpg)

# set your ggplot and everything that's chained to it to a variable
my.plot <- g + <other stuff>

ggsave("my_plot.jpg", plot = my.plot, width = 5, units = "in")

Or save as a PDF

ggsave("my_plot.pdf", plot = my.plot, width = 5, units = "in", device = cairo_pdf)

Other geoms

We can transform our bar graph into a line graph, but instead of Year_chr, use Year_dt. The datatype of the variable can influence what geoms can be used and how it is displayed on the axis.

Lines and Points

# Use Year_dt, a date datatype, to plot a line graph 
l <- ggplot(data = df1, aes(x = Year_dt, y = Estimate, color = County))

l +
  geom_line()

# add points to disinguish the intermediate years
l +
  geom_line() +
  geom_point()

Note that initializing ggplot this time was slightly different than before. The ggplot() function can also take data and aesthetic arguments. The difference is that arguments in ggplot() are global and will trickle down to the geoms. You won’t have to copy the same arguments for each geom.

Bar and Line

# Create a data frame of regional totals by year
# Add a new column 'Region' for labeling purposes
reg.tot <- aggregate(x = df1['Estimate'], by = df1['Year_dt'], FUN = sum)
reg.tot$Region <- "Region"

g +
  geom_col(data = df1, 
           aes(x = Year_dt, y = Estimate, fill = County), 
           position = 'dodge') +
  geom_line(data = reg.tot, 
            aes(x = Year_dt, y = Estimate))

# add the regional line to the legend
g +
  geom_col(data = df1, 
           aes(x = Year_dt, y = Estimate, fill = County), 
           position = 'dodge') +
  geom_line(data = reg.tot, 
            aes(x = Year_dt, y = Estimate, color = Region)) +
  scale_color_manual(name = NULL, 
                     values = c('Region' = 'black'))

Facets

Also known as Trellis displays, facets are an easy way to display groups of data alongside each other.

Facet Wrap

You can use facet_wrap() if you want your subplots laid out horizonally and wrapped around. Put your variable(s) within vars(). vars() like the aes() checks to see what the unique values of your variable(s) are and allows facet_ to do what it needs to do to subplot.

# facet wrap
l +
  geom_line() +
  geom_point() +
  facet_wrap(vars(County), ncol = 2) +
  theme(legend.position = "none") # the legend may be redundant, let's turn it off

*For longtime users of ggplot2, you may have used the ~ notation in facet_ functions. You’ve probably seen the ~ used in books or stack overflow answers. You can still get away with using ~ but it has been soft-deprecated and superceded with vars() like in the examples shown.

Facet Grid

Use facet_grid() when you want to split by at least two variables that generate vertical and horizontal panels.

Let’s create a facet grid where vertical panels are based on Filter values (where 1 = Total County, 2 = Unincorporated County, and 3 = Incorporated County estimates) and horizontal panels based on Counties. Query df for all records except for individual cities and towns.

df2 <- df[df$Filter != 4, ]

ggplot(df2, aes(x = Year_dt, y = Estimate)) +
  geom_col() +
  facet_grid(rows = vars(County), cols = vars(Filter))  

Factors

There are several ways of changing the labels of facet panels. One is by using a column that is a factor data type.

Factors are used to work with categorical variables, variables that have a fixed and known set of possible values3. The advantage of factors is that you can set an order to values and sort in non-alphabetical order.

Factor variables can be used for any aesthetic mapping, not just facet panels!

# Create Factor column
df2$Filter_name <- factor(df2$Filter, 
                          levels = c("1", "2", "3"), 
                          labels = c( "Total","Unincorporated","Incorporated"))

ggplot(df2, aes(x = Year_dt, y = Estimate)) +
  geom_col() +
  facet_grid(rows = vars(County), cols = vars(Filter_name))
# Change order of grid columns by changing the levels and labels
df2$Filter_name <- factor(df2$Filter, 
                          levels = c("1", "3", "2"), 
                          labels = c( "Total","Incorporated", "Unincorporated"))

ggplot(df2, aes(x = Year_dt, y = Estimate)) +
  geom_col() +
  facet_grid(rows = vars(County), cols = vars(Filter_name))

Extensions

Plotly

Plotly for R is a separate graphing library that specializes in interactive graphs. Plotly is part of the htmlwidgets family – libraries that are originally written in JavaScript then bound with R. It has its own set of graphing functions and syntax but it doesn’t mean you have to learn a completely new system inorder to have interactive graphs!

Load the plotly and htmlwidgets library

library(ggplot2)
library(plotly)
library(htmlwidgets)

Just use Plotly’s ggplotly() around your ggplot object to integrate ggplot2 with Plotly.

p <- g +
  geom_col(data = df2, 
           aes(x = Year_chr, Estimate, fill = County)
  )

p1 <- ggplotly(p)

# when saving a ggplotly or plotly object to disk, use a .html extension
saveWidget(p1, file="my_plot.html")  

Others

Sometimes ggplot2 on its own can only do so much. Where there is a gap, R users have built on top of ggplot2 and created their own libraries to enhance ggplot2 graphs either with animation, better fitted labeling, or specialty geoms to name a few. You can check out the ggplot2 extension gallery here:

https://exts.ggplot2.tidyverse.org/gallery/

Resources

Below are some my favorite resources for help and inspiration

The official ggplot2 Cheatsheet: A two pager with a decent list of geoms available and their aesthetics among other things. A good way to sample what’s in ggplot2 without being overwhelmed.

https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf

Labs by Alison Hill et al: These labs are a tutorial on data munging and visualization with ggplot2 and some of the tidyverse family. You can explore interesting datasets with guidance. Challenges, tips, and solution code are all included.

https://apreshill.github.io/data-vis-labs-2018/

R Graphics Cookbook by Winston Chang: Any questions on how to do something in ggplot2 can be answered in this book. The text is easy to understand and the combination of explanation, visuals and code make it a book (or e-book) worth having close by.

https://r-graphics.org/

Datacamp courses: Put your ggplot2 knowledge to practice with ggplot2 related courses. You’ll go more in-depth on the ggplot2 components and learn how to munge data as well.

https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2

Homework

Now that you have a framework for how ggplot2 works, let’s practice with homework. The homework is a mix of guided steps and challenges that might not have been covered in the module.

If you are stumped by a question or task, give yourself 10 minutes to experiment and google for a solution, but after that, please ask for help. We are here for you!

allison horst image
artwork by @allison_horst

  1. R Graphics Cookbook, pg 409↩︎

  2. R Graphics Cookbook, pg 409↩︎

  3. R for Data Science, https://r4ds.had.co.nz/factors.html↩︎