The solutions below will use base R syntax as learned in modules 1-3.
First let’s load our libraries
library(ggplot2)
library(openxlsx)
library(reshape2)
library(scales)
Given the following graph…
title <- 'Fuel Economy of Popular Cars'
legend.title <- 'Type of Car'
p <- ggplot(mpg, aes(displ, hwy, color = class)) +
geom_point()
The dataset mpg
is actually a dataset built into R. When we download R or install packages, each comes with datasets. Similar to functions, you can type ?mpg
in the console to find its metadata. Built in datasets such as mpg
allow us to create reproducible examples–code that anyone can run on their computer as long as they have R and that particular package.
To see all the built-in datasets available, type data()
into your console. Call some additional libraries and run data()
again.
a. scale_color_discrete(name = legend.title)
theme(legend.title = element_text(legend.title))
theme(legend.text = element_text(title = legend.title))
d. labs(color = legend.title)
We can change the legend titles with a and d. theme()
can change the appearance (e.g. font, text size, text face, position) of an existing legend title but we can’t create or rename it with the theme()
layer.
a. labs(title = title)
b. ggtitle(title)
c. annotate('text', label = title, x = min(mpg$displ) + 3.5, y = max(mpg$hwy), size = 4)
theme(plot.title = element_text(title))
Options a, b, and c are valid options. Again, theme()
can help change the appearance of an existing title but can’t generate one.
labs()
, ggtitle()
, and annotate()
?ggtitle()
is the equivalent of using labs(title = 'My title', subtitle = 'My subtitle')
.
p + ggtitle(title, 'Engine displacement (L) by highway miles per gallon')
With ggtitle()
and labs()
you can adjust its vertical position with theme()
but they will still reserve blank space above the plotting region for the title.
# Moving your title inside with ggtitle() or labs()
p +
ggtitle(title) +
theme(plot.title = element_text(vjust = -8, hjust = .8))
# Moving your title inside with annotation
p +
annotate('text', label = title, x = min(mpg$displ) + 3.5, y = max(mpg$hwy), size = 4.5)
It appears that cereals with more sugar receive a lower rating. Perhaps a particular demographic was rating these cereals or maybe these ratings were derived from consumer reports…
# read in excel sheet
my.dir <- 'C:/Users/clam/Documents/github/intro-ggplot2/data'
df <- read.xlsx(file.path(my.dir, 'cereal.xlsx'))
# calculate sugars per oz in new column
df$sugars_per_oz <- df$sugars/df$weight
# plot it
ggplot(df,
aes(x = sugars_per_oz,
y = rating,
color = mfr,
shape = type)) +
geom_point() +
scale_shape_manual(values = c(3, 15))
Using graphs to explore this dataset, it looks like negative values exist! The range of values for the sugars_per_oz
column is -1 and 15.
range(df$sugars_per_oz)
## [1] -1 15
Quaker Oatmeal has -1 sugars/oz?!
df[df$sugars_per_oz == -1, ]
## name mfr type calories protein fat sodium fiber carbo sugars
## 58 Quaker Oatmeal Q H 100 5 2 0 2.7 -1 -1
## potass vitamins shelf weight cups rating sugars_per_oz
## 58 110 0 1 1 0.67 50.82839 -1
Golden Crisp and Smacks have the most sugars/oz.
df[df$sugars_per_oz == 15, ]
## name mfr type calories protein fat sodium fiber carbo sugars potass
## 31 Golden Crisp P C 100 2 0 45 0 11 15 40
## 67 Smacks K C 110 2 1 70 1 9 15 40
## vitamins shelf weight cups rating sugars_per_oz
## 31 25 1 1 0.88 35.25244 15
## 67 25 2 1 0.75 31.23005 15
Cereals with the words ‘Wheat’ or ‘Rice’ or ‘Fiber’ have no sugars
df[df$sugars_per_oz == 0, ]
## name mfr type calories protein fat sodium fiber carbo
## 4 All-Bran with Extra Fiber K C 50 4 0 140 14 8
## 21 Cream of Wheat (Quick) N H 100 3 0 80 1 21
## 55 Puffed Rice Q C 50 1 0 0 0 13
## 56 Puffed Wheat Q C 50 2 0 0 1 10
## 64 Shredded Wheat N C 80 2 0 0 3 16
## 65 Shredded Wheat 'n'Bran N C 90 3 0 0 4 19
## 66 Shredded Wheat spoon size N C 90 3 0 0 3 20
## sugars potass vitamins shelf weight cups rating sugars_per_oz
## 4 0 330 25 3 1.00 0.50 93.70491 0
## 21 0 -1 0 2 1.00 1.00 64.53382 0
## 55 0 15 0 3 0.50 1.00 60.75611 0
## 56 0 50 0 3 0.50 1.00 63.00565 0
## 64 0 95 0 1 0.83 1.00 68.23588 0
## 65 0 140 0 1 1.00 0.67 74.47295 0
## 66 0 120 0 1 1.00 0.67 72.80179 0
We can find the highest rated cereal (All-Bran with Extra Fiber) on the third shelf.
# create a new data frame with the highest rated cereal
df2 <- df[df$rating == max(df$rating), ]
ggplot(df,
aes(x = sugars_per_oz,
y = rating,
color = mfr,
shape = type)) +
geom_point() +
scale_shape_manual(values = c(3, 15),
labels = c("Cold", "Hot")) +
facet_wrap(vars(shelf)) +
labs(x = 'Sugars(g) per Ounce',
y = 'Rating',
color = 'Manufacturer',
shape = 'Type') +
scale_color_discrete(labels = c("American Home\nFood Products",
"General Mills",
"Kelloggs",
"Nabisco",
"Post",
"Quaker Oats",
"Ralston Purina")) +
geom_text(data = df2,
aes(x = sugars_per_oz + 1,
y = rating,
label = name),
size = 2,
hjust = 0) +
theme(legend.text = element_text(size = 6))
# read excel sheet
my.dir <- 'C:/Users/clam/Documents/github/intro-ggplot2/data'
df <- read.xlsx(file.path(my.dir, "ofm_april1_population_final_tidied.xlsx"), detectDates = T)
# subset the data for only cities & towns for years 2010 and 2020
df1 <- df[df$Filter == 4 & df$Year_chr %in% c(2010, 2020), ]
# cast the data so that each observation has a 2010 and 2020 estimate
df2 <- dcast(df1, County + Jurisdiction ~ paste0("Year_", Year_chr), value.var = "Estimate")
# calculate the difference
df2$diff <- df2$Year_2020 - df2$Year_2010
# sort the data based on the difference column in decending order
# and take only the top 10 observations
df3 <- head(df2[order(-df2$diff),], 10)
df3
## County Jurisdiction Year_2010 Year_2020 diff
## 33 King Seattle 608660 761100 152440
## 19 King Kirkland 48787 90660 41873
## 18 King Kent 92411 130500 38089
## 4 King Bellevue 122363 148100 25737
## 31 King Sammamish 45780 65100 19320
## 7 King Burien 33313 52300 18987
## 29 King Redmond 54144 69900 15756
## 65 Pierce Tacoma 198397 213300 14903
## 30 King Renton 90927 105500 14573
## 6 King Bothell (part) 17090 29730 12640
At this point if we graphed our data, the cities and towns would be in alphabetical order
# create our graph and the x-axis is arranged in alphabetical order
ggplot() +
geom_col(data = df3, aes(x = Jurisdiction, y = diff, fill = County)) +
scale_y_continuous(labels = label_comma()) +
theme(axis.text.x = element_text(angle = 45,
hjust = 1,
vjust = 1),
plot.title = element_text(size = 13, face = 'bold')) +
labs(title = 'Top 10 Cities and Towns',
subtitle = 'With the greatest nominal growth between 2010 and 2020',
caption = 'OFM April 1, version Sept 2020',
x = NULL,
y = 'Persons')
Using as.factor()
to convert Jurisdiction
into a factor datatype and then reorder()
to reorder based on another column allows the sorting to be based on values of the difference column.
# convert Jurisdiction column into a factor and reorder it based on the difference column
df3$Jurisdiction <- as.factor(df3$Jurisdiction)
df3$Jurisdiction <- reorder(df3$Jurisdiction, df3$diff)
# print the plot again
ggplot() +
geom_col(data = df3, aes(x = Jurisdiction, y = diff, fill = County)) +
scale_y_continuous(labels = label_comma()) +
theme(axis.text.x = element_text(angle = 45,
hjust = 1,
vjust = 1),
plot.title = element_text(size = 13, face = 'bold')) +
labs(title = 'Top 10 Cities and Towns',
subtitle = 'With the greatest nominal growth between 2010 and 2020',
caption = 'OFM April 1, version Sept 2020',
x = NULL,
y = 'Persons')