Back to class overview
R script from this class

Motivation

The authors of the data.table package are Matt Dowle and Arun Srinivasan. There are at least five good reasons to learn the package:

  1. Concise syntax
  2. Operations are insanely fast and
  3. Memory efficient
  4. Package is feature rich and
  5. Dependency free

Regarding efficiency, it has been shown that data.table can be 75x faster than dplyr. Here is a comparison with other systems. We will talk about some of the reasons later in the class.

It is not the purpose of this class to convince you that data.table is superior to dplyr or vice versa. It is to show you another powerful tool that you can use to tackle big or small data problems efficiently in R. One can use the two packages to complement each other. If you end up liking the dplyr syntax more but want to efficiency advantage, consider using the dtplyr package. It uses dplyr syntax while working with data.table objects.

Creating data.table object

Let’s start our session by loading the library. Uncomment the second line and set the working directory where your data for this session is located (don’t forget to use forward slashes, e.g. “C:/Users/hana/Documents”):

library(data.table)
#setwd("/Users/hana/psrc/rworkshops2020/data.table")

To use data.table features, we need to create a data.table object, which is an enhanced version of data.frame. There are several ways to do it:

  1. Create it from scratch:

    d <- data.table(x = 1:10, y = letters[1:10])
    d
    ##      x y
    ##  1:  1 a
    ##  2:  2 b
    ##  3:  3 c
    ##  4:  4 d
    ##  5:  5 e
    ##  6:  6 f
    ##  7:  7 g
    ##  8:  8 h
    ##  9:  9 i
    ## 10: 10 j
  2. Convert existing data.frame:

    df <- data.frame(x = 1:10, y = letters[1:10])
    d <- as.data.table(df)
    d
    ##      x y
    ##  1:  1 a
    ##  2:  2 b
    ##  3:  3 c
    ##  4:  4 d
    ##  5:  5 e
    ##  6:  6 f
    ##  7:  7 g
    ##  8:  8 h
    ##  9:  9 i
    ## 10: 10 j

    One can do such conversion also on a tibble object and thus, it allows to easily switch between dplyr and data.table. To switch back from data.table to dplyr, use the as_tibble() function.

  3. Coercion by reference, i.e. no need to reassign an object:

    dft <- data.frame(x = 1:10, y = letters[1:10])
    class(dft)
    ## [1] "data.frame"
    setDT(dft)
    class(dft)
    ## [1] "data.table" "data.frame"

    The class of dft changed without a re-assignment, thus saving memory. To reverse the operation to make dft being data.frame again, use the setDF() function.

  4. Reading from file (extremely fast!):

    dt <- fread("ofm_april1_population_final.csv")
    class(dt)
    ## [1] "data.table" "data.frame"

    For reading from Excel spreadsheets, one can use the read.xlsx() function from the openxlsx library (as shown in the Rmarkdown class) and then convert to data.table via as.data.table() as shown above. However, for big datasets the fread() function is superior to anything else in R. Reading in the PSRC’s parcels dataset (1.3 million rows, 285M) is 13x faster than with read.csv().

    The equivalent to fread() for writing data into a file is fwrite(). It’s is also amazingly fast.

    Now let’s explore the dataset structure.

    str(dt)
    View(dt)

About data.table syntax

The picture at the top of the page shows the basic syntax of all data.tables. While tidyverse tends to break up operations step-by-step using verbs for each step, data.table aims to do everything in one concise expression.

Working with rows: DT[i,]

Selecting rows

When subsetting by rows, things work similary like in data.frame, except the syntax is simpler. As an example, select rows from dt that are in King county and are not cities:

dt[County == "King" & Filter < 4]

Or select all cities that have the word “Wood” in their name:

dt[grepl("Wood", Jurisdiction)]

Indices work too. Also notice that the comma is optional if no j and by is used.

dt[1:3] # is equivalent to
dt[1:3,]

For further processing, we select rows that correspond to cities in the PSRC’s four counties:

dt4c <- dt[Filter == 4 & County %chin% c("King", "Kitsap", "Pierce", "Snohomish")] 
dim(dt4c)
## [1] 87 14

The %chin% operator is like %in% but optimized for speed and is for character vectors only.

Ordering rows

The i dimension is also used for putting rows into specific order. Here, ordering by Jurisdiction:

dt4c[order(Jurisdiction)]
##     Line Filter    County      Jurisdiction 2010.Population.Census
##  1:  143      4      King            Algona                   3014
##  2:  325      4 Snohomish         Arlington                  17926
##  3:  144      4      King     Auburn (part)                  62761
##  4:  274      4    Pierce     Auburn (part)                   7419
##  5:  186      4    Kitsap Bainbridge Island                  23025
## ---                                                               
## 83:  296      4    Pierce  University Place                  31144
## 84:  297      4    Pierce          Wilkeson                    477
## 85:  180      4      King       Woodinville                  10938
## 86:  344      4 Snohomish           Woodway                   1307
## 87:  181      4      King      Yarrow Point                   1001
## 9 variables not shown: [2011.Population.Estimate, 2012.Population.Estimate, 2013.Population.Estimate, 2014.Population.Estimate, 2015.Population.Estimate, 2016.Population.Estimate, 2017.Population.Estimate, 2018.Population.Estimate, 2019.Population.Estimate]

Ordering by reference (in place) implements optimized ordering:

setorder(dt4c, County, Jurisdiction)

What does “by reference” mean?

Base R assigns objects by value, i.e. when an object is assigned, a copy of the original object is created, i.e. the value occupies a different address in the memory. Python, C, Java and other programming languages work this way.

Assignment by reference (or “in place”) differs as it does not create a separate copy; it rather references the original object, i.e. it points to the same address in the memory. In addition to data.table, C++ or Python pandas support this as well. It is well suited for working with big data as it is more memory efficient, but one has to be careful not to overwrite the original object.

Working with columns: DT[,j]

Renaming columns

The column names of our dataset are not very pretty, so let’s change it.

colnames(dt4c)
##  [1] "Line"                     "Filter"                  
##  [3] "County"                   "Jurisdiction"            
##  [5] "2010.Population.Census"   "2011.Population.Estimate"
##  [7] "2012.Population.Estimate" "2013.Population.Estimate"
##  [9] "2014.Population.Estimate" "2015.Population.Estimate"
## [11] "2016.Population.Estimate" "2017.Population.Estimate"
## [13] "2018.Population.Estimate" "2019.Population.Estimate"

There are at least two ways to rename columns:

  1. By reference: (we replace Jurisdiction by City)

    setnames(dt4c, "Jurisdiction", "City")
    Note that both, the second and third arguments can include more than one name, if renaming of multiple columns is desired.
  2. Using colnames() as with data.frame: (we extract the first four characters from the time columns)

    colnames(dt4c)[5:ncol(dt4c)] <- substr(colnames(dt4c)[5:ncol(dt4c)], 1, 4)
    colnames(dt4c)
    ##  [1] "Line"   "Filter" "County" "City"   "2010"   "2011"   "2012"   "2013"  
    ##  [9] "2014"   "2015"   "2016"   "2017"   "2018"   "2019"

Selecting columns

Columns can be selected in various ways. The most common case is probably a selection by column names:

dt4c[, .(County, City)]
##        County               City
##  1:      King             Algona
##  2:      King      Auburn (part)
##  3:      King Beaux Arts Village
##  4:      King           Bellevue
##  5:      King      Black Diamond
## ---                             
## 83: Snohomish           Mukilteo
## 84: Snohomish          Snohomish
## 85: Snohomish           Stanwood
## 86: Snohomish             Sultan
## 87: Snohomish            Woodway

Selection by an index still works:

dt4c[1:10, 1:5]
##     Line Filter County               City   2010
##  1:  143      4   King             Algona   3014
##  2:  144      4   King      Auburn (part)  62761
##  3:  145      4   King Beaux Arts Village    299
##  4:  146      4   King           Bellevue 122363
##  5:  147      4   King      Black Diamond   4153
##  6:  148      4   King     Bothell (part)  17090
##  7:  149      4   King             Burien  33313
##  8:  150      4   King          Carnation   1786
##  9:  151      4   King         Clyde Hill   2984
## 10:  152      4   King          Covington  17575

Names can also be used in the same way as indices. Use backquotes if the column name starts with a number:

dt4c[1:5, Filter:City]
##    Filter County               City
## 1:      4   King             Algona
## 2:      4   King      Auburn (part)
## 3:      4   King Beaux Arts Village
## 4:      4   King           Bellevue
## 5:      4   King      Black Diamond
dt4c[1:5, Filter:`2014`]
##    Filter County               City   2010   2011   2012   2013   2014
## 1:      4   King             Algona   3014   3055   3070   3075   3090
## 2:      4   King      Auburn (part)  62761  63050  63390  64320  65350
## 3:      4   King Beaux Arts Village    299    300    300    290    295
## 4:      4   King           Bellevue 122363 123400 124600 132100 134400
## 5:      4   King      Black Diamond   4153   4160   4170   4170   4180

The .() is a synonym for list():

dt4c[1:5, list(County, City)]
##    County               City
## 1:   King             Algona
## 2:   King      Auburn (part)
## 3:   King Beaux Arts Village
## 4:   King           Bellevue
## 5:   King      Black Diamond

Or use the same way as in data.frame, using the c() function:

dt4c[1:5, c("County", "City")]
##    County               City
## 1:   King             Algona
## 2:   King      Auburn (part)
## 3:   King Beaux Arts Village
## 4:   King           Bellevue
## 5:   King      Black Diamond

However, if you assign column names to an object, say cols, then dt4c[, cols] will not work. Instead use

cols <- c("County", "City")
dt4c[, ..cols] # or
dt4c[, cols, with = FALSE]

Negation can be used to exclude columns. Here, select all columns except the ones given:

dt4c <- dt4c[, !c("Line", "Filter")]
dim(dt4c)
## [1] 87 12

Assigning values to columns

The := operator

Assignments to columns is done via := which assigns by reference:

dt4c[, id := 1:nrow(dt4c)]
dt4c[, id := id + 5000]

Be aware of the implications of assigning by reference. Here is an example:

d                 # d is our toy data.table
##      x y
##  1:  1 a
##  2:  2 b
##  3:  3 c
##  4:  4 d
##  5:  5 e
##  6:  6 f
##  7:  7 g
##  8:  8 h
##  9:  9 i
## 10: 10 j
a <- d            # assign d to a
a[ , x := NULL][] # delete column in a
##     y
##  1: a
##  2: b
##  3: c
##  4: d
##  5: e
##  6: f
##  7: g
##  8: h
##  9: i
## 10: j
d                 # d changed as well!
##     y
##  1: a
##  2: b
##  3: c
##  4: d
##  5: e
##  6: f
##  7: g
##  8: h
##  9: i
## 10: j

Here is the right way to do it using the copy() function:

d <- data.table(x = 1:10, y = letters[1:10]) # re-create d
a <- copy(d)      # use the copy() function to create a separate object
a[ , x := NULL][] # delete column in a
##     y
##  1: a
##  2: b
##  3: c
##  4: d
##  5: e
##  6: f
##  7: g
##  8: h
##  9: i
## 10: j
d                 # d is not impacted
##      x y
##  1:  1 a
##  2:  2 b
##  3:  3 c
##  4:  4 d
##  5:  5 e
##  6:  6 f
##  7:  7 g
##  8:  8 h
##  9:  9 i
## 10: 10 j

Note that in the above example I used [] at the end of lines to print out the values of the object. They are not needed tfor the actual assignment.

Partial assignment

You do not have to assign values to all rows - a partial assignment can be done with the help of the i dimension. Say we want to increase the id column by 1000 but only for King county:

dt4c[County == "King", id := id + 1000]
dt4c[1:4, .(County, City, id)]
##    County               City   id
## 1:   King             Algona 6001
## 2:   King      Auburn (part) 6002
## 3:   King Beaux Arts Village 6003
## 4:   King           Bellevue 6004
dt4c[50:53, .(County, City, id)]
##    County            City   id
## 1: Pierce        Edgewood 5050
## 2: Pierce Enumclaw (part) 5051
## 3: Pierce            Fife 5052
## 4: Pierce        Fircrest 5053

If the column does not exist, a partial assignment fills in NAs:

d[x < 5, z := TRUE][]
##      x y    z
##  1:  1 a TRUE
##  2:  2 b TRUE
##  3:  3 c TRUE
##  4:  4 d TRUE
##  5:  5 e   NA
##  6:  6 f   NA
##  7:  7 g   NA
##  8:  8 h   NA
##  9:  9 i   NA
## 10: 10 j   NA

Let’s now delete the id column:

dt4c[, id := NULL]

Multiple things at once

One can assign multiple columns in one command. Here we add two columns (u and v) and delete one column (z) in one go:

d[ , ':='(u = x^2, v = x^3, z = NULL)]

We can also chain multiple operations into one line. Here we first add a column w, then we add a column dummy via a partial assignment, then we order the rows:

d[, w := -4:5][w > 0, dummy := TRUE][order(x, decreasing = TRUE)]
##      x y   u    v  w dummy
##  1: 10 j 100 1000  5  TRUE
##  2:  9 i  81  729  4  TRUE
##  3:  8 h  64  512  3  TRUE
##  4:  7 g  49  343  2  TRUE
##  5:  6 f  36  216  1  TRUE
##  6:  5 e  25  125  0    NA
##  7:  4 d  16   64 -1    NA
##  8:  3 c   9   27 -2    NA
##  9:  2 b   4    8 -3    NA
## 10:  1 a   1    1 -4    NA

Note that the first two operations are by reference, i.e. the d object is changed. The last operation does not change the object, unless we would re-assign it to d.

To assign multiple categories to one variable, there is a new function called fcase() which is inspired by CASE WHEN in SQL, and it is optimized for speed. For example.

d[, category := fcase(w > 0, "positive", 
                      w %between% c(-2, 0), "slightly negative",
                      w < -2, "negative")]

Similarly to %chin%, the %between% operator is data.table specific.

Finally, an assignment can be also used to rename columns:

d[, dummy_new := dummy][, dummy := NULL][]
##      x y   u    v  w          category dummy_new
##  1:  1 a   1    1 -4          negative        NA
##  2:  2 b   4    8 -3          negative        NA
##  3:  3 c   9   27 -2 slightly negative        NA
##  4:  4 d  16   64 -1 slightly negative        NA
##  5:  5 e  25  125  0 slightly negative        NA
##  6:  6 f  36  216  1          positive      TRUE
##  7:  7 g  49  343  2          positive      TRUE
##  8:  8 h  64  512  3          positive      TRUE
##  9:  9 i  81  729  4          positive      TRUE
## 10: 10 j 100 1000  5          positive      TRUE

This way of renaming columns changes the order of the columns. You can use the function setcolorder() to reorder columns in place.

Subsetting with .SD

.SD (Subsetting Data) is a very powerful operator in data.table. It allows to perform a given function on any subset of columns.

To demonstrate it, as part of our cleaning of dt4c we will convert all time-related columns into the numeric type. We could do it column by column as

dt4c[, `2010` := as.numeric(`2010`)]

That would be a little tedious to do it for many columns. Thus, we use the .SD symbol to apply the as.numeric function to the 2010-2019 columns at once:

cols <- as.character(2010:2019) # column names to apply the conversion to
dt4c[, (cols) := lapply(.SD, as.numeric), .SDcols = cols] 
str(dt4c)
## Classes 'data.table' and 'data.frame':   87 obs. of  12 variables:
##  $ County: chr  "King" "King" "King" "King" ...
##  $ City  : chr  "Algona" "Auburn (part)" "Beaux Arts Village" "Bellevue" ...
##  $ 2010  : num  3014 62761 299 122363 4153 ...
##  $ 2011  : num  3055 63050 300 123400 4160 ...
##  $ 2012  : num  3070 63390 300 124600 4170 ...
##  $ 2013  : num  3075 64320 290 132100 4170 ...
##  $ 2014  : num  3090 65350 295 134400 4180 ...
##  $ 2015  : num  3105 65950 300 135000 4200 ...
##  $ 2016  : num  3175 67340 300 139400 4305 ...
##  $ 2017  : num  3180 69060 300 140700 4335 ...
##  $ 2018  : num  3180 70650 300 142400 4360 ...
##  $ 2019  : num  3190 71740 300 145300 4525 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "index")= int(0) 
##   ..- attr(*, "__County")= int(0)

The lapply() function is a standard base R function that applies the given function (here as.numeric) to a list or a dataset (given by the first argument), element by element, or columnn by column. In the data.table context, the .SD symbol is replaced by a dataset composed by the columns given by .SDcols. Thus, the function as.numeric is applied to all columns defined in the cols object, and the result is then assigned to columns given by left side of :=, which here have the same names.

The .SD symbol is often used in aggreagations, and we will see an example of it in the next section.

Group by: DT[,,by]

Simple aggregations

The most simple aggregations are done by applying an aggregating functions to columns. For example,

dt4c[, sum(`2019`)]
## [1] 2991715
dt4c[, mean(`2019`)]
## [1] 34387.53

For counts use the special symbol .N:

dt4c[, .N]
## [1] 87

Group by

Things get more interesting when we aggregate by groups. Here summing by a column, by a conditional expression and by multiple expressions. If you want to name the aggregated column, use .() on the j dimension:

dt4c[, sum(`2019`), by = County]
##       County      V1
## 1:      King 1978025
## 2:    Kitsap   92170
## 3:    Pierce  468300
## 4: Snohomish  453220
dt4c[, .(Population = sum(`2019`)), by = County]
##       County Population
## 1:      King    1978025
## 2:    Kitsap      92170
## 3:    Pierce     468300
## 4: Snohomish     453220
dt4c[, .(Population = sum(`2019`)), by = `2019` > 100000]
##    `2019` > 1e+05 Population
## 1:          FALSE    1541415
## 2:           TRUE    1450300
dt4c[, .(Population = sum(`2019`), Count = .N), by = `2019` > 100000]
##    `2019` > 1e+05 Population Count
## 1:          FALSE    1541415    81
## 2:           TRUE    1450300     6
dt4c[, .(Population = sum(`2019`), Count = .N), by = .(County, `2019` > 100000)]
##       County `2019` > 1e+05 Population Count
## 1:      King          FALSE     850925    35
## 2:      King           TRUE    1127100     4
## 3:    Kitsap          FALSE      92170     4
## 4:    Pierce          FALSE     256900    23
## 5:    Pierce           TRUE     211400     1
## 6: Snohomish          FALSE     341420    19
## 7: Snohomish           TRUE     111800     1

The above commands do not change the dt4c dataset. To add an aggregated column to the dataset, use the := symbol:

dt4c[, county_sum := sum(`2019`), by = County]

Now one can use the aggregated column to compute county shares. If it was done correctly, the sum of shares by county should yield 1.

dt4c[, county_share := round(`2019`/county_sum*100, 2)]
dt4c[, sum(county_share), by = County] 
##       County     V1
## 1:      King 100.02
## 2:    Kitsap  99.99
## 3:    Pierce 100.00
## 4: Snohomish  99.99

Using .SD

The power of the .SD feature shows especially in aggregations. It allows to easily aggregate for multiple columns. Here we sum all columns from 2010 to 2019 (alternatively, we could use our object cols created earlier):

dt4c[, lapply(.SD, sum), by = County, .SDcols = `2010`:`2019`]
##       County    2010    2011    2012    2013    2014    2015    2016    2017
## 1:      King 1606249 1657335 1701280 1728800 1765200 1799520 1859180 1906640
## 2:    Kitsap   81111   82505   83880   83495   84465   86260   88280   89080
## 3:    Pierce  428487  430040  432245  436005  439330  444070  452230  458920
## 4: Snohomish  411043  412565  414455  418000  420665  427340  433865  439600
## 2 variables not shown: [2018, 2019]

Joining datasets

To demonstrate how to join datasets, let’s read in the dataset of counties provided with the class:

counties <- fread("counties.csv")

We now merge this dataset with our dt4c. The joining columns are County for dt4c and county_name for dataset counties. There are two ways of doing the join.

The merge function

The first one is using the merge() function:

dt4cm <- merge(dt4c, counties, by.x = "County", by.y = "county_name")

Note that if the joining columns would be the same, just use the argument by.

Join via the i dimension and setting keys

The second way of doing the join is using the first (i) dimension. For this example we first create another county dataset by aggregating dt4c and renaming the County column to be county_name, chaining the operations one after another:

popcty <- dt4c[ , .(Pop = sum(`2019`)), by = County][, county_name := County][, County := NULL]

The left and right joins can be done via

popcty[counties, , on = .(county_name)] # left join
##        Pop county_name county_id   acres
## 1: 1978025        King        33 1403205
## 2:   92170      Kitsap        35  246076
## 3:  468300      Pierce        53  675448
## 4:  453220   Snohomish        61 1240445
counties[popcty, , on = .(county_name)] # right join
##    county_id   acres county_name     Pop
## 1:        33 1403205        King 1978025
## 2:        35  246076      Kitsap   92170
## 3:        53  675448      Pierce  468300
## 4:        61 1240445   Snohomish  453220

Setting a key can significantly decrease the runtime. For example, merging the PSRC’s parcels with the buildings dataset was 16x faster than when merging using base R.

setkey(popcty, "county_name")
setkey(counties, "county_name")

When a key is set, the on argument can be omitted:

counties[popcty]
##    county_id   acres county_name     Pop
## 1:        33 1403205        King 1978025
## 2:        35  246076      Kitsap   92170
## 3:        53  675448      Pierce  468300
## 4:        61 1240445   Snohomish  453220

To explicitely use columns from the left dataset, use the prefix i.. Below, the Pop column is from the dataset popcty, therefore it is used as i.Pop, while the column acres is from counties and therefore does not have any prefix.

counties[popcty, .(county_name, density = i.Pop/acres)]
##    county_name   density
## 1:        King 1.4096479
## 2:      Kitsap 0.3745591
## 3:      Pierce 0.6933176
## 4:   Snohomish 0.3653689

We now add the density column to the counties dataset:

counties[popcty, density := i.Pop/acres][]
##    county_id   acres county_name   density
## 1:        33 1403205        King 1.4096479
## 2:        35  246076      Kitsap 0.3745591
## 3:        53  675448      Pierce 0.6933176
## 4:        61 1240445   Snohomish 0.3653689

Notice that by default we are doing an outer join:

popcty[county_name != "King"][counties]
##       Pop county_name county_id   acres   density
## 1:     NA        King        33 1403205 1.4096479
## 2:  92170      Kitsap        35  246076 0.3745591
## 3: 468300      Pierce        53  675448 0.6933176
## 4: 453220   Snohomish        61 1240445 0.3653689

To do an inner join, use

popcty[county_name != "King"][counties, nomatch = NULL]
##       Pop county_name county_id   acres   density
## 1:  92170      Kitsap        35  246076 0.3745591
## 2: 468300      Pierce        53  675448 0.6933176
## 3: 453220   Snohomish        61 1240445 0.3653689

One can also do a join that would give records not in common:

counties[!popcty[county_name != "King"]] 
##    county_id   acres county_name  density
## 1:        33 1403205        King 1.409648

Reshaping data

The format of the data we’ve been working so far is often called wide. It is a typical data format of demographic datasets.

head(dt4c)
##    County               City   2010   2011   2012   2013   2014   2015   2016
## 1:   King             Algona   3014   3055   3070   3075   3090   3105   3175
## 2:   King      Auburn (part)  62761  63050  63390  64320  65350  65950  67340
## 3:   King Beaux Arts Village    299    300    300    290    295    300    300
## 4:   King           Bellevue 122363 123400 124600 132100 134400 135000 139400
## 5:   King      Black Diamond   4153   4160   4170   4170   4180   4200   4305
## 6:   King     Bothell (part)  17090  17150  17280  17440  24610  25410  26590
## 5 variables not shown: [2017, 2018, 2019, county_sum, county_share]

Let’s remove columns that we will not need anymore.

dt4c[, ':='(county_sum = NULL, county_share = NULL)]

Wide to long

Various packages, including ggplot2 require the data to be in the so called long format (or tidy) where each value (here population) is its own record, i.e. each value occupies one row. To convert from wide to long, use the melt() function. In our example, we need to specify that County and City are non-population columns:

dtl <- melt(dt4c, id.vars = c("County", "City"))
head(dtl)
##    County               City variable  value
## 1:   King             Algona     2010   3014
## 2:   King      Auburn (part)     2010  62761
## 3:   King Beaux Arts Village     2010    299
## 4:   King           Bellevue     2010 122363
## 5:   King      Black Diamond     2010   4153
## 6:   King     Bothell (part)     2010  17090

To name the variable and value columns explicitely, use the following arguments:

dtl <- melt(dt4c, id.vars = c("County", "City"), variable.name = "Year", value.name = "Population")
head(dtl)
##    County               City Year Population
## 1:   King             Algona 2010       3014
## 2:   King      Auburn (part) 2010      62761
## 3:   King Beaux Arts Village 2010        299
## 4:   King           Bellevue 2010     122363
## 5:   King      Black Diamond 2010       4153
## 6:   King     Bothell (part) 2010      17090

Now this dataset can be plugged directly into ggplot:

library(ggplot2)
g <- ggplot(dtl[County == "Pierce"]) + 
      geom_col(aes(x = City, y = Population, fill = as.factor(Year)),
               position = "dodge") + coord_flip() + labs(fill = "Year")
print(g)

Long to wide

If you have a dataset in the long format and need to convert it to the wide format, for example for reporting purposes, use the dcast() function. Its second argument is a formula (LHS ~ RHS) that determines which columns stay (RHS) and which columns are cast (LHS). ... represents all remaining variables:

dtw <- dcast(dtl, County + City ~ Year, value.var = "Population")
dtw <- dcast(dtl, ... ~ Year, value.var = "Population")
head(dtw)
##    County               City   2010   2011   2012   2013   2014   2015   2016
## 1:   King             Algona   3014   3055   3070   3075   3090   3105   3175
## 2:   King      Auburn (part)  62761  63050  63390  64320  65350  65950  67340
## 3:   King Beaux Arts Village    299    300    300    290    295    300    300
## 4:   King           Bellevue 122363 123400 124600 132100 134400 135000 139400
## 5:   King      Black Diamond   4153   4160   4170   4170   4180   4200   4305
## 6:   King     Bothell (part)  17090  17150  17280  17440  24610  25410  26590
## 3 variables not shown: [2017, 2018, 2019]

Exporting dataset

As mentioned previously, the counterpart of the reading function fread() is the function fwrite(). It is the most efficient way to export data. Let’s write out our dtw dataset:

fwrite(dtw, "my_ofm.csv")

See ?fwrite for more options.