R Fundamentals I

Data structures

Learning Objectives

  • To be aware of the different types of data
  • To be aware of the different basic data structures commonly encountered in R
  • To be able to ask questions from R about the type, class, and structure of an object.

Data Types

Before we can analyse any data, we’ll need to have a strong understanding of the basic data types and data structures. It is Very Important to understand because these are the things you will manipulate on a day-to-day basis in R, and are the source of most frustration encountered by beginners.

R has 5 basic atomic types (meaning they can’t be broken down into anything smaller):

  • logical (e.g., TRUE, FALSE)
  • numeric
    • integer (e.g, 2L, as.integer(3))
    • double (i.e. decimal) (e.g, -24.57, 2.0, pi)
  • complex (i.e. complex numbers) (e.g, 1 + 0i, 1 + 4i)
  • text (called “character” in R) (e.g, "a", "swc", 'This is a cat')

There are a few functions we can use to interrogate data in R to determine its type:

Data Structures

There are five data structures you will commonly encounter in R. These are:

  • vector
  • matrix
  • factor
  • list
  • data.frame

For now, let’s focus on vectors in more detail, to discover more about data types.

Vectors

A vector is the most common and basic data structure in R and is pretty much the workhorse of R. They are sometimes referred to as atomic vectors, because importantly, they can only contain one data type. They are the building blocks of every other data structure.

Create a vector of empty strings of length 10

 [1] "" "" "" "" "" "" "" "" "" ""

Initialize a vector with the same value

 [1] 0 0 0 0 0 0 0 0 0 0

Or we can use the c function to combine any values we like into a vector (so long as they’re the same atomic type!).

[1] 10 12 45 33

You can also create vectors as sequence of numbers

 [1]  1  2  3  4  5  6  7  8  9 10
 [1]  1  2  3  4  5  6  7  8  9 10
 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
[91] 10.0

You can also use the concatenate function to add elements to a vector:

[1] 10 12 45 33 57
[1] 10 12 45 33 57 10 12 45 33

Create a logical vector

[1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE
[1] TRUE
[1] FALSE

Character vectors can be created using the paste function:

[1] "1,6"  "2,7"  "3,8"  "4,9"  "5,10"
[1] "1,6 ; 2,7 ; 3,8 ; 4,9 ; 5,10"
[1] "Section 1" "Section 2" "Section 3"
[1] "Section1" "Section2" "Section3"
[1] "Section1" "Section2" "Section3"

Challenge 1

Vectors can only contain one atomic type. If you try to combine different types, R will create a vector that is the least common denominator: the type that is easiest to coerce to.

Guess what the following do without running them first:

This is called implicit coercion.

The coercion rule goes logical -> integer -> numeric -> complex -> character.

You can also coerce vectors explicitly using the as.<class_name>. Example

R will try to do whatever makes the most sense for that value:

[1] "10" "12" "45" "33" "57"
[1] 10+0i 12+0i 45+0i 33+0i 57+0i
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

This is behaviour you will find in many programming languages. 0 is FALSE, while every other number is treated as TRUE. Sometimes coercions, especially nonsensical ones won’t work.

In some cases, R won’t be able to do anything sensible:

Warning: NAs introduced by coercion
[1] NA NA NA

A vector of “NAs” was returned, and so was a warning.

You can ask questions about the structure of vectors:

[1]  9 10
[1] 0
[1] 11
 int [1:11] 0 1 2 3 4 5 6 7 8 9 ...

Vectors can be named:

a b c d 
1 2 3 4 

Matrices

Another data structure you’ll likely encounter are matrices. Underneath the hood, they are really just atomic vectors, with added dimension attributes.

We can create one with the matrix function. Let’s generate some random data:

           [,1]       [,2]      [,3]       [,4]       [,5]        [,6]
[1,] -0.6264538  1.5952808 0.4874291 -0.3053884 -0.6212406 -0.04493361
[2,]  0.1836433  0.3295078 0.7383247  1.5117812 -2.2146999 -0.01619026
[3,] -0.8356286 -0.8204684 0.5757814  0.3898432  1.1249309  0.94383621
 num [1:3, 1:6] -0.626 0.184 -0.836 1.595 0.33 ...
[1] 3 6
[1] 3
[1] 6

You can use rownames, colnames, and dimnames to set or retrieve the column and rownames of a matrix. The function length will tell you the number of elements.

Challenge 2

What do you think will be the result of length(x)? Try it. Were you right? Why / why not?

Factors

Factors are special vectors that represent categorical data. Factors can be ordered or unordered and are important for modeling functions such as aov(), lm() and glm() and also in plot methods.

We can create factors using the factor function:

[1] yes no  no  yes yes
Levels: no yes

So we can see that the output is very similar to a character vector, but with an attached levels component. This becomes clearer when we look at its structure:

 Factor w/ 2 levels "no","yes": 2 1 1 2 2

This reveals something important: while factors look (and often behave) like character vectors, they are actually integers under the hood, and here, we can see that “no” is represented by a 1, and “yes” a 2.

Lists

If you want to combine different types of data, you will need to use lists. Lists act as containers, and can contain any type of data structure, even themselves!

Lists can be created using list or coerced from other objects using as.list():

[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i

Each element of the list is denoted by a [[ in the output. Inside each list element is an atomic vector of length one containing the given values.

Lists can contain more complex objects:

$a
[1] "Research Bazaar"

$b
 [1]  1  2  3  4  5  6  7  8  9 10

$data
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

In this case our list contains a character vector of length one, a numeric vector with 10 entries, and a small data frame from one of R’s many preloaded datasets (see ?data). We’ve also given each list element a name, which is why you see $a instead of [[1]].

Add an element to list:

$a
[1] "Research Bazaar"

$b
 [1]  1  2  3  4  5  6  7  8  9 10

$data
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

$c
[1]  TRUE FALSE  TRUE FALSE  TRUE FALSE

Lists are extremely useful inside functions. You can “staple” together lots of different kinds of results into a single object that a function can return. In fact many R functions which return complex output store their results in a list.

Explore objects in a list:

[1] "a"    "b"    "data" "c"   
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Challenge solutions

Solution to challenge 1

Vectors can only contain one atomic type. If you try to combine different types, R will create a vector that is the least common denominator: the type that is easiest to coerce to.

[1] "1.7" "a"  
[1] "character"
[1] 1 2
[1] "double"
[1] "a"    "TRUE"
[1] "character"

Solution to challenge 2

What do you think will be the result of length(x)?

[1] 18

Because a matrix is really just a vector with added dimension attributes, length gives you the total number of elements in the matrix.