4 Data structures in R

Data in R

  • Until now, you’ve created fairly simple data in R and stored it as a vector.

  • However, most (if not all) of you will have much more complicated datasets from your various experiments and surveys that go well beyond what a vector can handle.

  • In previous lectures we’ve gone through the main four data types (i.e vector types) in R, i.e. logical, integer, double, character

  • Now let’s have a look at some of main structures that we have for storing these data.

Data Structures

  • R has many data structures, some of the important ones are:

1. Atomic vectors
2. Matrices
3. Arrays
4. Factors

5. Lists
6. Data frames
7. Tibbles

1 Atomic Vectors

  • Perhaps the simplest type of data structure is the vector
  • You’ve already been introduced to vectors
  • Vectors that have a single value (length 1) are called scalars
  • key thing to remember is that all the elements inside a vector must be of the same data type

2 Matrices

  • When a rectangular data structure contains a single type of data in all its cells (i.e., in all its rows and columns), we have a matrix of data.

  • In R, a matrix really is an atomic vector that is tweaked into another shape (i.e., a re-shaped vector).

  • Internally, this is implemented by taking a vector and adding attributes that describe its shape and the names of its rows or columns


  • R function matrix() is used to create a matrix from a atomic vector.
# creating matrix using matrix()
A = matrix(
  c(1, 2, 3, 4, 5, 6, 7, 8, 9),
  nrow = 3,
  ncol = 3,
  byrow = TRUE
)
A
#>      [,1] [,2] [,3]
#> [1,]    1    2    3
#> [2,]    4    5    6
#> [3,]    7    8    9

matrix(1:6, nrow = 2) # default: byrow = FALSE
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6
# Creating matrix using rbind() or cbind()
rbind(1:3, 11:13, 33:35)
#>      [,1] [,2] [,3]
#> [1,]    1    2    3
#> [2,]   11   12   13
#> [3,]   33   34   35
cbind(letters[1:2], c("k", "m"), letters[18:19])
#>      [,1] [,2] [,3]
#> [1,] "a"  "k"  "r" 
#> [2,] "b"  "m"  "s"

# Creating matrix by adding dim attribute
z <- 1:6
dim(z) <- c(2, 3)
z
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6

3 Arrays

  • Arrays are just multidimensional matrices
# Creating array using array()
A = array(c(1, 2, 3, 4, 5, 6, 7, 8), dim = c(2, 2, 2))
A
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    2    4
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    5    7
#> [2,]    6    8

# Creating array by adding dim attribute
z <- 1:18
dim(z) <- c(2, 3, 3)
z
#> , , 1
#> 
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3]
#> [1,]    7    9   11
#> [2,]    8   10   12
#> 
#> , , 3
#> 
#>      [,1] [,2] [,3]
#> [1,]   13   15   17
#> [2,]   14   16   18

Summary:

  • Like vectors and matrices, arrays must contain elements all of the same data types.

  • Data structures like matrices, or arrays are built on top of atomic vectors by adding attributes

  • In other words, matrices and arrays are just atomic vectors with a dim() (dimension) attribute

Calculations on matrices

  • Sometimes it’s also useful to define row and column names for your matrix
my_mat <- matrix(1:16, nrow = 4, byrow = TRUE)
rownames(my_mat) <- c("A", "B", "C", "D")
colnames(my_mat) <- c("a", "b", "c", "d")
my_mat
#>    a  b  c  d
#> A  1  2  3  4
#> B  5  6  7  8
#> C  9 10 11 12
#> D 13 14 15 16

  • The usual matrix addition, multiplication etc can be performed. Note the use of the %*% operator to perform matrix multiplication.
mat1 <- matrix(c(2, 0, 1, 1), nrow = 2)
mat1
#>      [,1] [,2]
#> [1,]    2    1
#> [2,]    0    1
mat2 <- matrix(c(1, 1, 0, 2), nrow = 2)
mat2
#>      [,1] [,2]
#> [1,]    1    0
#> [2,]    1    2
mat1 + mat2  # matrix addition
#>      [,1] [,2]
#> [1,]    3    1
#> [2,]    1    3
# element by element products
mat1 * mat2
#>      [,1] [,2]
#> [1,]    2    0
#> [2,]    0    2
# matrix multiplication
mat1 %*% mat2
#>      [,1] [,2]
#> [1,]    3    2
#> [2,]    1    2

Basic martrix functions

  • R has numerous built in functions to perform matrix operations

  • For example, to transpose a matrix we use the transposition function t()

my_mat_t <- t(my_mat)
my_mat_t
#>   A B  C  D
#> a 1 5  9 13
#> b 2 6 10 14
#> c 3 7 11 15
#> d 4 8 12 16
  • To extract the diagonal elements of a matrix and store them as a vector we can use the diag() function.
my_mat_diag <- diag(my_mat)
my_mat_diag
#> [1]  1  6 11 16

Functions Description
chol(x) Choleski decomposition
t(x) Transpose of a matrix x.
diag(x) Extracts the diagonal elements of a matrix
ncol(x) Returns the number of columns
nrow(x) Returns the number of rows
colSums(x) Returns the sum of columns
rowSums(x) Returns the sum of rows
solve(A,b) Solve the system Ax=b
solve(x) Calculate the inverse

Exercise 4

  1. Create a vector called x consisting of the first fifteen integers of the number line.

  2. Use the function dim() to assign dimension to vector x with three rows and five columns. What is the class of x now?

  3. Given the following matrices, A=[2900041475517874]b=[1609]

  1. Calculate ATb.
  2. Find the inverse of matrix A.
  3. Solve the equation for x, where Ax=b.

  1. Generate a vector x0 of order 20 with all elements as 1

  2. Generate a vector x1 of order 20 with elements randomly selected from 30:70, consider a seed 80

  3. Create a matrix X with the first column x0 and the second column x1

  4. Generate a vector Y of order 20 using the equation yi=1.2+1.8x1+ϵi, where ϵiN(0,9)

  5. Obtain the value of (XX)1XY, use the R function solve() to obtain an inverse of a square matrix.

S3 Atomic Vectors:

  • Remember: matrices, arrays are just atomic vectors that are reshaped

  • In addition to these regular atomic vectors, there are some S3 atomic vectors

  • One of the most important vector attributes is class, which underlies the S3 object system

  • A class attribute turns an object into an S3 object, which means it will behave differently from a regular vector when passed to a generic function

  • Every S3 object is built on top of a base type, and often stores additional information in other attributes


  • Some important S3 vectors used in base R are

    • Categorical data recorded in factor vector

    • Dates are stored in Date vector

    • Date-times are stored in POSIXct and POSIXlt vectors

  • Among these, we will discuss only the factor vector.


4 Factors

  • Factors are used to store categorical information in R, a categorical variable has only pre-defined levels, e.g., gender has two levels male and female

  • Factors are similar to character data except it can take only predefined values

  • Factors are built on top of an integer vector with two attributes:

  • a class, “factor”, which makes it behave differently from regular integer vectors, and

  • levels, which defines the set of allowed value

Factors look like strings, but behave like integers


  • The function factor() is used to create factor vector from an atomic vector and it has the following arguments

  • x data vector

  • levels values of x that will be used as the level of the factor

  • labels a vector of labels for the levels

# Creating factor using factor()
fac = factor(x = c(1, 2, 1, 1, 2, 1, 2), 
              levels = c(1, 2),
              labels = c("Male", "Female"))
fac
#> [1] Male   Female Male   Male   Female Male   Female
#> Levels: Male Female
as.character(fac)
#> [1] "Male"   "Female" "Male"   "Male"   "Female" "Male"   "Female"

# not providing levels
fac1 = factor(c("Male", "Female", "Male",
               "Male", "Female", "Male", "Female"))
fac1
#> [1] Male   Female Male   Male   Female Male   Female
#> Levels: Female Male
# providing levels 
fac2 = factor(c("Male", "Female", "Male",
               "Male", "Female", "Male", "Female"), 
             levels = c("Male", "Female"))
fac2
#> [1] Male   Female Male   Male   Female Male   Female
#> Levels: Male Female
typeof(fac1)
#> [1] "integer"
attributes(fac1)
#> $levels
#> [1] "Female" "Male"  
#> 
#> $class
#> [1] "factor"

5 Lists

  • List is a vector with heterogeneous elements, i.e., each element of a list can be any type

  • The function list() is used to create a list

list1 <- list(1:3, 
              "a", 
              c(TRUE, FALSE, FALSE), 
              c(2.5, 5.1, 9))
list1
#> [[1]]
#> [1] 1 2 3
#> 
#> [[2]]
#> [1] "a"
#> 
#> [[3]]
#> [1]  TRUE FALSE FALSE
#> 
#> [[4]]
#> [1] 2.5 5.1 9.0

typeof(list1)
#> [1] "list"
is.list(list1)
#> [1] TRUE
length(list1)
#> [1] 4
str(list1)
#> List of 4
#>  $ : int [1:3] 1 2 3
#>  $ : chr "a"
#>  $ : logi [1:3] TRUE FALSE FALSE
#>  $ : num [1:3] 2.5 5.1 9

  • Lists are sometimes called recursive vectors because a list can contain other lists. This makes them fundamentally different from atomic vectors
l3 <- list(list(list("First list")))
str(l3)
#> List of 1
#>  $ :List of 1
#>   ..$ :List of 1
#>   .. ..$ : chr "First list"
  • List

S3 Lists:

  • Recall: Data structures like matrices, arrays, or factors are built on top of atomic vectors by adding attributes.

  • Similarly, in addition to the regular lists, there are two important S3 vectors built on top of lists

    • They are are data frames and tibbles.

6 Data Frames

  • A data frame is a named list of vectors with attributes for (column) names, row.names, and its class, “data.frame”

  • In contrast to a regular list, a data frame has an additional constraint:

    • the length of each of its vectors must be the same
  • This gives data frames their rectangular structure

  • Columns are variables, and rows are observations

  • Data frame is R’s equivalent to spreadsheet

  • If you do data analysis in R, you’re going to be using data frames


  • R function data.frame() is used to create a new data frame, where atomic vectors can be used as inputs
# Creating a data frame
name = c("Fahim", "Abir", "Aman")
language = c("R", "Python", "Java")
age = c(22, 25, 45)
df = data.frame(name, language, age)
df
#>    name language age
#> 1 Fahim        R  22
#> 2  Abir   Python  25
#> 3  Aman     Java  45
typeof(df)
#> [1] "list"
attributes(df)
#> $names
#> [1] "name"     "language" "age"     
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1 2 3

  • While the attributes of data frame and matrix are different, a matrix can be transformed into a data frame using the function as.data.frame()
mat1 <- matrix(1:12, nrow = 3)
attributes(mat1)
#> $dim
#> [1] 3 4
mat2 <- as.data.frame(mat1)
attributes(mat2)
#> $names
#> [1] "V1" "V2" "V3" "V4"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1 2 3

There are various ways to inspect a data frame, such as:

  • str() gives a very brief description of the data
  • names() gives the name of each variable in the data
  • summary() gives some very basic summary statistics for each variable
  • head() shows the first few rows
  • tail() shows the last few rows

7 Tibbles

  • Data frame is one of the most important ideas in R and it is one of the things that make R different from other programming languages

  • Data frames are created more than 20 years ago and over the years, the way people use R have changed

  • Some of the design decisions of data frame do not go well with current way of using R

  • Tibbles are similar to data frames and it overcome some of limitations of data frames


  • Tibbles are not the part of the base R, it is in the R package tibble

  • To use tibble, one need to load the package tibble to the current R environment

# Load the tibble package
library(tibble)
# Create a tibble with three columns: name, age, and city
my_data <- tibble(
  name = c("Samir", "Amir", "Aman"),
  age = c(25, 30, 35),
  city = c("Dhaka", "Khulna", "Jashore")
)
my_data
#> # A tibble: 3 × 3
#>   name    age city   
#>   <chr> <dbl> <chr>  
#> 1 Samir    25 Dhaka  
#> 2 Amir     30 Khulna 
#> 3 Aman     35 Jashore

Data frame vs. Tibble

Data frame

df2 <- data.frame(
  x = 1:3,
  y = LETTERS[1:3],
  z = c(2, 4, 6)
)
df2
#>   x y z
#> 1 1 A 2
#> 2 2 B 4
#> 3 3 C 6
typeof(df2)
#> [1] "list"

Tibble

tb2 <- tibble(
  x = 1:3,
  y = LETTERS[1:3],
  z = x * 2
)
tb2
#> # A tibble: 3 × 3
#>       x y         z
#>   <int> <chr> <dbl>
#> 1     1 A         2
#> 2     2 B         4
#> 3     3 C         6
typeof(tb2)
#> [1] "list"

Data frame

attributes(df2)
#> $names
#> [1] "x" "y" "z"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1 2 3
str(df2)
#> 'data.frame':    3 obs. of  3 variables:
#>  $ x: int  1 2 3
#>  $ y: chr  "A" "B" "C"
#>  $ z: num  2 4 6

Tibble

attributes(tb2)
#> $class
#> [1] "tbl_df"     "tbl"        "data.frame"
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $names
#> [1] "x" "y" "z"
str(tb2)
#> tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ x: int [1:3] 1 2 3
#>  $ y: chr [1:3] "A" "B" "C"
#>  $ z: num [1:3] 2 4 6

  • While data frames automatically recycle columns that are an integer multiple of the longest column, tibbles will only recycle vectors of length one

Data frame

data.frame(x = 1:4, y = 1:2)
#>   x y
#> 1 1 1
#> 2 2 2
#> 3 3 1
#> 4 4 2
data.frame(x = 1:4, y = 1:3)
#> Error in data.frame(x = 1:4, y = 1:3): arguments imply differing number of rows: 4, 3

Tibble

tibble(x = 1:4, y = 1:2)
#> Error in `tibble()`:
#> ! Tibble columns must have compatible sizes.
#> • Size 4: Existing data.
#> • Size 2: Column `y`.
#> ℹ Only values of size one are recycled.
tibble(x = 1:4, y = 1)
#> # A tibble: 4 × 2
#>       x     y
#>   <int> <dbl>
#> 1     1     1
#> 2     2     1
#> 3     3     1
#> 4     4     1

  • A data frame mtcars is available in base R
head(mtcars, 2)
#>               mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4
mtcars_t <- as_tibble(mtcars)
mtcars_t
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows