8 More on data frames

(AST230) R for Data Science

Md Rasel Biswas

Data frame

  • A two-dimensional array with two or more atomic vectors of the same length is known as a data frame

  • Most useful storage structure for data analysis

  • Columns are variables, and rows are observations

  • R’s equivalent to spreadsheet

  • If you do data analysis in R, you’re going to be using data frames

# Creating three atomic vectors
age <- c(11, 9, 8, 10, 5)
name <- c("Raju", "Raj", "Raba", "Rahul", "Rimi")
sex <- c("boy", "boy", "girl", "boy", "girl")
loc <- c(1, 2, 1, 1, 2)

# Creating data frame
df <- data.frame(age, name, gender=sex, loc)

# Convert categorical variables to factor
df$gender <- factor(df$gender)
df$loc <- factor(df$loc, labels=c("Urban", "Rural"))

# Print df
  age  name gender   loc
1  11  Raju    boy Urban
2   9   Raj    boy Rural
3   8  Raba   girl Urban
4  10 Rahul    boy Urban
5   5  Rimi   girl Rural

Some useful functions

# Variable names of the data frame
[1] "age"    "name"   "gender" "loc"   
# Dimension of the data frame
[1] 5 4
# Details of a df
'data.frame':   5 obs. of  4 variables:
 $ age   : num  11 9 8 10 5
 $ name  : chr  "Raju" "Raj" "Raba" "Rahul" ...
 $ gender: Factor w/ 2 levels "boy","girl": 1 1 2 1 2
 $ loc   : Factor w/ 2 levels "Urban","Rural": 1 2 1 1 2
# Summary of the data frame
      age           name            gender     loc   
 Min.   : 5.0   Length:5           boy :3   Urban:3  
 1st Qu.: 8.0   Class :character   girl:2   Rural:2  
 Median : 9.0   Mode  :character                     
 Mean   : 8.6                                        
 3rd Qu.:10.0                                        
 Max.   :11.0                                        
# Summary of a specific variable
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    5.0     8.0     9.0     8.6    10.0    11.0 
# Frequency table of a variable

 boy girl 
   3    2 

Ordering data frames

We want to reorder the observations of the data df by the variable age.

Recall: order() is used to order an atomic vector by its value. Remember the following example?

age <- c(11, 9, 8, 10, 5)
[1]  5  8  9 10 11
[1] 5 3 2 4 1
age[order(age)] #equivalent to sort()
[1]  5  8  9 10 11
# Original data
  age  name gender   loc
1  11  Raju    boy Urban
2   9   Raj    boy Rural
3   8  Raba   girl Urban
4  10 Rahul    boy Urban
5   5  Rimi   girl Rural
# Ordering the data by `age`
df[order(df$age), ]
  age  name gender   loc
5   5  Rimi   girl Rural
3   8  Raba   girl Urban
2   9   Raj    boy Rural
4  10 Rahul    boy Urban
1  11  Raju    boy Urban

Handling missing data

  • The NA (Not Applicable) character is used as a placeholder of missing observation in R

  • Most of the R functions have an argument na.rm, which takes a logical value to exclude the missing value from the calculation

mean(c(1:10, NA, 14:16), 
     na.rm = TRUE)
[1] 7.692308
  • na.omit() is used to exclude all rows of a data frame that include a missing observation
xmd <- data.frame(
  x = c(NA, 11:14),
  y = c(rep("boy", 4), NA))
xmd # Data with missing values
   x    y
1 NA  boy
2 11  boy
3 12  boy
4 13  boy
5 14 <NA>
# Data after omitting missing values
   x   y
2 11 boy
3 12 boy
4 13 boy

Adding new column or rows

Adding a new variable using $

df$place <- c("UK", "BN", "PK", "IN", "BN")
  age  name gender   loc place
1  11  Raju    boy Urban    UK
2   9   Raj    boy Rural    BN
3   8  Raba   girl Urban    PK
4  10 Rahul    boy Urban    IN
5   5  Rimi   girl Rural    BN
# removing loc
df$loc <- NULL
  age  name gender place
1  11  Raju    boy    UK
2   9   Raj    boy    BN
3   8  Raba   girl    PK
4  10 Rahul    boy    IN
5   5  Rimi   girl    BN
# rbind for rows
df1 <- data.frame(id = 1:4, height = c(120, 150, 132, 122),
                        weight = c(44, 56, 49, 45))
  id height weight
1  1    120     44
2  2    150     56
3  3    132     49
4  4    122     45
df2 <- data.frame(id = 5:6, height = c(119, 110), weight = c(39, 35))
  id height weight
1  5    119     39
2  6    110     35
rbind(df1, df2)
  id height weight
1  1    120     44
2  2    150     56
3  3    132     49
4  4    122     45
5  5    119     39
6  6    110     35
# cbind for columns
  id height weight
1  1    120     44
2  2    150     56
3  3    132     49
4  4    122     45
df3 <- data.frame(location = c("UK", "CZ", "CZ", "UK"))
1       UK
2       CZ
3       CZ
4       UK
cbind(df1, df3)
  id height weight location
1  1    120     44       UK
2  2    150     56       CZ
3  3    132     49       CZ
4  4    122     45       UK

Analyse a subset of data

# Full data
  age  name gender place
1  11  Raju    boy    UK
2   9   Raj    boy    BN
3   8  Raba   girl    PK
4  10 Rahul    boy    IN
5   5  Rimi   girl    BN
# A subset of boy's data
df_boy <- df[df$gender == "boy", ]
  age  name gender place
1  11  Raju    boy    UK
2   9   Raj    boy    BN
4  10 Rahul    boy    IN
# Mean age of boys
[1] 10

Exercise 8.1

The data mtcars comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Load the data by running data(mtcars)

  • Obtain the variable list of the data frame mtcars
  • How many observations and variables do the mtcars data have?
  • Check the types of the variables of mtcars
  • Rename the variable hp to horsepower
  • Order the dataset in ascending order of the variable mpg (miles per gallon)
  • Convert the variable cyl (number of cylinders) to factor type variable
  • Create a subset of the mtcars dataset where mpg is less than 30, retaining only the first five variables. Save the resulting dataset as mtcars_subset.

Frequency table

A frequency table (known as frequency distribution) is a tabular format of summarizing data where frequency corresponding each data point is presented

Frequency table of ungrouped data is often not so useful.


10.4 13.3 14.3 14.7   15 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2 19.7   21 
   2    1    1    1    1    2    1    1    1    1    1    1    1    2    1    2 
21.4 21.5 22.8 24.4   26 27.3 30.4 32.4 33.9 
   2    1    2    1    1    1    2    1    1 

Therefore, dividing the values into groups or class intervals is useful.

mtcars$mpg_cat <- cut(mtcars$mpg, breaks = c(10, 20, 30, 40),
                      labels= c("low", "med", "high"), right = T)

 low  med high 
  18   10    4 

We can extend this further by producing a frequency for each combination of mpg_cat and cyl

table(mtcars$mpg_cat, mtcars$cyl)
        4  6  8
  low   0  4 14
  med   7  3  0
  high  4  0  0

Exercise 8.1 (continued)

Using the mtcars data:

  • Find the mean, median, mode, range, standard deviation, and IQR of the variable Miles/(US) gallon (mpg)

  • Find the frequency table of Number of cylinders (cyl)

  • Find the 2-way contingency table of Number of cylinders (cyl) and Number of forward gears (gear)