8 More on data frames

(AST230) R for Data Science

Md Rasel Biswas

Data frame

  • A two-dimensional array with two or more atomic vectors of the same length is known as a data frame

  • Most useful storage structure for data analysis

  • Columns are variables, and rows are observations

  • R’s equivalent to spreadsheet

  • If you do data analysis in R, you’re going to be using data frames

# Creating three atomic vectors
age <- c(11, 9, 8, 10, 5)
name <- c("Raju", "Raj", "Raba", "Rahul", "Rimi")
sex <- c("boy", "boy", "girl", "boy", "girl")
loc <- c(1, 2, 1, 1, 2)

# Creating data frame
df <- data.frame(age, name, gender=sex, loc)

# Convert categorical variables to factor
df$gender <- factor(df$gender)
df$loc <- factor(df$loc, labels=c("Urban", "Rural"))

# Print df
df
  age  name gender   loc
1  11  Raju    boy Urban
2   9   Raj    boy Rural
3   8  Raba   girl Urban
4  10 Rahul    boy Urban
5   5  Rimi   girl Rural

Some useful functions

# Variable names of the data frame
names(df)
[1] "age"    "name"   "gender" "loc"   
# Dimension of the data frame
dim(df)
[1] 5 4
# Details of a df
str(df)
'data.frame':   5 obs. of  4 variables:
 $ age   : num  11 9 8 10 5
 $ name  : chr  "Raju" "Raj" "Raba" "Rahul" ...
 $ gender: Factor w/ 2 levels "boy","girl": 1 1 2 1 2
 $ loc   : Factor w/ 2 levels "Urban","Rural": 1 2 1 1 2
# Summary of the data frame
summary(df)
      age           name            gender     loc   
 Min.   : 5.0   Length:5           boy :3   Urban:3  
 1st Qu.: 8.0   Class :character   girl:2   Rural:2  
 Median : 9.0   Mode  :character                     
 Mean   : 8.6                                        
 3rd Qu.:10.0                                        
 Max.   :11.0                                        
# Summary of a specific variable
summary(df$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    5.0     8.0     9.0     8.6    10.0    11.0 
# Frequency table of a variable
table(df$gender)

 boy girl 
   3    2 

Ordering data frames

We want to reorder the observations of the data df by the variable age.

Recall: order() is used to order an atomic vector by its value. Remember the following example?

age <- c(11, 9, 8, 10, 5)
sort(age)
[1]  5  8  9 10 11
order(age)
[1] 5 3 2 4 1
age[order(age)] #equivalent to sort()
[1]  5  8  9 10 11
# Original data
df
  age  name gender   loc
1  11  Raju    boy Urban
2   9   Raj    boy Rural
3   8  Raba   girl Urban
4  10 Rahul    boy Urban
5   5  Rimi   girl Rural
# Ordering the data by `age`
df[order(df$age), ]
  age  name gender   loc
5   5  Rimi   girl Rural
3   8  Raba   girl Urban
2   9   Raj    boy Rural
4  10 Rahul    boy Urban
1  11  Raju    boy Urban

Handling missing data

  • The NA (Not Applicable) character is used as a placeholder of missing observation in R

  • Most of the R functions have an argument na.rm, which takes a logical value to exclude the missing value from the calculation

mean(c(1:10, NA, 14:16), 
     na.rm = TRUE)
[1] 7.692308
  • na.omit() is used to exclude all rows of a data frame that include a missing observation
xmd <- data.frame(
  x = c(NA, 11:14),
  y = c(rep("boy", 4), NA))
xmd # Data with missing values
   x    y
1 NA  boy
2 11  boy
3 12  boy
4 13  boy
5 14 <NA>
# Data after omitting missing values
na.omit(xmd)
   x   y
2 11 boy
3 12 boy
4 13 boy

Adding new column or rows

Adding a new variable using $

df$place <- c("UK", "BN", "PK", "IN", "BN")
df
  age  name gender   loc place
1  11  Raju    boy Urban    UK
2   9   Raj    boy Rural    BN
3   8  Raba   girl Urban    PK
4  10 Rahul    boy Urban    IN
5   5  Rimi   girl Rural    BN
# removing loc
df$loc <- NULL
df
  age  name gender place
1  11  Raju    boy    UK
2   9   Raj    boy    BN
3   8  Raba   girl    PK
4  10 Rahul    boy    IN
5   5  Rimi   girl    BN
# rbind for rows
df1 <- data.frame(id = 1:4, height = c(120, 150, 132, 122),
                        weight = c(44, 56, 49, 45))
df1
  id height weight
1  1    120     44
2  2    150     56
3  3    132     49
4  4    122     45
df2 <- data.frame(id = 5:6, height = c(119, 110), weight = c(39, 35))
df2
  id height weight
1  5    119     39
2  6    110     35
rbind(df1, df2)
  id height weight
1  1    120     44
2  2    150     56
3  3    132     49
4  4    122     45
5  5    119     39
6  6    110     35
# cbind for columns
df1
  id height weight
1  1    120     44
2  2    150     56
3  3    132     49
4  4    122     45
df3 <- data.frame(location = c("UK", "CZ", "CZ", "UK"))
df3
  location
1       UK
2       CZ
3       CZ
4       UK
cbind(df1, df3)
  id height weight location
1  1    120     44       UK
2  2    150     56       CZ
3  3    132     49       CZ
4  4    122     45       UK

Analyse a subset of data

# Full data
df
  age  name gender place
1  11  Raju    boy    UK
2   9   Raj    boy    BN
3   8  Raba   girl    PK
4  10 Rahul    boy    IN
5   5  Rimi   girl    BN
# A subset of boy's data
df_boy <- df[df$gender == "boy", ]
df_boy
  age  name gender place
1  11  Raju    boy    UK
2   9   Raj    boy    BN
4  10 Rahul    boy    IN
# Mean age of boys
mean(df_boy$age)
[1] 10

Exercise 8.1

The data mtcars comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Load the data by running data(mtcars)

  • Obtain the variable list of the data frame mtcars
  • How many observations and variables do the mtcars data have?
  • Check the types of the variables of mtcars
  • Rename the variable hp to horsepower
  • Order the dataset in ascending order of the variable mpg (miles per gallon)
  • Convert the variable cyl (number of cylinders) to factor type variable
  • Create a subset of the mtcars dataset where mpg is less than 30, retaining only the first five variables. Save the resulting dataset as mtcars_subset.

Frequency table

A frequency table (known as frequency distribution) is a tabular format of summarizing data where frequency corresponding each data point is presented

Frequency table of ungrouped data is often not so useful.

table(mtcars$mpg)

10.4 13.3 14.3 14.7   15 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2 19.7   21 
   2    1    1    1    1    2    1    1    1    1    1    1    1    2    1    2 
21.4 21.5 22.8 24.4   26 27.3 30.4 32.4 33.9 
   2    1    2    1    1    1    2    1    1 

Therefore, dividing the values into groups or class intervals is useful.

mtcars$mpg_cat <- cut(mtcars$mpg, breaks = c(10, 20, 30, 40),
                      labels= c("low", "med", "high"), right = T)
table(mtcars$mpg_cat)

 low  med high 
  18   10    4 

We can extend this further by producing a frequency for each combination of mpg_cat and cyl

table(mtcars$mpg_cat, mtcars$cyl)
      
        4  6  8
  low   0  4 14
  med   7  3  0
  high  4  0  0

Exercise 8.1 (continued)

Using the mtcars data:

  • Find the mean, median, mode, range, standard deviation, and IQR of the variable Miles/(US) gallon (mpg)

  • Find the frequency table of Number of cylinders (cyl)

  • Find the 2-way contingency table of Number of cylinders (cyl) and Number of forward gears (gear)