#> age name gender loc
#> Min. : 5.0 Length:5 boy :3 Urban:3
#> 1st Qu.: 8.0 Class :character girl:2 Rural:2
#> Median : 9.0 Mode :character
#> Mean : 8.6
#> 3rd Qu.:10.0
#> Max. :11.0
# Summary of a specific variablesummary(df$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 5.0 8.0 9.0 8.6 10.0 11.0
# Frequency table of a variabletable(df$gender)
#>
#> boy girl
#> 3 2
Ordering data frames
We want to reorder the observations of the data df by the variable age.
Recall:order() is used to order an atomic vector by its value. Remember the following example?
age <-c(11, 9, 8, 10, 5)sort(age)
#> [1] 5 8 9 10 11
order(age)
#> [1] 5 3 2 4 1
age[order(age)] #equivalent to sort()
#> [1] 5 8 9 10 11
# Original datadf
#> age name gender loc
#> 1 11 Raju boy Urban
#> 2 9 Raj boy Rural
#> 3 8 Raba girl Urban
#> 4 10 Rahul boy Urban
#> 5 5 Rimi girl Rural
# Ordering the data by `age`df[order(df$age), ]
#> age name gender loc
#> 5 5 Rimi girl Rural
#> 3 8 Raba girl Urban
#> 2 9 Raj boy Rural
#> 4 10 Rahul boy Urban
#> 1 11 Raju boy Urban
Handling missing data
The NA (Not Applicable) character is used as a placeholder of missing observation in R
Most of the R functions have an argument na.rm, which takes a logical value to exclude the missing value from the calculation
mean(c(1:10, NA, 14:16), na.rm =TRUE)
#> [1] 7.692308
na.omit() is used to exclude all rows of a data frame that include a missing observation
xmd <-data.frame(x =c(NA, 11:14),y =c(rep("boy", 4), NA))xmd # Data with missing values
#> x y
#> 1 NA boy
#> 2 11 boy
#> 3 12 boy
#> 4 13 boy
#> 5 14 <NA>
# Data after omitting missing valuesna.omit(xmd)
#> x y
#> 2 11 boy
#> 3 12 boy
#> 4 13 boy
Adding new column or rows
Adding a new variable using $
df$place <-c("UK", "BN", "PK", "IN", "BN")df
#> age name gender loc place
#> 1 11 Raju boy Urban UK
#> 2 9 Raj boy Rural BN
#> 3 8 Raba girl Urban PK
#> 4 10 Rahul boy Urban IN
#> 5 5 Rimi girl Rural BN
# removing locdf$loc <-NULLdf
#> age name gender place
#> 1 11 Raju boy UK
#> 2 9 Raj boy BN
#> 3 8 Raba girl PK
#> 4 10 Rahul boy IN
#> 5 5 Rimi girl BN
#> age name gender place
#> 1 11 Raju boy UK
#> 2 9 Raj boy BN
#> 3 8 Raba girl PK
#> 4 10 Rahul boy IN
#> 5 5 Rimi girl BN
# A subset of boy's datadf_boy <- df[df$gender =="boy", ]df_boy
#> age name gender place
#> 1 11 Raju boy UK
#> 2 9 Raj boy BN
#> 4 10 Rahul boy IN
# Mean age of boysmean(df_boy$age)
#> [1] 10
Exercise 8.1
The data mtcars comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Load the data by running data(mtcars)
Obtain the variable list of the data frame mtcars
How many observations and variables do the mtcars data have?
Check the types of the variables of mtcars
Rename the variable hp to horsepower
Order the dataset in ascending order of the variable mpg (miles per gallon)
Convert the variable cyl (number of cylinders) to factor type variable
Create a subset of the mtcars dataset where mpg is less than 30, retaining only the first five variables. Save the resulting dataset as mtcars_subset.
Frequency table
A frequency table (known as frequency distribution) is a tabular format of summarizing data where frequency corresponding each data point is presented
Frequency table of ungrouped data is often not so useful.