15 ggplot2

Data vizualization with ggplot2

Read this

ggplot2

  • base R plot functions require more effort and expertise to create high-quality publishable graphs

  • Over the years, many R packages (e.g. lattice, grid, etc.) were introduced to overcome the limitations of base R plot functions

  • The newest addition to R plot functions is ggplot2 package and it can be used to produce elegant plots without much effort!


  • ggplot2 package implements the grammar of graphics, a coherent system for describing and building graphs

  • To load ggplot2 package to the current R environment

library(ggplot2)

1 Scatter plot

ggplot(data = penguins) + 
  geom_point(
    mapping = aes(x = bill_length_mm, 
                  y = flipper_length_mm)
    )

Creating a ggplot

ggplot(data = penguins) +
  geom_point(
    mapping = aes(x = bill_length_mm, 
                  y = flipper_length_mm)
    )

ggplot(data = penguins) +

  • A blank slate: It creates a coordinate system to which several layers can be added

  • All plot functions of ggplot2 package begin with the ggplot() function

  • data is the first argument of ggplot() and it specifies the data frame to be used for the plot

  • One or more layers can be added to ggplot() using a plus (+) sign

geom_point

  • Geometric objects (called geom) are the shapes we put on a plot (e.g. points, bars, etc.).

  • You can have an unlimited number of layers, but at a minimum a plot must have at least one geom

    • geom_point() makes a scatter plot by adding a layer of points.
    • geom_line() adds a layer of lines connecting data points.
    • geom_col() adds bars for bar charts.
    • geom_histogram() makes a histogram.
    • geom_boxplot() adds boxes for boxplots

mapping = aes()

  • Each type of geom usually has a required set of aesthetics to be set. Aesthetic mappings are set with the aes() function. Examples include

    • x and y (the position on the x and y axes)
    • color (“outside” color, like the line around a bar)
    • fill (“inside” color, like the color of the bar itself)
    • shape (the type of point, like a dot, square, triangle, etc.)
    • linetype (solid, dashed, dotted etc.)
    • size (of geoms)

Adding labels, title, and caption to a graph

ggplot(data = penguins) +
  geom_point(
    mapping = aes(x = bill_length_mm, 
                  y = flipper_length_mm)
    ) +
  labs(
    x = "Bill length",
    y = "Flipper length",
    title = "Scatter plot of bill and flipper length",
    caption = "R package palmerpenguins") +
  theme_bw()


2 Histogram

  • geom_histogram() is for histogram

  • Only x value is needed for its aes() function

ggplot(data = penguins) +
      geom_histogram( 
        mapping = aes(x = body_mass_g) 
        )


  • fill argument of geom_histogram() modifies color of the bars
ggplot(data = penguins) +
      geom_histogram( 
        mapping = aes(x = body_mass_g, 
                      y = after_stat(density)), 
        fill = "steelblue")  

  • col argument of geom_histogram() modifies sides of the bars
ggplot(data = penguins) +
      geom_histogram( 
        mapping = aes(x = body_mass_g, 
                      y = after_stat(density)), 
        fill = "steelblue",
        col = "white")  


Histogram and density function

  • geom_density() is used to obtain the density of a variable
ggplot(data = penguins) +
      geom_histogram( 
        mapping = aes(x = body_mass_g, 
                      y = after_stat(density)), 
        fill = "steelblue",
        col = "white") +
      geom_density(
        mapping = aes(x = body_mass_g, 
                      y = after_stat(density)), 
        col = "brown", size = 1) 


  • A common mapping function in ggplot() for different geom_*()
ggplot(data = penguins, 
       mapping = aes(x = body_mass_g, 
                     y = after_stat(density))) + 
      geom_histogram(fill = "steelblue", 
                     col = "white") + 
      geom_density(col = "brown", 
                   size = 1) 

Exercise 3.2.1

(use diamonds data to answer the followings)

  • Create a histogram of carat and check the effect of bins on histogram

  • Add a density line to the plot obtained in Question 1


3 Boxplot

  • geom_boxplot() is used to obtain a boxplot
ggplot(data = penguins) +
  geom_boxplot(
    mapping = aes(x = species, 
                  y = body_mass_g)
    )

Boxplot

ggplot(data = penguins) +
  geom_boxplot(
    mapping = aes(x = species, 
                  y = body_mass_g)
    ) +
  coord_flip() 


Boxplot with original data points

ggplot(data = penguins, 
       mapping = aes(x = species, 
                  y = body_mass_g)) +
  geom_boxplot() +  
  geom_jitter(width = .2, 
              mapping = aes(col = species), 
              size = .75) 

  • geom_jitter() adds a small amount random variation to each point and it is useful to visualize points at different levels

Exercise 3.2.2

(use diamonds data to answer the followings)

  • Create a boxplot of carat at different levels of cut

  • Create a scatter plot to examine the effect of carat on price

Aesthetic mappings

Aesthetic mappings

  • A third variable can be added to a two-dimensional scatter plot by mapping it to an aesthetic

  • A aesthetic is a visual property (such as the size, shape, and color of the points) of the plot

  • Points of a plot can be displayed in different ways by changing the levels of its aesthetic properties (e.g. size, shape, or color of points can be changed)


  • Variables can be linked to the graph using the following properties

  • positions (x, y)

  • colors (color, fill)

  • shapes (shape, linetype)

  • size (size)

  • transparency (alpha)

  • groupings (group)


ggplot(data = penguins) +
  geom_point(
    mapping = aes(x = bill_length_mm, 
                  y = flipper_length_mm, 
                  col = species) 
    )

  • col is specified by different levels of species

ggplot(data = penguins) +
  geom_point(
    mapping = aes(x = bill_length_mm, 
                  y = flipper_length_mm, 
                  col =  bill_length_mm > 50), 
    show.legend = FALSE 
    )

  • col is specified by a function of bill_length_mm

  • show_legend is a logical argument of geom_*


  • Besides col, some other aesthetic types are useful in ggplot2

    • size assigns different sizes of the points to different values of the variable

    • alpha controls the transparency of the point

    • shape assigns different (at most six) shapes to different values of the variable

  • ggplot2 creates a legend for the variables used in the arguments of aes() except for x and y


ggplot(data = penguins) +
  geom_point(
    mapping = aes(x = bill_length_mm, 
                  y = flipper_length_mm, 
                  alpha = species) 
    )


ggplot(data = penguins) +
  geom_point(
    mapping = aes(x = bill_length_mm, 
                  y = flipper_length_mm, 
                  shape = species)
    )


  • Aesthetic properties can also be set manually, e.g. col = "blue" will make all the points blue, which does not convey any information about a variable but only changes the appearance of the plot

  • To set an aesthetic manually, the aesthetic type needs to be defined outside of aes() as an argument of geom_?? function


ggplot(data = penguins) +
  geom_point(
    mapping = aes(x = bill_length_mm, 
                  y = flipper_length_mm), 
    col = "blue") 


ggplot(data = penguins) +
  geom_point(
    mapping = aes(x = bill_length_mm, 
                  y = flipper_length_mm), 
    col = "blue", alpha = .5) 

geom_smooth()

ggplot(data = penguins, 
       mapping = aes(x = bill_length_mm, 
                  y = flipper_length_mm)) +
  geom_point() + 
  geom_smooth() 

  • geom_smooth() fits the relationship between two quantitative using a smoothing method

ggplot(data = penguins, 
       mapping = aes(x = bill_length_mm, 
                     y = flipper_length_mm)) +
  geom_point() + 
  geom_smooth(method = "lm") 


ggplot(data = penguins, 
       mapping = aes(x = bill_length_mm, 
                  y = flipper_length_mm,
                  col = species)) + 
  geom_point(size = .75) +
  geom_smooth(method = "lm", se = FALSE) 

Exercise 3.2.3

(use diamonds data to answer the followings)

  • Create a scatter plot to examine the effect of price on carat and assign different colors to different levels of cut

  • Show a fit of a linear model on the scatter plot of carat and price

  • Show different fits of linear models (price on carat) corresponding to different levels of cut on the scatter plot of price and carat

Facets

  • Adding information about a new variable to an existing plot could be helpful for data analysis (e.g. aesthetic)

  • facets can add information about a categorical variable to an existing plot by splitting the plot according to the levels of the categorical variable

    • facet_wrap() splits the plot by a single variable

    • facet_grid() splits the plot by the combination of two variables


ggplot(data = penguins) +
  geom_point(mapping = aes(x = flipper_length_mm, 
                           y = bill_length_mm)) +
  facet_wrap(~species) 


ggplot(data = penguins,
       mapping = aes(
         x = flipper_length_mm, y = bill_length_mm, col = species)) +
  geom_point() +
  geom_smooth(
    method = "lm", se = FALSE, col = "black") +
  facet_wrap(~species) 


ggplot(data = penguins[!is.na(penguins$sex), ]) +
  geom_point(mapping = aes(x = flipper_length_mm, 
                           y = bill_length_mm)) +
  facet_wrap(~ sex + species) 


ggplot(data = penguins[!is.na(penguins$sex), ]) +
  geom_point(mapping = aes(x = flipper_length_mm, 
                           y = bill_length_mm)) +
  facet_grid(sex ~ species) 


ggplot(data = penguins) +
  geom_histogram(
    mapping = aes(x = body_mass_g),
    col = "brown", fill = "yellow" 
    ) +
  facet_wrap(~species, ncol = 1)


ggplot(data = penguins) +
  geom_histogram(
    mapping = aes(x = body_mass_g),
    col = "brown", fill = "yellow" 
    ) +
  facet_wrap(~species, ncol = 1) +
  theme_minimal() 


ggplot(data = penguins) +
  geom_histogram(
    mapping = aes(x = body_mass_g),
    col = "brown", fill = "yellow" 
    ) +
  facet_wrap(~species, ncol = 1) +
  theme_minimal() + 
  theme(
    panel.grid = element_blank()
  )


Exercise 3.2.4

(use diamonds data to answer the followings)

  • Create histogram of x at different levels of cut

4 Density plot

Distribution of penguins’ body mass

ggplot(data = penguins) +
  geom_density(
    mapping = aes(x = body_mass_g, 
                  fill = species), 
    alpha = .5) 


Density plot

Distribution of penguins’ body mass

ggplot(data = penguins) +
  geom_density(
    mapping = aes(x = body_mass_g, 
                  fill = species), 
    alpha = .5) +
  theme(
    legend.position = "top",
    legend.title = element_blank()
  )


Distribution of penguins’ body mass

ggplot(data = penguins) +
  geom_density(
    mapping = aes(x = body_mass_g, 
                  fill = species), 
    alpha = .5) +
  theme_minimal(base_size = 7) +
  theme(
    legend.position = "top",
    legend.title = element_blank()
  )


Distribution of penguins’ body mass

ggplot(data = penguins) +
  geom_density(
    mapping = aes(x = body_mass_g, 
                  fill = species), 
    alpha = .5) +
  theme_minimal(base_size = 7) +
  theme(
    legend.position = "top",
    legend.key.size = unit(.75, "lines"),
    legend.title = element_blank()
  )

5 Barchart

Frequency distribution of species

ggplot(data = penguins) + 
  geom_bar(aes(x = species)) 

Frequency distribution of species

ggplot(data = penguins) +
  geom_bar(aes(x = species)) +
  theme_minimal(base_size = 16) 

Barchart with two variables

Distribution of species by year

ggplot(data = penguins) +
  geom_bar(aes(x = species, fill = factor(year)),
           position = "dodge")

ggplot(data = penguins) +
  geom_bar(aes(x = species, fill = factor(year)), 
           position = "stack")

ggplot(data = penguins) +
  geom_bar(aes(x = species, fill = factor(year)), 
           position = "fill")

Exercise 3.2.5

(use diamonds data to answer the followings)

  • Create a barplot of cut

  • Create a barplot of color

  • Create a barplot of cut with showing the distribution of color at different levels of cut

  • Check the use three different value of the argument position when creating a barplot with cut and color

Homework

  • Use the package gapminder to get an access to the data gapminder

  • gapminder has 6 variables and 1704 observations, where the variables are:

#> [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
  • Create a scatter plot to examine how gdpPercap affects lifeExp

  • Change the scale of x-axis to log base 10

  • Add a color layer corresponding to continent to the previous graph


  • Create a scatter plot of gdpPercap versus lifeExp for different continents in different plotting regions

  • Add smooth lines to describe relationship between gdpPercap and lifeExp for different continents separately

  • Draw a boxplot of lifeExp to compare distribution life expectancy for different continents

  • Draw a histogram of lifeExp and check it shapes for different bin size

  • Draw density plots of lifeExp for different continents in a single plot


  • Make a scatter plot of lifeExp on the y-axis against year on the x

  • Fit a straight line to estimate mean life expectancy for a year for different countries

  • Split the plot for different continents

  • Add a continent-specific mean line to the plot

Statistical layers geom_*() vs stat_*()

ggplot(penguins, 
       aes(bill_length_mm, flipper_length_mm)) +
  geom_smooth(stat = "smooth")

ggplot(penguins, 
       aes(bill_length_mm, flipper_length_mm)) +
  stat_smooth(geom = "smooth")


ggplot(penguins, 
       aes(bill_length_mm, flipper_length_mm)) +
  geom_point(stat = "identity")

ggplot(penguins, 
       aes(bill_length_mm, flipper_length_mm)) +
  stat_identity(geom = "point")


ggplot(penguins, aes(species)) +
  geom_bar(stat = "count")

ggplot(penguins, aes(species)) +
  stat_count(geom = "bar")

Statistical summaries

ggplot(penguins, 
       aes(species, flipper_length_mm)) +
  stat_summary()

ggplot(penguins, 
       aes(species, flipper_length_mm)) +
  stat_summary(
    fun.data = mean_se,
    geom = "pointrange"
  )


ggplot(penguins, 
       aes(species, flipper_length_mm)) +
  geom_boxplot() +
  stat_summary(
    fun = mean,
    geom = "point",
    col = "red",
    size = 3
  )


ggplot(penguins, 
       aes(species, flipper_length_mm)) +
  stat_summary(
    fun = mean,
    fun.max = function(y) mean(y) + sd(y),
    fun.min = function(y) mean(y) - sd(y)
  )