library(ggplot2)
15 ggplot2
Data vizualization with ggplot2
ggplot2
base R
plot functions require more effort and expertise to create high-quality publishable graphsOver the years, many R packages (e.g.
lattice
,grid
, etc.) were introduced to overcome the limitations ofbase R
plot functionsThe newest addition to R plot functions is
ggplot2
package and it can be used to produce elegant plots without much effort!
ggplot2
package implements thegrammar of graphics
, a coherent system for describing and building graphsTo load
ggplot2
package to the current R environment
1 Scatter plot
ggplot(data = penguins) +
geom_point(
mapping = aes(x = bill_length_mm,
y = flipper_length_mm)
)
Creating a ggplot
ggplot(data = penguins) +
A blank slate: It creates a coordinate system to which several layers can be added
All plot functions of
ggplot2
package begin with theggplot()
functiondata
is the first argument ofggplot()
and it specifies the data frame to be used for the plotOne or more layers can be added to
ggplot()
using a plus (+
) sign
geom_point
Geometric objects (called
geom
) are the shapes we put on a plot (e.g. points, bars, etc.).You can have an unlimited number of layers, but at a minimum a plot must have at least one geom
geom_point()
makes a scatter plot by adding a layer of points.geom_line()
adds a layer of lines connecting data points.geom_col()
adds bars for bar charts.geom_histogram()
makes a histogram.geom_boxplot()
adds boxes for boxplots
mapping = aes()
Each type of
geom
usually has a required set of aesthetics to be set. Aesthetic mappings are set with theaes()
function. Examples includex
andy
(the position on the x and y axes)color
(“outside” color, like the line around a bar)fill
(“inside” color, like the color of the bar itself)shape
(the type of point, like a dot, square, triangle, etc.)linetype
(solid, dashed, dotted etc.)size
(of geoms)
Adding labels, title, and caption to a graph
ggplot(data = penguins) +
geom_point(
mapping = aes(x = bill_length_mm,
y = flipper_length_mm)
+
) labs(
x = "Bill length",
y = "Flipper length",
title = "Scatter plot of bill and flipper length",
caption = "R package palmerpenguins") +
theme_bw()
2 Histogram
geom_histogram()
is for histogramOnly
x
value is needed for itsaes()
function
ggplot(data = penguins) +
geom_histogram(
mapping = aes(x = body_mass_g)
)
fill
argument ofgeom_histogram()
modifies color of the bars
ggplot(data = penguins) +
geom_histogram(
mapping = aes(x = body_mass_g,
y = after_stat(density)),
fill = "steelblue")
col
argument ofgeom_histogram()
modifies sides of the bars
ggplot(data = penguins) +
geom_histogram(
mapping = aes(x = body_mass_g,
y = after_stat(density)),
fill = "steelblue",
col = "white")
Histogram and density function
geom_density()
is used to obtain the density of a variable
ggplot(data = penguins) +
geom_histogram(
mapping = aes(x = body_mass_g,
y = after_stat(density)),
fill = "steelblue",
col = "white") +
geom_density(
mapping = aes(x = body_mass_g,
y = after_stat(density)),
col = "brown", size = 1)
- A common mapping function in
ggplot()
for differentgeom_*()
ggplot(data = penguins,
mapping = aes(x = body_mass_g,
y = after_stat(density))) +
geom_histogram(fill = "steelblue",
col = "white") +
geom_density(col = "brown",
size = 1)
Exercise 3.2.1
(use diamonds
data to answer the followings)
Create a histogram of
carat
and check the effect ofbins
on histogramAdd a density line to the plot obtained in Question 1
3 Boxplot
geom_boxplot()
is used to obtain a boxplot
ggplot(data = penguins) +
geom_boxplot(
mapping = aes(x = species,
y = body_mass_g)
)
Boxplot
ggplot(data = penguins) +
geom_boxplot(
mapping = aes(x = species,
y = body_mass_g)
+
) coord_flip()
Boxplot with original data points
ggplot(data = penguins,
mapping = aes(x = species,
y = body_mass_g)) +
geom_boxplot() +
geom_jitter(width = .2,
mapping = aes(col = species),
size = .75)
geom_jitter()
adds a small amount random variation to each point and it is useful to visualize points at different levels
Exercise 3.2.2
(use diamonds
data to answer the followings)
Create a boxplot of
carat
at different levels ofcut
Create a scatter plot to examine the effect of
carat
onprice
Aesthetic mappings
Aesthetic mappings
A third variable can be added to a two-dimensional scatter plot by mapping it to an aesthetic
A aesthetic is a visual property (such as the size, shape, and color of the points) of the plot
Points of a plot can be displayed in different ways by changing the levels of its aesthetic properties (e.g. size, shape, or color of points can be changed)
Variables can be linked to the graph using the following properties
positions (
x
,y
)colors (
color
,fill
)shapes (
shape
,linetype
)size (
size
)transparency (
alpha
)groupings (
group
)
ggplot(data = penguins) +
geom_point(
mapping = aes(x = bill_length_mm,
y = flipper_length_mm,
col = species)
)
col
is specified by different levels ofspecies
ggplot(data = penguins) +
geom_point(
mapping = aes(x = bill_length_mm,
y = flipper_length_mm,
col = bill_length_mm > 50),
show.legend = FALSE
)
col
is specified by a function ofbill_length_mm
show_legend
is a logical argument ofgeom_*
Besides
col
, some other aesthetic types are useful inggplot2
size
assigns different sizes of the points to different values of the variablealpha
controls the transparency of the pointshape
assigns different (at most six) shapes to different values of the variable
ggplot2
creates a legend for the variables used in the arguments ofaes()
except forx
andy
ggplot(data = penguins) +
geom_point(
mapping = aes(x = bill_length_mm,
y = flipper_length_mm,
alpha = species)
)
ggplot(data = penguins) +
geom_point(
mapping = aes(x = bill_length_mm,
y = flipper_length_mm,
shape = species)
)
Aesthetic properties can also be set manually, e.g.
col = "blue"
will make all the points blue, which does not convey any information about a variable but only changes the appearance of the plotTo set an aesthetic manually, the aesthetic type needs to be defined outside of
aes()
as an argument ofgeom_??
function
ggplot(data = penguins) +
geom_point(
mapping = aes(x = bill_length_mm,
y = flipper_length_mm),
col = "blue")
ggplot(data = penguins) +
geom_point(
mapping = aes(x = bill_length_mm,
y = flipper_length_mm),
col = "blue", alpha = .5)
geom_smooth()
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = flipper_length_mm)) +
geom_point() +
geom_smooth()
geom_smooth()
fits the relationship between two quantitative using a smoothing method
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = flipper_length_mm)) +
geom_point() +
geom_smooth(method = "lm")
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = flipper_length_mm,
col = species)) +
geom_point(size = .75) +
geom_smooth(method = "lm", se = FALSE)
Exercise 3.2.3
(use diamonds
data to answer the followings)
Create a scatter plot to examine the effect of
price
oncarat
and assign different colors to different levels ofcut
Show a fit of a linear model on the scatter plot of
carat
andprice
Show different fits of linear models (
price
oncarat
) corresponding to different levels ofcut
on the scatter plot ofprice
andcarat
Facets
Adding information about a new variable to an existing plot could be helpful for data analysis (e.g. aesthetic)
facets
can add information about a categorical variable to an existing plot by splitting the plot according to the levels of the categorical variablefacet_wrap()
splits the plot by a single variablefacet_grid()
splits the plot by the combination of two variables
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm,
y = bill_length_mm)) +
facet_wrap(~species)
ggplot(data = penguins,
mapping = aes(
x = flipper_length_mm, y = bill_length_mm, col = species)) +
geom_point() +
geom_smooth(
method = "lm", se = FALSE, col = "black") +
facet_wrap(~species)
ggplot(data = penguins[!is.na(penguins$sex), ]) +
geom_point(mapping = aes(x = flipper_length_mm,
y = bill_length_mm)) +
facet_wrap(~ sex + species)
ggplot(data = penguins[!is.na(penguins$sex), ]) +
geom_point(mapping = aes(x = flipper_length_mm,
y = bill_length_mm)) +
facet_grid(sex ~ species)
ggplot(data = penguins) +
geom_histogram(
mapping = aes(x = body_mass_g),
col = "brown", fill = "yellow"
+
) facet_wrap(~species, ncol = 1)
ggplot(data = penguins) +
geom_histogram(
mapping = aes(x = body_mass_g),
col = "brown", fill = "yellow"
+
) facet_wrap(~species, ncol = 1) +
theme_minimal()
ggplot(data = penguins) +
geom_histogram(
mapping = aes(x = body_mass_g),
col = "brown", fill = "yellow"
+
) facet_wrap(~species, ncol = 1) +
theme_minimal() +
theme(
panel.grid = element_blank()
)
Exercise 3.2.4
(use diamonds
data to answer the followings)
- Create histogram of
x
at different levels ofcut
4 Density plot
Distribution of penguins’ body mass
ggplot(data = penguins) +
geom_density(
mapping = aes(x = body_mass_g,
fill = species),
alpha = .5)
Density plot
Distribution of penguins’ body mass
ggplot(data = penguins) +
geom_density(
mapping = aes(x = body_mass_g,
fill = species),
alpha = .5) +
theme(
legend.position = "top",
legend.title = element_blank()
)
Distribution of penguins’ body mass
ggplot(data = penguins) +
geom_density(
mapping = aes(x = body_mass_g,
fill = species),
alpha = .5) +
theme_minimal(base_size = 7) +
theme(
legend.position = "top",
legend.title = element_blank()
)
Distribution of penguins’ body mass
ggplot(data = penguins) +
geom_density(
mapping = aes(x = body_mass_g,
fill = species),
alpha = .5) +
theme_minimal(base_size = 7) +
theme(
legend.position = "top",
legend.key.size = unit(.75, "lines"),
legend.title = element_blank()
)
5 Barchart
Frequency distribution of species
ggplot(data = penguins) +
geom_bar(aes(x = species))
Frequency distribution of species
ggplot(data = penguins) +
geom_bar(aes(x = species)) +
theme_minimal(base_size = 16)
Barchart with two variables
Distribution of species
by year
ggplot(data = penguins) +
geom_bar(aes(x = species, fill = factor(year)),
position = "dodge")
ggplot(data = penguins) +
geom_bar(aes(x = species, fill = factor(year)),
position = "stack")
ggplot(data = penguins) +
geom_bar(aes(x = species, fill = factor(year)),
position = "fill")
Exercise 3.2.5
(use diamonds
data to answer the followings)
Create a barplot of
cut
Create a barplot of
color
Create a barplot of
cut
with showing the distribution ofcolor
at different levels ofcut
Check the use three different value of the argument
position
when creating a barplot withcut
andcolor
Homework
Use the package
gapminder
to get an access to the datagapminder
gapminder
has 6 variables and 1704 observations, where the variables are:
#> [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
Create a scatter plot to examine how
gdpPercap
affectslifeExp
Change the scale of x-axis to log base 10
Add a color layer corresponding to
continent
to the previous graph
Create a scatter plot of
gdpPercap
versuslifeExp
for different continents in different plotting regionsAdd smooth lines to describe relationship between
gdpPercap
andlifeExp
for different continents separatelyDraw a boxplot of
lifeExp
to compare distribution life expectancy for different continentsDraw a histogram of
lifeExp
and check it shapes for different bin sizeDraw density plots of
lifeExp
for different continents in a single plot
Make a scatter plot of
lifeExp
on the y-axis againstyear
on thex
Fit a straight line to estimate mean life expectancy for a year for different countries
Split the plot for different continents
Add a continent-specific mean line to the plot
Statistical layers geom_*()
vs stat_*()
ggplot(penguins,
aes(bill_length_mm, flipper_length_mm)) +
geom_smooth(stat = "smooth")
ggplot(penguins,
aes(bill_length_mm, flipper_length_mm)) +
stat_smooth(geom = "smooth")
ggplot(penguins,
aes(bill_length_mm, flipper_length_mm)) +
geom_point(stat = "identity")
ggplot(penguins,
aes(bill_length_mm, flipper_length_mm)) +
stat_identity(geom = "point")
ggplot(penguins, aes(species)) +
geom_bar(stat = "count")
ggplot(penguins, aes(species)) +
stat_count(geom = "bar")
Statistical summaries
ggplot(penguins,
aes(species, flipper_length_mm)) +
stat_summary()
ggplot(penguins,
aes(species, flipper_length_mm)) +
stat_summary(
fun.data = mean_se,
geom = "pointrange"
)
ggplot(penguins,
aes(species, flipper_length_mm)) +
geom_boxplot() +
stat_summary(
fun = mean,
geom = "point",
col = "red",
size = 3
)
ggplot(penguins,
aes(species, flipper_length_mm)) +
stat_summary(
fun = mean,
fun.max = function(y) mean(y) + sd(y),
fun.min = function(y) mean(y) - sd(y)
)