| id | gender | ast101 | ast102 |
|---|---|---|---|
| 101 | F | 57 | 49 |
| 102 | M | 51 | 51 |
| 103 | F | 72 | 26 |
| 104 | F | 58 | 58 |
| 105 | M | 65 | 32 |
| 106 | M | 57 | 62 |
| 107 | F | 65 | 66 |
13 Joining
Joining
Joins
It’s rare that a data analysis involves only a single data frame.
Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.
Motivational example
Year 1
Year 2
| id | ast201 | ast202 |
|---|---|---|
| 101 | 77 | 43 |
| 102 | 72 | 34 |
| 103 | 65 | 41 |
| 104 | 76 | 39 |
| 105 | 75 | 37 |
| 106 | 70 | 35 |
| 201 | 76 | 65 |
-
Create a variable grade which takes the following values:
4.0 (for
), 3.75 (for ),3.5 (for
), 3.25 (for ),3.0 (for
), , 2.5 (for ), and0 (for
)
Suppose, ast101 and ast201 are 4-credit course and other courses are of three credits
Calculate the GPA for each student for two years separately
Calculate CGPA, i.e., overall performance of each student
Compare the performance of male and female on the basis of CGPA
Joins
-
dplyrprovides six join functions:left_join(),inner_join(),right_join(), andfull_join()semi_join(), andanti_join()
-
They all have the same interface:
- they take a pair of data frames (
xandy) and return a data frame
- they take a pair of data frames (
Mutating joins
-
A mutating join allows you to combine variables from two data frames:
it first matches observations by their keys, then copies across variables from one data frame to the other
Like
mutate(), the join functions add variables to the right
-
There are four types of mutating join
-
left_join(),inner_join(),right_join(),full_join()
-
- Let’s define two simple tibbles x and y.
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
To understand how joins work, it’s useful to think of every possible match.
Here we show that with a grid of connecting lines

Inner join

xy <- inner_join(x = x, y = y, by = "key")xy#> # A tibble: 2 × 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2

xy_left <- left_join(x = x, y = y, by = "key")xy_left#> # A tibble: 3 × 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
#> 3 3 x3 <NA>

xy_right <- right_join(x = x, y = y,
by = "key")xy_right#> # A tibble: 3 × 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
#> 3 4 <NA> y3

xy_full <- full_join(x = x, y = y, by = "key")xy_full#> # A tibble: 4 × 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
#> 3 3 x3 <NA>
#> 4 4 <NA> y3
The following Venn diagrams showing the difference between inner, left, right, and full joins.

Filtering joins
Filtering joins
Mutating joins add columns from y to x, matching rows based on the key.
Filtering joins filter rows from
based on the presence or absence of matches in .-
Two types of filtering join:
-
semi_join(), andanti_join()
-
-
semi_join()return all rows from x with a match in y

-
anti_join()keeps rows in x that match zero rows in y

Exam data
Year 1
| id | gender | ast101 | ast102 |
|---|---|---|---|
| 101 | F | 57 | 49 |
| 102 | M | 51 | 51 |
| 103 | F | 72 | 26 |
| 104 | F | 58 | 58 |
| 105 | M | 65 | 32 |
| 106 | M | 57 | 62 |
| 107 | F | 65 | 66 |
Year 2
| id | ast201 | ast202 |
|---|---|---|
| 101 | 77 | 43 |
| 102 | 72 | 34 |
| 103 | 65 | 41 |
| 104 | 76 | 39 |
| 105 | 75 | 37 |
| 106 | 70 | 35 |
| 201 | 76 | 65 |
Calculate CGPA, i.e., overall performance of each student
Compare the performance of male and female on the basis of CGPA