#> # A tibble: 3 × 3
#> id score_2022 score_2023
#> <int> <dbl> <dbl>
#> 1 1 70 75
#> 2 2 85 82
#> 3 3 90 95
12 Reshaping
Introduction
Data rarely comes in the format you want. Reshaping means changing the structure of your dataset so that it becomes:
- Easier to analyse
- Easier to visualize
- Easier to model
In the tidyverse, reshaping is mainly done with:
pivot_longer()pivot_wider()separate()unite()
Conceptual Foundations
Long vs Wide Data
-
Wide format: One row per unit, multiple columns for repeated measures. Example:
id | score_2022 | score_2023 | -
Long format: One row per observation.
id | year | score |
Most statistical models and ggplot2 visualizations prefer long format. Human-readable summary tables often prefer wide format.
pivot_longer() — Wide → Long
Basic Idea
Convert many columns into two columns:
- A “key” column (the name of the original columns)
- A “value” column (the data inside those columns)
Example Data
Converting to Long Format
df_long <- df |>
pivot_longer(
cols = starts_with("score_"),
names_to = "year",
names_prefix = "score_",
values_to = "score"
)
df_long#> # A tibble: 6 × 3
#> id year score
#> <int> <chr> <dbl>
#> 1 1 2022 70
#> 2 1 2023 75
#> 3 2 2022 85
#> 4 2 2023 82
#> 5 3 2022 90
#> 6 3 2023 95
pivot_wider() — Long → Wide
Basic Idea
The reverse of pivot_longer().
Takes a “key” column and spreads it across multiple columns.
Example Long Data
Converting to Wide Format
df_wide <- df_long |>
pivot_wider(
names_from = year,
values_from = score,
names_prefix = "score_"
)
df_wide#> # A tibble: 3 × 3
#> id score_2022 score_2023
#> <dbl> <dbl> <dbl>
#> 1 1 70 NA
#> 2 2 85 82
#> 3 3 90 95
separate() and unite()
separate()
Split one column into multiple columns.
df |>
separate(date, into = c("year", "month", "day"), sep = "-")unite()
Combine multiple columns into a single column.
df |>
unite("full_date", year, month, day, sep = "-")These are very useful before or after pivoting.
Avoid pivoting all columns unintentionally — using
pivot_longer(cols = everything())will also melt ID variables; always select columns explicitly.Column names containing spaces or special characters cause reshaping errors; clean them first (e.g., with
janitor::clean_names()).
Exercises
Exercise 1
| year | Adelie | Chinstrap | Gentoo |
|---|---|---|---|
| 2007 | 50 | 26 | 34 |
| 2008 | 50 | 18 | 46 |
| 2009 | 52 | 24 | 44 |
Convert this to the following:
| year | species | frequency |
|---|---|---|
| 2007 | Adelie | 50 |
| 2007 | Chinstrap | 26 |
| 2007 | Gentoo | 34 |
| 2008 | Adelie | 50 |
| 2008 | Chinstrap | 18 |
| 2008 | Gentoo | 46 |
| 2009 | Adelie | 52 |
| 2009 | Chinstrap | 24 |
| 2009 | Gentoo | 44 |
Exercise 2
The billboard dataset records the billboard rank of songs in the year 2000:
| artist | track | date.entered | wk1 | wk2 | wk3 | wk4 | wk5 | wk6 | wk7 | wk8 | wk9 | wk10 | wk11 | wk12 | wk13 | wk14 | wk15 | wk16 | wk17 | wk18 | wk19 | wk20 | wk21 | wk22 | wk23 | wk24 | wk25 | wk26 | wk27 | wk28 | wk29 | wk30 | wk31 | wk32 | wk33 | wk34 | wk35 | wk36 | wk37 | wk38 | wk39 | wk40 | wk41 | wk42 | wk43 | wk44 | wk45 | wk46 | wk47 | wk48 | wk49 | wk50 | wk51 | wk52 | wk53 | wk54 | wk55 | wk56 | wk57 | wk58 | wk59 | wk60 | wk61 | wk62 | wk63 | wk64 | wk65 | wk66 | wk67 | wk68 | wk69 | wk70 | wk71 | wk72 | wk73 | wk74 | wk75 | wk76 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 Pac | Baby Don’t Cry (Keep… | 2000-02-26 | 87 | 82 | 72 | 77 | 87 | 94 | 99 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2Ge+her | The Hardest Part Of … | 2000-09-02 | 91 | 87 | 92 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 3 Doors Down | Kryptonite | 2000-04-08 | 81 | 70 | 68 | 67 | 66 | 57 | 54 | 53 | 51 | 51 | 51 | 51 | 47 | 44 | 38 | 28 | 22 | 18 | 18 | 14 | 12 | 7 | 6 | 6 | 6 | 5 | 5 | 4 | 4 | 4 | 4 | 3 | 3 | 3 | 4 | 5 | 5 | 9 | 9 | 15 | 14 | 13 | 14 | 16 | 17 | 21 | 22 | 24 | 28 | 33 | 42 | 42 | 49 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 3 Doors Down | Loser | 2000-10-21 | 76 | 76 | 72 | 69 | 67 | 65 | 55 | 59 | 62 | 61 | 61 | 59 | 61 | 66 | 72 | 76 | 75 | 67 | 73 | 70 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 504 Boyz | Wobble Wobble | 2000-04-15 | 57 | 34 | 25 | 17 | 17 | 31 | 36 | 49 | 53 | 57 | 64 | 70 | 75 | 76 | 78 | 85 | 92 | 96 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 98^0 | Give Me Just One Nig… | 2000-08-19 | 51 | 39 | 34 | 26 | 26 | 19 | 2 | 2 | 3 | 6 | 7 | 22 | 29 | 36 | 47 | 67 | 66 | 84 | 93 | 94 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week. Here, the column names are one variable (the week) and the cell values are another (the rank). Convert this data to tidy format (long format)
| artist | track | date.entered | week | rank |
|---|---|---|---|---|
| 2 Pac | Baby Don’t Cry (Keep… | 2000-02-26 | 1 | 87 |
| 2 Pac | Baby Don’t Cry (Keep… | 2000-02-26 | 2 | 82 |
| 2 Pac | Baby Don’t Cry (Keep… | 2000-02-26 | 3 | 72 |
| 2 Pac | Baby Don’t Cry (Keep… | 2000-02-26 | 4 | 77 |
| 2 Pac | Baby Don’t Cry (Keep… | 2000-02-26 | 5 | 87 |
| 2 Pac | Baby Don’t Cry (Keep… | 2000-02-26 | 6 | 94 |
Exercise 3
The who dataset contains tuberculosis (TB) case counts in a wide format, where variables related to new TB cases begin with prefixes such as:
new_sp_m014
new_sp_f1524
newrel_f65
| country | iso2 | iso3 | year | new_sp_m014 | new_sp_m1524 | new_sp_m2534 | new_sp_m3544 | new_sp_m4554 | new_sp_m5564 | new_sp_m65 | new_sp_f014 | new_sp_f1524 | new_sp_f2534 | new_sp_f3544 | new_sp_f4554 | new_sp_f5564 | new_sp_f65 | new_sn_m014 | new_sn_m1524 | new_sn_m2534 | new_sn_m3544 | new_sn_m4554 | new_sn_m5564 | new_sn_m65 | new_sn_f014 | new_sn_f1524 | new_sn_f2534 | new_sn_f3544 | new_sn_f4554 | new_sn_f5564 | new_sn_f65 | new_ep_m014 | new_ep_m1524 | new_ep_m2534 | new_ep_m3544 | new_ep_m4554 | new_ep_m5564 | new_ep_m65 | new_ep_f014 | new_ep_f1524 | new_ep_f2534 | new_ep_f3544 | new_ep_f4554 | new_ep_f5564 | new_ep_f65 | newrel_m014 | newrel_m1524 | newrel_m2534 | newrel_m3544 | newrel_m4554 | newrel_m5564 | newrel_m65 | newrel_f014 | newrel_f1524 | newrel_f2534 | newrel_f3544 | newrel_f4554 | newrel_f5564 | newrel_f65 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | AF | AFG | 1980 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | AF | AFG | 1981 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | AF | AFG | 1982 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | AF | AFG | 1983 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | AF | AFG | 1984 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Afghanistan | AF | AFG | 1985 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Using the who dataset, convert all variables whose names begin with "new_" into a long format dataset.
Create:
- a new column named
keycontaining the original column names - a new column named
casescontaining the corresponding values
Display the first six rows of the resulting dataset.
Using the long-format dataset produced in Question 1, the variable key encodes multiple components separated by underscores (e.g., "new_sp_m014").
Split the key variable into the following four new variables:
prefixtypesexage
Use "_" as the separator. Display the first six rows of the resulting dataset.
From the dataset obtained in Question 2, create a new variable by combining:
sexage
into a single variable named sex_age, with the two components separated by a hyphen ("-").
Display the first six rows of the resulting dataset.