12 Reshaping

Introduction

Data rarely comes in the format you want. Reshaping means changing the structure of your dataset so that it becomes:

  • Easier to analyse
  • Easier to visualize
  • Easier to model

In the tidyverse, reshaping is mainly done with:

  • pivot_longer()
  • pivot_wider()
  • separate()
  • unite()

Conceptual Foundations

Long vs Wide Data

  • Wide format: One row per unit, multiple columns for repeated measures. Example:

    id | score_2022 | score_2023 |
  • Long format: One row per observation.

    id | year | score |

Most statistical models and ggplot2 visualizations prefer long format. Human-readable summary tables often prefer wide format.

pivot_longer() — Wide → Long

Basic Idea

Convert many columns into two columns:

  • A “key” column (the name of the original columns)
  • A “value” column (the data inside those columns)

Example Data

df <- tibble(
  id = 1:3,
  score_2022 = c(70, 85, 90),
  score_2023 = c(75, 82, 95)
)
df
#> # A tibble: 3 × 3
#>      id score_2022 score_2023
#>   <int>      <dbl>      <dbl>
#> 1     1         70         75
#> 2     2         85         82
#> 3     3         90         95

Converting to Long Format

df_long <- df |>
  pivot_longer(
    cols = starts_with("score_"),
    names_to = "year",
    names_prefix = "score_",
    values_to = "score"
  )
df_long
#> # A tibble: 6 × 3
#>      id year  score
#>   <int> <chr> <dbl>
#> 1     1 2022     70
#> 2     1 2023     75
#> 3     2 2022     85
#> 4     2 2023     82
#> 5     3 2022     90
#> 6     3 2023     95

pivot_wider() — Long → Wide

Basic Idea

The reverse of pivot_longer().

Takes a “key” column and spreads it across multiple columns.

Example Long Data

df_long <- tibble(
  id = c(1,1,2,2,3,3),
  year = c(2022,2023,2022,2023,2022,2023),
  score = c(70,NA,85,82,90,95)
)
df_long
#> # A tibble: 6 × 3
#>      id  year score
#>   <dbl> <dbl> <dbl>
#> 1     1  2022    70
#> 2     1  2023    NA
#> 3     2  2022    85
#> 4     2  2023    82
#> 5     3  2022    90
#> 6     3  2023    95

Converting to Wide Format

df_wide <- df_long |>
  pivot_wider(
    names_from = year,
    values_from = score,
    names_prefix = "score_"
  )
df_wide
#> # A tibble: 3 × 3
#>      id score_2022 score_2023
#>   <dbl>      <dbl>      <dbl>
#> 1     1         70         NA
#> 2     2         85         82
#> 3     3         90         95

separate() and unite()

separate()

Split one column into multiple columns.

df |>
  separate(date, into = c("year", "month", "day"), sep = "-")

unite()

Combine multiple columns into a single column.

df |>
  unite("full_date", year, month, day, sep = "-")

These are very useful before or after pivoting.


  • Avoid pivoting all columns unintentionally — using pivot_longer(cols = everything()) will also melt ID variables; always select columns explicitly.

  • Column names containing spaces or special characters cause reshaping errors; clean them first (e.g., with janitor::clean_names()).

Exercises

Exercise 1

year Adelie Chinstrap Gentoo
2007 50 26 34
2008 50 18 46
2009 52 24 44

Convert this to the following:

year species frequency
2007 Adelie 50
2007 Chinstrap 26
2007 Gentoo 34
2008 Adelie 50
2008 Chinstrap 18
2008 Gentoo 46
2009 Adelie 52
2009 Chinstrap 24
2009 Gentoo 44

Exercise 2

The billboard dataset records the billboard rank of songs in the year 2000:

billboard |> head() |> knitr::kable()
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11 wk12 wk13 wk14 wk15 wk16 wk17 wk18 wk19 wk20 wk21 wk22 wk23 wk24 wk25 wk26 wk27 wk28 wk29 wk30 wk31 wk32 wk33 wk34 wk35 wk36 wk37 wk38 wk39 wk40 wk41 wk42 wk43 wk44 wk45 wk46 wk47 wk48 wk49 wk50 wk51 wk52 wk53 wk54 wk55 wk56 wk57 wk58 wk59 wk60 wk61 wk62 wk63 wk64 wk65 wk66 wk67 wk68 wk69 wk70 wk71 wk72 wk73 wk74 wk75 wk76
2 Pac Baby Don’t Cry (Keep… 2000-02-26 87 82 72 77 87 94 99 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2Ge+her The Hardest Part Of … 2000-09-02 91 87 92 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66 57 54 53 51 51 51 51 47 44 38 28 22 18 18 14 12 7 6 6 6 5 5 4 4 4 4 3 3 3 4 5 5 9 9 15 14 13 14 16 17 21 22 24 28 33 42 42 49 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 Doors Down Loser 2000-10-21 76 76 72 69 67 65 55 59 62 61 61 59 61 66 72 76 75 67 73 70 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17 31 36 49 53 57 64 70 75 76 78 85 92 96 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
98^0 Give Me Just One Nig… 2000-08-19 51 39 34 26 26 19 2 2 3 6 7 22 29 36 47 67 66 84 93 94 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week. Here, the column names are one variable (the week) and the cell values are another (the rank). Convert this data to tidy format (long format)

artist track date.entered week rank
2 Pac Baby Don’t Cry (Keep… 2000-02-26 1 87
2 Pac Baby Don’t Cry (Keep… 2000-02-26 2 82
2 Pac Baby Don’t Cry (Keep… 2000-02-26 3 72
2 Pac Baby Don’t Cry (Keep… 2000-02-26 4 77
2 Pac Baby Don’t Cry (Keep… 2000-02-26 5 87
2 Pac Baby Don’t Cry (Keep… 2000-02-26 6 94

Exercise 3

The who dataset contains tuberculosis (TB) case counts in a wide format, where variables related to new TB cases begin with prefixes such as:

new_sp_m014
new_sp_f1524
newrel_f65
who |> head(n = 6) |> knitr::kable()
country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 new_sp_m4554 new_sp_m5564 new_sp_m65 new_sp_f014 new_sp_f1524 new_sp_f2534 new_sp_f3544 new_sp_f4554 new_sp_f5564 new_sp_f65 new_sn_m014 new_sn_m1524 new_sn_m2534 new_sn_m3544 new_sn_m4554 new_sn_m5564 new_sn_m65 new_sn_f014 new_sn_f1524 new_sn_f2534 new_sn_f3544 new_sn_f4554 new_sn_f5564 new_sn_f65 new_ep_m014 new_ep_m1524 new_ep_m2534 new_ep_m3544 new_ep_m4554 new_ep_m5564 new_ep_m65 new_ep_f014 new_ep_f1524 new_ep_f2534 new_ep_f3544 new_ep_f4554 new_ep_f5564 new_ep_f65 newrel_m014 newrel_m1524 newrel_m2534 newrel_m3544 newrel_m4554 newrel_m5564 newrel_m65 newrel_f014 newrel_f1524 newrel_f2534 newrel_f3544 newrel_f4554 newrel_f5564 newrel_f65
Afghanistan AF AFG 1980 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Afghanistan AF AFG 1981 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Afghanistan AF AFG 1982 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Afghanistan AF AFG 1983 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Afghanistan AF AFG 1984 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Afghanistan AF AFG 1985 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Using the who dataset, convert all variables whose names begin with "new_" into a long format dataset.

Create:

  • a new column named key containing the original column names
  • a new column named cases containing the corresponding values

Display the first six rows of the resulting dataset.


Using the long-format dataset produced in Question 1, the variable key encodes multiple components separated by underscores (e.g., "new_sp_m014").

Split the key variable into the following four new variables:

  • prefix
  • type
  • sex
  • age

Use "_" as the separator. Display the first six rows of the resulting dataset.


From the dataset obtained in Question 2, create a new variable by combining:

  • sex
  • age

into a single variable named sex_age, with the two components separated by a hyphen ("-").

Display the first six rows of the resulting dataset.