12 Reshaping – R for Data Science

Introduction

Data rarely comes in the format you want. Reshaping means changing the structure of your dataset so that it becomes:

Easier to analyse
Easier to visualize
Easier to model

In the tidyverse, reshaping is mainly done with:

pivot_longer()
pivot_wider()
separate()
unite()

Conceptual Foundations

Long vs Wide Data

Wide format: One row per unit, multiple columns for repeated measures. Example:

id | score_2022 | score_2023 |
Long format: One row per observation.

id | year | score |

Most statistical models and ggplot2 visualizations prefer long format. Human-readable summary tables often prefer wide format.

`pivot_longer()` — Wide → Long

Basic Idea

Convert many columns into two columns:

A “key” column (the name of the original columns)
A “value” column (the data inside those columns)

Example Data

df <- tibble(
  id = 1:3,
  score_2022 = c(70, 85, 90),
  score_2023 = c(75, 82, 95)
)
df

#> # A tibble: 3 × 3
#>      id score_2022 score_2023
#>   <int>      <dbl>      <dbl>
#> 1     1         70         75
#> 2     2         85         82
#> 3     3         90         95

Converting to Long Format

df_long <- df |>
  pivot_longer(
    cols = starts_with("score_"),
    names_to = "year",
    names_prefix = "score_",
    values_to = "score"
  )
df_long

#> # A tibble: 6 × 3
#>      id year  score
#>   <int> <chr> <dbl>
#> 1     1 2022     70
#> 2     1 2023     75
#> 3     2 2022     85
#> 4     2 2023     82
#> 5     3 2022     90
#> 6     3 2023     95

`pivot_wider()` — Long → Wide

Basic Idea

The reverse of pivot_longer().

Takes a “key” column and spreads it across multiple columns.

Example Long Data

df_long <- tibble(
  id = c(1,1,2,2,3,3),
  year = c(2022,2023,2022,2023,2022,2023),
  score = c(70,NA,85,82,90,95)
)
df_long

#> # A tibble: 6 × 3
#>      id  year score
#>   <dbl> <dbl> <dbl>
#> 1     1  2022    70
#> 2     1  2023    NA
#> 3     2  2022    85
#> 4     2  2023    82
#> 5     3  2022    90
#> 6     3  2023    95

Converting to Wide Format

df_wide <- df_long |>
  pivot_wider(
    names_from = year,
    values_from = score,
    names_prefix = "score_"
  )
df_wide

#> # A tibble: 3 × 3
#>      id score_2022 score_2023
#>   <dbl>      <dbl>      <dbl>
#> 1     1         70         NA
#> 2     2         85         82
#> 3     3         90         95

`separate()` and `unite()`

`separate()`

Split one column into multiple columns.

df |>
  separate(date, into = c("year", "month", "day"), sep = "-")

`unite()`

Combine multiple columns into a single column.

df |>
  unite("full_date", year, month, day, sep = "-")

These are very useful before or after pivoting.

Avoid pivoting all columns unintentionally — using pivot_longer(cols = everything()) will also melt ID variables; always select columns explicitly.
Column names containing spaces or special characters cause reshaping errors; clean them first (e.g., with janitor::clean_names()).

Exercises

Exercise 1

year	Adelie	Chinstrap	Gentoo
2007	50	26	34
2008	50	18	46
2009	52	24	44

Convert this to the following:

year	species	frequency
2007	Adelie	50
2007	Chinstrap	26
2007	Gentoo	34
2008	Adelie	50
2008	Chinstrap	18
2008	Gentoo	46
2009	Adelie	52
2009	Chinstrap	24
2009	Gentoo	44

Exercise 2

The billboard dataset records the billboard rank of songs in the year 2000:

billboard |> head() |> knitr::kable()

artist	track	date.entered	wk1	wk2	wk3	wk4	wk5	wk6	wk7	wk8	wk9	wk10	wk11	wk12	wk13	wk14	wk15	wk16	wk17	wk18	wk19	wk20	wk21	wk22	wk23	wk24	wk25	wk26	wk27	wk28	wk29	wk30	wk31	wk32	wk33	wk34	wk35	wk36	wk37	wk38	wk39	wk40	wk41	wk42	wk43	wk44	wk45	wk46	wk47	wk48	wk49	wk50	wk51	wk52	wk53	wk54	wk55	wk56	wk57	wk58	wk59	wk60	wk61	wk62	wk63	wk64	wk65	wk66	wk67	wk68	wk69	wk70	wk71	wk72	wk73	wk74	wk75	wk76
2 Pac	Baby Don’t Cry (Keep…	2000-02-26	87	82	72	77	87	94	99	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2Ge+her	The Hardest Part Of …	2000-09-02	91	87	92	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
3 Doors Down	Kryptonite	2000-04-08	81	70	68	67	66	57	54	53	51	51	51	51	47	44	38	28	22	18	18	14	12	7	6	6	6	5	5	4	4	4	4	3	3	3	4	5	5	9	9	15	14	13	14	16	17	21	22	24	28	33	42	42	49	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
3 Doors Down	Loser	2000-10-21	76	76	72	69	67	65	55	59	62	61	61	59	61	66	72	76	75	67	73	70	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
504 Boyz	Wobble Wobble	2000-04-15	57	34	25	17	17	31	36	49	53	57	64	70	75	76	78	85	92	96	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
98^0	Give Me Just One Nig…	2000-08-19	51	39	34	26	26	19	2	2	3	6	7	22	29	36	47	67	66	84	93	94	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week. Here, the column names are one variable (the week) and the cell values are another (the rank). Convert this data to tidy format (long format)

artist	track	date.entered	week	rank
2 Pac	Baby Don’t Cry (Keep…	2000-02-26	1	87
2 Pac	Baby Don’t Cry (Keep…	2000-02-26	2	82
2 Pac	Baby Don’t Cry (Keep…	2000-02-26	3	72
2 Pac	Baby Don’t Cry (Keep…	2000-02-26	4	77
2 Pac	Baby Don’t Cry (Keep…	2000-02-26	5	87
2 Pac	Baby Don’t Cry (Keep…	2000-02-26	6	94

Exercise 3

The who dataset contains tuberculosis (TB) case counts in a wide format, where variables related to new TB cases begin with prefixes such as:

new_sp_m014
new_sp_f1524
newrel_f65

who |> head(n = 6) |> knitr::kable()

country	iso2	iso3	year	new_sp_m014	new_sp_m1524	new_sp_m2534	new_sp_m3544	new_sp_m4554	new_sp_m5564	new_sp_m65	new_sp_f014	new_sp_f1524	new_sp_f2534	new_sp_f3544	new_sp_f4554	new_sp_f5564	new_sp_f65	new_sn_m014	new_sn_m1524	new_sn_m2534	new_sn_m3544	new_sn_m4554	new_sn_m5564	new_sn_m65	new_sn_f014	new_sn_f1524	new_sn_f2534	new_sn_f3544	new_sn_f4554	new_sn_f5564	new_sn_f65	new_ep_m014	new_ep_m1524	new_ep_m2534	new_ep_m3544	new_ep_m4554	new_ep_m5564	new_ep_m65	new_ep_f014	new_ep_f1524	new_ep_f2534	new_ep_f3544	new_ep_f4554	new_ep_f5564	new_ep_f65	newrel_m014	newrel_m1524	newrel_m2534	newrel_m3544	newrel_m4554	newrel_m5564	newrel_m65	newrel_f014	newrel_f1524	newrel_f2534	newrel_f3544	newrel_f4554	newrel_f5564	newrel_f65
Afghanistan	AF	AFG	1980	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Afghanistan	AF	AFG	1981	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Afghanistan	AF	AFG	1982	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Afghanistan	AF	AFG	1983	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Afghanistan	AF	AFG	1984	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Afghanistan	AF	AFG	1985	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Using the who dataset, convert all variables whose names begin with "new_" into a long format dataset.

Create:

a new column named key containing the original column names
a new column named cases containing the corresponding values

Display the first six rows of the resulting dataset.

Using the long-format dataset produced in Question 1, the variable key encodes multiple components separated by underscores (e.g., "new_sp_m014").

Split the key variable into the following four new variables:

prefix
type
sex
age

Use "_" as the separator. Display the first six rows of the resulting dataset.

From the dataset obtained in Question 2, create a new variable by combining:

sex
age

into a single variable named sex_age, with the two components separated by a hyphen ("-").

Display the first six rows of the resulting dataset.

Introduction

Conceptual Foundations

Long vs Wide Data

pivot_longer() — Wide → Long

Basic Idea

Example Data

Converting to Long Format

pivot_wider() — Long → Wide

Basic Idea

Example Long Data

Converting to Wide Format

separate() and unite()

separate()

unite()

Exercises

Exercise 1

Exercise 2

Exercise 3

`pivot_longer()` — Wide → Long

`pivot_wider()` — Long → Wide

`separate()` and `unite()`

`separate()`

`unite()`