Our illustrated penguins have reached the tidyr package! The photo backdrop is a snowy Antarctic wonderland featuring a Gentoo penguin with outstretched flippers

tidyr: info

tidyr helps us transform our dataset into a tidy format

There are three interrelated rules which make a dataset tidy:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell. schematic representing the 3 earlier points

R4DS book cover



R for Data Science: Ch 12 Tidy data

Package documentation: https://tidyr.tidyverse.org

tidyr: exercise

Both penguin datasets are already tidy!

We can pretend that penguins wasn’t tidy and that it looked instead like untidy_penguins below, where body_mass_g was recorded separately for male, female, and NA sex penguins.

untidy_penguins <- penguins |> pivot_wider(names_from = sex, values_from = body_mass_g)
untidy_penguins
# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm  year  male
   <fct>   <fct>              <dbl>         <dbl>             <int> <int> <int>
 1 Adelie  Torgersen           39.1          18.7               181  2007  3750
 2 Adelie  Torgersen           39.5          17.4               186  2007    NA
 3 Adelie  Torgersen           40.3          18                 195  2007    NA
 4 Adelie  Torgersen           NA            NA                  NA  2007    NA
 5 Adelie  Torgersen           36.7          19.3               193  2007    NA
 6 Adelie  Torgersen           39.3          20.6               190  2007  3650
 7 Adelie  Torgersen           38.9          17.8               181  2007    NA
 8 Adelie  Torgersen           39.2          19.6               195  2007  4675
 9 Adelie  Torgersen           34.1          18.1               193  2007    NA
10 Adelie  Torgersen           42            20.2               190  2007    NA
# ℹ 334 more rows
# ℹ 2 more variables: female <int>, `NA` <int>

Now let’s make it tidy again!

We’ll use the help of pivot_longer()

untidy_penguins |>
  pivot_longer(cols = male:`NA`,           
               names_to = "sex",           
               values_to = "body_mass_g")
# A tibble: 1,032 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm  year sex   
   <fct>   <fct>              <dbl>         <dbl>             <int> <int> <chr> 
 1 Adelie  Torgersen           39.1          18.7               181  2007 male  
 2 Adelie  Torgersen           39.1          18.7               181  2007 female
 3 Adelie  Torgersen           39.1          18.7               181  2007 NA    
 4 Adelie  Torgersen           39.5          17.4               186  2007 male  
 5 Adelie  Torgersen           39.5          17.4               186  2007 female
 6 Adelie  Torgersen           39.5          17.4               186  2007 NA    
 7 Adelie  Torgersen           40.3          18                 195  2007 male  
 8 Adelie  Torgersen           40.3          18                 195  2007 female
 9 Adelie  Torgersen           40.3          18                 195  2007 NA    
10 Adelie  Torgersen           NA            NA                  NA  2007 male  
# ℹ 1,022 more rows
# ℹ 1 more variable: body_mass_g <int>