Summary
Rectangling is the process of untangling lists into tables.Table of Contents
Overview
This post is an abbreviated version of the tidyr
vignette on rectangling, along with a few diversions. According to the vignette, “[r]ectangling is the art and craft of taking a deeply nested list (often sourced from wild caught JSON or XML) and taming it into a tidy data set of rows and columns.” If you prefer to interactively explore the vignette from the command line, you’ll need to install tidyr
, dplyr
and repurrrsive
packages and call the vignette.
#load packages
library(tidyr)
library(dplyr)
library(repurrrsive)
#load vignette in help window
vignette("rectangle", package = "tidyr")
Or the vignette is available on its own web page. The workhorse of the effort are permutations of the unnest_*
function and hoist()
. Both are in the tidyr
package.
The general workfow of rectangling is to populate a tibble with a list. Nested lists will be converted to list-columns whereby the various “unnest” functions can be applied like unnest_wider
, unnest_longer
and unnest_auto
. When only small portions of a list are to be retained, hoist()
may be a more efficient alternative.
Dragons – Base Case
The examples under ?tidyr::hoist()
are a good place to start for rectangling. Here, we’ll begin with the “dragon” example. The data set is a two-column tibble with the first column being an atomic vector of character type. The second column is a list-column.
Insert into tibble
df <- tibble(
character = c("Toothless", "Dory"),
metadata = list(
list(
species = "dragon",
color = "black",
films = c(
"How to Train Your Dragon",
"How to Train Your Dragon 2",
"How to Train Your Dragon: The Hidden World"
)
),
list(
species = "blue tang",
color = "blue",
films = c("Finding Nemo", "Finding Dory")
)
)
)
Inspect
# A tibble: 2 × 2
character metadata
<chr> <list>
1 Toothless <named list [3]>
2 Dory <named list [3]>
unnest_wider()
The first step is to use tidyr::unnest_wider()
.
df %>% unnest_wider(metadata)
# A tibble: 2 × 4
character species color films
<chr> <chr> <chr> <list>
1 Toothless dragon black <chr [3]>
2 Dory blue tang blue <chr [2]>
hoist()
The hoist()
function is helpful when you want to extract specific components.
df %>% hoist(metadata,
"species",
first_film = list("films", 1L),
third_film = list("films", 3L)
)
# A tibble: 2 × 5
character species first_film third_film metadata
<chr> <chr> <chr> <chr> <list>
1 Toothless dragon How to Train Your Dragon How to Train Your D… <named list>
2 Dory blue tang Finding Nemo <NA> <named list>
Full tibble
df %>%
unnest_wider(metadata) %>%
unnest_longer(films)
# A tibble: 5 × 4
character species color films
<chr> <chr> <chr> <chr>
1 Toothless dragon black How to Train Your Dragon
2 Toothless dragon black How to Train Your Dragon 2
3 Toothless dragon black How to Train Your Dragon: The Hidden World
4 Dory blue tang blue Finding Nemo
5 Dory blue tang blue Finding Dory
GitHub users – unnest_wider()
Insert into tibble
users <- tibble(user = gh_users)
Inspect
names(users$user[[1]])
[1] "login" "id" "avatar_url"
[4] "gravatar_id" "url" "html_url"
[7] "followers_url" "following_url" "gists_url"
[10] "starred_url" "subscriptions_url" "organizations_url"
[13] "repos_url" "events_url" "received_events_url"
[16] "type" "site_admin" "name"
[19] "company" "blog" "location"
[22] "email" "hireable" "bio"
[25] "public_repos" "public_gists" "followers"
[28] "following" "created_at" "updated_at"
unnest_wider()
users |> unnest_wider(user)
# A tibble: 6 × 30
login id avatar_url gravatar_id url html_url followers_url following_url
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
1 gabo… 6.60e5 https://a… "" http… https:/… https://api.… https://api.…
2 jenn… 5.99e5 https://a… "" http… https:/… https://api.… https://api.…
3 jtle… 1.57e6 https://a… "" http… https:/… https://api.… https://api.…
4 juli… 1.25e7 https://a… "" http… https:/… https://api.… https://api.…
5 leep… 3.51e6 https://a… "" http… https:/… https://api.… https://api.…
6 masa… 8.36e6 https://a… "" http… https:/… https://api.… https://api.…
# … with 22 more variables: gists_url <chr>, starred_url <chr>,
# subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
# events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
# name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
# hireable <lgl>, bio <chr>, public_repos <int>, public_gists <int>,
# followers <int>, following <int>, created_at <chr>, updated_at <chr>
hoist()
users %>% hoist(
user,
followers = "followers",
login = "login",
url = "html_url"
)
# A tibble: 6 × 4
followers login url user
<int> <chr> <chr> <list>
1 303 gaborcsardi https://github.com/gaborcsardi <named list [27]>
2 780 jennybc https://github.com/jennybc <named list [27]>
3 3958 jtleek https://github.com/jtleek <named list [27]>
4 115 juliasilge https://github.com/juliasilge <named list [27]>
5 213 leeper https://github.com/leeper <named list [27]>
6 34 masalmon https://github.com/masalmon <named list [27]>
GitHub repos - unnest_longer()
Insert into tibble
repos <- tibble(repo = repurrrsive::gh_repos)
Inspect
repos
# A tibble: 6 × 1
repo
<list>
1 <list [30]>
2 <list [30]>
3 <list [30]>
4 <list [26]>
5 <list [30]>
6 <list [30]>
unnest_longer()
repos <- repos %>% unnest_longer(repo)
repos
# A tibble: 176 × 1
repo
<list>
1 <named list [68]>
2 <named list [68]>
3 <named list [68]>
4 <named list [68]>
5 <named list [68]>
6 <named list [68]>
7 <named list [68]>
8 <named list [68]>
9 <named list [68]>
10 <named list [68]>
# … with 166 more rows
hoist()
repos %>% hoist(
repo,
login = c("owner", "login"), #<===
name = "name",
homepage = "homepage",
watchers = "watchers_count"
)
# A tibble: 176 × 5
login name homepage watchers repo
<chr> <chr> <chr> <int> <list>
1 gaborcsardi after <NA> 5 <named list [65]>
2 gaborcsardi argufy <NA> 19 <named list [65]>
3 gaborcsardi ask <NA> 5 <named list [65]>
4 gaborcsardi baseimports <NA> 0 <named list [65]>
5 gaborcsardi citest <NA> 0 <named list [65]>
6 gaborcsardi clisymbols "" 18 <named list [65]>
7 gaborcsardi cmaker <NA> 0 <named list [65]>
8 gaborcsardi cmark <NA> 0 <named list [65]>
9 gaborcsardi conditions <NA> 0 <named list [65]>
10 gaborcsardi crayon <NA> 52 <named list [65]>
# … with 166 more rows
Game of Throne Characters
Insert into tibble
Inspect
# A tibble: 30 × 1
char
<list>
1 <named list [18]>
2 <named list [18]>
3 <named list [18]>
4 <named list [18]>
5 <named list [18]>
6 <named list [18]>
7 <named list [18]>
8 <named list [18]>
9 <named list [18]>
10 <named list [18]>
# … with 20 more rows
unnest_wider()
# A tibble: 30 × 18
url id name gender culture born died alive titles aliases father
<chr> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <list> <list> <chr>
1 https://w… 1022 Theo… Male "Ironb… "In … "" TRUE <chr> <chr> ""
2 https://w… 1052 Tyri… Male "" "In … "" TRUE <chr> <chr> ""
3 https://w… 1074 Vict… Male "Ironb… "In … "" TRUE <chr> <chr> ""
4 https://w… 1109 Will Male "" "" "In … FALSE <chr> <chr> ""
5 https://w… 1166 Areo… Male "Norvo… "In … "" TRUE <chr> <chr> ""
6 https://w… 1267 Chett Male "" "At … "In … FALSE <chr> <chr> ""
7 https://w… 1295 Cres… Male "" "In … "In … FALSE <chr> <chr> ""
8 https://w… 130 Aria… Female "Dorni… "In … "" TRUE <chr> <chr> ""
9 https://w… 1303 Daen… Female "Valyr… "In … "" TRUE <chr> <chr> ""
10 https://w… 1319 Davo… Male "Weste… "In … "" TRUE <chr> <chr> ""
# … with 20 more rows, and 7 more variables: mother <chr>, spouse <chr>,
# allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
# playedBy <list>
When the tibble was unnested, there are a collection of list-columns. To see which columns are of the class list
, you can use dplyr’s select_if
function.
chars2 %>% select_if(is.list)
# A tibble: 30 × 7
titles aliases allegiances books povBooks tvSeries playedBy
<list> <list> <list> <list> <list> <list> <list>
1 <chr [3]> <chr [4]> <chr [1]> <chr [3]> <chr [2]> <chr [6]> <chr [1]>
2 <chr [2]> <chr [11]> <chr [1]> <chr [2]> <chr [4]> <chr [6]> <chr [1]>
3 <chr [2]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [1]> <chr [1]>
4 <chr [1]> <chr [1]> <NULL> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
5 <chr [1]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [2]> <chr [1]>
6 <chr [1]> <chr [1]> <NULL> <chr [2]> <chr [1]> <chr [1]> <chr [1]>
7 <chr [1]> <chr [1]> <NULL> <chr [2]> <chr [1]> <chr [1]> <chr [1]>
8 <chr [1]> <chr [1]> <chr [1]> <chr [4]> <chr [1]> <chr [1]> <chr [1]>
9 <chr [5]> <chr [11]> <chr [1]> <chr [1]> <chr [4]> <chr [6]> <chr [1]>
10 <chr [4]> <chr [5]> <chr [2]> <chr [1]> <chr [3]> <chr [5]> <chr [1]>
# … with 20 more rows
Build table
# A tibble: 180 × 3
name media value
<chr> <chr> <chr>
1 Theon Greyjoy books A Game of Thrones
2 Theon Greyjoy books A Storm of Swords
3 Theon Greyjoy books A Feast for Crows
4 Theon Greyjoy tvSeries Season 1
5 Theon Greyjoy tvSeries Season 2
6 Theon Greyjoy tvSeries Season 3
7 Theon Greyjoy tvSeries Season 4
8 Theon Greyjoy tvSeries Season 5
9 Theon Greyjoy tvSeries Season 6
10 Tyrion Lannister books A Feast for Crows
# … with 170 more rows
Conclusion
This post loosely followed the tidyr
vignette “Rectangling.” That said, the section on google geocoding was omitted. The vignette demonstrated the use of unnest_wider()
, unnest_longer()
, and hoist()
. It also showed how to solve an error when names in a list are duplicated by adding the argument names_repair = "unique"
.
Acknowledgements
This blog post was made possible thanks to:
- Jenny Bryan, the maintainer of the
repurrrsive
package.
References
Disclaimer
The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimed as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.
Reproducibility
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.1.3 (2022-03-10)
os macOS Big Sur/Monterey 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Chicago
date 2022-04-22
pandoc 2.14.1 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
blogdown * 1.9 2022-03-28 [1] CRAN (R 4.1.2)
bookdown 0.25 2022-03-16 [1] CRAN (R 4.1.2)
brio 1.1.3 2021-11-30 [1] CRAN (R 4.1.0)
bslib 0.3.1.9000 2022-03-04 [1] Github (rstudio/bslib@888fbe0)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.0)
callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0)
cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.2)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.1.2)
crayon 1.5.1 2022-03-26 [1] CRAN (R 4.1.0)
DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.0)
desc 1.4.1 2022-03-06 [1] CRAN (R 4.1.2)
devtools * 2.4.3 2021-11-30 [1] CRAN (R 4.1.0)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.0)
dplyr * 1.0.8 2022-02-08 [1] CRAN (R 4.1.2)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
evaluate 0.15 2022-02-18 [1] CRAN (R 4.1.2)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.2)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.0)
generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2)
ggplot2 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
ggthemes * 4.2.4 2021-01-20 [1] CRAN (R 4.1.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.2)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.0)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.1.2)
knitr 1.38 2022-03-25 [1] CRAN (R 4.1.0)
lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.2)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.0)
munsell 0.5.0.9000 2021-10-19 [1] Github (cwickham/munsell@e539541)
pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.1.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgload 1.2.4 2021-11-30 [1] CRAN (R 4.1.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
processx 3.5.3 2022-03-25 [1] CRAN (R 4.1.0)
ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.0)
repurrrsive * 1.0.0 2019-07-15 [1] CRAN (R 4.1.0)
rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.2)
rmarkdown 2.13 2022-03-10 [1] CRAN (R 4.1.2)
rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.1.0)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
sass 0.4.1 2022-03-23 [1] CRAN (R 4.1.2)
scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.0)
stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.0)
stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
testthat 3.1.3 2022-03-29 [1] CRAN (R 4.1.2)
tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.0)
tidyr * 1.2.0 2022-02-01 [1] CRAN (R 4.1.2)
tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.1.2)
usethis * 2.1.5 2021-12-09 [1] CRAN (R 4.1.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
vctrs 0.4.0 2022-03-30 [1] CRAN (R 4.1.2)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.0)
xfun 0.30 2022-03-02 [1] CRAN (R 4.1.2)
yaml 2.3.5 2022-02-21 [1] CRAN (R 4.1.2)
[1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────