Summary
When creating a package, documenting your data is a crucial step. While important, it can also be time-consuming. A high-dimensional dataset would require describing each variable. This post gives a quick method to pass an R-CMD-check and document data.Table of Contents
Overview
When creating a package, documenting your data is a crucial step. While important, it can also be time-consuming. A high-dimensional dataset would require describing each variable. This post gives a quick method to pass an R-CMD-check and document data. The method relies on using paste, cat, and the multiple cursor feature to speed documentation. It reduces the number of mistakes made from manual entry and ensures the inclusion of all variables.
Roxygen for Package Documentation
Chapter 8 in the R-Packages book deals with data. This example pertains to the situation where an author saves the data to the data/
folder, meaning that it is “effectively exported”. Only exported data is documented.
# the 'usethis' package has a specific function
my_pkg_data <- sample(1000)
usethis::use_data(my_pkg_data)
Here’s some sample roxygen code taken from R-Packages. Two roxygen tags are important to note. First, the @format
tag describes the dataset. From R-Packages, “you should include a definition list that describes each variable. It’s usually a good idea to describe the variables’ units . . . .” Second, the @source
tag reminds you where the data originated.
#' World Health Organization TB data
#'
#' A subset of data from the World Health Organization Global Tuberculosis
#' Report ...
#'
#' @format ## `who`
#' A data frame with 7,240 rows and 60 columns:
#' \describe{
#' \item{country}{Country name}
#' \item{iso2, iso3}{2 & 3 letter ISO country codes}
#' \item{year}{Year}
#' ...
#' }
#' @source <https://www.who.int/teams/global-tuberculosis-programme/data>
"my_pkg_data"
Create Sample Table
To start, let’s create a sample table with 26 variables. (I chose 26 for the number of letters in the alphabet). Each column of the dataframe/tibble must be described so it gets a \item{}
line. For a dataframe with a lot of columns, this is time-consuming. Here’s one way to get started.
library(tibble)
df <- tibble()
df <- rbind(1:26)
# generate 26 variable names of character lengths of 5 to 20
variable_names <- function(x) {
paste0(sample(letters, x, replace = T), collapse = "")
}
# assign
names(df) <- sample(5:20, 26, replace = T) |>
purrr::map_chr(variable_names)
df
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 1 2 3 4 5 6 7 8 9 10 11 12 13 14
[,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
[1,] 15 16 17 18 19 20 21 22 23 24 25 26
attr(,"names")
[1] "zmrcdtyqpupbpzexf" "tomeqbnwwwbbxkrh" "dxnlulqrannhklhcg"
[4] "mzpjnvzihayykrdeov" "umbdfndihlrgngsxj" "ribzrzlshevtorcchni"
[7] "qjpha" "efpgomezmlrdv" "hvicguldtcegmopbsme"
[10] "ktxieovedtfbxypssqv" "lbahehftmxfdhpcb" "potmymsftvhzxyp"
[13] "qrwfu" "tegbrsjbvoqiprtc" "adqgcascafcopar"
[16] "wbotzawrradsniozhoo" "xupzvlqtutkxfuka" "xhangidttsgzpbdlntv"
[19] "gaqnpmlxp" "llybzoymeidhgehg" "knrknkbxgdl"
[22] "llcwhgcymcveptsytu" "wascpqsewkkrfxhczj" "bidmngrrsjibdgfu"
[25] "wioubalfvexykmj" "fvbdgozzybrojqly"
Generate Entries
Each variable requires a description. If you send this code to the console, you can copy and paste the output into your documentation. The descriptions are not helpful, but you won’t miss any variables or spend a lot of time comparing your data to your documentation.
paste0("#' \\item{", names(df), "}{", names(df), "}") |>
cat(sep = "\n")
#' \item{zmrcdtyqpupbpzexf}{zmrcdtyqpupbpzexf}
#' \item{tomeqbnwwwbbxkrh}{tomeqbnwwwbbxkrh}
#' \item{dxnlulqrannhklhcg}{dxnlulqrannhklhcg}
#' \item{mzpjnvzihayykrdeov}{mzpjnvzihayykrdeov}
#' \item{umbdfndihlrgngsxj}{umbdfndihlrgngsxj}
#' \item{ribzrzlshevtorcchni}{ribzrzlshevtorcchni}
#' \item{qjpha}{qjpha}
#' \item{efpgomezmlrdv}{efpgomezmlrdv}
#' \item{hvicguldtcegmopbsme}{hvicguldtcegmopbsme}
#' \item{ktxieovedtfbxypssqv}{ktxieovedtfbxypssqv}
#' \item{lbahehftmxfdhpcb}{lbahehftmxfdhpcb}
#' \item{potmymsftvhzxyp}{potmymsftvhzxyp}
#' \item{qrwfu}{qrwfu}
#' \item{tegbrsjbvoqiprtc}{tegbrsjbvoqiprtc}
#' \item{adqgcascafcopar}{adqgcascafcopar}
#' \item{wbotzawrradsniozhoo}{wbotzawrradsniozhoo}
#' \item{xupzvlqtutkxfuka}{xupzvlqtutkxfuka}
#' \item{xhangidttsgzpbdlntv}{xhangidttsgzpbdlntv}
#' \item{gaqnpmlxp}{gaqnpmlxp}
#' \item{llybzoymeidhgehg}{llybzoymeidhgehg}
#' \item{knrknkbxgdl}{knrknkbxgdl}
#' \item{llcwhgcymcveptsytu}{llcwhgcymcveptsytu}
#' \item{wascpqsewkkrfxhczj}{wascpqsewkkrfxhczj}
#' \item{bidmngrrsjibdgfu}{bidmngrrsjibdgfu}
#' \item{wioubalfvexykmj}{wioubalfvexykmj}
#' \item{fvbdgozzybrojqly}{fvbdgozzybrojqly}
Multiple Cursors
Once you’ve copied and pasted the console output into your documentation, you may have additional editing to do. One handy feature is the multiple cursor function. You can find it on a Mac with the keyboard shortcut option + mouse.
Conclusion
Documenting your data is important. Many experts recommend the package form for programming because it makes you adhere to community norms and conform to good coding practices. When you are building a package, you may want to quickly document some data. This post can give you one strategy to program, cut, and paste your way to passing an R-CMD-check test and documenting your data.
References
Disclaimer
The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimed as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.
Reproducibility
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.1.3 (2022-03-10)
os macOS Big Sur/Monterey 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2022-10-21
pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
blogdown * 1.13 2022-09-24 [1] CRAN (R 4.1.2)
bookdown 0.29 2022-09-12 [1] CRAN (R 4.1.3)
bslib 0.4.0.9000 2022-08-26 [1] Github (rstudio/bslib@fa2e03c)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.0)
callr 3.7.2 2022-08-22 [1] CRAN (R 4.1.2)
cli 3.4.1 2022-09-23 [1] CRAN (R 4.1.2)
codetools 0.2-18 2020-11-04 [1] CRAN (R 4.1.3)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.1.2)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.1.3)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.1.2)
devtools * 2.4.4 2022-07-20 [1] CRAN (R 4.1.2)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.0)
dplyr 1.0.10 2022-09-01 [1] CRAN (R 4.1.2)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
evaluate 0.16 2022-08-09 [1] CRAN (R 4.1.2)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.2)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
formatR 1.12 2022-03-31 [1] CRAN (R 4.1.2)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.1.2)
ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.1.2)
ggthemes * 4.2.4 2021-01-20 [1] CRAN (R 4.1.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.2)
gtable 0.3.1 2022-09-01 [1] CRAN (R 4.1.2)
htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.1.2)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.1.0)
httpuv 1.6.6 2022-09-08 [1] CRAN (R 4.1.2)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.1.2)
knitr 1.40 2022-08-24 [1] CRAN (R 4.1.3)
later 1.3.0 2021-08-18 [1] CRAN (R 4.1.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.1.2)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.2)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.1.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.1.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
pillar 1.8.1 2022-08-19 [1] CRAN (R 4.1.2)
pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.1.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgload 1.3.0 2022-06-27 [1] CRAN (R 4.1.2)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
processx 3.7.0 2022-07-07 [1] CRAN (R 4.1.2)
profvis 0.3.7 2020-11-02 [1] CRAN (R 4.1.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.1.0)
ps 1.7.1 2022-06-18 [1] CRAN (R 4.1.2)
purrr 0.3.5 2022-10-06 [1] CRAN (R 4.1.2)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.1.2)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.0)
rlang 1.0.6 2022-09-24 [1] CRAN (R 4.1.2)
rmarkdown 2.16 2022-08-24 [1] CRAN (R 4.1.2)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.1.2)
sass 0.4.2 2022-07-16 [1] CRAN (R 4.1.2)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.1.2)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.0)
shiny 1.7.2 2022-07-19 [1] CRAN (R 4.1.2)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.1.2)
stringr 1.4.1 2022-08-20 [1] CRAN (R 4.1.2)
tibble 3.1.8 2022-07-22 [1] CRAN (R 4.1.2)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.1.2)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.1.0)
usethis * 2.1.6 2022-05-25 [1] CRAN (R 4.1.2)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
vctrs 0.4.2 2022-09-29 [1] CRAN (R 4.1.3)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.0)
xfun 0.33 2022-09-12 [1] CRAN (R 4.1.2)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.1.0)
yaml 2.3.5 2022-02-21 [1] CRAN (R 4.1.2)
[1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────