Summary
With the ubiquitous “FIPS” codes, there is no consistency in how they are formatted. It seems like on every project, the merge is dependent upon the format matching in the datasets and then I’ve usually got a mess on my hands. A missing “0” is often the problem.Table of Contents
Overview
As an illustration, FIPS codes will often look like this where the lengths of the strings are different. Some datasets will have the zero in front, while others will not. Sometimes, the problem isn’t how the data was formatted, but how the analyst imports it.
df <- tibble(fips = c("1000", "01000"))
df
# A tibble: 2 × 1
fips
<chr>
1 1000
2 01000
This wreaks havoc on your project when you go to merge.
Strategy 1 - Set column type
One strategy to dealing with this problem is to read in the data as character so that the formatting is clear. Otherwise, the function will just drop the leading zeros and convert the column to an integer. For example, in the data.table
and utils
package, the code might look like:
utils::read.csv(file = file, colClasses = "character")
#or
data.table::fread(input = input, colClasses = "character")
Strategy 2 - the readr
package
I tried out the readr
package for this problem and it correctly converted the example to a character string and retained the leading zero.
readr::read_csv(file = file)
Strategy 3 - the “Hacky Way”
I’m embarrased to admit that this was some code I routinely used in projects. And I cringe looking at it. The line of code was interrupting a readable magrittr
pipe. (Check out the new pipe |>
). There is a better, more elegant solution.
df$fips[which(nchar(df$fips)==4)] <- paste0("0", df$fips[which(nchar(df$fips)==4)])
df
# A tibble: 2 × 1
fips
<chr>
1 01000
2 01000
Strategy 4 - the “Elegant Way”
library(stringr)
df %>%
mutate(fips = str_pad(fips, width = 5, side = "left", pad = "0"))
# A tibble: 2 × 1
fips
<chr>
1 01000
2 01000
Conclusion
The best strategy is to avoid this problem in the first place. Import the data correctly. Importing all data as character first allows you to see the data as it is without getting transformed by an import package. The next best strategy is to use str_pad
in the stringr
package. It makes for readable code and describes what occurred. Check the Stackoverflow post, “How to add leading zeros?”, for additional ideas.
Sometimes the hardest part of coding is knowing what the best search term is. I wished I’d used the term “pad” and I could’ve avoided years of hacky code. Ugh.
References
Disclaimer
The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimed as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.
Reproducibility
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.1.0 (2021-05-18)
os macOS Catalina 10.15.7
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Chicago
date 2021-08-29
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0)
blogdown * 1.4 2021-07-23 [1] CRAN (R 4.1.0)
bookdown 0.23 2021-08-13 [1] CRAN (R 4.1.0)
broom 0.7.9 2021-07-27 [1] CRAN (R 4.1.0)
bslib 0.2.5.1 2021-05-18 [1] CRAN (R 4.1.0)
cachem 1.0.5 2021-05-15 [1] CRAN (R 4.1.0)
callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0)
codetools 0.2-18 2020-11-04 [1] CRAN (R 4.1.0)
colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
desc 1.3.0 2021-03-05 [1] CRAN (R 4.1.0)
devtools * 2.4.2 2021-06-07 [1] CRAN (R 4.1.0)
digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
ggthemes * 4.2.4 2021-01-20 [1] CRAN (R 4.1.0)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
haven 2.4.3 2021-08-04 [1] CRAN (R 4.1.0)
hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0)
htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0)
knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0)
magrittr * 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0)
modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0)
pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.1.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.1.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0)
prompt * 1.0.1 2021-03-12 [1] CRAN (R 4.1.0)
ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0)
Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
readr * 2.0.1 2021-08-10 [1] CRAN (R 4.1.0)
readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
remotes 2.4.0 2021-06-02 [1] CRAN (R 4.1.0)
reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0)
rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
rmarkdown 2.10 2021-08-06 [1] CRAN (R 4.1.0)
roxygen2 * 7.1.1 2020-06-27 [1] CRAN (R 4.1.0)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
rvest 1.0.1 2021-07-26 [1] CRAN (R 4.1.0)
sass 0.4.0 2021-05-12 [1] CRAN (R 4.1.0)
scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
stringi 1.7.3 2021-07-16 [1] CRAN (R 4.1.0)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
testthat 3.0.4 2021-07-01 [1] CRAN (R 4.1.0)
tibble * 3.1.3 2021-07-23 [1] CRAN (R 4.1.0)
tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0)
tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0)
tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.1.0)
usethis * 2.0.1 2021-02-10 [1] CRAN (R 4.1.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
xfun 0.25 2021-08-06 [1] CRAN (R 4.1.0)
xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
[1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library