7 min read

Multilingual Letter Frequency in Literature


an Enigma encoding machine
An Enigma encoding machine from WWII

View raw source for this post

Summary

This post is about letter frequency from four works of literature. Each one is in a different language. Letter frequency analysis is how simple substitution ciphers are cracked in cryptography.

Table of Contents

Overview

Recently, I’ve been studying cryptography. Or, more accurately, a couple of family members have been studying it. For inspiration, we watched “The Imitation Game,” starring Benedict Cumberbatch and Keira Knightly. It’s a great movie and a celebration of one of the great intellects of the twentieth century Alan Turing. He is widely considered the creator of the modern computer. Along with many others at Bletchley Park, he broke Germany’s Enigma code accelerating the Allied victory.

Caesar Cipher

A “caesar cipher” is a simple shift of the alphabet to encode data. If you shift the alphabet 7 spaces an “A” would become the seventh letter “G”. This kind of encoding is easily broken by studying the frequency of letters within a language. In English, “E” is the most common letter. It represents somewhere around 12.5% of all letters. A seven-letter shift from “E” to “L” would convert all of the “Es” to “Ls”. “L” would then be present in 12.5% of the total characters in a text.

Gutenbergr Package

With the Gutenbergr package in R, public domain texts are freely available. The collection is large and contains 60,000+ free eBooks in different languages. I chose four well-known works of literature one from English (en), Spanish (es), German (de), and French (fr). The texts are “Moby Dick” by Herman Melville, Don Quijote by Cervantes, Faust by Goethe and Les Miserables by Victor Hugo. These were in the original language. There were similar distributions of the letters for the other languages.

Code

# devtools::install_github('ropensci/gutenbergr')
library(dplyr)
library(tidytext)
library(gutenbergr)

# Moby Dick; Or, The Whale by Herman Melville - English
moby_dick <- gutenberg_download(gutenberg_id = 2701, meta_fields = c("title", "author",
    "language"), verbose = F)

# Victor Hugo - Les misérables Tome V: Jean Valjean - French
les_miserab <- gutenberg_download(gutenberg_id = 17519, meta_fields = c("title",
    "author", "language"), verbose = F)
# Don Quijote - Cervantes
don_quijote <- gutenberg_download(gutenberg_id = 2000, meta_fields = c("title", "author",
    "language"), verbose = F)
# Faust - Goethe
faust <- gutenberg_download(gutenberg_id = 2229, meta_fields = c("title", "author",
    "language"), verbose = F)
# build function
find_letter_frequency <- function(book) {
    book |>
        mutate(text = iconv(text, to = "latin1")) |>
        unnest_characters(characters, text) |>
        group_by(author, title, language, characters) |>
        summarize(n = n(), .groups = "drop") |>
        filter(grepl("[a-z]", characters)) |>
        mutate(total = sum(n)) |>
        mutate(pct = n/total) |>
        mutate(pct = round(pct * 100, 1)) |>
        arrange(desc(pct))
}
# combine
data <- dplyr::bind_rows(find_letter_frequency(book = moby_dick), find_letter_frequency(book = les_miserab),
    find_letter_frequency(book = don_quijote), find_letter_frequency(book = faust))

convert_names <- tribble(~title, ~title1, "Moby Dick; Or, The Whale", "Moby Dick",
    "Les misérables Tome V: Jean Valjean", "Les miserables", "Don Quijote", "Don Quijote",
    "Faust: Der Tragödie erster Teil", "Faust")
data <- dplyr::left_join(data, convert_names, by = "title") |>
    dplyr::select(-title) |>
    dplyr::rename(title = title1)

# reorder characters high --> low
characters_reordered <- data |>
    filter(language == "en") |>
    arrange(desc(pct)) |>
    select(characters) |>
    pull()
data$characters <- factor(data$characters, levels = characters_reordered)


# reorder plots
data$title <- factor(data$title)
data$title <- forcats::fct_relevel(data$title, "Moby Dick")
# reorder language
data$language <- factor(data$language)
data$language <- forcats::fct_relevel(data$language, c("en", "es", "de", "fr"))
# set colors
colors <- colorspace::qualitative_hcl(4, palette = "dark2")
# plot
library(ggplot2)
data |>
    ggplot() + aes(characters, pct, group = language, color = language) + geom_point(size = 3,
    shape = 18) + scale_color_manual(values = colors) + scale_y_continuous(limits = c(0,
    16), breaks = c(0, 5, 10, 15), labels = paste0(seq(0, 15, 5), "%"), name = "",
    ) + scale_x_discrete(name = "") + facet_wrap(. ~ title, ncol = 1) + theme_minimal() +
    theme(text = element_text(size = 14))

Acknowledgements

This blog post was made possible thanks to:

References

[1]
R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2022 [Online]. Available: https://www.R-project.org/
[2]
Y. Xie, C. Dervieux, and A. Presmanes Hill, Blogdown: Create blogs and websites with r markdown. 2022 [Online]. Available: https://CRAN.R-project.org/package=blogdown
[3]
R. Ihaka et al., Colorspace: A toolbox for manipulating and assessing colors and palettes. 2022 [Online]. Available: https://CRAN.R-project.org/package=colorspace
[4]
H. Wickham et al., ggplot2: Create elegant data visualisations using the grammar of graphics. 2022 [Online]. Available: https://CRAN.R-project.org/package=ggplot2
[5]
D. Robinson and J. Silge, Tidytext: Text mining using dplyr, ggplot2, and other tidy tools. 2022 [Online]. Available: https://github.com/juliasilge/tidytext

Disclaimer

The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimed as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.

Reproducibility

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.3 (2022-03-10)
 os       macOS Big Sur/Monterey 10.16
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2022-10-21
 pandoc   2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version    date (UTC) lib source
 assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.1.0)
 blogdown    * 1.13       2022-09-24 [1] CRAN (R 4.1.2)
 bookdown      0.29       2022-09-12 [1] CRAN (R 4.1.3)
 bslib         0.4.0.9000 2022-08-26 [1] Github (rstudio/bslib@fa2e03c)
 cachem        1.0.6      2021-08-19 [1] CRAN (R 4.1.0)
 callr         3.7.2      2022-08-22 [1] CRAN (R 4.1.2)
 cli           3.4.1      2022-09-23 [1] CRAN (R 4.1.2)
 colorspace  * 2.0-3      2022-02-21 [1] CRAN (R 4.1.2)
 crayon        1.5.2      2022-09-29 [1] CRAN (R 4.1.3)
 DBI           1.1.3      2022-06-18 [1] CRAN (R 4.1.2)
 devtools    * 2.4.4      2022-07-20 [1] CRAN (R 4.1.2)
 digest        0.6.29     2021-12-01 [1] CRAN (R 4.1.0)
 dplyr       * 1.0.10     2022-09-01 [1] CRAN (R 4.1.2)
 ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.1.0)
 evaluate      0.16       2022-08-09 [1] CRAN (R 4.1.2)
 fansi         1.0.3      2022-03-24 [1] CRAN (R 4.1.2)
 fastmap       1.1.0      2021-01-25 [1] CRAN (R 4.1.0)
 formatR       1.12       2022-03-31 [1] CRAN (R 4.1.2)
 fs            1.5.2      2021-12-08 [1] CRAN (R 4.1.0)
 generics      0.1.3      2022-07-05 [1] CRAN (R 4.1.2)
 ggplot2     * 3.3.6      2022-05-03 [1] CRAN (R 4.1.2)
 ggthemes    * 4.2.4      2021-01-20 [1] CRAN (R 4.1.0)
 glue          1.6.2      2022-02-24 [1] CRAN (R 4.1.2)
 gtable        0.3.1      2022-09-01 [1] CRAN (R 4.1.2)
 highr         0.9        2021-04-16 [1] CRAN (R 4.1.0)
 htmltools     0.5.3      2022-07-18 [1] CRAN (R 4.1.2)
 htmlwidgets   1.5.4      2021-09-08 [1] CRAN (R 4.1.0)
 httpuv        1.6.6      2022-09-08 [1] CRAN (R 4.1.2)
 janeaustenr   1.0.0      2022-08-26 [1] CRAN (R 4.1.3)
 jquerylib     0.1.4      2021-04-26 [1] CRAN (R 4.1.0)
 jsonlite      1.8.0      2022-02-22 [1] CRAN (R 4.1.2)
 knitr         1.40       2022-08-24 [1] CRAN (R 4.1.3)
 later         1.3.0      2021-08-18 [1] CRAN (R 4.1.0)
 lattice       0.20-45    2021-09-22 [1] CRAN (R 4.1.3)
 lifecycle     1.0.3      2022-10-07 [1] CRAN (R 4.1.2)
 magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.1.2)
 Matrix        1.5-1      2022-09-13 [1] CRAN (R 4.1.2)
 memoise       2.0.1      2021-11-26 [1] CRAN (R 4.1.0)
 mime          0.12       2021-09-28 [1] CRAN (R 4.1.0)
 miniUI        0.1.1.1    2018-05-18 [1] CRAN (R 4.1.0)
 munsell       0.5.0      2018-06-12 [1] CRAN (R 4.1.0)
 pillar        1.8.1      2022-08-19 [1] CRAN (R 4.1.2)
 pkgbuild      1.3.1      2021-12-20 [1] CRAN (R 4.1.0)
 pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.1.0)
 pkgload       1.3.0      2022-06-27 [1] CRAN (R 4.1.2)
 prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.1.0)
 processx      3.7.0      2022-07-07 [1] CRAN (R 4.1.2)
 profvis       0.3.7      2020-11-02 [1] CRAN (R 4.1.0)
 promises      1.2.0.1    2021-02-11 [1] CRAN (R 4.1.0)
 ps            1.7.1      2022-06-18 [1] CRAN (R 4.1.2)
 purrr         0.3.5      2022-10-06 [1] CRAN (R 4.1.2)
 R6            2.5.1      2021-08-19 [1] CRAN (R 4.1.0)
 Rcpp          1.0.9      2022-07-08 [1] CRAN (R 4.1.2)
 remotes       2.4.2      2021-11-30 [1] CRAN (R 4.1.0)
 rlang         1.0.6      2022-09-24 [1] CRAN (R 4.1.2)
 rmarkdown     2.16       2022-08-24 [1] CRAN (R 4.1.2)
 rstudioapi    0.14       2022-08-22 [1] CRAN (R 4.1.2)
 sass          0.4.2      2022-07-16 [1] CRAN (R 4.1.2)
 scales        1.2.1      2022-08-20 [1] CRAN (R 4.1.2)
 sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.1.0)
 shiny         1.7.2      2022-07-19 [1] CRAN (R 4.1.2)
 SnowballC     0.7.0      2020-04-01 [1] CRAN (R 4.1.0)
 stringi       1.7.8      2022-07-11 [1] CRAN (R 4.1.2)
 stringr       1.4.1      2022-08-20 [1] CRAN (R 4.1.2)
 tibble        3.1.8      2022-07-22 [1] CRAN (R 4.1.2)
 tidyselect    1.2.0      2022-10-10 [1] CRAN (R 4.1.2)
 tidytext    * 0.3.4      2022-08-20 [1] CRAN (R 4.1.2)
 tokenizers    0.2.3      2022-09-23 [1] CRAN (R 4.1.2)
 urlchecker    1.0.1      2021-11-30 [1] CRAN (R 4.1.0)
 usethis     * 2.1.6      2022-05-25 [1] CRAN (R 4.1.2)
 utf8          1.2.2      2021-07-24 [1] CRAN (R 4.1.0)
 vctrs         0.4.2      2022-09-29 [1] CRAN (R 4.1.3)
 withr         2.5.0      2022-03-03 [1] CRAN (R 4.1.0)
 xfun          0.33       2022-09-12 [1] CRAN (R 4.1.2)
 xtable        1.8-4      2019-04-21 [1] CRAN (R 4.1.0)
 yaml          2.3.5      2022-02-21 [1] CRAN (R 4.1.2)

 [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────