Summary
This post is about letter frequency from four works of literature. Each one is in a different language. Letter frequency analysis is how simple substitution ciphers are cracked in cryptography.Table of Contents
Overview
Recently, I’ve been studying cryptography. Or, more accurately, a couple of family members have been studying it. For inspiration, we watched “The Imitation Game,” starring Benedict Cumberbatch and Keira Knightly. It’s a great movie and a celebration of one of the great intellects of the twentieth century Alan Turing. He is widely considered the creator of the modern computer. Along with many others at Bletchley Park, he broke Germany’s Enigma code accelerating the Allied victory.
Caesar Cipher
A “caesar cipher” is a simple shift of the alphabet to encode data. If you shift the alphabet 7 spaces an “A” would become the seventh letter “G”. This kind of encoding is easily broken by studying the frequency of letters within a language. In English, “E” is the most common letter. It represents somewhere around 12.5% of all letters. A seven-letter shift from “E” to “L” would convert all of the “Es” to “Ls”. “L” would then be present in 12.5% of the total characters in a text.
Gutenbergr Package
With the Gutenbergr
package in R, public domain texts are freely available. The collection is large and contains 60,000+ free eBooks in different languages. I chose four well-known works of literature one from English (en), Spanish (es), German (de), and French (fr). The texts are “Moby Dick” by Herman Melville, Don Quijote by Cervantes, Faust by Goethe and Les Miserables by Victor Hugo. These were in the original language. There were similar distributions of the letters for the other languages.
Code
# devtools::install_github('ropensci/gutenbergr')
library(dplyr)
library(tidytext)
library(gutenbergr)
# Moby Dick; Or, The Whale by Herman Melville - English
moby_dick <- gutenberg_download(gutenberg_id = 2701, meta_fields = c("title", "author",
"language"), verbose = F)
# Victor Hugo - Les misérables Tome V: Jean Valjean - French
les_miserab <- gutenberg_download(gutenberg_id = 17519, meta_fields = c("title",
"author", "language"), verbose = F)
# Don Quijote - Cervantes
don_quijote <- gutenberg_download(gutenberg_id = 2000, meta_fields = c("title", "author",
"language"), verbose = F)
# Faust - Goethe
faust <- gutenberg_download(gutenberg_id = 2229, meta_fields = c("title", "author",
"language"), verbose = F)
# build function
find_letter_frequency <- function(book) {
book |>
mutate(text = iconv(text, to = "latin1")) |>
unnest_characters(characters, text) |>
group_by(author, title, language, characters) |>
summarize(n = n(), .groups = "drop") |>
filter(grepl("[a-z]", characters)) |>
mutate(total = sum(n)) |>
mutate(pct = n/total) |>
mutate(pct = round(pct * 100, 1)) |>
arrange(desc(pct))
}
# combine
data <- dplyr::bind_rows(find_letter_frequency(book = moby_dick), find_letter_frequency(book = les_miserab),
find_letter_frequency(book = don_quijote), find_letter_frequency(book = faust))
convert_names <- tribble(~title, ~title1, "Moby Dick; Or, The Whale", "Moby Dick",
"Les misérables Tome V: Jean Valjean", "Les miserables", "Don Quijote", "Don Quijote",
"Faust: Der Tragödie erster Teil", "Faust")
data <- dplyr::left_join(data, convert_names, by = "title") |>
dplyr::select(-title) |>
dplyr::rename(title = title1)
# reorder characters high --> low
characters_reordered <- data |>
filter(language == "en") |>
arrange(desc(pct)) |>
select(characters) |>
pull()
data$characters <- factor(data$characters, levels = characters_reordered)
# reorder plots
data$title <- factor(data$title)
data$title <- forcats::fct_relevel(data$title, "Moby Dick")
# reorder language
data$language <- factor(data$language)
data$language <- forcats::fct_relevel(data$language, c("en", "es", "de", "fr"))
# set colors
colors <- colorspace::qualitative_hcl(4, palette = "dark2")
# plot
library(ggplot2)
data |>
ggplot() + aes(characters, pct, group = language, color = language) + geom_point(size = 3,
shape = 18) + scale_color_manual(values = colors) + scale_y_continuous(limits = c(0,
16), breaks = c(0, 5, 10, 15), labels = paste0(seq(0, 15, 5), "%"), name = "",
) + scale_x_discrete(name = "") + facet_wrap(. ~ title, ncol = 1) + theme_minimal() +
theme(text = element_text(size = 14))
References
Disclaimer
The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimed as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.
Reproducibility
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.1.3 (2022-03-10)
os macOS Big Sur/Monterey 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2022-10-21
pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
blogdown * 1.13 2022-09-24 [1] CRAN (R 4.1.2)
bookdown 0.29 2022-09-12 [1] CRAN (R 4.1.3)
bslib 0.4.0.9000 2022-08-26 [1] Github (rstudio/bslib@fa2e03c)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.0)
callr 3.7.2 2022-08-22 [1] CRAN (R 4.1.2)
cli 3.4.1 2022-09-23 [1] CRAN (R 4.1.2)
colorspace * 2.0-3 2022-02-21 [1] CRAN (R 4.1.2)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.1.3)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.1.2)
devtools * 2.4.4 2022-07-20 [1] CRAN (R 4.1.2)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.0)
dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.1.2)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
evaluate 0.16 2022-08-09 [1] CRAN (R 4.1.2)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.2)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
formatR 1.12 2022-03-31 [1] CRAN (R 4.1.2)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.1.2)
ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.1.2)
ggthemes * 4.2.4 2021-01-20 [1] CRAN (R 4.1.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.2)
gtable 0.3.1 2022-09-01 [1] CRAN (R 4.1.2)
highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.1.2)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.1.0)
httpuv 1.6.6 2022-09-08 [1] CRAN (R 4.1.2)
janeaustenr 1.0.0 2022-08-26 [1] CRAN (R 4.1.3)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.1.2)
knitr 1.40 2022-08-24 [1] CRAN (R 4.1.3)
later 1.3.0 2021-08-18 [1] CRAN (R 4.1.0)
lattice 0.20-45 2021-09-22 [1] CRAN (R 4.1.3)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.1.2)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.2)
Matrix 1.5-1 2022-09-13 [1] CRAN (R 4.1.2)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.1.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.1.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
pillar 1.8.1 2022-08-19 [1] CRAN (R 4.1.2)
pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.1.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgload 1.3.0 2022-06-27 [1] CRAN (R 4.1.2)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
processx 3.7.0 2022-07-07 [1] CRAN (R 4.1.2)
profvis 0.3.7 2020-11-02 [1] CRAN (R 4.1.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.1.0)
ps 1.7.1 2022-06-18 [1] CRAN (R 4.1.2)
purrr 0.3.5 2022-10-06 [1] CRAN (R 4.1.2)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.1.2)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.0)
rlang 1.0.6 2022-09-24 [1] CRAN (R 4.1.2)
rmarkdown 2.16 2022-08-24 [1] CRAN (R 4.1.2)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.1.2)
sass 0.4.2 2022-07-16 [1] CRAN (R 4.1.2)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.1.2)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.0)
shiny 1.7.2 2022-07-19 [1] CRAN (R 4.1.2)
SnowballC 0.7.0 2020-04-01 [1] CRAN (R 4.1.0)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.1.2)
stringr 1.4.1 2022-08-20 [1] CRAN (R 4.1.2)
tibble 3.1.8 2022-07-22 [1] CRAN (R 4.1.2)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.1.2)
tidytext * 0.3.4 2022-08-20 [1] CRAN (R 4.1.2)
tokenizers 0.2.3 2022-09-23 [1] CRAN (R 4.1.2)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.1.0)
usethis * 2.1.6 2022-05-25 [1] CRAN (R 4.1.2)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
vctrs 0.4.2 2022-09-29 [1] CRAN (R 4.1.3)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.0)
xfun 0.33 2022-09-12 [1] CRAN (R 4.1.2)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.1.0)
yaml 2.3.5 2022-02-21 [1] CRAN (R 4.1.2)
[1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────