Summary
The #tidytuesday release was the hidden gems project, a collection of notebook reviews from Kaggle.Table of Contents
Overview
Begun in 2018, “tidy Tuesday” releases a dataset weekly and challenges the R community to provide insight, doing what data analysts do. For the week of April 26, the challenge was the “Kaggle Hidden Gems” dataset. “Hidden gems” are items that are good, but unrecognized or overlooked by others. In this same spirit, Martin Henze of Heads or Tails finds and promotes three underrated notebooks (formerly “kernels”) that he believes are deserving of a wider audience. The Kaggle Hidden Gems dataset is comprised of 300 notebooks compiled over 100 episodes.
Background
The original dataset is twelve columns and 300 observations:
Rows: 300
Columns: 12
$ vol <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, …
$ date <date> 2020-05-12, 2020-05-12, 2020-05-12, 2020-05-19, 2020-…
$ link_forum <chr> "https://www.kaggle.com/general/150603", "https://www.…
$ link_twitter <chr> "https://twitter.com/heads0rtai1s/status/1260289884943…
$ notebook <chr> "https://www.kaggle.com/hansjoerg/glmnet-xgboost-and-s…
$ author_kaggle <chr> "hansjoerg", "parulpandey", "jonathanbouchet", "andrad…
$ title <chr> "Glmnet, XGBoost, and SVM Using tidymodels", "Breathe …
$ review <chr> "A well-structured and documented tutorial on how to u…
$ author_name <chr> "Hansjoerg", "Parul Pandey", "Jonathan Bouchet", "Andr…
$ author_twitter <chr> "https://twitter.com/hansjoerg_me", "https://twitter.c…
$ author_linkedin <chr> NA, "https://www.linkedin.com/in/parul-pandey-a5498975…
$ notes <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
The most interesting variables were the title
and Martin Henze’s review
. These variables were comprised of text so I went in search of some continuous variables and potentially categorical ones too. The note’s field also had some interesting items. At regular intervals, users got to choose outstanding notebooks.
Feature Creation
With some wrangling, additional features were constructed: (1)title_length
in words; (2) review_length
in words; (3) special
for notebooks chosen for a special milestone; (4) community
for notebooks chosen by users; (6) sentiment_afinn
for a sentiment score on the review; (7) frequency
the number of notebooks submitted by a single author; (8) prior_submission
a logical value for users that submitted more than one notebook and (9) a link
for the actual Kaggle notebook.
Rows: 300
Columns: 8
$ title_length <int> 41, 43, 35, 33, 26, 35, 34, 47, 36, 32, 44, 50, 41, 4…
$ review_length <int> 162, 129, 190, 107, 170, 119, 178, 259, 182, 276, 159…
$ special <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ community <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ sentiment_afinn <dbl> NA, NA, 15, 2, 2, 2, 4, NA, 5, 3, 1, 4, -1, 7, 6, 5, …
$ frequency <fct> 1, 3, 9, 2, 3, 3, 2, 1, 2, 3, 3, 1, 1, 1, 1, 2, 1, 1,…
$ prior_submission <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRU…
$ link <chr> "<a href='https://www.kaggle.com/hansjoerg/glmnet-xgb…
Sentiment Indicator
The sentiment indicator was created using the tidytext
package and following the methodology described in Text Mining with R–Chapter 2 by Julia Silge and David Robinson. Text mining can be used to provide the “emotional content” of a text programmatically. Here, the AFINN
sentiment dictionary was applied to the review field. Words are assigned a score between -5 and 5. Illustrating the extreme sentiment within the dictionary yields the following examples:
Words of Negative Sentiment
# A tibble: 10 × 2
word value
<chr> <dbl>
1 bastard -5
2 bastards -5
3 bitch -5
4 bitches -5
5 cock -5
6 cocksucker -5
7 cocksuckers -5
8 cunt -5
9 motherfucker -5
10 motherfucking -5
Words of Positive Sentiment
# A tibble: 5 × 2
word value
<chr> <dbl>
1 breathtaking 5
2 hurrah 5
3 outstanding 5
4 superb 5
5 thrilled 5
“Dictionary-based methods like the ones we are discussing find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text,” according to Silge and Robinson. The individual word scores were then summarized by their title resulting in the sentiment_afinn
variable. The afinn
dictionary contains 2,477 words and “NA” was returned on 87 of the 300 entries. Presumably, 87 reviews did not contain any words within the dictionary.
Conclusion
All of the 300 notebooks were considered to be overlooked and among the best. There was very little correlation among the variables, but sentiment tended to be higher where Martin Henze was familiar with an author’s past work. It could also mean that Kagglers’ notebook skills increased with additional submissions. By creating the plot and table, I have a nice index of outstanding Kaggle notebooks to consult in the future.
Acknowledgements
This blog post was made possible thanks to:
References
Disclaimer
The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimed as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.
Reproducibility
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.1.3 (2022-03-10)
os macOS Big Sur/Monterey 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Chicago
date 2022-04-28
pandoc 2.14.1 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.0)
bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
blogdown * 1.9 2022-03-28 [1] CRAN (R 4.1.2)
bookdown 0.25 2022-03-16 [1] CRAN (R 4.1.2)
brio 1.1.3 2021-11-30 [1] CRAN (R 4.1.0)
broom 0.7.12 2022-01-28 [1] CRAN (R 4.1.2)
bslib 0.3.1.9000 2022-03-04 [1] Github (rstudio/bslib@888fbe0)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.0)
callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0)
cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.2)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.1.2)
crayon 1.5.1 2022-03-26 [1] CRAN (R 4.1.0)
crosstalk 1.2.0 2021-11-04 [1] CRAN (R 4.1.0)
curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0)
data.table 1.14.2 2021-09-27 [1] CRAN (R 4.1.0)
DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.0)
desc 1.4.1 2022-03-06 [1] CRAN (R 4.1.2)
devtools * 2.4.3 2021-11-30 [1] CRAN (R 4.1.0)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.0)
dplyr * 1.0.8 2022-02-08 [1] CRAN (R 4.1.2)
DT * 0.22 2022-03-28 [1] CRAN (R 4.1.2)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
evaluate 0.15 2022-02-18 [1] CRAN (R 4.1.2)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.2)
farver 2.1.0 2021-02-28 [1] CRAN (R 4.1.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.0)
generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2)
ggforce 0.3.3 2021-03-05 [1] CRAN (R 4.1.0)
ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
ggraph * 2.0.5 2021-02-23 [1] CRAN (R 4.1.0)
ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.1.0)
ggthemes * 4.2.4 2021-01-20 [1] CRAN (R 4.1.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.2)
graphlayouts 0.8.0 2022-01-03 [1] CRAN (R 4.1.2)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.1.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
highcharter * 0.9.4 2022-01-03 [1] CRAN (R 4.1.2)
hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.0)
htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.0)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.1.0)
igraph * 1.2.11 2022-01-04 [1] CRAN (R 4.1.2)
janeaustenr 0.1.5 2017-06-10 [1] CRAN (R 4.1.0)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.1.2)
knitr 1.38 2022-03-25 [1] CRAN (R 4.1.0)
lattice 0.20-45 2021-09-22 [1] CRAN (R 4.1.3)
lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.0)
lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.1.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.2)
MASS 7.3-55 2022-01-16 [1] CRAN (R 4.1.3)
Matrix 1.4-0 2021-12-08 [1] CRAN (R 4.1.3)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.0)
munsell 0.5.0.9000 2021-10-19 [1] Github (cwickham/munsell@e539541)
pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.1.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgload 1.2.4 2021-11-30 [1] CRAN (R 4.1.0)
polyclip 1.10-0 2019-03-14 [1] CRAN (R 4.1.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
processx 3.5.3 2022-03-25 [1] CRAN (R 4.1.0)
ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
quantmod 0.4.18 2020-12-09 [1] CRAN (R 4.1.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0)
rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.0)
Rcpp 1.0.8.3 2022-03-17 [1] CRAN (R 4.1.2)
readr 2.1.2 2022-01-30 [1] CRAN (R 4.1.2)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.0)
rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.2)
rlist 0.4.6.2 2021-09-03 [1] CRAN (R 4.1.0)
rmarkdown 2.13 2022-03-10 [1] CRAN (R 4.1.2)
rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.1.0)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
sass 0.4.1 2022-03-23 [1] CRAN (R 4.1.2)
scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.0)
SnowballC 0.7.0 2020-04-01 [1] CRAN (R 4.1.0)
stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.0)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
testthat 3.1.3 2022-03-29 [1] CRAN (R 4.1.2)
textdata * 0.4.1 2020-05-04 [1] CRAN (R 4.1.0)
tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.0)
tidygraph 1.2.0 2020-05-12 [1] CRAN (R 4.1.0)
tidyr 1.2.0 2022-02-01 [1] CRAN (R 4.1.2)
tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.1.2)
tidytext * 0.3.2 2021-09-30 [1] CRAN (R 4.1.0)
tokenizers 0.2.1 2018-03-29 [1] CRAN (R 4.1.0)
TTR 0.24.3 2021-12-12 [1] CRAN (R 4.1.0)
tweenr 1.0.2 2021-03-23 [1] CRAN (R 4.1.0)
tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.1.2)
usethis * 2.1.5 2021-12-09 [1] CRAN (R 4.1.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
vctrs 0.4.0 2022-03-30 [1] CRAN (R 4.1.2)
viridis 0.6.2 2021-10-13 [1] CRAN (R 4.1.0)
viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.1.0)
vroom 1.5.7 2021-11-30 [1] CRAN (R 4.1.0)
widyr * 0.1.4 2021-08-12 [1] CRAN (R 4.1.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.0)
xfun 0.30 2022-03-02 [1] CRAN (R 4.1.2)
xts 0.12.1 2020-09-09 [1] CRAN (R 4.1.0)
yaml 2.3.5 2022-02-21 [1] CRAN (R 4.1.2)
zoo 1.8-9 2021-03-09 [1] CRAN (R 4.1.0)
[1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────