100 min read

Tidy Tuesday Hidden Gems


broken stone revealing jeweled interior
What looks like a rock could be a hidden gem. Photo by "lilartsy" on Unsplash.

View raw source for this post

Summary

The #tidytuesday release was the hidden gems project, a collection of notebook reviews from Kaggle.

Table of Contents

Overview

Begun in 2018, “tidy Tuesday” releases a dataset weekly and challenges the R community to provide insight, doing what data analysts do. For the week of April 26, the challenge was the “Kaggle Hidden Gems” dataset. “Hidden gems” are items that are good, but unrecognized or overlooked by others. In this same spirit, Martin Henze of Heads or Tails finds and promotes three underrated notebooks (formerly “kernels”) that he believes are deserving of a wider audience. The Kaggle Hidden Gems dataset is comprised of 300 notebooks compiled over 100 episodes.

Background

The original dataset is twelve columns and 300 observations:

Rows: 300
Columns: 12
$ vol             <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, …
$ date            <date> 2020-05-12, 2020-05-12, 2020-05-12, 2020-05-19, 2020-…
$ link_forum      <chr> "https://www.kaggle.com/general/150603", "https://www.…
$ link_twitter    <chr> "https://twitter.com/heads0rtai1s/status/1260289884943…
$ notebook        <chr> "https://www.kaggle.com/hansjoerg/glmnet-xgboost-and-s…
$ author_kaggle   <chr> "hansjoerg", "parulpandey", "jonathanbouchet", "andrad…
$ title           <chr> "Glmnet, XGBoost, and SVM Using tidymodels", "Breathe …
$ review          <chr> "A well-structured and documented tutorial on how to u…
$ author_name     <chr> "Hansjoerg", "Parul Pandey", "Jonathan Bouchet", "Andr…
$ author_twitter  <chr> "https://twitter.com/hansjoerg_me", "https://twitter.c…
$ author_linkedin <chr> NA, "https://www.linkedin.com/in/parul-pandey-a5498975…
$ notes           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

The most interesting variables were the title and Martin Henze’s review. These variables were comprised of text so I went in search of some continuous variables and potentially categorical ones too. The note’s field also had some interesting items. At regular intervals, users got to choose outstanding notebooks.

Feature Creation

With some wrangling, additional features were constructed: (1)title_length in words; (2) review_length in words; (3) special for notebooks chosen for a special milestone; (4) community for notebooks chosen by users; (6) sentiment_afinn for a sentiment score on the review; (7) frequency the number of notebooks submitted by a single author; (8) prior_submission a logical value for users that submitted more than one notebook and (9) a link for the actual Kaggle notebook.

Rows: 300
Columns: 8
$ title_length     <int> 41, 43, 35, 33, 26, 35, 34, 47, 36, 32, 44, 50, 41, 4…
$ review_length    <int> 162, 129, 190, 107, 170, 119, 178, 259, 182, 276, 159…
$ special          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ community        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ sentiment_afinn  <dbl> NA, NA, 15, 2, 2, 2, 4, NA, 5, 3, 1, 4, -1, 7, 6, 5, …
$ frequency        <fct> 1, 3, 9, 2, 3, 3, 2, 1, 2, 3, 3, 1, 1, 1, 1, 2, 1, 1,…
$ prior_submission <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRU…
$ link             <chr> "<a href='https://www.kaggle.com/hansjoerg/glmnet-xgb…

Sentiment Indicator

The sentiment indicator was created using the tidytext package and following the methodology described in Text Mining with R–Chapter 2 by Julia Silge and David Robinson. Text mining can be used to provide the “emotional content” of a text programmatically. Here, the AFINN sentiment dictionary was applied to the review field. Words are assigned a score between -5 and 5. Illustrating the extreme sentiment within the dictionary yields the following examples:

Words of Negative Sentiment

# A tibble: 10 × 2
   word          value
   <chr>         <dbl>
 1 bastard          -5
 2 bastards         -5
 3 bitch            -5
 4 bitches          -5
 5 cock             -5
 6 cocksucker       -5
 7 cocksuckers      -5
 8 cunt             -5
 9 motherfucker     -5
10 motherfucking    -5

Words of Positive Sentiment

# A tibble: 5 × 2
  word         value
  <chr>        <dbl>
1 breathtaking     5
2 hurrah           5
3 outstanding      5
4 superb           5
5 thrilled         5

“Dictionary-based methods like the ones we are discussing find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text,” according to Silge and Robinson. The individual word scores were then summarized by their title resulting in the sentiment_afinn variable. The afinn dictionary contains 2,477 words and “NA” was returned on 87 of the 300 entries. Presumably, 87 reviews did not contain any words within the dictionary.

Hidden Gems Plot

Hidden Gems Table

Conclusion

All of the 300 notebooks were considered to be overlooked and among the best. There was very little correlation among the variables, but sentiment tended to be higher where Martin Henze was familiar with an author’s past work. It could also mean that Kagglers’ notebook skills increased with additional submissions. By creating the plot and table, I have a nice index of outstanding Kaggle notebooks to consult in the future.

References

[1]
R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2022 [Online]. Available: https://www.R-project.org/
[2]
Y. Xie, C. Dervieux, and A. Presmanes Hill, Blogdown: Create blogs and websites with r markdown. 2022 [Online]. Available: https://CRAN.R-project.org/package=blogdown
[3]
H. Wickham, R. François, L. Henry, and K. Müller, Dplyr: A grammar of data manipulation. 2022 [Online]. Available: https://CRAN.R-project.org/package=dplyr
[4]
Y. Xie, J. Cheng, and X. Tan, DT: A wrapper of the JavaScript library DataTables. 2022 [Online]. Available: https://github.com/rstudio/DT
[5]
J. Kunst, Highcharter: A wrapper for the highcharts library. 2022 [Online]. Available: https://CRAN.R-project.org/package=highcharter
[6]
D. Robinson and J. Silge, Tidytext: Text mining using dplyr, ggplot2, and other tidy tools. 2021 [Online]. Available: https://github.com/juliasilge/tidytext

Disclaimer

The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimed as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.

Reproducibility

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.3 (2022-03-10)
 os       macOS Big Sur/Monterey 10.16
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Chicago
 date     2022-04-28
 pandoc   2.14.1 @ /usr/local/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package      * version    date (UTC) lib source
 assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.1.0)
 backports      1.4.1      2021-12-13 [1] CRAN (R 4.1.0)
 bit            4.0.4      2020-08-04 [1] CRAN (R 4.1.0)
 bit64          4.0.5      2020-08-30 [1] CRAN (R 4.1.0)
 blogdown     * 1.9        2022-03-28 [1] CRAN (R 4.1.2)
 bookdown       0.25       2022-03-16 [1] CRAN (R 4.1.2)
 brio           1.1.3      2021-11-30 [1] CRAN (R 4.1.0)
 broom          0.7.12     2022-01-28 [1] CRAN (R 4.1.2)
 bslib          0.3.1.9000 2022-03-04 [1] Github (rstudio/bslib@888fbe0)
 cachem         1.0.6      2021-08-19 [1] CRAN (R 4.1.0)
 callr          3.7.0      2021-04-20 [1] CRAN (R 4.1.0)
 cli            3.2.0      2022-02-14 [1] CRAN (R 4.1.2)
 colorspace     2.0-3      2022-02-21 [1] CRAN (R 4.1.2)
 crayon         1.5.1      2022-03-26 [1] CRAN (R 4.1.0)
 crosstalk      1.2.0      2021-11-04 [1] CRAN (R 4.1.0)
 curl           4.3.2      2021-06-23 [1] CRAN (R 4.1.0)
 data.table     1.14.2     2021-09-27 [1] CRAN (R 4.1.0)
 DBI            1.1.2      2021-12-20 [1] CRAN (R 4.1.0)
 desc           1.4.1      2022-03-06 [1] CRAN (R 4.1.2)
 devtools     * 2.4.3      2021-11-30 [1] CRAN (R 4.1.0)
 digest         0.6.29     2021-12-01 [1] CRAN (R 4.1.0)
 dplyr        * 1.0.8      2022-02-08 [1] CRAN (R 4.1.2)
 DT           * 0.22       2022-03-28 [1] CRAN (R 4.1.2)
 ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.1.0)
 evaluate       0.15       2022-02-18 [1] CRAN (R 4.1.2)
 fansi          1.0.3      2022-03-24 [1] CRAN (R 4.1.2)
 farver         2.1.0      2021-02-28 [1] CRAN (R 4.1.0)
 fastmap        1.1.0      2021-01-25 [1] CRAN (R 4.1.0)
 fs             1.5.2      2021-12-08 [1] CRAN (R 4.1.0)
 generics       0.1.2      2022-01-31 [1] CRAN (R 4.1.2)
 ggforce        0.3.3      2021-03-05 [1] CRAN (R 4.1.0)
 ggplot2      * 3.3.5      2021-06-25 [1] CRAN (R 4.1.0)
 ggraph       * 2.0.5      2021-02-23 [1] CRAN (R 4.1.0)
 ggrepel        0.9.1      2021-01-15 [1] CRAN (R 4.1.0)
 ggthemes     * 4.2.4      2021-01-20 [1] CRAN (R 4.1.0)
 glue           1.6.2      2022-02-24 [1] CRAN (R 4.1.2)
 graphlayouts   0.8.0      2022-01-03 [1] CRAN (R 4.1.2)
 gridExtra      2.3        2017-09-09 [1] CRAN (R 4.1.0)
 gtable         0.3.0      2019-03-25 [1] CRAN (R 4.1.0)
 highcharter  * 0.9.4      2022-01-03 [1] CRAN (R 4.1.2)
 hms            1.1.1      2021-09-26 [1] CRAN (R 4.1.0)
 htmltools      0.5.2      2021-08-25 [1] CRAN (R 4.1.0)
 htmlwidgets    1.5.4      2021-09-08 [1] CRAN (R 4.1.0)
 igraph       * 1.2.11     2022-01-04 [1] CRAN (R 4.1.2)
 janeaustenr    0.1.5      2017-06-10 [1] CRAN (R 4.1.0)
 jquerylib      0.1.4      2021-04-26 [1] CRAN (R 4.1.0)
 jsonlite       1.8.0      2022-02-22 [1] CRAN (R 4.1.2)
 knitr          1.38       2022-03-25 [1] CRAN (R 4.1.0)
 lattice        0.20-45    2021-09-22 [1] CRAN (R 4.1.3)
 lifecycle      1.0.1      2021-09-24 [1] CRAN (R 4.1.0)
 lubridate      1.8.0      2021-10-07 [1] CRAN (R 4.1.0)
 magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.1.2)
 MASS           7.3-55     2022-01-16 [1] CRAN (R 4.1.3)
 Matrix         1.4-0      2021-12-08 [1] CRAN (R 4.1.3)
 memoise        2.0.1      2021-11-26 [1] CRAN (R 4.1.0)
 munsell        0.5.0.9000 2021-10-19 [1] Github (cwickham/munsell@e539541)
 pillar         1.7.0      2022-02-01 [1] CRAN (R 4.1.2)
 pkgbuild       1.3.1      2021-12-20 [1] CRAN (R 4.1.0)
 pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.1.0)
 pkgload        1.2.4      2021-11-30 [1] CRAN (R 4.1.0)
 polyclip       1.10-0     2019-03-14 [1] CRAN (R 4.1.0)
 prettyunits    1.1.1      2020-01-24 [1] CRAN (R 4.1.0)
 processx       3.5.3      2022-03-25 [1] CRAN (R 4.1.0)
 ps             1.6.0      2021-02-28 [1] CRAN (R 4.1.0)
 purrr          0.3.4      2020-04-17 [1] CRAN (R 4.1.0)
 quantmod       0.4.18     2020-12-09 [1] CRAN (R 4.1.0)
 R6             2.5.1      2021-08-19 [1] CRAN (R 4.1.0)
 rappdirs       0.3.3      2021-01-31 [1] CRAN (R 4.1.0)
 Rcpp           1.0.8.3    2022-03-17 [1] CRAN (R 4.1.2)
 readr          2.1.2      2022-01-30 [1] CRAN (R 4.1.2)
 remotes        2.4.2      2021-11-30 [1] CRAN (R 4.1.0)
 rlang          1.0.2      2022-03-04 [1] CRAN (R 4.1.2)
 rlist          0.4.6.2    2021-09-03 [1] CRAN (R 4.1.0)
 rmarkdown      2.13       2022-03-10 [1] CRAN (R 4.1.2)
 rprojroot      2.0.3      2022-04-02 [1] CRAN (R 4.1.0)
 rstudioapi     0.13       2020-11-12 [1] CRAN (R 4.1.0)
 sass           0.4.1      2022-03-23 [1] CRAN (R 4.1.2)
 scales         1.1.1      2020-05-11 [1] CRAN (R 4.1.0)
 sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.1.0)
 SnowballC      0.7.0      2020-04-01 [1] CRAN (R 4.1.0)
 stringi        1.7.6      2021-11-29 [1] CRAN (R 4.1.0)
 stringr      * 1.4.0      2019-02-10 [1] CRAN (R 4.1.0)
 testthat       3.1.3      2022-03-29 [1] CRAN (R 4.1.2)
 textdata     * 0.4.1      2020-05-04 [1] CRAN (R 4.1.0)
 tibble         3.1.6      2021-11-07 [1] CRAN (R 4.1.0)
 tidygraph      1.2.0      2020-05-12 [1] CRAN (R 4.1.0)
 tidyr          1.2.0      2022-02-01 [1] CRAN (R 4.1.2)
 tidyselect     1.1.2      2022-02-21 [1] CRAN (R 4.1.2)
 tidytext     * 0.3.2      2021-09-30 [1] CRAN (R 4.1.0)
 tokenizers     0.2.1      2018-03-29 [1] CRAN (R 4.1.0)
 TTR            0.24.3     2021-12-12 [1] CRAN (R 4.1.0)
 tweenr         1.0.2      2021-03-23 [1] CRAN (R 4.1.0)
 tzdb           0.3.0      2022-03-28 [1] CRAN (R 4.1.2)
 usethis      * 2.1.5      2021-12-09 [1] CRAN (R 4.1.0)
 utf8           1.2.2      2021-07-24 [1] CRAN (R 4.1.0)
 vctrs          0.4.0      2022-03-30 [1] CRAN (R 4.1.2)
 viridis        0.6.2      2021-10-13 [1] CRAN (R 4.1.0)
 viridisLite    0.4.0      2021-04-13 [1] CRAN (R 4.1.0)
 vroom          1.5.7      2021-11-30 [1] CRAN (R 4.1.0)
 widyr        * 0.1.4      2021-08-12 [1] CRAN (R 4.1.0)
 withr          2.5.0      2022-03-03 [1] CRAN (R 4.1.0)
 xfun           0.30       2022-03-02 [1] CRAN (R 4.1.2)
 xts            0.12.1     2020-09-09 [1] CRAN (R 4.1.0)
 yaml           2.3.5      2022-02-21 [1] CRAN (R 4.1.2)
 zoo            1.8-9      2021-03-09 [1] CRAN (R 4.1.0)

 [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────