Merge branch 'master' of github.com:erikaduan/r_tips

knightvvitch · May 28, 2022 · e512754 · e512754
2 parents 96f7a74 + 7d857b6
commit e512754
Show file tree

Hide file tree

Showing 7 changed files with 74 additions and 31 deletions.
diff --git a/README.md b/README.md
@@ -50,6 +50,12 @@ Many kudos to [Dr Chuanxin Liu](https://github.com/codetrainee), my former PhD s
 + [Introduction to hypergeometric, geometric, negative binomial and multinomial distributions](https://github.com/erikaduan/R_tips/blob/master/tutorials/2020-09-22_hypergeometric-and-other-discrete-distributions/2020-09-22_hypergeometric-and-other-discrete-distributions.md)  
 
 
+# Other resources 
+These resources also cover a comprehensive range of practical R usage tutorials.  
+
++ [Statistical Computing](https://36-750.github.io/) by Alex Reinhart and Christopher Genovese  
++ [Data Science Toolkit](https://benkeser.github.io/info550/lectures/) by David Benkeser  
+
 # Tutorial style guide  
 
 A painful form of technical debt is inconsistent code style. This repository now contains the following file naming and code style rules.  
@@ -81,31 +87,7 @@ A painful form of technical debt is inconsistent code style. This repository now
   version 1.4.0.
    https://CRAN.R-project.org/package=stringr
 
-+ Max Kuhn. (2019). `caret`: Classification and Regression
-  Training. R package version 6.0-84. https://CRAN.R-project.org/package=caret  
-    + Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony
-  Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem,
-  Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt.  
-
-+ Jacob Kaplan (2020). `fastDummies`: Fast Creation of Dummy (Binary) Columns and Rows from Categorical
-  Variables. R package version 1.6.1. https://CRAN.R-project.org/package=fastDummies  
-
 + Kirill Müller (2017). `here`: A Simpler Way to Find Your Files. R package version 0.1.
   https://CRAN.R-project.org/package=here  
 
-+ Paul Murrell (2015). `compare`: Comparing Objects for Differences. R package version 0.2-6.
-  https://CRAN.R-project.org/package=compare  
-
-+ A. Liaw and M. Wiener (2002). Classification and Regression by `randomForest`. R News 2(3), 18--22.  
-
-+ Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, Rory
-  Mitchell, Ignacio Cano, Tianyi Zhou, Mu Li, Junyuan Xie, Min Lin, Yifeng Geng and Yutian Li (2020).
-  `xgboost`: Extreme Gradient Boosting. R package version 1.0.0.2. https://CRAN.R-project.org/package=xgboost  
-
-+ Alexandros Karatzoglou, Alex Smola, Kurt Hornik, Achim Zeileis (2004). `kernlab` - An S4 Package for Kernel
-  Methods in R. Journal of Statistical Software 11(9), 1-20. URL http://www.jstatsoft.org/v11/i09/  
-
-+ Microsoft Corporation and Steve Weston (2019). `doParallel`: Foreach Parallel Adaptor for the `parallel`
-  Package. R package version 1.0.15. https://CRAN.R-project.org/package=doParallel  
-
 + Richard Iannone (2020). `DiagrammeR`: Graph/Network Visualization. R package version 1.0.6.1.  https://CRAN.R-project.org/package=DiagrammeR  
diff --git a/tutorials/dc-data_table_vs_dplyr/dc-data_table_vs_dplyr.Rmd b/tutorials/dc-data_table_vs_dplyr/dc-data_table_vs_dplyr.Rmd
@@ -7,25 +7,27 @@ output:
     toc: true
 ---
 
-```{r setup, include = FALSE}
-knitr::opts_chunk$set(echo = TRUE, results = 'hide', message = FALSE)   
+```{r setup, include=FALSE}
+# Set up global environment ----------------------------------------------------
+knitr::opts_chunk$set(echo=TRUE, results="hide", message=FALSE)   
 ```
 
-```{r, message = FALSE}
-#-----load required packages-----  
+```{r, message=FALSE}
+# Load required packages -------------------------------------------------------  
 if (!require("pacman")) install.packages("pacman")
 pacman::p_load(here,
-               ids, # for generating random ids
+               ids, # Generate random ids
                tidyverse,
                data.table,
-               compare, # compare between data frames
+               compare, # Compare between data frames
                microbenchmark)
 ```
 
 
 # Introduction   
 
-One of the great benefits of following Rstats conversations on Twitter is its access to user insights. I became curious about `data.table` after reading conversations about its superior performance yet decreased visibility compared to `tidyverse`.      
+I became curious about `data.table` after reading Twitter conversations about its superior performance yet decreased visibility compared to `tidyverse`. Because  
+
 
 Fast forward a few years and the [data processing efficiency](https://h2oai.github.io/db-benchmark/) of `data.table` has become extremely handy:  
 
@@ -960,6 +962,8 @@ In contrast, `data.table` is efficient because it contains a very fast ordering
 
 # Other resources   
 
++ https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
+
 + The definitive [stack overflow discussion](https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly/27840349#27840349) about the best use cases for data.table versus dplyr (from tidyverse).   
 
 + A great side by side comparison of data.table versus dplyr operations by [Atrebas](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/).      
@@ -974,3 +978,7 @@ In contrast, `data.table` is efficient because it contains a very fast ordering
 Robin Lovelace](https://csgillespie.github.io/efficientR/data-processing-with-data-table.html).      
 
 + A more detailed explanation of the usage of binary search based subset in `data.table` by [Arun Srinivasan](https://gist.github.com/arunsrinivasan/dacb9d1cac301de8d9ff).      
+
++ https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html
+
++ http://www.john-ros.com/Rcourse/parallel.html
diff --git a/tutorials/dc-data_table_vs_dplyr/dc-dataset_generation_script.R b/tutorials/dc-data_table_vs_dplyr/dc-dataset_generation_script.R
diff --git a/tutorials/dc-using_arrow/dc-using_arrow.Rmd b/tutorials/dc-using_arrow/dc-using_arrow.Rmd
@@ -0,0 +1,13 @@
+---
+title: "Using arrow with tidyverse and data.table"
+author: Erika Duan
+date: "`r Sys.Date()`"
+output:
+  github_document:
+    toc: true
+---
+
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
diff --git a/tutorials/dc-using_arrow/dc-using_arrow.md b/tutorials/dc-using_arrow/dc-using_arrow.md
@@ -0,0 +1,38 @@
+Using arrow with tidyverse and data.table
+================
+Erika Duan
+2022-03-05
+
+-   [R Markdown](#r-markdown)
+-   [Including Plots](#including-plots)
+
+## R Markdown
+
+This is an R Markdown document. Markdown is a simple formatting syntax
+for authoring HTML, PDF, and MS Word documents. For more details on
+using R Markdown see <http://rmarkdown.rstudio.com>.
+
+When you click the **Knit** button a document will be generated that
+includes both content as well as the output of any embedded R code
+chunks within the document. You can embed an R code chunk like this:
+
+``` r
+summary(cars)
+```
+
+    ##      speed           dist       
+    ##  Min.   : 4.0   Min.   :  2.00  
+    ##  1st Qu.:12.0   1st Qu.: 26.00  
+    ##  Median :15.0   Median : 36.00  
+    ##  Mean   :15.4   Mean   : 42.98  
+    ##  3rd Qu.:19.0   3rd Qu.: 56.00  
+    ##  Max.   :25.0   Max.   :120.00
+
+## Including Plots
+
+You can also embed plots, for example:
+
+![](dc-using_arrow_files/figure-gfm/pressure-1.png)<!-- -->
+
+Note that the `echo = FALSE` parameter was added to the code chunk to
+prevent printing of the R code that generated the plot.
diff --git a/tutorials/dc-using_arrow/dc-using_arrow_files/figure-gfm/pressure-1.png b/tutorials/dc-using_arrow/dc-using_arrow_files/figure-gfm/pressure-1.png
diff --git a/tutorials/p-automating_rmd_reports/p-automating_rmd_reports_part_2.Rmd b/tutorials/p-automating_rmd_reports/p-automating_rmd_reports_part_2.Rmd
@@ -320,3 +320,5 @@ jobs:
 + A [YouTube tutorial](https://www.youtube.com/watch?v=NwUijrm2U2w) by DVC on using GitHub Actions with R to automate data visualisation tasks.   
 + A useful (online resource](https://explainshell.com/) for explaining shell commands required to create components of the GitHub Actions YAML workflow.  
 + https://amitlevinson.com/blog/automated-plot-with-github-actions/  
++ https://rstats.wtf/index.html  
++ https://goodresearch.dev/pipelines.html