-
Notifications
You must be signed in to change notification settings - Fork 13
/
intro.Rmd
144 lines (87 loc) · 6.31 KB
/
intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# Introduction {-}
This document provides some tools, demonstrations, and more to make data processing, programming, modeling, visualization, and presentation easier. The goal here is to instill a foundation for sound data science, and also to provide some of the *why* behind the approaches demonstrated, so that one can better implement them. Focus is given to tools and methods that are generalizable, replicable, and efficient.
## Intended Audience
If you are a budding data scientist in academia, industry, or elsewhere, the content in this document should provide enough knowledge for you to do better work under a variety of circumstances, and from data creation to presentation of results. You probably have already tried to analyze data in some form before this, but may be struggling to do so in an efficient way. Hopefully this will allow one to extend what they know, fill in gaps, and otherwise try some new tricks.
## Programming Language
While the programming language focus is on R, where applicable (which is most of the time), [Python notebooks are also available](https://github.com/m-clark/data-processing-and-visualization/tree/master/jupyter_notebooks), and they include the same exercises. Furthermore, much of the actual content and concepts discussed would apply to any programming language used for data science, and the concepts for programming, modeling, visualization, reproducibility and more are applicable for anyone engaged in data science.
To use R effectively enough for the content here you need to also install RStudio (which is not R itself), and know how to install and load packages.
## Additional Practice
Exercises are provided to get more practice with the things learned. In addition, references are available to dive into even more extensions of the things demonstrated here.
## Outline
The content is divided into five sections. While parts assume at least the information in Part 1, they are otherwise fairly independent and can be explored on their own.
### Part 1: Information Processing
- Understanding Basic R Approaches to Gathering and Processing Data
- Overview of Data Structures
- Getting data in and out
- Indexing
- Getting Acquainted with Other Approaches to Data Processing
- Pipes, and how to use them
- tidyverse
- data.table
- Misc.
### Part 2: Programming Basics
- Using R more fully
- Dealing with objects
- Iterative programming
- Writing functions
- Going further
- Code style
- Vectorization
- Regular expressions
### Part 3: Modeling
- Model Exploration
- Key concepts
- Understanding and fitting models
- Overview of extensions
- Model Criticism
- Model Assessment
- Model Comparison
- Machine Learning
- Concepts
- Demonstration of techniques
### Part 4: Visualization
- Thinking Visually
- Visualizing Information
- Color
- Contrast
- and more...
- Using ggplot2
- Aesthetics
- Layers
- Themes
- and more...
- Adding Interactivity
- Package demos
- Shiny
### Part 5: Presentation
- Building Better Data-Driven Products
- Reproducibility concepts
- Starting out with R markdown
- Standard documents
- Customization and more
- Themes, CSS, etc.
## Workshops
This document also serves as a basis for several workshops. To follow along with the examples, clone/download the related section repos. Downloading any one of them will have an R project and associated data, such that the code from any section should run.
- [R I: Information Processing](https://github.com/m-clark/R-I-Basics)
- [R II: Programming](https://github.com/m-clark/R-II-Programming)
- [R III: Modeling](https://github.com/m-clark/R-III-Modeling)
- [R IV: Visualization](https://github.com/m-clark/R-IV-Visualization)
- [R V: Presentation](https://github.com/m-clark/R-V-Presentation)
## Other
Color coding in text:
- <span class="emph">emphasis</span>
- <span class="pack">package</span>
- <span class="func">function</span>
- <span class="objclass">object/class</span>
- [link]()
Some key packages used in the following demonstrations and exercises:
<span class="pack">tidyverse</span> (several packages), <span class="pack">data.table</span>, <span class="pack">tidymodels</span>, <span class="pack">rmarkdown</span>
### Python notebooks
The related Python notebooks may be found [here](https://github.com/m-clark/data-processing-and-visualization/tree/master/jupyter_notebooks). The primary modules used and demonstrated are <span class="pack" style = "">numpy</span> and <span class="pack" style = "">pandas</span> for the first two parts, and throughout the rest. Besides that, <span class="pack" style = "">statsmodels</span>, <span class="pack" style = "">scikit-learn</span>, <span class="pack" style = "">plotnine</span>, <span class="pack" style = "">plotly</span>, and others are employed.
### Other R packages
Many other packages are also used for data or minor demonstration, so feel free to install as we come across them. Here are a few.
<span class="pack">ggplot2movies</span>, <span class="pack">nycflights13</span>, <span class="pack">DT</span>, <span class="pack">highcharter</span>, <span class="pack">magrittr</span>, <span class="pack">maps</span>, <span class="pack">mgcv</span> (already comes with base R), <span class="pack">plotly</span>, <span class="pack">quantmod</span>, <span class="pack">readr</span>, <span class="pack">visNetwork</span>, <span class="pack">emmeans</span>, <span class="pack">ggeffects</span>
### History
This was initially a document that only contained the information processing and visualization chapters, and it filled a need specifically for workshops. Over time, it has increased to include other things I think are generally missing from those who simply learn how to get to the end results. The content is born out of the gaps I see in those I consult with, and also just 'I wish I'd known that' type of experience. In the interim, textbooks have come out, authored by some who develop the packages being used here, that could be used as a next step, as they offer more detail. They are listed in the [references][references] section.
### Current Efforts
At present most of the content is more or less as I want it, though I'm filling it out with some minor additions here and there over time. Currently I'm improving the Python documents as well.