-
Notifications
You must be signed in to change notification settings - Fork 1
/
0101_welcome.Rmd
314 lines (192 loc) · 15.1 KB
/
0101_welcome.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
# (PART) Module 01 {-}
```{r include = FALSE}
source("common.R")
library(tweetrmd) #... embedding tweets
library(vembedr)
ds4p_urls <- read.csv("./admin/csv/ds4p_urls.csv")
```
# Welcome to Data Science
This module is designed to introduce you to some of the tools and workflow we will use in the course (and in data science and programming more broadly).
Each section of this model has links to some videos related to the topic.
You can find the module playlist [here][pl_01]. Most of the slides used to make the videos in this module can be found [here](https://github.com/DataScience4Psych/slides).
## Module Materials
* Videos
* Located in the subchapters of this module
* Slidedecks
* [Welcome Slides][d01_welcome]
* [Meet the toolkit][d02_toolkit]
* Suggested Readings
* All subchapters of this module, including
* [R basics and workflow](#r_basics)
* R4DS
* [Book Introduction](https://r4ds.had.co.nz/introduction.html)
* [Data exploration Introduction](https://r4ds.had.co.nz/explore-intro.html)
* **Optional:** [Short Happy Git](https://datascience4psych.github.io/DataScience4Psych/shorthappygit.html)
* If Short Happy Git is too much, start with [Oh My Git](https://ohmygit.org/)
* For more depth, [Happy Git with R](https://happygitwithr.com/)
* Activities
* [Bechdal Test][ae02_bechdel]
* [Oh My Git](https://ohmygit.org/)
* Lab
* [Hello R](#lab01)
# What is Data Science?
```{r, echo=FALSE}
"https://www.youtube.com/watch?v=BpKXkkU-NiY" %>%
embed_url() %>%
use_align("center")
```
You can follow along with the slides [here][d01_welcome] if they do not appear below.
```{r, echo=FALSE,error=TRUE}
slide_url(ds4p_urls,"d01_welcome") %>%
include_url(height = "400px")
```
## See for yourselves!
I've embedded a few examples below.
### Shiny App
```{r, echo=FALSE}
knitr::include_app("https://minecr.shinyapps.io/unvotes/")
```
### Hans Rosling
The video below is the shorter version. Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats
```{r, echo=FALSE}
"https://www.youtube.com/watch?v=jbkSRLYSojo" %>%
embed_url() %>%
use_align("center")
```
You can find a longer talk-length version below.
```{r, echo=FALSE}
"https://www.youtube.com/watch?v=hVimVzgtD6w" %>%
embed_url() %>%
use_align("center")
```
### Social Media
Social media contains a ton of great (and terrible examples of data science in action. These examples range from entire subreddits, such as
[/r/DataisBeautiful](https://www.reddit.com/r/dataisbeautiful/) ([be sure to check out the highest voted posts](https://www.reddit.com/r/dataisbeautiful/top/?sort=top&t=all)) to celebrity tweets about data scientists.
```{r, echo=FALSE}
include_tweet("https://twitter.com/Lesdoggg/status/1346584128368508930")
include_tweet("https://twitter.com/kareem_carr/status/1352655262054834182")
```
### Read for yourselves!
<!-- source: https://raw.githubusercontent.com/academic/awesome-datascience/live/README.md -->
| Link | Preview |
| --- | --- |
| [What is Data Science @ O'reilly](https://www.oreilly.com/ideas/what-is-data-science) | _Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: “here’s a lot of data, what can you make from it?”_ |
| [What is Data Science @ Quora](https://www.quora.com/Data-Science/What-is-data-science) | Data Science is a combination of a number of aspects of Data such as Technology, Algorithm development, and data interference to study the data, analyze it, and find innovative solutions to difficult problems. Basically Data Science is all about Analyzing data and driving for business growth by finding creative ways. |
| [The sexiest job of 21st century](https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century) | _Data scientists today are akin to Wall Street "quants" of the 1980s and 1990s. In those days people with backgrounds in physics and math streamed to investment banks and hedge funds, where they could devise entirely new algorithms and data strategies. Then a variety of universities developed master’s programs in financial engineering, which churned out a second generation of talent that was more accessible to mainstream firms. The pattern was repeated later in the 1990s with search engineers, whose rarefied skills soon came to be taught in computer science programs._ |
| [Wikipedia](https://en.wikipedia.org/wiki/Data_science) | _Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data._ |
| [How to Become a Data Scientist](https://www.mastersindatascience.org/careers/data-scientist/) | _Data scientists are big data wranglers, gathering and analyzing large sets of structured and unstructured data. A data scientist’s role combines computer science, statistics, and mathematics. They analyze, process, and model data then interpret the results to create actionable plans for companies and other organizations._ |
| [a very short history of #datascience](http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/) | _The story of how data scientists became sexy is mostly the story of the coupling of the mature discipline of statistics with a very young one--computer science. The term “Data Science” has emerged only recently to specifically designate a new profession that is expected to make sense of the vast stores of big data. But making sense of data has a long history and has been discussed by scientists, statisticians, librarians, computer scientists and others for years. The following timeline traces the evolution of the term “Data Science” and its use, attempts to define it, and related terms._ |
# Meet our toolbox!
```{r, echo=FALSE}
"https://www.youtube.com/watch?v=SJaQtRLFb-Y" %>%
embed_url() %>%
use_align("center")
```
You can follow along with the slides [here][d02_toolkit] if they do not appear below. I recommend installing [R, Rstudio](#install), [git, and github](#installgit) before starting [the Bechdal activity](#bechdal)
```{r, echo=FALSE}
slide_url(ds4p_urls,"d02_toolkit") %>%
include_url(height = "400px")
```
## R and RStudio
### Install R and RStudio {#install}
```{r}
"https://www.youtube.com/watch?v=kVIZGCT5p9U" %>%
embed_url() %>%
use_align("center")
```
* Install [R, a free software environment for statistical computing and graphics](https://www.r-project.org) from [CRAN][cran], the Comprehensive R Archive Network. I __highly recommend__ you install a precompiled binary distribution for your operating system -- use the links up at the top of the CRAN page linked above!
* Install RStudio's IDE (stands for _integrated development environment_), a powerful user interface for R. Get the Open Source Edition of RStudio Desktop.
- You can run either the [Preview version](https://www.rstudio.com/products/rstudio/download/preview/) or the official releases available [here](https://www.rstudio.com/products/rstudio/#Desktop).
- RStudio comes with a __text editor__, so there is no immediate need to install a separate stand-alone editor.
- RStudio can __interface with Git(Hub)__. However, you must do all the Git(Hub) set up [described elsewhere](https://happygitwithr.com) before you can take advantage of this.
If you have a pre-existing installation of R and/or RStudio, I __highly recommend__ that you reinstall both and get as current as possible. It can be considerably harder to run old software than new.
* When you upgrade R, you generally also need to update any packages you have installed.
### Testing testing
* Do whatever is appropriate for your OS to launch RStudio. You should get a window similar to the screenshot you see [here](https://www.rstudio.com/wp-content/uploads/2014/04/rstudio-workbench.png), but yours will be more boring because you haven't written any code or made any figures yet!
* Put your cursor in the pane labeled Console, which is where you interact with the live R process. Create a simple object with code like `x <- 3 * 4` (followed by enter or return). Then inspect the `x` object by typing `x` followed by enter or return. You should see the value 12 print to screen. If yes, you've succeeded in installing R and RStudio.
### Add-on packages
R is an extensible system and many people share useful code they have developed as a _package_ via [CRAN][cran] and GitHub. To install a package from [CRAN][cran], for example the [dplyr](https://CRAN.R-project.org/package=dplyr) package for data manipulation, here is one way to do it in the R console (there are others).
```r
install.packages("dplyr", dependencies = TRUE)
```
By including `dependencies = TRUE`, we are being explicit and extra-careful to install any additional packages the target package, dplyr in the example above, needs to have around.
You could use the above method to install the following packages, all of which we will use:
* palmerpenguins, [package webpage](https://allisonhorst.github.io/palmerpenguins/)
### Further resources
The above will get your basic setup ready but here are some links if you are interested in reading a bit further.
* [How to Use RStudio](https://support.rstudio.com/hc/en-us)
* [RStudio's leads for learning R](https://support.rstudio.com/hc/en-us/articles/200552336-Getting-Help-with-R)
* [R FAQ](https://cran.r-project.org/faqs.html)
* [R Installation and Administration](http://cran.r-project.org/doc/manuals/R-admin.html)
* [More about add-on packages in the R Installation and Administration Manual](https://cran.r-project.org/doc/manuals/R-admin.html#Add_002don-packages)
# Bechdel Activity {#bechdal}
```{r, echo=FALSE}
slide_url(ds4p_urls,"d02_toolkit","#24") %>%
include_url(height = "400px")
```
You can find the materials for the Bechdel activity [here][ae02_bechdel]. The compiled version should look something like the following...
```{r, echo=FALSE}
"https://datascience4psych.github.io/ae-02-bechdel-rmarkdown/bechdel.html" %>%
include_url(height = "400px")
```
# Thoughtful Workflow
At this point, I recommend you pause and think about your workflow. I give you permission to spend some time and energy sorting this out! It can be as or more important than learning a new R function or package. The experts don't talk about this much, because they've already got a workflow; it's something they do almost without thinking.
Working through subsequent material in R Markdown documents, possibly using Git and GitHub to track and share your progress, is a great idea and will leave you more prepared for your future data analysis projects. Typing individual lines of R code is but a small part of data analysis and it pays off to think holistically about your workflow.
If you want a lot more detail on workflows, you can wander over to the optional bit on [r basics and workflow](#r_basics).
## R Markdown {#rmarkdown}
```{r, echo=FALSE}
"https://www.youtube.com/watch?v=fIhIqTy8PVw?start=412" %>%
embed_url() %>%
use_align("center")
```
If you are in the mood to be entertained, start the video from the beginning. But if you'd rather just get on with it, start watching at 6:52.
You can follow along with the slides [here][d02_toolkit] if they do not appear below.
```{r, echo=FALSE}
slide_url(ds4p_urls,"d02_toolkit","#26") %>%
include_url(height = "400px")
```
<!--Original content: https://stat545.com/block007_first-use-rmarkdown.html-->
R Markdown is an accessible way to create computational documents that combine prose and tables and figures produced by R code.
An introductory R Markdown workflow, including how it intersects with Git, GitHub, and RStudio, is now maintained within the Happy Git site:
[Test drive R Markdown](https://happygitwithr.com/rmd-test-drive.html)
## Git and Github {#installgit}
![XKCD on Git](https://imgs.xkcd.com/comics/git.png)
<!-- source https://github.com/uo-ec607/lectures/blob/master/02-git/02-Git.Rmd -->
First, it's important to realize that Git and GitHub are distinct things. GitHub is an online hosting platform that provides an array of services built on top of the Git system. (Similar platforms include Bitbucket and GitLab.) Just like we don't *need* Rstudio to run R code, we don't *need* GitHub to use Git... But, it will make our lives so much easier.
Git can be very powerful and useful, but it can also take some getting used to.
In this class, we are going to work with some of its most basic functions.
We will do all of our interfacing with Git using the GitHub app and website.
You can follow along with the slides [here][d02_toolkit] if they do not appear below.
```{r, echo=FALSE}
slide_url(ds4p_urls,"d02_toolkit","#30") %>%
include_url(height = "400px")
```
### What is Github?
```{r, echo=FALSE}
"https://www.youtube.com/watch?v=w3jLJU7DT5E" %>%
embed_url() %>%
use_align("center")
```
### Git
Git is a **distributed Version Control System (VCS)**. It is a useful tool for easily tracking changes to your code, collaborating, and sharing.
> (Wait, what?)
> Okay, try this: Imagine if Dropbox and the "Track changes" feature in MS Word had a baby. Git would be that baby.
> In fact, it's even better than that because Git is optimized for the things that social scientists and data scientists spend a lot of time working on (e.g. code).
The learning curve is worth it -- I promise you!
With Git, you can track the changes you make to your project so you always have a record of what you've worked on and can easily revert back to an older version if need be. It also makes working with others easier -— groups of people can work together on the same project and merge their changes into one final source!
GitHub is a way to use the same power of Git all online with an easy-to-use interface. It's used across the software world and beyond to collaborate and maintain the history of projects.
> There's a high probability that your favorite app, program or package is built using Git-based tools. (RStudio is a case in point.)
Scientists and academic researchers are starting to use it as well. Benefits of version control and collaboration tools aside, Git(Hub) helps to operationalize the ideals of open science and reproducibility. Journals have increasingly strict requirements regarding reproducibility and data access. GH makes this easy (DOI integration, off-the-shelf licenses, etc.). I run my [entire lab on GH](https://github.com/R-Computing-Lab); this entire course is running on github; these lecture notes are hosted on github...
## Getting Help with R
You can follow along with the slides [here][d02_toolkit] if they do not appear below.
```{r, echo=FALSE}
"https://www.youtube.com/watch?v=O2wfi7Z0Py4" %>%
embed_url() %>%
use_align("center")
```
```{r, echo=FALSE}
slide_url(ds4p_urls,"d02_toolkit","#41") %>%
include_url(height = "400px")
```
```{r links, child="admin/md/courselinks.md"}
```