forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-tidy.Rmd
executable file
·421 lines (272 loc) · 24.6 KB
/
04-tidy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
# Data Tidying via tidyr {#tidy}
```{r setup_tidy, include=FALSE}
chap <- 4
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(
tidy = FALSE,
out.width = '\\textwidth'
)
# This bit of code is a bug fix on asis blocks, which we use to show/not show LC
# solutions, which are written like markdown text. In theory, it shouldn't be
# necessary for knitr versions <=1.11.6, but I've found I still need to for
# everything to knit properly in asis blocks. More info here:
# https://stackoverflow.com/questions/32944715/conditionally-display-block-of-markdown-text-using-knitr
library(knitr)
knit_engines$set(asis = function(options) {
if (options$echo && options$eval) knit_child(text = options$code)
})
# This controls which LC solutions to show. Options for solutions_shown: "ALL"
# (to show all solutions), or subsets of c('4-4', '4-5'), including the
# null vector c('') to show no solutions.
solutions_shown <- c('4-1', '4-2', '4-3', '4-4')
show_solutions <- function(section){
return(solutions_shown == "ALL" | section %in% solutions_shown)
}
```
In Subsection \@ref(programming-concepts) we introduced the concept of a data frame: a rectangular spreadsheet-like representation of data in R where the rows correspond to observations and the columns correspond to variables describing each observation. In Section \@ref(nycflights13), we started explorations of our first data frame `flights` included in the `nycflights13` package. In Chapter \@ref(viz) we made graphics using data contained in `flights` and other data frames.
In this chapter, we extend some of these ideas by discussing a type of data formatting called "tidy" data. You will see that having data stored in "tidy" format is about more than what the colloquial definition of the term "tidy" might suggest of having your data "neatly organized" in a spreadsheet. Instead, we define the term "tidy" in a more rigorous fashion, outlining a set of rules by which data can be stored and the implications of these rules on analyses.
Although knowledge of this type of data formatting was not necessary in our treatment of data visualization in Chapter \@ref(viz) since all the data was already in tidy format, we'll see going forward that having tidy data will allow you to more easily create data visualizations in a wide range of settings. Furthermore, it will also help you with data wrangling in Chapter \@ref(wrangling) and in all subsequent chapters in this book when we cover regression and discuss statistical inference.
### Needed packages {-}
Let's load all the packages needed for this chapter (this assumes you've already installed them). If needed, read Section \@ref(packages) for information on how to install and load R packages.
```{r warning=FALSE, message=FALSE}
library(nycflights13)
library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
```
```{r message=FALSE, warning=FALSE, echo=FALSE}
# Packages needed internally, but not in text.
library(knitr)
```
---
<!--Subsection on Tidy Data -->
## What is tidy data?
You have surely heard the word "tidy" in your life:
- "Tidy up your room!"
- "Please write your homework in a tidy way so that it is easier to grade and to provide feedback."
- Marie Kondo's best-selling book [_The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing_](https://www.amazon.com/Life-Changing-Magic-Tidying-Decluttering-Organizing/dp/1607747308/ref=sr_1_1?ie=UTF8&qid=1469400636&sr=8-1&keywords=tidying+up)
- "I am not by any stretch of the imagination a tidy person, and the piles of unread books on the coffee table and by my bed have a plaintive, pleading quality to me - 'Read me, please!'" - Linda Grant
What does it mean for your data to be "tidy"? Beyond just being organized, in the context of this book having "tidy" data means that your data follows a standardized format. This makes it easier for you and others to visualize your data, to wrangle/transform your data, and to model your data. We will follow Hadley Wickham's definition of *tidy data* here [@tidy]:
> A dataset is a collection of values, usually either numbers (if quantitative)
or strings AKA text data (if qualitative). Values are organised in two ways.
Every value belongs to a variable and an observation. A variable contains all
values that measure the same underlying attribute (like height, temperature,
duration) across units. An observation contains all values measured on the same
unit (like a person, or a day, or a city) across attributes.
> Tidy data is a standard way of mapping the meaning of a dataset to its
structure. A dataset is messy or tidy depending on how rows, columns and tables
are matched up with observations, variables and types. In *tidy data*:
> 1. Each variable forms a column.
> 2. Each observation forms a row.
> 3. Each type of observational unit forms a table.
```{r tidyfig, echo=FALSE, fig.cap="Tidy data graphic from http://r4ds.had.co.nz/tidy-data.html"}
knitr::include_graphics("images/tidy-1.png")
```
For example, say the following table consists of stock prices:
```{r echo=FALSE}
stocks <- data_frame(
Date = as.Date('2009-01-01') + 0:4,
`Boeing Stock Price` = paste("$", c("173.55", "172.61", "173.86", "170.77", "174.29"), sep = ""),
`Amazon Stock Price` = paste("$", c("174.90", "171.42", "171.58", "173.89", "170.16"), sep = ""),
`Google Stock Price` = paste("$", c("174.34", "170.04", "173.65", "174.87", "172.19") ,sep = "")
) %>%
slice(1:2)
stocks %>%
kable(
digits = 2,
caption = "Stock Prices (Non-Tidy Format)",
booktabs = TRUE
)
```
Although the data are neatly organized in a spreadsheet-type format, they are not in tidy format since there are three variables corresponding to three unique pieces of information (Date, Stock Name, and Stock Price), but there are not three columns. In tidy data format each variable should be its own column, as shown below. Notice that both tables present the same information, but in different formats.
```{r echo=FALSE}
stocks_tidy <- stocks %>%
rename(
Boeing = `Boeing Stock Price`,
Amazon = `Amazon Stock Price`,
Google = `Google Stock Price`
) %>%
gather(`Stock Name`, `Stock Price`, -Date)
stocks_tidy %>%
kable(
digits = 2,
caption = "Stock Prices (Tidy Format)",
booktabs = TRUE
)
```
However, consider the following table
```{r echo=FALSE}
stocks <- data_frame(
Date = as.Date('2009-01-01') + 0:4,
`Boeing Price` = paste("$", c("173.55", "172.61", "173.86", "170.77", "174.29"), sep = ""),
`Weather` = c("Sunny", "Overcast", "Rain", "Rain", "Sunny")
) %>%
slice(1:2)
stocks %>%
kable(
digits = 2,
caption = "Date, Boeing Price, Weather Data",
booktabs = TRUE
)
```
In this case, even though the variable "Boeing Price" occurs again, the data *is* tidy since there are three variables corresponding to three unique pieces of information (Date, Boeing stock price, and the weather that particular day).
The non-tidy data format in the original table is also known as ["wide"](https://en.wikipedia.org/wiki/Wide_and_narrow_data) format whereas the tidy data format in the second table is also known as ["long"/"narrow"](https://en.wikipedia.org/wiki/Wide_and_narrow_data#Narrow) data format.
In this book, we will work with work with datasets that are already in tidy format. But data isn't always in this nice format that the `tidyverse` gets its name from. Data actually may come to you in a variety of different formats that require data cleaning and reshaping beyond the scope of this book. For a thorough example of the steps needed to take a messy dataset and turn it into a tidy one, check out the different functions available for data tidying and a case study using data from the World Health Organization in [R for Data Science](http://r4ds.had.co.nz/tidy-data.html) [@rds2016].
Most frequently though, data that isn't in long format and is instead in wide format can be converted into "tidy" format by using the `tidyr` package [@R-tidyr] in the `tidyverse`. We'll now investigate how that can be done using the `gather()` function in `tidyr`. Before we proceed with reshaping our data, we will discuss how to read data stored in CSV format into R as a data frame.
---
## Importing CSVs via readr
Up to this point, we've used data either stored inside of an R package or we've manually created the data such as the `fruits` and `fruits_counted` data in Subsection \@ref(geombar). Another common way to get data into R is via reading in data from a spreadsheet either stored on your computer or stored online. For our purposes here, we will work with downloading data stored online.
First, let's download a *Comma Separated Values* (CSV) file of ratings of the level of democracy in different countries spanning 1952 to 1992: <http://ismayc.github.io/dem_score.csv>. After downloading it open it and take a look. You can think of a CSV file as a barebones spreadsheet where:
* Each line in the file corresponds to a row of data/one observation.
* Values for each line are separated with commas. In other words, the values of different variables are separated by commas.
* The first line is usually a *header* row indicating the names of the columsn/variables.
As opposed to a barebones CSV file, Excel files contain a lot of *metadata*, or put more simply, data about the data. Examples include the used of bold and italic fonts, colored cells, different column widths, etc. However, going forward we will only avail ourselves of just the data, and not the metadata, as saved in a CSV file.
There are many ways to read in this data into RStudio. Here are two of the simplest; for the purposes of practice, we suggest you try both. First, we can use the `read_csv()` function from the `readr` package to read in the data directly off the web:
```{r message=FALSE}
dem_score <- read_csv("http://ismayc.github.io/dem_score.csv")
dem_score
```
Second, let's read in the same data, but using the file you just downloaded on
to your computer: Go to the Files pane of RStudio -> Navigate the directories to
where your downloaded files are -> Right click `dem_score.csv` -> Click "Import
Dataset..." -> Click "Import". You'll see two things happen:
1. The RStudio Viewer will pop open with your data.
1. In the console, the command that read-in the data will run. You can copy and paste this code to reload your data again later.
In this `dem_score` data frame, the minimum value of -10 corresponds to a highly autocratic nation whereas a value of 10 corresponds to a highly democratic nation. Note also that backticks surround the different names of the columns here. Variable names are not allowed to start with a number but this can be worked around by surrounding the column name in backticks. Variable names also can't include spaces so if you'd like to refer to the variable **Stock Names** above, for example, you'll need to surround it in backticks: `` `Stock Names` ``.
---
## Converting from wide to long {#tidying}
Let's focus on only the data corresponding to the country of Guatemala.
```{r }
guat_dem <- dem_score %>%
filter(country == "Guatemala")
guat_dem
```
Now let's produce a plot showing how the democracy scores have changed over the 40 years from 1952 to 1992 for Guatemala. Let's start by laying out how we would map our aesthetics to variables in the data frame:
- The `data` frame is `guat_dem` by setting `data = guat_dem`
What are the names of the variables to plot? We'd like to see how the democracy score has changed over the years. Now we are stuck in a predicament. We see that we have a variable named `country` but its only value is `"Guatemala"`. We have other variables denoted by different year values. Unfortunately, we've run into a dataset that is not in the appropriate format to apply the Grammar of Graphics and `ggplot2`. Remember that `ggplot2` is a package in the `tidyverse` and, thus, needs data to be in a tidy format. We'd like to finish off our mapping of aesthetics to variables by doing something like
- The `aes`thetic mapping is set by `aes(x = year, y = democracy_score)`
but this is not possible with our wide-formatted data. We need to take the values of the current column names in `guat_dem` (aside from `country`) and convert them into a new variable that will act as a key called `year`. Then, we'd like to take the numbers on the inside of the table and turn them into a column that will act as values called `democracy_score`. Our resulting data frame will have three columns: `country`, `year`, and `democracy_score`.
The `gather()` function in the `tidyr` package can complete this task for us. The first argument to `gather()`, just as with `ggplot2()`, is the `data` argument where we specify which data frame we would like to tidy. The next two arguments to `gather()` are `key` and `value`, which specify what we'd like to call the new columns that convert our wide data into long format. Lastly, we include a specification for variables we'd like to NOT include in this tidying process using a `-`.
<!-- Should we include a mention of also including all the variables you'd like to include? I rarely do this and use the negation instead. -->
<!-- I like not teaching the pipe here since the data argument is the same as what they are used to with ggplot2 -->
```{r}
guat_tidy <- gather(data = guat_dem,
key = year,
value = democracy_score,
- country)
guat_tidy
```
We can now create the plot to show how the democracy score of Guatemala changed from 1952 to 1992 using a linegraph and `ggplot2`.
```{r errors=TRUE}
ggplot(data = guat_tidy, mapping = aes(x = year, y = democracy_score)) +
geom_line()
```
<!-- Arg, this is really annoying that gather() doesn't see that these are all numbers. Do you know a way around this? I usually just go mutate(year = as.numeric(year) but they don't know mutate() yet. -->
Observe that the `year` variable in `guat_tidy` is stored as a character vector since we had to circumvent the naming rules in R by adding backticks around the different year columns in `guat_dem`. This is leading to `ggplot` not knowing exactly how to plot a line using a categorical variable. We can fix this by using the `parse_number()` function in the `readr` package and then specify the horizontal axis label to be `"year"`:
```{r guatline, fig.cap="Guatemala's democracy score ratings from 1952 to 1992"}
ggplot(data = guat_tidy, mapping = aes(x = parse_number(year), y = democracy_score)) +
geom_line() +
labs(x = "year")
```
We'll see in Chapter \@ref(wrangling) how we could use the `mutate()` function to change `year` to be a numeric variable instead after we have done our tidying. Notice now that the mappings of aesthetics to variables make sense in Figure \@ref(fig:guatline):
- The `data` frame is `guat_tidy` by setting `data = dem_score`
- The `x` `aes`thetic is mapped to `year`
- The `y` `aes`thetic is mapped to `democracy_score`
- The `geom_`etry chosen is `line`
```{block lc-tidying, type='learncheck', purl=FALSE}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Convert the `dem_score` data frame into
a tidy data frame and assign the name of `dem_tidy` to the resulting long-formatted data frame.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Read in the life expectancy data stored at http://ismayc.github.io/le_mess.csv and convert it to a tidy data frame.
```{asis lc4-1solutions, include=show_solutions('4-1')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc - 1), ")")`** `dem_tidy <- gather(data = dem_score, key = year, value = democracy_score, - country)`
**`r paste0("(LC", chap, ".", (lc), ")")`**
`life_expectancy <- read_csv('http://ismayc.github.io/le_mess.csv')`
`life_ex_tidy <- gather(data = life_expectancy, key = year, value = life_expectancy, -country)`
```
```{asis lc4-2solutions, include=show_solutions('4-2'), echo=show_solutions('4-2')}
`dem_tidy <- gather(data = dem_score, key = year, value = democracy_score, - country)`
`life_expectancy <- read_csv('http://ismayc.github.io/le_mess.csv')`
`life_ex_tidy <- gather(data = life_expectancy, key = year, value = life_expectancy, -country)`
```
```{block, type='learncheck', purl=FALSE}
```
---
<!--Subsection on nycflights13 -->
## Back to nycflights
Recall the `nycflights13` package with data about all domestic flights departing from New York City in 2013 that we introduced in Chapter \@ref(nycflights13) and used extensively in Chapter \@ref(viz) to create visualizations. In particular, let's revisit the `flights` data frame by running `View(flights)` in your console. We see that `flights` has a rectangular shape with each row corresponding to a different flight and each column corresponding to a characteristic of that flight. This matches exactly with how Hadley Wickham defined tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
But what about the third property?
> 3. Each type of observational unit forms a table.
### Observational units
We identified earlier that the observational unit in the `flights` dataset is an individual flight. And we have shown that this dataset consists of `r scales::comma(nrow(flights))` flights with `r ncol(flights)` variables. In other words, rows of this dataset don't refer to a measurement on an airline or on an airport; they refer to characteristics/measurements on a given flight from New York City in 2013.
Also included in the `nycflights13` package are datasets with different observational units [@R-nycflights13]:
* `airlines`: translation between two letter IATA carrier codes and names (`r nrow(nycflights13::airlines)` in total)
* `planes`: construction information about each of `r scales::comma(nrow(nycflights13::planes))` planes used
* `weather`: hourly meteorological data (about `r nycflights13::weather %>% count(origin) %>% .[["n"]] %>% mean() %>% round()` observations) for each of the three NYC airports
* `airports`: airport names and locations
The organization of this data follows the third "tidy" data property: observations corresponding to the same observational unit should be saved in the same table/data frame.
### Identification vs measurement variables
There is a subtle difference between the kinds of variables that you will encounter in data frames: *measurement variables* and *identification variables*. The `airports` data frame you worked with above contains both these types of variables. Recall that in `airports` the observational unit is an airport, and thus each row corresponds to one particular airport. Let's pull them apart using the `glimpse` function:
```{r}
glimpse(airports)
```
The variables `faa` and `name` are what we will call *identification variables*: variables that uniquely identify each observational unit. They are mainly used to provide a name to the observational unit. `faa` gives the code provided by the FAA for that airport while the `name` variable gives the longer more natural name of the airport. The remaining variables (`lat`, `lon`, `alt`, `tz`, `dst`, `tzone`) are often called *measurement* or *characteristic* variables: variables that describe properties of each observational unit, in other words each observation in each row. For example, `lat` and `long` describe the latitude and longitude of each airport.
While it is not an absolute rule, for organizational purposes it considered good practice to have your identification variables in the far left-most columns of your data frame.
```{block lc3-3c, type='learncheck'}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What properties of the observational unit do each of `lat`, `lon`, `alt`, `tz`, `dst`, and `tzone` describe for the `airports` data frame? Note that you may want to use `?airports` to get more information.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Provide the names of variables in a data frame with at least three variables in which one of them is an identification variable and the other two are not. In other words, create your own tidy dataset that matches these conditions.
```{asis lc4-3solutions, include=show_solutions('4-3')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc - 1), ")")`** See `?airports` help file
**`r paste0("(LC", chap, ".", (lc), ")")`** Identification Variables
* In the `weather` example in LC3.8, the combination of `origin`, `year`, `month`, `day`, `hour` are identification variables as they identify the observation in question.
* Anything else pertains to observations: `temp`, `humid`, `wind_speed`, etc.
```
```{block, type='learncheck', purl=FALSE}
```
---
## Optional: Normal forms of data
The datasets included in the `nycflights13` package are in a form that minimizes redundancy of data. We will see that there are ways to _merge_ (or _join_) the different tables together easily. We are capable of doing so because each of the tables have _keys_ in common to relate one to another. This is an important property of **normal forms** of data. The process of decomposing data frames into less redundant tables without losing information is called **normalization**. More information is available on [Wikipedia](https://en.wikipedia.org/wiki/Database_normalization).
We saw an example of this above with the `airlines` dataset. While the `flights` data frame could also include a column with the names of the airlines instead of the carrier code, this would be repetitive since there is a unique mapping of the carrier code to the name of the airline/carrier.
Below an example is given showing how to **join** the `airlines` data frame together with the `flights` data frame by linking together the two datasets via a common **key** of `"carrier"`. Note that this "joined" data frame is assigned to a new data frame called `joined_flights`. The **key** variable that we frequently join by is one of the *identification variables* mentioned above.
```{r message=FALSE}
library(dplyr)
joined_flights <- inner_join(x = flights, y = airlines, by = "carrier")
```
```{r eval=FALSE}
View(joined_flights)
```
If we `View` this dataset, we see a new variable has been created called `name`. (We will see in Subsection \@ref(rename) ways to change `name` to a more descriptive variable name.) More discussion about joining data frames together will be given in Chapter \@ref(wrangling). We will see there that the names of the columns to be linked need not match as they did here with `"carrier"`.
```{block tidy_review, type='learncheck'}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are common characteristics of "tidy" datasets?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What makes "tidy" datasets useful for organizing data?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some advantages of data in normal forms? What are some disadvantages?
```{asis lc4-4solutions, include=show_solutions('4-4')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc - 2), ")")` What are common characteristics of “tidy” datasets?** Rows correspond to observations, while columns correspond to variables.
**`r paste0("(LC", chap, ".", (lc - 1), ")")` What makes “tidy” datasets useful for organizing data?** Tidy datasets are an organized way of viewing data. We'll see later that this format is required for the `ggplot2` and `dplyr` packages for data visualization and wrangling.
**`r paste0("(LC", chap, ".", (lc), ")")`** When datasets are in normal form, we can easily `_join` them with other datasets! For example, can we join the `flights` data with the `planes` data? We'll see this more in Chapter 5!
```
```{block, type='learncheck', purl=FALSE}
```
---
## Conclusion
### Review questions
<!-- Need to include an exercise in the DataCamp course on using gather() to turn the `police_locals` data frame into a tidy data frame. -->
Review questions have been designed using the `fivethirtyeight` R package [@R-fivethirtyeight] with links to the corresponding FiveThirtyEight.com articles in our free DataCamp course **Effective Data Storytelling using the `tidyverse`**. The material in this chapter is covered in the **Tidy Data** chapter of the DataCamp course available [here](https://campus.datacamp.com/courses/effective-data-storytelling-using-the-tidyverse/tidy-data).
### What's to come?
In Chapter \@ref(wrangling), we'll further explore data in tidy format by grouping our data, creating summaries based on those groupings, filtering our data to match conditions, and performing other wranglings with our data including defining new columns/variables. These data wrangling procedures will go hand-in-hand with the data visualizations you've produced in Chapter \@ref(viz).
### Script of R code
An R script file of all R code used in this chapter is available [here](http://moderndive.com/scripts/04-tidy.R).