forked from rstudio-conf-2020/r-for-excel
-
Notifications
You must be signed in to change notification settings - Fork 0
/
tidying.Rmd
319 lines (210 loc) · 12.1 KB
/
tidying.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
# Tidying {#tidying}
## Summary
In previous sessions, we learned to read in data, do some wrangling, and create a graph and table.
Here, we'll continue by *reshaping* data frames (converting from long-to-wide, or wide-to-long format), *separating* and *uniting* variable (column) contents, and finding and replacing string patterns.
### Tidy data
"Tidy" might sound like a generic way to describe non-messy looking data, but it is actually a specific data structure. When data is *tidy*, it is rectangular with each variable as a column, each row an observation, and each cell contains a single value (see: [Ch. 12 in R for Data Science by Grolemund & Wickham](https://r4ds.had.co.nz/tidy-data.html)).
![](img/tidy_data.png)
### Objectives
In this session we'll learn some tools to help make our data **tidy** and more coder-friendly. Those include:
- Use `tidyr::pivot_wider()` and `tidyr::pivot_longer()` to reshape data frames
- `janitor::clean_names()` to make column headers more manageable
- `tidyr::unite()` and `tidyr::separate()` to merge or separate information from different columns
- Detect or replace a string with `stringr` functions
### Resources
-- [Ch. 12 *Tidy Data*, in R for Data Science](https://r4ds.had.co.nz/tidy-data.html) by Grolemund & Wickham
- [`tidyr` documentation from tidyverse.org](https://tidyr.tidyverse.org/)
- [`janitor` repo / information](https://github.com/sfirke/janitor) from Sam Firke
## Set-up
### Create a new R Markdown and attach packages
- Open your project from Day 1 (click on the .Rproj file)
- PULL to make sure your project is up to date
- Create a new R Markdown file called `my_tidying.Rmd`
- Remove all example code / text below the first code chunk
- Attach the packages we'll use here (`library(package_name)`):
- `tidyverse`
- `here`
- `janitor`
- `readxl`
Knit and save your new .Rmd within the project folder.
```{r, message = FALSE, warning = FALSE}
# Attach packages
library(tidyverse)
library(janitor)
library(here)
library(readxl)
```
### `read_excel()` to read in data from an Excel worksheet
We've used both `read_csv()` and `read_excel()` to import data from spreadsheets into R.
Use `read_excel()` to read in the **inverts.xlsx** data as an objected called **inverts**.
```{r, message = FALSE}
inverts <- read_excel(here("data", "inverts.xlsx"))
```
Be sure to explore the imported data a bit:
```{r, eval = FALSE}
View(inverts)
names(inverts)
summary(inverts)
```
## `tidyr::pivot_longer()` to reshape from wider-to-longer format
If we look at *inverts*, we can see that the *year* variable is actually split over 3 columns, so we'd say this is currently in **wide format**.
There may be times when you want to have data in wide format, but often with code it is more efficient to convert to **long format** by gathering together observations for a variable that is currently split into multiple columns.
Schematically, converting from wide to long format using `pivot_longer()` looks like this:
![](img/tidyr_pivot_longer.png)
We'll use `tidyr::pivot_longer()` to gather data from all years in *inverts* (columns `2016`, `2017`, and `2018`) into two columns:
- one called *year*, which contains the year
- one called *sp_count* containing the number of each species observed.
The new data frame will be stored as *inverts_long*:
```{r}
# Note: Either single-quotes, double-quotes, OR backticks around years work!
inverts_long <- pivot_longer(data = inverts,
cols = '2016':'2018',
names_to = "year",
values_to = "sp_count")
```
The outcome is the new long-format *inverts_long* data frame:
```{r}
inverts_long
```
Hooray, long format!
One thing that isn't obvious at first (but would become obvious if you continued working with this data) is that since those year numbers were initially column names (characters), when they are stacked into the *year* column, their class wasn't auto-updated to numeric.
Explore the class of *year* in *inverts_long*:
```{r}
class(inverts_long$year)
```
That's a good thing! We don't want R to update classes of our data without our instruction. We'll use `dplyr::mutate()` in a different way here: to create a new column (that's how we've used `mutate()` previously) that has the same name of an existing column, in order to update and overwrite the existing column.
In this case, we'll `mutate()` to add a column called *year*, which contains an `as.numeric()` version of the existing *year* variable:
```{r}
# Coerce "year" class to numeric:
inverts_long <- inverts_long %>%
mutate(year = as.numeric(year))
```
Checking the class again, we see that *year* has been updated to a numeric variable:
```{r}
class(inverts_long$year)
```
## `tidyr::pivot_wider()` to convert from longer-to-wider format
In the previous example, we had information spread over multiple columns that we wanted to *gather*. Sometimes, we'll have data that we want to *spread* over multiple columns.
For example, imagine that starting from *inverts_long* we want each species in the *common_name* column to exist as its **own column**. In that case, we would be converting from a longer to a wider format, and will use `tidyr::pivot_wider()`.
Specifically for our data, we'll use `pivot_wider()` to spread the *common_name* across multiple columns as follows:
```{r}
inverts_wide <- inverts_long %>%
pivot_wider(names_from = common_name,
values_from = sp_count)
```
```{r}
inverts_wide
```
We can see that now each *species* has its own column (wider format). But also notice that those column headers (since they have spaces) might not be in the most coder-friendly format...
## `janitor::clean_names()` to clean up column names
The `janitor` package by Sam Firke is a great collection of functions for some quick data cleaning, like:
- `janitor::clean_names()`: update column headers to a case of your choosing
- `janitor::get_dupes()`: see all rows that are duplicates within variables you choose
- `janitor::remove_empty()`: remove empty rows and/or columns
- `janitor::adorn_*()`: jazz up tables
Here, we'll use `janitor::clean_names()` to convert all of our column headers to a more convenient case - the default is **lower_snake_case**, which means all spaces and symbols are replaced with an underscore (or a word describing the symbol), all characters are lowercase, and a few other nice adjustments.
For example, `janitor::clean_names()` would update these nightmare column names into much nicer forms:
- `My...RECENT-income!` becomes `my_recent_income`
- `SAMPLE2.!test1` becomes `sample2_test1`
- `ThisIsTheName` becomes `this_is_the_name`
- `2015` becomes `x2015`
If we wanted to then use these columns (which we probably would, since we created them), we could clean the names to get them into more coder-friendly lower_snake_case with `janitor::clean_names()`:
```{r}
inverts_wide <- inverts_wide %>%
clean_names()
```
```{r}
names(inverts_wide)
```
And there are other case options in `clean_names()`, like:
- "snake" produces snake_case (the default)
- "lower_camel" or "small_camel" produces lowerCamel
- "upper_camel" or "big_camel" produces UpperCamel
- "screaming_snake" or "all_caps" produces ALL_CAPS
- "lower_upper" produces lowerUPPER
- "upper_lower" produces UPPERlower
## `tidyr::unite()` and `tidyr::separate()` to combine or separate information in column(s)
Sometimes we'll want to *separate* contents of a single column into multiple columns, or *combine* entries from different columns into a single column.
For example, the following data frame has *genus* and *species* in separate columns:
```{r, echo = FALSE, eval=FALSE}
unite_ex <- tribble(
~id, ~genus, ~species, ~common_name,
1, "Scorpaena", "guttata", "sculpin",
2, "Sebastes", "miniatus", "vermillion"
)
kable(unite_ex)
```
We may want to combine the genus and species into a single column, *scientific_name*:
```{r, echo = FALSE, eval=FALSE}
unite_ex_sci_name <- unite_ex %>%
unite("scientific_name",
c(genus, species),
sep = " ")
kable(unite_ex_sci_name)
```
Or we may want to do the reverse (separate information from a single column into multiple columns). Here, we'll learn `tidyr::unite()` and `tidyr::separate()` to help us do both.
### `tidyr::unite()` to merge information from separate columns
Use `tidyr::unite()` to combine information from multiple columns into a single column (as for the scientific name example above)
![](img/rstudio-cheatsheet-unite.png)
To demonstrate uniting information from separate columns, we'll make a single column that has the combined information from *site* abbreviation and *year* in *inverts_long*.
We need to give `tidyr::unite()` several arguments:
- **data:** the data frame containing columns we want to combine (or pipe into the function from the data frame)
- **col:** the name of the new "united" column
- the **columns you are uniting**
- **sep:** the symbol, value or character to put between the united information from each column
```{r}
inverts_unite <- inverts_long %>%
unite(col = "site_year", # What to name the new united column
c(site, year), # The columns we'll unite (site, year)
sep = "_") # How to separate the things we're uniting
```
```{r, echo = FALSE}
head(inverts_unite)
```
#### Activity:
**Task:** Create a new object called 'inverts_moyr', starting from inverts_long, that unites the month and year columns into a single column named "mo_yr", using a slash "/" as the separator. Then try updating the separator to something else! Like "_hello!_".
**Solution:**
```{r}
inverts_moyr <- inverts_long %>%
unite(col = "mo_yr", # What to name the new united column
c(month, year), # The columns we'll unite (site, year)
sep = "/")
```
**Merging information from > 2 columns (not done in workshop)**
`tidyr::unite()` can also combine information from *more* than two columns. For example, to combine the *site*, *common_name* and *year* columns from *inverts_long*, we could use:
```{r}
# Uniting more than 2 columns:
inverts_triple_unite <- inverts_long %>%
tidyr::unite(col = "year_site_name",
c(year, site, common_name),
sep = "-") # Note: this is a dash
```
```{r}
head(inverts_triple_unite)
```
### `tidyr::separate()` to separate information into multiple columns
While `tidyr::unite()` allows us to combine information from multiple columns, it's more likely that you'll *start* with a single column that you want to split up into pieces.
For example, I might want to split up a column containing the *genus* and *species* (*Scorpaena guttata*) into two separate columns (*Scorpaena* | *guttata*), so that I can count how many *Scorpaena* organisms exist in my dataset at the genus level.
Use `tidyr::separate()` to "separate a character column into multiple columns using a regular expression separator."
![](img/rstudio-cheatsheet-separate.png)
Let's start again with *inverts_unite*, where we have combined the *site* and *year* into a single column called *site_year*. If we want to **separate** those, we can use:
```{r}
inverts_sep <- inverts_unite %>%
tidyr::separate(site_year, into = c("my_site", "my_year"))
```
## `stringr::str_replace()` to replace a pattern
Was data entered in a way that's difficult to code with, or is just plain annoying? Did someone wrongly enter "fish" as "fsh" throughout the spreadsheet, and you want to update it everywhere?
Use `stringr::str_replace()` to automatically replace a string pattern.
**Warning**: The pattern will be replaced everywhere - so if you ask to replace "fsh" with "fish", then "offshore" would be updated to "offishore". Be careful to ensure that when you think you're making one replacement, you're not also replacing something else unexpectedly.
Starting with inverts, let's any place we find "california" we want to replace it with the abbreviation "CA":
```{r}
ca_abbr <- inverts %>%
mutate(
common_name =
str_replace(common_name,
pattern = "california",
replacement = "CA")
)
```
Now, check to confirm that "california" has been replaced with "CA".
### END **tidying** session!