forked from rstudio-conf-2020/r-for-excel
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfilter-join.Rmd
366 lines (238 loc) · 15.9 KB
/
filter-join.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
# Filters and joins {#filter-join}
## Summary
In previous sessions, we've learned to do some basic wrangling and find summary information with functions in the `dplyr` package, which exists within the `tidyverse`. In this session, we'll expand our data wrangling toolkit using:
- `filter()` to conditionally subset our data by **rows**, and
- `*_join()` functions to merge data frames together
- And we'll make a nicely formatted HTML table with `kable()` and `kableExtra`
The combination of `filter()` and `*_join()` - to return rows satisfying a condition we specify, and merging data frames by like variables - is analogous to the useful VLOOKUP function in Excel.
### Objectives
- Use `filter()` to subset data frames, returning **rows** that satisfy variable conditions
- Use `full_join()`, `left_join()`, and `inner_join()` to merge data frames, with different endpoints in mind
- Use `filter()` and `*_join()` as part of a wrangling sequence
### Resources
- [`filter()` documentation from tidyverse.org](https://dplyr.tidyverse.org/reference/filter.html)
- [`join()` documentation from tidyverse.org](https://dplyr.tidyverse.org/reference/join.html)
- [Chapters 5 and 13 in *R for Data Science* by Garrett Grolemund and Hadley Wickham](https://r4ds.had.co.nz/)
- ["Create awesome HTML tables with knitr::kable() and kableExtra" by Hao Zhu](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html)
## Set-up: Create a new .Rmd, attach packages & get data
Create a new R Markdown document in your r-workshop project and knit to save as **filter_join.Rmd**. Remove all the example code (everything below the set-up code chunk).
In this session, we'll attach four packages:
- `tidyverse`
- `readxl`
- `here`
- `kableExtra`
Attach the packages in the setup code chunk in your .Rmd:
```{r, message = FALSE, warning = FALSE}
library(tidyverse)
library(readxl)
library(here)
library(kableExtra)
```
Then create a new code chunk to read in two files from your 'data' subfolder:
- fish.csv
- kelp_fronds.xlsx (read in only the "abur" worksheet by adding argument `sheet = "abur"` to `read_excel()`)
```{r, message = FALSE}
# Read in data:
fish <- read_csv(here("data", "fish.csv"))
kelp_abur <- read_excel(here("data", "kelp_fronds.xlsx"), sheet = "abur")
```
We should always explore the data we've read in. Use functions like `View()`, `names()`, `summary()`, `head()` and `tail()` to check them out.
Now, let's use `filter()` to decide which observations (rows) we'll keep or exclude in new subsets, similar to using Excel's VLOOKUP function or filter tool.
## `dplyr::filter()` to conditionally subset by rows
Use `filter()` to let R know which **rows** you want to keep or exclude, based whether or not their contents match conditions that you set for one or more variables.
![](img/rstudio-cheatsheet-filter.png)
Some examples in words that might inspire you to use `filter()`:
- "I only want to keep rows where the temperature is greater than 90°F."
- "I want to keep all observations **except** those where the tree type is listed as **unknown**."
- "I want to make a new subset with only data for mountain lions (the species variable) in California (the state variable)."
When we use `filter()`, we need to let R know a couple of things:
- What data frame we're filtering from
- What condition(s) we want observations to **match** and/or **not match** in order to keep them in the new subset
Here, we'll learn some common ways to use `filter()`.
### Filter rows by matching a single character string
Let's say we want to keep all observations from the **fish** data frame where the common name is "garibaldi" (fun fact: that's California's official marine state fish, protected in California coastal waters!).
Here, we need to tell R to only *keep rows* from the **fish** data frame when the common name (**common_name** variable) exactly matches **garibaldi**.
Use `==` to ask R to look for exact matches:
```{r}
fish_garibaldi <- fish %>%
filter(common_name == "garibaldi")
```
Check out the **fish_garibaldi** object to ensure that only *garibaldi* observations remain.
#### Activity
**Task**: Create a subset starting from the **fish** data frame, stored as object **fish_mohk**, that only contains observations from Mohawk Reef (site entered as "mohk").
**Solution**:
```{r}
fish_mohk <- fish %>%
filter(site == "mohk")
```
Explore the subset you just created to ensure that only Mohawk Reef observations are returned.
### Filter rows based on numeric conditions
Use expected operators (>, <, >=, <=, ==) to set conditions for a numeric variable when filtering. For this example, we only want to retain observations when the **total_count** column value is >= 50:
```{r}
fish_over50 <- fish %>%
filter(total_count >= 50)
```
### Filter to return rows that match *this* OR *that* OR *that*
What if we want to return a subset of the **fish** df that contains *garibaldi*, *blacksmith* OR *black surfperch*?
There are several ways to write an "OR" statement for filtering, which will keep any observations that match Condition A *or* Condition B *or* Condition C. In this example, we will create a subset from **fish** that only contains rows where the **common_name** is *garibaldi* or *blacksmith* or *black surfperch*.
Way 1: You can indicate **OR** using the vertical line operator `|` to indicate "OR":
```{r}
fish_3sp <- fish %>%
filter(common_name == "garibaldi" |
common_name == "blacksmith" |
common_name == "black surfperch")
```
Alternatively, if you're looking for multiple matches in the *same variable*, you can use the `%in%` operator instead. Use `%in%` to ask R to look for *any matches* within a vector:
```{r}
fish_3sp <- fish %>%
filter(common_name %in% c("garibaldi", "blacksmith", "black surfperch"))
```
Notice that the two methods above return the same thing.
**Critical thinking:** In what scenario might you *NOT* want to use `%in%` for an "or" filter statement? Hint: What if the "or" conditions aren't different outcomes for the same variable?
#### Activity
**Task:** Create a subset from **fish** called **fish_gar_2016** that keeps all observations if the year is 2016 *OR* the common name is "garibaldi."
**Solution:**
```{r}
fish_gar_2016 <- fish %>%
filter(year == 2016 | common_name == "garibaldi")
```
### Filter to return observations that match **this** AND **that**
In the examples above, we learned to keep observations that matched any of a number of conditions (**or** statements).
Sometimes we'll only want to keep observations that satisfy multiple conditions (e.g., to keep this observation it must satisfy this condition **AND** that condition). For example, we may want to create a subset that only returns rows from **fish** where the **year** is 2018 *and* the **site** is Arroyo Quemado "aque"
In `filter()`, add a comma (or ampersand '&') between arguments for multiple "and" conditions:
```{r}
aque_2018 <- fish %>%
filter(year == 2018, site == "aque")
```
Check it out to see that only observations where the site is "aque" in 2018 are retained:
```{r}
aque_2018
```
Like most things in R, there are other ways to do the same thing. For example, you could do the same thing using `&` (instead of a comma) between "and" conditions:
```{r, eval = FALSE}
# Use the ampersand (&) to add another condition "and this must be true":
aque_2018 <- fish %>%
filter(year == 2018 & site == "aque")
```
Or you could just do two filter steps in sequence:
```{r, eval = FALSE}
# Written as sequential filter steps:
aque_2018 <- fish %>%
filter(year == 2018) %>%
filter(site == "aque")
```
### Activity: combined filter conditions
**Challenge task:** Create a subset from the **fish** data frame, called **low_gb_wr** that only contains:
- Observations for *garibaldi* or *rock wrasse*
- AND the *total_count* is *less than or equal to 10*
**Solution:**
```{r}
low_gb_wr <- fish %>%
filter(common_name %in% c("garibaldi", "rock wrasse"),
total_count <= 10)
```
### `stringr::str_detect()` to filter by a partial pattern
Sometimes we'll want to keep observations that contain a specific string pattern within a variable of interest.
For example, consider the fantasy data below:
```{r, echo = FALSE}
df_str_ex <- tribble(
~id, ~species,
1, "rainbow rockfish",
2, "blue rockfish",
3, "sparkle urchin",
4, "royal blue fish"
)
kable(df_str_ex)
```
There might be a time when we would want to use observations that:
- Contain the string "fish", in isolation or within a larger string (like "rockfish")
- Contain the string "blue"
In those cases, it would be useful to **detect** a string pattern, and potentially keep any rows that contain it. Here, we'll use `stringr::str_detect()` to find and keep observations that contain our specified string pattern.
Let's detect and keep observations from **fish** where the **common_name** variable contains string pattern "black". Note that there are two fish, blacksmith and black surfperch, that would satisfy this condition.
Using `filter()` + `str_detect()` in combination to find and keep observations where the **site** variable contains pattern "sc":
```{r}
fish_bl <- fish %>%
filter(str_detect(common_name, pattern = "black"))
```
So `str_detect()` returns is a series of TRUE/FALSE responses for each row, based on whether or not they contain the specified pattern. In that example, any row that *does* contain "black" returns `TRUE`, and any row that *does not* contain "black" returns `FALSE`.
### Activity
**Task:** Create a new object called **fish_it**, starting from **fish**, that only contains observations if the **common_name** variable contains the string pattern "it". What species remain?
**Solution:**
```{r}
fish_it <- fish %>%
filter(str_detect(common_name, pattern = "it"))
# blacksmITh and senorITa remain!
```
We can also *exclude* observations that contain a set string pattern by adding the `negate = TRUE` argument within `str_detect()`.
**Sync your local project to your repo on GitHub.**
## `dplyr::*_join()` to merge data frames
There are a number of ways to merge data frames in R. We'll use `full_join()`, `left_join()`, and `inner_join()` in this session.
From R Documentation (`?join`):
- `full_join()`: "returns all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing." Basically, nothing gets thrown out, even if a match doesn't exist - making `full_join()` the safest option for merging data frames. When in doubt, `full_join()`.
- `left_join()`: "return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned."
- `inner_join()`: "returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned." This will drop observations that don't have a match between the merged data frames, which makes it a riskier merging option if you're not sure what you're trying to do.
Schematic (from RStudio data wrangling cheat sheet):
![](img/rstudio-cheatsheet-combine-options1.png)
We will use **kelp_abur** as our "left" data frame, and **fish** as our "right" data frame, to explore different join outcomes.
### `full_join()` to merge data frames, keeping everything
When we join data frames in R, we need to tell R a couple of things (and it does the hard joining work for us):
- Which data frames we want to merge together
- Which variables to merge by
Use `full_join()` to safely combine two data frames, keeping everything from both and populating with `NA` as necessary.
Example: use `full_join()` to combine **kelp_abur** and **fish**:
```{r}
abur_kelp_fish <- kelp_abur %>%
full_join(fish, by = c("year", "site"))
```
Let's look at the merged data frame with `View(abur_kelp_fish)`. A few things to notice about how `full_join()` has worked:
1. All columns that existed in **both data frames** still exist
2. All observations are retained, even if they don't have a match. In this case, notice that for other sites (not 'abur') the observation for fish still exists, even though there was no corresponding kelp data to merge with it.
3. The kelp frond data is joined to *all observations* where the joining variables (*year*, *site*) are a match, which is why it is repeated 5 times for each year (once for each fish species).
Because all data (observations & columns) are retained, `full_join()` is the safest option if you're unclear about how to merge data frames.
### `left_join(x,y)` to merge data frames, keeping everything in the 'x' data frame and only matches from the 'y' data frame
Now, we want to keep all observations in *kelp_abur*, and merge them with *fish* while only keeping observations from *fish* that match an observation within *abur*. When we use `left_join()`, any information from *fish* that don't have a match (by year and site) in *kelp_abur* won't be retained, because those wouldn't have a match in the left data frame.
```{r}
kelp_fish_left <- kelp_abur %>%
left_join(fish, by = c("year","site"))
```
Notice when you look at **kelp_fish_left**, data for other sites that exist in fish do **not** get joined, because `left_join(df_a, df_b)` will only keep observations from `df_b` if they have a match in `df_a`!
### `inner_join()` to merge data frames, only keeping observations with a match in **both**
Use `inner_join()` if you **only** want to retain observations that have matches across **both data** frames. Caution: this is built to exclude any observations that don't match across data frames by joined variables - double check to make sure this is actually what you want to do!
For example, if we use `inner_join()` to merge fish and kelp_abur, then we are asking R to **only return observations where the joining variables (*year* and *site*) have matches in both data frames.** Let's see what the outcome is:
```{r}
kelp_fish_injoin <- kelp_abur %>%
inner_join(fish, by = c("year", "site"))
# kelp_fish_injoin
```
Here, we see that only observations (rows) where there is a match for *year* and *site* in both data frames are returned.
### `filter()` and `join()` in a sequence
Now let's combine what we've learned about piping, filtering and joining!
Let's complete the following as part of a single sequence (remember, check to see what you've produced after each step) to create a new data frame called **my_fish_join**:
- Start with **fish** data frame
- Filter **fish** to only including observations for 2017 at Arroyo Burro
- Join the **kelp_abur** data frame to the resulting subset using `left_join()`
- Add a new column that contains the 'fish per kelp fronds' density (total_count / total_fronds)
That sequence might look like this:
```{r}
my_fish_join <- fish %>%
filter(year == 2017, site == "abur") %>%
left_join(kelp_abur, by = c("year", "site")) %>%
mutate(fish_per_frond = total_count / total_fronds)
```
Explore the resulting **my_fish_join** data frame.
## An HTML table with `kable()` and `kableExtra`
With any data frame, you can a nicer looking table in your knitted HTML using `knitr::kable()` and functions in the `kableExtra` package.
Start by using `kable()` with my_fish_join, and see what the default HTML table looks like in your knitted document:
```{r, eval=FALSE}
kable(my_fish_join)
```
Simple, but quick to get a clear & useful table! Now let's spruce it up a bit with `kableExtra::kable_styling()` to modify HTML table styles:
```{r, eval=FALSE}
my_fish_join %>%
kable() %>%
kable_styling(bootstrap_options = "striped",
full_width = FALSE)
```
...with many other options for customizing HTML tables! Make sure to check out ["Create awesome HTML tables with knitr::kable() and kableExtra" by Hao Zhu](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html) for more examples and options.
**Sync your project with your repo on GitHub**
### End `filter()` + `_join()` section!