forked from rstudio-conf-2020/r-for-excel
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathggplot2.Rmd
361 lines (253 loc) · 16 KB
/
ggplot2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
# Graphs with ggplot2 {#ggplot2}
## Summary
In this session, we'll first learn how to read some external data (from .xls, .xlsx, and CSV files) into R with the `readr` and `readxl` packages (both part of the `tidyverse`).
Then, we'll write reproducible code to build graphs piece-by-piece. In Excel, graphs are made by manually selecting options - which, as we've discussed previously, may not be the best option for reproducibility. Also, if we haven't built a graph with reproducible code, then we might not be able to easily recreate a graph *or* use that code again to make the same style graph with different data.
Using `ggplot2`, the graphics package within the `tidyverse`, we'll write reproducible code to manually and thoughtfully build our graphs.
> "ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places." - [R4DS](http://r4ds.had.co.nz/data-visualisation.html)
So yeah...that `gg` is from "grammar of graphics."
We'll use the `ggplot2` package, but the function we use to initialize a graph will be `ggplot`, which works best for data in tidy format (i.e., a column for every variable, and a row for every observation). Graphics with `ggplot` are built step-by-step, adding new elements as layers with a plus sign (`+`) between layers (note: this is different from the pipe operator, `%>%`. Adding layers in this fashion allows for extensive flexibility and customization of plots.
<br>
![](img/rstudio-cheatsheet-ggplot.png)
### Objectives
- Read in external data (Excel files, CSVs) with `readr` and `readxl`
- Initial data exploration
- Build several common types of graphs (scatterplot, column, line) in ggplot2
- Customize gg-graph aesthetics (color, style, themes, etc.)
- Update axis labels and titles
- Combine compatible graph types (geoms)
- Build multiseries graphs
- Split up data into faceted graphs
- Export figures with `ggsave()`
### Resources
- [https://readr.tidyverse.org/](`readr` documentation) from tidyverse.org
- [https://readxl.tidyverse.org/](`readxl` documentation) from tidyverse.org
- [https://readxl.tidyverse.org/articles/articles/readxl-workflows.html](`readxl` workflows article) from tidyverse.org
- [https://r4ds.had.co.nz/data-visualisation.html](Chapter 3 - Data Visualization in R for Data Science by Grolemund and Wickham)
- [ggplot2-cheatsheet-2.1.pdf](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf)
- [Graphs with ggplot2 - Cookbook for R](http://www.cookbook-r.com/Graphs/#graphs-with-ggplot2)
- ["Why I use ggplot2" - David Robinson Blog Post](http://varianceexplained.org/r/why-I-use-ggplot2/)
## Getting started - In existing .Rmd, attach packages
In your existing `plots-ggplot.Rmd` from Session 2, remove everything below the first code chunk.
The `ggplot2` package is part of the `tidyverse`, so we don't need to attach it separately. Attach the `tidyverse`, `readxl` and `here` packages in the top-most code chunk of your .Rmd.
```{r, message = FALSE, warning = FALSE}
library(tidyverse)
library(readxl)
library(here)
```
## Read in .xlsx and .csv files
In this session, we'll use data for parks visitation from two files:
- A comma-separated-value (CSV) file containing visitation data for all National Parks in California (ca_np.csv)
- A single Excel worksheet containing only visitation for Channel Islands National Park (ci_np.xlsx)
### `read_csv()` to read in comma-separated-value (.csv) files
There are many types of files containing data that you might want to work with in R. A common one is a comma separated value (CSV) file, which contains values with each column entry separated by a comma delimiter. CSVs can be opened, viewed, and worked with in Excel just like an .xls or .xlsx file - but let's learn how to get data directly from a CSV into R where we can work with it more reproducibly.
To read in the ca_np.csv file, we need to:
- insert a new code chunk
- use `read_csv()` to read in the file
- use `here()` within `read_csv()` to tell it where to look
- assign the stored data an object name (we'll store ours as ca_np)
```{r message=FALSE}
ca_np <- read_csv(here("data", "ca_np.csv"))
```
Look in your Environment to see that **ca_np** now shows up. Click on the object in the Environment, and R will automatically run the `View()` function for you to pull up your data in a separate viewing tab. Now we can look at it in the spreadsheet format we're used to.
We can explore our data frame a bit more to see what it contains. For example:
- `names()`: to see the variable (column) names
- `head()`: to see the first *x* rows (6 is the default)
- `summary()`: see a quick summary of each variable
Cool! Next, let's read in ci_np.xlsx an Excel file) using `read_excel()`.
### `readxl` to read in Excel files
We also have an Excel file (ci_np.xlsx) that contains observations **only** for Channel Islands National Park visitation. Both `readr` and `readxl` are part of the `tidyverse`, which means we should expect their functions to have similar syntax and structure.
*Note: If `readxl` is part of the `tidyverse`, then why did I have to attach it separately?* Great question! Too keep the `tidyverse` manageable, there are "core" packages (`readr`, `dplyr`, `tidyr`, `ggplot2`, `forcats`, `purrr`, `tibble`, `stringr`) that you would expect to use frequently, and those are automatically attached when you use `library(tidyverse)`. But there are *also* more specialized `tidyverse` packages (e.g. `readxl`, `reprex`, `lubridate`, `rvest`) that are built with similar design philosophy, but are not automatically attached with `library(tidyverse)`. Those specialized packages are **installed** along with the `tidyverse`, but need to be attached individually (e.g. with `library(readxl)`).
Use `read_excel()` to get the ci_np.xlsx data into R:
```{r, message = FALSE}
ci_np <- read_excel(here("data", "ci_np.xlsx"))
```
*Note: If you want to explicitly read in an .xlsx or .xls file, you can use `read_xlsx()` or `read_xls()` instead. `read_excel()` will make its best guess about which type of Excel file you're reading in, so is the generic form.*
Explore the ci_np data frame as above using functions like `View()`, `names()`, `head()`, and `summary()`.
Now that we have read in the National Parks visitation data, let's use `ggplot2` to visualize it.
## Our first ggplot graph: Visitors to Channel Islands NP
To create a bare-bones ggplot graph, we need to tell R three basic things:
1. We're using `ggplot2::ggplot()`
2. Data we're using & variables we're plotting (i.e., what is x and/or y?)
3. What type of graph we're making (the type of *geom*)
Generally, that structure will look like this:
```{r, eval = FALSE}
ggplot(data = df_name, aes(x = x_var_name, y = y_var_name)) +
geom_type()
```
Breaking that down:
- First, tell R you're using `ggplot()`
- Then, tell it the object name where variables exist (`data = df_name`)
- Next, tell it the aesthetics `aes()` to specify which variables you want to plot
- Then add a layer for the type of geom (graph type) with `geom_*()` - for example, `geom_point()` is a scatterplot, `geom_line()` is a line graph, `geom_col()` is a column graph, etc.
Let's do that to create a line graph of visitors to Channel Islands National Park:
```{r}
ggplot(data = ci_np, aes(x = year, y = visitors)) +
geom_line()
```
We're going to be doing a lot of plot variations with those same variables. Let's store the first line as object `gg_base` so that we don't need to retype it each time:
```{r}
gg_base <- ggplot(data = ci_np, aes(x = year, y = visitors))
```
Or, we could change that to a scatterplot just by updating the `geom_*`:
```{r}
gg_base +
geom_point()
```
We could even do that for a column graph:
```{r}
gg_base +
geom_col()
```
Or an area plot...
```{r}
gg_base +
geom_area()
```
We can see that updating to different `geom_*` types is quick, so long as the types of graphs we're switching between are compatible.
The data are there, now let's do some data viz customization.
## Intro to customizing `ggplot` graphs
First, we'll customize some aesthetics (e.g. colors, styles, axis labels, etc.) of our graphs based on non-variable values.
> We can change the aesthetics of elements in a ggplot graph by adding arguments within the layer where that element is created.
Some common arguments we'll use first are:
- `color = ` or `colour = `: update point or line colors
- `fill = `: update fill color for objects with areas
- `linetype = `: update the line type (dashed, long dash, etc.)
- `pch = `: update the point style
- `size = `: update the element size (e.g. of points or line thickness)
- `alpha = `: update element opacity (1 = opaque, 0 = transparent)
Building on our first line graph, let's update the line color to "purple" and make the line type "dashed":
```{r}
gg_base +
geom_line(
color = "purple",
linetype = "dashed"
)
```
How do we know which color names ggplot will recognize? If you google "R colors ggplot2" you'll find a lot of good resources. Here's one: [SAPE ggplot2 colors quick reference guide](http://sape.inf.usi.ch/quick-reference/ggplot2/colour)
Now let's update the point, style and size of points on our previous scatterplot graph using `color = `, `size = `, and `pch = ` (see `?pch` for the different point styles, which can be further customized).
```{r}
gg_base +
geom_point(color = "purple",
pch = 17,
size = 4,
alpha = 0.5)
```
### Activity: customize your own ggplot graph
Update one of the example graphs you created above to customize **at least** an element color and size!
## Mapping variables onto aesthetics
In the examples above, we have customized aesthetics based on constants that we input as arguments (e.g., the color / style / size isn't changing based on a variable characteristic or value). Sometimes, however, we **do** want the aesthetics of a graph to depend on a variable. To do that, we'll **map variables onto graph aesthetics**, meaning we'll change how an element on the graph looks based on a variable characteristic (usually, character or value).
> When we want to customize a graph element based on a variable's characteristic or value, add the argument within `aes()` in the appropriate `geom_*()` layer
In short, if updating aesthetics based on a variable, make sure to put that argument inside of `aes()`.
**Example:** Create a ggplot scatterplot graph where the **size** and **color** of the points change based on the **number of visitors**, and make all points the same level of opacity (`alpha = 0.5`). Notice the `aes()` around the `size = ` and `color = ` arguments.
Also: this is overmapped and unnecessary. Avoid excessive / overcomplicated aesthetic mapping in data visualization.
```{r}
gg_base +
geom_point(
aes(size = visitors,
color = visitors),
alpha = 0.5
)
```
In the example above, notice that the two arguments that **do** depend on variables are within `aes()`, but since `alpha = 0.5` doesn't depend on a variable then it is *outside the `aes()` but still within the `geom_point()` layer*.
### Activity: map variables onto graph aesthetics
Create a column plot of Channel Islands National Park visitation over time, where the **fill color** (argument: `fill = `) changes based on the number of **visitors**.
```{r}
gg_base +
geom_col(aes(fill = visitors))
```
**Sync your project with your GitHub repo.**
## ggplot2 complete themes
While every element of a ggplot graph is manually customizable, there are also built-in themes (`theme_*()`) that you can add to your ggplot code to make some major headway before making smaller tweaks manually.
Here are a few to try today (but also notice all the options that appear as we start typing `theme_` into our ggplot graph code!):
- `theme_light()`
- `theme_minimal()`
- `theme_bw()`
Here, let's update our previous graph with `theme_minimal()`:
```{r}
gg_base +
geom_point(
aes(size = visitors,
color = visitors),
alpha = 0.5
) +
theme_minimal()
```
## Updating axis labels and titles
Use `labs()` to update axis labels, and add a title and/or subtitle to your ggplot graph.
```{r}
gg_base +
geom_line(linetype = "dotted") +
theme_bw() +
labs(
x = "Year",
y = "Annual park visitors",
title = "Channel Islands NP Visitation",
subtitle = "(1963 - 2016)"
)
```
**Note**: If you want to update the formatting of axis values (for example, to convert to comma format instead of scientific format above), you can use the `scales` package options (see more from the [R Cookbook](http://www.cookbook-r.com/Graphs/Axes_(ggplot2)/)).
## Combining compatible geoms
As long as the geoms are compatible, we can layer them on top of one another to further customize a graph.
For example, adding points to a line graph:
```{r}
gg_base +
geom_line(color = "purple") +
geom_point(color = "orange",
aes(size = year),
alpha = 0.5)
```
Or, combine a column and line graph (not sure why you'd want to do this, but you can):
```{r}
gg_base +
geom_col(fill = "orange",
color = "purple") +
geom_line(color = "green")
```
## Multi-series ggplot graphs
In the examples above, we only had a single series - visitation at Channel Islands National Park. Often we'll want to visualize multiple series. For example, from the `ca_np` object we have stored, we might want to plot visitation for *all* California National Parks.
To do that, we need to add an aesthetic that lets `ggplot` know how things are going to be grouped. A demonstration of why that's important - what happens if we *don't* let ggplot know how to group things?
```{r}
ggplot(data = ca_np, aes(x = year, y = visitors)) +
geom_line()
```
Well that's definitely a mess, and it's because ggplot has no idea that these **should be different series based on the different parks that appear in the 'park_name' column**.
We can make sure R does know by adding an explicit grouping argument (`group = `), or by updating an aesthetic based on *park_name*:
```{r}
ggplot(data = ca_np, aes(x = year, y = visitors, group = park_name)) +
geom_line()
```
**Note**: You could also add an aesthetic (`color = park_name`) in the `geom_line()` layer to create groupings, instead of in the topmost `ggplot()` layer.
Let's store that topmost line so that we can use it more quickly later on in the lesson:
```{r}
gg_np <- ggplot(data = ca_np, aes(x = year, y = visitors, group = park_name))
```
## Faceting ggplot graphs
When we facet graphs, we split them up into multiple plotting panels, where each panel contains a subset of the data. In our case, we'll split the graph above into different panels, each containing visitation data for a single park.
Also notice that any general theme changes made will be applied to *all* of the graphs.
```{r}
gg_np +
geom_line(show.legend = FALSE) +
theme_light() +
labs(x = "year", y = "annual visitors") +
facet_wrap(~ park_name)
```
## Exporting a ggplot graph with `ggsave()`
If we want our graph to appear in a knitted html, then we don't need to do anything else. But often we'll need a saved image file, of specific size and resolution, to share or for publication.
`ggsave()` will export the *most recently run* ggplot graph by default (`plot = last_plot()`), unless you give it the name of a different saved ggplot object. Some common arguments for `ggsave()`:
- `width = `: set exported image width (default inches)
- `height = `: set exported image height (default height)
- `dpi = `: set dpi (dots per inch)
So to export the faceted graph above at 180 dpi, width a width of 8" and a height of 7", we can use:
```{r, include = FALSE}
ggsave(here("img","np_graph.jpg"), dpi = 180, width = 8, height = 7)
```
```{r, eval = FALSE}
ggsave(here("figures", "np_graph.jpg"), dpi = 180, width = 8, height = 7)
```
Notice that a .jpg image of that name and size is now stored in the `figures\` folder within your working directory. You can change the type of exported image, too (e.g. pdf, tiff, eps, png, mmp, svg).
**Sync your project with your GitHub repo.**
- Stage
- Commit
- Pull (to check for remote changes)
- Push!
### End `ggplot` session!