forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata-structures.Rmd
464 lines (316 loc) · 16 KB
/
data-structures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
# Data structures
```{r, include = FALSE}
library(purrr)
library(dplyr)
```
As you start to write more functions, and as you want your functions to work with more types of inputs, it's useful to have some grounding in the underlying data structures that R is built on. This chapter will dive deeper into the objects that you've already used, helping you better understand how things work.
The most important class of objects in R is the __vector__. Every vector has two key properties:
1. Its type, whether it's logical, numeric, character, and so on. You
can determine the type of any R object with `typeof()`.
2. Its length, which you can retrieve with `length()`.
Vectors are broken down into __atomic__ vectors, and __lists__. I call factors, dates, and date times __augmented vectors__ because they're built on top of atomic vectors. Data frames are also augmented vectors as they built on top of lists.
Note that R does not have "scalars". In R, a single number is a vector of length 1. The impacts of this are mostly on how functions work. Because there are no scalars, most built-in functions are vectorised, meaning that they will operate on a vector of numbers. That's why, for example, you can write `1:10 + 10:1`.
## Atomic vectors
There are four important types of atomic vector:
* logical
* integer
* double
* character
Collectively, integer and double vectors are known as numeric vectors. Most of the time the distinction between integers and doubles is not important in R, so we'll discuss them together.
(There are also two rarer atomic vectors: raw and complex. They're beyond the scope of this book because they are rarely needed to do data analysis)
### Logical
Logical vectors are the simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`. Logical vectors are usually constructed with comparison operators, as described in [comparisons].
In numeric contexts, `TRUE` is converted to `1`, `FALSE` converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues.
```{r}
x <- sample(20, 100, replace = TRUE)
y <- x > 10
sum(y)
mean(y)
```
### Numeric
Numeric vectors encompasses both integers and doubles (real numbers). For large data, there is some small advantage to using the integer data type if you really have integers, but in most cases the differences are immaterial. In R, numbers are doubles by default. To make an integer, use a `L` after the number:
```{r}
typeof(1)
typeof(1L)
```
There are two cases where you need to be aware of the differences between doubles and integers. Firstly, never test for equality on a double. There can be very small differences that don't print out by default. These differences arise because a double is represented using a fixed number of (binary) digits. For example, what should you get if you square the square-root of two?
```{r}
x <- sqrt(2) ^ 2
x
```
It certainly looks like we get what we expect: `2`. But things are not exactly as they seem:
```{r}
x == 2
x - 2
```
The number we've computed is actually slightly different to 2. To avoid this sort of comparison difficulty, you can use the `near()` function from dplyr (available in 0.5).
```{r, eval = packageVersion("dplyr") >= "0.4.3.9000"}
dplyr::near(x, 2)
```
The other important thing to know about doubles is that they have three special values in addition to `NA`:
```{r}
c(-1, 0, 1) / 0
```
Like with missing values, you should avoid using `==` to check for these other special values. Instead use `is.finite()`, `is.infinite()`, and `is.nan()`:
| | 0 | Inf | NA | NaN |
|------------------|-----|-----|-----|-----|
| `is.finite()` | x | | | |
| `is.infinite()` | | x | | |
| `is.na()` | | | x | x |
| `is.nan()` | | | | x |
Note that `is.finite(x)` is not the same as `!is.infinite(x)`.
### Character
Each element of a character vector is a string.
```{r}
x <- c("abc", "def", "ghijklmnopqrs")
typeof(x)
```
You learned how to manipulate these vectors in [strings].
R uses a global string pool. This reduces the amount of memory strings take up because
```{r}
x <- "This is a reasonably long string."
pryr::object_size(x)
y <- rep(x, 1000)
pryr::object_size(y)
```
`y` doesn't take up 1,000x as much memory as `x`, because each element of `y` is just a pointer to that same string. A pointer is 8 bytes, so 1000 pointers to a 136 B string is about 8.13 kB.
### Missing values
There are four types of missing value, one for each type of atomic vector:
```{r}
NA # logical
NA_integer_ # integer
NA_real_ # double
NA_character_ # character
```
It is not usually necessary to know about these different types because in most cases `NA` is automatically converted to the type that you need. However, there are some functions that are strict about their inputs, and you'll need to give them an missing value of the correct type.
## Subsetting
## Augmented vectors
There are three important types of vector that are built on top of atomic vectors: factors, dates, and date times. I call these augmented vectors, because they are atomic vectors with additional __attributes__. Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
```{r}
x <- 1:10
attr(x, "greeting")
attr(x, "greeting") <- "Hi!"
attr(x, "farewell") <- "Bye!"
attributes(x)
```
There are three very important attributes that are used to implement fundamental parts of R:
* "names" are used to name the elements of a vector.
* "dims" make a vector behave like a matrix or array.
* "class" is used to implemenet the S3 object oriented system.
Class is particularly important because it changes what __generic functions__ do with the object. Generic functions are key to OO in R. Here's what a typical generic function looks like:
```{r}
as.Date
```
The call to "UseMethod" means that this is a generic function, and it will call a specific __method__, based on the class of the first argument. You can list all the methods for a generic with `methods()`:
```{r}
methods("as.Date")
```
And you can see the specific implementation of a method with `getS3method()`:
```{r}
getS3method("as.Date", "default")
getS3method("as.Date", "numeric")
```
The most important S3 generic is `print()`: it controls how the object is printed when you type its name on the console. Other important generics are the subsetting functions `[`, `[[`, and `$`.
A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
### Factors
Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:
```{r}
x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
typeof(x)
attributes(x)
```
Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \<sigh\>](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. The motivation for factors is the modelling context. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow"
The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a factor.
```{r}
x <- factor(letters[1:5])
is.factor(x)
as.factor(letters[1:5])
```
### Dates and date times
Dates in R are numeric vectors (sometimes integers, sometimes doubles) that represent the number of days since 1 January 1970.
```{r}
x <- as.Date("1971-01-01")
unclass(x)
typeof(x)
attributes(x)
```
Date times are numeric vectors (sometimes integers, sometimes doubles) that represent the number of seconds since 1 January 1970:
```{r}
x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)
typeof(x)
attributes(x)
```
The `tzone` is optional, and only controls the way the date is printed not what it means.
There is another type of datetimes called POSIXlt. These are built on top of named lists.
```{r}
y <- as.POSIXlt(x)
typeof(y)
attributes(y)
```
If you use the packages outlined in this book, you should never encounter a POSIXlt. They do crop up in base R, because they are used extract specific components of a date (like the year or month). However, lubridate provides helpers for you to do this instead. Otherwise POSIXct's are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a POSIXct with `as.POSIXct()`.
## Recursive vectors (lists)
Lists are the data structure R uses for hierarchical objects. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
You create a list with `list()`:
```{r}
x <- list(1, 2, 3)
str(x)
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
```
Unlike atomic vectors, `lists()` can contain a mix of objects:
```{r}
y <- list("a", 1L, 1.5, TRUE)
str(y)
```
Lists can even contain other lists!
```{r}
z <- list(list(1, 2), list(3, 4))
str(z)
```
`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.
### Visualising lists
To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists. For example, take these three lists:
```{r}
x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
```
I draw them as follows:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-structure.png")
```
* Lists are rounded rectangles that contain their children.
* I draw each child a little darker than its parent to make it easier to see
the hierarchy.
* The orientation of the children (i.e. rows or columns) isn't important,
so I'll pick a row or column orientation to either save space or illustrate
an important property in the example.
### Subsetting
There are three ways to subset a list, which I'll illustrate with `a`:
```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
```
* `[` extracts a sub-list. The result will always be a list.
```{r}
str(a[1:2])
str(a[4])
```
Like subsetting vectors, you can use an integer vector to select by
position, or a character vector to select by name.
* `[[` extracts a single component from a list. It removes a level of
hierarchy from the list.
```{r}
str(y[[1]])
str(y[[4]])
```
* `$` is a shorthand for extracting named elements of a list. It works
similarly to `[[` except that you don't need to use quotes.
```{r}
a$a
a[["b"]]
```
Or visually:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-subsetting.png")
```
### Lists of condiments
It's easy to get confused between `[` and `[[`, but it's important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help you remember these differences:
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper.jpg")
```
If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper-1.jpg")
```
`x[2]` would look the same, but would contain the second packet. `x[1:2]` would be a pepper shaker containing two pepper packets.
`x[[1]]` is:
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper-2.jpg")
```
If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper-3.jpg")
```
### Exercises
1. Draw the following lists as nested sets.
1. Generate the lists corresponding to these nested set diagrams.
1. What happens if you subset a data frame as if you're subsetting a list?
What are the key differences between a list and a data frame?
## Data frames
Data frames are augmented lists: they have class "data.frame", and `names` (column) and `row.names` attributes:
```{r}
df1 <- data.frame(x = 1:5, y = 5:1)
typeof(df1)
attributes(df1)
```
The difference between a data frame and a list is that all the elements of a data frame must be the same length. All functions that work with data frames enforce this constraint.
Generally, I recommend using `dplyr::data_frame()` instead of `data.frame`. It creates an object that "extends" the data frame. That means it has all the existing behaviour of a data frame:
```{r}
df2 <- dplyr::data_frame(x = 1:5, y = 5:1)
typeof(df2)
attributes(df2)
```
The additional `tbl_df` class makes the print method more informative (and only prints the first 10 rows, not the first 10,000), and makes the subsetting methods more strict:
```{r, error = TRUE}
df1
df2
df1$z
df2$z
```
There are a few other ways in `data_frame()` behaves differently to `data.frame()`
* `data.frame()` does a number of transformations to its inputs. For example,
unless you `stringsAsFactors = FALSE` it always converts character vectors
to factors. `data_frame()` does not conversion:
```{r}
data.frame(x = letters) %>% sapply(class)
data_frame(x = letters) %>% sapply(class)
```
* `data.frame()` automatically transforms names, `data_frame()` does not.
```{r}
data.frame(`crazy name` = 1) %>% names()
data_frame(`crazy name` = 1) %>% names()
```
* In `data_frame()` you can refer to variables that you just created:
```{r}
data_frame(x = 1:5, y = x ^ 2)
```
* It never uses row names. The whole point of tidy data is to store variables
in a consistent way. Row names are a variable stored in a unique way,
so I don't recommend using them.
* It only recycles vectors of length 1. Recycling vectors of greater lengths
is a frequent source of silent mistakes.
```{r, error = TRUE}
data.frame(x = 1:2, y = 1:4)
data_frame(x = 1:2, y = 1:4)
```
## Predicates
| | lgl | int | dbl | chr | list | null |
|------------------|-----|-----|-----|-----|------|------|
| `is_logical()` | x | | | | | |
| `is_integer()` | | x | | | | |
| `is_double()` | | | x | | | |
| `is_numeric()` | | x | x | | | |
| `is_character()` | | | | x | | |
| `is_atomic()` | x | x | x | x | | |
| `is_list()` | | | | | x | |
| `is_vector()` | x | x | x | x | x | |
| `is_null()` | | | | | | x |
Compared to the base R functions, they only inspect the type of the object, not its attributes. This means they tend to be less surprising:
```{r}
is.atomic(NULL)
is_atomic(NULL)
is.vector(factor("a"))
is_vector(factor("a"))
```
I recommend using these instead of the base functions.
Each predicate also comes with "scalar" and "bare" versions. The scalar version checks that the length is 1 and the bare version checks that the object is a bare vector with no S3 class.
```{r}
y <- factor(c("a", "b", "c"))
is_integer(y)
is_scalar_integer(y)
is_bare_integer(y)
```
### Exercises
1. Carefully read the documentation of `is.vector()`. What does it actually
test for?