-
Notifications
You must be signed in to change notification settings - Fork 48
/
Copy pathch11.Rmd
549 lines (347 loc) · 11.2 KB
/
ch11.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
---
title: "Ch11"
output:
html_document:
df_print: paged
---
```{r}
library(tidyverse)
# stringr now belongs to the tidyverse core
```
## Exercises 14.2.5
In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?
`paste` and `paste0` are the same but `paste0` has `sep = ""` by default and `paste` has `sep = " "` by default.
`str_c` is the equivalent `stringr` function.
```{r}
str_c(c("a", "b"), collapse = ", ")
```
```{r}
str_c(c("a", "b"), NA)
# In `str_c` everything that is pasted with an NA is an NA
paste0(c("a", "b"), NA)
# But in paste0 NA gets converted to a character string a pasted together. To mimic the same behaviour, replace the NA to a string with:
str_c(c("a", "b"), str_replace_na(NA))
```
In your own words, describe the difference between the sep and collapse arguments to str_c().
`sep` is what divides what you paste together within a vector of strings. `collapse` is the divider of a single pasted vector of strings.
Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?
```{r}
uneven <- "one"
even <- "thre"
str_sub(even, str_length(even) / 2, str_length(even) / 2)
# Automatically rounds up the lower digit
str_sub(uneven, str_length(uneven) / 2, str_length(uneven) / 2)
```
One solution would be to round the the highest digit with `ceiling`.
What does str_wrap() do? When might you want to use it?
```{r}
str_wrap(
"Hey, so this is one paragraph
I'm interested in writing but I
think it might be too long. I just
want to make sure this is in the right format",
width = 60, indent = 2, exdent = 1
) %>% cat()
```
This might be interesting to output messages while running scripts or in packages.
What does str_trim() do? What’s the opposite of str_trim()?
Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.
```{r}
str_paster <- function(x, collapse = ", ") {
str_c(x, collapse = collapse)
}
tr <- letters[1:3]
str_paster(tr)
tr <- letters[1:2]
str_paster(tr)
tr <- letters[1]
str_paster(tr)
tr <- letters[0]
str_paster(tr)
```
It always returns a character, even if the vector is empty.
## 14.3.1.1 Exercises
Explain why each of these strings don’t match a \: "\", "\\", "\\\".
"\" won't match anything because "\" needs to be accompanied by two "\\" to escape "\"
"\" won't match "\\" because because "\" is actualy "\\" and needs double escaping so "\\\\" will match it.
Same for "\\\".
How would you match the sequence "'\?
str_view("\"'\\", "\"'\\\\")
What patterns will the regular expression \..\..\.. match? How would you represent it as a string?
It matches a string similar to .a.b.c So every '\.' matches a literal dot and . matches any character except a new line.
```{r}
str_view(".a.b.c", "\\..\\..\\..")
```
## Exercises 14.3.2.1
How would you match the literal string "$^$"?
Given the corpus of common words in stringr::words, create regular expressions that find all words that:
Start with "y".
```{r}
str_bring <- function(string, pattern) {
string[str_detect(string, pattern)]
}
str_bring(words, "^y")
```
End with "x"
```{r}
str_bring(words, "x$")
```
Are exactly three letters long. (Don’t cheat by using str_length()!)
```{r}
str_bring(words, "^.{3}$")
```
Have seven letters or more.
```{r}
str_bring(words, "^.{7,}$")
```
Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.
## Exercises 14.3.3.1
Create regular expressions to find all words that:
Start with a vowel.
```{r}
str_bring(words, "^[aeiou]")
```
That only contain consonants. (Hint: thinking about matching “not”-vowels.)
```{r}
str_bring(words, "^[^aeiou]")
```
End with ed, but not with eed.
```{r}
str_bring(words, "[^e]ed$")
```
End with ing or ise.
```{r}
str_bring(words, "i(ng|se)$")
```
Empirically verify the rule "i before e except after c".
```{r}
str_bring(words, "ie|[^c]ie")
```
Is "q"" always followed by a "u"?
```{r}
str_bring(words, "q[^u]")
```
Yes!
Write a regular expression that matches a word if it’s probably written in British English, not American English.
A bit hard. The closest is: "ou|ise^|ae|oe|yse^"
```{r}
str_bring(words, "ou|ise^|ae|oe|yse^")
```
But see: https://jrnold.github.io/r4ds-exercise-solutions/strings.html
Create a regular expression that will match telephone numbers as commonly written in your country.
```{r}
x <- c("34697382009", "18093438932", "18098462020")
str_bring(x, "^34.{9}$")
```
or
```{r}
x <- c("123-456-7890", "1235-2351")
str_bring(x, "\\d{3}-\\d{3}-\\d{4}")
```
## Exercises 14.3.4.1
Describe the equivalents of ?, +, * in {m,n} form.
? is {,1}
+ is {1,}
* has no equivalent
Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
^.*$
Matches any string
"\\{.+\\}"
Matches any string with curly braces.
\d{4}-\d{2}-\d{2}
Matches a set of numbers in this format dddd-dd-dd
"\\\\{4}"
It matches four back slashes.
```{r}
str_bring("\\\\\\\\", "\\\\{4}")
```
Create regular expressions to find all words that:
Start with three consonants.
```{r}
str_bring(words, "^[^aeiou]{3}")
```
Have three or more vowels in a row.
```{r}
str_bring(words, "[aeiou]{3,}")
```
Have two or more vowel-consonant pairs in a row.
```{r}
str_bring(words, "[^aeiou][aeiou]{2,}")
```
Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.
## Exercises 14.3.5.1
Describe, in words, what these expressions will match:
(.)\1\1
Any character repeated three times in a row.
```{r}
str_bring(c("aaa", "aaba"), "(.)\\1\\1")
```
"(.)(.)\\2\\1"
Two characters followed by the same two characters in reverse order
```{r}
str_bring(c("aabb"), "(.)(.)\\2\\1")
```
(..)\1
Two charachters repeated twice
```{r}
str_bring(c("abab", "abba"), "(..)\\1")
```
"(.).\\1.\\1"
A character repeated three times with characters in between each repitition, e.g. abaca
```{r}
str_bring(c("abaca", "aabb"), "(.).\\1.\\1")
```
"(.)(.)(.).*\\3\\2\\1"
The characters followed by any character repeate 0 or more times and then then the same three characters in reverse order.
```{r}
str_bring(c("abc312131cba", "aaabbbccc"), "(.)(.)(.).*\\3\\2\\1")
```
Construct regular expressions to match words that:
Start and end with the same character.
```{r}
str_bring(words, "^(.).*\\1$")
```
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
```{r}
str_bring(words, "(..).*\\1")
```
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
```{r}
str_bring(words, "([a-z]).*\\1.*\\1")
```
## Exercises 14.4.2
For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.
Find all words that start or end with x.
```{r}
str_bring(words, "^x|x$")
```
or
```{r}
start_r <- str_detect(words, "^x")
end_r <- str_detect(words, "x$")
words[start_r | end_r]
```
Find all words that start with a vowel and end with a consonant.
```{r}
str_bring(words, "^[aeiou].*[^aeiou]$")
```
or
```{r}
start_r <- str_detect(words, "^[aeiou]")
end_r <- str_detect(words, "[^aeiou]$")
words[start_r & end_r]
```
Are there any words that contain at least one of each different vowel?
```{r}
vowels <-
str_detect(words, "a") & str_detect(words, "e") & str_detect(words, "i") &
str_detect(words, "o") & str_detect(words, "u")
words[vowels]
```
What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
```{r}
vowels <- str_count(words, "[aeiou]")
words[which.max(vowels)]
```
```{r}
words[which.max(vowels / str_length(words))]
```
## Exercises 14.4.3.1
In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem.
```{r}
colors <- c(
"red", "orange", "yellow", "green", "blue", "purple"
)
color_match <- str_c(str_c("\\b", colors, "\\b"), collapse = "|")
sentences[str_count(sentences, color_match) > 1]
```
From the Harvard sentences data, extract:
The first word from each sentence.
```{r}
str_extract(sentences, "^[a-zA-Z]+")
```
All words ending in ing.
```{r}
end_ing <- str_extract(sentences, "\\b[a-zA-Z]+ing\\b")
end_ing[!is.na(end_ing)]
```
All plurals.
```{r}
unique(unlist(str_extract_all(sentences, "\\b[a-zA-Z]{3,}s\\b"))) %>%
head()
```
## Exercises 14.4.4.1
Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
```{r}
numbers <- c(
"one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"
)
number_regexp <- str_c("(", str_c(numbers, collapse = "|"), ")")
regexp <- str_c(number_regexp, " ([^ ]+)")
all_match <- str_match(sentences, regexp)
all_match[complete.cases(all_match), ] %>% head()
```
Find all contractions. Separate out the pieces before and after the apostrophe.
```{r}
contract_re <- "([a-zA-Z]+)'([a-zA-Z]+)"
contract <- sentences[str_detect(sentences, contract_re)]
str_match(contract, contract_re)
```
## Exercises 14.4.5.1
Replace all forward slashes in a string with backslashes.
```{r}
str_replace_all(c("hey this is a /", "and another / in the pic"),
"/", "\\\\")
```
Implement a simple version of str_to_lower() using replace_all().
```{r}
my_str_lower <- function(x) {
lower_let <- letters
names(lower_let) <- LETTERS
str_replace_all(x, lower_let)
}
identical(my_str_lower(sentences), str_to_lower(sentences))
```
Switch the first and last letters in words. Which of those strings are still words?
```{r}
str_replace_all(words, "^([a-z])(.*)([a-z])$", c("\\3\\2\\1"))
```
## Exercises 14.4.6.1
Split up a string like "apples, pears, and bananas" into individual components.
```{r}
str_split("apples, pears, and bananas", boundary("word"))[[1]]
```
Why is it better to split up by boundary("word") than " "?
Becaise ot tales care of commas and dots.
What does splitting with an empty string ("") do? Experiment, and then read the documentation.
```{r}
str_split("apples, pears, and bananas", "")[[1]]
```
## Exercises 14.5.1
How would you find all strings containing \ with regex() vs. with fixed()?
```{r}
str_ing <- c("contains \\", "and \\ another", "ad")
str_subset(str_ing, regex("\\\\"))
str_subset(str_ing, fixed("\\")) # ignores regular expressions and matches
# on byte by byte.
```
What are the five most common words in sentences?
```{r}
unlist(str_split(sentences, boundary("word"))) %>%
str_to_lower() %>%
tibble() %>%
set_names("words") %>%
count(words) %>%
arrange(desc(n)) %>%
head()
```
## Exercises 14.7.1
Find the stringi functions that:
Count the number of words.
stri_count
Find duplicated strings.
stri_duplicated
Generate random text.
stri_rand_* functions
How do you control the language that stri_sort() uses for sorting?
With the `locale` argument which is specified through `...`