Skip to content

Commit

Permalink
More hacking of string chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Apr 22, 2021
1 parent 554890b commit 2505136
Showing 3 changed files with 227 additions and 198 deletions.
43 changes: 43 additions & 0 deletions prog-strings.Rmd
Original file line number Diff line number Diff line change
@@ -6,6 +6,49 @@ library(tidyr)
library(tibble)
```

### str_c

`NULL`s are silently dropped.
This is particularly useful in conjunction with `if`:

```{r}
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE
str_c(
"Good ", time_of_day, " ", name,
if (birthday) " and HAPPY BIRTHDAY",
"."
)
```

## Performance

`fixed()`: matches exactly the specified sequence of bytes.
It ignores all special regular expressions and operates at a very low level.
This allows you to avoid complex escaping and can be much faster than regular expressions.
The following microbenchmark shows that it's about 3x faster for a simple example.

```{r}
microbenchmark::microbenchmark(
fixed = str_detect(sentences, fixed("the")),
regex = str_detect(sentences, "the"),
times = 20
)
```

As you saw with `str_split()` you can use `boundary()` to match boundaries.
You can also use it with the other functions:

```{r}
x <- "This is a sentence."
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
```

###

### Extract

```{r}
50 changes: 49 additions & 1 deletion regexps.Rmd
Original file line number Diff line number Diff line change
@@ -296,6 +296,55 @@ There are two useful function in base R that also use regular expressions:
(If you're more comfortable with "globs" like `*.Rmd`, you can convert them to regular expressions with `glob2rx()`):
## Options
When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`:
```{r, eval = FALSE}
# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))
```

You can use the other arguments of `regex()` to control details of the match:

- `ignore_case = TRUE` allows characters to match either their uppercase or lowercase forms.
This always uses the current locale.

```{r}
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
str_view(bananas, regex("banana", ignore_case = TRUE))
```
- `multiline = TRUE` allows `^` and `$` to match the start and end of each line rather than the start and end of the complete string.
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
```
- `comments = TRUE` allows you to use comments and white space to make complex regular expressions more understandable.
Spaces are ignored, as is everything after `#`.
To match a literal space, you'll need to escape it: `"\\ "`.
```{r}
phone <- regex("
\\(? # optional opening parens
(\\d{3}) # area code
[) -]? # optional closing parens, space, or dash
(\\d{3}) # another three numbers
[ -]? # optional space or dash
(\\d{3}) # three more numbers
", comments = TRUE)
str_match("514-791-8141", phone)
```
- `dotall = TRUE` allows `.` to match everything, including `\n`.
## A caution
A word of caution before we continue: because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression.
@@ -394,4 +443,3 @@ See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for mor
Don't forget that you're in a programming language and you have other tools at your disposal.
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
Loading
Oops, something went wrong.

0 comments on commit 2505136

Please sign in to comment.