Skip to content

Commit

Permalink
Small copy edits edits to strings.Rmd
Browse files Browse the repository at this point in the history
  • Loading branch information
garrettgman committed Apr 7, 2016
1 parent 90bd4fb commit b724f66
Showing 1 changed file with 12 additions and 12 deletions.
24 changes: 12 additions & 12 deletions strings.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ sentences <- readr::read_lines("harvard-sentences.txt")

<!-- look at http://d-rug.github.io/blog/2015/regex.fick/, http://qntm.org/files/re/re.html -->

This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically unstructured or semi-structured data so you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically come as unstructured or semi-structured data. When this happens, you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead I'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.

This chapter will focus on the __stringr__ package. This package provides a consistent set of functions that all work the same way and are easier to learn than the base R equivalents. We'll also take a brief look at the __stringi__ package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.

Expand All @@ -30,9 +30,9 @@ double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```

That means if you want to include a literal `\`, you'll need to double it up: `"\\"`.
That means if you want to include a literal backslash, you'll need to double it up: `"\\"`.

Beware that the printed representation of the string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:
Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:

```{r}
x <- c("\"", "\\")
Expand Down Expand Up @@ -83,7 +83,7 @@ Use the `sep` argument to control how they're separated:
str_c("x", "y", sep = ", ")
```

Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
Like most other functions in R, missing values are contagious. If you want them to print as `"NA"`, use `str_replace_na()`:

```{r}
x <- c("abc", NA)
Expand Down Expand Up @@ -118,7 +118,7 @@ str_c(c("x", "y", "z"), collapse = ", ")

### Subsetting strings

You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` argument which give the (inclusive) position of the substring:
You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring:

```{r}
x <- c("Apple", "Banana", "Pear")
Expand Down Expand Up @@ -186,7 +186,7 @@ str_sort(x, locale = "haw") # Hawaiian

Regular expressions, regexps for short, are a very terse language that allow to describe patterns in strings. They take a little while to get your head around, but once you've got it you'll find them extremely useful.

To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and show you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
To learn regular expressions, we'll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.

### Basic matches

Expand All @@ -203,7 +203,7 @@ The next step up in complexity is `.`, which matches any character (except a new
str_view(x, ".a.")
```

But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. In other words, you need to make the regular expression `\.`, but this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So the string `"\."` reduces to the special character written as `\.` In this case, `\.` is not a recognized special character and the string would lead to an error; but `"\n"` would reduce to a new line, `"\t"` would reduce to a tab, and `"\\"` would reduce to a literal `\`, which provides a way forward. To create a string that reduces to a literal backslash followed by a period, you need to escape the backslash. So to match a literal "`.`" you need to use `"\\."`, which simplifies to the regular expression `\.`.

```{r, cache = FALSE}
# To create the regular expression, we need \\
Expand All @@ -216,7 +216,7 @@ writeLines(dot)
str_view(c("abc", "a.c", "bef"), "a\\.c")
```

If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!
If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!

```{r, cache = FALSE}
x <- "a\\b"
Expand Down Expand Up @@ -372,7 +372,7 @@ str_view(fruit, "(..)\\1", match = TRUE)

(You'll also see how they're useful in conjunction with `str_match()` in a few pages.)

Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. They are called non-capturing parentheses.
Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use them for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. `(?:)` are called non-capturing parentheses.

For example:

Expand Down Expand Up @@ -401,7 +401,7 @@ Now that you've learned the basics of regular expressions, it's time to learn ho
* Find the positions of matches.
* Extract the content of matches.
* Replace matches with new values.
* How can you split a string based on a match.
* Split a string based on a match.

Because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. But since you're in a programming language, it's often easy to break the problem down into smaller pieces. If you find yourself getting stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.

Expand Down Expand Up @@ -459,7 +459,7 @@ str_count("abababa", "aba")
str_view_all("abababa", "aba")
```

Note the use of `str_view_all()`. As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches.
Note the use of `str_view_all()`. As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches. The second function will have the suffix `_all`.

### Exercises

Expand Down Expand Up @@ -633,7 +633,7 @@ sentences %>%
You can also request a maximum number of pieces:

```{r}
fields <- c("Name: Hadley", "County: NZ", "Age: 35")
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
```

Expand Down

0 comments on commit b724f66

Please sign in to comment.