ch11.Rmd

---
title: "Ch11"
output:
  html_document:
    df_print: paged
---

```{r}
library(tidyverse)
# stringr now belongs to the tidyverse core
```


## Exercises 14.2.5

In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

`paste` and `paste0` are the same but `paste0` has `sep = ""` by default and `paste` has `sep = " "` by default.

`str_c` is the equivalent `stringr` function.

```{r}
str_c(c("a", "b"), collapse = ", ")
```

```{r}
str_c(c("a", "b"), NA)
# In `str_c` everything that is pasted with an NA is an NA

paste0(c("a", "b"), NA)
# But in paste0 NA gets converted to a character string a pasted together. To mimic the same behaviour, replace the NA to a string with:
str_c(c("a", "b"), str_replace_na(NA))
```

In your own words, describe the difference between the sep and collapse arguments to str_c().

`sep` is what divides what you paste together within a vector of strings. `collapse` is the divider of a single pasted vector of strings.

Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

```{r}
uneven <- "one"
even <- "thre"

str_sub(even, str_length(even) / 2, str_length(even) / 2)

# Automatically rounds up the lower digit
str_sub(uneven, str_length(uneven) / 2, str_length(uneven) / 2)
```
One solution would be to round the the highest digit with `ceiling`.

What does str_wrap() do? When might you want to use it?

```{r}
str_wrap(
    "Hey, so this is one paragraph
    I'm interested in writing but I
    think it might be too long. I just
    want to make sure this is in the right format",
    width = 60, indent = 2, exdent = 1
) %>% cat()
```

This might be interesting to output messages while running scripts or in packages.
What does str_trim() do? What’s the opposite of str_trim()?

Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

```{r}

str_paster <- function(x, collapse = ", ") {
  str_c(x, collapse = collapse)
}

tr <- letters[1:3]
str_paster(tr)

tr <- letters[1:2]
str_paster(tr)

tr <- letters[1]
str_paster(tr)

tr <- letters[0]
str_paster(tr)
```
It always returns a character, even if the vector is empty.

## 14.3.1.1 Exercises

Explain why each of these strings don’t match a \: "\", "\\", "\\\".

"\" won't match anything because "\" needs to be accompanied by two "\\" to escape "\"
"\" won't match "\\" because because "\" is actualy "\\" and needs double escaping so "\\\\" will match it.

Same for "\\\".


How would you match the sequence "'\?

str_view("\"'\\", "\"'\\\\")

What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

It matches a string similar to .a.b.c So every '\.' matches a literal dot and . matches any character except a new line.

```{r}
str_view(".a.b.c", "\\..\\..\\..")
```

## Exercises 14.3.2.1

How would you match the literal string "$^$"?

Given the corpus of common words in stringr::words, create regular expressions that find all words that:

Start with "y".
```{r}
str_bring <- function(string, pattern) {
  string[str_detect(string, pattern)]
}

str_bring(words, "^y")
```

End with "x"

```{r}
str_bring(words, "x$")
```

Are exactly three letters long. (Don’t cheat by using str_length()!)

```{r}
str_bring(words, "^.{3}$")
```

Have seven letters or more.

```{r}
str_bring(words, "^.{7,}$")
```

Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.

## Exercises 14.3.3.1

Create regular expressions to find all words that:

Start with a vowel.

```{r}
str_bring(words, "^[aeiou]")
```

That only contain consonants. (Hint: thinking about matching “not”-vowels.)

```{r}
str_bring(words, "^[^aeiou]")
```

End with ed, but not with eed.

```{r}
str_bring(words, "[^e]ed$")
```

End with ing or ise.

```{r}
str_bring(words, "i(ng|se)$")
```

Empirically verify the rule "i before e except after c".

```{r}
str_bring(words, "ie|[^c]ie")
```


Is "q"" always followed by a "u"?

```{r}
str_bring(words, "q[^u]")
```

Yes!

Write a regular expression that matches a word if it’s probably written in British English, not American English.

A bit hard. The closest is: "ou|ise^|ae|oe|yse^"

```{r}
str_bring(words, "ou|ise^|ae|oe|yse^")
```

But see: https://jrnold.github.io/r4ds-exercise-solutions/strings.html

Create a regular expression that will match telephone numbers as commonly written in your country.

```{r}
x <- c("34697382009", "18093438932", "18098462020")
str_bring(x, "^34.{9}$")
```
or

```{r}
x <- c("123-456-7890", "1235-2351")
str_bring(x, "\\d{3}-\\d{3}-\\d{4}")
```

## Exercises 14.3.4.1

Describe the equivalents of ?, +, * in {m,n} form.

? is {,1}
+ is {1,}
* has no equivalent

Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

^.*$ 

Matches any string

"\\{.+\\}"

Matches any string with curly braces.

\d{4}-\d{2}-\d{2}

Matches a set of numbers in this format dddd-dd-dd

"\\\\{4}"

It matches four back slashes.

```{r}
str_bring("\\\\\\\\", "\\\\{4}")
```

Create regular expressions to find all words that:

Start with three consonants.

```{r}
str_bring(words, "^[^aeiou]{3}")
```

Have three or more vowels in a row.

```{r}
str_bring(words, "[aeiou]{3,}")
```


Have two or more vowel-consonant pairs in a row.

```{r}
str_bring(words, "[^aeiou][aeiou]{2,}")
```

Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.


## Exercises 14.3.5.1
Describe, in words, what these expressions will match:

(.)\1\1

Any character repeated three times in a row.

```{r}
str_bring(c("aaa", "aaba"), "(.)\\1\\1")
```

"(.)(.)\\2\\1"

Two characters followed by the same two characters in reverse order
```{r}
str_bring(c("aabb"), "(.)(.)\\2\\1")
```


(..)\1

Two charachters repeated twice

```{r}
str_bring(c("abab", "abba"), "(..)\\1")
```


"(.).\\1.\\1"

A character repeated three times with characters in between each repitition, e.g. abaca

```{r}
str_bring(c("abaca", "aabb"), "(.).\\1.\\1")
```


"(.)(.)(.).*\\3\\2\\1"

The characters followed by any character repeate 0 or more times and then then the same three characters in reverse order.

```{r}
str_bring(c("abc312131cba", "aaabbbccc"), "(.)(.)(.).*\\3\\2\\1")
```

Construct regular expressions to match words that:

Start and end with the same character.

```{r}
str_bring(words, "^(.).*\\1$")
```

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

```{r}
str_bring(words, "(..).*\\1")
```


Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)


```{r}
str_bring(words, "([a-z]).*\\1.*\\1")
```

 
## Exercises 14.4.2

For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

Find all words that start or end with x.

```{r}
str_bring(words, "^x|x$")
```

or

```{r}
start_r <- str_detect(words, "^x")
end_r <- str_detect(words, "x$")

words[start_r | end_r]
```


Find all words that start with a vowel and end with a consonant.

```{r}
str_bring(words, "^[aeiou].*[^aeiou]$")
```

or

```{r}
start_r <- str_detect(words, "^[aeiou]")
end_r <- str_detect(words, "[^aeiou]$")

words[start_r & end_r]
```


Are there any words that contain at least one of each different vowel?

```{r}

vowels <-
  str_detect(words, "a") & str_detect(words, "e") & str_detect(words, "i") &
  str_detect(words, "o") & str_detect(words, "u")

words[vowels]
```


What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)

```{r}
vowels <- str_count(words, "[aeiou]")

words[which.max(vowels)]
```

```{r}
words[which.max(vowels / str_length(words))]
```

## Exercises 14.4.3.1

In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem.

```{r}
colors <- c(
  "red", "orange", "yellow", "green", "blue", "purple"
)

color_match <- str_c(str_c("\\b", colors, "\\b"), collapse = "|")

sentences[str_count(sentences, color_match) > 1]
```


From the Harvard sentences data, extract:

The first word from each sentence.

```{r}
str_extract(sentences, "^[a-zA-Z]+")
```

All words ending in ing.

```{r}

end_ing <- str_extract(sentences, "\\b[a-zA-Z]+ing\\b")
end_ing[!is.na(end_ing)]
```


All plurals.

```{r}
unique(unlist(str_extract_all(sentences, "\\b[a-zA-Z]{3,}s\\b"))) %>%
  head()
```

## Exercises 14.4.4.1

Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.

```{r}
numbers <- c(
  "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"
)

number_regexp <- str_c("(", str_c(numbers, collapse = "|"), ")")

regexp <- str_c(number_regexp, " ([^ ]+)")
all_match <- str_match(sentences, regexp)

all_match[complete.cases(all_match), ] %>% head()
```

Find all contractions. Separate out the pieces before and after the apostrophe.

```{r}
contract_re <- "([a-zA-Z]+)'([a-zA-Z]+)"
contract <- sentences[str_detect(sentences, contract_re)]

str_match(contract, contract_re)
```

## Exercises 14.4.5.1
Replace all forward slashes in a string with backslashes.

```{r}
str_replace_all(c("hey this is a /", "and another / in the pic"),
                "/", "\\\\")
```


Implement a simple version of str_to_lower() using replace_all().

```{r}
my_str_lower <- function(x) {
  lower_let <- letters
  names(lower_let) <- LETTERS
  
  str_replace_all(x, lower_let)
}

identical(my_str_lower(sentences), str_to_lower(sentences))
```


Switch the first and last letters in words. Which of those strings are still words?

```{r}
str_replace_all(words, "^([a-z])(.*)([a-z])$", c("\\3\\2\\1"))
```

## Exercises 14.4.6.1
Split up a string like "apples, pears, and bananas" into individual components.

```{r}
str_split("apples, pears, and bananas", boundary("word"))[[1]]
```


Why is it better to split up by boundary("word") than " "?

Becaise ot tales care of commas and dots.

What does splitting with an empty string ("") do? Experiment, and then read the documentation.

```{r}
str_split("apples, pears, and bananas", "")[[1]]
```


## Exercises 14.5.1
How would you find all strings containing \ with regex() vs. with fixed()?

```{r}
str_ing <- c("contains \\", "and \\ another", "ad")

str_subset(str_ing, regex("\\\\"))
str_subset(str_ing, fixed("\\")) # ignores regular expressions and matches
# on byte by byte.
```


What are the five most common words in sentences?
```{r}

unlist(str_split(sentences, boundary("word"))) %>%
  str_to_lower() %>%
  tibble() %>%
  set_names("words") %>%
  count(words) %>%
  arrange(desc(n)) %>%
  head()

```

## Exercises 14.7.1
Find the stringi functions that:

Count the number of words.

stri_count

Find duplicated strings.

stri_duplicated

Generate random text.

stri_rand_* functions

How do you control the language that stri_sort() uses for sorting?

With the `locale` argument which is specified through `...`