Skip to content

Commit

Permalink
Use US spelling of summarize()
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Nov 18, 2022
1 parent d06d412 commit 3045d05
Show file tree
Hide file tree
Showing 17 changed files with 87 additions and 88 deletions.
4 changes: 2 additions & 2 deletions base-R.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ There are a number other base approaches to creating new columns including with
Hadley collected a few examples at <https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf>.

Using `$` directly is convenient when performing quick summaries.
For example, if you just want find the size of the biggest diamond or the possible values of `cut`, there's no need to use `summarise()`:
For example, if you just want find the size of the biggest diamond or the possible values of `cut`, there's no need to use `summarize()`:

```{r}
max(diamonds$carat)
Expand Down Expand Up @@ -423,7 +423,7 @@ Another important member of the apply family is `tapply()` which computes a sing
```{r}
diamonds |>
group_by(cut) |>
summarise(price = mean(price))
summarize(price = mean(price))
tapply(diamonds$price, diamonds$cut, mean)
```
Expand Down
4 changes: 2 additions & 2 deletions communicate-plots.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ It's not wonderful for this plot, but it isn't too bad.
```{r}
class_avg <- mpg |>
group_by(class) |>
summarise(
summarize(
displ = median(displ),
hwy = median(hwy)
)
Expand All @@ -208,7 +208,7 @@ Often, you want the label in the corner of the plot, so it's convenient to creat

```{r}
label_info <- mpg |>
summarise(
summarize(
displ = max(displ),
hwy = max(hwy),
label = "Increasing engine size is \nrelated to decreasing fuel economy."
Expand Down
9 changes: 4 additions & 5 deletions data-transform.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -423,11 +423,10 @@ This means subsequent operations will now work "by month".

### `summarize()` {#sec-summarize}

The most important grouped operation is a summary.
It collapses each group to a single row[^data-transform-3].
Here we compute the average departure delay by month:
The most important grouped operation is a summary, which each collapses each group to a single row.
In dplyr, this is operation is performed by `summarize()`[^data-transform-3], as shown by the following example, which computes the average departure delay by month:

[^data-transform-3]: This is a slightly simplification; later on you'll learn how to use `summarize()` to produce multiple summary rows for each group.
[^data-transform-3]: Or `summarise()`, if you prefer British English.

```{r}
flights |>
Expand Down Expand Up @@ -673,7 +672,7 @@ You can find a good explanation of this problem and how to overcome it at <http:
## Summary

In this chapter, you've learned the tools that dplyr provides for working with data frames.
The tools are roughly grouped into three categories: those that manipulate the rows (like `filter()` and `arrange()`, those that manipulate the columns (like `select()` and `mutate()`), and those that manipulate groups (like `group_by()` and `summarise()`).
The tools are roughly grouped into three categories: those that manipulate the rows (like `filter()` and `arrange()`, those that manipulate the columns (like `select()` and `mutate()`), and those that manipulate groups (like `group_by()` and `summarize()`).
In this chapter, we've focused on these "whole data frame" tools, but you haven't yet learned much about what you can do with the individual variable.
We'll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.

Expand Down
14 changes: 7 additions & 7 deletions databases.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,7 @@ flights |>
```{r}
flights |>
group_by(dest) |>
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
show_query()
```

Expand Down Expand Up @@ -393,14 +393,14 @@ You'll see more complex examples once we hit the join functions.

### GROUP BY

`group_by()` is translated to the `GROUP BY`[^databases-6] clause and `summarise()` is translated to the `SELECT` clause:
`group_by()` is translated to the `GROUP BY`[^databases-6] clause and `summarize()` is translated to the `SELECT` clause:

[^databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.

```{r}
diamonds_db |>
group_by(cut) |>
summarise(
summarize(
n = n(),
avg_price = mean(price, na.rm = TRUE)
) |>
Expand Down Expand Up @@ -445,7 +445,7 @@ dbplyr will remind you about this behavior the first time you hit it:
```{r}
flights |>
group_by(dest) |>
summarise(delay = mean(arr_delay))
summarize(delay = mean(arr_delay))
```

If you want to learn more about how NULLs work, you might enjoy "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand.
Expand All @@ -471,7 +471,7 @@ This is a one of the idiosyncracies of SQL created because `WHERE` is evaluated
```{r}
diamonds_db |>
group_by(cut) |>
summarise(n = n()) |>
summarize(n = n()) |>
filter(n > 100) |>
show_query()
```
Expand Down Expand Up @@ -579,13 +579,13 @@ The easiest way to see the full set of what's currently available is to visit th
So far we've focused on the big picture of how dplyr verbs are translated to the clauses of a query.
Now we're going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use `mean(x)` in a `summarize()`?

To help see what's going on, we'll use a couple of little helper functions that run a `summarise()` or `mutate()` and show the generated SQL.
To help see what's going on, we'll use a couple of little helper functions that run a `summarize()` or `mutate()` and show the generated SQL.
That will make it a little easier to explore a few variations and see how summaries and transformations can differ.

```{r}
summarize_query <- function(df, ...) {
df |>
summarise(...) |>
summarize(...) |>
show_query()
}
mutate_query <- function(df, ...) {
Expand Down
4 changes: 2 additions & 2 deletions datetimes.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,7 @@ It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays
flights_dt |>
mutate(minute = minute(dep_time)) |>
group_by(minute) |>
summarise(
summarize(
avg_delay = mean(dep_delay, na.rm = TRUE),
n = n()) |>
ggplot(aes(minute, avg_delay)) +
Expand All @@ -369,7 +369,7 @@ Interestingly, if we look at the *scheduled* departure time we don't see such a
sched_dep <- flights_dt |>
mutate(minute = minute(sched_dep_time)) |>
group_by(minute) |>
summarise(
summarize(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n())
Expand Down
4 changes: 2 additions & 2 deletions factors.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ For example, imagine you want to explore the average number of hours spent watch
#| any sense of overall pattern.
relig_summary <- gss_cat |>
group_by(relig) |>
summarise(
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
Expand Down Expand Up @@ -232,7 +232,7 @@ What if we create a similar plot looking at how average age varies across report
#| then $8000-9999.
rincome_summary <- gss_cat |>
group_by(rincome) |>
summarise(
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
Expand Down
18 changes: 9 additions & 9 deletions functions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -441,7 +441,7 @@ So the key challenge in writing data frame functions is figuring out which argum
Fortunately this is easy because you can look it up from the documentation 😄.
There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:

- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarise()` that compute with variables.
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarize()` that compute with variables.

- **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select variables.

Expand All @@ -455,7 +455,7 @@ If you commonly perform the same set of summaries when doing initial data explor

```{r}
summary6 <- function(data, var) {
data |> summarise(
data |> summarize(
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
Expand All @@ -468,9 +468,9 @@ summary6 <- function(data, var) {
diamonds |> summary6(carat)
```

(Whenever you wrap `summarise()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)
(Whenever you wrap `summarize()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)

The nice thing about this function is because it wraps `summarise()` you can used it on grouped data:
The nice thing about this function is because it wraps `summarize()` you can used it on grouped data:

```{r}
diamonds |>
Expand All @@ -489,7 +489,7 @@ diamonds |>

To summarize multiple variables you'll need to wait until @sec-across, where you'll learn how to use `across()`.

Another popular `summarise()` helper function is a version of `count()` that also computes proportions:
Another popular `summarize()` helper function is a version of `count()` that also computes proportions:

```{r}
# https://twitter.com/Diabb6/status/1571635146658402309
Expand Down Expand Up @@ -547,7 +547,7 @@ You might try writing something like:
count_missing <- function(df, group_vars, x_var) {
df |>
group_by({{ group_vars }}) |>
summarise(n_miss = sum(is.na({{ x_var }})))
summarize(n_miss = sum(is.na({{ x_var }})))
}
flights |>
count_missing(c(year, month, day), dep_time)
Expand All @@ -560,7 +560,7 @@ We can work around that problem by using the handy `pick()` which allows you to
count_missing <- function(df, group_vars, x_var) {
df |>
group_by(pick({{ group_vars }})) |>
summarise(n_miss = sum(is.na({{ x_var }})))
summarize(n_miss = sum(is.na({{ x_var }})))
}
flights |>
count_missing(c(year, month, day), dep_time)
Expand Down Expand Up @@ -602,7 +602,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
```{r}
#| eval: false
flights |> group_by(dest) |> summarise_severe()
flights |> group_by(dest) |> summarize_severe()
```
3. Finds all flights that were cancelled or delayed by more than a user supplied number of hours:
Expand All @@ -616,7 +616,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
```{r}
#| eval: false
weather |> summarise_weather(temp)
weather |> summarize_weather(temp)
```
5. Converts the user supplied variable that uses clock time (e.g. `dep_time`, `arr_time`, etc) into a decimal time (i.e. hours + minutes / 60).
Expand Down
Loading

0 comments on commit 3045d05

Please sign in to comment.