forked from tidyverse/dplyr
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
110 changed files
with
31,786 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,3 +29,5 @@ demo/pandas | |
^TAGS$ | ||
^\.dir-locals\.el$ | ||
^vignettes/rsconnect$ | ||
^docs$ | ||
^_pkgdown\.yml$ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
template: | ||
package: tidytemplate | ||
default_assets: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
YEAR: 2013-2015 | ||
COPYRIGHT HOLDER: RStudio |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,182 @@ | ||
<!-- | ||
%\VignetteEngine{knitr} | ||
%\VignetteIndexEntry{Baseball benchmarks} | ||
--> | ||
|
||
```{r, echo = FALSE, message = FALSE} | ||
library(dplyr) | ||
library(microbenchmark) | ||
library(data.table) | ||
library(Lahman) | ||
knitr::opts_chunk$set( | ||
comment = "#>", | ||
error = FALSE, | ||
tidy = FALSE | ||
) | ||
options(digits = 3, microbenchmark.unit = "ms") | ||
``` | ||
|
||
# Benchmarks: baseball data | ||
|
||
The purpose of these benchmarks is to be as fair as possible, to help understand the relatively performance tradeoffs of the different approaches. If you think my implementation of base or data.table equivalents is suboptimal, please let me know better ways. | ||
|
||
Also note that I consider any significant performance difference between `dt` and `dt_raw` to be a bug in dplyr: for individual operations there should be very little overhead to calling data.table via dplyr. However, data.table may be significantly faster when performing the same sequence of operations as dplyr. This is because currently dplyr uses an eager evaluation approach so the individual calls to `[.data.table` don't get as much information about the desired result as the single call to `[.data.table` would if you did it by hand. | ||
|
||
Thanks go to Matt Dowle and Arun Srinivasan for their extensive feedback on these benchmarks. | ||
|
||
## Data setup | ||
|
||
The following benchmarks explore the performance on a somewhat realistic example: the `Batting` dataset from the Lahman package. It contains `r nrow(Batting)` records on the batting careers of `r length(Batting$playerID)` players from `r min(Batting$yearID)` to `r max(Batting$yearID)`. | ||
|
||
The first code block defines two alternative backends for the Batting dataset. Grouping operations are performed inline in each benchmark. This represents the common scenario where you group the data and immediately use it. | ||
|
||
```{r setup} | ||
batting_df <- tbl_df(Batting) | ||
batting_dt <- tbl_dt(Batting) | ||
``` | ||
|
||
## Summarise | ||
|
||
Compute the average number of at bats for each player: | ||
|
||
```{r summarise-mean} | ||
microbenchmark( | ||
dplyr_df = batting_df %>% group_by(playerID) %>% summarise(ab = mean(AB)), | ||
dplyr_dt = batting_dt %>% group_by(playerID) %>% summarise(ab = mean(AB)), | ||
dt_raw = batting_dt[, list(ab = mean(AB)), by = playerID], | ||
base = tapply(batting_df$AB, batting_df$playerID, FUN = mean), | ||
times = 5 | ||
) | ||
``` | ||
|
||
NB: base implementation captures computation but not output format, giving considerably less output. | ||
|
||
However, this comparison is slightly unfair because both data.table and `summarise()` use tricks to find a more efficient implementation of `mean()`. Data table calls a `C` implementation of the `mean (using `.External(Cfastmean, B, FALSE)` and thus avoiding the overhead of S3 method dispatch). `dplyr::summarise` uses a hybrid evaluation technique, where common functions are implemented purely in C++, avoiding R function call overhead. | ||
|
||
```{r sumarise-mean_} | ||
mean_ <- function(x) .Internal(mean(x)) | ||
microbenchmark( | ||
dplyr_df = batting_df %>% group_by(playerID) %>% summarise(ab = mean_(AB)), | ||
dplyr_dt = batting_dt %>% group_by(playerID) %>% summarise(ab = mean_(AB)), | ||
dt_raw = batting_dt[, list(ab = mean_(AB)), by = playerID], | ||
base = tapply(batting_df$AB, batting_df$playerID, FUN = mean_), | ||
times = 5 | ||
) | ||
``` | ||
|
||
## Arrange | ||
|
||
Arrange by year within each player: | ||
|
||
```{r arrange} | ||
microbenchmark( | ||
dplyr_df = batting_df %>% arrange(playerID, yearID), | ||
dplyr_dt = batting_dt %>% arrange(playerID, yearID), | ||
dt_raw = setkey(copy(batting_dt), playerID, yearID), | ||
base = batting_dt[order(batting_df$playerID, batting_df$yearID), ], | ||
times = 2 | ||
) | ||
``` | ||
|
||
## Filter | ||
|
||
Find the year for which each player played the most games: | ||
|
||
```{r filter} | ||
microbenchmark( | ||
dplyr_df = batting_df %>% group_by(playerID) %>% filter(G == max(G)), | ||
dplyr_dt = batting_dt %>% group_by(playerID) %>% filter(G == max(G)), | ||
dt_raw = batting_dt[batting_dt[, .I[G == max(G)], by = playerID]$V1], | ||
base = batting_df[ave(batting_df$G, batting_df$playerID, FUN = max) == | ||
batting_df$G, ], | ||
times = 2 | ||
) | ||
``` | ||
|
||
I'm not aware of a single line data table equivalent ([see SO 16573995](http://stackoverflow.com/questions/16573995/)). Suggetions welcome. dplyr currently doesn't support hybrid evaluation for logical comparison, but it is scheduled for 0.2 (see [#113](https://github.com/hadley/dplyr/issues/113)), this should give an additional speed up. | ||
|
||
## Mutate | ||
|
||
Rank years based on number of at bats: | ||
|
||
```{r mutate} | ||
microbenchmark( | ||
dplyr_df = batting_df %>% group_by(playerID) %>% mutate(r = rank(desc(AB))), | ||
dplyr_dt = batting_dt %>% group_by(playerID) %>% mutate(r = rank(desc(AB))), | ||
dt_raw = copy(batting_dt)[, rank := rank(desc(AB)), by = playerID], | ||
times = 2 | ||
) | ||
``` | ||
|
||
(The `dt_raw` code needs to explicitly copy the data.table so the it doesn't modify in place, as is the data.table default. This is an example where it's difficult to compare data.table and dplyr directly because of different underlying philosophies.) | ||
|
||
Compute year of career: | ||
|
||
```{r mutate2} | ||
microbenchmark( | ||
dplyr_df = batting_df %>% group_by(playerID) %>% | ||
mutate(cyear = yearID - min(yearID) + 1), | ||
dplyr_dt = batting_dt %>% group_by(playerID) %>% | ||
mutate(cyear = yearID - min(yearID) + 1), | ||
dt_raw = copy(batting_dt)[, cyear := yearID - min(yearID) + 1, | ||
by = playerID], | ||
times = 5 | ||
) | ||
``` | ||
|
||
Rank is a relatively expensive operation and `min()` is relatively cheap, showing the the relative performance overhead of the difference techniques. | ||
|
||
dplyr currently has some support for hybrid evaluation of window functions. This yields substantial speed-ups where available: | ||
|
||
```{r mutate_hybrid} | ||
min_rank_ <- min_rank | ||
microbenchmark( | ||
hybrid = batting_df %>% group_by(playerID) %>% mutate(r = min_rank(AB)), | ||
regular = batting_df %>% group_by(playerID) %>% mutate(r = min_rank_(AB)), | ||
times = 2 | ||
) | ||
``` | ||
|
||
## Joins | ||
|
||
We conclude with some quick comparisons of joins. First we create two new datasets: `master` which contains demographic information on each player, and `hall_of_fame` which contains all players inducted into the hall of fame. | ||
|
||
```{r} | ||
master_df <- tbl_df(Master) %>% select(playerID, birthYear) | ||
hall_of_fame_df <- tbl_df(HallOfFame) %>% filter(inducted == "Y") %>% | ||
select(playerID, votedBy, category) | ||
master_dt <- tbl_dt(Master) %>% select(playerID, birthYear) | ||
hall_of_fame_dt <- tbl_dt(HallOfFame) %>% filter(inducted == "Y") %>% | ||
select(playerID, votedBy, category) | ||
``` | ||
|
||
|
||
```{r} | ||
microbenchmark( | ||
dplyr_df = left_join(master_df, hall_of_fame_df, by = "playerID"), | ||
dplyr_dt = left_join(master_dt, hall_of_fame_dt, by = "playerID"), | ||
base = merge(master_df, hall_of_fame_df, by = "playerID", all.x = TRUE), | ||
times = 10 | ||
) | ||
microbenchmark( | ||
dplyr_df = inner_join(master_df, hall_of_fame_df, by = "playerID"), | ||
dplyr_dt = inner_join(master_dt, hall_of_fame_dt, by = "playerID"), | ||
base = merge(master_df, hall_of_fame_df, by = "playerID"), | ||
times = 10 | ||
) | ||
microbenchmark( | ||
dplyr_df = semi_join(master_df, hall_of_fame_df, by = "playerID"), | ||
dplyr_dt = semi_join(master_dt, hall_of_fame_dt, by = "playerID"), | ||
times = 10 | ||
) | ||
microbenchmark( | ||
dplyr_df = anti_join(master_df, hall_of_fame_df, by = "playerID"), | ||
dplyr_dt = anti_join(master_dt, hall_of_fame_dt, by = "playerID"), | ||
times = 10 | ||
) | ||
``` |
Oops, something went wrong.