Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Materialize Altrep character vectors before re-encoding
Often if you are going to be visiting all elements of an Altrep vector materializing it early on is much more performant than doing element-wise access. Both because the element wise access has more overall overhead and because often you end up materializing the full vector at other stages later on (which is the case in `group_by()`), which can incur the costs of retrieving the data more than once. This can be seen with the current devel version of vroom (jimhester/vroom@cc08248) doing a naive `dplyr::group_by()` with an altrep vector (df2) is significantly slower than explicitly materializing the vector (df1) prior to the `group_by()` call. # generate a 2 column tbl with a factor and double types, I used factors so # there would be a limited number of values, but read it as a character. input <- vroom::gen_tbl(1e6, 2, col_types = "fd") readr::write_csv(input, "test.csv") df1 <- vroom::vroom("test.csv") bench::system_time({ vroom:::force_materialization(df1[["V1"]]) dplyr::group_by(df1, V1) }) #> process real #> 286ms 287ms df2 <- vroom::vroom("test.csv") bench::system_time({ dplyr::group_by(df2, V1) }) #> process real #> 583ms 583ms After this change the performance is equivalent between the two examples. input <- vroom::gen_tbl(1e6, 2, col_types = "fd") readr::write_csv(input, "test.csv") df1 <- vroom::vroom("test.csv") bench::system_time({ vroom:::force_materialization(df1[["V1"]]) dplyr::group_by(df1, V1) }) #> process real #> 298ms 299ms df2 <- vroom::vroom("test.csv") bench::system_time({ dplyr::group_by(df2, V1) }) #> process real #> 208ms 209ms
- Loading branch information