Skip to content

Commit

Permalink
Materialize Altrep character vectors before re-encoding
Browse files Browse the repository at this point in the history
Often if you are going to be visiting all elements of an Altrep vector
materializing it early on is much more performant than doing
element-wise access. Both because the element wise access has more
overall overhead and because often you end up materializing the full
vector at other stages later on (which is the case in `group_by()`),
which can incur the costs of retrieving the data more than once.

This can be seen with the current devel version of vroom
(jimhester/vroom@cc08248)
doing a naive `dplyr::group_by()` with an altrep vector (df2) is
significantly slower than explicitly materializing the vector (df1)
prior to the `group_by()` call.

    # generate a 2 column tbl with a factor and double types, I used factors so
    # there would be a limited number of values, but read it as a character.
    input <- vroom::gen_tbl(1e6, 2, col_types = "fd")
    readr::write_csv(input, "test.csv")

    df1 <- vroom::vroom("test.csv")

    bench::system_time({
      vroom:::force_materialization(df1[["V1"]])
      dplyr::group_by(df1, V1)
    })
    #> process    real
    #>   286ms   287ms

    df2 <- vroom::vroom("test.csv")

    bench::system_time({
      dplyr::group_by(df2, V1)
    })
    #> process    real
    #>   583ms   583ms

After this change the performance is equivalent between the two
examples.

    input <- vroom::gen_tbl(1e6, 2, col_types = "fd")
    readr::write_csv(input, "test.csv")

    df1 <- vroom::vroom("test.csv")

    bench::system_time({
      vroom:::force_materialization(df1[["V1"]])
      dplyr::group_by(df1, V1)
    })
    #> process    real
    #>   298ms   299ms

    df2 <- vroom::vroom("test.csv")

    bench::system_time({
      dplyr::group_by(df2, V1)
    })
    #> process    real
    #>   208ms   209ms
  • Loading branch information
jimhester committed Apr 2, 2019
1 parent 2980546 commit 29eff25
Showing 1 changed file with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions src/encoding.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,14 @@ R_xlen_t get_first_reencode_pos(const Rcpp::CharacterVector& x) {
Rcpp::CharacterVector reencode_char(SEXP x) {
if (Rf_isFactor(x)) return reencode_factor(x);

#if (defined(R_VERSION) && R_VERSION >= R_Version(3, 5, 0))
// If ret is an Altrep call DATAPTR to materialize it fully here, since we
// will be touching all the elements anyway.
if (ALTREP(x)) {
DATAPTR(x);
}
#endif

Rcpp::CharacterVector ret(x);
R_xlen_t first = get_first_reencode_pos(ret);
if (first >= ret.length()) return ret;
Expand Down

0 comments on commit 29eff25

Please sign in to comment.