Materialize Altrep character vectors before re-encoding · rfaelens/dplyr@29eff25

Commit

Materialize Altrep character vectors before re-encoding

Often if you are going to be visiting all elements of an Altrep vector
materializing it early on is much more performant than doing
element-wise access. Both because the element wise access has more
overall overhead and because often you end up materializing the full
vector at other stages later on (which is the case in `group_by()`),
which can incur the costs of retrieving the data more than once.

This can be seen with the current devel version of vroom
(jimhester/vroom@cc08248)
doing a naive `dplyr::group_by()` with an altrep vector (df2) is
significantly slower than explicitly materializing the vector (df1)
prior to the `group_by()` call.

    # generate a 2 column tbl with a factor and double types, I used factors so
    # there would be a limited number of values, but read it as a character.
    input <- vroom::gen_tbl(1e6, 2, col_types = "fd")
    readr::write_csv(input, "test.csv")

    df1 <- vroom::vroom("test.csv")

    bench::system_time({
      vroom:::force_materialization(df1[["V1"]])
      dplyr::group_by(df1, V1)
    })
    #> process    real
    #>   286ms   287ms

    df2 <- vroom::vroom("test.csv")

    bench::system_time({
      dplyr::group_by(df2, V1)
    })
    #> process    real
    #>   583ms   583ms

After this change the performance is equivalent between the two
examples.

    input <- vroom::gen_tbl(1e6, 2, col_types = "fd")
    readr::write_csv(input, "test.csv")

    df1 <- vroom::vroom("test.csv")

    bench::system_time({
      vroom:::force_materialization(df1[["V1"]])
      dplyr::group_by(df1, V1)
    })
    #> process    real
    #>   298ms   299ms

    df2 <- vroom::vroom("test.csv")

    bench::system_time({
      dplyr::group_by(df2, V1)
    })
    #> process    real
    #>   208ms   209ms

Loading branch information

jimhester committed Apr 2, 2019

1 parent 2980546 commit 29eff25

src/encoding.cpp

-Original file line number
+Diff line change
@@ Expand Up @@
     Rcpp::CharacterVector reencode_char(SEXP x) {
       if (Rf_isFactor(x)) return reencode_factor(x);
+    #if (defined(R_VERSION) && R_VERSION >= R_Version(3, 5, 0))
+      // If ret is an Altrep call DATAPTR to materialize it fully here, since we
+      // will be touching all the elements anyway.
+      if (ALTREP(x)) {
+        DATAPTR(x);
+      }
+    #endif
       Rcpp::CharacterVector ret(x);
       R_xlen_t first = get_first_reencode_pos(ret);
       if (first >= ret.length()) return ret;
@@ Expand Down @@

0 comments on commit `29eff25`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `29eff25`

Commit

There are no files selected for viewing

0 comments on commit 29eff25

0 comments on commit `29eff25`