Skip to content

Commit

Permalink
ARROW-18079: [R] Improve efficiency of schema creation to prevent per…
Browse files Browse the repository at this point in the history
…formance regressions (apache#14447)

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
  • Loading branch information
thisisnic authored Oct 18, 2022
1 parent c33bb50 commit e991644
Show file tree
Hide file tree
Showing 9 changed files with 52 additions and 13 deletions.
1 change: 1 addition & 0 deletions r/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,7 @@ importFrom(purrr,map_dfr)
importFrom(purrr,map_int)
importFrom(purrr,map_lgl)
importFrom(purrr,reduce)
importFrom(purrr,walk)
importFrom(rlang,"%||%")
importFrom(rlang,":=")
importFrom(rlang,.data)
Expand Down
2 changes: 1 addition & 1 deletion r/R/arrow-package.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#' @importFrom stats quantile median na.omit na.exclude na.pass na.fail
#' @importFrom R6 R6Class
#' @importFrom purrr as_mapper map map2 map_chr map2_chr map_dbl map_dfr map_int map_lgl keep imap imap_chr
#' @importFrom purrr flatten reduce
#' @importFrom purrr flatten reduce walk
#' @importFrom assertthat assert_that is.string
#' @importFrom rlang list2 %||% is_false abort dots_n warn enquo quo_is_null enquos is_integerish quos quo
#' @importFrom rlang eval_tidy new_data_mask syms env new_environment env_bind set_names exec
Expand Down
8 changes: 6 additions & 2 deletions r/R/arrowExports.R

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions r/R/dplyr-collect.R
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,12 @@ implicit_schema <- function(.data) {
# want to go one level up (where we may have called implicit_schema() before)
.data <- ensure_group_vars(.data)
old_schm <- .data$.data$schema

if (is.null(.data$aggregations) && is.null(.data$join) && !needs_projection(.data$selected_columns, old_schm)) {
# Just use the schema we have
return(old_schm)
}

# Add in any augmented fields that may exist in the query but not in the
# real data, in case we have FieldRefs to them
old_schm[["__filename"]] <- string()
Expand Down
3 changes: 2 additions & 1 deletion r/R/dplyr-eval.R
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,9 @@ arrow_mask <- function(.data, aggregation = FALSE) {
}
}

schema <- .data$.data$schema
# Assign the schema to the expressions
map(.data$selected_columns, ~ (.$schema <- .data$.data$schema))
walk(.data$selected_columns, ~ (.$schema <- schema))

# Add the column references and make the mask
out <- new_data_mask(
Expand Down
3 changes: 2 additions & 1 deletion r/R/dplyr-select.R
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ relocate.arrow_dplyr_query <- function(.data, ..., .before = NULL, .after = NULL
.data <- as_adq(.data)

# Assign the schema to the expressions
map(.data$selected_columns, ~ (.$schema <- .data$.data$schema))
schema <- .data$.data$schema
walk(.data$selected_columns, ~ (.$schema <- schema))

# Create a mask for evaluating expressions in tidyselect helpers
mask <- new_environment(.cache$functions, parent = caller_env())
Expand Down
6 changes: 3 additions & 3 deletions r/R/schema.R
Original file line number Diff line number Diff line change
Expand Up @@ -182,9 +182,9 @@ Schema$create <- function(...) {
}

if (all(map_lgl(.list, ~ inherits(., "Field")))) {
schema_(.list)
Schema__from_fields(.list)
} else {
schema_(.fields(.list))
Schema__from_list(imap(.list, as_type))
}
}
#' @include arrowExports.R
Expand Down Expand Up @@ -298,7 +298,7 @@ length.Schema <- function(x) x$num_fields
call. = FALSE
)
}
schema_(fields)
Schema__from_fields(fields)
}

#' @export
Expand Down
17 changes: 13 additions & 4 deletions r/src/arrowExports.cpp

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

19 changes: 18 additions & 1 deletion r/src/schema.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,28 @@
#include <arrow/util/key_value_metadata.h>

// [[arrow::export]]
std::shared_ptr<arrow::Schema> schema_(
std::shared_ptr<arrow::Schema> Schema__from_fields(
const std::vector<std::shared_ptr<arrow::Field>>& fields) {
return arrow::schema(fields);
}

// [[arrow::export]]
std::shared_ptr<arrow::Schema> Schema__from_list(cpp11::list field_list) {
int n = field_list.size();

bool nullable = true;
cpp11::strings names(field_list.attr(R_NamesSymbol));

std::vector<std::shared_ptr<arrow::Field>> fields(n);

for (int i = 0; i < n; i++) {
fields[i] = arrow::field(
names[i], cpp11::as_cpp<std::shared_ptr<arrow::DataType>>(field_list[i]),
nullable);
}
return arrow::schema(fields);
}

// [[arrow::export]]
std::string Schema__ToString(const std::shared_ptr<arrow::Schema>& s) {
return s->ToString();
Expand Down

0 comments on commit e991644

Please sign in to comment.