forked from apache/arrow
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmap_batches.Rd
43 lines (38 loc) · 1.61 KB
/
map_batches.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dataset-scan.R
\name{map_batches}
\alias{map_batches}
\title{Apply a function to a stream of RecordBatches}
\usage{
map_batches(X, FUN, ..., .schema = NULL, .lazy = TRUE, .data.frame = NULL)
}
\arguments{
\item{X}{A \code{Dataset} or \code{arrow_dplyr_query} object, as returned by the
\code{dplyr} methods on \code{Dataset}.}
\item{FUN}{A function or \code{purrr}-style lambda expression to apply to each
batch. It must return a RecordBatch or something coercible to one via
`as_record_batch()'.}
\item{...}{Additional arguments passed to \code{FUN}}
\item{.schema}{An optional \code{\link[=schema]{schema()}}. If NULL, the schema will be inferred
from the first batch.}
\item{.lazy}{Use \code{TRUE} to evaluate \code{FUN} lazily as batches are read from
the result; use \code{FALSE} to evaluate \code{FUN} on all batches before returning
the reader.}
\item{.data.frame}{Deprecated argument, ignored}
}
\value{
An \code{arrow_dplyr_query}.
}
\description{
As an alternative to calling \code{collect()} on a \code{Dataset} query, you can
use this function to access the stream of \code{RecordBatch}es in the \code{Dataset}.
This lets you do more complex operations in R that operate on chunks of data
without having to hold the entire Dataset in memory at once. You can include
\code{map_batches()} in a dplyr pipeline and do additional dplyr methods on the
stream of data in Arrow after it.
}
\details{
This is experimental and not recommended for production use. It is also
single-threaded and runs in R not C++, so it won't be as fast as core
Arrow methods.
}