- Explore a multi-file dataset with
open_dataset()
and then usedplyr
methods toselect()
,filter()
, etc., and work will be done where possible in Arrow memory. When necessary, data is pulled into R for further computation.dplyr
methods are conditionally loaded if you havedplyr
available; it is not a hard dependency. - Tables and RecordBatches also have
dplyr
methods. - For exploration without
dplyr
,[
methods for Tables, RecordBatches, Arrays, and ChunkedArrays now support natural row extraction operations. These use the C++Filter
,Slice
, andTake
methods for efficient access, depending on the type of selection vector. - An experimental, lazily evaluated
array_expression
class has also been added, enabling among other things the ability to filter a Table with some function of Arrays, such asarrow_table[arrow_table$var1 > 5, ]
without having to pull everything into R first.
write_parquet()
now supports compressioncodec_is_available()
returnsTRUE
orFALSE
whether the Arrow C++ library was built with support for a given compression library (e.g. gzip, lz4, snappy)
- This patch release includes bugfixes in the C++ library around dictionary types and Parquet reading.
- The R6 classes that wrap the C++ classes are now documented and exported and have been renamed to be more R-friendly. Users of the high-level R interface in this package are not affected. Those who want to interact with the Arrow C++ API more directly should work with these objects and methods. As part of this change, many functions that instantiated these R6 objects have been removed in favor of
Class$create()
methods. Notably,arrow::array()
andarrow::table()
have been removed in favor ofArray$create()
andTable$create()
, eliminating the package startup message about maskingbase
functions. For more information, see the newvignette("arrow")
. - Due to a subtle change in the Arrow message format, data written by the 0.15 version libraries may not be readable by older versions. If you need to send data to a process that uses an older version of Arrow (for example, an Apache Spark server that hasn't yet updated to Arrow 0.15), you can set the environment variable
ARROW_PRE_0_15_IPC_FORMAT=1
. - The
as_tibble
argument in theread_*()
functions has been renamed toas_data_frame
(ARROW-6337, @jameslamb) - The
arrow::Column
class has been removed, as it was removed from the C++ library
Table
andRecordBatch
objects have S3 methods that enable you to work with them more likedata.frame
s. Extract columns, subset, and so on. See?Table
and?RecordBatch
for examples.- Initial implementation of bindings for the C++ File System API. (ARROW-6348)
- Compressed streams are now supported on Windows (ARROW-6360), and you can also specify a compression level (ARROW-6533)
- Parquet file reading is much, much faster, thanks to improvements in the Arrow C++ library.
read_csv_arrow()
supports more parsing options, includingcol_names
,na
,quoted_na
, andskip
read_parquet()
andread_feather()
can ingest data from araw
vector (ARROW-6278)- File readers now properly handle paths that need expanding, such as
~/file.parquet
(ARROW-6323) - Improved support for creating types in a schema: the types' printed names (e.g. "double") are guaranteed to be valid to use in instantiating a schema (e.g.
double()
), and time types can be created with human-friendly resolution strings ("ms", "s", etc.). (ARROW-6338, ARROW-6364)
Initial CRAN release of the arrow
package. Key features include:
- Read and write support for various file formats, including Parquet, Feather/Arrow, CSV, and JSON.
- API bindings to the C++ library for Arrow data types and objects, as well as mapping between Arrow types and R data types.
- Tools for helping with C++ library configuration and installation.