From b77e8ae6ff7e987d878f90877421afdcd52dec88 Mon Sep 17 00:00:00 2001 From: Neal Richardson Date: Thu, 10 Sep 2020 14:59:13 -0700 Subject: [PATCH] ARROW-9854: [R] Support reading/writing data to/from S3 - [x] read_parquet/feather/etc. from S3 (use FileSystem->OpenInputFile(path)) - [x] write_$FORMAT via FileSystem->OpenOutputStream(path) - [x] write_dataset (done? at least via URI) - [x] ~~for linux, an argument to install_arrow to help, assuming you've installed aws-sdk-cpp already (turn on ARROW_S3, AWSSDK_SOURCE=SYSTEM)~~ Turns out there's no official deb/rpm packages for aws-sdk-cpp so there's no value in making this part easier; would be more confusing than helpful actually - [x] set up a real test bucket and user for e2e testing (credentials available on request) - [x] add a few tests that use s3, if credentials are set (which I'll set locally) - [x] add vignette showing how to use s3 (via URI) - [x] update docs, news Out of the current scope: - [ ] testing with minio on CI - [ ] download dataset, i.e. copy files/directory recursively (needs ARROW-9867, ARROW-9868) - [ ] friendlier methods for interacting with/viewing a filesystem (ls, mkdir, etc.) (ARROW-9870) - [ ] direct construction of S3FileSystem object with S3Options (i.e. not only URI) (ARROW-9869) Closes #8058 from nealrichardson/r-s3 Authored-by: Neal Richardson Signed-off-by: Neal Richardson --- r/NEWS.md | 9 +++--- r/R/csv.R | 2 +- r/R/dataset-factory.R | 15 ++-------- r/R/dataset.R | 13 ++------ r/R/feather.R | 6 ++-- r/R/filesystem.R | 22 +++++++++++++- r/R/io.R | 23 +++++++++++++-- r/R/ipc_stream.R | 8 ++--- r/R/parquet.R | 5 ++-- r/man/make_readable_file.Rd | 6 +++- r/man/read_delim_arrow.Rd | 2 +- r/man/read_feather.Rd | 4 +-- r/man/read_ipc_stream.Rd | 4 +-- r/man/read_json_arrow.Rd | 2 +- r/man/read_parquet.Rd | 4 +-- r/man/write_feather.Rd | 2 +- r/man/write_ipc_stream.Rd | 2 +- r/man/write_parquet.Rd | 3 +- r/tests/testthat/test-s3.R | 52 ++++++++++++++++++++++++++++++++ r/vignettes/fs.Rmd | 59 +++++++++++++++++++++++++++++++++++++ 20 files changed, 191 insertions(+), 52 deletions(-) create mode 100644 r/tests/testthat/test-s3.R create mode 100644 r/vignettes/fs.Rmd diff --git a/r/NEWS.md b/r/NEWS.md index b3c423201fbde..5b2eac82539c5 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -25,6 +25,11 @@ * Datasets now have `head()`, `tail()`, and take (`[`) methods. `head()` is optimized but the others may not be performant. * `collect()` gains an `as_data_frame` argument, default `TRUE` but when `FALSE` allows you to evaluate the accumulated `select` and `filter` query but keep the result in Arrow, not an R `data.frame` +## AWS S3 support + +* S3 support is now enabled in binary macOS and Windows (Rtools40 only, i.e. R >= 4.0) packages. To enable it on Linux, you will need to build and install `aws-sdk-cpp` from source, then set the environment variable `EXTRA_CMAKE_FLAGS="-DARROW_S3=ON -DAWSSDK_SOURCE=SYSTEM"` prior to building the R package (with bundled C++ build, not with Arrow system libraries) from source. +* File readers and writers (`read_parquet()`, `write_feather()`, et al.) now accept an `s3://` URI as the source or destination file, as do `open_dataset()` and `write_dataset()`. See `vignette("fs", package = "arrow")` for details. + ## Computation * Comparison (`==`, `>`, etc.) and boolean (`&`, `|`, `!`) operations, along with `is.na`, `%in%` and `match` (called `match_arrow()`), on Arrow Arrays and ChunkedArrays are now implemented in the C++ library. @@ -32,10 +37,6 @@ * `dplyr` filter expressions on Arrow Tables and RecordBatches are now evaluated in the C++ library, rather than by pulling data into R and evaluating. This yields significant performance improvements. * `dim()` (`nrow`) for dplyr queries on Table/RecordBatch is now supported -## Packaging - -* S3 support is now enabled in binary macOS and Windows (Rtools40 only, i.e. R >= 4.0) packages - ## Other improvements * `arrow` now depends on [`cpp11`](https://cpp11.r-lib.org/), which brings more robust UTF-8 handling and faster compilation diff --git a/r/R/csv.R b/r/R/csv.R index e145a907e28c1..62dfad7d52906 100644 --- a/r/R/csv.R +++ b/r/R/csv.R @@ -32,7 +32,7 @@ #' `parse_options`, `convert_options`, or `read_options` arguments, or you can #' use [CsvTableReader] directly for lower-level access. #' -#' @param file A character file name, `raw` vector, or an Arrow input stream. +#' @param file A character file name or URI, `raw` vector, or an Arrow input stream. #' If a file name, a memory-mapped Arrow [InputStream] will be opened and #' closed when finished; compression will be detected from the file extension #' and handled automatically. If an input stream is provided, it will be left diff --git a/r/R/dataset-factory.R b/r/R/dataset-factory.R index 767f0b7c02a42..00039faed0fd2 100644 --- a/r/R/dataset-factory.R +++ b/r/R/dataset-factory.R @@ -48,17 +48,8 @@ DatasetFactory$create <- function(x, stop("'x' must be a string or a list of DatasetFactory", call. = FALSE) } - if (!inherits(filesystem, "FileSystem")) { - if (grepl("://", x)) { - fs_from_uri <- FileSystem$from_uri(x) - filesystem <- fs_from_uri$fs - x <- fs_from_uri$path - } else { - filesystem <- LocalFileSystem$create() - x <- clean_path_abs(x) - } - } - selector <- FileSelector$create(x, allow_not_found = FALSE, recursive = TRUE) + path_and_fs <- get_path_and_filesystem(x, filesystem) + selector <- FileSelector$create(path_and_fs$path, allow_not_found = FALSE, recursive = TRUE) if (is.character(format)) { format <- FileFormat$create(match.arg(format), ...) @@ -74,7 +65,7 @@ DatasetFactory$create <- function(x, partitioning <- DirectoryPartitioningFactory$create(partitioning) } } - FileSystemDatasetFactory$create(filesystem, selector, format, partitioning) + FileSystemDatasetFactory$create(path_and_fs$fs, selector, format, partitioning) } #' Create a DatasetFactory diff --git a/r/R/dataset.R b/r/R/dataset.R index ec86dc56f083a..7661c33292e8c 100644 --- a/r/R/dataset.R +++ b/r/R/dataset.R @@ -164,17 +164,8 @@ Dataset <- R6Class("Dataset", inherit = ArrowObject, NewScan = function() unique_ptr(ScannerBuilder, dataset___Dataset__NewScan(self)), ToString = function() self$schema$ToString(), write = function(path, filesystem = NULL, schema = self$schema, format, partitioning, ...) { - if (!inherits(filesystem, "FileSystem")) { - if (grepl("://", path)) { - fs_from_uri <- FileSystem$from_uri(path) - filesystem <- fs_from_uri$fs - path <- fs_from_uri$path - } else { - filesystem <- LocalFileSystem$create() - path <- clean_path_abs(path) - } - } - dataset___Dataset__Write(self, schema, format, filesystem, path, partitioning) + path_and_fs <- get_path_and_filesystem(path, filesystem) + dataset___Dataset__Write(self, schema, format, path_and_fs$fs, path_and_fs$path, partitioning) invisible(self) } ), diff --git a/r/R/feather.R b/r/R/feather.R index 9b8dc8c512100..7026de4dbabfc 100644 --- a/r/R/feather.R +++ b/r/R/feather.R @@ -24,7 +24,7 @@ #' and the version 2 specification, which is the Apache Arrow IPC file format. #' #' @param x `data.frame`, [RecordBatch], or [Table] -#' @param sink A string file path or [OutputStream] +#' @param sink A string file path, URI, or [OutputStream] #' @param version integer Feather file version. Version 2 is the current. #' Version 1 is the more limited legacy format. #' @param chunk_size For V2 files, the number of rows that each chunk of data @@ -106,7 +106,7 @@ write_feather <- function(x, assert_is(x, "Table") if (is.string(sink)) { - sink <- FileOutputStream$create(sink) + sink <- make_output_stream(sink) on.exit(sink$close()) } assert_is(sink, "OutputStream") @@ -142,7 +142,7 @@ write_feather <- function(x, #' df <- read_feather(tf, col_select = starts_with("d")) #' } read_feather <- function(file, col_select = NULL, as_data_frame = TRUE, ...) { - if (!inherits(file, "InputStream")) { + if (!inherits(file, "RandomAccessFile")) { file <- make_readable_file(file) on.exit(file$close()) } diff --git a/r/R/filesystem.R b/r/R/filesystem.R index f0e123ac4cd0d..4cde03eb6b42a 100644 --- a/r/R/filesystem.R +++ b/r/R/filesystem.R @@ -228,7 +228,7 @@ FileSystem <- R6Class("FileSystem", inherit = ArrowObject, shared_ptr(InputStream, fs___FileSystem__OpenInputStream(self, clean_path_rel(path))) }, OpenInputFile = function(path) { - shared_ptr(InputStream, fs___FileSystem__OpenInputFile(self, clean_path_rel(path))) + shared_ptr(RandomAccessFile, fs___FileSystem__OpenInputFile(self, clean_path_rel(path))) }, OpenOutputStream = function(path) { shared_ptr(OutputStream, fs___FileSystem__OpenOutputStream(self, clean_path_rel(path))) @@ -242,11 +242,31 @@ FileSystem <- R6Class("FileSystem", inherit = ArrowObject, ) ) FileSystem$from_uri <- function(uri) { + assert_that(is.string(uri)) out <- fs___FileSystemFromUri(uri) out$fs <- shared_ptr(FileSystem, out$fs)$..dispatch() out } +get_path_and_filesystem <- function(x, filesystem = NULL) { + # Wrapper around FileSystem$from_uri that handles local paths + # and an optional explicit filesystem + assert_that(is.string(x)) + if (is_url(x)) { + if (!is.null(filesystem)) { + # Stop? Can't have URL (which yields a fs) and another fs + } + FileSystem$from_uri(x) + } else { + list( + fs = filesystem %||% LocalFileSystem$create(), + path = clean_path_abs(x) + ) + } +} + +is_url <- function(x) grepl("://", x) + #' @usage NULL #' @format NULL #' @rdname FileSystem diff --git a/r/R/io.R b/r/R/io.R index c14c5ce1abcd6..3b607a4e2b74f 100644 --- a/r/R/io.R +++ b/r/R/io.R @@ -224,15 +224,25 @@ mmap_open <- function(path, mode = c("read", "write", "readwrite")) { #' with this compression codec, either a [Codec] or the string name of one. #' If `NULL` (default) and `file` is a string file name, the function will try #' to infer compression from the file extension. +#' @param filesystem If not `NULL`, `file` will be opened via the +#' `filesystem$OpenInputFile()` filesystem method, rather than the `io` module's +#' `MemoryMappedFile` or `ReadableFile` constructors. #' @return An `InputStream` or a subclass of one. #' @keywords internal -make_readable_file <- function(file, mmap = TRUE, compression = NULL) { +make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem = NULL) { if (is.string(file)) { + if (is_url(file)) { + fs_and_path <- FileSystem$from_uri(file) + filesystem <- fs_and_path$fs + file <- fs_and_path$path + } if (is.null(compression)) { # Infer compression from the file path compression <- detect_compression(file) } - if (isTRUE(mmap)) { + if (!is.null(filesystem)) { + file <- filesystem$OpenInputFile(file) + } else if (isTRUE(mmap)) { file <- mmap_open(file) } else { file <- ReadableFile$create(file) @@ -247,6 +257,15 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL) { file } +make_output_stream <- function(x) { + if (is_url(x)) { + fs_and_path <- FileSystem$from_uri(x) + fs_and_path$fs$OpenOutputStream(fs_and_path$path) + } else { + FileOutputStream$create(x) + } +} + detect_compression <- function(path) { assert_that(is.string(path)) switch(tools::file_ext(path), diff --git a/r/R/ipc_stream.R b/r/R/ipc_stream.R index 0c728b26b5341..618ace52f49e2 100644 --- a/r/R/ipc_stream.R +++ b/r/R/ipc_stream.R @@ -41,7 +41,7 @@ write_ipc_stream <- function(x, sink, ...) { x <- Table$create(x) } if (is.string(sink)) { - sink <- FileOutputStream$create(sink) + sink <- make_output_stream(sink) on.exit(sink$close()) } assert_is(sink, "OutputStream") @@ -82,10 +82,10 @@ write_to_raw <- function(x, format = c("stream", "file")) { #' `read_arrow()`, a wrapper around `read_ipc_stream()` and `read_feather()`, #' is deprecated. You should explicitly choose #' the function that will read the desired IPC format (stream or file) since -#' a file or `InputStream` may contain either. +#' a file or `InputStream` may contain either. #' -#' @param file A character file name, `raw` vector, or an Arrow input stream. -#' If a file name, a memory-mapped Arrow [InputStream] will be opened and +#' @param file A character file name or URI, `raw` vector, or an Arrow input stream. +#' If a file name or URI, an Arrow [InputStream] will be opened and #' closed when finished. If an input stream is provided, it will be left #' open. #' @param as_data_frame Should the function return a `data.frame` (default) or diff --git a/r/R/parquet.R b/r/R/parquet.R index caf93f5284b92..0b6357316eed7 100644 --- a/r/R/parquet.R +++ b/r/R/parquet.R @@ -59,7 +59,8 @@ read_parquet <- function(file, #' This function enables you to write Parquet files from R. #' #' @param x An [arrow::Table][Table], or an object convertible to it. -#' @param sink an [arrow::io::OutputStream][OutputStream] or a string which is interpreted as a file path +#' @param sink an [arrow::io::OutputStream][OutputStream] or a string +#' interpreted as a file path or URI #' @param chunk_size chunk size in number of rows. If NULL, the total number of rows is used. #' @param version parquet version, "1.0" or "2.0". Default "1.0". Numeric values #' are coerced to character. @@ -129,7 +130,7 @@ write_parquet <- function(x, } if (is.string(sink)) { - sink <- FileOutputStream$create(sink) + sink <- make_output_stream(sink) on.exit(sink$close()) } else if (!inherits(sink, "OutputStream")) { abort("sink must be a file path or an OutputStream") diff --git a/r/man/make_readable_file.Rd b/r/man/make_readable_file.Rd index 11d302c0b04d1..fe2e29826120d 100644 --- a/r/man/make_readable_file.Rd +++ b/r/man/make_readable_file.Rd @@ -4,7 +4,7 @@ \alias{make_readable_file} \title{Handle a range of possible input sources} \usage{ -make_readable_file(file, mmap = TRUE, compression = NULL) +make_readable_file(file, mmap = TRUE, compression = NULL, filesystem = NULL) } \arguments{ \item{file}{A character file name, \code{raw} vector, or an Arrow input stream} @@ -15,6 +15,10 @@ make_readable_file(file, mmap = TRUE, compression = NULL) with this compression codec, either a \link{Codec} or the string name of one. If \code{NULL} (default) and \code{file} is a string file name, the function will try to infer compression from the file extension.} + +\item{filesystem}{If not \code{NULL}, \code{file} will be opened via the +\code{filesystem$OpenInputFile()} filesystem method, rather than the \code{io} module's +\code{MemoryMappedFile} or \code{ReadableFile} constructors.} } \value{ An \code{InputStream} or a subclass of one. diff --git a/r/man/read_delim_arrow.Rd b/r/man/read_delim_arrow.Rd index 124abdcb91281..abc2d4b058199 100644 --- a/r/man/read_delim_arrow.Rd +++ b/r/man/read_delim_arrow.Rd @@ -59,7 +59,7 @@ read_tsv_arrow( ) } \arguments{ -\item{file}{A character file name, \code{raw} vector, or an Arrow input stream. +\item{file}{A character file name or URI, \code{raw} vector, or an Arrow input stream. If a file name, a memory-mapped Arrow \link{InputStream} will be opened and closed when finished; compression will be detected from the file extension and handled automatically. If an input stream is provided, it will be left diff --git a/r/man/read_feather.Rd b/r/man/read_feather.Rd index f507edb456ed9..b84d07f61768b 100644 --- a/r/man/read_feather.Rd +++ b/r/man/read_feather.Rd @@ -7,8 +7,8 @@ read_feather(file, col_select = NULL, as_data_frame = TRUE, ...) } \arguments{ -\item{file}{A character file name, \code{raw} vector, or an Arrow input stream. -If a file name, a memory-mapped Arrow \link{InputStream} will be opened and +\item{file}{A character file name or URI, \code{raw} vector, or an Arrow input stream. +If a file name or URI, an Arrow \link{InputStream} will be opened and closed when finished. If an input stream is provided, it will be left open.} diff --git a/r/man/read_ipc_stream.Rd b/r/man/read_ipc_stream.Rd index 1cc969b922e80..01b64350a8c71 100644 --- a/r/man/read_ipc_stream.Rd +++ b/r/man/read_ipc_stream.Rd @@ -10,8 +10,8 @@ read_arrow(file, ...) read_ipc_stream(file, as_data_frame = TRUE, ...) } \arguments{ -\item{file}{A character file name, \code{raw} vector, or an Arrow input stream. -If a file name, a memory-mapped Arrow \link{InputStream} will be opened and +\item{file}{A character file name or URI, \code{raw} vector, or an Arrow input stream. +If a file name or URI, an Arrow \link{InputStream} will be opened and closed when finished. If an input stream is provided, it will be left open.} diff --git a/r/man/read_json_arrow.Rd b/r/man/read_json_arrow.Rd index 37fff64daf097..8501b19c392d1 100644 --- a/r/man/read_json_arrow.Rd +++ b/r/man/read_json_arrow.Rd @@ -7,7 +7,7 @@ read_json_arrow(file, col_select = NULL, as_data_frame = TRUE, ...) } \arguments{ -\item{file}{A character file name, \code{raw} vector, or an Arrow input stream. +\item{file}{A character file name or URI, \code{raw} vector, or an Arrow input stream. If a file name, a memory-mapped Arrow \link{InputStream} will be opened and closed when finished; compression will be detected from the file extension and handled automatically. If an input stream is provided, it will be left diff --git a/r/man/read_parquet.Rd b/r/man/read_parquet.Rd index 6bd7335c40d2e..f4a3897643c2d 100644 --- a/r/man/read_parquet.Rd +++ b/r/man/read_parquet.Rd @@ -13,8 +13,8 @@ read_parquet( ) } \arguments{ -\item{file}{A character file name, \code{raw} vector, or an Arrow input stream. -If a file name, a memory-mapped Arrow \link{InputStream} will be opened and +\item{file}{A character file name or URI, \code{raw} vector, or an Arrow input stream. +If a file name or URI, an Arrow \link{InputStream} will be opened and closed when finished. If an input stream is provided, it will be left open.} diff --git a/r/man/write_feather.Rd b/r/man/write_feather.Rd index e9639480a5d02..e079aeb893434 100644 --- a/r/man/write_feather.Rd +++ b/r/man/write_feather.Rd @@ -16,7 +16,7 @@ write_feather( \arguments{ \item{x}{\code{data.frame}, \link{RecordBatch}, or \link{Table}} -\item{sink}{A string file path or \link{OutputStream}} +\item{sink}{A string file path, URI, or \link{OutputStream}} \item{version}{integer Feather file version. Version 2 is the current. Version 1 is the more limited legacy format.} diff --git a/r/man/write_ipc_stream.Rd b/r/man/write_ipc_stream.Rd index 2bf4fdd2430a5..8274eddb3b1e1 100644 --- a/r/man/write_ipc_stream.Rd +++ b/r/man/write_ipc_stream.Rd @@ -12,7 +12,7 @@ write_ipc_stream(x, sink, ...) \arguments{ \item{x}{\code{data.frame}, \link{RecordBatch}, or \link{Table}} -\item{sink}{A string file path or \link{OutputStream}} +\item{sink}{A string file path, URI, or \link{OutputStream}} \item{...}{extra parameters passed to \code{write_feather()}.} } diff --git a/r/man/write_parquet.Rd b/r/man/write_parquet.Rd index 0253a2fb5a36e..f532ce06c4c5d 100644 --- a/r/man/write_parquet.Rd +++ b/r/man/write_parquet.Rd @@ -22,7 +22,8 @@ write_parquet( \arguments{ \item{x}{An \link[=Table]{arrow::Table}, or an object convertible to it.} -\item{sink}{an \link[=OutputStream]{arrow::io::OutputStream} or a string which is interpreted as a file path} +\item{sink}{an \link[=OutputStream]{arrow::io::OutputStream} or a string +interpreted as a file path or URI} \item{chunk_size}{chunk size in number of rows. If NULL, the total number of rows is used.} diff --git a/r/tests/testthat/test-s3.R b/r/tests/testthat/test-s3.R new file mode 100644 index 0000000000000..9dfadfdfb5878 --- /dev/null +++ b/r/tests/testthat/test-s3.R @@ -0,0 +1,52 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +context("S3 integration tests") + +run_these <- tryCatch({ + if (arrow_with_s3() && + identical(tolower(Sys.getenv("ARROW_R_DEV")), "true") && + !identical(Sys.getenv("AWS_ACCESS_KEY_ID"), "") && + !identical(Sys.getenv("AWS_SECRET_ACCESS_KEY"), "")) { + # See if we have access to the test bucket + bucket <- FileSystem$from_uri("s3://ursa-labs-r-test?region=us-west-2") + bucket$fs$GetFileInfo(bucket$path) + TRUE + } else { + FALSE + } +}, error = function(e) FALSE) + +bucket_uri <- function(..., bucket = "s3://ursa-labs-r-test/%s?region=us-west-2") { + segments <- paste(..., sep = "/") + sprintf(bucket, segments) +} + +if (run_these) { + now <- as.numeric(Sys.time()) + on.exit(bucket$fs$DeleteDir(paste0("ursa-labs-r-test/", now))) + + test_that("read/write Feather on S3", { + write_feather(example_data, bucket_uri(now, "test.feather")) + expect_identical(read_feather(bucket_uri(now, "test.feather")), example_data) + }) + + test_that("read/write Parquet on S3", { + write_parquet(example_data, bucket_uri(now, "test.parquet")) + expect_identical(read_parquet(bucket_uri(now, "test.parquet")), example_data) + }) +} diff --git a/r/vignettes/fs.Rmd b/r/vignettes/fs.Rmd new file mode 100644 index 0000000000000..03730bc1269a8 --- /dev/null +++ b/r/vignettes/fs.Rmd @@ -0,0 +1,59 @@ +--- +title: "Working with Cloud Storage (S3)" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Working with Cloud Storage (S3)} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +The Arrow C++ library includes a generic filesystem interface and specific +implementations for some cloud storage systems. This setup allows various +parts of the project to be able to read and write data with different storage +backends. In the `arrow` R package, support has been enabled for AWS S3 on +macOS and Windows. This vignette provides an overview of working with S3 data +using Arrow. + +> Note that S3 support is not enabled by default on Linux due to packaging complications. To enable it, you will need to build and install [aws-sdk-cpp](https://aws.amazon.com/sdk-for-cpp/) from source, then set the environment variable `EXTRA_CMAKE_FLAGS="-DARROW_S3=ON -DAWSSDK_SOURCE=SYSTEM"` prior to building the R package (with bundled C++ build, not with Arrow system libraries) from source. + +## URIs + +File readers and writers (`read_parquet()`, `write_feather()`, et al.) +now accept an S3 URI as the source or destination file, +as do `open_dataset()` and `write_dataset()`. +An S3 URI looks like: + +``` +s3://[id:secret@]bucket/path[?region=] +``` + +For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at + +``` +s3://ursa-labs-taxi-data/2019/06/data.parquet?region=us-east-2 +``` + +`region` defaults to `us-east-1` and can be omitted if the bucket is in that region. + +Given this URI, we can pass it to `read_parquet()` just as if it were a local file path: + +```r +df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet?region=us-east-2") +``` + +Note that this will be slower to read than if the file were local, +though if you're running on a machine in the same AWS region as the file in S3, +the cost of reading the data over the network should be much lower. + +## Authentication + +To access private S3 buckets, you need two secret parameters: +a `AWS_ACCESS_KEY_ID`, which is like a user id, +and `AWS_SECRET_ACCESS_KEY`, like a token. +There are a few options for passing these credentials: + +1. Include them in the URI, like `s3://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@bucket-name/path/to/file`. Be sure to [URL-encode](https://en.wikipedia.org/wiki/Percent-encoding) your secrets if they contain special characters like "/". + +2. Set them as environment variables named `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`. + +3. Define them in a `~/.aws/credentials` file, according to the [AWS documentation](https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/credentials.html).