Skip to content

Commit

Permalink
ARROW-16204: [C++][Dataset] Default error existing_data_behaviour for…
Browse files Browse the repository at this point in the history
… writing dataset ignores a single file

Closes apache#12898 from jorisvandenbossche/ARROW-16204

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
  • Loading branch information
jorisvandenbossche committed Apr 22, 2022
1 parent 16638a4 commit 912e2bb
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 3 deletions.
2 changes: 1 addition & 1 deletion cpp/src/arrow/dataset/dataset_writer.cc
Original file line number Diff line number Diff line change
Expand Up @@ -423,7 +423,7 @@ Status EnsureDestinationValid(const FileSystemDatasetWriteOptions& options) {
// If the path doesn't exist then continue
return Status::OK();
}
if (maybe_files->size() > 1) {
if (maybe_files->size() > 0) {
return Status::Invalid(
"Could not write to ", options.base_dir,
" as the directory is not empty and existing_data_behavior is to error");
Expand Down
12 changes: 12 additions & 0 deletions cpp/src/arrow/dataset/dataset_writer_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -453,6 +453,18 @@ TEST_F(DatasetWriterTestFixture, ErrOnExistingData) {
ASSERT_RAISES(Invalid, DatasetWriter::Make(write_options_));
AssertEmptyFiles(
{"testdir/chunk-0.arrow", "testdir/chunk-5.arrow", "testdir/blah.txt"});

// only a single file in the target directory
fs::TimePoint mock_now2 = std::chrono::system_clock::now();
ASSERT_OK_AND_ASSIGN(
std::shared_ptr<fs::FileSystem> fs2,
MockFileSystem::Make(
mock_now2, {::arrow::fs::Dir("testdir"), fs::File("testdir/part-0.arrow")}));
filesystem_ = std::dynamic_pointer_cast<MockFileSystem>(fs2);
write_options_.filesystem = filesystem_;
write_options_.base_dir = "testdir";
ASSERT_RAISES(Invalid, DatasetWriter::Make(write_options_));
AssertEmptyFiles({"testdir/part-0.arrow"});
}

} // namespace internal
Expand Down
4 changes: 2 additions & 2 deletions docs/source/python/dataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -570,7 +570,7 @@ dataset to be partitioned. For example:
part = ds.partitioning(
pa.schema([("c", pa.int16())]), flavor="hive"
)
ds.write_dataset(table, "sample_dataset", format="parquet", partitioning=part)
ds.write_dataset(table, "partitioned_dataset", format="parquet", partitioning=part)
This will create two files. Half our data will be in the dataset_root/c=1 directory and
the other half will be in the dataset_root/c=2 directory.
Expand Down Expand Up @@ -701,7 +701,7 @@ into memory:
new_part = ds.partitioning(
pa.schema([("c", pa.int16())]), flavor=None
)
input_dataset = ds.dataset("sample_dataset", partitioning=old_part)
input_dataset = ds.dataset("partitioned_dataset", partitioning=old_part)
# A scanner can act as an iterator of record batches but you could also receive
# data from the network (e.g. via flight), from your own scanning, or from any
# other method that yields record batches. In addition, you can pass a dataset
Expand Down

0 comments on commit 912e2bb

Please sign in to comment.