From 5e8db3156c733a31e196683011db113e76ce6a32 Mon Sep 17 00:00:00 2001 From: Ben Harkins <60872452+benibus@users.noreply.github.com> Date: Wed, 12 Apr 2023 09:01:29 -0400 Subject: [PATCH] GH-34653: [CI][C++] Fix for arrow-dataset-file-json-test segfault on alpine-linux-cpp (#35047) ### What changes are included in this PR? Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top. ### Are these changes tested? Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue. ### Are there any user-facing changes? No ### Notes This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump. The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances. So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though. * Closes: #34653 Authored-by: benibus Signed-off-by: Weston Pace --- cpp/src/arrow/dataset/file_json_test.cc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/cpp/src/arrow/dataset/file_json_test.cc b/cpp/src/arrow/dataset/file_json_test.cc index f3ea1941daea6..b34b5cddf1a85 100644 --- a/cpp/src/arrow/dataset/file_json_test.cc +++ b/cpp/src/arrow/dataset/file_json_test.cc @@ -233,7 +233,8 @@ class JsonScanMixin { // inter-fragment parallelism (when threading is enabled). JsonFragmentScanOptions json_options; json_options.read_options.use_threads = true; - json_options.read_options.block_size = 256; + // Should amount to roughly 60 blocks per fragment. + json_options.read_options.block_size = 1 << 13; this_->SetJsonOptions(std::move(json_options)); this_->TestScan(); }