GH-34653: [CI][C++] Fix for arrow-dataset-file-json-test segfault on … · khwilson/arrow@5e8db31

Commit

apacheGH-34653: [CI][C++] Fix for arrow-dataset-file-json-test segfau…

…lt on alpine-linux-cpp (apache#35047)

### What changes are included in this PR?

Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top.

### Are these changes tested?

Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue.

### Are there any user-facing changes?

No

### Notes

This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump.

The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances.

So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though.
* Closes: apache#34653

Authored-by: benibus <[email protected]>
Signed-off-by: Weston Pace <[email protected]>

Loading branch information

benibus authored Apr 12, 2023

1 parent 6120345 commit 5e8db31

cpp/src/arrow/dataset/file_json_test.cc

-Original file line number
+Diff line change
@@ Expand Up / @@ -233,7 +233,8 @@ class JsonScanMixin { @@
         // inter-fragment parallelism (when threading is enabled).
         JsonFragmentScanOptions json_options;
         json_options.read_options.use_threads = true;
-        json_options.read_options.block_size = 256;
+        // Should amount to roughly 60 blocks per fragment.
+        json_options.read_options.block_size = 1 << 13;
         this_->SetJsonOptions(std::move(json_options));
         this_->TestScan();
       }
@@ Expand Down @@

0 comments on commit `5e8db31`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `5e8db31`

Commit

There are no files selected for viewing

0 comments on commit 5e8db31

0 comments on commit `5e8db31`