Skip to content

Commit

Permalink
apacheGH-34653: [CI][C++] Fix for arrow-dataset-file-json-test segfau…
Browse files Browse the repository at this point in the history
…lt on alpine-linux-cpp (apache#35047)

### What changes are included in this PR?

Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top.

### Are these changes tested?

Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue.

### Are there any user-facing changes?

No

### Notes

This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump.

The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances.

So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though.
* Closes: apache#34653

Authored-by: benibus <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
  • Loading branch information
benibus authored Apr 12, 2023
1 parent 6120345 commit 5e8db31
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion cpp/src/arrow/dataset/file_json_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,8 @@ class JsonScanMixin {
// inter-fragment parallelism (when threading is enabled).
JsonFragmentScanOptions json_options;
json_options.read_options.use_threads = true;
json_options.read_options.block_size = 256;
// Should amount to roughly 60 blocks per fragment.
json_options.read_options.block_size = 1 << 13;
this_->SetJsonOptions(std::move(json_options));
this_->TestScan();
}
Expand Down

0 comments on commit 5e8db31

Please sign in to comment.