Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
apacheGH-34653: [CI][C++] Fix for arrow-dataset-file-json-test segfau…
…lt on alpine-linux-cpp (apache#35047) ### What changes are included in this PR? Increases the block size used in the `ScanWithParallelDecoding` test to reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to roughly 60 while increasing the runtime of each job. This should still satisfy the purpose of test without going completely over the top. ### Are these changes tested? Yes, tested locally on the alpine docker image many times after successfully reproducing the original issue. ### Are there any user-facing changes? No ### Notes This doesn't solve the underlying cause (although the testing parameters were arguably far too unusual in the first place), however I do believe that I've identified the issue via a core dump. The problem starts [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369), where a `MappingGenerator` gets stacked on top of a generator that applies readahead. It seems that the underlying futures were completing very quickly, resulting in `AddCallback` being called recursively many, many times - starting [here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240). This leads to a stack overflow under specific circumstances. So, to fully guard against the problem, you'd probably want to change the logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid overflowing the stack. Not entirely sure if doing this would be worthwhile though. * Closes: apache#34653 Authored-by: benibus <[email protected]> Signed-off-by: Weston Pace <[email protected]>
- Loading branch information