forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-50661][CONNECT][SS] Fix Spark Connect Scala foreachBatch impl.…
… to support Dataset[T]. ### What changes were proposed in this pull request? This PR fixes incorrect implementation of Scala Streaming foreachBatch when the input dataset is not a DataFrame (but a Dataset[T]) in spark connect mode. **Note** that this only affects `Scala`. In `DataStreamWriter`: - serialize foreachBatch function together with the dataset's encoder. - reuse ForeachWriterPacket for foreachBatch as both are sink operations and only require a function/writer object and the encoder of the input. Optionally, we could rename `ForeachWriterPacket` to something more general for both cases. In `SparkConnectPlanner` / `StreamingForeachBatchHelper` - Use the encoder passed from the client to recover the Dataset[T] object to properly call the foreachBatch function. ### Why are the changes needed? Without the fix, Scala foreachBatch will fail or give wrong results when the input dataset is not a DataFrame. Below is a simple reproduction: ``` import org.apache.spark.sql._ spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/test") val q = spark.readStream.format("parquet").schema("id LONG").load("/tmp/test").as[java.lang.Long].writeStream.foreachBatch((ds: Dataset[java.lang.Long], batchId: Long) => println(ds.collect().map(_.asInstanceOf[Long]).sum)).start() Thread.sleep(1000) q.stop() ``` The code above should output 45 in the foreachBatch function. Without the fix, the code will fail because the foreachBatch function will be called with a DataFrame object instead of Dataset[java.lang.Long]. ### Does this PR introduce _any_ user-facing change? Yes, this PR includes fixes to the Spark Connect client (we add the encoder to the foreachBatch function during serialization) around the foreachBatch API. ### How was this patch tested? 1. Run end-to-end test with spark-shell (with spark connect server and client running in connect mode). 2. New / updated unit tests that would have failed without the fix. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49323 from haiyangsun-db/SPARK-50661. Authored-by: Haiyang Sun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
- Loading branch information
1 parent
af53ee4
commit 51b011f
Showing
4 changed files
with
108 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters