-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the retry support for nondeterministic expressions #11789
base: branch-25.02
Are you sure you want to change the base?
Conversation
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
build |
Signed-off-by: Firestarman <[email protected]>
build |
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
build |
The failure in CI is a known issue, pls refer to #11790. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a great first step. I mostly want to understand the 3.5.1 issue around generate and if we need to file a follow on issue.
I would also love to see some rand tests around
- join (we can restrict the range of rand to make it actually match)
- hash aggregate with some distinct operators so expand is called
- (Do we need/want rand in the key for an aggregate?)
- window operations
Then I think we will have covered most of the major cases when this could be called.
|
||
# See https://github.com/apache/spark/commit/9c0b803ba124a6e70762aec1e5559b0d66529f4d | ||
@ignore_order(local=True) | ||
@pytest.mark.skipif(is_before_spark_351(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a follow on issue for us to handle this properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally we don't need to do anything. It is a pure Spark bug.
Spark 3.5.0 will complain the exception as below when evaluating a rand()
in Generate with the codegen disabled and fail this test.
E Caused by: java.lang.IllegalArgumentException: requirement failed: Nondeterministic expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized before eval.
E at scala.Predef$.require(Predef.scala:281)
E at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
E at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
E at org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
E at org.apache.spark.sql.catalyst.expressions.CreateArray.$anonfun$eval$1(complexTypeCreator.scala:95)
E at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
...
E at org.apache.spark.sql.catalyst.expressions.CreateArray.eval(complexTypeCreator.scala:95)
E at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
E at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
And Spark before 3.5.0 will throw the following exception and fail the test too.
E pyspark.errors.exceptions.captured.AnalysisException: nondeterministic expressions are only allowed in Project, Filter, Aggregate or Window, found:
E explode(array(rand(42L))),col
E in operator Generate explode(array(rand(42))), false, [col#2].;
E Project [col#2]
E +- Generate explode(array(rand(42))), false, [col#2]
E +- LogicalRDD [a#0], false
While GPU supports Generate with rand
for all the versions. So to make this test work we have to ignore it for Spark before 3.5.1.
Thx for review. Would it be ok to add them by a following PR ? And merging this PR will not close the issue #11649. |
@revans2 Could you help review this again? I'v addressed your comments. Thx in advance. |
Contributes to #11649
This PR is trying to address some requirements described in issue #11649, but not all of them.
It introduces two new classes named
GpuExpressionRetryable
andRetryStateTracker
to initially set up a fundamental support for the items 2 and 3 as below, and adds in the relevant unit tests.And it also adds the integration tests for the function
rand()
being used inHashAggregate
,Generate
,Projection
,ArrowEvalPython
andFilter
. This is for the item 4, and it still does not cover all the cases where a nondeterministic expression can be used, but we are closer than before.