Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-41391][SQL] The output column name of groupBy.agg(count_distin…
…ct) is incorrect ### What changes were proposed in this pull request? correct the output column name of groupBy.agg(count_distinct), so the "*" is expanded correctly into column names and the output column has the distinct keyword. ### Why are the changes needed? Output column name for groupBy.agg(count_distinct) is incorrect . However similar queries in spark sql return correct output column. For groupBy.agg queries on dataframe "*" is not expanded correctly in the output column and the distinct keyword is missing from output column. ``` // initializing data scala> val df = spark.range(1, 10).withColumn("value", lit(1)) df: org.apache.spark.sql.DataFrame = [id: bigint, value: int] scala> df.createOrReplaceTempView("table") // Dataframe aggregate queries with incorrect output column scala> df.groupBy("id").agg(count_distinct($"*")) res3: org.apache.spark.sql.DataFrame = [id: bigint, count(unresolvedstar()): bigint] scala> df.groupBy("id").agg(count_distinct($"value")) res1: org.apache.spark.sql.DataFrame = [id: bigint, count(value): bigint] // Spark Sql aggregate queries with correct output column scala> spark.sql(" SELECT id, COUNT(DISTINCT *) FROM table GROUP BY id ") res4: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT id, value): bigint] scala> spark.sql(" SELECT id, COUNT(DISTINCT value) FROM table GROUP BY id ") res2: org.apache.spark.sql.DataFrame = [id: bigint, count(DISTINCT value): bigint] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes apache#40116 from ritikam2/master. Authored-by: Ritika Maheshwari <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
- Loading branch information