Skip to content

Commit

Permalink
[SPARK-11567] [PYTHON] Add Python API for corr Aggregate function
Browse files Browse the repository at this point in the history
like `df.agg(corr("col1", "col2")`

davies

Author: felixcheung <[email protected]>

Closes apache#9536 from felixcheung/pyfunc.
  • Loading branch information
felixcheung authored and davies committed Nov 10, 2015
1 parent 638c51d commit 32790fe
Showing 1 changed file with 16 additions and 0 deletions.
16 changes: 16 additions & 0 deletions python/pyspark/sql/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,22 @@ def coalesce(*cols):
return Column(jc)


@since(1.6)
def corr(col1, col2):
"""Returns a new :class:`Column` for the Pearson Correlation Coefficient for ``col1``
and ``col2``.
>>> a = [x * x - 2 * x + 3.5 for x in range(20)]
>>> b = range(20)
>>> corrDf = sqlContext.createDataFrame(zip(a, b))
>>> corrDf = corrDf.agg(corr(corrDf._1, corrDf._2).alias('c'))
>>> corrDf.selectExpr('abs(c - 0.9572339139475857) < 1e-16 as t').collect()
[Row(t=True)]
"""
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.corr(_to_java_column(col1), _to_java_column(col2)))


@since(1.3)
def countDistinct(col, *cols):
"""Returns a new :class:`Column` for distinct count of ``col`` or ``cols``.
Expand Down

0 comments on commit 32790fe

Please sign in to comment.