Skip to content

Commit

Permalink
[SPARK-44401][PYTHON][DOCS] Arrow Python UDF Use Guide
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
Add use guide for Arrow Python UDF.

### Why are the changes needed?
Better documentation and usability of Arrow Python UDF.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
N/A.

Closes apache#41974 from xinrong-meng/guide.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
  • Loading branch information
xinrong-meng authored and HyukjinKwon committed Jul 19, 2023
1 parent f35c814 commit bdea4c5
Show file tree
Hide file tree
Showing 2 changed files with 47 additions and 0 deletions.
21 changes: 21 additions & 0 deletions examples/src/main/python/sql/arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,27 @@ def merge_ordered(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
# +--------+---+---+----+


def arrow_python_udf_example(spark: SparkSession) -> None:
from pyspark.sql.functions import udf

@udf(returnType='int') # A default, pickled Python UDF
def slen(s): # type: ignore[no-untyped-def]
return len(s)

@udf(returnType='int', useArrow=True) # An Arrow Python UDF
def arrow_slen(s): # type: ignore[no-untyped-def]
return len(s)

df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age"))

df.select(slen("name"), arrow_slen("name")).show()
# +----------+----------------+
# |slen(name)|arrow_slen(name)|
# +----------+----------------+
# | 8| 8|
# +----------+----------------+


if __name__ == "__main__":
spark = SparkSession \
.builder \
Expand Down
26 changes: 26 additions & 0 deletions python/docs/source/user_guide/sql/arrow_pandas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,32 @@ The following example shows how to use ``DataFrame.groupby().cogroup().applyInPa

For detailed usage, please see :meth:`PandasCogroupedOps.applyInPandas`

Arrow Python UDFs
-----------------

Arrow Python UDFs are user defined functions that are executed row-by-row, utilizing Arrow for efficient batch data
transfer and serialization. To define an Arrow Python UDF, you can use the :meth:`udf` decorator or wrap the function
with the :meth:`udf` method, ensuring the ``useArrow`` parameter is set to True. Additionally, you can enable Arrow
optimization for Python UDFs throughout the entire SparkSession by setting the Spark configuration ``spark.sql
.execution.pythonUDF.arrow.enabled`` to true. It's important to note that the Spark configuration takes effect only
when ``useArrow`` is either not set or set to None.

The type hints for Arrow Python UDFs should be specified in the same way as for default, pickled Python UDFs.

Here's an example that demonstrates the usage of both a default, pickled Python UDF and an Arrow Python UDF:

.. literalinclude:: ../../../../../examples/src/main/python/sql/arrow.py
:language: python
:lines: 279-297
:dedent: 4

Compared to the default, pickled Python UDFs, Arrow Python UDFs provide a more coherent type coercion mechanism. UDF
type coercion poses challenges when the Python instances returned by UDFs do not align with the user-specified
return type. The default, pickled Python UDFs' type coercion has certain limitations, such as relying on None as a
fallback for type mismatches, leading to potential ambiguity and data loss. Additionally, converting date, datetime,
and tuples to strings can yield ambiguous results. Arrow Python UDFs, on the other hand, leverage Arrow's
capabilities to standardize type coercion and address these issues effectively.

Usage Notes
-----------

Expand Down

0 comments on commit bdea4c5

Please sign in to comment.