forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark…
… Connect ### What changes were proposed in this pull request? This PR proposes to add "Live Notebook: DataFrame with Spark Connect" at [Getting Started](https://spark.apache.org/docs/latest/api/python/getting_started/index.html) documents as below: <img width="794" alt="Screen Shot 2023-02-23 at 1 15 41 PM" src="https://user-images.githubusercontent.com/44108233/220820191-ca0e5705-1694-4eaa-8658-67d522af1bf8.png"> Basically, the notebook copied the contents of 'Live Notebook: DataFrame', and updated the contents related to Spark Connect. The Notebook looks like the below: <img width="814" alt="Screen Shot 2023-02-23 at 1 15 54 PM" src="https://user-images.githubusercontent.com/44108233/220820218-bbfb6a58-7009-4327-aea4-72ed6496d77c.png"> ### Why are the changes needed? To help quick start using DataFrame with Spark Connect for those who new to Spark Connect. ### Does this PR introduce _any_ user-facing change? No, it's documentation. ### How was this patch tested? Manually built the docs, and run the CI. Closes apache#40092 from itholic/SPARK-42475. Authored-by: itholic <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
- Loading branch information
1 parent
f15414e
commit 3830285
Showing
5 changed files
with
159 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
142 changes: 142 additions & 0 deletions
142
python/docs/source/getting_started/quickstart_connect.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Quickstart: Spark Connect\n", | ||
"\n", | ||
"Spark Connect introduced a decoupled client-server architecture for Spark that allows remote connectivity to Spark clusters using the [DataFrame API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html?highlight=dataframe#pyspark.sql.DataFrame).\n", | ||
"\n", | ||
"This notebook walks through a simple step-by-step example of how to use Spark Connect to build any type of application that needs to leverage the power of Spark when working with data.\n", | ||
"\n", | ||
"Spark Connect includes both client and server components and we will show you how to set up and use both." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Launch Spark server with Spark Connect\n", | ||
"\n", | ||
"To launch Spark with support for Spark Connect sessions, run the `start-connect-server.sh` script." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:$SPARK_VERSION" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Connect to Spark Connect server\n", | ||
"\n", | ||
"Now that the Spark server is running, we can connect to it remotely using Spark Connect. We do this by creating a remote Spark session on the client where our application runs. Before we can do that, we need to make sure to stop the existing regular Spark session because it cannot coexist with the remote Spark Connect session we are about to create." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from pyspark.sql import SparkSession\n", | ||
"\n", | ||
"SparkSession.builder.master(\"local[*]\").getOrCreate().stop()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The command we used above to launch the server configured Spark to run as `localhost:15002`. So now we can create a remote Spark session on the client using the following command." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"spark = SparkSession.builder.remote(\"sc://localhost:15002\").getOrCreate()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Create DataFrame\n", | ||
"\n", | ||
"Once the remote Spark session is created successfully, it can be used the same way as a regular Spark session. Therefore, you can create a DataFrame with the following command." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"+---+---+-------+----------+-------------------+\n", | ||
"| a| b| c| d| e|\n", | ||
"+---+---+-------+----------+-------------------+\n", | ||
"| 1|2.0|string1|2000-01-01|2000-01-01 12:00:00|\n", | ||
"| 2|3.0|string2|2000-02-01|2000-01-02 12:00:00|\n", | ||
"| 4|5.0|string3|2000-03-01|2000-01-03 12:00:00|\n", | ||
"+---+---+-------+----------+-------------------+\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from datetime import datetime, date\n", | ||
"from pyspark.sql import Row\n", | ||
"\n", | ||
"df = spark.createDataFrame([\n", | ||
" Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),\n", | ||
" Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),\n", | ||
" Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))\n", | ||
"])\n", | ||
"df.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"See 'Live Notebook: DataFrame' at [the quickstart page](https://spark.apache.org/docs/latest/api/python/getting_started/index.html) for more detail usage of DataFrame API." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.11" | ||
}, | ||
"name": "quickstart", | ||
"notebookId": 1927513300154480 | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 1 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters