Skip to content

Commit

Permalink
doc: Read parquet files with PySpark (#3020)
Browse files Browse the repository at this point in the history
* Basic doc for pyspark

* Fix toctree
  • Loading branch information
AndreaFrancis authored Aug 19, 2024
1 parent e4bb30c commit e000fcd
Show file tree
Hide file tree
Showing 3 changed files with 62 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@
title: Polars
- local: mlcroissant
title: mlcroissant
- local: pyspark
title: PySpark
- title: Conceptual Guides
sections:
- local: configs_and_splits
Expand Down
1 change: 1 addition & 0 deletions docs/source/parquet_process.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ There are several different libraries you can use to work with the published Par
- [Pandas](https://pandas.pydata.org/docs/index.html), a data analysis tool for working with data structures
- [Polars](https://pola-rs.github.io/polars-book/user-guide/), a Rust based DataFrame library
- [mlcroissant](https://github.com/mlcommons/croissant/tree/main/python/mlcroissant), a library for loading datasets from Croissant metadata
- [pyspark](https://spark.apache.org/docs/latest/api/python), the Python API for Apache Spark
59 changes: 59 additions & 0 deletions docs/source/pyspark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# PySpark

[pyspark](https://spark.apache.org/docs/latest/api/python) is the Python interface for Apache Spark, enabling large-scale data processing and real-time analytics in a distributed environment using Python.

<Tip>

For a detailed guide on how to analyze datasets on the Hub with PySpark, check out this [blog](https://huggingface.co/blog/asoria/pyspark-hugging-face-datasets).

</Tip>

To start working with Parquet files in PySpark, you'll first need to add the file(s) to a Spark context. Below is an example of how to read a single Parquet file:

```py
from pyspark import SparkFiles, SparkContext, SparkFiles
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("WineReviews").getOrCreate()

# Add the Parquet file to the Spark context
spark.sparkContext.addFile("https://huggingface.co/api/datasets/james-burton/wine_reviews/parquet/default/train/0.parquet")

# Read the Parquet file into a DataFrame
df = spark.read.parquet(SparkFiles.get("0.parquet"))

```
If your dataset is sharded into multiple Parquet files, you'll need to add each file to the Spark context individually. Here's how to do it:

```py
import requests

# Fetch the URLs of the Parquet files for the train split
r = requests.get('https://huggingface.co/api/datasets/james-burton/wine_reviews/parquet')
train_parquet_files = r.json()['default']['train']

# Add each Parquet file to the Spark context
for url in train_parquet_files:
spark.sparkContext.addFile(url)

# Read all Parquet files into a single DataFrame
df = spark.read.parquet(SparkFiles.getRootDirectory() + "/*.parquet")

```

Once you've loaded the data into a PySpark DataFrame, you can perform various operations to explore and analyze it:

```py
print(f"Shape of the dataset: {df.count()}, {len(df.columns)}")

# Display first 10 rows
df.show(n=10)

# Get a statistical summary of the data
df.describe().show()

# Print the schema of the DataFrame
df.printSchema()

```

0 comments on commit e000fcd

Please sign in to comment.