forked from ClickHouse/clickhouse-docs
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
198 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,195 @@ | ||
--- | ||
title: How to query data in an S3 bucket | ||
sidebar_label: Querying data in S3 | ||
slug: /en/chdb/guides/querying-s3 | ||
description: Learn how to query data in an S3 bucket with chDB. | ||
keywords: [chdb, pandas] | ||
--- | ||
|
||
A lot of the world's data lives in Amazon S3 buckets. | ||
In this guide, we'll learn how to query that data using chDB. | ||
|
||
## Setup | ||
|
||
Let's first create a virtual environment: | ||
|
||
```bash | ||
python -m venv .venv | ||
source .venv/bin/activate | ||
``` | ||
|
||
And now we'll install chDB. | ||
Make sure you have version 2.0.2 or higher: | ||
|
||
```bash | ||
pip install "chdb>=2.0.2" | ||
``` | ||
|
||
And now we're going to install iPython: | ||
|
||
```bash | ||
pip install ipython | ||
``` | ||
|
||
We're going to use `ipython` to run the commands in the rest of the guide, which you can launch by running: | ||
|
||
```bash | ||
ipython | ||
``` | ||
|
||
You can also use the code in a Python script or in your favorite notebook. | ||
|
||
## Listing files in an S3 bucket | ||
|
||
Let's start by listing all the files in an S3 bucket. | ||
To do this, we can use the [`s3` table function](/docs/en/sql-reference/table-functions/s3) and pass in the path to a file or a wildcard to a set of files. | ||
|
||
:::tip | ||
If you pass just the bucket name it will throw an exception. | ||
::: | ||
|
||
We're also going to use the [`One`](/docs/en/interfaces/formats#data-format-one) input format so that the file isn't parsed, instead a single row is returned per file and we can access the file via the `_file` virtual column and the path via the `_path` virtual column. | ||
|
||
```python | ||
chdb.query(""" | ||
SELECT | ||
_file, | ||
_path | ||
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet', One) | ||
SETTINGS output_format_pretty_row_numbers=0 | ||
""", 'PrettyCompact') | ||
``` | ||
|
||
```text | ||
┌─_file───────────────────────────────┬─_path─────────────────────────────────────────────────────────────────────┐ | ||
│ amazon_reviews_2010.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2010.snappy.parquet │ | ||
│ amazon_reviews_1990s.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_1990s.snappy.parquet │ | ||
│ amazon_reviews_2013.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2013.snappy.parquet │ | ||
│ amazon_reviews_2015.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet │ | ||
│ amazon_reviews_2014.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2014.snappy.parquet │ | ||
│ amazon_reviews_2012.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2012.snappy.parquet │ | ||
│ amazon_reviews_2000s.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2000s.snappy.parquet │ | ||
│ amazon_reviews_2011.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2011.snappy.parquet │ | ||
└─────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────┘ | ||
``` | ||
|
||
This bucket contains only Parquet files. | ||
|
||
## Querying files in an S3 bucket | ||
|
||
Next, let's learn how to query those files. | ||
If we want to count the number of rows in each of those files, we can run the following query: | ||
|
||
```python | ||
chdb.query(""" | ||
SELECT | ||
_file, | ||
count() AS count, | ||
formatReadableQuantity(count) AS readableCount | ||
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet') | ||
GROUP BY ALL | ||
SETTINGS output_format_pretty_row_numbers=0 | ||
""", 'PrettyCompact') | ||
``` | ||
|
||
```text | ||
┌─_file───────────────────────────────┬────count─┬─readableCount───┐ | ||
│ amazon_reviews_2013.snappy.parquet │ 28034255 │ 28.03 million │ | ||
│ amazon_reviews_1990s.snappy.parquet │ 639532 │ 639.53 thousand │ | ||
│ amazon_reviews_2011.snappy.parquet │ 6112495 │ 6.11 million │ | ||
│ amazon_reviews_2015.snappy.parquet │ 41905631 │ 41.91 million │ | ||
│ amazon_reviews_2012.snappy.parquet │ 11541011 │ 11.54 million │ | ||
│ amazon_reviews_2000s.snappy.parquet │ 14728295 │ 14.73 million │ | ||
│ amazon_reviews_2014.snappy.parquet │ 44127569 │ 44.13 million │ | ||
│ amazon_reviews_2010.snappy.parquet │ 3868472 │ 3.87 million │ | ||
└─────────────────────────────────────┴──────────┴─────────────────┘ | ||
``` | ||
|
||
We can also pass in the HTTP URI for an S3 bucket and will get the same results: | ||
|
||
```python | ||
chdb.query(""" | ||
SELECT | ||
_file, | ||
count() AS count, | ||
formatReadableQuantity(count) AS readableCount | ||
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/*.parquet') | ||
GROUP BY ALL | ||
SETTINGS output_format_pretty_row_numbers=0 | ||
""", 'PrettyCompact') | ||
``` | ||
|
||
Let's have a look at the schema of these Parquet files using the `DESCRIBE` clause: | ||
|
||
```python | ||
chdb.query(""" | ||
DESCRIBE s3('s3://datasets-documentation/amazon_reviews/*.parquet') | ||
SETTINGS describe_compact_output=1 | ||
""", 'PrettyCompact') | ||
``` | ||
|
||
```text | ||
┌─name──────────────┬─type─────────────┐ | ||
1. │ review_date │ Nullable(UInt16) │ | ||
2. │ marketplace │ Nullable(String) │ | ||
3. │ customer_id │ Nullable(UInt64) │ | ||
4. │ review_id │ Nullable(String) │ | ||
5. │ product_id │ Nullable(String) │ | ||
6. │ product_parent │ Nullable(UInt64) │ | ||
7. │ product_title │ Nullable(String) │ | ||
8. │ product_category │ Nullable(String) │ | ||
9. │ star_rating │ Nullable(UInt8) │ | ||
10. │ helpful_votes │ Nullable(UInt32) │ | ||
11. │ total_votes │ Nullable(UInt32) │ | ||
12. │ vine │ Nullable(Bool) │ | ||
13. │ verified_purchase │ Nullable(Bool) │ | ||
14. │ review_headline │ Nullable(String) │ | ||
15. │ review_body │ Nullable(String) │ | ||
└───────────────────┴──────────────────┘ | ||
``` | ||
|
||
Let's now compute the top product categories based on number of reviews, as well as computing the average star rating: | ||
|
||
```python | ||
chdb.query(""" | ||
SELECT product_category, count() AS reviews, round(avg(star_rating), 2) as avg | ||
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet') | ||
GROUP BY ALL | ||
LIMIT 10 | ||
""", 'PrettyCompact') | ||
``` | ||
|
||
```text | ||
┌─product_category─┬──reviews─┬──avg─┐ | ||
1. │ Toys │ 4864056 │ 4.21 │ | ||
2. │ Apparel │ 5906085 │ 4.11 │ | ||
3. │ Luggage │ 348644 │ 4.22 │ | ||
4. │ Kitchen │ 4880297 │ 4.21 │ | ||
5. │ Books │ 19530930 │ 4.34 │ | ||
6. │ Outdoors │ 2302327 │ 4.24 │ | ||
7. │ Video │ 380596 │ 4.19 │ | ||
8. │ Grocery │ 2402365 │ 4.31 │ | ||
9. │ Shoes │ 4366757 │ 4.24 │ | ||
10. │ Jewelry │ 1767667 │ 4.14 │ | ||
└──────────────────┴──────────┴──────┘ | ||
``` | ||
|
||
## Querying files in a private S3 bucket | ||
|
||
If we're querying files in a private S3 bucket, we need to pass in an access key and secret. | ||
We can pass in those credentials to the `s3` table function: | ||
|
||
```python | ||
chdb.query(""" | ||
SELECT product_category, count() AS reviews, round(avg(star_rating), 2) as avg | ||
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet', 'access-key', 'secret') | ||
GROUP BY ALL | ||
LIMIT 10 | ||
""", 'PrettyCompact') | ||
``` | ||
|
||
:::note | ||
This query won't work because it's a public bucket! | ||
::: | ||
|
||
An alternative way is to used [named collections](/docs/en/operations/named-collections), but this approach isn't yet supported by chDB. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters