query s3 bucket

janeklb · Oct 9, 2024 · ac24a00 · ac24a00
1 parent 384a7af
commit ac24a00
Show file tree

Hide file tree

Showing 2 changed files with 198 additions and 2 deletions.
diff --git a/docs/en/chdb/guides/querying-s3-bucket.md b/docs/en/chdb/guides/querying-s3-bucket.md
@@ -0,0 +1,195 @@
+---
+title: How to query data in an S3 bucket
+sidebar_label: Querying data in S3
+slug: /en/chdb/guides/querying-s3
+description: Learn how to query data in an S3 bucket with chDB.
+keywords: [chdb, pandas]
+---
+
+A lot of the world's data lives in Amazon S3 buckets.
+In this guide, we'll learn how to query that data using chDB.
+
+## Setup
+
+Let's first create a virtual environment:
+
+```bash
+python -m venv .venv
+source .venv/bin/activate
+```
+
+And now we'll install chDB.
+Make sure you have version 2.0.2 or higher:
+
+```bash
+pip install "chdb>=2.0.2"
+```
+
+And now we're going to install iPython:
+
+```bash
+pip install ipython
+```
+
+We're going to use `ipython` to run the commands in the rest of the guide, which you can launch by running:
+
+```bash
+ipython
+```
+
+You can also use the code in a Python script or in your favorite notebook.
+
+## Listing files in an S3 bucket
+
+Let's start by listing all the files in an S3 bucket.
+To do this, we can use the [`s3` table function](/docs/en/sql-reference/table-functions/s3) and pass in the path to a file or a wildcard to a set of files.
+
+:::tip
+If you pass just the bucket name it will throw an exception.
+:::
+
+We're also going to use the [`One`](/docs/en/interfaces/formats#data-format-one) input format so that the file isn't parsed, instead a single row is returned per file and we can access the file via the `_file` virtual column and the path via the `_path` virtual column.
+
+```python
+chdb.query("""
+SELECT
+    _file,
+    _path
+FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet', One)
+SETTINGS output_format_pretty_row_numbers=0
+""", 'PrettyCompact')
+```
+
+```text
+┌─_file───────────────────────────────┬─_path─────────────────────────────────────────────────────────────────────┐
+│ amazon_reviews_2010.snappy.parquet  │ datasets-documentation/amazon_reviews/amazon_reviews_2010.snappy.parquet  │
+│ amazon_reviews_1990s.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_1990s.snappy.parquet │
+│ amazon_reviews_2013.snappy.parquet  │ datasets-documentation/amazon_reviews/amazon_reviews_2013.snappy.parquet  │
+│ amazon_reviews_2015.snappy.parquet  │ datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet  │
+│ amazon_reviews_2014.snappy.parquet  │ datasets-documentation/amazon_reviews/amazon_reviews_2014.snappy.parquet  │
+│ amazon_reviews_2012.snappy.parquet  │ datasets-documentation/amazon_reviews/amazon_reviews_2012.snappy.parquet  │
+│ amazon_reviews_2000s.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2000s.snappy.parquet │
+│ amazon_reviews_2011.snappy.parquet  │ datasets-documentation/amazon_reviews/amazon_reviews_2011.snappy.parquet  │
+└─────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────┘
+```
+
+This bucket contains only Parquet files.
+
+## Querying files in an S3 bucket
+
+Next, let's learn how to query those files.
+If we want to count the number of rows in each of those files, we can run the following query:
+
+```python
+chdb.query("""
+SELECT
+    _file,
+    count() AS count,
+    formatReadableQuantity(count) AS readableCount    
+FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet')
+GROUP BY ALL
+SETTINGS output_format_pretty_row_numbers=0
+""", 'PrettyCompact')
+```
+
+```text
+┌─_file───────────────────────────────┬────count─┬─readableCount───┐
+│ amazon_reviews_2013.snappy.parquet  │ 28034255 │ 28.03 million   │
+│ amazon_reviews_1990s.snappy.parquet │   639532 │ 639.53 thousand │
+│ amazon_reviews_2011.snappy.parquet  │  6112495 │ 6.11 million    │
+│ amazon_reviews_2015.snappy.parquet  │ 41905631 │ 41.91 million   │
+│ amazon_reviews_2012.snappy.parquet  │ 11541011 │ 11.54 million   │
+│ amazon_reviews_2000s.snappy.parquet │ 14728295 │ 14.73 million   │
+│ amazon_reviews_2014.snappy.parquet  │ 44127569 │ 44.13 million   │
+│ amazon_reviews_2010.snappy.parquet  │  3868472 │ 3.87 million    │
+└─────────────────────────────────────┴──────────┴─────────────────┘
+```
+
+We can also pass in the HTTP URI for an S3 bucket and will get the same results:
+
+```python
+chdb.query("""
+SELECT
+    _file,
+    count() AS count,
+    formatReadableQuantity(count) AS readableCount    
+FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/*.parquet')
+GROUP BY ALL
+SETTINGS output_format_pretty_row_numbers=0
+""", 'PrettyCompact')
+```
+
+Let's have a look at the schema of these Parquet files using the `DESCRIBE` clause:
+
+```python
+chdb.query("""
+DESCRIBE s3('s3://datasets-documentation/amazon_reviews/*.parquet')
+SETTINGS describe_compact_output=1
+""", 'PrettyCompact')
+```
+
+```text
+    ┌─name──────────────┬─type─────────────┐
+ 1. │ review_date       │ Nullable(UInt16) │
+ 2. │ marketplace       │ Nullable(String) │
+ 3. │ customer_id       │ Nullable(UInt64) │
+ 4. │ review_id         │ Nullable(String) │
+ 5. │ product_id        │ Nullable(String) │
+ 6. │ product_parent    │ Nullable(UInt64) │
+ 7. │ product_title     │ Nullable(String) │
+ 8. │ product_category  │ Nullable(String) │
+ 9. │ star_rating       │ Nullable(UInt8)  │
+10. │ helpful_votes     │ Nullable(UInt32) │
+11. │ total_votes       │ Nullable(UInt32) │
+12. │ vine              │ Nullable(Bool)   │
+13. │ verified_purchase │ Nullable(Bool)   │
+14. │ review_headline   │ Nullable(String) │
+15. │ review_body       │ Nullable(String) │
+    └───────────────────┴──────────────────┘
+```
+
+Let's now compute the top product categories based on number of reviews, as well as computing the average star rating:
+
+```python
+chdb.query("""
+SELECT product_category, count() AS reviews, round(avg(star_rating), 2) as avg
+FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet')
+GROUP BY ALL
+LIMIT 10
+""", 'PrettyCompact')
+```
+
+```text
+    ┌─product_category─┬──reviews─┬──avg─┐
+ 1. │ Toys             │  4864056 │ 4.21 │
+ 2. │ Apparel          │  5906085 │ 4.11 │
+ 3. │ Luggage          │   348644 │ 4.22 │
+ 4. │ Kitchen          │  4880297 │ 4.21 │
+ 5. │ Books            │ 19530930 │ 4.34 │
+ 6. │ Outdoors         │  2302327 │ 4.24 │
+ 7. │ Video            │   380596 │ 4.19 │
+ 8. │ Grocery          │  2402365 │ 4.31 │
+ 9. │ Shoes            │  4366757 │ 4.24 │
+10. │ Jewelry          │  1767667 │ 4.14 │
+    └──────────────────┴──────────┴──────┘
+```
+
+## Querying files in a private S3 bucket
+
+If we're querying files in a private S3 bucket, we need to pass in an access key and secret.
+We can pass in those credentials to the  `s3` table function:
+
+```python
+chdb.query("""
+SELECT product_category, count() AS reviews, round(avg(star_rating), 2) as avg
+FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet', 'access-key', 'secret')
+GROUP BY ALL
+LIMIT 10
+""", 'PrettyCompact')
+```
+
+:::note
+This query won't work because it's a public bucket!
+:::
+
+An alternative way is to used [named collections](/docs/en/operations/named-collections), but this approach isn't yet supported by chDB.
diff --git a/sidebars.js b/sidebars.js
@@ -1244,8 +1244,9 @@ const sidebars = {
         "en/chdb/guides/jupysql",
         "en/chdb/guides/querying-pandas",
         "en/chdb/guides/querying-apache-arrow",
-        "en/chdb/guides/clickhouse-local",
-        "en/chdb/guides/query-remote-clickhouse"
+        "en/chdb/guides/query-remote-clickhouse",
+        "en/chdb/guides/querying-s3-bucket",
+        "en/chdb/guides/clickhouse-local",        
       ],
     },