Skip to content

Commit

Permalink
query s3 bucket
Browse files Browse the repository at this point in the history
  • Loading branch information
mneedham committed Oct 9, 2024
1 parent 384a7af commit ac24a00
Show file tree
Hide file tree
Showing 2 changed files with 198 additions and 2 deletions.
195 changes: 195 additions & 0 deletions docs/en/chdb/guides/querying-s3-bucket.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
---
title: How to query data in an S3 bucket
sidebar_label: Querying data in S3
slug: /en/chdb/guides/querying-s3
description: Learn how to query data in an S3 bucket with chDB.
keywords: [chdb, pandas]
---

A lot of the world's data lives in Amazon S3 buckets.
In this guide, we'll learn how to query that data using chDB.

## Setup

Let's first create a virtual environment:

```bash
python -m venv .venv
source .venv/bin/activate
```

And now we'll install chDB.
Make sure you have version 2.0.2 or higher:

```bash
pip install "chdb>=2.0.2"
```

And now we're going to install iPython:

```bash
pip install ipython
```

We're going to use `ipython` to run the commands in the rest of the guide, which you can launch by running:

```bash
ipython
```

You can also use the code in a Python script or in your favorite notebook.

## Listing files in an S3 bucket

Let's start by listing all the files in an S3 bucket.
To do this, we can use the [`s3` table function](/docs/en/sql-reference/table-functions/s3) and pass in the path to a file or a wildcard to a set of files.

:::tip
If you pass just the bucket name it will throw an exception.
:::

We're also going to use the [`One`](/docs/en/interfaces/formats#data-format-one) input format so that the file isn't parsed, instead a single row is returned per file and we can access the file via the `_file` virtual column and the path via the `_path` virtual column.

```python
chdb.query("""
SELECT
_file,
_path
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet', One)
SETTINGS output_format_pretty_row_numbers=0
""", 'PrettyCompact')
```

```text
┌─_file───────────────────────────────┬─_path─────────────────────────────────────────────────────────────────────┐
│ amazon_reviews_2010.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2010.snappy.parquet │
│ amazon_reviews_1990s.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_1990s.snappy.parquet │
│ amazon_reviews_2013.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2013.snappy.parquet │
│ amazon_reviews_2015.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet │
│ amazon_reviews_2014.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2014.snappy.parquet │
│ amazon_reviews_2012.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2012.snappy.parquet │
│ amazon_reviews_2000s.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2000s.snappy.parquet │
│ amazon_reviews_2011.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2011.snappy.parquet │
└─────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────┘
```

This bucket contains only Parquet files.

## Querying files in an S3 bucket

Next, let's learn how to query those files.
If we want to count the number of rows in each of those files, we can run the following query:

```python
chdb.query("""
SELECT
_file,
count() AS count,
formatReadableQuantity(count) AS readableCount
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet')
GROUP BY ALL
SETTINGS output_format_pretty_row_numbers=0
""", 'PrettyCompact')
```

```text
┌─_file───────────────────────────────┬────count─┬─readableCount───┐
│ amazon_reviews_2013.snappy.parquet │ 28034255 │ 28.03 million │
│ amazon_reviews_1990s.snappy.parquet │ 639532 │ 639.53 thousand │
│ amazon_reviews_2011.snappy.parquet │ 6112495 │ 6.11 million │
│ amazon_reviews_2015.snappy.parquet │ 41905631 │ 41.91 million │
│ amazon_reviews_2012.snappy.parquet │ 11541011 │ 11.54 million │
│ amazon_reviews_2000s.snappy.parquet │ 14728295 │ 14.73 million │
│ amazon_reviews_2014.snappy.parquet │ 44127569 │ 44.13 million │
│ amazon_reviews_2010.snappy.parquet │ 3868472 │ 3.87 million │
└─────────────────────────────────────┴──────────┴─────────────────┘
```

We can also pass in the HTTP URI for an S3 bucket and will get the same results:

```python
chdb.query("""
SELECT
_file,
count() AS count,
formatReadableQuantity(count) AS readableCount
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/*.parquet')
GROUP BY ALL
SETTINGS output_format_pretty_row_numbers=0
""", 'PrettyCompact')
```

Let's have a look at the schema of these Parquet files using the `DESCRIBE` clause:

```python
chdb.query("""
DESCRIBE s3('s3://datasets-documentation/amazon_reviews/*.parquet')
SETTINGS describe_compact_output=1
""", 'PrettyCompact')
```

```text
┌─name──────────────┬─type─────────────┐
1. │ review_date │ Nullable(UInt16) │
2. │ marketplace │ Nullable(String) │
3. │ customer_id │ Nullable(UInt64) │
4. │ review_id │ Nullable(String) │
5. │ product_id │ Nullable(String) │
6. │ product_parent │ Nullable(UInt64) │
7. │ product_title │ Nullable(String) │
8. │ product_category │ Nullable(String) │
9. │ star_rating │ Nullable(UInt8) │
10. │ helpful_votes │ Nullable(UInt32) │
11. │ total_votes │ Nullable(UInt32) │
12. │ vine │ Nullable(Bool) │
13. │ verified_purchase │ Nullable(Bool) │
14. │ review_headline │ Nullable(String) │
15. │ review_body │ Nullable(String) │
└───────────────────┴──────────────────┘
```

Let's now compute the top product categories based on number of reviews, as well as computing the average star rating:

```python
chdb.query("""
SELECT product_category, count() AS reviews, round(avg(star_rating), 2) as avg
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet')
GROUP BY ALL
LIMIT 10
""", 'PrettyCompact')
```

```text
┌─product_category─┬──reviews─┬──avg─┐
1. │ Toys │ 4864056 │ 4.21 │
2. │ Apparel │ 5906085 │ 4.11 │
3. │ Luggage │ 348644 │ 4.22 │
4. │ Kitchen │ 4880297 │ 4.21 │
5. │ Books │ 19530930 │ 4.34 │
6. │ Outdoors │ 2302327 │ 4.24 │
7. │ Video │ 380596 │ 4.19 │
8. │ Grocery │ 2402365 │ 4.31 │
9. │ Shoes │ 4366757 │ 4.24 │
10. │ Jewelry │ 1767667 │ 4.14 │
└──────────────────┴──────────┴──────┘
```

## Querying files in a private S3 bucket

If we're querying files in a private S3 bucket, we need to pass in an access key and secret.
We can pass in those credentials to the `s3` table function:

```python
chdb.query("""
SELECT product_category, count() AS reviews, round(avg(star_rating), 2) as avg
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet', 'access-key', 'secret')
GROUP BY ALL
LIMIT 10
""", 'PrettyCompact')
```

:::note
This query won't work because it's a public bucket!
:::

An alternative way is to used [named collections](/docs/en/operations/named-collections), but this approach isn't yet supported by chDB.
5 changes: 3 additions & 2 deletions sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -1244,8 +1244,9 @@ const sidebars = {
"en/chdb/guides/jupysql",
"en/chdb/guides/querying-pandas",
"en/chdb/guides/querying-apache-arrow",
"en/chdb/guides/clickhouse-local",
"en/chdb/guides/query-remote-clickhouse"
"en/chdb/guides/query-remote-clickhouse",
"en/chdb/guides/querying-s3-bucket",
"en/chdb/guides/clickhouse-local",
],
},

Expand Down

0 comments on commit ac24a00

Please sign in to comment.