-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
218 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,218 @@ | ||
# WN 0007 - Nested Computations | ||
|
||
- **Status**: *Draft* | ||
- **Discussions-To:** https://github.com/malloydata/malloy/discussions/1428, slack, gchat | ||
- **Author:** lloyd tabb | ||
|
||
Malloy produces nested data. To access nested data today, malloy implicitly unnests while guaranteeing computational accuracy. There are times when we'd like to make computations against nested data without unnesting (count the number of occurrences with a filter) or to transform the nested data or to unnest the data more than once (to find co-occurrence). | ||
|
||
Assume a table with the following structure | ||
|
||
orders.parquet: | ||
``` | ||
order_id | ||
order_time | ||
shipped_time | ||
status | ||
user_id | ||
items [] | ||
product_id | ||
product_description | ||
sale_price | ||
is_returned | ||
product_category | ||
``` | ||
|
||
## How it works today | ||
|
||
Today, if you simply reference values in items and they are automatically unnested. | ||
|
||
``` | ||
source: order is duckdb.table('orders.parquet) extend { | ||
view: by_user is { | ||
where: not items.is_returned | ||
group_by: user_id | ||
aggregate: revenue is items.sale_price.sum() | ||
} | ||
} | ||
``` | ||
|
||
This query is writtens as | ||
|
||
``` | ||
SELECT user_id, SUM(items.sale_price) | ||
FROM 'orders.parquet', UNNEST(items) as items | ||
WHERE NOT items.is_returned | ||
GROUP_BY 1 | ||
``` | ||
|
||
## But 'order_value' ? | ||
The problem with this is 'order_value' is not available as a dimension? Certainly it is an attribute of the order but with unnest, it isn't easily representable. Suppose, for example I wanted to look at a distribution of orders by value? This proves difficult. | ||
|
||
## Dimensional Subqueries | ||
|
||
In the example below, we're creating a dimmension `order_value`. I can be used like any other dimension. The expression for order_value written as a query with `items` as its source. There can only be one return value and the field named `return` is that value. | ||
|
||
``` | ||
source: orders is duckdb.table('orders.parquet) extend { | ||
dimension: order_value is items->{ | ||
aggregate: return is saleprice.sum() | ||
where: not is_returned | ||
} | ||
} | ||
run: orders -> { | ||
// bucket order by $10 | ||
group_by: order_value_tiered is floor(order_value/10)*10 | ||
// compute the number of orders and total order value for the bucket | ||
aggregate: | ||
order_count is count() | ||
total_revenue is order_value.sum() | ||
} | ||
``` | ||
|
||
The SQL for query is. | ||
|
||
``` | ||
SELECT | ||
FLOOR( | ||
(SELECT SUM(sale_price) FROM UNNEST(items) WHERE not is_returned) | ||
/10) * 10 | ||
as order_value_tiered, | ||
COUNT(*) as order_count | ||
SUM( | ||
(SELECT SUM(sale_price) FROM UNNEST(items) WHERE not is_returned) | ||
) as total__revenue | ||
FROM 'orders.parquet' as orders | ||
``` | ||
|
||
## Attributes | ||
Annother common use case is to test for attributes. Suppose for example we wanted to know if an order contained both Shoes and Socks. | ||
|
||
``` | ||
source: orders is duckdb.table('orders.parquet) extend { | ||
dimension: contains_shoes is items -> { | ||
aggregate: return is count() {where: product_category = 'SHOES} > 0 | ||
} | ||
dimension: contains_socks is items -> { | ||
aggregate: return is count() {where: product_category = 'SOCKS} > 0 | ||
} | ||
dimension: contains_shoes_and_socks is contains_socks and contains_socks | ||
} | ||
``` | ||
|
||
This makes two passes through the array. it also can be written as a single pass. | ||
|
||
``` | ||
source: orders is duckdb.table('orders.parquet) extend { | ||
dimension: contains_shoes_and_socks is items -> { | ||
aggregate: return is ( | ||
pick 1 when count() {where: product_category = 'SHOES} > 0 | ||
+ pick 1 when count() {where: product_category = 'SOCKS'} > 0 | ||
) = 2 | ||
} | ||
} | ||
``` | ||
|
||
## Returning Dimensions | ||
|
||
Suppose we want to know the main category of the order (the category name that generated the largest part of the sale). The query must return a single row and | ||
must have a column named 'return'. | ||
|
||
``` | ||
source: orders is duckdb.table('orders.parquet) extend { | ||
dimension: largest_category is items -> { | ||
group_by: return is product_category | ||
aggregate: total_sales is sale_price.sum() | ||
order_by: total_sales // impiled but shown for clairity | ||
limit: 1 | ||
} | ||
} | ||
``` | ||
|
||
## Nested Sources | ||
In this example 'items' is a *nested source*. We've seen that nested sources can be used as a source in a dimensional subquery but they can be used for other things. There are functions that return nested sources. Nested sources can be used in pipelines. | ||
|
||
## Joining Nested Sources | ||
There are lots of use cases for rejoining a nested source but the a simple example is to find co-occurance. | ||
|
||
The query below, product category pairs and orders by the most common pattern. | ||
|
||
``` | ||
source: orders is duckdb.table('orders.parquet) extend { | ||
primary_key: order_id | ||
join_unnest: items2 is items | ||
} | ||
run: orders -> { | ||
where: items.product_category != items2.product_category | ||
group_by: | ||
category1 is items.product_category | ||
category2 is items2.product_category | ||
aggregate: | ||
order_count is count() | ||
} | ||
``` | ||
|
||
The SQL for this is | ||
|
||
``` | ||
SELECT | ||
items.product_category as category1, | ||
items2.product_category as category2, | ||
count(distinct order_id) as order_count | ||
FROM 'orders.parquet', | ||
UNNEST(items) as items, | ||
UNNEST(items) as items2 | ||
WHERE items.product_category <> items2.product_category | ||
GROUP_BY 1,2 | ||
``` | ||
|
||
## Joined Nested Sources on Queries | ||
|
||
``` | ||
source: orders is duckdb.table('orders.parquet) extend { | ||
primary_key: order_id | ||
join_unnest: returned_items is items->{ | ||
project:* | ||
where: is_returned | ||
} | ||
} | ||
// how many orders have returned items in a given category. | ||
run: orders -> { | ||
returned_items.product_category | ||
aggregate: | ||
order_count is count() | ||
} | ||
``` | ||
|
||
|
||
## Functions that return nested sources. | ||
There are several intrinic functions that return nested sources. These functions can be used anywhere a nested value. Function `split` takes a string and converts it into a source with a single column 'value'. `generate_series` generate a series of numbers or dates with a single column value. | ||
|
||
|
||
``` | ||
source: orders is duckdb.table('orders.parquet) extend { | ||
primary_key: order_id | ||
join_unnest: description_words is | ||
split(items.product_description) | ||
} | ||
// how many orders have returned items in a given category. | ||
run: orders -> { | ||
group_by: word is description_words.value | ||
aggregate: order_count is count() | ||
} | ||
``` |