Skip to content

Commit

Permalink
Initial Draft
Browse files Browse the repository at this point in the history
  • Loading branch information
lloydtabb committed Oct 23, 2023
1 parent 79a56b3 commit c1f9d5c
Showing 1 changed file with 218 additions and 0 deletions.
218 changes: 218 additions & 0 deletions wns/WN-0007/wn-0007.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# WN 0007 - Nested Computations

- **Status**: *Draft*
- **Discussions-To:** https://github.com/malloydata/malloy/discussions/1428, slack, gchat
- **Author:** lloyd tabb

Malloy produces nested data. To access nested data today, malloy implicitly unnests while guaranteeing computational accuracy. There are times when we'd like to make computations against nested data without unnesting (count the number of occurrences with a filter) or to transform the nested data or to unnest the data more than once (to find co-occurrence).

Assume a table with the following structure

orders.parquet:
```
order_id
order_time
shipped_time
status
user_id
items []
product_id
product_description
sale_price
is_returned
product_category
```

## How it works today

Today, if you simply reference values in items and they are automatically unnested.

```
source: order is duckdb.table('orders.parquet) extend {
view: by_user is {
where: not items.is_returned
group_by: user_id
aggregate: revenue is items.sale_price.sum()
}
}
```

This query is writtens as

```
SELECT user_id, SUM(items.sale_price)
FROM 'orders.parquet', UNNEST(items) as items
WHERE NOT items.is_returned
GROUP_BY 1
```

## But 'order_value' ?
The problem with this is 'order_value' is not available as a dimension? Certainly it is an attribute of the order but with unnest, it isn't easily representable. Suppose, for example I wanted to look at a distribution of orders by value? This proves difficult.

## Dimensional Subqueries

In the example below, we're creating a dimmension `order_value`. I can be used like any other dimension. The expression for order_value written as a query with `items` as its source. There can only be one return value and the field named `return` is that value.

```
source: orders is duckdb.table('orders.parquet) extend {
dimension: order_value is items->{
aggregate: return is saleprice.sum()
where: not is_returned
}
}
run: orders -> {
// bucket order by $10
group_by: order_value_tiered is floor(order_value/10)*10
// compute the number of orders and total order value for the bucket
aggregate:
order_count is count()
total_revenue is order_value.sum()
}
```

The SQL for query is.

```
SELECT
FLOOR(
(SELECT SUM(sale_price) FROM UNNEST(items) WHERE not is_returned)
/10) * 10
as order_value_tiered,
COUNT(*) as order_count
SUM(
(SELECT SUM(sale_price) FROM UNNEST(items) WHERE not is_returned)
) as total__revenue
FROM 'orders.parquet' as orders
```

## Attributes
Annother common use case is to test for attributes. Suppose for example we wanted to know if an order contained both Shoes and Socks.

```
source: orders is duckdb.table('orders.parquet) extend {
dimension: contains_shoes is items -> {
aggregate: return is count() {where: product_category = 'SHOES} > 0
}
dimension: contains_socks is items -> {
aggregate: return is count() {where: product_category = 'SOCKS} > 0
}
dimension: contains_shoes_and_socks is contains_socks and contains_socks
}
```

This makes two passes through the array. it also can be written as a single pass.

```
source: orders is duckdb.table('orders.parquet) extend {
dimension: contains_shoes_and_socks is items -> {
aggregate: return is (
pick 1 when count() {where: product_category = 'SHOES} > 0
+ pick 1 when count() {where: product_category = 'SOCKS'} > 0
) = 2
}
}
```

## Returning Dimensions

Suppose we want to know the main category of the order (the category name that generated the largest part of the sale). The query must return a single row and
must have a column named 'return'.

```
source: orders is duckdb.table('orders.parquet) extend {
dimension: largest_category is items -> {
group_by: return is product_category
aggregate: total_sales is sale_price.sum()
order_by: total_sales // impiled but shown for clairity
limit: 1
}
}
```

## Nested Sources
In this example 'items' is a *nested source*. We've seen that nested sources can be used as a source in a dimensional subquery but they can be used for other things. There are functions that return nested sources. Nested sources can be used in pipelines.

## Joining Nested Sources
There are lots of use cases for rejoining a nested source but the a simple example is to find co-occurance.

The query below, product category pairs and orders by the most common pattern.

```
source: orders is duckdb.table('orders.parquet) extend {
primary_key: order_id
join_unnest: items2 is items
}
run: orders -> {
where: items.product_category != items2.product_category
group_by:
category1 is items.product_category
category2 is items2.product_category
aggregate:
order_count is count()
}
```

The SQL for this is

```
SELECT
items.product_category as category1,
items2.product_category as category2,
count(distinct order_id) as order_count
FROM 'orders.parquet',
UNNEST(items) as items,
UNNEST(items) as items2
WHERE items.product_category <> items2.product_category
GROUP_BY 1,2
```

## Joined Nested Sources on Queries

```
source: orders is duckdb.table('orders.parquet) extend {
primary_key: order_id
join_unnest: returned_items is items->{
project:*
where: is_returned
}
}
// how many orders have returned items in a given category.
run: orders -> {
returned_items.product_category
aggregate:
order_count is count()
}
```


## Functions that return nested sources.
There are several intrinic functions that return nested sources. These functions can be used anywhere a nested value. Function `split` takes a string and converts it into a source with a single column 'value'. `generate_series` generate a series of numbers or dates with a single column value.


```
source: orders is duckdb.table('orders.parquet) extend {
primary_key: order_id
join_unnest: description_words is
split(items.product_description)
}
// how many orders have returned items in a given category.
run: orders -> {
group_by: word is description_words.value
aggregate: order_count is count()
}
```

0 comments on commit c1f9d5c

Please sign in to comment.