Skip to content

Commit

Permalink
Minor fixes in dbt user guides
Browse files Browse the repository at this point in the history
  • Loading branch information
rfraposa committed May 4, 2022
1 parent 6ba0cd1 commit 3b9cde4
Show file tree
Hide file tree
Showing 10 changed files with 138 additions and 119 deletions.
4 changes: 2 additions & 2 deletions docs/en/integrations/dbt/_category_.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
position: 191
position: 42
label: 'dbt'
collapsible: true
collapsed: true
link:
type: generated-index
title: S3
title: dbt
slug: /en/integrations/dbt
8 changes: 4 additions & 4 deletions docs/en/integrations/dbt/dbt-connecting.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ description: Connecting dbt to ClickHouse

(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)
Enter a number: 1
Enter a number: 1
16:53:21 No sample profile found for clickhouse.
16:53:21
16:53:21
Your new dbt project "imdb" was created!
For more information on how to configure the profiles.yml file,
Expand All @@ -28,13 +28,13 @@ description: Connecting dbt to ClickHouse
https://docs.getdbt.com/docs/configure-your-profile
```
2. `cd` into your project folder.
2. `cd` into your project folder:
```bash
cd imdb
```
3. At this point, you will need the text editor of your choice. In the examples below, we use the popular VSCode. Opening the IMDB directory, you should see a collection of yml and sql files.
3. At this point, you will need the text editor of your choice. In the examples below, we use the popular VSCode. Opening the IMDB directory, you should see a collection of yml and sql files:
<img src={require('./images/dbt_02.png').default} class="image" alt="New dbt project" style={{width: '100%'}}/>
Expand Down
106 changes: 54 additions & 52 deletions docs/en/integrations/dbt/dbt-incremental-model.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_label: Incremental Materializations
sidebar_label: Incremental Materializations
sidebar_position: 6
description: Table materializations with dbt and ClickHouse
---
Expand All @@ -10,42 +10,44 @@ In the previous example, we created a table to materialize the model. This table

To illustrate this example, we will add the actor "Clicky McClickHouse", who will appear in an incredible 910 movies - ensuring he has appeared in more films than even [Mel Blanc](https://en.wikipedia.org/wiki/Mel_Blanc).

1. First, we modify our model to be of type incremental. This additional requires:
1. First, we modify our model to be of type incremental. This addition requires:

1. **unique_key** - To ensure the plugin can uniquely identify rows, we must provide a unique_key - in this case, the id field from our query will suffice. This ensures we will have no row duplicates in our materialized table. For more details on uniqueness constraints, see[ here](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models#defining-a-uniqueness-constraint-optional).
2. **Incremental filter** - We also need to tell dbt how it should identify which rows have changed on an incremental run. This is achieved by providing a delta expression. Typically this involves a timestamp for event data; hence our updated_at timestamp field. This column, which defaults to the value of now() when rows are inserted, allows new roles to be identified. Additionally, we need to identify the alternative case where new actors are added. Using the {{this}} variable, to denote the existing materialized table, this gives us the expression `where id > (select max(id) from {{ this }}) and updated_at > (select max(created_at) from {{this}})`. We embed this inside the `{% if is_incremental() %}` condition, ensuring it is only used on incremental runs and not when the table is first constructed. For more details on filtering rows for incremental models, see [here](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models#filtering-rows-on-an-incremental-run).
1. **unique_key** - To ensure the plugin can uniquely identify rows, we must provide a unique_key - in this case, the `id` field from our query will suffice. This ensures we will have no row duplicates in our materialized table. For more details on uniqueness constraints, see[ here](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models#defining-a-uniqueness-constraint-optional).
2. **Incremental filter** - We also need to tell dbt how it should identify which rows have changed on an incremental run. This is achieved by providing a delta expression. Typically this involves a timestamp for event data; hence our updated_at timestamp field. This column, which defaults to the value of now() when rows are inserted, allows new roles to be identified. Additionally, we need to identify the alternative case where new actors are added. Using the {{this}} variable, to denote the existing materialized table, this gives us the expression `where id > (select max(id) from {{ this }}) and updated_at > (select max(created_at) from {{this}})`. We embed this inside the `{% if is_incremental() %}` condition, ensuring it is only used on incremental runs and not when the table is first constructed. For more details on filtering rows for incremental models, see [this discussion in the dbt docs](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models#filtering-rows-on-an-incremental-run).

Update the file `actor_summary.sql` with the following:

```sql
{{ config(order_by='(updated_at, id, name)', engine='MergeTree()', materialized='incremental',
{{ config(order_by='(updated_at, id, name)', engine='MergeTree()', materialized='incremental',
unique_key='id') }}
```

Note that our model will only respond to updates and additions to the `roles` and `actors` tables. To respond to all tables, users would be encouraged to split this model into multiple sub models - each with their own incremental criteria. These models can in turn be referenced and connected. For further details on cross referencing models see [here](https://docs.getdbt.com/reference/dbt-jinja-functions/ref).

2. Execute a `dbt run` and confirm the results of the resulting table.
2. Execute a `dbt run` and confirm the results of the resulting table:

```bash
clickhouse-user@clickhouse:~/imdb$ dbt run
clickhouse-user@clickhouse:~/imdb$ dbt run
15:33:34 Running with dbt=1.0.4
15:33:34 Found 1 model, 0 tests, 1 snapshot, 0 analyses, 181 macros, 0 operations, 0 seed files, 6 sources, 0 exposures, 0 metrics
15:33:34
15:33:34
15:33:35 Concurrency: 1 threads (target='dev')
15:33:35
15:33:35
15:33:35 1 of 1 START incremental model imdb_dbt.actor_summary........................... [RUN]
15:33:41 1 of 1 OK created incremental model imdb_dbt.actor_summary...................... [OK in 6.33s]
15:33:41
15:33:41
15:33:41 Finished running 1 incremental model in 7.30s.
15:33:41
15:33:41
15:33:41 Completed successfully
15:33:41
15:33:41
15:33:41 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
```

```sql
SELECT * FROM imdb_dbt.actor_summary ORDER BY num_movies DESC LIMIT 5;
```

```response
+------+------------+----------+------------------+------+---------+-------------------+
|id |name |num_movies|avg_rank |genres|directors|updated_at |
+------+------------+----------+------------------+------+---------+-------------------+
Expand All @@ -57,13 +59,13 @@ To illustrate this example, we will add the actor "Clicky McClickHouse", who wil
+------+------------+----------+------------------+------+---------+-------------------+
```

3. We will now add data to our model to illustrate an incremental update. Add our actor "Clicky McClickHouse" to the `actors` table.
3. We will now add data to our model to illustrate an incremental update. Add our actor "Clicky McClickHouse" to the `actors` table:

```sql
INSERT INTO imdb.actors VALUES (845466, 'Clicky', 'McClickHouse', 'M');
INSERT INTO imdb.actors VALUES (845466, 'Clicky', 'McClickHouse', 'M');
```

4. Let's star Clicky in 910 random movies.
4. Let's star Clicky in 910 random movies:
```sql
INSERT INTO imdb.roles
Expand All @@ -72,7 +74,7 @@ To illustrate this example, we will add the actor "Clicky McClickHouse", who wil
LIMIT 910 OFFSET 10000;
```
5. Confirm he is indeed now the actor with the most appearances by querying the underlying source table and bypassing any dbt models.
5. Confirm he is indeed now the actor with the most appearances by querying the underlying source table and bypassing any dbt models:
```sql
SELECT id,
Expand Down Expand Up @@ -100,38 +102,41 @@ To illustrate this example, we will add the actor "Clicky McClickHouse", who wil
GROUP BY id
ORDER BY num_movies DESC
LIMIT 2;
```
```response
+------+-------------------+----------+------------------+------+---------+-------------------+
|id |name |num_movies|avg_rank |genres|directors|updated_at |
+------+-------------------+----------+------------------+------+---------+-------------------+
|845466|Clicky McClickHouse|910 |1.4687938697032283|21 |662 |2022-04-26 16:20:36|
|45332 |Mel Blanc |909 |5.7884792542982515|19 |148 |2022-04-26 16:17:42|
+------+-------------------+----------+------------------+------+---------+-------------------+
```
6. Execute a `dbt run` and confirm our model has been updated and matches the above results.
6. Execute a `dbt run` and confirm our model has been updated and matches the above results:
```bash
clickhouse-user@clickhouse:~/imdb$ dbt run
16:12:16 Running with dbt=1.0.4
16:12:16 Found 1 model, 0 tests, 1 snapshot, 0 analyses, 181 macros, 0 operations, 0 seed files, 6 sources, 0 exposures, 0 metrics
16:12:16
16:12:16
16:12:17 Concurrency: 1 threads (target='dev')
16:12:17
16:12:17
16:12:17 1 of 1 START incremental model imdb_dbt.actor_summary........................... [RUN]
16:12:24 1 of 1 OK created incremental model imdb_dbt.actor_summary...................... [OK in 6.82s]
16:12:24
16:12:24
16:12:24 Finished running 1 incremental model in 7.79s.
16:12:24
16:12:24
16:12:24 Completed successfully
16:12:24
16:12:24
16:12:24 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
```
```sql
SELECT * FROM imdb_dbt.actor_summary ORDER BY num_movies DESC LIMIT 2;
```
```response
+------+-------------------+----------+------------------+------+---------+-------------------+
|id |name |num_movies|avg_rank |genres|directors|updated_at |
+------+-------------------+----------+------------------+------+---------+-------------------+
Expand All @@ -145,42 +150,41 @@ We can identify the statements executed to achieve the above incremental update
```sql
SELECT event_time, query FROM system.query_log WHERE type='QueryStart' AND query LIKE '%dbt%'
SELECT event_time, query FROM system.query_log WHERE type='QueryStart' AND query LIKE '%dbt%'
AND event_time > subtractMinutes(now(), 15) ORDER BY event_time LIMIT 100;
```
Adjust the above query to the period of execution. We leave result inspection to the user but highlight the general strategy used by the plugin to perform incremental updates:
1. The plugin creates a temporary table `actor_sumary__dbt_tmp` using the [Memory engine](https://clickhouse.com/docs/en/engines/table-engines/special/memory). Rows that have changed are streamed into this table.
```sql
create temporary table actor_summary__dbt_tmp
engine = Memory
order by ((updated_at, id, name))
as (with actor_summary as (SELECT id,
any(actor_name) as name,
uniqExact(movie_id) as num_movies,
avg(rank) as avg_rank,
uniqExact(genre) as genres,
uniqExact(director_name) as directors,
max(created_at) as updated_at
FROM (
SELECT imdb.actors.id as id,
concat(imdb.actors.first_name, ' ', imdb.actors.last_name) as actor_name,
imdb.movies.id as movie_id,
imdb.movies.rank as rank,
genre,
concat(imdb.directors.first_name, ' ', imdb.directors.last_name) as director_name,
created_at
FROM imdb.actors
JOIN imdb.roles ON imdb.roles.actor_id = imdb.actors.id
LEFT OUTER JOIN imdb.movies ON imdb.movies.id = imdb.roles.movie_id
LEFT OUTER JOIN imdb.genres ON imdb.genres.movie_id = imdb.movies.id
LEFT OUTER JOIN imdb.movie_directors ON imdb.movie_directors.movie_id = imdb.movies.id
LEFT OUTER JOIN imdb.directors ON imdb.directors.id = imdb.movie_directors.director_id
)
GROUP BY id)
any(actor_name) as name,
uniqExact(movie_id) as num_movies,
avg(rank) as avg_rank,
uniqExact(genre) as genres,
uniqExact(director_name) as directors,
max(created_at) as updated_at
FROM (
SELECT imdb.actors.id as id,
concat(imdb.actors.first_name, ' ', imdb.actors.last_name) as actor_name,
imdb.movies.id as movie_id,
imdb.movies.rank as rank,
genre,
concat(imdb.directors.first_name, ' ', imdb.directors.last_name) as director_name,
created_at
FROM imdb.actors
JOIN imdb.roles ON imdb.roles.actor_id = imdb.actors.id
LEFT OUTER JOIN imdb.movies ON imdb.movies.id = imdb.roles.movie_id
LEFT OUTER JOIN imdb.genres ON imdb.genres.movie_id = imdb.movies.id
LEFT OUTER JOIN imdb.movie_directors ON imdb.movie_directors.movie_id = imdb.movies.id
LEFT OUTER JOIN imdb.directors ON imdb.directors.id = imdb.movie_directors.director_id
)
GROUP BY id)
select *
from actor_summary
Expand All @@ -189,8 +193,7 @@ Adjust the above query to the period of execution. We leave result inspection to
or updated_at > (select max(updated_at) from imdb_dbt.actor_summary));
```
2. The previous materialized table is renamed `actor_summary_old`. A new table `actor_summary` is created. The rows from the old table are, in turn, streamed from the old to new, with a check to make sure row ids do not exist in the temporary table. This effectively handles updates.
2. The previous materialized table is renamed `actor_summary_old`. A new table `actor_summary` is created. The rows from the old table are, in turn, streamed from the old to new, with a check to make sure row ids do not exist in the temporary table. This effectively handles updates:
```sql
insert into imdb_dbt.actor_summary ("id", "name", "num_movies", "avg_rank", "genres", "directors", "updated_at")
select "id", "name", "num_movies", "avg_rank", "genres", "directors", "updated_at"
Expand All @@ -199,12 +202,11 @@ Adjust the above query to the period of execution. We leave result inspection to
from actor_summary__dbt_tmp);
```
3. Finally, results from the temporary table are streamed into the new `actor_summary` table.
3. Finally, results from the temporary table are streamed into the new `actor_summary` table:
```sql
insert into imdb_dbt.actor_summary ("id", "name", "num_movies", "avg_rank", "genres", "directors", "updated_at")
select "id", "name", "num_movies", "avg_rank", "genres", "directors", "updated_at"
from actor_summary__dbt_tmp;
```
This strategy may encounter challenges on very large models. For further details see [Limitations](./dbt-limitations).
This strategy may encounter challenges on very large models. For further details see [Limitations](./dbt-limitations).
6 changes: 3 additions & 3 deletions docs/en/integrations/dbt/dbt-intro.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
---
sidebar_label: Introduction
sidebar_label: Introduction
sidebar_position: 1
description: Users can transform and model their data in ClickHouse using dbts
---

# ClickHouse and dbt

dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles materializing these select statements into objects in the database in the form of tables and views - performing the T of ELT. Users can create a model defined by a SELECT statement.
**dbt** (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles materializing these select statements into objects in the database in the form of tables and views - performing the T of ELT. Users can create a model defined by a SELECT statement.

Within dbt, these models can be cross-referenced and layered to allow the construction of higher-level concepts. The boilerplate SQL required to connect models is automatically generated. Furthermore, dbt identifies dependencies between models and ensures they are created in the appropriate order using a directed acyclic graph (DAG).

Expand All @@ -23,7 +23,7 @@ dbt provides 4 types of materialization:
* **ephemeral**: The model is not directly built in the database but is instead pulled into dependent models as common table expressions.
* **incremental**: The model is initially materialized as a table, and in subsequent runs, dbt inserts new rows and updates changed rows in the table.

Additional syntax and clauses define how these models should be updated if their underlying data changes. dbt generally recommends starting with the view materialization until performance becomes a concern. The table materialization provides a query time performance improvement by capturing the results of the model’s query as a table at the expense of increased storage. The incremental approach builds on this further to allow subsequent updates to the underlying data to be captured in the target table.
Additional syntax and clauses define how these models should be updated if their underlying data changes. dbt generally recommends starting with the view materialization until performance becomes a concern. The table materialization provides a query time performance improvement by capturing the results of the model’s query as a table at the expense of increased storage. The incremental approach builds on this further to allow subsequent updates to the underlying data to be captured in the target table.

The[ current plugin](https://github.com/silentsokolov/dbt-clickhouse) for ClickHouse supports the **view**, **table,** and **incremental** materializations. Ephemeral is not supported. The plugin also supports dbt[ snapshots](https://docs.getdbt.com/docs/building-a-dbt-project/snapshots#check-strategy) and[ seeds](https://docs.getdbt.com/docs/building-a-dbt-project/seeds) which we explore in this guide.

Expand Down
Loading

0 comments on commit 3b9cde4

Please sign in to comment.