Minor fixes in dbt user guides

mufasa007 · May 4, 2022 · 3b9cde4 · 3b9cde4
1 parent 6ba0cd1
commit 3b9cde4
Show file tree

Hide file tree

Showing 10 changed files with 138 additions and 119 deletions.
diff --git a/docs/en/integrations/dbt/_category_.yml b/docs/en/integrations/dbt/_category_.yml
@@ -1,8 +1,8 @@
-position: 191
+position: 42
 label: 'dbt'
 collapsible: true
 collapsed: true
 link:
   type: generated-index
-  title: S3
+  title: dbt
   slug: /en/integrations/dbt
diff --git a/docs/en/integrations/dbt/dbt-connecting.md b/docs/en/integrations/dbt/dbt-connecting.md
@@ -17,9 +17,9 @@ description: Connecting dbt to ClickHouse
 
     (Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)
 
-    Enter a number: 1  
+    Enter a number: 1
     16:53:21  No sample profile found for clickhouse.
-    16:53:21  
+    16:53:21
     Your new dbt project "imdb" was created!
 
     For more information on how to configure the profiles.yml file,
@@ -28,13 +28,13 @@ description: Connecting dbt to ClickHouse
     https://docs.getdbt.com/docs/configure-your-profile
     ```
 
-2. `cd` into your project folder.
+2. `cd` into your project folder:
 
     ```bash
     cd imdb
     ```
 
-3. At this point, you will need the text editor of your choice. In the examples below, we use the popular VSCode. Opening the IMDB directory, you should see a collection of yml and sql files.
+3. At this point, you will need the text editor of your choice. In the examples below, we use the popular VSCode. Opening the IMDB directory, you should see a collection of yml and sql files:
 
     <img src={require('./images/dbt_02.png').default} class="image" alt="New dbt project" style={{width: '100%'}}/>
 

diff --git a/docs/en/integrations/dbt/dbt-incremental-model.md b/docs/en/integrations/dbt/dbt-incremental-model.md
@@ -1,5 +1,5 @@
 ---
-sidebar_label: Incremental Materializations 
+sidebar_label: Incremental Materializations
 sidebar_position: 6
 description: Table materializations with dbt and ClickHouse
 ---
@@ -10,42 +10,44 @@ In the previous example, we created a table to materialize the model. This table
 
 To illustrate this example, we will add the actor "Clicky McClickHouse", who will appear in an incredible 910 movies - ensuring he has appeared in more films than even [Mel Blanc](https://en.wikipedia.org/wiki/Mel_Blanc).
 
-1. First, we modify our model to be of type incremental. This additional requires:
+1. First, we modify our model to be of type incremental. This addition requires:
 
-    1. **unique_key** - To ensure the plugin can uniquely identify rows, we must provide a unique_key - in this case, the id field from our query will suffice. This ensures we will have no row duplicates in our materialized table. For more details on uniqueness constraints, see[ here](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models#defining-a-uniqueness-constraint-optional). 
-    2. **Incremental filter** - We also need to tell dbt how it should identify which rows have changed on an incremental run. This is achieved by providing a delta expression. Typically this involves a timestamp for event data; hence our updated_at timestamp field. This column, which defaults to the value of now() when rows are inserted, allows new roles to be identified. Additionally, we need to identify the alternative case where new actors are added. Using the {{this}} variable, to denote the existing materialized table, this gives us the expression `where id > (select max(id) from {{ this }}) and updated_at > (select max(created_at) from {{this}})`. We embed this inside the `{% if is_incremental() %}` condition, ensuring it is only used on incremental runs and not when the table is first constructed. For more details on filtering rows for incremental models, see [here](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models#filtering-rows-on-an-incremental-run).
+    1. **unique_key** - To ensure the plugin can uniquely identify rows, we must provide a unique_key - in this case, the `id` field from our query will suffice. This ensures we will have no row duplicates in our materialized table. For more details on uniqueness constraints, see[ here](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models#defining-a-uniqueness-constraint-optional).
+    2. **Incremental filter** - We also need to tell dbt how it should identify which rows have changed on an incremental run. This is achieved by providing a delta expression. Typically this involves a timestamp for event data; hence our updated_at timestamp field. This column, which defaults to the value of now() when rows are inserted, allows new roles to be identified. Additionally, we need to identify the alternative case where new actors are added. Using the {{this}} variable, to denote the existing materialized table, this gives us the expression `where id > (select max(id) from {{ this }}) and updated_at > (select max(created_at) from {{this}})`. We embed this inside the `{% if is_incremental() %}` condition, ensuring it is only used on incremental runs and not when the table is first constructed. For more details on filtering rows for incremental models, see [this discussion in the dbt docs](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models#filtering-rows-on-an-incremental-run).
 
     Update the file `actor_summary.sql` with the following:
 
     ```sql
-    {{ config(order_by='(updated_at, id, name)', engine='MergeTree()', materialized='incremental', 
+    {{ config(order_by='(updated_at, id, name)', engine='MergeTree()', materialized='incremental',
     unique_key='id') }}
     ```
 
     Note that our model will only respond to updates and additions to the `roles` and `actors` tables. To respond to all tables, users would be encouraged to split this model into multiple sub models - each with their own incremental criteria. These models can in turn be referenced and connected. For further details on cross referencing models see [here](https://docs.getdbt.com/reference/dbt-jinja-functions/ref).
 
-2. Execute a `dbt run` and confirm the results of the resulting table.
+2. Execute a `dbt run` and confirm the results of the resulting table:
 
     ```bash
-    clickhouse-user@clickhouse:~/imdb$  dbt run 
+    clickhouse-user@clickhouse:~/imdb$  dbt run
     15:33:34  Running with dbt=1.0.4
     15:33:34  Found 1 model, 0 tests, 1 snapshot, 0 analyses, 181 macros, 0 operations, 0 seed files, 6 sources, 0 exposures, 0 metrics
-    15:33:34  
+    15:33:34
     15:33:35  Concurrency: 1 threads (target='dev')
-    15:33:35  
+    15:33:35
     15:33:35  1 of 1 START incremental model imdb_dbt.actor_summary........................... [RUN]
     15:33:41  1 of 1 OK created incremental model imdb_dbt.actor_summary...................... [OK in 6.33s]
-    15:33:41  
+    15:33:41
     15:33:41  Finished running 1 incremental model in 7.30s.
-    15:33:41  
+    15:33:41
     15:33:41  Completed successfully
-    15:33:41  
+    15:33:41
     15:33:41  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
     ```
 
     ```sql
     SELECT * FROM imdb_dbt.actor_summary ORDER BY num_movies DESC LIMIT 5;
+    ```
 
+    ```response
     +------+------------+----------+------------------+------+---------+-------------------+
     |id    |name        |num_movies|avg_rank          |genres|directors|updated_at         |
     +------+------------+----------+------------------+------+---------+-------------------+
@@ -57,13 +59,13 @@ To illustrate this example, we will add the actor "Clicky McClickHouse", who wil
     +------+------------+----------+------------------+------+---------+-------------------+
     ```
 
-3. We will now add data to our model to illustrate an incremental update. Add our actor  "Clicky McClickHouse" to the `actors` table.
+3. We will now add data to our model to illustrate an incremental update. Add our actor  "Clicky McClickHouse" to the `actors` table:
 
     ```sql
-        INSERT INTO imdb.actors VALUES (845466, 'Clicky', 'McClickHouse', 'M');
+    INSERT INTO imdb.actors VALUES (845466, 'Clicky', 'McClickHouse', 'M');
     ```
 
-4. Let's star Clicky in 910 random movies.
+4. Let's star Clicky in 910 random movies:
 
     ```sql
     INSERT INTO imdb.roles
@@ -72,7 +74,7 @@ To illustrate this example, we will add the actor "Clicky McClickHouse", who wil
     LIMIT 910 OFFSET 10000;
     ```
 
-5. Confirm he is indeed now the actor with the most appearances by querying the underlying source table and bypassing any dbt models.
+5. Confirm he is indeed now the actor with the most appearances by querying the underlying source table and bypassing any dbt models:
 
     ```sql
     SELECT id,
@@ -100,38 +102,41 @@ To illustrate this example, we will add the actor "Clicky McClickHouse", who wil
     GROUP BY id
     ORDER BY num_movies DESC
     LIMIT 2;
+    ```
 
+    ```response
     +------+-------------------+----------+------------------+------+---------+-------------------+
     |id    |name               |num_movies|avg_rank          |genres|directors|updated_at         |
     +------+-------------------+----------+------------------+------+---------+-------------------+
     |845466|Clicky McClickHouse|910       |1.4687938697032283|21    |662      |2022-04-26 16:20:36|
     |45332 |Mel Blanc          |909       |5.7884792542982515|19    |148      |2022-04-26 16:17:42|
     +------+-------------------+----------+------------------+------+---------+-------------------+
-
     ```
 
-6. Execute a `dbt run` and confirm our model has been updated and matches the above results.
+6. Execute a `dbt run` and confirm our model has been updated and matches the above results:
 
     ```bash
     clickhouse-user@clickhouse:~/imdb$  dbt run
     16:12:16  Running with dbt=1.0.4
     16:12:16  Found 1 model, 0 tests, 1 snapshot, 0 analyses, 181 macros, 0 operations, 0 seed files, 6 sources, 0 exposures, 0 metrics
-    16:12:16  
+    16:12:16
     16:12:17  Concurrency: 1 threads (target='dev')
-    16:12:17  
+    16:12:17
     16:12:17  1 of 1 START incremental model imdb_dbt.actor_summary........................... [RUN]
     16:12:24  1 of 1 OK created incremental model imdb_dbt.actor_summary...................... [OK in 6.82s]
-    16:12:24  
+    16:12:24
     16:12:24  Finished running 1 incremental model in 7.79s.
-    16:12:24  
+    16:12:24
     16:12:24  Completed successfully
-    16:12:24  
+    16:12:24
     16:12:24  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
     ```
 
     ```sql
     SELECT * FROM imdb_dbt.actor_summary ORDER BY num_movies DESC LIMIT 2;
+    ```
 
+    ```response
     +------+-------------------+----------+------------------+------+---------+-------------------+
     |id    |name               |num_movies|avg_rank          |genres|directors|updated_at         |
     +------+-------------------+----------+------------------+------+---------+-------------------+
@@ -145,42 +150,41 @@ We can identify the statements executed to achieve the above incremental update
 
 
 ```sql
-SELECT event_time, query  FROM system.query_log WHERE type='QueryStart' AND query LIKE '%dbt%' 
+SELECT event_time, query  FROM system.query_log WHERE type='QueryStart' AND query LIKE '%dbt%'
 AND event_time > subtractMinutes(now(), 15) ORDER BY event_time LIMIT 100;
 ```
 
 Adjust the above query to the period of execution. We leave result inspection to the user but highlight the general strategy used by the plugin to perform incremental updates:
 
 
 1. The plugin creates a temporary table `actor_sumary__dbt_tmp` using the [Memory engine](https://clickhouse.com/docs/en/engines/table-engines/special/memory). Rows that have changed are streamed into this table.
-
     ```sql
     create temporary table actor_summary__dbt_tmp
         engine = Memory
             order by ((updated_at, id, name))
     as (with actor_summary as (SELECT id,
-                                    any(actor_name)          as name,
-                                    uniqExact(movie_id)      as num_movies,
-                                    avg(rank)                as avg_rank,
-                                    uniqExact(genre)         as genres,
-                                    uniqExact(director_name) as directors,
-                                    max(created_at)          as updated_at
-                            FROM (
-                                        SELECT imdb.actors.id                                                   as id,
-                                            concat(imdb.actors.first_name, ' ', imdb.actors.last_name)       as actor_name,
-                                            imdb.movies.id                                                   as movie_id,
-                                            imdb.movies.rank                                                 as rank,
-                                            genre,
-                                            concat(imdb.directors.first_name, ' ', imdb.directors.last_name) as director_name,
-                                            created_at
-                                        FROM imdb.actors
-                                                JOIN imdb.roles ON imdb.roles.actor_id = imdb.actors.id
-                                                LEFT OUTER JOIN imdb.movies ON imdb.movies.id = imdb.roles.movie_id
-                                                LEFT OUTER JOIN imdb.genres ON imdb.genres.movie_id = imdb.movies.id
-                                                LEFT OUTER JOIN imdb.movie_directors ON imdb.movie_directors.movie_id = imdb.movies.id
-                                                LEFT OUTER JOIN imdb.directors ON imdb.directors.id = imdb.movie_directors.director_id
-                                        )
-                            GROUP BY id)
+                any(actor_name)          as name,
+                uniqExact(movie_id)      as num_movies,
+                avg(rank)                as avg_rank,
+                uniqExact(genre)         as genres,
+                uniqExact(director_name) as directors,
+                max(created_at)          as updated_at
+                FROM (
+                    SELECT imdb.actors.id                                                   as id,
+                        concat(imdb.actors.first_name, ' ', imdb.actors.last_name)       as actor_name,
+                        imdb.movies.id                                                   as movie_id,
+                        imdb.movies.rank                                                 as rank,
+                        genre,
+                        concat(imdb.directors.first_name, ' ', imdb.directors.last_name) as director_name,
+                        created_at
+                    FROM imdb.actors
+                            JOIN imdb.roles ON imdb.roles.actor_id = imdb.actors.id
+                            LEFT OUTER JOIN imdb.movies ON imdb.movies.id = imdb.roles.movie_id
+                            LEFT OUTER JOIN imdb.genres ON imdb.genres.movie_id = imdb.movies.id
+                            LEFT OUTER JOIN imdb.movie_directors ON imdb.movie_directors.movie_id = imdb.movies.id
+                            LEFT OUTER JOIN imdb.directors ON imdb.directors.id = imdb.movie_directors.director_id
+                    )
+        GROUP BY id)
 
         select *
         from actor_summary
@@ -189,8 +193,7 @@ Adjust the above query to the period of execution. We leave result inspection to
         or updated_at > (select max(updated_at) from imdb_dbt.actor_summary));
     ```
 
-    2. The previous materialized table is renamed `actor_summary_old`. A new table `actor_summary` is created. The rows from the old table are, in turn, streamed from the old to new, with a check to make sure row ids do not exist in the temporary table. This effectively handles updates.
-
+2. The previous materialized table is renamed `actor_summary_old`. A new table `actor_summary` is created. The rows from the old table are, in turn, streamed from the old to new, with a check to make sure row ids do not exist in the temporary table. This effectively handles updates:
     ```sql
     insert into imdb_dbt.actor_summary ("id", "name", "num_movies", "avg_rank", "genres", "directors", "updated_at")
     select "id", "name", "num_movies", "avg_rank", "genres", "directors", "updated_at"
@@ -199,12 +202,11 @@ Adjust the above query to the period of execution. We leave result inspection to
                     from actor_summary__dbt_tmp);
     ```
 
-    3. Finally, results from the temporary table are streamed into the new `actor_summary` table.
-
+3. Finally, results from the temporary table are streamed into the new `actor_summary` table:
     ```sql
     insert into imdb_dbt.actor_summary ("id", "name", "num_movies", "avg_rank", "genres", "directors", "updated_at")
     select "id", "name", "num_movies", "avg_rank", "genres", "directors", "updated_at"
     from actor_summary__dbt_tmp;
     ```
 
-    This strategy may encounter challenges on very large models. For further details see [Limitations](./dbt-limitations).
+This strategy may encounter challenges on very large models. For further details see [Limitations](./dbt-limitations).
diff --git a/docs/en/integrations/dbt/dbt-intro.md b/docs/en/integrations/dbt/dbt-intro.md
@@ -1,12 +1,12 @@
 ---
-sidebar_label: Introduction 
+sidebar_label: Introduction
 sidebar_position: 1
 description: Users can transform and model their data in ClickHouse using dbts
 ---
 
 # ClickHouse and dbt
 
-dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles materializing these select statements into objects in the database in the form of tables and views - performing the T of ELT. Users can create a model defined by a SELECT statement.
+**dbt** (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles materializing these select statements into objects in the database in the form of tables and views - performing the T of ELT. Users can create a model defined by a SELECT statement.
 
 Within dbt, these models can be cross-referenced and layered to allow the construction of higher-level concepts. The boilerplate SQL required to connect models is automatically generated. Furthermore, dbt identifies dependencies between models and ensures they are created in the appropriate order using a directed acyclic graph (DAG).
 
@@ -23,7 +23,7 @@ dbt provides 4 types of materialization:
 * **ephemeral**: The model is not directly built in the database but is instead pulled into dependent models as common table expressions.
 * **incremental**: The model is initially materialized as a table, and in subsequent runs, dbt inserts new rows and updates changed rows in the table.
 
-Additional syntax and clauses define how these models should be updated if their underlying data changes. dbt generally recommends starting with the view materialization until performance becomes a concern. The table materialization provides a query time performance improvement by capturing the results of the model’s query as a table at the expense of increased storage. The incremental approach builds on this further to allow subsequent updates to the underlying data to be captured in the target table. 
+Additional syntax and clauses define how these models should be updated if their underlying data changes. dbt generally recommends starting with the view materialization until performance becomes a concern. The table materialization provides a query time performance improvement by capturing the results of the model’s query as a table at the expense of increased storage. The incremental approach builds on this further to allow subsequent updates to the underlying data to be captured in the target table.
 
 The[ current plugin](https://github.com/silentsokolov/dbt-clickhouse) for ClickHouse supports the **view**, **table,** and **incremental** materializations. Ephemeral is not supported. The plugin also supports dbt[ snapshots](https://docs.getdbt.com/docs/building-a-dbt-project/snapshots#check-strategy) and[ seeds](https://docs.getdbt.com/docs/building-a-dbt-project/seeds) which we explore in this guide.