Skip to content

Commit

Permalink
Spark 3.3: Add doc for the changelog view procedure. (apache#7147)
Browse files Browse the repository at this point in the history
  • Loading branch information
flyrain authored Apr 11, 2023
1 parent 9f9e189 commit 982e493
Showing 1 changed file with 116 additions and 0 deletions.
116 changes: 116 additions & 0 deletions docs/spark-procedures.md
Original file line number Diff line number Diff line change
Expand Up @@ -587,3 +587,119 @@ Get all the snapshot ancestors by a particular snapshot
CALL spark_catalog.system.ancestors_of('db.tbl', 1)
CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
```
## Change Data Capture
### `create_changelog_view`
Creates a view that contains the changes from a given table.
#### Usage
| Argument Name | Required? | Type | Description |
|---------------|----------|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `table` | ✔️ | string | Name of the source table for the changelog |
| `changelog_view` | | string | Name of the view to create |
| `options` | | map<string, string> | A map of Spark read options to use |
|`compute_updates`| | boolean | Whether to compute pre/post update images (see below for more information). Defaults to false. |
|`identifier_columns`| | array<string> | The list of identifier columns to compute updates. If the argument `compute_updates` is set to true and `identifier_columns` are not provided, the table’s current identifier fields will be used to compute updates. |
|`remove_carryovers`| | boolean | Whether to remove carry-over rows (see below for more information). Defaults to true. |
Here is a list of commonly used Spark read options:
* `start-snapshot-id`: the exclusive start snapshot ID. If not provided, it reads from the table’s first snapshot inclusively.
* `end-snapshot-id`: the inclusive end snapshot id, default to table's current snapshot.
* `start-timestamp`: the exclusive start timestamp. If not provided, it reads from the table’s first snapshot inclusively.
* `end-timestamp`: the inclusive end timestamp, default to table's current snapshot.
#### Output
| Output Name | Type | Description |
| ------------|------|----------------------------------------|
| `changelog_view` | string | The name of the created changelog view |
#### Examples
Create a changelog view `tbl_changes` based on the changes that happened between snapshot `1` (exclusive) and `2` (inclusive).
```sql
CALL spark_catalog.system.create_changelog_view(
table => 'db.tbl',
options => map('start-snapshot-id','1','end-snapshot-id', '2')
)
```
Create a changelog view `my_changelog_view` based on the changes that happened between timestamp `1678335750489` (exclusive) and `1678992105265` (inclusive).
```sql
CALL spark_catalog.system.create_changelog_view(
table => 'db.tbl',
options => map('start-timestamp','1678335750489','end-timestamp', '1678992105265'),
changelog_view => 'my_changelog_view'
)
```
Create a changelog view that computes updates based on the identifier columns `id` and `name`.
```sql
CALL spark_catalog.system.create_changelog_view(
table => 'db.tbl',
options => map('start-snapshot-id','1','end-snapshot-id', '2'),
identifier_columns => array('id', 'name')
)
```
Once the changelog view is created, you can query the view to see the changes that happened between the snapshots.
```sql
SELECT * FROM tbl_changes
```
```sql
SELECT * FROM tbl_changes where _change_type = 'INSERT' AND id = 3 ORDER BY _change_ordinal
```
Please note that the changelog view includes Change Data Capture(CDC) metadata columns
that provide additional information about the changes being tracked. These columns are:
- `_change_type`: the type of change. It has one of the following values: `INSERT`, `DELETE`, `UPDATE_BEFORE`, or `UPDATE_AFTER`.
- `_change_ordinal`: the order of changes
- `_commit_snapshot_id`: the snapshot ID where the change occurred
Here is an example of corresponding results. It shows that the first snapshot inserted 2 records, and the
second snapshot deleted 1 record.
| id | name |_change_type | _change_ordinal | _change_snapshot_id |
|---|--------|---|---|---|
|1 | Alice |INSERT |0 |5390529835796506035|
|2 | Bob |INSERT |0 |5390529835796506035|
|1 | Alice |DELETE |1 |8764748981452218370|
#### Carry-over Rows
The procedure removes the carry-over rows by default. Carry-over rows are the result of row-level operations(`MERGE`, `UPDATE` and `DELETE`)
when using copy-on-write. For example, given a file which contains row1 `(id=1, name='Alice')` and row2 `(id=2, name='Bob')`.
A copy-on-write delete of row2 would require erasing this file and preserving row1 in a new file. The changelog table
reports this as the following pair of rows, despite it not being an actual change to the table.
| id | name | _change_type |
|-----|-------|--------------|
| 1 | Alice | DELETE |
| 1 | Alice | INSERT |
By default, this view finds the carry-over rows and removes them from the result. User can disable this
behavior by setting the `remove_carryovers` option to `false`.
#### Pre/Post Update Images
The procedure computes the pre/post update images if configured. Pre/post update images are converted from a
pair of a delete row and an insert row. Identifier columns are used for determining whether an insert and a delete record
refer to the same row. If the two records share the same values for the identity columns they are considered to be before
and after states of the same row. You can either set identifier fields in the table schema or input them as the procedure parameters.
The following example shows pre/post update images computation with an identifier column(`id`), where a row deletion
and an insertion with the same `id` are treated as a single update operation. Specifically, suppose we have the following pair of rows:
| id | name | _change_type |
|-----|--------|--------------|
| 3 | Robert | DELETE |
| 3 | Dan | INSERT |
In this case, the procedure marks the row before the update as an `UPDATE_BEFORE` image and the row after the update
as an `UPDATE_AFTER` image, resulting in the following pre/post update images:
| id | name | _change_type |
|-----|--------|--------------|
| 3 | Robert | UPDATE_BEFORE|
| 3 | Dan | UPDATE_AFTER |

0 comments on commit 982e493

Please sign in to comment.