Notebooks for the 'Slowly Changing Dimensions (SCD) Type 2' online tech talk
The slowly changing dimension type two (SCD Type 2) is a classic data warehouse and star schema mainstay. This structure enables 'as-of' analytics over the point in time facts stored in the fact table(s). Customer, device, product, store, supplier are typical dimensions in a data warehouse. Facts such as orders and sales link dimensions together at the point in time the fact occured. However, dimensions, such as customers might change over time, customers may move, they may get reclassified as to their market segment as in this hypothetical demo scenario. The dimension is termed 'slowly changing' because the dimension doesn't change on a regular schedule.
When the data system receives the new customer information, the new customer dimension record is created. If an existing customer record of a matching key value exists, instead of over-writing the current record (a type 1 update) the current record is marked as in-active while preserving any foriegn key relationships to fact tables. The currrent record is also given an end date signifying the last date(time) the record was meaningful. A new record with the new information is inserted, with a first effect date (or start date) set to the effective date of the incoming customer information, often the job date(time) of the job running the load. The end date of the new record is set to NULL
to signify the end date is not known, or is set to a rather large (e.g. 12/31/9999) value to make coding between clauses easier. Thus the count of active customer records remains the same.
Additional constraints:
- It's important to perform the update and insert in the same transaction to maintain accuracy of the dimension table.
- Additional criteria include that there is only one natural key and active or current indicate is true and any point in time.
- The surrogate key must be unique within the table. Fact tables will join to to the surrogate key value (in most cases).
Big data systems have struggeled with the ability to update the prior customer record. To achieve scale and a lockless architecture, direct updates to storage were not permitted. Delta Lake changes this.
Delta Lake with ACID transactions make it much easier to reliably to perform UPDATE
, DELETE
and Upsert/Merge operations. Delta Lake introduces the MERGE INTO
operator to perform Upsert/Merge operations on Delta Lake tables
This demo will walk through loading a Customer dimension table and then selecting a percentage of the data to migrate to a 'SPORTS' market segment. We will then upsert/merge those modified records back into the customer_dim table and display the before & after summary of the market segments.