Name	Name	Last commit message	Last commit date
parent directory ..
Generating Surrogate Keys.sql	Generating Surrogate Keys.sql
README.md	README.md

Introduction to Surrogate Key Generation for Lake House

2020-08-25 | Watch the video | This folder contains the notebooks used in this tutorial.

History of Surrogate Keys
Why is it hard
Requirements for a good strategy

Embarisingly parallel
SQL based
No Collisions

Strategies

monotonically_increasing_id
row_number() Rank OVER
ZipWithIndex()
ZipWithUniqueIndex()
Row Hash with hash()
Row Hash with md5()

Validation

Uniqueness / Collisions
Distributability (Performance)
Distribution (Skewness)

Summary

History of Surrogate Keys

"A surrogate key (or synthetic key, entity identifier, system-generated key, database sequence number, factless key, technical key, or arbitrary unique identifier[citation needed]) in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data.[1]" https://en.wikipedia.org/wiki/Surrogate_key

Ralph Kimbal: "Actually, a surrogate key in a data warehouse is more than just a substitute for a natural key. In a data warehouse, a surrogate key is a necessary generalization of the natural production key and is one of the basic elements of data warehouse design. Let’s be very clear: Every join between dimension tables and fact tables in a data warehouse environment should be based on surrogate keys, not natural keys. It is up to the data extract logic to systematically look up and replace every incoming natural key with a data warehouse surrogate key each time either a dimension record or a fact record is brought into the data warehouse environment."

"A surrogate key is frequently a sequential number (e.g. a Sybase or SQL Server "identity column", a PostgreSQL or Informix serial, an Oracle or SQL Server SEQUENCE or a column defined with AUTO_INCREMENT in MySQL). Some databases provide UUID/GUID as a possible data type for surrogate keys (e.g. PostgreSQL UUID or SQL Server UNIQUEIDENTIFIER)."

Approaches to generating surrogates include:

Universally Unique Identifiers (UUIDs)
Globally Unique Identifiers (GUIDs)
Object Identifiers (OIDs)
Sybase or SQL Server identity column IDENTITY OR IDENTITY(n,n)
Oracle SEQUENCE, or GENERATED AS IDENTITY (starting from version 12.1)[3]
SQL Server SEQUENCE (starting from SQL Server 2012)[4]
PostgreSQL or IBM Informix serial
MySQL AUTO_INCREMENT
SQLite AUTOINCREMENT
AutoNumber data type in Microsoft Access
AS IDENTITY GENERATED BY DEFAULT in IBM DB2
Identity column (implemented in DDL) in Teradata
SURROGATE_KEY user-defined function (UDF) in Hive HDP 3.5
ZipWithUniqueId() using RDDs in Spark 1.6
ZipWithIndex() using RDDs in Spark 1.6

Why are Surrogate Keys Hard?

We're going back to the past...and getting a tad geeky (apologies in advance...not really).

SKs are hard even for single node (traditional) data warehousing

In databases, using increment/identity/GUID/UUIDs (SK) allows you to generate a unique surrogate key for that database. If you sharded the databases (i.e. multiple databases with the same schema but would process a different shard of data), the SK would be unique for only that database. This could be mitigated by adding application code (or joins) that concatenated/joined the shard database identifier with the SK but this would have an undesirable performance impact.

Even if this approach worked, what if the same identifier showed up in two different shards; this would result in different SKs for the same identifier which would be undesireable business outcome. For example:

Initially, Samantha Carter's is based on Earth (Tau'ri) and part SG-1 team.

After 15 years successfully saving Milky Way galaxy, she is now leading the Atlantis team in the Pegasus galaxy.

In this example, the identifier is the PK and Name (technically just the name) and we're updating Carter's demographic information.

Shard	SK	PK	Name	Team	Location
0	100	1	Samantha Carter	SG-1	Tau'ri
0	102	1	Samantha Carter	Atlantis	Subido
3	1182	1	Samantha Carter	Atlantis	Subido

But how about if our sharded databases received Carter's updates at the same time and it was not distributed properly (either due to design, scale, multi-verse, etc.)? This would result in logically two different surrogate keys for Carter.
As noted, this is undesirable as you would want to properly associate the same SK to the same identifier - i.e. to Carter. And while this may seem obvious to avoid now, this was a problem with traditional databases.

Perhaps let's centralize it all?

One short-lived approach was to have a central server or service where the only job was to provide unique SKs for all values. The problem with this approaches included:

As the database transactions increased, the higher the load to generate the SKs.
Even if it could handle the SK creation, it became resource intensive for the SK lookup.
It ultimately required the a resource intensive OLTP database to continually generate or lookup values based on a provided identifier saturating the server resources or network in between.

Let's Hash our way of this

A potential solution for this would be to create a hash of the identifier (often multiple columns). The key advantage of this approach is that provided the same hash (and keys) are used, any sharded database (or distributed system for that matter) could get the exact same hash value if they were using the same identifier. Thus, if you had two shards, they would run the same hash function and get something like the value below.

$hash([A, B]) = hash([1, `Samantha: Carter`]) = BC304$

The potential problem with this approach though is hash collisions where the two different inputs results in the same output hash value, e.g.

$hash([A, B]) = hash([1, `Samantha: Carter`, `SG-1`]) = X303$

$hash([A, B]) = hash([1, `Daniel: Jackson`, `SG-1`]) = X303$

This issue is also known as the "birthday problem" where two different UserIDs having the same Hash_of_the_UserID. For example, if you use a 32-bit hash (the equivalent of converting your RDBMS HashBytes hash value to integer) with 100,000 users, we can use the Taylor series to approximate the chance (p) of a collision:

$p = \1 - \e^(\frac{-1 \x (10^5)^2}{2 \x \2^{32}})$ =68.787%

If you use a 64-bit hash (the equivalent of converting your RDBMS HashBytes hash value to big integer), the chance of collision essentially becomes zero for 100,000 users. It isn’t until you reach 100 million users that the chance of collision climbs to 0.027%.

Note, The Taylor series approximates the chance of only one collision. To determine the chances of more than one collision, a general approximation is to use the binomial distribution based on the presumption that a good hash function imitates a uniform random spread over the hash space (that is, every selection of a hash value has a fixed chance of hitting something already in the set). Therefore, when using a 32-bit hash with 100,000 customers, the success probability of 2 hash collisions is 26.42% and the success probability of >= 2 hash collisions is 67.56%!

In the past, we would solve the problem by using 64-bit hashes to minimize the likelihood of collisions. When working with hundreds of millions of hashes, admittedly the likelihood of collision is pretty low (0.027%). But how if you're working with billions or trillions of hashes? Distribution to the rescue...not really!

Wrap Up

monotonically_increasing_id()
zipWithIndex() Vote
zipWithUniqueId()
row_number()
hash()
md5()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2020-08-25 | Generating Surrogate Keys for your Data Lakehouse with Spark SQL and Delta Lake

2020-08-25 | Generating Surrogate Keys for your Data Lakehouse with Spark SQL and Delta Lake

README.md

Introduction to Surrogate Key Generation for Lake House

History of Surrogate Keys

Why are Surrogate Keys Hard?

SKs are hard even for single node (traditional) data warehousing

Perhaps let's centralize it all?

Let's Hash our way of this

Wrap Up

Files

2020-08-25 | Generating Surrogate Keys for your Data Lakehouse with Spark SQL and Delta Lake

Directory actions

More options

Directory actions

More options

Latest commit

History

2020-08-25 | Generating Surrogate Keys for your Data Lakehouse with Spark SQL and Delta Lake

Folders and files

parent directory

README.md

Introduction to Surrogate Key Generation for Lake House

History of Surrogate Keys

Why are Surrogate Keys Hard?

SKs are hard even for single node (traditional) data warehousing

Perhaps let's centralize it all?

Let's Hash our way of this

Wrap Up