Skip to content

Latest commit

 

History

History
 
 

2020-08-25 | Generating Surrogate Keys for your Data Lakehouse with Spark SQL and Delta Lake

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Introduction to Surrogate Key Generation for Lake House

2020-08-25 | Watch the video | This folder contains the notebooks used in this tutorial.

  1. History of Surrogate Keys

  2. Why is it hard

  3. Requirements for a good strategy

  • Embarisingly parallel
  • SQL based
  • No Collisions
  1. Strategies
  • monotonically_increasing_id
  • row_number() Rank OVER
  • ZipWithIndex()
  • ZipWithUniqueIndex()
  • Row Hash with hash()
  • Row Hash with md5()
  1. Validation
  • Uniqueness / Collisions
  • Distributability (Performance)
  • Distribution (Skewness)
  1. Summary

History of Surrogate Keys

"A surrogate key (or synthetic key, entity identifier, system-generated key, database sequence number, factless key, technical key, or arbitrary unique identifier[citation needed]) in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data.[1]" https://en.wikipedia.org/wiki/Surrogate_key

Ralph Kimbal: "Actually, a surrogate key in a data warehouse is more than just a substitute for a natural key. In a data warehouse, a surrogate key is a necessary generalization of the natural production key and is one of the basic elements of data warehouse design. Let’s be very clear: Every join between dimension tables and fact tables in a data warehouse environment should be based on surrogate keys, not natural keys. It is up to the data extract logic to systematically look up and replace every incoming natural key with a data warehouse surrogate key each time either a dimension record or a fact record is brought into the data warehouse environment."

"A surrogate key is frequently a sequential number (e.g. a Sybase or SQL Server "identity column", a PostgreSQL or Informix serial, an Oracle or SQL Server SEQUENCE or a column defined with AUTO_INCREMENT in MySQL). Some databases provide UUID/GUID as a possible data type for surrogate keys (e.g. PostgreSQL UUID or SQL Server UNIQUEIDENTIFIER)."

Approaches to generating surrogates include:

Why are Surrogate Keys Hard?

We're going back to the past...and getting a tad geeky (apologies in advance...not really).

SKs are hard even for single node (traditional) data warehousing

In databases, using increment/identity/GUID/UUIDs (SK) allows you to generate a unique surrogate key for that database. If you sharded the databases (i.e. multiple databases with the same schema but would process a different shard of data), the SK would be unique for only that database. This could be mitigated by adding application code (or joins) that concatenated/joined the shard database identifier with the SK but this would have an undesirable performance impact.

Even if this approach worked, what if the same identifier showed up in two different shards; this would result in different SKs for the same identifier which would be undesireable business outcome. For example:

  • Initially, Samantha Carter's is based on Earth (Tau'ri) and part SG-1 team.
  • After 15 years successfully saving Milky Way galaxy, she is now leading the Atlantis team in the Pegasus galaxy.
  • In this example, the identifier is the PK and Name (technically just the name) and we're updating Carter's demographic information.
Shard SK PK Name Team Location Point of Origin
0 100 1 Samantha Carter SG-1 Tau'ri
0 102 1 Samantha Carter Atlantis Subido
3 1182 1 Samantha Carter Atlantis Subido
  • But how about if our sharded databases received Carter's updates at the same time and it was not distributed properly (either due to design, scale, multi-verse, etc.)? This would result in logically two different surrogate keys for Carter.

  • As noted, this is undesirable as you would want to properly associate the same SK to the same identifier - i.e. to Carter. And while this may seem obvious to avoid now, this was a problem with traditional databases.

Perhaps let's centralize it all?

One short-lived approach was to have a central server or service where the only job was to provide unique SKs for all values. The problem with this approaches included:

  • As the database transactions increased, the higher the load to generate the SKs.
  • Even if it could handle the SK creation, it became resource intensive for the SK lookup.
  • It ultimately required the a resource intensive OLTP database to continually generate or lookup values based on a provided identifier saturating the server resources or network in between.

Let's Hash our way of this

A potential solution for this would be to create a hash of the identifier (often multiple columns). The key advantage of this approach is that provided the same hash (and keys) are used, any sharded database (or distributed system for that matter) could get the exact same hash value if they were using the same identifier. Thus, if you had two shards, they would run the same hash function and get something like the value below.

formula

The potential problem with this approach though is hash collisions where the two different inputs results in the same output hash value, e.g.

formula

formula

This issue is also known as the "birthday problem" where two different UserIDs having the same Hash_of_the_UserID. For example, if you use a 32-bit hash (the equivalent of converting your RDBMS HashBytes hash value to integer) with 100,000 users, we can use the Taylor series to approximate the chance (p) of a collision:

formula=68.787%

If you use a 64-bit hash (the equivalent of converting your RDBMS HashBytes hash value to big integer), the chance of collision essentially becomes zero for 100,000 users. It isn’t until you reach 100 million users that the chance of collision climbs to 0.027%.

Note, The Taylor series approximates the chance of only one collision. To determine the chances of more than one collision, a general approximation is to use the binomial distribution based on the presumption that a good hash function imitates a uniform random spread over the hash space (that is, every selection of a hash value has a fixed chance of hitting something already in the set). Therefore, when using a 32-bit hash with 100,000 customers, the success probability of 2 hash collisions is 26.42% and the success probability of >= 2 hash collisions is 67.56%!

In the past, we would solve the problem by using 64-bit hashes to minimize the likelihood of collisions. When working with hundreds of millions of hashes, admittedly the likelihood of collision is pretty low (0.027%). But how if you're working with billions or trillions of hashes? Distribution to the rescue...not really!

Wrap Up

  • monotonically_increasing_id()
  • zipWithIndex() Vote
  • zipWithUniqueId()
  • row_number()
  • hash()
  • md5()