2020-08-25 | Watch the video | This folder contains the notebooks used in this tutorial.
-
History of Surrogate Keys
-
Why is it hard
-
Requirements for a good strategy
- Embarisingly parallel
- SQL based
- No Collisions
- Strategies
- monotonically_increasing_id
- row_number() Rank OVER
- ZipWithIndex()
- ZipWithUniqueIndex()
- Row Hash with hash()
- Row Hash with md5()
- Validation
- Uniqueness / Collisions
- Distributability (Performance)
- Distribution (Skewness)
- Summary
"A surrogate key (or synthetic key, entity identifier, system-generated key, database sequence number, factless key, technical key, or arbitrary unique identifier[citation needed]) in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data.[1]" https://en.wikipedia.org/wiki/Surrogate_key
Ralph Kimbal: "Actually, a surrogate key in a data warehouse is more than just a substitute for a natural key. In a data warehouse, a surrogate key is a necessary generalization of the natural production key and is one of the basic elements of data warehouse design. Let’s be very clear: Every join between dimension tables and fact tables in a data warehouse environment should be based on surrogate keys, not natural keys. It is up to the data extract logic to systematically look up and replace every incoming natural key with a data warehouse surrogate key each time either a dimension record or a fact record is brought into the data warehouse environment."
"A surrogate key is frequently a sequential number (e.g. a Sybase or SQL Server "identity column", a PostgreSQL or Informix serial, an Oracle or SQL Server SEQUENCE or a column defined with AUTO_INCREMENT in MySQL). Some databases provide UUID/GUID as a possible data type for surrogate keys (e.g. PostgreSQL UUID or SQL Server UNIQUEIDENTIFIER)."
Approaches to generating surrogates include:
- Universally Unique Identifiers (UUIDs)
- Globally Unique Identifiers (GUIDs)
- Object Identifiers (OIDs)
- Sybase or SQL Server identity column IDENTITY OR IDENTITY(n,n)
- Oracle SEQUENCE, or GENERATED AS IDENTITY (starting from version 12.1)[3]
- SQL Server SEQUENCE (starting from SQL Server 2012)[4]
- PostgreSQL or IBM Informix serial
- MySQL AUTO_INCREMENT
- SQLite AUTOINCREMENT
- AutoNumber data type in Microsoft Access
- AS IDENTITY GENERATED BY DEFAULT in IBM DB2
- Identity column (implemented in DDL) in Teradata
- SURROGATE_KEY user-defined function (UDF) in Hive HDP 3.5
- ZipWithUniqueId() using RDDs in Spark 1.6
- ZipWithIndex() using RDDs in Spark 1.6
We're going back to the past...and getting a tad geeky (apologies in advance...not really).
In databases, using increment/identity/GUID/UUIDs (SK) allows you to generate a unique surrogate key for that database. If you sharded the databases (i.e. multiple databases with the same schema but would process a different shard of data), the SK would be unique for only that database. This could be mitigated by adding application code (or joins) that concatenated/joined the shard database identifier with the SK but this would have an undesirable performance impact.
Even if this approach worked, what if the same identifier showed up in two different shards; this would result in different SKs for the same identifier which would be undesireable business outcome. For example:
- Initially, Samantha Carter's is based on Earth (Tau'ri) and part SG-1 team.
- After 15 years successfully saving Milky Way galaxy, she is now leading the Atlantis team in the Pegasus galaxy.
- In this example, the identifier is the PK and Name (technically just the name) and we're updating Carter's demographic information.
Shard | SK | PK | Name | Team | Location | Point of Origin |
---|---|---|---|---|---|---|
0 | 100 | 1 | Samantha Carter | SG-1 | Tau'ri | |
0 | 102 | 1 | Samantha Carter | Atlantis | Subido | |
3 | 1182 | 1 | Samantha Carter | Atlantis | Subido |
-
But how about if our sharded databases received Carter's updates at the same time and it was not distributed properly (either due to design, scale, multi-verse, etc.)? This would result in logically two different surrogate keys for Carter.
-
As noted, this is undesirable as you would want to properly associate the same SK to the same identifier - i.e. to Carter. And while this may seem obvious to avoid now, this was a problem with traditional databases.
One short-lived approach was to have a central server or service where the only job was to provide unique SKs for all values. The problem with this approaches included:
- As the database transactions increased, the higher the load to generate the SKs.
- Even if it could handle the SK creation, it became resource intensive for the SK lookup.
- It ultimately required the a resource intensive OLTP database to continually generate or lookup values based on a provided identifier saturating the server resources or network in between.
A potential solution for this would be to create a hash of the identifier (often multiple columns). The key advantage of this approach is that provided the same hash (and keys) are used, any sharded database (or distributed system for that matter) could get the exact same hash value if they were using the same identifier. Thus, if you had two shards, they would run the same hash function and get something like the value below.
The potential problem with this approach though is hash collisions where the two different inputs results in the same output hash value, e.g.
This issue is also known as the "birthday problem" where two different UserIDs having the same Hash_of_the_UserID. For example, if you use a 32-bit hash (the equivalent of converting your RDBMS HashBytes hash value to integer) with 100,000 users, we can use the Taylor series to approximate the chance (p) of a collision:
If you use a 64-bit hash (the equivalent of converting your RDBMS HashBytes hash value to big integer), the chance of collision essentially becomes zero for 100,000 users. It isn’t until you reach 100 million users that the chance of collision climbs to 0.027%.
Note, The Taylor series approximates the chance of only one collision. To determine the chances of more than one collision, a general approximation is to use the binomial distribution based on the presumption that a good hash function imitates a uniform random spread over the hash space (that is, every selection of a hash value has a fixed chance of hitting something already in the set). Therefore, when using a 32-bit hash with 100,000 customers, the success probability of 2 hash collisions is 26.42% and the success probability of >= 2 hash collisions is 67.56%!
In the past, we would solve the problem by using 64-bit hashes to minimize the likelihood of collisions. When working with hundreds of millions of hashes, admittedly the likelihood of collision is pretty low (0.027%). But how if you're working with billions or trillions of hashes? Distribution to the rescue...not really!
- monotonically_increasing_id()
- zipWithIndex() Vote
- zipWithUniqueId()
- row_number()
- hash()
- md5()