A Tablet is a horizontal partition of a Kudu table, similar to tablets in BigTable or regions in HBase. Each tablet hosts a contiguous range of rows which does not overlap with any other tablet's range. Together, all the tablets in a table comprise the table's entire key space.
Each tablet is further subdivided into a number of sets of rows called RowSets. Each RowSet consists of the data for a set of rows. RowSets are disjoint, ie the set of rows for different RowSets do not intersect, so any given key is present in at most one RowSet. While RowSets are disjoint, their key spaces may overlap.
One RowSet is held in memory and is referred to as the MemRowSet. All inserts go directly into the MemRowSet, which is an in-memory B-Tree sorted by the table's primary key. As data is inserted, it is accumulated in the MemRowSet, where it is made immediately visible to future readers, subject to MVCC (see below).
NOTE: Unlike BigTable, only inserts and updates of recently-inserted data go into the MemRowSet -- mutations such as updates and deletions of on-disk rows are discussed in a later section of this document.
Each row exists in exactly one entry in the MemRowSet. The value of this entry consists of a special header, followed by the packed format of the row data (more detail below). Since the MemRowSet is fully in-memory, it will eventually fill up and "Flush" to disk -- this process is described in detail later in this document.
Kudu uses multi-version concurrency control in order to provide a number of useful features:
-
Snapshot scanners: when a scanner is created, it operates as of a point-in-time snapshot of the tablet. Any further updates to the tablet which occur during the course of the scan are ignored. In addition, this point-in-time can be stored and re-used for additional scans on the same tablet, for example if an application would like to perform analytics requiring multiple passes on a consistent view of the data.
-
Time-travel scanners: similar to the above, a user may create a scanner which operates as of some point in time from the past, providing a consistent "time travel read". This can be used to take point-in-time consistent backups.
-
Change-history queries: given two MVCC snapshots, the user may be able to query the set of deltas between those two snapshots for any given row. This can be leveraged to take incremental backups, perform cross-cluster synchronization, or for offline audit analysis.
-
Multi-row atomic updates within a tablet: a single mutation may apply to multiple rows within a tablet, and it will be made visible in a single atomic action.
In order to provide MVCC, each mutation is tagged with a timestamp. Timestamps are generated by a TS-wide Clock instance, and ensured to be unique within a tablet by the tablet's MvccManager. The state of the MvccManager determines the set of timestamps which are considered "committed" and thus visible to newly generated scanners. Upon creation, a scanner takes a snapshot of the MvccManager state, and any data which seen by that scanner is then compared against the MvccSnapshot to determine which insertions, updates, and deletes should be considered visible.
Timestamps are monotonically increasing per tablet. We use a technique called HybridTime (see OSDI'14 submission for details) to create timestamps which correspond to true wall clock time but also reflect causality between nodes.
In order to support these snapshot and time-travel reads, multiple versions of any given row must be stored in the database. To prevent unbounded space usage, the user may configure a retention period beyond which old transaction records may be GCed (thus preventing any snapshot reads from earlier than that point in history). (NOTE: history GC not currently implemented)
In order to support MVCC in the MemRowSet, each row is tagged with the timestamp which inserted the row. Additionally, the row contains a singly linked list containing any further mutations that were made to the row after its insertion, each tagged with the mutation's timestamp:
MemRowSet Row
+----------------------------------------------------+
| insertion timestamp | mutation head | row data... |
+-------------------------|--------------------------+
|
v First mutation
+-----------------------------------------------+
| mutation timestamp | next_mut | change record |
+--------------------|--------------------------+
__________/
/
| Second mutation
+--------v--------------------------------------+
| mutation timestamp | next_mut | change record |
+--------------------|--------------------------+
__________/
/
...
In traditional database terms, one can think of the mutation list forming a sort of "REDO log" containing all changes which affect this row.
Any reader traversing the MemRowSet needs to apply these mutations to read the correct snapshot of the row, via the following logic:
- If row.insertion_timestamp is not committed in scanner's MVCC snapshot, skip the row (it was not yet inserted when the scanner's snapshot was made).
- Otherwise, copy the row data into the output buffer.
- For each mutation in the list:
- if mutation.timestamp is committed in the scanner's MVCC snapshot, apply the change to the in-memory copy of the row. Otherwise, skip this mutation (it was not yet mutated at the time of the snapshot).
- if the mutation indicates a DELETE, mark the row as deleted in the output buffer of the scanner by zeroing its bit in the scanner's selection vector.
Note that "mutation" in this case can be one of three types:
- UPDATE: changes the value of one or more columns
- DELETE: removes the row from the database
- REINSERT: reinsert the row with a new set of data (only occurs on a MemRowSet row with a prior DELETE mutation)
As a concrete example, consider the following sequence on a table with schema (key STRING, val UINT32):
INSERT INTO t VALUES ("row", 1); [timestamp 1]
UPDATE t SET val = 2 WHERE key = "row"; [timestamp 2]
DELETE FROM t WHERE key = "row"; [timestamp 3]
INSERT INTO t VALUES ("row", 3); [timestamp 4]
This would result in the following structure in the MemRowSet:
+-----------------------------------+
| tx 1 | mutation head | ("row", 1) |
+----------|------------------------+
|
|
+---v--------------------------+
| tx 2 | next ptr | SET val=2 |
+-----------|------------------+
______/
|
+---v-------v----------------+
| tx 3 | next ptr | DELETE |
+-----------|----------------+
______/
|
+---v------------------------------------+
| tx 4 | next ptr | REINSERT ("row", 3) |
+----------------------------------------+
Note that this has a couple of undesirable properties when update frequency is high:
- readers must chase pointers through a singly linked list, likely causing many CPU cache misses.
- updates must append to the end of a singly linked list, which is O(n) where 'n' is the number of times this row has been updated.
However, we consider the above inefficiencies tolerable given the following assumptions:
- Kudu's target uses cases have a relatively low update rate: we assume that a single row won't have a high frequency of updates
- Only a very small fraction of the total database will be in the MemRowSet -- once the MemRowSet reaches some target size threshold, it will flush. So, even if scanning MemRowSet is slow due to update handling, it will make up only a small percentage of overall query time.
If it turns out that the above inefficiencies impact real applications, various optimizations can be applied in the future to reduce the overhead.
When the MemRowSet fills up, a Flush occurs, which persists the data to disk.
+------------+
| MemRowSet |
+------------+
|
| Flush process writes entries in memory to a new DiskRowSet on disk
v
+--------------+ +--------------+ +--------------+
| DiskRowSet 0 | | DiskRowSet 1 | .. | DiskRowSet N |
+-------------+- +--------------+ +--------------+
When the data is flushed, it is stored as a set of CFiles (see cfile.md). Each of the rows in the data is addressable by a sequential "rowid", which is dense, immutable, and unique within this DiskRowSet. For example, if a given DiskRowSet contains 5 rows, then they will be assigned rowid 0 through 4, in order of ascending key. Within a different DiskRowSet, there will be different rows with the same rowids.
Reads may map between primary keys (user-visible) and rowids (internal) using an index structure. In the case that the primary key is a simple key, the key structure is embedded within the primary key column's CFile. Otherwise, a separate index CFile stores the encoded compound key and provides a similar function.
NOTE: rowids are not explicitly stored with each row, but rather an implicit identifier based on the row's ordinal index in the file. Some parts of the source code refer to rowids as "row indexes" or "ordinal indexes".
NOTE: other systems such as C-Store call the MemRowSet the "write optimized store" (WOS), and the on-disk files the "read-optimized store" (ROS).
In order to continue to provide MVCC for on-disk data, each on-disk RowSet consists not only of the current columnar data, but also "UNDO" records which provide the ability to rollback a row's data to an earlier version.
+--------------+ +-----------+
| UNDO records | <--- | base data |
+--------------+ +-----------+
- time of data progresses to the right --->
When a user wants to read the most recent version of the data immediately after a flush, only the base data is required. Because the base data is stored in a columnar format, this common case is very efficient. If instead, the user wants to run a time-travel query, the read path consults the UNDO records in order to roll back the visible data to the earlier point in time.
When a scanner encounters a row, it processes the MVCC information as follows:
- Read base image of row
- For each UNDO record: -- If the associated timestamp is NOT committed, execute rollback change.
For example, recall the series of mutations used in "MVCC Mutations in MemRowSet" above:
INSERT INTO t VALUES ("row", 1); [timestamp 1]
UPDATE t SET val = 2 WHERE key = "row"; [timestamp 2]
DELETE FROM t WHERE key = "row"; [timestamp 3]
INSERT INTO t VALUES ("row", 3); [timestamp 4]
When this row is flushed to disk, we store it on disk in the following way:
Base data:
("row", 3)
UNDO records (roll-back):
Before Tx 4: DELETE
Before Tx 3: INSERT ("row", 2")
Before Tx 2: SET row=1
Before Tx 1: DELETE
Each UNDO record is the inverse of the transaction which triggered it -- for example the INSERT at transaction 1 turns into a "DELETE" when it is saved as an UNDO record.
The use of the UNDO record here acts to preserve the insertion timestamp: queries whose MVCC snapshot indicates Tx 1 is not yet committed will execute the DELETE "UNDO" record, such that the row is made invisible.
For example, consider two different example scanners:
Current time scanner (all txns committed)
-----------------------------------------
- Read base data
- Since tx 1-4 are committed, ignore all UNDO records
- No REDO records
Result: current row ("row", 3)
Scanner as of timestamp 1
---------------------
- Read base data. Buffer = ("row", 3)
- Rollback Tx 4: Buffer = <deleted>
- Rollback Tx 3: Buffer = ("row", 2)
- Rollback Tx 2: Buffer = ("row", 1)
Result: ("row", 1)
Each case processes the correct set of UNDO records to yield the state of the row as of the desired point of time.
Given that the most common case of queries will be running against "current" data. In that case, we would like to optimize query execution by avoiding the processing of any UNDO records. To do so, we include file-level metadata indicating the range of transactions for which UNDO records are present. If the scanner's MVCC snapshot indicates that all of these transactions are already committed, then the set of deltas may be short circuited, and the query can proceed with no MVCC overhead.
Updates or deletes of already-flushed rows do not go into the MemRowSet. Instead, the updated key is searched for among all RowSets in order to locate the unique RowSet which holds this key. This processes first uses an interval tree to locate a set of candidate rowsets which may contain the key in question. Following this, we consult a bloom filter for each of those candidates. For rowsets which pass both checks, we seek the primary key index to determine the row's rowid within that rowset.
Once the appropriate RowSet has been determined, the mutation will also be aware of the key's rowid within the RowSet (as a result of the same key search which verified that the key is present in the RowSet). The mutation can then enter an in-memory structure called the DeltaMemStore.
The DeltaMemStore is an in-memory concurrent BTree keyed by a composite key of the rowid and the mutating timestamp. At read time, these mutations are processed in the same manner as the mutations for newly inserted data.
When the Delta MemStore grows too large, it performs a flush to an on-disk DeltaFile, and resets itself to become empty:
+------------+ +---------+ +---------+ +----------------+
| base data | <--- | delta 0 | <-- | delta N | <-- | delta memstore |
+------------+ +---------+ +---------+ +----------------+
The DeltaFiles contain the same type of information as the Delta MemStore, but compacted to a dense on-disk serialized format. Because these delta files contain records of transactions that need to be re-applied to the base data in order to bring rows up-to-date, they are called "REDO" files, and the mutations contained are called "REDO" records. Similar to data resident in the MemRowSet, REDO mutations need to be applied to read newer versions of the data.
A given row may have delta information in multiple delta structures. In that case, the deltas are applied sequentially, with later modifications winning over earlier modifications.
Note that the mutation tracking structure for a given row does not necessarily include the entirety of the row. If only a single column of a row is updated, then the mutation structure will only include the updated column. This allows for fast updates of small columns without the overhead of reading or re-writing larger columns (an advantage compared to the MVCC techniques used by systems such as C-Store and PostgreSQL).
In summary, each DiskRowSet consists of three logical components:
+--------------+ +-----------+ +--------------+
| UNDO records | <--- | base data | ---> | REDO records |
+--------------+ +-----------+ +--------------+
Base data: the columnar data for the RowSet, at the time the RowSet was flushed
UNDO records: historical data which needs to be processed to rollback rows to points in time prior to the RowSet flush.
REDO records: data which needs to be processed in order to bring rows up to date with respect to modifications made after the RowSet was flushed.
UNDO records and REDO records are stored in the same file format, called a DeltaFile.
Within a RowSet, reads become less efficient as more mutations accumulate in the delta tracking structures; in particular, each flushed delta file will have to be seeked and merged as the base data is read. Additionally, if a record has been updated many times, many REDO records have to be applied in order to expose the most current version to a scanner.
In order to mitigate this and improve read performance, Kudu performs background processing which transforms a RowSet from inefficient physical layouts to more efficient ones, while maintaining the same logical contents. These types of transformations are called "delta compactions". Delta compactions serve several main goals:
- Reduce the number of delta files
The more delta files that have been flushed for a RowSet, the more separate files must be read in order to produce the current version of a row. In workloads that do not fit in RAM, each random read will result in a disk seek for each of the delta files, causing performance to suffer.
- Migrate REDO records to UNDO records
As described above, a RowSet consists of base data (stored per-column), a set of "undo" records (to move back in time), and a set of "redo" records (to move forward in time from the base data). Given that most queries will be made against the present version of the database, we would like to minimize the number of REDO records stored.
At any point, a row's REDO records may be merged into the base data, and replaced by an equivalent set of UNDO records containing the old versions of the cells.
- Garbage collect old UNDO records.
UNDO records need to be retained only as far back as a user-configured historical retention period. Beyond this period, we can remove old "undo" records to save disk space.
NOTE: In the BigTable design, timestamps are associated with data, not with changes. In the Kudu design, timestamps are associated with changes, not with data. After historical UNDO logs have been removed, there is no remaining record of when any row or cell was inserted or updated. If users need this functionality, they should keep their own "inserted_on" timestamp column, as they would in a traditional RDBMS.
A REDO delta compaction may be classified as either 'minor' or 'major':
A 'minor' compaction is one that does not include the base data. In this type of compaction, the resulting file is itself a delta file.
+------------+ +---------+ +---------+ +---------+ +---------+
| base data | <--- | delta 0 + <-- | delta 1 + <-- | delta 2 + <-- | delta 3 +
+------------+ +---------+ +---------+ +---------+ +---------+
\_________________________________________/
files selected for compaction
=====>
+------------+ +---------+ +-----------------------+
| base data | <--- | delta 0 + <-- | delta 1 (old delta 3) +
+------------+ +---------+ +-----------------------+
\_________/
compaction result
Minor REDO delta compactions serve only goal 1: because they do not read or re-write base data, they cannot transform REDO records into UNDO.
A 'major' REDO compaction is one that includes the base data along with any number of REDO delta files.
+------------+ +---------+ +---------+ +---------+ +---------+
| base data | <--- | delta 0 + <-- | delta 1 + <-- | delta 2 + <-- | delta 3 +
+------------+ +---------+ +---------+ +---------+ +---------+
\_____________________________________________/
files selected for compaction
=====>
+------------+ +----------------+ +-----------------------+ +-----------------------+
| new UNDOs | --> | new base data | <--- | delta 0 (old delta 2) + <-- | delta 1 (old delta 3) +
+------------+ +----------------+ +-----------------------+ +-----------------------+
\____________________________________/
compaction result
Major delta compactions satisfy delta compaction goals 1 and 2, but cost more than minor delta compactions since they must read and re-write the base data, which is typically larger than the delta data.
A major REDO delta compaction may be performed against any subset of the columns
in a DiskRowSet -- if only a single column has received a significant number of updates,
then a compaction can be performed which only reads and rewrites that column. It is
assumed that this is a common workload in many EDW-like applications (e.g updating
an order_status
column in an order table, or a visit_count
column in a user table).
Note that both types of delta compactions maintain the row ids within the RowSet: hence, they can be done entirely in the background with no locking. The resulting compaction file can be introduced into the RowSet by atomically swapping it with the compaction inputs. After the swap is complete, the pre-compaction files may be removed.
As more data is inserted into a tablet, more and more DiskRowSets will accumulate. This can hurt performance for the following cases:
a) Random access (get or update a single row by primary key)
In this case, each RowSet whose key range includes the probe key must be individually consulted to locate the specified key. Bloom filters can mitigate the number of physical seeks, but extra bloom filter accesses can impact CPU and also increase memory usage.
b) Scan with specified range (eg scan where primary key between 'A' and 'B')
In this case, each RowSet with an overlapping key range must be individually seeked, regardless of bloom filters. Specialized index structures might be able to assist, here, but again at the cost of memory, etc.
c) Sorted scans
If the user query requires that the scan result be yielded in primary-key-sorted order, then the results must be passed through a merge process. Merging is typically logarithmic in the number of inputs: as the number of inputs grows higher, the merge becomes more expensive.
Given the above, it is desirable to merge RowSets together to reduce the number of RowSets:
+------------+
| RowSet 0 |
+------------+
+------------+ \
| RowSet 1 | |
+------------+ |
|
+------------+ | +--------------+
| RowSet 2 | |===> RowSet compaction ===> | new RowSet 1 |
+------------+ | +--------------+
|
+------------+ |
| RowSet 3 | |
+------------+ /
Unlike Delta Compactions described above, note that row ids are not maintained in a Merging Compaction. This makes the handling of concurrent mutations a somewhat intricate dance. This process is described in more detail in 'compaction.txt' in this directory.
Go go gadget ASCII art!
+-----------+
| MemRowSet |
+-----------+
|
| flush: creates a new DiskRowSet 0
v
+---------------+
| DiskRowSet 0 |
+---------------+
DiskRowSet 1:
+---------+ +------------+ +---------+ +---------+ +---------+ +---------+
| UNDOs 0 | --> | base data | <--- | REDOs 0 | <-- | REDOS 1 | <-- | REDOs 2 | <-- | REDOs 3 |
+---------+ +------------+ +---------+ +---------+ +---------+ +---------+
\____________________________________________________________/
| major compaction
v
+---------+ +------------+ +---------+ +---------+
| UNDOs 0'| --> | base data' | <--- | REDOs 2 | <-- | REDOs 3 |
+---------+ +------------+ +---------+ +---------+
\____________________________/
compaction result
DiskRowSet 2:
+---------+ +------------+ +---------+ +---------+ +---------+ +---------+
| UNDOs 0 | --> | base data | <--- | REDOs 0 | <-- | REDOS 1 | <-- | REDOs 2 | <-- | REDOs 3 |
+---------+ +------------+ +---------+ +---------+ +---------+ +---------+
\_________________________/
| minor compaction
v
+---------+ +------------+ +---------+ +---------+ +---------+
| UNDOs 0 | --> | base data | <--- | REDOS 0'| <-- | REDOs 2 | <-- | REDOs 3 |
+---------+ +------------+ +---------+ +---------+ +---------+
\_________/
compaction result
+-----------------+ \
| DiskRowSet 3 | |
+-----------------+ |
|
+-----------------+ | +----------------+
| DiskRowSet 4 | |===> Merging compaction ===> | new DiskRowSet |
+-----------------+ | +----------------+
|
+-----------------+ |
| DiskRowSet 5 | |
+-----------------+ /
This design differs from the approach used in BigTable in a few key ways:
- A given key is only present in at most one RowSet in the tablet.
In BigTable, a key may be present in several different SSTables. An entire Tablet in BigTable looks more like the RowSet in Kudu -- any read of a key must merge together data found in all of the SSTables, just like a single row lookup in Kudu must merge together the base data with all of the DeltaFiles.
The advantage of the Kudu approach is that, when reading a row, or servicing a query for which sort-order is not important, no merge is required. For example, an aggregate over a range of keys can individually scan each RowSet (even in parallel) and then sum the results, since the order in which keys are presented is not important. Similarly, selects without an explicit 'ORDER BY primary_key' specification do not need to conduct a merge. It's obvious why this can result in more efficient scanning.
The disadvantage here is that, unlike BigTable, inserts and mutations are distinct operations: inserts must go into the MemRowSet, whereas mutations (delete/update) must go into the DeltaMemStore in the specific RowSet containing that key. This has performance impacts as follows:
a) Inserts must determine that they are in fact new keys.
This results in a bloom filter query against all present RowSets. If any RowSet indicates a possible match, then a seek must be performed against the key column(s) to determine whether it is in fact an insert or update.
It is assumed that, so long as the number of RowSets is small, and the bloom filters accurate enough, the vast majority of inserts will not require any physical disk seeks. Additionally, if the key pattern for inserts is locally sequential (eg '_' in a time-series application), then the blocks corresponding to those keys are likely to be kept in the data block cache due to their frequent usage.
b) Updates must determine which RowSet they correspond to.
Similar to above, this results in a bloom filter query against all RowSets, as well as a primary key lookup against any matching RowSets.
One advantage to this difference is that the semantics are more familiar to users who are accustomed to RDBMS systems where an INSERT of a duplicate primary key gives a Primary Key Violation error rather than replacing the existing row. Similarly, an UPDATE of a row which does not exist can give a key violation error, indicating that no rows were updated. These semantics are not generally provided by BigTable-like systems.
- Mutation applications of data on disk are performed on numeric rowids rather than arbitrary keys.
In order to reconcile a key on disk with its potentially-mutated form, BigTable performs a merge based on the row's key. These keys may be arbitrarily long strings, so comparison can be expensive. Additionally, even if the key column is not needed to service a query (e.g an aggregate computation), the key column must be read off disk and processed, which causes extra IO. Given that composite keys are often used in BigTable applications, the key size may dwarf the size of the column of interest by an order of magnitude, especially if the queried column is stored in a dense encoding.
In contrast, mutations in Kudu are stored by rowid. So, merges can proceed much more efficiently by maintaining counters: given the next mutation to apply, we can simply subtract to find how many rows of unmutated base data may be passed through unmodified. Alternatively, direct addressing can be used to efficiently "patch" entire blocks of base data given a set of mutations.
Additionally, if the key is not needed in the query results, the query plan need not consult the key except perhaps to determine scan boundaries.
As an example, consider the query:
> SELECT SUM(cpu_usage) FROM timeseries WHERE machine = 'foo.cloudera.com'
AND unix_time BETWEEN 1349658729 AND 1352250720;
... given a composite primary key (host, unix_time)
This may be evaluated in Kudu with the following pseudo-code:
sum = 0
foreach RowSet:
start_rowid = rowset.lookup_key(1349658729)
end_rowid = rowset.lookup_key(1352250720)
iter = rowset.new_iterator("cpu_usage")
iter.seek(start_rowid)
remaining = end_rowid - start_rowid
while remaining > 0:
block = iter.fetch_upto(remaining)
sum += sum(block)
The fetching of blocks can be done very efficiently since the application of any potential mutations can simply index into the block and replace any mutated values with their new data.
- timestamps are not part of the data model
In BigTable-like systems, the timestamp of each cell is exposed to the user, and essentially forms the last element of a composite row key. This means that it is efficient to directly access some particular version of a cell, and store entire time series as many different versions of a single cell. This is not efficient in Kudu -- timestamps should be considered an implementation detail used for MVCC, not another dimension in the row key. Instead, Kudu provides native composite row keys which can be useful for time series.
C-Store provides MVCC by adding two extra columns to each table: an insertion epoch and a deletion epoch. Epochs in Vertica are essentially equivalent to timestamps in Kudu. When a row is inserted, the transaction's epoch is written in the row's epoch column. The deletion epoch column is initially NULL. When a row is deleted, the epoch of the deletion transaction is written into that column. As a scanner iterates over the table, it only includes rows where the insertion epoch is committed and the deletion epoch is either NULL or uncommitted.
Updates in Vertica are always implemented as a transactional DELETE followed by a re-INSERT. So, the old version of the row has the update's epoch as its deletion epoch, and the new version of the row has the update's epoch as its insertion epoch.
This has the downside that even updates of one small column must read all of the columns for that row, incurring many seeks and additional IO overhead for logging the re-insertion. Additionally, while both versions of the row need to be retained, the space usage of the row has been doubled. If a row is being frequently updated, then the space usage will increase significantly, even if only a single column of the row has been changed.
In contrast, Kudu does not need to read the other columns, and only needs to re-store the columns which have changed, which should yield much improved UPDATE throughput for online applications.
References:
- http://vertica-forums.com/viewtopic.php?f=48&t=345&start=10
- http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf
PostgreSQL's MVCC implementation is very similar to Vertica's. Each tuple has an associated "xmin" and "xmax" column. "xmin" contains the timestamp when the row was inserted, and "xmax" contains the timestamp when the row was deleted or updated.
PostgreSQL has the same downsides as C-Store in that a frequently updated row will end up replicated many times in the tablespace, taking up extra storage and IO. The overhead is not as bad, though, since Postgres is a row-store, and thus re-reading all of the N columns for an update does not incur N separate seeks.
References:
- postgres source code
- http://www.packtpub.com/article/transaction-model-of-postgresql
Oracle's MVCC and time-travel implementations are somewhat similar to Kudu's. Its MVCC operates on physical blocks rather than records. Whenever a block is modified, it is modified in place and a compensating UNDO record is written to a Rollback Segment (RBS) in the transaction log. The block header is then modified to point to the Rollback Segment which contains the UNDO record.
When readers read a block, the read path looks at the data block header to determine if rollback is required. If so, it reads the associated rollback segment to apply UNDO logs.
This has the downside that the rollback segments are allocated based on the order of transaction commit, and thus are not likely to be sequentially laid out with regard to the order of rows being read. So, scanning through a table in a time travel query may require a random access to retrieve associated UNDO logs for each block, whereas in Kudu, the undo logs have been sorted and organized by row-id.
NOTE: the above is very simplified, but the overall idea is correct.
References: