Auto refresh iterator with snapshot #13354

mszeszko-meta · 2025-01-31T02:09:30Z

Problem

Once opened, iterator will preserve its' respective RocksDB snapshot for read consistency. Unless explicitly Refresh'ed, the iterator will hold on to the Init-time assigned SuperVersion throughout its lifetime. As time goes by, this might result in artificially long holdup of the obsolete memtables (potentially referenced by that superversion alone) consequently limiting the supply of the reclaimable memory on the DB instance. This behavior proved to be especially problematic in case of logical backups (outside of RocksDB BackupEngine).

Solution

Building on top of the Refresh(const Snapshot* snapshot) API introduced in #10594, we're adding a new ReadOptions opt-in knob that (when enabled) will instruct the iterator to automatically refresh itself to the latest superversion - all that while retaining the originally assigned, explicit snapshot (supplied in read_options.snapshot at the time of iterator creation) for consistency. To ensure minimal performance overhead we're leveraging relaxed atomic for superversion freshness lookups.

Test plan

Correctness: New test to demonstrate the auto refresh behavior in contrast to legacy iterator: ./db_iterator_test --gtest_filter=*AutoRefreshIterator*.

Stress testing: We're adding command line parameter controlling the feature and hooking it up to as many iterator use cases in db_stress as we reasonably can with random feature on/off configuration in db_crashtest.py.

Benchmarking

The goal of this benchmark is to validate that throughput did not regress substantially. Benchmark was run on optimized build 3-5 times for each respective category OR till convergence. Experiments have been run 'in parallel' (at the same time) on separate db instances within a single host to evenly spread the potential adverse impact of noisy neighbor activities. Host specs [1].

TLDR; Baseline & new solution are practically indistinguishable from performance standpoint. Difference (positive or negative) relative to the baseline, if any, is no more than 1% with Direct IO and no more than 2% when using OS page cache.

Approach:

This feature is only effective on iterators with well-defined snapshot passed via ReadOptions config. We modified the existing db_bench program to reflect that constraint. However, it quickly turned out that the actual Snapshot* initialization is quite expensive. Especially in case of 'tiny scans' (100 rows) contributing as much as 25-35 microseconds, which is ~20-30% of the average per/op latency unintentionally masking potentially adverse performance impact of this change. As a result, we ended up creating a single, explicit 'global' Snapshot* for all the future scans before running multiple experiments en masse. This is also a valuable data point for us to keep in mind in case of any future discussions about taking implicit snapshots - now we know what the lower bound cost could be.

Mixed access benchmark

DB Setup:

./db_bench -db=/tmp/testdb -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -write_buffer_size=655360  -disable_auto_compactions=true -target_file_size_base=16777216 -compression_type=none

Test:

./db_bench -use_existing_db=true -db=/tmp/testdb -statistics=false -cache_size=0 -cache_index_and_filter_blocks=false -benchmarks=seekrandom -preserve_internal_time_seconds=1 -explicit_snapshot=1 -use_direct_reads=? -async_io=? -num=?  -seek_nexts=? -auto_refresh_iterator_with_snapshot=?

Direct IO bypassing OS page cache

async io = true, direct io read = true

	seek_nexts = 100, num=2000000	seek_nexts = 20000, num=50000	seek_nexts = 400000, num=2000
baseline	29821.5 (± 186) ops/s, 309.05 (± 2.25) MB/s	96.5 (± 0.5) ops/s, 1004.25 (± 0.55) MB/s	204.8 (± 2.3) ms/op, 1013.05 (± 11.55) MB/s
auto refresh	29794.5 (± 262.5) ops/s, 309.65 (± 2.65) MB/s	95.5 (± 0.5) ops/s, 997.5 (± 4.1) MB/s	206 (± 1.8) ms/op, 1006.8 (± 8.4) MB/s

async io = false, direct io read = true

	seek_nexts = 100, num=2000000	seek_nexts = 20000, num=50000	seek_nexts = 400000, num=2000
baseline	23954.5 (± 56.5) ops/sec, 621.55 (± 1.25) MB/s	100.5 (± 0.5) ops/sec, 1046.6 (± 5.9) MB/s	202.4 (± 3.1) ms/op, 1025.3 (± 15.6) MB/s
auto refresh	24025 (± 148) ops/sec, 623.1 (± 3.9) MB/s	99.5 (± 0.5) ops/s, 1037.55 (± 8.45) MB/s	203.4 (± 3.8) ms/op, 1020.6 (± 18.8) MB/s

Using OS page cache

async io = true, direct io read = false

	seek_nexts=100, num=2000000	seek_nexts = 20000, num=50000	seek_nexts = 400000, num=2000
baseline	18618 (± 316) ops/sec, 483.1 (± 8.2) MB/s	71.5 (± 0.5) ops/sec, 748.2 (± 2.6) MB/s	275.4 (±2.7) ms/op, 753.6 (± 7.5) MB/s
auto refresh	17910 (± 192) ops/sec, 465.05 (± 4.55) MB/s	73.5 (± 0.5) ops/sec, 766.9 (± 2.3) MB/s	274.9 (± 2.3) ms/op, 755 (± 6.3) MB/s

async io = false, direct io read = false

	seek_nexts=100, num=2000000	seek_nexts = 20000, num=50000	seek_nexts = 400000, num=2000
baseline	23207 (± 248) ops/sec, 602.05 (± 6.35) MB/s	13644.5 (± 224.5) ops/sec, 756.5 (± 8.4) MB/s	272.5 (± 0.9), 761.45 (± 2.35) MB/s
auto refresh	22960 (± 302) ops/sec, 595.75 (± 8.05) MB/s	13721.5 (± 103) ops/sec, 756.2 (± 5.7) MB/s	272.9 (±0.7) ms/op, 760.45 (± 2.05) MB/s

"DB in memory" benchmark

DB Setup

Allow a single memtable to grow large enough (~572MB) to fit in all the rows. Upon shutdown all the rows will be flushed to the WAL file (inspected 000004.log file is 541MB in size).

./db_bench -db=/tmp/testdb_in_mem -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -write_buffer_size=600000000  max_write_buffer_number=2 -compression_type=none

As a part of recovery in subsequent DB open, WAL will be loaded into the memory and processed to one or more SST files during the recovery. We're selecting a large block cache (cache_size parameter in db_bench script) suitable for holding the entire DB to test the “hot path” CPU overhead. Note that with such configured write_buffer_size and max_write_buffer_number we'll produce a single SST file.

./db_bench -use_existing_db=true -db=/tmp/testdb_in_mem -statistics=false -cache_size=1000000000 -cache_index_and_filter_blocks=true -benchmarks=seekrandom -preserve_internal_time_seconds=1 -write_buffer_size=600000000  max_write_buffer_number=2 -explicit_snapshot=1 -use_direct_reads=? -async_io=? -num=? -seek_nexts=? -auto_refresh_iterator_with_snapshot={0|1}

	seek_nexts=100, num=2000000	seek_nexts = 20000, num=50000	seek_nexts = 400000, num=2000
baseline
auto refresh

[1]

Specs

Property	Value
RocksDB	version 10.0.0
Date	Mon Feb 3 23:21:03 2025
CPU	32 * Intel Xeon Processor (Skylake)
CPUCache	16384 KB
Keys	16 bytes each (+ 0 bytes user-defined timestamp)
Values	100 bytes each (50 bytes after compression)
Prefix	0 bytes
RawSize	5.5 MB (estimated)
FileSize	3.1 MB (estimated)
Compression	Snappy
Compression sampling rate	0
Memtablerep	SkipListFactory
Perf Level	1

pdillinger · 2025-01-31T22:44:08Z

problematic in case of backups

in the case of logical backups (not BackupEngine)

pdillinger

In addition to the performance test, I would like to see a unit test where we write some stuff, open an iterator, trigger and wait for a flush+full compaction, step the iterator, wait for background work, then verify some things like maybe size of all memtables is smaller, and all the old live sst files have been deleted. Something like that.

db/arena_wrapped_db_iter.cc

facebook-github-bot · 2025-02-04T15:54:01Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-02-04T18:28:10Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-02-04T18:30:45Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

include/rocksdb/options.h

unreleased_history/new_features/auto_refresh_iterator.md

include/rocksdb/options.h

pdillinger

Otherwise, test etc. looks good

db/db_iterator_test.cc

facebook-github-bot · 2025-02-04T22:41:07Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-02-04T23:29:27Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-02-04T23:50:11Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-02-05T00:11:00Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pdillinger

Along with updating unit tests for reverse iterator stuff, then we should be good.

db/arena_wrapped_db_iter.cc

pdillinger

Almost forgot: we also need to add this to db_stress. We should add a command line parameter controlling the feature (for differential diagnostic/debugging purposes), connect it up to as many iterator uses in db_stress as we reasonably can, and then set it randomly in db_crashtest.py. This could be another PR, but we can't forget about it as part of the same release cycle.

mszeszko-meta · 2025-02-05T22:11:09Z

Almost forgot: we also need to add this to db_stress. We should add a command line parameter controlling the feature (for differential diagnostic/debugging purposes), connect it up to as many iterator uses in db_stress as we reasonably can, and then set it randomly in db_crashtest.py. This could be another PR, but we can't forget about it as part of the same release cycle.

Agreed. Separately, with even lesser urgency (could be different release cycle), we would probably want to add similar plumbing for db_bench.

include/rocksdb/options.h

facebook-github-bot · 2025-02-07T03:29:07Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-02-07T03:38:16Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

tools/db_bench_tool.cc

facebook-github-bot · 2025-02-08T05:32:46Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-02-08T05:36:43Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-02-10T22:41:25Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-02-11T00:06:41Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-02-11T00:10:25Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

pdillinger

LGTM implementation-wise! Let's maybe chat in person about the latest performance tests before landing

pdillinger · 2025-02-10T18:44:16Z

unreleased_history/new_features/auto_refresh_iterator.md

@@ -0,0 +1 @@
+Introduced new `auto_refresh_iterator_with_snapshot` opt-in knob that (when enabled) will periodically release obsolete memory and storage resources for as long as the iterator is making progress and its supplied `read_options.snapshot` was initialized with well-defined (non-nullptr) value.


Nit: nullptr is well defined

facebook-github-bot · 2025-02-11T02:10:11Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

mszeszko-meta requested a review from pdillinger January 31, 2025 02:09

facebook-github-bot added the CLA Signed label Jan 31, 2025

mszeszko-meta changed the title ~~[EXPERIMENTAL] Auto refresh the iterator~~ [EXPERIMENTAL] Auto refresh iterator Jan 31, 2025

mszeszko-meta changed the title ~~[EXPERIMENTAL] Auto refresh iterator~~ [EXPERIMENTAL] Iterator auto refresh Jan 31, 2025

mszeszko-meta force-pushed the auto_refresh_iterator_option branch from 59fe197 to 777bad5 Compare January 31, 2025 03:33

pdillinger reviewed Jan 31, 2025

View reviewed changes

db/arena_wrapped_db_iter.cc Outdated Show resolved Hide resolved

mszeszko-meta force-pushed the auto_refresh_iterator_option branch 3 times, most recently from e03ba41 to e65144a Compare February 4, 2025 09:04

mszeszko-meta changed the title ~~[EXPERIMENTAL] Iterator auto refresh~~ Iterator auto refresh Feb 4, 2025

mszeszko-meta marked this pull request as ready for review February 4, 2025 15:53

mszeszko-meta requested a review from pdillinger February 4, 2025 15:53

pdillinger reviewed Feb 4, 2025

View reviewed changes

include/rocksdb/options.h Outdated Show resolved Hide resolved

unreleased_history/new_features/auto_refresh_iterator.md Outdated Show resolved Hide resolved

include/rocksdb/options.h Outdated Show resolved Hide resolved

pdillinger requested changes Feb 4, 2025

View reviewed changes

db/db_iterator_test.cc Outdated Show resolved Hide resolved

db/db_iterator_test.cc Outdated Show resolved Hide resolved

db/db_iterator_test.cc Outdated Show resolved Hide resolved

db/db_iterator_test.cc Outdated Show resolved Hide resolved

mszeszko-meta force-pushed the auto_refresh_iterator_option branch from 4d744f6 to 57abeeb Compare February 4, 2025 23:50

mszeszko-meta requested a review from pdillinger February 5, 2025 00:24

pdillinger requested changes Feb 5, 2025

View reviewed changes

db/arena_wrapped_db_iter.cc Show resolved Hide resolved

db/arena_wrapped_db_iter.cc Outdated Show resolved Hide resolved

db/arena_wrapped_db_iter.cc Outdated Show resolved Hide resolved

db/arena_wrapped_db_iter.cc Outdated Show resolved Hide resolved

pdillinger reviewed Feb 5, 2025

View reviewed changes

pdillinger reviewed Feb 6, 2025

View reviewed changes

include/rocksdb/options.h Outdated Show resolved Hide resolved

mszeszko-meta changed the title ~~Iterator auto refresh~~ Auto refresh iterator with snapshot Feb 7, 2025

mszeszko-meta force-pushed the auto_refresh_iterator_option branch from 02713d5 to 754758f Compare February 7, 2025 03:38

mszeszko-meta commented Feb 7, 2025

View reviewed changes

tools/db_bench_tool.cc Outdated Show resolved Hide resolved

mszeszko-meta added 5 commits February 7, 2025 21:32

Auto refresh iterator

4b99a90

Add feature release note

f808f73

Rephrase the commentary

e012a53

Address comments in db_iterator_test

546d76d

Reconcile PR to the latest privately discussed semantics

cd11974

mszeszko-meta force-pushed the auto_refresh_iterator_option branch from 754758f to 8b55a50 Compare February 8, 2025 05:32

Reduce the impact of expensive snapshot initialization

230c0f6

mszeszko-meta force-pushed the auto_refresh_iterator_option branch from 8b55a50 to 230c0f6 Compare February 8, 2025 05:36

mszeszko-meta requested a review from pdillinger February 10, 2025 17:52

Hook new feature into db_stress

2a64725

Simplify the contract for MaybeAutoRefresh

4791457

pdillinger approved these changes Feb 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto refresh iterator with snapshot #13354

Auto refresh iterator with snapshot #13354

mszeszko-meta commented Jan 31, 2025 •

edited

Loading

pdillinger commented Jan 31, 2025

pdillinger left a comment

facebook-github-bot commented Feb 4, 2025

facebook-github-bot commented Feb 4, 2025

facebook-github-bot commented Feb 4, 2025

pdillinger left a comment

facebook-github-bot commented Feb 4, 2025

facebook-github-bot commented Feb 4, 2025

facebook-github-bot commented Feb 4, 2025

facebook-github-bot commented Feb 5, 2025

pdillinger left a comment

pdillinger left a comment

mszeszko-meta commented Feb 5, 2025

facebook-github-bot commented Feb 7, 2025

facebook-github-bot commented Feb 7, 2025

facebook-github-bot commented Feb 8, 2025

facebook-github-bot commented Feb 8, 2025

facebook-github-bot commented Feb 10, 2025

facebook-github-bot commented Feb 11, 2025

facebook-github-bot commented Feb 11, 2025

pdillinger left a comment

pdillinger Feb 10, 2025

facebook-github-bot commented Feb 11, 2025

		@@ -0,0 +1 @@
		Introduced new `auto_refresh_iterator_with_snapshot` opt-in knob that (when enabled) will periodically release obsolete memory and storage resources for as long as the iterator is making progress and its supplied `read_options.snapshot` was initialized with well-defined (non-nullptr) value.

Auto refresh iterator with snapshot #13354

Are you sure you want to change the base?

Auto refresh iterator with snapshot #13354

Conversation

mszeszko-meta commented Jan 31, 2025 • edited Loading

Problem

Solution

Test plan

Benchmarking

Mixed access benchmark

Direct IO bypassing OS page cache

Using OS page cache

"DB in memory" benchmark

Specs

pdillinger commented Jan 31, 2025

pdillinger left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 4, 2025

facebook-github-bot commented Feb 4, 2025

facebook-github-bot commented Feb 4, 2025

pdillinger left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 4, 2025

facebook-github-bot commented Feb 4, 2025

facebook-github-bot commented Feb 4, 2025

facebook-github-bot commented Feb 5, 2025

pdillinger left a comment

Choose a reason for hiding this comment

pdillinger left a comment

Choose a reason for hiding this comment

mszeszko-meta commented Feb 5, 2025

facebook-github-bot commented Feb 7, 2025

facebook-github-bot commented Feb 7, 2025

facebook-github-bot commented Feb 8, 2025

facebook-github-bot commented Feb 8, 2025

facebook-github-bot commented Feb 10, 2025

facebook-github-bot commented Feb 11, 2025

facebook-github-bot commented Feb 11, 2025

pdillinger left a comment

Choose a reason for hiding this comment

pdillinger Feb 10, 2025

Choose a reason for hiding this comment

facebook-github-bot commented Feb 11, 2025

mszeszko-meta commented Jan 31, 2025 •

edited

Loading