Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto refresh iterator with snapshot #13354

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

mszeszko-meta
Copy link
Contributor

@mszeszko-meta mszeszko-meta commented Jan 31, 2025

Problem

Once opened, iterator will preserve its' respective RocksDB snapshot for read consistency. Unless explicitly Refresh'ed, the iterator will hold on to the Init-time assigned SuperVersion throughout its lifetime. As time goes by, this might result in artificially long holdup of the obsolete memtables (potentially referenced by that superversion alone) consequently limiting the supply of the reclaimable memory on the DB instance. This behavior proved to be especially problematic in case of logical backups (outside of RocksDB BackupEngine).

Solution

Building on top of the Refresh(const Snapshot* snapshot) API introduced in #10594, we're adding a new ReadOptions opt-in knob that (when enabled) will instruct the iterator to automatically refresh itself to the latest superversion - all that while retaining the originally assigned, explicit snapshot (supplied in read_options.snapshot at the time of iterator creation) for consistency. To ensure minimal performance overhead we're leveraging relaxed atomic for superversion freshness lookups.

Test plan

Correctness: New test to demonstrate the auto refresh behavior in contrast to legacy iterator: ./db_iterator_test --gtest_filter=*AutoRefreshIterator*.

Stress testing: We're adding command line parameter controlling the feature and hooking it up to as many iterator use cases in db_stress as we reasonably can with random feature on/off configuration in db_crashtest.py.

Benchmarking

The goal of this benchmark is to validate that throughput did not regress substantially. Benchmark was run on optimized build 3-5 times for each respective category OR till convergence. Experiments have been run 'in parallel' (at the same time) on separate db instances within a single host to evenly spread the potential adverse impact of noisy neighbor activities. Host specs [1].

TLDR; Baseline & new solution are practically indistinguishable from performance standpoint. Difference (positive or negative) relative to the baseline, if any, is no more than 1% with Direct IO and no more than 2% when using OS page cache.

Approach:

This feature is only effective on iterators with well-defined snapshot passed via ReadOptions config. We modified the existing db_bench program to reflect that constraint. However, it quickly turned out that the actual Snapshot* initialization is quite expensive. Especially in case of 'tiny scans' (100 rows) contributing as much as 25-35 microseconds, which is ~20-30% of the average per/op latency unintentionally masking potentially adverse performance impact of this change. As a result, we ended up creating a single, explicit 'global' Snapshot* for all the future scans before running multiple experiments en masse. This is also a valuable data point for us to keep in mind in case of any future discussions about taking implicit snapshots - now we know what the lower bound cost could be.

Mixed access benchmark

DB Setup:

./db_bench -db=/tmp/testdb -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -write_buffer_size=655360  -disable_auto_compactions=true -target_file_size_base=16777216 -compression_type=none

Test:

./db_bench -use_existing_db=true -db=/tmp/testdb -statistics=false -cache_size=0 -cache_index_and_filter_blocks=false -benchmarks=seekrandom -preserve_internal_time_seconds=1 -explicit_snapshot=1 -use_direct_reads=? -async_io=? -num=?  -seek_nexts=? -auto_refresh_iterator_with_snapshot=? 

Direct IO bypassing OS page cache

async io = true, direct io read = true

  seek_nexts = 100, num=2000000 seek_nexts = 20000, num=50000 seek_nexts = 400000, num=2000
baseline 29821.5 (± 186) ops/s, 309.05 (± 2.25) MB/s 96.5 (± 0.5) ops/s, 1004.25 (± 0.55) MB/s 204.8 (± 2.3) ms/op, 1013.05 (± 11.55) MB/s
auto refresh 29794.5 (± 262.5) ops/s, 309.65 (± 2.65) MB/s 95.5 (± 0.5) ops/s, 997.5 (± 4.1) MB/s 206 (± 1.8) ms/op, 1006.8 (± 8.4) MB/s

async io = false, direct io read = true

  seek_nexts = 100, num=2000000 seek_nexts = 20000, num=50000 seek_nexts = 400000, num=2000
baseline 23954.5 (± 56.5) ops/sec, 621.55 (± 1.25) MB/s 100.5 (± 0.5) ops/sec, 1046.6 (± 5.9) MB/s 202.4 (± 3.1) ms/op, 1025.3 (± 15.6) MB/s
auto refresh 24025 (± 148) ops/sec, 623.1 (± 3.9) MB/s 99.5 (± 0.5) ops/s, 1037.55 (± 8.45) MB/s 203.4 (± 3.8) ms/op, 1020.6 (± 18.8) MB/s

Using OS page cache

async io = true, direct io read = false

  seek_nexts=100, num=2000000 seek_nexts = 20000, num=50000 seek_nexts = 400000, num=2000
baseline 18618 (± 316) ops/sec, 483.1 (± 8.2) MB/s 71.5 (± 0.5) ops/sec, 748.2 (± 2.6) MB/s 275.4 (±2.7) ms/op, 753.6 (± 7.5) MB/s
auto refresh 17910 (± 192) ops/sec, 465.05 (± 4.55) MB/s 73.5 (± 0.5) ops/sec, 766.9 (± 2.3) MB/s 274.9 (± 2.3) ms/op, 755 (± 6.3) MB/s

async io = false, direct io read = false

  seek_nexts=100, num=2000000 seek_nexts = 20000, num=50000 seek_nexts = 400000, num=2000
baseline 23207 (± 248) ops/sec, 602.05 (± 6.35) MB/s 13644.5 (± 224.5) ops/sec, 756.5 (± 8.4) MB/s 272.5 (± 0.9), 761.45 (± 2.35) MB/s
auto refresh 22960 (± 302) ops/sec, 595.75 (± 8.05) MB/s 13721.5 (± 103) ops/sec, 756.2 (± 5.7) MB/s 272.9 (±0.7) ms/op, 760.45 (± 2.05) MB/s

"DB in memory" benchmark

DB Setup

  1. Allow a single memtable to grow large enough (~572MB) to fit in all the rows. Upon shutdown all the rows will be flushed to the WAL file (inspected 000004.log file is 541MB in size).
./db_bench -db=/tmp/testdb_in_mem -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -write_buffer_size=600000000  max_write_buffer_number=2 -compression_type=none
  1. As a part of recovery in subsequent DB open, WAL will be loaded into the memory and processed to one or more SST files during the recovery. We're selecting a large block cache (cache_size parameter in db_bench script) suitable for holding the entire DB to test the “hot path” CPU overhead. Note that with such configured write_buffer_size and max_write_buffer_number we'll produce a single SST file.
./db_bench -use_existing_db=true -db=/tmp/testdb_in_mem -statistics=false -cache_size=1000000000 -cache_index_and_filter_blocks=true -benchmarks=seekrandom -preserve_internal_time_seconds=1 -write_buffer_size=600000000  max_write_buffer_number=2 -explicit_snapshot=1 -use_direct_reads=? -async_io=? -num=? -seek_nexts=? -auto_refresh_iterator_with_snapshot={0|1}
  seek_nexts=100, num=2000000 seek_nexts = 20000, num=50000 seek_nexts = 400000, num=2000
baseline
auto refresh

[1]

Specs

Property Value
RocksDB version 10.0.0
Date Mon Feb 3 23:21:03 2025
CPU 32 * Intel Xeon Processor (Skylake)
CPUCache 16384 KB
Keys 16 bytes each (+ 0 bytes user-defined timestamp)
Values 100 bytes each (50 bytes after compression)
Prefix 0 bytes
RawSize 5.5 MB (estimated)
FileSize 3.1 MB (estimated)
Compression Snappy
Compression sampling rate 0
Memtablerep SkipListFactory
Perf Level 1

@mszeszko-meta mszeszko-meta changed the title [EXPERIMENTAL] Auto refresh the iterator [EXPERIMENTAL] Auto refresh iterator Jan 31, 2025
@mszeszko-meta mszeszko-meta changed the title [EXPERIMENTAL] Auto refresh iterator [EXPERIMENTAL] Iterator auto refresh Jan 31, 2025
@mszeszko-meta mszeszko-meta force-pushed the auto_refresh_iterator_option branch from 59fe197 to 777bad5 Compare January 31, 2025 03:33
@pdillinger
Copy link
Contributor

problematic in case of backups

in the case of logical backups (not BackupEngine)

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the performance test, I would like to see a unit test where we write some stuff, open an iterator, trigger and wait for a flush+full compaction, step the iterator, wait for background work, then verify some things like maybe size of all memtables is smaller, and all the old live sst files have been deleted. Something like that.

db/arena_wrapped_db_iter.cc Outdated Show resolved Hide resolved
@mszeszko-meta mszeszko-meta force-pushed the auto_refresh_iterator_option branch 3 times, most recently from e03ba41 to e65144a Compare February 4, 2025 09:04
@mszeszko-meta mszeszko-meta changed the title [EXPERIMENTAL] Iterator auto refresh Iterator auto refresh Feb 4, 2025
@mszeszko-meta mszeszko-meta marked this pull request as ready for review February 4, 2025 15:53
@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

include/rocksdb/options.h Outdated Show resolved Hide resolved
unreleased_history/new_features/auto_refresh_iterator.md Outdated Show resolved Hide resolved
include/rocksdb/options.h Outdated Show resolved Hide resolved
Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, test etc. looks good

db/db_iterator_test.cc Outdated Show resolved Hide resolved
db/db_iterator_test.cc Outdated Show resolved Hide resolved
db/db_iterator_test.cc Outdated Show resolved Hide resolved
db/db_iterator_test.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

@mszeszko-meta mszeszko-meta force-pushed the auto_refresh_iterator_option branch from 4d744f6 to 57abeeb Compare February 4, 2025 23:50
@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with updating unit tests for reverse iterator stuff, then we should be good.

db/arena_wrapped_db_iter.cc Show resolved Hide resolved
db/arena_wrapped_db_iter.cc Outdated Show resolved Hide resolved
db/arena_wrapped_db_iter.cc Outdated Show resolved Hide resolved
db/arena_wrapped_db_iter.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost forgot: we also need to add this to db_stress. We should add a command line parameter controlling the feature (for differential diagnostic/debugging purposes), connect it up to as many iterator uses in db_stress as we reasonably can, and then set it randomly in db_crashtest.py. This could be another PR, but we can't forget about it as part of the same release cycle.

@mszeszko-meta
Copy link
Contributor Author

Almost forgot: we also need to add this to db_stress. We should add a command line parameter controlling the feature (for differential diagnostic/debugging purposes), connect it up to as many iterator uses in db_stress as we reasonably can, and then set it randomly in db_crashtest.py. This could be another PR, but we can't forget about it as part of the same release cycle.

Agreed. Separately, with even lesser urgency (could be different release cycle), we would probably want to add similar plumbing for db_bench.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

@mszeszko-meta mszeszko-meta changed the title Iterator auto refresh Auto refresh iterator with snapshot Feb 7, 2025
@mszeszko-meta mszeszko-meta force-pushed the auto_refresh_iterator_option branch from 02713d5 to 754758f Compare February 7, 2025 03:38
@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

tools/db_bench_tool.cc Outdated Show resolved Hide resolved
@mszeszko-meta mszeszko-meta force-pushed the auto_refresh_iterator_option branch from 754758f to 8b55a50 Compare February 8, 2025 05:32
@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

@mszeszko-meta mszeszko-meta force-pushed the auto_refresh_iterator_option branch from 8b55a50 to 230c0f6 Compare February 8, 2025 05:36
@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM implementation-wise! Let's maybe chat in person about the latest performance tests before landing

@@ -0,0 +1 @@
Introduced new `auto_refresh_iterator_with_snapshot` opt-in knob that (when enabled) will periodically release obsolete memory and storage resources for as long as the iterator is making progress and its supplied `read_options.snapshot` was initialized with well-defined (non-nullptr) value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: nullptr is well defined

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants