-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto refresh iterator with snapshot #13354
base: main
Are you sure you want to change the base?
Auto refresh iterator with snapshot #13354
Conversation
59fe197
to
777bad5
Compare
in the case of logical backups (not BackupEngine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the performance test, I would like to see a unit test where we write some stuff, open an iterator, trigger and wait for a flush+full compaction, step the iterator, wait for background work, then verify some things like maybe size of all memtables is smaller, and all the old live sst files have been deleted. Something like that.
e03ba41
to
e65144a
Compare
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, test etc. looks good
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
4d744f6
to
57abeeb
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along with updating unit tests for reverse iterator stuff, then we should be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost forgot: we also need to add this to db_stress. We should add a command line parameter controlling the feature (for differential diagnostic/debugging purposes), connect it up to as many iterator uses in db_stress as we reasonably can, and then set it randomly in db_crashtest.py. This could be another PR, but we can't forget about it as part of the same release cycle.
Agreed. Separately, with even lesser urgency (could be different release cycle), we would probably want to add similar plumbing for |
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
02713d5
to
754758f
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
754758f
to
8b55a50
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
8b55a50
to
230c0f6
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM implementation-wise! Let's maybe chat in person about the latest performance tests before landing
@@ -0,0 +1 @@ | |||
Introduced new `auto_refresh_iterator_with_snapshot` opt-in knob that (when enabled) will periodically release obsolete memory and storage resources for as long as the iterator is making progress and its supplied `read_options.snapshot` was initialized with well-defined (non-nullptr) value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: nullptr is well defined
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
Problem
Once opened, iterator will preserve its' respective RocksDB snapshot for read consistency. Unless explicitly
Refresh'ed
, the iterator will hold on to theInit
-time assignedSuperVersion
throughout its lifetime. As time goes by, this might result in artificially long holdup of the obsolete memtables (potentially referenced by that superversion alone) consequently limiting the supply of the reclaimable memory on the DB instance. This behavior proved to be especially problematic in case of logical backups (outside of RocksDBBackupEngine
).Solution
Building on top of the
Refresh(const Snapshot* snapshot)
API introduced in #10594, we're adding a newReadOptions
opt-in knob that (when enabled) will instruct the iterator to automatically refresh itself to the latest superversion - all that while retaining the originally assigned, explicit snapshot (supplied inread_options.snapshot
at the time of iterator creation) for consistency. To ensure minimal performance overhead we're leveraging relaxed atomic for superversion freshness lookups.Test plan
Correctness: New test to demonstrate the auto refresh behavior in contrast to legacy iterator:
./db_iterator_test --gtest_filter=*AutoRefreshIterator*
.Stress testing: We're adding command line parameter controlling the feature and hooking it up to as many iterator use cases in
db_stress
as we reasonably can with random feature on/off configuration in db_crashtest.py.Benchmarking
The goal of this benchmark is to validate that throughput did not regress substantially. Benchmark was run on optimized build 3-5 times for each respective category OR till convergence. Experiments have been run 'in parallel' (at the same time) on separate db instances within a single host to evenly spread the potential adverse impact of noisy neighbor activities. Host specs [1].
TLDR; Baseline & new solution are practically indistinguishable from performance standpoint. Difference (positive or negative) relative to the baseline, if any, is no more than 1% with Direct IO and no more than 2% when using OS page cache.
Approach:
This feature is only effective on iterators with well-defined
snapshot
passed viaReadOptions
config. We modified the existingdb_bench
program to reflect that constraint. However, it quickly turned out that the actualSnapshot*
initialization is quite expensive. Especially in case of 'tiny scans' (100 rows) contributing as much as 25-35 microseconds, which is ~20-30% of the average per/op latency unintentionally masking potentially adverse performance impact of this change. As a result, we ended up creating a single, explicit 'global'Snapshot*
for all the future scans before running multiple experiments en masse. This is also a valuable data point for us to keep in mind in case of any future discussions about taking implicit snapshots - now we know what the lower bound cost could be.Mixed access benchmark
DB Setup:
Test:
Direct IO bypassing OS page cache
async io = true, direct io read = true
async io = false, direct io read = true
Using OS page cache
async io = true, direct io read = false
async io = false, direct io read = false
"DB in memory" benchmark
DB Setup
000004.log
file is 541MB in size).cache_size
parameter indb_bench
script) suitable for holding the entire DB to test the “hot path” CPU overhead. Note that with such configuredwrite_buffer_size
andmax_write_buffer_number
we'll produce a single SST file.[1]
Specs