Skip to content

Commit

Permalink
Set Write rate limiter priority dynamically and pass it to FS (facebo…
Browse files Browse the repository at this point in the history
…ok#9988)

Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.

From the RocksDB perspective, there can be two kinds of rate limiters, the internal (native) one and the external one.
- The internal (native) rate limiter is introduced in [the wiki](https://github.com/facebook/rocksdb/wiki/Rate-Limiter). Currently, only IO_LOW and IO_HIGH are used and they are set statically.
- For the external rate limiter, in FSWritableFile functions,  IOOptions is open for end users to set and get rate_limiter_priority for their own rate limiter. Currently, RocksDB doesn’t pass the rate_limiter_priority through IOOptions to the file system.

### Solution
During the User Read, Flush write, Compaction read/write, the WriteController is used to determine whether DB writes are stalled or slowed down. The rate limiter priority (Env::IOPriority) can be determined accordingly. We decided to always pass the priority in IOOptions. What the file system does with it should be a contract between the user and the file system. We would like to set the rate limiter priority at file level, since the Flush/Compaction job level may be too coarse with multiple files and block IO level is too granular.

**This PR is for the Write path.** The **Write:** dynamic priority for different state are listed as follows:

| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
|  Flush | IO_HIGH | IO_USER | IO_USER |
|  Compaction | IO_LOW | IO_USER | IO_USER |

Flush and Compaction writes share the same call path through BlockBaseTableWriter, WritableFileWriter, and FSWritableFile. When a new FSWritableFile object is created, its io_priority_ can be set dynamically based on the state of the WriteController. In WritableFileWriter, before the call sites of FSWritableFile functions, WritableFileWriter::DecideRateLimiterPriority() determines the rate_limiter_priority. The options (IOOptions) argument of FSWritableFile functions will be updated with the rate_limiter_priority.

Pull Request resolved: facebook#9988

Test Plan: Add unit tests.

Reviewed By: anand1976

Differential Revision: D36395159

Pulled By: gitbw95

fbshipit-source-id: a7c82fc29759139a1a07ec46c37dbf7e753474cf
  • Loading branch information
gitbw95 authored and facebook-github-bot committed May 18, 2022
1 parent b84e336 commit 05c678e
Show file tree
Hide file tree
Showing 11 changed files with 367 additions and 51 deletions.
2 changes: 2 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
* Fixed a bug where RocksDB could corrupt DBs with `avoid_flush_during_recovery == true` by removing valid WALs, leading to `Status::Corruption` with message like "SST file is ahead of WALs" when attempting to reopen.
* Fixed a bug in async_io path where incorrect length of data is read by FilePrefetchBuffer if data is consumed from two populated buffers and request for more data is sent.
* Fixed a CompactionFilter bug. Compaction filter used to use `Delete` to remove keys, even if the keys should be removed with `SingleDelete`. Mixing `Delete` and `SingleDelete` may cause undefined behavior.
* Fixed a bug in `WritableFileWriter::WriteDirect` and `WritableFileWriter::WriteDirectWithChecksum`. The rate_limiter_priority specified in ReadOptions was not passed to the RateLimiter when requesting a token.
* Fixed a bug which might cause process crash when I/O error happens when reading an index block in MultiGet().

### New Features
Expand All @@ -29,6 +30,7 @@
### Behavior changes
* Enforce the existing contract of SingleDelete so that SingleDelete cannot be mixed with Delete because it leads to undefined behavior. Fix a number of unit tests that violate the contract but happen to pass.
* ldb `--try_load_options` default to true if `--db` is specified and not creating a new DB, the user can still explicitly disable that by `--try_load_options=false` (or explicitly enable that by `--try_load_options`).
* During Flush write or Compaction write, the WriteController is used to determine whether DB writes are stalled or slowed down. The priority (Env::IOPriority) can then be determined accordingly and be passed in IOOptions to the file system.

## 7.2.0 (04/15/2022)
### Bug Fixes
Expand Down
15 changes: 14 additions & 1 deletion db/compaction/compaction_job.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2285,7 +2285,7 @@ Status CompactionJob::OpenCompactionOutputFile(
/*enable_hash=*/paranoid_file_checks_);
}

writable_file->SetIOPriority(Env::IOPriority::IO_LOW);
writable_file->SetIOPriority(GetRateLimiterPriority());
writable_file->SetWriteLifeTimeHint(write_hint_);
FileTypeSet tmp_set = db_options_.checksum_handoff_file_types;
writable_file->SetPreallocationBlockSize(static_cast<size_t>(
Expand Down Expand Up @@ -2476,6 +2476,19 @@ std::string CompactionJob::GetTableFileName(uint64_t file_number) {
file_number, compact_->compaction->output_path_id());
}

Env::IOPriority CompactionJob::GetRateLimiterPriority() {
if (versions_ && versions_->GetColumnFamilySet() &&
versions_->GetColumnFamilySet()->write_controller()) {
WriteController* write_controller =
versions_->GetColumnFamilySet()->write_controller();
if (write_controller->NeedsDelay() || write_controller->IsStopped()) {
return Env::IO_USER;
}
}

return Env::IO_LOW;
}

#ifndef ROCKSDB_LITE
std::string CompactionServiceCompactionJob::GetTableFileName(
uint64_t file_number) {
Expand Down
6 changes: 6 additions & 0 deletions db/compaction/compaction_job.h
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,8 @@ class CompactionJob {
IOStatus io_status_;

private:
friend class CompactionJobTestBase;

// Generates a histogram representing potential divisions of key ranges from
// the input. It adds the starting and/or ending keys of certain input files
// to the working set and then finds the approximate size of data in between
Expand Down Expand Up @@ -234,6 +236,10 @@ class CompactionJob {
// Get table file name in where it's outputting to, which should also be in
// `output_directory_`.
virtual std::string GetTableFileName(uint64_t file_number);
// The rate limiter priority (io_priority) is determined dynamically here.
// The Compaction Read and Write priorities are the same for different
// scenarios, such as write stalled.
Env::IOPriority GetRateLimiterPriority();
};

// CompactionServiceInput is used the pass compaction information between two
Expand Down
40 changes: 39 additions & 1 deletion db/compaction/compaction_job_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -321,7 +321,8 @@ class CompactionJobTestBase : public testing::Test {
const std::vector<SequenceNumber>& snapshots = {},
SequenceNumber earliest_write_conflict_snapshot = kMaxSequenceNumber,
int output_level = 1, bool verify = true,
uint64_t expected_oldest_blob_file_number = kInvalidBlobFileNumber) {
uint64_t expected_oldest_blob_file_number = kInvalidBlobFileNumber,
bool check_get_priority = false) {
auto cfd = versions_->GetColumnFamilySet()->GetDefault();

size_t num_input_files = 0;
Expand Down Expand Up @@ -390,6 +391,32 @@ class CompactionJobTestBase : public testing::Test {
expected_oldest_blob_file_number);
}
}

if (check_get_priority) {
CheckGetRateLimiterPriority(compaction_job);
}
}

void CheckGetRateLimiterPriority(CompactionJob& compaction_job) {
// When the state from WriteController is normal.
ASSERT_EQ(compaction_job.GetRateLimiterPriority(), Env::IO_LOW);

WriteController* write_controller =
compaction_job.versions_->GetColumnFamilySet()->write_controller();

{
// When the state from WriteController is Delayed.
std::unique_ptr<WriteControllerToken> delay_token =
write_controller->GetDelayToken(1000000);
ASSERT_EQ(compaction_job.GetRateLimiterPriority(), Env::IO_USER);
}

{
// When the state from WriteController is Stopped.
std::unique_ptr<WriteControllerToken> stop_token =
write_controller->GetStopToken();
ASSERT_EQ(compaction_job.GetRateLimiterPriority(), Env::IO_USER);
}
}

std::shared_ptr<Env> env_guard_;
Expand Down Expand Up @@ -1303,6 +1330,17 @@ TEST_F(CompactionJobTest, ResultSerialization) {
}
}

TEST_F(CompactionJobTest, GetRateLimiterPriority) {
NewDB();

auto expected_results = CreateTwoFiles(false);
auto cfd = versions_->GetColumnFamilySet()->GetDefault();
auto files = cfd->current()->storage_info()->LevelFiles(0);
ASSERT_EQ(2U, files.size());
RunCompaction({files}, expected_results, {}, kMaxSequenceNumber, 1, true,
kInvalidBlobFileNumber, true);
}

class CompactionJobTimestampTest : public CompactionJobTestBase {
public:
CompactionJobTimestampTest()
Expand Down
17 changes: 15 additions & 2 deletions db/flush_job.cc
Original file line number Diff line number Diff line change
Expand Up @@ -810,6 +810,7 @@ Status FlushJob::WriteLevel0Table() {

{
auto write_hint = cfd_->CalculateSSTWriteHint(0);
Env::IOPriority io_priority = GetRateLimiterPriorityForWrite();
db_mutex_->Unlock();
if (log_buffer_) {
log_buffer_->FlushBufferToLog();
Expand Down Expand Up @@ -925,7 +926,7 @@ Status FlushJob::WriteLevel0Table() {
snapshot_checker_, mutable_cf_options_.paranoid_file_checks,
cfd_->internal_stats(), &io_s, io_tracer_,
BlobFileCreationReason::kFlush, event_logger_, job_context_->job_id,
Env::IO_HIGH, &table_properties_, write_hint, full_history_ts_low,
io_priority, &table_properties_, write_hint, full_history_ts_low,
blob_callback_, &num_input_entries, &memtable_payload_bytes,
&memtable_garbage_bytes);
// TODO: Cleanup io_status in BuildTable and table builders
Expand Down Expand Up @@ -1032,6 +1033,19 @@ Status FlushJob::WriteLevel0Table() {
return s;
}

Env::IOPriority FlushJob::GetRateLimiterPriorityForWrite() {
if (versions_ && versions_->GetColumnFamilySet() &&
versions_->GetColumnFamilySet()->write_controller()) {
WriteController* write_controller =
versions_->GetColumnFamilySet()->write_controller();
if (write_controller->IsStopped() || write_controller->NeedsDelay()) {
return Env::IO_USER;
}
}

return Env::IO_HIGH;
}

#ifndef ROCKSDB_LITE
std::unique_ptr<FlushJobInfo> FlushJob::GetFlushJobInfo() const {
db_mutex_->AssertHeld();
Expand Down Expand Up @@ -1064,7 +1078,6 @@ std::unique_ptr<FlushJobInfo> FlushJob::GetFlushJobInfo() const {
}
return info;
}

#endif // !ROCKSDB_LITE

} // namespace ROCKSDB_NAMESPACE
4 changes: 4 additions & 0 deletions db/flush_job.h
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,8 @@ class FlushJob {
#endif // !ROCKSDB_LITE

private:
friend class FlushJobTest_GetRateLimiterPriorityForWrite_Test;

void ReportStartedFlush();
void ReportFlushInputSize(const autovector<MemTable*>& mems);
void RecordFlushIOStats();
Expand Down Expand Up @@ -121,6 +123,8 @@ class FlushJob {
// process has not matured yet.
Status MemPurge();
bool MemPurgeDecider();
// The rate limiter priority (io_priority) is determined dynamically here.
Env::IOPriority GetRateLimiterPriorityForWrite();
#ifndef ROCKSDB_LITE
std::unique_ptr<FlushJobInfo> GetFlushJobInfo() const;
#endif // !ROCKSDB_LITE
Expand Down
66 changes: 66 additions & 0 deletions db/flush_job_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -528,6 +528,72 @@ TEST_F(FlushJobTest, Snapshots) {
job_context.Clean();
}

TEST_F(FlushJobTest, GetRateLimiterPriorityForWrite) {
// Prepare a FlushJob that flush MemTables of Single Column Family.
const size_t num_mems = 2;
const size_t num_mems_to_flush = 1;
const size_t num_keys_per_table = 100;
JobContext job_context(0);
ColumnFamilyData* cfd = versions_->GetColumnFamilySet()->GetDefault();
std::vector<uint64_t> memtable_ids;
std::vector<MemTable*> new_mems;
for (size_t i = 0; i != num_mems; ++i) {
MemTable* mem = cfd->ConstructNewMemtable(*cfd->GetLatestMutableCFOptions(),
kMaxSequenceNumber);
mem->SetID(i);
mem->Ref();
new_mems.emplace_back(mem);
memtable_ids.push_back(mem->GetID());

for (size_t j = 0; j < num_keys_per_table; ++j) {
std::string key(std::to_string(j + i * num_keys_per_table));
std::string value("value" + key);
ASSERT_OK(mem->Add(SequenceNumber(j + i * num_keys_per_table), kTypeValue,
key, value, nullptr /* kv_prot_info */));
}
}

autovector<MemTable*> to_delete;
for (auto mem : new_mems) {
cfd->imm()->Add(mem, &to_delete);
}

EventLogger event_logger(db_options_.info_log.get());
SnapshotChecker* snapshot_checker = nullptr; // not relavant

assert(memtable_ids.size() == num_mems);
uint64_t smallest_memtable_id = memtable_ids.front();
uint64_t flush_memtable_id = smallest_memtable_id + num_mems_to_flush - 1;
FlushJob flush_job(
dbname_, versions_->GetColumnFamilySet()->GetDefault(), db_options_,
*cfd->GetLatestMutableCFOptions(), flush_memtable_id, env_options_,
versions_.get(), &mutex_, &shutting_down_, {}, kMaxSequenceNumber,
snapshot_checker, &job_context, nullptr, nullptr, nullptr, kNoCompression,
db_options_.statistics.get(), &event_logger, true,
true /* sync_output_directory */, true /* write_manifest */,
Env::Priority::USER, nullptr /*IOTracer*/);

// When the state from WriteController is normal.
ASSERT_EQ(flush_job.GetRateLimiterPriorityForWrite(), Env::IO_HIGH);

WriteController* write_controller =
flush_job.versions_->GetColumnFamilySet()->write_controller();

{
// When the state from WriteController is Delayed.
std::unique_ptr<WriteControllerToken> delay_token =
write_controller->GetDelayToken(1000000);
ASSERT_EQ(flush_job.GetRateLimiterPriorityForWrite(), Env::IO_USER);
}

{
// When the state from WriteController is Stopped.
std::unique_ptr<WriteControllerToken> stop_token =
write_controller->GetStopToken();
ASSERT_EQ(flush_job.GetRateLimiterPriorityForWrite(), Env::IO_USER);
}
}

class FlushJobTimestampTest : public FlushJobTestBase {
public:
FlushJobTimestampTest()
Expand Down
2 changes: 1 addition & 1 deletion db/write_controller.h
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ class WriteController {
bool IsStopped() const;
bool NeedsDelay() const { return total_delayed_.load() > 0; }
bool NeedSpeedupCompaction() const {
return IsStopped() || NeedsDelay() || total_compaction_pressure_ > 0;
return IsStopped() || NeedsDelay() || total_compaction_pressure_.load() > 0;
}
// return how many microseconds the caller needs to sleep after the call
// num_bytes: how many number of bytes to put into the DB.
Expand Down
Loading

0 comments on commit 05c678e

Please sign in to comment.