Skip to content

Commit

Permalink
Refactor trimming logic for immutable memtables (facebook#5022)
Browse files Browse the repository at this point in the history
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: facebook#5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
  • Loading branch information
miasantreble authored and facebook-github-bot committed Aug 23, 2019
1 parent 26293c8 commit 2f41ecf
Show file tree
Hide file tree
Showing 53 changed files with 522 additions and 107 deletions.
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -530,6 +530,7 @@ set(SOURCES
db/table_cache.cc
db/table_properties_collector.cc
db/transaction_log_impl.cc
db/trim_history_scheduler.cc
db/version_builder.cc
db/version_edit.cc
db/version_set.cc
Expand Down
2 changes: 2 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
* Fix bloom filter lookups by the MultiGet batching API when BlockBasedTableOptions::whole_key_filtering is false, by checking that a key is in the perfix_extractor domain and extracting the prefix before looking up.
### New Features
* VerifyChecksum() by default will issue readahead. Allow ReadOptions to be passed in to those functions to override the readhead size. For checksum verifying before external SST file ingestion, a new option IngestExternalFileOptions.verify_checksums_readahead_size, is added for this readahead setting.
### Public API Change
* Added max_write_buffer_size_to_maintain option to better control memory usage of immutable memtables.

## 6.4.0 (7/30/2019)
### Default Option Change
Expand Down
1 change: 1 addition & 0 deletions TARGETS
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ cpp_library(
"db/table_cache.cc",
"db/table_properties_collector.cc",
"db/transaction_log_impl.cc",
"db/trim_history_scheduler.cc",
"db/version_builder.cc",
"db/version_edit.cc",
"db/version_set.cc",
Expand Down
5 changes: 5 additions & 0 deletions db/c.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2514,6 +2514,11 @@ void rocksdb_options_set_max_write_buffer_number_to_maintain(
opt->rep.max_write_buffer_number_to_maintain = n;
}

void rocksdb_options_set_max_write_buffer_size_to_maintain(
rocksdb_options_t* opt, int64_t n) {
opt->rep.max_write_buffer_size_to_maintain = n;
}

void rocksdb_options_set_enable_pipelined_write(rocksdb_options_t* opt,
unsigned char v) {
opt->rep.enable_pipelined_write = v;
Expand Down
12 changes: 10 additions & 2 deletions db/column_family.cc
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,14 @@ ColumnFamilyOptions SanitizeOptions(const ImmutableDBOptions& db_options,
if (result.max_write_buffer_number < 2) {
result.max_write_buffer_number = 2;
}
if (result.max_write_buffer_number_to_maintain < 0) {
// fall back max_write_buffer_number_to_maintain if
// max_write_buffer_size_to_maintain is not set
if (result.max_write_buffer_size_to_maintain < 0) {
result.max_write_buffer_size_to_maintain =
result.max_write_buffer_number *
static_cast<int64_t>(result.write_buffer_size);
} else if (result.max_write_buffer_size_to_maintain == 0 &&
result.max_write_buffer_number_to_maintain < 0) {
result.max_write_buffer_number_to_maintain = result.max_write_buffer_number;
}
// bloom filter size shouldn't exceed 1/4 of memtable size.
Expand Down Expand Up @@ -423,7 +430,8 @@ ColumnFamilyData::ColumnFamilyData(
write_buffer_manager_(write_buffer_manager),
mem_(nullptr),
imm_(ioptions_.min_write_buffer_number_to_merge,
ioptions_.max_write_buffer_number_to_maintain),
ioptions_.max_write_buffer_number_to_maintain,
ioptions_.max_write_buffer_size_to_maintain),
super_version_(nullptr),
super_version_number_(0),
local_sv_(new ThreadLocalPtr(&SuperVersionUnrefHandle)),
Expand Down
11 changes: 7 additions & 4 deletions db/column_family_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1132,22 +1132,25 @@ TEST_P(ColumnFamilyTest, DifferentWriteBufferSizes) {
default_cf.arena_block_size = 4 * 4096;
default_cf.max_write_buffer_number = 10;
default_cf.min_write_buffer_number_to_merge = 1;
default_cf.max_write_buffer_number_to_maintain = 0;
default_cf.max_write_buffer_size_to_maintain = 0;
one.write_buffer_size = 200000;
one.arena_block_size = 4 * 4096;
one.max_write_buffer_number = 10;
one.min_write_buffer_number_to_merge = 2;
one.max_write_buffer_number_to_maintain = 1;
one.max_write_buffer_size_to_maintain =
static_cast<int>(one.write_buffer_size);
two.write_buffer_size = 1000000;
two.arena_block_size = 4 * 4096;
two.max_write_buffer_number = 10;
two.min_write_buffer_number_to_merge = 3;
two.max_write_buffer_number_to_maintain = 2;
two.max_write_buffer_size_to_maintain =
static_cast<int>(two.write_buffer_size);
three.write_buffer_size = 4096 * 22;
three.arena_block_size = 4096;
three.max_write_buffer_number = 10;
three.min_write_buffer_number_to_merge = 4;
three.max_write_buffer_number_to_maintain = -1;
three.max_write_buffer_size_to_maintain =
static_cast<int>(three.write_buffer_size);

Reopen({default_cf, one, two, three});

Expand Down
5 changes: 3 additions & 2 deletions db/db_basic_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@ TEST_F(DBBasicTest, FlushMultipleMemtable) {
writeOpt.disableWAL = true;
options.max_write_buffer_number = 4;
options.min_write_buffer_number_to_merge = 3;
options.max_write_buffer_number_to_maintain = -1;
options.max_write_buffer_size_to_maintain = -1;
CreateAndReopenWithCF({"pikachu"}, options);
ASSERT_OK(dbfull()->Put(writeOpt, handles_[1], "foo", "v1"));
ASSERT_OK(Flush(1));
Expand Down Expand Up @@ -327,7 +327,8 @@ TEST_F(DBBasicTest, FlushEmptyColumnFamily) {
writeOpt.disableWAL = true;
options.max_write_buffer_number = 2;
options.min_write_buffer_number_to_merge = 1;
options.max_write_buffer_number_to_maintain = 1;
options.max_write_buffer_size_to_maintain =
static_cast<int64_t>(options.write_buffer_size);
CreateAndReopenWithCF({"pikachu"}, options);

// Compaction can still go through even if no thread can flush the
Expand Down
2 changes: 1 addition & 1 deletion db/db_compaction_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ Options DeletionTriggerOptions(Options options) {
options.compression = kNoCompression;
options.write_buffer_size = kCDTKeysPerBuffer * (kCDTValueSize + 24);
options.min_write_buffer_number_to_merge = 1;
options.max_write_buffer_number_to_maintain = 0;
options.max_write_buffer_size_to_maintain = 0;
options.num_levels = kCDTNumLevels;
options.level0_file_num_compaction_trigger = 1;
options.target_file_size_base = options.write_buffer_size * 2;
Expand Down
1 change: 1 addition & 0 deletions db/db_impl/db_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -472,6 +472,7 @@ Status DBImpl::CloseHelper() {
&files_grabbed_for_purge_);
EraseThreadStatusDbInfo();
flush_scheduler_.Clear();
trim_history_scheduler_.Clear();

while (!flush_queue_.empty()) {
const FlushRequest& flush_req = PopFirstFromFlushQueue();
Expand Down
5 changes: 5 additions & 0 deletions db/db_impl/db_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
#include "db/read_callback.h"
#include "db/snapshot_checker.h"
#include "db/snapshot_impl.h"
#include "db/trim_history_scheduler.h"
#include "db/version_edit.h"
#include "db/wal_manager.h"
#include "db/write_controller.h"
Expand Down Expand Up @@ -1355,6 +1356,8 @@ class DBImpl : public DB {

void MaybeFlushStatsCF(autovector<ColumnFamilyData*>* cfds);

Status TrimMemtableHistory(WriteContext* context);

Status SwitchMemtable(ColumnFamilyData* cfd, WriteContext* context);

void SelectColumnFamiliesForAtomicFlush(autovector<ColumnFamilyData*>* cfds);
Expand Down Expand Up @@ -1733,6 +1736,8 @@ class DBImpl : public DB {

FlushScheduler flush_scheduler_;

TrimHistoryScheduler trim_history_scheduler_;

SnapshotList snapshots_;

// For each background job, pending_outputs_ keeps the current file number at
Expand Down
8 changes: 5 additions & 3 deletions db/db_impl/db_impl_open.cc
Original file line number Diff line number Diff line change
Expand Up @@ -862,9 +862,10 @@ Status DBImpl::RecoverLogFiles(const std::vector<uint64_t>& log_numbers,
// That's why we set ignore missing column families to true
bool has_valid_writes = false;
status = WriteBatchInternal::InsertInto(
&batch, column_family_memtables_.get(), &flush_scheduler_, true,
log_number, this, false /* concurrent_memtable_writes */,
next_sequence, &has_valid_writes, seq_per_batch_, batch_per_txn_);
&batch, column_family_memtables_.get(), &flush_scheduler_,
&trim_history_scheduler_, true, log_number, this,
false /* concurrent_memtable_writes */, next_sequence,
&has_valid_writes, seq_per_batch_, batch_per_txn_);
MaybeIgnoreError(&status);
if (!status.ok()) {
// We are treating this as a failure while reading since we read valid
Expand Down Expand Up @@ -931,6 +932,7 @@ Status DBImpl::RecoverLogFiles(const std::vector<uint64_t>& log_numbers,
}

flush_scheduler_.Clear();
trim_history_scheduler_.Clear();
auto last_sequence = *next_sequence - 1;
if ((*next_sequence != kMaxSequenceNumber) &&
(versions_->LastSequence() <= last_sequence)) {
Expand Down
6 changes: 3 additions & 3 deletions db/db_impl/db_impl_secondary.cc
Original file line number Diff line number Diff line change
Expand Up @@ -253,9 +253,9 @@ Status DBImplSecondary::RecoverLogFiles(
bool has_valid_writes = false;
status = WriteBatchInternal::InsertInto(
&batch, column_family_memtables_.get(),
nullptr /* flush_scheduler */, true, log_number, this,
false /* concurrent_memtable_writes */, next_sequence,
&has_valid_writes, seq_per_batch_, batch_per_txn_);
nullptr /* flush_scheduler */, nullptr /* trim_history_scheduler*/,
true, log_number, this, false /* concurrent_memtable_writes */,
next_sequence, &has_valid_writes, seq_per_batch_, batch_per_txn_);
}
// If column family was not found, it might mean that the WAL write
// batch references to the column family that was dropped after the
Expand Down
55 changes: 44 additions & 11 deletions db/db_impl/db_impl_write.cc
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@ Status DBImpl::WriteImpl(const WriteOptions& write_options,
versions_->GetColumnFamilySet());
w.status = WriteBatchInternal::InsertInto(
&w, w.sequence, &column_family_memtables, &flush_scheduler_,
&trim_history_scheduler_,
write_options.ignore_missing_column_families, 0 /*log_number*/, this,
true /*concurrent_memtable_writes*/, seq_per_batch_, w.batch_cnt,
batch_per_txn_);
Expand Down Expand Up @@ -375,7 +376,8 @@ Status DBImpl::WriteImpl(const WriteOptions& write_options,
// w.sequence will be set inside InsertInto
w.status = WriteBatchInternal::InsertInto(
write_group, current_sequence, column_family_memtables_.get(),
&flush_scheduler_, write_options.ignore_missing_column_families,
&flush_scheduler_, &trim_history_scheduler_,
write_options.ignore_missing_column_families,
0 /*recovery_log_number*/, this, parallel, seq_per_batch_,
batch_per_txn_);
} else {
Expand All @@ -391,6 +393,7 @@ Status DBImpl::WriteImpl(const WriteOptions& write_options,
assert(w.sequence == current_sequence);
w.status = WriteBatchInternal::InsertInto(
&w, w.sequence, &column_family_memtables, &flush_scheduler_,
&trim_history_scheduler_,
write_options.ignore_missing_column_families, 0 /*log_number*/,
this, true /*concurrent_memtable_writes*/, seq_per_batch_,
w.batch_cnt, batch_per_txn_);
Expand Down Expand Up @@ -545,9 +548,9 @@ Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,
} else {
memtable_write_group.status = WriteBatchInternal::InsertInto(
memtable_write_group, w.sequence, column_family_memtables_.get(),
&flush_scheduler_, write_options.ignore_missing_column_families,
0 /*log_number*/, this, false /*concurrent_memtable_writes*/,
seq_per_batch_, batch_per_txn_);
&flush_scheduler_, &trim_history_scheduler_,
write_options.ignore_missing_column_families, 0 /*log_number*/, this,
false /*concurrent_memtable_writes*/, seq_per_batch_, batch_per_txn_);
versions_->SetLastSequence(memtable_write_group.last_sequence);
write_thread_.ExitAsMemTableWriter(&w, memtable_write_group);
}
Expand All @@ -559,8 +562,8 @@ Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,
versions_->GetColumnFamilySet());
w.status = WriteBatchInternal::InsertInto(
&w, w.sequence, &column_family_memtables, &flush_scheduler_,
write_options.ignore_missing_column_families, 0 /*log_number*/, this,
true /*concurrent_memtable_writes*/);
&trim_history_scheduler_, write_options.ignore_missing_column_families,
0 /*log_number*/, this, true /*concurrent_memtable_writes*/);
if (write_thread_.CompleteParallelMemTableWriter(&w)) {
MemTableInsertStatusCheck(w.status);
versions_->SetLastSequence(w.write_group->last_sequence);
Expand Down Expand Up @@ -597,8 +600,9 @@ Status DBImpl::UnorderedWriteMemtable(const WriteOptions& write_options,
versions_->GetColumnFamilySet());
w.status = WriteBatchInternal::InsertInto(
&w, w.sequence, &column_family_memtables, &flush_scheduler_,
write_options.ignore_missing_column_families, 0 /*log_number*/, this,
true /*concurrent_memtable_writes*/, seq_per_batch_, sub_batch_cnt);
&trim_history_scheduler_, write_options.ignore_missing_column_families,
0 /*log_number*/, this, true /*concurrent_memtable_writes*/,
seq_per_batch_, sub_batch_cnt);

WriteStatusCheck(w.status);
if (write_options.disableWAL) {
Expand Down Expand Up @@ -856,6 +860,10 @@ Status DBImpl::PreprocessWrite(const WriteOptions& write_options,
status = HandleWriteBufferFull(write_context);
}

if (UNLIKELY(status.ok() && !trim_history_scheduler_.Empty())) {
status = TrimMemtableHistory(write_context);
}

if (UNLIKELY(status.ok() && !flush_scheduler_.Empty())) {
WaitForPendingWrites();
status = ScheduleFlushes(write_context);
Expand Down Expand Up @@ -1112,9 +1120,9 @@ Status DBImpl::WriteRecoverableState() {
WriteBatchInternal::SetSequence(&cached_recoverable_state_, seq + 1);
auto status = WriteBatchInternal::InsertInto(
&cached_recoverable_state_, column_family_memtables_.get(),
&flush_scheduler_, true, 0 /*recovery_log_number*/, this,
false /* concurrent_memtable_writes */, &next_seq, &dont_care_bool,
seq_per_batch_);
&flush_scheduler_, &trim_history_scheduler_, true,
0 /*recovery_log_number*/, this, false /* concurrent_memtable_writes */,
&next_seq, &dont_care_bool, seq_per_batch_);
auto last_seq = next_seq - 1;
if (two_write_queues_) {
versions_->FetchAddLastAllocatedSequence(last_seq - seq);
Expand Down Expand Up @@ -1474,6 +1482,31 @@ void DBImpl::MaybeFlushStatsCF(autovector<ColumnFamilyData*>* cfds) {
}
}

Status DBImpl::TrimMemtableHistory(WriteContext* context) {
autovector<ColumnFamilyData*> cfds;
ColumnFamilyData* tmp_cfd;
while ((tmp_cfd = trim_history_scheduler_.TakeNextColumnFamily()) !=
nullptr) {
cfds.push_back(tmp_cfd);
}
for (auto& cfd : cfds) {
autovector<MemTable*> to_delete;
cfd->imm()->TrimHistory(&to_delete, cfd->mem()->ApproximateMemoryUsage());
for (auto m : to_delete) {
delete m;
}
context->superversion_context.NewSuperVersion();
assert(context->superversion_context.new_superversion.get() != nullptr);
cfd->InstallSuperVersion(&context->superversion_context, &mutex_);

if (cfd->Unref()) {
delete cfd;
cfd = nullptr;
}
}
return Status::OK();
}

Status DBImpl::ScheduleFlushes(WriteContext* context) {
autovector<ColumnFamilyData*> cfds;
if (immutable_db_options_.atomic_flush) {
Expand Down
7 changes: 4 additions & 3 deletions db/db_properties_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -615,8 +615,9 @@ TEST_F(DBPropertiesTest, NumImmutableMemTable) {
writeOpt.disableWAL = true;
options.max_write_buffer_number = 4;
options.min_write_buffer_number_to_merge = 3;
options.max_write_buffer_number_to_maintain = 4;
options.write_buffer_size = 1000000;
options.max_write_buffer_size_to_maintain =
5 * static_cast<int64_t>(options.write_buffer_size);
CreateAndReopenWithCF({"pikachu"}, options);

std::string big_value(1000000 * 2, 'x');
Expand Down Expand Up @@ -747,7 +748,7 @@ TEST_F(DBPropertiesTest, DISABLED_GetProperty) {
options.max_background_flushes = 1;
options.max_write_buffer_number = 10;
options.min_write_buffer_number_to_merge = 1;
options.max_write_buffer_number_to_maintain = 0;
options.max_write_buffer_size_to_maintain = 0;
options.write_buffer_size = 1000000;
Reopen(options);

Expand Down Expand Up @@ -997,7 +998,7 @@ TEST_F(DBPropertiesTest, EstimatePendingCompBytes) {
options.max_background_flushes = 1;
options.max_write_buffer_number = 10;
options.min_write_buffer_number_to_merge = 1;
options.max_write_buffer_number_to_maintain = 0;
options.max_write_buffer_size_to_maintain = 0;
options.write_buffer_size = 1000000;
Reopen(options);

Expand Down
5 changes: 3 additions & 2 deletions db/db_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -883,7 +883,7 @@ TEST_F(DBTest, FlushMultipleMemtable) {
writeOpt.disableWAL = true;
options.max_write_buffer_number = 4;
options.min_write_buffer_number_to_merge = 3;
options.max_write_buffer_number_to_maintain = -1;
options.max_write_buffer_size_to_maintain = -1;
CreateAndReopenWithCF({"pikachu"}, options);
ASSERT_OK(dbfull()->Put(writeOpt, handles_[1], "foo", "v1"));
ASSERT_OK(Flush(1));
Expand All @@ -901,7 +901,8 @@ TEST_F(DBTest, FlushSchedule) {
options.level0_stop_writes_trigger = 1 << 10;
options.level0_slowdown_writes_trigger = 1 << 10;
options.min_write_buffer_number_to_merge = 1;
options.max_write_buffer_number_to_maintain = 1;
options.max_write_buffer_size_to_maintain =
static_cast<int64_t>(options.write_buffer_size);
options.max_write_buffer_number = 2;
options.write_buffer_size = 120 * 1024;
CreateAndReopenWithCF({"pikachu"}, options);
Expand Down
2 changes: 2 additions & 0 deletions db/deletefile_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,8 @@ TEST_F(DeleteFileTest, BackgroundPurgeIteratorTest) {
TEST_F(DeleteFileTest, BackgroundPurgeCFDropTest) {
auto do_test = [&](bool bg_purge) {
ColumnFamilyOptions co;
co.max_write_buffer_size_to_maintain =
static_cast<int64_t>(co.write_buffer_size);
WriteOptions wo;
FlushOptions fo;
ColumnFamilyHandle* cfh = nullptr;
Expand Down
2 changes: 1 addition & 1 deletion db/flush_scheduler.cc
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

namespace rocksdb {

void FlushScheduler::ScheduleFlush(ColumnFamilyData* cfd) {
void FlushScheduler::ScheduleWork(ColumnFamilyData* cfd) {
#ifndef NDEBUG
{
std::lock_guard<std::mutex> lock(checking_mutex_);
Expand Down
Loading

0 comments on commit 2f41ecf

Please sign in to comment.