Skip to content

Commit

Permalink
Add manifest fix-up utility for file temperatures (facebook#9683)
Browse files Browse the repository at this point in the history
Summary:
The goal of this change is to allow changes to the "current" (in
FileSystem) file temperatures to feed back into DB metadata, so that
they can inform decisions and stats reporting. In part because of
modular code factoring, it doesn't seem easy to do this automagically,
where opening an SST file and observing current Temperature different
from expected would trigger a change in metadata and DB manifest write
(essentially giving the deep read path access to the write path). It is also
difficult to do this while the DB is open because of the limitations of
LogAndApply.

This change allows updating file temperature metadata on a closed DB
using an experimental utility function UpdateManifestForFilesState()
or `ldb update_manifest --update_temperatures`. This should suffice for
"migration" scenarios where outside tooling has placed or re-arranged DB
files into a (different) tiered configuration without going through
RocksDB itself (currently, only compaction can change temperature
metadata).

Some details:
* Refactored and added unit test for `ldb unsafe_remove_sst_file` because
of shared functionality
* Pulled in autovector.h changes from facebook#9546 to fix SuperVersionContext
move constructor (related to an older draft of this change)

Possible follow-up work:
* Support updating manifest with file checksums, such as when a
new checksum function is used and want existing DB metadata updated
for it.
* It's possible that for some repair scenarios, lighter weight than
full repair, we might want to support UpdateManifestForFilesState() to
modify critical file details like size or checksum using same
algorithm. But let's make sure these are differentiated from modifying
file details in ways that don't suspect corruption (or require extreme
trust).

Pull Request resolved: facebook#9683

Test Plan: unit tests added

Reviewed By: jay-zhuang

Differential Revision: D34798828

Pulled By: pdillinger

fbshipit-source-id: cfd83e8fb10761d8c9e7f9c020d68c9106a95554
  • Loading branch information
pdillinger authored and facebook-github-bot committed Mar 18, 2022
1 parent b2aacaf commit a8a422e
Show file tree
Hide file tree
Showing 13 changed files with 610 additions and 34 deletions.
1 change: 1 addition & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
* Added BlobDB options to `ldb`
* `BlockBasedTableOptions::detect_filter_construct_corruption` can now be dynamically configured using `DB::SetOptions`.
* Automatically recover from retryable read IO errors during backgorund flush/compaction.
* Experimental support for preserving file Temperatures through backup and restore, and for updating DB metadata for outside changes to file Temperature (`UpdateManifestForFilesState` or `ldb update_manifest --update_temperatures`).

### Bug Fixes
* Fixed a data race on `versions_` between `DBImpl::ResumeImpl()` and threads waiting for recovery to complete (#9496)
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -909,7 +909,7 @@ gen_parallel_tests:
# 107.816 PASS t/DBTest.EncodeDecompressedBlockSizeTest
#
slow_test_regexp = \
^.*SnapshotConcurrentAccessTest.*$$|^.*SeqAdvanceConcurrentTest.*$$|^t/run-table_test-HarnessTest.Randomized$$|^t/run-db_test-.*(?:FileCreationRandomFailure|EncodeDecompressedBlockSizeTest)$$|^.*RecoverFromCorruptedWALWithoutFlush$$
^.*MySQLStyleTransactionTest.*$$|^.*SnapshotConcurrentAccessTest.*$$|^.*SeqAdvanceConcurrentTest.*$$|^t/run-table_test-HarnessTest.Randomized$$|^t/run-db_test-.*(?:FileCreationRandomFailure|EncodeDecompressedBlockSizeTest)$$|^.*RecoverFromCorruptedWALWithoutFlush$$
prioritize_long_running_tests = \
perl -pe 's,($(slow_test_regexp)),100 $$1,' \
| sort -k1,1gr \
Expand Down
100 changes: 100 additions & 0 deletions db/db_test2.cc
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
#include "options/options_helper.h"
#include "port/port.h"
#include "port/stack_trace.h"
#include "rocksdb/experimental.h"
#include "rocksdb/iostats_context.h"
#include "rocksdb/persistent_cache.h"
#include "rocksdb/trace_record.h"
Expand Down Expand Up @@ -6973,6 +6974,105 @@ TEST_F(DBTest2, CheckpointFileTemperature) {
delete checkpoint;
Close();
}

TEST_F(DBTest2, FileTemperatureManifestFixup) {
auto test_fs = std::make_shared<FileTemperatureTestFS>(env_->GetFileSystem());
std::unique_ptr<Env> env(new CompositeEnvWrapper(env_, test_fs));
Options options = CurrentOptions();
options.bottommost_temperature = Temperature::kWarm;
options.level0_file_num_compaction_trigger = 2;
options.env = env.get();
std::vector<std::string> cfs = {/*"default",*/ "test1", "test2"};
CreateAndReopenWithCF(cfs, options);
// Needed for later re-opens (weird)
cfs.insert(cfs.begin(), kDefaultColumnFamilyName);

// Generate a bottommost file in all CFs
for (int cf = 0; cf < 3; ++cf) {
ASSERT_OK(Put(cf, "a", "val"));
ASSERT_OK(Put(cf, "c", "val"));
ASSERT_OK(Flush(cf));
ASSERT_OK(Put(cf, "b", "val"));
ASSERT_OK(Put(cf, "d", "val"));
ASSERT_OK(Flush(cf));
}
ASSERT_OK(dbfull()->TEST_WaitForCompact());

// verify
ASSERT_GT(GetSstSizeHelper(Temperature::kWarm), 0);
ASSERT_EQ(GetSstSizeHelper(Temperature::kUnknown), 0);
ASSERT_EQ(GetSstSizeHelper(Temperature::kCold), 0);
ASSERT_EQ(GetSstSizeHelper(Temperature::kHot), 0);

// Generate a non-bottommost file in all CFs
for (int cf = 0; cf < 3; ++cf) {
ASSERT_OK(Put(cf, "e", "val"));
ASSERT_OK(Flush(cf));
}

// re-verify
ASSERT_GT(GetSstSizeHelper(Temperature::kWarm), 0);
// Not supported: ASSERT_GT(GetSstSizeHelper(Temperature::kUnknown), 0);
ASSERT_EQ(GetSstSizeHelper(Temperature::kCold), 0);
ASSERT_EQ(GetSstSizeHelper(Temperature::kHot), 0);

// Now change FS temperature on bottommost file(s) to kCold
std::map<uint64_t, Temperature> current_temps;
test_fs->CopyCurrentSstFileTemperatures(&current_temps);
for (auto e : current_temps) {
if (e.second == Temperature::kWarm) {
test_fs->OverrideSstFileTemperature(e.first, Temperature::kCold);
}
}
// Metadata not yet updated
ASSERT_EQ(Get("a"), "val");
ASSERT_EQ(GetSstSizeHelper(Temperature::kCold), 0);

// Update with Close and UpdateManifestForFilesState, but first save cf
// descriptors
std::vector<ColumnFamilyDescriptor> column_families;
for (size_t i = 0; i < handles_.size(); ++i) {
ColumnFamilyDescriptor cfdescriptor;
// GetDescriptor is not implemented for ROCKSDB_LITE
handles_[i]->GetDescriptor(&cfdescriptor).PermitUncheckedError();
column_families.push_back(cfdescriptor);
}
Close();
experimental::UpdateManifestForFilesStateOptions update_opts;
update_opts.update_temperatures = true;

ASSERT_OK(experimental::UpdateManifestForFilesState(
options, dbname_, column_families, update_opts));

// Re-open and re-verify after update
ReopenWithColumnFamilies(cfs, options);
ASSERT_GT(GetSstSizeHelper(Temperature::kCold), 0);
// Not supported: ASSERT_GT(GetSstSizeHelper(Temperature::kUnknown), 0);
ASSERT_EQ(GetSstSizeHelper(Temperature::kWarm), 0);
ASSERT_EQ(GetSstSizeHelper(Temperature::kHot), 0);

// Change kUnknown to kHot
test_fs->CopyCurrentSstFileTemperatures(&current_temps);
for (auto e : current_temps) {
if (e.second == Temperature::kUnknown) {
test_fs->OverrideSstFileTemperature(e.first, Temperature::kHot);
}
}

// Update with Close and UpdateManifestForFilesState
Close();
ASSERT_OK(experimental::UpdateManifestForFilesState(
options, dbname_, column_families, update_opts));

// Re-open and re-verify after update
ReopenWithColumnFamilies(cfs, options);
ASSERT_GT(GetSstSizeHelper(Temperature::kCold), 0);
ASSERT_EQ(GetSstSizeHelper(Temperature::kUnknown), 0);
ASSERT_EQ(GetSstSizeHelper(Temperature::kWarm), 0);
ASSERT_GT(GetSstSizeHelper(Temperature::kHot), 0);

Close();
}
#endif // ROCKSDB_LITE

// WAL recovery mode is WALRecoveryMode::kPointInTimeRecovery.
Expand Down
100 changes: 100 additions & 0 deletions db/experimental.cc
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
#include "rocksdb/experimental.h"

#include "db/db_impl/db_impl.h"
#include "db/version_util.h"
#include "logging/logging.h"

namespace ROCKSDB_NAMESPACE {
namespace experimental {
Expand Down Expand Up @@ -46,5 +48,103 @@ Status SuggestCompactRange(DB* db, const Slice* begin, const Slice* end) {
return SuggestCompactRange(db, db->DefaultColumnFamily(), begin, end);
}

Status UpdateManifestForFilesState(
const DBOptions& db_opts, const std::string& db_name,
const std::vector<ColumnFamilyDescriptor>& column_families,
const UpdateManifestForFilesStateOptions& opts) {
OfflineManifestWriter w(db_opts, db_name);
Status s = w.Recover(column_families);

size_t files_updated = 0;
size_t cfs_updated = 0;
auto fs = db_opts.env->GetFileSystem();

for (auto cfd : *w.Versions().GetColumnFamilySet()) {
if (!s.ok()) {
break;
}
assert(cfd);

if (cfd->IsDropped() || !cfd->initialized()) {
continue;
}

const auto* current = cfd->current();
assert(current);

const auto* vstorage = current->storage_info();
assert(vstorage);

VersionEdit edit;
edit.SetColumnFamily(cfd->GetID());

/* SST files */
for (int level = 0; level < cfd->NumberLevels(); level++) {
if (!s.ok()) {
break;
}
const auto& level_files = vstorage->LevelFiles(level);

for (const auto& lf : level_files) {
assert(lf);

uint64_t number = lf->fd.GetNumber();
std::string fname =
TableFileName(w.IOptions().db_paths, number, lf->fd.GetPathId());

std::unique_ptr<FSSequentialFile> f;
FileOptions fopts;
fopts.temperature = lf->temperature;

IOStatus file_ios =
fs->NewSequentialFile(fname, fopts, &f, /*dbg*/ nullptr);
if (file_ios.ok()) {
if (opts.update_temperatures) {
Temperature temp = f->GetTemperature();
if (temp != Temperature::kUnknown && temp != lf->temperature) {
// Current state inconsistent with manifest
++files_updated;
edit.DeleteFile(level, number);
edit.AddFile(level, number, lf->fd.GetPathId(),
lf->fd.GetFileSize(), lf->smallest, lf->largest,
lf->fd.smallest_seqno, lf->fd.largest_seqno,
lf->marked_for_compaction, temp,
lf->oldest_blob_file_number,
lf->oldest_ancester_time, lf->file_creation_time,
lf->file_checksum, lf->file_checksum_func_name,
lf->min_timestamp, lf->max_timestamp);
}
}
} else {
s = file_ios;
break;
}
}
}

if (s.ok() && edit.NumEntries() > 0) {
s = w.LogAndApply(cfd, &edit);
if (s.ok()) {
++cfs_updated;
}
}
}

if (cfs_updated > 0) {
ROCKS_LOG_INFO(db_opts.info_log,
"UpdateManifestForFilesState: updated %zu files in %zu CFs",
files_updated, cfs_updated);
} else if (s.ok()) {
ROCKS_LOG_INFO(db_opts.info_log,
"UpdateManifestForFilesState: no updates needed");
}
if (!s.ok()) {
ROCKS_LOG_ERROR(db_opts.info_log, "UpdateManifestForFilesState failed: %s",
s.ToString().c_str());
}

return s;
}

} // namespace experimental
} // namespace ROCKSDB_NAMESPACE
2 changes: 1 addition & 1 deletion db/internal_stats.cc
Original file line number Diff line number Diff line change
Expand Up @@ -519,7 +519,7 @@ const std::unordered_map<std::string, DBPropertyInfo>
{false, nullptr, &InternalStats::HandleLiveSstFilesSize, nullptr,
nullptr}},
{DB::Properties::kLiveSstFilesSizeAtTemperature,
{true, &InternalStats::HandleLiveSstFilesSizeAtTemperature, nullptr,
{false, &InternalStats::HandleLiveSstFilesSizeAtTemperature, nullptr,
nullptr, nullptr}},
{DB::Properties::kEstimatePendingCompactionBytes,
{false, nullptr, &InternalStats::HandleEstimatePendingCompactionBytes,
Expand Down
5 changes: 4 additions & 1 deletion db/job_context.h
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,16 @@ struct SuperVersionContext {
explicit SuperVersionContext(bool create_superversion = false)
: new_superversion(create_superversion ? new SuperVersion() : nullptr) {}

explicit SuperVersionContext(SuperVersionContext&& other)
explicit SuperVersionContext(SuperVersionContext&& other) noexcept
: superversions_to_free(std::move(other.superversions_to_free)),
#ifndef ROCKSDB_DISABLE_STALL_NOTIFICATION
write_stall_notifications(std::move(other.write_stall_notifications)),
#endif
new_superversion(std::move(other.new_superversion)) {
}
// No copies
SuperVersionContext(const SuperVersionContext& other) = delete;
void operator=(const SuperVersionContext& other) = delete;

void NewSuperVersion() {
new_superversion = std::unique_ptr<SuperVersion>(new SuperVersion());
Expand Down
68 changes: 68 additions & 0 deletions db/version_util.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
// Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).

#pragma once

#include "db/version_set.h"

namespace ROCKSDB_NAMESPACE {

// Instead of opening a `DB` to perform certain manifest updates, this
// uses the underlying `VersionSet` API to read and modify the MANIFEST. This
// allows us to use the user's real options, while not having to worry about
// the DB persisting new SST files via flush/compaction or attempting to read/
// compact files which may fail, particularly for the file we intend to remove
// (the user may want to remove an already deleted file from MANIFEST).
class OfflineManifestWriter {
public:
OfflineManifestWriter(const DBOptions& options, const std::string& db_path)
: wc_(options.delayed_write_rate),
wb_(options.db_write_buffer_size),
immutable_db_options_(WithDbPath(options, db_path)),
tc_(NewLRUCache(1 << 20 /* capacity */,
options.table_cache_numshardbits)),
versions_(db_path, &immutable_db_options_, sopt_, tc_.get(), &wb_, &wc_,
/*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr,
/*db_session_id*/ "") {}

Status Recover(const std::vector<ColumnFamilyDescriptor>& column_families) {
return versions_.Recover(column_families);
}

Status LogAndApply(ColumnFamilyData* cfd, VersionEdit* edit) {
// Use `mutex` to imitate a locked DB mutex when calling `LogAndApply()`.
InstrumentedMutex mutex;
mutex.Lock();
Status s = versions_.LogAndApply(cfd, *cfd->GetLatestMutableCFOptions(),
edit, &mutex, nullptr /* db_directory */,
false /* new_descriptor_log */);
mutex.Unlock();
return s;
}

VersionSet& Versions() { return versions_; }
const ImmutableDBOptions& IOptions() { return immutable_db_options_; }

private:
WriteController wc_;
WriteBufferManager wb_;
ImmutableDBOptions immutable_db_options_;
std::shared_ptr<Cache> tc_;
EnvOptions sopt_;
VersionSet versions_;

static ImmutableDBOptions WithDbPath(const DBOptions& options,
const std::string& db_path) {
ImmutableDBOptions rv(options);
if (rv.db_paths.empty()) {
// `VersionSet` expects options that have been through
// `SanitizeOptions()`, which would sanitize an empty `db_paths`.
rv.db_paths.emplace_back(db_path, 0 /* target_size */);
}
return rv;
}
};

} // namespace ROCKSDB_NAMESPACE
27 changes: 27 additions & 0 deletions include/rocksdb/experimental.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,32 @@ Status SuggestCompactRange(DB* db, const Slice* begin, const Slice* end);
Status PromoteL0(DB* db, ColumnFamilyHandle* column_family,
int target_level = 1);

struct UpdateManifestForFilesStateOptions {
// When true, read current file temperatures from FileSystem and update in
// DB manifest when a temperature other than Unknown is reported and
// inconsistent with manifest.
bool update_temperatures = true;

// TODO: new_checksums: to update files to latest file checksum algorithm
};

// Utility for updating manifest of DB directory (not open) for current state
// of files on filesystem. See UpdateManifestForFilesStateOptions.
//
// To minimize interference with ongoing DB operations, only the following
// guarantee is provided, assuming no IO error encountered:
// * Only files live in DB at start AND end of call to
// UpdateManifestForFilesState() are guaranteed to be updated (as needed) in
// manifest.
// * For example, new files after start of call to
// UpdateManifestForFilesState() might not be updated, but that is not
// typically required to achieve goal of manifest consistency/completeness
// (because current DB configuration would ensure new files get the desired
// consistent metadata).
Status UpdateManifestForFilesState(
const DBOptions& db_opts, const std::string& db_name,
const std::vector<ColumnFamilyDescriptor>& column_families,
const UpdateManifestForFilesStateOptions& opts = {});

} // namespace experimental
} // namespace ROCKSDB_NAMESPACE
Loading

0 comments on commit a8a422e

Please sign in to comment.