Skip to content

Commit

Permalink
Provide implementation to prefetch data asynchronously in FilePrefetc…
Browse files Browse the repository at this point in the history
…hBuffer (facebook#9674)

Summary:
In FilePrefetchBuffer if reads are sequential, after prefetching call ReadAsync API to prefetch data asynchronously so that in next prefetching data will be available. Data prefetched asynchronously will be readahead_size/2. It uses two buffers, one for synchronous prefetching and one for asynchronous. In case, the data is overlapping, the data is copied from both buffers to third buffer to make it continuous.
This feature is under ReadOptions::async_io and is under experimental.

Pull Request resolved: facebook#9674

Test Plan:
1. Add new unit tests
2. Run **db_stress** to make sure nothing crashes.

    -   Normal prefetch without `async_io` ran successfully:
```
export CRASH_TEST_EXT_ARGS=" --async_io=0"
 make crash_test -j
 ```

3. **Run Regressions**.
   i) Main branch without any change for normal prefetching with async_io disabled:

 ```
 ./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -
           use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
 ```

```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.0
Date:       Thu Mar 17 13:11:34 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
```

  ii) normal prefetching after changes with async_io disable:

```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_withchange -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.0
Date:       Thu Mar 17 14:11:31 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_withchange]
seekrandom   :  471347.227 micros/op 2 ops/sec;  348.1 MB/s (255 of 255 found)
```

Reviewed By: anand1976

Differential Revision: D34731543

Pulled By: akankshamahajan15

fbshipit-source-id: 8e23aa93453d5fe3c672b9231ad582f60207937f
  • Loading branch information
akankshamahajan15 authored and facebook-github-bot committed Mar 21, 2022
1 parent a8a422e commit 49a10fe
Show file tree
Hide file tree
Showing 17 changed files with 614 additions and 129 deletions.
1 change: 1 addition & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
* `BlockBasedTableOptions::detect_filter_construct_corruption` can now be dynamically configured using `DB::SetOptions`.
* Automatically recover from retryable read IO errors during backgorund flush/compaction.
* Experimental support for preserving file Temperatures through backup and restore, and for updating DB metadata for outside changes to file Temperature (`UpdateManifestForFilesState` or `ldb update_manifest --update_temperatures`).
* Experimental support for async_io in ReadOptions which is used by FilePrefetchBuffer to prefetch some of the data asynchronously, if reads are sequential and auto readahead is enabled by rocksdb internally.

### Bug Fixes
* Fixed a data race on `versions_` between `DBImpl::ResumeImpl()` and threads waiting for recovery to complete (#9496)
Expand Down
487 changes: 412 additions & 75 deletions file/file_prefetch_buffer.cc

Large diffs are not rendered by default.

118 changes: 100 additions & 18 deletions file/file_prefetch_buffer.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
#include "file/readahead_file_info.h"
#include "port/port.h"
#include "rocksdb/env.h"
#include "rocksdb/file_system.h"
#include "rocksdb/options.h"
#include "util/aligned_buffer.h"

Expand All @@ -26,6 +27,11 @@ namespace ROCKSDB_NAMESPACE {
struct IOOptions;
class RandomAccessFileReader;

struct BufferInfo {
AlignedBuffer buffer_;
uint64_t offset_ = 0;
};

// FilePrefetchBuffer is a smart buffer to store and read data from a file.
class FilePrefetchBuffer {
public:
Expand All @@ -48,15 +54,19 @@ class FilePrefetchBuffer {
// it. Used for adaptable readahead of the file footer/metadata.
// implicit_auto_readahead : Readahead is enabled implicitly by rocksdb after
// doing sequential scans for two times.
// async_io : When async_io is enabled, if it's implicit_auto_readahead, it
// prefetches data asynchronously in second buffer while curr_ is being
// consumed.
//
// Automatic readhead is enabled for a file if readahead_size
// and max_readahead_size are passed in.
// A user can construct a FilePrefetchBuffer without any arguments, but use
// `Prefetch` to load data into the buffer.
FilePrefetchBuffer(size_t readahead_size = 0, size_t max_readahead_size = 0,
bool enable = true, bool track_min_offset = false,
bool implicit_auto_readahead = false)
: buffer_offset_(0),
bool implicit_auto_readahead = false,
bool async_io = false)
: curr_(0),
readahead_size_(readahead_size),
max_readahead_size_(max_readahead_size),
min_offset_read_(port::kMaxSizet),
Expand All @@ -65,18 +75,36 @@ class FilePrefetchBuffer {
implicit_auto_readahead_(implicit_auto_readahead),
prev_offset_(0),
prev_len_(0),
num_file_reads_(kMinNumFileReadsToStartAutoReadahead + 1) {}
num_file_reads_(kMinNumFileReadsToStartAutoReadahead + 1),
io_handle_(nullptr),
del_fn_(nullptr),
async_read_in_progress_(false),
async_io_(async_io) {
// If async_io_ is enabled, data is asynchronously filled in second buffer
// while curr_ is being consumed. If data is overlapping in two buffers,
// data is copied to third buffer to return continuous buffer.
bufs_.resize(3);
}

// Load data into the buffer from a file.
// reader : the file reader.
// offset : the file offset to start reading from.
// n : the number of bytes to read.
// rate_limiter_priority : rate limiting priority, or `Env::IO_TOTAL` to
// bypass.
// is_async_read : if the data should be prefetched by calling read
// asynchronously. It should be set true when called
// from TryReadFromCache.
Status Prefetch(const IOOptions& opts, RandomAccessFileReader* reader,
uint64_t offset, size_t n,
Env::IOPriority rate_limiter_priority);

Status PrefetchAsync(const IOOptions& opts, RandomAccessFileReader* reader,
FileSystem* fs, uint64_t offset, size_t length,
size_t readahead_size,
Env::IOPriority rate_limiter_priority,
bool& copy_to_third_buffer);

// Tries returning the data for a file read from this buffer if that data is
// in the buffer.
// It handles tracking the minimum read offset if track_min_offset = true.
Expand All @@ -97,14 +125,20 @@ class FilePrefetchBuffer {
Env::IOPriority rate_limiter_priority,
bool for_compaction = false);

bool TryReadFromCacheAsync(const IOOptions& opts,
RandomAccessFileReader* reader, uint64_t offset,
size_t n, Slice* result, Status* status,
Env::IOPriority rate_limiter_priority,
bool for_compaction /* = false */, FileSystem* fs);

// The minimum `offset` ever passed to TryReadFromCache(). This will nly be
// tracked if track_min_offset = true.
size_t min_offset_read() const { return min_offset_read_; }

// Called in case of implicit auto prefetching.
void UpdateReadPattern(const uint64_t& offset, const size_t& len,
bool is_adaptive_readahead = false) {
if (is_adaptive_readahead) {
bool decrease_readaheadsize) {
if (decrease_readaheadsize) {
// Since this block was eligible for prefetch but it was found in
// cache, so check and decrease the readahead_size by 8KB (default)
// if eligible.
Expand All @@ -114,16 +148,6 @@ class FilePrefetchBuffer {
prev_len_ = len;
}

bool IsBlockSequential(const size_t& offset) {
return (prev_len_ == 0 || (prev_offset_ + prev_len_ == offset));
}

// Called in case of implicit auto prefetching.
void ResetValues() {
num_file_reads_ = 1;
readahead_size_ = kInitAutoReadaheadSize;
}

void GetReadaheadState(ReadaheadFileInfo::ReadaheadInfo* readahead_info) {
readahead_info->readahead_size = readahead_size_;
readahead_info->num_file_reads = num_file_reads_;
Expand All @@ -141,7 +165,8 @@ class FilePrefetchBuffer {
// - num_file_reads_ + 1 (including this read) >
// kMinNumFileReadsToStartAutoReadahead
if (implicit_auto_readahead_ && readahead_size_ > 0) {
if ((offset + size > buffer_offset_ + buffer_.CurrentSize()) &&
if ((offset + size >
bufs_[curr_].offset_ + bufs_[curr_].buffer_.CurrentSize()) &&
IsBlockSequential(offset) &&
(num_file_reads_ + 1 > kMinNumFileReadsToStartAutoReadahead)) {
size_t initial_auto_readahead_size = kInitAutoReadaheadSize;
Expand All @@ -152,9 +177,59 @@ class FilePrefetchBuffer {
}
}

bool IsEligibleForPrefetch(uint64_t offset, size_t n) {
// Prefetch only if this read is sequential otherwise reset readahead_size_
// to initial value.
if (!IsBlockSequential(offset)) {
UpdateReadPattern(offset, n, false /*decrease_readaheadsize*/);
ResetValues();
return false;
}
num_file_reads_++;
if (num_file_reads_ <= kMinNumFileReadsToStartAutoReadahead) {
UpdateReadPattern(offset, n, false /*decrease_readaheadsize*/);
return false;
}
return true;
}

// Callback function passed to underlying FS in case of asynchronous reads.
void PrefetchAsyncCallback(const FSReadRequest& req, void* cb_arg);

private:
AlignedBuffer buffer_;
uint64_t buffer_offset_;
// Calculates roundoff offset and length to be prefetched based on alignment
// and data present in buffer_. It also allocates new buffer or refit tail if
// required.
void CalculateOffsetAndLen(size_t alignment, uint64_t offset,
size_t roundup_len, size_t index, bool refit_tail,
uint64_t& chunk_len);

Status Read(const IOOptions& opts, RandomAccessFileReader* reader,
Env::IOPriority rate_limiter_priority, uint64_t read_len,
uint64_t chunk_len, uint64_t rounddown_start, uint32_t index);

Status ReadAsync(const IOOptions& opts, RandomAccessFileReader* reader,
Env::IOPriority rate_limiter_priority, uint64_t read_len,
uint64_t chunk_len, uint64_t rounddown_start,
uint32_t index);

// Copy the data from src to third buffer.
void CopyDataToBuffer(uint32_t src, uint64_t& offset, size_t& length);

bool IsBlockSequential(const size_t& offset) {
return (prev_len_ == 0 || (prev_offset_ + prev_len_ == offset));
}

// Called in case of implicit auto prefetching.
void ResetValues() {
num_file_reads_ = 1;
readahead_size_ = kInitAutoReadaheadSize;
}

std::vector<BufferInfo> bufs_;
// curr_ represents the index for bufs_ indicating which buffer is being
// consumed currently.
uint32_t curr_;
size_t readahead_size_;
// FilePrefetchBuffer object won't be created from Iterator flow if
// max_readahead_size_ = 0.
Expand All @@ -174,5 +249,12 @@ class FilePrefetchBuffer {
uint64_t prev_offset_;
size_t prev_len_;
int64_t num_file_reads_;

// io_handle_ is allocated and used by underlying file system in case of
// asynchronous reads.
void* io_handle_;
IOHandleDeleter del_fn_;
bool async_read_in_progress_;
bool async_io_;
};
} // namespace ROCKSDB_NAMESPACE
18 changes: 18 additions & 0 deletions file/prefetch_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -718,9 +718,11 @@ TEST_P(PrefetchTest1, DBIterLevelReadAhead) {

WriteBatch batch;
Random rnd(309);
int total_keys = 0;
for (int j = 0; j < 5; j++) {
for (int i = j * kNumKeys; i < (j + 1) * kNumKeys; i++) {
ASSERT_OK(batch.Put(BuildKey(i), rnd.RandomString(1000)));
total_keys++;
}
ASSERT_OK(db_->Write(WriteOptions(), &batch));
ASSERT_OK(Flush());
Expand Down Expand Up @@ -761,12 +763,16 @@ TEST_P(PrefetchTest1, DBIterLevelReadAhead) {
ReadOptions ro;
if (is_adaptive_readahead) {
ro.adaptive_readahead = true;
// TODO akanksha: Remove after adding new units.
ro.async_io = true;
}
auto iter = std::unique_ptr<Iterator>(db_->NewIterator(ro));
int num_keys = 0;
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {
ASSERT_OK(iter->status());
num_keys++;
}
ASSERT_EQ(num_keys, total_keys);

ASSERT_GT(buff_prefetch_count, 0);
buff_prefetch_count = 0;
Expand Down Expand Up @@ -854,6 +860,8 @@ TEST_P(PrefetchTest2, NonSequentialReads) {
// Iterate until prefetch is done.
ReadOptions ro;
ro.adaptive_readahead = true;
// TODO akanksha: Remove after adding new units.
ro.async_io = true;
auto iter = std::unique_ptr<Iterator>(db_->NewIterator(ro));
iter->SeekToFirst();
while (iter->Valid() && buff_prefetch_count == 0) {
Expand Down Expand Up @@ -940,6 +948,8 @@ TEST_P(PrefetchTest2, DecreaseReadAheadIfInCache) {
SyncPoint::GetInstance()->EnableProcessing();
ReadOptions ro;
ro.adaptive_readahead = true;
// TODO akanksha: Remove after adding new units.
ro.async_io = true;
{
/*
* Reseek keys from sequential Data Blocks within same partitioned
Expand All @@ -958,28 +968,35 @@ TEST_P(PrefetchTest2, DecreaseReadAheadIfInCache) {
// After caching, blocks will be read from cache (Sequential blocks)
auto iter = std::unique_ptr<Iterator>(db_->NewIterator(ro));
iter->Seek(BuildKey(0));
ASSERT_TRUE(iter->Valid());
iter->Seek(BuildKey(1000));
ASSERT_TRUE(iter->Valid());
iter->Seek(BuildKey(1004)); // Prefetch data (not in cache).
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(current_readahead_size, expected_current_readahead_size);

// Missed one sequential block but 1011 is already in buffer so
// readahead will not be reset.
iter->Seek(BuildKey(1011));
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(current_readahead_size, expected_current_readahead_size);

// Eligible to Prefetch data (not in buffer) but block is in cache so no
// prefetch will happen and will result in decrease in readahead_size.
// readahead_size will be 8 * 1024
iter->Seek(BuildKey(1015));
ASSERT_TRUE(iter->Valid());
expected_current_readahead_size -= decrease_readahead_size;

// 1016 is the same block as 1015. So no change in readahead_size.
iter->Seek(BuildKey(1016));
ASSERT_TRUE(iter->Valid());

// Prefetch data (not in buffer) but found in cache. So decrease
// readahead_size. Since it will 0 after decrementing so readahead_size will
// be set to initial value.
iter->Seek(BuildKey(1019));
ASSERT_TRUE(iter->Valid());
expected_current_readahead_size = std::max(
decrease_readahead_size,
(expected_current_readahead_size >= decrease_readahead_size
Expand All @@ -988,6 +1005,7 @@ TEST_P(PrefetchTest2, DecreaseReadAheadIfInCache) {

// Prefetch next sequential data.
iter->Seek(BuildKey(1022));
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(current_readahead_size, expected_current_readahead_size);
ASSERT_EQ(buff_prefetch_count, 2);
buff_prefetch_count = 0;
Expand Down
16 changes: 16 additions & 0 deletions file/random_access_file_reader.cc
Original file line number Diff line number Diff line change
Expand Up @@ -424,4 +424,20 @@ IOStatus RandomAccessFileReader::PrepareIOOptions(const ReadOptions& ro,
return PrepareIOFromReadOptions(ro, SystemClock::Default().get(), opts);
}
}

// TODO akanksha: Add perf_times etc.
IOStatus RandomAccessFileReader::ReadAsync(
FSReadRequest& req, const IOOptions& opts,
std::function<void(const FSReadRequest&, void*)> cb, void* cb_arg,
void** io_handle, IOHandleDeleter* del_fn,
Env::IOPriority rate_limiter_priority) {
if (use_direct_io()) {
req.status = Read(opts, req.offset, req.len, &(req.result), req.scratch,
nullptr /*dbg*/, rate_limiter_priority);
cb(req, cb_arg);
return IOStatus::OK();
}
return file_->ReadAsync(req, opts, cb, cb_arg, io_handle, del_fn,
nullptr /*dbg*/);
}
} // namespace ROCKSDB_NAMESPACE
5 changes: 5 additions & 0 deletions file/random_access_file_reader.h
Original file line number Diff line number Diff line change
Expand Up @@ -174,5 +174,10 @@ class RandomAccessFileReader {
bool use_direct_io() const { return file_->use_direct_io(); }

IOStatus PrepareIOOptions(const ReadOptions& ro, IOOptions& opts);

IOStatus ReadAsync(FSReadRequest& req, const IOOptions& opts,
std::function<void(const FSReadRequest&, void*)> cb,
void* cb_arg, void** io_handle, IOHandleDeleter* del_fn,
Env::IOPriority rate_limiter_priority);
};
} // namespace ROCKSDB_NAMESPACE
9 changes: 9 additions & 0 deletions include/rocksdb/options.h
Original file line number Diff line number Diff line change
Expand Up @@ -1598,6 +1598,15 @@ struct ReadOptions {
// Default: `Env::IO_TOTAL`.
Env::IOPriority rate_limiter_priority = Env::IO_TOTAL;

// Experimental
//
// If async_io is enabled, RocksDB will prefetch some of data asynchronously.
// RocksDB apply it if reads are sequential and its internal automatic
// prefetching.
//
// Default: false
bool async_io;

ReadOptions();
ReadOptions(bool cksum, bool cache);
};
Expand Down
6 changes: 4 additions & 2 deletions options/options.cc
Original file line number Diff line number Diff line change
Expand Up @@ -665,7 +665,8 @@ ReadOptions::ReadOptions()
deadline(std::chrono::microseconds::zero()),
io_timeout(std::chrono::microseconds::zero()),
value_size_soft_limit(std::numeric_limits<uint64_t>::max()),
adaptive_readahead(false) {}
adaptive_readahead(false),
async_io(false) {}

ReadOptions::ReadOptions(bool cksum, bool cache)
: snapshot(nullptr),
Expand All @@ -689,6 +690,7 @@ ReadOptions::ReadOptions(bool cksum, bool cache)
deadline(std::chrono::microseconds::zero()),
io_timeout(std::chrono::microseconds::zero()),
value_size_soft_limit(std::numeric_limits<uint64_t>::max()),
adaptive_readahead(false) {}
adaptive_readahead(false),
async_io(false) {}

} // namespace ROCKSDB_NAMESPACE
6 changes: 3 additions & 3 deletions table/block_based/block_based_table_iterator.cc
Original file line number Diff line number Diff line change
Expand Up @@ -232,9 +232,9 @@ void BlockBasedTableIterator::InitDataBlock() {
// Enabled after 2 sequential IOs when ReadOptions.readahead_size == 0.
// Explicit user requested readahead:
// Enabled from the very first IO when ReadOptions.readahead_size is set.
block_prefetcher_.PrefetchIfNeeded(rep, data_block_handle,
read_options_.readahead_size,
is_for_compaction);
block_prefetcher_.PrefetchIfNeeded(
rep, data_block_handle, read_options_.readahead_size, is_for_compaction,
read_options_.async_io);
Status s;
table_->NewDataBlockIterator<DataBlockIter>(
read_options_, data_block_handle, &block_iter_, BlockType::kData,
Expand Down
6 changes: 3 additions & 3 deletions table/block_based/block_based_table_reader.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1472,9 +1472,9 @@ Status BlockBasedTable::MaybeReadBlockAndLoadToCache(
// Update the block details so that PrefetchBuffer can use the read
// pattern to determine if reads are sequential or not for
// prefetching. It should also take in account blocks read from cache.
prefetch_buffer->UpdateReadPattern(handle.offset(),
BlockSizeWithTrailer(handle),
ro.adaptive_readahead);
prefetch_buffer->UpdateReadPattern(
handle.offset(), BlockSizeWithTrailer(handle),
ro.adaptive_readahead /*decrease_readahead_size*/);
}
}
}
Expand Down
Loading

0 comments on commit 49a10fe

Please sign in to comment.