Skip to content

Commit

Permalink
Pessimistic Transactions
Browse files Browse the repository at this point in the history
Summary:
Initial implementation of Pessimistic Transactions.  This diff contains the api changes discussed in D38913.  This diff is pretty large, so let me know if people would prefer to meet up to discuss it.

MyRocks folks:  please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.

Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint().  After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex.  We can then decide which route is preferable.

Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.

Test Plan: Unit tests, db_bench parallel testing.

Reviewers: igor, rven, sdong, yhchiang, yoshinorim

Reviewed By: sdong

Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40869
  • Loading branch information
agiardullo committed Aug 12, 2015
1 parent c2868cb commit c2f2cb0
Show file tree
Hide file tree
Showing 31 changed files with 4,875 additions and 492 deletions.
5 changes: 5 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,10 @@ set(SOURCES
utilities/table_properties_collectors/compact_on_deletion_collector.cc
utilities/transactions/optimistic_transaction_impl.cc
utilities/transactions/optimistic_transaction_db_impl.cc
utilities/transactions/transaction_impl.cc
utilities/transactions/transaction_db_impl.cc
utilities/transactions/transaction_lock_mgr.cc
utilities/transactions/transaction_util.cc
utilities/ttl/db_ttl_impl.cc
utilities/write_batch_with_index/write_batch_with_index.cc
utilities/write_batch_with_index/write_batch_with_index_internal.cc
Expand Down Expand Up @@ -333,6 +337,7 @@ set(TESTS
utilities/spatialdb/spatial_db_test.cc
utilities/table_properties_collectors/compact_on_deletion_collector_test.cc
utilities/transactions/optimistic_transaction_test.cc
utilities/transactions/transaction_test.cc
utilities/ttl/ttl_test.cc
utilities/write_batch_with_index/write_batch_with_index_test.cc
)
Expand Down
1 change: 1 addition & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

## 3.12.0 (7/2/2015)
### New Features
* Added experimental support for pessimistic transactions. See include/rocksdb/utilities/transaction.h for more info.
* Added experimental support for optimistic transactions. See include/rocksdb/utilities/optimistic_transaction.h for more info.
* Added a new way to report QPS from db_bench (check out --report_file and --report_interval_seconds)
* Added a cache for individual rows. See DBOptions::row_cache for more info.
Expand Down
6 changes: 5 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,8 @@ TESTS = \
write_callback_test \
heap_test \
compact_on_deletion_collector_test \
compaction_job_stats_test
compaction_job_stats_test \
transaction_test

SUBSET := $(shell echo $(TESTS) |sed s/^.*$(ROCKSDBTESTS_START)/$(ROCKSDBTESTS_START)/)

Expand Down Expand Up @@ -919,6 +920,9 @@ write_callback_test: db/write_callback_test.o $(LIBOBJECTS) $(TESTHARNESS)
heap_test: util/heap_test.o $(GTEST)
$(AM_LINK)

transaction_test: utilities/transactions/transaction_test.o $(LIBOBJECTS) $(TESTHARNESS)
$(AM_LINK)

sst_dump: tools/sst_dump.o $(LIBOBJECTS)
$(AM_LINK)

Expand Down
86 changes: 60 additions & 26 deletions db/db_bench.cc
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ int main() {
#include "rocksdb/slice_transform.h"
#include "rocksdb/perf_context.h"
#include "rocksdb/utilities/flashcache.h"
#include "rocksdb/utilities/optimistic_transaction.h"
#include "rocksdb/utilities/transaction.h"
#include "rocksdb/utilities/transaction_db.h"
#include "rocksdb/utilities/optimistic_transaction_db.h"
#include "port/port.h"
#include "port/stack_trace.h"
Expand Down Expand Up @@ -448,10 +449,14 @@ DEFINE_int32(deletepercent, 2, "Percentage of deletes out of reads/writes/"
DEFINE_uint64(delete_obsolete_files_period_micros, 0,
"Ignored. Left here for backward compatibility");

DEFINE_bool(transaction_db, false,
DEFINE_bool(optimistic_transaction_db, false,
"Open a OptimisticTransactionDB instance. "
"Required for randomtransaction benchmark.");

DEFINE_bool(transaction_db, false,
"Open a TransactionDB instance. "
"Required for randomtransaction benchmark.");

DEFINE_uint64(transaction_sets, 2,
"Number of keys each transaction will "
"modify (use in RandomTransaction only). Max: 9999");
Expand Down Expand Up @@ -919,15 +924,15 @@ static void AppendWithSpace(std::string* str, Slice msg) {
struct DBWithColumnFamilies {
std::vector<ColumnFamilyHandle*> cfh;
DB* db;
OptimisticTransactionDB* txn_db;
OptimisticTransactionDB* opt_txn_db;
std::atomic<size_t> num_created; // Need to be updated after all the
// new entries in cfh are set.
size_t num_hot; // Number of column families to be queried at each moment.
// After each CreateNewCf(), another num_hot number of new
// Column families will be created and used to be queried.
port::Mutex create_cf_mutex; // Only one thread can execute CreateNewCf()

DBWithColumnFamilies() : db(nullptr), txn_db(nullptr) {
DBWithColumnFamilies() : db(nullptr), opt_txn_db(nullptr) {
cfh.clear();
num_created = 0;
num_hot = 0;
Expand All @@ -936,17 +941,17 @@ struct DBWithColumnFamilies {
DBWithColumnFamilies(const DBWithColumnFamilies& other)
: cfh(other.cfh),
db(other.db),
txn_db(other.txn_db),
opt_txn_db(other.opt_txn_db),
num_created(other.num_created.load()),
num_hot(other.num_hot) {}

void DeleteDBs() {
std::for_each(cfh.begin(), cfh.end(),
[](ColumnFamilyHandle* cfhi) { delete cfhi; });
cfh.clear();
if (txn_db) {
delete txn_db;
txn_db = nullptr;
if (opt_txn_db) {
delete opt_txn_db;
opt_txn_db = nullptr;
} else {
delete db;
}
Expand Down Expand Up @@ -2445,11 +2450,19 @@ class Benchmark {
if (FLAGS_readonly) {
s = DB::OpenForReadOnly(options, db_name, column_families,
&db->cfh, &db->db);
} else if (FLAGS_transaction_db) {
} else if (FLAGS_optimistic_transaction_db) {
s = OptimisticTransactionDB::Open(options, db_name, column_families,
&db->cfh, &db->txn_db);
&db->cfh, &db->opt_txn_db);
if (s.ok()) {
db->db = db->opt_txn_db->GetBaseDB();
}
} else if (FLAGS_transaction_db) {
TransactionDB* ptr;
TransactionDBOptions txn_db_options;
s = TransactionDB::Open(options, txn_db_options, db_name,
column_families, &db->cfh, &ptr);
if (s.ok()) {
db->db = db->txn_db->GetBaseDB();
db->db = ptr;
}
} else {
s = DB::Open(options, db_name, column_families, &db->cfh, &db->db);
Expand All @@ -2459,11 +2472,19 @@ class Benchmark {
db->num_hot = num_hot;
} else if (FLAGS_readonly) {
s = DB::OpenForReadOnly(options, db_name, &db->db);
} else if (FLAGS_optimistic_transaction_db) {
s = OptimisticTransactionDB::Open(options, db_name, &db->opt_txn_db);
if (s.ok()) {
db->db = db->opt_txn_db->GetBaseDB();
}
} else if (FLAGS_transaction_db) {
s = OptimisticTransactionDB::Open(options, db_name, &db->txn_db);
TransactionDB* ptr;
TransactionDBOptions txn_db_options;
s = TransactionDB::Open(options, txn_db_options, db_name, &ptr);
if (s.ok()) {
db->db = db->txn_db->GetBaseDB();
db->db = ptr;
}

} else {
s = DB::Open(options, db_name, &db->db);
}
Expand Down Expand Up @@ -3530,7 +3551,6 @@ class Benchmark {
uint64_t transactions_aborted = 0;
Status s;
uint64_t num_prefix_ranges = FLAGS_transaction_sets;
bool use_txn = FLAGS_transaction_db;

if (num_prefix_ranges == 0 || num_prefix_ranges > 9999) {
fprintf(stderr, "invalid value for transaction_sets\n");
Expand All @@ -3545,19 +3565,25 @@ class Benchmark {
}

while (!duration.Done(1)) {
OptimisticTransaction* txn = nullptr;
Transaction* txn = nullptr;
WriteBatch* batch = nullptr;

if (use_txn) {
txn = db_.txn_db->BeginTransaction(write_options_);
if (FLAGS_optimistic_transaction_db) {
txn = db_.opt_txn_db->BeginTransaction(write_options_);
assert(txn);
} else if (FLAGS_transaction_db) {
TransactionDB* txn_db = reinterpret_cast<TransactionDB*>(db_.db);
TransactionOptions txn_options;
txn_options.expiration = 10000000;
txn = txn_db->BeginTransaction(write_options_, txn_options);
} else {
batch = new WriteBatch();
}

// pick a random number to use to increment a key in each set
uint64_t incr = (thread->rand.Next() % 100) + 1;

bool failed = false;
// For each set, pick a key at random and increment it
for (uint8_t i = 0; i < num_prefix_ranges; i++) {
uint64_t int_value;
Expand All @@ -3572,8 +3598,8 @@ class Benchmark {
std::string full_key = std::string(prefix_buf) + base_key.ToString();
Slice key(full_key);

if (use_txn) {
s = txn->Get(read_options, key, &value);
if (txn) {
s = txn->GetForUpdate(read_options, key, &value);
} else {
s = db->Get(read_options, key, &value);
}
Expand All @@ -3599,15 +3625,23 @@ class Benchmark {
}

std::string sum = ToString(int_value + incr);
if (use_txn) {
txn->Put(key, sum);
if (txn) {
s = txn->Put(key, sum);
if (!s.ok()) {
failed = true;
break;
}
} else {
batch->Put(key, sum);
}
}

if (use_txn) {
s = txn->Commit();
if (txn) {
if (failed) {
txn->Rollback();
} else {
s = txn->Commit();
}
} else {
s = db->Write(write_options_, batch);
}
Expand All @@ -3616,7 +3650,7 @@ class Benchmark {
// Ideally, we'd want to run this stress test with enough concurrency
// on a small enough set of keys that we get some failed transactions
// due to conflicts.
if (use_txn && s.IsBusy()) {
if (txn && s.IsBusy()) {
transactions_aborted++;
} else {
fprintf(stderr, "Unexpected write error: %s\n", s.ToString().c_str());
Expand All @@ -3635,7 +3669,7 @@ class Benchmark {
}

char msg[100];
if (use_txn) {
if (FLAGS_optimistic_transaction_db || FLAGS_transaction_db) {
snprintf(msg, sizeof(msg),
"( transactions:%" PRIu64 " aborts:%" PRIu64 ")",
transactions_done, transactions_aborted);
Expand All @@ -3653,7 +3687,7 @@ class Benchmark {
// Since each iteration of RandomTransaction() incremented a key in each set
// by the same value, the sum of the keys in each set should be the same.
void RandomTransactionVerify() {
if (!FLAGS_transaction_db) {
if (!FLAGS_transaction_db && !FLAGS_optimistic_transaction_db) {
// transactions not used, nothing to verify.
return;
}
Expand Down
48 changes: 47 additions & 1 deletion db/db_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -3686,7 +3686,7 @@ Status DBImpl::WriteImpl(const WriteOptions& write_options,
mutex_.Lock();
}

if (db_options_.paranoid_checks && !status.ok() &&
if (db_options_.paranoid_checks && !status.ok() && !status.IsTimedOut() &&
!status.IsBusy() && bg_error_.ok()) {
bg_error_ = status; // stop compaction & fail any further writes
}
Expand Down Expand Up @@ -3944,6 +3944,22 @@ SuperVersion* DBImpl::GetAndRefSuperVersion(uint32_t column_family_id) {
return GetAndRefSuperVersion(cfd);
}

// REQUIRED: mutex is NOT held
SuperVersion* DBImpl::GetAndRefSuperVersionUnlocked(uint32_t column_family_id) {
ColumnFamilyData* cfd;
{
InstrumentedMutexLock l(&mutex_);
auto column_family_set = versions_->GetColumnFamilySet();
cfd = column_family_set->GetColumnFamily(column_family_id);
}

if (!cfd) {
return nullptr;
}

return GetAndRefSuperVersion(cfd);
}

void DBImpl::ReturnAndCleanupSuperVersion(ColumnFamilyData* cfd,
SuperVersion* sv) {
bool unref_sv = !cfd->ReturnThreadLocalSuperVersion(sv);
Expand Down Expand Up @@ -3974,6 +3990,22 @@ void DBImpl::ReturnAndCleanupSuperVersion(uint32_t column_family_id,
ReturnAndCleanupSuperVersion(cfd, sv);
}

// REQUIRED: Mutex should NOT be held.
void DBImpl::ReturnAndCleanupSuperVersionUnlocked(uint32_t column_family_id,
SuperVersion* sv) {
ColumnFamilyData* cfd;
{
InstrumentedMutexLock l(&mutex_);
auto column_family_set = versions_->GetColumnFamilySet();
cfd = column_family_set->GetColumnFamily(column_family_id);
}

// If SuperVersion is held, and we successfully fetched a cfd using
// GetAndRefSuperVersion(), it must still exist.
assert(cfd != nullptr);
ReturnAndCleanupSuperVersion(cfd, sv);
}

// REQUIRED: this function should only be called on the write thread or if the
// mutex is held.
ColumnFamilyHandle* DBImpl::GetColumnFamilyHandle(uint32_t column_family_id) {
Expand All @@ -3986,6 +4018,20 @@ ColumnFamilyHandle* DBImpl::GetColumnFamilyHandle(uint32_t column_family_id) {
return cf_memtables->GetColumnFamilyHandle();
}

// REQUIRED: mutex is NOT held.
ColumnFamilyHandle* DBImpl::GetColumnFamilyHandleUnlocked(
uint32_t column_family_id) {
ColumnFamilyMemTables* cf_memtables = column_family_memtables_.get();

InstrumentedMutexLock l(&mutex_);

if (!cf_memtables->Seek(column_family_id)) {
return nullptr;
}

return cf_memtables->GetColumnFamilyHandle();
}

void DBImpl::GetApproximateSizes(ColumnFamilyHandle* column_family,
const Range* range, int n, uint64_t* sizes,
bool include_memtable) {
Expand Down
10 changes: 10 additions & 0 deletions db/db_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,9 @@ class DBImpl : public DB {
// mutex is held.
SuperVersion* GetAndRefSuperVersion(uint32_t column_family_id);

// Same as above, should called without mutex held and not on write thread.
SuperVersion* GetAndRefSuperVersionUnlocked(uint32_t column_family_id);

// Un-reference the super version and return it to thread local cache if
// needed. If it is the last reference of the super version. Clean it up
// after un-referencing it.
Expand All @@ -336,11 +339,18 @@ class DBImpl : public DB {
// REQUIRED: this function should only be called on the write thread.
void ReturnAndCleanupSuperVersion(uint32_t colun_family_id, SuperVersion* sv);

// Same as above, should called without mutex held and not on write thread.
void ReturnAndCleanupSuperVersionUnlocked(uint32_t colun_family_id,
SuperVersion* sv);

// REQUIRED: this function should only be called on the write thread or if the
// mutex is held. Return value only valid until next call to this function or
// mutex is released.
ColumnFamilyHandle* GetColumnFamilyHandle(uint32_t column_family_id);

// Same as above, should called without mutex held and not on write thread.
ColumnFamilyHandle* GetColumnFamilyHandleUnlocked(uint32_t column_family_id);

protected:
Env* const env_;
const std::string dbname_;
Expand Down
7 changes: 5 additions & 2 deletions examples/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ include ../make_config.mk

.PHONY: clean

all: simple_example column_families_example compact_files_example c_simple_example transaction_example
all: simple_example column_families_example compact_files_example c_simple_example optimistic_transaction_example transaction_example

simple_example: simple_example.cc
$(CXX) $(CXXFLAGS) $@.cc -o$@ ../librocksdb.a -I../include -O2 -std=c++11 $(PLATFORM_LDFLAGS) $(PLATFORM_CXXFLAGS) $(EXEC_LDFLAGS)
Expand All @@ -19,8 +19,11 @@ compact_files_example: compact_files_example.cc
c_simple_example: c_simple_example.o
$(CXX) $@.o -o$@ ../librocksdb.a $(PLATFORM_LDFLAGS) $(EXEC_LDFLAGS)

optimistic_transaction_example: optimistic_transaction_example.cc
$(CXX) $(CXXFLAGS) $@.cc -o$@ ../librocksdb.a -I../include -O2 -std=c++11 $(PLATFORM_LDFLAGS) $(PLATFORM_CXXFLAGS) $(EXEC_LDFLAGS)

transaction_example: transaction_example.cc
$(CXX) $(CXXFLAGS) $@.cc -o$@ ../librocksdb.a -I../include -O2 -std=c++11 $(PLATFORM_LDFLAGS) $(PLATFORM_CXXFLAGS) $(EXEC_LDFLAGS)

clean:
rm -rf ./simple_example ./column_families_example ./compact_files_example ./c_simple_example c_simple_example.o ./transaction_example
rm -rf ./simple_example ./column_families_example ./compact_files_example ./c_simple_example c_simple_example.o ./optimistic_transaction_example ./transaction_example
Loading

0 comments on commit c2f2cb0

Please sign in to comment.