Skip to content

Commit

Permalink
Truncate posts on the main /blog/ page
Browse files Browse the repository at this point in the history
  • Loading branch information
JoelMarcey committed Oct 4, 2016
1 parent 0d7acad commit b90e29c
Show file tree
Hide file tree
Showing 31 changed files with 50 additions and 178 deletions.
2 changes: 1 addition & 1 deletion docs/_posts/2014-03-27-how-to-backup-rocksdb.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ In RocksDB, we have implemented an easy way to backup your DB. Here is a simple
backupable_db->CreateNewBackup();
delete backupable_db; // no need to also delete db


<!--truncate-->


This simple example will create a backup of your DB in "/tmp/rocksdb_backup". Creating new BackupableDB consumes DB* and you should be calling all the DB methods on object `backupable_db` going forward.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ redirect_from:

In recent months, we have focused on optimizing RocksDB for in-memory workloads. With growing RAM sizes and strict low-latency requirements, lots of applications decide to keep their entire data in memory. Running in-memory database with RocksDB is easy -- just mount your RocksDB directory on tmpfs or ramfs [1]. Even if the process crashes, RocksDB can recover all of your data from in-memory filesystem. However, what happens if the machine reboots?

<!--truncate-->

In this article we will explain how you can recover your in-memory RocksDB database even after a machine reboot.

Every update to RocksDB is written to two places - one is an in-memory data structure called memtable and second is write-ahead log. Write-ahead log can be used to completely recover the data in memtable. By default, when we flush the memtable to table file, we also delete the current log, since we don't need it anymore for recovery (the data from the log is "persisted" in the table file -- we say that the log file is obsolete). However, if your table file is stored in in-memory file system, you may need the obsolete write-ahead log to recover the data after the machine reboots. Here's how you can do that.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ redirect_from:

On Mar 27, 2014, RocksDB team @ Facebook held the 1st RocksDB local meetup in FB HQ (Menlo Park, California). We invited around 80 guests from 20+ local companies, including LinkedIn, Twitter, Dropbox, Square, Pinterest, MapR, Microsoft and IBM. Finally around 50 guests showed up, totaling around 60% show-up rate.

<!--truncate-->

[![Resize of 20140327_200754](/static/images/Resize-of-20140327_200754-300x225.jpg)](/static/images/Resize-of-20140327_200754-300x225.jpg)

RocksDB team @ Facebook gave four talks about the latest progress and experience on RocksDB:
Expand Down
5 changes: 3 additions & 2 deletions docs/_posts/2014-04-07-rocksdb-2-8-release.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,11 @@ redirect_from:

Check out the new RocksDB 2.8 release on [Github](https://github.com/facebook/rocksdb/releases/tag/2.8.fb).

RocksDB 2.8. is mostly focused on improving performance for in-memory workloads. We are seeing read QPS as high as 5M (we will write a separate blog post on this). Here is the summary of new features:

RocksDB 2.8. is mostly focused on improving performance for in-memory workloads. We are seeing read QPS as high as 5M (we will write a separate blog post on this).

<!--truncate-->

Here is the summary of new features:

* Added a new table format called PlainTable, which is optimized for RAM storage (ramfs or tmpfs). You can read more details about it on [our wiki](https://github.com/facebook/rocksdb/wiki/PlainTable-Format).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ For a `Get()` request, RocksDB goes through mutable memtable, list of immutable

On level 0, files are sorted based on the time they are flushed. Their key range (as defined by FileMetaData.smallest and FileMetaData.largest) are mostly overlapped with each other. So it needs to look up every L0 file.

<!--truncate-->

Compaction is scheduled periodically to pick up files from an upper level and merges them with files from lower level. As a result, key/values are moved from L0 down the LSM tree gradually. Compaction sorts key/values and split them into files. From level 1 and below, SST files are sorted based on key. Their key range are mutually exclusive. Instead of scanning through each SST file and checking if a key falls into its range, RocksDB performs a binary search based on FileMetaData.largest to locate a candidate file that can potentially contain the target key. This reduces complexity from O(N) to O(log(N)). However, log(N) can still be large for bottom levels. For a fan-out ratio of 10, level 3 can have 1000 files. That requires 10 comparisons to locate a candidate file. This is a significant cost for an in-memory database when you can do [several million gets per second](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks).

One observation to this problem is that: after the LSM tree is built, an SST file's position in its level is fixed. Furthermore, its order relative to files from the next level is also fixed. Based on this idea, we can perform [fractional cascading](http://en.wikipedia.org/wiki/Fractional_cascading) kind of optimization to narrow down the binary search range. Here is an example:
Expand Down
3 changes: 1 addition & 2 deletions docs/_posts/2014-05-14-lock.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,7 @@ In this post, we briefly introduce the recent improvements we did to RocksDB to

RocksDB has a simple thread synchronization mechanism (See [RocksDB Architecture Guide](https://github.com/facebook/rocksdb/wiki/Rocksdb-Architecture-Guide)  to understand terms used below, like SST tables or mem tables). SST tables are immutable after being written and mem tables are lock-free data structures supporting single writer and multiple readers. There is only one single major lock, the DB mutex (DBImpl.mutex_) protecting all the meta operations, including:



<!--truncate-->

* Increase or decrease reference counters of mem tables and SST tables

Expand Down
5 changes: 1 addition & 4 deletions docs/_posts/2014-05-19-rocksdb-3-0-release.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@ Check out new RocksDB release on [Github](https://github.com/facebook/rocksdb/re

New features in RocksDB 3.0:




* [Column Family support](https://github.com/facebook/rocksdb/wiki/Column-Families)


Expand All @@ -22,6 +19,6 @@ New features in RocksDB 3.0:

* Deprecated ReadOptions::prefix_seek and ReadOptions::prefix


<!--truncate-->

Check out the full [change log](https://github.com/facebook/rocksdb/blob/3.0.fb/HISTORY.md).
4 changes: 0 additions & 4 deletions docs/_posts/2014-05-22-rocksdb-3-1-release.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,10 @@ Check out the new release on [Github](https://github.com/facebook/rocksdb/releas

New features in RocksDB 3.1:




* [Materialized hash index](https://github.com/facebook/rocksdb/commit/0b3d03d026a7248e438341264b4c6df339edc1d7)


* [FIFO compaction style](https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style)



We released 3.1 so fast after 3.0 because one of our internal customers needed materialized hash index.
2 changes: 2 additions & 0 deletions docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ Design goals:
1. Minimize memory consumption.
1. Queries efficiently return empty results

<!--truncate-->

Notice that our priority was not to maximize query performance, but to strike a balance between query performance and memory consumption. PlainTable query performance is not as good as you would see with a nicely-designed hash table, but they are of the same order of magnitude, while keeping memory overhead to a minimum.

Since we are targeting micro-second latency, it is on the level of the number of CPU cache misses (if they cannot be parallellized, which are usually the case for index look-ups). On our target hardware with Intel CPUs of multiple sockets with NUMA, we can only allow 4-5 CPU cache misses (including costs of data TLB).
Expand Down
13 changes: 1 addition & 12 deletions docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -9,32 +9,21 @@ redirect_from:

As promised in the previous [blog post](blog/2014/05/14/lock.html)!




RocksDB employs a multiversion concurrency control strategy. Before reading data, it needs to grab the current version, which is encapsulated in a data structure called [SuperVersion](https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c).



<!--truncate-->

At the beginning of `GetImpl()`, it used to do this:





<span class="zw-portion">mutex_.Lock();
</span>auto* s = super_version_->Ref();
mutex_.Unlock();




The lock is necessary because pointer super_version_ may be updated, the corresponding SuperVersion may be deleted while Ref() is in progress.




`Ref()` simply increases the reference counter and returns “this” pointer. However, this simple operation posed big challenges for in-memory workload and stopped RocksDB from scaling read throughput beyond 8 cores. Running 32 read threads on a 32-core CPU leads to [70% system CPU usage](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf). This is outrageous!


Expand Down
4 changes: 1 addition & 3 deletions docs/_posts/2014-06-27-rocksdb-3-2-release.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,12 @@ Check out new RocksDB release on [GitHub](https://github.com/facebook/rocksdb/r

New Features in RocksDB 3.2:




* PlainTable now supports a new key encoding: for keys of the same prefix, the prefix is only written once. It can be enabled through encoding_type paramter of NewPlainTableFactory()


* Add AdaptiveTableFactory, which is used to convert from a DB of PlainTable to BlockBasedTabe, or vise versa. It can be created using NewAdaptiveTableFactory()

<!--truncate-->

Public API changes:

Expand Down
4 changes: 1 addition & 3 deletions docs/_posts/2014-07-29-rocksdb-3-3-release.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,12 @@ Check out new RocksDB release on [GitHub](https://github.com/facebook/rocksdb/r

New Features in RocksDB 3.3:




* **JSON API prototype**.


* **Performance improvement on HashLinkList**: We addressed performance outlier of HashLinkList caused by skewed bucket by switching data in the bucket from linked list to skip list. Add parameter threshold_use_skiplist in NewHashLinkListRepFactory().

<!--truncate-->

* **More effective on storage space reclaim**: RocksDB is now able to reclaim storage space more effectively during the compaction process. This is done by compensating the size of each deletion entry by the 2X average value size, which makes compaction to be triggerred by deletion entries more easily.

Expand Down
64 changes: 2 additions & 62 deletions docs/_posts/2014-09-12-cuckoo.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,12 @@ redirect_from:

## Introduction




We recently introduced a new [Cuckoo Hashing](http://en.wikipedia.org/wiki/Cuckoo_hashing) based SST file format which is optimized for fast point lookups. The new format was built for applications which require very high point lookup rates (~4Mqps) in read only mode but do not use operations like range scan, merge operator, etc. But, the existing RocksDB file formats were built to support range scan and other operations and the current best point lookup in RocksDB is 1.2 Mqps given by [PlainTable](https://github.com/facebook/rocksdb/wiki/PlainTable-Format)[ format](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). This prompted a hashing based file format, which we present here. The new table format uses a cache friendly version of Cuckoo Hashing algorithm with only 1 or 2 memory accesses per lookup.



<!--truncate-->

Goals:







* Reduce memory accesses per lookup to 1 or 2


Expand All @@ -34,101 +24,51 @@ Goals:
* Minimize database size




Assumptions:







* Key length and value length are fixed


* The database is operated in read only mode




Non-goals:




While optimizing the performance of Get() operation was our primary goal, compaction and build times were secondary. We may work on improving them in future.


* While optimizing the performance of Get() operation was our primary goal, compaction and build times were secondary. We may work on improving them in future.


Details for setting up the table format can be found in [GitHub](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format).





## Cuckoo Hashing Algorithm




In order to achieve high lookup speeds, we did multiple optimizations, including a cache friendly cuckoo hash algorithm. Cuckoo Hashing uses multiple hash functions, _h1, ..., __hn._





### Original Cuckoo Hashing




To insert any new key _k_, we compute hashes of the key _h1(k), ..., __hn__(k)_. We insert the key in the first hash location that is free. If all the locations are blocked, we try to move one of the colliding keys to a different location by trying to re-insert it.




Finding smallest set of keys to displace in order to accommodate the new key is naturally a shortest path problem in a directed graph where nodes are buckets of hash table and there is an edge from bucket _A_ to bucket _B_ if the element stored in bucket _A_ can be accommodated in bucket _B_ using one of the hash functions. The source nodes are the possible hash locations for the given key _k_ and destination is any one of the empty buckets. We use this algorithm to handle collision.




To retrieve a key _k_, we compute hashes, _h1(k), ..., __hn__(k)_ and the key must be present in one of these locations.




Our goal is to minimize average (and maximum) number of hash functions required and hence the number of memory accesses. In our experiments, with a hash utilization of 90%, we found that the average number of lookups is 1.8 and maximum is 3. Around 44% of keys are accommodated in first hash location and 33% in second location.





### Cache Friendly Cuckoo Hashing




We noticed the following two sub-optimal properties in original Cuckoo implementation:







* If the key is not present in first hash location, we jump to second hash location which may not be in cache. This results in many cache misses.


* Because only 44% of keys are located in first cuckoo block, we couldn't have an optimal prefetching strategy - prefetching all hash locations for a key is wasteful. But prefetching only the first hash location helps only 44% of cases.




The solution is to insert more keys near first location. In case of collision in the first hash location - _h1(k)_, we try to insert it in next few buckets, _h1(k)+1, _h1(k)+2, _..., h1(k)+t-1_. If all of these _t_ locations are occupied, we skip over to next hash function _h2_ and repeat the process. We call the set of _t_ buckets as a _Cuckoo Block_. We chose _t_ such that size of a block is not bigger than a cache line and we prefetch the first cuckoo block.




With the new algorithm, for 90% hash utilization, we found that 85% of keys are accommodated in first Cuckoo Block. Prefetching the first cuckoo block yields best results. For a database of 100 million keys with key length 8 and value length 4, the hash algorithm alone can achieve 9.6 Mqps and we are working on improving it further. End to end RocksDB performance results can be found [here](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format).
57 changes: 1 addition & 56 deletions docs/_posts/2014-09-12-new-bloom-filter-format.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -9,99 +9,44 @@ redirect_from:

## Introduction




In this post, we are introducing "full filter block" --- a new bloom filter format for [block based table](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format). This could bring about 40% of improvement for key query under in-memory (all data stored in memory, files stored in tmpfs/ramfs, an [example](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks) workload. The main idea behind is to generate a big filter that covers all the keys in SST file to avoid lots of unnecessary memory look ups.



<!--truncate-->

## What is Bloom Filter




In brief, [bloom filter](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter) is a bits array generated for a set of keys that could tell if an arbitrary key may exist in that set.




In RocksDB, we generate such a bloom filter for each SST file. When we conduct a query for a key, we first goes to the bloom filter block of SST file. If key may exist in filter, we goes into data block in SST file to search for the key. If not, we would return directly. So it could help speed up point look up operation a lot.




## Original Bloom Filter Format




Original bloom filter creates filters for each individual data block in SST file. It has complex structure (ref [here](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format#filter-meta-block)) which results in a lot of non-adjacent memory look ups.




Here's the work flow for checking original bloom filter in block based table:




1. Given the target key, we goes to the index block to get the "data block ID" where this key may reside.
1. Using the "data block ID", we goes to the filter block and get the correct "offset of filter".
1. Using the "offset of filter", we goes to the actual filter and do the checking.




## New Bloom Filter Format




New bloom filter creates filter for all keys in SST file and we name it "full filter". The data structure of full filter is very simple, there is just one big filter:




    [ full filter ]




In this way, the work flow of bloom filter checking is much simplified.




(1) Given the target key, we goes directly to the filter block and conduct the filter checking.




To be specific, there would be no checking for index block and no address jumping inside of filter block.




Though it is a big filter, the total filter size would be the same as the original filter.




One little draw back is that the new bloom filter introduces more memory consumption when building SST file because we need to buffer keys (or their hashes) before generating filter. Original filter just creates a bunch of small filters so it just buffer a small amount of keys. For full filter, we buffer hashes of all keys, which would take more memory when SST file size increases.




## Usage & Customization




You can refer to the document here for [usage](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#usage-of-new-bloom-filter) and [customization](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#customize-your-own-filterpolicy).





Loading

0 comments on commit b90e29c

Please sign in to comment.