Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix major performance bug in AutoHCC growth phase (facebook#11871)
Summary: ## The Problem Mark Callaghan found a performance bug in yet-unreleased AutoHCC (which should have been found in my own testing). The observed behavior is very slow insertion performance as the table is growing into a very large structure. The root cause is the precarious combination of linear hashing (indexing into the table while allowing growth) and linear probing (for finding an empty slot to insert into). Naively combined, this is a disaster because in linear hashing, part of the table is twice as dense as first probing location as the rest. Thus, even a modest load factor like 0.6 could cause the dense part of the table to degrade to linear search. The code had a correction for this imbalance, which works in steady-state operation, but failed to account for the concentrating effect of table growth. Specifically, newly-added slots were underpopulated which allowed old slots to become over-populated and degrade to linear search, even in single-threaded operation. Here's an example: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=1 -populate_cache=0 -value_bytes=500 -cache_size=3000000000 -histograms=0 -report_problems -ops_per_thread=20000000 -resident_ratio=0.6 ``` AutoHCC: Complete in 774.213 s; Rough parallel ops/sec = 25832 FixedHCC: Complete in 19.630 s; Rough parallel ops/sec = 1018840 LRUCache: Complete in 25.842 s; Rough parallel ops/sec = 773947 ## The Fix One small change is apparently sufficient to fix the problem, but I wanted to re-optimize the whole "finding a good empty slot" algorithm to improve safety margins for good performance and to improve typical case performance. The small change is to track the newly-added slot from Grow in Insert, when applicable, and use that slot for insertion if (a) the home slot is already occupied, and (b) the newly-added slot is empty. This appears to sufficiently load new slots while avoiding over-population of either old or new slots. See `likely_empty_slot`. However I've also made the logic much more resilient to parts of the table becoming over-populated. I tested a variant that used double hashing instead of linear probing and found that hurt steady-state average-case performance, presumably due to loss of locality in the chains. And even conventional double hashing might not be ideally robust against density skew in the table (still present because of home location bias), because double hashing might choose a small increment that could take a long time to iterate to the under-populated part of the table. The compromise that seems to bring the best of each approach is this: do linear probing (+1 at a time) within a small bound (chosen bound of 4 based on performance testing) and then fall back on a double-hashing variant if no slot has been found. The double-hashing variant uses a probing increment that is always close to the golden ratio, relative to the table size, so that any under-populated regions of the table can be found relatively quickly, without introducing any additional skew. And the increment is varied slightly to avoid clustering effects that could happen with a fixed increment (regardless of how big it is). And that leaves us with one remaining problem: the double hashing increment might not be relatively prime to the table size, so the probing sequence might be a cycle that does not cover the full set of slots. To solve this we can use a technique I developed many years ago (probably also developed by others) that simply adds one (in modular arithmetic) whenever we finish a (potentially incomplete) cycle. This is a simple and reasonably efficient way to iterate over all the slots without repetition, regardless of whether the increment is not relatively prime to the table size, or even zero. Pull Request resolved: facebook#11871 Test Plan: existing correctness tests, especially ClockCacheTest.ClockTableFull Intended follow-up: make ClockTableFull test more complete for AutoHCC ## Performance Ignoring old AutoHCC performance, as we established above it could be terrible. FixedHCC and LRUCache are unaffected by this change. All tests below include this change. ### Getting up to size, single thread (same cache_bench command as above, all three run at same time) AutoHCC: Complete in 26.724 s; Rough parallel ops/sec = 748400 FixedHCC: Complete in 19.987 s; Rough parallel ops/sec = 1000631 LRUCache: Complete in 28.291 s; Rough parallel ops/sec = 706939 Single-threaded faster than LRUCache (often / sometimes) is good. FixedHCC has an obvious advantage because it starts at full size. ### Multiple threads, steady state, high hit rate ~95% Using `-threads=10 -populate_cache=1 -ops_per_thread=10000000` and still `-resident_ratio=0.6` AutoHCC: Complete in 48.778 s; Rough parallel ops/sec = 2050119 FixedHCC: Complete in 46.569 s; Rough parallel ops/sec = 2147329 LRUCache: Complete in 50.537 s; Rough parallel ops/sec = 1978735 ### Multiple threads, steady state, low hit rate ~50% Change to `-resident_ratio=0.2` AutoHCC: Complete in 49.264 s; Rough parallel ops/sec = 2029884 FixedHCC: Complete in 49.750 s; Rough parallel ops/sec = 2010041 LRUCache: Complete in 53.002 s; Rough parallel ops/sec = 1886713 Don't expect AutoHCC to be consistently faster than FixedHCC, but they are at least similar in these benchmarks. Reviewed By: jowlyzhang Differential Revision: D49548534 Pulled By: pdillinger fbshipit-source-id: 263e4f4d71d0e9a7d91db3795b48fad75408822b
- Loading branch information