Fix major performance bug in AutoHCC growth phase (facebook#11871)

Summary: ## The Problem Mark Callaghan found a performance bug in yet-unreleased AutoHCC (which should have been found in my own testing). The observed behavior is very slow insertion performance as the table is growing into a very large structure. The root cause is the precarious combination of linear hashing (indexing into the table while allowing growth) and linear probing (for finding an empty slot to insert into). Naively combined, this is a disaster because in linear hashing, part of the table is twice as dense as first probing location as the rest. Thus, even a modest load factor like 0.6 could cause the dense part of the table to degrade to linear search. The code had a correction for this imbalance, which works in steady-state operation, but failed to account for the concentrating effect of table growth. Specifically, newly-added slots were underpopulated which allowed old slots to become over-populated and degrade to linear search, even in single-threaded operation. Here's an example: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=1 -populate_cache=0 -value_bytes=500 -cache_size=3000000000 -histograms=0 -report_problems -ops_per_thread=20000000 -resident_ratio=0.6 ``` AutoHCC: Complete in 774.213 s; Rough parallel ops/sec = 25832 FixedHCC: Complete in 19.630 s; Rough parallel ops/sec = 1018840 LRUCache: Complete in 25.842 s; Rough parallel ops/sec = 773947 ## The Fix One small change is apparently sufficient to fix the problem, but I wanted to re-optimize the whole "finding a good empty slot" algorithm to improve safety margins for good performance and to improve typical case performance. The small change is to track the newly-added slot from Grow in Insert, when applicable, and use that slot for insertion if (a) the home slot is already occupied, and (b) the newly-added slot is empty. This appears to sufficiently load new slots while avoiding over-population of either old or new slots. See `likely_empty_slot`. However I've also made the logic much more resilient to parts of the table becoming over-populated. I tested a variant that used double hashing instead of linear probing and found that hurt steady-state average-case performance, presumably due to loss of locality in the chains. And even conventional double hashing might not be ideally robust against density skew in the table (still present because of home location bias), because double hashing might choose a small increment that could take a long time to iterate to the under-populated part of the table. The compromise that seems to bring the best of each approach is this: do linear probing (+1 at a time) within a small bound (chosen bound of 4 based on performance testing) and then fall back on a double-hashing variant if no slot has been found. The double-hashing variant uses a probing increment that is always close to the golden ratio, relative to the table size, so that any under-populated regions of the table can be found relatively quickly, without introducing any additional skew. And the increment is varied slightly to avoid clustering effects that could happen with a fixed increment (regardless of how big it is). And that leaves us with one remaining problem: the double hashing increment might not be relatively prime to the table size, so the probing sequence might be a cycle that does not cover the full set of slots. To solve this we can use a technique I developed many years ago (probably also developed by others) that simply adds one (in modular arithmetic) whenever we finish a (potentially incomplete) cycle. This is a simple and reasonably efficient way to iterate over all the slots without repetition, regardless of whether the increment is not relatively prime to the table size, or even zero. Pull Request resolved: facebook#11871 Test Plan: existing correctness tests, especially ClockCacheTest.ClockTableFull Intended follow-up: make ClockTableFull test more complete for AutoHCC ## Performance Ignoring old AutoHCC performance, as we established above it could be terrible. FixedHCC and LRUCache are unaffected by this change. All tests below include this change. ### Getting up to size, single thread (same cache_bench command as above, all three run at same time) AutoHCC: Complete in 26.724 s; Rough parallel ops/sec = 748400 FixedHCC: Complete in 19.987 s; Rough parallel ops/sec = 1000631 LRUCache: Complete in 28.291 s; Rough parallel ops/sec = 706939 Single-threaded faster than LRUCache (often / sometimes) is good. FixedHCC has an obvious advantage because it starts at full size. ### Multiple threads, steady state, high hit rate ~95% Using `-threads=10 -populate_cache=1 -ops_per_thread=10000000` and still `-resident_ratio=0.6` AutoHCC: Complete in 48.778 s; Rough parallel ops/sec = 2050119 FixedHCC: Complete in 46.569 s; Rough parallel ops/sec = 2147329 LRUCache: Complete in 50.537 s; Rough parallel ops/sec = 1978735 ### Multiple threads, steady state, low hit rate ~50% Change to `-resident_ratio=0.2` AutoHCC: Complete in 49.264 s; Rough parallel ops/sec = 2029884 FixedHCC: Complete in 49.750 s; Rough parallel ops/sec = 2010041 LRUCache: Complete in 53.002 s; Rough parallel ops/sec = 1886713 Don't expect AutoHCC to be consistently faster than FixedHCC, but they are at least similar in these benchmarks. Reviewed By: jowlyzhang Differential Revision: D49548534 Pulled By: pdillinger fbshipit-source-id: 263e4f4d71d0e9a7d91db3795b48fad75408822b
laofan13 · Sep 22, 2023 · f6cb763 · f6cb763
1 parent 269478e
commit f6cb763
Show file tree

Hide file tree

Showing 2 changed files with 91 additions and 68 deletions.
diff --git a/cache/clock_cache.cc b/cache/clock_cache.cc
@@ -2218,6 +2218,9 @@ bool AutoHyperClockTable::Grow(InsertState& state) {
   // forward" due to length_info_ being out-of-date.
   CatchUpLengthInfoNoWait(grow_home);
 
+  // See usage in DoInsert()
+  state.likely_empty_slot = grow_home;
+
   // Success
   return true;
 }
@@ -2847,14 +2850,15 @@ AutoHyperClockTable::HandleImpl* AutoHyperClockTable::DoInsert(
   // We could go searching through the chain for any duplicate, but that's
   // not typically helpful, except for the REDUNDANT block cache stats.
   // (Inferior duplicates will age out with eviction.) However, we do skip
-  // insertion if the home slot already has a match (already_matches below),
-  // so that we keep better CPU cache locality when we can.
+  // insertion if the home slot (or some other we happen to probe) already
+  // has a match (already_matches below). This helps to keep better locality
+  // when we can.
   //
   // And we can do that as part of searching for an available slot to
   // insert the new entry, because our preferred location and first slot
   // checked will be the home slot.
   //
-  // As the table initially grows to size few entries will be in the same
+  // As the table initially grows to size, few entries will be in the same
   // cache line as the chain head. However, churn in the cache relatively
   // quickly improves the proportion of entries sharing that cache line with
   // the chain head. Data:
@@ -2877,12 +2881,19 @@ AutoHyperClockTable::HandleImpl* AutoHyperClockTable::DoInsert(
 
   size_t idx = home;
   bool already_matches = false;
-  if (!TryInsert(proto, arr[idx], initial_countdown, take_ref,
-                 &already_matches)) {
-    if (already_matches) {
-      return nullptr;
-    }
-
+  if (TryInsert(proto, arr[idx], initial_countdown, take_ref,
+                &already_matches)) {
+    assert(idx == home);
+  } else if (already_matches) {
+    return nullptr;
+    // Here we try to populate newly-opened slots in the table, but not
+    // when we can add something to its home slot. This makes the structure
+    // more performant more quickly on (initial) growth.
+  } else if (UNLIKELY(state.likely_empty_slot > 0) &&
+             TryInsert(proto, arr[state.likely_empty_slot], initial_countdown,
+                       take_ref, &already_matches)) {
+    idx = state.likely_empty_slot;
+  } else {
     // We need to search for an available slot outside of the home.
     // Linear hashing provides nice resizing but does typically mean
     // that some heads (home locations) have (in expectation) twice as
@@ -2892,81 +2903,88 @@ AutoHyperClockTable::HandleImpl* AutoHyperClockTable::DoInsert(
     //
     // This means that if we just use linear probing (by a small constant)
     // to find an available slot, part of the structure could easily fill up
-    // and resot to linear time operations even when the overall load factor
+    // and resort to linear time operations even when the overall load factor
     // is only modestly high, like 70%. Even though each slot has its own CPU
-    // cache line, there is likely a small locality benefit (e.g. TLB and
-    // paging) to iterating one by one, but obviously not with the linear
-    // hashing imbalance.
+    // cache line, there appears to be a small locality benefit (e.g. TLB and
+    // paging) to iterating one by one, as long as we don't afoul of the
+    // linear hashing imbalance.
     //
     // In a traditional non-concurrent structure, we could keep a "free list"
     // to ensure immediate access to an available slot, but maintaining such
     // a structure could require more cross-thread coordination to ensure
     // all entries are eventually available to all threads.
     //
-    // The way we solve this problem is to use linear probing but try to
-    // correct for the linear hashing imbalance (when probing beyond the
-    // home slot). If the home is high load (minimum shift) we choose an
-    // alternate location, uniformly among all slots, to linear probe from.
-    //
-    // Supporting data: we can use FixedHyperClockCache to get a baseline
-    // of near-ideal distribution of occupied slots, with its uniform
-    // distribution and double hashing.
-    // $ ./cache_bench -cache_type=fixed_hyper_clock_cache -histograms=0
-    //     -cache_size=1300000000
-    // ...
-    // Slot occupancy stats: Overall 59% (156629/262144),
-    //   Min/Max/Window = 47%/70%/500, MaxRun{Pos/Neg} = 22/15
-    //
-    // Now we can try various sizes between powers of two with AutoHCC to see
-    // how bad the MaxRun can be.
-    // $ for I in `seq 8 15`; do
-    //     ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0
-    //       -cache_size=${I}00000000 2>&1 | grep clock_cache.cc; done
-    // where the worst case MaxRun was with I=11:
-    // Slot occupancy stats: Overall 59% (132528/221094),
-    //   Min/Max/Window = 44%/73%/500, MaxRun{Pos/Neg} = 64/19
-    //
-    // The large table size offers a large sample size to be confident that
-    // this is an acceptable level of clustering (max ~3x probe length)
-    // compared to no clustering. Increasing the max load factor to 0.7
-    // increases the MaxRun above 100, potentially much closer to a tipping
-    // point.
-
-    // TODO? remember a freed entry from eviction, possibly in thread local
-
-    size_t start = home;
-    if (orig_home_shift == LengthInfoToMinShift(state.saved_length_info)) {
-      start = FastRange64(proto.hashed_key[0], used_length);
-    }
-    idx = start;
-    for (int cycles = 0;;) {
+    // The way we solve this problem is to use unit-increment linear probing
+    // with a small bound, and then fall back on big jumps to have a good
+    // chance of finding a slot in an under-populated region quickly if that
+    // doesn't work.
+    size_t i = 0;
+    constexpr size_t kMaxLinearProbe = 4;
+    for (; i < kMaxLinearProbe; i++) {
+      idx++;
+      if (idx >= used_length) {
+        idx -= used_length;
+      }
       if (TryInsert(proto, arr[idx], initial_countdown, take_ref,
                     &already_matches)) {
         break;
       }
       if (already_matches) {
         return nullptr;
       }
-      ++idx;
-      if (idx >= used_length) {
-        // In case the structure has grown, double-check
-        StartInsert(state);
-        used_length = LengthInfoToUsedLength(state.saved_length_info);
+    }
+    if (i == kMaxLinearProbe) {
+      // Keep searching, but change to a search method that should quickly
+      // find any under-populated region. Switching to an increment based
+      // on the golden ratio helps with that, but we also inject some minor
+      // variation (less than 2%, 1 in 2^6) to avoid clustering effects on
+      // this larger increment (if it were a fixed value in steady state
+      // operation). Here we are primarily using upper bits of hashed_key[1]
+      // while home is based on lowest bits.
+      uint64_t incr_ratio = 0x9E3779B185EBCA87U + (proto.hashed_key[1] >> 6);
+      size_t incr = FastRange64(incr_ratio, used_length);
+      assert(incr > 0);
+      size_t start = idx;
+      for (;; i++) {
+        idx += incr;
         if (idx >= used_length) {
-          idx = 0;
+          // Wrap around (faster than %)
+          idx -= used_length;
         }
-      }
-      if (idx == start) {
-        // Cycling back should not happen unless there is enough random churn
-        // in parallel that we happen to hit each slot at a time that it's
-        // occupied, which is really only feasible for small structures, though
-        // with linear probing to find empty slots, "small" here might be
-        // larger than for double hashing.
-        assert(used_length <= 256);
-        ++cycles;
-        if (cycles > 2) {
-          // Fall back on standalone insert in case something goes awry to
-          // cause this
+        if (idx == start) {
+          // We have just completed a cycle that might not have covered all
+          // slots. (incr and used_length could have common factors.)
+          // Increment for the next cycle, which eventually ensures complete
+          // iteration over the set of slots before repeating.
+          idx++;
+          if (idx >= used_length) {
+            idx -= used_length;
+          }
+          start++;
+          if (start >= used_length) {
+            start -= used_length;
+          }
+          if (i >= used_length) {
+            used_length = LengthInfoToUsedLength(
+                length_info_.load(std::memory_order_acquire));
+            if (i >= used_length * 2) {
+              // Cycling back should not happen unless there is enough random
+              // churn in parallel that we happen to hit each slot at a time
+              // that it's occupied, which is really only feasible for small
+              // structures, though with linear probing to find empty slots,
+              // "small" here might be larger than for double hashing.
+              assert(used_length <= 256);
+              // Fall back on standalone insert in case something goes awry to
+              // cause this
+              return nullptr;
+            }
+          }
+        }
+        if (TryInsert(proto, arr[idx], initial_countdown, take_ref,
+                      &already_matches)) {
+          break;
+        }
+        if (already_matches) {
           return nullptr;
         }
       }
@@ -3481,6 +3499,10 @@ void AutoHyperClockTable::Evict(size_t requested_charge, InsertState& state,
 
     for (HandleImpl* h : to_finish_eviction) {
       TrackAndReleaseEvictedEntry(h, data);
+      // NOTE: setting likely_empty_slot here can cause us to reduce the
+      // portion of "at home" entries, probably because an evicted entry
+      // is more likely to come back than a random new entry and would be
+      // unable to go into its home slot.
     }
     to_finish_eviction.clear();
 

diff --git a/cache/clock_cache.h b/cache/clock_cache.h
@@ -822,6 +822,7 @@ class AutoHyperClockTable : public BaseClockTable {
   // For BaseClockTable::Insert
   struct InsertState {
     uint64_t saved_length_info = 0;
+    size_t likely_empty_slot = 0;
   };
 
   void StartInsert(InsertState& state);