Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
row_cache: Always touch the partition on entry
This fixes a potential cause for reactor stalls during memory reclamation. Applies only to schemas without clustering columns. Every partition in cache has a dummy row at the end of the clustering range (last dummy). That dummy must be evicted last, because MVCC logic needs it to be there at all times. If LRU picks it for eviction and it's not the last row, eviction does nothing and moves on. Eventually, all other rows in this partition will be evicted too and then the partition will go away with it. Mutation reader updates the position of rows in the LRU (aka touching) as it walks over them. However, it was not always touching the last dummy row. If the partition was fully cached, and schema had no clustering key, it would exit early before reaching the last dummy row, here: inline void cache_flat_mutation_reader::move_to_next_entry() { clogger.trace("csm {}: move_to_next_entry(), curr={}", fmt::ptr(this), _next_row.position()); if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) { move_to_next_range(); That's because no_clustering_row_between() is always true for any key in tables with no clustering columns, and the reader advances to end-of-stream without advancing _next_row to the last dummy. This is expected and desired, it means that the query range ends at the current row and there is no need to move further. We would not take this exit for tables with a non-singular clustering key domain and open-ended query range, since there would be a key gap before the last dummy row. Refs scylladb#2972. The effect of leaving the last dummy row not touched will be that such scans will segregate rows in the LRU, bring all regular rows to the front, and dummy rows at the tail. When eviction reaches the band of dummy rows, it will have to walk over it, because evicting them releases no memory. This can cause a reactor stall. An easy fix for the scenario would be to always touch the dummy entry when entering the partition. It's unlikely that the read will not proceed to the regular rows. It would be best to avoid linking such dummies in the LRU, but that's a much more complex change. Discovered in perf_row_cache_update, test_small_partitions(). I saw 200ms stalls with -m8G. Refs scylladb#8541. Tests: - row_cache_test (release) - perf_simple_query [no change] Message-Id: <[email protected]>
- Loading branch information