Skip to content

Commit

Permalink
mm: workingset: eviction buckets for bigmem/lowbit machines
Browse files Browse the repository at this point in the history
For per-cgroup thrash detection, we need to store the memcg ID inside
the radix tree cookie as well.  However, on 32 bit that doesn't leave
enough bits for the eviction timestamp to cover the necessary range of
recently evicted pages.  The radix tree entry would look like this:

[ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]

12 bits means 4096 pages, means 16M worth of recently evicted pages.
But refaults are actionable up to distances covering half of memory.  To
not miss refaults, we have to stretch out the range at the cost of how
precisely we can tell when a page was evicted.  This way we can shave
off lower bits from the eviction timestamp until the necessary range is
covered.  E.g.  grouping evictions into 1M buckets (256 pages) will
stretch the longest representable refault distance to 4G.

This patch implements eviction buckets that are automatically sized
according to the available bits and the necessary refault range, in
preparation for per-cgroup thrash detection.

The maximum actionable distance is currently half of memory, but to
support memory hotplug of up to 200% of boot-time memory, we size the
buckets to cover double the distance.  Beyond that, thrashing won't be
detectable anymore.

During boot, the kernel will print out the exact parameters, like so:

  [    0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6

In this example, there are 12 radix entry bits available for the
eviction timestamp, to cover a maximum distance of 2^18 pages (this is a
1G machine).  Consequently, evictions must be grouped into buckets of
2^6 pages, or 256K.

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Vladimir Davydov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
  • Loading branch information
hnaz authored and torvalds committed Mar 15, 2016
1 parent 162453b commit 612e449
Showing 1 changed file with 29 additions and 1 deletion.
30 changes: 29 additions & 1 deletion mm/workingset.c
Original file line number Diff line number Diff line change
Expand Up @@ -156,8 +156,19 @@
ZONES_SHIFT + NODES_SHIFT)
#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)

/*
* Eviction timestamps need to be able to cover the full range of
* actionable refaults. However, bits are tight in the radix tree
* entry, and after storing the identifier for the lruvec there might
* not be enough left to represent every single actionable refault. In
* that case, we have to sacrifice granularity for distance, and group
* evictions into coarser buckets by shaving off lower timestamp bits.
*/
static unsigned int bucket_order __read_mostly;

static void *pack_shadow(unsigned long eviction, struct zone *zone)
{
eviction >>= bucket_order;
eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
Expand All @@ -178,7 +189,7 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
entry >>= NODES_SHIFT;

*zonep = NODE_DATA(nid)->node_zones + zid;
*evictionp = entry;
*evictionp = entry << bucket_order;
}

/**
Expand Down Expand Up @@ -400,8 +411,25 @@ static struct lock_class_key shadow_nodes_key;

static int __init workingset_init(void)
{
unsigned int timestamp_bits;
unsigned int max_order;
int ret;

BUILD_BUG_ON(BITS_PER_LONG < EVICTION_SHIFT);
/*
* Calculate the eviction bucket size to cover the longest
* actionable refault distance, which is currently half of
* memory (totalram_pages/2). However, memory hotplug may add
* some more pages at runtime, so keep working with up to
* double the initial memory by using totalram_pages as-is.
*/
timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
max_order = fls_long(totalram_pages - 1);
if (max_order > timestamp_bits)
bucket_order = max_order - timestamp_bits;
printk("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
timestamp_bits, max_order, bucket_order);

ret = list_lru_init_key(&workingset_shadow_nodes, &shadow_nodes_key);
if (ret)
goto err;
Expand Down

0 comments on commit 612e449

Please sign in to comment.