Skip to content

Commit

Permalink
mm: memcontrol: reclaim and OOM kill when shrinking memory.max below …
Browse files Browse the repository at this point in the history
…usage

Setting the original memory.limit_in_bytes hardlimit is subject to a
race condition when the desired value is below the current usage.  The
code tries a few times to first reclaim and then see if the usage has
dropped to where we would like it to be, but there is no locking, and
the workload is free to continue making new charges up to the old limit.
Thus, attempting to shrink a workload relies on pure luck and hope that
the workload happens to cooperate.

To fix this in the cgroup2 memory.max knob, do it the other way round:
set the limit first, then try enforcement.  And if reclaim is not able
to succeed, trigger OOM kills in the group.  Keep going until the new
limit is met, we run out of OOM victims and there's only unreclaimable
memory left, or the task writing to memory.max is killed.  This allows
users to shrink groups reliably, and the behavior is consistent with
what happens when new charges are attempted in excess of memory.max.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
  • Loading branch information
hnaz authored and torvalds committed Mar 17, 2016
1 parent 588083b commit b6e6edc
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 4 deletions.
6 changes: 6 additions & 0 deletions Documentation/cgroup-v2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1387,6 +1387,12 @@ system than killing the group. Otherwise, memory.max is there to
limit this type of spillover and ultimately contain buggy or even
malicious applications.

Setting the original memory.limit_in_bytes below the current usage was
subject to a race condition, where concurrent charges could cause the
limit setting to fail. memory.max on the other hand will first set the
limit to prevent new charges, and then reclaim and OOM kill until the
new limit is met - or the task writing to memory.max is killed.

The combined memory+swap accounting and limiting is replaced by real
control over swap space.

Expand Down
38 changes: 34 additions & 4 deletions mm/memcontrol.c
Original file line number Diff line number Diff line change
Expand Up @@ -1236,7 +1236,7 @@ static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
return limit;
}

static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
int order)
{
struct oom_control oc = {
Expand Down Expand Up @@ -1314,6 +1314,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
}
unlock:
mutex_unlock(&oom_lock);
return chosen;
}

#if MAX_NUMNODES > 1
Expand Down Expand Up @@ -5029,6 +5030,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
unsigned int nr_reclaims = MEM_CGROUP_RECLAIM_RETRIES;
bool drained = false;
unsigned long max;
int err;

Expand All @@ -5037,9 +5040,36 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
if (err)
return err;

err = mem_cgroup_resize_limit(memcg, max);
if (err)
return err;
xchg(&memcg->memory.limit, max);

for (;;) {
unsigned long nr_pages = page_counter_read(&memcg->memory);

if (nr_pages <= max)
break;

if (signal_pending(current)) {
err = -EINTR;
break;
}

if (!drained) {
drain_all_stock(memcg);
drained = true;
continue;
}

if (nr_reclaims) {
if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
GFP_KERNEL, true))
nr_reclaims--;
continue;
}

mem_cgroup_events(memcg, MEMCG_OOM, 1);
if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
break;
}

memcg_wb_domain_size_changed(memcg);
return nbytes;
Expand Down

0 comments on commit b6e6edc

Please sign in to comment.