Skip to content

Commit

Permalink
hugetlb: memcg: account hugetlb-backed memory in memory controller
Browse files Browse the repository at this point in the history
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory.  This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers.  The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc.  fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max.  Similar procedure is done to other memory limits
(memory.low for e.g).  However, this is rather cumbersome and buggy. 
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller).  Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.

Some caveats to consider:
  * This feature is only available on cgroup v2.
  * There is no hugetlb pool management involved in the memory
    controller. As stated above, hugetlb folios are only charged towards
    the memory controller when it is used. Host overcommit management
    has to consider it when configuring hard limits.
  * Failure to charge towards the memcg results in SIGBUS. This could
    happen even if the hugetlb pool still has pages (but the cgroup
    limit is hit and reclaim attempt fails).
  * When this feature is enabled, hugetlb pages contribute to memory
    reclaim protection. low, min limits tuning must take into account
    hugetlb memory.
  * Hugetlb pages utilized while this option is not selected will not
    be tracked by the memory controller (even if cgroup v2 is remounted
    later on).

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nhat Pham <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Frank van der Linden <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
  • Loading branch information
nhatsmrt authored and akpm00 committed Oct 18, 2023
1 parent 85ce2c5 commit 8cba957
Show file tree
Hide file tree
Showing 7 changed files with 127 additions and 11 deletions.
29 changes: 29 additions & 0 deletions Documentation/admin-guide/cgroup-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options.
relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).

memory_hugetlb_accounting
Count HugeTLB memory usage towards the cgroup's overall
memory usage for the memory controller (for the purpose of
statistics reporting and memory protetion). This is a new
behavior that could regress existing setups, so it must be
explicitly opted in with this mount option.

A few caveats to keep in mind:

* There is no HugeTLB pool management involved in the memory
controller. The pre-allocated pool does not belong to anyone.
Specifically, when a new HugeTLB folio is allocated to
the pool, it is not accounted for from the perspective of the
memory controller. It is only charged to a cgroup when it is
actually used (for e.g at page fault time). Host memory
overcommit management has to consider this when configuring
hard limits. In general, HugeTLB pool management should be
done via other mechanisms (such as the HugeTLB controller).
* Failure to charge a HugeTLB folio to the memory controller
results in SIGBUS. This could happen even if the HugeTLB pool
still has pages available (but the cgroup limit is hit and
reclaim attempt fails).
* Charging HugeTLB memory towards the memory controller affects
memory protection and reclaim dynamics. Any userspace tuning
(of low, min limits for e.g) needs to take this into account.
* HugeTLB pages utilized while this option is not selected
will not be tracked by the memory controller (even if cgroup
v2 is remounted later on).


Organizing Processes and Threads
--------------------------------
Expand Down
5 changes: 5 additions & 0 deletions include/linux/cgroup-defs.h
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,11 @@ enum {
* Enable recursive subtree protection
*/
CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),

/*
* Enable hugetlb accounting for the memory controller.
*/
CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
};

/* cftype->flags */
Expand Down
9 changes: 9 additions & 0 deletions include/linux/memcontrol.h
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
return __mem_cgroup_charge(folio, mm, gfp);
}

int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
long nr_pages);

int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
gfp_t gfp, swp_entry_t entry);
void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
Expand Down Expand Up @@ -1258,6 +1261,12 @@ static inline int mem_cgroup_charge(struct folio *folio,
return 0;
}

static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
gfp_t gfp, long nr_pages)
{
return 0;
}

static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
{
Expand Down
15 changes: 14 additions & 1 deletion kernel/cgroup/cgroup.c
Original file line number Diff line number Diff line change
Expand Up @@ -1902,6 +1902,7 @@ enum cgroup2_param {
Opt_favordynmods,
Opt_memory_localevents,
Opt_memory_recursiveprot,
Opt_memory_hugetlb_accounting,
nr__cgroup2_params
};

Expand All @@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
fsparam_flag("favordynmods", Opt_favordynmods),
fsparam_flag("memory_localevents", Opt_memory_localevents),
fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot),
fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
{}
};

Expand All @@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
case Opt_memory_recursiveprot:
ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
return 0;
case Opt_memory_hugetlb_accounting:
ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
return 0;
}
return -EINVAL;
}
Expand All @@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;

if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
}
}

Expand All @@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
seq_puts(seq, ",memory_localevents");
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
seq_puts(seq, ",memory_recursiveprot");
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
seq_puts(seq, ",memory_hugetlb_accounting");
return 0;
}

Expand Down Expand Up @@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
"nsdelegate\n"
"favordynmods\n"
"memory_localevents\n"
"memory_recursiveprot\n");
"memory_recursiveprot\n"
"memory_hugetlb_accounting\n");
}
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);

Expand Down
35 changes: 28 additions & 7 deletions mm/hugetlb.c
Original file line number Diff line number Diff line change
Expand Up @@ -1927,6 +1927,7 @@ void free_huge_folio(struct folio *folio)
pages_per_huge_page(h), folio);
hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
pages_per_huge_page(h), folio);
mem_cgroup_uncharge(folio);
if (restore_reserve)
h->resv_huge_pages++;

Expand Down Expand Up @@ -3026,11 +3027,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
struct folio *folio;
long map_chg, map_commit;
long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
long gbl_chg;
int ret, idx;
int memcg_charge_ret, ret, idx;
struct hugetlb_cgroup *h_cg = NULL;
struct mem_cgroup *memcg;
bool deferred_reserve;
gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;

memcg = get_mem_cgroup_from_current();
memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
if (memcg_charge_ret == -ENOMEM) {
mem_cgroup_put(memcg);
return ERR_PTR(-ENOMEM);
}

idx = hstate_index(h);
/*
Expand All @@ -3039,8 +3049,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
* code of zero indicates a reservation exists (no change).
*/
map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
if (map_chg < 0)
if (map_chg < 0) {
if (!memcg_charge_ret)
mem_cgroup_cancel_charge(memcg, nr_pages);
mem_cgroup_put(memcg);
return ERR_PTR(-ENOMEM);
}

/*
* Processes that did not create the mapping will have no
Expand All @@ -3051,10 +3065,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
*/
if (map_chg || avoid_reserve) {
gbl_chg = hugepage_subpool_get_pages(spool, 1);
if (gbl_chg < 0) {
vma_end_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
}
if (gbl_chg < 0)
goto out_end_reservation;

/*
* Even though there was no reservation in the region/reserve
Expand Down Expand Up @@ -3136,6 +3148,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
pages_per_huge_page(h), folio);
}

if (!memcg_charge_ret)
mem_cgroup_commit_charge(folio, memcg);
mem_cgroup_put(memcg);

return folio;

out_uncharge_cgroup:
Expand All @@ -3147,7 +3164,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
out_subpool_put:
if (map_chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
out_end_reservation:
vma_end_reservation(h, vma, addr);
if (!memcg_charge_ret)
mem_cgroup_cancel_charge(memcg, nr_pages);
mem_cgroup_put(memcg);
return ERR_PTR(-ENOSPC);
}

Expand Down
42 changes: 41 additions & 1 deletion mm/memcontrol.c
Original file line number Diff line number Diff line change
Expand Up @@ -7096,6 +7096,41 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
return ret;
}

/**
* mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio
* @memcg: memcg to charge.
* @gfp: reclaim mode.
* @nr_pages: number of pages to charge.
*
* This function is called when allocating a huge page folio to determine if
* the memcg has the capacity for it. It does not commit the charge yet,
* as the hugetlb folio itself has not been obtained from the hugetlb pool.
*
* Once we have obtained the hugetlb folio, we can call
* mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the
* folio, we should instead call mem_cgroup_cancel_charge() to undo the effect
* of try_charge().
*
* Returns 0 on success. Otherwise, an error code is returned.
*/
int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
long nr_pages)
{
/*
* If hugetlb memcg charging is not enabled, do not fail hugetlb allocation,
* but do not attempt to commit charge later (or cancel on error) either.
*/
if (mem_cgroup_disabled() || !memcg ||
!cgroup_subsys_on_dfl(memory_cgrp_subsys) ||
!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
return -EOPNOTSUPP;

if (try_charge(memcg, gfp, nr_pages))
return -ENOMEM;

return 0;
}

/**
* mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
* @folio: folio to charge.
Expand Down Expand Up @@ -7365,7 +7400,12 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
return;

memcg = folio_memcg(old);
VM_WARN_ON_ONCE_FOLIO(!memcg, old);
/*
* Note that it is normal to see !memcg for a hugetlb folio.
* For e.g, itt could have been allocated when memory_hugetlb_accounting
* was not selected.
*/
VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !memcg, old);
if (!memcg)
return;

Expand Down
3 changes: 1 addition & 2 deletions mm/migrate.c
Original file line number Diff line number Diff line change
Expand Up @@ -633,8 +633,7 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)

folio_copy_owner(newfolio, folio);

if (!folio_test_hugetlb(folio))
mem_cgroup_migrate(folio, newfolio);
mem_cgroup_migrate(folio, newfolio);
}
EXPORT_SYMBOL(folio_migrate_flags);

Expand Down

0 comments on commit 8cba957

Please sign in to comment.