Skip to content

Commit

Permalink
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
Browse files Browse the repository at this point in the history
When memory.reclaim was introduced, it became the first case where
cgroup_reclaim() is true for the root cgroup.  Johannes concluded [1] that
for most cases this is okay, except for one case.  Historically, kswapd
would throttle reclaim on a node if a lot of pages marked for reclaim are
under writeback (aka the node is congested).  This occurred by setting
LRUVEC_CONGESTED bit in lruvec->flags.  The bit would be cleared when the
node is balanced.

Similarly, cgroup reclaim would set the same bit when an lruvec is
congested, and clear it on the way out of reclaim (to throttle local
reclaimers).

Before the introduction of memory.reclaim, the root memcg was the only
target of kswapd reclaim, and non-root memcgs were the only targets of
cgroup reclaim, so they would never interfere.  Using the same bit for
both was fine.  After memory.reclaim, it is possible for cgroup reclaim on
the root cgroup to clear the bit set by kswapd.  This would result in
reclaim on the node to be unthrottled before the node is balanced.

Fix this by introducing separate bits for cgroup-level and node-level
congestion.  kswapd can unthrottle an lruvec that is marked as congested
by cgroup reclaim (as the entire node should no longer be congested), but
not vice versa (to prevent premature unthrottling before the entire node
is balanced).

[1]https://lore.kernel.org/lkml/[email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Reported-by: Johannes Weiner <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
  • Loading branch information
yosrym93 authored and akpm00 committed Jun 23, 2023
1 parent 7a70447 commit 1bc545b
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 10 deletions.
18 changes: 15 additions & 3 deletions include/linux/mmzone.h
Original file line number Diff line number Diff line change
Expand Up @@ -293,9 +293,21 @@ static inline bool is_active_lru(enum lru_list lru)
#define ANON_AND_FILE 2

enum lruvec_flags {
LRUVEC_CONGESTED, /* lruvec has many dirty pages
* backed by a congested BDI
*/
/*
* An lruvec has many dirty pages backed by a congested BDI:
* 1. LRUVEC_CGROUP_CONGESTED is set by cgroup-level reclaim.
* It can be cleared by cgroup reclaim or kswapd.
* 2. LRUVEC_NODE_CONGESTED is set by kswapd node-level reclaim.
* It can only be cleared by kswapd.
*
* Essentially, kswapd can unthrottle an lruvec throttled by cgroup
* reclaim, but not vice versa. This only applies to the root cgroup.
* The goal is to prevent cgroup reclaim on the root cgroup (e.g.
* memory.reclaim) to unthrottle an unbalanced node (that was throttled
* by kswapd).
*/
LRUVEC_CGROUP_CONGESTED,
LRUVEC_NODE_CONGESTED,
};

#endif /* !__GENERATING_BOUNDS_H */
Expand Down
19 changes: 12 additions & 7 deletions mm/vmscan.c
Original file line number Diff line number Diff line change
Expand Up @@ -6578,10 +6578,13 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* Legacy memcg will stall in page writeback so avoid forcibly
* stalling in reclaim_throttle().
*/
if ((current_is_kswapd() ||
(cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested) {
if (cgroup_reclaim(sc) && writeback_throttling_sane(sc))
set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags);

if (current_is_kswapd())
set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags);
}

/*
* Stall direct reclaim for IO completions if the lruvec is
Expand All @@ -6591,7 +6594,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
*/
if (!current_is_kswapd() && current_may_throttle() &&
!sc->hibernation_mode &&
test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
(test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) ||
test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags)))
reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);

if (should_continue_reclaim(pgdat, nr_node_reclaimed, sc))
Expand Down Expand Up @@ -6848,7 +6852,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,

lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup,
zone->zone_pgdat);
clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags);
}
}

Expand Down Expand Up @@ -7237,7 +7241,8 @@ static void clear_pgdat_congested(pg_data_t *pgdat)
{
struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat);

clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
clear_bit(LRUVEC_NODE_CONGESTED, &lruvec->flags);
clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags);
clear_bit(PGDAT_DIRTY, &pgdat->flags);
clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
}
Expand Down

0 comments on commit 1bc545b

Please sign in to comment.