Skip to content

Commit

Permalink
cgroup: use a dedicated workqueue for cgroup destruction
Browse files Browse the repository at this point in the history
commit e5fca24 upstream.

Since be44562 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8c ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.

As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.

Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.

This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.

Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Hugh Dickins <[email protected]>
Reported-by: Shawn Bohrer <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/g/[email protected]
Cc: [email protected] # v3.9+
Signed-off-by: Greg Kroah-Hartman <[email protected]>
  • Loading branch information
htejun authored and gregkh committed Dec 4, 2013
1 parent 52915b4 commit a6647e9
Showing 1 changed file with 26 additions and 2 deletions.
28 changes: 26 additions & 2 deletions kernel/cgroup.c
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,14 @@ static DEFINE_MUTEX(cgroup_mutex);

static DEFINE_MUTEX(cgroup_root_mutex);

/*
* cgroup destruction makes heavy use of work items and there can be a lot
* of concurrent destructions. Use a separate workqueue so that cgroup
* destruction work items don't end up filling up max_active of system_wq
* which may lead to deadlock.
*/
static struct workqueue_struct *cgroup_destroy_wq;

/*
* Generate an array of cgroup subsystem pointers. At boot time, this is
* populated with the built in subsystems, and modular subsystems are
Expand Down Expand Up @@ -873,7 +881,7 @@ static void cgroup_free_rcu(struct rcu_head *head)
{
struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);

schedule_work(&cgrp->free_work);
queue_work(cgroup_destroy_wq, &cgrp->free_work);
}

static void cgroup_diput(struct dentry *dentry, struct inode *inode)
Expand Down Expand Up @@ -4686,6 +4694,22 @@ int __init cgroup_init(void)
return err;
}

static int __init cgroup_wq_init(void)
{
/*
* There isn't much point in executing destruction path in
* parallel. Good chunk is serialized with cgroup_mutex anyway.
* Use 1 for @max_active.
*
* We would prefer to do this in cgroup_init() above, but that
* is called before init_workqueues(): so leave this until after.
*/
cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
BUG_ON(!cgroup_destroy_wq);
return 0;
}
core_initcall(cgroup_wq_init);

/*
* proc_cgroup_show()
* - Print task's cgroup paths into seq_file, one line for each hierarchy
Expand Down Expand Up @@ -4996,7 +5020,7 @@ void __css_put(struct cgroup_subsys_state *css)

v = css_unbias_refcnt(atomic_dec_return(&css->refcnt));
if (v == 0)
schedule_work(&css->dput_work);
queue_work(cgroup_destroy_wq, &css->dput_work);
}
EXPORT_SYMBOL_GPL(__css_put);

Expand Down

0 comments on commit a6647e9

Please sign in to comment.