Skip to content

Commit

Permalink
Merge tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kern…
Browse files Browse the repository at this point in the history
…el/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Yafang Shao added task_get_cgroup1() helper to enable a similar BPF
   helper so that BPF progs can be more useful on cgroup1 hierarchies.
   While cgroup1 is mostly in maintenance mode, this addition is very
   small while having an outsized usefulness for users who are still on
   cgroup1. Yafang also optimized root cgroup list access by making it
   RCU protected in the process.

 - Waiman Long optimized rstat operation leading to substantially lower
   and more consistent lock hold time while flushing the hierarchical
   statistics. As the lock can be acquired briefly in various hot paths,
   this reduction has cascading benefits.

 - Waiman also improved the quality of isolation for cpuset's isolated
   partitions. CPUs which are allocated to isolated partitions are now
   excluded from running unbound work items and cpu_is_isolated() test
   which is used by vmstat and memcg to reduce interference now includes
   cpuset isolated CPUs. While it isn't there yet, the hope is
   eventually reaching parity with the isolation level provided by the
   `isolcpus` boot param but in a dynamic manner.

* tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Move rcu_head up near the top of cgroup_root
  cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check
  cgroup: Avoid false cacheline sharing of read mostly rstat_cpu
  cgroup/rstat: Optimize cgroup_rstat_updated_list()
  cgroup: Fix documentation for cpu.idle
  cgroup/cpuset: Expose cpuset.cpus.isolated
  workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS
  cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
  cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask
  cgroup/cpuset: Keep track of CPUs in isolated partitions
  selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh
  workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
  selftests: cgroup: Fixes a typo in a comment
  cgroup: Add a new helper for cgroup1 hierarchy
  cgroup: Add annotation for holding namespace_sem in current_cgns_cgroup_from_root()
  cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show()
  cgroup: Make operations on the cgroup root_list RCU safe
  cgroup: Remove unnecessary list_empty()
  • Loading branch information
torvalds committed Jan 9, 2024
2 parents bfe8eb3 + a7fb042 commit 9f8413c
Show file tree
Hide file tree
Showing 14 changed files with 708 additions and 283 deletions.
33 changes: 27 additions & 6 deletions Documentation/admin-guide/cgroup-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1093,7 +1093,11 @@ All time durations are in microseconds.
A read-write single value file which exists on non-root
cgroups. The default is "100".

The weight in the range [1, 10000].
For non idle groups (cpu.idle = 0), the weight is in the
range [1, 10000].

If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1),
then the weight will show as a 0.

cpu.weight.nice
A read-write single value file which exists on non-root
Expand Down Expand Up @@ -1157,6 +1161,16 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.

cpu.idle
A read-write single value file which exists on non-root cgroups.
The default is 0.

This is the cgroup analog of the per-task SCHED_IDLE sched policy.
Setting this value to a 1 will make the scheduling policy of the
cgroup SCHED_IDLE. The threads inside the cgroup will retain their
own relative priorities, but the cgroup itself will be treated as
very low priority relative to its peers.



Memory
Expand Down Expand Up @@ -2316,6 +2330,13 @@ Cpuset Interface Files
treated to have an implicit value of "cpuset.cpus" in the
formation of local partition.

cpuset.cpus.isolated
A read-only and root cgroup only multiple values file.

This file shows the set of all isolated CPUs used in existing
isolated partitions. It will be empty if no isolated partition
is created.

cpuset.cpus.partition
A read-write single value file which exists on non-root
cpuset-enabled cgroups. This flag is owned by the parent cgroup
Expand Down Expand Up @@ -2358,11 +2379,11 @@ Cpuset Interface Files
partition or scheduling domain. The set of exclusive CPUs is
determined by the value of its "cpuset.cpus.exclusive.effective".

When set to "isolated", the CPUs in that partition will
be in an isolated state without any load balancing from the
scheduler. Tasks placed in such a partition with multiple
CPUs should be carefully distributed and bound to each of the
individual CPUs for optimal performance.
When set to "isolated", the CPUs in that partition will be in
an isolated state without any load balancing from the scheduler
and excluded from the unbound workqueues. Tasks placed in such
a partition with multiple CPUs should be carefully distributed
and bound to each of the individual CPUs for optimal performance.

A partition root ("root" or "isolated") can be in one of the
two possible states - valid or invalid. An invalid partition
Expand Down
21 changes: 18 additions & 3 deletions include/linux/cgroup-defs.h
Original file line number Diff line number Diff line change
Expand Up @@ -496,6 +496,20 @@ struct cgroup {
struct cgroup_rstat_cpu __percpu *rstat_cpu;
struct list_head rstat_css_list;

/*
* Add padding to separate the read mostly rstat_cpu and
* rstat_css_list into a different cacheline from the following
* rstat_flush_next and *bstat fields which can have frequent updates.
*/
CACHELINE_PADDING(_pad_);

/*
* A singly-linked list of cgroup structures to be rstat flushed.
* This is a scratch field to be used exclusively by
* cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
*/
struct cgroup *rstat_flush_next;

/* cgroup basic resource statistics */
struct cgroup_base_stat last_bstat;
struct cgroup_base_stat bstat;
Expand Down Expand Up @@ -548,6 +562,10 @@ struct cgroup_root {
/* Unique id for this hierarchy. */
int hierarchy_id;

/* A list running through the active hierarchies */
struct list_head root_list;
struct rcu_head rcu; /* Must be near the top */

/*
* The root cgroup. The containing cgroup_root will be destroyed on its
* release. cgrp->ancestors[0] will be used overflowing into the
Expand All @@ -561,9 +579,6 @@ struct cgroup_root {
/* Number of cgroups in the hierarchy, used only for /proc/cgroups */
atomic_t nr_cgrps;

/* A list running through the active hierarchies */
struct list_head root_list;

/* Hierarchy-specific flags */
unsigned int flags;

Expand Down
4 changes: 3 additions & 1 deletion include/linux/cgroup.h
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ struct css_task_iter {
extern struct file_system_type cgroup_fs_type;
extern struct cgroup_root cgrp_dfl_root;
extern struct css_set init_css_set;
extern spinlock_t css_set_lock;

#define SUBSYS(_x) extern struct cgroup_subsys _x ## _cgrp_subsys;
#include <linux/cgroup_subsys.h>
Expand Down Expand Up @@ -386,7 +387,6 @@ static inline void cgroup_unlock(void)
* as locks used during the cgroup_subsys::attach() methods.
*/
#ifdef CONFIG_PROVE_RCU
extern spinlock_t css_set_lock;
#define task_css_set_check(task, __c) \
rcu_dereference_check((task)->cgroups, \
rcu_read_lock_sched_held() || \
Expand Down Expand Up @@ -853,4 +853,6 @@ static inline void cgroup_bpf_put(struct cgroup *cgrp) {}

#endif /* CONFIG_CGROUP_BPF */

struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id);

#endif /* _LINUX_CGROUP_H */
6 changes: 6 additions & 0 deletions include/linux/cpuset.h
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ extern void cpuset_lock(void);
extern void cpuset_unlock(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
extern bool cpuset_cpu_is_isolated(int cpu);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
Expand Down Expand Up @@ -207,6 +208,11 @@ static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)
return false;
}

static inline bool cpuset_cpu_is_isolated(int cpu)
{
return false;
}

static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
{
return node_possible_map;
Expand Down
4 changes: 3 additions & 1 deletion include/linux/sched/isolation.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
#define _LINUX_SCHED_ISOLATION_H

#include <linux/cpumask.h>
#include <linux/cpuset.h>
#include <linux/init.h>
#include <linux/tick.h>

Expand Down Expand Up @@ -67,7 +68,8 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
static inline bool cpu_is_isolated(int cpu)
{
return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
!housekeeping_test_cpu(cpu, HK_TYPE_TICK);
!housekeeping_test_cpu(cpu, HK_TYPE_TICK) ||
cpuset_cpu_is_isolated(cpu);
}

#endif /* _LINUX_SCHED_ISOLATION_H */
2 changes: 1 addition & 1 deletion include/linux/workqueue.h
Original file line number Diff line number Diff line change
Expand Up @@ -491,7 +491,7 @@ struct workqueue_attrs *alloc_workqueue_attrs(void);
void free_workqueue_attrs(struct workqueue_attrs *attrs);
int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs);
int workqueue_set_unbound_cpumask(cpumask_var_t cpumask);
extern int workqueue_unbound_exclude_cpumask(cpumask_var_t cpumask);

extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work);
Expand Down
4 changes: 2 additions & 2 deletions kernel/cgroup/cgroup-internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -164,13 +164,13 @@ struct cgroup_mgctx {
#define DEFINE_CGROUP_MGCTX(name) \
struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)

extern spinlock_t css_set_lock;
extern struct cgroup_subsys *cgroup_subsys[];
extern struct list_head cgroup_roots;

/* iterate across the hierarchies */
#define for_each_root(root) \
list_for_each_entry((root), &cgroup_roots, root_list)
list_for_each_entry_rcu((root), &cgroup_roots, root_list, \
lockdep_is_held(&cgroup_mutex))

/**
* for_each_subsys - iterate all enabled cgroup subsystems
Expand Down
34 changes: 34 additions & 0 deletions kernel/cgroup/cgroup-v1.c
Original file line number Diff line number Diff line change
Expand Up @@ -1262,6 +1262,40 @@ int cgroup1_get_tree(struct fs_context *fc)
return ret;
}

/**
* task_get_cgroup1 - Acquires the associated cgroup of a task within a
* specific cgroup1 hierarchy. The cgroup1 hierarchy is identified by its
* hierarchy ID.
* @tsk: The target task
* @hierarchy_id: The ID of a cgroup1 hierarchy
*
* On success, the cgroup is returned. On failure, ERR_PTR is returned.
* We limit it to cgroup1 only.
*/
struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id)
{
struct cgroup *cgrp = ERR_PTR(-ENOENT);
struct cgroup_root *root;
unsigned long flags;

rcu_read_lock();
for_each_root(root) {
/* cgroup1 only*/
if (root == &cgrp_dfl_root)
continue;
if (root->hierarchy_id != hierarchy_id)
continue;
spin_lock_irqsave(&css_set_lock, flags);
cgrp = task_cgroup_from_root(tsk, root);
if (!cgrp || !cgroup_tryget(cgrp))
cgrp = ERR_PTR(-ENOENT);
spin_unlock_irqrestore(&css_set_lock, flags);
break;
}
rcu_read_unlock();
return cgrp;
}

static int __init cgroup1_wq_init(void)
{
/*
Expand Down
45 changes: 30 additions & 15 deletions kernel/cgroup/cgroup.c
Original file line number Diff line number Diff line change
Expand Up @@ -1315,7 +1315,7 @@ static void cgroup_exit_root_id(struct cgroup_root *root)

void cgroup_free_root(struct cgroup_root *root)
{
kfree(root);
kfree_rcu(root, rcu);
}

static void cgroup_destroy_root(struct cgroup_root *root)
Expand Down Expand Up @@ -1347,10 +1347,9 @@ static void cgroup_destroy_root(struct cgroup_root *root)

spin_unlock_irq(&css_set_lock);

if (!list_empty(&root->root_list)) {
list_del(&root->root_list);
cgroup_root_count--;
}
WARN_ON_ONCE(list_empty(&root->root_list));
list_del_rcu(&root->root_list);
cgroup_root_count--;

if (!have_favordynmods)
cgroup_favor_dynmods(root, false);
Expand Down Expand Up @@ -1390,7 +1389,15 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
}
}

BUG_ON(!res_cgroup);
/*
* If cgroup_mutex is not held, the cgrp_cset_link will be freed
* before we remove the cgroup root from the root_list. Consequently,
* when accessing a cgroup root, the cset_link may have already been
* freed, resulting in a NULL res_cgroup. However, by holding the
* cgroup_mutex, we ensure that res_cgroup can't be NULL.
* If we don't hold cgroup_mutex in the caller, we must do the NULL
* check.
*/
return res_cgroup;
}

Expand All @@ -1413,6 +1420,11 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)

rcu_read_unlock();

/*
* The namespace_sem is held by current, so the root cgroup can't
* be umounted. Therefore, we can ensure that the res is non-NULL.
*/
WARN_ON_ONCE(!res);
return res;
}

Expand Down Expand Up @@ -1449,15 +1461,16 @@ static struct cgroup *current_cgns_cgroup_dfl(void)
static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
struct cgroup_root *root)
{
lockdep_assert_held(&cgroup_mutex);
lockdep_assert_held(&css_set_lock);

return __cset_cgroup_from_root(cset, root);
}

/*
* Return the cgroup for "task" from the given hierarchy. Must be
* called with cgroup_mutex and css_set_lock held.
* called with css_set_lock held to prevent task's groups from being modified.
* Must be called with either cgroup_mutex or rcu read lock to prevent the
* cgroup root from being destroyed.
*/
struct cgroup *task_cgroup_from_root(struct task_struct *task,
struct cgroup_root *root)
Expand Down Expand Up @@ -2032,7 +2045,7 @@ void init_cgroup_root(struct cgroup_fs_context *ctx)
struct cgroup_root *root = ctx->root;
struct cgroup *cgrp = &root->cgrp;

INIT_LIST_HEAD(&root->root_list);
INIT_LIST_HEAD_RCU(&root->root_list);
atomic_set(&root->nr_cgrps, 1);
cgrp->root = root;
init_cgroup_housekeeping(cgrp);
Expand Down Expand Up @@ -2115,7 +2128,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
* care of subsystems' refcounts, which are explicitly dropped in
* the failure exit path.
*/
list_add(&root->root_list, &cgroup_roots);
list_add_rcu(&root->root_list, &cgroup_roots);
cgroup_root_count++;

/*
Expand Down Expand Up @@ -6265,7 +6278,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
if (!buf)
goto out;

cgroup_lock();
rcu_read_lock();
spin_lock_irq(&css_set_lock);

for_each_root(root) {
Expand All @@ -6276,6 +6289,11 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
if (root == &cgrp_dfl_root && !READ_ONCE(cgrp_dfl_visible))
continue;

cgrp = task_cgroup_from_root(tsk, root);
/* The root has already been unmounted. */
if (!cgrp)
continue;

seq_printf(m, "%d:", root->hierarchy_id);
if (root != &cgrp_dfl_root)
for_each_subsys(ss, ssid)
Expand All @@ -6286,9 +6304,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
seq_printf(m, "%sname=%s", count ? "," : "",
root->name);
seq_putc(m, ':');

cgrp = task_cgroup_from_root(tsk, root);

/*
* On traditional hierarchies, all zombie tasks show up as
* belonging to the root cgroup. On the default hierarchy,
Expand Down Expand Up @@ -6320,7 +6335,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
retval = 0;
out_unlock:
spin_unlock_irq(&css_set_lock);
cgroup_unlock();
rcu_read_unlock();
kfree(buf);
out:
return retval;
Expand Down
Loading

0 comments on commit 9f8413c

Please sign in to comment.