Skip to content

Commit

Permalink
Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/lin…
Browse files Browse the repository at this point in the history
…ux/kernel/git/tip/tip

Pull scheduler udpates from Ingo Molnar:

 - Changes to core scheduling facilities:

    - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
      coordinated scheduling across SMT siblings. This is a much
      requested feature for cloud computing platforms, to allow the
      flexible utilization of SMT siblings, without exposing untrusted
      domains to information leaks & side channels, plus to ensure more
      deterministic computing performance on SMT systems used by
      heterogenous workloads.

      There are new prctls to set core scheduling groups, which allows
      more flexible management of workloads that can share siblings.

    - Fix task->state access anti-patterns that may result in missed
      wakeups and rename it to ->__state in the process to catch new
      abuses.

 - Load-balancing changes:

    - Tweak newidle_balance for fair-sched, to improve 'memcache'-like
      workloads.

    - "Age" (decay) average idle time, to better track & improve
      workloads such as 'tbench'.

    - Fix & improve energy-aware (EAS) balancing logic & metrics.

    - Fix & improve the uclamp metrics.

    - Fix task migration (taskset) corner case on !CONFIG_CPUSET.

    - Fix RT and deadline utilization tracking across policy changes

    - Introduce a "burstable" CFS controller via cgroups, which allows
      bursty CPU-bound workloads to borrow a bit against their future
      quota to improve overall latencies & batching. Can be tweaked via
      /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.

    - Rework assymetric topology/capacity detection & handling.

 - Scheduler statistics & tooling:

    - Disable delayacct by default, but add a sysctl to enable it at
      runtime if tooling needs it. Use static keys and other
      optimizations to make it more palatable.

    - Use sched_clock() in delayacct, instead of ktime_get_ns().

 - Misc cleanups and fixes.

* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/doc: Update the CPU capacity asymmetry bits
  sched/topology: Rework CPU capacity asymmetry detection
  sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
  psi: Fix race between psi_trigger_create/destroy
  sched/fair: Introduce the burstable CFS controller
  sched/uclamp: Fix uclamp_tg_restrict()
  sched/rt: Fix Deadline utilization tracking during policy change
  sched/rt: Fix RT utilization tracking during policy change
  sched: Change task_struct::state
  sched,arch: Remove unused TASK_STATE offsets
  sched,timer: Use __set_current_state()
  sched: Add get_current_state()
  sched,perf,kvm: Fix preemption condition
  sched: Introduce task_is_running()
  sched: Unbreak wakeups
  sched/fair: Age the average idle time
  sched/cpufreq: Consider reduced CPU capacity in energy calculation
  sched/fair: Take thermal pressure into account while estimating energy
  thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
  sched/fair: Return early from update_tg_cfs_load() if delta == 0
  ...
  • Loading branch information
torvalds committed Jun 28, 2021
2 parents 28a27cb + adf3c31 commit 54a728d
Show file tree
Hide file tree
Showing 139 changed files with 3,122 additions and 781 deletions.
12 changes: 7 additions & 5 deletions Documentation/accounting/delay-accounting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,13 +69,15 @@ Compile the kernel with::
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASKSTATS=y

Delay accounting is enabled by default at boot up.
To disable, add::
Delay accounting is disabled by default at boot up.
To enable, add::

nodelayacct
delayacct

to the kernel boot options. The rest of the instructions
below assume this has not been done.
to the kernel boot options. The rest of the instructions below assume this has
been done. Alternatively, use sysctl kernel.task_delayacct to switch the state
at runtime. Note however that only tasks started after enabling it will have
delayacct information.

After the system has booted up, use a utility
similar to getdelays.c to access the delays
Expand Down
223 changes: 223 additions & 0 deletions Documentation/admin-guide/hw-vuln/core-scheduling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
.. SPDX-License-Identifier: GPL-2.0
===============
Core Scheduling
===============
Core scheduling support allows userspace to define groups of tasks that can
share a core. These groups can be specified either for security usecases (one
group of tasks don't trust another), or for performance usecases (some
workloads may benefit from running on the same core as they don't need the same
hardware resources of the shared core, or may prefer different cores if they
do share hardware resource needs). This document only describes the security
usecase.

Security usecase
----------------
A cross-HT attack involves the attacker and victim running on different Hyper
Threads of the same core. MDS and L1TF are examples of such attacks. The only
full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core
scheduling is a scheduler feature that can mitigate some (not all) cross-HT
attacks. It allows HT to be turned on safely by ensuring that only tasks in a
user-designated trusted group can share a core. This increase in core sharing
can also improve performance, however it is not guaranteed that performance
will always improve, though that is seen to be the case with a number of real
world workloads. In theory, core scheduling aims to perform at least as good as
when Hyper Threading is disabled. In practice, this is mostly the case though
not always: as synchronizing scheduling decisions across 2 or more CPUs in a
core involves additional overhead - especially when the system is lightly
loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core
scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the
total number of CPUs. Please measure the performance of your workloads always.

Usage
-----
Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
Using this feature, userspace defines groups of tasks that can be co-scheduled
on the same core. The core scheduler uses this information to make sure that
tasks that are not in the same group never run simultaneously on a core, while
doing its best to satisfy the system's scheduling requirements.

Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
This interface provides support for the creation of core scheduling groups, as
well as admission and removal of tasks from created groups::

#include <sys/prctl.h>

int prctl(int option, unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5);

option:
``PR_SCHED_CORE``

arg2:
Command for operation, must be one off:

- ``PR_SCHED_CORE_GET`` -- get core_sched cookie of ``pid``.
- ``PR_SCHED_CORE_CREATE`` -- create a new unique cookie for ``pid``.
- ``PR_SCHED_CORE_SHARE_TO`` -- push core_sched cookie to ``pid``.
- ``PR_SCHED_CORE_SHARE_FROM`` -- pull core_sched cookie from ``pid``.

arg3:
``pid`` of the task for which the operation applies.

arg4:
``pid_type`` for which the operation applies. It is of type ``enum pid_type``.
For example, if arg4 is ``PIDTYPE_TGID``, then the operation of this command
will be performed for all tasks in the task group of ``pid``.

arg5:
userspace pointer to an unsigned long for storing the cookie returned by
``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.

In order for a process to push a cookie to, or pull a cookie from a process, it
is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the
process.

Building hierarchies of tasks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The simplest way to build hierarchies of threads/processes which share a
cookie and thus a core is to rely on the fact that the core-sched cookie is
inherited across forks/clones and execs, thus setting a cookie for the
'initial' script/executable/daemon will place every spawned child in the
same core-sched group.

Cookie Transferral
~~~~~~~~~~~~~~~~~~
Transferring a cookie between the current and other tasks is possible using
PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
specified task or a share a cookie with a task. In combination this allows a
simple helper program to pull a cookie from a task in an existing core
scheduling group and share it with already running tasks.

Design/Implementation
---------------------
Each task that is tagged is assigned a cookie internally in the kernel. As
mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
each other and share a core.

The basic idea is that, every schedule event tries to select tasks for all the
siblings of a core such that all the selected tasks running on a core are
trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
The idle task is considered special, as it trusts everything and everything
trusts it.

During a schedule() event on any sibling of a core, the highest priority task on
the sibling's core is picked and assigned to the sibling calling schedule(), if
the sibling has the task enqueued. For rest of the siblings in the core,
highest priority task with the same cookie is selected if there is one runnable
in their individual run queues. If a task with same cookie is not available,
the idle task is selected. Idle task is globally trusted.

Once a task has been selected for all the siblings in the core, an IPI is sent to
siblings for whom a new task was selected. Siblings on receiving the IPI will
switch to the new task immediately. If an idle task is selected for a sibling,
then the sibling is considered to be in a `forced idle` state. I.e., it may
have tasks on its on runqueue to run, however it will still have to run idle.
More on this in the next section.

Forced-idling of hyperthreads
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The scheduler tries its best to find tasks that trust each other such that all
tasks selected to be scheduled are of the highest priority in a core. However,
it is possible that some runqueues had tasks that were incompatible with the
highest priority ones in the core. Favoring security over fairness, one or more
siblings could be forced to select a lower priority task if the highest
priority task is not trusted with respect to the core wide highest priority
task. If a sibling does not have a trusted task to run, it will be forced idle
by the scheduler (idle thread is scheduled to run).

When the highest priority task is selected to run, a reschedule-IPI is sent to
the sibling to force it into idle. This results in 4 cases which need to be
considered depending on whether a VM or a regular usermode process was running
on either HT::

HT1 (attack) HT2 (victim)
A idle -> user space user space -> idle
B idle -> user space guest -> idle
C idle -> guest user space -> idle
D idle -> guest guest -> idle

Note that for better performance, we do not wait for the destination CPU
(victim) to enter idle mode. This is because the sending of the IPI would bring
the destination CPU immediately into kernel mode from user space, or VMEXIT
in the case of guests. At best, this would only leak some scheduler metadata
which may not be worth protecting. It is also possible that the IPI is received
too late on some architectures, but this has not been observed in the case of
x86.

Trust model
~~~~~~~~~~~
Core scheduling maintains trust relationships amongst groups of tasks by
assigning them a tag that is the same cookie value.
When a system with core scheduling boots, all tasks are considered to trust
each other. This is because the core scheduler does not have information about
trust relationships until userspace uses the above mentioned interfaces, to
communicate them. In other words, all tasks have a default cookie value of 0.
and are considered system-wide trusted. The forced-idling of siblings running
cookie-0 tasks is also avoided.

Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
within such groups are considered to trust each other, but do not trust those
outside. Tasks outside the group also don't trust tasks within.

Limitations of core-scheduling
------------------------------
Core scheduling tries to guarantee that only trusted tasks run concurrently on a
core. But there could be small window of time during which untrusted tasks run
concurrently or kernel could be running concurrently with a task not trusted by
kernel.

IPI processing delays
~~~~~~~~~~~~~~~~~~~~~
Core scheduling selects only trusted tasks to run together. IPI is used to notify
the siblings to switch to the new task. But there could be hardware delays in
receiving of the IPI on some arch (on x86, this has not been observed). This may
cause an attacker task to start running on a CPU before its siblings receive the
IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
may populate data in the cache and micro architectural buffers after the attacker
starts to run and this is a possibility for data leak.

Open cross-HT issues that core scheduling does not solve
--------------------------------------------------------
1. For MDS
~~~~~~~~~~
Core scheduling cannot protect against MDS attacks between an HT running in
user mode and another running in kernel mode. Even though both HTs run tasks
which trust each other, kernel memory is still considered untrusted. Such
attacks are possible for any combination of sibling CPU modes (host or guest mode).

2. For L1TF
~~~~~~~~~~~
Core scheduling cannot protect against an L1TF guest attacker exploiting a
guest or host victim. This is because the guest attacker can craft invalid
PTEs which are not inverted due to a vulnerable guest kernel. The only
solution is to disable EPT (Extended Page Tables).

For both MDS and L1TF, if the guest vCPU is configured to not trust each
other (by tagging separately), then the guest to guest attacks would go away.
Or it could be a system admin policy which considers guest to guest attacks as
a guest problem.

Another approach to resolve these would be to make every untrusted task on the
system to not trust every other untrusted task. While this could reduce
parallelism of the untrusted tasks, it would still solve the above issues while
allowing system processes (trusted tasks) to share a core.

3. Protecting the kernel (IRQ, syscall, VMEXIT)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Unfortunately, core scheduling does not protect kernel contexts running on
sibling hyperthreads from one another. Prototypes of mitigations have been posted
to LKML to solve this, but it is debatable whether such windows are practically
exploitable, and whether the performance overhead of the prototypes are worth
it (not to mention, the added code complexity).

Other Use cases
---------------
The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
with SMT enabled. There are other use cases where this feature could be used:

- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
that uses SIMD instructions etc.
- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
together could also be realized using core scheduling. One example is vCPUs of
a VM.
1 change: 1 addition & 0 deletions Documentation/admin-guide/hw-vuln/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ are configurable at compile, boot or run time.
tsx_async_abort
multihit.rst
special-register-buffer-data-sampling.rst
core-scheduling.rst
2 changes: 1 addition & 1 deletion Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3244,7 +3244,7 @@

noclflush [BUGS=X86] Don't use the CLFLUSH instruction

nodelayacct [KNL] Disable per-task delay accounting
delayacct [KNL] Enable per-task delay accounting

nodsp [SH] Disable hardware DSP at boot time.

Expand Down
7 changes: 7 additions & 0 deletions Documentation/admin-guide/sysctl/kernel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1088,6 +1088,13 @@ Model available). If your platform happens to meet the
requirements for EAS but you do not want to use it, change
this value to 0.

task_delayacct
===============

Enables/disables task delay accounting (see
:doc:`accounting/delay-accounting.rst`). Enabling this feature incurs
a small amount of overhead in the scheduler but is useful for debugging
and performance tuning. It is required by some tools such as iotop.

sched_schedstats
================
Expand Down
6 changes: 4 additions & 2 deletions Documentation/scheduler/sched-capacity.rst
Original file line number Diff line number Diff line change
Expand Up @@ -284,8 +284,10 @@ whether the system exhibits asymmetric CPU capacities. Should that be the
case:

- The sched_asym_cpucapacity static key will be enabled.
- The SD_ASYM_CPUCAPACITY flag will be set at the lowest sched_domain level that
spans all unique CPU capacity values.
- The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
level that spans all unique CPU capacity values.
- The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
CPUs with any range of asymmetry.

The sched_asym_cpucapacity static key is intended to guard sections of code that
cater to asymmetric CPU capacity systems. Do note however that said key is
Expand Down
2 changes: 1 addition & 1 deletion Documentation/scheduler/sched-energy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,7 @@ section lists these dependencies and provides hints as to how they can be met.

As mentioned in the introduction, EAS is only supported on platforms with
asymmetric CPU topologies for now. This requirement is checked at run-time by
looking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling
looking for the presence of the SD_ASYM_CPUCAPACITY_FULL flag when the scheduling
domains are built.

See Documentation/scheduler/sched-capacity.rst for requirements to be met for this
Expand Down
2 changes: 1 addition & 1 deletion arch/alpha/kernel/process.c
Original file line number Diff line number Diff line change
Expand Up @@ -380,7 +380,7 @@ get_wchan(struct task_struct *p)
{
unsigned long schedule_frame;
unsigned long pc;
if (!p || p == current || p->state == TASK_RUNNING)
if (!p || p == current || task_is_running(p))
return 0;
/*
* This one depends on the frame size of schedule(). Do a
Expand Down
1 change: 0 additions & 1 deletion arch/alpha/kernel/smp.c
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,6 @@ smp_callin(void)
DBGS(("smp_callin: commencing CPU %d current %p active_mm %p\n",
cpuid, current, current->active_mm));

preempt_disable();
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
}

Expand Down
1 change: 0 additions & 1 deletion arch/arc/kernel/smp.c
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,6 @@ void start_kernel_secondary(void)
pr_info("## CPU%u LIVE ##: Executing Code...\n", cpu);

local_irq_enable();
preempt_disable();
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
}

Expand Down
2 changes: 1 addition & 1 deletion arch/arc/kernel/stacktrace.c
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ seed_unwind_frame_info(struct task_struct *tsk, struct pt_regs *regs,
* is safe-kept and BLINK at a well known location in there
*/

if (tsk->state == TASK_RUNNING)
if (task_is_running(tsk))
return -1;

frame_info->task = tsk;
Expand Down
2 changes: 1 addition & 1 deletion arch/arm/kernel/process.c
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ unsigned long get_wchan(struct task_struct *p)
struct stackframe frame;
unsigned long stack_page;
int count = 0;
if (!p || p == current || p->state == TASK_RUNNING)
if (!p || p == current || task_is_running(p))
return 0;

frame.fp = thread_saved_fp(p);
Expand Down
1 change: 0 additions & 1 deletion arch/arm/kernel/smp.c
Original file line number Diff line number Diff line change
Expand Up @@ -432,7 +432,6 @@ asmlinkage void secondary_start_kernel(void)
#endif
pr_debug("CPU%u: Booted secondary processor\n", cpu);

preempt_disable();
trace_hardirqs_off();

/*
Expand Down
2 changes: 1 addition & 1 deletion arch/arm64/include/asm/preempt.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ static inline void preempt_count_set(u64 pc)
} while (0)

#define init_idle_preempt_count(p, cpu) do { \
task_thread_info(p)->preempt_count = PREEMPT_ENABLED; \
task_thread_info(p)->preempt_count = PREEMPT_DISABLED; \
} while (0)

static inline void set_preempt_need_resched(void)
Expand Down
2 changes: 1 addition & 1 deletion arch/arm64/kernel/process.c
Original file line number Diff line number Diff line change
Expand Up @@ -598,7 +598,7 @@ unsigned long get_wchan(struct task_struct *p)
struct stackframe frame;
unsigned long stack_page, ret = 0;
int count = 0;
if (!p || p == current || p->state == TASK_RUNNING)
if (!p || p == current || task_is_running(p))
return 0;

stack_page = (unsigned long)try_get_task_stack(p);
Expand Down
1 change: 0 additions & 1 deletion arch/arm64/kernel/smp.c
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,6 @@ asmlinkage notrace void secondary_start_kernel(void)
init_gic_priority_masking();

rcu_cpu_starting(cpu);
preempt_disable();
trace_hardirqs_off();

/*
Expand Down
5 changes: 1 addition & 4 deletions arch/arm64/kvm/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,6 @@ if VIRTUALIZATION
menuconfig KVM
bool "Kernel-based Virtual Machine (KVM) support"
depends on OF
# for TASKSTATS/TASK_DELAY_ACCT:
depends on NET && MULTIUSER
select MMU_NOTIFIER
select PREEMPT_NOTIFIERS
select HAVE_KVM_CPU_RELAX_INTERCEPT
Expand All @@ -38,8 +36,7 @@ menuconfig KVM
select IRQ_BYPASS_MANAGER
select HAVE_KVM_IRQ_BYPASS
select HAVE_KVM_VCPU_RUN_PID_CHANGE
select TASKSTATS
select TASK_DELAY_ACCT
select SCHED_INFO
help
Support hosting virtualized guest machines.

Expand Down
1 change: 0 additions & 1 deletion arch/csky/kernel/asm-offsets.c
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
int main(void)
{
/* offsets into the task struct */
DEFINE(TASK_STATE, offsetof(struct task_struct, state));
DEFINE(TASK_THREAD_INFO, offsetof(struct task_struct, stack));
DEFINE(TASK_FLAGS, offsetof(struct task_struct, flags));
DEFINE(TASK_PTRACE, offsetof(struct task_struct, ptrace));
Expand Down
Loading

0 comments on commit 54a728d

Please sign in to comment.