Skip to content

Commit

Permalink
Merge tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/lin…
Browse files Browse the repository at this point in the history
…ux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Fair scheduler (SCHED_OTHER) improvements:
   - Remove the old and now unused SIS_PROP code & option
   - Scan cluster before LLC in the wake-up path
   - Use candidate prev/recent_used CPU if scanning failed for cluster
     wakeup

  NUMA scheduling improvements:
   - Improve the VMA access-PID code to better skip/scan VMAs
   - Extend tracing to cover VMA-skipping decisions
   - Improve/fix the recently introduced sched_numa_find_nth_cpu() code
   - Generalize numa_map_to_online_node()

  Energy scheduling improvements:
   - Remove the EM_MAX_COMPLEXITY limit
   - Add tracepoints to track energy computation
   - Make the behavior of the 'sched_energy_aware' sysctl more
     consistent
   - Consolidate and clean up access to a CPU's max compute capacity
   - Fix uclamp code corner cases

  RT scheduling improvements:
   - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
   - Drive the ->rto_mask with rt_rq->pushable_tasks updates

  Scheduler scalability improvements:
   - Rate-limit updates to tg->load_avg
   - On x86 disable IBRS when CPU is offline to improve single-threaded
     performance
   - Micro-optimize in_task() and in_interrupt()
   - Micro-optimize the PSI code
   - Avoid updating PSI triggers and ->rtpoll_total when there are no
     state changes

  Core scheduler infrastructure improvements:
   - Use saved_state to reduce some spurious freezer wakeups
   - Bring in a handful of fast-headers improvements to scheduler
     headers
   - Make the scheduler UAPI headers more widely usable by user-space
   - Simplify the control flow of scheduler syscalls by using lock
     guards
   - Fix sched_setaffinity() vs. CPU hotplug race

  Scheduler debuggability improvements:
   - Disallow writing invalid values to sched_rt_period_us
   - Fix a race in the rq-clock debugging code triggering warnings
   - Fix a warning in the bandwidth distribution code
   - Micro-optimize in_atomic_preempt_off() checks
   - Enforce that the tasklist_lock is held in for_each_thread()
   - Print the TGID in sched_show_task()
   - Remove the /proc/sys/kernel/sched_child_runs_first sysctl

  ... and misc cleanups & fixes"

* tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
  sched/fair: Remove SIS_PROP
  sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
  sched/fair: Scan cluster before scanning LLC in wake-up path
  sched: Add cpus_share_resources API
  sched/core: Fix RQCF_ACT_SKIP leak
  sched/fair: Remove unused 'curr' argument from pick_next_entity()
  sched/nohz: Update comments about NEWILB_KICK
  sched/fair: Remove duplicate #include
  sched/psi: Update poll => rtpoll in relevant comments
  sched: Make PELT acronym definition searchable
  sched: Fix stop_one_cpu_nowait() vs hotplug
  sched/psi: Bail out early from irq time accounting
  sched/topology: Rename 'DIE' domain to 'PKG'
  sched/psi: Delete the 'update_total' function parameter from update_triggers()
  sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
  sched/headers: Remove comment referring to rq::cpu_load, since this has been removed
  sched/numa: Complete scanning of inactive VMAs when there is no alternative
  sched/numa: Complete scanning of partial VMAs regardless of PID activity
  sched/numa: Move up the access pid reset logic
  sched/numa: Trace decisions related to skipping VMAs
  ...
  • Loading branch information
torvalds committed Oct 30, 2023
2 parents 3cf3fab + 984ffb6 commit 63ce50f
Show file tree
Hide file tree
Showing 45 changed files with 1,013 additions and 965 deletions.
17 changes: 16 additions & 1 deletion Documentation/admin-guide/pm/intel_idle.rst
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ and ``idle=nomwait``. If any of them is present in the kernel command line, the
``MWAIT`` instruction is not allowed to be used, so the initialization of
``intel_idle`` will fail.

Apart from that there are four module parameters recognized by ``intel_idle``
Apart from that there are five module parameters recognized by ``intel_idle``
itself that can be set via the kernel command line (they cannot be updated via
sysfs, so that is the only way to change their values).

Expand Down Expand Up @@ -216,6 +216,21 @@ are ignored).
The idle states disabled this way can be enabled (on a per-CPU basis) from user
space via ``sysfs``.

The ``ibrs_off`` module parameter is a boolean flag (defaults to
false). If set, it is used to control if IBRS (Indirect Branch Restricted
Speculation) should be turned off when the CPU enters an idle state.
This flag does not affect CPUs that use Enhanced IBRS which can remain
on with little performance impact.

For some CPUs, IBRS will be selected as mitigation for Spectre v2 and Retbleed
security vulnerabilities by default. Leaving the IBRS mode on while idling may
have a performance impact on its sibling CPU. The IBRS mode will be turned off
by default when the CPU enters into a deep idle state, but not in some
shallower ones. Setting the ``ibrs_off`` module parameter will force the IBRS
mode to off when the CPU is in any one of the available idle states. This may
help performance of a sibling CPU at the expense of a slightly higher wakeup
latency for the idle CPU.


.. _intel-idle-core-and-package-idle-states:

Expand Down
3 changes: 2 additions & 1 deletion Documentation/admin-guide/sysctl/kernel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1182,7 +1182,8 @@ automatically on platforms where it can run (that is,
platforms with asymmetric CPU topologies and having an Energy
Model available). If your platform happens to meet the
requirements for EAS but you do not want to use it, change
this value to 0.
this value to 0. On Non-EAS platforms, write operation fails and
read doesn't return anything.

task_delayacct
===============
Expand Down
13 changes: 7 additions & 6 deletions Documentation/scheduler/sched-capacity.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,15 @@ per Hz, leading to::
-------------------

Two different capacity values are used within the scheduler. A CPU's
``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
which some loss of available performance (e.g. time spent handling IRQs) is
subtracted.
``original capacity`` is its maximum attainable capacity, i.e. its maximum
attainable performance level. This original capacity is returned by
the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
capacity`` to which some loss of available performance (e.g. time spent
handling IRQs) is subtracted.

Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
while ``capacity_orig`` is class-agnostic. The rest of this document will use
the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
while ``original capacity`` is class-agnostic. The rest of this document will use
the term ``capacity`` interchangeably with ``original capacity`` for the sake of
brevity.

1.3 Platform examples
Expand Down
29 changes: 3 additions & 26 deletions Documentation/scheduler/sched-energy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -359,32 +359,9 @@ in milli-Watts or in an 'abstract scale'.
6.3 - Energy Model complexity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The task wake-up path is very latency-sensitive. When the EM of a platform is
too complex (too many CPUs, too many performance domains, too many performance
states, ...), the cost of using it in the wake-up path can become prohibitive.
The energy-aware wake-up algorithm has a complexity of:

C = Nd * (Nc + Ns)

with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).

A complexity check is performed at the root domain level, when scheduling
domains are built. EAS will not start on a root domain if its C happens to be
higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
time of writing).

If you really want to use EAS but the complexity of your platform's Energy
Model is too high to be used with a single root domain, you're left with only
two possible options:

1. split your system into separate, smaller, root domains using exclusive
cpusets and enable EAS locally on each of them. This option has the
benefit to work out of the box but the drawback of preventing load
balance between root domains, which can result in an unbalanced system
overall;
2. submit patches to reduce the complexity of the EAS wake-up algorithm,
hence enabling it to cope with larger EMs in reasonable time.
EAS does not impose any complexity limit on the number of PDs/OPPs/CPUs but
restricts the number of CPUs to EM_MAX_NUM_CPUS to prevent overflows during
the energy estimation.


6.4 - Schedutil governor
Expand Down
40 changes: 21 additions & 19 deletions Documentation/scheduler/sched-rt-group.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,25 +39,25 @@ Most notable:
1.1 The problem
---------------

Realtime scheduling is all about determinism, a group has to be able to rely on
Real-time scheduling is all about determinism, a group has to be able to rely on
the amount of bandwidth (eg. CPU time) being constant. In order to schedule
multiple groups of realtime tasks, each group must be assigned a fixed portion
of the CPU time available. Without a minimum guarantee a realtime group can
multiple groups of real-time tasks, each group must be assigned a fixed portion
of the CPU time available. Without a minimum guarantee a real-time group can
obviously fall short. A fuzzy upper limit is of no use since it cannot be
relied upon. Which leaves us with just the single fixed portion.

1.2 The solution
----------------

CPU time is divided by means of specifying how much time can be spent running
in a given period. We allocate this "run time" for each realtime group which
the other realtime groups will not be permitted to use.
in a given period. We allocate this "run time" for each real-time group which
the other real-time groups will not be permitted to use.

Any time not allocated to a realtime group will be used to run normal priority
Any time not allocated to a real-time group will be used to run normal priority
tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
SCHED_OTHER.

Let's consider an example: a frame fixed realtime renderer must deliver 25
Let's consider an example: a frame fixed real-time renderer must deliver 25
frames a second, which yields a period of 0.04s per frame. Now say it will also
have to play some music and respond to input, leaving it with around 80% CPU
time dedicated for the graphics. We can then give this group a run time of 0.8
Expand All @@ -70,7 +70,7 @@ needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
of 0.00015s.

The remaining CPU time will be used for user input and other tasks. Because
realtime tasks have explicitly allocated the CPU time they need to perform
real-time tasks have explicitly allocated the CPU time they need to perform
their tasks, buffer underruns in the graphics or audio can be eliminated.

NOTE: the above example is not fully implemented yet. We still
Expand All @@ -87,18 +87,20 @@ lack an EDF scheduler to make non-uniform periods usable.
The system wide settings are configured under the /proc virtual file system:

/proc/sys/kernel/sched_rt_period_us:
The scheduling period that is equivalent to 100% CPU bandwidth
The scheduling period that is equivalent to 100% CPU bandwidth.

/proc/sys/kernel/sched_rt_runtime_us:
A global limit on how much time realtime scheduling may use. Even without
CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
available to all realtime groups.
A global limit on how much time real-time scheduling may use. This is always
less or equal to the period_us, as it denotes the time allocated from the
period_us for the real-time tasks. Even without CONFIG_RT_GROUP_SCHED enabled,
this will limit time reserved to real-time processes. With
CONFIG_RT_GROUP_SCHED=y it signifies the total bandwidth available to all
real-time groups.

* Time is specified in us because the interface is s32. This gives an
operating range from 1us to about 35 minutes.
* sched_rt_period_us takes values from 1 to INT_MAX.
* sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
* sched_rt_runtime_us takes values from -1 to sched_rt_period_us.
* A run time of -1 specifies runtime == period, ie. no limit.


Expand All @@ -108,18 +110,18 @@ The system wide settings are configured under the /proc virtual file system:
The default values for sched_rt_period_us (1000000 or 1s) and
sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by
SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
realtime tasks will not lock up the machine but leave a little time to recover
real-time tasks will not lock up the machine but leave a little time to recover
it. By setting runtime to -1 you'd get the old behaviour back.

By default all bandwidth is assigned to the root group and new groups get the
period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
want to assign bandwidth to another group, reduce the root group's bandwidth
and assign some or all of the difference to another group.

Realtime group scheduling means you have to assign a portion of total CPU
bandwidth to the group before it will accept realtime tasks. Therefore you will
not be able to run realtime tasks as any user other than root until you have
done that, even if the user has the rights to run processes with realtime
Real-time group scheduling means you have to assign a portion of total CPU
bandwidth to the group before it will accept real-time tasks. Therefore you will
not be able to run real-time tasks as any user other than root until you have
done that, even if the user has the rights to run processes with real-time
priority!


Expand Down
4 changes: 2 additions & 2 deletions arch/powerpc/kernel/smp.c
Original file line number Diff line number Diff line change
Expand Up @@ -1051,7 +1051,7 @@ static struct sched_domain_topology_level powerpc_topology[] = {
#endif
{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
{ cpu_mc_mask, SD_INIT_NAME(MC) },
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ cpu_cpu_mask, SD_INIT_NAME(PKG) },
{ NULL, },
};

Expand Down Expand Up @@ -1595,7 +1595,7 @@ static void add_cpu_to_masks(int cpu)
/* Skip all CPUs already part of current CPU core mask */
cpumask_andnot(mask, cpu_online_mask, cpu_core_mask(cpu));

/* If chip_id is -1; limit the cpu_core_mask to within DIE*/
/* If chip_id is -1; limit the cpu_core_mask to within PKG */
if (chip_id == -1)
cpumask_and(mask, mask, cpu_cpu_mask(cpu));

Expand Down
2 changes: 1 addition & 1 deletion arch/s390/kernel/topology.c
Original file line number Diff line number Diff line change
Expand Up @@ -522,7 +522,7 @@ static struct sched_domain_topology_level s390_topology[] = {
{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
{ cpu_book_mask, SD_INIT_NAME(BOOK) },
{ cpu_drawer_mask, SD_INIT_NAME(DRAWER) },
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ cpu_cpu_mask, SD_INIT_NAME(PKG) },
{ NULL, },
};

Expand Down
11 changes: 11 additions & 0 deletions arch/x86/include/asm/spec-ctrl.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

#include <linux/thread_info.h>
#include <asm/nospec-branch.h>
#include <asm/msr.h>

/*
* On VMENTER we must preserve whatever view of the SPEC_CTRL MSR
Expand Down Expand Up @@ -76,6 +77,16 @@ static inline u64 ssbd_tif_to_amd_ls_cfg(u64 tifn)
return (tifn & _TIF_SSBD) ? x86_amd_ls_cfg_ssbd_mask : 0ULL;
}

/*
* This can be used in noinstr functions & should only be called in bare
* metal context.
*/
static __always_inline void __update_spec_ctrl(u64 val)
{
__this_cpu_write(x86_spec_ctrl_current, val);
native_wrmsrl(MSR_IA32_SPEC_CTRL, val);
}

#ifdef CONFIG_SMP
extern void speculative_store_bypass_ht_init(void);
#else
Expand Down
12 changes: 10 additions & 2 deletions arch/x86/kernel/smpboot.c
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@
#include <asm/hw_irq.h>
#include <asm/stackprotector.h>
#include <asm/sev.h>
#include <asm/spec-ctrl.h>

/* representing HT siblings of each logical CPU */
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
Expand Down Expand Up @@ -640,13 +641,13 @@ static void __init build_sched_topology(void)
};
#endif
/*
* When there is NUMA topology inside the package skip the DIE domain
* When there is NUMA topology inside the package skip the PKG domain
* since the NUMA domains will auto-magically create the right spanning
* domains based on the SLIT.
*/
if (!x86_has_numa_in_package) {
x86_topology[i++] = (struct sched_domain_topology_level){
cpu_cpu_mask, x86_die_flags, SD_INIT_NAME(DIE)
cpu_cpu_mask, x86_die_flags, SD_INIT_NAME(PKG)
};
}

Expand Down Expand Up @@ -1596,8 +1597,15 @@ void __noreturn hlt_play_dead(void)
native_halt();
}

/*
* native_play_dead() is essentially a __noreturn function, but it can't
* be marked as such as the compiler may complain about it.
*/
void native_play_dead(void)
{
if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS))
__update_spec_ctrl(0);

play_dead_common();
tboot_shutdown(TB_SHUTDOWN_WFS);

Expand Down
18 changes: 13 additions & 5 deletions drivers/idle/intel_idle.c
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,8 @@
#include <linux/moduleparam.h>
#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
#include <asm/nospec-branch.h>
#include <asm/mwait.h>
#include <asm/msr.h>
#include <asm/spec-ctrl.h>
#include <asm/fpu/api.h>

#define INTEL_IDLE_VERSION "0.5.1"
Expand All @@ -69,6 +68,7 @@ static int max_cstate = CPUIDLE_STATE_MAX - 1;
static unsigned int disabled_states_mask __read_mostly;
static unsigned int preferred_states_mask __read_mostly;
static bool force_irq_on __read_mostly;
static bool ibrs_off __read_mostly;

static struct cpuidle_device __percpu *intel_idle_cpuidle_devices;

Expand Down Expand Up @@ -182,12 +182,12 @@ static __cpuidle int intel_idle_ibrs(struct cpuidle_device *dev,
int ret;

if (smt_active)
native_wrmsrl(MSR_IA32_SPEC_CTRL, 0);
__update_spec_ctrl(0);

ret = __intel_idle(dev, drv, index);

if (smt_active)
native_wrmsrl(MSR_IA32_SPEC_CTRL, spec_ctrl);
__update_spec_ctrl(spec_ctrl);

return ret;
}
Expand Down Expand Up @@ -1853,11 +1853,13 @@ static void state_update_enter_method(struct cpuidle_state *state, int cstate)
}

if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS) &&
state->flags & CPUIDLE_FLAG_IBRS) {
((state->flags & CPUIDLE_FLAG_IBRS) || ibrs_off)) {
/*
* IBRS mitigation requires that C-states are entered
* with interrupts disabled.
*/
if (ibrs_off && (state->flags & CPUIDLE_FLAG_IRQ_ENABLE))
state->flags &= ~CPUIDLE_FLAG_IRQ_ENABLE;
WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE);
state->enter = intel_idle_ibrs;
return;
Expand Down Expand Up @@ -2176,3 +2178,9 @@ MODULE_PARM_DESC(preferred_cstates, "Mask of preferred idle states");
* 'CPUIDLE_FLAG_INIT_XSTATE' and 'CPUIDLE_FLAG_IBRS' flags.
*/
module_param(force_irq_on, bool, 0444);
/*
* Force the disabling of IBRS when X86_FEATURE_KERNEL_IBRS is on and
* CPUIDLE_FLAG_IRQ_ENABLE isn't set.
*/
module_param(ibrs_off, bool, 0444);
MODULE_PARM_DESC(ibrs_off, "Disable IBRS when idle");
2 changes: 2 additions & 0 deletions include/linux/cpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,8 @@ static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
static inline void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) { }
#endif /* !CONFIG_HOTPLUG_CPU */

DEFINE_LOCK_GUARD_0(cpus_read_lock, cpus_read_lock(), cpus_read_unlock())

#ifdef CONFIG_PM_SLEEP_SMP
extern int freeze_secondary_cpus(int primary);
extern void thaw_secondary_cpus(void);
Expand Down
8 changes: 8 additions & 0 deletions include/linux/list.h
Original file line number Diff line number Diff line change
Expand Up @@ -686,6 +686,14 @@ static inline void list_splice_tail_init(struct list_head *list,
#define list_for_each(pos, head) \
for (pos = (head)->next; !list_is_head(pos, (head)); pos = pos->next)

/**
* list_for_each_reverse - iterate backwards over a list
* @pos: the &struct list_head to use as a loop cursor.
* @head: the head for your list.
*/
#define list_for_each_reverse(pos, head) \
for (pos = (head)->prev; pos != (head); pos = pos->prev)

/**
* list_for_each_rcu - Iterate over a list in an RCU-safe fashion
* @pos: the &struct list_head to use as a loop cursor.
Expand Down
4 changes: 2 additions & 2 deletions include/linux/mm.h
Original file line number Diff line number Diff line change
Expand Up @@ -1726,8 +1726,8 @@ static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
unsigned int pid_bit;

pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG));
if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->access_pids[1])) {
__set_bit(pid_bit, &vma->numab_state->access_pids[1]);
if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->pids_active[1])) {
__set_bit(pid_bit, &vma->numab_state->pids_active[1]);
}
}
#else /* !CONFIG_NUMA_BALANCING */
Expand Down
Loading

0 comments on commit 63ce50f

Please sign in to comment.