Skip to content

Commit

Permalink
docs: scheduler: Convert schedutil.txt to ReST
Browse files Browse the repository at this point in the history
All other scheduler documents have been converted to *.rst. Let's do
the same for schedutil.txt.

Also fixed some typos.

Signed-off-by: Tang Yizhou <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
  • Loading branch information
Tang Yizhou authored and Jonathan Corbet committed Mar 16, 2022
1 parent ff13687 commit b57b849
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 13 deletions.
1 change: 1 addition & 0 deletions Documentation/scheduler/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Linux Scheduler
sched-domains
sched-capacity
sched-energy
schedutil
sched-nice-design
sched-rt-group
sched-stats
Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
=========
Schedutil
=========

.. note::

NOTE; all this assumes a linear relation between frequency and work capacity,
we know this is flawed, but it is the best workable approximation.
All this assumes a linear relation between frequency and work capacity,
we know this is flawed, but it is the best workable approximation.


PELT (Per Entity Load Tracking)
-------------------------------
===============================

With PELT we track some metrics across the various scheduler entities, from
individual tasks to task-group slices to CPU runqueues. As the basis for this
Expand Down Expand Up @@ -38,24 +42,24 @@ while 'runnable' will increase to reflect the amount of contention.
For more detail see: kernel/sched/pelt.c


Frequency- / CPU Invariance
---------------------------
Frequency / CPU Invariance
==========================

Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
a big CPU, we allow architectures to scale the time delta with two ratios, one
Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.

For simple DVFS architectures (where software is in full control) we trivially
compute the ratio as:
compute the ratio as::

f_cur
r_dvfs := -----
f_max

For more dynamic systems where the hardware is in control of DVFS we use
hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
For Intel specifically, we use:
For Intel specifically, we use::

APERF
f_cur := ----- * P0
Expand Down Expand Up @@ -87,7 +91,7 @@ For more detail see:


UTIL_EST / UTIL_EST_FASTUP
--------------------------
==========================

Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
Expand All @@ -106,7 +110,7 @@ For more detail see: kernel/sched/fair.c:util_est_dequeue()


UCLAMP
------
======

It is possible to set effective u_min and u_max clamps on each CFS or RT task;
the runqueue keeps an max aggregate of these clamps for all running tasks.
Expand All @@ -115,15 +119,15 @@ For more detail see: include/uapi/linux/sched/types.h


Schedutil / DVFS
----------------
================

Every time the scheduler load tracking is updated (task wakeup, task
migration, time progression) we call out to schedutil to update the hardware
DVFS state.

The basis is the CPU runqueue's 'running' metric, which per the above it is
the frequency invariant utilization estimate of the CPU. From this we compute
a desired frequency like:
a desired frequency like::

max( running, util_est ); if UTIL_EST
u_cfs := { running; otherwise
Expand All @@ -135,7 +139,7 @@ a desired frequency like:

f_des := min( f_max, 1.25 u * f_max )

XXX IO-wait; when the update is due to a task wakeup from IO-completion we
XXX IO-wait: when the update is due to a task wakeup from IO-completion we
boost 'u' above.

This frequency is then used to select a P-state/OPP or directly munged into a
Expand All @@ -153,7 +157,7 @@ For more information see: kernel/sched/cpufreq_schedutil.c


NOTES
-----
=====

- On low-load scenarios, where DVFS is most relevant, the 'running' numbers
will closely reflect utilization.
Expand Down

0 comments on commit b57b849

Please sign in to comment.