Skip to content

Commit

Permalink
Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/…
Browse files Browse the repository at this point in the history
…linux/kernel/git/tip/tip

Pull 'full dynticks' support from Ingo Molnar:
 "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
  kernel feature to the timer and scheduler subsystems: 'full dynticks',
  or CONFIG_NO_HZ_FULL=y.

  This feature extends the nohz variable-size timer tick feature from
  idle to busy CPUs (running at most one task) as well, potentially
  reducing the number of timer interrupts significantly.

  This feature got motivated by real-time folks and the -rt tree, but
  the general utility and motivation of full-dynticks runs wider than
  that:

   - HPC workloads get faster: CPUs running a single task should be able
     to utilize a maximum amount of CPU power.  A periodic timer tick at
     HZ=1000 can cause a constant overhead of up to 1.0%.  This feature
     removes that overhead - and speeds up the system by 0.5%-1.0% on
     typical distro configs even on modern systems.

   - Real-time workload latency reduction: CPUs running critical tasks
     should experience as little jitter as possible.  The last remaining
     source of kernel-related jitter was the periodic timer tick.

   - A single task executing on a CPU is a pretty common situation,
     especially with an increasing number of cores/CPUs, so this feature
     helps desktop and mobile workloads as well.

  The cost of the feature is mainly related to increased timer
  reprogramming overhead when a CPU switches its tick period, and thus
  slightly longer to-idle and from-idle latency.

  Configuration-wise a third mode of operation is added to the existing
  two NOHZ kconfig modes:

   - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
     as a config option.  This is the traditional Linux periodic tick
     design: there's a HZ tick going on all the time, regardless of
     whether a CPU is idle or not.

   - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
     periodic tick when a CPU enters idle mode.

   - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
     tick when a CPU is idle, also slows the tick down to 1 Hz (one
     timer interrupt per second) when only a single task is running on a
     CPU.

  The .config behavior is compatible: existing !CONFIG_NO_HZ and
  CONFIG_NO_HZ=y settings get translated to the new values, without the
  user having to configure anything.  CONFIG_NO_HZ_FULL is turned off by
  default.

  This feature is based on a lot of infrastructure work that has been
  steadily going upstream in the last 2-3 cycles: related RCU support
  and non-periodic cputime support in particular is upstream already.

  This tree adds the final pieces and activates the feature.  The pull
  request is marked RFC because:

   - it's marked 64-bit only at the moment - the 32-bit support patch is
     small but did not get ready in time.

   - it has a number of fresh commits that came in after the merge
     window.  The overwhelming majority of commits are from before the
     merge window, but still some aspects of the tree are fresh and so I
     marked it RFC.

   - it's a pretty wide-reaching feature with lots of effects - and
     while the components have been in testing for some time, the full
     combination is still not very widely used.  That it's default-off
     should reduce its regression abilities and obviously there are no
     known regressions with CONFIG_NO_HZ_FULL=y enabled either.

   - the feature is not completely idempotent: there is no 100%
     equivalent replacement for a periodic scheduler/timer tick.  In
     particular there's ongoing work to map out and reduce its effects
     on scheduler load-balancing and statistics.  This should not impact
     correctness though, there are no known regressions related to this
     feature at this point.

   - it's a pretty ambitious feature that with time will likely be
     enabled by most Linux distros, and we'd like you to make input on
     its design/implementation, if you dislike some aspect we missed.
     Without flaming us to crisp! :-)

  Future plans:

   - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
     the periodic tick altogether when there's a single busy task on a
     CPU.  We'd first like 1 Hz to be exposed more widely before we go
     for the 0 Hz target though.

   - once we reach 0 Hz we can remove the periodic tick assumption from
     nr_running>=2 as well, by essentially interrupting busy tasks only
     as frequently as the sched_latency constraints require us to do -
     once every 4-40 msecs, depending on nr_running.

  I am personally leaning towards biting the bullet and doing this in
  v3.10, like the -rt tree this effort has been going on for too long -
  but the final word is up to you as usual.

  More technical details can be found in Documentation/timers/NO_HZ.txt"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  sched: Keep at least 1 tick per second for active dynticks tasks
  rcu: Fix full dynticks' dependency on wide RCU nocb mode
  nohz: Protect smp_processor_id() in tick_nohz_task_switch()
  nohz_full: Add documentation.
  cputime_nsecs: use math64.h for nsec resolution conversion helpers
  nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
  nohz: Reduce overhead under high-freq idling patterns
  nohz: Remove full dynticks' superfluous dependency on RCU tree
  nohz: Fix unavailable tick_stop tracepoint in dynticks idle
  nohz: Add basic tracing
  nohz: Select wide RCU nocb for full dynticks
  nohz: Disable the tick when irq resume in full dynticks CPU
  nohz: Re-evaluate the tick for the new task after a context switch
  nohz: Prepare to stop the tick on irq exit
  nohz: Implement full dynticks kick
  nohz: Re-evaluate the tick from the scheduler IPI
  sched: New helper to prevent from stopping the tick in full dynticks
  sched: Kick full dynticks CPU that have more than one task enqueued.
  perf: New helper to prevent full dynticks CPUs from stopping tick
  perf: Kick full dynticks CPU if events rotation is needed
  ...
  • Loading branch information
torvalds committed May 5, 2013
2 parents 64049d1 + 265f22a commit 534c97b
Show file tree
Hide file tree
Showing 31 changed files with 990 additions and 119 deletions.
2 changes: 1 addition & 1 deletion Documentation/RCU/stallwarn.txt
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
o A hardware or software issue shuts off the scheduler-clock
interrupt on a CPU that is not in dyntick-idle mode. This
problem really has happened, and seems to be most likely to
result in RCU CPU stall warnings for CONFIG_NO_HZ=n kernels.
result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.

o A bug in the RCU implementation.

Expand Down
4 changes: 2 additions & 2 deletions Documentation/cpu-freq/governors.txt
Original file line number Diff line number Diff line change
Expand Up @@ -131,8 +131,8 @@ sampling_rate_min:
The sampling rate is limited by the HW transition latency:
transition_latency * 100
Or by kernel restrictions:
If CONFIG_NO_HZ is set, the limit is 10ms fixed.
If CONFIG_NO_HZ is not set or nohz=off boot parameter is used, the
If CONFIG_NO_HZ_COMMON is set, the limit is 10ms fixed.
If CONFIG_NO_HZ_COMMON is not set or nohz=off boot parameter is used, the
limits depend on the CONFIG_HZ option:
HZ=1000: min=20000us (20ms)
HZ=250: min=80000us (80ms)
Expand Down
8 changes: 8 additions & 0 deletions Documentation/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1964,6 +1964,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
Valid arguments: on, off
Default: on

nohz_full= [KNL,BOOT]
In kernels built with CONFIG_NO_HZ_FULL=y, set
the specified list of CPUs whose tick will be stopped
whenever possible. The boot CPU will be forced outside
the range to maintain the timekeeping.
The CPUs in this range must also be included in the
rcu_nocbs= set.

noiotrap [SH] Disables trapped I/O port accesses.

noirqdebug [X86-32] Disables the code which attempts to detect and
Expand Down
273 changes: 273 additions & 0 deletions Documentation/timers/NO_HZ.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
NO_HZ: Reducing Scheduling-Clock Ticks


This document describes Kconfig options and boot parameters that can
reduce the number of scheduling-clock interrupts, thereby improving energy
efficiency and reducing OS jitter. Reducing OS jitter is important for
some types of computationally intensive high-performance computing (HPC)
applications and for real-time applications.

There are two main contexts in which the number of scheduling-clock
interrupts can be reduced compared to the old-school approach of sending
a scheduling-clock interrupt to all CPUs every jiffy whether they need
it or not (CONFIG_HZ_PERIODIC=y or CONFIG_NO_HZ=n for older kernels):

1. Idle CPUs (CONFIG_NO_HZ_IDLE=y or CONFIG_NO_HZ=y for older kernels).

2. CPUs having only one runnable task (CONFIG_NO_HZ_FULL=y).

These two cases are described in the following two sections, followed
by a third section on RCU-specific considerations and a fourth and final
section listing known issues.


IDLE CPUs

If a CPU is idle, there is little point in sending it a scheduling-clock
interrupt. After all, the primary purpose of a scheduling-clock interrupt
is to force a busy CPU to shift its attention among multiple duties,
and an idle CPU has no duties to shift its attention among.

The CONFIG_NO_HZ_IDLE=y Kconfig option causes the kernel to avoid sending
scheduling-clock interrupts to idle CPUs, which is critically important
both to battery-powered devices and to highly virtualized mainframes.
A battery-powered device running a CONFIG_HZ_PERIODIC=y kernel would
drain its battery very quickly, easily 2-3 times as fast as would the
same device running a CONFIG_NO_HZ_IDLE=y kernel. A mainframe running
1,500 OS instances might find that half of its CPU time was consumed by
unnecessary scheduling-clock interrupts. In these situations, there
is strong motivation to avoid sending scheduling-clock interrupts to
idle CPUs. That said, dyntick-idle mode is not free:

1. It increases the number of instructions executed on the path
to and from the idle loop.

2. On many architectures, dyntick-idle mode also increases the
number of expensive clock-reprogramming operations.

Therefore, systems with aggressive real-time response constraints often
run CONFIG_HZ_PERIODIC=y kernels (or CONFIG_NO_HZ=n for older kernels)
in order to avoid degrading from-idle transition latencies.

An idle CPU that is not receiving scheduling-clock interrupts is said to
be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running
tickless". The remainder of this document will use "dyntick-idle mode".

There is also a boot parameter "nohz=" that can be used to disable
dyntick-idle mode in CONFIG_NO_HZ_IDLE=y kernels by specifying "nohz=off".
By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling
dyntick-idle mode.


CPUs WITH ONLY ONE RUNNABLE TASK

If a CPU has only one runnable task, there is little point in sending it
a scheduling-clock interrupt because there is no other task to switch to.

The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
sending scheduling-clock interrupts to CPUs with a single runnable task,
and such CPUs are said to be "adaptive-ticks CPUs". This is important
for applications with aggressive real-time response constraints because
it allows them to improve their worst-case response times by the maximum
duration of a scheduling-clock interrupt. It is also important for
computationally intensive short-iteration workloads: If any CPU is
delayed during a given iteration, all the other CPUs will be forced to
wait idle while the delayed CPU finishes. Thus, the delay is multiplied
by one less than the number of CPUs. In these situations, there is
again strong motivation to avoid sending scheduling-clock interrupts.

By default, no CPU will be an adaptive-ticks CPU. The "nohz_full="
boot parameter specifies the adaptive-ticks CPUs. For example,
"nohz_full=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks
CPUs. Note that you are prohibited from marking all of the CPUs as
adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain
online to handle timekeeping tasks in order to ensure that system calls
like gettimeofday() returns accurate values on adaptive-tick CPUs.
(This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no
running user processes to observe slight drifts in clock rate.)
Therefore, the boot CPU is prohibited from entering adaptive-ticks
mode. Specifying a "nohz_full=" mask that includes the boot CPU will
result in a boot-time error message, and the boot CPU will be removed
from the mask.

Alternatively, the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter specifies
that all CPUs other than the boot CPU are adaptive-ticks CPUs. This
Kconfig parameter will be overridden by the "nohz_full=" boot parameter,
so that if both the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter and
the "nohz_full=1" boot parameter is specified, the boot parameter will
prevail so that only CPU 1 will be an adaptive-ticks CPU.

Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
This is covered in the "RCU IMPLICATIONS" section below.

Normally, a CPU remains in adaptive-ticks mode as long as possible.
In particular, transitioning to kernel mode does not automatically change
the mode. Instead, the CPU will exit adaptive-ticks mode only if needed,
for example, if that CPU enqueues an RCU callback.

Just as with dyntick-idle mode, the benefits of adaptive-tick mode do
not come for free:

1. CONFIG_NO_HZ_FULL selects CONFIG_NO_HZ_COMMON, so you cannot run
adaptive ticks without also running dyntick idle. This dependency
extends down into the implementation, so that all of the costs
of CONFIG_NO_HZ_IDLE are also incurred by CONFIG_NO_HZ_FULL.

2. The user/kernel transitions are slightly more expensive due
to the need to inform kernel subsystems (such as RCU) about
the change in mode.

3. POSIX CPU timers on adaptive-tick CPUs may miss their deadlines
(perhaps indefinitely) because they currently rely on
scheduling-tick interrupts. This will likely be fixed in
one of two ways: (1) Prevent CPUs with POSIX CPU timers from
entering adaptive-tick mode, or (2) Use hrtimers or other
adaptive-ticks-immune mechanism to cause the POSIX CPU timer to
fire properly.

4. If there are more perf events pending than the hardware can
accommodate, they are normally round-robined so as to collect
all of them over time. Adaptive-tick mode may prevent this
round-robining from happening. This will likely be fixed by
preventing CPUs with large numbers of perf events pending from
entering adaptive-tick mode.

5. Scheduler statistics for adaptive-tick CPUs may be computed
slightly differently than those for non-adaptive-tick CPUs.
This might in turn perturb load-balancing of real-time tasks.

6. The LB_BIAS scheduler feature is disabled by adaptive ticks.

Although improvements are expected over time, adaptive ticks is quite
useful for many types of real-time and compute-intensive applications.
However, the drawbacks listed above mean that adaptive ticks should not
(yet) be enabled by default.


RCU IMPLICATIONS

There are situations in which idle CPUs cannot be permitted to
enter either dyntick-idle mode or adaptive-tick mode, the most
common being when that CPU has RCU callbacks pending.

The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such CPUs
to enter dyntick-idle mode or adaptive-tick mode anyway. In this case,
a timer will awaken these CPUs every four jiffies in order to ensure
that the RCU callbacks are processed in a timely fashion.

Another approach is to offload RCU callback processing to "rcuo" kthreads
using the CONFIG_RCU_NOCB_CPU=y Kconfig option. The specific CPUs to
offload may be selected via several methods:

1. One of three mutually exclusive Kconfig options specify a
build-time default for the CPUs to offload:

a. The CONFIG_RCU_NOCB_CPU_NONE=y Kconfig option results in
no CPUs being offloaded.

b. The CONFIG_RCU_NOCB_CPU_ZERO=y Kconfig option causes
CPU 0 to be offloaded.

c. The CONFIG_RCU_NOCB_CPU_ALL=y Kconfig option causes all
CPUs to be offloaded. Note that the callbacks will be
offloaded to "rcuo" kthreads, and that those kthreads
will in fact run on some CPU. However, this approach
gives fine-grained control on exactly which CPUs the
callbacks run on, along with their scheduling priority
(including the default of SCHED_OTHER), and it further
allows this control to be varied dynamically at runtime.

2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated
list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1,
3, 4, and 5. The specified CPUs will be offloaded in addition to
any CPUs specified as offloaded by CONFIG_RCU_NOCB_CPU_ZERO=y or
CONFIG_RCU_NOCB_CPU_ALL=y. This means that the "rcu_nocbs=" boot
parameter has no effect for kernels built with RCU_NOCB_CPU_ALL=y.

The offloaded CPUs will never queue RCU callbacks, and therefore RCU
never prevents offloaded CPUs from entering either dyntick-idle mode
or adaptive-tick mode. That said, note that it is up to userspace to
pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the
scheduler will decide where to run them, which might or might not be
where you want them to run.


KNOWN ISSUES

o Dyntick-idle slows transitions to and from idle slightly.
In practice, this has not been a problem except for the most
aggressive real-time workloads, which have the option of disabling
dyntick-idle mode, an option that most of them take. However,
some workloads will no doubt want to use adaptive ticks to
eliminate scheduling-clock interrupt latencies. Here are some
options for these workloads:

a. Use PMQOS from userspace to inform the kernel of your
latency requirements (preferred).

b. On x86 systems, use the "idle=mwait" boot parameter.

c. On x86 systems, use the "intel_idle.max_cstate=" to limit
` the maximum C-state depth.

d. On x86 systems, use the "idle=poll" boot parameter.
However, please note that use of this parameter can cause
your CPU to overheat, which may cause thermal throttling
to degrade your latencies -- and that this degradation can
be even worse than that of dyntick-idle. Furthermore,
this parameter effectively disables Turbo Mode on Intel
CPUs, which can significantly reduce maximum performance.

o Adaptive-ticks slows user/kernel transitions slightly.
This is not expected to be a problem for computationally intensive
workloads, which have few such transitions. Careful benchmarking
will be required to determine whether or not other workloads
are significantly affected by this effect.

o Adaptive-ticks does not do anything unless there is only one
runnable task for a given CPU, even though there are a number
of other situations where the scheduling-clock tick is not
needed. To give but one example, consider a CPU that has one
runnable high-priority SCHED_FIFO task and an arbitrary number
of low-priority SCHED_OTHER tasks. In this case, the CPU is
required to run the SCHED_FIFO task until it either blocks or
some other higher-priority task awakens on (or is assigned to)
this CPU, so there is no point in sending a scheduling-clock
interrupt to this CPU. However, the current implementation
nevertheless sends scheduling-clock interrupts to CPUs having a
single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER
tasks, even though these interrupts are unnecessary.

Better handling of these sorts of situations is future work.

o A reboot is required to reconfigure both adaptive idle and RCU
callback offloading. Runtime reconfiguration could be provided
if needed, however, due to the complexity of reconfiguring RCU at
runtime, there would need to be an earthshakingly good reason.
Especially given that you have the straightforward option of
simply offloading RCU callbacks from all CPUs and pinning them
where you want them whenever you want them pinned.

o Additional configuration is required to deal with other sources
of OS jitter, including interrupts and system-utility tasks
and processes. This configuration normally involves binding
interrupts and tasks to particular CPUs.

o Some sources of OS jitter can currently be eliminated only by
constraining the workload. For example, the only way to eliminate
OS jitter due to global TLB shootdowns is to avoid the unmapping
operations (such as kernel module unload operations) that
result in these shootdowns. For another example, page faults
and TLB misses can be reduced (and in some cases eliminated) by
using huge pages and by constraining the amount of memory used
by the application. Pre-faulting the working set can also be
helpful, especially when combined with the mlock() and mlockall()
system calls.

o Unless all CPUs are idle, at least one CPU must keep the
scheduling-clock interrupt going in order to support accurate
timekeeping.

o If there are adaptive-ticks CPUs, there will be at least one
CPU keeping the scheduling-clock interrupt going, even if all
CPUs are otherwise idle.
4 changes: 2 additions & 2 deletions arch/um/include/shared/common-offsets.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ DEFINE(UM_NSEC_PER_USEC, NSEC_PER_USEC);
#ifdef CONFIG_PRINTK
DEFINE(UML_CONFIG_PRINTK, CONFIG_PRINTK);
#endif
#ifdef CONFIG_NO_HZ
DEFINE(UML_CONFIG_NO_HZ, CONFIG_NO_HZ);
#ifdef CONFIG_NO_HZ_COMMON
DEFINE(UML_CONFIG_NO_HZ_COMMON, CONFIG_NO_HZ_COMMON);
#endif
#ifdef CONFIG_UML_X86
DEFINE(UML_CONFIG_UML_X86, CONFIG_UML_X86);
Expand Down
2 changes: 1 addition & 1 deletion arch/um/os-Linux/time.c
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ long long os_nsecs(void)
return timeval_to_ns(&tv);
}

#ifdef UML_CONFIG_NO_HZ
#ifdef UML_CONFIG_NO_HZ_COMMON
static int after_sleep_interval(struct timespec *ts)
{
return 0;
Expand Down
Loading

0 comments on commit 534c97b

Please sign in to comment.