Skip to content

Commit

Permalink
Merge tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/g…
Browse files Browse the repository at this point in the history
…it/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "The most significant change here is the extension of the Energy Model
  to cover non-CPU devices (as well as CPUs) from Lukasz Luba.

  There is also some new hardware support (Ice Lake server idle states
  table for intel_idle, Sapphire Rapids and Power Limit 4 support in the
  RAPL driver), some new functionality in the existing drivers (eg. a
  new switch to disable/enable CPU energy-efficiency optimizations in
  intel_pstate, delayed timers in devfreq), some assorted fixes (cpufreq
  core, intel_pstate, intel_idle) and cleanups (eg. cpuidle-psci,
  devfreq), including the elimination of W=1 build warnings from cpufreq
  done by Lee Jones.

  Specifics:

   - Make the Energy Model cover non-CPU devices (Lukasz Luba).

   - Add Ice Lake server idle states table to the intel_idle driver and
     eliminate a redundant static variable from it (Chen Yu, Rafael
     Wysocki).

   - Eliminate all W=1 build warnings from cpufreq (Lee Jones).

   - Add support for Sapphire Rapids and for Power Limit 4 to the Intel
     RAPL power capping driver (Sumeet Pawnikar, Zhang Rui).

   - Fix function name in kerneldoc comments in the idle_inject power
     capping driver (Yangtao Li).

   - Fix locking issues with cpufreq governors and drop a redundant
     "weak" function definition from cpufreq (Viresh Kumar).

   - Rearrange cpufreq to register non-modular governors at the
     core_initcall level and allow the default cpufreq governor to be
     specified in the kernel command line (Quentin Perret).

   - Extend, fix and clean up the intel_pstate driver (Srinivas
     Pandruvada, Rafael Wysocki):

       * Add a new sysfs attribute for disabling/enabling CPU
         energy-efficiency optimizations in the processor.

       * Make the driver avoid enabling HWP if EPP is not supported.

       * Allow the driver to handle numeric EPP values in the sysfs
         interface and fix the setting of EPP via sysfs in the active
         mode.

       * Eliminate a static checker warning and clean up a kerneldoc
         comment.

   - Clean up some variable declarations in the powernv cpufreq driver
     (Wei Yongjun).

   - Fix up the ->enter_s2idle callback definition to cover the case
     when it points to the same function as ->idle correctly (Neal Liu).

   - Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson).

   - Make the PM core emit "changed" uevent when adding/removing the
     "wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi).

   - Add a helper macro for declaring PM callbacks and use it in the MMC
     jz4740 driver (Paul Cercueil).

   - Fix white space in some places in the hibernate code and make the
     system-wide PM code use "const char *" where appropriate (Xiang
     Chen, Alexey Dobriyan).

   - Add one more "unsafe" helper macro to the freezer to cover the NFS
     use case (He Zhe).

   - Change the language in the generic PM domains framework to use
     parent/child terminology and clean up a typo and some comment
     fromatting in that code (Kees Cook, Geert Uytterhoeven).

   - Update the operating performance points OPP framework (Lukasz Luba,
     Andrew-sh.Cheng, Valdis Kletnieks):

       * Refactor dev_pm_opp_of_register_em() and update related drivers.

       * Add a missing function export.

       * Allow disabled OPPs in dev_pm_opp_get_freq().

   - Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric
     Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier):

       * Add support for delayed timers to the devfreq core and make the
         Samsung exynos5422-dmc driver use it.

       * Unify sysfs interface to use "df-" as a prefix in instance
         names consistently.

       * Fix devfreq_summary debugfs node indentation.

       * Add the rockchip,pmu phandle to the rk3399_dmc driver DT
         bindings.

       * List Dmitry Osipenko as the Tegra devfreq driver maintainer.

       * Fix typos in the core devfreq code.

   - Update the pm-graph utility to version 5.7 including a number of
     fixes related to suspend-to-idle (Todd Brandt).

   - Fix coccicheck errors and warnings in the cpupower utility (Shuah
     Khan).

   - Replace HTTP links with HTTPs ones in multiple places (Alexander A.
     Klimov)"

* tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits)
  cpuidle: ACPI: fix 'return' with no value build warning
  cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
  cpufreq: intel_pstate: Rearrange the storing of new EPP values
  intel_idle: Customize IceLake server support
  PM / devfreq: Fix the wrong end with semicolon
  PM / devfreq: Fix indentaion of devfreq_summary debugfs node
  PM / devfreq: Clean up the devfreq instance name in sysfs attr
  memory: samsung: exynos5422-dmc: Add module param to control IRQ mode
  memory: samsung: exynos5422-dmc: Adjust polling interval and uptreshold
  memory: samsung: exynos5422-dmc: Use delayed timer as default
  PM / devfreq: Add support delayed timer for polling mode
  dt-bindings: devfreq: rk3399_dmc: Add rockchip,pmu phandle
  PM / devfreq: tegra: Add Dmitry as a maintainer
  PM / devfreq: event: Fix trivial spelling
  PM / devfreq: rk3399_dmc: Fix kernel oops when rockchip,pmu is absent
  cpuidle: change enter_s2idle() prototype
  cpuidle: psci: Prevent domain idlestates until consumers are ready
  cpuidle: psci: Convert PM domain to platform driver
  cpuidle: psci: Fix error path via converting to a platform driver
  cpuidle: psci: Fail cpuidle registration if set OSI mode failed
  ...
  • Loading branch information
torvalds committed Aug 4, 2020
2 parents d516840 + 86ba54f commit 0408497
Show file tree
Hide file tree
Showing 81 changed files with 1,596 additions and 962 deletions.
12 changes: 12 additions & 0 deletions Documentation/ABI/testing/sysfs-class-devfreq
Original file line number Diff line number Diff line change
Expand Up @@ -108,3 +108,15 @@ Description:
frequency requested by governors and min_freq.
The max_freq overrides min_freq because max_freq may be
used to throttle devices to avoid overheating.

What: /sys/class/devfreq/.../timer
Date: July 2020
Contact: Chanwoo Choi <[email protected]>
Description:
This ABI shows and stores the kind of work timer by users.
This work timer is used by devfreq workqueue in order to
monitor the device status such as utilization. The user
can change the work timer on runtime according to their demand
as following:
echo deferrable > /sys/class/devfreq/.../timer
echo delayed > /sys/class/devfreq/.../timer
5 changes: 5 additions & 0 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -703,6 +703,11 @@
cpufreq.off=1 [CPU_FREQ]
disable the cpufreq sub-system

cpufreq.default_governor=
[CPU_FREQ] Name of the default cpufreq governor or
policy to use. This governor must be registered in the
kernel before the cpufreq driver probes.

cpu_init_udelay=N
[X86] Delay for N microsec between assert and de-assert
of APIC INIT to start processors. This delay occurs
Expand Down
6 changes: 3 additions & 3 deletions Documentation/admin-guide/pm/cpufreq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -147,9 +147,9 @@ CPUs in it.

The next major initialization step for a new policy object is to attach a
scaling governor to it (to begin with, that is the default scaling governor
determined by the kernel configuration, but it may be changed later
via ``sysfs``). First, a pointer to the new policy object is passed to the
governor's ``->init()`` callback which is expected to initialize all of the
determined by the kernel command line or configuration, but it may be changed
later via ``sysfs``). First, a pointer to the new policy object is passed to
the governor's ``->init()`` callback which is expected to initialize all of the
data structures necessary to handle the given policy and, possibly, to add
a governor ``sysfs`` interface to it. Next, the governor is started by
invoking its ``->start()`` callback.
Expand Down
17 changes: 16 additions & 1 deletion Documentation/admin-guide/pm/intel_pstate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -431,6 +431,17 @@ argument is passed to the kernel in the command line.
supported in the current configuration, writes to this attribute will
fail with an appropriate error.

``energy_efficiency``
This attribute is only present on platforms, which have CPUs matching
Kaby Lake or Coffee Lake desktop CPU model. By default
energy efficiency optimizations are disabled on these CPU models in HWP
mode by this driver. Enabling energy efficiency may limit maximum
operating frequency in both HWP and non HWP mode. In non HWP mode,
optimizations are done only in the turbo frequency range. In HWP mode,
optimizations are done in the entire frequency range. Setting this
attribute to "1" enables energy efficiency optimizations and setting
to "0" disables energy efficiency optimizations.

Interpretation of Policy Attributes
-----------------------------------

Expand Down Expand Up @@ -554,7 +565,11 @@ somewhere between the two extremes:
Strings written to the ``energy_performance_preference`` attribute are
internally translated to integer values written to the processor's
Energy-Performance Preference (EPP) knob (if supported) or its
Energy-Performance Bias (EPB) knob.
Energy-Performance Bias (EPB) knob. It is also possible to write a positive
integer value between 0 to 255, if the EPP feature is present. If the EPP
feature is not present, writing integer value to this attribute is not
supported. In this case, user can use
"/sys/devices/system/cpu/cpu*/power/energy_perf_bias" interface.

[Note that tasks may by migrated from one CPU to another by the scheduler's
load-balancing algorithm and if different energy vs performance hints are
Expand Down
2 changes: 2 additions & 0 deletions Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ Optional properties:
format depends on the interrupt controller.
It should be a DCF interrupt. When DDR DVFS finishes
a DCF interrupt is triggered.
- rockchip,pmu: Phandle to the syscon managing the "PMU general register
files".

Following properties relate to DDR timing:

Expand Down
135 changes: 75 additions & 60 deletions Documentation/power/energy-model.rst
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
====================
Energy Model of CPUs
====================
.. SPDX-License-Identifier: GPL-2.0
=======================
Energy Model of devices
=======================

1. Overview
-----------

The Energy Model (EM) framework serves as an interface between drivers knowing
the power consumed by CPUs at various performance levels, and the kernel
the power consumed by devices at various performance levels, and the kernel
subsystems willing to use that information to make energy-aware decisions.

The source of the information about the power consumed by CPUs can vary greatly
The source of the information about the power consumed by devices can vary greatly
from one platform to another. These power costs can be estimated using
devicetree data in some cases. In others, the firmware will know better.
Alternatively, userspace might be best positioned. And so on. In order to avoid
Expand All @@ -25,7 +27,7 @@ framework, and interested clients reading the data from it::
+---------------+ +-----------------+ +---------------+
| Thermal (IPA) | | Scheduler (EAS) | | Other |
+---------------+ +-----------------+ +---------------+
| | em_pd_energy() |
| | em_cpu_energy() |
| | em_cpu_get() |
+---------+ | +---------+
| | |
Expand All @@ -35,7 +37,7 @@ framework, and interested clients reading the data from it::
| Framework |
+---------------------+
^ ^ ^
| | | em_register_perf_domain()
| | | em_dev_register_perf_domain()
+----------+ | +---------+
| | |
+---------------+ +---------------+ +--------------+
Expand All @@ -47,12 +49,12 @@ framework, and interested clients reading the data from it::
| Device Tree | | Firmware | | ? |
+--------------+ +---------------+ +--------------+

The EM framework manages power cost tables per 'performance domain' in the
system. A performance domain is a group of CPUs whose performance is scaled
together. Performance domains generally have a 1-to-1 mapping with CPUFreq
policies. All CPUs in a performance domain are required to have the same
micro-architecture. CPUs in different performance domains can have different
micro-architectures.
In case of CPU devices the EM framework manages power cost tables per
'performance domain' in the system. A performance domain is a group of CPUs
whose performance is scaled together. Performance domains generally have a
1-to-1 mapping with CPUFreq policies. All CPUs in a performance domain are
required to have the same micro-architecture. CPUs in different performance
domains can have different micro-architectures.


2. Core APIs
Expand All @@ -70,28 +72,37 @@ CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
Drivers are expected to register performance domains into the EM framework by
calling the following API::

int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
struct em_data_callback *cb);
int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
struct em_data_callback *cb, cpumask_t *cpus);

Drivers must specify the CPUs of the performance domains using the cpumask
argument, and provide a callback function returning <frequency, power> tuples
for each capacity state. The callback function provided by the driver is free
Drivers must provide a callback function returning <frequency, power> tuples
for each performance state. The callback function provided by the driver is free
to fetch data from any relevant location (DT, firmware, ...), and by any mean
deemed necessary. See Section 3. for an example of driver implementing this
deemed necessary. Only for CPU devices, drivers must specify the CPUs of the
performance domains using cpumask. For other devices than CPUs the last
argument must be set to NULL.
See Section 3. for an example of driver implementing this
callback, and kernel/power/energy_model.c for further documentation on this
API.


2.3 Accessing performance domains
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are two API functions which provide the access to the energy model:
em_cpu_get() which takes CPU id as an argument and em_pd_get() with device
pointer as an argument. It depends on the subsystem which interface it is
going to use, but in case of CPU devices both functions return the same
performance domain.

Subsystems interested in the energy model of a CPU can retrieve it using the
em_cpu_get() API. The energy model tables are allocated once upon creation of
the performance domains, and kept in memory untouched.

The energy consumed by a performance domain can be estimated using the
em_pd_energy() API. The estimation is performed assuming that the schedutil
CPUfreq governor is in use.
em_cpu_energy() API. The estimation is performed assuming that the schedutil
CPUfreq governor is in use in case of CPU device. Currently this calculation is
not provided for other type of devices.

More details about the above APIs can be found in include/linux/energy_model.h.

Expand All @@ -106,42 +117,46 @@ EM framework::

-> drivers/cpufreq/foo_cpufreq.c

01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu)
02 {
03 long freq, power;
04
05 /* Use the 'foo' protocol to ceil the frequency */
06 freq = foo_get_freq_ceil(cpu, *KHz);
07 if (freq < 0);
08 return freq;
09
10 /* Estimate the power cost for the CPU at the relevant freq. */
11 power = foo_estimate_power(cpu, freq);
12 if (power < 0);
13 return power;
14
15 /* Return the values to the EM framework */
16 *mW = power;
17 *KHz = freq;
18
19 return 0;
20 }
21
22 static int foo_cpufreq_init(struct cpufreq_policy *policy)
23 {
24 struct em_data_callback em_cb = EM_DATA_CB(est_power);
25 int nr_opp, ret;
26
27 /* Do the actual CPUFreq init work ... */
28 ret = do_foo_cpufreq_init(policy);
29 if (ret)
30 return ret;
31
32 /* Find the number of OPPs for this policy */
33 nr_opp = foo_get_nr_opp(policy);
34
35 /* And register the new performance domain */
36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
37
38 return 0;
39 }
01 static int est_power(unsigned long *mW, unsigned long *KHz,
02 struct device *dev)
03 {
04 long freq, power;
05
06 /* Use the 'foo' protocol to ceil the frequency */
07 freq = foo_get_freq_ceil(dev, *KHz);
08 if (freq < 0);
09 return freq;
10
11 /* Estimate the power cost for the dev at the relevant freq. */
12 power = foo_estimate_power(dev, freq);
13 if (power < 0);
14 return power;
15
16 /* Return the values to the EM framework */
17 *mW = power;
18 *KHz = freq;
19
20 return 0;
21 }
22
23 static int foo_cpufreq_init(struct cpufreq_policy *policy)
24 {
25 struct em_data_callback em_cb = EM_DATA_CB(est_power);
26 struct device *cpu_dev;
27 int nr_opp, ret;
28
29 cpu_dev = get_cpu_device(cpumask_first(policy->cpus));
30
31 /* Do the actual CPUFreq init work ... */
32 ret = do_foo_cpufreq_init(policy);
33 if (ret)
34 return ret;
35
36 /* Find the number of OPPs for this policy */
37 nr_opp = foo_get_nr_opp(policy);
38
39 /* And register the new performance domain */
40 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus);
41
42 return 0;
43 }
15 changes: 10 additions & 5 deletions Documentation/power/powercap/powercap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,11 +167,13 @@ For example::
package-0
---------

The Intel RAPL technology allows two constraints, short term and long term,
with two different time windows to be applied to each power zone. Thus for
each zone there are 2 attributes representing the constraint names, 2 power
limits and 2 attributes representing the sizes of the time windows. Such that,
constraint_j_* attributes correspond to the jth constraint (j = 0,1).
Depending on different power zones, the Intel RAPL technology allows
one or multiple constraints like short term, long term and peak power,
with different time windows to be applied to each power zone.
All the zones contain attributes representing the constraint names,
power limits and the sizes of the time windows. Note that time window
is not applicable to peak power. Here, constraint_j_* attributes
correspond to the jth constraint (j = 0,1,2).

For example::

Expand All @@ -181,6 +183,9 @@ For example::
constraint_1_name
constraint_1_power_limit_uw
constraint_1_time_window_us
constraint_2_name
constraint_2_power_limit_uw
constraint_2_time_window_us

Power Zone Attributes
=====================
Expand Down
9 changes: 9 additions & 0 deletions MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -11153,6 +11153,15 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl.git
F: Documentation/devicetree/bindings/memory-controllers/
F: drivers/memory/

MEMORY FREQUENCY SCALING DRIVERS FOR NVIDIA TEGRA
M: Dmitry Osipenko <[email protected]>
L: [email protected]
L: [email protected]
T: git git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux.git
S: Maintained
F: drivers/devfreq/tegra20-devfreq.c
F: drivers/devfreq/tegra30-devfreq.c

MEMORY MANAGEMENT
M: Andrew Morton <[email protected]>
L: [email protected]
Expand Down
26 changes: 2 additions & 24 deletions arch/powerpc/platforms/cell/cpufreq_spudemand.c
Original file line number Diff line number Diff line change
Expand Up @@ -126,30 +126,8 @@ static struct cpufreq_governor spu_governor = {
.stop = spu_gov_stop,
.owner = THIS_MODULE,
};

/*
* module init and destoy
*/

static int __init spu_gov_init(void)
{
int ret;

ret = cpufreq_register_governor(&spu_governor);
if (ret)
printk(KERN_ERR "registration of governor failed\n");
return ret;
}

static void __exit spu_gov_exit(void)
{
cpufreq_unregister_governor(&spu_governor);
}


module_init(spu_gov_init);
module_exit(spu_gov_exit);
cpufreq_governor_init(spu_governor);
cpufreq_governor_exit(spu_governor);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Christian Krafft <[email protected]>");

6 changes: 4 additions & 2 deletions arch/x86/include/asm/msr-index.h
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,10 @@

#define MSR_LBR_SELECT 0x000001c8
#define MSR_LBR_TOS 0x000001c9

#define MSR_IA32_POWER_CTL 0x000001fc
#define MSR_IA32_POWER_CTL_BIT_EE 19

#define MSR_LBR_NHM_FROM 0x00000680
#define MSR_LBR_NHM_TO 0x000006c0
#define MSR_LBR_CORE_FROM 0x00000040
Expand Down Expand Up @@ -269,8 +273,6 @@

#define MSR_PEBS_FRONTEND 0x000003f7

#define MSR_IA32_POWER_CTL 0x000001fc

#define MSR_IA32_MC0_CTL 0x00000400
#define MSR_IA32_MC0_STATUS 0x00000401
#define MSR_IA32_MC0_ADDR 0x00000402
Expand Down
Loading

0 comments on commit 0408497

Please sign in to comment.