Skip to content

Commit

Permalink
Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/li…
Browse files Browse the repository at this point in the history
…nux/kernel/git/tip/tip

Pull x86 cache allocation interface from Thomas Gleixner:
 "This provides support for Intel's Cache Allocation Technology, a cache
  partitioning mechanism.

  The interface is odd, but the hardware interface of that CAT stuff is
  odd as well.

  We tried hard to come up with an abstraction, but that only allows
  rather simple partitioning, but no way of sharing and dealing with the
  per package nature of this mechanism.

  In the end we decided to expose the allocation bitmaps directly so all
  combinations of the hardware can be utilized.

  There are two ways of associating a cache partition:

   - Task

     A task can be added to a resource group. It uses the cache
     partition associated to the group.

   - CPU

     All tasks which are not member of a resource group use the group to
     which the CPU they are running on is associated with.

     That allows for simple CPU based partitioning schemes.

  The main expected user sare:

   - Virtualization so a VM can only trash only the associated part of
     the cash w/o disturbing others

   - Real-Time systems to seperate RT and general workloads.

   - Latency sensitive enterprise workloads

   - In theory this also can be used to protect against cache side
     channel attacks"

[ Intel RDT is "Resource Director Technology". The interface really is
  rather odd and very specific, which delayed this pull request while I
  was thinking about it. The pull request itself came in early during
  the merge window, I just delayed it until things had calmed down and I
  had more time.

  But people tell me they'll use this, and the good news is that it is
  _so_ specific that it's rather independent of anything else, and no
  user is going to depend on the interface since it's pretty rare. So if
  push comes to shove, we can just remove the interface and nothing will
  break ]

* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  x86/intel_rdt: Implement show_options() for resctrlfs
  x86/intel_rdt: Call intel_rdt_sched_in() with preemption disabled
  x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount
  x86/intel_rdt: Fix setting of closid when adding CPUs to a group
  x86/intel_rdt: Update percpu closid immeditately on CPUs affected by changee
  x86/intel_rdt: Reset per cpu closids on unmount
  x86/intel_rdt: Select KERNFS when enabling INTEL_RDT_A
  x86/intel_rdt: Prevent deadlock against hotplug lock
  x86/intel_rdt: Protect info directory from removal
  x86/intel_rdt: Add info files to Documentation
  x86/intel_rdt: Export the minimum number of set mask bits in sysfs
  x86/intel_rdt: Propagate error in rdt_mount() properly
  x86/intel_rdt: Add a missing #include
  MAINTAINERS: Add maintainer for Intel RDT resource allocation
  x86/intel_rdt: Add scheduler hook
  x86/intel_rdt: Add schemata file
  x86/intel_rdt: Add tasks files
  x86/intel_rdt: Add cpus file
  x86/intel_rdt: Add mkdir to resctrl file system
  x86/intel_rdt: Add "info" files to resctrl file system
  ...
  • Loading branch information
torvalds committed Dec 22, 2016
2 parents f79f7b1 + 76ae054 commit eb254f3
Show file tree
Hide file tree
Showing 20 changed files with 2,320 additions and 25 deletions.
16 changes: 16 additions & 0 deletions Documentation/ABI/testing/sysfs-devices-system-cpu
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,22 @@ Description: Parameters for the CPU cache attributes
the modified cache line is written to main
memory only when it is replaced


What: /sys/devices/system/cpu/cpu*/cache/index*/id
Date: September 2016
Contact: Linux kernel mailing list <[email protected]>
Description: Cache id

The id provides a unique number for a specific instance of
a cache of a particular type. E.g. there may be a level
3 unified cache on each socket in a server and we may
assign them ids 0, 1, 2, ...

Note that id value can be non-contiguous. E.g. level 1
caches typically exist per core, but there may not be a
power of two cores on a socket, so these caches may be
numbered 0, 1, 2, 3, 4, 5, 8, 9, 10, ...

What: /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats
/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/turbo_stat
/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/sub_turbo_stat
Expand Down
214 changes: 214 additions & 0 deletions Documentation/x86/intel_rdt_ui.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
User Interface for Resource Allocation in Intel Resource Director Technology

Copyright (C) 2016 Intel Corporation

Fenghua Yu <[email protected]>
Tony Luck <[email protected]>

This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".

To use the feature mount the file system:

# mount -t resctrl resctrl [-o cdp] /sys/fs/resctrl

mount options are:

"cdp": Enable code/data prioritization in L3 cache allocations.


Info directory
--------------

The 'info' directory contains information about the enabled
resources. Each resource has its own subdirectory. The subdirectory
names reflect the resource names. Each subdirectory contains the
following files:

"num_closids": The number of CLOSIDs which are valid for this
resource. The kernel uses the smallest number of
CLOSIDs of all enabled resources as limit.

"cbm_mask": The bitmask which is valid for this resource. This
mask is equivalent to 100%.

"min_cbm_bits": The minimum number of consecutive bits which must be
set when writing a mask.


Resource groups
---------------
Resource groups are represented as directories in the resctrl file
system. The default group is the root directory. Other groups may be
created as desired by the system administrator using the "mkdir(1)"
command, and removed using "rmdir(1)".

There are three files associated with each group:

"tasks": A list of tasks that belongs to this group. Tasks can be
added to a group by writing the task ID to the "tasks" file
(which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2)
and clone(2) are added to the same group as their parent.
If a pid is not in any sub partition, it is in root partition
(i.e. default partition).

"cpus": A bitmask of logical CPUs assigned to this group. Writing
a new mask can add/remove CPUs from this group. Added CPUs
are removed from their previous group. Removed ones are
given to the default (root) group. You cannot remove CPUs
from the default group.

"schemata": A list of all the resources available to this group.
Each resource has its own line and format - see below for
details.

When a task is running the following rules define which resources
are available to it:

1) If the task is a member of a non-default group, then the schemata
for that group is used.

2) Else if the task belongs to the default group, but is running on a
CPU that is assigned to some specific group, then the schemata for
the CPU's group is used.

3) Otherwise the schemata for the default group is used.


Schemata files - general concepts
---------------------------------
Each line in the file describes one resource. The line starts with
the name of the resource, followed by specific values to be applied
in each of the instances of that resource on the system.

Cache IDs
---------
On current generation systems there is one L3 cache per socket and L2
caches are generally just shared by the hyperthreads on a core, but this
isn't an architectural requirement. We could have multiple separate L3
caches on a socket, multiple cores could share an L2 cache. So instead
of using "socket" or "core" to define the set of logical cpus sharing
a resource we use a "Cache ID". At a given cache level this will be a
unique number across the whole system (but it isn't guaranteed to be a
contiguous sequence, there may be gaps). To find the ID for each logical
CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id

Cache Bit Masks (CBM)
---------------------
For cache resources we describe the portion of the cache that is available
for allocation using a bitmask. The maximum value of the mask is defined
by each cpu model (and may be different for different cache levels). It
is found using CPUID, but is also provided in the "info" directory of
the resctrl file system in "info/{resource}/cbm_mask". X86 hardware
requires that these masks have all the '1' bits in a contiguous block. So
0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
and 0xA are not. On a system with a 20-bit mask each bit represents 5%
of the capacity of the cache. You could partition the cache into four
equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.


L3 details (code and data prioritization disabled)
--------------------------------------------------
With CDP disabled the L3 schemata format is:

L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

L3 details (CDP enabled via mount option to resctrl)
----------------------------------------------------
When CDP is enabled L3 control is split into two separate resources
so you can specify independent masks for code and data like this:

L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

L2 details
----------
L2 cache does not support code and data prioritization, so the
schemata format is always:

L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

Example 1
---------
On a two socket machine (one L3 cache per socket) with just four bits
for cache bit masks

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata

The default resource group is unmodified, so we have access to all parts
of all caches (its schemata file reads "L3:0=f;1=f").

Tasks that are under the control of group "p0" may only allocate from the
"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
Tasks in group "p1" use the "lower" 50% of cache on both sockets.

Example 2
---------
Again two sockets, but this time with a more realistic 20-bit mask.

Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
neighbors, each of the two real-time tasks exclusively occupies one quarter
of L3 cache on socket 0.

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl

First we reset the schemata for the default group so that the "upper"
50% of the L3 cache on socket 0 cannot be used by ordinary tasks:

# echo "L3:0=3ff;1=fffff" > schemata

Next we make a resource group for our first real time task and give
it access to the "top" 25% of the cache on socket 0.

# mkdir p0
# echo "L3:0=f8000;1=fffff" > p0/schemata

Finally we move our first real time task into this resource group. We
also use taskset(1) to ensure the task always runs on a dedicated CPU
on socket 0. Most uses of resource groups will also constrain which
processors tasks run on.

# echo 1234 > p0/tasks
# taskset -cp 1 1234

Ditto for the second real time task (with the remaining 25% of cache):

# mkdir p1
# echo "L3:0=7c00;1=fffff" > p1/schemata
# echo 5678 > p1/tasks
# taskset -cp 2 5678

Example 3
---------

A single socket system which has real-time tasks running on core 4-7 and
non real-time workload assigned to core 0-3. The real-time tasks share text
and data, so a per task association is not required and due to interaction
with the kernel it's desired that the kernel on these cores shares L3 with
the tasks.

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl

First we reset the schemata for the default group so that the "upper"
50% of the L3 cache on socket 0 cannot be used by ordinary tasks:

# echo "L3:0=3ff" > schemata

Next we make a resource group for our real time cores and give
it access to the "top" 50% of the cache on socket 0.

# mkdir p0
# echo "L3:0=ffc00;" > p0/schemata

Finally we move core 4-7 over to the new group and make sure that the
kernel and the tasks running there get 50% of the cache.

# echo C0 > p0/cpus
8 changes: 8 additions & 0 deletions MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -10327,6 +10327,14 @@ L: [email protected]
S: Supported
F: drivers/infiniband/sw/rdmavt

RDT - RESOURCE ALLOCATION
M: Fenghua Yu <[email protected]>
L: [email protected]
S: Supported
F: arch/x86/kernel/cpu/intel_rdt*
F: arch/x86/include/asm/intel_rdt*
F: Documentation/x86/intel_rdt*

READ-COPY UPDATE (RCU)
M: "Paul E. McKenney" <[email protected]>
M: Josh Triplett <[email protected]>
Expand Down
13 changes: 13 additions & 0 deletions arch/x86/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,19 @@ config GOLDFISH
def_bool y
depends on X86_GOLDFISH

config INTEL_RDT_A
bool "Intel Resource Director Technology Allocation support"
default n
depends on X86 && CPU_SUP_INTEL
select KERNFS
help
Select to enable resource allocation which is a sub-feature of
Intel Resource Director Technology(RDT). More information about
RDT can be found in the Intel x86 Architecture Software
Developer Manual.

Say N if unsure.

if X86_32
config X86_EXTENDED_PLATFORM
bool "Support for extended (non-PC) x86 platforms"
Expand Down
23 changes: 2 additions & 21 deletions arch/x86/events/intel/cqm.c
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@
#include <linux/perf_event.h>
#include <linux/slab.h>
#include <asm/cpu_device_id.h>
#include <asm/intel_rdt_common.h>
#include "../perf_event.h"

#define MSR_IA32_PQR_ASSOC 0x0c8f
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d

Expand All @@ -24,32 +24,13 @@ static unsigned int cqm_l3_scale; /* supposedly cacheline size */
static bool cqm_enabled, mbm_enabled;
unsigned int mbm_socket_max;

/**
* struct intel_pqr_state - State cache for the PQR MSR
* @rmid: The cached Resource Monitoring ID
* @closid: The cached Class Of Service ID
* @rmid_usecnt: The usage counter for rmid
*
* The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
* lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
* contains both parts, so we need to cache them.
*
* The cache also helps to avoid pointless updates if the value does
* not change.
*/
struct intel_pqr_state {
u32 rmid;
u32 closid;
int rmid_usecnt;
};

/*
* The cached intel_pqr_state is strictly per CPU and can never be
* updated from a remote CPU. Both functions which modify the state
* (intel_cqm_event_start and intel_cqm_event_stop) are called with
* interrupts disabled, which is sufficient for the protection.
*/
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
static struct hrtimer *mbm_timers;
/**
* struct sample - mbm event's (local or total) data
Expand Down
4 changes: 4 additions & 0 deletions arch/x86/include/asm/cpufeatures.h
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,9 @@

#define X86_FEATURE_CPB ( 7*32+ 2) /* AMD Core Performance Boost */
#define X86_FEATURE_EPB ( 7*32+ 3) /* IA32_ENERGY_PERF_BIAS support */
#define X86_FEATURE_CAT_L3 ( 7*32+ 4) /* Cache Allocation Technology L3 */
#define X86_FEATURE_CAT_L2 ( 7*32+ 5) /* Cache Allocation Technology L2 */
#define X86_FEATURE_CDP_L3 ( 7*32+ 6) /* Code and Data Prioritization L3 */

#define X86_FEATURE_HW_PSTATE ( 7*32+ 8) /* AMD HW-PState */
#define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
Expand Down Expand Up @@ -222,6 +225,7 @@
#define X86_FEATURE_RTM ( 9*32+11) /* Restricted Transactional Memory */
#define X86_FEATURE_CQM ( 9*32+12) /* Cache QoS Monitoring */
#define X86_FEATURE_MPX ( 9*32+14) /* Memory Protection Extension */
#define X86_FEATURE_RDT_A ( 9*32+15) /* Resource Director Technology Allocation */
#define X86_FEATURE_AVX512F ( 9*32+16) /* AVX-512 Foundation */
#define X86_FEATURE_AVX512DQ ( 9*32+17) /* AVX-512 DQ (Double/Quad granular) Instructions */
#define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */
Expand Down
Loading

0 comments on commit eb254f3

Please sign in to comment.