Skip to content

Commit

Permalink
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Browse files Browse the repository at this point in the history
Pull KVM updates from Paolo Bonzini:
 "For x86, there is a new alternative and (in the future) more scalable
  implementation of extended page tables that does not need a reverse
  map from guest physical addresses to host physical addresses.

  For now it is disabled by default because it is still lacking a few of
  the existing MMU's bells and whistles. However it is a very solid
  piece of work and it is already available for people to hammer on it.

  Other updates:

  ARM:
   - New page table code for both hypervisor and guest stage-2
   - Introduction of a new EL2-private host context
   - Allow EL2 to have its own private per-CPU variables
   - Support of PMU event filtering
   - Complete rework of the Spectre mitigation

  PPC:
   - Fix for running nested guests with in-kernel IRQ chip
   - Fix race condition causing occasional host hard lockup
   - Minor cleanups and bugfixes

  x86:
   - allow trapping unknown MSRs to userspace
   - allow userspace to force #GP on specific MSRs
   - INVPCID support on AMD
   - nested AMD cleanup, on demand allocation of nested SVM state
   - hide PV MSRs and hypercalls for features not enabled in CPUID
   - new test for MSR_IA32_TSC writes from host and guest
   - cleanups: MMU, CPUID, shared MSRs
   - LAPIC latency optimizations ad bugfixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (232 commits)
  kvm: x86/mmu: NX largepage recovery for TDP MMU
  kvm: x86/mmu: Don't clear write flooding count for direct roots
  kvm: x86/mmu: Support MMIO in the TDP MMU
  kvm: x86/mmu: Support write protection for nesting in tdp MMU
  kvm: x86/mmu: Support disabling dirty logging for the tdp MMU
  kvm: x86/mmu: Support dirty logging for the TDP MMU
  kvm: x86/mmu: Support changed pte notifier in tdp MMU
  kvm: x86/mmu: Add access tracking for tdp_mmu
  kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU
  kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
  kvm: x86/mmu: Add TDP MMU PF handler
  kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
  kvm: x86/mmu: Support zapping SPTEs in the TDP MMU
  KVM: Cache as_id in kvm_memory_slot
  kvm: x86/mmu: Add functions to handle changed TDP SPTEs
  kvm: x86/mmu: Allocate and free TDP MMU roots
  kvm: x86/mmu: Init / Uninit the TDP MMU
  kvm: x86/mmu: Introduce tdp_iter
  KVM: mmu: extract spte.h and spte.c
  KVM: mmu: Separate updating a PTE from kvm_set_pte_rmapp
  ...
  • Loading branch information
torvalds committed Oct 23, 2020
2 parents 9313f80 + 29cf0f5 commit f9a705a
Show file tree
Hide file tree
Showing 119 changed files with 8,444 additions and 4,976 deletions.
216 changes: 207 additions & 9 deletions Documentation/virt/kvm/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4498,11 +4498,14 @@ Currently, the following list of CPUID leaves are returned:
- HYPERV_CPUID_ENLIGHTMENT_INFO
- HYPERV_CPUID_IMPLEMENT_LIMITS
- HYPERV_CPUID_NESTED_FEATURES
- HYPERV_CPUID_SYNDBG_VENDOR_AND_MAX_FUNCTIONS
- HYPERV_CPUID_SYNDBG_INTERFACE
- HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES

HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened VMCS was
enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).

Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
Userspace invokes KVM_GET_SUPPORTED_HV_CPUID by passing a kvm_cpuid2 structure
with the 'nent' field indicating the number of entries in the variable-size
array 'entries'. If the number of entries is too low to describe all Hyper-V
feature leaves, an error (E2BIG) is returned. If the number is more or equal
Expand Down Expand Up @@ -4704,6 +4707,106 @@ KVM_PV_VM_VERIFY
Verify the integrity of the unpacked image. Only if this succeeds,
KVM is allowed to start protected VCPUs.

4.126 KVM_X86_SET_MSR_FILTER
----------------------------

:Capability: KVM_X86_SET_MSR_FILTER
:Architectures: x86
:Type: vm ioctl
:Parameters: struct kvm_msr_filter
:Returns: 0 on success, < 0 on error

::

struct kvm_msr_filter_range {
#define KVM_MSR_FILTER_READ (1 << 0)
#define KVM_MSR_FILTER_WRITE (1 << 1)
__u32 flags;
__u32 nmsrs; /* number of msrs in bitmap */
__u32 base; /* MSR index the bitmap starts at */
__u8 *bitmap; /* a 1 bit allows the operations in flags, 0 denies */
};

#define KVM_MSR_FILTER_MAX_RANGES 16
struct kvm_msr_filter {
#define KVM_MSR_FILTER_DEFAULT_ALLOW (0 << 0)
#define KVM_MSR_FILTER_DEFAULT_DENY (1 << 0)
__u32 flags;
struct kvm_msr_filter_range ranges[KVM_MSR_FILTER_MAX_RANGES];
};

flags values for ``struct kvm_msr_filter_range``:

``KVM_MSR_FILTER_READ``

Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
indicates that a read should immediately fail, while a 1 indicates that
a read for a particular MSR should be handled regardless of the default
filter action.

``KVM_MSR_FILTER_WRITE``

Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
indicates that a write should immediately fail, while a 1 indicates that
a write for a particular MSR should be handled regardless of the default
filter action.

``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE``

Filter both read and write accesses to MSRs using the given bitmap. A 0
in the bitmap indicates that both reads and writes should immediately fail,
while a 1 indicates that reads and writes for a particular MSR are not
filtered by this range.

flags values for ``struct kvm_msr_filter``:

``KVM_MSR_FILTER_DEFAULT_ALLOW``

If no filter range matches an MSR index that is getting accessed, KVM will
fall back to allowing access to the MSR.

``KVM_MSR_FILTER_DEFAULT_DENY``

If no filter range matches an MSR index that is getting accessed, KVM will
fall back to rejecting access to the MSR. In this mode, all MSRs that should
be processed by KVM need to explicitly be marked as allowed in the bitmaps.

This ioctl allows user space to define up to 16 bitmaps of MSR ranges to
specify whether a certain MSR access should be explicitly filtered for or not.

If this ioctl has never been invoked, MSR accesses are not guarded and the
default KVM in-kernel emulation behavior is fully preserved.

Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR
filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
an error.

As soon as the filtering is in place, every MSR access is processed through
the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff);
x2APIC MSRs are always allowed, independent of the ``default_allow`` setting,
and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base
register.

If a bit is within one of the defined ranges, read and write accesses are
guarded by the bitmap's value for the MSR index if the kind of access
is included in the ``struct kvm_msr_filter_range`` flags. If no range
cover this particular access, the behavior is determined by the flags
field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW``
and ``KVM_MSR_FILTER_DEFAULT_DENY``.

Each bitmap range specifies a range of MSRs to potentially allow access on.
The range goes from MSR index [base .. base+nmsrs]. The flags field
indicates whether reads, writes or both reads and writes are filtered
by setting a 1 bit in the bitmap for the corresponding MSR index.

If an MSR access is not permitted through the filtering, it generates a
#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
allows user space to deflect and potentially handle various MSR accesses
into user space.

If a vCPU is in running state while this ioctl is invoked, the vCPU may
experience inconsistent filtering behavior on MSR accesses.


5. The kvm_run structure
========================
Expand Down Expand Up @@ -4869,14 +4972,13 @@ to the byte array.

.. note::

For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
KVM_EXIT_EPR the corresponding

operations are complete (and guest state is consistent) only after userspace
has re-entered the kernel with KVM_RUN. The kernel side will first finish
incomplete operations and then check for pending signals. Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.
For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR,
KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding
operations are complete (and guest state is consistent) only after userspace
has re-entered the kernel with KVM_RUN. The kernel side will first finish
incomplete operations and then check for pending signals. Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.

::

Expand Down Expand Up @@ -5163,6 +5265,44 @@ Note that KVM does not skip the faulting instruction as it does for
KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
if it decides to decode and emulate the instruction.

::

/* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
struct {
__u8 error; /* user -> kernel */
__u8 pad[7];
__u32 reason; /* kernel -> user */
__u32 index; /* kernel -> user */
__u64 data; /* kernel <-> user */
} msr;

Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
exit for writes.

The "reason" field specifies why the MSR trap occurred. User space will only
receive MSR exit traps when a particular reason was requested during through
ENABLE_CAP. Currently valid exit reasons are:

KVM_MSR_EXIT_REASON_UNKNOWN - access to MSR that is unknown to KVM
KVM_MSR_EXIT_REASON_INVAL - access to invalid MSRs or reserved bits
KVM_MSR_EXIT_REASON_FILTER - access blocked by KVM_X86_SET_MSR_FILTER

For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
wants to read. To respond to this request with a successful read, user space
writes the respective data into the "data" field and must continue guest
execution to ensure the read data is transferred into guest register state.

If the RDMSR request was unsuccessful, user space indicates that with a "1" in
the "error" field. This will inject a #GP into the guest when the VCPU is
executed again.

For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
wants to write. Once finished processing the event, user space must continue
vCPU execution. If the MSR write was unsuccessful, user space also sets the
"error" field to "1".

::

/* Fix the size of the union. */
Expand Down Expand Up @@ -5852,6 +5992,28 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows
the maximum halt time to specified on a per-VM basis, effectively overriding
the module parameter for the target VM.

7.21 KVM_CAP_X86_USER_SPACE_MSR
-------------------------------

:Architectures: x86
:Target: VM
:Parameters: args[0] contains the mask of KVM_MSR_EXIT_REASON_* events to report
:Returns: 0 on success; -1 on error

This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
into user space.

When a guest requests to read or write an MSR, KVM may not implement all MSRs
that are relevant to a respective system. It also does not differentiate by
CPU type.

To allow more fine grained control over MSR handling, user space may enable
this capability. With it enabled, MSR accesses that match the mask specified in
args[0] and trigger a #GP event inside the guest by KVM will instead trigger
KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications which user space
can then handle to implement model specific MSR handling and/or user notifications
to inform a user that an MSR was not handled.

8. Other capabilities.
======================

Expand Down Expand Up @@ -6193,3 +6355,39 @@ distribution...)

If this capability is available, then the CPNC and CPVC can be synchronized
between KVM and userspace via the sync regs mechanism (KVM_SYNC_DIAG318).

8.26 KVM_CAP_X86_USER_SPACE_MSR
-------------------------------

:Architectures: x86

This capability indicates that KVM supports deflection of MSR reads and
writes to user space. It can be enabled on a VM level. If enabled, MSR
accesses that would usually trigger a #GP by KVM into the guest will
instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
KVM_EXIT_X86_WRMSR exit notifications.

8.25 KVM_X86_SET_MSR_FILTER
---------------------------

:Architectures: x86

This capability indicates that KVM supports that accesses to user defined MSRs
may be rejected. With this capability exposed, KVM exports new VM ioctl
KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR
ranges that KVM should reject access to.

In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
trap and emulate MSRs that are outside of the scope of KVM as well as
limit the attack surface on KVM's MSR emulation code.


8.26 KVM_CAP_ENFORCE_PV_CPUID
-----------------------------

Architectures: x86

When enabled, KVM will disable paravirtual features provided to the
guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
(0x40000001). Otherwise, a guest may use the paravirtual features
regardless of what has actually been exposed through the CPUID leaf.
88 changes: 44 additions & 44 deletions Documentation/virt/kvm/cpuid.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,64 +38,64 @@ returns::

where ``flag`` is defined as below:

================================= =========== ================================
flag value meaning
================================= =========== ================================
KVM_FEATURE_CLOCKSOURCE 0 kvmclock available at msrs
0x11 and 0x12
================================== =========== ================================
flag value meaning
================================== =========== ================================
KVM_FEATURE_CLOCKSOURCE 0 kvmclock available at msrs
0x11 and 0x12

KVM_FEATURE_NOP_IO_DELAY 1 not necessary to perform delays
on PIO operations
KVM_FEATURE_NOP_IO_DELAY 1 not necessary to perform delays
on PIO operations

KVM_FEATURE_MMU_OP 2 deprecated
KVM_FEATURE_MMU_OP 2 deprecated

KVM_FEATURE_CLOCKSOURCE2 3 kvmclock available at msrs
0x4b564d00 and 0x4b564d01
KVM_FEATURE_CLOCKSOURCE2 3 kvmclock available at msrs
0x4b564d00 and 0x4b564d01

KVM_FEATURE_ASYNC_PF 4 async pf can be enabled by
writing to msr 0x4b564d02
KVM_FEATURE_ASYNC_PF 4 async pf can be enabled by
writing to msr 0x4b564d02

KVM_FEATURE_STEAL_TIME 5 steal time can be enabled by
writing to msr 0x4b564d03
KVM_FEATURE_STEAL_TIME 5 steal time can be enabled by
writing to msr 0x4b564d03

KVM_FEATURE_PV_EOI 6 paravirtualized end of interrupt
handler can be enabled by
writing to msr 0x4b564d04
KVM_FEATURE_PV_EOI 6 paravirtualized end of interrupt
handler can be enabled by
writing to msr 0x4b564d04

KVM_FEATURE_PV_UNHAULT 7 guest checks this feature bit
before enabling paravirtualized
spinlock support
KVM_FEATURE_PV_UNHALT 7 guest checks this feature bit
before enabling paravirtualized
spinlock support

KVM_FEATURE_PV_TLB_FLUSH 9 guest checks this feature bit
before enabling paravirtualized
tlb flush
KVM_FEATURE_PV_TLB_FLUSH 9 guest checks this feature bit
before enabling paravirtualized
tlb flush

KVM_FEATURE_ASYNC_PF_VMEXIT 10 paravirtualized async PF VM EXIT
can be enabled by setting bit 2
when writing to msr 0x4b564d02
KVM_FEATURE_ASYNC_PF_VMEXIT 10 paravirtualized async PF VM EXIT
can be enabled by setting bit 2
when writing to msr 0x4b564d02

KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit
before enabling paravirtualized
sebd IPIs
KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit
before enabling paravirtualized
send IPIs

KVM_FEATURE_POLL_CONTROL 12 host-side polling on HLT can
be disabled by writing
to msr 0x4b564d05.
KVM_FEATURE_POLL_CONTROL 12 host-side polling on HLT can
be disabled by writing
to msr 0x4b564d05.

KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit
before using paravirtualized
sched yield.
KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit
before using paravirtualized
sched yield.

KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit
before using the second async
pf control msr 0x4b564d06 and
async pf acknowledgment msr
0x4b564d07.
KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit
before using the second async
pf control msr 0x4b564d06 and
async pf acknowledgment msr
0x4b564d07.

KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side
per-cpu warps are expeced in
kvmclock
================================= =========== ================================
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side
per-cpu warps are expected in
kvmclock
================================== =========== ================================

::

Expand Down
Loading

0 comments on commit f9a705a

Please sign in to comment.