Skip to content

Commit

Permalink
Merge remote-tracking branch 'kvm/queue' into HEAD
Browse files Browse the repository at this point in the history
x86 Xen-for-KVM:

* Allow the Xen runstate information to cross a page boundary

* Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured

* add support for 32-bit guests in SCHEDOP_poll

x86 fixes:

* One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).

* Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
   years back when eliminating unnecessary barriers when switching between
   vmcs01 and vmcs02.

* Clean up the MSR filter docs.

* Clean up vmread_error_trampoline() to make it more obvious that params
  must be passed on the stack, even for x86-64.

* Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
  of the current guest CPUID.

* Fudge around a race with TSC refinement that results in KVM incorrectly
  thinking a guest needs TSC scaling when running on a CPU with a
  constant TSC, but no hardware-enumerated TSC frequency.

* Advertise (on AMD) that the SMM_CTL MSR is not supported

* Remove unnecessary exports

Selftests:

* Fix an inverted check in the access tracking perf test, and restore
  support for asserting that there aren't too many idle pages when
  running on bare metal.

* Fix an ordering issue in the AMX test introduced by recent conversions
  to use kvm_cpu_has(), and harden the code to guard against similar bugs
  in the future.  Anything that tiggers caching of KVM's supported CPUID,
  kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
  the caching occurs before the test opts in via prctl().

* Fix build errors that occur in certain setups (unsure exactly what is
  unique about the problematic setup) due to glibc overriding
  static_assert() to a variant that requires a custom message.

* Introduce actual atomics for clear/set_bit() in selftests

Documentation:

* Remove deleted ioctls from documentation

* Various fixes
  • Loading branch information
bonzini committed Dec 12, 2022
2 parents 2afc1fb + 5656374 commit 9352e74
Show file tree
Hide file tree
Showing 58 changed files with 1,077 additions and 674 deletions.
182 changes: 90 additions & 92 deletions Documentation/virt/kvm/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -272,18 +272,6 @@ the VCPU file descriptor can be mmap-ed, including:
KVM_CAP_DIRTY_LOG_RING, see section 8.3.


4.6 KVM_SET_MEMORY_REGION
-------------------------

:Capability: basic
:Architectures: all
:Type: vm ioctl
:Parameters: struct kvm_memory_region (in)
:Returns: 0 on success, -1 on error

This ioctl is obsolete and has been removed.


4.7 KVM_CREATE_VCPU
-------------------

Expand Down Expand Up @@ -368,17 +356,6 @@ see the description of the capability.
Note that the Xen shared info page, if configured, shall always be assumed
to be dirty. KVM will not explicitly mark it such.

4.9 KVM_SET_MEMORY_ALIAS
------------------------

:Capability: basic
:Architectures: x86
:Type: vm ioctl
:Parameters: struct kvm_memory_alias (in)
:Returns: 0 (success), -1 (error)

This ioctl is obsolete and has been removed.


4.10 KVM_RUN
------------
Expand Down Expand Up @@ -1332,7 +1309,7 @@ yet and must be cleared on entry.
__u64 userspace_addr; /* start of the userspace allocated memory */
};

/* for kvm_memory_region::flags */
/* for kvm_userspace_memory_region::flags */
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)

Expand Down Expand Up @@ -1377,10 +1354,6 @@ the memory region are automatically reflected into the guest. For example, an
mmap() that affects the region will be made visible immediately. Another
example is madvise(MADV_DROP).

It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
allocation and is deprecated.


4.36 KVM_SET_TSS_ADDR
---------------------
Expand Down Expand Up @@ -3293,6 +3266,7 @@ valid entries found.
----------------------

:Capability: KVM_CAP_DEVICE_CTRL
:Architectures: all
:Type: vm ioctl
:Parameters: struct kvm_create_device (in/out)
:Returns: 0 on success, -1 on error
Expand Down Expand Up @@ -3333,6 +3307,7 @@ number.
:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
KVM_CAP_VCPU_ATTRIBUTES for vcpu device
KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device (no set)
:Architectures: x86, arm64, s390
:Type: device ioctl, vm ioctl, vcpu ioctl
:Parameters: struct kvm_device_attr
:Returns: 0 on success, -1 on error
Expand Down Expand Up @@ -4104,80 +4079,71 @@ flags values for ``struct kvm_msr_filter_range``:
``KVM_MSR_FILTER_READ``

Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
indicates that a read should immediately fail, while a 1 indicates that
a read for a particular MSR should be handled regardless of the default
indicates that read accesses should be denied, while a 1 indicates that
a read for a particular MSR should be allowed regardless of the default
filter action.

``KVM_MSR_FILTER_WRITE``

Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
indicates that a write should immediately fail, while a 1 indicates that
a write for a particular MSR should be handled regardless of the default
indicates that write accesses should be denied, while a 1 indicates that
a write for a particular MSR should be allowed regardless of the default
filter action.

``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE``

Filter both read and write accesses to MSRs using the given bitmap. A 0
in the bitmap indicates that both reads and writes should immediately fail,
while a 1 indicates that reads and writes for a particular MSR are not
filtered by this range.

flags values for ``struct kvm_msr_filter``:

``KVM_MSR_FILTER_DEFAULT_ALLOW``

If no filter range matches an MSR index that is getting accessed, KVM will
fall back to allowing access to the MSR.
allow accesses to all MSRs by default.

``KVM_MSR_FILTER_DEFAULT_DENY``

If no filter range matches an MSR index that is getting accessed, KVM will
fall back to rejecting access to the MSR. In this mode, all MSRs that should
be processed by KVM need to explicitly be marked as allowed in the bitmaps.
deny accesses to all MSRs by default.

This ioctl allows user space to define up to 16 bitmaps of MSR ranges to
specify whether a certain MSR access should be explicitly filtered for or not.
This ioctl allows userspace to define up to 16 bitmaps of MSR ranges to deny
guest MSR accesses that would normally be allowed by KVM. If an MSR is not
covered by a specific range, the "default" filtering behavior applies. Each
bitmap range covers MSRs from [base .. base+nmsrs).

If this ioctl has never been invoked, MSR accesses are not guarded and the
default KVM in-kernel emulation behavior is fully preserved.
If an MSR access is denied by userspace, the resulting KVM behavior depends on
whether or not KVM_CAP_X86_USER_SPACE_MSR's KVM_MSR_EXIT_REASON_FILTER is
enabled. If KVM_MSR_EXIT_REASON_FILTER is enabled, KVM will exit to userspace
on denied accesses, i.e. userspace effectively intercepts the MSR access. If
KVM_MSR_EXIT_REASON_FILTER is not enabled, KVM will inject a #GP into the guest
on denied accesses.

If an MSR access is allowed by userspace, KVM will emulate and/or virtualize
the access in accordance with the vCPU model. Note, KVM may still ultimately
inject a #GP if an access is allowed by userspace, e.g. if KVM doesn't support
the MSR, or to follow architectural behavior for the MSR.

By default, KVM operates in KVM_MSR_FILTER_DEFAULT_ALLOW mode with no MSR range
filters.

Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR
filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
an error.

As soon as the filtering is in place, every MSR access is processed through
the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff);
x2APIC MSRs are always allowed, independent of the ``default_allow`` setting,
and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base
register.

.. warning::
MSR accesses coming from nested vmentry/vmexit are not filtered.
MSR accesses as part of nested VM-Enter/VM-Exit are not filtered.
This includes both writes to individual VMCS fields and reads/writes
through the MSR lists pointed to by the VMCS.

If a bit is within one of the defined ranges, read and write accesses are
guarded by the bitmap's value for the MSR index if the kind of access
is included in the ``struct kvm_msr_filter_range`` flags. If no range
cover this particular access, the behavior is determined by the flags
field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW``
and ``KVM_MSR_FILTER_DEFAULT_DENY``.

Each bitmap range specifies a range of MSRs to potentially allow access on.
The range goes from MSR index [base .. base+nmsrs]. The flags field
indicates whether reads, writes or both reads and writes are filtered
by setting a 1 bit in the bitmap for the corresponding MSR index.

If an MSR access is not permitted through the filtering, it generates a
#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
allows user space to deflect and potentially handle various MSR accesses
into user space.
x2APIC MSR accesses cannot be filtered (KVM silently ignores filters that
cover any x2APIC MSRs).

Note, invoking this ioctl while a vCPU is running is inherently racy. However,
KVM does guarantee that vCPUs will see either the previous filter or the new
filter, e.g. MSRs with identical settings in both the old and new filter will
have deterministic behavior.

Similarly, if userspace wishes to intercept on denied accesses,
KVM_MSR_EXIT_REASON_FILTER must be enabled before activating any filters, and
left enabled until after all filters are deactivated. Failure to do so may
result in KVM injecting a #GP instead of exiting to userspace.

4.98 KVM_CREATE_SPAPR_TCE_64
----------------------------

Expand Down Expand Up @@ -5339,6 +5305,7 @@ KVM_PV_ASYNC_CLEANUP_PERFORM
union {
__u8 long_mode;
__u8 vector;
__u8 runstate_update_flag;
struct {
__u64 gfn;
} shared_info;
Expand Down Expand Up @@ -5416,6 +5383,14 @@ KVM_XEN_ATTR_TYPE_XEN_VERSION
event channel delivery, so responding within the kernel without
exiting to userspace is beneficial.

KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG
This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
support for KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG. It enables the
XEN_RUNSTATE_UPDATE flag which allows guest vCPUs to safely read
other vCPUs' vcpu_runstate_info. Xen guests enable this feature via
the VM_ASST_TYPE_runstate_update_flag of the HYPERVISOR_vm_assist
hypercall.

4.127 KVM_XEN_HVM_GET_ATTR
--------------------------

Expand Down Expand Up @@ -6473,31 +6448,33 @@ if it decides to decode and emulate the instruction.

Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
may instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
exit for writes.

The "reason" field specifies why the MSR trap occurred. User space will only
receive MSR exit traps when a particular reason was requested during through
The "reason" field specifies why the MSR interception occurred. Userspace will
only receive MSR exits when a particular reason was requested during through
ENABLE_CAP. Currently valid exit reasons are:

KVM_MSR_EXIT_REASON_UNKNOWN - access to MSR that is unknown to KVM
KVM_MSR_EXIT_REASON_INVAL - access to invalid MSRs or reserved bits
KVM_MSR_EXIT_REASON_FILTER - access blocked by KVM_X86_SET_MSR_FILTER

For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
wants to read. To respond to this request with a successful read, user space
For KVM_EXIT_X86_RDMSR, the "index" field tells userspace which MSR the guest
wants to read. To respond to this request with a successful read, userspace
writes the respective data into the "data" field and must continue guest
execution to ensure the read data is transferred into guest register state.

If the RDMSR request was unsuccessful, user space indicates that with a "1" in
If the RDMSR request was unsuccessful, userspace indicates that with a "1" in
the "error" field. This will inject a #GP into the guest when the VCPU is
executed again.

For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
wants to write. Once finished processing the event, user space must continue
vCPU execution. If the MSR write was unsuccessful, user space also sets the
For KVM_EXIT_X86_WRMSR, the "index" field tells userspace which MSR the guest
wants to write. Once finished processing the event, userspace must continue
vCPU execution. If the MSR write was unsuccessful, userspace also sets the
"error" field to "1".

See KVM_X86_SET_MSR_FILTER for details on the interaction with MSR filtering.

::


Expand Down Expand Up @@ -7263,19 +7240,27 @@ the module parameter for the target VM.
:Parameters: args[0] contains the mask of KVM_MSR_EXIT_REASON_* events to report
:Returns: 0 on success; -1 on error

This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
into user space.
This capability allows userspace to intercept RDMSR and WRMSR instructions if
access to an MSR is denied. By default, KVM injects #GP on denied accesses.

When a guest requests to read or write an MSR, KVM may not implement all MSRs
that are relevant to a respective system. It also does not differentiate by
CPU type.

To allow more fine grained control over MSR handling, user space may enable
To allow more fine grained control over MSR handling, userspace may enable
this capability. With it enabled, MSR accesses that match the mask specified in
args[0] and trigger a #GP event inside the guest by KVM will instead trigger
KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications which user space
can then handle to implement model specific MSR handling and/or user notifications
to inform a user that an MSR was not handled.
args[0] and would trigger a #GP inside the guest will instead trigger
KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications. Userspace
can then implement model specific MSR handling and/or user notifications
to inform a user that an MSR was not emulated/virtualized by KVM.

The valid mask flags are:

KVM_MSR_EXIT_REASON_UNKNOWN - intercept accesses to unknown (to KVM) MSRs
KVM_MSR_EXIT_REASON_INVAL - intercept accesses that are architecturally
invalid according to the vCPU model and/or mode
KVM_MSR_EXIT_REASON_FILTER - intercept accesses that are denied by userspace
via KVM_X86_SET_MSR_FILTER

7.22 KVM_CAP_X86_BUS_LOCK_EXIT
-------------------------------
Expand Down Expand Up @@ -7936,7 +7921,7 @@ KVM_EXIT_X86_WRMSR exit notifications.
This capability indicates that KVM supports that accesses to user defined MSRs
may be rejected. With this capability exposed, KVM exports new VM ioctl
KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR
ranges that KVM should reject access to.
ranges that KVM should deny access to.

In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
trap and emulate MSRs that are outside of the scope of KVM as well as
Expand Down Expand Up @@ -8080,12 +8065,13 @@ KVM device "kvm-arm-vgic-its" when dirty ring is enabled.
This capability indicates the features that Xen supports for hosting Xen
PVHVM guests. Valid flags are::

#define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR (1 << 0)
#define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1)
#define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2)
#define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3)
#define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
#define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
#define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR (1 << 0)
#define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1)
#define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2)
#define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 3)
#define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
#define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
#define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6)

The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG
ioctl is available, for the guest to set its hypercall page.
Expand Down Expand Up @@ -8117,6 +8103,18 @@ KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID/TIMER/UPCALL_VECTOR vCPU attributes.
related to event channel delivery, timers, and the XENVER_version
interception.

The KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG flag indicates that KVM supports
the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute in the KVM_XEN_SET_ATTR
and KVM_XEN_GET_ATTR ioctls. This controls whether KVM will set the
XEN_RUNSTATE_UPDATE flag in guest memory mapped vcpu_runstate_info during
updates of the runstate information. Note that versions of KVM which support
the RUNSTATE feature above, but not thie RUNSTATE_UPDATE_FLAG feature, will
always set the XEN_RUNSTATE_UPDATE flag when updating the guest structure,
which is perhaps counterintuitive. When this flag is advertised, KVM will
behave more correctly, not using the XEN_RUNSTATE_UPDATE flag until/unless
specifically enabled (by the guest making the hypercall, causing the VMM
to enable the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute).

8.31 KVM_CAP_PPC_MULTITCE
-------------------------

Expand Down
2 changes: 2 additions & 0 deletions arch/x86/include/asm/kvm_host.h
Original file line number Diff line number Diff line change
Expand Up @@ -686,6 +686,7 @@ struct kvm_vcpu_xen {
struct gfn_to_pfn_cache vcpu_info_cache;
struct gfn_to_pfn_cache vcpu_time_info_cache;
struct gfn_to_pfn_cache runstate_cache;
struct gfn_to_pfn_cache runstate2_cache;
u64 last_steal;
u64 runstate_entry_time;
u64 runstate_times[4];
Expand Down Expand Up @@ -1112,6 +1113,7 @@ struct msr_bitmap_range {
struct kvm_xen {
u32 xen_version;
bool long_mode;
bool runstate_update_flag;
u8 upcall_vector;
struct gfn_to_pfn_cache shinfo_cache;
struct idr evtchn_ports;
Expand Down
8 changes: 0 additions & 8 deletions arch/x86/include/uapi/asm/kvm.h
Original file line number Diff line number Diff line change
Expand Up @@ -53,14 +53,6 @@
/* Architectural interrupt line count. */
#define KVM_NR_INTERRUPTS 256

struct kvm_memory_alias {
__u32 slot; /* this has a different namespace than memory slots */
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size;
__u64 target_phys_addr;
};

/* for KVM_GET_IRQCHIP and KVM_SET_IRQCHIP */
struct kvm_pic_state {
__u8 last_irr; /* edge detection */
Expand Down
4 changes: 4 additions & 0 deletions arch/x86/kvm/cpuid.c
Original file line number Diff line number Diff line change
Expand Up @@ -1233,8 +1233,12 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
* Other defined bits are for MSRs that KVM does not expose:
* EAX 3 SPCL, SMM page configuration lock
* EAX 13 PCMSR, Prefetch control MSR
*
* KVM doesn't support SMM_CTL.
* EAX 9 SMM_CTL MSR is not supported
*/
entry->eax &= BIT(0) | BIT(2) | BIT(6);
entry->eax |= BIT(9);
if (static_cpu_has(X86_FEATURE_LFENCE_RDTSC))
entry->eax |= BIT(2);
if (!static_cpu_has_bug(X86_BUG_NULL_SEG))
Expand Down
2 changes: 0 additions & 2 deletions arch/x86/kvm/irq.c
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)

return r;
}
EXPORT_SYMBOL(kvm_cpu_has_pending_timer);

/*
* check if there is a pending userspace external interrupt
Expand Down Expand Up @@ -150,7 +149,6 @@ void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
if (kvm_xen_timer_enabled(vcpu))
kvm_xen_inject_timer_irqs(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_inject_pending_timer_irqs);

void __kvm_migrate_timers(struct kvm_vcpu *vcpu)
{
Expand Down
Loading

0 comments on commit 9352e74

Please sign in to comment.