Skip to content

Commit

Permalink
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Browse files Browse the repository at this point in the history
Pull kvm updates from Paolo Bonzini:
 "Generic:

   - Use memdup_array_user() to harden against overflow.

   - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
     architectures.

   - Clean up Kconfigs that all KVM architectures were selecting

   - New functionality around "guest_memfd", a new userspace API that
     creates an anonymous file and returns a file descriptor that refers
     to it. guest_memfd files are bound to their owning virtual machine,
     cannot be mapped, read, or written by userspace, and cannot be
     resized. guest_memfd files do however support PUNCH_HOLE, which can
     be used to switch a memory area between guest_memfd and regular
     anonymous memory.

   - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
     per-page attributes for a given page of guest memory; right now the
     only attribute is whether the guest expects to access memory via
     guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
     TDX or ARM64 pKVM is checked by firmware or hypervisor that
     guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
     the case of pKVM).

  x86:

   - Support for "software-protected VMs" that can use the new
     guest_memfd and page attributes infrastructure. This is mostly
     useful for testing, since there is no pKVM-like infrastructure to
     provide a meaningfully reduced TCB.

   - Fix a relatively benign off-by-one error when splitting huge pages
     during CLEAR_DIRTY_LOG.

   - Fix a bug where KVM could incorrectly test-and-clear dirty bits in
     non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
     a non-huge SPTE.

   - Use more generic lockdep assertions in paths that don't actually
     care about whether the caller is a reader or a writer.

   - let Xen guests opt out of having PV clock reported as "based on a
     stable TSC", because some of them don't expect the "TSC stable" bit
     (added to the pvclock ABI by KVM, but never set by Xen) to be set.

   - Revert a bogus, made-up nested SVM consistency check for
     TLB_CONTROL.

   - Advertise flush-by-ASID support for nSVM unconditionally, as KVM
     always flushes on nested transitions, i.e. always satisfies flush
     requests. This allows running bleeding edge versions of VMware
     Workstation on top of KVM.

   - Sanity check that the CPU supports flush-by-ASID when enabling SEV
     support.

   - On AMD machines with vNMI, always rely on hardware instead of
     intercepting IRET in some cases to detect unmasking of NMIs

   - Support for virtualizing Linear Address Masking (LAM)

   - Fix a variety of vPMU bugs where KVM fail to stop/reset counters
     and other state prior to refreshing the vPMU model.

   - Fix a double-overflow PMU bug by tracking emulated counter events
     using a dedicated field instead of snapshotting the "previous"
     counter. If the hardware PMC count triggers overflow that is
     recognized in the same VM-Exit that KVM manually bumps an event
     count, KVM would pend PMIs for both the hardware-triggered overflow
     and for KVM-triggered overflow.

   - Turn off KVM_WERROR by default for all configs so that it's not
     inadvertantly enabled by non-KVM developers, which can be
     problematic for subsystems that require no regressions for W=1
     builds.

   - Advertise all of the host-supported CPUID bits that enumerate
     IA32_SPEC_CTRL "features".

   - Don't force a masterclock update when a vCPU synchronizes to the
     current TSC generation, as updating the masterclock can cause
     kvmclock's time to "jump" unexpectedly, e.g. when userspace
     hotplugs a pre-created vCPU.

   - Use RIP-relative address to read kvm_rebooting in the VM-Enter
     fault paths, partly as a super minor optimization, but mostly to
     make KVM play nice with position independent executable builds.

   - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
     CONFIG_HYPERV as a minor optimization, and to self-document the
     code.

   - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
     "emulation" at build time.

  ARM64:

   - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
     granule sizes. Branch shared with the arm64 tree.

   - Large Fine-Grained Trap rework, bringing some sanity to the
     feature, although there is more to come. This comes with a prefix
     branch shared with the arm64 tree.

   - Some additional Nested Virtualization groundwork, mostly
     introducing the NV2 VNCR support and retargetting the NV support to
     that version of the architecture.

   - A small set of vgic fixes and associated cleanups.

  Loongarch:

   - Optimization for memslot hugepage checking

   - Cleanup and fix some HW/SW timer issues

   - Add LSX/LASX (128bit/256bit SIMD) support

  RISC-V:

   - KVM_GET_REG_LIST improvement for vector registers

   - Generate ISA extension reg_list using macros in get-reg-list
     selftest

   - Support for reporting steal time along with selftest

  s390:

   - Bugfixes

  Selftests:

   - Fix an annoying goof where the NX hugepage test prints out garbage
     instead of the magic token needed to run the test.

   - Fix build errors when a header is delete/moved due to a missing
     flag in the Makefile.

   - Detect if KVM bugged/killed a selftest's VM and print out a helpful
     message instead of complaining that a random ioctl() failed.

   - Annotate the guest printf/assert helpers with __printf(), and fix
     the various bugs that were lurking due to lack of said annotation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
  x86/kvm: Do not try to disable kvmclock if it was not enabled
  KVM: x86: add missing "depends on KVM"
  KVM: fix direction of dependency on MMU notifiers
  KVM: introduce CONFIG_KVM_COMMON
  KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
  KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
  RISC-V: KVM: selftests: Add get-reg-list test for STA registers
  RISC-V: KVM: selftests: Add steal_time test support
  RISC-V: KVM: selftests: Add guest_sbi_probe_extension
  RISC-V: KVM: selftests: Move sbi_ecall to processor.c
  RISC-V: KVM: Implement SBI STA extension
  RISC-V: KVM: Add support for SBI STA registers
  RISC-V: KVM: Add support for SBI extension registers
  RISC-V: KVM: Add SBI STA info to vcpu_arch
  RISC-V: KVM: Add steal-update vcpu request
  RISC-V: KVM: Add SBI STA extension skeleton
  RISC-V: paravirt: Implement steal-time support
  RISC-V: Add SBI STA extension definitions
  RISC-V: paravirt: Add skeleton for pv-time support
  RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
  ...
  • Loading branch information
torvalds committed Jan 17, 2024
2 parents 1b1934d + 1c6d984 commit 09d1c6a
Show file tree
Hide file tree
Showing 187 changed files with 7,469 additions and 2,710 deletions.
6 changes: 3 additions & 3 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3996,9 +3996,9 @@
vulnerability. System may allow data leaks with this
option.

no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES] Disable paravirtualized
steal time accounting. steal time is computed, but
won't influence scheduler behaviour
no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV] Disable
paravirtualized steal time accounting. steal time is
computed, but won't influence scheduler behaviour

nosync [HW,M68K] Disables sync negotiation for all devices.

Expand Down
219 changes: 207 additions & 12 deletions Documentation/virt/kvm/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -147,10 +147,29 @@ described as 'basic' will be available.
The new VM has no virtual cpus and no memory.
You probably want to use 0 as machine type.

X86:
^^^^

Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.

S390:
^^^^^

In order to create user controlled virtual machines on S390, check
KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
privileged user (CAP_SYS_ADMIN).

MIPS:
^^^^^

To use hardware assisted virtualization on MIPS (VZ ASE) rather than
the default trap & emulate implementation (which changes the virtual
memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
flag KVM_VM_MIPS_VZ.

ARM64:
^^^^^^

On arm64, the physical address size for a VM (IPA Size limit) is limited
to 40bits by default. The limit can be configured if the host supports the
extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
Expand Down Expand Up @@ -608,18 +627,6 @@ interrupt number dequeues the interrupt.
This is an asynchronous vcpu ioctl and can be invoked from any thread.


4.17 KVM_DEBUG_GUEST
--------------------

:Capability: basic
:Architectures: none
:Type: vcpu ioctl
:Parameters: none)
:Returns: -1 on error

Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead.


4.18 KVM_GET_MSRS
-----------------

Expand Down Expand Up @@ -6192,6 +6199,130 @@ to know what fields can be changed for the system register described by
``op0, op1, crn, crm, op2``. KVM rejects ID register values that describe a
superset of the features supported by the system.

4.140 KVM_SET_USER_MEMORY_REGION2
---------------------------------

:Capability: KVM_CAP_USER_MEMORY2
:Architectures: all
:Type: vm ioctl
:Parameters: struct kvm_userspace_memory_region2 (in)
:Returns: 0 on success, -1 on error

KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
allows mapping guest_memfd memory into a guest. All fields shared with
KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_GUEST_MEMFD
in flags to have KVM bind the memory region to a given guest_memfd range of
[guest_memfd_offset, guest_memfd_offset + memory_size]. The target guest_memfd
must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
the target range must not be bound to any other memory region. All standard
bounds checks apply (use common sense).

::

struct kvm_userspace_memory_region2 {
__u32 slot;
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
__u64 guest_memfd_offset;
__u32 guest_memfd;
__u32 pad1;
__u64 pad2[14];
};

A KVM_MEM_GUEST_MEMFD region _must_ have a valid guest_memfd (private memory) and
userspace_addr (shared memory). However, "valid" for userspace_addr simply
means that the address itself must be a legal userspace address. The backing
mapping for userspace_addr is not required to be valid/populated at the time of
KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
on-demand.

When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
state. At VM creation time, all memory is shared, i.e. the PRIVATE attribute
is '0' for all gfns. Userspace can control whether memory is shared/private by
toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.

4.141 KVM_SET_MEMORY_ATTRIBUTES
-------------------------------

:Capability: KVM_CAP_MEMORY_ATTRIBUTES
:Architectures: x86
:Type: vm ioctl
:Parameters: struct kvm_memory_attributes (in)
:Returns: 0 on success, <0 on error

KVM_SET_MEMORY_ATTRIBUTES allows userspace to set memory attributes for a range
of guest physical memory.

::

struct kvm_memory_attributes {
__u64 address;
__u64 size;
__u64 attributes;
__u64 flags;
};

#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)

The address and size must be page aligned. The supported attributes can be
retrieved via ioctl(KVM_CHECK_EXTENSION) on KVM_CAP_MEMORY_ATTRIBUTES. If
executed on a VM, KVM_CAP_MEMORY_ATTRIBUTES precisely returns the attributes
supported by that VM. If executed at system scope, KVM_CAP_MEMORY_ATTRIBUTES
returns all attributes supported by KVM. The only attribute defined at this
time is KVM_MEMORY_ATTRIBUTE_PRIVATE, which marks the associated gfn as being
guest private memory.

Note, there is no "get" API. Userspace is responsible for explicitly tracking
the state of a gfn/page as needed.

The "flags" field is reserved for future extensions and must be '0'.

4.142 KVM_CREATE_GUEST_MEMFD
----------------------------

:Capability: KVM_CAP_GUEST_MEMFD
:Architectures: none
:Type: vm ioctl
:Parameters: struct kvm_create_guest_memfd(in)
:Returns: 0 on success, <0 on error

KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor
that refers to it. guest_memfd files are roughly analogous to files created
via memfd_create(), e.g. guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped. Unlike
"regular" memfd_create() files, guest_memfd files are bound to their owning
virtual machine (see below), cannot be mapped, read, or written by userspace,
and cannot be resized (guest_memfd files do however support PUNCH_HOLE).

::

struct kvm_create_guest_memfd {
__u64 size;
__u64 flags;
__u64 reserved[6];
};

Conceptually, the inode backing a guest_memfd file represents physical memory,
i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The
file itself, which is bound to a "struct kvm", is that instance's view of the
underlying memory, e.g. effectively provides the translation of guest addresses
to host memory. This allows for use cases where multiple KVM structures are
used to manage a single virtual machine, e.g. when performing intrahost
migration of a virtual machine.

KVM currently only supports mapping guest_memfd via KVM_SET_USER_MEMORY_REGION2,
and more specifically via the guest_memfd and guest_memfd_offset fields in
"struct kvm_userspace_memory_region2", where guest_memfd_offset is the offset
into the guest_memfd instance. For a given guest_memfd file, there can be at
most one mapping per page, i.e. binding multiple memory regions to a single
guest_memfd range is not allowed (any number of memory regions can be bound to
a single guest_memfd file, but the bound ranges must not overlap).

See KVM_SET_USER_MEMORY_REGION2 for additional details.

5. The kvm_run structure
========================

Expand Down Expand Up @@ -6824,6 +6955,30 @@ array field represents return values. The userspace should update the return
values of SBI call before resuming the VCPU. For more details on RISC-V SBI
spec refer, https://github.com/riscv/riscv-sbi-doc.

::

/* KVM_EXIT_MEMORY_FAULT */
struct {
#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
__u64 flags;
__u64 gpa;
__u64 size;
} memory_fault;

KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the
guest physical address range [gpa, gpa + size) of the fault. The 'flags' field
describes properties of the faulting access that are likely pertinent:

- KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
on a private memory access. When clear, indicates the fault occurred on a
shared access.

Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
accompanies a return code of '-1', not '0'! errno will always be set to EFAULT
or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
kvm_run.exit_reason is stale/undefined for all other error numbers.

::

/* KVM_EXIT_NOTIFY */
Expand Down Expand Up @@ -7858,6 +8013,27 @@ This capability is aimed to mitigate the threat that malicious VMs can
cause CPU stuck (due to event windows don't open up) and make the CPU
unavailable to host or other VMs.

7.34 KVM_CAP_MEMORY_FAULT_INFO
------------------------------

:Architectures: x86
:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.

The presence of this capability indicates that KVM_RUN will fill
kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
there is a valid memslot but no backing VMA for the corresponding host virtual
address.

The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
to KVM_EXIT_MEMORY_FAULT.

Note: Userspaces which attempt to resolve memory faults so that they can retry
KVM_RUN are encouraged to guard against repeatedly receiving the same
error/annotated fault.

See KVM_EXIT_MEMORY_FAULT for more information.

8. Other capabilities.
======================

Expand Down Expand Up @@ -8374,6 +8550,7 @@ PVHVM guests. Valid flags are::
#define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
#define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
#define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6)
#define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7)

The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG
ioctl is available, for the guest to set its hypercall page.
Expand Down Expand Up @@ -8417,6 +8594,11 @@ behave more correctly, not using the XEN_RUNSTATE_UPDATE flag until/unless
specifically enabled (by the guest making the hypercall, causing the VMM
to enable the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute).

The KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag indicates that KVM supports
clearing the PVCLOCK_TSC_STABLE_BIT flag in Xen pvclock sources. This will be
done when the KVM_CAP_XEN_HVM ioctl sets the
KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag.

8.31 KVM_CAP_PPC_MULTITCE
-------------------------

Expand Down Expand Up @@ -8596,6 +8778,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
64-bit bitmap (each bit describing a block size). The default value is
0, to disable the eager page splitting.

8.41 KVM_CAP_VM_TYPES
---------------------

:Capability: KVM_CAP_MEMORY_ATTRIBUTES
:Architectures: x86
:Type: system ioctl

This capability returns a bitmap of support VM types. The 1-setting of bit @n
means the VM type with value @n is supported. Possible values of @n are::

#define KVM_X86_DEFAULT_VM 0
#define KVM_X86_SW_PROTECTED_VM 1

9. Known KVM API problems
=========================

Expand Down
7 changes: 3 additions & 4 deletions Documentation/virt/kvm/locking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,9 @@ On x86:

- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock and kvm->arch.xen.xen_lock

- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and
kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and
cannot be taken without already holding kvm->arch.mmu_lock (typically with
``read_lock`` for the TDP MMU, thus the need for additional spinlocks).
- kvm->arch.mmu_lock is an rwlock; critical sections for
kvm->arch.tdp_mmu_pages_lock and kvm->arch.mmu_unsync_pages_lock must
also take kvm->arch.mmu_lock

Everything else is a leaf: no other lock is taken inside the critical
sections.
Expand Down
15 changes: 15 additions & 0 deletions arch/arm64/include/asm/esr.h
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,21 @@ static inline bool esr_is_data_abort(unsigned long esr)
return ec == ESR_ELx_EC_DABT_LOW || ec == ESR_ELx_EC_DABT_CUR;
}

static inline bool esr_fsc_is_translation_fault(unsigned long esr)
{
return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_FAULT;
}

static inline bool esr_fsc_is_permission_fault(unsigned long esr)
{
return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_PERM;
}

static inline bool esr_fsc_is_access_flag_fault(unsigned long esr)
{
return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_ACCESS;
}

const char *esr_get_class_string(unsigned long esr);
#endif /* __ASSEMBLY */

Expand Down
Loading

0 comments on commit 09d1c6a

Please sign in to comment.