Skip to content

Commit

Permalink
Merge branch kvm-arm64/eager-page-splitting into kvmarm/next
Browse files Browse the repository at this point in the history
* kvm-arm64/eager-page-splitting:
  : Eager Page Splitting, courtesy of Ricardo Koller.
  :
  : Dirty logging performance is dominated by the cost of splitting
  : hugepages to PTE granularity. On systems that mere mortals can get their
  : hands on, each fault incurs the cost of a full break-before-make
  : pattern, wherein the broadcast invalidation and ensuing serialization
  : significantly increases fault latency.
  :
  : The goal of eager page splitting is to move the cost of hugepage
  : splitting out of the stage-2 fault path and instead into the ioctls
  : responsible for managing the dirty log:
  :
  :  - If manual protection is enabled for the VM, hugepage splitting
  :    happens in the KVM_CLEAR_DIRTY_LOG ioctl. This is desirable as it
  :    provides userspace granular control over hugepage splitting.
  :
  :  - Otherwise, if userspace relies on the legacy dirty log behavior
  :    (clear on collection), hugepage splitting is done at the moment dirty
  :    logging is enabled for a particular memslot.
  :
  : Support for eager page splitting requires explicit opt-in from
  : userspace, which is realized through the
  : KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE capability.
  arm64: kvm: avoid overflow in integer division
  KVM: arm64: Use local TLBI on permission relaxation
  KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG
  KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
  KVM: arm64: Split huge pages when dirty logging is enabled
  KVM: arm64: Add kvm_uninit_stage2_mmu()
  KVM: arm64: Refactor kvm_arch_commit_memory_region()
  KVM: arm64: Add kvm_pgtable_stage2_split()
  KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
  KVM: arm64: Export kvm_are_all_memslots_empty()
  KVM: arm64: Add helper for creating unlinked stage2 subtrees
  KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs
  KVM: arm64: Rename free_removed to free_unlinked

Signed-off-by: Oliver Upton <[email protected]>
  • Loading branch information
oupton committed Jun 15, 2023
2 parents 44c026a + 14c3555 commit 8351039
Show file tree
Hide file tree
Showing 15 changed files with 612 additions and 57 deletions.
27 changes: 27 additions & 0 deletions Documentation/virt/kvm/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8445,6 +8445,33 @@ structure.
When getting the Modified Change Topology Report value, the attr->addr
must point to a byte where the value will be stored or retrieved from.

8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
---------------------------------------

:Capability: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
:Architectures: arm64
:Type: vm
:Parameters: arg[0] is the new split chunk size.
:Returns: 0 on success, -EINVAL if any memslot was already created.

This capability sets the chunk size used in Eager Page Splitting.

Eager Page Splitting improves the performance of dirty-logging (used
in live migrations) when guest memory is backed by huge-pages. It
avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing
it eagerly when enabling dirty logging (with the
KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using
KVM_CLEAR_DIRTY_LOG.

The chunk size specifies how many pages to break at a time, using a
single allocation for each chunk. Bigger the chunk size, more pages
need to be allocated ahead of time.

The chunk size needs to be a valid block size. The list of acceptable
block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
64-bit bitmap (each bit describing a block size). The default value is
0, to disable the eager page splitting.

9. Known KVM API problems
=========================

Expand Down
4 changes: 4 additions & 0 deletions arch/arm64/include/asm/kvm_asm.h
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ enum __kvm_host_smccc_func {
__KVM_HOST_SMCCC_FUNC___kvm_vcpu_run,
__KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context,
__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa,
__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa_nsh,
__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid,
__KVM_HOST_SMCCC_FUNC___kvm_flush_cpu_context,
__KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff,
Expand Down Expand Up @@ -225,6 +226,9 @@ extern void __kvm_flush_vm_context(void);
extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
int level);
extern void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
phys_addr_t ipa,
int level);
extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);

extern void __kvm_timer_set_cntvoff(u64 cntvoff);
Expand Down
15 changes: 15 additions & 0 deletions arch/arm64/include/asm/kvm_host.h
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,21 @@ struct kvm_s2_mmu {
/* The last vcpu id that ran on each physical CPU */
int __percpu *last_vcpu_ran;

#define KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT 0
/*
* Memory cache used to split
* KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE worth of huge pages. It
* is used to allocate stage2 page tables while splitting huge
* pages. The choice of KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
* influences both the capacity of the split page cache, and
* how often KVM reschedules. Be wary of raising CHUNK_SIZE
* too high.
*
* Protected by kvm->slots_lock.
*/
struct kvm_mmu_memory_cache split_page_cache;
uint64_t split_page_chunk_size;

struct kvm_arch *arch;
};

Expand Down
1 change: 1 addition & 0 deletions arch/arm64/include/asm/kvm_mmu.h
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ void __init free_hyp_pgds(void);

void stage2_unmap_vm(struct kvm *kvm);
int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
void kvm_uninit_stage2_mmu(struct kvm *kvm);
void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
phys_addr_t pa, unsigned long size, bool writable);
Expand Down
79 changes: 75 additions & 4 deletions arch/arm64/include/asm/kvm_pgtable.h
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,24 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
return level >= KVM_PGTABLE_MIN_BLOCK_LEVEL;
}

static inline u32 kvm_supported_block_sizes(void)
{
u32 level = KVM_PGTABLE_MIN_BLOCK_LEVEL;
u32 r = 0;

for (; level < KVM_PGTABLE_MAX_LEVELS; level++)
r |= BIT(kvm_granule_shift(level));

return r;
}

static inline bool kvm_is_block_size_supported(u64 size)
{
bool is_power_of_two = IS_ALIGNED(size, size);

return is_power_of_two && (size & kvm_supported_block_sizes());
}

/**
* struct kvm_pgtable_mm_ops - Memory management callbacks.
* @zalloc_page: Allocate a single zeroed memory page.
Expand All @@ -104,7 +122,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
* allocation is physically contiguous.
* @free_pages_exact: Free an exact number of memory pages previously
* allocated by zalloc_pages_exact.
* @free_removed_table: Free a removed paging structure by unlinking and
* @free_unlinked_table: Free an unlinked paging structure by unlinking and
* dropping references.
* @get_page: Increment the refcount on a page.
* @put_page: Decrement the refcount on a page. When the
Expand All @@ -124,7 +142,7 @@ struct kvm_pgtable_mm_ops {
void* (*zalloc_page)(void *arg);
void* (*zalloc_pages_exact)(size_t size);
void (*free_pages_exact)(void *addr, size_t size);
void (*free_removed_table)(void *addr, u32 level);
void (*free_unlinked_table)(void *addr, u32 level);
void (*get_page)(void *addr);
void (*put_page)(void *addr);
int (*page_count)(void *addr);
Expand Down Expand Up @@ -195,13 +213,21 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
* with other software walkers.
* @KVM_PGTABLE_WALK_HANDLE_FAULT: Indicates the page-table walk was
* invoked from a fault handler.
* @KVM_PGTABLE_WALK_SKIP_BBM_TLBI: Visit and update table entries
* without Break-before-make's
* TLB invalidation.
* @KVM_PGTABLE_WALK_SKIP_CMO: Visit and update table entries
* without Cache maintenance
* operations required.
*/
enum kvm_pgtable_walk_flags {
KVM_PGTABLE_WALK_LEAF = BIT(0),
KVM_PGTABLE_WALK_TABLE_PRE = BIT(1),
KVM_PGTABLE_WALK_TABLE_POST = BIT(2),
KVM_PGTABLE_WALK_SHARED = BIT(3),
KVM_PGTABLE_WALK_HANDLE_FAULT = BIT(4),
KVM_PGTABLE_WALK_SKIP_BBM_TLBI = BIT(5),
KVM_PGTABLE_WALK_SKIP_CMO = BIT(6),
};

struct kvm_pgtable_visit_ctx {
Expand Down Expand Up @@ -441,15 +467,41 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);

/**
* kvm_pgtable_stage2_free_removed() - Free a removed stage-2 paging structure.
* kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure.
* @mm_ops: Memory management callbacks.
* @pgtable: Unlinked stage-2 paging structure to be freed.
* @level: Level of the stage-2 paging structure to be freed.
*
* The page-table is assumed to be unreachable by any hardware walkers prior to
* freeing and therefore no TLB invalidation is performed.
*/
void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);

/**
* kvm_pgtable_stage2_create_unlinked() - Create an unlinked stage-2 paging structure.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @phys: Physical address of the memory to map.
* @level: Starting level of the stage-2 paging structure to be created.
* @prot: Permissions and attributes for the mapping.
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
* page-table pages.
* @force_pte: Force mappings to PAGE_SIZE granularity.
*
* Returns an unlinked page-table tree. This new page-table tree is
* not reachable (i.e., it is unlinked) from the root pgd and it's
* therefore unreachableby the hardware page-table walker. No TLB
* invalidation or CMOs are performed.
*
* If device attributes are not explicitly requested in @prot, then the
* mapping will be normal, cacheable.
*
* Return: The fully populated (unlinked) stage-2 paging structure, or
* an ERR_PTR(error) on failure.
*/
kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
u64 phys, u32 level,
enum kvm_pgtable_prot prot,
void *mc, bool force_pte);

/**
* kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
Expand Down Expand Up @@ -620,6 +672,25 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
*/
int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);

/**
* kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
* to PAGE_SIZE guest pages.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init().
* @addr: Intermediate physical address from which to split.
* @size: Size of the range.
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
* page-table pages.
*
* The function tries to split any level 1 or 2 entry that overlaps
* with the input range (given by @addr and @size).
*
* Return: 0 on success, negative error code on failure. Note that
* kvm_pgtable_stage2_split() is best effort: it tries to break as many
* blocks in the input range as allowed by @mc_capacity.
*/
int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
struct kvm_mmu_memory_cache *mc);

/**
* kvm_pgtable_walk() - Walk a page-table.
* @pgt: Page-table structure initialised by kvm_pgtable_*_init().
Expand Down
28 changes: 28 additions & 0 deletions arch/arm64/kvm/arm.c
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
struct kvm_enable_cap *cap)
{
int r;
u64 new_cap;

if (cap->flags)
return -EINVAL;
Expand All @@ -89,6 +90,24 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags);
break;
case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE:
new_cap = cap->args[0];

mutex_lock(&kvm->slots_lock);
/*
* To keep things simple, allow changing the chunk
* size only when no memory slots have been created.
*/
if (!kvm_are_all_memslots_empty(kvm)) {
r = -EINVAL;
} else if (new_cap && !kvm_is_block_size_supported(new_cap)) {
r = -EINVAL;
} else {
r = 0;
kvm->arch.mmu.split_page_chunk_size = new_cap;
}
mutex_unlock(&kvm->slots_lock);
break;
default:
r = -EINVAL;
break;
Expand Down Expand Up @@ -302,6 +321,15 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ARM_PTRAUTH_GENERIC:
r = system_has_full_ptr_auth();
break;
case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE:
if (kvm)
r = kvm->arch.mmu.split_page_chunk_size;
else
r = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
break;
case KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES:
r = kvm_supported_block_sizes();
break;
default:
r = 0;
}
Expand Down
10 changes: 10 additions & 0 deletions arch/arm64/kvm/hyp/nvhe/hyp-main.c
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,15 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
__kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
}

static void handle___kvm_tlb_flush_vmid_ipa_nsh(struct kvm_cpu_context *host_ctxt)
{
DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
DECLARE_REG(phys_addr_t, ipa, host_ctxt, 2);
DECLARE_REG(int, level, host_ctxt, 3);

__kvm_tlb_flush_vmid_ipa_nsh(kern_hyp_va(mmu), ipa, level);
}

static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
{
DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
Expand Down Expand Up @@ -315,6 +324,7 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__kvm_vcpu_run),
HANDLE_FUNC(__kvm_flush_vm_context),
HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa_nsh),
HANDLE_FUNC(__kvm_tlb_flush_vmid),
HANDLE_FUNC(__kvm_flush_cpu_context),
HANDLE_FUNC(__kvm_timer_set_cntvoff),
Expand Down
6 changes: 3 additions & 3 deletions arch/arm64/kvm/hyp/nvhe/mem_protect.c
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,9 @@ static void host_s2_put_page(void *addr)
hyp_put_page(&host_s2_pool, addr);
}

static void host_s2_free_removed_table(void *addr, u32 level)
static void host_s2_free_unlinked_table(void *addr, u32 level)
{
kvm_pgtable_stage2_free_removed(&host_mmu.mm_ops, addr, level);
kvm_pgtable_stage2_free_unlinked(&host_mmu.mm_ops, addr, level);
}

static int prepare_s2_pool(void *pgt_pool_base)
Expand All @@ -110,7 +110,7 @@ static int prepare_s2_pool(void *pgt_pool_base)
host_mmu.mm_ops = (struct kvm_pgtable_mm_ops) {
.zalloc_pages_exact = host_s2_zalloc_pages_exact,
.zalloc_page = host_s2_zalloc_page,
.free_removed_table = host_s2_free_removed_table,
.free_unlinked_table = host_s2_free_unlinked_table,
.phys_to_virt = hyp_phys_to_virt,
.virt_to_phys = hyp_virt_to_phys,
.page_count = hyp_page_count,
Expand Down
52 changes: 52 additions & 0 deletions arch/arm64/kvm/hyp/nvhe/tlb.c
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,58 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
__tlb_switch_to_host(&cxt);
}

void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
phys_addr_t ipa, int level)
{
struct tlb_inv_context cxt;

/* Switch to requested VMID */
__tlb_switch_to_guest(mmu, &cxt, true);

/*
* We could do so much better if we had the VA as well.
* Instead, we invalidate Stage-2 for this IPA, and the
* whole of Stage-1. Weep...
*/
ipa >>= 12;
__tlbi_level(ipas2e1, ipa, level);

/*
* We have to ensure completion of the invalidation at Stage-2,
* since a table walk on another CPU could refill a TLB with a
* complete (S1 + S2) walk based on the old Stage-2 mapping if
* the Stage-1 invalidation happened first.
*/
dsb(nsh);
__tlbi(vmalle1);
dsb(nsh);
isb();

/*
* If the host is running at EL1 and we have a VPIPT I-cache,
* then we must perform I-cache maintenance at EL2 in order for
* it to have an effect on the guest. Since the guest cannot hit
* I-cache lines allocated with a different VMID, we don't need
* to worry about junk out of guest reset (we nuke the I-cache on
* VMID rollover), but we do need to be careful when remapping
* executable pages for the same guest. This can happen when KSM
* takes a CoW fault on an executable page, copies the page into
* a page that was previously mapped in the guest and then needs
* to invalidate the guest view of the I-cache for that page
* from EL1. To solve this, we invalidate the entire I-cache when
* unmapping a page from a guest if we have a VPIPT I-cache but
* the host is running at EL1. As above, we could do better if
* we had the VA.
*
* The moral of this story is: if you have a VPIPT I-cache, then
* you should be running with VHE enabled.
*/
if (icache_is_vpipt())
icache_inval_all_pou();

__tlb_switch_to_host(&cxt);
}

void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
{
struct tlb_inv_context cxt;
Expand Down
Loading

0 comments on commit 8351039

Please sign in to comment.