Skip to content

Commit

Permalink
Merge branch 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/vir…
Browse files Browse the repository at this point in the history
…t/kvm/kvm

* 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (321 commits)
  KVM: Drop CONFIG_DMAR dependency around kvm_iommu_map_pages
  KVM: Fix signature of kvm_iommu_map_pages stub
  KVM: MCE: Send SRAR SIGBUS directly
  KVM: MCE: Add MCG_SER_P into KVM_MCE_CAP_SUPPORTED
  KVM: fix typo in copyright notice
  KVM: Disable interrupts around get_kernel_ns()
  KVM: MMU: Avoid sign extension in mmu_alloc_direct_roots() pae root address
  KVM: MMU: move access code parsing to FNAME(walk_addr) function
  KVM: MMU: audit: check whether have unsync sps after root sync
  KVM: MMU: audit: introduce audit_printk to cleanup audit code
  KVM: MMU: audit: unregister audit tracepoints before module unloaded
  KVM: MMU: audit: fix vcpu's spte walking
  KVM: MMU: set access bit for direct mapping
  KVM: MMU: cleanup for error mask set while walk guest page table
  KVM: MMU: update 'root_hpa' out of loop in PAE shadow path
  KVM: x86 emulator: Eliminate compilation warning in x86_decode_insn()
  KVM: x86: Fix constant type in kvm_get_time_scale
  KVM: VMX: Add AX to list of registers clobbered by guest switch
  KVM guest: Move a printk that's using the clock before it's ready
  KVM: x86: TSC catchup mode
  ...
  • Loading branch information
torvalds committed Oct 24, 2010
2 parents bdaf12b + 2a31339 commit 1765a1f
Show file tree
Hide file tree
Showing 73 changed files with 6,701 additions and 2,389 deletions.
8 changes: 7 additions & 1 deletion Documentation/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1131,9 +1131,13 @@ and is between 256 and 4096 characters. It is defined in the file
kvm.oos_shadow= [KVM] Disable out-of-sync shadow paging.
Default is 1 (enabled)

kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM.
kvm.mmu_audit= [KVM] This is a R/W parameter which allows audit
KVM MMU at runtime.
Default is 0 (off)

kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM.
Default is 1 (enabled)

kvm-amd.npt= [KVM,AMD] Disable nested paging (virtualized MMU)
for all guests.
Default is 1 (enabled) if in 64bit or 32bit-PAE mode
Expand Down Expand Up @@ -1698,6 +1702,8 @@ and is between 256 and 4096 characters. It is defined in the file

nojitter [IA64] Disables jitter checking for ITC timers.

no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver

nolapic [X86-32,APIC] Do not enable or use the local APIC.

nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
Expand Down
61 changes: 57 additions & 4 deletions Documentation/kvm/api.txt
Original file line number Diff line number Diff line change
Expand Up @@ -320,22 +320,51 @@ struct kvm_translation {
4.15 KVM_INTERRUPT

Capability: basic
Architectures: x86
Architectures: x86, ppc
Type: vcpu ioctl
Parameters: struct kvm_interrupt (in)
Returns: 0 on success, -1 on error

Queues a hardware interrupt vector to be injected. This is only
useful if in-kernel local APIC is not used.
useful if in-kernel local APIC or equivalent is not used.

/* for KVM_INTERRUPT */
struct kvm_interrupt {
/* in */
__u32 irq;
};

X86:

Note 'irq' is an interrupt vector, not an interrupt pin or line.

PPC:

Queues an external interrupt to be injected. This ioctl is overleaded
with 3 different irq values:

a) KVM_INTERRUPT_SET

This injects an edge type external interrupt into the guest once it's ready
to receive interrupts. When injected, the interrupt is done.

b) KVM_INTERRUPT_UNSET

This unsets any pending interrupt.

Only available with KVM_CAP_PPC_UNSET_IRQ.

c) KVM_INTERRUPT_SET_LEVEL

This injects a level type external interrupt into the guest context. The
interrupt stays pending until a specific ioctl with KVM_INTERRUPT_UNSET
is triggered.

Only available with KVM_CAP_PPC_IRQ_LEVEL.

Note that any value for 'irq' other than the ones stated above is invalid
and incurs unexpected behavior.

4.16 KVM_DEBUG_GUEST

Capability: basic
Expand Down Expand Up @@ -1013,8 +1042,9 @@ number is just right, the 'nent' field is adjusted to the number of valid
entries in the 'entries' array, which is then filled.

The entries returned are the host cpuid as returned by the cpuid instruction,
with unknown or unsupported features masked out. The fields in each entry
are defined as follows:
with unknown or unsupported features masked out. Some features (for example,
x2apic), may not be present in the host cpu, but are exposed by kvm if it can
emulate them efficiently. The fields in each entry are defined as follows:

function: the eax value used to obtain the entry
index: the ecx value used to obtain the entry (for entries that are
Expand All @@ -1032,6 +1062,29 @@ are defined as follows:
eax, ebx, ecx, edx: the values returned by the cpuid instruction for
this function/index combination

4.46 KVM_PPC_GET_PVINFO

Capability: KVM_CAP_PPC_GET_PVINFO
Architectures: ppc
Type: vm ioctl
Parameters: struct kvm_ppc_pvinfo (out)
Returns: 0 on success, !0 on error

struct kvm_ppc_pvinfo {
__u32 flags;
__u32 hcall[4];
__u8 pad[108];
};

This ioctl fetches PV specific information that need to be passed to the guest
using the device tree or other means from vm context.

For now the only implemented piece of information distributed here is an array
of 4 instructions that make up a hypercall.

If any additional field gets added to this structure later on, a bit for that
additional piece of information will be set in the flags bitmap.

5. The kvm_run structure

Application code obtains a pointer to the kvm_run structure by
Expand Down
196 changes: 196 additions & 0 deletions Documentation/kvm/ppc-pv.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
The PPC KVM paravirtual interface
=================================

The basic execution principle by which KVM on PowerPC works is to run all kernel
space code in PR=1 which is user space. This way we trap all privileged
instructions and can emulate them accordingly.

Unfortunately that is also the downfall. There are quite some privileged
instructions that needlessly return us to the hypervisor even though they
could be handled differently.

This is what the PPC PV interface helps with. It takes privileged instructions
and transforms them into unprivileged ones with some help from the hypervisor.
This cuts down virtualization costs by about 50% on some of my benchmarks.

The code for that interface can be found in arch/powerpc/kernel/kvm*

Querying for existence
======================

To find out if we're running on KVM or not, we leverage the device tree. When
Linux is running on KVM, a node /hypervisor exists. That node contains a
compatible property with the value "linux,kvm".

Once you determined you're running under a PV capable KVM, you can now use
hypercalls as described below.

KVM hypercalls
==============

Inside the device tree's /hypervisor node there's a property called
'hypercall-instructions'. This property contains at most 4 opcodes that make
up the hypercall. To call a hypercall, just call these instructions.

The parameters are as follows:

Register IN OUT

r0 - volatile
r3 1st parameter Return code
r4 2nd parameter 1st output value
r5 3rd parameter 2nd output value
r6 4th parameter 3rd output value
r7 5th parameter 4th output value
r8 6th parameter 5th output value
r9 7th parameter 6th output value
r10 8th parameter 7th output value
r11 hypercall number 8th output value
r12 - volatile

Hypercall definitions are shared in generic code, so the same hypercall numbers
apply for x86 and powerpc alike with the exception that each KVM hypercall
also needs to be ORed with the KVM vendor code which is (42 << 16).

Return codes can be as follows:

Code Meaning

0 Success
12 Hypercall not implemented
<0 Error

The magic page
==============

To enable communication between the hypervisor and guest there is a new shared
page that contains parts of supervisor visible register state. The guest can
map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.

With this hypercall issued the guest always gets the magic page mapped at the
desired location in effective and physical address space. For now, we always
map the page to -4096. This way we can access it using absolute load and store
functions. The following instruction reads the first field of the magic page:

ld rX, -4096(0)

The interface is designed to be extensible should there be need later to add
additional registers to the magic page. If you add fields to the magic page,
also define a new hypercall feature to indicate that the host can give you more
registers. Only if the host supports the additional features, make use of them.

The magic page has the following layout as described in
arch/powerpc/include/asm/kvm_para.h:

struct kvm_vcpu_arch_shared {
__u64 scratch1;
__u64 scratch2;
__u64 scratch3;
__u64 critical; /* Guest may not get interrupts if == r1 */
__u64 sprg0;
__u64 sprg1;
__u64 sprg2;
__u64 sprg3;
__u64 srr0;
__u64 srr1;
__u64 dar;
__u64 msr;
__u32 dsisr;
__u32 int_pending; /* Tells the guest if we have an interrupt */
};

Additions to the page must only occur at the end. Struct fields are always 32
or 64 bit aligned, depending on them being 32 or 64 bit wide respectively.

Magic page features
===================

When mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE,
a second return value is passed to the guest. This second return value contains
a bitmap of available features inside the magic page.

The following enhancements to the magic page are currently available:

KVM_MAGIC_FEAT_SR Maps SR registers r/w in the magic page

For enhanced features in the magic page, please check for the existence of the
feature before using them!

MSR bits
========

The MSR contains bits that require hypervisor intervention and bits that do
not require direct hypervisor intervention because they only get interpreted
when entering the guest or don't have any impact on the hypervisor's behavior.

The following bits are safe to be set inside the guest:

MSR_EE
MSR_RI
MSR_CR
MSR_ME

If any other bit changes in the MSR, please still use mtmsr(d).

Patched instructions
====================

The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
respectively on 32 bit systems with an added offset of 4 to accomodate for big
endianness.

The following is a list of mapping the Linux kernel performs when running as
guest. Implementing any of those mappings is optional, as the instruction traps
also act on the shared page. So calling privileged instructions still works as
before.

From To
==== ==

mfmsr rX ld rX, magic_page->msr
mfsprg rX, 0 ld rX, magic_page->sprg0
mfsprg rX, 1 ld rX, magic_page->sprg1
mfsprg rX, 2 ld rX, magic_page->sprg2
mfsprg rX, 3 ld rX, magic_page->sprg3
mfsrr0 rX ld rX, magic_page->srr0
mfsrr1 rX ld rX, magic_page->srr1
mfdar rX ld rX, magic_page->dar
mfdsisr rX lwz rX, magic_page->dsisr

mtmsr rX std rX, magic_page->msr
mtsprg 0, rX std rX, magic_page->sprg0
mtsprg 1, rX std rX, magic_page->sprg1
mtsprg 2, rX std rX, magic_page->sprg2
mtsprg 3, rX std rX, magic_page->sprg3
mtsrr0 rX std rX, magic_page->srr0
mtsrr1 rX std rX, magic_page->srr1
mtdar rX std rX, magic_page->dar
mtdsisr rX stw rX, magic_page->dsisr

tlbsync nop

mtmsrd rX, 0 b <special mtmsr section>
mtmsr rX b <special mtmsr section>

mtmsrd rX, 1 b <special mtmsrd section>

[Book3S only]
mtsrin rX, rY b <special mtsrin section>

[BookE only]
wrteei [0|1] b <special wrteei section>


Some instructions require more logic to determine what's going on than a load
or store instruction can deliver. To enable patching of those, we keep some
RAM around where we can live translate instructions to. What happens is the
following:

1) copy emulation code to memory
2) patch that code to fit the emulated instruction
3) patch that code to return to the original pc + 4
4) patch the original instruction to branch to the new code

That way we can inject an arbitrary amount of code as replacement for a single
instruction. This allows us to check for pending interrupts when setting EE=1
for example.
Loading

0 comments on commit 1765a1f

Please sign in to comment.