Skip to content

Commit

Permalink
lguest: update commentry
Browse files Browse the repository at this point in the history
Every so often, after code shuffles, I need to go through and unbitrot
the Lguest Journey (see drivers/lguest/README).  Since we now use RCU in
a simple form in one place I took the opportunity to expand that explanation.

Signed-off-by: Rusty Russell <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Paul McKenney <[email protected]>
  • Loading branch information
rustyrussell committed Jul 30, 2009
1 parent 2e04ef7 commit a91d74a
Show file tree
Hide file tree
Showing 11 changed files with 398 additions and 111 deletions.
184 changes: 139 additions & 45 deletions Documentation/lguest/lguest.c

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions arch/x86/include/asm/lguest_hcall.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,10 @@
* operations? There are two ways: the direct way is to make a "hypercall",
* to make requests of the Host Itself.
*
* We use the KVM hypercall mechanism. Seventeen hypercalls are
* available: the hypercall number is put in the %eax register, and the
* arguments (when required) are placed in %ebx, %ecx, %edx and %esi.
* If a return value makes sense, it's returned in %eax.
* We use the KVM hypercall mechanism, though completely different hypercall
* numbers. Seventeen hypercalls are available: the hypercall number is put in
* the %eax register, and the arguments (when required) are placed in %ebx,
* %ecx, %edx and %esi. If a return value makes sense, it's returned in %eax.
*
* Grossly invalid calls result in Sudden Death at the hands of the vengeful
* Host, rather than returning failure. This reflects Winston Churchill's
Expand Down
99 changes: 77 additions & 22 deletions arch/x86/lguest/boot.c
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,7 @@ static void lazy_hcall1(unsigned long call,
async_hcall(call, arg1, 0, 0, 0);
}

/* You can imagine what lazy_hcall2, 3 and 4 look like. :*/
static void lazy_hcall2(unsigned long call,
unsigned long arg1,
unsigned long arg2)
Expand Down Expand Up @@ -189,8 +190,10 @@ static void lazy_hcall4(unsigned long call,
}
#endif

/* When lazy mode is turned off reset the per-cpu lazy mode variable and then
* issue the do-nothing hypercall to flush any stored calls. */
/*G:036
* When lazy mode is turned off reset the per-cpu lazy mode variable and then
* issue the do-nothing hypercall to flush any stored calls.
:*/
static void lguest_leave_lazy_mmu_mode(void)
{
kvm_hypercall0(LHCALL_FLUSH_ASYNC);
Expand Down Expand Up @@ -250,13 +253,11 @@ extern void lg_irq_enable(void);
extern void lg_restore_fl(unsigned long flags);

/*M:003
* Note that we don't check for outstanding interrupts when we re-enable them
* (or when we unmask an interrupt). This seems to work for the moment, since
* interrupts are rare and we'll just get the interrupt on the next timer tick,
* but now we can run with CONFIG_NO_HZ, we should revisit this. One way would
* be to put the "irq_enabled" field in a page by itself, and have the Host
* write-protect it when an interrupt comes in when irqs are disabled. There
* will then be a page fault as soon as interrupts are re-enabled.
* We could be more efficient in our checking of outstanding interrupts, rather
* than using a branch. One way would be to put the "irq_enabled" field in a
* page by itself, and have the Host write-protect it when an interrupt comes
* in when irqs are disabled. There will then be a page fault as soon as
* interrupts are re-enabled.
*
* A better method is to implement soft interrupt disable generally for x86:
* instead of disabling interrupts, we set a flag. If an interrupt does come
Expand Down Expand Up @@ -568,7 +569,7 @@ static void lguest_write_cr4(unsigned long val)
* cr3 ---> +---------+
* | --------->+---------+
* | | | PADDR1 |
* Top-level | | PADDR2 |
* Mid-level | | PADDR2 |
* (PMD) page | | |
* | | Lower-level |
* | | (PTE) page |
Expand All @@ -588,30 +589,70 @@ static void lguest_write_cr4(unsigned long val)
* Index into top Index into second Offset within page
* page directory page pagetable page
*
* The kernel spends a lot of time changing both the top-level page directory
* and lower-level pagetable pages. The Guest doesn't know physical addresses,
* so while it maintains these page tables exactly like normal, it also needs
* to keep the Host informed whenever it makes a change: the Host will create
* the real page tables based on the Guests'.
* Now, unfortunately, this isn't the whole story: Intel added Physical Address
* Extension (PAE) to allow 32 bit systems to use 64GB of memory (ie. 36 bits).
* These are held in 64-bit page table entries, so we can now only fit 512
* entries in a page, and the neat three-level tree breaks down.
*
* The result is a four level page table:
*
* cr3 --> [ 4 Upper ]
* [ Level ]
* [ Entries ]
* [(PUD Page)]---> +---------+
* | --------->+---------+
* | | | PADDR1 |
* Mid-level | | PADDR2 |
* (PMD) page | | |
* | | Lower-level |
* | | (PTE) page |
* | | | |
* .... ....
*
*
* And the virtual address is decoded as:
*
* 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
* |<-2->|<--- 9 bits ---->|<---- 9 bits --->|<------ 12 bits ------>|
* Index into Index into mid Index into lower Offset within page
* top entries directory page pagetable page
*
* It's too hard to switch between these two formats at runtime, so Linux only
* supports one or the other depending on whether CONFIG_X86_PAE is set. Many
* distributions turn it on, and not just for people with silly amounts of
* memory: the larger PTE entries allow room for the NX bit, which lets the
* kernel disable execution of pages and increase security.
*
* This was a problem for lguest, which couldn't run on these distributions;
* then Matias Zabaljauregui figured it all out and implemented it, and only a
* handful of puppies were crushed in the process!
*
* Back to our point: the kernel spends a lot of time changing both the
* top-level page directory and lower-level pagetable pages. The Guest doesn't
* know physical addresses, so while it maintains these page tables exactly
* like normal, it also needs to keep the Host informed whenever it makes a
* change: the Host will create the real page tables based on the Guests'.
*/

/*
* The Guest calls this to set a second-level entry (pte), ie. to map a page
* into a process' address space. We set the entry then tell the Host the
* toplevel and address this corresponds to. The Guest uses one pagetable per
* process, so we need to tell the Host which one we're changing (mm->pgd).
* The Guest calls this after it has set a second-level entry (pte), ie. to map
* a page into a process' address space. Wetell the Host the toplevel and
* address this corresponds to. The Guest uses one pagetable per process, so
* we need to tell the Host which one we're changing (mm->pgd).
*/
static void lguest_pte_update(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
{
#ifdef CONFIG_X86_PAE
/* PAE needs to hand a 64 bit page table entry, so it uses two args. */
lazy_hcall4(LHCALL_SET_PTE, __pa(mm->pgd), addr,
ptep->pte_low, ptep->pte_high);
#else
lazy_hcall3(LHCALL_SET_PTE, __pa(mm->pgd), addr, ptep->pte_low);
#endif
}

/* This is the "set and update" combo-meal-deal version. */
static void lguest_set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pteval)
{
Expand Down Expand Up @@ -672,20 +713,26 @@ static void lguest_set_pte(pte_t *ptep, pte_t pteval)
}

#ifdef CONFIG_X86_PAE
/*
* With 64-bit PTE values, we need to be careful setting them: if we set 32
* bits at a time, the hardware could see a weird half-set entry. These
* versions ensure we update all 64 bits at once.
*/
static void lguest_set_pte_atomic(pte_t *ptep, pte_t pte)
{
native_set_pte_atomic(ptep, pte);
if (cr3_changed)
lazy_hcall1(LHCALL_FLUSH_TLB, 1);
}

void lguest_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
static void lguest_pte_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
{
native_pte_clear(mm, addr, ptep);
lguest_pte_update(mm, addr, ptep);
}

void lguest_pmd_clear(pmd_t *pmdp)
static void lguest_pmd_clear(pmd_t *pmdp)
{
lguest_set_pmd(pmdp, __pmd(0));
}
Expand Down Expand Up @@ -784,6 +831,14 @@ static void __init lguest_init_IRQ(void)
irq_ctx_init(smp_processor_id());
}

/*
* With CONFIG_SPARSE_IRQ, interrupt descriptors are allocated as-needed, so
* rather than set them in lguest_init_IRQ we are called here every time an
* lguest device needs an interrupt.
*
* FIXME: irq_to_desc_alloc_node() can fail due to lack of memory, we should
* pass that up!
*/
void lguest_setup_irq(unsigned int irq)
{
irq_to_desc_alloc_node(irq, 0);
Expand Down Expand Up @@ -1298,7 +1353,7 @@ __init void lguest_init(void)
*/
switch_to_new_gdt(0);

/* As described in head_32.S, we map the first 128M of memory. */
/* We actually boot with all memory mapped, but let's say 128MB. */
max_pfn_mapped = (128*1024*1024) >> PAGE_SHIFT;

/*
Expand Down
2 changes: 2 additions & 0 deletions arch/x86/lguest/i386_head.S
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ send_interrupts:
* create one manually here.
*/
.byte 0x0f,0x01,0xc1 /* KVM_HYPERCALL */
/* Put eax back the way we found it. */
popl %eax
ret

Expand All @@ -125,6 +126,7 @@ ENTRY(lg_restore_fl)
jnz send_interrupts
/* Again, the normal path has used no extra registers. Clever, huh? */
ret
/*:*/

/* These demark the EIP range where host should never deliver interrupts. */
.global lguest_noirq_start
Expand Down
7 changes: 6 additions & 1 deletion drivers/lguest/core.c
Original file line number Diff line number Diff line change
Expand Up @@ -217,10 +217,15 @@ int run_guest(struct lg_cpu *cpu, unsigned long __user *user)

/*
* It's possible the Guest did a NOTIFY hypercall to the
* Launcher, in which case we return from the read() now.
* Launcher.
*/
if (cpu->pending_notify) {
/*
* Does it just needs to write to a registered
* eventfd (ie. the appropriate virtqueue thread)?
*/
if (!send_notify_to_eventfd(cpu)) {
/* OK, we tell the main Laucher. */
if (put_user(cpu->pending_notify, user))
return -EFAULT;
return sizeof(cpu->pending_notify);
Expand Down
6 changes: 5 additions & 1 deletion drivers/lguest/hypercalls.c
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ static void do_hcall(struct lg_cpu *cpu, struct hcall_args *args)
case LHCALL_SHUTDOWN: {
char msg[128];
/*
* Shutdown is such a trivial hypercall that we do it in four
* Shutdown is such a trivial hypercall that we do it in five
* lines right here.
*
* If the lgread fails, it will call kill_guest() itself; the
Expand Down Expand Up @@ -245,6 +245,10 @@ static void initialize(struct lg_cpu *cpu)
* device), the Guest will still see the old page. In practice, this never
* happens: why would the Guest read a page which it has never written to? But
* a similar scenario might one day bite us, so it's worth mentioning.
*
* Note that if we used a shared anonymous mapping in the Launcher instead of
* mapping /dev/zero private, we wouldn't worry about cop-on-write. And we
* need that to switch the Launcher to processes (away from threads) anyway.
:*/

/*H:100
Expand Down
11 changes: 6 additions & 5 deletions drivers/lguest/lguest_device.c
Original file line number Diff line number Diff line change
Expand Up @@ -236,17 +236,14 @@ static void lg_notify(struct virtqueue *vq)
extern void lguest_setup_irq(unsigned int irq);

/*
* This routine finds the first virtqueue described in the configuration of
* This routine finds the Nth virtqueue described in the configuration of
* this device and sets it up.
*
* This is kind of an ugly duckling. It'd be nicer to have a standard
* representation of a virtqueue in the configuration space, but it seems that
* everyone wants to do it differently. The KVM coders want the Guest to
* allocate its own pages and tell the Host where they are, but for lguest it's
* simpler for the Host to simply tell us where the pages are.
*
* So we provide drivers with a "find the Nth virtqueue and set it up"
* function.
*/
static struct virtqueue *lg_find_vq(struct virtio_device *vdev,
unsigned index,
Expand Down Expand Up @@ -422,7 +419,11 @@ static void add_lguest_device(struct lguest_device_desc *d,

/* This devices' parent is the lguest/ dir. */
ldev->vdev.dev.parent = lguest_root;
/* We have a unique device index thanks to the dev_index counter. */
/*
* The device type comes straight from the descriptor. There's also a
* device vendor field in the virtio_device struct, which we leave as
* 0.
*/
ldev->vdev.id.device = d->type;
/*
* We have a simple set of routines for querying the device's
Expand Down
Loading

0 comments on commit a91d74a

Please sign in to comment.