Skip to content

Commit

Permalink
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kerne…
Browse files Browse the repository at this point in the history
…l/git/rdma/rdma

Pull hmm updates from Jason Gunthorpe:
 "This is more cleanup and consolidation of the hmm APIs and the very
  strongly related mmu_notifier interfaces. Many places across the tree
  using these interfaces are touched in the process. Beyond that a
  cleanup to the page walker API and a few memremap related changes
  round out the series:

   - General improvement of hmm_range_fault() and related APIs, more
     documentation, bug fixes from testing, API simplification &
     consolidation, and unused API removal

   - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
     and make them internal kconfig selects

   - Hoist a lot of code related to mmu notifier attachment out of
     drivers by using a refcount get/put attachment idiom and remove the
     convoluted mmu_notifier_unregister_no_release() and related APIs.

   - General API improvement for the migrate_vma API and revision of its
     only user in nouveau

   - Annotate mmu_notifiers with lockdep and sleeping region debugging

  Two series unrelated to HMM or mmu_notifiers came along due to
  dependencies:

   - Allow pagemap's memremap_pages family of APIs to work without
     providing a struct device

   - Make walk_page_range() and related use a constant structure for
     function pointers"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
  libnvdimm: Enable unit test infrastructure compile checks
  mm, notifier: Catch sleeping/blocking for !blockable
  kernel.h: Add non_block_start/end()
  drm/radeon: guard against calling an unpaired radeon_mn_unregister()
  csky: add missing brackets in a macro for tlb.h
  pagewalk: use lockdep_assert_held for locking validation
  pagewalk: separate function pointers from iterator data
  mm: split out a new pagewalk.h header from mm.h
  mm/mmu_notifiers: annotate with might_sleep()
  mm/mmu_notifiers: prime lockdep
  mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
  mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
  mm/hmm: hmm_range_fault() infinite loop
  mm/hmm: hmm_range_fault() NULL pointer bug
  mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
  mm/mmu_notifiers: remove unregister_no_release
  RDMA/odp: remove ib_ucontext from ib_umem
  RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  ...
  • Loading branch information
torvalds committed Sep 21, 2019
2 parents 227c3e9 + 62974fc commit 84da111
Show file tree
Hide file tree
Showing 65 changed files with 1,684 additions and 2,184 deletions.
73 changes: 11 additions & 62 deletions Documentation/vm/hmm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -192,15 +192,14 @@ read only, or fully unmap, etc.). The device must complete the update before
the driver callback returns.

When the device driver wants to populate a range of virtual addresses, it can
use either::
use::

long hmm_range_snapshot(struct hmm_range *range);
long hmm_range_fault(struct hmm_range *range, bool block);
long hmm_range_fault(struct hmm_range *range, unsigned int flags);

The first one (hmm_range_snapshot()) will only fetch present CPU page table
With the HMM_RANGE_SNAPSHOT flag, it will only fetch present CPU page table
entries and will not trigger a page fault on missing or non-present entries.
The second one does trigger a page fault on missing or read-only entries if
write access is requested (see below). Page faults use the generic mm page
Without that flag, it does trigger a page fault on missing or read-only entries
if write access is requested (see below). Page faults use the generic mm page
fault code path just like a CPU page fault.

Both functions copy CPU page table entries into their pfns array argument. Each
Expand All @@ -223,24 +222,24 @@ The usage pattern is::
range.flags = ...;
range.values = ...;
range.pfn_shift = ...;
hmm_range_register(&range);
hmm_range_register(&range, mirror);

/*
* Just wait for range to be valid, safe to ignore return value as we
* will use the return value of hmm_range_snapshot() below under the
* will use the return value of hmm_range_fault() below under the
* mmap_sem to ascertain the validity of the range.
*/
hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);

again:
down_read(&mm->mmap_sem);
ret = hmm_range_snapshot(&range);
ret = hmm_range_fault(&range, HMM_RANGE_SNAPSHOT);
if (ret) {
up_read(&mm->mmap_sem);
if (ret == -EBUSY) {
/*
* No need to check hmm_range_wait_until_valid() return value
* on retry we will get proper error with hmm_range_snapshot()
* on retry we will get proper error with hmm_range_fault()
*/
hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
goto again;
Expand Down Expand Up @@ -340,58 +339,8 @@ Migration to and from device memory
===================================

Because the CPU cannot access device memory, migration must use the device DMA
engine to perform copy from and to device memory. For this we need a new
migration helper::

int migrate_vma(const struct migrate_vma_ops *ops,
struct vm_area_struct *vma,
unsigned long mentries,
unsigned long start,
unsigned long end,
unsigned long *src,
unsigned long *dst,
void *private);

Unlike other migration functions it works on a range of virtual address, there
are two reasons for that. First, device DMA copy has a high setup overhead cost
and thus batching multiple pages is needed as otherwise the migration overhead
makes the whole exercise pointless. The second reason is because the
migration might be for a range of addresses the device is actively accessing.

The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
controls destination memory allocation and copy operation. Second one is there
to allow the device driver to perform cleanup operations after migration::

struct migrate_vma_ops {
void (*alloc_and_copy)(struct vm_area_struct *vma,
const unsigned long *src,
unsigned long *dst,
unsigned long start,
unsigned long end,
void *private);
void (*finalize_and_map)(struct vm_area_struct *vma,
const unsigned long *src,
const unsigned long *dst,
unsigned long start,
unsigned long end,
void *private);
};

It is important to stress that these migration helpers allow for holes in the
virtual address range. Some pages in the range might not be migrated for all
the usual reasons (page is pinned, page is locked, ...). This helper does not
fail but just skips over those pages.

The alloc_and_copy() might decide to not migrate all pages in the
range (for reasons under the callback control). For those, the callback just
has to leave the corresponding dst entry empty.

Finally, the migration of the struct page might fail (for file backed page) for
various reasons (failure to freeze reference, or update page cache, ...). If
that happens, then the finalize_and_map() can catch any pages that were not
migrated. Note those pages were still copied to a new page and thus we wasted
bandwidth but this is considered as a rare event and a price that we are
willing to pay to keep all the code simpler.
engine to perform copy from and to device memory. For this we need to use
migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() helpers.


Memory cgroup (memcg) and rss accounting
Expand Down
8 changes: 4 additions & 4 deletions arch/csky/include/asm/tlb.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@

#define tlb_start_vma(tlb, vma) \
do { \
if (!tlb->fullmm) \
flush_cache_range(vma, vma->vm_start, vma->vm_end); \
if (!(tlb)->fullmm) \
flush_cache_range(vma, (vma)->vm_start, (vma)->vm_end); \
} while (0)

#define tlb_end_vma(tlb, vma) \
do { \
if (!tlb->fullmm) \
flush_tlb_range(vma, vma->vm_start, vma->vm_end); \
if (!(tlb)->fullmm) \
flush_tlb_range(vma, (vma)->vm_start, (vma)->vm_end); \
} while (0)

#define tlb_flush(tlb) flush_tlb_mm((tlb)->mm)
Expand Down
23 changes: 13 additions & 10 deletions arch/openrisc/kernel/dma.c
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
*/

#include <linux/dma-noncoherent.h>
#include <linux/pagewalk.h>

#include <asm/cpuinfo.h>
#include <asm/spr_defs.h>
Expand Down Expand Up @@ -43,6 +44,10 @@ page_set_nocache(pte_t *pte, unsigned long addr,
return 0;
}

static const struct mm_walk_ops set_nocache_walk_ops = {
.pte_entry = page_set_nocache,
};

static int
page_clear_nocache(pte_t *pte, unsigned long addr,
unsigned long next, struct mm_walk *walk)
Expand All @@ -58,6 +63,10 @@ page_clear_nocache(pte_t *pte, unsigned long addr,
return 0;
}

static const struct mm_walk_ops clear_nocache_walk_ops = {
.pte_entry = page_clear_nocache,
};

/*
* Alloc "coherent" memory, which for OpenRISC means simply uncached.
*
Expand All @@ -80,10 +89,6 @@ arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
{
unsigned long va;
void *page;
struct mm_walk walk = {
.pte_entry = page_set_nocache,
.mm = &init_mm
};

page = alloc_pages_exact(size, gfp | __GFP_ZERO);
if (!page)
Expand All @@ -98,7 +103,8 @@ arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
* We need to iterate through the pages, clearing the dcache for
* them and setting the cache-inhibit bit.
*/
if (walk_page_range(va, va + size, &walk)) {
if (walk_page_range(&init_mm, va, va + size, &set_nocache_walk_ops,
NULL)) {
free_pages_exact(page, size);
return NULL;
}
Expand All @@ -111,13 +117,10 @@ arch_dma_free(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, unsigned long attrs)
{
unsigned long va = (unsigned long)vaddr;
struct mm_walk walk = {
.pte_entry = page_clear_nocache,
.mm = &init_mm
};

/* walk_page_range shouldn't be able to fail here */
WARN_ON(walk_page_range(va, va + size, &walk));
WARN_ON(walk_page_range(&init_mm, va, va + size,
&clear_nocache_walk_ops, NULL));

free_pages_exact(vaddr, size);
}
Expand Down
12 changes: 6 additions & 6 deletions arch/powerpc/mm/book3s64/subpage_prot.c
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
#include <linux/kernel.h>
#include <linux/gfp.h>
#include <linux/types.h>
#include <linux/mm.h>
#include <linux/pagewalk.h>
#include <linux/hugetlb.h>
#include <linux/syscalls.h>

Expand Down Expand Up @@ -139,14 +139,14 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
return 0;
}

static const struct mm_walk_ops subpage_walk_ops = {
.pmd_entry = subpage_walk_pmd_entry,
};

static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
unsigned long len)
{
struct vm_area_struct *vma;
struct mm_walk subpage_proto_walk = {
.mm = mm,
.pmd_entry = subpage_walk_pmd_entry,
};

/*
* We don't try too hard, we just mark all the vma in that range
Expand All @@ -163,7 +163,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
if (vma->vm_start >= (addr + len))
break;
vma->vm_flags |= VM_NOHUGEPAGE;
walk_page_vma(vma, &subpage_proto_walk);
walk_page_vma(vma, &subpage_walk_ops, NULL);
vma = vma->vm_next;
}
}
Expand Down
35 changes: 16 additions & 19 deletions arch/s390/mm/gmap.c
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
*/

#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/pagewalk.h>
#include <linux/swap.h>
#include <linux/smp.h>
#include <linux/spinlock.h>
Expand Down Expand Up @@ -2521,13 +2521,9 @@ static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
return 0;
}

static inline void zap_zero_pages(struct mm_struct *mm)
{
struct mm_walk walk = { .pmd_entry = __zap_zero_pages };

walk.mm = mm;
walk_page_range(0, TASK_SIZE, &walk);
}
static const struct mm_walk_ops zap_zero_walk_ops = {
.pmd_entry = __zap_zero_pages,
};

/*
* switch on pgstes for its userspace process (for kvm)
Expand All @@ -2546,7 +2542,7 @@ int s390_enable_sie(void)
mm->context.has_pgste = 1;
/* split thp mappings and disable thp for future mappings */
thp_split_mm(mm);
zap_zero_pages(mm);
walk_page_range(mm, 0, TASK_SIZE, &zap_zero_walk_ops, NULL);
up_write(&mm->mmap_sem);
return 0;
}
Expand Down Expand Up @@ -2589,12 +2585,13 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
return 0;
}

static const struct mm_walk_ops enable_skey_walk_ops = {
.hugetlb_entry = __s390_enable_skey_hugetlb,
.pte_entry = __s390_enable_skey_pte,
};

int s390_enable_skey(void)
{
struct mm_walk walk = {
.hugetlb_entry = __s390_enable_skey_hugetlb,
.pte_entry = __s390_enable_skey_pte,
};
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
int rc = 0;
Expand All @@ -2614,8 +2611,7 @@ int s390_enable_skey(void)
}
mm->def_flags &= ~VM_MERGEABLE;

walk.mm = mm;
walk_page_range(0, TASK_SIZE, &walk);
walk_page_range(mm, 0, TASK_SIZE, &enable_skey_walk_ops, NULL);

out_up:
up_write(&mm->mmap_sem);
Expand All @@ -2633,13 +2629,14 @@ static int __s390_reset_cmma(pte_t *pte, unsigned long addr,
return 0;
}

static const struct mm_walk_ops reset_cmma_walk_ops = {
.pte_entry = __s390_reset_cmma,
};

void s390_reset_cmma(struct mm_struct *mm)
{
struct mm_walk walk = { .pte_entry = __s390_reset_cmma };

down_write(&mm->mmap_sem);
walk.mm = mm;
walk_page_range(0, TASK_SIZE, &walk);
walk_page_range(mm, 0, TASK_SIZE, &reset_cmma_walk_ops, NULL);
up_write(&mm->mmap_sem);
}
EXPORT_SYMBOL_GPL(s390_reset_cmma);
4 changes: 3 additions & 1 deletion drivers/gpu/drm/amd/amdgpu/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ config DRM_AMDGPU_CIK
config DRM_AMDGPU_USERPTR
bool "Always enable userptr write support"
depends on DRM_AMDGPU
depends on HMM_MIRROR
depends on MMU
select HMM_MIRROR
select MMU_NOTIFIER
help
This option selects CONFIG_HMM and CONFIG_HMM_MIRROR if it
isn't already selected to enabled full userptr support.
Expand Down
2 changes: 2 additions & 0 deletions drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
#include <linux/pm_runtime.h>
#include <linux/vga_switcheroo.h>
#include <drm/drm_probe_helper.h>
#include <linux/mmu_notifier.h>

#include "amdgpu.h"
#include "amdgpu_irq.h"
Expand Down Expand Up @@ -1469,6 +1470,7 @@ static void __exit amdgpu_exit(void)
amdgpu_unregister_atpx_handler();
amdgpu_sync_fini();
amdgpu_fence_slab_fini();
mmu_notifier_synchronize();
}

module_init(amdgpu_init);
Expand Down
15 changes: 8 additions & 7 deletions drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
Original file line number Diff line number Diff line change
Expand Up @@ -195,13 +195,14 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
* Block for operations on BOs to finish and mark pages as accessed and
* potentially dirty.
*/
static int amdgpu_mn_sync_pagetables_gfx(struct hmm_mirror *mirror,
const struct hmm_update *update)
static int
amdgpu_mn_sync_pagetables_gfx(struct hmm_mirror *mirror,
const struct mmu_notifier_range *update)
{
struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
unsigned long start = update->start;
unsigned long end = update->end;
bool blockable = update->blockable;
bool blockable = mmu_notifier_range_blockable(update);
struct interval_tree_node *it;

/* notification is exclusive, but interval is inclusive */
Expand Down Expand Up @@ -243,13 +244,14 @@ static int amdgpu_mn_sync_pagetables_gfx(struct hmm_mirror *mirror,
* necessitates evicting all user-mode queues of the process. The BOs
* are restorted in amdgpu_mn_invalidate_range_end_hsa.
*/
static int amdgpu_mn_sync_pagetables_hsa(struct hmm_mirror *mirror,
const struct hmm_update *update)
static int
amdgpu_mn_sync_pagetables_hsa(struct hmm_mirror *mirror,
const struct mmu_notifier_range *update)
{
struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
unsigned long start = update->start;
unsigned long end = update->end;
bool blockable = update->blockable;
bool blockable = mmu_notifier_range_blockable(update);
struct interval_tree_node *it;

/* notification is exclusive, but interval is inclusive */
Expand Down Expand Up @@ -482,6 +484,5 @@ void amdgpu_hmm_init_range(struct hmm_range *range)
range->flags = hmm_range_flags;
range->values = hmm_range_values;
range->pfn_shift = PAGE_SHIFT;
INIT_LIST_HEAD(&range->list);
}
}
Loading

0 comments on commit 84da111

Please sign in to comment.