Skip to content

Commit

Permalink
mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM
Browse files Browse the repository at this point in the history
Pach series "Add FOLL_LONGTERM to GUP fast and use it".

HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages.  These pages can be held for a significant time.  But
get_user_pages_fast() does not protect against mapping FS DAX pages.

Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks.  XDP has also
shown interest in using this functionality.[1]

In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.

[1] https://lkml.org/lkml/2019/3/19/939

"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move.  I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem.  Then I think we can change the flag to a
better name.

Secondly, it depends on how often you are registering memory.  I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance.  I don't have the numbers as the
tests for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

This patch (of 7):

This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast().  Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.

Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.

This patch does not change any functionality.  In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked.  However, callers of get_user_pages_fast()
were not "protected".

FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.

NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages.  This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.

As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.

In review[1] it was asked:

<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?

A couple of points.

First "longterm" is a relative thing and at this point is probably a
misnomer.  This is really flagging a pin which is going to be given to
hardware and can't move.  I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem.  Then I think we can
change the flag to a better name.

Second, It depends on how often you are registering memory.  I have spoken
with some RDMA users who consider MR in the performance path...  For the
overall application performance.  I don't have the numbers as the tests
for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

</quote>

[1] https://lore.kernel.org/lkml/[email protected]/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

[[email protected]: v3]
  Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ira Weiny <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Mike Marshall <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
  • Loading branch information
weiny2 authored and torvalds committed May 14, 2019
1 parent a222f34 commit 932f4a6
Show file tree
Hide file tree
Showing 11 changed files with 173 additions and 108 deletions.
5 changes: 3 additions & 2 deletions arch/powerpc/mm/book3s64/iommu_api.c
Original file line number Diff line number Diff line change
Expand Up @@ -141,8 +141,9 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
for (entry = 0; entry < entries; entry += chunk) {
unsigned long n = min(entries - entry, chunk);

ret = get_user_pages_longterm(ua + (entry << PAGE_SHIFT), n,
FOLL_WRITE, mem->hpages + entry, NULL);
ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
FOLL_WRITE | FOLL_LONGTERM,
mem->hpages + entry, NULL);
if (ret == n) {
pinned += n;
continue;
Expand Down
5 changes: 3 additions & 2 deletions drivers/infiniband/core/umem.c
Original file line number Diff line number Diff line change
Expand Up @@ -295,10 +295,11 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,

while (npages) {
down_read(&mm->mmap_sem);
ret = get_user_pages_longterm(cur_base,
ret = get_user_pages(cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof (struct page *)),
gup_flags, page_list, NULL);
gup_flags | FOLL_LONGTERM,
page_list, NULL);
if (ret < 0) {
up_read(&mm->mmap_sem);
goto umem_release;
Expand Down
8 changes: 4 additions & 4 deletions drivers/infiniband/hw/qib/qib_user_pages.c
Original file line number Diff line number Diff line change
Expand Up @@ -114,10 +114,10 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,

down_read(&current->mm->mmap_sem);
for (got = 0; got < num_pages; got += ret) {
ret = get_user_pages_longterm(start_page + got * PAGE_SIZE,
num_pages - got,
FOLL_WRITE | FOLL_FORCE,
p + got, NULL);
ret = get_user_pages(start_page + got * PAGE_SIZE,
num_pages - got,
FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE,
p + got, NULL);
if (ret < 0) {
up_read(&current->mm->mmap_sem);
goto bail_release;
Expand Down
9 changes: 5 additions & 4 deletions drivers/infiniband/hw/usnic/usnic_uiom.c
Original file line number Diff line number Diff line change
Expand Up @@ -143,10 +143,11 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
ret = 0;

while (npages) {
ret = get_user_pages_longterm(cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof(struct page *)),
gup_flags, page_list, NULL);
ret = get_user_pages(cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof(struct page *)),
gup_flags | FOLL_LONGTERM,
page_list, NULL);

if (ret < 0)
goto out;
Expand Down
6 changes: 3 additions & 3 deletions drivers/media/v4l2-core/videobuf-dma-sg.c
Original file line number Diff line number Diff line change
Expand Up @@ -186,12 +186,12 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);

err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages,
flags, dma->pages, NULL);
err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
flags | FOLL_LONGTERM, dma->pages, NULL);

if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err,
dprintk(1, "get_user_pages: err=%d [%d]\n", err,
dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
Expand Down
3 changes: 2 additions & 1 deletion drivers/vfio/vfio_iommu_type1.c
Original file line number Diff line number Diff line change
Expand Up @@ -358,7 +358,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,

down_read(&mm->mmap_sem);
if (mm == current->mm) {
ret = get_user_pages_longterm(vaddr, 1, flags, page, vmas);
ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
vmas);
} else {
ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
vmas, NULL);
Expand Down
5 changes: 3 additions & 2 deletions fs/io_uring.c
Original file line number Diff line number Diff line change
Expand Up @@ -2697,8 +2697,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,

ret = 0;
down_read(&current->mm->mmap_sem);
pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
pages, vmas);
pret = get_user_pages(ubuf, nr_pages,
FOLL_WRITE | FOLL_LONGTERM,
pages, vmas);
if (pret == nr_pages) {
/* don't support file backed memory */
for (j = 0; j < nr_pages; j++) {
Expand Down
41 changes: 28 additions & 13 deletions include/linux/mm.h
Original file line number Diff line number Diff line change
Expand Up @@ -1505,19 +1505,6 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);

#if defined(CONFIG_FS_DAX) || defined(CONFIG_CMA)
long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas);
#else
static inline long get_user_pages_longterm(unsigned long start,
unsigned long nr_pages, unsigned int gup_flags,
struct page **pages, struct vm_area_struct **vmas)
{
return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
}
#endif /* CONFIG_FS_DAX */

int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);

Expand Down Expand Up @@ -2583,6 +2570,34 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
#define FOLL_COW 0x4000 /* internal GUP flag */
#define FOLL_ANON 0x8000 /* don't do file mappings */
#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */

/*
* NOTE on FOLL_LONGTERM:
*
* FOLL_LONGTERM indicates that the page will be held for an indefinite time
* period _often_ under userspace control. This is contrasted with
* iov_iter_get_pages() where usages which are transient.
*
* FIXME: For pages which are part of a filesystem, mappings are subject to the
* lifetime enforced by the filesystem and we need guarantees that longterm
* users like RDMA and V4L2 only establish mappings which coordinate usage with
* the filesystem. Ideas for this coordination include revoking the longterm
* pin, delaying writeback, bounce buffer page writeback, etc. As FS DAX was
* added after the problem with filesystems was found FS DAX VMAs are
* specifically failed. Filesystem pages are still subject to bugs and use of
* FOLL_LONGTERM should be avoided on those pages.
*
* FIXME: Also NOTE that FOLL_LONGTERM is not supported in every GUP call.
* Currently only get_user_pages() and get_user_pages_fast() support this flag
* and calls to get_user_pages_[un]locked are specifically not allowed. This
* is due to an incompatibility with the FS DAX check and
* FAULT_FLAG_ALLOW_RETRY
*
* In the CMA case: longterm pins in a CMA region would unnecessarily fragment
* that region. And so CMA attempts to migrate the page before pinning when
* FOLL_LONGTERM is specified.
*/

static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
{
Expand Down
Loading

0 comments on commit 932f4a6

Please sign in to comment.