Skip to content

Commit

Permalink
Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block
Browse files Browse the repository at this point in the history
Pull io_uring IO interface from Jens Axboe:
 "Second attempt at adding the io_uring interface.

  Since the first one, we've added basic unit testing of the three
  system calls, that resides in liburing like the other unit tests that
  we have so far. It'll take a while to get full coverage of it, but
  we're working towards it. I've also added two basic test programs to
  tools/io_uring. One uses the raw interface and has support for all the
  various features that io_uring supports outside of standard IO, like
  fixed files, fixed IO buffers, and polled IO. The other uses the
  liburing API, and is a simplified version of cp(1).

  This adds support for a new IO interface, io_uring.

  io_uring allows an application to communicate with the kernel through
  two rings, the submission queue (SQ) and completion queue (CQ) ring.
  This allows for very efficient handling of IOs, see the v5 posting for
  some basic numbers:

    https://lore.kernel.org/linux-block/[email protected]/

  Outside of just efficiency, the interface is also flexible and
  extendable, and allows for future use cases like the upcoming NVMe
  key-value store API, networked IO, and so on. It also supports async
  buffered IO, something that we've always failed to support in the
  kernel.

  Outside of basic IO features, it supports async polled IO as well.
  This particular feature has already been tested at Facebook months ago
  for flash storage boxes, with 25-33% improvements. It makes polled IO
  actually useful for real world use cases, where even basic flash sees
  a nice win in terms of efficiency, latency, and performance. These
  boxes were IOPS bound before, now they are not.

  This series adds three new system calls. One for setting up an
  io_uring instance (io_uring_setup(2)), one for submitting/completing
  IO (io_uring_enter(2)), and one for aux functions like registrating
  file sets, buffers, etc (io_uring_register(2)). Through the help of
  Arnd, I've coordinated the syscall numbers so merge on that front
  should be painless.

  Jon did a writeup of the interface a while back, which (except for
  minor details that have been tweaked) is still accurate. Find that
  here:

    https://lwn.net/Articles/776703/

  Huge thanks to Al Viro for helping getting the reference cycle code
  correct, and to Jann Horn for his extensive reviews focused on both
  security and bugs in general.

  There's a userspace library that provides basic functionality for
  applications that don't need or want to care about how to fiddle with
  the rings directly. It has helpers to allow applications to easily set
  up an io_uring instance, and submit/complete IO through it without
  knowing about the intricacies of the rings. It also includes man pages
  (thanks to Jeff Moyer), and will continue to grow support helper
  functions and features as time progresses. Find it here:

    git://git.kernel.dk/liburing

  Fio has full support for the raw interface, both in the form of an IO
  engine (io_uring), but also with a small test application (t/io_uring)
  that can exercise and benchmark the interface"

* tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
  io_uring: add a few test tools
  io_uring: allow workqueue item to handle multiple buffered requests
  io_uring: add support for IORING_OP_POLL
  io_uring: add io_kiocb ref count
  io_uring: add submission polling
  io_uring: add file set registration
  net: split out functions related to registering inflight socket files
  io_uring: add support for pre-mapped user IO buffers
  block: implement bio helper to add iter bvec pages to bio
  io_uring: batch io_kiocb allocation
  io_uring: use fget/fput_many() for file references
  fs: add fget_many() and fput_many()
  io_uring: support for IO polling
  io_uring: add fsync support
  Add io_uring IO interface
  • Loading branch information
torvalds committed Mar 8, 2019
2 parents 80201fe + 21b4aa5 commit 38e7571
Show file tree
Hide file tree
Showing 32 changed files with 4,783 additions and 146 deletions.
3 changes: 3 additions & 0 deletions arch/x86/entry/syscalls/syscall_32.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -429,3 +429,6 @@
421 i386 rt_sigtimedwait_time64 sys_rt_sigtimedwait __ia32_compat_sys_rt_sigtimedwait_time64
422 i386 futex_time64 sys_futex __ia32_sys_futex
423 i386 sched_rr_get_interval_time64 sys_sched_rr_get_interval __ia32_sys_sched_rr_get_interval
425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup
426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter
427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register
3 changes: 3 additions & 0 deletions arch/x86/entry/syscalls/syscall_64.tbl
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,9 @@
334 common rseq __x64_sys_rseq
# don't use numbers 387 through 423, add new calls after the last
# 'common' entry
425 common io_uring_setup __x64_sys_io_uring_setup
426 common io_uring_enter __x64_sys_io_uring_enter
427 common io_uring_register __x64_sys_io_uring_register

#
# x32-specific system call numbers start at 512 to avoid cache impact
Expand Down
62 changes: 54 additions & 8 deletions block/bio.c
Original file line number Diff line number Diff line change
Expand Up @@ -836,6 +836,40 @@ int bio_add_page(struct bio *bio, struct page *page,
}
EXPORT_SYMBOL(bio_add_page);

static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
{
const struct bio_vec *bv = iter->bvec;
unsigned int len;
size_t size;

if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len))
return -EINVAL;

len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
size = bio_add_page(bio, bv->bv_page, len,
bv->bv_offset + iter->iov_offset);
if (size == len) {
struct page *page;
int i;

/*
* For the normal O_DIRECT case, we could skip grabbing this
* reference and then not have to put them again when IO
* completes. But this breaks some in-kernel users, like
* splicing to/from a loop device, where we release the pipe
* pages unconditionally. If we can fix that case, we can
* get rid of the get here and the need to call
* bio_release_pages() at IO completion time.
*/
mp_bvec_for_each_page(page, bv, i)
get_page(page);
iov_iter_advance(iter, size);
return 0;
}

return -EINVAL;
}

#define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *))

/**
Expand Down Expand Up @@ -884,23 +918,35 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
}

/**
* bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
* bio_iov_iter_get_pages - add user or kernel pages to a bio
* @bio: bio to add pages to
* @iter: iov iterator describing the region to be mapped
* @iter: iov iterator describing the region to be added
*
* This takes either an iterator pointing to user memory, or one pointing to
* kernel pages (BVEC iterator). If we're adding user pages, we pin them and
* map them into the kernel. On IO completion, the caller should put those
* pages. For now, when adding kernel pages, we still grab a reference to the
* page. This isn't strictly needed for the common case, but some call paths
* end up releasing pages from eg a pipe and we can't easily control these.
* See comment in __bio_iov_bvec_add_pages().
*
* Pins pages from *iter and appends them to @bio's bvec array. The
* pages will have to be released using put_page() when done.
* The function tries, but does not guarantee, to pin as many pages as
* fit into the bio, or are requested in *iter, whatever is smaller.
* If MM encounters an error pinning the requested pages, it stops.
* Error is returned only if 0 pages could be pinned.
* fit into the bio, or are requested in *iter, whatever is smaller. If
* MM encounters an error pinning the requested pages, it stops. Error
* is returned only if 0 pages could be pinned.
*/
int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
const bool is_bvec = iov_iter_is_bvec(iter);
unsigned short orig_vcnt = bio->bi_vcnt;

do {
int ret = __bio_iov_iter_get_pages(bio, iter);
int ret;

if (is_bvec)
ret = __bio_iov_bvec_add_pages(bio, iter);
else
ret = __bio_iov_iter_get_pages(bio, iter);

if (unlikely(ret))
return bio->bi_vcnt > orig_vcnt ? 0 : ret;
Expand Down
1 change: 1 addition & 0 deletions fs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
obj-$(CONFIG_AIO) += aio.o
obj-$(CONFIG_IO_URING) += io_uring.o
obj-$(CONFIG_FS_DAX) += dax.o
obj-$(CONFIG_FS_ENCRYPTION) += crypto/
obj-$(CONFIG_FILE_LOCKING) += locks.o
Expand Down
15 changes: 10 additions & 5 deletions fs/file.c
Original file line number Diff line number Diff line change
Expand Up @@ -706,7 +706,7 @@ void do_close_on_exec(struct files_struct *files)
spin_unlock(&files->file_lock);
}

static struct file *__fget(unsigned int fd, fmode_t mask)
static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
{
struct files_struct *files = current->files;
struct file *file;
Expand All @@ -721,23 +721,28 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
*/
if (file->f_mode & mask)
file = NULL;
else if (!get_file_rcu(file))
else if (!get_file_rcu_many(file, refs))
goto loop;
}
rcu_read_unlock();

return file;
}

struct file *fget_many(unsigned int fd, unsigned int refs)
{
return __fget(fd, FMODE_PATH, refs);
}

struct file *fget(unsigned int fd)
{
return __fget(fd, FMODE_PATH);
return __fget(fd, FMODE_PATH, 1);
}
EXPORT_SYMBOL(fget);

struct file *fget_raw(unsigned int fd)
{
return __fget(fd, 0);
return __fget(fd, 0, 1);
}
EXPORT_SYMBOL(fget_raw);

Expand Down Expand Up @@ -768,7 +773,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
return 0;
return (unsigned long)file;
} else {
file = __fget(fd, mask);
file = __fget(fd, mask, 1);
if (!file)
return 0;
return FDPUT_FPUT | (unsigned long)file;
Expand Down
9 changes: 7 additions & 2 deletions fs/file_table.c
Original file line number Diff line number Diff line change
Expand Up @@ -326,9 +326,9 @@ void flush_delayed_fput(void)

static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);

void fput(struct file *file)
void fput_many(struct file *file, unsigned int refs)
{
if (atomic_long_dec_and_test(&file->f_count)) {
if (atomic_long_sub_and_test(refs, &file->f_count)) {
struct task_struct *task = current;

if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
Expand All @@ -347,6 +347,11 @@ void fput(struct file *file)
}
}

void fput(struct file *file)
{
fput_many(file, 1);
}

/*
* synchronous analog of fput(); for kernel threads that might be needed
* in some umount() (and thus can't use flush_delayed_fput() without
Expand Down
Loading

0 comments on commit 38e7571

Please sign in to comment.