Skip to content

Commit

Permalink
blk-mq: new multi-queue block IO queueing mechanism
Browse files Browse the repository at this point in the history
Linux currently has two models for block devices:

- The classic request_fn based approach, where drivers use struct
  request units for IO. The block layer provides various helper
  functionalities to let drivers share code, things like tag
  management, timeout handling, queueing, etc.

- The "stacked" approach, where a driver squeezes in between the
  block layer and IO submitter. Since this bypasses the IO stack,
  driver generally have to manage everything themselves.

With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.

The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.

This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.

blk-mq provides various helper functions, which include:

- Scalable support for request tagging. Most devices need to
  be able to uniquely identify a request both in the driver and
  to the hardware. The tagging uses per-cpu caches for freed
  tags, to enable cache hot reuse.

- Timeout handling without tracking request on a per-device
  basis. Basically the driver should be able to get a notification,
  if a request happens to fail.

- Optional support for non 1:1 mappings between issue and
  submission queues. blk-mq can redirect IO completions to the
  desired location.

- Support for per-request payloads. Drivers almost always need
  to associate a request structure with some driver private
  command structure. Drivers can tell blk-mq this at init time,
  and then any request handed to the driver will have the
  required size of memory associated with it.

- Support for merging of IO, and plugging. The stacked model
  gets neither of these. Even for high IOPS devices, merging
  sequential IO reduces per-command overhead and thus
  increases bandwidth.

For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).

Contributions in this patch from the following people:

Shaohua Li <[email protected]>
Alexander Gordeev <[email protected]>
Christoph Hellwig <[email protected]>
Mike Christie <[email protected]>
Matias Bjorling <[email protected]>
Jeff Moyer <[email protected]>

Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
  • Loading branch information
axboe committed Oct 25, 2013
1 parent 1dddc01 commit 320ae51
Show file tree
Hide file tree
Showing 18 changed files with 2,890 additions and 109 deletions.
5 changes: 3 additions & 2 deletions block/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@
obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-iopoll.o blk-lib.o ioctl.o genhd.o scsi_ioctl.o \
partition-generic.o partitions/
blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \
blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o partitions/

obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
Expand Down
142 changes: 84 additions & 58 deletions block/blk-core.c
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
#include <linux/backing-dev.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/kernel_stat.h>
Expand Down Expand Up @@ -48,7 +49,7 @@ DEFINE_IDA(blk_queue_ida);
/*
* For the allocated request tables
*/
static struct kmem_cache *request_cachep;
struct kmem_cache *request_cachep = NULL;

/*
* For queue allocation
Expand All @@ -60,42 +61,6 @@ struct kmem_cache *blk_requestq_cachep;
*/
static struct workqueue_struct *kblockd_workqueue;

static void drive_stat_acct(struct request *rq, int new_io)
{
struct hd_struct *part;
int rw = rq_data_dir(rq);
int cpu;

if (!blk_do_io_stat(rq))
return;

cpu = part_stat_lock();

if (!new_io) {
part = rq->part;
part_stat_inc(cpu, part, merges[rw]);
} else {
part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
if (!hd_struct_try_get(part)) {
/*
* The partition is already being removed,
* the request will be accounted on the disk only
*
* We take a reference on disk->part0 although that
* partition will never be deleted, so we can treat
* it as any other partition.
*/
part = &rq->rq_disk->part0;
hd_struct_get(part);
}
part_round_stats(cpu, part);
part_inc_in_flight(part, rw);
rq->part = part;
}

part_stat_unlock();
}

void blk_queue_congestion_threshold(struct request_queue *q)
{
int nr;
Expand Down Expand Up @@ -594,9 +559,12 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
if (!q)
return NULL;

if (percpu_counter_init(&q->mq_usage_counter, 0))
goto fail_q;

q->id = ida_simple_get(&blk_queue_ida, 0, 0, gfp_mask);
if (q->id < 0)
goto fail_q;
goto fail_c;

q->backing_dev_info.ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
Expand Down Expand Up @@ -643,13 +611,17 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
q->bypass_depth = 1;
__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);

init_waitqueue_head(&q->mq_freeze_wq);

if (blkcg_init_queue(q))
goto fail_id;

return q;

fail_id:
ida_simple_remove(&blk_queue_ida, q->id);
fail_c:
percpu_counter_destroy(&q->mq_usage_counter);
fail_q:
kmem_cache_free(blk_requestq_cachep, q);
return NULL;
Expand Down Expand Up @@ -1108,7 +1080,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
goto retry;
}

struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
static struct request *blk_old_get_request(struct request_queue *q, int rw,
gfp_t gfp_mask)
{
struct request *rq;

Expand All @@ -1125,6 +1098,14 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)

return rq;
}

struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
{
if (q->mq_ops)
return blk_mq_alloc_request(q, rw, gfp_mask);
else
return blk_old_get_request(q, rw, gfp_mask);
}
EXPORT_SYMBOL(blk_get_request);

/**
Expand Down Expand Up @@ -1210,7 +1191,7 @@ EXPORT_SYMBOL(blk_requeue_request);
static void add_acct_request(struct request_queue *q, struct request *rq,
int where)
{
drive_stat_acct(rq, 1);
blk_account_io_start(rq, true);
__elv_add_request(q, rq, where);
}

Expand Down Expand Up @@ -1299,12 +1280,17 @@ EXPORT_SYMBOL_GPL(__blk_put_request);

void blk_put_request(struct request *req)
{
unsigned long flags;
struct request_queue *q = req->q;

spin_lock_irqsave(q->queue_lock, flags);
__blk_put_request(q, req);
spin_unlock_irqrestore(q->queue_lock, flags);
if (q->mq_ops)
blk_mq_free_request(req);
else {
unsigned long flags;

spin_lock_irqsave(q->queue_lock, flags);
__blk_put_request(q, req);
spin_unlock_irqrestore(q->queue_lock, flags);
}
}
EXPORT_SYMBOL(blk_put_request);

Expand Down Expand Up @@ -1340,8 +1326,8 @@ void blk_add_request_payload(struct request *rq, struct page *page,
}
EXPORT_SYMBOL_GPL(blk_add_request_payload);

static bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
struct bio *bio)
bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
struct bio *bio)
{
const int ff = bio->bi_rw & REQ_FAILFAST_MASK;

Expand All @@ -1358,12 +1344,12 @@ static bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
req->__data_len += bio->bi_size;
req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));

drive_stat_acct(req, 0);
blk_account_io_start(req, false);
return true;
}

static bool bio_attempt_front_merge(struct request_queue *q,
struct request *req, struct bio *bio)
bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
struct bio *bio)
{
const int ff = bio->bi_rw & REQ_FAILFAST_MASK;

Expand All @@ -1388,12 +1374,12 @@ static bool bio_attempt_front_merge(struct request_queue *q,
req->__data_len += bio->bi_size;
req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));

drive_stat_acct(req, 0);
blk_account_io_start(req, false);
return true;
}

/**
* attempt_plug_merge - try to merge with %current's plugged list
* blk_attempt_plug_merge - try to merge with %current's plugged list
* @q: request_queue new bio is being queued at
* @bio: new bio being queued
* @request_count: out parameter for number of traversed plugged requests
Expand All @@ -1409,8 +1395,8 @@ static bool bio_attempt_front_merge(struct request_queue *q,
* reliable access to the elevator outside queue lock. Only check basic
* merging parameters without querying the elevator.
*/
static bool attempt_plug_merge(struct request_queue *q, struct bio *bio,
unsigned int *request_count)
bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
unsigned int *request_count)
{
struct blk_plug *plug;
struct request *rq;
Expand Down Expand Up @@ -1489,7 +1475,7 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
* Check if we can merge with the plugged list before grabbing
* any locks.
*/
if (attempt_plug_merge(q, bio, &request_count))
if (blk_attempt_plug_merge(q, bio, &request_count))
return;

spin_lock_irq(q->queue_lock);
Expand Down Expand Up @@ -1557,7 +1543,7 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
}
}
list_add_tail(&req->queuelist, &plug->list);
drive_stat_acct(req, 1);
blk_account_io_start(req, true);
} else {
spin_lock_irq(q->queue_lock);
add_acct_request(q, req, where);
Expand Down Expand Up @@ -2011,7 +1997,7 @@ unsigned int blk_rq_err_bytes(const struct request *rq)
}
EXPORT_SYMBOL_GPL(blk_rq_err_bytes);

static void blk_account_io_completion(struct request *req, unsigned int bytes)
void blk_account_io_completion(struct request *req, unsigned int bytes)
{
if (blk_do_io_stat(req)) {
const int rw = rq_data_dir(req);
Expand All @@ -2025,7 +2011,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
}
}

static void blk_account_io_done(struct request *req)
void blk_account_io_done(struct request *req)
{
/*
* Account IO completion. flush_rq isn't accounted as a
Expand Down Expand Up @@ -2073,6 +2059,42 @@ static inline struct request *blk_pm_peek_request(struct request_queue *q,
}
#endif

void blk_account_io_start(struct request *rq, bool new_io)
{
struct hd_struct *part;
int rw = rq_data_dir(rq);
int cpu;

if (!blk_do_io_stat(rq))
return;

cpu = part_stat_lock();

if (!new_io) {
part = rq->part;
part_stat_inc(cpu, part, merges[rw]);
} else {
part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
if (!hd_struct_try_get(part)) {
/*
* The partition is already being removed,
* the request will be accounted on the disk only
*
* We take a reference on disk->part0 although that
* partition will never be deleted, so we can treat
* it as any other partition.
*/
part = &rq->rq_disk->part0;
hd_struct_get(part);
}
part_round_stats(cpu, part);
part_inc_in_flight(part, rw);
rq->part = part;
}

part_stat_unlock();
}

/**
* blk_peek_request - peek at the top of a request queue
* @q: request queue to peek at
Expand Down Expand Up @@ -2448,7 +2470,6 @@ static void blk_finish_request(struct request *req, int error)
if (req->cmd_flags & REQ_DONTPREP)
blk_unprep_request(req);


blk_account_io_done(req);

if (req->end_io)
Expand Down Expand Up @@ -2870,6 +2891,7 @@ void blk_start_plug(struct blk_plug *plug)

plug->magic = PLUG_MAGIC;
INIT_LIST_HEAD(&plug->list);
INIT_LIST_HEAD(&plug->mq_list);
INIT_LIST_HEAD(&plug->cb_list);

/*
Expand Down Expand Up @@ -2967,6 +2989,10 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
BUG_ON(plug->magic != PLUG_MAGIC);

flush_plug_callbacks(plug, from_schedule);

if (!list_empty(&plug->mq_list))
blk_mq_flush_plug_list(plug, from_schedule);

if (list_empty(&plug->list))
return;

Expand Down
7 changes: 7 additions & 0 deletions block/blk-exec.c
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#include <linux/module.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/sched/sysctl.h>

#include "blk.h"
Expand Down Expand Up @@ -58,6 +59,12 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,

rq->rq_disk = bd_disk;
rq->end_io = done;

if (q->mq_ops) {
blk_mq_insert_request(q, rq, true);
return;
}

/*
* need to check this before __blk_run_queue(), because rq can
* be freed before that returns.
Expand Down
Loading

0 comments on commit 320ae51

Please sign in to comment.