Skip to content

Commit

Permalink
Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block
Browse files Browse the repository at this point in the history
Pull block updates from Jens Axboe:

 - Two NVMe pull requests:
     - ana log parse fix from Anton
     - nvme quirks support for Apple devices from Ben
     - fix missing bio completion tracing for multipath stack devices
       from Hannes and Mikhail
     - IP TOS settings for nvme rdma and tcp transports from Israel
     - rq_dma_dir cleanups from Israel
     - tracing for Get LBA Status command from Minwoo
     - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
     - Some consolidation between the fabrics transports for handling
       the CAP register
     - reset race with ns scanning fix for fabrics (move fabrics
       commands to a dedicated request queue with a different lifetime
       from the admin request queue)."
     - controller reset and namespace scan races fixes
     - nvme discovery log change uevent support
     - naming improvements from Keith
     - multiple discovery controllers reject fix from James
     - some regular cleanups from various people

 - Series fixing (and re-fixing) null_blk debug printing and nr_devices
   checks (André)

 - A few pull requests from Song, with fixes from Andy, Guoqing,
   Guilherme, Neil, Nigel, and Yufen.

 - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

 - Bio merge handling unification (Christoph)

 - Pick default elevator correctly for devices with special needs
   (Damien)

 - Block stats fixes (Hou)

 - Timeout and support devices nbd fixes (Mike)

 - Series fixing races around elevator switching and device add/remove
   (Ming)

 - sed-opal cleanups (Revanth)

 - Per device weight support for BFQ (Fam)

 - Support for blk-iocost, a new model that can properly account cost of
   IO workloads. (Tejun)

 - blk-cgroup writeback fixes (Tejun)

 - paride queue init fixes (zhengbin)

 - blk_set_runtime_active() cleanup (Stanley)

 - Block segment mapping optimizations (Bart)

 - lightnvm fixes (Hans/Minwoo/YueHaibing)

 - Various little fixes and cleanups

* tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
  null_blk: format pr_* logs with pr_fmt
  null_blk: match the type of parameter nr_devices
  null_blk: do not fail the module load with zero devices
  block: also check RQF_STATS in blk_mq_need_time_stamp()
  block: make rq sector size accessible for block stats
  bfq: Fix bfq linkage error
  raid5: use bio_end_sector in r5_next_bio
  raid5: remove STRIPE_OPS_REQ_PENDING
  md: add feature flag MD_FEATURE_RAID0_LAYOUT
  md/raid0: avoid RAID0 data corruption due to layout confusion.
  raid5: don't set STRIPE_HANDLE to stripe which is in batch list
  raid5: don't increment read_errors on EILSEQ return
  nvmet: fix a wrong error status returned in error log page
  nvme: send discovery log page change events to userspace
  nvme: add uevent variables for controller devices
  nvme: enable aen regardless of the presence of I/O queues
  nvme-fabrics: allow discovery subsystems accept a kato
  nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
  nvme: Remove redundant assignment of cq vector
  nvme: Assign subsys instance from first ctrl
  ...
  • Loading branch information
torvalds committed Sep 17, 2019
2 parents 5260c2b + 9c7eddf commit 7ad67ca
Show file tree
Hide file tree
Showing 107 changed files with 5,895 additions and 1,283 deletions.
97 changes: 97 additions & 0 deletions Documentation/admin-guide/cgroup-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1469,6 +1469,103 @@ IO Interface Files
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021

io.cost.qos
A read-write nested-keyed file with exists only on the root
cgroup.

This file configures the Quality of Service of the IO cost
model based controller (CONFIG_BLK_CGROUP_IOCOST) which
currently implements "io.weight" proportional control. Lines
are keyed by $MAJ:$MIN device numbers and not ordered. The
line for a given device is populated on the first write for
the device on "io.cost.qos" or "io.cost.model". The following
nested keys are defined.

====== =====================================
enable Weight-based control enable
ctrl "auto" or "user"
rpct Read latency percentile [0, 100]
rlat Read latency threshold
wpct Write latency percentile [0, 100]
wlat Write latency threshold
min Minimum scaling percentage [1, 10000]
max Maximum scaling percentage [1, 10000]
====== =====================================

The controller is disabled by default and can be enabled by
setting "enable" to 1. "rpct" and "wpct" parameters default
to zero and the controller uses internal device saturation
state to adjust the overall IO rate between "min" and "max".

When a better control quality is needed, latency QoS
parameters can be configured. For example::

8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0

shows that on sdb, the controller is enabled, will consider
the device saturated if the 95th percentile of read completion
latencies is above 75ms or write 150ms, and adjust the overall
IO issue rate between 50% and 150% accordingly.

The lower the saturation point, the better the latency QoS at
the cost of aggregate bandwidth. The narrower the allowed
adjustment range between "min" and "max", the more conformant
to the cost model the IO behavior. Note that the IO issue
base rate may be far off from 100% and setting "min" and "max"
blindly can lead to a significant loss of device capacity or
control quality. "min" and "max" are useful for regulating
devices which show wide temporary behavior changes - e.g. a
ssd which accepts writes at the line speed for a while and
then completely stalls for multiple seconds.

When "ctrl" is "auto", the parameters are controlled by the
kernel and may change automatically. Setting "ctrl" to "user"
or setting any of the percentile and latency parameters puts
it into "user" mode and disables the automatic changes. The
automatic mode can be restored by setting "ctrl" to "auto".

io.cost.model
A read-write nested-keyed file with exists only on the root
cgroup.

This file configures the cost model of the IO cost model based
controller (CONFIG_BLK_CGROUP_IOCOST) which currently
implements "io.weight" proportional control. Lines are keyed
by $MAJ:$MIN device numbers and not ordered. The line for a
given device is populated on the first write for the device on
"io.cost.qos" or "io.cost.model". The following nested keys
are defined.

===== ================================
ctrl "auto" or "user"
model The cost model in use - "linear"
===== ================================

When "ctrl" is "auto", the kernel may change all parameters
dynamically. When "ctrl" is set to "user" or any other
parameters are written to, "ctrl" become "user" and the
automatic changes are disabled.

When "model" is "linear", the following model parameters are
defined.

============= ========================================
[r|w]bps The maximum sequential IO throughput
[r|w]seqiops The maximum 4k sequential IOs per second
[r|w]randiops The maximum 4k random IOs per second
============= ========================================

From the above, the builtin linear model determines the base
costs of a sequential and random IO and the cost coefficient
for the IO size. While simple, this model can cover most
common device classes acceptably.

The IO cost model isn't expected to be accurate in absolute
sense and is scaled to the device behavior dynamically.

If needed, tools/cgroup/iocost_coef_gen.py can be used to
generate device-specific coefficients.

io.weight
A read-write flat-keyed file which exists on non-root cgroups.
The default is "default 100".
Expand Down
6 changes: 0 additions & 6 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1201,12 +1201,6 @@
See comment before function elanfreq_setup() in
arch/x86/kernel/cpu/cpufreq/elanfreq.c.

elevator= [IOSCHED]
Format: { "mq-deadline" | "kyber" | "bfq" }
See Documentation/block/deadline-iosched.rst,
Documentation/block/kyber-iosched.rst and
Documentation/block/bfq-iosched.rst for details.

elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390]
Specifies physical address of start of kernel core
image elf header and optionally the size. Generally
Expand Down
8 changes: 3 additions & 5 deletions Documentation/admin-guide/kernel-per-CPU-kthreads.rst
Original file line number Diff line number Diff line change
Expand Up @@ -274,22 +274,20 @@ To reduce its OS jitter, do any of the following:
(based on an earlier one from Gilad Ben-Yossef) that
reduces or even eliminates vmstat overhead for some
workloads at https://lkml.org/lkml/2013/9/4/379.
e. Boot with "elevator=noop" to avoid workqueue use by
the block layer.
f. If running on high-end powerpc servers, build with
e. If running on high-end powerpc servers, build with
CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
daemon from running on each CPU every second or so.
(This will require editing Kconfig files and will defeat
this platform's RAS functionality.) This avoids jitter
due to the rtas_event_scan() function.
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
g. If running on Cell Processor, build your kernel with
f. If running on Cell Processor, build your kernel with
CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
spu_gov_work().
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
h. If running on PowerMAC, build your kernel with
g. If running on PowerMAC, build your kernel with
CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
avoiding OS jitter from rackmeter_do_timer().

Expand Down
33 changes: 18 additions & 15 deletions Documentation/block/null_blk.rst
Original file line number Diff line number Diff line change
@@ -1,19 +1,16 @@
.. SPDX-License-Identifier: GPL-2.0
========================
Null block device driver
========================

1. Overview
===========
Overview
========

The null block device (/dev/nullb*) is used for benchmarking the various
The null block device (``/dev/nullb*``) is used for benchmarking the various
block-layer implementations. It emulates a block device of X gigabytes in size.
The following instances are possible:

Single-queue block-layer

- Request-based.
- Single submission queue per device.
- Implements IO scheduling algorithms (CFQ, Deadline, noop).
It does not execute any read/write operation, just mark them as complete in
the request queue. The following instances are possible:

Multi-queue block-layer

Expand All @@ -27,15 +24,15 @@ The following instances are possible:

All of them have a completion queue for each core in the system.

2. Module parameters applicable for all instances
=================================================
Module parameters
=================

queue_mode=[0-2]: Default: 2-Multi-queue
Selects which block-layer the module should instantiate with.

= ============
0 Bio-based
1 Single-queue
1 Single-queue (deprecated)
2 Multi-queue
= ============

Expand Down Expand Up @@ -67,17 +64,19 @@ irqmode=[0-2]: Default: 1-Soft-irq
completion_nsec=[ns]: Default: 10,000ns
Combined with irqmode=2 (timer). The time each completion event must wait.

submit_queues=[1..nr_cpus]:
submit_queues=[1..nr_cpus]: Default: 1
The number of submission queues attached to the device driver. If unset, it
defaults to 1. For multi-queue, it is ignored when use_per_node_hctx module
parameter is 1.

hw_queue_depth=[0..qdepth]: Default: 64
The hardware queue depth of the device.

III: Multi-queue specific parameters
Multi-queue specific parameters
-------------------------------

use_per_node_hctx=[0/1]: Default: 0
Number of hardware context queues.

= =====================================================================
0 The number of submit queues are set to the value of the submit_queues
Expand All @@ -87,13 +86,15 @@ use_per_node_hctx=[0/1]: Default: 0
= =====================================================================

no_sched=[0/1]: Default: 0
Enable/disable the io scheduler.

= ======================================
0 nullb* use default blk-mq io scheduler
1 nullb* doesn't use io scheduler
= ======================================

blocking=[0/1]: Default: 0
Blocking behavior of the request queue.

= ===============================================================
0 Register as a non-blocking blk-mq driver device.
Expand All @@ -103,6 +104,7 @@ blocking=[0/1]: Default: 0
= ===============================================================

shared_tags=[0/1]: Default: 0
Sharing tags between devices.

= ================================================================
0 Tag set is not shared.
Expand All @@ -111,6 +113,7 @@ shared_tags=[0/1]: Default: 0
= ================================================================

zoned=[0/1]: Default: 0
Device is a random-access or a zoned block device.

= ======================================================================
0 Block device is exposed as a random-access block device.
Expand Down
4 changes: 0 additions & 4 deletions Documentation/block/switching-sched.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,6 @@
Switching Scheduler
===================

To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
'noop' and 'cfq' (the default) are also available. IO schedulers are assigned
globally at boot time only presently.

Each io queue has a set of io scheduler tunables associated with it. These
tunables control how the io scheduler works. You can find these entries
in::
Expand Down
13 changes: 13 additions & 0 deletions block/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ menuconfig BLOCK

if BLOCK

config BLK_RQ_ALLOC_TIME
bool

config BLK_SCSI_REQUEST
bool

Expand Down Expand Up @@ -132,6 +135,16 @@ config BLK_CGROUP_IOLATENCY

Note, this is an experimental interface and could be changed someday.

config BLK_CGROUP_IOCOST
bool "Enable support for cost model based cgroup IO controller"
depends on BLK_CGROUP=y
select BLK_RQ_ALLOC_TIME
---help---
Enabling this option enables the .weight interface for cost
model based proportional IO control. The IO controller
distributes IO capacity between different groups based on
their share of the overall weight distribution.

config BLK_WBT_MQ
bool "Multiqueue writeback throttling"
default y
Expand Down
1 change: 1 addition & 0 deletions block/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
Expand Down
Loading

0 comments on commit 7ad67ca

Please sign in to comment.