Skip to content

Commit

Permalink
Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Browse files Browse the repository at this point in the history
Pull block updates from Jens Axboe:
 "First pull request for this merge window, there will also be a
  followup request with some stragglers.

  This pull request contains:

   - Fix for a thundering heard issue in the wbt block code (Anchal
     Agarwal)

   - A few NVMe pull requests:
      * Improved tracepoints (Keith)
      * Larger inline data support for RDMA (Steve Wise)
      * RDMA setup/teardown fixes (Sagi)
      * Effects log suppor for NVMe target (Chaitanya Kulkarni)
      * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
      * TP4004 (ANA) support (Christoph)
      * Various NVMe fixes

   - Block io-latency controller support. Much needed support for
     properly containing block devices. (Josef)

   - Series improving how we handle sense information on the stack
     (Kees)

   - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

   - Zoned device support for null_blk (Matias)

   - AIX partition fixes (Mauricio Faria de Oliveira)

   - DIF checksum code made generic (Max Gurtovoy)

   - Add support for discard in iostats (Michael Callahan / Tejun)

   - Set of updates for BFQ (Paolo)

   - Removal of async write support for bsg (Christoph)

   - Bio page dirtying and clone fixups (Christoph)

   - Set of bcache fix/changes (via Coly)

   - Series improving blk-mq queue setup/teardown speed (Ming)

   - Series improving merging performance on blk-mq (Ming)

   - Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
  blkcg: Make blkg_root_lookup() work for queues in bypass mode
  bcache: fix error setting writeback_rate through sysfs interface
  null_blk: add lock drop/acquire annotation
  Blk-throttle: reduce tail io latency when iops limit is enforced
  block: paride: pd: mark expected switch fall-throughs
  block: Ensure that a request queue is dissociated from the cgroup controller
  block: Introduce blk_exit_queue()
  blkcg: Introduce blkg_root_lookup()
  block: Remove two superfluous #include directives
  blk-mq: count the hctx as active before allocating tag
  block: bvec_nr_vecs() returns value for wrong slab
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG
  bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
  bcache: set max writeback rate when I/O request is idle
  bcache: add code comments for bset.c
  bcache: fix mistaken comments in request.c
  bcache: fix mistaken code comments in bcache.h
  bcache: add a comment in super.c
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: display rate debug parameters to 0 when writeback is not running
  ...
  • Loading branch information
torvalds committed Aug 14, 2018
2 parents 958f338 + b86d865 commit 73ba2fb
Show file tree
Hide file tree
Showing 172 changed files with 6,029 additions and 2,659 deletions.
10 changes: 10 additions & 0 deletions Documentation/ABI/testing/procfs-diskstats
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Description:
The /proc/diskstats file displays the I/O statistics
of block devices. Each line contains the following 14
fields:

1 - major number
2 - minor mumber
3 - device name
Expand All @@ -19,4 +20,13 @@ Description:
12 - I/Os currently in progress
13 - time spent doing I/Os (ms)
14 - weighted time spent doing I/Os (ms)

Kernel 4.18+ appends four more fields for discard
tracking putting the total at 18:

15 - discards completed successfully
16 - discards merged
17 - sectors discarded
18 - time spent discarding

For more details refer to Documentation/iostats.txt
92 changes: 88 additions & 4 deletions Documentation/admin-guide/cgroup-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ v1 is available under Documentation/cgroup-v1/.
5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
5-3-3. IO Latency
5-3-3-1. How IO Latency Throttling Works
5-3-3-2. IO Latency Interface Files
5-4. PID
5-4-1. PID Interface Files
5-5. Device
Expand Down Expand Up @@ -1314,17 +1317,19 @@ IO Interface Files
Lines are keyed by $MAJ:$MIN device numbers and not ordered.
The following nested keys are defined.

====== ===================
====== =====================
rbytes Bytes read
wbytes Bytes written
rios Number of read IOs
wios Number of write IOs
====== ===================
dbytes Bytes discarded
dios Number of discard IOs
====== =====================

An example read output follows:

8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021

io.weight
A read-write flat-keyed file which exists on non-root cgroups.
Expand Down Expand Up @@ -1446,6 +1451,85 @@ writeback as follows.
vm.dirty[_background]_ratio.


IO Latency
~~~~~~~~~~

This is a cgroup v2 controller for IO workload protection. You provide a group
with a latency target, and if the average latency exceeds that target the
controller will throttle any peers that have a lower latency target than the
protected workload.

The limits are only applied at the peer level in the hierarchy. This means that
in the diagram below, only groups A, B, and C will influence each other, and
groups D and F will influence each other. Group G will influence nobody.

[root]
/ | \
A B C
/ \ |
D F G


So the ideal way to configure this is to set io.latency in groups A, B, and C.
Generally you do not want to set a value lower than the latency your device
supports. Experiment to find the value that works best for your workload.
Start at higher than the expected latency for your device and watch the
avg_lat value in io.stat for your workload group to get an idea of the
latency you see during normal operation. Use the avg_lat value as a basis for
your real setting, setting at 10-15% higher than the value in io.stat.

How IO Latency Throttling Works
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

io.latency is work conserving; so as long as everybody is meeting their latency
target the controller doesn't do anything. Once a group starts missing its
target it begins throttling any peer group that has a higher target than itself.
This throttling takes 2 forms:

- Queue depth throttling. This is the number of outstanding IO's a group is
allowed to have. We will clamp down relatively quickly, starting at no limit
and going all the way down to 1 IO at a time.

- Artificial delay induction. There are certain types of IO that cannot be
throttled without possibly adversely affecting higher priority groups. This
includes swapping and metadata IO. These types of IO are allowed to occur
normally, however they are "charged" to the originating group. If the
originating group is being throttled you will see the use_delay and delay
fields in io.stat increase. The delay value is how many microseconds that are
being added to any process that runs in this group. Because this number can
grow quite large if there is a lot of swapping or metadata IO occurring we
limit the individual delay events to 1 second at a time.

Once the victimized group starts meeting its latency target again it will start
unthrottling any peer groups that were throttled previously. If the victimized
group simply stops doing IO the global counter will unthrottle appropriately.

IO Latency Interface Files
~~~~~~~~~~~~~~~~~~~~~~~~~~

io.latency
This takes a similar format as the other controllers.

"MAJOR:MINOR target=<target time in microseconds"

io.stat
If the controller is enabled you will see extra stats in io.stat in
addition to the normal ones.

depth
This is the current queue depth for the group.

avg_lat
This is an exponential moving average with a decay rate of 1/exp
bound by the sampling interval. The decay rate interval can be
calculated by multiplying the win value in io.stat by the
corresponding number of samples based on the win value.

win
The sampling window size in milliseconds. This is the minimum
duration of time between evaluation events. Windows only elapse
with IO activity. Idle periods extend the most recent window.

PID
---

Expand Down
7 changes: 7 additions & 0 deletions Documentation/block/null_blk.txt
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,10 @@ shared_tags=[0/1]: Default: 0
0: Tag set is not shared.
1: Tag set shared between devices for blk-mq. Only makes sense with
nr_devices > 1, otherwise there's no tag set to share.

zoned=[0/1]: Default: 0
0: Block device is exposed as a random-access block device.
1: Block device is exposed as a host-managed zoned block device.

zone_size=[MB]: Default: 256
Per zone size when exposed as a zoned block device. Must be a power of two.
28 changes: 16 additions & 12 deletions Documentation/block/stat.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,28 +31,32 @@ write ticks milliseconds total wait time for write requests
in_flight requests number of I/Os currently in flight
io_ticks milliseconds total time this block device has been active
time_in_queue milliseconds total wait time for all requests
discard I/Os requests number of discard I/Os processed
discard merges requests number of discard I/Os merged with in-queue I/O
discard sectors sectors number of sectors discarded
discard ticks milliseconds total wait time for discard requests

read I/Os, write I/Os
=====================
read I/Os, write I/Os, discard I/0s
===================================

These values increment when an I/O request completes.

read merges, write merges
=========================
read merges, write merges, discard merges
=========================================

These values increment when an I/O request is merged with an
already-queued I/O request.

read sectors, write sectors
===========================
read sectors, write sectors, discard_sectors
============================================

These values count the number of sectors read from or written to this
block device. The "sectors" in question are the standard UNIX 512-byte
sectors, not any device- or filesystem-specific block size. The
counters are incremented when the I/O completes.
These values count the number of sectors read from, written to, or
discarded from this block device. The "sectors" in question are the
standard UNIX 512-byte sectors, not any device- or filesystem-specific
block size. The counters are incremented when the I/O completes.

read ticks, write ticks
=======================
read ticks, write ticks, discard ticks
======================================

These values count the number of milliseconds that I/O requests have
waited on this block device. If there are multiple I/O requests waiting,
Expand Down
15 changes: 15 additions & 0 deletions Documentation/iostats.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ Here are examples of these different formats::
3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
3 1 hda1 35486 38030 38030 38030

4.18+ diskstats:
3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0

On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have
a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``.

Expand Down Expand Up @@ -101,6 +104,18 @@ Field 11 -- weighted # of milliseconds spent doing I/Os
last update of this field. This can provide an easy measure of both
I/O completion time and the backlog that may be accumulating.

Field 12 -- # of discards completed
This is the total number of discards completed successfully.

Field 13 -- # of discards merged
See the description of field 2

Field 14 -- # of sectors discarded
This is the total number of sectors discarded successfully.

Field 15 -- # of milliseconds spent discarding
This is the total number of milliseconds spent by all discards (as
measured from __make_request() to end_that_request_last()).

To avoid introducing performance bottlenecks, no locks are held while
modifying these counters. This implies that minor inaccuracies may be
Expand Down
16 changes: 16 additions & 0 deletions block/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,18 @@ config BLK_WBT
dynamically on an algorithm loosely based on CoDel, factoring in
the realtime performance of the disk.

config BLK_CGROUP_IOLATENCY
bool "Enable support for latency based cgroup IO protection"
depends on BLK_CGROUP=y
default n
---help---
Enabling this option enables the .latency interface for IO throttling.
The IO controller will attempt to maintain average IO latencies below
the configured latency target, throttling anybody with a higher latency
target than the victimized group.

Note, this is an experimental interface and could be changed someday.

config BLK_WBT_SQ
bool "Single queue writeback throttling"
default n
Expand Down Expand Up @@ -177,6 +189,10 @@ config BLK_DEBUG_FS
Unless you are building a kernel for a tiny system, you should
say Y here.

config BLK_DEBUG_FS_ZONED
bool
default BLK_DEBUG_FS && BLK_DEV_ZONED

config BLK_SED_OPAL
bool "Logic for interfacing with Opal enabled SEDs"
---help---
Expand Down
4 changes: 3 additions & 1 deletion block/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,15 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o partition-generic.o ioprio.o \
badblocks.o partitions/
badblocks.o partitions/ blk-rq-qos.o

obj-$(CONFIG_BOUNCE) += bounce.o
obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o
obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
Expand All @@ -34,4 +35,5 @@ obj-$(CONFIG_BLK_MQ_RDMA) += blk-mq-rdma.o
obj-$(CONFIG_BLK_DEV_ZONED) += blk-zoned.o
obj-$(CONFIG_BLK_WBT) += blk-wbt.o
obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o
Loading

0 comments on commit 73ba2fb

Please sign in to comment.