Skip to content

Commit

Permalink
Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block
Browse files Browse the repository at this point in the history
Pull block updates from Jens Axboe:
 "This is the main block updates for 5.3. Nothing earth shattering or
  major in here, just fixes, additions, and improvements all over the
  map. This contains:

   - Series of documentation fixes (Bart)

   - Optimization of the blk-mq ctx get/put (Bart)

   - null_blk removal race condition fix (Bob)

   - req/bio_op() cleanups (Chaitanya)

   - Series cleaning up the segment accounting, and request/bio mapping
     (Christoph)

   - Series cleaning up the page getting/putting for bios (Christoph)

   - block cgroup cleanups and moving it to where it is used (Christoph)

   - block cgroup fixes (Tejun)

   - Series of fixes and improvements to bcache, most notably a write
     deadlock fix (Coly)

   - blk-iolatency STS_AGAIN and accounting fixes (Dennis)

   - Series of improvements and fixes to BFQ (Douglas, Paolo)

   - debugfs_create() return value check removal for drbd (Greg)

   - Use struct_size(), where appropriate (Gustavo)

   - Two lighnvm fixes (Heiner, Geert)

   - MD fixes, including a read balance and corruption fix (Guoqing,
     Marcos, Xiao, Yufen)

   - block opal shadow mbr additions (Jonas, Revanth)

   - sbitmap compare-and-exhange improvemnts (Pavel)

   - Fix for potential bio->bi_size overflow (Ming)

   - NVMe pull requests:
       - improved PCIe suspent support (Keith Busch)
       - error injection support for the admin queue (Akinobu Mita)
       - Fibre Channel discovery improvements (James Smart)
       - tracing improvements including nvmetc tracing support (Minwoo Im)
       - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
         Kulkarni)"

   - Various little fixes and improvements to drivers and core"

* tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
  blk-iolatency: fix STS_AGAIN handling
  block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
  blk-mq: simplify blk_mq_make_request()
  blk-mq: remove blk_mq_put_ctx()
  sbitmap: Replace cmpxchg with xchg
  block: fix .bi_size overflow
  block: sed-opal: check size of shadow mbr
  block: sed-opal: ioctl for writing to shadow mbr
  block: sed-opal: add ioctl for done-mark of shadow mbr
  block: never take page references for ITER_BVEC
  direct-io: use bio_release_pages in dio_bio_complete
  block_dev: use bio_release_pages in bio_unmap_user
  block_dev: use bio_release_pages in blkdev_bio_end_io
  iomap: use bio_release_pages in iomap_dio_bio_end_io
  block: use bio_release_pages in bio_map_user_iov
  block: use bio_release_pages in bio_unmap_user
  block: optionally mark pages dirty in bio_release_pages
  block: move the BIO_NO_PAGE_REF check into bio_release_pages
  block: skd_main.c: Remove call to memset after dma_alloc_coherent
  block: mtip32xx: Remove call to memset after dma_alloc_coherent
  ...
  • Loading branch information
torvalds committed Jul 9, 2019
2 parents 0415052 + c9b3007 commit 3b99107
Show file tree
Hide file tree
Showing 104 changed files with 3,368 additions and 1,554 deletions.
12 changes: 6 additions & 6 deletions Documentation/block/bfq-iosched.txt
Original file line number Diff line number Diff line change
Expand Up @@ -38,13 +38,13 @@ stack). To give an idea of the limits with BFQ, on slow or average
CPUs, here are, first, the limits of BFQ for three different CPUs, on,
respectively, an average laptop, an old desktop, and a cheap embedded
system, in case full hierarchical support is enabled (i.e.,
CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_DEBUG_BLK_CGROUP is not
CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is not
set (Section 4-2):
- Intel i7-4850HQ: 400 KIOPS
- AMD A8-3850: 250 KIOPS
- ARM CortexTM-A53 Octa-core: 80 KIOPS

If CONFIG_DEBUG_BLK_CGROUP is set (and of course full hierarchical
If CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchical
support is enabled), then the sustainable throughput with BFQ
decreases, because all blkio.bfq* statistics are created and updated
(Section 4-2). For BFQ, this leads to the following maximum
Expand Down Expand Up @@ -537,19 +537,19 @@ or io.bfq.weight.

As for cgroups-v1 (blkio controller), the exact set of stat files
created, and kept up-to-date by bfq, depends on whether
CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all
CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all
the stat files documented in
Documentation/cgroup-v1/blkio-controller.rst. If, instead,
CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files
CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files
blkio.bfq.io_service_bytes
blkio.bfq.io_service_bytes_recursive
blkio.bfq.io_serviced
blkio.bfq.io_serviced_recursive

The value of CONFIG_DEBUG_BLK_CGROUP greatly influences the maximum
The value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximum
throughput sustainable with bfq, because updating the blkio.bfq.*
stats is rather costly, especially for some of the stats enabled by
CONFIG_DEBUG_BLK_CGROUP.
CONFIG_BFQ_CGROUP_DEBUG.

Parameters to set
-----------------
Expand Down
1 change: 0 additions & 1 deletion Documentation/block/biodoc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -436,7 +436,6 @@ struct bio {
struct bvec_iter bi_iter; /* current index into bio_vec array */

unsigned int bi_size; /* total size in bytes */
unsigned short bi_phys_segments; /* segments after physaddr coalesce*/
unsigned short bi_hw_segments; /* segments after DMA remapping */
unsigned int bi_max; /* max bio_vecs we can hold
used as index into pool */
Expand Down
64 changes: 43 additions & 21 deletions Documentation/block/queue-sysfs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,15 @@ add_random (RW)
This file allows to turn off the disk entropy contribution. Default
value of this file is '1'(on).

chunk_sectors (RO)
------------------
This has different meaning depending on the type of the block device.
For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors
of the RAID volume stripe segment. For a zoned block device, either host-aware
or host-managed, chunk_sectors indicates the size in 512B sectors of the zones
of the device, with the eventual exception of the last zone of the device which
may be smaller.

dax (RO)
--------
This file indicates whether the device supports Direct Access (DAX),
Expand Down Expand Up @@ -43,6 +52,16 @@ large discards are issued, setting this value lower will make Linux issue
smaller discards and potentially help reduce latencies induced by large
discard operations.

discard_zeroes_data (RO)
------------------------
Obsolete. Always zero.

fua (RO)
--------
Whether or not the block driver supports the FUA flag for write requests.
FUA stands for Force Unit Access. If the FUA flag is set that means that
write requests must bypass the volatile cache of the storage device.

hw_sector_size (RO)
-------------------
This is the hardware sector size of the device, in bytes.
Expand Down Expand Up @@ -83,14 +102,19 @@ logical_block_size (RO)
-----------------------
This is the logical block size of the device, in bytes.

max_discard_segments (RO)
-------------------------
The maximum number of DMA scatter/gather entries in a discard request.

max_hw_sectors_kb (RO)
----------------------
This is the maximum number of kilobytes supported in a single data transfer.

max_integrity_segments (RO)
---------------------------
When read, this file shows the max limit of integrity segments as
set by block layer which a hardware controller can handle.
Maximum number of elements in a DMA scatter/gather list with integrity
data that will be submitted by the block layer core to the associated
block driver.

max_sectors_kb (RW)
-------------------
Expand All @@ -100,11 +124,12 @@ size allowed by the hardware.

max_segments (RO)
-----------------
Maximum number of segments of the device.
Maximum number of elements in a DMA scatter/gather list that is submitted
to the associated block driver.

max_segment_size (RO)
---------------------
Maximum segment size of the device.
Maximum size in bytes of a single element in a DMA scatter/gather list.

minimum_io_size (RO)
--------------------
Expand Down Expand Up @@ -132,6 +157,12 @@ per-block-cgroup request pool. IOW, if there are N block cgroups,
each request queue may have up to N request pools, each independently
regulated by nr_requests.

nr_zones (RO)
-------------
For zoned block devices (zoned attribute indicating "host-managed" or
"host-aware"), this indicates the total number of zones of the device.
This is always 0 for regular block devices.

optimal_io_size (RO)
--------------------
This is the optimal IO size reported by the device.
Expand Down Expand Up @@ -185,8 +216,8 @@ This is the number of bytes the device can write in a single write-same
command. A value of '0' means write-same is not supported by this
device.

wb_lat_usec (RW)
----------------
wbt_lat_usec (RW)
-----------------
If the device is registered for writeback throttling, then this file shows
the target minimum read latency. If this latency is exceeded in a given
window of time (see wb_window_usec), then the writeback throttling will start
Expand All @@ -201,6 +232,12 @@ blk-throttle makes decision based on the samplings. Lower time means cgroups
have more smooth throughput, but higher CPU overhead. This exists only when
CONFIG_BLK_DEV_THROTTLING_LOW is enabled.

write_zeroes_max_bytes (RO)
---------------------------
For block drivers that support REQ_OP_WRITE_ZEROES, the maximum number of
bytes that can be zeroed at once. The value 0 means that REQ_OP_WRITE_ZEROES
is not supported.

zoned (RO)
----------
This indicates if the device is a zoned block device and the zone model of the
Expand All @@ -213,19 +250,4 @@ devices are described in the ZBC (Zoned Block Commands) and ZAC
do not support zone commands, they will be treated as regular block devices
and zoned will report "none".

nr_zones (RO)
-------------
For zoned block devices (zoned attribute indicating "host-managed" or
"host-aware"), this indicates the total number of zones of the device.
This is always 0 for regular block devices.

chunk_sectors (RO)
------------------
This has different meaning depending on the type of the block device.
For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors
of the RAID volume stripe segment. For a zoned block device, either host-aware
or host-managed, chunk_sectors indicates the size in 512B sectors of the zones
of the device, with the eventual exception of the last zone of the device which
may be smaller.

Jens Axboe <[email protected]>, February 2009
12 changes: 6 additions & 6 deletions Documentation/cgroup-v1/blkio-controller.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ Various user visible config options
CONFIG_BLK_CGROUP
- Block IO controller.

CONFIG_DEBUG_BLK_CGROUP
CONFIG_BFQ_CGROUP_DEBUG
- Debug help. Right now some additional stats file show up in cgroup
if this option is enabled.

Expand Down Expand Up @@ -202,13 +202,13 @@ Proportional weight policy files
write, sync or async.

- blkio.avg_queue_size
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
The average queue size for this cgroup over the entire time of this
cgroup's existence. Queue size samples are taken each time one of the
queues of this cgroup gets a timeslice.

- blkio.group_wait_time
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time the cgroup had to wait since it became busy
(i.e., went from 0 to 1 request queued) to get a timeslice for one of
its queues. This is different from the io_wait_time which is the
Expand All @@ -219,7 +219,7 @@ Proportional weight policy files
got a timeslice and will not include the current delta.

- blkio.empty_time
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time a cgroup spends without any pending
requests when not being served, i.e., it does not include any time
spent idling for one of the queues of the cgroup. This is in
Expand All @@ -228,7 +228,7 @@ Proportional weight policy files
time it had a pending request and will not include the current delta.

- blkio.idle_time
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time spent by the IO scheduler idling for a
given cgroup in anticipation of a better request than the existing ones
from other queues/cgroups. This is in nanoseconds. If this is read
Expand All @@ -237,7 +237,7 @@ Proportional weight policy files
the current delta.

- blkio.dequeue
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. This
- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
gives the statistics about how many a times a group was dequeued
from service tree of the device. First two fields specify the major
and minor number of the device and third field specifies the number
Expand Down
56 changes: 56 additions & 0 deletions Documentation/fault-injection/nvme-fault-injection.txt
Original file line number Diff line number Diff line change
Expand Up @@ -114,3 +114,59 @@ R13: ffff88011a3c9680 R14: 0000000000000000 R15: 0000000000000000
cpu_startup_entry+0x6f/0x80
start_secondary+0x187/0x1e0
secondary_startup_64+0xa5/0xb0

Example 3: Inject an error into the 10th admin command
------------------------------------------------------

echo 100 > /sys/kernel/debug/nvme0/fault_inject/probability
echo 10 > /sys/kernel/debug/nvme0/fault_inject/space
echo 1 > /sys/kernel/debug/nvme0/fault_inject/times
nvme reset /dev/nvme0

Expected Result:

After NVMe controller reset, the reinitialization may or may not succeed.
It depends on which admin command is actually forced to fail.

Message from dmesg:

nvme nvme0: resetting controller
FAULT_INJECTION: forcing a failure.
name fault_inject, interval 1, probability 100, space 1, times 1
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0-rc2+ #2
Hardware name: MSI MS-7A45/B150M MORTAR ARCTIC (MS-7A45), BIOS 1.50 04/25/2017
Call Trace:
<IRQ>
dump_stack+0x63/0x85
should_fail+0x14a/0x170
nvme_should_fail+0x38/0x80 [nvme_core]
nvme_irq+0x129/0x280 [nvme]
? blk_mq_end_request+0xb3/0x120
__handle_irq_event_percpu+0x84/0x1a0
handle_irq_event_percpu+0x32/0x80
handle_irq_event+0x3b/0x60
handle_edge_irq+0x7f/0x1a0
handle_irq+0x20/0x30
do_IRQ+0x4e/0xe0
common_interrupt+0xf/0xf
</IRQ>
RIP: 0010:cpuidle_enter_state+0xc5/0x460
Code: ff e8 8f 5f 86 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 69 03 00 00 31 ff e8 62 aa 8c ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 37 03 00 00 4c 8b 45 d0 4c 2b 45 b8 48 ba cf f7 53
RSP: 0018:ffffffff88c03dd0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc
RAX: ffff9dac25a2ac80 RBX: ffffffff88d53760 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 000000002d958403 RDI: 0000000000000000
RBP: ffffffff88c03e18 R08: fffffff75e35ffb7 R09: 00000a49a56c0b48
R10: ffffffff88c03da0 R11: 0000000000001b0c R12: ffff9dac25a34d00
R13: 0000000000000006 R14: 0000000000000006 R15: ffffffff88d53760
cpuidle_enter+0x2e/0x40
call_cpuidle+0x23/0x40
do_idle+0x201/0x280
cpu_startup_entry+0x1d/0x20
rest_init+0xaa/0xb0
arch_call_rest_init+0xe/0x1b
start_kernel+0x51c/0x53b
x86_64_start_reservations+0x24/0x26
x86_64_start_kernel+0x74/0x77
secondary_startup_64+0xa4/0xb0
nvme nvme0: Could not set queue count (16385)
nvme nvme0: IO queues not created
7 changes: 7 additions & 0 deletions block/Kconfig.iosched
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,13 @@ config BFQ_GROUP_IOSCHED
Enable hierarchical scheduling in BFQ, using the blkio
(cgroups-v1) or io (cgroups-v2) controller.

config BFQ_CGROUP_DEBUG
bool "BFQ IO controller debugging"
depends on BFQ_GROUP_IOSCHED
---help---
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.

endmenu

endif
Loading

0 comments on commit 3b99107

Please sign in to comment.