Skip to content

Commit

Permalink
Merge tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/ker…
Browse files Browse the repository at this point in the history
…nel/git/device-mapper/linux-dm

Pull device-mapper updates from Mike Snitzer:
 "Add the ability to collect I/O statistics on user-defined regions of a
  device-mapper device.  This dm-stats code required the reintroduction
  of a div64_u64_rem() helper, but as a separate method that doesn't
  slow down div64_u64() -- especially on 32-bit systems.

  Allow the error target to replace request-based DM devices (e.g.
  multipath) in addition to bio-based DM devices.

  Various other small code fixes and improvements to thin-provisioning,
  DM cache and the DM ioctl interface"

* tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm stripe: silence a couple sparse warnings
  dm: add statistics support
  dm thin: always return -ENOSPC if no_free_space is set
  dm ioctl: cleanup error handling in table_load
  dm ioctl: increase granularity of type_lock when loading table
  dm ioctl: prevent rename to empty name or uuid
  dm thin: set pool read-only if breaking_sharing fails block allocation
  dm thin: prefix pool error messages with pool device name
  dm: allow error target to replace bio-based and request-based targets
  math64: New separate div64_u64_rem helper
  dm space map: optimise sm_ll_dec and sm_ll_inc
  dm btree: prefetch child nodes when walking tree for a dm_btree_del
  dm btree: use pop_frame in dm_btree_del to cleanup code
  dm cache: eliminate holes in cache structure
  dm cache: fix stacking of geometry limits
  dm thin: fix stacking of geometry limits
  dm thin: add data block size limits to Documentation
  dm cache: add data block size limits to code and Documentation
  dm cache: document metadata device is exclussive to a cache
  dm: stop using WQ_NON_REENTRANT
  • Loading branch information
torvalds committed Sep 10, 2013
2 parents 4d7696f + 7fff5e8 commit 7426d62
Show file tree
Hide file tree
Showing 25 changed files with 1,621 additions and 162 deletions.
6 changes: 4 additions & 2 deletions Documentation/device-mapper/cache.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,16 @@ other parameters detailed later):
which are dirty, and extra hints for use by the policy object.
This information could be put on the cache device, but having it
separate allows the volume manager to configure it differently,
e.g. as a mirror for extra robustness.
e.g. as a mirror for extra robustness. This metadata device may only
be used by a single cache device.

Fixed block size
----------------

The origin is divided up into blocks of a fixed size. This block size
is configurable when you first create the cache. Typically we've been
using block sizes of 256k - 1024k.
using block sizes of 256KB - 1024KB. The block size must be between 64
(32KB) and 2097152 (1GB) and a multiple of 64 (32KB).

Having a fixed block size simplifies the target a lot. But it is
something of a compromise. For instance, a small part of a block may be
Expand Down
186 changes: 186 additions & 0 deletions Documentation/device-mapper/statistics.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
DM statistics
=============

Device Mapper supports the collection of I/O statistics on user-defined
regions of a DM device. If no regions are defined no statistics are
collected so there isn't any performance impact. Only bio-based DM
devices are currently supported.

Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.

The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats (see:
Documentation/iostats.txt). But two extra counters (12 and 13) are
provided: total time spent reading and writing in milliseconds. All
these counters may be accessed by sending the @stats_print message to
the appropriate DM device via dmsetup.

Each region has a corresponding unique identifier, which we call a
region_id, that is assigned when the region is created. The region_id
must be supplied when querying statistics about the region, deleting the
region, etc. Unique region_ids enable multiple userspace programs to
request and process statistics for the same DM device without stepping
on each other's data.

The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space. At most, 1/4 of the overall system
memory may be allocated by DM statistics. The admin can see how much
memory is used by reading
/sys/module/dm_mod/parameters/stats_current_allocated_bytes

Messages
========

@stats_create <range> <step> [<program_id> [<aux_data>]]

Create a new region and return the region_id.

<range>
"-" - whole device
"<start_sector>+<length>" - a range of <length> 512-byte sectors
starting with <start_sector>.

<step>
"<area_size>" - the range is subdivided into areas each containing
<area_size> sectors.
"/<number_of_areas>" - the range is subdivided into the specified
number of areas.

<program_id>
An optional parameter. A name that uniquely identifies
the userspace owner of the range. This groups ranges together
so that userspace programs can identify the ranges they
created and ignore those created by others.
The kernel returns this string back in the output of
@stats_list message, but it doesn't use it for anything else.

<aux_data>
An optional parameter. A word that provides auxiliary data
that is useful to the client program that created the range.
The kernel returns this string back in the output of
@stats_list message, but it doesn't use this value for anything.

@stats_delete <region_id>

Delete the region with the specified id.

<region_id>
region_id returned from @stats_create

@stats_clear <region_id>

Clear all the counters except the in-flight i/o counters.

<region_id>
region_id returned from @stats_create

@stats_list [<program_id>]

List all regions registered with @stats_create.

<program_id>
An optional parameter.
If this parameter is specified, only matching regions
are returned.
If it is not specified, all regions are returned.

Output format:
<region_id>: <start_sector>+<length> <step> <program_id> <aux_data>

@stats_print <region_id> [<starting_line> <number_of_lines>]

Print counters for each step-sized area of a region.

<region_id>
region_id returned from @stats_create

<starting_line>
The index of the starting line in the output.
If omitted, all lines are returned.

<number_of_lines>
The number of lines to include in the output.
If omitted, all lines are returned.

Output format for each step-sized area of a region:

<start_sector>+<length> counters

The first 11 counters have the same meaning as
/sys/block/*/stat or /proc/diskstats.

Please refer to Documentation/iostats.txt for details.

1. the number of reads completed
2. the number of reads merged
3. the number of sectors read
4. the number of milliseconds spent reading
5. the number of writes completed
6. the number of writes merged
7. the number of sectors written
8. the number of milliseconds spent writing
9. the number of I/Os currently in progress
10. the number of milliseconds spent doing I/Os
11. the weighted number of milliseconds spent doing I/Os

Additional counters:
12. the total time spent reading in milliseconds
13. the total time spent writing in milliseconds

@stats_print_clear <region_id> [<starting_line> <number_of_lines>]

Atomically print and then clear all the counters except the
in-flight i/o counters. Useful when the client consuming the
statistics does not want to lose any statistics (those updated
between printing and clearing).

<region_id>
region_id returned from @stats_create

<starting_line>
The index of the starting line in the output.
If omitted, all lines are printed and then cleared.

<number_of_lines>
The number of lines to process.
If omitted, all lines are printed and then cleared.

@stats_set_aux <region_id> <aux_data>

Store auxiliary data aux_data for the specified region.

<region_id>
region_id returned from @stats_create

<aux_data>
The string that identifies data which is useful to the client
program that created the range. The kernel returns this
string back in the output of @stats_list message, but it
doesn't use this value for anything.

Examples
========

Subdivide the DM device 'vol' into 100 pieces and start collecting
statistics on them:

dmsetup message vol 0 @stats_create - /100

Set the auxillary data string to "foo bar baz" (the escape for each
space must also be escaped, otherwise the shell will consume them):

dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz

List the statistics:

dmsetup message vol 0 @stats_list

Print the statistics:

dmsetup message vol 0 @stats_print 0

Delete the statistics:

dmsetup message vol 0 @stats_delete 0
15 changes: 8 additions & 7 deletions Documentation/device-mapper/thin-provisioning.txt
Original file line number Diff line number Diff line change
Expand Up @@ -99,13 +99,14 @@ Using an existing pool device
$data_block_size $low_water_mark"

$data_block_size gives the smallest unit of disk space that can be
allocated at a time expressed in units of 512-byte sectors. People
primarily interested in thin provisioning may want to use a value such
as 1024 (512KB). People doing lots of snapshotting may want a smaller value
such as 128 (64KB). If you are not zeroing newly-allocated data,
a larger $data_block_size in the region of 256000 (128MB) is suggested.
$data_block_size must be the same for the lifetime of the
metadata device.
allocated at a time expressed in units of 512-byte sectors.
$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
multiple of 128 (64KB). $data_block_size cannot be changed after the
thin-pool is created. People primarily interested in thin provisioning
may want to use a value such as 1024 (512KB). People doing lots of
snapshotting may want a smaller value such as 128 (64KB). If you are
not zeroing newly-allocated data, a larger $data_block_size in the
region of 256000 (128MB) is suggested.

$low_water_mark is expressed in blocks of size $data_block_size. If
free space on the data device drops below this level then a dm event
Expand Down
2 changes: 1 addition & 1 deletion drivers/md/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
#

dm-mod-y += dm.o dm-table.o dm-target.o dm-linear.o dm-stripe.o \
dm-ioctl.o dm-io.o dm-kcopyd.o dm-sysfs.o
dm-ioctl.o dm-io.o dm-kcopyd.o dm-sysfs.o dm-stats.o
dm-multipath-y += dm-path-selector.o dm-mpath.o
dm-snapshot-y += dm-snap.o dm-exception-store.o dm-snap-transient.o \
dm-snap-persistent.o
Expand Down
59 changes: 35 additions & 24 deletions drivers/md/dm-cache-target.c
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,11 @@ static void free_bitset(unsigned long *bits)
#define MIGRATION_COUNT_WINDOW 10

/*
* The block size of the device holding cache data must be >= 32KB
* The block size of the device holding cache data must be
* between 32KB and 1GB.
*/
#define DATA_DEV_BLOCK_SIZE_MIN_SECTORS (32 * 1024 >> SECTOR_SHIFT)
#define DATA_DEV_BLOCK_SIZE_MAX_SECTORS (1024 * 1024 * 1024 >> SECTOR_SHIFT)

/*
* FIXME: the cache is read/write for the time being.
Expand Down Expand Up @@ -101,6 +103,8 @@ struct cache {
struct dm_target *ti;
struct dm_target_callbacks callbacks;

struct dm_cache_metadata *cmd;

/*
* Metadata is written to this device.
*/
Expand All @@ -116,11 +120,6 @@ struct cache {
*/
struct dm_dev *cache_dev;

/*
* Cache features such as write-through.
*/
struct cache_features features;

/*
* Size of the origin device in _complete_ blocks and native sectors.
*/
Expand All @@ -138,8 +137,6 @@ struct cache {
uint32_t sectors_per_block;
int sectors_per_block_shift;

struct dm_cache_metadata *cmd;

spinlock_t lock;
struct bio_list deferred_bios;
struct bio_list deferred_flush_bios;
Expand All @@ -148,8 +145,8 @@ struct cache {
struct list_head completed_migrations;
struct list_head need_commit_migrations;
sector_t migration_threshold;
atomic_t nr_migrations;
wait_queue_head_t migration_wait;
atomic_t nr_migrations;

/*
* cache_size entries, dirty if set
Expand All @@ -160,9 +157,16 @@ struct cache {
/*
* origin_blocks entries, discarded if set.
*/
uint32_t discard_block_size; /* a power of 2 times sectors per block */
dm_dblock_t discard_nr_blocks;
unsigned long *discard_bitset;
uint32_t discard_block_size; /* a power of 2 times sectors per block */

/*
* Rather than reconstructing the table line for the status we just
* save it and regurgitate.
*/
unsigned nr_ctr_args;
const char **ctr_args;

struct dm_kcopyd_client *copier;
struct workqueue_struct *wq;
Expand All @@ -187,14 +191,12 @@ struct cache {
bool loaded_mappings:1;
bool loaded_discards:1;

struct cache_stats stats;

/*
* Rather than reconstructing the table line for the status we just
* save it and regurgitate.
* Cache features such as write-through.
*/
unsigned nr_ctr_args;
const char **ctr_args;
struct cache_features features;

struct cache_stats stats;
};

struct per_bio_data {
Expand Down Expand Up @@ -1687,24 +1689,25 @@ static int parse_origin_dev(struct cache_args *ca, struct dm_arg_set *as,
static int parse_block_size(struct cache_args *ca, struct dm_arg_set *as,
char **error)
{
unsigned long tmp;
unsigned long block_size;

if (!at_least_one_arg(as, error))
return -EINVAL;

if (kstrtoul(dm_shift_arg(as), 10, &tmp) || !tmp ||
tmp < DATA_DEV_BLOCK_SIZE_MIN_SECTORS ||
tmp & (DATA_DEV_BLOCK_SIZE_MIN_SECTORS - 1)) {
if (kstrtoul(dm_shift_arg(as), 10, &block_size) || !block_size ||
block_size < DATA_DEV_BLOCK_SIZE_MIN_SECTORS ||
block_size > DATA_DEV_BLOCK_SIZE_MAX_SECTORS ||
block_size & (DATA_DEV_BLOCK_SIZE_MIN_SECTORS - 1)) {
*error = "Invalid data block size";
return -EINVAL;
}

if (tmp > ca->cache_sectors) {
if (block_size > ca->cache_sectors) {
*error = "Data block size is larger than the cache device";
return -EINVAL;
}

ca->block_size = tmp;
ca->block_size = block_size;

return 0;
}
Expand Down Expand Up @@ -2609,9 +2612,17 @@ static void set_discard_limits(struct cache *cache, struct queue_limits *limits)
static void cache_io_hints(struct dm_target *ti, struct queue_limits *limits)
{
struct cache *cache = ti->private;
uint64_t io_opt_sectors = limits->io_opt >> SECTOR_SHIFT;

blk_limits_io_min(limits, 0);
blk_limits_io_opt(limits, cache->sectors_per_block << SECTOR_SHIFT);
/*
* If the system-determined stacked limits are compatible with the
* cache's blocksize (io_opt is a factor) do not override them.
*/
if (io_opt_sectors < cache->sectors_per_block ||
do_div(io_opt_sectors, cache->sectors_per_block)) {
blk_limits_io_min(limits, 0);
blk_limits_io_opt(limits, cache->sectors_per_block << SECTOR_SHIFT);
}
set_discard_limits(cache, limits);
}

Expand Down
Loading

0 comments on commit 7426d62

Please sign in to comment.