Skip to content

Commit

Permalink
Merge branch 'md-next' into md-linus
Browse files Browse the repository at this point in the history
  • Loading branch information
shligit committed May 1, 2017
2 parents 85724ed + b506335 commit e265eb3
Show file tree
Hide file tree
Showing 26 changed files with 3,582 additions and 1,483 deletions.
32 changes: 29 additions & 3 deletions Documentation/admin-guide/md.rst
Original file line number Diff line number Diff line change
Expand Up @@ -276,14 +276,14 @@ All md devices contain:
array creation it will default to 0, though starting the array as
``clean`` will set it much larger.

new_dev
new_dev
This file can be written but not read. The value written should
be a block device number as major:minor. e.g. 8:0
This will cause that device to be attached to the array, if it is
available. It will then appear at md/dev-XXX (depending on the
name of the device) and further configuration is then possible.

safe_mode_delay
safe_mode_delay
When an md array has seen no write requests for a certain period
of time, it will be marked as ``clean``. When another write
request arrives, the array is marked as ``dirty`` before the write
Expand All @@ -292,7 +292,7 @@ All md devices contain:
period as a number of seconds. The default is 200msec (0.200).
Writing a value of 0 disables safemode.

array_state
array_state
This file contains a single word which describes the current
state of the array. In many cases, the state can be set by
writing the word for the desired state, however some states
Expand Down Expand Up @@ -401,7 +401,30 @@ All md devices contain:
once the array becomes non-degraded, and this fact has been
recorded in the metadata.

consistency_policy
This indicates how the array maintains consistency in case of unexpected
shutdown. It can be:

none
Array has no redundancy information, e.g. raid0, linear.

resync
Full resync is performed and all redundancy is regenerated when the
array is started after unclean shutdown.

bitmap
Resync assisted by a write-intent bitmap.

journal
For raid4/5/6, journal device is used to log transactions and replay
after unclean shutdown.

ppl
For raid5 only, Partial Parity Log is used to close the write hole and
eliminate resync.

The accepted values when writing to this file are ``ppl`` and ``resync``,
used to enable and disable PPL.


As component devices are added to an md array, they appear in the ``md``
Expand Down Expand Up @@ -563,6 +586,9 @@ Each directory contains:
adds bad blocks without acknowledging them. This is largely
for testing.

ppl_sector, ppl_size
Location and size (in sectors) of the space used for Partial Parity Log
on this device.


An active md device will also contain an entry for each active device
Expand Down
2 changes: 1 addition & 1 deletion Documentation/md/md-cluster.txt
Original file line number Diff line number Diff line change
Expand Up @@ -321,4 +321,4 @@ The algorithm is:

There are somethings which are not supported by cluster MD yet.

- update size and change array_sectors.
- change array_sectors.
44 changes: 44 additions & 0 deletions Documentation/md/raid5-ppl.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
Partial Parity Log

Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
addressed by PPL is that after a dirty shutdown, parity of a particular stripe
may become inconsistent with data on other member disks. If the array is also
in degraded state, there is no way to recalculate parity, because one of the
disks is missing. This can lead to silent data corruption when rebuilding the
array or using it is as degraded - data calculated from parity for array blocks
that have not been touched by a write request during the unclean shutdown can
be incorrect. Such condition is known as the RAID5 Write Hole. Because of
this, md by default does not allow starting a dirty degraded array.

Partial parity for a write operation is the XOR of stripe data chunks not
modified by this write. It is just enough data needed for recovering from the
write hole. XORing partial parity with the modified chunks produces parity for
the stripe, consistent with its state before the write operation, regardless of
which chunk writes have completed. If one of the not modified data disks of
this stripe is missing, this updated parity can be used to recover its
contents. PPL recovery is also performed when starting an array after an
unclean shutdown and all disks are available, eliminating the need to resync
the array. Because of this, using write-intent bitmap and PPL together is not
supported.

When handling a write request PPL writes partial parity before new data and
parity are dispatched to disks. PPL is a distributed log - it is stored on
array member drives in the metadata area, on the parity drive of a particular
stripe. It does not require a dedicated journaling drive. Write performance is
reduced by up to 30%-40% but it scales with the number of drives in the array
and the journaling drive does not become a bottleneck or a single point of
failure.

Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
not a true journal. It does not protect from losing in-flight data, only from
silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
performed for this stripe (parity is not updated). So it is possible to have
arbitrary data in the written part of a stripe if that disk is lost. In such
case the behavior is the same as in plain raid5.

PPL is available for md version-1 metadata and external (specifically IMSM)
metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.

Currently, volatile write-back cache should be disabled on all member drives
when using PPL. Otherwise it cannot guarantee consistency in case of power
failure.
61 changes: 13 additions & 48 deletions block/bio.c
Original file line number Diff line number Diff line change
Expand Up @@ -633,20 +633,21 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
}
EXPORT_SYMBOL(bio_clone_fast);

static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
struct bio_set *bs, int offset,
int size)
/**
* bio_clone_bioset - clone a bio
* @bio_src: bio to clone
* @gfp_mask: allocation priority
* @bs: bio_set to allocate from
*
* Clone bio. Caller will own the returned bio, but not the actual data it
* points to. Reference count of returned bio will be one.
*/
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
struct bio_set *bs)
{
struct bvec_iter iter;
struct bio_vec bv;
struct bio *bio;
struct bvec_iter iter_src = bio_src->bi_iter;

/* for supporting partial clone */
if (offset || size != bio_src->bi_iter.bi_size) {
bio_advance_iter(bio_src, &iter_src, offset);
iter_src.bi_size = size;
}

/*
* Pre immutable biovecs, __bio_clone() used to just do a memcpy from
Expand All @@ -670,8 +671,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
* __bio_clone_fast() anyways.
*/

bio = bio_alloc_bioset(gfp_mask, __bio_segments(bio_src,
&iter_src), bs);
bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
if (!bio)
return NULL;
bio->bi_bdev = bio_src->bi_bdev;
Expand All @@ -688,7 +688,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
break;
default:
__bio_for_each_segment(bv, bio_src, iter, iter_src)
bio_for_each_segment(bv, bio_src, iter)
bio->bi_io_vec[bio->bi_vcnt++] = bv;
break;
}
Expand All @@ -707,43 +707,8 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,

return bio;
}

/**
* bio_clone_bioset - clone a bio
* @bio_src: bio to clone
* @gfp_mask: allocation priority
* @bs: bio_set to allocate from
*
* Clone bio. Caller will own the returned bio, but not the actual data it
* points to. Reference count of returned bio will be one.
*/
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
struct bio_set *bs)
{
return __bio_clone_bioset(bio_src, gfp_mask, bs, 0,
bio_src->bi_iter.bi_size);
}
EXPORT_SYMBOL(bio_clone_bioset);

/**
* bio_clone_bioset_partial - clone a partial bio
* @bio_src: bio to clone
* @gfp_mask: allocation priority
* @bs: bio_set to allocate from
* @offset: cloned starting from the offset
* @size: size for the cloned bio
*
* Clone bio. Caller will own the returned bio, but not the actual data it
* points to. Reference count of returned bio will be one.
*/
struct bio *bio_clone_bioset_partial(struct bio *bio_src, gfp_t gfp_mask,
struct bio_set *bs, int offset,
int size)
{
return __bio_clone_bioset(bio_src, gfp_mask, bs, offset, size);
}
EXPORT_SYMBOL(bio_clone_bioset_partial);

/**
* bio_add_pc_page - attempt to add page to bio
* @q: the target queue
Expand Down
2 changes: 1 addition & 1 deletion drivers/md/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
dm-era-y += dm-era-target.o
dm-verity-y += dm-verity-target.o
md-mod-y += md.o bitmap.o
raid456-y += raid5.o raid5-cache.o
raid456-y += raid5.o raid5-cache.o raid5-ppl.o

# Note: link order is important. All raid personalities
# and must come before md.o, as they each initialise
Expand Down
59 changes: 48 additions & 11 deletions drivers/md/bitmap.c
Original file line number Diff line number Diff line change
Expand Up @@ -471,6 +471,7 @@ void bitmap_update_sb(struct bitmap *bitmap)
kunmap_atomic(sb);
write_page(bitmap, bitmap->storage.sb_page, 1);
}
EXPORT_SYMBOL(bitmap_update_sb);

/* print out the bitmap file superblock */
void bitmap_print_sb(struct bitmap *bitmap)
Expand Down Expand Up @@ -696,7 +697,7 @@ static int bitmap_read_sb(struct bitmap *bitmap)

out:
kunmap_atomic(sb);
/* Assiging chunksize is required for "re_read" */
/* Assigning chunksize is required for "re_read" */
bitmap->mddev->bitmap_info.chunksize = chunksize;
if (err == 0 && nodes && (bitmap->cluster_slot < 0)) {
err = md_setup_cluster(bitmap->mddev, nodes);
Expand Down Expand Up @@ -1727,7 +1728,7 @@ void bitmap_flush(struct mddev *mddev)
/*
* free memory that was allocated
*/
static void bitmap_free(struct bitmap *bitmap)
void bitmap_free(struct bitmap *bitmap)
{
unsigned long k, pages;
struct bitmap_page *bp;
Expand Down Expand Up @@ -1761,6 +1762,21 @@ static void bitmap_free(struct bitmap *bitmap)
kfree(bp);
kfree(bitmap);
}
EXPORT_SYMBOL(bitmap_free);

void bitmap_wait_behind_writes(struct mddev *mddev)
{
struct bitmap *bitmap = mddev->bitmap;

/* wait for behind writes to complete */
if (bitmap && atomic_read(&bitmap->behind_writes) > 0) {
pr_debug("md:%s: behind writes in progress - waiting to stop.\n",
mdname(mddev));
/* need to kick something here to make sure I/O goes? */
wait_event(bitmap->behind_wait,
atomic_read(&bitmap->behind_writes) == 0);
}
}

void bitmap_destroy(struct mddev *mddev)
{
Expand All @@ -1769,6 +1785,8 @@ void bitmap_destroy(struct mddev *mddev)
if (!bitmap) /* there was no bitmap */
return;

bitmap_wait_behind_writes(mddev);

mutex_lock(&mddev->bitmap_info.mutex);
spin_lock(&mddev->lock);
mddev->bitmap = NULL; /* disconnect from the md device */
Expand Down Expand Up @@ -1920,6 +1938,27 @@ int bitmap_load(struct mddev *mddev)
}
EXPORT_SYMBOL_GPL(bitmap_load);

struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot)
{
int rv = 0;
struct bitmap *bitmap;

bitmap = bitmap_create(mddev, slot);
if (IS_ERR(bitmap)) {
rv = PTR_ERR(bitmap);
return ERR_PTR(rv);
}

rv = bitmap_init_from_disk(bitmap, 0);
if (rv) {
bitmap_free(bitmap);
return ERR_PTR(rv);
}

return bitmap;
}
EXPORT_SYMBOL(get_bitmap_from_slot);

/* Loads the bitmap associated with slot and copies the resync information
* to our bitmap
*/
Expand All @@ -1929,14 +1968,13 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
int rv = 0, i, j;
sector_t block, lo = 0, hi = 0;
struct bitmap_counts *counts;
struct bitmap *bitmap = bitmap_create(mddev, slot);

if (IS_ERR(bitmap))
return PTR_ERR(bitmap);
struct bitmap *bitmap;

rv = bitmap_init_from_disk(bitmap, 0);
if (rv)
goto err;
bitmap = get_bitmap_from_slot(mddev, slot);
if (IS_ERR(bitmap)) {
pr_err("%s can't get bitmap from slot %d\n", __func__, slot);
return -1;
}

counts = &bitmap->counts;
for (j = 0; j < counts->chunks; j++) {
Expand All @@ -1963,8 +2001,7 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
bitmap_unplug(mddev->bitmap);
*low = lo;
*high = hi;
err:
bitmap_free(bitmap);

return rv;
}
EXPORT_SYMBOL_GPL(bitmap_copy_from_slot);
Expand Down
3 changes: 3 additions & 0 deletions drivers/md/bitmap.h
Original file line number Diff line number Diff line change
Expand Up @@ -267,8 +267,11 @@ void bitmap_daemon_work(struct mddev *mddev);

int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
int chunksize, int init);
struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot);
int bitmap_copy_from_slot(struct mddev *mddev, int slot,
sector_t *lo, sector_t *hi, bool clear_bits);
void bitmap_free(struct bitmap *bitmap);
void bitmap_wait_behind_writes(struct mddev *mddev);
#endif

#endif
Loading

0 comments on commit e265eb3

Please sign in to comment.