Skip to content

Commit

Permalink
Merge tag 'md/4.1' of git://neil.brown.name/md
Browse files Browse the repository at this point in the history
Pull md updates from Neil Brown:
 "More updates that usual this time.  A few have performance impacts
  which hould mostly be positive, but RAID5 (in particular) can be very
  work-load ensitive...  We'll have to wait and see.

  Highlights:

   - "experimental" code for managing md/raid1 across a cluster using
     DLM.  Code is not ready for general use and triggers a WARNING if
     used.  However it is looking good and mostly done and having in
     mainline will help co-ordinate development.

   - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
     handle a full (chunk wide) stripe as a single unit.

   - RAID6 can now perform read-modify-write cycles which should help
     performance on larger arrays: 6 or more devices.

   - RAID5/6 stripe cache now grows and shrinks dynamically.  The value
     set is used as a minimum.

   - Resync is now allowed to go a little faster than the 'mininum' when
     there is competing IO.  How much faster depends on the speed of the
     devices, so the effective minimum should scale with device speed to
     some extent"

* tag 'md/4.1' of git://neil.brown.name/md: (58 commits)
  md/raid5: don't do chunk aligned read on degraded array.
  md/raid5: allow the stripe_cache to grow and shrink.
  md/raid5: change ->inactive_blocked to a bit-flag.
  md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe
  md/raid5: pass gfp_t arg to grow_one_stripe()
  md/raid5: introduce configuration option rmw_level
  md/raid5: activate raid6 rmw feature
  md/raid6 algorithms: xor_syndrome() for SSE2
  md/raid6 algorithms: xor_syndrome() for generic int
  md/raid6 algorithms: improve test program
  md/raid6 algorithms: delta syndrome functions
  raid5: handle expansion/resync case with stripe batching
  raid5: handle io error of batch list
  RAID5: batch adjacent full stripe write
  raid5: track overwrite disk count
  raid5: add a new flag to track if a stripe can be batched
  raid5: use flex_array for scribble data
  md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
  md: allow resync to go faster when there is competing IO.
  md: remove 'go_faster' option from ->sync_request()
  ...
  • Loading branch information
torvalds committed Apr 24, 2015
2 parents d56a669 + 9ffc8f7 commit 474095e
Show file tree
Hide file tree
Showing 29 changed files with 2,860 additions and 305 deletions.
176 changes: 176 additions & 0 deletions Documentation/md-cluster.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
The cluster MD is a shared-device RAID for a cluster.


1. On-disk format

Separate write-intent-bitmap are used for each cluster node.
The bitmaps record all writes that may have been started on that node,
and may not yet have finished. The on-disk layout is:

0 4k 8k 12k
-------------------------------------------------------------------
| idle | md super | bm super [0] + bits |
| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
| bm bits [3, contd] | | |

During "normal" functioning we assume the filesystem ensures that only one
node writes to any given block at a time, so a write
request will
- set the appropriate bit (if not already set)
- commit the write to all mirrors
- schedule the bit to be cleared after a timeout.

Reads are just handled normally. It is up to the filesystem to
ensure one node doesn't read from a location where another node (or the same
node) is writing.


2. DLM Locks for management

There are two locks for managing the device:

2.1 Bitmap lock resource (bm_lockres)

The bm_lockres protects individual node bitmaps. They are named in the
form bitmap001 for node 1, bitmap002 for node and so on. When a node
joins the cluster, it acquires the lock in PW mode and it stays so
during the lifetime the node is part of the cluster. The lock resource
number is based on the slot number returned by the DLM subsystem. Since
DLM starts node count from one and bitmap slots start from zero, one is
subtracted from the DLM slot number to arrive at the bitmap slot number.

3. Communication

Each node has to communicate with other nodes when starting or ending
resync, and metadata superblock updates.

3.1 Message Types

There are 3 types, of messages which are passed

3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
updated, and the node must re-read the md superblock. This is performed
synchronously.

3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
so that each node may suspend or resume the region.

3.2 Communication mechanism

The DLM LVB is used to communicate within nodes of the cluster. There
are three resources used for the purpose:

3.2.1 Token: The resource which protects the entire communication
system. The node having the token resource is allowed to
communicate.

3.2.2 Message: The lock resource which carries the data to
communicate.

3.2.3 Ack: The resource, acquiring which means the message has been
acknowledged by all nodes in the cluster. The BAST of the resource
is used to inform the receive node that a node wants to communicate.

The algorithm is:

1. receive status

sender receiver receiver
ACK:CR ACK:CR ACK:CR

2. sender get EX of TOKEN
sender get EX of MESSAGE
sender receiver receiver
TOKEN:EX ACK:CR ACK:CR
MESSAGE:EX
ACK:CR

Sender checks that it still needs to send a message. Messages received
or other events that happened while waiting for the TOKEN may have made
this message inappropriate or redundant.

3. sender write LVB.
sender down-convert MESSAGE from EX to CR
sender try to get EX of ACK
[ wait until all receiver has *processed* the MESSAGE ]

[ triggered by bast of ACK ]
receiver get CR of MESSAGE
receiver read LVB
receiver processes the message
[ wait finish ]
receiver release ACK

sender receiver receiver
TOKEN:EX MESSAGE:CR MESSAGE:CR
MESSAGE:CR
ACK:EX

4. triggered by grant of EX on ACK (indicating all receivers have processed
message)
sender down-convert ACK from EX to CR
sender release MESSAGE
sender release TOKEN
receiver upconvert to EX of MESSAGE
receiver get CR of ACK
receiver release MESSAGE

sender receiver receiver
ACK:CR ACK:CR ACK:CR


4. Handling Failures

4.1 Node Failure
When a node fails, the DLM informs the cluster with the slot. The node
starts a cluster recovery thread. The cluster recovery thread:
- acquires the bitmap<number> lock of the failed node
- opens the bitmap
- reads the bitmap of the failed node
- copies the set bitmap to local node
- cleans the bitmap of the failed node
- releases bitmap<number> lock of the failed node
- initiates resync of the bitmap on the current node

The resync process, is the regular md resync. However, in a clustered
environment when a resync is performed, it needs to tell other nodes
of the areas which are suspended. Before a resync starts, the node
send out RESYNC_START with the (lo,hi) range of the area which needs
to be suspended. Each node maintains a suspend_list, which contains
the list of ranges which are currently suspended. On receiving
RESYNC_START, the node adds the range to the suspend_list. Similarly,
when the node performing resync finishes, it send RESYNC_FINISHED
to other nodes and other nodes remove the corresponding entry from
the suspend_list.

A helper function, should_suspend() can be used to check if a particular
I/O range should be suspended or not.

4.2 Device Failure
Device failures are handled and communicated with the metadata update
routine.

5. Adding a new Device
For adding a new device, it is necessary that all nodes "see" the new device
to be added. For this, the following algorithm is used:

1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends NEWDISK with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether the disk
was found:
ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
disc.number set to slot number)
ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on no-new-devs (CR) if device is found
7. Node 1 attempts EX lock on no-new-devs
8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
as SpareLocal
9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
10. Other nodes get the information whether a disk is added or not
by the following METADATA_UPDATED.
19 changes: 16 additions & 3 deletions crypto/async_tx/async_pq.c
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
{
void **srcs;
int i;
int start = -1, stop = disks - 3;

if (submit->scribble)
srcs = submit->scribble;
Expand All @@ -134,10 +135,21 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
if (blocks[i] == NULL) {
BUG_ON(i > disks - 3); /* P or Q can't be zero */
srcs[i] = (void*)raid6_empty_zero_page;
} else
} else {
srcs[i] = page_address(blocks[i]) + offset;
if (i < disks - 2) {
stop = i;
if (start == -1)
start = i;
}
}
}
raid6_call.gen_syndrome(disks, len, srcs);
if (submit->flags & ASYNC_TX_PQ_XOR_DST) {
BUG_ON(!raid6_call.xor_syndrome);
if (start >= 0)
raid6_call.xor_syndrome(disks, start, stop, len, srcs);
} else
raid6_call.gen_syndrome(disks, len, srcs);
async_tx_sync_epilog(submit);
}

Expand Down Expand Up @@ -178,7 +190,8 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
if (device)
unmap = dmaengine_get_unmap_data(device->dev, disks, GFP_NOIO);

if (unmap &&
/* XORing P/Q is only implemented in software */
if (unmap && !(submit->flags & ASYNC_TX_PQ_XOR_DST) &&
(src_cnt <= dma_maxpq(device, 0) ||
dma_maxpq(device, DMA_PREP_CONTINUE) > 0) &&
is_dma_pq_aligned(device, offset, 0, len)) {
Expand Down
16 changes: 16 additions & 0 deletions drivers/md/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,22 @@ config MD_FAULTY

In unsure, say N.


config MD_CLUSTER
tristate "Cluster Support for MD (EXPERIMENTAL)"
depends on BLK_DEV_MD
depends on DLM
default n
---help---
Clustering support for MD devices. This enables locking and
synchronization across multiple systems on the cluster, so all
nodes in the cluster can access the MD devices simultaneously.

This brings the redundancy (and uptime) of RAID levels across the
nodes of the cluster.

If unsure, say N.

source "drivers/md/bcache/Kconfig"

config BLK_DEV_DM_BUILTIN
Expand Down
1 change: 1 addition & 0 deletions drivers/md/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ obj-$(CONFIG_MD_RAID10) += raid10.o
obj-$(CONFIG_MD_RAID456) += raid456.o
obj-$(CONFIG_MD_MULTIPATH) += multipath.o
obj-$(CONFIG_MD_FAULTY) += faulty.o
obj-$(CONFIG_MD_CLUSTER) += md-cluster.o
obj-$(CONFIG_BCACHE) += bcache/
obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
Expand Down
Loading

0 comments on commit 474095e

Please sign in to comment.