=======================
= LVM RAID Design Doc =
=======================
#############################
# Chapter 1: User-Interface #
#############################
***************** CREATING A RAID DEVICE ******************
01: lvcreate --type <RAID type> \
02: [--regionsize <size>] \
03: [-i/--stripes <#>] [-I,--stripesize <size>] \
04: [-m/--mirrors <#>] \
05: [--[min|max]recoveryrate <kB/sec/disk>] \
06: [--stripecache <size>] \
07: [--writemostly <devices>] \
08: [--maxwritebehind <size>] \
09: [[no]sync] \
10: <Other normal args, like: -L 5G -n lv vg> \
11: [devices]
Line 01:
I don't intend for there to be shorthand options for specifying the
segment type. The available RAID types are:
"raid0" - Stripe [NOT IMPLEMENTED]
"raid1" - should replace DM Mirroring
"raid10" - striped mirrors, [NOT IMPLEMENTED]
"raid4" - RAID4
"raid5" - Same as "raid5_ls" (Same default as MD)
"raid5_la" - RAID5 Rotating parity 0 with data continuation
"raid5_ra" - RAID5 Rotating parity N with data continuation
"raid5_ls" - RAID5 Rotating parity 0 with data restart
"raid5_rs" - RAID5 Rotating parity N with data restart
"raid6" - Same as "raid6_zr"
"raid6_zr" - RAID6 Rotating parity 0 with data restart
"raid6_nr" - RAID6 Rotating parity N with data restart
"raid6_nc" - RAID6 Rotating parity N with data continuation
The exception to 'no shorthand options' will be where the RAID implementations
can displace traditional tagets. This is the case with 'mirror' and 'raid1'.
In this case, "mirror_segtype_default" - found under the "global" section in
lvm.conf - can be set to "mirror" or "raid1". The segment type inferred when
the '-m' option is used will be taken from this setting. The default segment
types can be overridden on the command line by using the '--type' argument.
Line 02:
Region size is relevant for all RAID types. It defines the granularity for
which the bitmap will track the active areas of disk. The default is currently
4MiB. I see no reason to change this unless it is a problem for MD performance.
MD does impose a restriction of 2^21 regions for a given device, however. This
means two things: 1) we should never need a metadata area larger than
8kiB+sizeof(superblock)+bitmap_offset (IOW, pretty small) and 2) the region
size will have to be upwardly revised if the device is larger than 8TiB
(assuming defaults).
Line 03/04:
The '-m/--mirrors' option is only relevant to RAID1 and will be used just like
it is today for DM mirroring. For all other RAID types, -i/--stripes and
-I/--stripesize are relevant. The former will specify the number of data
devices that will be used for striping. For example, if the user specifies
'--type raid0 -i 3', then 3 devices are needed. If the user specifies
'--type raid6 -i 3', then 5 devices are needed. The -I/--stripesize may be
confusing to MD users, as they use the term "chunksize". I think they will
adapt without issue and I don't wish to create a conflict with the term
"chunksize" that we use for snapshots.
Line 05/06/07:
I'm still not clear on how to specify these options. Some are easier than
others. '--writemostly' is particularly hard because it involves specifying
which devices shall be 'write-mostly' and thus, also have 'max-write-behind'
applied to them. It has been suggested that a '--readmostly'/'--readfavored'
or similar option could be introduced as a way to specify a primary disk vs.
specifying all the non-primary disks via '--writemostly'. I like this idea,
but haven't come up with a good name yet. Thus, these will remain
unimplemented until future specification.
Line 09/10/11:
These are familiar.
Further creation related ideas:
Today, you can specify '--type mirror' without an '-m/--mirrors' argument
necessary. The number of devices defaults to two (and the log defaults to
'disk'). A similar thing should happen with the RAID types. All of them
should default to having two data devices unless otherwise specified. This
would mean a total number of 2 devices for RAID 0/1, 3 devices for RAID 4/5,
and 4 devices for RAID 6/10.
***************** CONVERTING A RAID DEVICE ******************
01: lvconvert [--type <RAID type>] \
02: [-R/--regionsize <size>] \
03: [-i/--stripes <#>] [-I,--stripesize <size>] \
04: [-m/--mirrors <#>] \
05: [--merge]
06: [--splitmirrors <#> [--trackchanges]] \
07: [--replace <sub_lv|device>] \
08: [--[min|max]recoveryrate <kB/sec/disk>] \
09: [--stripecache <size>] \
10: [--writemostly <devices>] \
11: [--maxwritebehind <size>] \
12: vg/lv
13: [devices]
lvconvert should work exactly as it does now when dealing with mirrors -
even if(when) we switch to MD RAID1. Of course, there are no plans to
allow the presense of the metadata area to be configurable (e.g. --corelog).
It will be simple enough to detect if the LV being up/down-converted is
new or old-style mirroring.
If we choose to use MD RAID0 as well, it will be possible to change the
number of stripes and the stripesize. It is therefore conceivable to see
something like, 'lvconvert -i +1 vg/lv'.
Line 01:
It is possible to change the RAID type of an LV - even if that LV is already
a RAID device of a different type. For example, you could change from
RAID4 to RAID5 or RAID5 to RAID6.
Line 02/03/04:
These are familiar options - all of which would now be available as options
for change. (However, it'd be nice if we didn't have regionsize in there.
It's simple on the kernel side, but is just an extra - often unecessary -
parameter to many functions in the LVM codebase.)
Line 05:
This option is used to merge an LV back into a RAID1 array - provided it was
split for temporary read-only use by '--splitmirrors 1 --trackchanges'.
Line 06:
The '--splitmirrors <#>' argument should be familiar from the "mirror" segment
type. It allows RAID1 images to be split from the array to form a new LV.
Either the original LV or the split LV - or both - could become a linear LV as
a result. If the '--trackchanges' argument is specified in addition to
'--splitmirrors', an LV will be split from the array. It will be read-only.
This operation does not change the original array - except that it uses an empty
slot to hold the position of the split LV which it expects to return in the
future (see the '--merge' argument). It tracks any changes that occur to the
array while the slot is kept in reserve. If the LV is merged back into the
array, only the changes are resync'ed to the returning image. Repeating the
'lvconvert' operation without the '--trackchanges' option will complete the
split of the LV permanently.
Line 07:
This option allows the user to specify a sub_lv (e.g. a mirror image) or
a particular device for replacement. The device (or all the devices in
the sub_lv) will be removed and replaced with different devices from the
VG.
Line 08/09/10/11:
It should be possible to alter these parameters of a RAID device. As with
lvcreate, however, I'm not entirely certain how to best define some of these.
We don't need all the capabilities at once though, so it isn't a pressing
issue.
Line 12:
The LV to operate on.
Line 13:
Devices that are to be used to satisfy the conversion request. If the
operation removes devices or splits a mirror, then the devices specified
form the list of candidates for removal. If the operation adds or replaces
devices, then the devices specified form the list of candidates for allocation.
###############################################
# Chapter 2: LVM RAID internal representation #
###############################################
The internal representation is somewhat like mirroring, but with alterations
for the different metadata components. LVM mirroring has a single log LV,
but RAID will have one for each data device. Because of this, I've added a
new 'areas' list to the 'struct lv_segment' - 'meta_areas'. There is exactly
a one-to-one relati