The Linux multipath implementation



Original author : Christophe Varoqui
Creation : Feb 2004
Last update : Dec 2010


Introduction


The most common multipathed environment today is a Fibre Channel (FC) Storage Area Network (SAN). These beasts can be found in most Datacenters. The lego blocks forming a SAN are :


The multipath term simply means that a host can access a LU by multiple paths, the path being a route from one host HBA port to one storage controller port.

Examples :


The Linux kernel choose not to mask the individual paths, that appear as normal SCSI Disks (sd).

Multipath awareness and support for an operating system can be described as :


All these goals are met by leveraging a set of userspace tools ans kernel subsystems :


The rest of this document describes these individual tools and subsystems and their interactions.

Device Mapper


Starting with Linux kernel 2.6, a new lightweight block subsystem named Device Mapper enables advanced storage management with style. This component features a pluggable design. At the time of this writing available plugins are :


This subsystem is the core component of the multipath tool chain. It is not included in the main kernel tree as of linux-2.6.10. It is part of a patchset created by Joe Thornber, and now maintained by Alasdair G Kergon (agk at redhat dot com) that can be downloaded at http://sources.redhat.com/dm/

This component fills the following requirements :


So, let's see how it works.

The Device Mapper is configured one map at a time. A device map, also referred to as a table, is a list of segments in the form of :


0 35258368 linear 8:48 65920
35258368 35258368 linear 8:32 65920
70516736 17694720 linear 8:16 17694976
88211456 17694720 linear 8:16 256


The first 2 parameters of each line are the segment starting block in the virtual device and the length of the segment. The next keyword is the target policy (linear). The rest of the line is the target parameters.

The Device Mapper can be fed its tables through the use of a library : libdevmapper. EVMS2, dmsetup, LVM2, the multipath configuration tool and kpartx all link this lib. A table setup boils down to sprintf'ing the right segment definitions in a char *. Should the DM user-kernel interface change from being ioctl based to a pseudo filesystem, the libdevmapper API should remain stable.

Here is an example of a multipath target :
                             [----------- 1st path group -----------] [--------- 2nd path group -----------]
0 71014400 multipath 0 0 2 1 round-robin 0 2 1 66:128 1000 65:64 1000 round-robin 0 2 1 8:0 1000 67:192 1000
^     ^       ^      ^ ^ ^ ^      ^      ^ ^ ^   ^      ^
|     |       |      | | | |      |      | | |   |      nb of io to send to this path before switching
|     |       |      | | | |      |      | | |   path major:minor numbers 
|     |       |      | | | |      |      | | number of path arguments 
|     |       |      | | | |      |      | number of paths in this path group
|     |       |      | | | |      |      number of selector arguments
|     |       |      | | | |      path selector
|     |       |      | | | next path group to try
|     |       |      | | number of path groups
|     |       |      | number of hwhandler
|     |       |      number of features
|     |       target name
|     target lenght in 512-bytes blocks
starting offset of the target

For completeness, here is an example of a pure failover target definition for the same LU :


0 71014400 multipath 0 0 4 1 round-robin 0 1 1 66:112 1000 round-robin 0 1 1 67:176 1000 round-robin 0 1 1 68:240 1000 round-robin 0 1 1 65:48 1000

And a full spread (multibus) target one :


0 71014400 multipath 0 0 1 1 round-robin 0 4 1 66:112 1000 67:176 1000 68:240 1000 65:48 1000

Upon device map creation, a new block kernel object named dm-[0-9]* is instantiated, and a hotplug call is triggered. Each device map can be assigned a symbolic name when created through libdevmapper, but this name won't be available anywhere but through a libdevmapper request.

hotplug subsystem and udev


Starting with Linux kernel 2.6, the hotplug callbacks are provided by the sysfs pseudo filesystem events. This filesystem presents to userspace kernel objects like bus, driver instances or block devices in a hierarchical and homogeneous manner. /sbin/hotplug is called upon file creation and deletion in the sysfs filesystem.
Udev acts as a proxy for sysfs events. The multipath tools collects events from a udev event relaying unix socket.

For our needs this facility provides :


Since linux-2.6.4, and its integration of the transport class for sysfs, it can also provide callbacks upon FC transport events like a “Port Database Rescan”. These callbacks could now be used to trigger SCSI Bus Rescan to bring a fully dynamic storage layer. (Or am I wrong ?)

Here is how we use this callbacks for the multipath implementation :


Udev is a reimplementation in userspace of the devfs kernel facility. It provides a dynamic /dev space, with an agnostic naming policy. Greg Kroah-Hartman is the original developper of this package, and it now maintained by Kay Sievers. It can be found at http://ftp.kernel.org/pub/linux/utils/kernel/hotplug/

To summarize what implementation details these subsystems fill :


multipath userspace config tool


This tool implements a stateless subset of multipath daemon features. It can work without the daemon running. It can handle the paths coalescing and device maps creation.

Here is how it works :


There are currently 3 io routing policy implemented :


Policy assignment can be set manually at the command line. This one sets the policy to multibus for the multipath containing the device with major 8 and minor 0 (/dev/sda)


multipath -p multibus -D 8 0


These policies can optionally be stored in a config file (/etc/multipath.conf). If the file is present, its content override the in-code defaults. All multipath hardware you will use must be described in either the config file if you have one, or the in-code defaults table if not, for the multipath tool to work.

The device maps naming policy is “name by LU WWID”, with a provision for defining per-LU aliases.

To illustrate this synopsis, here is an example verbose output :

[root@cl039 multipath-tools-0.3.9]# multipath -v2 3600a0b80000b5c9c0000044d3b667c19
unchanged: 3600a0b80000b5c9c0000044d3b667c19
[size=34675 MB][features="0"][hwhandler="0"]
\_ round-robin 0 [first]
  \_ 3:0:0:5 sdat 66:208  [ready ][active]
  \_ 2:0:2:5 sdz  65:144  [ready ][active]
\_ round-robin 0
  \_ 3:0:2:5 sdbn 68:16   [ready ][active]
  \_ 2:0:0:5 sdf  8:80    [ready ][active]


The first section shows the list of all paths detected on the host. The second shows the multipath structs produced by the coalescing logic. The third shows the device maps submitted to the Device Mapper.

Of interest is the creation of device maps for single path LU : this enables systems to operate normally when booted in a degraded SAN context. The missing paths will be added to the maps when they become available.

The implementation requirements filled by this tool are :


the multipathd daemon


This daemon can do everything the multipath command do, and additionaly, is in charge of checking the paths in case they come up or down. When this occurs, it will reconfigure the multipath map the path belongs to, so that this map regains its maximum performance and redundancy.

The implementation requirements filled by this daemon are :


kpartx userspace config tool


This tool, derived from util-linux' partx, reads partition tables on specified device and create device maps over partitions segments detected. It is called from hotplug upon device maps creation and deletion.

kpartx is part of the multipath-tools package

Early userspace


Starting with Linux kernel 2.6, an early userspace execution environment is available in the name of initramfs. The grand plan is to package a set of tools in a cpio archive concatenated to the kernel. This archive is expanded in an in-memory filesystem early at boot and the tools are called to assume logics that previously belonged in the kernel : dhcp requests and setups, nfsroot stuff ...

The multipath implementation toolbox fits in this early userspace definition. udev, multipath and kpartx are packaged with the initrd or initramfs to bring up the multipathed device early enough to boot on.

So is met the last multipath implementation requirement.