The Linux multipath implementation

Original author : Christophe Varoqui
Creation : Feb 2004
Last update : Apr 2004

This document is shared under the OpenContent License (http://www.opencontent.org/opl.shtml)

Introduction

The most common multipathed environment today is a Fibre Channel (FC) Storage Area Network (SAN). This beasts can be found in most Datacenters. The lego blocks forming a SAN are :

The multipath term simply means that a host can access a LU by multiple paths, the path being a route from one host HBA port to one storage controller port.

Examples :

  • A host with 2 HBA attached to a single fabric is presented a LU by a 4 ports storage controller. The host then see 8 paths to the LU

  • A host with 2 HBA attached to a dual independent fabric (1 HBA on each fabric) is presented a LU by a 4 ports storage controller (2 ports on each fabric). The host then see 4 paths to the LU : 2 paths through fabric A, plus 2 through the fabric B.



The Linux kernel choose not to mask the individual paths, that appear as normal SCSI Disks (SD).

Multipath awareness and support for an operating system can be described as :

All these goals are met by leveraging a set of userspace tools ans kernel subsystems :

The rest of this document describes these individual tools and subsystems and their interactions.

Device Mapper

Starting with Linux kernel 2.6, a new lightweight block subsystem named Device Mapper enables advanced storage management with style. This component features a pluggable design. At the time of this writing available plugins are :

This last policy is the core component of the multipath tool chain. It is not included in the main kernel tree as of linux-2.6.5. It is part of a patchset maintained by Joe Thornber (thornber at redhat dot com) that can be downloaded at http://people.sistina.com/~thornber/dm/

This component fills the following requirements :

So, let's see how it works.

The Device Mapper is configured one map at a time. A device map, also referred to as a table, is a list of segments in the form of :

0 35258368 linear 8:48 65920

35258368 35258368 linear 8:32 65920

70516736 17694720 linear 8:16 17694976

88211456 17694720 linear 8:16 256



The first 2 parameters of each line are the segment starting block in the virtual device and the length of the segment. The next keyword is the target policy (linear). The rest of the line is the target parameters.

The Device Mapper can be fed its tables through the use of a library : libdevmapper. Dmsetup, LVM2, the multipath configuration tool and kpartx all link this lib. A table setup boils down to sprintf'ing the right segment definitions in a char *. Should the DM user-kernel interface change from being ioctl based to a pseudo filesystem, the libdevmapper API should remain stable.

Here is an example of a multipath target :

0 27262976 multipath 2 round-robin 2 0 /dev/sda /dev/sdk round-robin 2 0 /dev/sdc /dev/sdm



The multipath target parameters are :

For completion, here is an example of a pure failover target definition for the same LU :

0 27262976 multipath 4 round-robin 1 0 /dev/sda round-robin 1 0 /dev/sdc round-robin 1 0 /dev/sdk round-robin 1 0 /dev/sdm

And a full spread (multibus) target one :

0 27262976 multipath 1 round-robin 4 0 /dev/sda /dev/sdc /dev/sdk /dev/sdm



Upon device map creation, a new block kernel object named dm-[0-9]* is instantiated, and a hotplug call is triggered. Each device map can be assigned a symbolic name when created through libdevmapper, but this name won't be available anywhere but through a libdevmapper request.

hotplug subsystem and udev

Starting with Linux kernel 2.6, the hotplug callbacks are communized through the presence of a new pseudo filesystem : sysfs. This filesystem presents to userspace kernel objects like bus, driver instances or block devices in a hierarchical and homogeneous manner. The hotplug subsystem is leveraged by triggering a /sbin/hotplug call upon file creation and deletion in the sysfs filesystem.

For our needs this facility provides :

Since linux-2.6.4, and its integration of the transport class for sysfs, it can also provide callbacks upon FC transport events like a “Port Database Rescan”. These callbacks could now be used to trigger SCSI Bus Rescan to bring a fully dynamic storage layer. (Or am I wrong ?)

Here is how we use this callbacks for the multipath implementation :

Udev is a reimplementation in userspace of the devfs kernel facility. It provides a dynamic /dev space, with an agnostic naming policy. Greg Kroah-Hartman is the main developer and maintainer of this package. It can be found at http://ftp.kernel.org/pub/linux/utils/kernel/hotplug/

To synthesize what implementation details these subsystems fill :

multipath userspace config tool

This tool is responsible for the paths coalescing and device maps creation. As seen earlier, it is triggered by the hotplug calls on new paths additions and suppressions. It must deal with hardware specifics and abstract them for the others subsystems.

Here is how it works :

There are currently 4 spreading policy implemented :

Policy assignment can be set manually at the command line. This one sets the policy to multibus for the multipath containing the device with major 8 and minor 0 (/dev/sda)

multipath -p multibus -D 8 0



Alternatively, default settings are defined into the multipath tool source code. Here is how it is :

#define setup_default_hwtable struct hwentry defhwtable[] = { \
{"COMPAQ ", "HSV110 (C)COMPAQ", GROUP_BY_TUR, &get_evpd_wwid}, \
{"COMPAQ ", "MSA1000 ", GROUP_BY_TUR, &get_evpd_wwid}, \
{"COMPAQ ", "MSA1000 VOLUME ", GROUP_BY_TUR, &get_evpd_wwid}, \
{"DEC ", "HSG80 ", GROUP_BY_TUR, &get_evpd_wwid}, \
{"HP ", "HSV100 ", GROUP_BY_TUR, &get_evpd_wwid}, \
{"HP ", "A6189A ", MULTIBUS, &get_evpd_wwid}, \
{"HP ", "OPEN- ", MULTIBUS, &get_evpd_wwid}, \
{"DDN ", "SAN DataDirector", MULTIBUS, &get_evpd_wwid}, \
{"FSC ", "CentricStor ", MULTIBUS, &get_evpd_wwid}, \
{"HITACHI ", "DF400 ", MULTIBUS, &get_evpd_wwid}, \
{"HITACHI ", "DF500 ", MULTIBUS, &get_evpd_wwid}, \
{"HITACHI ", "DF600 ", MULTIBUS, &get_evpd_wwid}, \
{"IBM ", "ProFibre 4000R ", MULTIBUS, &get_evpd_wwid}, \
{"SGI ", "TP9100 ", MULTIBUS, &get_evpd_wwid}, \
{"SGI ", "TP9300 ", MULTIBUS, &get_evpd_wwid}, \
{"SGI ", "TP9400 ", MULTIBUS, &get_evpd_wwid}, \
{"SGI ", "TP9500 ", MULTIBUS, &get_evpd_wwid}, \
{"", "", 0, NULL}, \
};



These policies can optionally be stored in a config file (/etc/multipath.conf). If the file is present, its content override the in-code defaults. All multipath hardware you will use must be described in either the config file if you have one, or the code defaults table if not, for the multipath tool to work.

Default settings are applied per LU. The LU is characterized by its vendor_id / product_id tuple ( respectively col1 & col2). The third column sets the default policy and the last set the default function used for fetch the unique identifier necessary for coalescing the paths.

The device maps naming policy is “name by LU WWID”.

To illustrate this synopsis, here is an example verbose output :

xa-s03:~/udev-016/extras# multipath -v
600508b4000156d700012000000b0000 (0 0 1 1) /dev/sda [HSV110 (C)COMPAQ]
600508b4000156c30001200000210000 (0 0 1 2) /dev/sdb [HSV110 (C)COMPAQ]
600508b4000156d700012000000b0000 (0 0 2 1) /dev/sdc [HSV110 (C)COMPAQ]
600508b4000156c30001200000210000 (0 0 2 2) /dev/sdd [HSV110 (C)COMPAQ]
60001fe1000bdad0000903507109004b (0 0 3 1) /dev/sde [HSG80 ]
60001fe1000bdad000090371312100bf (0 0 3 2) /dev/sdf [HSG80 ]
60001fe1000bdad000090371312100c2 (0 0 3 3) /dev/sdg [HSG80 ]
60001fe1000bdad00009037131210067 (0 0 4 1) /dev/sdh [HSG80 ]
60001fe1000bdad000090371312100b3 (0 0 4 2) /dev/sdi [HSG80 ]
60001fe1000bdad00009035071090024 (0 0 4 3) /dev/sdj [HSG80 ]
600508b4000156d700012000000b0000 (1 0 1 1) /dev/sdk [HSV110 (C)COMPAQ]
600508b4000156c30001200000210000 (1 0 1 2) /dev/sdl [HSV110 (C)COMPAQ]
600508b4000156d700012000000b0000 (1 0 2 1) /dev/sdm [HSV110 (C)COMPAQ]
600508b4000156c30001200000210000 (1 0 2 2) /dev/sdn [HSV110 (C)COMPAQ]
600508b4000156d700012000000b0000
\_(0 0 1 1) /dev/sda [HSV110 (C)COMPAQ]
\_(0 0 2 1) /dev/sdc [HSV110 (C)COMPAQ]
\_(1 0 1 1) /dev/sdk [HSV110 (C)COMPAQ]
\_(1 0 2 1) /dev/sdm [HSV110 (C)COMPAQ]
600508b4000156c30001200000210000
\_(0 0 1 2) /dev/sdb [HSV110 (C)COMPAQ]
\_(0 0 2 2) /dev/sdd [HSV110 (C)COMPAQ]
\_(1 0 1 2) /dev/sdl [HSV110 (C)COMPAQ]
\_(1 0 2 2) /dev/sdn [HSV110 (C)COMPAQ]
60001fe1000bdad0000903507109004b
\_(0 0 3 1) /dev/sde [HSG80 ]
60001fe1000bdad000090371312100bf
\_(0 0 3 2) /dev/sdf [HSG80 ]
60001fe1000bdad000090371312100c2
\_(0 0 3 3) /dev/sdg [HSG80 ]
60001fe1000bdad00009037131210067
\_(0 0 4 1) /dev/sdh [HSG80 ]
60001fe1000bdad000090371312100b3
\_(0 0 4 2) /dev/sdi [HSG80 ]
60001fe1000bdad00009035071090024
\_(0 0 4 3) /dev/sdj [HSG80 ]
U:600508b4000156d700012000000b0000:0 27262976 multipath 1 round-robin 2 0 /dev/sda /dev/sdk 1 round-robin 2 0 /dev/sdc /dev/sdm
U:600508b4000156c30001200000210000:0 31457280 multipath 1 round-robin 2 0 /dev/sdb /dev/sdl 1 round-robin 2 0 /dev/sdd /dev/sdn
U:60001fe1000bdad0000903507109004b:0 106669167 multipath 1 round-robin 1 0 /dev/sde
U:60001fe1000bdad000090371312100bf:0 142229246 multipath 1 round-robin 1 0 /dev/sdf
U:60001fe1000bdad000090371312100c2:0 142229246 multipath 1 round-robin 1 0 /dev/sdg
U:60001fe1000bdad00009037131210067:0 213338334 multipath 1 round-robin 1 0 /dev/sdh
U:60001fe1000bdad000090371312100b3:0 213338334 multipath 1 round-robin 1 0 /dev/sdi
U:60001fe1000bdad00009035071090024:0 71114623 multipath 1 round-robin 1 0 /dev/sdj



The first section shows the list of all paths detected on the host. The second shows the multipath structs produced by the coalescing logic. The third shows the device maps submitted to the Device Mapper.

Of interest is the creation of device maps for single path LU : this enables systems to operate normally when booted in a degraded SAN context. The missing paths will be added to the maps when they become available.

This tool is packaged with udev, in the extras/ section. The devmap_name tool and multipathd daemon are distributed in the same tree. Alternatively the multipath-tools package can be found at http://christophe.varoqui.free.fr/

The implementation requirements filled by this tool are :

the multipathd daemon

This daemon is in charge of checking the failed paths in case they come up. When this occurs, it will reconfigure the multipath map the path belongs to, so that this map regain its maximum performance and redundancy.

This daemon executes the external multipath config tool when events occur. In turn, the multipath tool signals the multipathd daemon it is done with devmap reconfiguration, so that it can refresh its failed path list.

A logical diagram can be found at http://christophe.varoqui.free.fr/

The implementation requirements filled by this daemon are :

kpartx userspace config tool

This tool, derived from util-linux' partx, reads partition tables on specified device and create device maps over partitions segments detected. It is called from hotplug upon device maps creation and deletion.

kpartx is part of the multipath-tools package distributed on http://christophe.varoqui.free.fr/

Early userspace

Starting with Linux kernel 2.6, an early userspace execution environment is available in the name of initramfs. The grand plan is to package a set of tools in a cpio archive concatenated to the kernel. This archive is expanded in an in-memory filesystem early at boot and the tools are called to assume logics that previously belonged in the kernel : dhcp requests and setups, nfsroot stuffing ...

Being concatenated to the kernel, the size of this archive matters a lot. A slim libc implementation is required and provided in the name of klibc, maintained by Hans Peter Anvin.

The multipath implementation toolbox fits in this early userspace definition. Udev, multipath and kpartx are linked against klibc and can be packaged with the cpio archive to bring up the multipathed device early enough to boot on.

So is met the last multipath implementation requirement.

Quick installation guide

Tested environments

Last updated on Mon Feb 16 2004


Working


Not tested


Not working



HBA

Host HW

Fabric Topology

Fabric HW

Storage Controlers

Notes

2 Qlogic 2200 (driver Qlogic 8beta10)

IA32

Simple fabric


Linked Brocade Silkworm 3200 & Brocade Silkworm 2200

HP StorageWorks HSG80 (multibus & failover configuration)

HP StorageWorks HSV110 (multibus)

Configure REPORT_LUN in the Linux kernel and Solaris as the personnality in the HSV110 console so that the ghosts paths are visible from the OS.

2 Qlogic 2200 (driver Qlogic 8beta10)

IA32

Dual fabrics

Brocade Silkworm 3200

HP StorageWorks HSG80 (multibus & failover configuration)

HP StorageWorks HSV110 (multibus)


2 Qlogic 2300

Sparc64

Dual fabrics

Brocade Silkworm 3200

HP StorageWorks HSG80 (multibus & failover configuration)

HP StorageWorks HSV110 (multibus)


2 HP OEM FCA2354 (Emulex LP9002L)

Alpha

Dual fabrics

Brocade Silkworm 3200

HP StorageWorks HSG80 (multibus & failover configuration)

HP StorageWorks HSV110 (multibus)

The emulex driver in -mjb tree seems a bit rough at the edges. Will try in time.

2 Qlogic 2300 (driver 8.00.00b11-k)

IA32

2-tiers fabric

Brocade Silkworm 3800

3PARdata inserv S800

tester : Andy (genanr at emsphone dot com)