Linux Mpio
Linux Mpio
6 Kernel
Michelle Butler, Technical Program Manager Andy Loftus, System Engineer Storage Enabling Technologies NCSA [email protected] or [email protected]
Who?
NCSA
a unit of the University of Illinois at Urbana-Champaign a federal, state, university, and industry funded center
Academic Users
NSF peer review
Myrinet
Full bi-section
Power/Cooling
593 KW / 193 tons
LCI Conference 2007
Cisco InfiniBand
3 to 1 over-subscribed OFED-1.1 w/ HPSM subnet manager
Lustre over IB
4 FasT controllers direct FC 1.2GB/s sustained 8 OSTs and 2 MDS w/complete auto failovers
Power/Cooling
148 KW / 42 tons
Cisco Infiniband
2 to 1 oversubscribed OFED-1.1 w/ HPSM subnet manager
Lustre over IB
22 OSTs 2 9500 DDN controllers direct FC 10 FasT controllers on SAN fabric 8.4GB/s sustained 22 OSTs and 2 MDS w/complete auto failovers
Power/Cooling
500 KW / 140 tons
Room 200:
7,000 sqft no columns 70 raised floor 2.3 MW power capacity 750 tons cooling capacity
Databases:
8 processor 12GB memory SGI Altix
30TB of SAN storage Oracle 10G, mysql, Postgres
Visualization Resources
30M-pixel Tiled Display Wall
8192 x 3840 pixels composite display 40 NEC VT540 projectors, arranged in a 5H x 8W matrix driven by 40-node Linux cluster
dual-processor 2.4GHz Intel Xeons with NVIDIA FX 5800 Ultra graphics accelerator cards Myrinet interconnect to be upgrade by early CY2007
funded by State of Illinois
SGI Prisms
8 x 8 processor (1.6 GHz Itanium2) 4 graphics pipes each; 1 GB RAM each InfiniBand connection to Altix machines
LCI Conference 2007
National Center for Supercomputing Applications
SAN at NCSA
1.3PB spinning disk
895TB SAN attached
10
Persistent Binding
Device naming problems Udev solution Examples Interactive Demo
11
Device node mapping can change with changes to - hardware - software - SAN Devices assigned random names (based on next available major/minor pair for device type) CLUSTER - Multiple hosts that see the same disk will assign the disk to different device nodes - may be /dev/sda on system1 but /dev/sdc on system2 - Can change with hardware changes; what used to be /dev/sda is not /dev/sdc Devfs helps only a little: - Fixes device naming; on a single host, disk will always have the same device node - But different hosts may have different device names for the same physical disk
12
13
Devfs provides dynamic and persistent naming, but: - kernel based - entire device db stored in kernel memory, never swapped - not possible to customize device names UDEV CUSTOM - custom names for devices - custom scripts can be run when specifice devices attached/removed
14
15
scsi_id
Unique id
SCSI INQUIRY
Sample usage:
root# scsi_id -g -u -s /block/sda SSEAGATE_ST318406LC_____3FE27FZP000073302G5W root# scsi_id -g -u -s /block/sdb 3600a0b8000122c6d00000000453174fc
LCI Conference 2007
National Center for Supercomputing Applications
/sbin/scsi_id - INPUT: existing local device name - OUTPUT: string that uniquely identifies the specific device (guaranteed unique among all scsi devices) SAMPLE: - sda: locally installed drive - sdb: SAN attached disk
16
BUS=scsi
/sys/bus/scsi
SYSFS
<BUS>/devices/H:B:T:L/<filename>
NAME
Device name to create (relative to /dev)
LCI Conference 2007
National Center for Supercomputing Applications
Custom naming controlled by rulesets stored in /etc/udev/rules.d A rule is a lists of keys to match against. When all keys match, the specified action is taken (create a device name or symlink)
17
18
Disk Ctlr
udev
WWPN + scsi_id
mpio_scsi_id
Get disk controller WWPN (Emulex) /sys/class/fc_transport/target<H>:<B>:<T>/port_name (QLA) grep + awk to pull value from /proc/scsi/ql2xxx/<host_id>
19
20
Restart udev
udevstart
Scan fc luns
{sysfs}/hostX/scan /dev/disk/by-id
BEGIN - tail -f /var/log/messages 1. 2. 3. 4. 5. 6. 7. 8. Enable udev logging Enable scsi_id for all devices (options -g) /proc/partitions Scan fc luns (echo - - - > /sys/class/scsi_host/hostX/scan) See udev log lines in messages file ; See fc disks in /dev/disk/by-id Enable 20-local rules file Udevstart See udev log lines in messages file ; See fc disks in /dev/disk/fc
DEFAULT CONFIGURATION Local rules file already exists. Disable it. Default behavior for scsi_id is to blacklist everything unknown (-b option). Enable white list everything (g option) so scsi_ids will be returned. Even before custom rules are in place, see default udev rule selection activity in /var/log/messages After running delete_fc_luns, udev removes /dev/sdX devices files (/var/log/messages) CUSTOM CONFIGURATION Udev custom rules are selected (see /var/log/messages) Major/Minor numbers line up for /dev/disk/fc/* and /proc/partition/*
21
Examples
udevinfo -a -p $(udevinfo -q path -n /dev/sdb) udevtest /block/sdb
Exmaple: multiple paths on Nadir - If luns are removed (delete_fc_luns) - Then added (scan_fc_luns) - No matches are found in 20-local.rules - Add syslog output to mpio_scsi_id + Shows params the script is called with + Shows what the script returns + target_wwpn is not getting set - Run udevstart (luns already attached now), matches found in 20-local.rules and device files created Probably either a driver or udev issue. Easiest solution is to run scan_luns and udevstart at system boot time (/etc/rc.d/rc.local)
22
23
24
25
26
http://www.reactivated.net/udevrules.php
How to write udev rules
http://www.us.kernel.org/pub/linux/utils/kernel/hotplug/ udev.html
Information and links
http://dims.ncsa.uiuc.edu/set/san
FC tools : custom tools used in demo
27
28
STORAGE VENDOR - End to end solution (they provide disk, HBA, driver, addl software, sometimes even FC switch) - HBAs (and other parts) come at a markup - One location for support tickets, but no alternate recourse if they cant fix the problem - Proprietary requirements (typically require 2 HBAs, only works with their systems) HBA VENDOR - QLA > Linux support spotty + 2.4 kernel ok, but strict requirements (2 HBAs, exactly 2 paths per lun, active/active controllers) + 2.6 kernel inconsistent behavior > Solaris support spotty (2 months to get 1 machine working, next month stops working, machine was untouched) > Dropped Windows support prematurely (Windows MPIO layer not complete yet, only an API for vendors) > Proprietary solution, only works with their HBAs and configuration software - Emulex (unix philosophy, do one thing and do it well; MPIO doesnt belong in the driver) FILESYSTEM - 3rd party - Veritos, others?? - Parallel Filesystems - Ibrix, Lustre, GPFS, CXFS (enable MPIO via failover hosts) OS - *NEW* Solaris 10 (XPATH, but requires Solaris branded QLA cards) - *NEW* Linux (device mapper multipath) (RedHat4, Suse, others)
29
30
31
32
33
path_grouping_policy
multibus failover group_by_prio group_by_serial group_by_node
Multipath control creates priority groups. Paths are grouped based on path_grouping_policy MULTIBUS - all paths in one priority group (DDN) (no penalty to access luns via alternate controllers) FAILOVER - one path per priority group (Use only 1 path at a time) (typically only 1 usable path, such as IBM fastt with AVT disabled) GROUP_BY_PRIO - Paths with same priority in same priority group, 1 group for each unique priority (Priorities assigned by external program) GROUP_BY_SERIAL - Paths grouped by scsi target serial (controller node WWN) GROUP_BY_NODE - (I have not tested or researched this, never had a need to)
34
prio_callout
3rd party pgm to assign priority values to each path
multipath
Integer value Device name
prio_callout
Only matters if using group_by_prio grouping policy DIRECTLY CONTROLS PRIORITY GROUP SELECTION - Priority group with highest value is active group PREVIOUS SLIDE - When all paths in a group are failed, next group becomes active. That would be the priority group with the next highest priority value that has an active path. PRIO_CALLOUT - Provided by vendor or (more typically) custom script written by admin for specific setup - If not using group_by_prio, then set this to /bin/true
35
no_path_retry
queue (N > 0) fail
TUR - SCSI Test Unit Ready - Preferred if lun supports it (OK on DDN, IBM fastt) - Does not cause AVT on IBM fastt - Does not fill up /var/log/messages on failures READSECTOR0 - physical lun access via /dev/sdX (IS THIS CORRECT???) DIRECTIO - physical lun access via /dev/sgY (IS THIS CORRECT???) Both readsector0 and directio cause AVT on IBM fastt, resulting in lun thrashing Both readsector0 and directio log fail messages in /var/log/messages (could be useful if you want to monitor logs for these events) NO_PATH_RETRY - # of retries before failing path - queue: queue I/O forever - (N > 0): queue I/O for N retries, then fail - fail: fail immediately
36
FAILBACK - When a path recovers, wait # seconds before enabling the path - Recovered path is added back into multipath enabled path list - multipath re-evaluates priority groups, changes active priority group if needed MANUAL RECOVERY - User runs /sbin/multipath to update enabled paths and priority groups
37
38
multipath
50
path_prio.sh
Primary-paths
/usr/local/etc/primary-paths
0x10000000c95ebeb4 0x10000000c95ebeb4 0x10000000c95ebeb4 0x10000000c95ebeb4 0x10000000c95ebeb4 0x10000000c95ebeb4 0x10000000c95ebeb4 0x10000000c95ebeb4 0x200200a0b8122c6e 0x200200a0b8122c6e 0x200200a0b8122c6e 0x200200a0b8122c6e 0x200300a0b8122c6e 0x200300a0b8122c6e 0x200300a0b8122c6e 0x200300a0b8122c6e 2:0:0:0 2:0:0:1 2:0:0:2 2:0:0:3 2:0:1:0 2:0:1:1 2:0:1:2 2:0:1:3 sdb sdc sdd sde sdi sdj sdk sdl 3600a0b8000122c6d00000000453174fc 3600a0b80000fd6320000000045317563 3600a0b8000122c6d0000000345317524 3600a0b80000fd6320000000245317593 3600a0b8000122c6d00000000453174fc 3600a0b80000fd6320000000045317563 3600a0b8000122c6d0000000345317524 3600a0b80000fd6320000000245317593 50 2 50 2 5 51 5 51
PATH_PRIO.SH - grep device from primary-paths file - return value from last column
39
Disk
IBM DS4500 Luns presented through both controllers Luns accessible via 1 controller only at a time AVT enabled
AVT - Lun will migrate to alternate controller if requested there - Tolerance of cable/switch failure - AVT penalty - lun inaccessible for 5-10 secs while controller ownership changing SCREENS: /var/log/messages , multi-port-mon , command , script host 1. 2. No luns (ls_fc_luns) /etc/multipath.conf 1. 2. 3. 4. 5. 6. 7. 1. Multipaths (fastt) Devices (fastt) Identify controller A, controller B
/usr/local/sbin/path_prio.sh /usr/local/etc/primary-paths Add luns (scan_fc_luns) 1. 1. 1. 2. 3. See multipath bindings & path_prio.sh output in /var/log/messages Multipath -v2 -l Script-host: disable disk port A See multipathd reconfig in /var/log/messages See I/O path change in multi-port-mon Script-host: enable disk port A View current multipath configuration Failover test
8.
Recover test 1.
40
Disk
DDN 8500 Luns accessible via both controllers (no penalty)
SCREENS: multi-port-mon , /var/log/messages , command , script-host 1. /etc/multipath.conf 1. 2. 2. 3. 4. 1. 1. 1. 2. 3. 5. 1. 2. 3. Devices (DDN) (path_prio = /bin/true ; path_grouping_policy = multibus) Multipath (DDN) See multipath bindings in /var/log/messages Multipath -v2 -l Expected changes in multi-port-mon Disable switch port for disk ctlr 1 See failover in /var/log/messages and multi-port-mon Expected changes in multi-port-mon Enable switch port for disk ctlr 1 See failback in /var/log/messages and multi-port-mon
Luns present? (ls_fc_luns) Add luns if needed (scan_fc_luns) View multipath configuration Failover test
41
ACTIVE/ACTIVE 2 HBAs - trivial, same as demo1 - Each HBA sees 1 ctlr - Can let both HBAs see both ctlrs (4 paths to each lun) + Use path_prio if need to control path usage ACTIVE/PASSIVE (AVT) 2 HBAs - trivial, similar to demo2 ACTIVE/PASSIVE (no AVT) 1 HBA - Tolerant of ctlr failure only. - If anything else fails, luns will not AVT to alternate ctlr, host will lose access ACTIVE/PASSIVE (no AVT) 2 HBAs - Non-preferred paths will be failed - Each HBA must have full access to both controllers
42
CANNOT MULTIPATH ROOT OR BOOT DEVICE - per ap-rhcs-dm-multipath-usagetxt.html (see references section)
43
http://www.redaht.com/docs/manuals/csgfs/browse/rhcs-en/ap-rhcs-dm-multipath-usagetxt.html
Description of output (multipath -v2 -l)
http://kbase.redhat.com/faq/FAQ_85_7170.shtm
Setup device-mapper multipathing in Red Hat Enterprise Linux 4?
http://dims.ncsa.uiuc.edu/set/san
Multi-port-mon Set switchport state : (en/dis)able switch port via SNMP
LCI Conference 2007
National Center for Supercomputing Applications
44