VSP Midrange Architecture and Concepts Guide
VSP Midrange Architecture and Concepts Guide
By James Byun (Performance Measurement Group – Solutions Engineering & Technical Operations)
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 1
Contents
Introduction ................................................................................................................... 7
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 2
60-Disk Dense LFF Drawer (DB60) ............................................................................................................. 39
12-Disk FMD Tray (DBF) ............................................................................................................................. 40
Drive Details ................................................................................................................................................ 41
SAS HDD ................................................................................................................................................. 42
SSD ......................................................................................................................................................... 42
FMD ......................................................................................................................................................... 42
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 4
Notices and Disclaimer
Copyright © 2015 Hitachi Data Systems Corporation. All rights reserved.
The performance data contained herein was obtained in a controlled isolated environment. Actual results that may be
obtained in other operating environments may vary significantly. While Hitachi Data Systems Corporation has reviewed
each item for accuracy in a specific situation, there is no guarantee that the same results can be obtained elsewhere.
All designs, specifications, statements, information and recommendations (collectively, "designs") in this paper are
presented "AS IS," with all faults. Hitachi Data Systems Corporation and its suppliers disclaim all warranties, including
without limitation, the warranty of merchantability, fitness for a particular purpose and non-infringement or arising from a
course of dealing, usage or trade practice. In no event shall Hitachi Data Systems Corporation or its suppliers be liable for
any indirect, special, consequential or incidental damages, including without limitation, lost profit or loss or damage to data
arising out of the use or inability to use the designs, even if Hitachi Data Systems Corporation or its suppliers have been
advised of the possibility of such damages.
This document has been reviewed for accuracy as of the date of initial publication. Hitachi Data Systems Corporation may
make improvements and/or changes in product and/or programs at any time without notice.
No part of this document may be reproduced or transmitted without written approval from Hitachi Data Systems
Corporation.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 5
Document Revision Level
Reference
Hitachi Manuals:
Various Hitachi Product Marketing and Product Management materials
Factory specifications documents
Factory training documents
Papers:
VSP G1000 Architecture and Concepts Guide
HUS VM Architecture and Concepts Guide
HUS 100 Family Architecture and Concepts Guide
Contributors
The information included in this document represents the expertise, feedback, and suggestions of several individuals.
The author would like to recognize the following reviewers of this document:
Alan Benway (Solutions Engineering and Technical Operations, retired)
Alan Davey (Product Management, VSP Midrange Platform Lead)
Charles Lofton (Solutions Engineering and Technical Operations)
Greg Loose (Solutions & Products – Hardware)
Bryan Ribaya (Solutions Engineering and Technical Operations)
Wendy Roberts (Sales, APAC Geo PM for VSP Midrange)
Ian Vogelesang (Solutions Engineering and Technical Operations)
Rob Whalley (Sales, EMEA Geo PM for VSP Midrange)
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 6
Introduction
This document covers the hardware architecture and concepts of operations for the Hitachi Virtual Storage Platform
Midrange (VSP Midrange) family. This document is not intended to cover any aspects of the storage software, customer
application software, customer specific environments, or features available in future releases.
This document will familiarize Hitachi Data Systems’ sales personnel, technical support staff, approved customers, and
value-added resellers with the features and concepts of the VSP Midrange family. Users who will benefit the most from
this document are those who already possess an in-depth knowledge of the Hitachi Unified Storage VM (HUS VM)
architecture.
This document will receive future updates to refine or expand on some discussion as the internals of the design are better
understood or as upgrades are released.
System Highlights
The VSP Midrange family is the successor to the Hitachi Unified Storage 100 (HUS 100) family of midrange storage
arrays and is now built on the same Storage Virtualization Operating System (SVOS) that runs on the Virtual Storage
Platform G1000 (VSP G1000).
The VSP G200, G400, and G600 models are positioned as direct replacements for the HUS 110, 130, and 150 models
respectively, but this comparison belies the immense increase in capabilities and performance of the new generation
architecture. The controller design is a “compacted logical” implementation of the VSP G1000, and is akin to the HUS VM
design albeit with newer generation hardware in a smaller footprint.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 7
These FE and BE Modules are common across all models but unique to the VSP Midrange family. Note: The VSP
G200 Controller Blade features an integrated Back-End controller that is functionally identical to the BE Module but is
not removable.
Each blade also contains half of the system cache and connection to Cache Flash Memory (CFM) which is used for
cache backup in case of power failure.
The connectivity within each Controller Blade is provided by PCI Express 3.0 links and also Intel QuickPath
Interconnect (QPI) for the G400/G600 that have two CPU sockets per blade. QPI enables one CPU direct access to
the PCI express paths on the other CPU without requiring interrupts or communication overhead. The external
connections between the two Controller Blades is provided by PCI Express 3.0 links.
Like the VSP G1000, the VSP Midrange family uses a cache-based Shared Memory (SM) system, often referred to as
Control Memory to indicate its system function. The master SM copy is mirrored between the two Controller Blades.
Additionally, each Controller Blade has a local, non-mirrored copy of SM that is used by the MPU(s) on that blade for
accessing metadata and control tables for those volumes (LDEVs) it manages. The majority (perhaps 80%) of all SM
accesses are simply reads to this local copy. Updates are written to both the local and master SM copies.
Each MPU controls all I/O operations for a discrete group of LDEVs (LUNs when they are mapped to a host port) in
the same manner that they are managed by single VSD processor boards in the VSP G1000 array. When first
created, each new LDEV is automatically assigned in a round-robin fashion to one of the four MPUs.
Each MPU executes the Storage Virtualization Operating System (SVOS) for the following processes for those
volumes (LDEVs) that it manages:
Target mode (Open Systems hosts)
External mode (Virtualization of other storage)
Back End mode (Operate FMD/SSD/HDD drives in the subsystem)
Replication Initiator mode (TrueCopy Sync or HUR)
Replication Target mode
The VSP Midrange family uses drive boxes that are mechanically similar but not compatible with HUS 100 or HUS
VM. These 12Gbps SAS drive boxes will not work with the 6Gbps SAS BE Modules of those systems, nor will their
6Gbps drive boxes work with the 12Gbps SAS BE Modules of the VSP Midrange systems. The drive box variations
are listed below. Note that although these are natively 12Gbps, 6Gbps SAS drives are supported in intermixed
configurations:
DBS: 2U 24-slot SFF, with one row of 24 x 2.5” vertical disk drive slots
DBL: 2U 12-slot LFF, with three rows of 4 x 3.5” horizontal disk drive slots
DB60: 4U 60-slot dense LFF, with five top-loaded rows of 12 x 3.5” disk drive slots
DBF: 2U 12-slot FMD, with four horizontal rows of 3 x FMD slots
Drive choices include:
SFF drives: 200GB MLC SSD, 400GB MLC SSD, 300GB 15K, 600GB 15K, 600GB 10K, and 1.2TB 10K SAS drives
LFF drives: 400GB MLC SSD (LFF canister), 1.2TB 10K (LFF canister), 4TB and 6TB 7.2K SAS drives
FMD drives: 1.6TB and 3.2TB drives
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 8
Hitachi Virtualization (UVM: Universal Volume Manager)
The VSP Midrange family provides the same Hitachi Virtualization mechanism as found on the VSP G1000. Other
customer storage systems (often being repurposed upon replacement by newer systems) may be attached to some of the
front-end FC ports. These paths would then operate in External Initiator mode rather than the default host Target mode.
The LUNs that are supplied by these external systems are accessed and managed by hosts that are attached to the same
or other front-end ports (utilizing Target mode). As far as any host is concerned, all of the visible virtualized LUNs passed
through the VSP Midrange external ports to the host target ports simply appear to be normal internal LUNs in the VSP
Midrange array. The VSP Midrange’s bidirectional FC ports allow simultaneous host and external storage connectivity
without the need for dedicated “ePorts” and host ports.
Virtualized (external) storage should only be deployed as Tier-2 and Tier-3. Tier-0 and Tier-1 use should be limited to the
internal VSP Midrange LDEVs which come from Parity Groups based on FMD, SSD, or SAS drives. One benefit of
virtualization is greatly simplified management of LUNs and hosts. Another major advantage of virtualization is the ability
to dynamically (and transparently) move LDEVs from Tier-0 or Tier-1 Parity Groups (known as RAID Groups on modular
systems) down to Tier-2 or Tier-3 external storage using a different RAID level or drive type. These LDEV migrations are
able to proceed while the original source LDEV remains online to the hosts, and the VSP Midrange will seamlessly switch
over the mapping from the original LDEV to the new lower tier LDEV when completed. No changes on the host mount
points are required.
Dynamic Provisioning Volumes (DPVOLs or virtual volumes) are then created with a user specified logical size (up to
60TB) and then connected to a single Pool. The host accesses the DPVOL (or many of them – even hundreds) as if it
were a normal volume (LUN) over one or more host ports. A major difference is that disk space is not physically allocated
to a DPVOL from the Pool until the host has written to different parts of that DPVOL’s Logical Block Address (LBA) space.
The entire logical size specified when creating that DPVOL could eventually become fully mapped to physical space using
42MB Pool pages from every LDEV in the Pool. If new LDEVs from new Parity Groups are added to the Pool later on, a
rebalance operation (restriping) of the currently allocated Pool pages onto these additional RAID Groups can be initiated.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 9
The new HDT feature allows a single Pool to contain multiple types of RAID groups (Pool Volumes, using any available
RAID level) and any type of drive, as well as external LUNs from virtualized storage. Up to three choices from these
possible combinations are allowed per Pool. An example would be FMDs (Tier 1), SAS 10K (Tier 2), and external storage
using SAS 7.2K (Tier 3). Only one RAID level is normally used per Tier.
However, when a Tier is to be changed from a drive type or RAID level to another, the makeup of each of the Tiers may
change temporarily for the time the migration is in progress. For example a Tier 2 may have been established using SAS
15K drives and RAID-5 (7D+1P), but it is desired to change this to SAS 10K and RAID-6 (6D+2P). The SAS 10K drives
would temporarily become part of Tier 3 (which allows disparate drive types) until the migration is complete and the
original SAS 15K Pool Volumes removed, at which point the SAS 10K drives will be moved up to Tier 2.
The original pool volume may be deleted using Pool Shrink, and in doing so the HDT software will relocate all allocated
42MB pages from that pool volume to all the new ones. Once the copy is completed (may take a long time) then that pool
volume (an LDEV) is removed from that Pool. That LDEV can now be reused for something else (or deleted, and the
drives for that Parity Group removed from the system).
HDT manages the mapping of 42MB Pool pages within these various tiers within a Pool automatically. Management
includes the dynamic relocation of a page based on frequency of back end disk I/O to that page. Therefore, the location
of a Pool page (42MB) is managed by HDT according to host usage of that part of an individual DPVOL’s LBA space.
This feature can eliminate most user management of storage tiers within a subsystem and can maintain peak
performance under dynamic conditions without user intervention. This mechanism functions effectively without visibility
into any file system residing on the volume. The top tier is kept full at all times, with pages being moved down a tier to
make room for new high-activity pages. This is a long term space management mechanism, not a real-time relocation
service.
Glossary
At this point some definitions of the various terminology used is necessary in order to make all of the following discussions
easier to follow. Throughout this paper the terminology used by Hitachi Data Systems (not Hitachi Ltd. in Japan) will
normally be used. As a lot of storage terminology is used differently in Hitachi documentation or by users in the field, here
are the definitions as used in this paper:
Array Group (installable, drive feature): The term used to describe a set of at least four physical drives installed into
any disk tray(s) (in any “roaming” order on VSP Midrange). When an Array Group is formatted using a RAID level, the
resulting RAID formatted entity is called a Parity Group. Although technically the term Array Group refers to a group of
bare physical drives, and the term Parity Group refers to something that has been formatted as a RAID level and
therefore actually has initial parity data (here we consider a RAID-10 mirror copy as parity data), be aware that this
technical distinction is often lost. You will see the terms Parity Group and Array Group used interchangeably in the field.
Back-end Module (BE Module, installable, DKB feature): A SAS drive controller module that plugs into a socket in the
Controller Chassis and provides the eight back-end 12Gbps SAS links via two SAS 4-Wide ports per module. There are
two of these modules installed in a VSP G400 or G600 unless it is purchased as a diskless system, which can then
have extra FE Modules (more Fibre Channel ports) installed instead of the BE Modules. Strictly speaking, the VSP
G200 does not have BE Modules but integrated Back-End controllers, which are identical in function to the BE Modules
but are not removable.
Bidirectional Port: A port that can simultaneously operate in Target and Initiator modes. This means the port supports
all four traditional attributes without requiring the user to choose one at a time:
Open Target (TAR)
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 10
Replication Target (RCU)
Replication Initiator (MCU)
External Initiator (ELUN)
Cache Directory: The region reserved in cache for use by the MPUs in managing the User Data cache region. The
Cache Directory size varies according to the size of the User Data cache region, which is directly affected by the size of
Shared Memory.
CB (Controller Chassis): Hitachi’s name for the bare Controller Box, which can come in one of three types (CBSS,
CBSL, CBL) depending on the model of the array and the type of internal drive slots, if any.
CFM (Cache Flash Module): SATA SSD that serves as a cache backup device in case of power loss. There are
designated CFM slots that these are installed in.
CHB (Channel Blade): Hitachi’s name for the Front-end Module.
Cluster: One half or side of the array, consisting of a Controller Blade, its components, and the I/O Modules connected
to it. Cluster 1 refers to the side containing Controller Blade 1 and Cluster 2 refers to the side containing Controller
Blade 2.
Concatenated Parity Group: A configuration where the VDEVs corresponding to a pair of RAID-10 (2D+2D) or RAID-
5 (7D+1P) Parity Groups, or four RAID-5 (7D+1P) Parity Groups, are interleaved on a RAID stripe level on a round robin
basis. A logical RAID stripe row is created as a concatenation of the individual RAID stripe rows. This has the effect of
dispersing I/O activity over twice or four times the number of drives, but it does not change the number, names, or size
of VDEVs, and hence it doesn't make it possible to assign larger LDEVs to them. Note that we often refer to RAID-10
(4D+4D), but this is actually two RAID-10 (2D+2D) Parity Groups interleaved together. For a more comprehensive
explanation refer to Appendix 5 of the VSP G1000 Architecture and Concepts Guide.
CTL (Controller Blade): The shorthand name for the Controller Blade, not to be confused with an HUS 100 Controller
which was also abbreviated CTL. The Intel Xeon Ivy Bridge EP processors and Cache DIMMs are physically installed
in the CTL.
DARE (Data at Rest Encryption): Controller-based data encryption of all blocks in a Parity Group, enabled via
software license key.
DB (Disk Box): Hitachi’s name for the disk enclosures.
DBS: 2U 24-slot SFF SAS box
DBL: 2U 12-slot LFF SAS box
DB60: 4U 60-slot dense LFF SAS drawer (supports SFF intermix via special drive canisters)
DBF: 2U 12-slot FMD box
DIMM (Dual Inline Memory Module): A “stick” of RAM installed in the corresponding DIMM sockets on the Controller
Blades.
DKB (Disk Blade): Hitachi’s name for the Back-end Module.
DKC (Disk Controller): Hitachi’s name for the controller unit as a whole, comprised of the Controller Chassis (CB),
Controller Blades (CTL), FE and BE Modules, Power Supplies, etc. The Controller Chassis (CB) is often also referred
to as the DKC.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 11
DPVOL (configurable, Dynamic Provisioning VOLume): The Virtual Volume connected to an HDP Pool. Some
documents also refer to this as a V-VOL, not to be confused with a VMware VVol. It is a member of a V-VOL Group,
which is a kind of VDEV. Each DPVOL has a user specified size between 8GB and 60TB in increments of one block
(512 byte sector) and is built upon a set of 42MB pages of physical storage (so each DPVOL should be specified as a
multiple of 42MB).
Drive (Disk): An FMD, SSD, or HDD. SATA disks are not supported in the VSP Midrange family.
DRR (DRR Emulator, Data Recovery and Reconstruction): Virtual processors that run on the VSP Midrange MPUs
in microcode (software) that manage RAID parity operations and drive formatting or rebuilds.
eLUN (configurable, External LUN): An External LUN is one which is located in another storage system and managed
as though it were just another internal LDEV. The external storage system is attached via two or more FC Ports and
accessed by the host through other front-end target ports. The eLUN is used within the VSP Midrange as a VDEV, a
logical container from which LDEVs can be carved. Individual external LDEVs may be mapped to a portion of or to the
entirety of the eLUN. Usually a single external LDEV is mapped to the exact LBA range of the eLUN, and thus the
eLUN can be “passed through” the VSP Midrange to the host.
FC Port: Any of the Fibre Channel ports on a Fibre Channel FE Module. Each VSP Midrange family FC Port is a
Bidirectional Port.
Feature (package): An installable hardware option (such as an FE Module, BE Module, or Cache DIMM) that is
orderable by Feature Code (P-Code). Each of the VSP Midrange features is a single board or module, and not a pair
like some of the VSP G1000 features.
FMD: The Flash Module Drive (1.6TB or 3.2TB) that installs in the DBF disk box.
Front-end Module (FE Module, installable, CHB feature): The host connectivity interface module that plugs into a
socket in the Controller Chassis. There are three types of FE Modules supported: 4 x 8Gbps FC, 2 x 16Gbps FC, and
2 x 10Gbps iSCSI. FC ports may also be used to attach to external storage or to remote systems when using the
TrueCopy or Hitachi Universal Replicator (HUR) program products.
GUM (Gateway for Unified Management): The embedded micro server (Linux) on each Controller Blade that provides
the system interface used by the Storage Navigator management software running on the SVP or the HiCommand Suite
(HCS).
LDEV (configurable, Logical DEVice): A logical volume internal to the system that can be used to contain customer
data. LDEVs are uniquely identified within the system using a six-digit identifier in the form LDKC:CU:LDEV. LDEVs
are carved from a VDEV (see VDEV), and there are three types of LDEVs: internal LDEVs, external LDEVs, and
DPVOLs. LDEVs are then mapped to a host as a LUN. Note: what is called an LDEV in all Hitachi enterprise systems
and the HUS VM is called an LU or LUN in HDS modular systems like the Hitachi Unified Storage 100 (HUS 100)
family.
LR (Local Router, Local Router Emulator, or Command Transfer Circuit): Virtual processors that run on the VSP
Midrange MPUs in microcode (software) that facilitate the transfer of commands between FE or BE Modules and the
MPUs.
LUN (configurable, Logical Unit Number): The host-visible identifier assigned by the administrator to an existing LDEV
to make it usable on a host port. An internal LUN has no actual queue depth limit (but 32 is a good rule of thumb) while
an external (virtualized) eLUN has a Queue Depth limit of 2-128 (adjustable) per external path to that eLUN. In Fibre
Channel, the host HBA Fibre Channel port is the initiator, and the system’s virtual Fibre Channel port (or Host Storage
Domain) is the target. Thus the Logical Unit Number is the number of the logical volume within a target.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 12
MP (Microprocessor): An individual MPU core, which is a single core of the Intel Ivy Bridge EP Xeon CPU. Not to be
confused with a FED MP or BED MP from USP V and earlier enterprise arrays.
MPU (Microprocessor Unit): The multi-core logical processor that is superimposed on the physical MP. In the case of
the VSP G200, the MPU is a 2-core logical unit that comprises half of the 4 cores on a single MP. The MPU in a VSP
G400 or G600 is a 4-core logical unit that comprises all 4 cores of a single MP.
OPEN-V: The name of the RAID mechanism on VSP Midrange and VSP G1000 for Open (non-mainframe) hosts.
Refer to Appendix 1 of the VSP G1000 Architecture and Concepts Guide for more details.
Parity Group (configurable, a RAID Group): A set of drives formatted as a single RAID level, either as RAID-10
(sometimes referred to as RAID-1+ in HDS documentation), RAID-5, or RAID-6. The VSP Midrange’s supported Parity
Group types are RAID-10 (2D+2D), RAID-5 (3D+1P, 4D+1P, 6D+1P, or 7D+1P), and RAID-6 (6D+2P, 12D+2P, and
14D+2P). The OPEN-V RAID chunk (or stripe) size is fixed at 512KB. Internal LDEVs are carved from the VDEV(s)
corresponding to the formatted space in a Parity Group, and thus the maximum size of an internal LDEV is determined
by the size of the VDEV that it is carved from. The maximum size of an internal VDEV is approximately 2.99TiB (binary
TB). If the formatted space in a Parity Group is bigger than 2.99TiB, then multiple VDEVs must be created on that
Parity Group. Note that there actually is no discrete 4D+4D Parity Group type – see Concatenated Parity Group.
PCIe (PCI Express): A multi-channel serial bus connection technology that supports x1, x4, and x8 lane configurations.
The resulting connection is called a “PCIe link”. The VSP Midrange uses the x8 type in most cases. The PCIe 3.0 x8
link is capable of 8GB/s send plus 8GB/s receive in full duplex mode (i.e. concurrently driven in each direction). Refer
to Appendix 8 of the VSP G1000 Architecture and Concepts Guide for more details.
PDEV (Physical DEVice): A physical internal drive.
RAID-1: Used in some documents to describe what is usually called “RAID-10”, a stripe of mirrored pairs. Thus when
we say “RAID-1” in the context of a Hitachi VSP-family system, we mean the same thing as when we say “RAID-10” in
the context of an HUS 100 modular system. Note that the alternative RAID-0+1 used by some vendors is quite
different, as it is the very vulnerable mirror of two RAID-0 stripes, where if one disk fails, all protection is lost. In a mirror
of stripes, it’s not that you lose the data on a single drive failure, because after all, it’s still a mirror, but in a mirror of
stripes if one drive fails, the entire stripe goes down, and you are very vulnerable to a second drive failure in the other
stripe. In generic RAID-10 where there is a stripe of mirrors, if two drives fail in different mirror pairs, then each mirror
pair is still “alive” within the stripe and thus no data are lost. This is how the VSP Midrange family works.
Shared Memory (SM): Otherwise referred to as Control Memory, it is the region in system cache that is used to
manage all volume metadata and system states. In general, Shared Memory contains all of the metadata in a storage
system that is used to describe the physical configuration, track the state of all LUN data, track the status of all
components, and manage all control tables that are used for I/O operations (including those of Copy Products). The
overall footprint of Shared Memory in cache can range from 8.5 - 24.5GB for VSP G200 and 11 - 51GB for G400 and
G600 models.
SVP (Service Processor): A 1U server running Windows that functions as a management interface to the VSP
Midrange array. It is installed separately from the DKC and runs the Mini-Management Appliance (Mini-MApp) software
which consists of:
Block Element Manager (BEM) – This is the Storage Navigator software component.
SVP Remote Method Invocation (RMI) code – This is how the SVP interfaces with the GUM on each Controller Blade.
Hi-Track Agent – System monitoring and reporting tool used by the Global Support Center.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 13
VDEV (configurable): The logical storage container from which LDEVs are carved. There are two types of VDEVs on
the VSP Midrange:
Internal VDEV (2.99TiB max) – Maps to the formatted space within a parity group that is available to store user data.
LDEVs carved from a parity group VDEV are called internal LDEVs.
External storage VDEV (4TiB max) – Maps to a LUN on an external (virtualized) system. LDEVs carved from
external VDEVs are called external LDEVs.
Note: The two Controller Blades must be symmetrically configured with respect to the FE module types and slots they are
installed in.
* This is achieved with an intermix of 24 SFF disks in the CBSS controller box and 240 LFF disks in four DB60 dense disk
boxes.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 14
Table 2. Summary of Maximum Limits, VSP Midrange and HUS 100 Family
Table of Maximum Limits VSP G200 VSP G400 VSP G600 HUS 110 HUS 130 HUS 150
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 15
VSP G200
1 1 2 2
A B B A
1Gb/s x 1Gb/s x
2 FC FC 2
FC iSCSI iSCSI FC
Supply
8Gb 8Gb
Power
FPGA SSD SSD FPGA
GUM GUM
FE Module - FE Module - FE Module - FE Module -
SATA 6Gb Ports Ports Ports Ports SATA 6Gb
CTRL CTRL
16GB/s
16GB/s
16GB/s
16GB/s
LAN LAN
PCH PCH
Controller Controller
Backplane
Intel Ivy Intel Ivy
16GB or 16GB or
Bridge EP Bridge EP 32GB
32GB 16GB/s I-Path (NTB)
Xeon (4 Xeon (4
25.6 GB/s
25.6GB/s
DDR3-1600 DDR3-1600
cores) cores)
16GB/s 16GB/s
Supply
Power
BE Module BE Module
4 x 12Gbps 4 x 12Gbps
Controller Links Links Controller
Blade 1 Blade 2
24 2.5" HDDs
Tray 1 SAS CTRL processor:
* 1 x (4 x 12Gbps) SAS links
* 1 external SAS Wide cable port
12 3.5" HDDs * SAS Wide Cable containing the 4 SAS links
Tray 2
ENCLOSURE
STACK
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 16
24 x SFF drive slots (CBSS)
2 Controller Blades, each with:
1 x 4-core Intel Ivy Bridge EP Xeon 1.8GHz processor
16GB or 32GB of cache (2 x 8GB or 2 x 16GB DDR3-1600 DIMMs) for a total of 32GB or 64GB of cache in the subsystem
2 FE module slots supporting 1 or 2 FE modules (4 x 8Gbps FC, 2 x 16Gbps FC, or 2 x 10Gbps iSCSI each)
1 integrated BE controller (4 x 12Gbps SAS links), standard only. Encryption is not currently supported.
1 x 120GB CFM SSD for cache backup
Up to 7 DBL, DBS, or DBF disk boxes, supporting a maximum of:
96 LFF disks (including 12 LFF disks in CBSL)
192 SFF disks (including 24 SFF disks in CBSS)
84 FMDs
Or up to 4 DB60 dense disk boxes, supporting a maximum of:
252 LFF disks (including 12 LFF disks in CBSL)
Or an intermix of disk box types not to exceed a disk box count of 7, where:
Each DBL, DBS, and DBF is counted as 1 disk box
Each DB60 is counted as 2 disk boxes
1U Service Processor (SVP)
One or more 19” standard racks (HDS supplied or appropriate third party)
The internal drive slots are located at the front of the Controller Chassis. The two Controller Blades are installed in slots
at the rear of the Controller Chassis, with two power boundaries called Cluster 1 (left side) and Cluster 2 (right side).
Each Cluster has its own independent Power Supply Unit (PSU). The FE Modules are installed into slots on the
Controller Blades. The BE Modules are integrated into the Controller Blades themselves. Figure 2 shows a rear view of
the chassis.
Cluster 1 Cluster 2
PSU2
1A 3A 5A 7A 1B 3B 5B 7B 2A 4A 6A 8A 2B 4B 6B 8B
Battery1 Battery2
PSU1
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 17
The Controller Blade slots for the FE Modules are labeled “A-B”, where the A slots represent the default features and the
B slots represent the optional features that can be installed. The FE Module port numbers are either “1, 3, 5, 7” for
Cluster 1 (odd) or “2, 4, 6, 8” for Cluster 2 (even). The name of a given FE Module comes from the Cluster it is installed in
and the slot within the Cluster. For example, the first FE Module (slot A) in Cluster 1 is FE-1A. Likewise, the name for an
individual port is the combination of the port number and the FE Module slot. For example, the last port on FE-2A is Port
8A. For FE Modules with only 2 ports, the port numbers are either “1, 3” for Cluster 1 or “2, 4” for Cluster 2.
The VSP G200 back end has a total of 8 x 12Gbps full duplex SAS links provided by the two integrated BE controllers
(one port each). The disk boxes are connected as a single enclosure “stack” with the Controller Box’s internal drives
serving as the first disk box (DB-00). Up to seven additional disk boxes can be attached, numbered DB-01 to DB-07.
Each half of the stack of disk boxes that share the same SAS port can be considered a “SAS boundary” akin to the power
boundaries that define the two DKC clusters. Each dual ported drive is accessible via four SAS links from the BE Module
in Cluster 1 and another four SAS links from the BE Module in Cluster 2.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 18
VSP G400
FC FC FC FC FC FC
iSCSI FC FC iSCSI
Supply
8Gb 8Gb 16Gb 16Gb 8Gb 8Gb
Power
FE Module - FE Module - FE Module - FE Module - FE Module - FE Module - FE Module - FE Module -
Ports Ports Ports Ports Ports Ports Ports Ports
16GB/s
16GB/s
16GB/s
16GB/s
16GB/s 16GB/s 16GB/s 16GB/s
16GB/s I-
Path (NTB)
Backplane
Intel Ivy 25.6GB/s Intel Ivy Intel Ivy 25.6GB/s Intel Ivy
16GB or Bridge EP (QPI)
Bridge EP 16GB or 16GB or Bridge EP (QPI)
Bridge EP 16GB or
32GB 32GB 32GB 32GB
Xeon (4 Xeon (4 Xeon (4 Xeon (4
25.6GB/s
25.6GB/s
25.6GB/s
25.6GB/s
16GB/s I-
Path (NTB)
LAN LAN
PCH 16GB/s 16GB/s PCH
Controller Controller
Supply
Power
The two Controller Blades are installed in slots at the front of the Controller Chassis, with two power boundaries called
Cluster 1 (bottom) and Cluster 2 (top). Each Cluster has its own independent PSU. Each blade has four fans and two
slots for cache flash memory SSDs on the front of the assembly, which are visible when the front DKC bezel is removed.
On the rear of the chassis are the slots for the FE and BE modules and the power supplies. Figure 4 shows a rear view of
the chassis.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 20
Figure 4. VSP G400 Controller Chassis (DKC) Organization
2A 4A 6A 8A 2B 4B 6B 8B 2C 4C 6C 8C 2D 4D 6D 8D
2H-0 2H-1
LAN x 2
BE-2H
Cluster 2
PSU1
PSU2
1A 3A 5A 7A 1B 3B 5B 7B 1C 3C 5C 7C 1D 3D 5D 7D
1H-0 1H-1
LAN x 2
BE-1H
Cluster 1
The slots for the FE Modules are labeled “A-D”, where the A slots represent the default features and the B-D slots
represent the optional features that can be installed. The FE Module port numbers are either “1, 3, 5, 7” for Cluster 1
(odd) or “2, 4, 6, 8” for Cluster 2 (even). The name of a given FE Module comes from the Cluster it is installed in and the
slot within the Cluster. For example, the first FE Module (slot A) in Cluster 1 is FE-1A. Likewise, the name for an
individual port is the combination of the port number and the FE Module slot. For example, the last port on FE-2D is Port
8D. For FE Modules with only 2 ports, the port numbers are either “1, 3” for Cluster 1 or “2, 4” for Cluster 2.
The slots for the BE Modules are labeled “H” and the SAS port numbers are “0, 1”. For a diskless system dedicated to
virtualization use, the two BE Modules can be replaced by two more FE Modules to provide a total of 40 x 8Gbps or 20 x
16Gbps FC ports.
The VSP G400 back end has a total of 16 x 12Gbps full duplex SAS links provided by the two BE Modules (two ports
each). The disk boxes are connected as two enclosure “stacks” with SAS port 0 on each BE Module connected to the first
stack of even numbered disk boxes (DB-00, DB-02, DB-04, etc.). SAS port 1 on each BE module is connected to the
second stack of odd numbered disk boxes (DB-01, DB-03, DB-05, etc.). Up to 16 disk boxes can be attached, numbered
DB-00 to DB-15. The half of a stack of disk boxes that share the same SAS port can be considered a “SAS boundary”.
Each stack can address up to 240 drives. Each dual ported drive is accessible via four SAS links from a BE Module port
in Cluster 1 and another four SAS links from a BE Module port in Cluster 2.
The I/O Module slots in grey are unused at this time and are blocked with metal spacers. They are present in the chassis
to allow for future products or enhancements to the VSP Midrange family.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 21
VSP G600
FC FC FC FC FC FC
iSCSI FC FC iSCSI
Supply
8Gb 8Gb 16Gb 16Gb 8Gb 8Gb
Power
FE Module - FE Module - FE Module - FE Module - FE Module - FE Module - FE Module - FE Module -
Ports Ports Ports Ports Ports Ports Ports Ports
16GB/s
16GB/s
16GB/s
16GB/s
16GB/s 16GB/s 16GB/s 16GB/s
16GB/s I-
Path (NTB)
16GB/s I-
Path (NTB)
LAN LAN
PCH 16GB/s 16GB/s PCH
Controller Controller
Supply
Power
The VSP G600 is a higher performance, higher capacity version of the G400. They share a common DKC and Controller
Blades and the number of installable FE and BE Modules is the same. A software license key is used to upgrade a G400
into a G600 model. This provides an MPU performance boost in microcode and doubles the internal cache bandwidth
and cache capacity via the activation of an additional two DIMM sockets per Xeon processor. In addition to these
performance enhancements, an additional 8 disk boxes (or 4 x DB60) are supported.
A fully configured VSP G600 system includes the following, with differences from the G400 highlighted in bold:
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 22
A 4U form factor Controller Chassis (DKC) providing:
2 Controller Blades, each with:
2 x 4-core Intel Ivy Bridge EP Xeon 2.5GHz processor
64GB or 128GB of cache (4 x 16GB or 8 x 16GB DDR3-1600 DIMMs) for a total of 128GB or 256GB of cache per
subsystem
5 I/O module slots
1 or 2 x 120GB CFM SSDs for cache backup
Choice of I/O module configurations
1 to 4 FE modules (4 x 8Gbps FC, 2 x 16Gbps FC, or 2 x 10Gbps iSCSI each)
1 BE module with 8 x 12Gbps SAS links (2 ports, 4 x 12Gbps SAS links per port), standard or encrypting
For virtualization configurations without internal disks, the BE module can be replaced with an additional FE module
Up to 24 DBL, DBS, or DBF disk boxes, supporting a maximum of:
288 LFF disks
576 SFF disks
288 FMDs
Or up to 12 DB60 dense disk boxes, supporting a maximum of:
720 LFF disks
Or an intermix of disk box types not to exceed a disk box count of 24, where:
Each DBL, DBS, and DBF is counted as 1 disk box
Each DB60 is counted as 2 disk boxes
1U Service Processor (SVP)
One or more 19” standard racks
The only difference in the VSP G600 back end is that each enclosure “stack” can support up to 360 drives due to the
increased number of installable disk boxes.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 23
The G200 and G400/G600 Controller Blades are different and will be described in detail separately
Front-end Connectivity modules (FE) with 4 x 8Gbps FC, 2 x 16Gbps FC, or 2 x 10Gbps iSCSI ports
Back-end Drive Controller modules (BE) with 8 x 12Gbps SAS links in two SAS Wide ports
The G200 does not support pluggable BE modules and its integrated BE controller provides 4 x 12Gbps SAS links in
one SAS Wide port
The two Controller Blades are the core of the design, with the Intel Xeon processors organized into MPU logical units that
execute the system software, manage all I/O, and emulate all the specialized functions that were done previously by
custom ASICs (DCTL ASIC in HUS 100 family, HM ASIC in HUS VM, DA ASIC in VSP G1000). FE modules are plugged
into slots directly on a G200 Controller Blade, while FE and BE modules are plugged into slots at the rear of the
G400/G600 DKC that are connected to an individual Controller Blade. These modules are extensions to the Controller
Blade, not an independent unit like the autonomous FED and BED boards on the Grid on a VSP G1000.
The FE modules are based on either a Tachyon QE8 or Hilda chip, which are powerful processors with significant
independent functionality (described in detail later). Similarly, the BE modules are based on a powerful dual-core SAS
Protocol Controller (SPC) which also functions mostly on its own. The Tachyon and SAS processors communicate with
their Controller Blade’s Local Routers (LR). This is their command transfer circuit to the MPU logical processors and the
method by which host I/O requests get scheduled.
Each Tachyon or SAS processor also has several DMA channels built-in. These are used to directly access the Data
Transfer Buffer (DXBF) and User Data regions of cache (described later in this paper). Access to the User Data regions
first requires assignment of a cache address from the MPU that owns the LDEV in question.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 24
One PCIe 3.0 x8 link (I-Path) to cross connect to the CPU on the other Controller Blade. The I-Path is also referred to
as the Non-Transparent Bridge (NTB) path.
Dual-channel memory controller attached to two DDR3-1600 DIMM slots, with 12.8GB/s of bandwidth per channel and
25.6GB/s of total cache bandwidth.
DMA channels to allow the PCIe attached FE and BE Modules to access cache managed by the on-die memory
controller, or to allow the MPUs to access the Cache Directory or Shared Memory.
ASIC emulation is performed in microcode to provide a consistent interface for the SVOS system software. The ASIC
functions that are emulated include:
Local Router (LR): The sole function of the LR is to transfer commands between the FE and BE Modules and MPUs.
Data Recovery and Reconstruction (DRR): The primary function of the DRR is to perform RAID parity operations, but
it is also responsible for drive formatting and rebuilds (correction copy).
Direct Memory Access (DMA): This DMA function applies only when user data must be transferred between
Controller Blades across an I-Path. An MPU on the source cluster manages the first step of the data transfer and an
MPU on the destination cluster manages the second step of the data transfer.
The embedded Back-end SAS controller is similar to the BE Module used in the G400 and G600 models. Here, the SPCv
12G processor also provides 8 x 12Gbps SAS links, but four of these links are connected to an external SAS 4-Wide port
and the other four links are connected to the internal SAS expander that’s part of the internal drive box.
Cache memory is installed into two DDR3-1600 DIMM slots on each Controller Blade, which are attached to the on-die
memory controller in the Intel Xeon CPU and organized as two independent memory channels. Each channel has a peak
theoretical transfer rate of 12.8GB/s, so the entire cache system has a peak rating of 51.2GB/s. There is a choice of 8GB
or 16GB DIMMs and both Controller Blades must be configured symmetrically. The supported cache size combinations
for each blade are:
16GB (2 x 8GB DIMMs)
32GB (2 x 16GB DIMMs)
Note that the cache memory on each of the two Controller Blades is concatenated together into one larger global cache
image. For “clean” data in cache, meaning data that is a copy of what is already on disk, data that is kept in cache to
serve possible future read hits, only one copy of the clean data is kept in the global space. Thus clean data is only kept
on one Controller Blade’s cache memory and is not mirrored across both Controller Blades. Only “dirty” data, meaning
data recently written by the host that has not yet been written to disk, is duplexed with a copy of the data being retained in
each of the two Controller Blades.
The rest of the auxiliary system management functions are provided via the Platform Controller Hub (PCH) chip. The
PCH is connected to the Intel Xeon CPU via a DMI 2.0 connection, which is electrically comparable to a PCI Express 2.0
x4 link. The PCH itself has a SATA 6Gbps controller built in that interfaces with a 120GB Cache Flash Module (CFM).
The CFM is a normal SATA SSD that is used for backing up the entire contents of cache in the event of a total loss of
power. If there is a partial loss of power to just one cluster, this is the backup target for that cluster’s cache space. In the
case of a planned power off, it is the backup target for just the Shared Memory region. During a power outage, the on-
blade battery power keeps the DIMMs, CFM, and Controller Blade functioning while destage occurs to the flash drive.
There is generally enough battery power to support a couple such outages back-to-back without recharging.
The PCH has a PCIe connection to an FPGA (Field Programmable Gate Array) processor that is responsible for
environmental monitoring and processing component failures. The FPGA relies on an environment microcontroller that
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 25
has monitoring connections to each of the components on the Controller Blade (FE and BE Modules, Power Supply Units,
Fans, Battery, etc.) as well as an interface to the other Controller Blade in the opposite cluster.
The PCH also has connections to the Gateway for Unified Management (GUM) and a pair of network interface controllers
(LAN Controllers). The GUM is an embedded micro server that provides the management interface that the storage
management software running on the SVP talks to. The LAN Controllers are what provide the public and management
network ports (gigabit Ethernet) on each Controller Blade.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 26
32GB (4 x 8GB DIMMs)
64GB (4 x 16GB DIMMs)
For the VSP G600, cache memory can be installed into half (Basic features) or all (Basic + Optional features) of the
DDR3-1600 DIMM slots. The resulting combinations of cache sizes and bandwidth per blade are:
64GB (4 x 16GB DIMMs): 51.2GB/s cache bandwidth
128GB (8 x 16GB DIMMs): 102.4GB/s cache bandwidth
As described for the G200, the cache memory on each Controller Blade is concatenated together into a single larger
global cache space, and only one copy of “clean” data is retained in cache memory on one of the Controller Blades. Only
“dirty” data recently written by the host that has not already been destaged to disk is duplexed in cache, with one copy on
each of the Controller Blades for redundancy.
The entire cache system of the G600 with all DIMMs populated has a peak rating of 204.8GB/s (102.4GB/s per Controller
Blade).
The rest of the auxiliary system management functions are provided via the Platform Controller Hub (PCH) chip. The
PCH is connected to the Intel Xeon CPU via a DMI 2.0 connection, which is electrically comparable to a PCI Express 2.0
x4 link. The PCH itself has a SATA 6Gbps controller built in that interfaces with one or two 120GB Cache Flash Modules
(CFM). The CFM is a normal SATA SSD that is used for backing up the entire contents of cache in the event of a total
loss of power. If there is a partial loss of power to just one cluster, this is the backup target for that cluster’s cache space.
In the case of a planned power off, it is the backup target for just the Shared Memory region. During a power outage, the
on-blade battery power keeps the DIMMs, CFM, and Controller Blade functioning while destage occurs to the flash drive.
There is generally enough battery power to support a couple such outages back-to-back without recharging. The G400
will only require a single CFM per Controller Blade. The G600 may require one or two CFMs per blade, depending on
whether half or all the cache DIMM slots are populated.
The PCH has a PCIe connection to an FPGA (Field Programmable Gate Array) processor that is responsible for
environmental monitoring and processing component failures. The FPGA relies on an environment microcontroller that
has monitoring connections to each of the components on the Controller Blade (FE and BE Modules, Power Supply Units,
Fans, Battery, etc.) as well as an interface to the other Controller Blade in the opposite cluster.
The PCH also has connections to the Gateway for Unified Management (GUM) and a pair of network interface controllers
(LAN Controllers). The GUM is an embedded micro server that provides the management interface that the storage
management software running on the SVP talks to. The LAN Controllers are what provide the public and management
network ports (gigabit Ethernet) on each Controller Blade.
The individual LDEV associations to MPU can be looked up and manually changed either by Storage Navigator or by a
script that uses CLI commands from the raidcom utility. There is no automatic load balancing mechanism to move “hot”
LDEVs around among the MPUs in order to even out the processing loads. It is not necessary to keep every LDEV from
the same Parity Group assigned to the same MPU, just as for VSP G1000.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 27
An MPU will accept all I/O requests for an LDEV it owns without regard for which FE port received that host request, or
which BE modules will perform the physical disk operations. Each Cluster contains a local copy of the LDEV-to-MPU
mapping tables so that the LRs can look up which MPU owns which LDEV. As a rule of thumb, the MPs in an MPU
should be kept below 80% busy to manage host latencies, and only 40% busy if providing for processor headroom to
maintain host performance in case of a failure of another MPU. Note: higher peak utilization during batch operations is
fine as long as the average utilization during the batch window is below 40%.
The FE module is fairly simple, primarily having a host interface processor on the board. The slot that the FE module
plugs into provides power and a PCI Express 3.0 x8 link to an Intel Ivy Bridge EP Xeon processor on the Controller Blade.
All interaction with an FE module is via the LR emulators running on the MPUs in the same Controller Blade. The host
interface processor has four DMA channels for moving data blocks into or out of cache.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 28
Figure 6. 4 x 8Gbps FE Module
Tachyon
QE8
FE Module
Tachyon Chip
The host interface processor for this module type is a Tachyon processor (single chip) and is used to bridge the Fibre
Channel host connection to a usable form for internal use by the storage controller. The FE modules use the PMC Sierra
PM8032 Tachyon QE8 processor. This high power Tachyon processor provides a variety of functions, to include:
A conversion of the Fibre Channel transport protocol to the PCIe x8 link for use by one to four controller processors:
SCSI initiator and target mode support
Complete Fibre Channel protocol sequence segmentation or reassembly
Conversion to the PCIe link protocol
Provides simultaneous full duplex operations of each port
Provides four DMA channels for directly writing blocks to system cache
Error detection and reporting
Packet CRC encode/decode offload engine
Auto-sync to a 2Gbps, 4Gbps or 8Gbps port speed per path
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 29
The QE8 processors can provide very high levels of performance as they are directly connected by a high-performance,
PCI Express 3.0 x8 link directly to the Intel Ivy Bridge EP Xeon processor. The QE8 processor can drive all four of its
8Gbps ports at full sequential speed. However, for random small block loads, the QE8 controller cannot drive all four
ports at full speed (limited by handshaking overhead).
The figure below provides a generalized internal view of the four-port QE8 processor from the PMC Sierra QE8 literature.
This module type has two 16Gbps FC ports that can auto-negotiate down to 8Gbps or 4Gbps rates depending on what
the host port requires. Up to 8 FC ports at 16Gbps per G200 system and up to 16 FC ports at 16Gbps for G400/G600
systems (20 ports at 16Gbps for a diskless system) is supported when this FE Module type is used.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 30
Figure 8. 2 x 16Gbps FE Module
EP8324
“Hilda”
FE Module
Hilda Chip
The host interface processor for this module type is a “Hilda” processor (single chip) and is used to bridge the Fibre
Channel host connection to a usable form for internal use by the storage controller. Hilda is an internal codename for the
QLogic 8300 series of processors. These FE modules use the QLogic EP8324 chip. This is considered a converged
network controller, as it can operate as a 16Gbps FC controller or as a 10Gbps Ethernet network interface controller,
depending on the firmware that is loaded and the SFPs that are installed. The high level functions provided by this chip
are the same as the Tachyon QE8, with link speeds down to 4Gbps supported per path.
The Hilda processors can provide very high levels of performance as they are directly connected by a high-performance,
PCI Express 3.0 x8 link directly to the Intel Ivy Bridge EP Xeon processor. The EP8324 can drive both of its 16Gbps ports
at full sequential speed. However, for random small block loads it cannot drive both ports at full speed (limited by
handshaking overhead).
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 31
Front-end Module: 2 x 10Gbps iSCSI Ports
EP8324 DDR3
“Hilda” Memory
FE Module
This module type is very similar to the 16Gbps FC module, but provides two 10Gbps iSCSI ports instead. Up to 8 iSCSI
ports at 10Gbps per G200 system and up to 16 iSCSI ports at 10Gbps for G400/G600 systems is supported when this FE
Module type is used.
Like the 16Gbps FC module, the QLogic EP8324 processor is used here but running firmware to operate as a 10Gbps
network interface controller. In this mode, it provides a variety of functions including:
Full hardware offload engine for IP, TCP, and UDP checksums
Jumbo frame support (9600 bytes)
The PCB for this module includes additional ECC-protected DDR3 DRAM necessary to support the iSCSI connections
and pluggable 10Gbps SFP+ optical transceivers are used.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 32
Back-end Module: 8 x 12Gbps SAS Links
Overview
The Back-end drive controller modules provide the SAS links used to attach to the disks in a set of disk boxes. These
modules are installed into specific slots at the rear of the G400/G600 DKC and are connected to a specific Controller
Blade. There are two types of modules, with a standard version and an encryption version. Each type includes two SAS
Wide ports on the rear panel by which a pair of SAS Wide cables to the first row of two disk boxes (DB-00 to DB-01) are
connected. There are four 12Gbps full duplex SAS links per port, with two ports per module. This module is unique to the
VSP Midrange family of arrays.
Except for a pure virtualization (diskless) configuration, there is one module installed per Controller Blade, with two
modules per VSP G400 or G600 system. This provides 16 x 12Gbps full duplex SAS links (over 4 SAS Wide ports) per
G600 system to support 720 drives. The BE module is fairly simple, primarily having a power SAS Protocol Controller
chip (SPCv 12G or SPCve 12G) on the board. The enhanced SPCve 12G processor is used on the encrypting version of
the BE module. All BE modules in a system must either be the standard type or encrypting type. The slot that the BE
module plugs into provides power and a PCI Express 3.0 x8 link directly to the Intel Ivy Bridge EP Xeon processor. All
interaction with a BE module is via the LR emulators running on the MPUs in the same Controller Blade.
Port 0 Port 1
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 33
The BE modules control all direct interaction to the drives (HDD, SSD, or FMD). Each BE module has a powerful dual-
core SPC that executes all I/O jobs received from the MPUs (via the LR emulators running on the MPUs in that Controller
Blade), managing all reading or writing to the drives. Each BE module provides two SAS Wide ports on the rear panel for
the cable attachment to two disk boxes. Each port is a bundle of four independent 12Gbps SAS links. Each port controls
up to 240 drives (VSP G200, G400) or up to 360 drives (VSP G600) in different “stacks” of disk boxes.
Should a link fail within a port, the SPC will no longer use it, failing over to the other 3 links in that port.
DARE Overview
The VSP Midrange family provides for controller-based (as opposed to drive-based) Data At Rest Encryption (DARE).
There are two types of optional BE modules: standard and encrypting (AES256). In order to enable DARE on one or
more Parity Groups, all BE modules in the system must be of the encrypting type. A license key enables the use of
DARE. Up to 1,024 Data Encryption Keys (DEK) are available, and these are kept in the BE Modules and backed up to
the system disk areas.
Data encryption is per drive (any types) per Parity Group, and must be enabled before any LDEVs are created. Each
drive has a data encryption key. All data blocks contained in a DARE enabled Parity Group are encrypted. DARE must
be disabled on a Parity Group before removing those drives from the system. Drive removal results in the data being
“crypto-shredded”.
Data blocks from a DARE enabled Parity group are encrypted/decrypted by the BE module’s SAS SPCve controller chip
as data is either written to or read from data blocks. User data is not encrypted or decrypted in cache.
Non-disruptive field upgrades of the BE modules are possible. This should be scheduled during periods of low activity to
minimize any performance impact during the upgrade.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 34
According to factory testing, the performance impact of this encryption/decryption (reads) process is minimal.
The DBS box holds up to 24 of the 2.5” SAS drives. These include all the 10K HDD, 15K HDD, and SSD options currently
available. The DBL holds up to 12 of the 3.5” SAS drives. These are available as 4TB 7200 RPM drives. The DB60
drawer holds up to 60 of the 4TB drives or optionally the 400GB SSD or 1.2TB 10K HDD with a special LFF conversion
canister. The DBF box holds 12 of the proprietary Hitachi Flash Module Drives, which are available in 1.6TB and 3.2TB
capacities.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 35
Figure 14. Map of the BE Module Ports to DBs (G400/G600)
Cluster 1 Cluster 2
BE-1H BE-2H
0 1 0 1
DB-00 DB-01
DBF DBF
DB-02 DB-03
DBS DBS
DB-04 DB-05
DBL DBL
DB-06 DB-07
DB60 DB60
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 36
24-Disk SFF Tray (DBS)
Figure 14 depicts a simplified view of the 2U 24-disk 2.5” SAS disk box. Inside each 24-disk tray is a pair of SAS
“expander” switches. These may be viewed as two 32-port SAS switches attached to the two SAS wide cables coming
from the BE Module ports, or from the next drive box in the stack in the direction of the controller. The two expanders
cross connect the dual-ported disks to all eight of the active 12Gbps SAS links that pass through each box. Any of the
available 2.5” HDD or SSDs may be intermixed in these trays, including a mix of 6Gbps and 12Gbps SAS interface drives.
The drives are arranged as one row of 24 slots for drive canisters. All drives within each box are dual attached, with one
full duplex drive port going to each switch.
Cluster 2
SAS Links
I/O Module (ENC)
BE-
2H0 To DB-02
SFF Drive SFF Drive SFF Drive SFF Drive SFF Drive SFF Drive SFF Drive SFF Drive
Slot 0 Slot 1 Slot 2 Slot 3 Slot-20 Slot-21 Slot-22 Slot-23
BE- To DB-02
1H0
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 37
12-Disk LFF Tray (DBL)
Figure 15 depicts a simplified view of the 2U 12-disk 3.5” SAS disk box. Inside each 12-disk tray is a pair of SAS
expander switches. These may be viewed as two 20-port SAS switches attached to the two SAS wide cables coming
from the BE Module ports, or from the next drive box in the stack in the direction of the controller. The two expanders
cross connect the dual-ported disks to all eight of the active 12Gbps SAS links that pass through each box. Any of the
available 3.5” HDDs may be intermixed in these trays, including a mix of 6Gbps and 12Gbps SAS interface drives. The
drives are arranged as three rows of 4 slots each for drive canisters. All drives within each box are dual attached, with
one full duplex drive port going to each switch.
Cluster 2
SAS Links
I/O Module (ENC)
BE-
2H0 To DB-02
LFF Drive LFF Drive LFF Drive LFF Drive LFF Drive LFF Drive LFF Drive LFF Drive
Slot 0 Slot 1 Slot 2 Slot 3 Slot 8 Slot 9 Slot 10 Slot 11
BE- To DB-02
1H0
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 38
60-Disk Dense LFF Drawer (DB60)
The high density 4U LFF disk drawer shown in Figure 16 has 60 3.5” vertical disk slots. Inside each 60-disk drawer is a
pair of SAS expander switches. These may be viewed as two 68-port SAS switches attached to the two SAS wide cables
coming from the BE Module ports, or from the next drive box in the stack in the direction of the controller. The two
expanders cross connect the dual-ported disks to all eight of the active 12Gbps SAS links that pass through each box.
Any of the available 3.5” HDDs may be intermixed in these trays, including a mix of 6Gbps and 12Gbps SAS interface
drives. In addition, there are specially ordered SFF drives (1.2TB 10K RPM and 400GB SSD) that ship with special
canisters to fit these LFF slots. This drawer operates as a single disk box, with the drives arranged as 5 rows of 12 slots
each for drive canisters. All drives within each box are dual attached, with one full duplex drive port going to each switch.
Cluster 2
SAS Links
I/O Module (ENC)
BE-
2H0 To DB-02
LFF Drive LFF Drive LFF Drive LFF Drive LFF Drive LFF Drive LFF Drive LFF Drive
Slot 0 Slot 1 Slot 2 Slot 3 Slot 56 Slot 57 Slot 58 Slot 59
BE- To DB-02
1H0
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 39
12-Disk FMD Tray (DBF)
Figure 17 depicts a simplified view of the 2U 12-slot DBF box details. Inside each 12-FMD module box there are two SAS
expander switches. These may be viewed as two 20-port SAS switches attached to the two SAS wide cables (8 SAS
links) coming from the BE Module ports, or from the next drive box in the stack in the direction of the controller. The two
expanders cross connect the Flash Module Drives (FMD) to all eight of the active 12Gbps SAS links that pass through
each box. The FMDs are arranged as four rows of 3 slots each. All FMDs within each box are quad attached, with two
full duplex drive ports going to each switch.
Cluster 2
SAS Links
I/O Module (ENC)
BE-
2H0 To DB-02
BE- To DB-02
1H0
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 40
Drive Details
Table 3 lists the maximum drive count for each VSP Midrange model and drive type.
Drive Type RPM Form Port Speed Advertised Raw Size Nominal Random
Factor Size (GB) (GB) Read IOPS
The type and quantity of drives selected for use in the VSP Midrange and the RAID levels chosen for those drives will
vary according to an analysis of the customer workload mix, cost limits, application performance targets and the usable
protected capacity requirements. The use of 6Gbps SAS drives does not affect the overall speed of the 12Gbps SAS
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 41
backend, as each drive will operate at its native interface speed. In general, the SAS interface is not the bottleneck for
individual drive performance, so the interface speed should not be used to choose one drive type over another.
SAS HDD
Hard Disk Drives (HDD) are still the primary drives used in both enterprise and midrange storage systems from Hitachi.
These are the workhorse drives, having both relatively high performance and good capacity. They fall in between the
random performance levels of SSDs and SATA HDDs, but are much closer to the SATA end of the performance
spectrum. SAS HDDs come in three rotation speeds: 15K RPM, 10K RPM, and 7.2K RPM. All three have about the
same sequential read throughput when comparing the same RAID level. The SAS interface speed of most of the
supported HDDs is 6Gbps, with a couple that support 12Gbps.
SAS drives are designed for high performance, having dual host ports and large caches. The dual host ports allow four
concurrent interactions with the attached BE modules. Both ports can send and receive data in full duplex at the same
time, communicating with the DRAM buffer in the drive. However, only one data transfer at a time can take place
internally between the DRAM buffer in the drive and the serial read/write heads.
SAS drives may be low level formatted to several different native sector sizes, for instance 512, 520, 524, or 528 bytes.
Hitachi uses the 520 byte sector size internally, while all host I/Os to the storage system are done using the usual 512
byte host sectors size.
SSD
Solid State Drives (SSD), as a storage device technology, has reached a level of maturity and market adoption that belies
the small niche it once occupied. By replacing spinning magnetic platters and read/write heads with a non-rotating NAND
flash array managed by a flash translation layer, the SSD is able to achieve very high performance (IOPS) with extremely
low response times (often less than 1ms). It is these two characteristics that make them a viable choice for some
workloads and environments, especially as the top tier of an HDT configuration. Price/capacity ($ per GB) and
price/performance ($ per IOPS) are two areas where SSDs typically do not compare well to HDDs.
Small block random read workloads are where SSDs perform their best and often justify their high cost. They are far less
cost effective for large block sequential transfers compared to HDDs, since the ultimate bottleneck is host or internal path
bandwidth and not usually the performance of the individual storage device. The SAS interface speed on all supported
SSDs is 12Gbps.
FMD
Hitachi’s custom Flash Module Drives could be described as turbo-charged SSDs. Each FMD has much higher
performance than an SSD due to much higher internal processing power and degree of parallelism. The drive choices
include a 1.6TB and a 3.2TB module. Inside each FMD is a 4-core ARM processor and ASIC (the Advanced Switch Flash
Controller – ASFC) that controls the internal operations, the four 6Gbps SAS ports, and the 128 NAND flash memory
chips. Hitachi FMDs have more cores and more independent parallel access paths to NAND flash chips than standard
SSDs. There is a considerable amount of logic in the ASFC that manages the space and how writes are managed. This
greatly aids in substantially increasing the write rate for an FMD over a standard SSD.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 42
Cache Memory Structure
The VSP Midrange family has a single physical memory system comprised of all the memory DIMMs installed on both
Controller Blades. There are two regions of cache memory, also known as Cache Sides, corresponding to the set of
memory DIMMs located on a given Controller Blade. For instance, Cache Side A corresponds to the set of DIMMs
populated on Controller Blade 1 (Cluster 1). Likewise, Cache Side B corresponds to the set of DIMMs populated on
Controller Blade 2 (Cluster 2).
Cache space is managed in the same manner as on the VSP G1000, with 64KB cache segments mapped to an MPU for
host blocks for a specific LDEV. Each MPU maintains its own Write Pending queue which tracks its current set of dirty
cache segments.
The top level management of the cache system is the Global Free List that manages all of the 64KB cache segments
that make up the User Data Cache area. Each MPU also operates a Local Free List (of all of the 64KB segments
currently allocated to it from the master Global Free List) from which it privately draws and returns segments for individual
I/O operations on its LDEVs. If the system determines that an MPU (based on its recent workloads) holds excess 64KB
segments, some of these 64KB segments are pulled back to the system’s Global Free List for reallocation to another MPU
that needs them.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 43
User Data Cache (CM) – All space not allocated for the previous functions that is set aside for general use by the four
MPUs for user data blocks. CM is interleaved across both Cache Sides and while cache segment allocation is generally
performed on a round-robin basis to balance data distribution. However, there are complex algorithms that factor in the
workload pattern and cache hit rate to place user data in the optimal location such that data transfer between the
Controller Blades (via I-Path) is minimized.
The combined size of LM, PM, and DXBF is fixed and does not change based on the amount of cache memory or
program products enabled. For the VSP G200 this fixed capacity is 12GB total (6GB per Cache Side), and for VSP
G400/G600 this fixed capacity is 22GB (11GB per Cache Side).
The size of Shared Memory depends on which program products are enabled and the desired maximum configurable
capacity of each program product. The Cache Directory size is dependent on the size of the User Data Cache area.
On a planned shutdown, Shared Memory is copied to the cache backup SSDs (CFM) on the Controller Blades. In the
event of power loss, the entire contents of cache (including all Write Pending segments) will be copied to the CFMs on the
Controller Blades.
Cluster 1 NTB
NTB Cluster 2
MPU-10 MPU-11 MPU-20 MPU-21
QPI QPI
core0 core1 core2 core3 core0 core1 core2 core3 core0 core1 core2 core3 core0 core1 core2 core3
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
All DIMMs within a CMG must be the same size. In addition, CMG0 must be installed before CMG1 and the DIMM
capacity must be the same across both CMGs. Recall that the two Controller Blades must be symmetrically configured,
so this means that a mix of 8GB and 16GB DIMMs is not permitted regardless of model.
Controller Blade 1
CPU
DIMM DIMM
00 01 CMG0
Controller Blade 1
CPU CPU
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 45
Cache Logical Partitions (CLPR)
Each VSP Midrange system has a base or default CLPR 0 that includes all available User Data Cache space. Optional
CLPRs may be established (by reducing the size of CLPR 0) to manage workloads on certain Parity Groups. All of the
LDEVs from an individual Parity Group must be assigned to the same CLPR, but they can be managed by different
MPUs. All Parity Groups are assigned to CLPR 0 by default. CLPR 0 will use the entire User Data Cache area as a pool
of 64KB segments to be allocated to the four MPUs. The space within CLPR 0 (less a small reserve for borrowing) is
divided up among the four MPUs.
When creating additional CLPRs, the original User Data Cache area is essentially being split into multiple areas of sizes
specified by the user, where that space is only used for those Parity Groups mapped to that CLPR. So cache usage by
application or by storage tier (and external LDEVs) may be controlled in this fashion.
The space in each new CLPR created will always be shared by all four MPUs, with each starting off with a fixed share of
that space whether or not they have LDEVs assigned there. There is a minimum footprint per CLPR per MPU that will
always remain regardless of usage. For example, if CLPR 1 is 16GB in size but only one MPU has LDEVs assigned to
this CLPR, 12GB will initially go unused. Over time, the busy MPU will borrow cache within CLPR 1 from the idle MPUs,
but there will always be some minimum reserved capacity that will be untouched.
Each CLPR has its own space and queue management mechanisms. As such, each CLPR will have its own Free, Clean,
and Dirty queues and each MPU maintains a separate Write Pending counter per CLPR. It is important for the storage
administrator to drill down and monitor each individual MPU-CLPR Write Pending counter, as the global Write Pending
counter is an average across all MPU-CLPR pairings. This can hide the fact that some MPU-CLPR pairings have hit 70%
Write Pending (emergency destage and inflow control) while others are sitting at 0% Write Pending.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 46
Each Controller Blade hosts one or two Intel Ivy Bridge EP Xeon processors. These are organized into MPU logical units
and are where the vast majority of the activity takes place in this architecture. The MPUs are responsible for executing
the system software, otherwise known as the Storage Virtualization Operating System (SVOS). This is where the basic
I/O management and processing tasks are executed, along with any virtualization or replication software in use.
The Controller Blades cross-connect with a pair of external links (I-Paths) which form a pair of PCI Express Non-
Transparent Bridges (NTB). The I-Paths use PCI Express 3.0 lanes provided by the embedded PCI Express switch in the
Intel Xeon processors. A VSP Midrange I-Path is considered non-transparent because it provides a limited interface over
which command messaging and user data transfer can occur. This contrasts with the transparent bridging ability of the I-
Paths in the HUS VM design. There, the HM ASICs provide a direct data transfer path across the I-Paths between MAIN
blades, such that user data in cache on Cluster 2 can be DMA transferred and sent via I-Path out to a Front-end port on
Cluster 1. Here, user data transfers across I-Paths always occur in two steps. First, the data is DMA transferred from
cache and sent across an I-Path into a Data Transfer Buffer (DXBF) on the other Cluster. Then a DMA transfer is
performed on the other Cluster to move the data in the DXBF to its destination, either to an attached host via FE Module
port or to Back-end disk via BE Module. Due to the extra overhead involved in these two step I-Path transfers, SVOS
algorithms optimize the placement of data in cache to minimize I-Path utilization.
Within a Controller Blade, the two Intel Xeon processors are connected with a QuickPath Interconnect (QPI) which is a
high speed, low latency path that functions effectively as a transparent bridge. Either CPU can access the DDR3 memory
channels, PCI Express paths, and FE and BE Modules physically connected to its partner CPU with virtually no latency.
QPI is not used on the VSP G200 Controller Blade since there is only a single Intel Xeon CPU installed.
The FE module Tachyon chips communicate with the LRs on the same Controller Blade. Each MPU in the system runs
an ASIC emulator that executes traditional hardware ASIC functions in microcode on the MPU itself. One of these
emulated functions is the Local Router function. I/O requests are placed in the LR emulator queue on the same cluster as
the receiving FE module and the first available LR, running on either MPU, pulls the request off the queue then creates
and hands off a “job” to the owning MPU to execute. The LR emulators have tables that contain maps of LDEV-to-MPU
ownership. If the owning MPU is in the same Cluster, then this is considered a Straight I/O. If the owning MPU is in the
other Cluster, the job is forwarded to that MPU via one of the I-Paths, and this is considered a Cross I/O. The owning
MPU responds to the LR with a cache address (whether for a read hit, a read miss, or a write buffer). If a read miss or
write request, the MPU will spawn one or more additional jobs and dispatch these to the one or more BE Module SPC
Processors that will then perform the requested physical disk read or write. For writes, the BE Modules use the DRR
emulator function running on the owning MPU to generate the parity blocks required and then write these blocks to the
appropriate disks. There will be two disk updates required for RAID-10 or RAID-5, and three updates for RAID-6.
If the data is in cache on the cluster opposite of the FE Module that received the host request, the owning MPU will spawn
a data transfer job to read the data from User Data cache and copy it into the FE DXBF on the opposite Cluster via I-Path.
The owning MPU then passes the addresses in the FE DXBF to the FE Tachyon (via LR). The Tachyon then uses one of
its DMA channels to read the FE DXBF address and transmit the requested blocks to the host.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 47
Read Miss (from disk)
For a read miss (data not found in cache), the host request (via LR) to the owning MPU creates a second job (by the
MPU) which then schedules a read operation from a disk by one of the BE modules to a target User Data region cache
address specified by the MPU. How the target cache location is selected is based on the I/O type and the LUN cache hit
rate:
For I/O that is judged to be Sequential in nature, the target cache location is on the same cluster as the FE Module that
received the host request. Since I/O that is judged to be Sequential triggers the system to do Sequential prefetch, this
reduces the amount of data that needs to be sent across the I-Paths.
For I/O that is judged to be non-Sequential to a LUN with a low cache hit ratio, the target cache location is on the same
cluster as the FE Module that received the host request. This is done to avoid incurring the overhead of transferring
data across the I-Paths.
For I/O that is judged to be non-Sequential to a LUN with a high cache hit ratio, the target cache location is determined
on a round robin basis, alternating between Clusters 1 and 2. Since the likelihood of this data being read again by the
host is high, this aims to balance cache allocation and read hits across the entire User Data cache space.
The appropriate BE SPC on the same cluster as the target cache location can directly push the requested data via DMA
into cache. After the back end read is completed, the same process as for Read Hits occurs, with the MPU passing the
requested cache address to the FE Tachyon, or first initiating a data transfer job via I-Path then passing the FE DXBF
address to the FE Tachyon. Ultimately, the FE Tachyon uses its DMA to transfer the requested blocks and complete the
host I/O.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 48
The Back-end Write process begins when the owning MPU creates a back-end job for that LDEV which it routes via LR to
the BE Modules that will perform the writes (on either Cluster, depending on where the necessary BE Modules are
located). The BE Module SAS SPCs will then receive the MPU commands to operate on this user data via LR, and the
SPCs can directly access the data blocks in cache via their internal DMA channels.
For RAID-10 LDEVs there will be two write operations, potentially on two BE Modules, read directly from the User Data
region of cache by the BE Module(s). For RAID-5 LDEVs, there will be two reads (old data, old parity) and two writes
(new data and new parity), probably on two BE Modules. For RAID-6 LDEVs there will be three reads (old data, old P, old
Q) and three writes (new data, new P, new Q) using two or more BE Modules. [Note: RAID-6 has one parity chunk (P)
and one redundancy chunk (Q) that is calculated with a more complicated algorithm.]
In the case of a RAID-5 or RAID-6 full-stripe write, where the new data completely covers the width of a RAID stripe, the
read old data and read old parity operations are skipped. For a RAID-5 full-stripe write, there will only be two operations
(write new data and write new parity). For a RAID-6 full-stripe write, there will only be three operations (write new data,
write new P, write new Q).
The parity generation operation must be done on the same cluster as the owning MPU. For the case of a full stripe write,
this could be as simple as the DRR emulator on the owning MPU just calculating new parity on the new data. Or it could
become quite complex if some of the old data or old parity reside in the opposite side of cache in the other cluster. In this
case, it could require one or more data transfer jobs utilizing the I-Paths to get everything into the owning MPU’s side of
cache before the DRR emulator could then calculate the new parity.
Failure Modes
This section explores the different component failure scenarios and describes the impact to system functionality in each
case. Note that a single component failure will never cause a VSP Midrange array to go offline entirely.
LDEV ownerships will automatically be taken over by the MPUs on the other remaining Controller Blade. Each MPU is
connected to a “partner” MPU in the other Cluster via I-Path and LDEV ownership is transferred based on these
partnerships:
MPU-10 MPU-20
MPU-11 MPU-21
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 49
I/O Module Failure
FE Modules
If a FE Module port fails then the individual port is blocked and there is no impact to the other ports on the same module.
If the entire module fails then all the ports will become blocked. In either scenario, if multipath software is used then there
will be no impact to the attached hosts.
BE Modules
If a SAS port fails on a BE Module it will become blocked and the microcode will use the SAS links on the other Cluster to
access drives on the affected backend path. If the entire module fails then both SAS ports are blocked and the microcode
will use the SAS links on the other Cluster to access drives on the affected backend paths.
DIMM Failure
When the system is running, microcode is stored in the Local Memory (LM) area of cache on each Cache Side. Since
Local Memory is interleaved across all the installed DIMMs on the Controller Blade, a DIMM failure means the microcode
becomes inaccessible on that Controller Blade. So a DIMM failure leads to the entire Controller Blade becoming blocked
and the net effect is the same as a Controller Blade failure.
MP Failure
If an individual CPU core fails, then that MP becomes blocked and there is no other impact to the system, aside from the
loss of processing capability of that failed core. If all the cores comprising an MPU fail then this constitutes an MPU
failure.
MPU Failure
An MPU failure represents a loss of processing cores while PCI Express paths and memory buses are unaffected. But
since the failed MPU is blocked, there is no way for it to initiate or respond to requests from its partner MPU via I-Path.
Architecturally, it is impossible for the partner MPU to use the QPI on its side to gain access to the other remaining I-Path,
so this means both the failed MPU and its partner MPU are blocked.
LDEV ownerships will automatically be taken over by the other MPU on each Controller Blade:
MPU-10 MPU-11
MPU-20 MPU-21
CPU Failure
If an Intel Xeon processor fails, it causes an MPU failure as well as the loss of PCI Express paths and memory buses.
The loss of memory channels triggers one or more Cache DIMM failures which results in a Controller Board failure as
described earlier.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 50
Drive Failure
If a drive fails, the Parity Group is placed in correction access mode and data on the failed drive is read from one or more
remaining drives in the Parity Group. All LDEVs within the Parity Group will remain online and accessible by hosts.
While this scenario is alarming, the use of spare drives and RAID-6 makes the chance of data loss in a production
environment highly unlikely.
If both ENCs or both Power Supply Units (PSU) fail, then the entire drive box becomes blocked. All drive boxes daisy
chained behind the failed one will also become blocked. Parity Groups with drives in the blocked trays will remain online if
RAID integrity can be maintained. Parity Groups that cannot maintain RAID integrity will be blocked, as well as their
LDEVs, but user data will remain intact.
Loss of Power
In the event of a planned shutdown, dirty data in cache will be written to disk and Shared Memory will be written to the
CFM SSDs. But in the case of a total loss of power the entire contents of cache, including dirty data that is in the Write
Pending queues, will be written to the CFM SSDs using battery power. When power is restored, the array will return to its
state at the point of power loss and the onboard batteries will recharge.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 51
The actual physical port limit is affected by the current request stack for each MPU. Each Controller Blade has a tag limit
of 64k entries for all of the LDEVs managed by its two MPUs from all host ports in use. So it is difficult to state what the
actual individual per-port limit might be as this is a dynamic value across the four MPUs. We generally use 2048 tags per
port as a rule of thumb to work with.
LUN Queue Depth is the maximum number of outstanding I/O requests (tags) per LUN on the host port within the
server. This is distinctly separate from the system’s port Maximum I/O Request Limit. A LUN queue depth is normally
associated with random I/O workloads and high server thread counts since sequential workloads are usually operating
with a very low thread count and large block size.
On the VSP Midrange, the per-LDEV rule-of-thumb queue depth is 32 per port but can be much higher (no real limit) in
the absence of other workloads on that FC port in use for that LDEV. There isn’t a queue depth limit per LDEV as such,
but the usable limit will depend upon that LDEV’s Parity Group’s ability to keep up with demand.
In the case of external (virtualized) LUNs, the individual LDEV queue depth per external path can be set to 2-to-128 in the
Universal Volume Manager GUI. Increasing the default queue depth value for external LUNs from 8 up to 32 can often
have a very positive effect (especially on response time) on OLTP-like workloads. The overall limits (maximum active
commands per port and maximum active commands per external storage array) are not hard set values, but depend on
several factors. Please refer to Appendix A of the VSP G1000 External Storage Performance paper for an in-depth
explanation of how these values are calculated.
The external mode is how other storage systems are attached to FE ports and virtualized. Front-end ports from the
external (secondary) system are attached to some number of VSP Midrange FE ports as if the VSP Midrange was a
Windows server. In essence, those FC ports are operated as though they were a type of back-end port. LUNs from
attached systems that are visible on these external ports are then remapped by the VSP Midrange out through other
specified FC ports that are attached to hosts. Each such external LUN will become an internal VSP Midrange LDEV that
is managed by one of the four MPUs.
As I/O requests arrive over host-attached ports on the VSP Midrange for an external LUN, the normal routing operations
within the FE module occur. The request is managed by the owner MPU in (mostly) the Local Memory working set of the
global Control Memory tables, and the data blocks go into that managing MPU’s data cache region. But the request is
then rerouted to one or more other FC ports (not to BEs) that control the external paths where that LUN is located. The
external system processes the request internally as though it were talking to a server instead of the VSP Midrange.
The two Replication MCU and RCU port modes are for use with the Hitachi Universal Replicator and True Copy Sync
software that connects two Hitachi enterprise systems together for replication.
The external port represents itself as a Windows server to the external system. Therefore, when configuring the host type
on the ports on the virtualized storage system, it must be set to “Windows mode”.
RAID-5 is a group of drives (RAID Group or Parity Group) with the space of one drive used for the rotating parity chunk
per RAID stripe (row of chunks across the set of drives). If using a 7D+1P configuration (7 data drives, 1 parity drive),
then you get 87.5% capacity utilization for user data blocks out of that RAID Group.
RAID-6 is RAID-5 with a second parity drive for a second unique parity block. The second parity block includes all of the
data chunks plus the first parity chunk for that row. This would be indicated as a 6D+2P construction (75% capacity
utilization) if using 8 drives, or 14D+2P if using 16 drives (87.5% capacity utilization as with RAID-5 7D+1P). Note that in
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 53
this case we use the term “parity” for the Q block in RAID-6, even though the calculation of the Q block is much more
complicated mathematically.
RAID-10 is a mirroring and striping mechanism. First, individual pairs of drives are placed into a mirror state. Then two of
these pairs are used in a simple RAID-0 stripe. As there are four drives in the RAID Group, this would be represented as
RAID-10 (2D+2D) and have 50% capacity utilization.
The VSP Midrange, like the VSP G1000 and earlier Hitachi enterprise subsystems, has a RAID-10 Parity Group type
called 4D+4D. Although we use the term 4+4, in the VSP Midrange, a 4+4 is actually a set of two 2+2 parity groups that
are RAID stripe interleaved on the same 8 drives. Thus the maximum size LDEV that you can create on a 2+2 is the
same as the maximum size LDEV that you can create on a 4+4, although on the 4+4 you can create two of them, one on
each 2+2 interleave.
RAID-10 is not the same as RAID-0+1, although usage by many would lead one to think this is the case. RAID-0+1 is a
RAID-0 stripe of N-disks mirrored to another RAID-0 stripe of N-disks. This would also be shown as 2D+2D for a 4-disk
construction. However, if one drive fails, that RAID-0 stripe also fails, and the mirror then fails, leaving a user with a single
unprotected RAID-0 group. In the case of real RAID-10, one drive of each mirror pair would have to fail before getting into
this same unprotected state.
The factors in determining which RAID level to use are cost, reliability, and performance. Table 5 shows the major
benefits and disadvantage of each RAID type. Each type provides its own unique set of benefits so a clear understanding
of your customer’s requirements is crucial in this decision.
Another characteristic of RAID is the idea of “write penalty”. Each type of RAID has a different back end physical drive I/O
cost, determined by the mechanism of that RAID level. The table below illustrates the trade-offs between the various
RAID levels for write operations. There are additional physical drive reads and writes for every application write due to
the use of mirrors or XOR parity.
Note that larger drives are usually deployed with RAID-6 to protect against a second drive failure within the Parity Group
during the lengthy drive rebuild of a failed drive.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 54
Table 6. RAID Write Penalties
RAID-10 1 2
RAID-5 1 4
RAID-6 1 6
Note that some industry usage replaces chunk with “stripe size”, “stripe depth”, or “interleave factor”, and stripe size with
“stripe width”, “row width” or “row size”.
Note that on all current RAID systems, the chunk is a primary unit of protection and layout management: either the parity
or mirror mechanism. I/O is not performed on a chunk basis as is commonly thought. On Open Systems, the entire
space presented by a LUN is a contiguous span of 512 byte blocks, known as the Logical Block Address range (LBA).
The host application makes I/O requests using some native request size (such as a file system block size), and this is
passed down to the storage as a unique I/O request. The request has the starting address (of a 512 byte block) and a
length (such as the file system 8KB block size). The storage system will locate that address within that LUN to a
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 55
particular drive address, and read or write only that amount of data – not that entire chunk. Also note that this request
could require two drives to satisfy if 2KB of the block lies on one chunk and 6KB on the next one in the stripe.
Because of the variations of file system formatting and such, there is no way to determine where a particular block may lie
on the raw space presented by a volume. A file system will create a variety of metadata in a quantity and distribution
pattern that is related to the size of that volume. Most file systems also typically scatter writes around within the LBA
range – an outdated hold-over from long ago when file systems wanted to avoid a common problem of the appearance of
bad sectors or tracks on drives. What this means is that attempts to align application block sizes with RAID chunk sizes is
a pointless exercise.
The one alignment issue that should be noted is in the case of host-based Logical Volume Managers. These also have a
native “stripe size” that is selectable when creating a logical volume from several physical storage LUNs. In this case, the
LVM stripe size should be a multiple of the RAID chunk size due to various interactions between the LVM and the LUNs.
One such example is the case of large block sequential I/O. If the LVM stripe size is equal to the RAID chunk size, then a
series of requests will be issued to different LUNs for that same I/O, making the request appear to be several random I/O
operations to the storage system. This can defeat the system’s sequential detect mechanisms, and turn off sequential
prefetch, slowing down these types of operations.
Port I/O Request Limits, LUN Queue Depths, and Transfer sizes
There are three aspects of I/O request handling per storage path that need to be understood. At the port level, there are
the mechanisms of an I/O request limit and a maximum transfer size. At the LUN level, there is the mechanism of the
maximum queue depth.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 56
Port I/O Request Limits
Each server and storage system has a particular port I/O request limit, this being the number of total outstanding I/O
requests that may be issued to a system port (not a LUN) at any point in time. This limit will vary according to the
server’s Fibre Channel HBA and its driver, as well as the limit supported by the target storage port. As switches serve to
aggregate many HBAs to the same storage system port, the operating limit will be the lower of these two points. In the
case of a SAN switch being used to attach multiple server ports to a single storage port (fan-in), the limit will most likely be
the limit supported by the storage port.
On the VSP Midrange, the rule of thumb average I/O request limit per FC port is 2048. This means that, at any one time,
there can be up to 2048 total active host I/O commands (tags) queued up for the various LUNs visible on that FC port.
The actual limit is determined by how many requests (tags) are currently in the queue on each MPU, or group of 4-cores.
Each MPU has a limit of 64k tags to use in managing all host I/O requests.
Most environments don’t require that all LUNs be active at the same time. As I/O requests are application driven, this
must be taken into consideration. Understanding the customer’s requirements for performance will dictate how many
LUNs should be assigned to a single port.
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 57
Selecting the Proper Disks
In all cases, distributing a workload across a higher number of small-capacity, high-RPM drives will provide better
performance in terms of full random access. Even better results can be achieved by distributing the workload over a
higher number of small LUNs where the LUNs are the only active LUNs in the Parity Group. When cached data locality is
low, multiple small-capacity high-RPM drives should be used.
One must also take into consideration the case where a system that is only partially populated with 15K RPM drives will
be able to provide a much higher aggregate level of host IOPS if the same budget is applied to lower cost 10K RPM
drives. For instance, if there is a 100% increase in the cost of the 15k drives, then one could install twice as many drives
of the 10K variety. The individual I/O will see some increase in response time when using 10k drives, but the total IOPS
available will be much higher.
Fan-in
Host Fan-in refers to the consolidation of many host nodes into one or just a few storage ports (many-to-one). Fan-in has
the potential for performance issues by creating a bottleneck at the front end storage port. Having multiple hosts
connected to the same storage port does work for environments that have minimal performance requirements. In
designing this type of solution, it is important to understand the performance requirements of each host. If each host has
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 58
either a high IOPS or throughput requirement, it is highly probable that a single 4-gigabit FC-AL port will not satisfy their
aggregate performance requirements.
Fan-out
Fan out allows a host node to take advantage of several storage ports (and possibly additional port processors) from a
single host port (one-to-many). Fan-out has a potential performance benefit for small block random I/O workloads. This
allows multiple storage ports (and their queues) to service a smaller number of host ports. Fan-out typically does not
benefit environments with high throughput (MB/sec) requirements due the transfer limits of the host bus adapters (HBAs).
Hitachi Data Systems Internal and Partner Use Only. NDA Required for Customers. Page 59