Section 2 - Storage Systems Architecture
Section 2 - Storage Systems Architecture
Section Objectives
Upon completion of this section, you will be able to:
y Describe the physical and logical components of a host
y Describe common connectivity components and protocols
y Describe features of intelligent disk storage systems
y Describe data flow between the host and the storage
array
© 2007 EMC Corporation. All rights reserved. Storage Systems Architecture Introduction - 2
The objectives for this section are shown here. Please take a moment to read them.
In This Section
This section contains the following modules:
1. Components of a Host
2. Connectivity
3. Physical Disks
4. RAID Arrays
5. Disk Storage Systems
Additional Information:
y Apply Your Knowledge
y Data Flow Exercise (Student Resource Guide ONLY)
y Case Studies (Student Resource Guide ONLY)
© 2007 EMC Corporation. All rights reserved. Storage Systems Architecture Introduction - 3
Components of a Host
Upon completion of this module, you will be able to:
y List the hardware and software components of a host
y Describe key protocols and concepts used by each
component
In this module, we look at the hardware and software components of a host, as well as the key
protocols and concepts that make these components work. This provides the context for how data
typically flows within the host, as well as between the hosts and storage systems.
The objectives for this module are shown here. Please take a moment to read them.
Components of a Host -1
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Examples of Hosts
Server
Laptop
Group of Servers
Mainframe
A host could be something small, like a laptop, or it could be larger, such as a server, a group or cluster
of servers, or a mainframe. The host has physical (hardware) and logical (software) components. Let’s
look at the physical components first.
Components of a Host -2
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Bus
CPU Storage
I/O Devices
The most common physical components found in a host system include the Central Processing Unit
(CPU), Storage, and Input/Output Devices (I/O).
The CPU performs all the computational processing (number-crunching) for the host. This processing
involves running programs, which are a series of instructions that tell the CPU what to do.
Storage can be high-speed, temporary (volatile, meaning that the content is lost when power is
removed) storage, or permanent magnetic or optical storage media.
I/O devices allow the host to communicate with the outside world.
Let’s look at each of these elements, starting with the CPU.
Components of a Host -3
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
CPU
Bus
ALU L1 Cache
Bus
Registers
CPU
The CPU consists of three major parts: The Arithmetic Logical Unit, Registers, and the L1 Cache.
The Arithmetic Logic Unit (ALU) is the portion of the CPU that performs all the manipulation of data,
such as addition of numbers.
The Registers hold data that is being used by the CPU. Because of their proximity to the ALU,
registers are very fast. CPUs will typically have only a small number of registers – 4 to 20 is common.
L1 cache is additional memory which is associated with the CPU. It holds data and program
instructions that are likely to be needed by the CPU in the near future. The L1 cache will be slower
than registers, but there will be more storage space in the L1 cache than in the registers – 16 KB is
common. Although L1 cache is optional, it is found on most modern CPUs.
The CPU connects to other components in the host via a bus. Buses will be discussed in the
Connectivity module of this Section.
Components of a Host -4
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Storage
n
…
Data n
3 Data 3
Disk
2 Data 2
1 Data 1
0 Data 0
Address Content
Memory
Components of a Host -5
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
L1 cache
L2 cache
Optical
Tape
disk
Slow
Low High
Cost
© 2007 EMC Corporation. All rights reserved. Components of a Host - 6
In any host, there is a variety of storage types. Each type has different characteristics of speed, cost,
and capacity. As a general rule, faster technologies cost more and, as a result, are more scarce.
CPU registers are extremely fast but limited in number to a few tens of locations at most, and are
expensive in terms of both cost and power use. As we move down the list, speeds decrease along with
cost.
Magnetic disks are generally fixed, whereas optical disk and tape use removable media. The cost of
optical and tape media per MB stored is much lower than that of magnetic disk.
Components of a Host -6
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
I/O Devices
y Human interface
– Keyboard
– Mouse
– Monitor
y Computer-computer interface
– Network Interface Card (NIC)
y Computer-peripheral interface
– USB (Universal Serial Bus) port
– Host Bus Adapter (HBA)
I/O devices allow a host to interact with the outside world by sending and receiving data. The basic I/O
devices, such as the keyboard, mouse and monitor, allow users to enter data and view the results of
operations. Other I/O devices allow hosts to communicate with each other or with peripheral devices,
such as printers and cameras.
Components of a Host -7
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
HBAs
Host
Apps
Operating System
Volume Management
Multi-pathing Software
Device Drivers
HBA HBA HBA
The host connects to storage devices using special hardware called a Host Bus Adapter (HBA). HBAs
are generally implemented as either an add-on card or a chip on the motherboard of the host. The ports
on the HBA are used to connect the host to the storage subsystem. There may be multiple HBAs in a
host.
The HBA has the processing capability to handle some storage commands, thereby reducing the
burden on the host CPU.
Components of a Host -8
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Apps
Operating System
Volume Management
Multi-pathing Software
Device Drivers
HBA HBA HBA
Components of a Host -9
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
File Systems
Host
Apps
Operating System
Volume Management
Multi-pathing Software
Device Drivers
HBA HBA HBA
The file system is the general name given to the host-based logical structures and software routines
used to control access to data storage.
The file system block is the smallest ‘container’ allocated to a file’s data. Each filesystem block is a
contiguous area of physical disk capacity.
y Blocks can range in size, depending on the type of files being stored and accessed.
y The block size is fixed (by the operating system) at the time of file system creation.
y Since most files are larger than the pre-defined filesystem block size, a file’s data spans multiple
filesystem blocks. However, the filesystem blocks containing all of the file’s data may not
necessarily be contiguous on a physical disk. Over time, as files grow larger, the file system
becomes increasingly fragmented.
In multi-user, multi-tasking environments, filesystems manage shared storage resources using:
y Directories, paths and structures to identify file locations
y Volume Managers to hide the complexity of physical disk structures
y File locking capabilities to control access to files. This is important when multiple users or
applications attempt to access the same file simultaneously
Components of a Host - 10
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
The number of files created and accessed by a host can be very large. Instead of using a linear or flat
structure (similar to having many objects in a single box), a filesystem is divided into directories
(smaller boxes), or folders.
Directories:
y Organize file systems into containers which may hold files as well as other (sub)directories
y Hold information about files they contain
A directory is a special type of file containing a list of filenames and associated metadata (information
or data about the file). When a user attempts to access a given file by name, the name is used to look
up the appropriate entry in the directory. That entry holds the corresponding metadata.
Components of a Host - 11
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Non-journaling file systems create a potential for lost files because they may use many separate writes
to update their data and metadata. If the system crashes during the write process, metadata or data may
be lost or corrupted. When the system reboots, the filesystem attempts to update the metadata
structures by examining and repairing them. This operation takes a long time on large file systems. If
there is insufficient information to recreate the desired or original structure, files may be misplaced or
lost and file systems corrupted.
A journaling file system uses a separate area called a log, or journal. This journal may contain all the
data to be written (physical journal), or may contain only the metadata to be updated (logical journal).
Before changes are made to the filesystem, they are written to this separate area. Once the journal has
been updated, the operation on the filesystem can be performed. If the system crashes during the
operation, there is enough information in the log to "replay" the log record and complete the operation.
Journaling results in a very quick filesystem check by only looking at the active, most recently
accessed parts of a large file system. In addition, because information about the pending operation is
saved, the risk of files being lost is lessened.
A disadvantage of journaling filesystems is they are slower than other file systems. This slowdown is
the result of the extra operations that have to be performed on the journal each time the filesystem is
changed. The much shortened time for file system check and the integrity provided by journaling far
outweighs this disadvantage. Nearly all file system implementations use journaling.
Components of a Host - 12
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Volume Management
Host
Apps
Operating System
Volume Management
Multi-pathing Software
Device Drivers
HBA HBA HBA
The volume manager is an optional intermediate layer between the file system and the physical disks.
It sits between the file system and the physical disk system. It ‘aggregates’ several hard disks to form a
large, virtual disk and makes this virtual disk visible to higher level programs and applications. It
optimizes access to storage and simplifies the management of storage resources.
Components of a Host - 13
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Components of a Host - 14
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Module Summary
Key points covered in this module:
y Hosts typically have:
– Hardware: CPU, memory, buses, disks, ports, and interfaces
– Software: applications, operating systems, file systems, device
drivers, volume managers
y Journaling enables:
– very fast file system checks in the event of system crash
– provides better integrity for file system structure
These are the key points covered in this module. Please take a moment to review them.
Components of a Host - 15
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Check your knowledge of this module by taking some time to answer the questions shown on the slide.
Components of a Host - 16
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Connectivity
Upon completion of this module, you will be able to:
y Describe the physical components of a networked
storage environment
y Describe the logical components (communication
protocols) of a networked storage environment
In the previous module, we looked at the host environment. In this module, we discuss how the host is
connected to storage, and the protocols used for communication between them.
The objectives for this module are shown here. Please take a moment to read them.
Connectivity - 1
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Bus
CPU
Port
HBA
Host
Port Cable
Disk
Connectivity - 2
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Bus Technology
Serial
Serial Bi-directional
Parallel
A bus is a collection of paths that facilitate data transmission from one part of the computer to another.
Physical components communicate across a bus by sending packages of data between the devices.
These packets can travel in a serial path or in parallel paths. In serial communication, the bits travel
one behind the other. In parallel communication, the bits can move along multiple paths
simultaneously.
A simple analogy to describe buses is a highway:
A Serial Bus is a one-way, single-lane highway where data packets travel in a line in one direction.
A Bi-directional Serial Bus is a two-lane road where data packets travel in a line in both directions
simultaneously
A Parallel Bus is a multi-lane, highway. This could be a bi-directional, multi-lane highway where they
travel in different lanes in both directions simultaneously.
Note: The Parallel Bi-directional Bus is not shown in this slide.
.
Connectivity - 3
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Bus Technology
y System Bus – connects CPU to Memory
y Local (I/O) Bus – carries data to/from peripheral devices
y Bus width measured in bits
y Bus speed measured in MHz
y Throughput measured in MB/S
Connectivity - 4
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Connectivity Protocols
y Protocol = a defined format for communication – allows
the sending and receiving devices to agree on what is
being communicated.
Protocol is a defined format, in this case, for communication between hardware or software
components. Communication protocols are defined for systems and components that are:
y Tightly connected entities – such as central processor to RAM, or storage buffers to controllers –
use standard BUS technology (e.g. System bus or I/O – Local Bus)
y Directly attached entities or devices connected at moderate distances – such as host to printer or
host to storage
y Network connected entities – such as networked hosts, Network Attached Storage (NAS) or
Storage Area Networks (SAN)
We will discuss the communication protocols (logical components) found in each of these connectivity
models, starting with the tightly connected or bus protocols.
Connectivity - 5
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Communication Protocols
Host
Apps
Operating System
PCI
The protocols for the local (I/O) bus and for connections to an internal disk system include:
y PCI
y IDE/ATA
y SCSI
The next few slides examine each of these.
Connectivity - 6
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
The Peripheral Component Interconnect (PCI) is a specification defining the local bus system within a
computer. The specification standardizes how PCI expansion cards, such as network cards or modems,
install themselves and exchange information with the central processing unit (CPU).
In more detail, a Peripheral Component Interconnect (PCI) includes:
y an interconnection system between a microprocessor and attached devices, in which expansion
slots are spaced closely for high-speed operation
y plug and play functionality that makes it easy for a host to recognize a new card
y 32 or 64 bit data
y a throughput of 133 MB/sec
PCI Express is an enhanced PCI bus with increased bandwidth.
Connectivity - 7
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
IDE/ATA
y Integrated Device Electronics (IDE) / Advanced
Technology Attachment (ATA)
y Most popular interface used with modern hard disks
y Good performance at low cost
y Desktop and laptop systems
y Inexpensive storage interconnect
The most popular interface protocol used in modern hard disks is the one most commonly known as
IDE. This interface is also known as ATA.
IDE/ATA hard disks are used in most modern PCs, and offer excellent performance at relatively low
cost.
Connectivity - 8
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Small Computer Systems Interface, SCSI, has several advantages over IDE that make it preferable for
use in higher-end machines. It is far less commonly used than IDE/ATA in PCs due to its higher cost
and the fact that its advantages are not useful for the typical home or business desktop user.
SCSI began as a parallel interface, allowing the connection of devices to a PC, or other servers, with
data being transmitted across multiple data lines. SCSI itself, however, has been broadened greatly in
terms of its scope, and now includes a wide variety of related technologies and standards.
Connectivity - 9
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
SCSI Model
Target
Initiator
As you can see from the diagram, a SCSI device that ‘starts’ a communication is an “initiator”, and a
SCSI device that services a request is a “target”.
You should not necessarily think of initiators as hosts, and targets as storage devices. Storage devices
may initiate a command to other storage devices or switches, and hosts may be targets and receive
commands from the storage devices.
After initiating a request to the target, the host can process other events without having to wait for a
response from the target. After it finishes processing, the target signals a command complete or a
status message back to the host.
Connectivity - 10
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
SCSI Model
Target LUNs
ID
Initiator
ID
Connectivity - 11
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
SCSI Addressing
Initiator ID is the original initiator ID number (used to send responses back to the initiator from the
storage device). A SCSI host bus adapter (referred to as a controller) can be implemented in two ways:
y an onboard interface
y an ‘add in’ card plugged into the system I/O bus
Target ID is the value for a specific storage device. It is an address that is set on the interface of the
device such as a disk, tape or CDROM.
LUN is Logical Unit Number of the device. It reflects the actual address of the device, as seen by the
target.
Connectivity - 12
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
t0
Peripheral
LUNs
Controller
d0 d1 d2
Target
c0 –
Controller/
Initiator/HBA
© 2007 EMC Corporation. All rights reserved. Connectivity - 13
For example, a logical device name (used by a host) for a disk drive may be: cn|tn|dn, where
y cn is the controller
y tn is the target ID of the devices such as t0, t1, t2 and so on
y dn is the device number, which reflects the actual address of the device unit. This is usually d0 for
most SCSI disks because there is only one disk attached to the target controller.
In intelligent storage systems, discussed later, each target may address many LUNs.
Connectivity - 13
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Connectivity - 14
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Expandability and number of devices - SCSI is superior to IDE/ATA. This advantage of SCSI only
matters if you actually need this much expansion capability as SCSI is more involved and expensive to
set up.
Device Type Support – SCSI holds a significant advantage over IDE/ATA in terms of the types of
devices each interface supports.
Cost – the IDE/ATA interface is superior to the SCSI interface.
Performance – These factors influence system performance for both interfaces:
y Maximum Interface Data Transfer Rate: Both interfaces presently offer very high maximum
interface rates, so this is not an issue for most PC users. However, if you are using many hard disks
at once, for example in a RAID array, SCSI offers better overall performance.
y Device-Mixing Issues: IDE/ATA channels that mix hard disks and CD-ROMs are subject to
significant performance hits due to the fact these devices are operating at different speeds (hard
disks read and write relatively quickly when compared to CDROM drives). Also, the IDE channel
that can only support a single device at a time must wait for the slower optical drive to complete a
task. SCSI does not have this problem.
y Device Performance: When looking at particular devices, SCSI can support multiple devices
simultaneously while IDE/ATA can only support a single device at a time.
Configuration and set-up – IDE/ATA is easier to set up, especially if you are using a reasonably new
machine and only a few devices. SCSI has a significant advantage over IDE/ATA in terms of hard disk
addressing issues.
Connectivity - 15
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Host
HBA
Cable
Port
Disk
A host with external storage is usually a large enterprise server. Components are identical to those of a
host with internal storage. The key difference is in the external storage interfaces used.
Connectivity - 16
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Apps
DBMS Mgmt Utils
File System
LVM
Multipathing Software
Device Drivers
HBA HBA HBA
Fibre Channel
Storage Arrays
Fibre Channel is a high–speed interconnect used in networked storage to connect servers to shared
storage devices. Fibre Channel components include HBAs, hubs, switches, cabling, and disks.
The term Fibre Channel refers to both the hardware components and the protocol used for
communication between nodes.
Connectivity - 17
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
y Fibre Channel
– Greater distance
– High device count in SANs
– Multiple initiators
– Dual-ported drives
The two most popular interfaces for external storage devices are SCSI and Fibre Channel (FC). SCSI is
also commonly used for internal storage in hosts; FC is almost never used internally.
Connectivity - 18
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Switches Storage
When computing environments require high speed connectivity, they use sophisticated equipment to
connect hosts to storage devices.
Physical connectivity components in networked storage environments include:
y HBA (Host-side interface) – Host Bus Adapters connect the host to the storage devices
y Optical cables – fiber optic cables to increase distance, and reduce cable bulk
y Switches – used to control access to multiple attached devices
y Directors – sophisticated switches with high availability components
y Bridges – connections to different parts of a network
Connectivity - 19
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Module Summary
Key points covered in this module:
y The physical components of a networked storage
environment
y The logical components (communication protocols) of a
networked storage environment
These are the key points covered in this module. Please take a moment to review them.
Connectivity - 20
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Check your knowledge of this module by taking some time to answer the questions shown on the slide.
Connectivity - 21
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Connectivity - 22
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Physical Disks
After completing this module, you will be able to:
y Describe the major physical components of a disk drive
and their function
y Define the logical constructs of a physical disk
y Describe the access characteristics for disk drives and
their performance implications
y Describe the logical partitioning of physical drives
There are several methods for storing data, however, in this module, the focus is on disk drives. Disk
drives use many types of technology to perform their job: mechanical, chemical, magnetic, electrical.
Our intent is not to make you an expert on every detail about the drive - rather you should have a high
level understanding of how both the physical and logical parts of a drive work. This enables you to see
how these parts impact system capacity, reliability, and performance.
The objectives for this module are shown here. Please take a moment to read them.
Physical Disks -1
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
The focus of this lesson is on the components of a disk drive and how they work. Additionally, it is
important to understand how the data is organized on the disk based on its disk geometry.
Physical Disks -2
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
01010100111010101010
00110100111010101010
00110100111010101010
10110101011010101010
A hard drive contains a series of rotating platters within a sealed case. The sealed case is known as
Head Disk Assembly, or HDA.
A platter has the following attributes:
y It is a rigid, round disk which is coated with magnetically sensitive material.
y Data is stored in binary code (0s and 1s). It is encoded by polarizing magnetic areas, or domains,
on the disk surface.
y Data can be written to and read from both surfaces of a platter.
y A platter’s storage capacity varies across drives. There is an industry trend toward higher capacity
as technology improves.
− Note: The drive’s capacity is determined by the number of platters, the amount of data which
can be stored on each platter, and how efficiently data is written to the platter.
Note: These concepts apply to disk drives used in systems of all sizes.
Physical Disks -3
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Spindle
Platters
Physical Disks -4
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Data is read and written by read/write heads, or R/W heads. Most drives have two R/W heads per
platter, one for each surface of the platter.
y When reading data, they detect magnetic polarization on the platter surface.
y When writing data, they change the magnetic polarization on the platter surface.
Since reading and writing data is a magnetic process, the R/W heads never actually touch the surface
of the platter. There is a microscopic air gap between the read/write heads and the platter. This is
known as the head flying height.
When the spindle rotation has stopped, the air gap is removed and the R/W heads rest on the surface of
the platter in a special area near the spindle called a landing zone. The landing zone is coated with a
lubricant to reduce head/platter friction. Logic on the disk drive ensures that the heads are moved to
the landing zone before they touch the surface.
If the drive malfunctions and a read/write head accidentally touches the surface of the platter outside of
the landing zone, it is called a head crash. When a head crash occurs, the magnetic coating on the
platter gets scratched and damage may also occur to the R/W head. A head crash generally results in
data loss.
Physical Disks -5
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Spindle
Actuator
Read/write heads are mounted on the actuator arm assembly, which positions the read/write head at
the location on the platter where data needs to be written or read.
Physical Disks -6
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
R/W Head
R/W Head
Actuator
The read/write heads for all of the platters in a drive are attached to one actuator arm assembly and
move across the platter simultaneously. Notice there are two read/write heads per platter, one for each
surface.
Physical Disks -7
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Controller
Interface
HDA
Power
Connector
The controller is a printed circuit board, mounted at the bottom of the disk drive. It contains a
microprocessor (as well as some internal memory, circuitry, and firmware) that controls:
y power to the spindle motor and control of motor speed
y how the drive communicates with the host CPU
y reads/writes by moving the actuator arm, and switching between R/W heads
y optimization of data access
Physical Disks -8
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Sector
Track
Platter
Data is recorded in tracks. A track is a concentric ring around the spindle which contains data.
y A track can hold a large amount of data. Track density describes how tightly packed the tracks are
on a platter.
y Tracks are numbered from the outer edge of the platter, starting at track zero.
y A track is divided into sectors. A sector is the smallest individually-addressable unit of storage.
y The number of sectors per track is based upon the specific drive.
y Sectors typically hold 512 bytes of user data. Some disks can be formatted with larger sectors.
y A formatting operation performed by the manufacturer writes the track and sector structure on the
platter.
Each sector stores user data as well as other information, including its sector number, head number (or
platter number) and track number. This information aids the controller in locating data on the drive,
but it also takes up space on the disk. Thus there is a difference between the capacity of an unformatted
disk and a formatted one. Drive manufacturers generally advertise the formatted capacity.
The first PC hard disks typically held 17 sectors per track. Today's hard disks can have a much larger
number of sectors in a single track. There can be thousands of tracks on a platter, depending on the size
of the drive.
Physical Disks -9
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Sector
Track
Since a platter is made up of concentric tracks, the outer tracks can hold more data than the inner ones
because they are physically longer than the inner tracks. However, in older disk drives, the outer tracks
had the same number of sectors as the inner tracks, which means that the data density was very low on
the outer tracks. This was an inefficient use of the available space.
Zoned-bit recording uses the disk more efficiently. It groups tracks into zones that are based upon
their distance from the center of the disk. Each zone is assigned an appropriate number of sectors per
track. This means that a zone near the center of the platter has fewer sectors per track than a zone on
the outer edge.
In zoned-bit recording:
y outside tracks have more sectors than inside tracks
y zones are numbered, with the outermost zone being Zone 0
y tracks within a given zone have the same number of sectors
Note: The media transfer rate drops as the zones move closer to the center of the platter, meaning that
performance is better on the zones created on the outside of the drive. Media transfer rate is covered
later in the module.
Physical Disks - 10
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Cylinder
Tracks and sectors organize data on a single platter. Cylinders help organize data across platters on a
drive.
A cylinder is the set of identical tracks on both surfaces of each of the drive’s platters. Often the drive
head location is referred to by cylinder number rather than by track number.
Because all of the read-write heads move together, each head is always physically located at the same
track number. In other words, one head cannot be on track zero while another is on track 10.
.
Physical Disks - 11
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Cylinder
Block 0
Head Block 8
(lower surface)
Block 16
Block 32
Block 48
At one time, drives used physical addresses made up of the Cylinder, Head, and Sector number (CHS)
to refer to specific locations on the disk. This meant that the host had to be aware of the geometry of
each disk that was used.
Logical Block Addressing (LBA) simplifies addressing by a using a linear address for accessing
physical blocks of data. The disk controller performs the translation process from LBA to CHS
address. The host only needs to know the size of the disk drive (how many blocks).
y Logical blocks are mapped to physical sectors on a 1:1 basis
y Block numbers start at 0 and increment by one until the last block is reached (E.g., 0, 1, 2, 3 … (N-
1))
y Block numbering starts at the beginning of a cylinder and continues until the end of that cylinder
y This is the traditional method for accessing peripherals on SCSI, Fibre Channel, and newer ATA
disks
y As an example, we’ll look at a new 500 GB drive. The true capacity of the drive is 465.7 GB,
which is in excess of 976,000,000 blocks. Each block will have its own unique address
In the slide, the drive shows 8 sectors per track, 8 heads, and 4 cylinders. We have a total of 8 x 8 x 4 =
256 blocks. The illustration on the right shows the block numbering, which ranges from 0 to 255.
Physical Disks - 12
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
A
B D
C
Partitioning -
Concatenation -
Multiple Logical Volumes
One Logical Volume
© 2007 EMC Corporation. All rights reserved. Physical Disks - 13
Partitioning divides the disk into logical containers (known as volumes), each of which can be used
for a particular purpose.
y Partitions are created from groups of contiguous cylinders
y A large physical drive could be partitioned into multiple Logical Volumes (LV) of smaller capacity
y Because partitions define the disk layout, they are generally created when the hard disk is initially
set up on the host
y Partition size impacts disk space utilization
y The host filesystem accesses partitions, with no knowledge of the physical structure.
Concatenation groups several smaller physical drives and presents them collectively as one large
logical drive to the host. This is typically done using the Logical Volume Manager on the host.
Physical Disks - 13
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Lesson Summary
Key points covered in this lesson:
y Physical drives are made up of:
– HDA
¾ Platters connected via a spindle
¾ Read/write heads which are positioned by an actuator
– Controller
¾ Controls power, communication, positioning, and optimization
These are the key points covered in this lesson. Please take a moment to review them.
Physical Disks - 14
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
The focus of this lesson is on the factors that impact how well a drive works, in particular, the
performance and reliability of the drive.
Since a disk drive is a mechanical device, it takes much more time than the electronic speeds of
memory. The length of time to read or write data on the disk is dependant primarily upon three factors:
Seek time, Rotational Delay also known as Latency, and Transfer Rate.
The objectives for this lesson are shown here. Please take a moment to read them.
Physical Disks - 15
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Seek times describe the time it takes to position the read/write heads radially across the platter. The
following specifications are often published:
y Full Stroke - the time it takes to move across the entire width of the disk, from the innermost track
to the outermost
y Average – the average time it takes to move from one random track to another (normally listed as
the time for one-third of a full stroke)
y Track-to-Track – the time it takes to move between adjacent tracks
Each of these specifications is measured in milliseconds (ms).
Notes:
Average seek times on modern disks typically are in the range of 3 to 15 ms.
Seek time has more impact on reads of random tracks on the disk rather than on adjacent tracks.
To improve seek time, data is often written only to a subset of the available cylinders (either on the
inner or outer tracks), and the drive is treated as though it has a lower capacity than it really has, e.g. a
500 GB drive is set up to use only the first 40 % of the cylinders, and is treated as a 200 GB drive.
This is known as short-stroking the drive.
Physical Disks - 16
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
The actuator moves the read/write head over the platter to a particular track, while the platter spins to
position the a particular sector under the read write head.
Rotational latency is the time it takes the platter to rotate and position the data under the read/write
head.
y Rotational latency is dependent upon the rotation speed of the spindle and is measured in
milliseconds (ms)
y The average rotational latency is one-half of the time taken for a full rotation
y Like seek times, rotational latency has more of an impact on reads or writes of random sectors on
the disk than on the same operations on adjacent sectors
Since spindle speed contributes to latency, the faster the disk spins, the quicker the correct sector will
rotate under the heads—thus leading to a lower latency.
Rotational latency is around 5.5 ms for a 5,400 rpm drive, and around 2.0 ms for a 15,000 rpm drive.
Physical Disks - 17
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Request 1
Request 2 2
4 3 2 1
1
Request 3
3
Request 4 4
Request 1
Request 2 2
4 2 3 1
1
Request 3
3
Request 4 4
If commands are processed as they are received, time is wasted if the read/write head passes over data
that is needed one or two requests later. To improve drive performance, some drive manufacturers
include logic that analyzes where data is stored on the platter relative to the data access requests.
Requests are then reordered to make best use of the data’s layout on the disk.
This technique is known as Command Queuing (also known as Multiple Command Reordering,
Multiple Command Optimization, Command Queuing and Reordering, Native Command Queuing or
Tagged Command Queuing).
In addition to being performed at the physical disk level, command queuing can also be performed by
the storage system that uses the disk.
Physical Disks - 18
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Disk Drive
The following steps take place when data is read from/written to the drive:
y Read
1. Data moves from the disk platters to the heads
2. Data moves from the heads to the drive's internal buffer
3. Data moves from the buffer through the interface to the host HBA
y Write
1. Data moves from the HBA to the internal buffer through the drive’s interface
2. Data moves from the buffer to the read/write heads
3. Data moves from the disk heads to the platters
The Data Transfer Rate describes the MB/second that the drive can deliver data to the HBA. Given
that internal and external factors can impact performance transfer rates are refined to use:
y Internal transfer rate - the speed of moving data from the disk surface to the R/W heads on a
single track of one surface of the disk. This is also known as the burst transfer rate
− Sustained internal transfer rate takes other factors into account, such as seek times
y External transfer rate - the rate at which data can be moved through the interface to the HBA.
The burst transfer rate is generally the advertised speed of the interface (e.g., 133 MB/s for
ATA/133)
− Sustained external transfer rate are lower than the interface speed
Note: Internal transfer rates are almost always lower, sometimes appreciably lower, than the external
transfer rate.
Physical Disks - 19
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Mean Time Between Failure (MTBF) is the amount of time that one can anticipate a device to work
before an incapacitating malfunction occurs. It is based on averages and therefore is used merely to
provide estimates. MTBF is measured in hours (e.g., 750,000 hours).
MTBF is based on an aggregate analysis of a huge number of drives, so it does not help to determine
how long a given drive will actually last. MTBF is often used along with the service life of the drive,
which describes how long you can expect the drive’s components to work before they wear out (e.g., 2
years).
Note: MTBF is a statistical method developed by the U.S. military as a way of estimating maintenance
levels required by various devices. It is generally not practical to test a drive before it becomes
available for sale (750,000 hours is over 85 years!). Instead, MTBF is tested by artificially aging the
drives. This is accomplished by subjecting them to stressful environments such as high temperatures,
high humidity, fluctuating voltages, etc.
Physical Disks - 20
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Lesson Summary
Key points covered in this lesson:
y Drive performance is impacted by a number of factors
including:
– Seek time
– Rotational latency
– Command queuing
– Data transfer rate
These are the key points covered in this lesson. Please take a moment to review them.
Physical Disks - 21
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Module Summary
Key points covered in this module:
y Physical drives are made up of a number of components
– HDA – houses the platters, spindles, actuator assemblies (which
include the actuator and the read/write heads)
– Controller - Controls power, communication, positioning, and
optimization
These are the key points covered in this module. Please take a moment to review them.
Physical Disks - 22
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Check your knowledge of this module by taking some time to answer the questions shown on the slide.
Physical Disks - 23
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Physical Disks - 24
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
RAID Arrays
After completing this module, you will be able to:
y Describe what RAID is and the needs it addresses
y Describe the concepts upon which RAID is built
y Compare and contrast common RAID levels
y Recommend the use of the common RAID levels based
on performance and availability considerations
In the previous module, we looked at how a disk drive works. Disk drives can be combined into disk
arrays to increase capacity.
An individual drive has a certain life expectancy before it fails, as measured by MTBF. Since there are
many drives, potentially hundreds or even thousands of drives in disk array, the probability of a drive
failure increases significantly. As an example, if the MTBF of a drive is 750,000 hours, and there are
100 drives in the array, then the MTBF of the array becomes 750,000 / 100, or 7,500 hours. RAID
(Redundant Array of Independent Disks) was introduced to mitigate this problem.
RAID arrays enable you to increase capacity, provide higher availability (in case of a drive failure),
and increase performance (through parallel access). In this module, we will look at the concepts that
provide a foundation for understanding disk arrays with built-in controllers for performing RAID
calculations. Such arrays are commonly referred to as RAID Arrays. We will also learn about a few
commonly implemented RAID levels and the type of protection they offer.
RAID Arrays -1
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
RAID
Controller
Host
RAID Array
RAID (Redundant Arrays of Independent Disks) combines two or more disk drives in an array into
a RAID set or a RAID group. The RAID set appears to the host as a single disk drive. Properly
implemented RAID sets provide:
y Higher data availability
y Improved I/O performance
y Streamlined management of storage devices
Historical Note: In 1987, Patterson, Gibson and Katz at the University of California Berkeley,
published a paper entitled, "A Case for Redundant Arrays of Inexpensive Disks (RAID)." This paper
described various types of disk arrays, referred to by the acronym RAID. At the time, data was stored
largely on large, expensive disk drives (called SLED, or Single Large Expensive Disk). The term
inexpensive was used in contrast to the SLED implementation. The term RAID has been redefined to
refer to independent disks, to reflect the advances in the storage technology.
RAID storage has now grown from an academic concept to an industry standard.
RAID Arrays -2
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
RAID Components
Physical
Array
Logical
Array
RAID
Controller
Logical
Array
Host
RAID Array
Physical disks inside a RAID array are usually contained in smaller sub-enclosures. These sub-
enclosures, or physical arrays, hold a fixed number of physical disks, and may also include other
supporting hardware, such as power supplies.
A subset of disks within a RAID array can be grouped to form logical associations called logical
arrays, also known as a RAID set or a RAID group. The operating system may see these disk groups
as if they were regular disk volumes. Logical arrays facilitate the management of a potentially huge
number of disks. Several physical disks can be combined to make large logical volumes.
Generally, the array management software implemented in RAID systems handles:
y Management and control of disk aggregations (e.g. volume management)
y Translation of I/O requests between the logical disks and the physical disks
y Data regeneration if disk failures occur
RAID Arrays -3
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
RAID Levels
y 0 Striped array with no fault tolerance
y 1 Disk mirroring
y 3 Parallel access array with dedicated parity disk
y 4 Striped array with independent disks and a dedicated
parity disk
y 5 Striped array with independent disks and distributed
parity
y 6 Striped array with independent disks and dual
distributed parity
y Combinations of levels (I.e., 1 + 0, 0 + 1, etc.)
© 2007 EMC Corporation. All rights reserved. RAID Arrays - 4
There are some standard RAID configuration levels, each of which has benefits in terms of
performance, capacity, data protection, etc.
The discussion centers around the commonly used levels and commonly used combinations of levels.
RAID Arrays -4
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Stripe 1
Stripe 2
Stripe 3
Strips
RAID sets are made up of disks. Within each disk, there are groups of contiguously addressed blocks,
called strips. The set of aligned strips that spans across all the disks within the RAID set is called a
stripe.
y Strip size (also called stripe depth) describes the number of blocks in a strip, and is the maximum
amount of data that is written to or read from a single disk in the set before the next disk is
accessed (assuming that the accessed data starts at the beginning of the strip).
− All strip in a stripe have the same number of blocks.
− Decreasing strip size means that data is broken into smaller pieces when spread across the
disks.
y Stripe size describes the number of data blocks in a stripe.
− To calculate the stripe size, multiply the strip size by the number of data disks.
y Stripe width refers to the number of data strips in a stripe (or, put differently, the number of data
disks in a stripe).
RAID Arrays -5
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
RAID
Block 0
4
3
2
1 Block 0
4
3
2
1
Controller
Host
RAID 0 stripes the data across the drives in the array without generating redundant data.
y Performance - better than JBOD because it uses striping. The I/O rate, called throughput, can be very high
when I/O sizes are small. Large I/Os produce high bandwidth (data moved per second) with this RAID type.
Performance is further improved when data is striped across multiple controllers with only one drive per
controller.
y Data Protection – no parity or mirroring means that there is no fault tolerance. Therefore, it is extremely
difficult to recover data.
y Applications – those that need high bandwidth or high throughput, but where the data is not critical, or can
easily be recreated.
Striping improves performance by distributing data across the disks in the array. This use of multiple
independent disks allows multiple reads and writes to take place concurrently.
y When a large amount of data is written, the first piece is sent to the first drive, the second piece to the second
drive, and so on.
y The pieces are put back together again when the data is read.
y Striping can occur at the block (or block multiple) level or the byte level. Stripe size can be specified at the
Logical Volume Manager level from the host – software RAID. Or depending on the vendor, can be set at
the array level – in case of hardware RAID.
Notes on striping:
y Increasing the number of drives in the array increases performance because more data can be read or written
simultaneously.
y A higher stripe width indicates a higher number of drives and therefore better performance.
y Striping is generally handled by the controller and is transparent to the host operating system.
RAID Arrays -6
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
RAID
Block 0
1 Block 0
1
Controller
Host
RAID 1 uses mirroring to improve fault tolerance. A RAID 1 group consists of 2 (typically) or more
disk modules. Every write to a data disk is also a write to the mirror disk(s). This is transparent to the
host. If a disk fails, the disk array controller uses the mirror drive for data recovery and continuous
operation. Data on the replaced drive is rebuilt from the mirror drive.
y Benefits - high data availability and high I/O rate (small block size)
y Drawbacks - total number of disks in the array equaling 2 times the data (useable) disks. This
means that the overhead cost equals 100%, while usable storage capacity is 50%
y Performance – improves read performance, but degrades write performance
y Data Protection - improved fault tolerance over RAID 0
y Disks – at least two disks
y Cost – expensive due to the extra capacity required to duplicate data
y Maintenance - low complexity
y Applications - applications requiring high availability and non-degraded performance in the event
of a drive failure
RAID Arrays -7
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
RAID 1
Block 0
Block 2
RAID RAID 0
Block 0
3
2
1
Controller
Block 1
Host Block 3
RAID 0+1 is one way of combining the speed of RAID 0 with the redundancy of RAID 1. RAID 0+1
is implemented as a mirrored array whose basic elements are RAID 0 stripes.
y Benefits - medium data availability, high I/O rate (small block size), and the ability to withstand
multiple drive failures as long as they occur on the same stripe
y Drawbacks - total number of disks equal two times the data disks, with overhead cost equaling
100%
y Performance - high I/O rates; writes are slower than reads because of mirroring
y Data Protection - medium reliability
y Disks - even number of disks (4 disk minimum to allow striping)
y Cost - very expensive because of the high overhead
y Applications – imaging and general file server
RAID Arrays -8
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
RAID 1
Block 0 Block 0
Block 2 Block 2
RAID RAID 0
Controller
Block 1 Block 1
In the event of a single drive failure, the entire stripe set is faulted. Normal processing can continue
with the mirrors. How ever, rebuild of the failed drive will involve copying data from the mirror to the
entire stripe set. This will result in increased rebuild times as compared to RAID 1+0 solution. This
makes RAID 0+1 implementation less common than RAID 1+0.
RAID Arrays -9
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
RAID 0
Block 1
Block 3
RAID RAID 1
Block 2
0
Controller
Block 1
Host Block 3
RAID 1+0 (or RAID 10, RAID 1/0, or RAID A) also combines the speed of RAID 0 with the
redundancy of RAID 1, but it is implemented a different manner than RAID 0+1. RAID 1+0 is a
striped array whose individual elements are RAID 1 arrays - mirrors.
y Benefits - high data availability, high I/O rate (small block size), and the ability to withstand
multiple drive failures as long as they occur on different mirrors
y Drawbacks - total number of disks equal two times the data disks, with overhead cost equaling
100%
y Data Protection - high reliability
y Disks - even number of disks (4 disk minimum, to allow striping)
y Cost - very expensive, because of the high overhead
y Performance: High I/O rates achieved using multiple stripe segments. Writes are slower than reads,
because they are mirrored
y Applications – databases requiring high I/O rates with random data, and applications requiring
maximum data availability
RAID Arrays - 10
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
RAID 0
Block 0 Block 1
Block 2 Block 3
RAID RAID 1
Controller
Block 0 Block 1
In the event of a drive failure, normal processing can continue with the surviving mirror. Only the data
on the failed drive has to be copied over from the mirror for the rebuild, as opposed to rebuilding the
entire stripe set in RAID 0+1. This results in faster rebuild times for RAID 1+0 and makes it a more
common solution than RAID 0+1.
Note that under normal operating conditions both RAID 0+1 and RAID 1+0 provide the same benefits.
These solutions are still aimed at protecting against a single drive failure and not against multiple drive
failures.
RAID Arrays - 11
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
0
4
8
1
5
9
RAID 2
Controller 6
10
3
Host 7
11
0123
4567
8 9 10 11
Parity Disk
© 2007 EMC Corporation. All rights reserved. RAID Arrays - 12
Parity is a redundancy check that ensures that the data is protected without using a full set of duplicate
drives.
y If a single disk in the array fails, the other disks have enough redundant data so that the data from
the failed disk can be recovered.
y Like striping, parity is generally a function of the RAID controller and is transparent to the host.
y Parity information can either be:
− Stored on a separate, dedicated drive (RAID-3)
− Distributed with the data across all the drives in the array (RAID-5)
RAID Arrays - 12
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Parity Calculation
5 Data
5 + 3 + 4 + 2 = 14
3 Data
5 + 3 + ? + 2 = 14
2 Data
? = 14 – 5 – 3 – 2
?=4 Parity
14
RAID Array
© 2007 EMC Corporation. All rights reserved. RAID Arrays - 13
This example uses arithmetic operations to demonstrate how parity works. It illustrates the concept,
but not the actual mechanism.
y Think of parity as the sum of the data on the other disks in the RAID set. Each time data is
updated, the parity is updated as well, so that it always reflects the current sum of the data on the
other disks.
Note: While parity is calculated on a per stripe basis, the diagram omits this detail for the sake of
simplification.
y If a disk fails, the value of its data is calculated by using the parity information and the data on the
surviving disks.
y If the parity disk fails, the value of its data is calculated by using the data disks. Parity will only
need to be recalculated, and saved, when the failed disk is replaced with a new disk.
In the event of a disk failure, each request for data from the failed disk requires that the data be
recalculated before it can be sent to the host. This recalculation is time-consuming, and decreases the
performance of the RAID set. Hot spare drives, introduced later, provide a way to minimize the
disruption caused by a disk failure.
The actual parity algorithm use the Boolean exclusive-OR (XOR) operations.
RAID Arrays - 13
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Block 0
3
2
1 RAID0
Block
Controller
Block
Parity1
Generated
Block 2
Host Block 3
P0123
RAID Level 3 stripes data for high performance and uses parity for improved fault tolerance. Data is
striped across all the disks, but one in the array. Parity information is stored on a dedicated drive, so
that data can be reconstructed if a drive fails.
RAID 3 always reads and writes complete stripes of data across all the disks. There are no partial
writes that update one out of many strips in a stripe.
y Benefits - the total number of disks is less than in a mirrored solution (e.g. 1.25 times the data disks
for group of 5), good bandwidth on large data transfers
y Drawbacks - poor efficiency in handling small data blocks. This makes it not well suited to
transaction processing applications. Data is lost if multiple drives fail within the same RAID 3
Group.
y Performance - high data read/write transfer rate. Disk failure has a significant impact on
throughput. Rebuilds are slow.
y Data Protection - uses parity for improved fault tolerance
y Striping – byte level to multiple block level, depending on vendor implementation
y Applications - applications where large sequential data accesses are used such as medical and
geographic imaging
RAID Arrays - 14
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Block 0
Block 4
Block 1
Block 5
Parity
RAID0 Block 2
Block 0 Block
Generated
Controller Block 6
P0123
Block 3
Host Block 7
P0123
P4567
RAID Level 4 stripes data for high performance and uses parity for improved fault tolerance. Data is
striped across all the disks, but one in the array. Parity information is stored on a dedicated disk so that
data can be reconstructed if a drive fails.
The data disks are independently accessible, and multiple reads and writes can occur simultaneously.
y Benefits - the total number of disks is less than in a mirrored solution (e.g., 1.25 times the data
disks for group of 5), good read throughput, and reasonable write throughput.
y Drawbacks – the dedicated parity drive can be a bottleneck when handling small data writes. This
RAID level is not well suited to transaction processing applications. Data loss if multiple drives
fail within the same RAID 4 Group.
y Performance - high data read transfer rate. Poor to medium write transfer rate. Disk failure has a
significant impact on throughput
y Data Protection - uses parity for improved fault tolerance.
y Striping – usually at the block (or block multiple) level
y Applications – general purpose file storage
RAID 4 is much less commonly used than RAID 5, discussed next. The dedicated parity drive is a
bottleneck, especially when a disk failure has occurred.
RAID Arrays - 15
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Block 0
Block 4
Block 1
Block 5
Parity
RAID4 Block 2
Block 0
4 Block 0
Generated
Controller Block 6
P405
1627 3
Block 3
Host P4567
P0123
Block 7
RAID 5 does not read and write data to all disks in parallel like RAID 3. Instead, it performs independent read
and write operations. There is no dedicated parity drive; data and parity information is distributed across all
drives in the group.
y Benefits - the most versatile RAID level. A transfer rate greater than that of a single drive but with a high
overall I/O rate. Good for parallel processing (multi-tasking) applications/environments. Cost savings due to
the use of parity over mirroring.
y Drawbacks - slower transfer rate than RAID 3. Small writes are slow, because they require a read-modify-
write (RMW) operation. Write to a single block involves two reads (old block and old parity) and two writes
(new block and new parity). There is degradation in performance in recovery and reconstruction modes and
data loss if multiple drives within the same group are lost.
y Performance - high read data transaction rate, medium write data transaction rate. Low ratio of parity disks to
data disks. Good aggregate transfer rate
y Data Protection - single disk failure puts volume in degraded mode. Difficult to rebuild (as compared to
RAID level 1).
y Disks - 5-disk and 9-disk groups are popular. Most implementations allow other RAID set sizes.
y Striping – block level, or multiple block level
y Applications - file and application servers, database servers, WWW, email, and News servers
Read operations do not involve parity calculations. In this case of 5-disk RAID 5 group, a maximum of 5
independent reads can be performed. As a write operation involves two disks (parity disk and the data disk), a
maximum of two independent writes can be performed in this configuration. So a maximum of 5 independent
reads or two independent writes can be performed on a 5-disk RAID 5 group.
RAID Arrays - 16
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
The details of diagonal parity generation and rebuilds are beyond the scope of this foundations course.
RAID Arrays - 17
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
RAID Implementations
y Hardware (usually a specialized disk controller card)
– Controls all drives attached to it
– Performs all RAID-related functions, including volume management
– Array(s) appear to the host operating system as a regular disk drive
– Dedicated cache to improve performance
– Generally provides some type of administrative software
y Software
– Generally runs as part of the operating system
– Volume management performed by the server
– Provides more flexibility for hardware, which can reduce the cost
– Performance is dependent on CPU load
– Has limited functionality
As a broad distinction, hardware RAID is implemented by intelligent storage systems external to the
host, or, at minimum, intelligent controllers in the host that offload the RAID management functions
from the host.
Software RAID usually describes RAID that is managed by the host. Typically it is implemented via
Logical Volume Manager on the host. The disadvantage of software RAID is that is uses host CPU
cycles that would be better utilized to run applications. Software RAID often looks attractive initially
because it does not require the purchase of additional hardware. The initial cost savings are soon
exceeded by the expense of using a costly server to perform I/O operations that it performs
inefficiently at best.
RAID Arrays - 18
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Hot Spares
RAID
Controller
A hot spare is an idle component (often a drive) in a RAID array that becomes a temporary replacement for a
failed component. For example:
The hot spare drive takes the failed drive’s identity in the array.
Data recovery takes place. How this happens is based on the RAID implementation:
y If parity was used, data is rebuilt onto the hot spare from the parity and data on the surviving drives.
y If mirroring was used, data is rebuilt using the data from the surviving mirror drive.
The failed drive is replaced with a new drive at some time later.
One of the following occurs:
y The hot spare replaces the new drive permanently—meaning that it is no longer a hot spare and a new hot
spare must be configured on the system.
y When the new drive is added to the system, data from the hot spare is copied to the new drive. The hot spare
returns to its idle state, ready to replace the next failed drive.
Note: The hot spare drive needs to be large enough to accommodate the data from the failed drive.
Hot spare replacement can be:
y Automatic - when a disk’s recoverable error rates exceed a predetermined threshold, the disk subsystem tries
to copy data from the failing disk to a spare one. If this task completes before the damaged disk fails, the
subsystem switches to the spare and marks the failing disk unusable. (If not, it uses parity or the mirrored
disk to recover the data, as appropriate).
y User initiated - the administrator tells the system when to do the rebuild. This gives the administrator control
(e.g., rebuild overnight so as not to degrade system performance); however, the system is vulnerable to
another failure because the hot spare is now unavailable. Some systems implement multiple hot spares to
improve availability.
RAID Arrays - 19
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Hot Swap
RAID
Controller
RAID
Controller
Like hot spares, hot swaps enable a system to recover quickly in the event of a failure. With a hot
swap, the user can replace the failed hardware (such as a controller) without having to shut down the
system.
RAID Arrays - 20
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Module Summary
Key points covered in this module:
y What RAID is and the needs it addresses
y The concepts upon which RAID is built
y Some commonly implemented RAID levels
These are the key points covered in this module. Please take a moment to review them.
RAID Arrays - 21
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
Check your knowledge of this module by taking some time to answer the questions shown on the slide.
RAID Arrays - 22
Copyright © 2007 EMC Corporation. Do not Copy - All Rights Reserved.
At this point, you have learned how disks work and how they can be combined to form RAID arrays.
Now we are going to build on those concepts and add intelligence to those arrays, making them even
more powerful. Throughout this module we refer to this as an intelligent storage system.
The objectives for this module are shown here. Please take a moment to read them
This module contains two lessons. In this lesson, we take a high level look at the components of a disk
storage system as well as two approaches to implementing them: integrated and modular.
The objectives for this lesson are shown here. Please take a moment to read them.
Let’s start by asking the question, “What is an intelligent storage system?” It is a disk storage system
which distributes data over several devices, and manages access to that data.
Intelligent storage systems have an operating environment. The operating environment can be viewed
as an “operating system” for the array. They also have large amounts of cache. Sophisticated
algorithms manage cache to optimize the read/write requests from the hosts. Large capacity drives can
be partitioned or “sliced” into smaller units. These smaller units, in turn, can be presented to hosts as
individual disk drives. Array management software can also enable multiple hosts to access the array
via the same I/O channel. The operating environment ensures that each host can only access the disk
resources allocated to it.
Intelligent storage systems, a collection of disks in an array, and RAID arrays, all provide increased
data storage capacity. However, intelligent storage systems provide more benefits, as listed in the
slide.
FC Ports
Port Processors
Monolithic Cache
RAID Controllers
Intelligent storage systems generally fall into one of two categories, monolithic and modular.
Monolithic storage systems are generally aimed at the enterprise level, centralizing data in a powerful
system with hundreds of drives. They have the following characteristics:
y Large storage capacity
y Large amounts of cache to service host I/Os efficiently and optimally
y Redundant components for improved data protection and availability
y Many built in features to make them more robust and fault tolerant
y Usually connect to mainframe computers or very powerful open systems hosts
y Multiple front end ports to provide connectivity to multiple servers
y Multiple back end Fibre Channel or SCSI RAID controllers to manage disk processing.
This system is contained within a single frame or interconnected frames (for expansion) and can scale
to support increases in connectivity, performance, and capacity as required. Monolithic storage
systems can handle large amounts of concurrent I/Os from numerous servers and applications. They
are quite expensive compared to modular storage systems (discussed in the next slide). Many of their
features and functionality might be required only for mission critical applications in large enterprises.
Note: Monolithic arrays are sometimes called integrated arrays, enterprise arrays, or cache centric
arrays.
Modular
Rack
Host Interface Host Interface
Servers
Cache Cache
Controller A Controller B
Disk Modules
Control Module
with Disks
Modular storage systems provide storage to a smaller number of (typically) Windows or Unix servers
than larger integrated storage systems. Modular storage systems are typically designed with two
controllers, each of which contains host interfaces, cache, RAID processors, and disk drive interfaces.
They have the following characteristics:
y Smaller total storage capacity and lesser global cache, than monolithic arrays
y Fewer front end ports for connection to servers
y Performance can degrade as the number of connected servers increases
y Limited redundancy
y Fewer options for array based local and remote replication
Note: Modular storage systems are sometimes called midrange or departmental storage systems.
It should also be noted that the distinction between monolithic and modular arrays is becoming
increasingly blurred. Traditionally, monolithic arrays have been associated with large enterprises and
modular arrays with small/medium businesses. With proper classification of application requirements
(such as performance, availability, scalability), modular arrays can now be found in several enterprises,
providing optimal storage solutions at a lower cost (than monolithic arrays).
Cache
Ports Controllers
The front end controller receives and processes I/O requests from the host. Hosts connect to the
storage system via ports on the front end controller.
y Ports are the external interfaces for connectivity to the host. Each storage port has processing logic
responsible for executing the appropriate transport protocol for storage connections. For example,
it could use SCSI, Fibre Channel, or iSCSI.
y Behind the storage ports are controllers which communicate with the cache and back end to
provide data access.
The number of front-end ports on a modular storage system generally ranges from 1-8; 4 is typical. On
a large monolithic array, port counts as high as 64 or 128 are common.
Request 1 F
R
Request 2 O 2
N 4 3 2 1
1
Request 3 T
3
Request 4 E 4
N
D
Request 1 F
R
Request 2 O 2
N 4 2 3 1
1
Request 3 T
3
Request 4 E 4
N
D
As seen earlier, command queuing processes multiple concurrent commands based on the organization
of the data on disk, regardless of the order in which the commands were received.
The command queuing software reorders commands so as to make the execution more efficient, and
assigns each command a tag. This tag identifies when the command will be executed, just as the
number you take at the deli determines when you will be served.
Some disk drives, particularly SCSI and Fibre Channel disks, are intelligent enough to manage their
own command queuing. Intelligent storage systems can make use of this native disk intelligence, and
may supplement it with queuing performed by the controller.
There are several command queuing algorithms that can be used. Here are some of the common ones.
y First In, First Out – commands are executed in the order in which they arrive. This is identical to
having no queuing, and is therefore inefficient in terms of performance.
y Seek Time Optimization - faster than First In, First Out. However, two requests could be on
cylinders that are very close to each other, but in very different places within the track. Meanwhile,
there might be a third sector that is a few cylinders further away but much closer overall to the
location of the first request. Optimizing seek times only, without regard for rotational latency, will
not normally produce the best results.
y Access Time Optimization - combines seek time optimization with an analysis of rotational
latency for optimal performance.
Cache improves system performance by isolating the hosts from the mechanical delays associated with
physical disks. You have already seen that accessing data from a physical disk usually takes several
milliseconds, because of seek times and rotational latency; accessing data from high speed memory
typically takes less than a millisecond. The performance of reads as well as writes may be improved by
the use of cache. Cache is discussed in more detail in the next lesson.
Controllers Ports
The back end controls the data transfers between cache and the physical disks. Physical disks are
connected to ports on the back end.
The back end provides the communication with the disks for read and write operations. The controllers
on the back end:
y Manages the transfer of data between the I/O bus and the disks in the storage system
y Handles addressing for the device - translating logical blocks into physical locations on the disk
y Provides additional, but limited, temporary storage for data
y Provides error detection and correction – often in conjunction with similar features on the disks
To provide maximum data protection and availability, dual controllers provide an alternative path to
physical disks, in case of a controller or a port failure. This reliability is enhanced if the disks used are
dual-ported; each disk port can connect to a separate controller. Having multiple controllers also
facilitates load balancing. Having more than one port on each controller provides additional protection
in the event of port failure. Typically, disks can be accessed via ports on controllers of two different
back ends.
Host
LUN 0
LUN 2 LUN 1
LUN 2
Since intelligent storage systems have multiple disk drives, they use the disks in various ways to
provide optimal performance and capacity. For example:
y A large physical drive could be subdivided into multiple virtual disks of smaller capacity. This is
similar to drive partitioning discussed in Section 2.
y Several physical drives can be combined together and presented as one large virtual drive. This is
similar to drive concatenation discussed in Section 2.
y Typically physical drives are grouped into RAID sets or RAID groups. LUNs with the desired
level of RAID protection are then created from these RAID sets and presented to the hosts.
The mapping of the LUNs to their physical location on the drives is managed by the controller.
LUN 1
Host
LUN 1
In this example, a RAID set consisting of 5 disks has been sliced, or partitioned, into several LUNs.
LUNs 0 and 1 are shown. Note how a portion of each LUN resides on each disk in the RAID set.
Host
Volume
Manager LUN 0
LUN 2 LUN 1
LUN 2
\\.\PhysicalDrive0
This example shows a single physical disk divided into 3 LUNs: LUN 0, 1 and 2. The LUNs are
presented separately to the host or hosts. A host will see a LUN as if it were a single disk device. The
host is not aware that this LUN is only a part of a larger physical drive.
The host assigns logical device names to the LUNs; the naming conventions vary by platform.
Examples are shown for both Unix and Windows addressing.
Lesson Summary
Key points covered in this lesson:
y An intelligent disk storage system:
– Is highly optimized for I/O processing
– Has an operating environment which, among other things, manages
cache, controls resource allocation, and provides advanced local and
remote replication capabilities
– Has a front end, cache, a back end, and physical disks
– The physical disks can be partitioned into LUNs or can be grouped
into RAID sets, and presented to the hosts
Please take a few moments to review the key points covered in this lesson.
We already mentioned that cache plays a key role in an intelligent storage system. At this point, let’s
take a closer look at what cache is and how it works.
Cache
Read
Write
Request
Acknowledgment
Physical disks are the slowest components of an intelligent storage system. If the disk has to be
accessed for every I/O operation from the host, response times are very high. Cache helps in reducing
the I/O response times. Cache can improve I/O response times in the following two ways:
y Read cache holds data that is staged into it from the physical disks. Discussed later, data can be
staged into cache ahead of time upon detection of read access patterns from hosts.
y Write cache holds data written by a host to the array until it can be committed to disk. Holding
writes in cache and acknowledging them immediately to host, prior to committing to disk, isolates
the host from inherent mechanical delays of the disk (such as rotational and seek latencies). Other
benefits of write caching are discussed later in this lesson.
Cache is volatile – loss of power leads to loss of data resident in cache, that has not yet been
committed to disk. Storage system vendors solve this problem in various ways. The memory may be
powered by a battery until AC power is restored, or battery power may be used to write the content of
cache to disk. In the event of an extended power failure, this is the best option. Intelligent storage
systems can have upwards of 256GB of cache and hundreds of physical disks. Potentially, there could
be a large amount of data to be committed to numerous disks. In this case, the batteries may not
provide power for sufficient amount of time to write each piece of data to the appropriate disk. Some
vendors use a dedicated set of physical disks to “dump” the content of cache during a power failure.
This is usually referred to as vaulting, and the dedicated disks are called vault drives. When power is
restored, data from these disks are read and then written to the correct disks.
Data Store
Tag RAM
The amount of user data that the cache can hold is based on the cache size and design. Cache normally
consists of two areas:
y Data store - the part of the cache that holds the data
y Tag RAM – the part of the cache that tracks the location of the data in the data store. Entries in
this area indicate where the data is found in memory, and also where the data belongs on disk.
Additional information found here will include a ‘dirty bit’ – a flag that indicates that data in cache
has not yet been committed to disk. There may also be time-based information such as the time of
last access. This information will be used to determine which cached information has not been
accessed for a long period of time, and may be discarded.
<Continued>
Configuration and implementation of cache varies between vendors. In general, these are the options:
y A reserved set of memory addresses for reads and another reserved set of memory addresses for
writes. This implementation is known as dedicated cache. Cache management, such as tracking
the addresses currently in use, those that are available, and the addresses whose content has to be
committed to disk, can become quite complex in this implementation.
y In a global cache implementation, both reads and writes can use any of the available memory
addresses. Cache management is more efficient in this implementation, as only one global set of
addresses has to be managed.
− Some global cache implementations allow the users to specify the percentage of cache that has
to be available for reads and the percentage of cache that has to be available for writes. This
implementation is common in modular storage arrays.
− In other global cache implementations, the ratio of cache available for reads vs. writes might be
fixed, or the array operating environment can dynamically adjust this ratio based on the current
workload. These implementations are typically found in integrated storage arrays.
In integrated arrays, all the front end and back end directors have access to all regions of the cache. In
modular arrays, each controller (typically two) has access to its own cache on-board. A fault in
memory, for example failure of a memory chip, would lead to loss of any uncommitted data held in it.
Vendors use different approaches to mitigate this risk:
y Pro-actively “scrub” all regions of memory. Faults can be detected ahead of time, and the faulty
region can be isolated or fenced, and taken out of use. This is similar to bad block relocation on
physical disks.
y Mirror all writes within cache. Similar to RAID 1 mirroring of disks, each write can be held in two
different memory addresses, well separated from each other. Each write would be placed on two
independent memory boards, for example. In the event of a fault, the write data will still be safe in
the mirrored location and can be committed (de-staged) to disk. Since reads are staged from the
disk to cache, if there is a fault, an I/O error could be returned to the host, and the data can be
staged back into a different location in cache to complete the read request. The read service time
would be elongated, how ever there is no risk of lost data. As only writes are mirrored, this method
will lead to better utilization of available cache for data store.
y A third approach would be to mirror all reads and all writes in cache. In this implementation, when
data is read from the disk to be staged into cache, it is written to two different locations. Likewise
writes from hosts will be held in two different locations. This effectively reduces the amount of
usable cache by half. As reads and writes are treated on equal footing, the management overhead
would be less than that of mirroring writes alone.
In either of the two mirroring approaches, the problem of cache coherency is introduced. Cache
coherency means that data in the two different cache addresses are identical at all times. It is the
responsibility of the array operating environment to ensure coherency.
When a host issues a read request, the front end controller accesses the Tag RAM to determine whether
the required data is already available in cache.
If the requested data is found in the cache, it is known as a cache hit.
y The data is sent directly to the host, with no disk operation required.
y This provides fast response times.
If the data is not found in cache, the operation is known as a cache miss.
y When there is a cache miss, the data must be read from disk. The back end controller accesses the
appropriate disk and retrieves the requested data.
y Data is typically placed in cache and then sent to the host.
The read cache hit ratio (or hit rate), usually expressed as a percentage, describes how well the read
cache is performing. To determine the hit ratio, divide the number of read cache hits by the total
number of read requests.
Cache misses lengthen I/O response times. The response time depend on factors such as rotational
latency, and seek times, as discussed earlier.
A read cache hit can take about a millisecond, while a read cache miss can take many times longer.
Remember that average disk access times for reads are often in the 10 ms range.
Cache is a finite resource. Even though the intelligent storage systems can have hundreds of GB of cache, when
all cache addresses are used up for data, some addresses have to be freed up to accommodate new data. Waiting
until a cache full condition occurs to free up addresses is inefficient and leads to performance degradation. The
array operating environment should proactively maintain a set of free addresses and/or a list of addresses that can
be potentially freed up when required. Algorithms used for cache management are:
y Least Recently Used (LRU) – access to data in cache is monitored continuously, and the addresses that have
not been accessed in a “long time” can be freed up immediately, or can be marked as being candidates for re-
use. This algorithm assumes that data not accessed in a while will not be requested by the host. The length
of time that an address should be inactive prior to being freed up is dependent on the implementation. Quite
clearly, if an address contains write data, not yet committed to disk, the data will of course be written to disk
before the address is re-used.
y Most Recently Used (MRU) – is the converse of LRU. Addresses that have been accessed most recently
will be freed up or marked as potential candidates for re-use. This algorithm assumes that data that has been
accessed in the immediate past may not be required for a while.
y Read Ahead – if the read requests are sequential, i.e. a contiguous set of disk blocks, several more blocks not
yet requested by the host can be read from disk and placed in cache. When the host subsequently requests
these blocks, these read operations will be read hits. In general, there is an upper limit to the amount of data
that is pre-fetched. The percentage of pre-fetched data that is actually used is also monitored. A high
percentage would imply that the algorithm is correctly predicting the sequential access pattern. A low
percentage would indicate that effort is being wasted in performing pre-fetch, and that the access pattern
from the host is not truly sequential.
Some implementations allow for data to be “pinned” in cache permanently. The pinned addresses will not
participate in the LRU or the MRU considerations. Note that the slide shows a depiction of the LRU.
Write Algorithms
Write-through Cache
Cache
Write
Request
Acknowledgement
Write-back
Cache
Write
Request
Acknowledgement Acknowledge-
ment
Write-through cache – data is placed in cache, immediately written to disk, and acknowledgement is
sent to the host. As data is committed to disk as it arrives, the risk of data loss is low. Write response
times will be longer because of the mechanical delays of the disk.
Write-back cache – data is placed in cache and immediately acknowledged to the host. At a later time,
data from several writes are committed (de-staged) to disk. Uncommitted data is exposed to risk of
loss in the event of failures. Write response times are much faster as the write operations are isolated
from the mechanical delays of the disk.
Cache could also be by-passed under certain conditions, such as very large write I/O sizes. In this
implementation, writes are sent directly to disk.
Lesson Summary
Key points covered in this lesson:
y Cache is a memory space used by an intelligent storage
system to reduce the time required to service I/O
requests from the host
y Cache can speed up both read and write operations
y Algorithms to manage cache include:
– Least Recently Used (LRU)
– Most Recently Used (MRU)
– Read Ahead (pre-fetch)
These are the key points covered in this lesson. Please take a moment to review them.
Module Summary
Key points covered in this module:
y Intelligent Storage Systems are RAID Arrays that are
highly optimized for I/O processing
y Monolithic storage systems are generally aimed at the
enterprise level, centralizing data in a powerful system
with hundreds of drives
y Modular storage systems provide storage to a smaller
number of (typically) Windows or Unix servers than larger
integrated storage systems
y Cache in intelligent storage systems accelerates
response times for host I/O requests
© 2007 EMC Corporation. All rights reserved. Disk Storage Systems - 26
These are the key points covered in this module. Please take a moment to review them.
Check your knowledge of this module by taking some time to answer the questions shown on the slide.
At this point, we will apply what you learned in this module to some real world examples. In this case,
we look at the architecture of the EMC CLARiiON and EMC Symmetrix storage arrays.
2/4 Gb/s Fibre 4Gb/s LCC 4Gb/s LCC 2/4 Gb/s Fibre
Channel Back End Channel Back End
The CLARiiON architecture includes fully redundant, hot swappable components—meaning the
system can survive the loss of a fan or a power supply, and the failed component can be replaced
without powering down the system.
y The Standby Power Supplies (SPSs) maintain power to the cache for long enough to allow its
content to be copied to a dedicated disk area (called the vault) if a power failure should occur.
y Storage Processors communicate with each other over the CLARiiON Messaging Interface (CMI)
channels. They transport commands, status information, and data for write cache mirroring
between the Storage Processors. CMI is used for peer-to-peer communications in the SAN space
and may be used for I/O expansion in the NAS space.
y The CX3-80 uses PCI-Express as the high-speed CMI path. PCI Express architecture delivers
advance I/O technology delivering high bandwidth per pin, superior routing characteristics, and
improved reliability.
y When more capacity is required, additional disk array enclosures containing disk modules can be
easily added. Link Control Cards (LCC) connect shelves of disks.
The Symmetrix DMX series arrays delivers the highest levels of performance and throughput for high-end storage. It incorporates the
following features:
y Direct Matrix Interconnect
− Up to 128 direct paths from directors and memory
− Up to 128 GB/s data bandwidth; up to 6.4 GB/s message bandwidth
y Dynamic Global Memory
− Up to 512 GB Global Memory
− Intelligent Adaptive Pre-fetch
− Tag-based cache algorithms
y Enginuity Operating Environment
− Foundation for powerful storage-based functionality
− Continuous availability and advanced data protection
− Performance optimization and self-tuning
− Advanced management
− Integrated SMI-S compliance
y Advanced processing power
− Up to 130 PowerPC Processors
− Four or eight processors per director
y High-performance back end
− Up to 64 2 Gb/s Fibre Channel paths (12.8 GB/s maximum bandwidth)
− RAID 0, 1, 1 + 0, 5
− 73, 146, and 300 GB 10,000 rpm disks; 73 and 146 GB 15,000 rpm disks; 500 GB 7,200 rpm disks
y A fully fault-tolerant design
− Nondisruptive upgrades and operations
− Full component-level redundancy with hot-swappable replacements
− Support: Dual-ported disks and global-disk hot spares
− Redundant power supplies and integrated battery backups
− Remote support and proactive call-home capabilities
This shows the logical representation of the Symmetrix DMX architecture. The Front-end (host connectivity directors and
ports), Cache (Memory) and the Back-end (directors/ports which connect to the physical disks) are shown.
Front-end:
y Hosts connect to the DMX via front-end ports (shown as ‘Host Attach”) on Front-end directors. DMX supports
ESCON, FICON, Fibre Channel and iSCSI front-end connectivity.
Back-end:
y The disk director ports (back-end) are connected to Disk Array Enclosures. The DMX back-end employs an arbitrated
loop design and dual-ported disk drives. I/Os to the physical disks are handled by the back-end.
Cache:
y All front-end I/Os (reads and writes) to the Symmetrix have to pass through the cache, this is unlike some arrays which
will allow I/Os to by pass cache altogether. Let us take a look at how the Symmetrix handles front-end read and write
operations:
y Read: A read is issued by a server. The Symmetrix will look for the data in the cache, if the data is in cache it will be
read from cache and sent to the server via the front-end port – This is a read hit. If the data is not in cache, then the
Symmetrix will go to the physical disks on the back-end, fetch the data into cache and then send the data from the
cache to the requesting server – This is a read miss.
y Write: A write is issued by a sever. The write will be received in cache and a write complete will be immediately
issued to the server. Data will be de-staged from the cache to the back-end at a later time.
y Enhanced global memory technology supports multiple regions and sixteen connections on each global memory
director, one to each director. Each director slot port is hard-wired point-to-point to one port on each global memory
director board. If a director is removed from a system, the usable bandwidth is not reduced. If a memory board is
removed, the usable bandwidth is dropped.
P
P
S
S
P
P
S
S
P
P
S
S
P
P
S
P = Primary Connection to Drive
S= Secondary Connection for Redundancy
© 2007 EMC Corporation. All rights reserved. Disk Storage Systems - 33
Symmetrix DMX back-end employs an arbitrated loop design and dual-ported disk drives. Each drive
connects to two paired Disk Directors through separate Fibre Channel loops. Port Bypass Cards
prevent a Director failure or replacement from affecting the other drives on the loop. Directors have
four primary loops for normal drive communication and four secondary loops to provide alternate path,
if the other director fails.
All Symmetrix arrays have a Service Processor running the SymmWin application. Initial
configuration of Symmetrix arrays has to be performed by EMC personnel via the Symmetrix Service
Processor.
Physical disks (in the disk array enclosures) are sliced into hypers, or disk slices, and protection
schemes (RAID1, RAID5, etc.) are then incorporated, creating the Symmetrix logical volumes
(discussed in the next slide). A Symmetrix logical volume is the entity that is presented to a host via a
Symmetrix front-end port. The host views the Symmetrix logical volume as a physical drive. Do not
confuse Symmetrix logical volumes with host-based logical volumes. Symmetrix logical volumes are
defined by the Symmetrix configuration, while host-based logical volumes are configured by Logical
Volume Manager software.
EMC ControlCenter and Solutions Enabler are software packages which are used to monitor and
manage the Symmetrix. Solutions Enabler has a command line interface, while ControlCenter provides
a Graphical User Interface (GUI). ControlCenter is a very powerful storage management tool,
managing the Symmetrix is one of the many things it can do.
Physical Physical
Logical Volume
Drive 04B Drive
Hyper LV 04B M2
Host Address
Volumes Target = 1
LUN = 0
LV 04B M1
© 2007 EMC Corporation. All rights reserved. Disk Storage Systems - 35
Mirroring provides the highest level of performance and availability for all applications. Mirroring
maintains a duplicate copy of a logical volume on two physical drives. The Symmetrix maintains
these copies internally by writing all modified data to both physical locations. The mirroring function
is transparent to attached hosts, as the hosts view the mirrored pair of hypers as a single Symmetrix
logical volume.
A RAID1 SLV: Two hyper volumes from two different disks on two different disk directors are
logically presented as an RAID1 SLV. The hyper volumes are chosen from different disks on different
disk directors to provide maximum redundancy. The SLV is given an Hexadecimal address. In the
example, SLV 04B is a RAID1 SLV whose hyper volumes exist on the physical disks in the back-end
of the array.
The SLV is then mapped to one or more Symmetrix front-end ports (a target and LUN ID is assigned
at this time). The SLV can now be assigned to a server. The server views the SLV as a physical drive.
On a fully configured Symmetrix DMX3 array, one can have up to 64,000 Symmetrix logical volumes.
The maximum number of SLVs on a DMX is a function of the number of disks, disk directors, and the
protection scheme used.
Data Protection
y Mirroring (RAID 1)
– Highest performance, availability and functionality
– Two hyper mirrors form one Symmetrix Logical Volume located on separate
physical drives
Data protection options are configured at the volume level and the same Symmetrix can employ a
variety of protection schemes.
Dynamic Sparing: Disks in the back-end of the Array which are reserved for use when a physical disk
fails. When a physical disk fails, the dynamic spare is used as a replacement.
SRDF is a remote replication solution and is discussed later on in the Business Continuity section of
this course.
Q: Architecture Exercise
Identify the components of a data storage environment:
C D E
A B F
F D C A
B
E
1. Fill in the letter in the diagram that corresponds to the appropriate operation. Hint: Not all of the
operations are used.
___ Host sends data to storage system
___ Data is written to physical disk some time later
___ Data is written to cache
___ Data is written to physical disk immediately
___ An acknowledgement is sent to the host
___ Data is returned to the host
___ Data is sent to back end
___ Back end receives status of write operation
2. List the operations in the correct order.
A
C
D B
Fill in the letter in the diagram that corresponds to the appropriate operations Hint: Not all of the
operations are used.
___ Host sends read request to storage system
___ Data is read from physical disk when requested by the LRU algorithm
___ Data is written to cache
___ Data is read from physical disk immediately
___ Read data is sent to the host
___ Status is returned to the host
___ Data is sent to back end
___ Back end receives status of read operation
___ Cache is searched, and data is found there
___ Cache is searched, and data is not found there
___ Data placed in cache by a previous read or write operation
F D A
E C B
1. Fill in the letter in the diagram that corresponds to the appropriate operations Hint: Not all of the
operations are used.
___ Host sends read request to storage system
___ Data is read from physical disk when requested by the LRU algorithm
___ Data is written to cache
___ Data is read from physical disk immediately
___ Read data is sent to the host
___ Status is returned to the host
___ Data is sent to back end
___ Back end receives status of read operation
___ Cache is searched, and data is found there
___ Cache is searched, and data is not found there
2. List the operations in the correct order.
Business Profile:
Acme Telecom is involved in mobile wireless services across the United States and has about 5000
employees worldwide. This company is Chicago based and has 7 regional offices across the country.
Although Acme is doing well financially, they continue to feel competitive pressure. As a result, the
company needs to ensure that the IT infrastructure takes advantage of fault tolerant features.
Current Situation/Issues:
• The company uses a number of different applications for communication, accounting, and
management. All the applications are hosted on individual servers with disks configured as RAID 0.
• All financial activity is managed and tracked by a single accounting application. It is very important
for the accounting data to be highly available.
• The application performs around 15% write operations, and the remaining 85 % are reads.
• The accounting data is currently stored on a 5-disk RAID 0 set. Each disk has an advertised formatted
capacity of 200 GB, and the total size of their files is 730 GB.
• The company performs nightly backups and removes old information—so the amount of data is
unlikely to change much over the next 6 months.
The company is approaching the end of the financial year and the IT budget is depleted. Buying even one
new disk drive will not be possible.
How would you suggest that the company restructure their environment?
You will need to justify your choice based on cost, performance, and availability of the new solution.
Business Profile:
Acme Telecom is involved in mobile wireless services across the United States and has about 5000
employees worldwide. This company is Chicago based and has 7 regional offices across the country.
Although Acme is doing well financially, they continue to feel competitive pressure. As a result, the
company needs to ensure that the IT infrastructure takes advantage of fault tolerant features.
Current Situation/Issues:
• The company uses a number of different applications for communication, accounting, and
management. All the applications were hosted on individual servers with disks configured as RAID 0.
• The company changed the RAID level of their accounting application based on your
recommendations 6 months ago.
• It is now the beginning of a new financial year and the IT department has an increased budget. You
are called in to recommend changes to their database environment.
• You investigate their database environment closely, and observe that the data is stored on a 6-disk
RAID 0 set. Each disk has an advertised formatted capacity of 200 GB and the total size of their files
is 900 GB. The amount of data is likely to change by 30 % over the next 6 months and your solution
must accommodate this growth.
• The application performs around 40% write operations, and the remaining 60 % are reads. The
average size of a read or write is small, at around 2 KB.
Section Summary
Key Points covered in this Section:
y Physical and logical components of a host
y Common connectivity components and protocols
y Features of intelligent disk storage systems
y Data flow between the host and the storage array
y Apply Your Knowledge
y Data Flow Exercise
y Case Studies
These are the key points covered in this section. Please take a moment to review them.
This concludes the training. Please proceed to the Course Completion slide to take the Assessment.