h14344 Emc Scaleio Basic Architecture
h14344 Emc Scaleio Basic Architecture
ABSTRACT
June 2015
EMC WHITE
PAPER
To learn more about how EMC products, services, and solutions can help solve your business and IT challenges,
contact your local representative or authorized reseller, visit www.emc.com, or explore and compare products in the
EMC Store
2
EXECUTIVE SUMMARY .............................................................................. 5
AUDIENCE ................................................................................................ 5
TERMINOLOGY TABLE ............................................................................... 5
SCALEIO SYSTEM ARCHITECTURE BASICS ................................................ 6
ScaleIO Data Client – SDC .................................................................................. 6
ScaleIO Data Server – SDS ................................................................................. 7
IO FLOW ................................................................................................. 12
CACHE .................................................................................................... 13
XtremCache .................................................................................................... 14
Write buffering ................................................................................................ 15
IO TYPES ................................................................................................ 15
Read Hits (RH)................................................................................................. 15
Read Misses (RM) ............................................................................................. 15
Writes............................................................................................................. 16
3
PROTECTION DOMAINS .......................................................................... 16
STORAGE POOLS ..................................................................................... 17
FAULT SET .............................................................................................. 17
SNAPSHOTS ............................................................................................ 18
QUALITY OF SERVICE (QOS)................................................................... 19
THROTTLING .......................................................................................... 19
SCALEIO MANAGEMENT .......................................................................... 20
CONCLUSIONS ........................................................................................ 22
REFERENCES ........................................................................................... 23
4
EXECUTIVE SUMMARY
This document is designed to help users understand the basic concepts and the architecture of ScaleIO.
EMC ScaleIO® is software that creates a server-based SAN from local application server storage (local or network
storage devices). ScaleIO delivers flexible, scalable performance and capacity on demand. ScaleIO integrates
storage and compute resources, scaling to thousands of servers (also called nodes). As an alternative to traditional
SAN infrastructures, ScaleIO combines hard disk drives (HDD), solid state disk (SSD), and Peripheral Component
Interconnect Express (PCIe) flash cards to create a virtual pool of block storage with varying performance tiers.
As opposed to traditional Fibre Channel SANs, ScaleIO has no requirement for a Fibre Channel fabric between the
servers and the storage. This further reduces the cost and complexity of the solution. In addition, ScaleIO is
hardware-agnostic and supports either physical or virtual application servers. It creates a software-defined storage
environment that allows users to exploit the unused local storage capacity in any server. ScaleIO provides a
scalable, high performance, fault tolerant distributed shared storage system.
AUDIENCE
This white paper is intended for customers, partners and employees interested in understanding the concepts,
basic architecture and components of a ScaleIO system.
TERMINOLOGY TABLE
Term Definition
HDD Hard disk drives. Traditional magnetic devices that store
digitally encoded data.
SSD Solid state disk that has no moving parts and uses flash
memory to store data persistently.
PCIe Peripheral Component Interconnect Express is a high-
speed serial computer expansion bus.
SDS ScaleIO Data Server. Contributes local storage space to an
aggregated pool of storage within the ScaleIO virtual SAN.
SDC ScaleIO Data Client. A lightweight device driver that
exposes ScaleIO shared block volumes to applications.
MDM ScaleIO Meta Data Manager. Manages, configures and
monitors the ScaleIO system.
Hyperconverged A converged configuration or converged infrastructure (CI)
where the application runs in the same layer as the storage
and compute.
LAN Local Area Network providing interconnectivity within a
limited (local) area. ScaleIO supports all network speeds
including 100Mb, 1Gb, 10Gb, 40Gb and InfiniBand (IB)
OpenStack Is a free, open source cloud computing software platform.
CTQ Command Tag Queueing: reordering the IO to optimize the
drive seek and improve the IOPs of the drive
5
Term Definition
RTO Recovery time objective. The time for a system/application
to be restored, e.g. from back-up, after a failure or
disruption.
DRAM Dynamic RAM. This is a type of random access memory
used in servers for caching, etc.
Protection Domain A logical container for SDSs. Each SDS can belong to only
one Protection Domain.
Storage Pool A logical cross-SDS group of drives within a single
Protection Domain. Usually used to map drives that share
common characteristics, e.g. a SAS pool, an SSD pool etc.
IOC Basic IO controller. It passes through the IO without any
additional features or function.
ROC RAID-on-Chip. Adds additional features such as buffering
Writes, RAID calculations, etc.
6
Figure 1 - ScaleIO SDC
Users may modify the default ScaleIO configuration parameter to allow two SDCs to access the same data. This
feature provides supportability of applications like Oracle RAC.
7
SCALEIO CONFIGURATIONS
There are three standard configurations for ScaleIO implementations, all providing flexibility and scalability. They
are as follows and are discussed in the following section.
Two-layer configuration
Mixed configuration
The applications perform IO operations via the local SDC. All servers contribute some or all of their local storage to
the ScaleIO system via their local SDS. Components communicate over the Local Area Network (LAN).
Two-layer Configuration
There is no ScaleIO requirement to implement a Converged configuration, as shown above, where the SDCs and
SDSs reside on the same servers.
In certain situations, customers prefer to have the SDS separated from an SDC and installed on a different server.
This type of configuration is called a two-layer configuration where the SDCs are configured on one group of servers,
and the SDSs are configured on another distinct group of servers, shown in Figure 4.
The applications that run on the first group of servers issue the IO requests to their local SDC. The second group,
running SDSs, contributes the servers’ local storage in the virtual SAN. Both groups communicate over a local area
network. Applications run in one layer, while storage resides in another layer.
This deployment is similar to a traditional external storage system such as VNX and VMAX, but without the Fibre
Channel layer.
8
Figure 4 - Two-layer Configuration
Mixed Configuration
ScaleIO is very flexible and allows any combination of the two configurations. When a two-layer and converged
configuration exist, this is called a mixed configuration. ScaleIO has no restriction on when configuration changes
can be made. This configuration is common in a transient case when moving from a two-layered configuration to a
converged configuration.
When a new group of servers are added as SDS servers, ScaleIO will automatically rearrange, optimize and
rebalance the data in the background without any downtime. ScaleIO deployments can be changed, or grown
quickly - with ease and supporting hundreds of nodes.
9
META DATA MANAGER – MDM
The Meta Data Manager manages the ScaleIO system. The MDM contains all the metadata required for system
operation; such as configuration changes. The MDM also allows monitoring capabilities to assist users with most
system management tasks.
The MDM manages the meta data, SDC, SDS, devices mapping, volumes, snapshots, system capacity including
device allocations and/or release of capacity, RAID protection, errors and failures, and system rebuild tasks
including rebalancing. In addition, the MDM responds to all user commands or queries. In a normal IO flow, the
MDM is not part of the data path and user data does not pass through the MDM. Therefore, the MDM is never a
performance bottleneck for IO operations.
The MDM uses an Active/Passive methodology with a tie breaker component where the primary node is Active, and
the secondary is passive. The data repository is stored in both Active and Passive.
Currently, an MDM can manage up to 1024 servers. When several MDMs are present, an SDC may be managed by
several MDMs, whereas, an SDS can only belong to one MDM.
The MDM is extremely lightweight and has an asynchronous (or lazy) interaction with the SDCs and SDSs. The MDM
daemon produces a heartbeat where updates are performed every few seconds. If the MDM does not detect the
heartbeat from an SDS it will initiate a forward-rebuild.
All ScaleIO commands are asynchronous with one exception. For consistency reasons, the unmap command is
synchronous where the user must wait for the completion before continuing.
Each SDC holds mapping information that is light-weight and efficient so it can be stored in real memory. For every 8
PB of storage, the SDC requires roughly 2 MB RAM. Mapping information may change without the client being
notified, this is the nature of a lazy or loosely-coupled approach.
10
Figure 6 - ScaleIO volume layout
Rebuilds
ScaleIO systems automatically rebuild a failed drive or failed server. For example if SDS1 crashes, ScaleIO will
rebuild its 1 MB chunks by copying from its mirrors. This process is called a forward rebuild. It is a many-to-many
copy operation, which is what makes the rebuild such a quick operation
Upon completion of the forward rebuild operation, the system is fully protected and optimized. Better still, while
this operation is in progress, all of the data is accessible to applications so that users experience no outage or
disruption in service.
The backward rebuild option is used when a node goes down for only a short period of time. This option is managed
by the MDM that determines whether updating the mirrored volumes will be faster than rebuilding the data on the
downed server The MDM collects all tracked changes from all SDSs, and is therefore equipped to make the best
decision on forward or backward rebuild methods.
ScaleIO always reserves space on servers in the case of an unplanned outage, when rebuilds are going to require
unused disk space. To ensure data protection during server failures, ScaleIO reserves 10% of the capacity by
default, not allowing it to be used for volume allocation.
To ensure full system protection during the event of a node failure, users must ensure that the spare capacity is at
least equal to the amount of capacity in the node containing the maximum capacity, or the maximum Fault Set
capacity. If all nodes contain equal capacity, it is recommended to set the spare capacity value to at least 1/N of the
total capacity (where N is the number of SDS nodes).
Rebalance
One of ScaleIO’s greatest benefits is its elasticity. Adding or removing devices, and/or servers to a ScaleIO
configuration triggers an automatic migration and rebalance of the remaining devices.
11
During this process, the data is migrated and then rebalanced across the servers. The rebalance process simplifies
ScaleIO management, eliminating long refresh cycles, and making the environment more dynamic and flexible. If
more compute power or storage is required, simply add more devices or more servers.
IO FLOW
IOs from the application are serviced by the SDC that runs on the same server as the application. The SDC fulfills the
IO request regardless of where any particular block physically resides.
When the IO is a Write, the SDC sends the IO to the SDS where the Primary copy is located. The Primary SDS will
send the IO to the local drive and in parallel, another IO is sent to the secondary mirror. Only after an
acknowledgment is received from the secondary SDS, the primary SDS will acknowledge the IO to the SDC.
A Read IO from the application will trigger the SDC to issue the IO to the SDS with the Primary chunk.
In terms of resources consumed, one host Write IO will generate two IOs over both the network and back-end drives.
A read will generate one network IO and one back-end IO to the drives. For example, if the application is issuing an
8 KB Write, the network and drives will get 2x8 KB IOs. For an 8 KB Read, there will be only one 8 KB IO on the
network and drives.
Note: The IO flow does not require any MDM or any other central management point. For this reason, ScaleIO is able
to scale linearly in terms of performance.
Every SDC knows how to direct an IO operation to the destination SDS. Because ScaleIO volume chunks are evenly
distributed between drive and nodes, the workload will always be sufficiently balanced. There is no flooding or
broadcasting. This is extremely efficient parallelism that eliminates single points of failure. Since there is no central
12
point of routing, all of this happens in a distributed manner. The SDC has all the intelligence needed to route every
request, preventing unnecessary network traffic and redundant SDS resource usage.
CACHE
Cache is a critical aspect of storage performance. At present ScaleIO uses RAM cache for Read Hits. ScaleIO cache
keeps recently-accessed data readily available. IOs read from cache have a lower response time than IOs serviced
by the drives, including Flash drives.
Another benefit of caching IO is that it reduces the drive workload which in many cases is a performance bottleneck
in the system.
Cache in ScaleIO is managed by the SDS. It is a simple and clean implementation that does not require cache
coherency management, which would have been required if cache were managed by the SDC.
13
Figure 10 - Cache warming up
Note: Both the MD and UD use Least Recently Used (LRU) algorithms to make sure “old” data is
evicted from cache first.
XtremCache
ScaleIO is equipped to use another caching option; XtremCache (formerly named XtremSW). This is a software layer
located under the SDS. XtremCache allows any Flash drive type at any size to be used as additional system Read
14
cache (writes are buffered only for Reads after Writes). When compared to Raid controller caching solutions,
XtremCache also allows the use of host PCI Flash cards/drives which can have an order of magnitude better
performance than enterprise level regular SAS solid state disk (SSD).
Write caching is achieved by the Raid controller as explained in the next section.
Write buffering
Writes are only buffered in the host memory for Read after Write caching. One way to achieve Write buffering is to
use Raid controllers (e.g. LSI, PMC etc.) that have battery backup for write buffering. It is important that the DRAM
buffer be protected against sudden power outages to avoid any data loss.
Raid controllers also have an option to increase their cache capability to Flash drives configured in the systems (e.g.
LSI CacheCade and PMC MaxCache). This allows increasing cache from 1-4 GB of DRAM to 512 GB. Up to 2 TB of the
Flash cache is managed by the Raid Controller. This cache is used for both Reads and Writes.
The main effect of having write buffering relates to Write Response Time which is much lower when the IO is
acknowledged from DRAM/Flash, rather than a HDD drive.
An added benefit of buffering the writes in a Raid controller is that it utilizes the “elevator reordering”. This is
sometimes referred to as Command Tag Queuing (CTQ). Elevator reordering increases the max IOPs of a HDD drive,
and even reduces the drive’s load because of rewrites to the same address locations.
Note: Apart from the re-write effect and CTQ, write buffering does not affect sustained random
writes maximum throughputs.
For Flash only configurations, it is usually recommended to use a pass-through IOC instead of a ROC (Raid-On-Chip)
controller since writes are acknowledged from the Flash drives regardless.
IO TYPES
There are three types of IO operations in a ScaleIO system:
Read Hit
Read Miss
Write
Each IO type and size behaves differently since they exercise different components inside the ScaleIO system.
It’s important to consider that sequential reads are not counted separately. If any IO is serviced from the host read
cache, those IOs are counted as Read Hits. Any other IO is counted as Read Misses.
15
Note: There is minimal pre-fetch as part of the ScaleIO cache code. For example, a sequential read
of 512B will bring 4 KB into cache. A best practice recommendation is to use the Read-ahead
feature in the Cache controller only for HDD drives. This allows pre-fetching IOs to increase the
performance when using HDD drives. With Flash drives, this feature is not necessary and not
recommended.
Writes
A Write is a Write IO operation to the ScaleIO system. Apart from the write buffering cases described in the above
section, there is little difference between the various write types, e.g. Sequential and Random.
PROTECTION DOMAINS
A Protection Domain is a set of SDSs. Each SDS belongs to one (and only one) Protection Domain. Thus, by
definition, each Protection Domain is a unique set of SDSs.
The ScaleIO Data Client (SDC) is not part of the Protection Domain. An SDC residing on the same server as an SDS
that belongs to Protection Domain X can also access data in Protection Domain Y.
The recommended number of nodes in a protection domain is a 100. Users can add Protection Domains during
installation. In addition, Protection Domains can be modified, post-installation, with all the management clients
(except for OpenStack).
16
STORAGE POOLS
Storage Pools allow the generation of different performance tiers in the ScaleIO system. A Storage Pool is a set of
physical storage devices in a Protection Domain. Each storage device belongs to one (and only one) Storage Pool.
When a Protection Domain is generated, it has one Storage Pool by default.
Storage Pools are mostly used to group drives based on drive types and drive speeds, e.g. SSD and HDD.
FAULT SET
In many cases, data centers are designed such that a unit of failure may consist of more than a single node. An
example use case is where a rack contains several SDSs and the customer wants to protect the environment from a
situation where the whole rack fails, or is lost by a power outage or some disaster.
The fault set will limit mirrored chunks from being in the same fault set. A minimum of 3 fault sets is required per
protection domain. Deploying Fault Sets will prevent both copies of data from being written to SDS’s in the same
fault set. This ensures that one copy of the data is available in the event that an entire fault set fails.
17
Figure 14 - Fault set data distribution
Figure 15 - Example Configuration; Fault Sets, Storage Pools and Protection Domain
SNAPSHOTS
The ScaleIO storage system enables users to take snapshots of existing volumes, up to 31 per volume. The
snapshots are thinly provisioned and are extremely quick. Once a snapshot is generated, it becomes a new
unmapped volume in the system. Users manipulate snapshots in the same manner as any other volume exposed to
the ScaleIO storage system.
18
Figure 16 - Snapshot operations
This structure in Figure 16 relates to all the snapshots resulting from one volume, and is referred to as a VTree (or
Volume Tree). It’s a tree spanning from the source volume as the root, whose siblings are either snapshots of the
volume itself or descendants of it.
Each volume has a construct called a vTree which holds the volume and all snapshots associated with it.
The limit on a vTree is 32 members – so 1 is taken by the original volume and the rest (31) are available for
snapshots.
THROTTLING
ScaleIO allows users to change (or throttle) certain parameters in order to set higher priorities for some operations
over others. The most common use case is to slow down a rebuild/rebalance operation which can help reduce the
impact on host IOs.
o Separately controlling
Rebalance
19
Rebuild
Rebalance/Rebuild throttling parameters: Setting these parameters allows users to set the rebalance/rebuild I/O
priority policy for a Storage Pool. It determines the priority policy that will be imposed to favor application I/O over
rebalance/rebuild I/O.
No Limit: No limit on rebalance/rebuild I/Os. This option will help complete the rebuild/rebalance ASAP, but
may have an impact of the host applications.
Limit Concurrent I/O: Limit rebalance/rebuild number of concurrent I/Os per SDS device.
Favor Application I/O: Limit rebalance/rebuild in both bandwidth and concurrent I/Os.
Dynamic Bandwidth Throttling: Limit rebalance/rebuild bandwidth and concurrent I/Os according to device
I/O thresholds. This option helps to increase the rebalance/rebuild rate when the host application workload
is low.
SCALEIO MANAGEMENT
Users manage ScaleIO in various ways including the CLI, the REST API, the vSphere plug-in for ESX, and the ScaleIO
GUI. Other tools including ViPR-C and ViPR SRM are integrated and capable of managing a ScaleIO system.
The ScaleIO command line interface or, scli, allows users to log into a ScaleIO system to create, manage and
monitor various system components including protection domains, the MDM, SDS, SDC, storage pools, volumes
and more.
The scli “—help” command provides information on syntax and usage for all ScaleIO commands.
The REST API for ScaleIO is serviced from the ScaleIO Gateway (which includes the REST gateway).
The ScaleIO Gateway connects to a single MDM and services requests by querying the MDM, and reformatting the
answers it receives from the MDM in a RESTful manner, back to a REST client. Every ScaleIO scli command is also
available in the ScaleIO REST API. Responses returned by the Gateway are formatted in JSON format. The API is
available as part of the ScaleIO Gateway package. If the ScaleIO Installation Manager was used to install ScaleIO,
the Gateway has already been installed and configured with the MDM details.
VMware provides a plug-in that allows users to view and provision ScaleIO components. The plug-in communicates
with the MDM and the vSphere server enabling users to view components and perform many
configuration/provisioning tasks right from within the VMware environment.
To use the plug-in, it must be registered in your vCenter. For more information, refer to the ScaleIO Installation
Guide at https://support.emc.com/docu59356_ScaleIO-Installation-Guide-1.32.pdf.
20
An EMC ScaleIO GUI is available for Windows, Linux and vSphere. The GUI allows installation, monitoring and
management of a ScaleIO system. Figure 17 displays the GUI dashboard providing a complete overview of the
current system state.
For more detailed information on the management tools available for ScaleIO, refer to the ScaleIO User Guide on the
ScaleIO Product Page at https://community.emc.com/docs/DOC-45035.
21
CONCLUSIONS
ScaleIO is software-defined storage that delivers a full suite of storage services and uses commodity hardware built
on off-the-shelf components and products. ScaleIO has no vendor specific hardware dependencies, is able to run
on any commodity server, supported on nearly any operating system and/or hypervisor, and can leverage existing
and future datacenter infrastructures.
Leading use case scenarios include cloud-based platforms built on ScaleIO to provide consumer and enterprise
applications to support banking, billing and much more. Managed Service Providers have implemented ScaleIO in
order to eliminate vendor lock-in, and grow with a solution that allows the use of any flash or HDD for storage, and
with any kind of server.
In short, ScaleIO simplifies data center operations making it flexible and efficient.
22
REFERENCES
Probability, Statistics, and Queueing Theory, 2nd Edition, Arnold O. Allen, Academic Press, 1990.
23