Storage Virtualization
Storage Virtualization
Storage Virtualization
Storage Virtualization
Agenda
• Overview
Introduction
What to be virtualized
Where to be virtualized
How to be virtualized
• Case study
On linux system
• RAID
• LVM
• NFS
In distributed system
• Vastsky
• Lustre
• Ceph
• HDFS
Overview
• Introduction
• What to be virtualized ?
Block, File system
• Where to be virtualized ?
Host-based, Network-based, Storage-based
• How to be virtualized ?
In-band, Out-of-band
• Introduction
• What to be virtualized
• Where to be virtualized
• How to be virtualized
• Case study
STORAGE VIRTUALIZATION
Introduction
• Common storage architecture :
DAS - Direct Attached Storage
• Storage device was directly attached to a
server or workstation, without a storage
network in between.
NAS - Network Attached Storage
• File-level computer data storage
connected to a computer network
providing data access to heterogeneous
clients.
SAN - Storage Area Network
• Attach remote storage devices to servers
in such a way that the devices appear as
locally attached to the operating system.
Introduction
• Desirable properties of storage virtualization:
Manageability
• Storage resource should be easily configured and deployed.
Availability
• Storage hardware failures should not affect the application.
Scalability
• Storage resource can easily scale up and down.
Security
• Storage resource should be securely isolated.
Introduction
• Storage concept and technique
Storage resource mapping table
Redundant data
Multi-path
Data sharing
Tiering
Concept and Technique
• Storage resource mapping table
Maintain tables to map storage resource to target.
Dynamic modify table entries for thin provisioning.
Use table to isolate different storage address space.
Concept and Technique
• Redundant data
Maintain replicas to provide high availability.
Use RAID technique to improve performance and availability.
Concept and Technique
• Multi-path
A fault-tolerance and performance
enhancement technique.
There is more than one physical path
between the host and storage devices
through the buses, controllers,
switches, and bridge devices
connecting them.
Concept and Technique
• Data sharing
Use data de-duplication technique to eliminate duplicated data.
Save and improve the usage of storage space
Concept and Technique
• Tiering
Automatic migrate data across storage resources with different
properties according to the significance or access frequency of data.
Example: iMac fusion drive
User Space
File system
• Provide compatible system call Application
interface to user space
applications. System call interface
Block device
Kernel Space
File System
• Provide compatible block device
interface to file system.
Block interface
• Through the interface such as
SCSI, SAS, ATA, SATA, etc. Device driver
Storage Device
File System Level
• Data and Files
What is data ?
• Data is information that has been converted to a machine-readable,
digital binary format.
• Control information indicates how data should be processed.
• Applications may embed control information in user data for formatting or
presentation.
• Data and its associated control information is organized into discrete units
as files or records.
What is file ?
• Files are the common containers for user data, application code, and
operating system executables and parameters.
File System Level
• About the files
Metadata
• The control information for file management is known as metadata.
• File metadata includes file attributes and pointers to the location of file
data content.
• File metadata may be segregated from a file's data content.
• Metadata on file ownership and permissions is used in file access.
• File timestamp metadata facilitates automated processes such as backup
and life cycle management.
Different file systems
• In Unix systems, file metadata is contained in the i-node structure.
• In Windows systems, file metadata is contained in records of file attributes.
File System Level
• File system
What is file system ?
• A file system is a software layer responsible for organizing and policing the
creation, modification, and deletion of files.
• File systems provide a hierarchical organization of files into directories and
subdirectories.
• The B-tree algorithm facilitates more rapid search and retrieval of files by
name.
• File system integrity is maintained through duplication of master tables,
change logs, and immediate writes off file changes.
Different file systems
• In Unix, the super block contains information on the current state of the
file system and its resources.
• In Windows NTFS, the master file table contains information on all file
entries and status.
File System Level
• File system level virtualization
File system maintains metadata
(i-node) of each file.
Translate file access requests to
underlining file system.
Sometime divide large file into small
sub-files (chunks) for parallel access,
which improves the performance
Block Device Level
• Block level data
The file system block
• The atomic unit of file system management is the file system block.
• A file's data may span multiple file system blocks.
• A file system block is composed of a consecutive range of disk block
addresses.
Data in disk
• Disk drives read and write data to media through cylinder, head, and
sector geometry.
• Microcode on a disk translates between disk block numbers and
cylinder/head/sector locations.
• This translation is an elementary form of virtualization.
Block Device Level
• Block device interface
SCSI (Small Computer System Interface)
• The exchange of data blocks between the host system and storage is
governed by the SCSI protocol.
• The SCSI protocol is implemented in a client/server model.
• The SCSI protocol is responsible for block exchange but does not define
how data blocks will be placed on disk.
• Multiple instances of SCSI client/server sessions may run concurrently
between a server and storage.
Block Device Level
• Logical unit and Logical volume
Logical unit
• The SCSI command processing entity within the storage target represents a
logical unit (LU) and is assigned a logical unit number (LUN) for identification
by the host platform.
• LUN assignment can be manipulated through LUN mapping, which
substitutes virtual LUN numbers for actual ones.
Logical volume
• A volume represents the storage capacity of one or more disk drives.
• Logical volume management may sit between the file system and the device
drivers that control system I/O.
• Volume management is responsible for creating and maintaining metadata
about storage capacity.
• Volumes are an archetypal form of storage virtualization.
Block Device Level
• Data block level virtualization
LUN & LBA
• A single block of information is
addressed using a logical unit
identifier (LUN) and an offset within
that LUN, which known as a Logical
Block Address (LBA).
Apply address space remapping
• The address space mapping is
between a logical disk and a logical
unit presented by one or more
storage controllers.
• Introduction
• What to be virtualized
• Where to be virtualized
• How to be virtualized
• Case study
STORAGE VIRTUALIZATION
Where To Be Virtualized
• Storage interconnection
The path to storage
• The storage interconnection provides the data path between
servers and storage.
• The storage interconnection is composed of both hardware and
software components.
• Operating systems provide drivers for I/O to storage assets.
• Storage connectivity for hosts is provided by host bus adapters
(HBAs) or network interface cards (NICs).
Where To Be Virtualized
• Storage interconnection protocol
Fibre Channel
• Usually for high performance requirements.
• Supports point-to-point, arbitrated loop, and fabric interconnects.
• Device discovery is provided by the simple name server (SNS).
• Fibre Channel fabrics are self-configuring via fabric protocols.
iSCSI ( internet SCSI )
• For moderate performance requirements.
• Encapsulates SCSI commands, status and data in TCP/IP.
• Device discovery by the Internet Storage Name Service (iSNS).
• iSCSI servers can be integrated into Fibre Channel SANs through IP storage
routers.
Where To Be Virtualized
• Abstraction of physical storage
Physical to virtual
• The cylinder, head and sector geometry of individual disks is virtualized
into logical block addresses (LBAs).
• For storage networks, the physical storage system is identified by a
network address / LUN pair.
• Combining RAID and JBOD assets to create a virtualized mirror must
accommodate performance differences.
Metadata integrity
• Storage metadata integrity requires redundancy for failover or load
balancing.
• Virtualization intelligence may need to interface with upper layer
applications to ensure data consistency.
Where To Be Virtualized
• Different approaches :
Host-based approach
• Implemented as a software
running on host systems.
Network-based approach
• Implemented on network devices.
Storage-based approach
• Implemented on storage target
subsystem.
Host-based Virtualization
• Host-based approach
File level
• Run virtualized file system on the
host to map files into data blocks, Block 1
Sub-file Block 2 Block 1
Sub-file Block 2Sub-fileBlock 1
which distributed among several 1 2 3
storage devices.
Block level
• Run logical volume management
software on the host to intercept I/O
requests and redirect them to
storage devices.
Provide services
• Software RAID
Host-based Virtualization
• Important issues
Storage metadata servers
• Storage metadata may be shared by multiple servers.
• Shared metadata enables a SAN file system view for multiple servers.
• Provides virtual to real logical block address mapping for client.
• A distributed SAN file system requires file locking mechanisms to preserve
data integrity.
Host-based storage APIs
• May be implemented by the operating system to provide a common
interface to disparate virtualized resources.
• Microsoft's virtual disk service (VDS) provides a management interface for
dynamic generation of virtualized storage.
Host-based Virtualization
• A typical example :
LVM
• Software layer between the file
system and the disk driver.
• Executed by the host CPU.
• Lack hardware-assist for functions
such as software RAID.
• Independence from vendor-specific
storage architectures.
• Dynamic capacity allocation to
expand or shrink volumes.
• Support alternate pathing for high
availability.
Host-based Virtualization
• Host-based implementation
Pros
• No additional hardware or infrastructure requirements
• Simple to design and implement
• Improve storage utilization
Cons
• Storage utilization optimized only on a per host base
• Software implementation is dependent to each operating system
• Consume CPU clock cycle for virtualization
Examples
• LVM, NFS
Network-based Virtualization
• Network-based approach
File level
• Seldom implement file level Block 1 Block 2 Block 1 Block 2 Block 1
Pros
• Easy to implement
Cons
• Bad scalability & Bottle neck
STORAGE VIRTUALIZATION
ON LINUX SYSTEM
RAID
• RAID (redundant array of independent disks)
• Originally: redundant array of inexpensive disks
RAID schemes provide different balance between the key goals:
• Reliability
• Availability
• Performance
• Capacity
RAID level
• The most used:
RAID0
• block-level striping without parity or mirroring
RAID1
• mirroring without parity or striping
RAID1+0
• referred to as RAID 1+0, mirroring and striping
RAID2
RAID3
RAID4
RAID5
• block-level striping with distributed parity
RAID5+0
• referred to as RAID 5+0, distributed parity and striping
RAID6
RAID 0
• RAID 0: Block-level striping
without parity or mirroring
It has no (or zero) redundancy.
It provides improved performance
and additional storage
It has no fault tolerance. Any drive
failure destroys the array, and the
likelihood of failure increases with
more drives in the array.
• dmsetup
Limitations
• Still cannot provide cross machine mappings.
Logical Volume Management
• dmsetup
low level logical volume management
• Operate create, delete, suspend and resume …etc
• Work with mapping table file
Logical Volume Management
• File system will build upon device mapper framework
by means of system calls.
Logical Volume Management
• File system in operating system will invoke a set of block
device system calls.
STORAGE VIRTUALIZATION
IN DISTRIBUTED SYSTEM
VastSky
• Overview
VastSky is a linux-based cluster storage system, which provides logical
volumes to users by aggregating disks over a network.
• Three kinds of servers
storage manager
• Maintaining a database which describes physical and logical resources in a system.
• e.g. create and attach logical volumes.
head servers
• Running user applications or virtual machines which actually use VastSky logical
volumes.
storage servers
• Storage servers have physical disks which are used to store user data.
• They are exported over the network and used to provide logical volumes on head
servers. (iSCSI)
VastSky
• VastSky Architecture
XML-RPC Storage
Manager
iSCSI request
Storage Pool
VastSky
• Logical Volume
a set of several mirrored disks
several physical disk chunks on different servers
Storage Pool
Logical Volume
Storage
Server1
Storage
Server2
There are 3 mirrored disks
and all of them are distributed Storage
in 3 different servers. Server3
Storage
Server4
VastSky
• Redundancy
VastSky mirrors user data to three storage servers by default and
all of them are updated synchronously.
VastSky can be configured to use two networks (e.g. two
independent ethernet segments) for redundancy.
• Fault detection
The storage manager periodically checks if each head and storage
servers are responsive.
• Recovery
On a failure, the storage manager attempts to reconfigure mirrors
by allocating new extents from other disks automatically.
VastSky
• Recovery Mechanism Storage Pool
Storage
Logical Volume Server1
Storage
data Server2
spare
Storage
crash
data Server3
Storage
data Server4
VastSky
• Scalability
Most of cluster file-systems and storage systems which have a
meta-data control node have a scalability problem.
VastSky doesn't have this problem since once a logical volume is set
up, all I/O operations will be done only through Linux drivers
without any storage manager interactions.
VastSky
• Load Balance
With VastSky's approach, the loads will be equalized across the
physical disks, which leads that it utilizes the I/O bandwidth of
them. Logical Volume
Storage Pool
D1 D2 D3
D1 D3 D1 Storage
D2 D2 D2 D1 Server1
D3
D1 D1 D1 D3
D2 D2 D1 Storage
D3 D1 D3 Server2
D2 D3
D1
D2 D3 D3
D1 D1 D1 Storage
D3
D2 D2 D2 D3 Server3
D3
D1
D2 D2 D1 D2
D3 Storage
D3 D2 D3 D2 D1 Server4
D3
Lustre File System
• What is Lustre ?
Lustre is a POSIX-compliant global, distributed, parallel filesystem.
Lustre is licensed under GPL.
• Some features :
Parallel shared POSIX file system
Scalable
• High performance
• Petabytes of storage
Coherent
• Single namespace
• Strict concurrency control
Heterogeneous networking
High availability
Lustre File System
• Lustre components :
Metadata Server (MDS)
• The MDS server makes metadata stored in one or more MDTs.
Metadata Target (MDT)
• The MDT stores metadata (such as filenames, permissions) on an MDS.
Object Storage Servers (OSS)
• The OSS provides file I/O service, and network request handling for one or
more local OSTs.
Object Storage Target (OST)
• The OST stores file data as data objects on one or more OSSs.
• Lustre network :
Supports several network types
• Infiniband, TCP/IP on Ethernet, Myrinet, Quadrics, …etc.
Take advantage of remote direct memory access (RDMA)
• Improve throughput and reduce CPU usage
Lustre File System
Lustre File System
• Lustre in HPC
Lustre is the leading HPC file system
• 15 of Top 30
• Demonstrated scalability
Performance
• Systems with over 1,000 nodes
• 190 GB/sec IO
• 26,000 clients
Examples
• Titan supercomputer at Oak Ridge National Laboratory
– TOP500: #1, November 2012
• System at Lawrence Livermore National Laboratory (LLNL)
• Texas Advanced Computing Center (TACC)
Ceph
• Overview
Ceph is a free software distributed file system.
Ceph's main goals are to be POSIX-compatible, and completely
distributed without a single point of failure.
The data is seamlessly replicated, making it fault tolerant.
• Release
On July 3, 2012, the Ceph development team released Argonaut, the
first release of Ceph with long-term support.
Ceph
• Introduction
Ceph is a distributed file system that provides excellent
performance ,reliability and scalability.
Objected-based Storage.
Ceph separates data and metadata operations by eliminating file
allocation tables and replacing them with generating functions.
Ceph utilizes a highly adaptive distributed metadata cluster,
improving scalability.
Using object-based storage device (OSD) to directly access data,
high performance.
Ceph
• Objected-Based Storage
Ceph
• Goal
Scalability
• Storage capacity, throughput, client performance. Emphasis on HPC.
Reliability
• Failures are the norm rather than the exception, so the system must have
fault detection and recovery mechanism.
Performance
• Dynamic workloads Load balance.
Ceph
• Three main components
Clients : Near-POSIX file system interface.
Cluster of OSDs : Store all data and metadata.
Metadata server cluster : Manage namespace (file name).
Three Fundamental Design
1. Separating Data and Metadata
Separation of file metadata management from the storage.
Metadata operations are collectively managed by a metadata server
cluster.
User can direct access OSDs to get data by metadata.
Ceph removed data allocation lists entirely.
Usr CRUSH assigns objects to storage devices.
Separating Data and Metadata
• Ceph separates data and metadata operations
Separating Data and Metadata
• CRUSH(Controlled Replication Under Scalable Hashing)
CRUSH is a scalable pseudo-random data distribution function
designed for distributed object-based storage systems .
Define some simple hash functions.
Use hash functions to efficiently map data objects to storage
devices without relying on a central directory.
Advantages
• Because using hash functions, client can calculate object location directly.
Separating Data and Metadata
• CRUSH(x) (osdn1, osdn2, osdn3)
Inputs
• x is the placement group
• Hierarchical cluster map
• Placement rules
Outputs a list of OSDs
• Advantages
Anyone can calculate object location
Cluster map infrequently updated
Separating Data and Metadata
• Data Distribution with CRUSH
In order to avoid imbalance (OSD idle, empty) or load asymmetries
(hot data on new device).
→distributing new data randomly.
Use a simple hash function, Ceph maps objects to Placement
groups (PGs). PGs are assigned to OSDs by CRUSH.
Dynamic Distributed Metadata
Management
2. Dynamic Distributed Metadata Management
Ceph utilizes a metadata cluster architecture based on Dynamic
Subtree Partitioning.(workload balance)
Dynamic Subtree Partitioning
• Most FS ,use static subtree partitioning
→imbalance workloads.
→simple hash function can get directory.
• Ceph’s MDS cluster is based on a dynamic subtree partitioning. →balance
workloads
Reliable Distributed Object Storage