Cloud Storage Infrastructures
Cloud Storage Infrastructures
Cloud Storage Infrastructure : A cloud storage infrastructure is the hardware and software
framework that supports the computing requirements of a private or public cloud storage service.
Both public and private cloud storage infrastructures are known for their elasticity, scalability and
flexibility.
Cloud storage architectures are primarily about delivery of storage on demand in a highly scalable and
multi-tenant way. cloud storage architectures consist of a front end that exports an API to access the
storage.
Cloud Storage Architecture
Characteristic Description
The ability to manage a system with minimal
Manageability
resources
Access method Protocol through which cloud storage is exposed
Performance Performance as measured by bandwidth and latency
Multi-tenancy Support for multiple users (or tenants)
Ability to scale to meet higher demands or load in a
Scalability
graceful manner
Data availability Measure of a system’s uptime
Ability to control a system — in particular, to
Control configure for cost, performance, or other
characteristics
Storage efficiency Measure of how efficiently the raw storage is used
Measure of the cost of the storage (commonly in
Cost
dollars per gigabytes) Fig. General Cloud Architecture
Cloud Storage Types
• DAS – Direct Attached Storage
• DAS stands for Direct Attached Storage and as the name suggests,
it is an architecture where storage connects directly to hosts.
Based on the location of storage devices with respect to host, DAS can be classified as external or
internal.
Internal DAS: The storage device is internally connected to the host by serial or parallel buses.
Most internal buses have distance limitations and can only be used for short distance
connectivity and can also connect only a limited number of devices. And also hamper
maintenance as they occupy large amount of space inside the server.
External DAS: the server connects directly to the external storage devices. SCSI or FC protocol
are used to communicate between host and storage devices.
It overcomes the limitation of internal DAS and overcome the distance and device count
limitations and also provides central administration of storage devices
Cloud Storage Infrastructure – Direct Attached Storage(DAS)
Why and why not to go for DAS?
Why to go for DAS:
• Less hardware and software are needed to setup and operate DAS.
• Managing DAS is easy as host based tools such as host OS are used.
• Major limitation of DAS is that it doesn’t scale up well and it restricts the number of hosts that can be directly
connected to the storage.
• Limited bandwidth in DAS hampers the available I/O processing capability and when capability is reached, service
availability may be compromised.
• It doesn’t make use of optimal use of resources due to its lack of ability to share front end ports.
Cloud Storage Infrastructure –Network Attached Storage(NAS)
NAS is a file-level computer data storage server connected to a network and providing data accessibility to a
diverse group of clients.
NAS is specialized for the task assigned to it either by its hardware, software or by both and provides the
advantage of server consolidation by removing the need of having multiple file servers.
NAS also uses its own OS which works on its own peripheral devices.
A NAS operating systems is optimized for file I/O and, therefore performs file I/O better than a primitive server.
It also uses different protocols like TCP/IP, CIFS and NFS which are basically used for data transfer and for
accessing remote file service.
Components of NAS
FIG: NAS
• Centralized storage device for storing data on a Fig: Network Attached Storage
network.
• Will have multiple hard drives in RAID
configuration.
• Directly attaches to a switch or router on a
network.
• Are used in Small businesses.
Drawbacks
• Single point of Failure.
Cloud Storage Infrastructure –Storage Area Network(SAN)
• A storage area network (SAN) provides access to consolidated, block level data storage that is accessible by
the application running on any of the networked server.
• It carries data between servers (hosts) and storage devices through fibre channel switches.
• A SAN helps in aiding organizations to connect geographically isolated hosts and provide robust
communication between hosts and storage devices.
• In a SAN, each storage server and storage device is linked through a switch which includes SAN features like
storage virtualization, quality of service, security and remote sensing etc.
• Cabling:- is the physical medium which is used to for establishing a link between every SAN device.
• HBA or Host Bus Adapter is an expansion card that fits into expansion slot in a server.
• Switch is used to handle and direct traffic between different network devices. It accepts traffic and then
transmits the traffic to the desired endpoint device.
Cloud Storage Infrastructure –Storage Area Network(SAN)
• A Special High Speed network that stores and
provides access to large amounts of data.
• SAN’s are Fault Tolerant.
• Data is shared among several disk arrays.
• Server access data as if it was accessing data from
local drive.
• iSCSI(Cheaper) and FC(Expensive) protocols
used.
• SAN’s are not affected by network traffic.
• Highly scalable, Highly Redundant and High
NAS NAS is perfect for SMBs and organizations that need Server-class devices at enterprise organizations that
a minimal-maintenance, reliable and flexible storage need to transfer block-level data supported by a
system that can quickly scale up as needed to Fibre Channel connection may find that NAS can’t
accommodate new users or growing data deliver everything that’s needed. Maximum data
transfer issues could be a problem with NAS
SAN SAN is best for block-level data sharing of mission- SAN can be a significant investment and is a
critical files or applications at data centers or large- sophisticated solution that’s typically reserved for
scale enterprise organizations. serious large-scale computing needs. A small-to-
midsize organization with a limited budget and few
IT staff or resources likely wouldn’t need SAN.
Storage Networking (FC, iSCSi, FCoE)
Fibre Channel (FC) is a technology for transmitting data between computer devices at data rates of up to 20 Gbps at present
• Fibre Channel began in the late 1980s as part of the IPI (Intelligent Peripheral Interface) Enhanced Physical Project to
increase the capabilities of the IPI protocol. That effort widened to investigate other interface protocols as candidates for
augmentation. In 1998, Fiber Channel was approved as a project and now have become and industry standard.
iSCSI - Internet Small Computer System Interface, is a storage networking standard used to link different storage
facilities.
• iSCSI is used to transmit data over local area networks, wide area networks or the Internet and can enable location-
independent data storage and retrieval and is one of two main approaches to storage data transmission over IP networks.
Fibre Channel over IP, translates Fibre Channel control codes and data into IP packets for transmission between
iSCSI Benefits
• SCSI transport protocol that operates over TCP • Wire server only once
• Encapsulation of SCSI command descriptor blocks and data • Fewer cables and adaptors
in TCP/IP byte streams • New operational model
• Broad industry support; OS vendors support their iSCSI
drivers, gateways (routers, bridges), and native iSCSI storage
arrays
Difference between FCIP and FCoE
• FCIP uses a tunnel to transfer data between networks. It relies on SCSI.
• FCoE was developed to simplify switches and consolidate I/O in comparison with FCIP. It replaces
FC links with high speed ethernet links between the devices that support the network.
• iFCP is a new standard that broadens the way data can be transferred over the internet. It combines
the FCIP and iSCSI protocols.
• Some customers have limited I/O requirements in the 100-Mbps range, and iSCSI is just the right solution for them. This is
why iSCSI has taken off and is so successful in the SMB market: it is cheap, and it gets the job done.
• Large enterprises are adopting virtualization, have much higher I/O requirements, and want to preserve their investments and
training in Fibre Channel. For them, FCoE is probably a better solution.
• FCoE will take a large share of the SAN market. It will not make iSCSI obsolete, but it will reduce its potential market.
Cloud File System
A cloud file system is a distributed file system that allows many clients to have access to data and supports
operations on that data.
A File system also ensure the security in terms of Confidentiality, Availability and Integrity.
• BigTable
• HBase
• Dynamo
Cloud File System: Google File System
• GFS is a proprietary distributed file
system developed by Google for its own
use.
There are a few aspects where these can be proven to be a little different from each other.
Goals
• Want asynchronous processes to be continuously updating different pieces of data
• Want access to most current data at any time
• Need to support:
• Very high read/write rates (millions of ops per second)
• Efficient scans over all or interesting subsets of data
• Efficient joins of large one-to-one and one-to-many datasets
• Often want to examine data changes over time
• E.g. Contents of a web page over multiple crawls
Building Blocks
• Building blocks:
• Google File System (GFS): Raw storage
• Scheduler: schedules jobs onto machines
• Lock service: distributed lock manager
• MapReduce: simplified large-scale data processing
• Want to keep copy of a large collection of web pages and related information
• Use URLs as row keys
• Various aspects of web page as column names
• Store contents of web pages in the contents: column under the timestamps when they were fetched.
Rows
Applications of HBase
•It is used whenever there is a need to write heavy applications.
•HBase is used whenever we need to provide fast random access to available data.
•Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
Architecture of HBase
• HBase has three major components: the client library, a master server, and region servers.
Architecture of HBase
• HBase has three major components: the client library, a
master server, and region servers.
• Zookeeper has ephemeral nodes representing different region servers. Master servers use these nodes to
discover available servers.
• In addition to availability, the nodes are also used to track server failures or network partitions.
• In pseudo and standalone modes, HBase itself will take care of zookeeper.
Dynamo
• Amazon DynamoDB is a fully managed
NoSQL database service that allows to create
database tables that can store and retrieve any
amount of data.
• Scalable − Amazon DynamoDB is designed to scale. There is no need to worry about predefined
limits to the amount of data each table can store. Any amount of data can be stored and retrieved.
DynamoDB will spread automatically with the amount of data stored as the table grows.
• Fast − Amazon DynamoDB provides high throughput at very low latency. As datasets grow,
latencies remain stable due to the distributed nature of DynamoDB's data placement and request
routing algorithms.
• Durable and highly available − Amazon DynamoDB replicates data over at least 3
different data centers’ results. The system operates and serves data even under
various failure conditions.
• Flexible: Amazon DynamoDB allows creation of dynamic tables, i.e. the table can
have any number of attributes, including multi-valued attributes.
• Cost-effective: Payment is for what we use without any minimum charges. Its
pricing structure is simple and easy to calculate.