Storage in Cloud
Storage in Cloud
STORAGE IN CLOUD
1
EVOLUTION OF
STORAGE
2
Direct Attached Storage (DAS)
• The storage device is directly attached to the computer
• A simple example of DAS is an external hard drive connected through a Universal Serial Bus
(USB) cable
• A DAS is designed to be used only by a single computer/server
DAS advantages
• High availability.
• High access rate due to Storage Area Network (SAN)
absence.
• Elimination of network setup complications.
• Storage capacity expansion.
• Data security and fault tolerance.
Disadvantages of DAS
• Dedicated resources are needed for a single computer
• No economies of scale in sharing the storage
• We cannot manage the DAS via a network
• DAS needs a special hardware connection
Network Attached Storage (NAS)
• It is a special purpose device and
comprises hard disks, as well as
management software
• NAS is dedicated 100% to serve files
over a network. In simple terms, NAS
shares storage over a network
• Once connected, you will come across
special folders named ‘Shares’ that can
be accessed over the network
• Multiple user logins can also be
created to provide various levels of
access
• NAS is commonly used in organizations
of all sizes for centralized file storage
and sharing
Advantages of NAS: Disadvantages of NAS:
• Relatively inexpensive • Performance depends on the protocol
• A self-contained solution • Slow down for video application or multiple
• Ease of administration large files
• A wide array of system and size to choose • It is file oriented
from • Increased LAN traffic
• Drive failure tolerant storage volumes • The file transfer speed is not as fast as DAS
• Automatic backup to other devices and the • Limited scalability
cloud. • Additional Input-output processing
• Easy to install and configure • System available features depend upon the
• 24/7 and remote data availability NAS chip and firmware
• It requires some knowledge of computer • For using NAS device people should know
network to use them efficiently some basic knowledge about computer
• Universal client access. networking
Storage Area Networks (SAN)
• It is a network of storage devices that can be accessed by multiple servers or computers, providing a shared pool
of storage space
• Each computer on the network can access storage on the SAN as though they were local disks connected directly
to the computer
• Its high I/O processing speeds and low latency make it a good option for high-speed transactional environments
like enterprise databases, high-traffic e-commerce websites, video streaming
Advantages of SAN
• It has economies of scale similar to that of
NAS
• It has higher hardware utilization, similar to
that of NAS
• It has speed similar or comparable to DAS
• It allows virtual environments, cloud
computing, etc.
Disadvantages of SAN
• Its performance is affected by other SAN users
• Its performance is limited by network if
configured incorrectly
• Better performance will still be found using
DAS hardware.
• It requires multiple static IP Addresses
EVOLUTION OF STORAGE
ARCHITECTURE
The storage architecture of
system is a critical component
of data transfer and accessing
vital information
It provides the foundation for
data access across an
enterprise
Depending on operations and the
needs of business, specific storage
architectures might be necessary to
enable employees to work to their
fullest potential
The setup of a storage
architecture can dictate what
aspects get prioritized, such as 7
STYLES OF STORAGE ARCHITECTURE
1. Multi-Tiered Model
With a multi-tiered data center, HTTP-based applications make good use of separate tiers for
web application and database servers. It allows for distinct separation between the tiers, which
improves security and redundancy. Security-wise, if one tier is compromised, the others are
generally safe with the help of firewalls between them. As for redundancy, if one server goes
down or needs maintenance, other servers in the same tier can keep things moving.
2. Clustered Architecture
In a clustered system, data stays behind a single compute node. They don’t share memory
between them. The input-output (I/O) path is short and direct, and the system’s interconnect
has exceedingly low latency. One approach to the clustered architecture model is to layer
“federation models” on top of them to scale it out somewhat. This bounces the I/O around until
it reaches the node that contains the data. These federated layers require additional code to
redirect data, which adds latency to the entire process.
3.Tightly-Coupled Architectures
These architectures distribute data between multiple nodes, running in
parallel, and use a grid of multiple high-availability controllers. They have a
significant amount of inter-node communication and work with several types
of operations, but the master node organizes input processing.With a more
complex design, a tightly-coupled architecture requires much more code
However, the more tightly coupled a storage architecture is, the better it can
predictably provide low latency. Since tight coupling improves performance, it
can be difficult to add nodes and scale up, which inevitably adds complexity to
the entire system and opens you up to bugs.
4.Loosely Coupled Architectures
• This type of system does not share memory between nodes
• The data is distributed among them with a significant amount of inter-node
communication on writes, which can make it expensive to run More data movement in a
loosely-coupled architecture, creating extra I/Os.
• This architecture is relatively simple to use and good for distributed reads
• If the data is heavily distributed on writes, you’ll see higher latency and less efficiency in
I/O operations
• If you have less distribution, you might get lower latency, but it will reduce the parallelism
on reading
5.Distributed Architectures
• While it may look similar to a loosely coupled architecture, this approach works with non-
transactional data
• It does not share memory between the nodes, and data is distributed across them
• This type of architecture is less common than many others but used by extremely large
enterprises having very large storage (in petabytes)
• Its parallel processing model and speed make it a great fit for search engines
Elements of Storage Architecture
1. Data Pattern
Depending on the type of work you do, you may have a random or sequential pattern of I/O
requests.
• Random: In a random pattern, the data is written and read at various locations on the disk
platter, which can influence the effectiveness of a RAID system. The controller cache uses
patterns to predict the data blocks it will need to access next for reading or writing. If the
data is random, there is no pattern for it to work from. Another issue with a random pattern
is the increase in seek time. With data spread out across data blocks, the disk head needs to
move each time a piece of information is requested. The arm and disk head physically have
to move there, which can add to the seek time and impact performance.
• Sequential: The sequential pattern works, as you would imagine, in an ordered fashion. It is
more structured and provides predictable data access. With this kind of layout, the RAID
controller can more accurately guess which data blocks will need to be accessed next and
cache that information. It boosts performance and keeps the arm from moving around as
much. These sequential applications are usually built with throughput in mind. You’ll see
sequential patterns with large filetypes, like video and backups, where they are written to
the drive in continuous blocks.
here’s how the drives compare.
SATA(serial advanced technology attachment): SATA drives have relatively large disk platters
that can struggle with random workloads due to their slow speed. The large platter size can
cause longer seek times.
SAS(Serial Attached SCSI): These drives have smaller platters with faster speeds. They can cut
the seek time down significantly.
SSD(Solid State Drives): The SSD drive is excellent for extremely high-performance workloads.
It has no moving parts, so seek times are almost nonexistent.
2. Layers
In data center storage architecture, you’ll typically see several layers of hardware that serve
separate functions. These layers typically include the:
Core layer: This first layer creates the high-speed packet switching necessary for data transfer. It
connects to many aggregation modules and uses a redundant design.
Aggregation layer: The aggregation layer is the place where traffic flows through and encounters
services like a firewall, network analysis, intrusion detection and more.
Access layer: This layer is where the servers and network physically link up. It involves switches,
cabling and adapters to get everything connected and allow users to access the data.
3. Performance vs. Capacity
Disk drive capabilities are always changing. Disk capacity used to be so low that SAN systems
didn’t have to worry about the number of disks not creating enough IOPS per gigabyte (GB) —
they had plenty. Nowadays, SATA drives and SAS drives can offer similar capacities, with the
SATA drive using significantly fewer disks. Fewer disks reduce the number of IOPS generated per
GB.
If your work involves a lot of random I/O interactions or extreme demand, using SATA disks can
quickly cause your IOPS to bottleneck before you reach capacity. One option here is to front the
disks with a solid-state cache, which can greatly improve random I/O performance.
• Evaluate cost from the outset: Keeping cost in mind as you design from the ground up allows you to make
realistic decisions that will work in the long term. You wouldn’t want to end up with an architecture that needs to
be reorganized right away because upkeep is too expensive or it doesn’t meet the company’s needs. Be realistic
about the costs of a storage architecture so it fits within the business budget.
• Find areas where you can compromise: You won’t be able to prioritize everything. In many instances, focusing on
one aspect will hurt the quality of another. A high-performance system will be costly and could be less scalable. A
scalable system might require more skilled administration and could lose speed. Talk with stakeholders about
what aspects are necessary for the system and why so you can evaluate possible trade-offs with business needs in
mind.
• Work in phases: Your first draft is not going to be the same as the final. As you work through the project, you will
encounter specific challenges and learn more about the technical details of your system. Try not to lock yourself
into a plan and allow the architecture to change organically as you uncover more information.
• Identify your needs first: While it may be tempting to dive right into the specific components that you want to
use, identifying more abstract requirements is an excellent way to start. Think about the state of your data, what
formats you’ll be working with and how you want it to communicate with the server. Try to develop as much
information about the required tasks as you can. This approach allows you to work your way down the chain and
find solutions that match the needs of more than one operation.
CLOUD STORAGE
ARCHITECTURE
Cloud storage is a digital storage solution which utilizes multiple servers to store
data in logical pools. The organizations buy the storage capacity from the providers
to store user, organization, or application data
15
TYPES OF CLOUD
STORAGE
Personal Cloud Storage: It is a subset of public cloud-storage that stores
individual’s data in the cloud and provides the individual with access to the data from
anywhere. It also provides data syncing and data sharing across multiple devices
Public Cloud Storage: It is where the enterprise and storage service provider are
separate and there aren’t any cloud resources stored in the enterprise’s data center.
The cloud-storage provider fully manages enterprise’s public cloud storage
Private Cloud Storage: The enterprise and cloud-storage provider are
integrated in the enterprise’s data center. Private cloud storage helps in resolving
the potential for security and performance concerns while still offering the
advantages of cloud-storage
Hybrid Cloud Storage: It is a combination of public and private cloud-storage
where critical data are stored in enterprise’s private cloud while other data is
stored in public cloud
16
CLOUD STORAGE PROVIDERS
1
4
• Spider Oak:
Founded in 2007, Spider Oak is a collaboration tool, file hosting and online backup service. It allows
users to access, synchronize and share data using a cloud-based server
The main focus in Spider Oak is on privacy and security with private encryption and a clear privacy
policy
• Tresorit:
Founded in 2011, Tresorit is a cloud storage provider based in Hungary and Switzerland. It
emphasizes on enhanced security and data encryption for businesses and personal users
It allows you to keep control of your files through ‘zero-knowledge encryption’ which means only you
and the chosen few you decide to share with and see your data. its end-to-end encryption
• Egnyte:
Founded in 2007, Egnyte provides software for enterprise file synchronization and
sharing. It allows businesses to store their data locally and online
ADVANTAGES OF CLOUD
STORAGE
File Accessibility – The files can be accessed at any time from any place so
long as you have Internet access
Offsite Backup – Cloud Storage provides organizations with offsite
(remote) backups of data which in turn reduces costs
Effective Use of Bandwidth – Cloud storage uses the bandwidth
effectively
i.e. instead of sending files to recipients, a web link can be sent through
email
Security of Data – Helps in protecting the data against ransomware or
malware as it is secured and needs proper authentication to access the 19
DISADVANTAGES OF CLOUD
• STORAGE
Dependency on Internet Speed – If the Internet connection is slow or
unstable, we might have problems accessing or sharing the files
• Dependency on a Third Party – A third party service provider (company) is
responsible for the data stored and hence it becomes an important pre-
requisite in selecting a vendor and to examine the security standards prior
investing
• High Cost for Huge Data – Organizations that require a large amount of
storage may also find costs increase significantly even after the first few
gigabytes of data stored
• No/ Minimal Control over Data Storage Framework – Since the cloud
storage framework is entirely managed and monitored by the service
provider, the customer has minimal control over it
20
BIG
DATA
A collection of data set that is very complex and large, which is very difficult to process and
store using the traditional data processing application or database management tools are
called Big Data
How Big data works?
Integrate
• Big data brings together data from many disparate sources and applications
• During integration, you need to bring in the data, process it, and make sure it’s formatted
and available in a form that your business analysts can get started with
Manage
• Big data requires storage, it can be in the cloud, on premises, or both
• You can store your data in any form you want and bring your desired processing
requirements and necessary process engines to those data sets on an on-demand basis
Analyze
• Investment in big data pays off when you analyze and act on your data
• Get new clarity with a visual analysis, explore the data further to make new discoveries
• Build data models with machine learning and artificial intelligence 21
CHARACTERISTICS OF BIG DATA
1. Volume: the size and amounts of big data that companies manage and analyze
2. Value: the value of big data usually comes from insight discovery and pattern recognition
that lead to more effective operations, stronger customer relationships and other clear
and quantifiable business benefits
3. Variety: the diversity and range of different data types, including unstructured data, semi-
structured data and raw data
4. Velocity: the speed at which companies receive, store and manage data – e.g., the
specific number of social media posts or search queries received within a day, hour or
other unit of time
5. Veracity: the “truth” or accuracy of data and information assets, which often determines
executive-level confidence
6. Variability: the changing nature of the data companies seek to capture, manage and
analyze – e.g., in sentiment or text analytics, changes in the meaning of key words or
phrases
DATA FORMATS
Unstructured
• These are the data that are not structured and not easy to analyze
• These types of data will include unknown Schemas such as video files or audio files etc
Semi-Structured
• These are the type of data in which some are structured, and some are not. It does not have a fixed
format such as JSON, XML, etc
Structured
• These are the best type of data in terms of structuring. The Data is wholly organized with fixed
schema such as RDBMS, which makes it easier to process and analyze
23
BIG DATA USE
CASE
1.Product development
2.Predictive maintenance
3.Customer experience
4.Fraud and compliance
5.Machine learning
6.Drive innovation
7.Operational efficiency
24
HADOO
P is a framework
Hadoop written in
Java that utilizes a large cluster of
commodity hardware to maintain
and store big size data. Hadoop
works on MapReduce Programming
Algorithm that was introduced by
Google
HDFS(Hadoop distributed File
System)
YARN(Yet Another Resource
Negotiator)
Common Utilities or Hadoop
Common 25
1.MAPREDUCE
• MapReduce is the data processing layer of Hadoop
• MapReduce runs these applications in parallel on a cluster of low-end machines in a reliable
and fault-tolerant manner
• MapReduce job comprises a number of map tasks and reduces tasks
• Each task works on a part of data. This distributes the load across the cluster
• The function of Map tasks is to load, parse, transform and filter data
• Each reduce task works on the sub-set of output from the map tasks. Reduce task applies
grouping and aggregation to this intermediate data from the map tasks
26
2.HDFS
• it stands for Hadoop Distributed File System
• It is data storage of Hadoop
• HDFS splits the data unit into smaller units
called blocks and stores them in a distributed
manner
• HDFS has a Master-slave architecture
• The daemon called NameNode runs on the
master server
• It is responsible for Namespace management
and regulates file access by the client
• DataNode daemon runs on slave nodes
• It is responsible for storing actual business
data. Internally, a file gets split into a number
of data blocks and stored on a group of slave
machines.
28
FAULT TOLERANCE AND RACK AWARENESS
• Default block size of 128MB or 256 MB
• To provide fault tolerance HDFS uses a replication technique
• It makes copies of the blocks and stores in on different DataNodes
• Replication factor is 3 by default but we can configure to any value
• A rack contains many DataNode machines and there are several such racks in the production
• HDFS follows a rack awareness algorithm to place the replicas of the blocks in a distributed fashion
29
3.YARN
• YARN or Yet Another Resource
Negotiator is the resource
management layer of Hadoop
• The basic principle behind YARN is to
separate resource management and
job scheduling/monitoring function
into separate daemons
• In YARN there is one global
ResourceManager and per-application
ApplicationMaster
• The ResourceManager arbitrates
resources among all the competing
applications in the system
• The job of NodeManger is to monitor
the resource usage by the container
and report the same to
ResourceManger
30
VIRTUALIZE DATA CENTRE
A virtual data center is a pool or collection of
cloud infrastructure resources specifically
designed for enterprise business needs
The basic resources are the processor (CPU),
memory (RAM), storage (disk space) and
networking (bandwidth)
It is a virtual representation of a physical data
center, complete with servers, storage clusters and
lots of networking components, all of which reside
in virtual space being hosted by one or more actual
data centers
https://www.vmware.com/in/products/datacenter-
virtualization.html
31
VIRTUALIZATION
• Binary translation and native execution: This technique uses a combination of binary
TECHNIQUES
translation for handling privileged and sensitive instructions and direct execution
techniques for user-level instructions .This technique is very efficient both in terms of
performance and in terms of compatibility with the guest OS, which does not need to know
that it is virtualized.
• OS assisted virtualization (paravirtualization): In this technique, the guest OS is modified to
be virtualization-aware paravirtualization can significantly reduce the virtualization
overhead since modifying guest OS is easy The overhead introduced by the hyper calls can
affect performance under heavy workloads
• Hardware-assisted virtualization: As an alternative approach to binary translation and an
attempt to enhance performance and compatibility privileged and sensitive calls are
handled by type 1 hypervisor, this eliminates the need for binary translation since the
translation is done on the hardware level, it significantly improves performance Cloud
providers utilize different virtualization platforms to build their datacenters. They use
platforms that combine paravirtualization and hardware-assisted virtualization to benefit
from the advantages of both 32
VIRTUAL MACHINE
• COMPONENTS
Virtual hardware: When a virtual machine is
created, a default set of virtual hardware is assigned
to it Every virtual machine has CPU, memory, and
disk resources
• Operating System: Guest operating system is
installed on a virtual machine just as you install an
operating system on a
• physical computer: After installation, it required to
be secured and patched periodically
• Application and Services: VM provider provides
different utility tools and applications that enhance
performance of VM and improves management
33
PHYSICAL TO VM
CONVERSION
Steps for Oracle VM Virtualbox
Start the Oracle VM Virtualbox Manager Wizard for a
Conversion
Click on New button
Enter name of VM and OS type and version
Select RAM size
Create Hard disk
Select hard disk file type as VDI
Storage on physical hardisk as Dynamically allocated
File location and size
Click Start and select installable media
34
STORAGE VIRTUALIZATION
Storage virtualization requires multi level
approach
35
STORAGE VIRTUALIZATION CHALLENGES
Scalability
• Ensure storage devices perform appropriate requirements
Functionality
• Virtualized environment must provide same or better functionality
• Must continue to leverage existing functionality on arrays
Manageability
• Virtualization device breaks end to end view of storage infrastructure
• Must integrate existing management tools
Support
• Interoperability in multi vendor environment
36
BLOCK LEVEL STORAGE VIRTUALIZATION
40
THIN AND THICK
•
• PROVISIONING
Thin provisioning allocates storage on demand
Thick provisioning pre-allocates storage space in anticipation of future capacity requirements
• Thick provisioning of storage allocation leads to idle storage space, inefficient storage utilization, and
unnecessary costs of infrastructure and maintenance
• Thin and thick provisioning are methods of storage allocation, the difference lies in how the Logical
Unit Number (LUN) is configured.
• Thick provisioning creates LUNs and assigns them to specific hosts or applications. The LUN reserves a
specific portion of the storage space specifically for the assigned host or application.
• Thick provisioned LUNs is that once provisioned they cannot be scaled up, In order to add more
storage to a thick provisioned LUN, it has to be recreated.
• Thin provisioned LUNs can be configured to larger sizes than the actual available physical storage
capacity, thin LUNs continue to operate and scatter data all over the available storage space; while the
array keeps track of it.
• LUNs can easily scale up whenever necessary in thin provisioning Thin provisioning speeds up storage
provisioning, simplifies management and reduces costs by enhancing storage utilization
41
VIRTUAL STORAGE AREA NETWORK -
VSAN
• A virtual storage area network (VSAN) is a logical partitioning created within a physical storage area
network
• This implementation model of a storage virtualization technique divides and allocates some or an
entire storage area network into one or more logical SANs to be used by internal or external IT
services and solutions
1.It is Software Defined 2. Simple Configuration 3. Flexibility
4. Encryption Features 5. Compression and Deduplication 6. Policy of Storage 3
7
VIRTUAL LOCAL AREA NETWORK -
VLAN
I/O •
The master sends to the client the location of the chunk server
replicas and identifies the primary replica
The client sends the write data to all the replicas chunk server’s
• buffer, starting with the closest. Data sent through pipeline
Once the replicas receive the data, the client tells the primary
• replica to begin the write function
The primary replica writes the data to the appropriate chuck and
• then the same is done on the secondary replica
The secondary replica completes the write function and reports back
• to the primary replica
Finally, the primary replica sends the
•
confirmation to the client
46
GFS READ
I/O
• Read requests are sent by external application to client
Read requests are sent by clients to master in order to
•
find out where a particular file on the system is stored
Master server replies back with the location for the chunk
•
server acting as the primary replica holding the chunk
47
HDFS WRITE •
The client sends a block write request to the
I/O
NameNode
The NameNode responds back by telling on which
•
DataNodes the file’s blocks should be written
Directly, HDFS client contacts the first DataNode
over TCP and sends "Ready" command. The first
• DataNode by its turn sends it to the second
DataNode and the same process continues for the
third DataNode
Rack1 Rack 2 Client seeks and retrieves the blocks directly from
•
DataNode where the blocks are placed
49
GFS AND HDFS
COMPARISON
Design Goal
• The main goal of GFS and HDFS is to support large files
• Used for data intensive computing
• Store data reliably, even when failures occur within chunk servers/Data nodes
• GFS and HDFS is designed more for batch processing rather than interactive use by users
Implementation
• GFS is proprietary file system and exclusive to Google only, it can not be used by any other company
• HDFS based on Apache Hadoop open-source project can be deployed and used by any company
• GFS uses Master and chunk server
• HDFS uses Name node and Data node
Scalability
• Both HDFS and GFS are considered as cluster based architecture
• Each file system runs over machines built from commodity hardware
• Each cluster may consist of thousands of nodes with huge data size storage
• Yahoo! has more than 100,000 CPU in 40,000 computers running Hadoop. Their biggest cluster contains around
4500 nodes
File serving
• In GFS, files are divided into units called chunks of fixed size. Chunk size is 64 MB
• In Hadoop, HDFS file system divides the files into units called blocks of 128 MB in size
50
GFS AND HDFS
Communicatio
COMPARISON
• Transmission Control Protocol (TCP) connections are used for communication in GFS. Pipelining is used for data transfer
nTCP
over connections
• Remote Procedure Call (RPC) based protocol on top of TCP/IP is used in HDFS
Cache management
• In GFS, cache metadata are saved in client memory
• The HDFS has DistributedCache of type private or public
Replication strategy
• The GFS has two replicas: Primary replicas and secondary replicas
• A primary replica is the data chunk that a chunk server sends to a
client
Secondary
• HDFS replicas
has rack serve
based as backups
system. on two
By default othercopies
chunkofservers
each block are stored by different Data Nodes in the same rack
third
and a copy is stored on a Data Node in a different rack
Database
• Bigtable is the database used by GFS, Bigtable is a proprietary database developed by Google using c+
+
• HBase is the database used by Hadoop, It is developed by Apache as open source using java
51