0% found this document useful (0 votes)

155 views51 pages

Storage in Cloud

This document discusses different types of storage architectures: 1. Direct Attached Storage (DAS) is storage directly connected to a single computer. It has high availability but cannot be managed over a network. 2. Network Attached Storage (NAS) shares storage over a network. It provides centralized file storage but performance depends on network protocols. 3. Storage Area Networks (SAN) allows multiple servers to access shared storage. It provides economies of scale but performance is affected by other users on the SAN. 4. Cloud storage virtualizes and abstracts storage, providing scalability, redundancy and flexibility at the cost of some performance and control.

Uploaded by

neerajky014

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

155 views51 pages

Storage in Cloud

Uploaded by

neerajky014

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 51

2.

STORAGE IN CLOUD

1
EVOLUTION OF
STORAGE

2
Direct Attached Storage (DAS)
• The storage device is directly attached to the computer
• A simple example of DAS is an external hard drive connected through a Universal Serial Bus
(USB) cable
• A DAS is designed to be used only by a single computer/server

DAS advantages
• High availability.
• High access rate due to Storage Area Network (SAN)
absence.
• Elimination of network setup complications.
• Storage capacity expansion.
• Data security and fault tolerance.
Disadvantages of DAS
• Dedicated resources are needed for a single computer
• No economies of scale in sharing the storage
• We cannot manage the DAS via a network
• DAS needs a special hardware connection
Network Attached Storage (NAS)
• It is a special purpose device and
comprises hard disks, as well as
management software
• NAS is dedicated 100% to serve files
over a network. In simple terms, NAS
shares storage over a network
• Once connected, you will come across
special folders named ‘Shares’ that can
be accessed over the network
• Multiple user logins can also be
created to provide various levels of
access
• NAS is commonly used in organizations
of all sizes for centralized file storage
and sharing
Advantages of NAS: Disadvantages of NAS:
• Relatively inexpensive • Performance depends on the protocol
• A self-contained solution • Slow down for video application or multiple
• Ease of administration large files
• A wide array of system and size to choose • It is file oriented
from • Increased LAN traffic
• Drive failure tolerant storage volumes • The file transfer speed is not as fast as DAS
• Automatic backup to other devices and the • Limited scalability
cloud. • Additional Input-output processing
• Easy to install and configure • System available features depend upon the
• 24/7 and remote data availability NAS chip and firmware
• It requires some knowledge of computer • For using NAS device people should know
network to use them efficiently some basic knowledge about computer
• Universal client access. networking
Storage Area Networks (SAN)
• It is a network of storage devices that can be accessed by multiple servers or computers, providing a shared pool
of storage space
• Each computer on the network can access storage on the SAN as though they were local disks connected directly
to the computer
• Its high I/O processing speeds and low latency make it a good option for high-speed transactional environments
like enterprise databases, high-traffic e-commerce websites, video streaming
Advantages of SAN
• It has economies of scale similar to that of
NAS
• It has higher hardware utilization, similar to
that of NAS
• It has speed similar or comparable to DAS
• It allows virtual environments, cloud
computing, etc.
Disadvantages of SAN
• Its performance is affected by other SAN users
• Its performance is limited by network if
configured incorrectly
• Better performance will still be found using
DAS hardware.
• It requires multiple static IP Addresses
EVOLUTION OF STORAGE
ARCHITECTURE
The storage architecture of
system is a critical component
of data transfer and accessing
vital information
It provides the foundation for
data access across an
enterprise
Depending on operations and the
needs of business, specific storage
architectures might be necessary to
enable employees to work to their
fullest potential
The setup of a storage
architecture can dictate what
aspects get prioritized, such as 7
STYLES OF STORAGE ARCHITECTURE
1. Multi-Tiered Model
With a multi-tiered data center, HTTP-based applications make good use of separate tiers for
web application and database servers. It allows for distinct separation between the tiers, which
improves security and redundancy. Security-wise, if one tier is compromised, the others are
generally safe with the help of firewalls between them. As for redundancy, if one server goes
down or needs maintenance, other servers in the same tier can keep things moving.

2. Clustered Architecture
In a clustered system, data stays behind a single compute node. They don’t share memory
between them. The input-output (I/O) path is short and direct, and the system’s interconnect
has exceedingly low latency. One approach to the clustered architecture model is to layer
“federation models” on top of them to scale it out somewhat. This bounces the I/O around until
it reaches the node that contains the data. These federated layers require additional code to
redirect data, which adds latency to the entire process.
3.Tightly-Coupled Architectures
These architectures distribute data between multiple nodes, running in
parallel, and use a grid of multiple high-availability controllers. They have a
significant amount of inter-node communication and work with several types
of operations, but the master node organizes input processing.With a more
complex design, a tightly-coupled architecture requires much more code
However, the more tightly coupled a storage architecture is, the better it can
predictably provide low latency. Since tight coupling improves performance, it
can be difficult to add nodes and scale up, which inevitably adds complexity to
the entire system and opens you up to bugs.
4.Loosely Coupled Architectures
• This type of system does not share memory between nodes
• The data is distributed among them with a significant amount of inter-node
communication on writes, which can make it expensive to run More data movement in a
loosely-coupled architecture, creating extra I/Os.
• This architecture is relatively simple to use and good for distributed reads
• If the data is heavily distributed on writes, you’ll see higher latency and less efficiency in
I/O operations
• If you have less distribution, you might get lower latency, but it will reduce the parallelism
on reading
5.Distributed Architectures
• While it may look similar to a loosely coupled architecture, this approach works with non-
transactional data
• It does not share memory between the nodes, and data is distributed across them
• This type of architecture is less common than many others but used by extremely large
enterprises having very large storage (in petabytes)
• Its parallel processing model and speed make it a great fit for search engines
Elements of Storage Architecture
1. Data Pattern
Depending on the type of work you do, you may have a random or sequential pattern of I/O
requests.
• Random: In a random pattern, the data is written and read at various locations on the disk
platter, which can influence the effectiveness of a RAID system. The controller cache uses
patterns to predict the data blocks it will need to access next for reading or writing. If the
data is random, there is no pattern for it to work from. Another issue with a random pattern
is the increase in seek time. With data spread out across data blocks, the disk head needs to
move each time a piece of information is requested. The arm and disk head physically have
to move there, which can add to the seek time and impact performance.
• Sequential: The sequential pattern works, as you would imagine, in an ordered fashion. It is
more structured and provides predictable data access. With this kind of layout, the RAID
controller can more accurately guess which data blocks will need to be accessed next and
cache that information. It boosts performance and keeps the arm from moving around as
much. These sequential applications are usually built with throughput in mind. You’ll see
sequential patterns with large filetypes, like video and backups, where they are written to
the drive in continuous blocks.
here’s how the drives compare.
SATA(serial advanced technology attachment): SATA drives have relatively large disk platters
that can struggle with random workloads due to their slow speed. The large platter size can
cause longer seek times.
SAS(Serial Attached SCSI): These drives have smaller platters with faster speeds. They can cut
the seek time down significantly.
SSD(Solid State Drives): The SSD drive is excellent for extremely high-performance workloads.
It has no moving parts, so seek times are almost nonexistent.
2. Layers
In data center storage architecture, you’ll typically see several layers of hardware that serve
separate functions. These layers typically include the:
Core layer: This first layer creates the high-speed packet switching necessary for data transfer. It
connects to many aggregation modules and uses a redundant design.
Aggregation layer: The aggregation layer is the place where traffic flows through and encounters
services like a firewall, network analysis, intrusion detection and more.
Access layer: This layer is where the servers and network physically link up. It involves switches,
cabling and adapters to get everything connected and allow users to access the data.
3. Performance vs. Capacity
Disk drive capabilities are always changing. Disk capacity used to be so low that SAN systems
didn’t have to worry about the number of disks not creating enough IOPS per gigabyte (GB) —
they had plenty. Nowadays, SATA drives and SAS drives can offer similar capacities, with the
SATA drive using significantly fewer disks. Fewer disks reduce the number of IOPS generated per
GB.
If your work involves a lot of random I/O interactions or extreme demand, using SATA disks can
quickly cause your IOPS to bottleneck before you reach capacity. One option here is to front the
disks with a solid-state cache, which can greatly improve random I/O performance.

4. RAID(Redundant Array of Independent Disks) Considerations

This term refers to the performance cost of protecting data with RAID and only affects
writes. If your work is write-sensitive, the parity penalty may affect you more since RAID is
less stable when it comes to write tasks. Different levels of RAID protection can also affect
the level of overhead. Remember that some drive types can benefit from different
configurations. An SSD, for instance, can have a RAID1+0 configuration for better
performance, while a SATA drive with a RAID6 configuration offers extra security during
rebuilds and high capacity.
Designing a Storage Architecture
Of course, we can’t put everything you need to know about storage architecture in one article, but here are a few
more of our tips to help you create the ideal storage structure without too much of a headache.

• Evaluate cost from the outset: Keeping cost in mind as you design from the ground up allows you to make
realistic decisions that will work in the long term. You wouldn’t want to end up with an architecture that needs to
be reorganized right away because upkeep is too expensive or it doesn’t meet the company’s needs. Be realistic
about the costs of a storage architecture so it fits within the business budget.
• Find areas where you can compromise: You won’t be able to prioritize everything. In many instances, focusing on
one aspect will hurt the quality of another. A high-performance system will be costly and could be less scalable. A
scalable system might require more skilled administration and could lose speed. Talk with stakeholders about
what aspects are necessary for the system and why so you can evaluate possible trade-offs with business needs in
mind.
• Work in phases: Your first draft is not going to be the same as the final. As you work through the project, you will
encounter specific challenges and learn more about the technical details of your system. Try not to lock yourself
into a plan and allow the architecture to change organically as you uncover more information.
• Identify your needs first: While it may be tempting to dive right into the specific components that you want to
use, identifying more abstract requirements is an excellent way to start. Think about the state of your data, what
formats you’ll be working with and how you want it to communicate with the server. Try to develop as much
information about the required tasks as you can. This approach allows you to work your way down the chain and
find solutions that match the needs of more than one operation.
CLOUD STORAGE
ARCHITECTURE
Cloud storage is a digital storage solution which utilizes multiple servers to store
data in logical pools. The organizations buy the storage capacity from the providers
to store user, organization, or application data

15
TYPES OF CLOUD
STORAGE
Personal Cloud Storage: It is a subset of public cloud-storage that stores
individual’s data in the cloud and provides the individual with access to the data from
anywhere. It also provides data syncing and data sharing across multiple devices
Public Cloud Storage: It is where the enterprise and storage service provider are
separate and there aren’t any cloud resources stored in the enterprise’s data center.
The cloud-storage provider fully manages enterprise’s public cloud storage
Private Cloud Storage: The enterprise and cloud-storage provider are
integrated in the enterprise’s data center. Private cloud storage helps in resolving
the potential for security and performance concerns while still offering the
advantages of cloud-storage
Hybrid Cloud Storage: It is a combination of public and private cloud-storage
where critical data are stored in enterprise’s private cloud while other data is
stored in public cloud
16
CLOUD STORAGE PROVIDERS

1
4
• Spider Oak:
Founded in 2007, Spider Oak is a collaboration tool, file hosting and online backup service. It allows
users to access, synchronize and share data using a cloud-based server
The main focus in Spider Oak is on privacy and security with private encryption and a clear privacy
policy

• Tresorit:
Founded in 2011, Tresorit is a cloud storage provider based in Hungary and Switzerland. It
emphasizes on enhanced security and data encryption for businesses and personal users
It allows you to keep control of your files through ‘zero-knowledge encryption’ which means only you
and the chosen few you decide to share with and see your data. its end-to-end encryption

• Egnyte:
Founded in 2007, Egnyte provides software for enterprise file synchronization and
sharing. It allows businesses to store their data locally and online
ADVANTAGES OF CLOUD
STORAGE
File Accessibility – The files can be accessed at any time from any place so
long as you have Internet access
Offsite Backup – Cloud Storage provides organizations with offsite
(remote) backups of data which in turn reduces costs
Effective Use of Bandwidth – Cloud storage uses the bandwidth
effectively
i.e. instead of sending files to recipients, a web link can be sent through
email
Security of Data – Helps in protecting the data against ransomware or
malware as it is secured and needs proper authentication to access the 19
DISADVANTAGES OF CLOUD
• STORAGE
Dependency on Internet Speed – If the Internet connection is slow or
unstable, we might have problems accessing or sharing the files
• Dependency on a Third Party – A third party service provider (company) is
responsible for the data stored and hence it becomes an important pre-
requisite in selecting a vendor and to examine the security standards prior
investing
• High Cost for Huge Data – Organizations that require a large amount of
storage may also find costs increase significantly even after the first few
gigabytes of data stored
• No/ Minimal Control over Data Storage Framework – Since the cloud
storage framework is entirely managed and monitored by the service
provider, the customer has minimal control over it
20
BIG
DATA
A collection of data set that is very complex and large, which is very difficult to process and
store using the traditional data processing application or database management tools are
called Big Data
How Big data works?
 Integrate
• Big data brings together data from many disparate sources and applications
• During integration, you need to bring in the data, process it, and make sure it’s formatted
and available in a form that your business analysts can get started with
 Manage
• Big data requires storage, it can be in the cloud, on premises, or both
• You can store your data in any form you want and bring your desired processing
requirements and necessary process engines to those data sets on an on-demand basis
 Analyze
• Investment in big data pays off when you analyze and act on your data
• Get new clarity with a visual analysis, explore the data further to make new discoveries
• Build data models with machine learning and artificial intelligence 21
CHARACTERISTICS OF BIG DATA
1. Volume: the size and amounts of big data that companies manage and analyze
2. Value: the value of big data usually comes from insight discovery and pattern recognition
that lead to more effective operations, stronger customer relationships and other clear
and quantifiable business benefits
3. Variety: the diversity and range of different data types, including unstructured data, semi-
structured data and raw data
4. Velocity: the speed at which companies receive, store and manage data – e.g., the
specific number of social media posts or search queries received within a day, hour or
other unit of time
5. Veracity: the “truth” or accuracy of data and information assets, which often determines
executive-level confidence
6. Variability: the changing nature of the data companies seek to capture, manage and
analyze – e.g., in sentiment or text analytics, changes in the meaning of key words or
phrases
DATA FORMATS
Unstructured
• These are the data that are not structured and not easy to analyze
• These types of data will include unknown Schemas such as video files or audio files etc
Semi-Structured
• These are the type of data in which some are structured, and some are not. It does not have a fixed
format such as JSON, XML, etc

Structured
• These are the best type of data in terms of structuring. The Data is wholly organized with fixed
schema such as RDBMS, which makes it easier to process and analyze

23
BIG DATA USE
CASE
1.Product development
2.Predictive maintenance
3.Customer experience
4.Fraud and compliance
5.Machine learning
6.Drive innovation
7.Operational efficiency

24
HADOO
P is a framework
Hadoop written in
Java that utilizes a large cluster of
commodity hardware to maintain
and store big size data. Hadoop
works on MapReduce Programming
Algorithm that was introduced by
Google
HDFS(Hadoop distributed File
System)
YARN(Yet Another Resource
Negotiator)
Common Utilities or Hadoop
Common 25
1.MAPREDUCE
• MapReduce is the data processing layer of Hadoop
• MapReduce runs these applications in parallel on a cluster of low-end machines in a reliable
and fault-tolerant manner
• MapReduce job comprises a number of map tasks and reduces tasks
• Each task works on a part of data. This distributes the load across the cluster
• The function of Map tasks is to load, parse, transform and filter data
• Each reduce task works on the sub-set of output from the map tasks. Reduce task applies
grouping and aggregation to this intermediate data from the map tasks

26
2.HDFS
• it stands for Hadoop Distributed File System
• It is data storage of Hadoop
• HDFS splits the data unit into smaller units
called blocks and stores them in a distributed
manner
• HDFS has a Master-slave architecture
• The daemon called NameNode runs on the
master server
• It is responsible for Namespace management
and regulates file access by the client
• DataNode daemon runs on slave nodes
• It is responsible for storing actual business
data. Internally, a file gets split into a number
of data blocks and stored on a group of slave
machines.
28
FAULT TOLERANCE AND RACK AWARENESS
• Default block size of 128MB or 256 MB
• To provide fault tolerance HDFS uses a replication technique
• It makes copies of the blocks and stores in on different DataNodes
• Replication factor is 3 by default but we can configure to any value
• A rack contains many DataNode machines and there are several such racks in the production
• HDFS follows a rack awareness algorithm to place the replicas of the blocks in a distributed fashion

29
3.YARN
• YARN or Yet Another Resource
Negotiator is the resource
management layer of Hadoop
• The basic principle behind YARN is to
separate resource management and
job scheduling/monitoring function
into separate daemons
• In YARN there is one global
ResourceManager and per-application
ApplicationMaster
• The ResourceManager arbitrates
resources among all the competing
applications in the system
• The job of NodeManger is to monitor
the resource usage by the container
and report the same to
ResourceManger

30
VIRTUALIZE DATA CENTRE
A virtual data center is a pool or collection of
cloud infrastructure resources specifically
designed for enterprise business needs
The basic resources are the processor (CPU),
memory (RAM), storage (disk space) and
networking (bandwidth)
It is a virtual representation of a physical data
center, complete with servers, storage clusters and
lots of networking components, all of which reside
in virtual space being hosted by one or more actual
data centers
https://www.vmware.com/in/products/datacenter-
virtualization.html

31
VIRTUALIZATION
• Binary translation and native execution: This technique uses a combination of binary
TECHNIQUES
translation for handling privileged and sensitive instructions and direct execution
techniques for user-level instructions .This technique is very efficient both in terms of
performance and in terms of compatibility with the guest OS, which does not need to know
that it is virtualized.
• OS assisted virtualization (paravirtualization): In this technique, the guest OS is modified to
be virtualization-aware paravirtualization can significantly reduce the virtualization
overhead since modifying guest OS is easy The overhead introduced by the hyper calls can
affect performance under heavy workloads
• Hardware-assisted virtualization: As an alternative approach to binary translation and an
attempt to enhance performance and compatibility privileged and sensitive calls are
handled by type 1 hypervisor, this eliminates the need for binary translation since the
translation is done on the hardware level, it significantly improves performance Cloud
providers utilize different virtualization platforms to build their datacenters. They use
platforms that combine paravirtualization and hardware-assisted virtualization to benefit
from the advantages of both 32
VIRTUAL MACHINE
• COMPONENTS
Virtual hardware: When a virtual machine is
created, a default set of virtual hardware is assigned
to it Every virtual machine has CPU, memory, and
disk resources
• Operating System: Guest operating system is
installed on a virtual machine just as you install an
operating system on a
• physical computer: After installation, it required to
be secured and patched periodically
• Application and Services: VM provider provides
different utility tools and applications that enhance
performance of VM and improves management

33
PHYSICAL TO VM
CONVERSION
Steps for Oracle VM Virtualbox
 Start the Oracle VM Virtualbox Manager Wizard for a
Conversion
 Click on New button
 Enter name of VM and OS type and version
 Select RAM size
 Create Hard disk
 Select hard disk file type as VDI
 Storage on physical hardisk as Dynamically allocated
 File location and size
 Click Start and select installable media

34
STORAGE VIRTUALIZATION
 Storage virtualization requires multi level
approach

35
STORAGE VIRTUALIZATION CHALLENGES
Scalability
• Ensure storage devices perform appropriate requirements
Functionality
• Virtualized environment must provide same or better functionality
• Must continue to leverage existing functionality on arrays
Manageability
• Virtualization device breaks end to end view of storage infrastructure
• Must integrate existing management tools
Support
• Interoperability in multi vendor environment

36
BLOCK LEVEL STORAGE VIRTUALIZATION

Ties together multiple independent storage arrays

Deployed in SAN environment
No disruptive data mobility and data migration
Enable significant cost and storage optimization 37
FILE LEVEL STORAGE VIRTUALIZATION

Deployed in NAS environment

Virtualization entity provides virtual storage to the operating systems in the form of files and directories
 The conversion of the files to virtual blocks is performed by the virtualization entity 3
3
STORAGE TIERING

Storage Tiering prioritizes storage blocks into different categories, referred to as

storage tiers
Provide various levels of performance and capacity based on price/performance
considerations, performance/ bandwidth demands, frequency of use, and other
criteria
Storage Tiering enables users to flexibly assign applications to tiers with different
drive types and RAID levels. 39
VIRTUAL
•
PROVISIONING
Virtual provisioning is a virtual storage network (VSAN)-based technology in which storage space is
allocated on demand to devices
• This process allows virtualized environments to control the allocation and management of
physical disk storage connected with virtual machines
• Virtual provisioning is designed to simplify storage administration by allowing storage
administrators to meet requests for capacity on-demand
• Virtual provisioning gives a host, application or file system the illusion that it has more storage
than is physically provided
• Physical storage is allocated only when the data is written, rather than when the application is
initially configured
• Virtual provisioning can reduce power and cooling costs by cutting down on the amount of idle
storage devices in the array. As a result, virtual provisioning has become a part of green
computing and green data center initiatives
• Administrators to carefully monitor the usage of virtually provisioned resources to ensure that no
virtual disks become full, resulting in storage errors for mission- critical applications

40
THIN AND THICK
•
• PROVISIONING
Thin provisioning allocates storage on demand
Thick provisioning pre-allocates storage space in anticipation of future capacity requirements
• Thick provisioning of storage allocation leads to idle storage space, inefficient storage utilization, and
unnecessary costs of infrastructure and maintenance
• Thin and thick provisioning are methods of storage allocation, the difference lies in how the Logical
Unit Number (LUN) is configured.
• Thick provisioning creates LUNs and assigns them to specific hosts or applications. The LUN reserves a
specific portion of the storage space specifically for the assigned host or application.
• Thick provisioned LUNs is that once provisioned they cannot be scaled up, In order to add more
storage to a thick provisioned LUN, it has to be recreated.
• Thin provisioned LUNs can be configured to larger sizes than the actual available physical storage
capacity, thin LUNs continue to operate and scatter data all over the available storage space; while the
array keeps track of it.
• LUNs can easily scale up whenever necessary in thin provisioning Thin provisioning speeds up storage
provisioning, simplifies management and reduces costs by enhancing storage utilization
41
VIRTUAL STORAGE AREA NETWORK -
VSAN

• A virtual storage area network (VSAN) is a logical partitioning created within a physical storage area
network
• This implementation model of a storage virtualization technique divides and allocates some or an
entire storage area network into one or more logical SANs to be used by internal or external IT
services and solutions
1.It is Software Defined 2. Simple Configuration 3. Flexibility
4. Encryption Features 5. Compression and Deduplication 6. Policy of Storage 3
7
VIRTUAL LOCAL AREA NETWORK -
VLAN

A virtual local area network (VLAN) is a logical group of workstations, servers

and network devices that appear to be on the same LAN despite their
geographical distribution
A VLAN allows a network of computers and users to communicate in a
simulated environment as if they exist in a single LAN and are sharing a
single broadcast and multicast domain
One of the most beneficial elements of a VLAN is that it removes latency
in the network, which saves network resources and increases network 3
8
NETWORK TRAFFIC
MANAGEMENT
Network traffic management deals with the process of monitoring and
controlling the activities of network besides transforming the network
into a managed resource by improving performance, efficiency, and
security
It also helps to operate, administer, and maintain the network
system It uses network monitoring tools and management
techniques such
as bandwidth monitoring, deep packet inspection and application
based
routing to ensure optimal network operation
Benefits/usage
•Prevent outages
•Troubleshoot quickly and efficiently
•Manage network changes
•Identify security threats
44
GOOGLE FILE SYSTEM
(GFS)

GFS has 3 entities

Client: Computers or applications manipulating existing files or creating new files on the system
Master: The master server is the orchestrator or manager of the cluster system that
operation log
maintain the
Chunk server: Chunk servers are the core engine of the GFS. They store file chunks of 64 MB size
45
GFS WRITE •
The client sends a request to the master server to allocate the chunk
server acting as the primary replica

I/O •
The master sends to the client the location of the chunk server
replicas and identifies the primary replica

The client sends the write data to all the replicas chunk server’s
• buffer, starting with the closest. Data sent through pipeline

Once the replicas receive the data, the client tells the primary
• replica to begin the write function

The primary replica writes the data to the appropriate chuck and
• then the same is done on the secondary replica

The secondary replica completes the write function and reports back
• to the primary replica
Finally, the primary replica sends the
•
confirmation to the client

46
GFS READ
I/O
• Read requests are sent by external application to client
Read requests are sent by clients to master in order to
•
find out where a particular file on the system is stored
Master server replies back with the location for the chunk
•
server acting as the primary replica holding the chunk

• The client contacts the desired chunk server directly

• Chunk server sends data to client

Finally, the client sends the data to the external
•
application

47
HDFS WRITE •
The client sends a block write request to the

I/O
NameNode
The NameNode responds back by telling on which
•
DataNodes the file’s blocks should be written
Directly, HDFS client contacts the first DataNode
over TCP and sends "Ready" command. The first
• DataNode by its turn sends it to the second
DataNode and the same process continues for the
third DataNode

"Ready" command is sent from the third DataNode

to the second one and finally to the first
• DataNode which delivers it to the client telling all
DataNodes are ready for the write order
The client sends the first block wishing to write
• directly to the first DataNode, then the
second and finally third DataNode
All DataNodes update the NameNode about the
•
written block
First DataNode tells the client that file’s block was
•
written successfully 4
3
HDFS READ
I/O
1
Client 2 Name Node • Client asks the NameNode about block’s location

3 NameNode has metadata for all blocks location.

Rack •
It sends blocks’ location back to the client
Awarenes
s

Rack1 Rack 2 Client seeks and retrieves the blocks directly from
•
DataNode where the blocks are placed

49
GFS AND HDFS
COMPARISON
 Design Goal
• The main goal of GFS and HDFS is to support large files
• Used for data intensive computing
• Store data reliably, even when failures occur within chunk servers/Data nodes
• GFS and HDFS is designed more for batch processing rather than interactive use by users

 Implementation
• GFS is proprietary file system and exclusive to Google only, it can not be used by any other company
• HDFS based on Apache Hadoop open-source project can be deployed and used by any company
• GFS uses Master and chunk server
• HDFS uses Name node and Data node

 Scalability
• Both HDFS and GFS are considered as cluster based architecture
• Each file system runs over machines built from commodity hardware
• Each cluster may consist of thousands of nodes with huge data size storage
• Yahoo! has more than 100,000 CPU in 40,000 computers running Hadoop. Their biggest cluster contains around
4500 nodes

 File serving
• In GFS, files are divided into units called chunks of fixed size. Chunk size is 64 MB
• In Hadoop, HDFS file system divides the files into units called blocks of 128 MB in size

50
GFS AND HDFS
 Communicatio
COMPARISON
• Transmission Control Protocol (TCP) connections are used for communication in GFS. Pipelining is used for data transfer
nTCP
over connections
• Remote Procedure Call (RPC) based protocol on top of TCP/IP is used in HDFS

 Cache management
• In GFS, cache metadata are saved in client memory
• The HDFS has DistributedCache of type private or public

 Files protection and permission

• In GFS, files are split up and stored in multiple pieces on multiple machines
• File names have random names and are not human readable. Files are obfuscated through algorithms that change
constantly
The HDFS
• File implements
or directory a permission
has separate model for
permissions forfiles and directories
the user that is the that
ownershares muchusers
, for other of thethat
POSIX
are model
members of the group,
for
and all other users

 Replication strategy
• The GFS has two replicas: Primary replicas and secondary replicas
• A primary replica is the data chunk that a chunk server sends to a
client
Secondary
• HDFS replicas
has rack serve
based as backups
system. on two
By default othercopies
chunkofservers
each block are stored by different Data Nodes in the same rack
third
and a copy is stored on a Data Node in a different rack

 Database
• Bigtable is the database used by GFS, Bigtable is a proprietary database developed by Google using c+
+
• HBase is the database used by Hadoop, It is developed by Apache as open source using java

1734787260059cloud Computing AKTU Notes Password Chaudhary - Unlocked
No ratings yet
1734787260059cloud Computing AKTU Notes Password Chaudhary - Unlocked
55 pages
Distributed Systems Slides-Lesson1
No ratings yet
Distributed Systems Slides-Lesson1
45 pages
Lecturenotes Module-5 BCS403 Databasemanagementsystem
No ratings yet
Lecturenotes Module-5 BCS403 Databasemanagementsystem
20 pages
Concurrency Control in Distributed Transactions
No ratings yet
Concurrency Control in Distributed Transactions
17 pages
Cloud Computing Chapter 3
No ratings yet
Cloud Computing Chapter 3
17 pages
WJ20228 OOP10 DesignPatterns
No ratings yet
WJ20228 OOP10 DesignPatterns
157 pages
Cloud Computing Notes
No ratings yet
Cloud Computing Notes
21 pages
Module 4-1
No ratings yet
Module 4-1
21 pages
02 Transactions
No ratings yet
02 Transactions
5 pages
COMP101 Lecture Notes
No ratings yet
COMP101 Lecture Notes
34 pages
Lecture 07 - Key-Value Databases
No ratings yet
Lecture 07 - Key-Value Databases
75 pages
Comprehensive 1 PDF
No ratings yet
Comprehensive 1 PDF
154 pages
Software Measurement: and Metrics
No ratings yet
Software Measurement: and Metrics
9 pages
Unit 2 Lec 1 Cloud Computing
No ratings yet
Unit 2 Lec 1 Cloud Computing
40 pages
Defining The Clouds For Enterprise
No ratings yet
Defining The Clouds For Enterprise
11 pages
Chapter 4a - High Level Design
No ratings yet
Chapter 4a - High Level Design
26 pages
Bit4209 Distributed Systems Module
No ratings yet
Bit4209 Distributed Systems Module
117 pages
Unit II - TCS 552 - Part 1
No ratings yet
Unit II - TCS 552 - Part 1
46 pages
Introduction To Cloud Computing
No ratings yet
Introduction To Cloud Computing
27 pages
UNIT III - Cloud Computing-1
No ratings yet
UNIT III - Cloud Computing-1
34 pages
Cloud Computing Unit 1
No ratings yet
Cloud Computing Unit 1
23 pages
Unified Modeling Language (UML)
100% (2)
Unified Modeling Language (UML)
44 pages
Kubernetes at A Glimpse 1691937493
No ratings yet
Kubernetes at A Glimpse 1691937493
13 pages
Use Case Diagram
100% (1)
Use Case Diagram
17 pages
System Modeling SE
No ratings yet
System Modeling SE
23 pages
Cloud Computing
No ratings yet
Cloud Computing
27 pages
Software Design Principles: Module-4
No ratings yet
Software Design Principles: Module-4
14 pages
CS1 - Introduction To Cloud
No ratings yet
CS1 - Introduction To Cloud
62 pages
Cloud Computing Unit 2
No ratings yet
Cloud Computing Unit 2
54 pages
Cloud - Computing UNIT-3 Material .
No ratings yet
Cloud - Computing UNIT-3 Material .
17 pages
Economics of Choosing A Cloud Platform For An Organization
No ratings yet
Economics of Choosing A Cloud Platform For An Organization
10 pages
Program: B.Tech - Specialization Course Code: CSCN2020 Course Name: Ethical Hacking
No ratings yet
Program: B.Tech - Specialization Course Code: CSCN2020 Course Name: Ethical Hacking
50 pages
VIRTUALIZATION For Data Center Automation
No ratings yet
VIRTUALIZATION For Data Center Automation
9 pages
Integrity and Security in DBMS
100% (1)
Integrity and Security in DBMS
58 pages
Cloud Computing
No ratings yet
Cloud Computing
36 pages
Using EMC VNX Storage With VMware
100% (1)
Using EMC VNX Storage With VMware
254 pages
Transaction in DDB
100% (1)
Transaction in DDB
9 pages
Formalization of UML Use Case Diagram-A Z Notation Based Approach
No ratings yet
Formalization of UML Use Case Diagram-A Z Notation Based Approach
6 pages
AWS Solutions Architect-Associate
No ratings yet
AWS Solutions Architect-Associate
7 pages
Concurrency Control in Distributed Database Systems
No ratings yet
Concurrency Control in Distributed Database Systems
5 pages
Introduction To Cloud Coumpting
No ratings yet
Introduction To Cloud Coumpting
21 pages
IEC 61850 Fundamentals
No ratings yet
IEC 61850 Fundamentals
72 pages
VSP G200 v83 05 02 Hardware Reference MK-94HM8020-12 PDF
No ratings yet
VSP G200 v83 05 02 Hardware Reference MK-94HM8020-12 PDF
135 pages
CS8791 Cloud Computing
No ratings yet
CS8791 Cloud Computing
52 pages
Architecture Should Do Similar Things in Similar Ways: 103. What Concept Is Followed by Conceptual Integrity?
No ratings yet
Architecture Should Do Similar Things in Similar Ways: 103. What Concept Is Followed by Conceptual Integrity?
8 pages
System Design: What Is Design? Example Designs in Real Life
No ratings yet
System Design: What Is Design? Example Designs in Real Life
11 pages
Unit - 2 - Virtualization
No ratings yet
Unit - 2 - Virtualization
57 pages
Ex - No: Credit Card Processing System Date
No ratings yet
Ex - No: Credit Card Processing System Date
27 pages
Latest Trends in Serverless Computing: Bachelors of Technology in
No ratings yet
Latest Trends in Serverless Computing: Bachelors of Technology in
7 pages
San 18cs822 Module Wise Questions
No ratings yet
San 18cs822 Module Wise Questions
3 pages
Characteristics of Virtualized Environment
No ratings yet
Characteristics of Virtualized Environment
14 pages
Se Module 2 PPT
No ratings yet
Se Module 2 PPT
86 pages
CMIS 3214 - Assignment (2018-2019)
No ratings yet
CMIS 3214 - Assignment (2018-2019)
8 pages
IBM 2202 Week 5 Chapter 5 OS and Virtualization - 231019 - 093252
No ratings yet
IBM 2202 Week 5 Chapter 5 OS and Virtualization - 231019 - 093252
28 pages
Chapter: Architecture & User Interface Design
No ratings yet
Chapter: Architecture & User Interface Design
26 pages
April 2018
No ratings yet
April 2018
12 pages
Veritas Notes Kamal
No ratings yet
Veritas Notes Kamal
102 pages
N TierMulti-Tier 3-Tier 2-Tier Architecture With EXAMPLE
No ratings yet
N TierMulti-Tier 3-Tier 2-Tier Architecture With EXAMPLE
7 pages
AWS Certified Solutions Architect Associate Exam Guide
No ratings yet
AWS Certified Solutions Architect Associate Exam Guide
8 pages
Vsphere Install Configure Manage V5
No ratings yet
Vsphere Install Configure Manage V5
2 pages
CC MCQ File
No ratings yet
CC MCQ File
12 pages
BVMS - System Design Guide - January 2025
No ratings yet
BVMS - System Design Guide - January 2025
50 pages
SPM 2 Marks Refer
No ratings yet
SPM 2 Marks Refer
13 pages
Unit 3 - Cloud Computing - WWW - Rgpvnotes.in
No ratings yet
Unit 3 - Cloud Computing - WWW - Rgpvnotes.in
11 pages
Red Book V7000gen2
No ratings yet
Red Book V7000gen2
206 pages
AWS Solution Architect Associate Agenda PDF
No ratings yet
AWS Solution Architect Associate Agenda PDF
6 pages
HPE2-N70 Exam Valid Dumps
No ratings yet
HPE2-N70 Exam Valid Dumps
5 pages
DSTN Merged - Csi Zc447 Es Zc447is Zc447ss Zc447 CH 15-16
No ratings yet
DSTN Merged - Csi Zc447 Es Zc447is Zc447ss Zc447 CH 15-16
63 pages
Storage Area Network
No ratings yet
Storage Area Network
12 pages
Cloud Computing Security Testing
No ratings yet
Cloud Computing Security Testing
12 pages
Computer Networking: Submitted by Submitted To
No ratings yet
Computer Networking: Submitted by Submitted To
21 pages
Question Bank Cloud
No ratings yet
Question Bank Cloud
74 pages
Analysis: Extend A Fibre Channel SAN and Leverage Virtual Infrastructure Via iSCSI
100% (1)
Analysis: Extend A Fibre Channel SAN and Leverage Virtual Infrastructure Via iSCSI
24 pages
Introducing Hitachi Virtual Storage Platform G1000, G1500, and F1500
No ratings yet
Introducing Hitachi Virtual Storage Platform G1000, G1500, and F1500
4 pages
Hitachi Data Systems Software Matrix Product Line Card PDF
No ratings yet
Hitachi Data Systems Software Matrix Product Line Card PDF
12 pages
Hitachi Datasheet For AMS 2500
No ratings yet
Hitachi Datasheet For AMS 2500
2 pages
CC Unit 3
No ratings yet
CC Unit 3
31 pages
Highly Available Virtual Machines With IBM Storwize V7000 and Microsoft Hyper-V
No ratings yet
Highly Available Virtual Machines With IBM Storwize V7000 and Microsoft Hyper-V
27 pages
ISM Chapter 2
No ratings yet
ISM Chapter 2
24 pages
Infrastructure As A Service (IAAS)
No ratings yet
Infrastructure As A Service (IAAS)
12 pages
HP XP Architecture
No ratings yet
HP XP Architecture
24 pages
Lesson 5. Storage Management
No ratings yet
Lesson 5. Storage Management
38 pages
12.8 HPE StoreVirtual Storage VSA Installation and Configuration Guide-A00090616en - Us
No ratings yet
12.8 HPE StoreVirtual Storage VSA Installation and Configuration Guide-A00090616en - Us
56 pages
CT097-3-3-CSVC - Dinanath Dahal
No ratings yet
CT097-3-3-CSVC - Dinanath Dahal
14 pages
Managed & Unmanaged Cloud Services
No ratings yet
Managed & Unmanaged Cloud Services
5 pages
Subject Outline-DIP208-Aug2024
No ratings yet
Subject Outline-DIP208-Aug2024
3 pages
ZXR10 KS1200 Disk Array Configuration Guide
No ratings yet
ZXR10 KS1200 Disk Array Configuration Guide
9 pages
Ultimate Microsoft Intune for Administrators: Master Enterprise Endpoint Security and Manage Devices, Apps, and Cloud Security with Expert Microsoft Intune Strategies (English Edition)
From Everand
Ultimate Microsoft Intune for Administrators: Master Enterprise Endpoint Security and Manage Devices, Apps, and Cloud Security with Expert Microsoft Intune Strategies (English Edition)
Paul Winstanley
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet

Storage in Cloud

Uploaded by

Storage in Cloud

Uploaded by

2.

4. RAID(Redundant Array of Independent Disks) Considerations

Ties together multiple independent storage arrays

Deployed in NAS environment

Storage Tiering prioritizes storage blocks into different categories, referred to as

A virtual local area network (VLAN) is a logical group of workstations, servers

GFS has 3 entities

• The client contacts the desired chunk server directly

• Chunk server sends data to client

"Ready" command is sent from the third DataNode

3 NameNode has metadata for all blocks location.

 Files protection and permission

You might also like