0% found this document useful (0 votes)
8 views40 pages

Cloud Chapter 4SWE

Uploaded by

banadawithunde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views40 pages

Cloud Chapter 4SWE

Uploaded by

banadawithunde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Chapter – 4

Cloud Storage
Outlines

 Introduction  Applications of Hadoop


 Data intensive Computing MapReduce
 Challenges  High-performance
 Big data distributed file systems and
cloud storage
 Technologies for data-
intensive computing  AWS ecosystem
 Map-Reduce framework  AWS Storage services
 Phases of MapReduce  Microsoft Storage services
 Benefits of Hadoop
MapReduce
Introduction
 Cloud storage is a service model in which data is maintained,
managed and backed up remotely and made available to users over
a network (typically the Internet).
 Traditionally, distributed databases have been considered the
natural evolution of database management systems as the scale of
the datasets becomes unmanageable with a single system.
 Distributed databases are a collection of data stored at different
sites of a computer network.
 Each site might expose a degree of autonomy, providing services
for the execution of local applications, but also participating in the
execution of a global application.
Contd…
 A distributed database can be created by splitting and
scattering the data of an existing database over different
sites or by federating together multiple existing databases.
 Due to the explosion of unstructured data in the form of
blogs, Web pages, software logs, and sensor readings, the
relational model in its original formulation does not seem
to be the preferred solution for supporting data analytics on
a large scale
Data intensive computing
 Data-intensive computing is concerned with production,
manipulation, and analysis of large-scale data in the range of
hundreds of megabytes (MB) to petabytes (PB) and beyond.
 Data-intensive computations occur in many application
domains.
 People conducting scientific simulations and experiments are
often keen to produce, analyze, and process huge volumes of
data. Hundreds of gigabytes of data are produced every
second by telescopes mapping the sky; the collection of
images of the sky easily reaches the scale of petabytes over a
year.
Contd…
 Bioinformatics applications mine databases that may end
up containing terabytes of data.
 Earthquake simulators process a massive amount of data,
which is produced as a result of recording the vibrations of
the Earth across the entire globe.
 Customer data for any telecom company would easily be in
the range of 10-100 terabytes. This volume of information
is not only processed to generate billing statements, but it
is also mined to identify scenarios, trends, and patterns that
help these companies provide better service.
Challenges
 The huge amount of data produced, analyzed, or stored imposes
requirements on the supporting infrastructures and middleware that
are hardly found in the traditional solutions for distributed
computing.

Here are open challenges in traditional computing:


 The location of data is crucial as the need for moving terabytes of
data becomes an obstacle for high-performing computations
 Scalable algorithms that can search and process massive datasets

 New metadata management technologies that can scale to handle


complex, heterogeneous, and distributed data sources
 Flexible and high-performance software integration techniques that
facilitate the combination of software modules running on different
platforms
Contd…
 Large datasets have mostly been the domain of scientific
computing. This scenario has recently started to change as
massive amounts of data are being produced, mined, and
crunched by companies that provide Internet services such
as searching, online advertising, and social media.
 It is critical for such companies to efficiently analyze these
huge datasets because they constitute a precious source of
information about their customers.
Big Data
 Together with the dissemination of cloud computing
technologies that support data-intensive computations, the
term Big Data has become popular.
 The term Big Data applies to datasets of which the size is
beyond the ability of commonly used software tools to
capture, manage, and process within a tolerable elapsed
time.
 Relational databases become ineffective for that amount of
information, instead requiring “massively parallel software
running on tens, hundreds, or even thousands of servers”
Contd…
Cloud technologies support data-intensive computing in several
ways:
 By providing a large amount of compute instances on
demand, which can be used to process and analyze large
datasets in parallel.
 By providing a storage system optimized for keeping large
blocks of data and other distributed data store architectures.
 By providing frameworks and programming APIs optimized
for the processing and management of large amounts of data.
These APIs are mostly coupled with a specific storage
infrastructure to optimize the overall performance of the
system.
Technologies for data-intensive computing
 Data-intensive computing concerns the development of
applications that are mainly focused on processing large
quantities of data. Therefore, storage systems and programming
models constitute a natural classification of the technologies
supporting data-intensive computing
 Research on databases and the data management industry are
definitely at a turning point, and new opportunities arise. Some
factors contributing to this change are:
 Growing of popularity of Big Data

 Growing importance of data analytics in the business chain.

 Presence of data in several forms, not only structured.

 New approaches and technologies for computing.


Contd…

 Cloud technologies for big data


 Map reduce framework in Hadoop
Map-Reduce framework
 It is a software framework that enables you to write
applications that process vast amounts of data, in-
parallel on large clusters of commodity hardware in a
reliable and fault-tolerant manner.
 It was primarily introduced by google.
 A MapReduce job usually splits the input data set into
independent chunks, which are processed by the map
tasks in a completely parallel manner.
 The framework takes care of scheduling tasks,
monitors them, and re-executes the failed tasks.
Contd…
 There are two primary tasks in MapReduce: map and reduce
 In the map job, we split the input dataset into chunks.
 Map task processes these chunks in parallel.
 The map we use outputs as inputs for the reduce tasks.
 Reducers process the intermediate data from the maps into
smaller tuples, that reduces the tasks, leading to the final
output of the framework.
Phases of MapReduce
The MapReduce program is executed in three main phases:
mapping, shuffling, and reducing.
Mapping Phase
 This is the first phase of the program. There are two steps in
this phase: splitting and mapping.
 A dataset is split into equal units called chunks (input splits)
in the splitting step.
 Hadoop consists of a RecordReader that uses
TextInputFormat to transform input splits into key-value
pairs.
 The key-value pairs are then used as inputs in the mapping
step. The mapping step contains a coding logic that is applied
to these data blocks.
Contd…
Shuffling phase
 This is the second phase that takes place after the
completion of the Mapping phase. It consists of two main
steps: sorting and merging.
 In the sorting step, the key-value pairs are sorted using the
keys.
 Merging ensures that key-value pairs are combined.

 The shuffling phase facilitates the removal of duplicate


values and the grouping of values.
 Different values with similar keys are grouped.

 The output of this phase will be keys and values, just like in
the Mapping phase.
Contd…
Reducer phase
 In the reducer phase, the output of the shuffling
phase is used as the input.
 The reducer processes this input further to reduce
the intermediate values into smaller values.
 It provides a summary of the entire dataset.

 The output from this phase is stored in the


HDFS(Hadoop distributed file system).
Example on word count…
Benefits of Hadoop MapReduce

 Speed: MapReduce can process huge unstructured


data in a short time
 Fault-tolerance: The MapReduce framework can
handle failures.
 Parallel Processing: In MapReduce, multiple job-
parts of the same dataset can be processed in a
parallel manner. This reduces the time taken to
complete a task.
Applications of Hadoop MapReduce
Map reduce framework can be implemented in some
popular programming language such as java, scala,
python etc. Practical applications of the MapReduce
program.
 E-commerce
 E-commerce companies such as Walmart, E-Bay, and
Amazon use MapReduce to analyze buying behavior.
 MapReduce provides meaningful information that is
used as the basis for developing product
recommendations. Some of the information used
include site records, e-commerce catalogs, purchase
history, and interaction logs.
Contd…
 Social networks
 The MapReduce programming tool can evaluate certain
information on social media platforms such as
Facebook, Twitter, and LinkedIn.
 Entertainment
 Netflix uses MapReduce to analyze the clicks and logs
of online customers. This information helps the
company suggest movies based on customers’ interests
and behavior
High-performance distributed file
systems and cloud storage
 Distributed file systems constitute the primary support for
data management. They provide an interface whereby to
store information in the form of files and later access them
for read and write operations.
 Examples:
 Lustre
 The Lustre file system is a massively parallel
distributed file system that covers the needs of a small
workgroup of clusters to a large-scale computing
cluster.
 It is designed to provide access to petabytes (PBs) of
storage to serve thousands of clients.
Cont’d
 IBM General Parallel File System (GPFS)
 GPFS is the high-performance distributed file system
developed by IBM that provides support for supercomputer
and Linux computing clusters
 Google File System
 GFS is the storage infrastructure that supports the
execution of distributed applications in Google’s
computing cloud. The system has been designed to be a
fault tolerant, highly available, distributed file system built
on commodity hardware and standard Linux operating
systems.
 Amazon simple storage service (s3)
AWS ecosystem
AWS Storage services
 AWS provides a collection of services for data storage and
information management.
 The core service in this area is represented by Amazon Simple
Storage Service (S3). This is a distributed object
store that allows users to store information in different formats.
 The core components of S3 are two: buckets and objects.
 Buckets represent virtual containers in which to store objects;
 Objects represent the content that is actually stored.
 Objects can also be enriched with metadata that can be
used to tag the stored content with additional information.
Contd….
Buckets
 A bucket is a container of objects.
 It can be thought of as a virtual drive hosted on the S3
distributed storage, which provides users with a flat store
to which they can add objects.
 Buckets are top level elements of the S3 storage
architecture and do not support nesting. That is, it is not
possible to create “sub buckets” or other kinds of physical
divisions.
Contd……
Objects and metadata
 Objects constitute the content elements stored in S3.
 An object is identified by a name that needs to be unique
within the bucket in which the content is stored.
 Metadata are both system and user defined: the first ones are
used by S3 to control the interaction with the object, whereas the
second ones are meaningful to the user, who can store up to 2
KB per metadata property represented by a key-value pair of
strings.
Amazon elastic block store
 The Amazon Elastic Block Store (EBS) allows AWS
users to provide EC2 instances with persistent storage in
the form of volumes that can be mounted at instance
startup.
 The content of an EBS volume survives the instance life
cycle and is persisted into S3.
Amazon ElastiCache
 ElastiCache is an implementation of an elastic in-
memory cache based on a cluster of EC2 instances.
 It provides fast data access from other EC2 instances
through a Memcached-compatible protocol so that
existing applications based on such technology do not
need to be modified and can transparently migrate to
ElastiCache.
 ElastiCache is based on a cluster of EC2 instances
running the caching software, which is made available
through Web services.
Structured storage solutions
 Enterprise applications quite often rely on databases to
store data in a structured form, index, and perform
analytics against it.
 Traditionally, RDBMS have been the common data
back-end for a wide range of applications, even though
recently more scalable and lightweight solutions have
been proposed.
 Amazon provides applications with structured storage
services in three different forms: preconfigured EC2
AMIs, Amazon Relational Data Storage (RDS), and
Amazon SimpleDB.
Contd….
Preconfigured EC2 AMIs
 Preconfigured EC2 AMIs are predefined templates featuring
an installation of a given database management system.
 EC2 instances created from these AMIs can be completed with
an EBS volume for storage persistence.
 Available AMIs include installations of IBM DB2, Microsoft
SQL Server, MySQL, Oracle, PostgreSQL, Sybase, and
Vertica.
Contd….
Amazon RDS
 RDS is relational database service that relies on the
EC2 infrastructure and is managed by Amazon.
 Developers do not have to worry about configuring
the storage for high availability, designing failover
strategies, or keeping the servers up-to-date with
patches. Moreover, the service provides users with
automatic backups, snapshots, point-in-time
recoveries, and facilities for implementing
replications.
Contd….
Amazon SimpleDB
 Amazon SimpleDB is a lightweight, highly scalable,
and flexible data storage solution for applications that
do not require a fully relational model for their data.
 SimpleDB provides support for semi-structured data,
the model for which is based on the concept of
domains, items, and attributes.
Amazon CloudFront
 CloudFront is an implementation of a content
delivery network on top of the Amazon distributed
storage infrastructure.
 It leverages a collection of edge servers strategically
located around the globe to better serve requests for
static and streaming Web content so that the transfer
time is reduced as much as possible.
 The content that can be delivered through CloudFront
is static (HTTP and HTTPS) or streaming (Real Time
Messaging Protocol, or RMTP).
Microsoft Azure Storage services
 Windows Azure provides different types of storage
solutions that complement compute services with a more
durable and redundant option compared to local storage.
 Compared to local storage, these services can be
accessed by multiple clients at the same time and
from everywhere, thus becoming a general solution
for storage.
 This service is used to store data over the cloud that
can be scaled according to the requirements.
Contd….
 It includes Microsoft Azure Storage (Blob, Queue, Table, and Azure File
services).
Blobs
 Azure allows storing large amount of data in the form of binary large
objects (BLOBs) by means of the blobs service. This service is optimal to
store large text or binary files.
 It is Microsoft's object storage solution for the cloud. Blob storage is
optimized for storing a massive amount of unstructured data, such as text or
binary data.
Two types of blobs are available:
Block blobs:- Block blobs are composed of blocks and are optimized for
sequential access; therefore they are appropriate for media streaming.
Contd….
Page blobs:- Page blobs are made of pages that are identified by
an offset from the beginning of the blob.
 A page blob can be split into multiple pages or constituted of a
single page.
 This type of blob is optimized for random access and can be
used to host data different from streaming.
 They allow random read and write operations.
 Page blobs store the VHD files that backs VMs.
Contd….
Azure drive
 Page blobs can be used to store an entire file system in the
form of a single Virtual Hard Drive (VHD) file.
 The can then be mounted as a part of the NTFS file system by
Azure compute resources, thus providing persistent and
durable storage.
 A page blob mounted as part of an NTFS tree is called an
Azure Drive.
Tables
 Tables constitute a semi-structured storage solution, allowing
users to store information in the form of entities with a
collection of properties.
Contd….
 It is used for storing datasets that don't require complex joins,
foreign keys, or stored procedures and can be de-normalized
for fast access.
 It is used for quickly querying data using a clustered index.
Queues
 Queue storage allows applications to communicate by
exchanging messages through durable queues, thus
avoiding lost or unprocessed messages.
 Applications enter messages into a queue, and
other applications can read them in a first-in, first-out
(FIFO) style.
Thanks
QA

You might also like