Cloud Chapter 4SWE

Uploaded by

banadawithunde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views40 pages

Cloud Chapter 4SWE

Uploaded by

banadawithunde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Chapter – 4

Cloud Storage
Outlines

 Introduction  Applications of Hadoop

 Data intensive Computing MapReduce
 Challenges  High-performance
 Big data distributed file systems and
cloud storage
 Technologies for data-
intensive computing  AWS ecosystem
 Map-Reduce framework  AWS Storage services
 Phases of MapReduce  Microsoft Storage services
 Benefits of Hadoop
MapReduce
Introduction
 Cloud storage is a service model in which data is maintained,
managed and backed up remotely and made available to users over
a network (typically the Internet).
 Traditionally, distributed databases have been considered the
natural evolution of database management systems as the scale of
the datasets becomes unmanageable with a single system.
 Distributed databases are a collection of data stored at different
sites of a computer network.
 Each site might expose a degree of autonomy, providing services
for the execution of local applications, but also participating in the
execution of a global application.
Contd…
 A distributed database can be created by splitting and
scattering the data of an existing database over different
sites or by federating together multiple existing databases.
 Due to the explosion of unstructured data in the form of
blogs, Web pages, software logs, and sensor readings, the
relational model in its original formulation does not seem
to be the preferred solution for supporting data analytics on
a large scale
Data intensive computing
 Data-intensive computing is concerned with production,
manipulation, and analysis of large-scale data in the range of
hundreds of megabytes (MB) to petabytes (PB) and beyond.
 Data-intensive computations occur in many application
domains.
 People conducting scientific simulations and experiments are
often keen to produce, analyze, and process huge volumes of
data. Hundreds of gigabytes of data are produced every
second by telescopes mapping the sky; the collection of
images of the sky easily reaches the scale of petabytes over a
year.
Contd…
 Bioinformatics applications mine databases that may end
up containing terabytes of data.
 Earthquake simulators process a massive amount of data,
which is produced as a result of recording the vibrations of
the Earth across the entire globe.
 Customer data for any telecom company would easily be in
the range of 10-100 terabytes. This volume of information
is not only processed to generate billing statements, but it
is also mined to identify scenarios, trends, and patterns that
help these companies provide better service.
Challenges
 The huge amount of data produced, analyzed, or stored imposes
requirements on the supporting infrastructures and middleware that
are hardly found in the traditional solutions for distributed
computing.

Here are open challenges in traditional computing:

 The location of data is crucial as the need for moving terabytes of
data becomes an obstacle for high-performing computations
 Scalable algorithms that can search and process massive datasets

 New metadata management technologies that can scale to handle

complex, heterogeneous, and distributed data sources
 Flexible and high-performance software integration techniques that
facilitate the combination of software modules running on different
platforms
Contd…
 Large datasets have mostly been the domain of scientific
computing. This scenario has recently started to change as
massive amounts of data are being produced, mined, and
crunched by companies that provide Internet services such
as searching, online advertising, and social media.
 It is critical for such companies to efficiently analyze these
huge datasets because they constitute a precious source of
information about their customers.
Big Data
 Together with the dissemination of cloud computing
technologies that support data-intensive computations, the
term Big Data has become popular.
 The term Big Data applies to datasets of which the size is
beyond the ability of commonly used software tools to
capture, manage, and process within a tolerable elapsed
time.
 Relational databases become ineffective for that amount of
information, instead requiring “massively parallel software
running on tens, hundreds, or even thousands of servers”
Contd…
Cloud technologies support data-intensive computing in several
ways:
 By providing a large amount of compute instances on
demand, which can be used to process and analyze large
datasets in parallel.
 By providing a storage system optimized for keeping large
blocks of data and other distributed data store architectures.
 By providing frameworks and programming APIs optimized
for the processing and management of large amounts of data.
These APIs are mostly coupled with a specific storage
infrastructure to optimize the overall performance of the
system.
Technologies for data-intensive computing
 Data-intensive computing concerns the development of
applications that are mainly focused on processing large
quantities of data. Therefore, storage systems and programming
models constitute a natural classification of the technologies
supporting data-intensive computing
 Research on databases and the data management industry are
definitely at a turning point, and new opportunities arise. Some
factors contributing to this change are:
 Growing of popularity of Big Data

 Growing importance of data analytics in the business chain.

 Presence of data in several forms, not only structured.

 New approaches and technologies for computing.

Contd…

 Cloud technologies for big data

 Map reduce framework in Hadoop
Map-Reduce framework
 It is a software framework that enables you to write
applications that process vast amounts of data, in-
parallel on large clusters of commodity hardware in a
reliable and fault-tolerant manner.
 It was primarily introduced by google.
 A MapReduce job usually splits the input data set into
independent chunks, which are processed by the map
tasks in a completely parallel manner.
 The framework takes care of scheduling tasks,
monitors them, and re-executes the failed tasks.
Contd…
 There are two primary tasks in MapReduce: map and reduce
 In the map job, we split the input dataset into chunks.
 Map task processes these chunks in parallel.
 The map we use outputs as inputs for the reduce tasks.
 Reducers process the intermediate data from the maps into
smaller tuples, that reduces the tasks, leading to the final
output of the framework.
Phases of MapReduce
The MapReduce program is executed in three main phases:
mapping, shuffling, and reducing.
Mapping Phase
 This is the first phase of the program. There are two steps in
this phase: splitting and mapping.
 A dataset is split into equal units called chunks (input splits)
in the splitting step.
 Hadoop consists of a RecordReader that uses
TextInputFormat to transform input splits into key-value
pairs.
 The key-value pairs are then used as inputs in the mapping
step. The mapping step contains a coding logic that is applied
to these data blocks.
Contd…
Shuffling phase
 This is the second phase that takes place after the
completion of the Mapping phase. It consists of two main
steps: sorting and merging.
 In the sorting step, the key-value pairs are sorted using the
keys.
 Merging ensures that key-value pairs are combined.

 The shuffling phase facilitates the removal of duplicate

values and the grouping of values.
 Different values with similar keys are grouped.

 The output of this phase will be keys and values, just like in
the Mapping phase.
Contd…
Reducer phase
 In the reducer phase, the output of the shuffling
phase is used as the input.
 The reducer processes this input further to reduce
the intermediate values into smaller values.
 It provides a summary of the entire dataset.

 The output from this phase is stored in the

HDFS(Hadoop distributed file system).
Example on word count…
Benefits of Hadoop MapReduce

 Speed: MapReduce can process huge unstructured

data in a short time
 Fault-tolerance: The MapReduce framework can
handle failures.
 Parallel Processing: In MapReduce, multiple job-
parts of the same dataset can be processed in a
parallel manner. This reduces the time taken to
complete a task.
Applications of Hadoop MapReduce
Map reduce framework can be implemented in some
popular programming language such as java, scala,
python etc. Practical applications of the MapReduce
program.
 E-commerce
 E-commerce companies such as Walmart, E-Bay, and
Amazon use MapReduce to analyze buying behavior.
 MapReduce provides meaningful information that is
used as the basis for developing product
recommendations. Some of the information used
include site records, e-commerce catalogs, purchase
history, and interaction logs.
Contd…
 Social networks
 The MapReduce programming tool can evaluate certain
information on social media platforms such as
Facebook, Twitter, and LinkedIn.
 Entertainment
 Netflix uses MapReduce to analyze the clicks and logs
of online customers. This information helps the
company suggest movies based on customers’ interests
and behavior
High-performance distributed file
systems and cloud storage
 Distributed file systems constitute the primary support for
data management. They provide an interface whereby to
store information in the form of files and later access them
for read and write operations.
 Examples:
 Lustre
 The Lustre file system is a massively parallel
distributed file system that covers the needs of a small
workgroup of clusters to a large-scale computing
cluster.
 It is designed to provide access to petabytes (PBs) of
storage to serve thousands of clients.
Cont’d
 IBM General Parallel File System (GPFS)
 GPFS is the high-performance distributed file system
developed by IBM that provides support for supercomputer
and Linux computing clusters
 Google File System
 GFS is the storage infrastructure that supports the
execution of distributed applications in Google’s
computing cloud. The system has been designed to be a
fault tolerant, highly available, distributed file system built
on commodity hardware and standard Linux operating
systems.
 Amazon simple storage service (s3)
AWS ecosystem
AWS Storage services
 AWS provides a collection of services for data storage and
information management.
 The core service in this area is represented by Amazon Simple
Storage Service (S3). This is a distributed object
store that allows users to store information in different formats.
 The core components of S3 are two: buckets and objects.
 Buckets represent virtual containers in which to store objects;
 Objects represent the content that is actually stored.
 Objects can also be enriched with metadata that can be
used to tag the stored content with additional information.
Contd….
Buckets
 A bucket is a container of objects.
 It can be thought of as a virtual drive hosted on the S3
distributed storage, which provides users with a flat store
to which they can add objects.
 Buckets are top level elements of the S3 storage
architecture and do not support nesting. That is, it is not
possible to create “sub buckets” or other kinds of physical
divisions.
Contd……
Objects and metadata
 Objects constitute the content elements stored in S3.
 An object is identified by a name that needs to be unique
within the bucket in which the content is stored.
 Metadata are both system and user defined: the first ones are
used by S3 to control the interaction with the object, whereas the
second ones are meaningful to the user, who can store up to 2
KB per metadata property represented by a key-value pair of
strings.
Amazon elastic block store
 The Amazon Elastic Block Store (EBS) allows AWS
users to provide EC2 instances with persistent storage in
the form of volumes that can be mounted at instance
startup.
 The content of an EBS volume survives the instance life
cycle and is persisted into S3.
Amazon ElastiCache
 ElastiCache is an implementation of an elastic in-
memory cache based on a cluster of EC2 instances.
 It provides fast data access from other EC2 instances
through a Memcached-compatible protocol so that
existing applications based on such technology do not
need to be modified and can transparently migrate to
ElastiCache.
 ElastiCache is based on a cluster of EC2 instances
running the caching software, which is made available
through Web services.
Structured storage solutions
 Enterprise applications quite often rely on databases to
store data in a structured form, index, and perform
analytics against it.
 Traditionally, RDBMS have been the common data
back-end for a wide range of applications, even though
recently more scalable and lightweight solutions have
been proposed.
 Amazon provides applications with structured storage
services in three different forms: preconfigured EC2
AMIs, Amazon Relational Data Storage (RDS), and
Amazon SimpleDB.
Contd….
Preconfigured EC2 AMIs
 Preconfigured EC2 AMIs are predefined templates featuring
an installation of a given database management system.
 EC2 instances created from these AMIs can be completed with
an EBS volume for storage persistence.
 Available AMIs include installations of IBM DB2, Microsoft
SQL Server, MySQL, Oracle, PostgreSQL, Sybase, and
Vertica.
Contd….
Amazon RDS
 RDS is relational database service that relies on the
EC2 infrastructure and is managed by Amazon.
 Developers do not have to worry about configuring
the storage for high availability, designing failover
strategies, or keeping the servers up-to-date with
patches. Moreover, the service provides users with
automatic backups, snapshots, point-in-time
recoveries, and facilities for implementing
replications.
Contd….
Amazon SimpleDB
 Amazon SimpleDB is a lightweight, highly scalable,
and flexible data storage solution for applications that
do not require a fully relational model for their data.
 SimpleDB provides support for semi-structured data,
the model for which is based on the concept of
domains, items, and attributes.
Amazon CloudFront
 CloudFront is an implementation of a content
delivery network on top of the Amazon distributed
storage infrastructure.
 It leverages a collection of edge servers strategically
located around the globe to better serve requests for
static and streaming Web content so that the transfer
time is reduced as much as possible.
 The content that can be delivered through CloudFront
is static (HTTP and HTTPS) or streaming (Real Time
Messaging Protocol, or RMTP).
Microsoft Azure Storage services
 Windows Azure provides different types of storage
solutions that complement compute services with a more
durable and redundant option compared to local storage.
 Compared to local storage, these services can be
accessed by multiple clients at the same time and
from everywhere, thus becoming a general solution
for storage.
 This service is used to store data over the cloud that
can be scaled according to the requirements.
Contd….
 It includes Microsoft Azure Storage (Blob, Queue, Table, and Azure File
services).
Blobs
 Azure allows storing large amount of data in the form of binary large
objects (BLOBs) by means of the blobs service. This service is optimal to
store large text or binary files.
 It is Microsoft's object storage solution for the cloud. Blob storage is
optimized for storing a massive amount of unstructured data, such as text or
binary data.
Two types of blobs are available:
Block blobs:- Block blobs are composed of blocks and are optimized for
sequential access; therefore they are appropriate for media streaming.
Contd….
Page blobs:- Page blobs are made of pages that are identified by
an offset from the beginning of the blob.
 A page blob can be split into multiple pages or constituted of a
single page.
 This type of blob is optimized for random access and can be
used to host data different from streaming.
 They allow random read and write operations.
 Page blobs store the VHD files that backs VMs.
Contd….
Azure drive
 Page blobs can be used to store an entire file system in the
form of a single Virtual Hard Drive (VHD) file.
 The can then be mounted as a part of the NTFS file system by
Azure compute resources, thus providing persistent and
durable storage.
 A page blob mounted as part of an NTFS tree is called an
Azure Drive.
Tables
 Tables constitute a semi-structured storage solution, allowing
users to store information in the form of entities with a
collection of properties.
Contd….
 It is used for storing datasets that don't require complex joins,
foreign keys, or stored procedures and can be de-normalized
for fast access.
 It is used for quickly querying data using a clustered index.
Queues
 Queue storage allows applications to communicate by
exchanging messages through durable queues, thus
avoiding lost or unprocessed messages.
 Applications enter messages into a queue, and
other applications can read them in a first-in, first-out
(FIFO) style.
Thanks
QA

OReilly Report Designing A Modern Application Data Stack
No ratings yet
OReilly Report Designing A Modern Application Data Stack
44 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
MapReduce Book Final
No ratings yet
MapReduce Book Final
175 pages
Online Class Scheduling System
100% (1)
Online Class Scheduling System
35 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Data Science
No ratings yet
Data Science
87 pages
CS8091 Bigdata Analytics Lessonplan With Date
No ratings yet
CS8091 Bigdata Analytics Lessonplan With Date
11 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Data-Intensive Computing Overview
No ratings yet
Data-Intensive Computing Overview
46 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data - Iv Bda
No ratings yet
Big Data - Iv Bda
143 pages
Ma Data Management Fs Oi Madm 60 en
No ratings yet
Ma Data Management Fs Oi Madm 60 en
139 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Become A Programmer, Motherfucker
100% (1)
Become A Programmer, Motherfucker
13 pages
Data-Intensive Computing
No ratings yet
Data-Intensive Computing
88 pages
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
0 Principles of Big Data
No ratings yet
0 Principles of Big Data
70 pages
Updated Final Bachelor of Commerce Honours Degree in Data Science and Informatics
No ratings yet
Updated Final Bachelor of Commerce Honours Degree in Data Science and Informatics
14 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Concurrent Manager FAQs
No ratings yet
Concurrent Manager FAQs
10 pages
Unit 5
No ratings yet
Unit 5
32 pages
2020 An Evaluation of Test Suite Minimization Techniques PDF
No ratings yet
2020 An Evaluation of Test Suite Minimization Techniques PDF
142 pages
Data Science
No ratings yet
Data Science
31 pages
Lecture 02
No ratings yet
Lecture 02
60 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
Cloud COMPUTING Module 4
No ratings yet
Cloud COMPUTING Module 4
50 pages
Vtu 5th Sem Open Electives
No ratings yet
Vtu 5th Sem Open Electives
10 pages
15CS565 Module4
No ratings yet
15CS565 Module4
61 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
HICET - Department of Computer Science and Engineering
No ratings yet
HICET - Department of Computer Science and Engineering
1 page
AAG NIH July2012 GeoFrontiers Shekhar PDF
No ratings yet
AAG NIH July2012 GeoFrontiers Shekhar PDF
25 pages
Course Outline and Introduction
No ratings yet
Course Outline and Introduction
37 pages
Cloud Computing Unit-5
No ratings yet
Cloud Computing Unit-5
22 pages
HADOOP
No ratings yet
HADOOP
55 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Bikku2019 Springer Con
No ratings yet
Bikku2019 Springer Con
9 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Jifs223295 2
No ratings yet
Jifs223295 2
25 pages
CC - Lecture 6-Data
No ratings yet
CC - Lecture 6-Data
44 pages
Unit Iv PDF
No ratings yet
Unit Iv PDF
26 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
PingCAP Ebook Modern Distributed Database Fundamentals
No ratings yet
PingCAP Ebook Modern Distributed Database Fundamentals
42 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Biggdata
No ratings yet
Biggdata
24 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Big Data
No ratings yet
Big Data
43 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Cloud Unit V
No ratings yet
Cloud Unit V
23 pages
Cloud Class1
No ratings yet
Cloud Class1
14 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Cascade
No ratings yet
Cascade
20 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Big Data and Mapreduce Challenges, Opportunities and Trends
No ratings yet
Big Data and Mapreduce Challenges, Opportunities and Trends
9 pages
MODULE 4 Notes
No ratings yet
MODULE 4 Notes
12 pages
Grid Computing PPT 5 Wecompress - Com 1
No ratings yet
Grid Computing PPT 5 Wecompress - Com 1
15 pages
Big Data
No ratings yet
Big Data
29 pages
Data Diffusion Dynamic Resource Provision and
No ratings yet
Data Diffusion Dynamic Resource Provision and
16 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Scalable Machine-Learning Algorithms For Big Data Analytics: A Comprehensive Review
No ratings yet
Scalable Machine-Learning Algorithms For Big Data Analytics: A Comprehensive Review
21 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
BDAunit III
No ratings yet
BDAunit III
4 pages
Detecting Bottlenecks in Parallel DAG-based Data Flow Programs
No ratings yet
Detecting Bottlenecks in Parallel DAG-based Data Flow Programs
10 pages
A Dynamic Data Placement Strategy
No ratings yet
A Dynamic Data Placement Strategy
9 pages
Yahoo s4
No ratings yet
Yahoo s4
8 pages
Assignment 01 K!
No ratings yet
Assignment 01 K!
8 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Cloud Comp Techno
No ratings yet
Cloud Comp Techno
5 pages
Mapreduce Ii: Permalink Comments (25) Trackbacks
No ratings yet
Mapreduce Ii: Permalink Comments (25) Trackbacks
6 pages
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
No ratings yet
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
3 pages
Project 619819 Epp 1 2020 1 Fi Eppka1 JMD Mob
No ratings yet
Project 619819 Epp 1 2020 1 Fi Eppka1 JMD Mob
2 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
Guha Roy 2017
No ratings yet
Guha Roy 2017
3 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Cloud Chapter 4SWE

Uploaded by

Cloud Chapter 4SWE

Uploaded by

Chapter – 4

 Introduction  Applications of Hadoop

Here are open challenges in traditional computing:

 New metadata management technologies that can scale to handle

 Growing importance of data analytics in the business chain.

 Presence of data in several forms, not only structured.

 New approaches and technologies for computing.

 Cloud technologies for big data

 The shuffling phase facilitates the removal of duplicate

 The output from this phase is stored in the

 Speed: MapReduce can process huge unstructured

You might also like