0% found this document useful (0 votes)
580 views26 pages

Cloud Unit3

The document discusses different data storage technologies used in cloud computing including relational databases, cloud file systems like GFS and HDFS, BigTable, HBase and Dynamo. It provides details on how each system is designed, the data models used and advantages for large scale data storage and processing in the cloud.

Uploaded by

Murlidhar Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
580 views26 pages

Cloud Unit3

The document discusses different data storage technologies used in cloud computing including relational databases, cloud file systems like GFS and HDFS, BigTable, HBase and Dynamo. It provides details on how each system is designed, the data models used and advantages for large scale data storage and processing in the cloud.

Uploaded by

Murlidhar Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

UCS18E08 - CLOUD COMPUTING Unit 3

Mr. D.SIVA
ASSISTANT PROFESSOR
DEPARTMENT OF COMPUTER SCIENCE
SRM IST, RAMAPURAM

DEPARTMENT OF COMPUTER SCIENCE, SRM IST -


1
RAMAPURAM
Syllabus
UNIT – III (12 Hours)

Data in the cloud: Relational databases, Cloud file systems:


GFS and HDFS, BigTable, HBase and Dynamo. Map-Reduce
and extensions: Parallel computing, The map-Reduce
model, Parallel efficiency of Map-Reduce, Relational
operations using Map-Reduce, Enterprise batch processing
using Map-Reduce, Introduction to cloud development,
Example/Application of Mapreduce, Features and
comparisons among GFS,HDFS etc, Map-Reduce model

DEPARTMENT OF COMPUTER SCIENCE, SRM IST - RAMAPURAM 2


Data in the cloud
 Since the 80s relational database technology has been the ‘default’ data storage and retrieval
mechanism used in the vast majority of enterprise applications.
The origins of relational databases, beginning with System R and Ingres in the 70s, focused on
introducing this new paradigm as a general purpose replacement for hierarchical and network
databases, for the most common business computing tasks at the time, viz. transaction processing.
 Google in particular has developed a massively parallel and fault tolerant distributed file system
(GFS) along with a data organization (BigTable) and programming paradigm (MapReduce) that is
markedly different from the traditional relational model.
Such ‘cloud data strategies’ are particularly well suited for large-volume massively parallel text
processing, as well as possibly other tasks, such as enterprise analytics.
 The public cloud computing offerings from Google (i.e. App Engine) as well as those from other
vendors have made similar data models (Google’s Datastore, Amazon’s SimpleDB) and
programming paradigms (Hadoop on Amazon’s EC2) available to users as part of their cloud
platforms.
RELATIONAL DATABASES

Before we delve into cloud data structures we first review traditional relational database
systems and how they store data.
Users (including application programs) interact with an RDBMS via SQL the database
‘front-end’ or parser transforms queries into memory and disk level operations to optimize
execution time.
 Data records are stored on pages of contiguous disk blocks, which are managed by the
disk-space-management layer..
Finally, the operating system files used by databases need to span multiple disks so as to
handle the large storage requirements of a database, by efficiently exploiting parallel I/O
systems such as RAID disk arrays or multi-processor clusters.
RELATIONAL DATABASES
The storage indexing layer of the
database system is responsible for
locating records and their organization
on disk pages. Relational records
(tabular rows) are stored on disk pages
and accessed through indexes on
specified columns,which can be B+-tree
indexes, hash indexes, or bitmap
indexes.
RELATIONAL DATABASES
Over the years database systems have evolved towards
exploiting the parallel computing capabilities of multi-
processor servers as well as harnessing the aggregate
computing power of clusters of servers connected by a
high-speed network.
 Figure 10.2 illustrates three parallel/distributed
database architec- tures: The shared memory
architecture is for machines with many CPUs (and with
each having possibly many processing ‘cores’) while the
memory address space is shared and managed by a
symmetric multi-processing operating system that
schedules processes in parallel exploiting all the
processors.
The shared-nothing architecture assumes a cluster of
independent servers each with its own disk, connected
by a network. A shared-disk architecture is somewhere in
between with the cluster of servers sharing storage
through high-speed network storage, such as a NAS
(network attached storage) or a SAN (storage area
network) interconnected via standard Ethernet, or faster
Fiber Channel or Infiniband connections.
 Each of the traditional transaction- processing databases, Oracle, DB2
and SQL Server support parallelism in various ways, as do specialized
systems designed for data warehousing such as Vertica, Netezza and
Teradata.
CLOUD FILE SYSTEMS GFS AND HDFS
The architecture of cloud file systems is illustrated in Figure 10.3.
Large files are broken up into ‘chunks’ (GFS) or ‘blocks’ (HDFS),
which are themselves large (64MB being typical).
These chunks are stored on commodity (Linux) servers called
Chunk Servers (GFS) or Data Nodes (HDFS).
 further each chunk is replicated at least three times, both on a
different physical rack as well as a different network segment in
anticipation of possible failures of these components apart from
server failures.
When a client program (‘cloud application’) needs to read/write a
file, it sends the full path and offset to the Master (GFS) which
sends back meta-data for one (in the case of read) or all (in the
case of write) of the replicas of the chunk where this data is to be
found.
Thereafter the client directly reads data from the designated
chunk server; this data is not cached since most reads are large
and caching would only complicate writes.
In case of a write, in particular an append, the client sends only
the data to be appended to all the chunk servers; when they all
acknowledge receiving this data it informs a designated ‘primary’
chunk server, whose identity it receives (and also caches) from the
Master.
CLOUD FILE SYSTEMS GFS AND HDFS
Advantages of Cloud File System:
 this architecture efficiently supports multiple parallel readers and writers.
It also supports writing (appending) and reading the same file by parallel sets of writers and readers
while maintaining a consistent view, i.e. each reader always sees the same data regardless of the
replica it happens to read from.
GFS :
The Google File System (GFS) is designed to manage relatively large files using a very large
distributed cluster of commodity servers connected by a high-speed network.
 It is therefore designed to
(a) expect and tolerate hardware failures, even during the reading or writing of an individual file (since files
are expected to be very large) and
 (b) support parallel reads, writes and appends by multiple client programs.
HDFS
The Hadoop Distributed File System (HDFS) is an open source implementation of the GFS
architecture that is also available on the Amazon EC2 cloud platform.
we refer to both GFS and HDFS as ‘cloud file systems.’
BIGTABLE, HBASE AND DYNAMO
BigTable :
 BigTable [9] is a distributed structured storage system built on GFS
 A BigTable is essentially a sparse, distributed, persistent, multidimensional sorted ‘map.’
Data in a BigTable is accessed by a row key, column key and a timestamp.
 Each column can store arbitrary name–value pairs of the form column-family:label, string.
 The set of possible column-families for a table is fixed when it is created whereas columns, i.e.
labels within the column family, can be created dynamically at any time.
Column families are stored close together in the distributed file system; thus the BigTable model
shares elements of column- oriented databases.
Further, each Bigtable cell (row, column) can contain multiple versions of the data that are stored
in decreasing timestamp order.
Each row stores infor- mation about a specific sale transaction and the row key is a transaction
identifier.
The ‘location’ column family stores columns relating to where the sale occurred, whereas the
‘product’ column family stores the actual products sold and their classification.
Note that there are two values for region having different timestamps, possibly because of a
reorganization of sales regions.
Notice also that in this example the data happens to be stored in a de-normalized fashion, as
compared to how one would possibly store it in a relational structure; for example the fact that
XYZ Soap is a Cleaner is not maintained.
Hbase :
Hadoop’s HBase is a similar open source system that uses HDFS.
 Figure 10.5 illustrates how BigTable tables are stored on a
distributed file system such as GFS or HDFS.
Each table is split into different row ranges, called tablets. Each
tablet is managed by a tablet server that stores each column
family for the given row range in a separate distributed file,
called an SSTable.
 Additionally, a single Metadata table is managed by a meta-
data server that is used to locate the tablets of any user table in
response to a read or write request.
 table is managed by a meta-data server that is used to locate
the tablets of any user table in response to a read or write
request.
Dynamo :
 another distributed data system called Dynamo, which was developed at Amazon and
underlies its SimpleDB key-value pair database.
 Unlike BigTable, Dynamo was designed specifically for supporting a large volume of
concurrent updates, each of which could be small in size, rather than bulk reads and appends
as in the case of BigTable and GFS.
 Dynamo’s data model is that of simple key-value pairs, and it is expected that applications
read and write such data objects fairly randomly. This model is well suited for many web-based
e-commerce applications that all need to support constructs such as a ‘shopping cart.’
 Dynamo does not rely on any underlying distributed file system and instead directly manages
data storage across distributed nodes.
The architecture of Dynamo is illustrated in Figure
10.6. Objects are key- value pairs with arbitrary
arrays of bytes.
 An MD5 hash of the key is used to generate a 128-
bit hash value. The range of this hash function is
mapped to a set of virtual nodes arranged in a ring,
so each key gets mapped to one virtual node.
Notice that the Dynamo architecture is completely
symmetric with each node being equal, unlike the
BigTable/GFS architecture that has special master
nodes at both the BigTable as well as GFS layer.
Dynamo is able to handle transient failures by
passing writes intended for a failed node to another
node temporarily. Such replicas are kept separately
and scanned periodically with replicas being sent
back to their intended node as soon as it is found to
have revived.
Finally, Dynamo can be implemented using different
storage engines at the node level, such as Berkeley
DB or even MySQL.
MapReduce and extensions
The MapReduce programming model was developed at Google in the pro- cess of
implementing large-scale search and text processing tasks on massive collections of web data
stored using BigTable and the GFS distributed file system.
 The MapReduce programming model is designed for processing and generating large
volumes of data via massively parallel computations utiliz- ing tens of thousands of
processors at a time.
 Hadoop is an open source implementation of the MapReduce model developed at Yahoo.
 Hadoop is also available on pre-packaged AMIs in the Amazon EC2 cloud platform, which
has sparked interest in applying the MapReduce model for large-scale, fault-tolerant
computations in other domains, including such applications in the enterprise context.
PARALLEL COMPUTING
 THE MAPREDUCE MODEL
PARALLEL EFFICIENCY OF MAPREDUCE
RELATIONAL OPERATIONS USING MAPREDUCE
ENTERPRISE BATCH PROCESSING USING MAPREDUCE
PARALLEL COMPUTING

 Parallel computing has a long history with its origins in scientific computing in the late 60s
and early 70s. Different models of parallel computing have been used based on the nature
and evolution of multiprocessor computer architectures.
 The shared-memory model assumes that any processor can access any memory location.
 In the distributed- memory model each processor can address only its own memory and
communicates with other processors using message passing over the network.
 In scientific computing applications for which these models were developed, it was assumed
that data would be loaded from disk at the start of a parallel job and then written back once
the computations had been completed, as scientific tasks were largely compute bound.
 Over time, parallel computing also began to be applied in the database arena, as we have
already discussed in the previous chapter, database systems supporting shared-memory,
shared-disk and shared-nothing models became available.
PARALLEL COMPUTING
The premise of parallel computing is that a task that takes time T should take time T/p if executed on p
processors.
 In practice, inefficiencies are introduced by distributing the computations such as
◦ (a) the need for synchronization among processors,
◦ (b) overheads of communication between processors through messages or disk, and
◦ (c) any imbalance in the distribution of work to processors.
Thus in practice the time Tp to execute on p processors is less than T, and the parallel efficiency of an algorithm is
defined as:

A scalable parallel implementation is one where:


(a) the parallel efficiency remains constant as the size of data is increased along with a corresponding increase in processors
and
(b) the parallel efficiency increases with the size of data for a fixed number of processors.
THE MAPREDUCE MODEL
MapReduce is a programming framework that allows us to
perform distributed and parallel processing on large data
sets in a distributed environment.
◦ MapReduce consists of two distinct tasks — Map and Reduce.
◦ As the name MapReduce suggests, reducer phase takes place
after the mapper phase has been completed.
◦ So, the first is the map job, where a block of data is read and
processed to produce key-value pairs as intermediate outputs.
◦ The output of a Mapper or map job (key-value pairs) is input
to the Reducer.
◦ The reducer receives the key-value pair from multiple map
jobs.
◦ Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or
key-value pairs which is the final output.
.
A Word Count Example of MapReduce
how a MapReduce works by taking an example where I have a text file called example.txt
whose contents are as follows:
◦ Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So,
we will be finding unique words and the number of occurrences of those unique words.
First, we divide the input into three splits as shown in the figure. This will distribute the work
among all the map nodes.
Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each
of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every
word, in itself, will occur once.
Now, a list of key-value pair will be created where the key is nothing but the individual words
and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs — Dear, 1;
Bear, 1; River, 1. The mapping process remains the same on all the nodes.
After the mapper phase, a partition process takes place where sorting and shuffling happen
so that all the tuples with the same key are sent to the corresponding reducer.
So, after the sorting and shuffling phase, each reducer will have a unique key and a list of
values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Now, each Reducer counts the values which are present in that list of values. As shown in the
figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number
of ones in the very list and gives the final output as — Bear, 2.
Finally, all the output key/value pairs are then collected and written in the output file.
Advantages of MapReduce
1. Parallel Processing:
In MapReduce, we are dividing the job among
multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is
based on Divide and Conquer paradigm which
helps us to process the data using different
machines. As the data is processed by multiple
machines instead of a single machine in parallel,
the time taken to process the data gets reduced
by a tremendous amount as shown in the figure
below
Advantages of MapReduce
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the data
in the MapReduce Framework. In the traditional system, we used to bring data to the
processing unit and process it. But, as the data grew and became very huge, bringing this huge
amount of data to the processing unit posed the following issues:
◦ Moving huge data to processing is costly and deteriorates the network performance.
◦ Processing takes time as the data is processed by a single unit which becomes the bottleneck.
◦ Master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit to
the data. So, as you can see in the above image that the data is distributed among multiple
nodes where each node processes the part of the data residing on it. This allows us to have
the following advantages:
◦ It is very cost effective to move the processing unit to the data.
◦ The processing time is reduced as all the nodes are working with their part of the data in parallel.
◦ Every node gets a part of the data to process and therefore, there is no chance of a node getting
overburdened.
RELATIONAL OPERATIONS USING MAPREDUCE
To explain how the relational operations are implemented using MapReduce and visualize it for each operation
using an example.
 to execute SQL statements on large data sets using the MapReduce model.
 how a relational join could be executed in parallel using MapReduce. Figure 11.2 illustrates such an example:
 In order to compute the gross sales by city these two tables need to be joined using SQL as shown in the
figure.
The MapReduce implementation works as follows:
 In the map step, each mapper reads a (random) subset of records from each input table Sales and Cities, and
segregates each of these by address, i.e. the reduce key k2 is ‘address.’
Next each reducer fetches Sales and Cities data for its assigned range of address values from each mapper, and
then performs a local join operation including the aggregation of sale value and grouping by city.

24-03-2021 DEPARTMENT OF COMPUTER SCIENCE, SRM IST - RAMAPURAM 22


RELATIONAL OPERATIONS USING MAPREDUCE

24-03-2021 DEPARTMENT OF COMPUTER SCIENCE, SRM IST - RAMAPURAM 23


Pig Latin and HiveQL
 Translating SQL-like statements to a map-reduce framework.
Two notable examples are Pig Latin developed at Yahoo!, and
Hive developed and used at Facebook.
Both of these are open source tools available as part of the
Hadoop project, and both leverage the Hadoop distributed file
system HDFS.
 Figure 11.3 illustrates how the above SQL query can be
represented using the Pig Latin language as well as the HiveQL
dialect of SQL.
 Pig Latin is ideal for executing sequences of large-scale data
transformations using MapReduce. In the enterprise context it
is well suited for the tasks involved in loading information into
a data warehouse.
HiveQL, being more declarative and closer to SQL, is a good
candidate for formulating analytical queries on a large
distributed data warehouse.
HadoopDB is an attempt at combining the advantages of
MapReduce and relational databases by using databases
locally within nodes while using MapReduce to coordinate
parallel execution.
 Another example is SQL/MR from Aster Data that enhances
a set of distributed SQL-compliant databases with MapReduce
programming constructs.

24-03-2021 DEPARTMENT OF COMPUTER SCIENCE, SRM IST - RAMAPURAM 24


ENTERPRISE BATCH PROCESSING USING MAPREDUCE

In the enterprise context there is considerable interest in leveraging the MapReduce model for
high-throughput batch processing, analysis on data warehouses as well as predictive analytics.
MapReduce is a programming model that allows the user to write batch processing jobs with a
small amount of code.
 High-throughput batch processing operations on transactional data, usu- ally performed as ‘end-
of-day’ processing, often need to access and compute using large data sets. These operations are
also naturally time bound, having to complete before transactional operations can resume fully.

24-03-2021 DEPARTMENT OF COMPUTER SCIENCE, SRM IST - RAMAPURAM 25


ENTERPRISE BATCH PROCESSING USING MAPREDUCE
As an example, illustrated in Figure 11.4, consider
an investment bank that needs to revalue the
portfolios of all its customers with the latest prices
as received from the stock exchange at the end of a
trading day.
 Each customer’s Holdings (the quantity of each
stock held by the customer) needs to be joined with
prices from the latest Stock feed, the quantity of
stock held by each customer must be multiplied by
stock price, and the result grouped by customer
name and appended to a set of Portfolio valuation
files/tables time-stamped by the current date.
 Figure 11.4 depicts a Pig Latin program for such a
batch process. This eventually translates to two
MapReduce phases as shown.
It is important to note that we append to the
Portfolio file rather than update an existing table;
this is because MapReduce leverages the
distributed file system where storage is cheap and
bulk record appends are far more efficient than
updates of existing records.

24-03-2021 DEPARTMENT OF COMPUTER SCIENCE, SRM IST - RAMAPURAM 26

You might also like