Cloud Unit3
Cloud Unit3
Mr. D.SIVA
ASSISTANT PROFESSOR
DEPARTMENT OF COMPUTER SCIENCE
SRM IST, RAMAPURAM
Before we delve into cloud data structures we first review traditional relational database
systems and how they store data.
Users (including application programs) interact with an RDBMS via SQL the database
‘front-end’ or parser transforms queries into memory and disk level operations to optimize
execution time.
Data records are stored on pages of contiguous disk blocks, which are managed by the
disk-space-management layer..
Finally, the operating system files used by databases need to span multiple disks so as to
handle the large storage requirements of a database, by efficiently exploiting parallel I/O
systems such as RAID disk arrays or multi-processor clusters.
RELATIONAL DATABASES
The storage indexing layer of the
database system is responsible for
locating records and their organization
on disk pages. Relational records
(tabular rows) are stored on disk pages
and accessed through indexes on
specified columns,which can be B+-tree
indexes, hash indexes, or bitmap
indexes.
RELATIONAL DATABASES
Over the years database systems have evolved towards
exploiting the parallel computing capabilities of multi-
processor servers as well as harnessing the aggregate
computing power of clusters of servers connected by a
high-speed network.
Figure 10.2 illustrates three parallel/distributed
database architec- tures: The shared memory
architecture is for machines with many CPUs (and with
each having possibly many processing ‘cores’) while the
memory address space is shared and managed by a
symmetric multi-processing operating system that
schedules processes in parallel exploiting all the
processors.
The shared-nothing architecture assumes a cluster of
independent servers each with its own disk, connected
by a network. A shared-disk architecture is somewhere in
between with the cluster of servers sharing storage
through high-speed network storage, such as a NAS
(network attached storage) or a SAN (storage area
network) interconnected via standard Ethernet, or faster
Fiber Channel or Infiniband connections.
Each of the traditional transaction- processing databases, Oracle, DB2
and SQL Server support parallelism in various ways, as do specialized
systems designed for data warehousing such as Vertica, Netezza and
Teradata.
CLOUD FILE SYSTEMS GFS AND HDFS
The architecture of cloud file systems is illustrated in Figure 10.3.
Large files are broken up into ‘chunks’ (GFS) or ‘blocks’ (HDFS),
which are themselves large (64MB being typical).
These chunks are stored on commodity (Linux) servers called
Chunk Servers (GFS) or Data Nodes (HDFS).
further each chunk is replicated at least three times, both on a
different physical rack as well as a different network segment in
anticipation of possible failures of these components apart from
server failures.
When a client program (‘cloud application’) needs to read/write a
file, it sends the full path and offset to the Master (GFS) which
sends back meta-data for one (in the case of read) or all (in the
case of write) of the replicas of the chunk where this data is to be
found.
Thereafter the client directly reads data from the designated
chunk server; this data is not cached since most reads are large
and caching would only complicate writes.
In case of a write, in particular an append, the client sends only
the data to be appended to all the chunk servers; when they all
acknowledge receiving this data it informs a designated ‘primary’
chunk server, whose identity it receives (and also caches) from the
Master.
CLOUD FILE SYSTEMS GFS AND HDFS
Advantages of Cloud File System:
this architecture efficiently supports multiple parallel readers and writers.
It also supports writing (appending) and reading the same file by parallel sets of writers and readers
while maintaining a consistent view, i.e. each reader always sees the same data regardless of the
replica it happens to read from.
GFS :
The Google File System (GFS) is designed to manage relatively large files using a very large
distributed cluster of commodity servers connected by a high-speed network.
It is therefore designed to
(a) expect and tolerate hardware failures, even during the reading or writing of an individual file (since files
are expected to be very large) and
(b) support parallel reads, writes and appends by multiple client programs.
HDFS
The Hadoop Distributed File System (HDFS) is an open source implementation of the GFS
architecture that is also available on the Amazon EC2 cloud platform.
we refer to both GFS and HDFS as ‘cloud file systems.’
BIGTABLE, HBASE AND DYNAMO
BigTable :
BigTable [9] is a distributed structured storage system built on GFS
A BigTable is essentially a sparse, distributed, persistent, multidimensional sorted ‘map.’
Data in a BigTable is accessed by a row key, column key and a timestamp.
Each column can store arbitrary name–value pairs of the form column-family:label, string.
The set of possible column-families for a table is fixed when it is created whereas columns, i.e.
labels within the column family, can be created dynamically at any time.
Column families are stored close together in the distributed file system; thus the BigTable model
shares elements of column- oriented databases.
Further, each Bigtable cell (row, column) can contain multiple versions of the data that are stored
in decreasing timestamp order.
Each row stores infor- mation about a specific sale transaction and the row key is a transaction
identifier.
The ‘location’ column family stores columns relating to where the sale occurred, whereas the
‘product’ column family stores the actual products sold and their classification.
Note that there are two values for region having different timestamps, possibly because of a
reorganization of sales regions.
Notice also that in this example the data happens to be stored in a de-normalized fashion, as
compared to how one would possibly store it in a relational structure; for example the fact that
XYZ Soap is a Cleaner is not maintained.
Hbase :
Hadoop’s HBase is a similar open source system that uses HDFS.
Figure 10.5 illustrates how BigTable tables are stored on a
distributed file system such as GFS or HDFS.
Each table is split into different row ranges, called tablets. Each
tablet is managed by a tablet server that stores each column
family for the given row range in a separate distributed file,
called an SSTable.
Additionally, a single Metadata table is managed by a meta-
data server that is used to locate the tablets of any user table in
response to a read or write request.
table is managed by a meta-data server that is used to locate
the tablets of any user table in response to a read or write
request.
Dynamo :
another distributed data system called Dynamo, which was developed at Amazon and
underlies its SimpleDB key-value pair database.
Unlike BigTable, Dynamo was designed specifically for supporting a large volume of
concurrent updates, each of which could be small in size, rather than bulk reads and appends
as in the case of BigTable and GFS.
Dynamo’s data model is that of simple key-value pairs, and it is expected that applications
read and write such data objects fairly randomly. This model is well suited for many web-based
e-commerce applications that all need to support constructs such as a ‘shopping cart.’
Dynamo does not rely on any underlying distributed file system and instead directly manages
data storage across distributed nodes.
The architecture of Dynamo is illustrated in Figure
10.6. Objects are key- value pairs with arbitrary
arrays of bytes.
An MD5 hash of the key is used to generate a 128-
bit hash value. The range of this hash function is
mapped to a set of virtual nodes arranged in a ring,
so each key gets mapped to one virtual node.
Notice that the Dynamo architecture is completely
symmetric with each node being equal, unlike the
BigTable/GFS architecture that has special master
nodes at both the BigTable as well as GFS layer.
Dynamo is able to handle transient failures by
passing writes intended for a failed node to another
node temporarily. Such replicas are kept separately
and scanned periodically with replicas being sent
back to their intended node as soon as it is found to
have revived.
Finally, Dynamo can be implemented using different
storage engines at the node level, such as Berkeley
DB or even MySQL.
MapReduce and extensions
The MapReduce programming model was developed at Google in the pro- cess of
implementing large-scale search and text processing tasks on massive collections of web data
stored using BigTable and the GFS distributed file system.
The MapReduce programming model is designed for processing and generating large
volumes of data via massively parallel computations utiliz- ing tens of thousands of
processors at a time.
Hadoop is an open source implementation of the MapReduce model developed at Yahoo.
Hadoop is also available on pre-packaged AMIs in the Amazon EC2 cloud platform, which
has sparked interest in applying the MapReduce model for large-scale, fault-tolerant
computations in other domains, including such applications in the enterprise context.
PARALLEL COMPUTING
THE MAPREDUCE MODEL
PARALLEL EFFICIENCY OF MAPREDUCE
RELATIONAL OPERATIONS USING MAPREDUCE
ENTERPRISE BATCH PROCESSING USING MAPREDUCE
PARALLEL COMPUTING
Parallel computing has a long history with its origins in scientific computing in the late 60s
and early 70s. Different models of parallel computing have been used based on the nature
and evolution of multiprocessor computer architectures.
The shared-memory model assumes that any processor can access any memory location.
In the distributed- memory model each processor can address only its own memory and
communicates with other processors using message passing over the network.
In scientific computing applications for which these models were developed, it was assumed
that data would be loaded from disk at the start of a parallel job and then written back once
the computations had been completed, as scientific tasks were largely compute bound.
Over time, parallel computing also began to be applied in the database arena, as we have
already discussed in the previous chapter, database systems supporting shared-memory,
shared-disk and shared-nothing models became available.
PARALLEL COMPUTING
The premise of parallel computing is that a task that takes time T should take time T/p if executed on p
processors.
In practice, inefficiencies are introduced by distributing the computations such as
◦ (a) the need for synchronization among processors,
◦ (b) overheads of communication between processors through messages or disk, and
◦ (c) any imbalance in the distribution of work to processors.
Thus in practice the time Tp to execute on p processors is less than T, and the parallel efficiency of an algorithm is
defined as:
In the enterprise context there is considerable interest in leveraging the MapReduce model for
high-throughput batch processing, analysis on data warehouses as well as predictive analytics.
MapReduce is a programming model that allows the user to write batch processing jobs with a
small amount of code.
High-throughput batch processing operations on transactional data, usu- ally performed as ‘end-
of-day’ processing, often need to access and compute using large data sets. These operations are
also naturally time bound, having to complete before transactional operations can resume fully.