0% found this document useful (0 votes)
17 views47 pages

Cloud Data Storage

Uploaded by

urban2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views47 pages

Cloud Data Storage

Uploaded by

urban2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Cloud Data Storage

Introduction
• Relational database
• Default data storage and retrieval mechanism since 80s
• Efficient in transaction processing
• Example: System R, Ingres, etc.
• Replaced hierarchical and network databases
• For scalable web search service:
• Google File System (GFS): Massively parallel and fault
tolerant distributed file system.
• BigTable
• Organizes data
• Similar to column-oriented database (e.g. Verica)
• MapReduce
• Parallel programming paradigm
Introduction Contd…
• Suitable for:
• Large volume massively parallel text processing
• Enterprise analytics
• Similar to BigTable data model are:
• Google App Engines Datastore
• Amazons SimpleDB
Relational Databases
• Users/application programs interact with an
RDMS through SQL
• RDMS parser:
• Transforms queries into memory and disk-level
operations
• Optimizes execution time
• Disk-Space management layer:
• Store data record on pages of contiguous memory
blocks
• Pages are fetched from disk into memory as requested
using pre-fetching and page replacement policies
Relational Database Contd…
• Database file system layer:
• Independent of OS file system
• Reason:
• To have full control on retaining or releasing
a page in memory
• Files used by the DB may span to multiple disk
to handle large storage
• Uses parallel I/O system , Viz. RAID disk array
or multi-processor clusters
Data Storage Techniques
• Row-oriented storage
• Optimal for write-oriented operations viz. transaction
processing applications
• Relational records: stored on contagious disk pages
• Accessed through indexes (primary index) on specified
columns
• Example” B+ tree like storage
• Column-oriented storage
• Efficient for data- warehouse workloads
• Aggregation of measures columns need to be performed
based on values from dimension columns
• Projection of a table is stored as sorted by dimension
values
• Required multiple “join Indexes”
• If different projections are to be indexed in sorted order
Data Storage Techniques Contd…
• Shared Memory
• Suitable for server with multiple CPUs
• Memory address space is shared and
managed by symmetric multi-processing
(SMP) operating system
Parallel • SMP: Schedules processes in parallel
exploiting all the processors
Database • Shared nothing
Architect • Cluster of independent server each with
its own disk space
ures • Connected by a network
• Shared disk
• Hybrid architecture
• Independent server clusters shared
storage through high-speed network
storage viz. NAS (Network attached
storage) or SAN (storage area network)
• Clusters are connected to storage viz:
standard Ethernet or faster Fiber
Channel or InfiniBand connections
Parallel Database Architecture
Contd….
Advantages of Parallel DB over
Relational DB
• Efficient execution of SQL queries by
exploiting multiple processors
• For shared nothing architecture:
• Tables are partitioned and distributed across
multiple processing nodes
• SQL optimizer handles distributed joins
• Distributed two-phase commit locking for transaction
isolation between processors
• Fault tolerant
• System failure handled by transferring control to
“stand-by” system [for transaction processing]
• Restoring computations [for data warehousing
applications]
Examples of databases
capable of handling
Advantages parallel processing
of
Parallel Traditional transaction
processing databases:
DB over Oracle, DB2, SQL Server
Relational
DB Data warehousing
databases: Netezza,
Verica, Teradata
Cloud File Systems
• Google File System (GFS)
• Designed to manage relatively large files using a
very large distributed cluster of commodity servers
connected by a high-speed network
• Handles:
• Failures even during reading or writing of individual files
• Fault tolerant: a necessity
• P (system failure)=1-(1-p(component failure)^N-> 1 (for
large N)
• Support parallel reads, writes and appends by multiple
simultaneous client programs
• Hadoop Distributed File System (HDFS)
• Opensource implementation of GFS architecture
• Available on Amazon EC2 Cloud Platform
Hadoop Distributed
File System (HDFS)
Cloud File Open-source
System implementation of GFS
architecture
Available on Amazon
EC2 cloud platform
GFS Architecture
Master GFS)
Name node (HDFS)
GFS Architecture
• Single Master Control file namespace
• Larger files are broken up into chucks (GFS) or
blocks (HDFS)
• Typical size of each chuck: 64 MB
• Stored on commodity (Linux) servers called Chuck
servers (GFS) or Data nodes (HDFS)
• Replicated three times on different:
• Physical rack
• Network segment
Read Operation in GFS

Client program Master replies


send the full with meta-data
path and offset for one of
of a file to replicas of the
Mater (GFS) or chuck where this
Name node (HDFS) data is found.

Client caches the It read data from


meta-data for the designated
faster access chuck server
Write/Append Operation in GFS
• Client program sends the full path of a file to the
Master (GFS) or Name Node (HDFS)
• Master replies with meta-data for all of replicas of
the chuck where this data is found
• Client send data to be appended to all chuck servers
• Chuck server acknowledge the receipt of this data
• Master designates one of these chuck server as primary
• Primary chuck server appends it copy of data into chuck
by choosing an offset
• Appending can also be done beyond EOF to account for multiple
simultaneous writes
• Send the offset to each replica
• If all replicas do not succeed in writing at the
designated offset, the primary retries
Fault Tolerance in GFS
• Master maintains regular communication with
chuck servers
• Heartbeat messages
• In case of failures:
• Chuck server’s meta-data is updated to reflect
failure
• For failure of primary chuck server, the master
assigns a new primary
• Client occasionally will try to this failed chuck
server
• Update their meta-data from master and retry
BigTable
• Distributed structured storage system built on
GFS
• Sparse, persistent, multi-dimensional stored
map (key-value pairs)
• Data is accessed by:
• Row key
• Column key
• Time stamp
BigTable Contd…
• Each column can store arbitrary name-value pair
in the form: Column-family: label
• Set of possible column-families for a table is
fixed when it is created
• Labels within a column family can be created
dynamically and at any time
• Each BigTable cell (row, column) can store
multiple versions of data in decreasing order
of timestamp
• As data in each column is stored together, they can
be accessed efficiently
BigTable Storage
BigTable Storage Contd…
• Each table is split into different row ranges, called
tables
• Each tablet is managed by a tablet server:
• Stores each column family for a given row range in a separate
distribustied file, called SSTable
• A single meta-data table is managed by a Meta-data
server
• Locates the tablets of any user table in response to a
read/write request
• The meta-data itself can be very large:
• Meta-data table can be similarly split into multiple tablets
• A root tablet points to other meta-data tables
• Support large parallel read and inserts even
simultaneously on the same table
• Insertions done in sorted fashion, and requires more
Dynamo
• Developed by Amazon
• Supports large volume of concurrent updates, each
of which could be small in size
• Different from BigTable: support bulk read and writes
• Data model for Dynamo:
• Simple <key, value> pair
• Well-suited for Web-based e-commerce applications
• Not dependent on any underlying distributed file system
(e.g GFS/HDFS) for:
• Failure handling
• Data replication
• Forwarding write requests to other replicas if the intended one is
down
• Conflict resolution
Dynamo Architecture
Dynamo Architecture Cont…
• Objects: <Key, Value> pairs with arbitrary
arrays of bytes
• MD5: generates a 128-bit hash value
• Range of this hash function is mapped to a set
of virtual nodes arranged in a ring
• Each key gets mapped to one virtual node
• The object is replicated at a primary virtual
node as well as (N-1) additional virtual nodes
• N: number of physical nodes
• Each physical node (server) manages a number of
virtual nodes at distributed positions on the
ring
Dynamo Architecture Contd…
• Load balancing for:
• Transient failures
• Network partition
• Write request in an object:
• Executed at one of its virtual nodes
• Forwards the request to all nodes which have the
replicas of the object
• Quorum protocol: maintains eventual consistency of
the replicas when a large number of concurrent reads
and writes take place
Dynamo Architecture Contd…
• Distributed object versioning
• Write create a new version of an object with its
local timestamp incremented
• Timestamp:
• Capture history of updates
• Versions that are superseded by later version (having larger
vector timestamp) are discarded
• If multiple write operations on same object occurs at the
same time, all versions will be maintained and returned to
read requests
• If conflict occurs:
• Resolution done by application-independent logic
Dynamo Architecture Contd…
• Quorum Consistent:
• Read operation accesses R replicas
• Write operation access W replicas
• If (R+W)> N: system is said to be quorum consistent
• Overheads:
• For efficient write: larger number of replicas to be read
• For efficient read: larger number of replicas to be written
into
• Dynamo:
• Implemented by different storage engines at node
level: Berkley DB (used by Amazon), MySQL, etc
Datastore
• Google and Amazon offer simple transactional
<Key, Value> pair database stores
• Google App Engines Datastore
• Amazon SimpleDB
• All entities (objects) in Datastore reside in
one BigTable table
• Does not exploit column-oriented storage
• Entities Table: store data as one column family
Datastore Contd…
• Multiple index tables are used to support
efficient queries
• BigTable:
• Horizontally partitioned (also called sharded) across
disks
• Sorted lexicographically by the key values
• Besides lexicographic sorting Datastore enables:
• Efficient execution of prefix and range queries on key
values
• Entities are ‘grouped’ for transaction purpose
• Keys are lexicographic by group ancestry
• Entities in the same group: stored close together on disk
• Index tables: support a variety of queries
• Uses value of entity attributes as keys
Datastore Contd….
• Automatically created indexes:
• Single-property indexes
• Supports efficient lookup of the records with WHERE clause
• Kind Indexes
• Support efficient lookup of queries of form SELECT ALL
• Configurable indexes
• Composite index:
• Retrieves more complex queries
• Query execution
• Indexes with highest selectivity is chosen
MapReduce
• MapReduce: programming model developed at Google
• Objective:
• Implement large scale search
• Text processing on massively scalable web data stored
using BigTable and GFS distributed file system
• Designed for processing and generating large
volume of data via massively parallel
computations, utilizing tens of thousand of
processors at a time
• Fault tolerant: ensure progress of computation
even if processors and network fail
• Example:
• Hadoop: open-source implementation of MapReduce
(developed at Yahoo!)
• Available on pre-packaged AMIs on Amazon EC2 cloud
platform
Parallel Computing
• Different models of parallel computing
• Nature and evolution of multiprocessor computer architecture
• Shared-memory model
• Assume that any processor can access any memory location
• Unequal latency
• Distributed-memory model
• Each processor can access only its own memory and communicates
with other processor using message passing
• Parallel computing:
• Developed for compute intensive scientific tasks
• Later found application in the database area
• Shared-memory
• Shared disk
• Shared nothing
Parallel Efficiency
• If a task takes time T in uniprocessor system, it
should take T/p if executed on p processors.
• Inefficiencies introduced in distributed computation
due to:
• Need for synchronization among processors
• Overhead of message communication between processors
• Imbalanced in the distribution of work to processors
• Parallel efficiency of an algorithm is defined as: ∈ =
𝑇
𝑝𝑇𝑝

• Scalable parallel implementation


• Parallel efficiency remain constant as the size of data is
increased along with a corresponding increase in processors
• Parallel efficiency increased with the size of data for a fixed
illustration
• Problem: Consider a very large collection of
documents, say web pages crawled from the entire
internet. The problem is to determine the
frequency (i.e, total number of occurrences) of
each word in this collection. Thus, if there are n
documents and m distinct words, we wish to
determine m frequencies, one for each word.
• Two approaches:
• Let each processor compute the frequencies for m/p words
• Let each processor compute the frequencies of m words
across n/p documents, followed by all the processors
summing their results
• Parallel computing is implemented as a
distributed-memory model with a shared disk, so
that each processor is able to access any document
illustration Contd…
• Time to read each word from the document=Time to
send the word to another processor via inter-
process communication =c
• Time to add to a running total of frequencies ->
negligible
• Each word occurs f time in a document (on
average)
• Time for computing all m frequencies with a single
processor=n x m x f x c
• First approach:
• Each processor read at most n x m/p x f times
• Parallel efficiency is calculated as: ∈𝑎 =
nmfc/pnmfc=1/p
• Efficiency fall with increasing p
Illustration Contd…
• Second approach
• Number of reads performed by each processor = n/p
x m x f
• Time taken to read = n/p x m x f x c
• Time taken to write partial frequencies of m-words
in parallel to disk = c x m
• Time taken to communicate partial frequencies to
(p-1) processors and then locally adding p sub-
vector to generate 1/p of final m-vector of
frequencies = p x (m/p) x c
• Parallel efficiency is computed as:
𝑛𝑚𝑓𝑐
• ∈𝑏 = 𝑛 𝑚 = nf / nf+2p = 1/ 1+ 2p/nf
𝑝(𝑝𝑚𝑓𝑐+𝑐𝑚+𝑝 𝑝 𝑐)
Illustration contd…
• Since p<<nf, efficiency of second approach is
higher than that of first
• In first approach, each processor is reading
many words that it need not read, resulting in
wasted work
• In the second approach every read is useful in
that it results in a computation that
contributes to the final answer
• Scalable
• Efficiency remains constant as both n and p
increases proportionally’
• Efficiency tends to 1 for fixed p and gradually
MapReduce Model
• Parallel programming abstraction
• Used by many different parallel applications
which carry out large-scale computation
involving thousands of processors
• Leverages a common underlying fault-tolerant
implementation
• Two phases of MapReduce:
• Map operation
• Reduce operation
• A configurable number of M ‘mapper’ processors
and R ‘reducer’ processors are assigned to work
on the problem
MapReduce Model Contd…
• Map Phase:
• Each mapper read approximately 1/M of the input from
the global file system, using locations given by the
master
• Map operation consists of transforming one set of
key-value pair to another
• MAP: (k_1 , v_1) ->[(k_2,v_2)].
• Each mapper writes computation results in one file
per reducer
• Files are sorted by a key and stored to the local
file system
• The master keeps track of the location of these
files
MapReduce Model Contd…
• Reduce phase:
• The master informs the reducer where the partial
computation have been stored on local files of
respective mappers.
• Reducer makes remote procedure call requests to the
mappers to fetch the files
• Each reducer groups the results of the map step
using the same key and performs a function f on the
list of values that correspond to these key value:
• Reduce: (k_2, [v_2]) ->(k_2, f([v_2]))
• Final results are written back to the GFS file
system
MapReduce: Fault Tolerance
• Heartbeat communication
• Updates are exchanged regarding the status of tasks
assigned to worker
• Communication exists, but no progress: master
duplicate those tasks and assigns to processors who
have already completed
• If a mapper fails, the master reassigns the
key-range designated to it to another working
node for re-execution
• Re-execution is required as the partial computations
are written into local files, rather than GFS file
system
• If a reducer fails, only the remaining tasks
are reassigned to another node, since the
MapReduce: Efficiency
• General computation task on a volume of data D
• Takes wD time on a uniprocessor (time to read data
from disk + performing computation + time to write
back to disk)
• Time to read/write on word from/to disk =c
• Now, the computational task is decomposed into map
and reduce stages as follows:
• Map stage:
• Mapping time = c_mD
• Data produced as output = 𝜎𝐷
• Reduce Stage:
• Reducing time = c_r 𝜎𝐷
• Data produced as output = 𝜎𝜇𝐷
MapReduce: Efficiency
• Considering no overheads in decomposing a task
into a map and a reduce stages, we have the
following relation:
• wD=cD+c_mD+c_r𝝈𝑫 + 𝒄𝝈𝝁𝑫
• Now, we use P processors that serve as both mapper
and reducers in respective phases to solve the
problem
• Additional overhead:
• Each mapper writes to its local disk followed by each
reducer remotely reading from the local disk if each
mapper
• For analysis purpose: time to read a word locally
or remotely is same
• Time to read data from disk by each mapper= wD/P
MapReduce: Efficiency Contd….
𝐶𝜎𝐷
• Time required to write into local disk =
𝑃
• Data read by each reducer
𝜎𝐷
from its partition in
each of P mappers = 2
𝑃
• The entire exchange can be executed in P steps,
with each reducer r reading from mapper r+i mod r
in step I
• Transfer time from
𝑪𝝈𝑫
mapper local disk to GFS for
each reducer =
𝑷
• Total overhead in parallel implementation due to𝜎𝐷
intermediate disk read and writes = ( wD/P + 2C
𝑃
)
• Parallel efficiency of the MapReduce
implementation:
𝒘𝑫 𝟏
MapReduce: Applications
• Indexing a large collection of documents
• Important aspect in web search as well a handling
structured data
• The map task consists of emitting a word-
document/record-id pair for each word: (𝑑𝑘 [𝑤1 … . 𝑤𝑛 ]) 
[(𝑤𝑖 , 𝑑𝑘 )]
• The reduce step groups the pair by word and creates
an index entry for each word: 𝑤𝑖 , 𝑑𝑘 −→ 𝑤𝑖 , 𝑑𝑖1 … 𝑑𝑖𝑚
• Relational Operations using MapReduce
• Execute SQL statements (relational joins/group by)
on large data sets
• Advantages over parallel database
• Large scale

You might also like