0% found this document useful (0 votes)
27 views44 pages

CC - Lecture 6-Data

The document discusses trends in data-intensive cloud computing technologies including the massive amount of data being generated and collected. It covers topics like distributed file systems, key-value stores, and data storage technologies for the cloud like Bigtable, DynamoDB, and HDFS. Examples of large datasets are also provided like what Google, Facebook, and CERN process daily.

Uploaded by

Asif Mahmood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views44 pages

CC - Lecture 6-Data

The document discusses trends in data-intensive cloud computing technologies including the massive amount of data being generated and collected. It covers topics like distributed file systems, key-value stores, and data storage technologies for the cloud like Bigtable, DynamoDB, and HDFS. Examples of large datasets are also provided like what Google, Facebook, and CERN process daily.

Uploaded by

Asif Mahmood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Lecture 8

DATA-INTENSIVE
TECHNOLOGIES FOR CLOUD
COMPUTING
TRENDS
 Massive data
 Thousands to millions of cores
 Consolidated data centers
 Shift from clock rate battle to multicore to many core…

 Cheap hardware
 Failures are the norm

 VM based systems

 Making accessible (Easy to use)


 More people requiring large scale data processing
 Shift from academia to industry..
BIG DATA EVERYWHERE!
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Social Network
HOW MUCH DATA?
 Google processes 20 PB a day (2008)
 Wayback Machine has 3 PB + 100 TB/month (3/2009)

 Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

 eBay has 6.5 PB of user data + 50 TB/day (5/2009)

 CERN’s Large Hydron Collider (LHC) generates 15 PB


a year

640K ought to be
enough for anybody.
Maximilien Brice, © CERN
THE EARTHSCOPE
• The Earthscope is the world's largest
science project. Designed to track North
America's geological evolution, this
observatory records data over 3.8 million
square miles, amassing 67 terabytes of
data. It analyzes seismic slips in the San
Andreas fault, sure, but also the plume of
magma underneath Yellowstone and
much, much more.
(http://www.msnbc.msn.com/id/4436359
8/ns/technology_and_science-
future_of_technology/#.TmetOdQ--uI)
TYPE OF DATA
 Relational Data (Tables/Transaction/Legacy Data)
 Text Data (Web)

 Semi-structured Data (XML)

 Graph Data
 Social Network, Semantic Web (RDF), …

 Streaming Data
 You can only scan the data once
WHAT’S BIG DATA?
No single definition; here is from Wikipedia:

 Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
 The challenges include capture, curation, storage, search, sharing,
transfer, analysis, and visualization.
 The trend to larger data sets is due to the additional information
derivable from analysis of a single large set of related data, as
compared to separate smaller sets with the same total amount of
data, allowing correlations to be found to "spot business trends,
determine quality of research, prevent diseases, link legal citations,
combat crime, and determine real-time roadway traffic conditions.”
8
BIG DATA: 3V’S

9
MOVING TOWARDS..
 Distributed File Systems
 HDFS, etc..
 Distributed Key-Value stores
 Data intensive parallel application frameworks
 MapReduce
 High level languages
 Science in the clouds
DISTRIBUTED DATA STORAGE
CLOUD DATA STORES (NO-SQL)
 Schema-less:
 Shared nothing architecture

 Elasticity

 Sharding

 Asynchronous replication

 BASE instead of ACID

http://nosqlpedia.com/wiki/Survey_distributed_databases
CLOUD DATA STORES (NO-SQL)
 Schema-less:
 “Tables” don’t have a pre-defined schema. Records have a
variable number of fields that can vary from record to record.
Record contents and semantics are enforced by applications.
 Shared nothing architecture
 Instead of using a common storage pool (e.g., SAN), each server uses only
its own local storage. This allows storage to be accessed at local disk speeds
instead of network speeds, and it allows capacity to be increased by adding
more nodes. Cost is also reduced since commodity hardware can be used.
 Elasticity
 Both storage and server capacity can be added on-the-fly by merely adding
more servers. No downtime is required. When a new node is added, the
database begins giving it something to do and requests to fulfill.

http://nosqlpedia.com/wiki/Survey_distributed_databases
CLOUD DATA STORES (NO-SQL)
 Sharding
 Instead of viewing the storage as a monolithic space, records are partitioned into
shards. Usually, a shard is small enough to be managed by a single server, though
shards are usually replicated. Sharding can be automatic (e.g., an existing shard
splits when it gets too big), or applications can assist in data sharding by
assigning each record a partition ID.
 Asynchronous replication
 Compared to RAID storage (mirroring and/or striping) or synchronous
replication, NoSQL databases employ asynchronous replication. This allows
writes to complete more quickly since they don’t depend on extra network traffic.
One side effect of this strategy is that data is not immediately replicated and
could be lost in certain windows. Also, locking is usually not available to protect
all copies of a specific unit of data.
 BASE instead of ACID
 NoSQL databases emphasize performance and availability. This requires
prioritizing the components of the CAP theorem (described elsewhere) that tends
to make true ACID transactions implausible
http://nosqlpedia.com/wiki/Survey_distributed_databases
ACID VS BASE

ACID BASE
‹ Strong consistency ‹ Weak consistency
‹ Isolation – stale data OK
‹ Focus on “commit” ‹ Availability first
‹ Nested transactions ‹ Best effort
‹ Availability? ‹ Approximate answers OK
‹ Conservative ‹ Aggressive (optimistic)
(pessimistic) ‹ Simpler!
‹ Difficult evolution ‹ Faster
(e.g. schema) ‹ Easier evolution

ACID (Atomicity, Consistency, Isolation, and Durability)


BASE (Basically Available, Soft state, Eventual consistency)
GOOGLE BIGTABLE
 Data Model
 A sparse, distributed, persistent multidimensional sorted map

 Indexed by a row key, column key, and a timestamp


 A table contains column families
 Column keys grouped in to column families
 Row ranges are stored as tablets (Sharding)
 Supports single row transactions
 Use Chubby distributed lock service to manage masters and tablet locks
 Based on GFS
 Supports running scripts and map reduce

Fay Chang, et. al. “Bigtable: A Distributed Storage System for Structured Data”.
AMAZON DYNAMO
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability

Vector clocks with # of versions is decoupled


High Availability for writes
reconciliation during reads from update rates.

Provides high availability and


Sloppy Quorum and hinted durability guarantee when
Handling temporary failures
handoff some of the replicas are not
available.

Recovering from permanent Synchronizes divergent


Using Merkle trees
failures replicas in the background.

Preserves symmetry and


avoids having a centralized
Membership and failure Gossip-based membership
registry for storing
detection protocol and failure detection.
membership and node liveness
information.

DeCandia, G., et al. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium on
Operating Systems Principles (Stevenson, Washington, USA, October 14 - 17, 2007). SOSP '07. ACM, 205-220. (pdf)
NO-SQL DATA STORES

http://nosqlpedia.com/wiki/Survey_distributed_databases
GFS
SECTOR
File System GFS/HDFS Lustre Sector
Architecture Cluster-based, Cluster based, Cluster based,
asymmetric, parallel Asymettric, Parallel Asymettric, Parallel

Communication RPC/TCP Network UDT


Independence
Naming Central metadata Central metadata Multiple Metadata
server server Masters
Synchronization Write-once-read-many, Hybrid locking General purpose I/O
locks on object leases mechanism using
leases, distributed lock
manager
Consistency and Server side replication, Server side meta data Server side
replication Async replication, replication, Client replication
checksum side caching,
checksum
Fault Tolerance Failure as norm Failure as exception Failure as norm
Security N/A Authentication, Security server,
Authorization based
Authentication,
Authorization
DATA INTENSIVE PARALLEL
PROCESSING FRAMEWORKS
MAPREDUCE
 General purpose massive data analysis in brittle
environments
 Commodity clusters
 Clouds

 Efficiency, Scalability, Redundancy, Load Balance, Fault


Tolerance
 Apache Hadoop
 HDFS

 Microsoft DryadLINQ
EXECUTION OVERVIEW

Source: http://code.google.com/edu/parallel/mapreduce-tutorial.html
MAPREDUCE
1. The MapReduce library in the user program first shards the input files
into M pieces of typically 16 megabytes to 64 megabytes (MB) per
piece. It then starts up many copies of the program on a cluster of
machines.
2. One of the copies of the program is special: the master. The rest are
workers that are assigned work by the master. There are M map tasks
and R reduce tasks to assign. The master picks idle workers and assigns
each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the
corresponding input shard. It parses key/value pairs out of the input data
and passes each pair to the user-defined Map function. The intermediate
key/value pairs produced by the Map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into
R regions by the partitioning function. The locations of these buffered
pairs on the local disk are passed back to the master, who is responsible
for forwarding these locations to the reduce workers.
MAPREDUCE
5. When a reduce worker is notified by the master about these
locations, it uses remote procedure calls to read the buffered data
from the local disks of the map workers. When a reduce worker has
read all intermediate data, it sorts it by the intermediate keys so that
all occurrences of the same key are grouped together. If the amount
of intermediate data is too large to fit in memory, an external sort is
used.
6. The reduce worker iterates over the sorted intermediate data and for
each unique intermediate key encountered, it passes the key and the
corresponding set of intermediate values to the user's Reduce
function. The output of the Reduce function is appended to a final
output file for this reduce partition.
7. When all map tasks and reduce tasks have been completed, the
master wakes up the user program. At this point, the MapReduce
call in the user program returns back to the user code.
WORD COUNT
Input Mapping Shuffling Reducing
foo, 1 foo, 1
car, 1 foo, 1 foo, 3
bar, 1 foo, 1

foo car bar foo, 1


bar, 1
foo bar foo bar, 1 bar, 2
bar, 1
car car car foo, 1

car, 1 car, 1
car, 1 car, 1
car, 4
car, 1 car, 1
car,1
WORD COUNT
Input Mapping Shuffling Sorting Reducing
foo, 1
car, 1
bar, 1 foo,1
car,1
bar, 1
foo car bar foo, 1 foo, 1 bar,<1,1> bar,2
foo bar foo bar, 1 bar, 1 car,<1,1,1,1> car,4
car car car foo, 1 foo, 1 foo,<1,1,1> foo,3
car, 1
car, 1
car, 1
car, 1
car, 1
car, 1
HADOOP & DRYADLINQ
Apache Hadoop Microsoft DryadLINQ
Master Node Data/Compute Nodes
Standard LINQ operations

Job M M M M
DryadLINQ operations

Tracker R R R R
DryadLINQ Compiler
HDFS
Directed
2 Data
Vertex :
Name 1 execution task
2 blocks Acyclic Graph
Node 3 3 4 Edge : (DAG) based
communication
path execution
flows
Dryad Execution Engine

 Apache Implementation of Google’s MapReduce


 Dryad process the DAG executing vertices on compute
 Hadoop Distributed File System (HDFS) manage data
clusters
 Map/Reduce tasks are scheduled based on data locality in  LINQ provides a query interface for structured data
HDFS (replicated data blocks)  Provide Hash, Range, and Round-Robin partition
patterns

Job creation; Resource management; Fault tolerance& re-execution of failed taskes/vertices

Judy Qiu Cloud Technologies and Their Applications Indiana University Bloomington March 26 2010
Programming Scheduling & Load
Feature Data Storage Communication
Model Balancing
Data locality,
Rack aware dynamic task
Hadoop MapReduce HDFS TCP scheduling through a
global queue,
natural load balancing
Windows Data locality/ Network
DAG based Shared Files/TCP
Shared topology based run time
Dryad execution pipes/ Shared memory
directories graph optimizations, Static
flows (Cosmos) FIFO
scheduling
Shared file
Iterative Content Distribution Data locality, based static
Twister system / Local
MapReduce Network/Direct TCP scheduling
disks
Dynamic scheduling
TCP through Azure
MapReduceRol Azure Blob
e4Azure MapReduce Storage Blob Storage/ (Direct through a global queue,
TCP) Good natural load
balancing

Low latency Available processing


Variety of Shared file
MPI communication capabilities/ User
topologies systems
channels controlled
Failure
Feature Monitoring Language Support
Handling
Java, Executables Linux cluster,
Re-execution Web based
are supported via Amazon Elastic
Hadoop of map and Monitoring UI,
reduce tasks API Hadoop Streaming, MapReduce, Future
PigLatin Grid
Monitoring
Re-execution C# + LINQ (through Windows HPCS
Dryad support for
of vertices DryadLINQ) cluster
execution graphs
Re-execution API to monitor Java,
of iterations the progress of Linux Cluster,
Twister Executable via Java FutureGrid
jobs wrappers
Window Azure
Re-execution API, Web based
MapReduce Compute, Windows
Roles4Azure of map and monitoring UI C# Azure Local
reduce tasks
Development Fabric
Minimal support
Program level C, C++, Fortran, Linux/Windows
MPI for task level
Check pointing Java, C# cluster
monitoring
INHOMOGENEOUS DATA
PERFORMANCE
Randomly Distributed Inhomogeneous Data
Mean: 400, Dataset Size: 10000
1900
1850
1800
1750
Time (s)

1700
1650
1600
1550
1500
0 50 100 150 200 250 300

Standard Deviation
DryadLinq SWG Hadoop SWG Hadoop SWG on VM

Inhomogeneity of data does not have a significant effect when the sequence
lengths are randomly distributed
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32
nodes)
INHOMOGENEOUS DATA PERFORMANCE
Skewed Distributed Inhomogeneous data
Mean: 400, Dataset Size: 10000
6,000

5,000
Total Time (s)

4,000

3,000

2,000

1,000

0
0 50 100 150 200 250 300

Standard Deviation
DryadLinq SWG Hadoop SWG Hadoop SWG on VM

This shows the natural load balancing of Hadoop MR dynamic task


assignment using a global pipe line in contrast to the DryadLinq static
assignment
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32 nodes)
MAPREDUCEROLES4AZURE
SEQUENCE ASSEMBLY PERFORMANCE
OTHER ABSTRACTIONS
 Other abstractions..
 All-pairs
 DAG
 Wavefront
APPLICATIONS
APPLICATION CATEGORIES
1. Synchronous
 Easiest to parallelize. Eg: SIMD
2. Asynchronous
 Evolve dynamically in time and different evolution
algorithms.
3. Loosely Synchronous
 Middle ground. Dynamically evolving members,
synchronized now and then. Eg: IterativeMapReduce
4. Pleasingly Parallel
5. Meta problems

GC Fox, et al. Parallel Computing Works. http://www.netlib.org/utk/lsi/pcwLSI/text/node25.html#props


APPLICATIONS 1
(1-
100)
2 3 4
(101- (201- (301-
200) 300) 400)
N

1 from Reduce 1
M1 M2 M3 …. M# hdfs://.../rowblock_1.out
 BioInformatics (1-100)

2 from
M6

from Reduce 2
M4 M5 …. hdfs://.../rowblock_2.out
 Sequence Alignment (101-200) M2 M9

3 from Reduce 3
 SmithWaterman-GOTOH All-pairs alignment (201-300)
M6
M5
M7 M8 …. hdfs://.../rowblock_3.out

 Sequence Assembly 4 from


M9
from
M10 ….
Reduce 4
hdfs://.../rowblock_4.out
(301-400) M3 M8
 Cap3 . . . . …. .
. . . . …. .
 CloudBurst . . . . …. .
. . . . …. .
 Data mining N From
M#
M(N*
(N+1)/2)
Reduce N
hdfs://.../rowblock_N.out
 MDS, GTM & Interpolations
WORKFLOWS
 Represent and manage complex distributed scientific
computations
 Composition and representation
 Mapping to resources (data as well as compute)
 Execution and provenance capturing

 Type of workflows
 Sequence of tasks, DAGs, cyclic graphs, hierarchical
workflows (workflows of workflows)
 Data Flows vs Control flows
 Interactive workflows
LEAD – LINKED ENVIRONMENTS FOR
DYNAMIC DISCOVERY
 Based on WS-BPEL and
SOA infrastructure
PEGASUS AND DAGMAN
 Pegasus
 Resource, data discovery
 Mapping computation to resources
 Orchestrate data transfers
 Publish results
 Graph optimizations

 DAGMAN
 Submits tasks to execution resources
 Monitor the execution
 Retries in case of failure
 Maintain dependencies
CONCLUSION
 Scientific analysis is moving more and more towards Clouds
and related technologies
 Lot of cutting-edge technologies out in the industry which we
can use to facilitate data intensive computing.

 Motivation
 Developing easy-to-use efficient software frameworks to
facilitate data intensive computing
 Thank You !!!

You might also like