CC - Lecture 6-Data
CC - Lecture 6-Data
DATA-INTENSIVE
TECHNOLOGIES FOR CLOUD
COMPUTING
TRENDS
Massive data
Thousands to millions of cores
Consolidated data centers
Shift from clock rate battle to multicore to many core…
Cheap hardware
Failures are the norm
VM based systems
640K ought to be
enough for anybody.
Maximilien Brice, © CERN
THE EARTHSCOPE
• The Earthscope is the world's largest
science project. Designed to track North
America's geological evolution, this
observatory records data over 3.8 million
square miles, amassing 67 terabytes of
data. It analyzes seismic slips in the San
Andreas fault, sure, but also the plume of
magma underneath Yellowstone and
much, much more.
(http://www.msnbc.msn.com/id/4436359
8/ns/technology_and_science-
future_of_technology/#.TmetOdQ--uI)
TYPE OF DATA
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can only scan the data once
WHAT’S BIG DATA?
No single definition; here is from Wikipedia:
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
The challenges include capture, curation, storage, search, sharing,
transfer, analysis, and visualization.
The trend to larger data sets is due to the additional information
derivable from analysis of a single large set of related data, as
compared to separate smaller sets with the same total amount of
data, allowing correlations to be found to "spot business trends,
determine quality of research, prevent diseases, link legal citations,
combat crime, and determine real-time roadway traffic conditions.”
8
BIG DATA: 3V’S
9
MOVING TOWARDS..
Distributed File Systems
HDFS, etc..
Distributed Key-Value stores
Data intensive parallel application frameworks
MapReduce
High level languages
Science in the clouds
DISTRIBUTED DATA STORAGE
CLOUD DATA STORES (NO-SQL)
Schema-less:
Shared nothing architecture
Elasticity
Sharding
Asynchronous replication
http://nosqlpedia.com/wiki/Survey_distributed_databases
CLOUD DATA STORES (NO-SQL)
Schema-less:
“Tables” don’t have a pre-defined schema. Records have a
variable number of fields that can vary from record to record.
Record contents and semantics are enforced by applications.
Shared nothing architecture
Instead of using a common storage pool (e.g., SAN), each server uses only
its own local storage. This allows storage to be accessed at local disk speeds
instead of network speeds, and it allows capacity to be increased by adding
more nodes. Cost is also reduced since commodity hardware can be used.
Elasticity
Both storage and server capacity can be added on-the-fly by merely adding
more servers. No downtime is required. When a new node is added, the
database begins giving it something to do and requests to fulfill.
http://nosqlpedia.com/wiki/Survey_distributed_databases
CLOUD DATA STORES (NO-SQL)
Sharding
Instead of viewing the storage as a monolithic space, records are partitioned into
shards. Usually, a shard is small enough to be managed by a single server, though
shards are usually replicated. Sharding can be automatic (e.g., an existing shard
splits when it gets too big), or applications can assist in data sharding by
assigning each record a partition ID.
Asynchronous replication
Compared to RAID storage (mirroring and/or striping) or synchronous
replication, NoSQL databases employ asynchronous replication. This allows
writes to complete more quickly since they don’t depend on extra network traffic.
One side effect of this strategy is that data is not immediately replicated and
could be lost in certain windows. Also, locking is usually not available to protect
all copies of a specific unit of data.
BASE instead of ACID
NoSQL databases emphasize performance and availability. This requires
prioritizing the components of the CAP theorem (described elsewhere) that tends
to make true ACID transactions implausible
http://nosqlpedia.com/wiki/Survey_distributed_databases
ACID VS BASE
ACID BASE
Strong consistency Weak consistency
Isolation – stale data OK
Focus on “commit” Availability first
Nested transactions Best effort
Availability? Approximate answers OK
Conservative Aggressive (optimistic)
(pessimistic) Simpler!
Difficult evolution Faster
(e.g. schema) Easier evolution
Fay Chang, et. al. “Bigtable: A Distributed Storage System for Structured Data”.
AMAZON DYNAMO
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability
DeCandia, G., et al. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium on
Operating Systems Principles (Stevenson, Washington, USA, October 14 - 17, 2007). SOSP '07. ACM, 205-220. (pdf)
NO-SQL DATA STORES
http://nosqlpedia.com/wiki/Survey_distributed_databases
GFS
SECTOR
File System GFS/HDFS Lustre Sector
Architecture Cluster-based, Cluster based, Cluster based,
asymmetric, parallel Asymettric, Parallel Asymettric, Parallel
Microsoft DryadLINQ
EXECUTION OVERVIEW
Source: http://code.google.com/edu/parallel/mapreduce-tutorial.html
MAPREDUCE
1. The MapReduce library in the user program first shards the input files
into M pieces of typically 16 megabytes to 64 megabytes (MB) per
piece. It then starts up many copies of the program on a cluster of
machines.
2. One of the copies of the program is special: the master. The rest are
workers that are assigned work by the master. There are M map tasks
and R reduce tasks to assign. The master picks idle workers and assigns
each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the
corresponding input shard. It parses key/value pairs out of the input data
and passes each pair to the user-defined Map function. The intermediate
key/value pairs produced by the Map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into
R regions by the partitioning function. The locations of these buffered
pairs on the local disk are passed back to the master, who is responsible
for forwarding these locations to the reduce workers.
MAPREDUCE
5. When a reduce worker is notified by the master about these
locations, it uses remote procedure calls to read the buffered data
from the local disks of the map workers. When a reduce worker has
read all intermediate data, it sorts it by the intermediate keys so that
all occurrences of the same key are grouped together. If the amount
of intermediate data is too large to fit in memory, an external sort is
used.
6. The reduce worker iterates over the sorted intermediate data and for
each unique intermediate key encountered, it passes the key and the
corresponding set of intermediate values to the user's Reduce
function. The output of the Reduce function is appended to a final
output file for this reduce partition.
7. When all map tasks and reduce tasks have been completed, the
master wakes up the user program. At this point, the MapReduce
call in the user program returns back to the user code.
WORD COUNT
Input Mapping Shuffling Reducing
foo, 1 foo, 1
car, 1 foo, 1 foo, 3
bar, 1 foo, 1
car, 1 car, 1
car, 1 car, 1
car, 4
car, 1 car, 1
car,1
WORD COUNT
Input Mapping Shuffling Sorting Reducing
foo, 1
car, 1
bar, 1 foo,1
car,1
bar, 1
foo car bar foo, 1 foo, 1 bar,<1,1> bar,2
foo bar foo bar, 1 bar, 1 car,<1,1,1,1> car,4
car car car foo, 1 foo, 1 foo,<1,1,1> foo,3
car, 1
car, 1
car, 1
car, 1
car, 1
car, 1
HADOOP & DRYADLINQ
Apache Hadoop Microsoft DryadLINQ
Master Node Data/Compute Nodes
Standard LINQ operations
Job M M M M
DryadLINQ operations
Tracker R R R R
DryadLINQ Compiler
HDFS
Directed
2 Data
Vertex :
Name 1 execution task
2 blocks Acyclic Graph
Node 3 3 4 Edge : (DAG) based
communication
path execution
flows
Dryad Execution Engine
Judy Qiu Cloud Technologies and Their Applications Indiana University Bloomington March 26 2010
Programming Scheduling & Load
Feature Data Storage Communication
Model Balancing
Data locality,
Rack aware dynamic task
Hadoop MapReduce HDFS TCP scheduling through a
global queue,
natural load balancing
Windows Data locality/ Network
DAG based Shared Files/TCP
Shared topology based run time
Dryad execution pipes/ Shared memory
directories graph optimizations, Static
flows (Cosmos) FIFO
scheduling
Shared file
Iterative Content Distribution Data locality, based static
Twister system / Local
MapReduce Network/Direct TCP scheduling
disks
Dynamic scheduling
TCP through Azure
MapReduceRol Azure Blob
e4Azure MapReduce Storage Blob Storage/ (Direct through a global queue,
TCP) Good natural load
balancing
1700
1650
1600
1550
1500
0 50 100 150 200 250 300
Standard Deviation
DryadLinq SWG Hadoop SWG Hadoop SWG on VM
Inhomogeneity of data does not have a significant effect when the sequence
lengths are randomly distributed
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32
nodes)
INHOMOGENEOUS DATA PERFORMANCE
Skewed Distributed Inhomogeneous data
Mean: 400, Dataset Size: 10000
6,000
5,000
Total Time (s)
4,000
3,000
2,000
1,000
0
0 50 100 150 200 250 300
Standard Deviation
DryadLinq SWG Hadoop SWG Hadoop SWG on VM
1 from Reduce 1
M1 M2 M3 …. M# hdfs://.../rowblock_1.out
BioInformatics (1-100)
2 from
M6
from Reduce 2
M4 M5 …. hdfs://.../rowblock_2.out
Sequence Alignment (101-200) M2 M9
3 from Reduce 3
SmithWaterman-GOTOH All-pairs alignment (201-300)
M6
M5
M7 M8 …. hdfs://.../rowblock_3.out
Type of workflows
Sequence of tasks, DAGs, cyclic graphs, hierarchical
workflows (workflows of workflows)
Data Flows vs Control flows
Interactive workflows
LEAD – LINKED ENVIRONMENTS FOR
DYNAMIC DISCOVERY
Based on WS-BPEL and
SOA infrastructure
PEGASUS AND DAGMAN
Pegasus
Resource, data discovery
Mapping computation to resources
Orchestrate data transfers
Publish results
Graph optimizations
DAGMAN
Submits tasks to execution resources
Monitor the execution
Retries in case of failure
Maintain dependencies
CONCLUSION
Scientific analysis is moving more and more towards Clouds
and related technologies
Lot of cutting-edge technologies out in the industry which we
can use to facilitate data intensive computing.
Motivation
Developing easy-to-use efficient software frameworks to
facilitate data intensive computing
Thank You !!!