Chapter 4 MapReduce and New Software Stack
Chapter 4 MapReduce and New Software Stack
and
New Software
stack
Introduction
• Business and government need to analyze and process a
tremendous amount of data in a very short period of time.
• If on single machine it takes huge amount of time.
Client
MapReduce HDFS
JobTracker NameNode
Master
Slaves
DataNode DataNode
TaskTracker TaskTracker
Distributed File Systems
• Rarely updated
Distributed File Systems
3. CloudStore
4. Kosmix
Google File Systems
• Similar to GFS.
• Master node = NameNode
• Shadow master = Secondary NameNode.
• Chuncks = Blocks
• Chunck server = DataNode
• DataNodes stores and retrieves blocks, also reports
the list of blocks it is storing to NameNode.
• Unlike GFS, only single-writers per file is allowed
and no append record operation is possible.
Word Count Problem
Details of MapReduce Execution
• The speculative tasks are launched for those tasks that have been
running for some time (at least one minute) and have not made any
much progress, on average, as compared with other tasks from the
job.
• The speculative task is killed if the original task completes before
the speculative task, on the other hand, the original task is killed if
the speculative task finishes before it.
Responsibilities of MapReduce Framework
• Input Formats: Defines How to read the input and defines the
split. Also defines number of Map tasks in the mapping phase.
1.Matrix multiplication
2.Relational operators
3.Computing selection
4.Computing Projection
Matrix Multipication using MapReduce
Map(key, val1):
for(i,j,aij) in value:
emit(i,aij*v[ j ])
Reduce(key, values):
result=0
For value in values:
result+=value
emit(key, result)
Matrix Multipication using MapReduce
Map A
Matrix A for(i=0;i<n;i++)
M*N for(j=0;j<m,j+
Key i,j +)Send key
(i,j)A Matrix
Reduce 1 Reduce 2 A*B
i.K Sum
Sort Multiply valu N*L
value values
es for key Key-
Key[I,j]*key[j,k]
i,k
Map B
Matrix B for(k=0;k<i;k++)
N*L for(j=0;j<m,j+
Key i,j +)Send key
(i,j)A
Review of some terminology
• Steps of MR:
1. Map over tuples, emit time, keyed by url
2. Framework automatically groups values by keys.
3. Compute average in reducer
4. Optimize with combiner
MR Algorithm for processing relational Data
Map(key, value):
for tuple in value:
if tuple satisfies C:
emit(tuple, tuple)
Reduce(key, value):
emit(tuple, tuple)
Hadoop Common Package
1. Name Node
2. Data Node
Name Node
• Master Node .
• Contain Meta Data.
• Maintain directories and files.
• Manages the blocks which are present on the Data
Node.
• Authorization and Authentication.
Data Node
• Slave Node .
• Responsible for processing read and write
requests for clients.
• Handles the block storage.
• Periodically sends the heartbeats and block
reports to data node.
Hadoop Map-Reduce
• Algorithm.
• Helps in parallel processing.
• Two phases:
1. Map Phase:
– Set of key-value pair forms
– Over each key-value pair, desire function is executed so as to
generate a set of intermediate key-value pair.
2. Reduce Phase:
-- The intermediate key-value pairs are grouped by key and values
are combined together according to the reduce algorithm provided
by the user.
• HDFS is the storage system for both i/p and o/p of the
MapReduce jobs.
Components of MapReduce
1. Job Tracker:
– Master which manages the jobs and recourses in the cluster.
– It schedule each map on Task Tracker.
– One Job Tracker in one cluster.
2. Task Tracker:
– Slaves which runs on every machine in a cluster.
– Responsible for running Map and Reduce Task as instructed by Job Tracker
3. JobHistoryServer:
• Demon that saves historical information about tasks.
MapReduce is Bad For
• Resource management.
• Job scheduling
G
S S F
Z Hive Pig i C
t p l H
o r a M
o a i B
a s o
o r r n a
p s n
k h
m k k s
e MapReduce e g
e
n o
e
d D
p r B
YARN
e a
r
HDFS
• HDFS
1. It is foundation for many more BD framework.
2. It provides scalable and reliable storage.
3. Size of data increases, we can add commodity hardware to increase storage
capacity.
• YARN -
1. Provides flexible scheduling and resource management over the HDFS storage.
2. Used at Yahoo to schedule jobs across 40000 servers.
• MapReduce -
1. programming Model
2. Simplifies parallel computing.
3. Instead of dealing with the complexities of synchronization and scheduling,
MapReduce deals with only 2 function:
Map()
Reduce()
4. Used by Google for Indexing websites.
Hadoop Ecosystem
• HIVE-
1. Programming model
2. Created at Facebook to issue SQL like queries using MapReduce on their data in
HDFS.
3. It is a basically Data Ware that provides Ad-hoc queries, data summarization
and analysis of huge data sets.
• PIG-
1. High level Programming model
2. Process and analyses BD using User Defined Functions and programming efforts.
3. Created at Yahoo to model data flow based programs using MapReduce.
4. Provides a Bridge to query data on Hadoop but unlike HIVE
• Giraph –
1. Specialized model for graph processing
2. Used by facebook to analyze social graph.
Hadoop Ecosystem
• HBase -
1. NoSQL / No-relational distributed Database.
2. It is a backing system for MR .
3. Hbase is based on Column than rows for fast processing.
• Cassendra-
1. NoSQL / No-relational distributed Database.
2. MR can retrieved data from Cassendra.
Hadoop Ecosystem
• MongoDB-
1. NoSQL Database
2. Document-oriented database system
3. It stores structure data as JSON-like documents.
• Zookeeper -
1. Managing Cluster
2. Running all this tools requires a centralized management system for
synchronization, configuration and to ensure high availability.
• Combination of cloud environment with big data processing tools such as Hadoop
, provides the high performance computing power needed to analyze vast
amount of data efficiently and cost effectively.
Cloud
integration
environment
Storage Storage
HBase VM
Node Node
Zookeeper VM
Switch HBase VM LDAP VM
Zookeeper VM Database
Web Console Server
VM
• Every hadoop compatible file system should provide location awareness for
effective scheduling of work.
• Hadoop application uses this information to find the data node and run the task.
• HDFS replicates data to keep different copies of data on different racks to reduce
the impact of rack power or switch failure.
Hadoop
Cluster
1. Security Concerns
• Disable by default.
• Doesn’t provides encryption at storage and n/w level.
2. Vulnerable by nature
• Written entirely in java.
• Most widely used language by cyber criminal.
Thank You