0% found this document useful (0 votes)
38 views48 pages

Chapter 4 MapReduce and New Software Stack

The document discusses the MapReduce framework and its integration with Hadoop for processing large datasets efficiently. It outlines the architecture of Hadoop, including components like HDFS and the roles of JobTracker and TaskTracker in managing tasks. Additionally, it covers the execution pipeline of MapReduce, handling node failures, and various algorithms that can be implemented using MapReduce.

Uploaded by

Maithili Divecha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views48 pages

Chapter 4 MapReduce and New Software Stack

The document discusses the MapReduce framework and its integration with Hadoop for processing large datasets efficiently. It outlines the architecture of Hadoop, including components like HDFS and the roles of JobTracker and TaskTracker in managing tasks. Additionally, it covers the execution pipeline of MapReduce, handling node failures, and various algorithms that can be implemented using MapReduce.

Uploaded by

Maithili Divecha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

MapReduce

and
New Software
stack
Introduction
• Business and government need to analyze and process a
tremendous amount of data in a very short period of time.
• If on single machine it takes huge amount of time.

• Solution: Software stack


• S/w stacks is a collection of are several commodity h/w which
are connected by Ethernet or switches.
• provides parallelism
• Hadoop framework provides new s/w stack i.e.
MAPREDUCE

• HDFS + MapReduce = Solution


Hadoop High-level architecture

Client
MapReduce HDFS

JobTracker NameNode
Master

Slaves
DataNode DataNode

TaskTracker TaskTracker
Distributed File Systems

• In scientific applications in past, required to do


parallel processing for fast computing.

• Used special-purpose computers.

• Web services enabled the use of commodity nodes to


reduce the cost and enable the parallel processing.

Parallel computing architecture = Cluster computing


Distributed File Systems

• New form of file system.

• Enormous, Possibly terabyte in size.

• Rarely updated
Distributed File Systems

• Computer nodes typically in the range of 8-64 are


stored in racks are connected with each other by
switch or Ethernet.

• Failure at the node level(disk failure) and at the rack


level(network failure) is taken care of by replicating
data in secondary node.

• All the tasks are completed independently and so if


any task fails, it can be restarted without affecting
the other task.
Distributed File Systems

1. Supports access to files that are stored on remote


servers.

2. Support for replication and local caching.

3. Concurrent access to files read/write has to be


taken care of using locking condition.
Types of Distributed File Systems

1. Google File System

2. Hadoop Distributed File System

3. CloudStore

4. Kosmix
Google File Systems

• Google stores massive amount of data.


• It need a good DFS with cheap commodity
computers to reduce cost.
• Unreliable.
• Redundant storage is required to manage failure.
• Most of GFS are written only once and sometimes
appended.
• It needs to allow large streaming reads and so high-
sustained throughput is required over low latency.
Google File Systems

• Files are in Gigabytes and stored as chunck of 64MB


each.
• Each are replicated Thrice to avoid information loss
due to the failure of commodity h/w.
• Chucks are centrally managed through a single
Master.
• Masters stores the metadata about the chunk.
• Metadata = file + chunck namespace + mapping of
file to the chunck + Location of replicas of each
chunk.
• Master is replicated in Shadow master.
Hadoop Distributed File Systems

• Similar to GFS.
• Master node = NameNode
• Shadow master = Secondary NameNode.
• Chuncks = Blocks
• Chunck server = DataNode
• DataNodes stores and retrieves blocks, also reports
the list of blocks it is storing to NameNode.
• Unlike GFS, only single-writers per file is allowed
and no append record operation is possible.
Word Count Problem
Details of MapReduce Execution

1. Run-Time coordination in MapReduce


2. Responsibilities of MapReduce Framework
3. MapReduce Execution Pipeline
4. Process Pipeline
5. Coping with Node Failure
Run-Time coordination in MapReduce

• MR handles the distributed code execution on the


cluster transparently once the user submits his “.jar” file.

• MR takes care of both Scheduling and Synchronization.

• MR ensures that all jobs get fairly equal share.

• MR also implements Scheduling optimization by


speculating execution.
Speculating Execution

• The speculative tasks are launched for those tasks that have been
running for some time (at least one minute) and have not made any
much progress, on average, as compared with other tasks from the
job.
• The speculative task is killed if the original task completes before
the speculative task, on the other hand, the original task is killed if
the speculative task finishes before it.
Responsibilities of MapReduce Framework

1. Provides overall coordination of execution

2. Select nodes for running mappers

3. Starts and monitors mapper’s execution.

4. Sorts and shuffles output of mappers.

5. Chooses location for reducer’s execution.

6. Delivers the output of mapper to reducer node.

7. Starts and monitors reducer’s execution.


MapReduce Execution Pipeline
MapReduce Execution Pipeline
• Drivers: For all nodes, It defines the configuration and
specification of all its components.

• Input Data: It resides in HDFS or HBase.

• Input Formats: Defines How to read the input and defines the
split. Also defines number of Map tasks in the mapping phase.

• RecordReader: Reads the data that is inside the mapper task. It


convert the data into key-value pairs and delivers it to Map
method.
• Mappper: The partition of the key space produced by the mapper
is given as input to reducer.
MapReduce Execution Pipeline
• Shuffle and sort: Is the process of moving map output to
reducers. It basically means that the pair with same key are
grouped together and passed to a single machine that will run
reduce script over them.

• Reducer: It execute user define code. And produce output


key-value pair.

• Records Writers: Used for storing data in a location specified


by OutputFormat.
• Distributed Cache: It is a resource used for sharing data
globally by all nodes in the cluster. This can be shared library
that each task can access.
Process Pipeline

1. Job driver uses InputFormat to partition a map’s execution and initiate a


JobClient.
2. JobClient communicates with JobTracker and submits the job for
execution.
3. JobTracker creates one Map task for each split as well as a set of reducer
tasks.
4. TaskTracker that is present on every node of the cluster controls the actual
execution of Map job.
5. Sends a period heartbeat message to the JobTracker .
6. JobTracker then uses a scheduler to allocate the task to the TaskTracker.
7. Once task is assigned to the TaskTracker, it copies the job .jar file to
TaskTracker’s local file system. And creates child processes.
8. Child process informs the parent(TaskTracker) about the task progress
every few seconds till it completes the task.
9. When the last task of the job is complete, JobTracker receives a notification
and it changes the status of the job as “Completed”.
10. By periodically polling the JobTracker, the JobClient recognizes the job
Coping with Node Failures

• JobTracker tracks all the MR jobs.


• Only one JobTracker for Hadoop cluster.
• If JobTracker fails, all the jobs running in its slave are
halted.
• Whole MR job is re-started.
• TaskTracker fails, all the tasks runs on that are re-started.
• Even if some of the tasks are completed, they have to be re-
done because the output destined for reduce tasks still
resides there and now they become unavailable.
• If there is failure at reduce node, then JobTracker sets to idle
and reschedules the reduce tasks on another node.
Algorithms using MapReduce

1.Matrix multiplication
2.Relational operators
3.Computing selection
4.Computing Projection
Matrix Multipication using MapReduce

Map(key, val1):
for(i,j,aij) in value:
emit(i,aij*v[ j ])
Reduce(key, values):
result=0
For value in values:
result+=value
emit(key, result)
Matrix Multipication using MapReduce

Map A
Matrix A for(i=0;i<n;i++)
M*N for(j=0;j<m,j+
Key i,j +)Send key
(i,j)A Matrix
Reduce 1 Reduce 2 A*B
i.K Sum
Sort Multiply valu N*L
value values
es for key Key-
Key[I,j]*key[j,k]
i,k
Map B
Matrix B for(k=0;k<i;k++)
N*L for(j=0;j<m,j+
Key i,j +)Send key
(i,j)A
Review of some terminology

• In traditional RDBMS, queries involve retrieval of


small amount of data.
• In MapReduce, “Full scans of Large amount of
data”
• Means Queries are not selective, they process all
data.
• A relation is a table
• Attributes are the column headers of the table
• The set of attributes of a relation is called a schema.
MapReduce and Relational Operators

• Example : What is average time spent per


URL?
Select url, AVG(time) from Visits Group By url;

• Steps of MR:
1. Map over tuples, emit time, keyed by url
2. Framework automatically groups values by keys.
3. Compute average in reducer
4. Optimize with combiner
MR Algorithm for processing relational Data

1. Shuffle/sort automatically handles group by sorting


and partitioning in MR.
2. Following operations are perform either in Mapper
or Reducer
– Selection
– Projection
– Union , intersection, difference
– Natural join
– Grouping and Aggregation
3. Multiple Strategies such as Reduce join, Map-side
join and In-memory join are used for relational joins.
Computing Selection by MapReduce

1. Map(): For each tuple t in R, test if it satisfies condition C. If


so, produce the key-value pair(t,t). That is, both the key, value
are t.
2. Reduce(): It simply passes each key –value pair to output
Computing Selection by MapReduce: Pseudo code

Map(key, value):
for tuple in value:
if tuple satisfies C:
emit(tuple, tuple)
Reduce(key, value):
emit(tuple, tuple)
Hadoop Common Package

• Contains Libraries and utilities required by other hadoop


modules.

• Provides file system and Operating System level


Abstraction.
Hadoop Distributed File System

• Manage storage and retrieval of data & metadata


required for computation.

• HDFS creates multiple replicas of each data block


and distributes them on computers to enable
reliable and rapid access.

• When file is loaded in HDFS, it is replicated &


fragmented into “BLOCKS” of data, which are
stored across the cluster nodes(Data Node).
Main components of HDFS

1. Name Node

2. Data Node
Name Node

• Master Node .
• Contain Meta Data.
• Maintain directories and files.
• Manages the blocks which are present on the Data
Node.
• Authorization and Authentication.
Data Node

• Slave Node .
• Responsible for processing read and write
requests for clients.
• Handles the block storage.
• Periodically sends the heartbeats and block
reports to data node.
Hadoop Map-Reduce

• Algorithm.
• Helps in parallel processing.
• Two phases:
1. Map Phase:
– Set of key-value pair forms
– Over each key-value pair, desire function is executed so as to
generate a set of intermediate key-value pair.
2. Reduce Phase:
-- The intermediate key-value pairs are grouped by key and values
are combined together according to the reduce algorithm provided
by the user.
• HDFS is the storage system for both i/p and o/p of the
MapReduce jobs.
Components of MapReduce

1. Job Tracker:
– Master which manages the jobs and recourses in the cluster.
– It schedule each map on Task Tracker.
– One Job Tracker in one cluster.
2. Task Tracker:
– Slaves which runs on every machine in a cluster.
– Responsible for running Map and Reduce Task as instructed by Job Tracker
3. JobHistoryServer:
• Demon that saves historical information about tasks.
MapReduce is Bad For

1. Frequently changing data: Reads entire data only once

2. Dependent task: Don’t have dependencies

3. Interactive Analysis: Doesn’t return result till end.

MapReduce hides complexity of parallel


programming and simplifies building
applications.
Yet Another Resource Negotiator

• YARN is the processing framework in Hadoop.

• Resource management.

• Job scheduling

• Monitoring of Job Tracker.


Hadoop Ecosystem
Higher Levels: Interactivity

G
S S F
Z Hive Pig i C
t p l H
o r a M
o a i B
a s o
o r r n a
p s n
k h
m k k s
e MapReduce e g
e
n o
e
d D
p r B
YARN
e a
r
HDFS

Lower Levels: Storage and Scheduling


Hadoop Ecosystem

• HDFS
1. It is foundation for many more BD framework.
2. It provides scalable and reliable storage.
3. Size of data increases, we can add commodity hardware to increase storage
capacity.

• YARN -
1. Provides flexible scheduling and resource management over the HDFS storage.
2. Used at Yahoo to schedule jobs across 40000 servers.

• MapReduce -
1. programming Model
2. Simplifies parallel computing.
3. Instead of dealing with the complexities of synchronization and scheduling,
MapReduce deals with only 2 function:
 Map()
 Reduce()
4. Used by Google for Indexing websites.
Hadoop Ecosystem

• HIVE-
1. Programming model
2. Created at Facebook to issue SQL like queries using MapReduce on their data in
HDFS.
3. It is a basically Data Ware that provides Ad-hoc queries, data summarization
and analysis of huge data sets.

• PIG-
1. High level Programming model
2. Process and analyses BD using User Defined Functions and programming efforts.
3. Created at Yahoo to model data flow based programs using MapReduce.
4. Provides a Bridge to query data on Hadoop but unlike HIVE

• Giraph –
1. Specialized model for graph processing
2. Used by facebook to analyze social graph.
Hadoop Ecosystem

• Spark / Storm / Flink-


1. Real time in-memory Data Processing
2. In-memory ->100X faster for some tasks.

• HBase -
1. NoSQL / No-relational distributed Database.
2. It is a backing system for MR .
3. Hbase is based on Column than rows for fast processing.

• Cassendra-
1. NoSQL / No-relational distributed Database.
2. MR can retrieved data from Cassendra.
Hadoop Ecosystem

• MongoDB-
1. NoSQL Database
2. Document-oriented database system
3. It stores structure data as JSON-like documents.

• Zookeeper -
1. Managing Cluster
2. Running all this tools requires a centralized management system for
synchronization, configuration and to ensure high availability.

• Mahout, Spark MLlib -> Machine Learning


• Apache Drill -> SQL on Hadoop
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain cluster
Physical Architecture

• Combination of cloud environment with big data processing tools such as Hadoop
, provides the high performance computing power needed to analyze vast
amount of data efficiently and cost effectively.

• Machine configuration for storage and computing servers :


1. 32 GB memory
2. 4 core processors
3. 200-320 GB hard disk

• Running hadoop in virtualized environments continues to develop and mature


with initiatives from open-source software projects.
Cloud Computing infrastructure to support Big Data Analytics

Cloud
integration
environment

Storage Storage
HBase VM
Node Node
Zookeeper VM
Switch HBase VM LDAP VM
Zookeeper VM Database
Web Console Server
VM

Cloud Management VLAN

Compute Compute M/c Config:


Web 32 GB memory
Node Node 4 cores
Server 200-320 GB Disk
Physical Architecture

• Every hadoop compatible file system should provide location awareness for
effective scheduling of work.
• Hadoop application uses this information to find the data node and run the task.
• HDFS replicates data to keep different copies of data on different racks to reduce
the impact of rack power or switch failure.

Hadoop
Cluster

Master Worker Worker Worker Worker


Node Node Node Node Node
JobTracker
TaskTracker DataNode DataNode DataNode DataNode
NameNode TaskTracker TaskTracker TaskTracker TaskTracker
DataNode
Hadoop limitation

1. Security Concerns
• Disable by default.
• Doesn’t provides encryption at storage and n/w level.

2. Vulnerable by nature
• Written entirely in java.
• Most widely used language by cyber criminal.

3. Not fit for small data


• Not all big data platforms are suitable for handling small files.
• Due to high capacity design, HDFS inefficiently support the small files
• Not recommended for small scale industries.
Hadoop

Thank You

You might also like