0% found this document useful (0 votes)

38 views48 pages

Chapter 4 MapReduce and New Software Stack

The document discusses the MapReduce framework and its integration with Hadoop for processing large datasets efficiently. It outlines the architecture of Hadoop, including components like HDFS and the roles of JobTracker and TaskTracker in managing tasks. Additionally, it covers the execution pipeline of MapReduce, handling node failures, and various algorithms that can be implemented using MapReduce.

Uploaded by

Maithili Divecha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views48 pages

Chapter 4 MapReduce and New Software Stack

Uploaded by

Maithili Divecha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

MapReduce

and
New Software
stack
Introduction
• Business and government need to analyze and process a
tremendous amount of data in a very short period of time.
• If on single machine it takes huge amount of time.

• Solution: Software stack

• S/w stacks is a collection of are several commodity h/w which
are connected by Ethernet or switches.
• provides parallelism
• Hadoop framework provides new s/w stack i.e.
MAPREDUCE

• HDFS + MapReduce = Solution

Hadoop High-level architecture

Client
MapReduce HDFS

JobTracker NameNode
Master

Slaves
DataNode DataNode

TaskTracker TaskTracker
Distributed File Systems

• In scientific applications in past, required to do

parallel processing for fast computing.

• Used special-purpose computers.

• Web services enabled the use of commodity nodes to

reduce the cost and enable the parallel processing.

Parallel computing architecture = Cluster computing

Distributed File Systems

• New form of file system.

• Enormous, Possibly terabyte in size.

• Rarely updated
Distributed File Systems

• Computer nodes typically in the range of 8-64 are

stored in racks are connected with each other by
switch or Ethernet.

• Failure at the node level(disk failure) and at the rack

level(network failure) is taken care of by replicating
data in secondary node.

• All the tasks are completed independently and so if

any task fails, it can be restarted without affecting
the other task.
Distributed File Systems

1. Supports access to files that are stored on remote

servers.

2. Support for replication and local caching.

3. Concurrent access to files read/write has to be

taken care of using locking condition.
Types of Distributed File Systems

1. Google File System

2. Hadoop Distributed File System

3. CloudStore

4. Kosmix
Google File Systems

• Google stores massive amount of data.

• It need a good DFS with cheap commodity
computers to reduce cost.
• Unreliable.
• Redundant storage is required to manage failure.
• Most of GFS are written only once and sometimes
appended.
• It needs to allow large streaming reads and so high-
sustained throughput is required over low latency.
Google File Systems

• Files are in Gigabytes and stored as chunck of 64MB

each.
• Each are replicated Thrice to avoid information loss
due to the failure of commodity h/w.
• Chucks are centrally managed through a single
Master.
• Masters stores the metadata about the chunk.
• Metadata = file + chunck namespace + mapping of
file to the chunck + Location of replicas of each
chunk.
• Master is replicated in Shadow master.
Hadoop Distributed File Systems

• Similar to GFS.
• Master node = NameNode
• Shadow master = Secondary NameNode.
• Chuncks = Blocks
• Chunck server = DataNode
• DataNodes stores and retrieves blocks, also reports
the list of blocks it is storing to NameNode.
• Unlike GFS, only single-writers per file is allowed
and no append record operation is possible.
Word Count Problem
Details of MapReduce Execution

1. Run-Time coordination in MapReduce

2. Responsibilities of MapReduce Framework
3. MapReduce Execution Pipeline
4. Process Pipeline
5. Coping with Node Failure
Run-Time coordination in MapReduce

• MR handles the distributed code execution on the

cluster transparently once the user submits his “.jar” file.

• MR takes care of both Scheduling and Synchronization.

• MR ensures that all jobs get fairly equal share.

• MR also implements Scheduling optimization by

speculating execution.
Speculating Execution

• The speculative tasks are launched for those tasks that have been
running for some time (at least one minute) and have not made any
much progress, on average, as compared with other tasks from the
job.
• The speculative task is killed if the original task completes before
the speculative task, on the other hand, the original task is killed if
the speculative task finishes before it.
Responsibilities of MapReduce Framework

1. Provides overall coordination of execution

2. Select nodes for running mappers

3. Starts and monitors mapper’s execution.

4. Sorts and shuffles output of mappers.

5. Chooses location for reducer’s execution.

6. Delivers the output of mapper to reducer node.

7. Starts and monitors reducer’s execution.

MapReduce Execution Pipeline
MapReduce Execution Pipeline
• Drivers: For all nodes, It defines the configuration and
specification of all its components.

• Input Data: It resides in HDFS or HBase.

• Input Formats: Defines How to read the input and defines the
split. Also defines number of Map tasks in the mapping phase.

• RecordReader: Reads the data that is inside the mapper task. It

convert the data into key-value pairs and delivers it to Map
method.
• Mappper: The partition of the key space produced by the mapper
is given as input to reducer.
MapReduce Execution Pipeline
• Shuffle and sort: Is the process of moving map output to
reducers. It basically means that the pair with same key are
grouped together and passed to a single machine that will run
reduce script over them.

• Reducer: It execute user define code. And produce output

key-value pair.

• Records Writers: Used for storing data in a location specified

by OutputFormat.
• Distributed Cache: It is a resource used for sharing data
globally by all nodes in the cluster. This can be shared library
that each task can access.
Process Pipeline

1. Job driver uses InputFormat to partition a map’s execution and initiate a

JobClient.
2. JobClient communicates with JobTracker and submits the job for
execution.
3. JobTracker creates one Map task for each split as well as a set of reducer
tasks.
4. TaskTracker that is present on every node of the cluster controls the actual
execution of Map job.
5. Sends a period heartbeat message to the JobTracker .
6. JobTracker then uses a scheduler to allocate the task to the TaskTracker.
7. Once task is assigned to the TaskTracker, it copies the job .jar file to
TaskTracker’s local file system. And creates child processes.
8. Child process informs the parent(TaskTracker) about the task progress
every few seconds till it completes the task.
9. When the last task of the job is complete, JobTracker receives a notification
and it changes the status of the job as “Completed”.
10. By periodically polling the JobTracker, the JobClient recognizes the job
Coping with Node Failures

• JobTracker tracks all the MR jobs.

• Only one JobTracker for Hadoop cluster.
• If JobTracker fails, all the jobs running in its slave are
halted.
• Whole MR job is re-started.
• TaskTracker fails, all the tasks runs on that are re-started.
• Even if some of the tasks are completed, they have to be re-
done because the output destined for reduce tasks still
resides there and now they become unavailable.
• If there is failure at reduce node, then JobTracker sets to idle
and reschedules the reduce tasks on another node.
Algorithms using MapReduce

1.Matrix multiplication
2.Relational operators
3.Computing selection
4.Computing Projection
Matrix Multipication using MapReduce

Map(key, val1):
for(i,j,aij) in value:
emit(i,aij*v[ j ])
Reduce(key, values):
result=0
For value in values:
result+=value
emit(key, result)
Matrix Multipication using MapReduce

Map A
Matrix A for(i=0;i<n;i++)
M*N for(j=0;j<m,j+
Key i,j +)Send key
(i,j)A Matrix
Reduce 1 Reduce 2 A*B
i.K Sum
Sort Multiply valu N*L
value values
es for key Key-
Key[I,j]*key[j,k]
i,k
Map B
Matrix B for(k=0;k<i;k++)
N*L for(j=0;j<m,j+
Key i,j +)Send key
(i,j)A
Review of some terminology

• In traditional RDBMS, queries involve retrieval of

small amount of data.
• In MapReduce, “Full scans of Large amount of
data”
• Means Queries are not selective, they process all
data.
• A relation is a table
• Attributes are the column headers of the table
• The set of attributes of a relation is called a schema.
MapReduce and Relational Operators

• Example : What is average time spent per

URL?
Select url, AVG(time) from Visits Group By url;

• Steps of MR:
1. Map over tuples, emit time, keyed by url
2. Framework automatically groups values by keys.
3. Compute average in reducer
4. Optimize with combiner
MR Algorithm for processing relational Data

1. Shuffle/sort automatically handles group by sorting

and partitioning in MR.
2. Following operations are perform either in Mapper
or Reducer
– Selection
– Projection
– Union , intersection, difference
– Natural join
– Grouping and Aggregation
3. Multiple Strategies such as Reduce join, Map-side
join and In-memory join are used for relational joins.
Computing Selection by MapReduce

1. Map(): For each tuple t in R, test if it satisfies condition C. If

so, produce the key-value pair(t,t). That is, both the key, value
are t.
2. Reduce(): It simply passes each key –value pair to output
Computing Selection by MapReduce: Pseudo code

Map(key, value):
for tuple in value:
if tuple satisfies C:
emit(tuple, tuple)
Reduce(key, value):
emit(tuple, tuple)
Hadoop Common Package

• Contains Libraries and utilities required by other hadoop

modules.

• Provides file system and Operating System level

Abstraction.
Hadoop Distributed File System

• Manage storage and retrieval of data & metadata

required for computation.

• HDFS creates multiple replicas of each data block

and distributes them on computers to enable
reliable and rapid access.

• When file is loaded in HDFS, it is replicated &

fragmented into “BLOCKS” of data, which are
stored across the cluster nodes(Data Node).
Main components of HDFS

1. Name Node

2. Data Node
Name Node

• Master Node .
• Contain Meta Data.
• Maintain directories and files.
• Manages the blocks which are present on the Data
Node.
• Authorization and Authentication.
Data Node

• Slave Node .
• Responsible for processing read and write
requests for clients.
• Handles the block storage.
• Periodically sends the heartbeats and block
reports to data node.
Hadoop Map-Reduce

• Algorithm.
• Helps in parallel processing.
• Two phases:
1. Map Phase:
– Set of key-value pair forms
– Over each key-value pair, desire function is executed so as to
generate a set of intermediate key-value pair.
2. Reduce Phase:
-- The intermediate key-value pairs are grouped by key and values
are combined together according to the reduce algorithm provided
by the user.
• HDFS is the storage system for both i/p and o/p of the
MapReduce jobs.
Components of MapReduce

1. Job Tracker:
– Master which manages the jobs and recourses in the cluster.
– It schedule each map on Task Tracker.
– One Job Tracker in one cluster.
2. Task Tracker:
– Slaves which runs on every machine in a cluster.
– Responsible for running Map and Reduce Task as instructed by Job Tracker
3. JobHistoryServer:
• Demon that saves historical information about tasks.
MapReduce is Bad For

1. Frequently changing data: Reads entire data only once

2. Dependent task: Don’t have dependencies

3. Interactive Analysis: Doesn’t return result till end.

MapReduce hides complexity of parallel

programming and simplifies building
applications.
Yet Another Resource Negotiator

• YARN is the processing framework in Hadoop.

• Resource management.

• Job scheduling

• Monitoring of Job Tracker.

Hadoop Ecosystem
Higher Levels: Interactivity

G
S S F
Z Hive Pig i C
t p l H
o r a M
o a i B
a s o
o r r n a
p s n
k h
m k k s
e MapReduce e g
e
n o
e
d D
p r B
YARN
e a
r
HDFS

Lower Levels: Storage and Scheduling

Hadoop Ecosystem

• HDFS
1. It is foundation for many more BD framework.
2. It provides scalable and reliable storage.
3. Size of data increases, we can add commodity hardware to increase storage
capacity.

• YARN -
1. Provides flexible scheduling and resource management over the HDFS storage.
2. Used at Yahoo to schedule jobs across 40000 servers.

• MapReduce -
1. programming Model
2. Simplifies parallel computing.
3. Instead of dealing with the complexities of synchronization and scheduling,
MapReduce deals with only 2 function:
 Map()
 Reduce()
4. Used by Google for Indexing websites.
Hadoop Ecosystem

• HIVE-
1. Programming model
2. Created at Facebook to issue SQL like queries using MapReduce on their data in
HDFS.
3. It is a basically Data Ware that provides Ad-hoc queries, data summarization
and analysis of huge data sets.

• PIG-
1. High level Programming model
2. Process and analyses BD using User Defined Functions and programming efforts.
3. Created at Yahoo to model data flow based programs using MapReduce.
4. Provides a Bridge to query data on Hadoop but unlike HIVE

• Giraph –
1. Specialized model for graph processing
2. Used by facebook to analyze social graph.
Hadoop Ecosystem

• Spark / Storm / Flink-

1. Real time in-memory Data Processing
2. In-memory ->100X faster for some tasks.

• HBase -
1. NoSQL / No-relational distributed Database.
2. It is a backing system for MR .
3. Hbase is based on Column than rows for fast processing.

• Cassendra-
1. NoSQL / No-relational distributed Database.
2. MR can retrieved data from Cassendra.
Hadoop Ecosystem

• MongoDB-
1. NoSQL Database
2. Document-oriented database system
3. It stores structure data as JSON-like documents.

• Zookeeper -
1. Managing Cluster
2. Running all this tools requires a centralized management system for
synchronization, configuration and to ensure high availability.

• Mahout, Spark MLlib -> Machine Learning

• Apache Drill -> SQL on Hadoop
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain cluster
Physical Architecture

• Combination of cloud environment with big data processing tools such as Hadoop
, provides the high performance computing power needed to analyze vast
amount of data efficiently and cost effectively.

• Machine configuration for storage and computing servers :

1. 32 GB memory
2. 4 core processors
3. 200-320 GB hard disk

• Running hadoop in virtualized environments continues to develop and mature

with initiatives from open-source software projects.
Cloud Computing infrastructure to support Big Data Analytics

Cloud
integration
environment

Storage Storage
HBase VM
Node Node
Zookeeper VM
Switch HBase VM LDAP VM
Zookeeper VM Database
Web Console Server
VM

Cloud Management VLAN

Compute Compute M/c Config:

Web 32 GB memory
Node Node 4 cores
Server 200-320 GB Disk
Physical Architecture

• Every hadoop compatible file system should provide location awareness for
effective scheduling of work.
• Hadoop application uses this information to find the data node and run the task.
• HDFS replicates data to keep different copies of data on different racks to reduce
the impact of rack power or switch failure.

Hadoop
Cluster

Master Worker Worker Worker Worker

Node Node Node Node Node
JobTracker
TaskTracker DataNode DataNode DataNode DataNode
NameNode TaskTracker TaskTracker TaskTracker TaskTracker
DataNode
Hadoop limitation

1. Security Concerns
• Disable by default.
• Doesn’t provides encryption at storage and n/w level.

2. Vulnerable by nature
• Written entirely in java.
• Most widely used language by cyber criminal.

3. Not fit for small data

• Not all big data platforms are suitable for handling small files.
• Due to high capacity design, HDFS inefficiently support the small files
• Not recommended for small scale industries.
Hadoop

Thank You

Sy Field Project Synopsis - Final
No ratings yet
Sy Field Project Synopsis - Final
12 pages
Network Databases Case Study
100% (1)
Network Databases Case Study
12 pages
Data Governance Questionnaire
100% (1)
Data Governance Questionnaire
2 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
Untitled
No ratings yet
Untitled
2,204 pages
Using PostgreSQL in Web 2.0 Applications
100% (8)
Using PostgreSQL in Web 2.0 Applications
21 pages
Course Code: Comp 324
No ratings yet
Course Code: Comp 324
20 pages
1Z0 084 Demo
No ratings yet
1Z0 084 Demo
6 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
MapReduce Its Applications For Course
No ratings yet
MapReduce Its Applications For Course
36 pages
??? ??????????? ?????
No ratings yet
??? ??????????? ?????
57 pages
Essbase Data Load
100% (2)
Essbase Data Load
35 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Introduction To Hadoop & Spark
No ratings yet
Introduction To Hadoop & Spark
28 pages
HBase at Stumbleupon
No ratings yet
HBase at Stumbleupon
38 pages
Project Report On Student Database
No ratings yet
Project Report On Student Database
14 pages
1 MapReduce Introduction With Example
No ratings yet
1 MapReduce Introduction With Example
52 pages
(DBMS)
No ratings yet
(DBMS)
9 pages
Unit 4
No ratings yet
Unit 4
7 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
INTRODUCTION and M1-CH-1
No ratings yet
INTRODUCTION and M1-CH-1
63 pages
Introduction To Oracle Sharding
100% (1)
Introduction To Oracle Sharding
13 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Big Data
No ratings yet
Big Data
47 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Aryan Sunil Mishra
No ratings yet
Aryan Sunil Mishra
1 page
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Database List
0% (10)
Database List
70 pages
Unit 5
No ratings yet
Unit 5
101 pages
Hadoop
No ratings yet
Hadoop
34 pages
Unit 2
No ratings yet
Unit 2
22 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Transactions All Up
No ratings yet
Transactions All Up
11 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
(IJCT-V3I4P1) Authors:Anusha Itnal, Sujata Umarani
No ratings yet
(IJCT-V3I4P1) Authors:Anusha Itnal, Sujata Umarani
5 pages
SQL Sks
No ratings yet
SQL Sks
14 pages
PLSQL 5 2 Practice
No ratings yet
PLSQL 5 2 Practice
3 pages
Cabanasj486 Snowflake Snowpro Core
No ratings yet
Cabanasj486 Snowflake Snowpro Core
6 pages
Unit 3-1
No ratings yet
Unit 3-1
65 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Bda 2
No ratings yet
Bda 2
35 pages
DBMS Unit-I
No ratings yet
DBMS Unit-I
172 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit 3 Handouts
No ratings yet
Unit 3 Handouts
11 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Hadoop
No ratings yet
Hadoop
5 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Designing and ImplementingDatabases With SQL Server 2000 Enterprise Edition
No ratings yet
Designing and ImplementingDatabases With SQL Server 2000 Enterprise Edition
26 pages
pkdp-203 0
No ratings yet
pkdp-203 0
23 pages
Unit 5
No ratings yet
Unit 5
7 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Big Data Mapreduce and Streaming
No ratings yet
Big Data Mapreduce and Streaming
10 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
ES 5 - Reviewer Ms Excel
No ratings yet
ES 5 - Reviewer Ms Excel
5 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Splits Input Into Independent Chunks in Parallel Manner
No ratings yet
Splits Input Into Independent Chunks in Parallel Manner
4 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Semantic Web: Abstra CT
No ratings yet
Semantic Web: Abstra CT
15 pages
Concepts
No ratings yet
Concepts
7 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
ERPMan Archiving 070318 1302 43906
No ratings yet
ERPMan Archiving 070318 1302 43906
6 pages

Chapter 4 MapReduce and New Software Stack

Uploaded by

Chapter 4 MapReduce and New Software Stack

Uploaded by

MapReduce

• Solution: Software stack

• HDFS + MapReduce = Solution

• In scientific applications in past, required to do

• Used special-purpose computers.

• Web services enabled the use of commodity nodes to

Parallel computing architecture = Cluster computing

• New form of file system.

• Enormous, Possibly terabyte in size.

• Computer nodes typically in the range of 8-64 are

• Failure at the node level(disk failure) and at the rack

• All the tasks are completed independently and so if

1. Supports access to files that are stored on remote

2. Support for replication and local caching.

3. Concurrent access to files read/write has to be

1. Google File System

2. Hadoop Distributed File System

• Google stores massive amount of data.

• Files are in Gigabytes and stored as chunck of 64MB

1. Run-Time coordination in MapReduce

• MR handles the distributed code execution on the

• MR takes care of both Scheduling and Synchronization.

• MR ensures that all jobs get fairly equal share.

• MR also implements Scheduling optimization by

1. Provides overall coordination of execution

2. Select nodes for running mappers

3. Starts and monitors mapper’s execution.

4. Sorts and shuffles output of mappers.

5. Chooses location for reducer’s execution.

6. Delivers the output of mapper to reducer node.

7. Starts and monitors reducer’s execution.

• Input Data: It resides in HDFS or HBase.

• RecordReader: Reads the data that is inside the mapper task. It

• Reducer: It execute user define code. And produce output

• Records Writers: Used for storing data in a location specified

1. Job driver uses InputFormat to partition a map’s execution and initiate a

• JobTracker tracks all the MR jobs.

• In traditional RDBMS, queries involve retrieval of

• Example : What is average time spent per

1. Shuffle/sort automatically handles group by sorting

1. Map(): For each tuple t in R, test if it satisfies condition C. If

• Contains Libraries and utilities required by other hadoop

• Provides file system and Operating System level

• Manage storage and retrieval of data & metadata

• HDFS creates multiple replicas of each data block

• When file is loaded in HDFS, it is replicated &

1. Frequently changing data: Reads entire data only once

2. Dependent task: Don’t have dependencies

3. Interactive Analysis: Doesn’t return result till end.

MapReduce hides complexity of parallel

• YARN is the processing framework in Hadoop.

• Monitoring of Job Tracker.

Lower Levels: Storage and Scheduling

• Spark / Storm / Flink-

• Mahout, Spark MLlib -> Machine Learning

• Machine configuration for storage and computing servers :

• Running hadoop in virtualized environments continues to develop and mature

Cloud Management VLAN

Compute Compute M/c Config:

Master Worker Worker Worker Worker

3. Not fit for small data

You might also like