0% found this document useful (0 votes)

41 views44 pages

Chapter Five Hadoop Mapreduce & HDFS

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the parallel processing of large datasets across nodes in a cluster using a simple programming model. Hadoop features distributed storage, parallel processing, fault tolerance, and scalability.

Uploaded by

Binyam Bekele Moges

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views44 pages

Chapter Five Hadoop Mapreduce & HDFS

Uploaded by

Binyam Bekele Moges

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

Hadoop, a distributed

framework for Big Data

What is Hadoop?
• An open-source software framework that supports data-
intensive distributed applications, licensed under the
Apache v2 license.
• Abstract and facilitate the storage and processing of large
and/or rapidly growing data sets.
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data
Hadoop Framework Tools
Major Hadoop components

• Hadoop has two major components

 MapReduce and HDFS

 MapReduce :Programming model(framework) that helps

programs do the parallel computation on data

 Hadoop Distributed File System (HDFS): A distributed file

system that runs on standard or low-end hardware.
Hadoop MapReduce ?
MapReduce is a programming model Google has used
successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
 A map function extracts some intelligence from raw data.
 A reduce function aggregates according to some guides
the data output by the map.
 Users specify the computation in terms of a map and a
reduce function,
 Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
 Underlying system also handles machine failures, efficient
communications, and performance issues.
MapReduce: “A new abstraction that allows us to express the
simple computations we were trying to perform but hides the
messy details of parallelization, fault-tolerance, data distribution
and load balancing in a library.”

• Programming model:
– Provides abstraction to express computation
• Library:
– To take care the runtime parallelisation of the computation.
• Distributed, with some centralization
• Main nodes of cluster are where most of the computational
power and storage of the system lies
MapReduce Programming Model
• Inspired from map and reduce operations commonly used in
functional programming languages like Lisp.
• Input: a set of key/value pairs
• User supplies two functions:
– map(k,v)  list(k1,v1)
– reduce(k1, list(v1))  v2
• (k1,v1) is an intermediate key/value pair
• Output is the set of (k1,v2) pairs
• For our example, assume that system
– Breaks up files into lines, and
– Calls map function with value of each line
• Key is the line number
Programming Model

 Input: a set of key/value pairs

 Output: a set of key/value pairs
 Computation is expressed using the two functions:
1. Map task: a single pair  a list of intermediate pairs
 map(input-key, input-value)  list(out-key, intermediate-value)
 <ki, vi>  { < kint, vint > }

2. Reduce task: all intermediate pairs with the same kint  a list of values
 reduce(out-key, list(intermediate-value))  list(out-values)
 < kint, {vint} >  < ko, vo >

8
Example: Counting the number of occurrences of each
work in a collection of documents

map(String input_key, String input_value):

// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):

// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
9
Parallel Processing of MapReduce Job
User
Program
copy copy copy

Master
assign assign
map reduce
Part 1 Map 1 Reduce 1 File 1
Part 2
Part 3 Reduce 1 write File 2
Map 2
Part 4
local
write
Part n
read Map n Reduce m File m
Remote
Read, Sort
Input file Intermediate Output files
partitions files
MapReduce: Word Count Example
• Consider the problem of counting the number of occurrences of
each word in a large collection of documents
• How would you do it in parallel?
• Solution:
– Divide documents among workers
– Each worker parses document to find all words, map
function outputs (word, count) pairs
– Partition (word, count) pairs across workers based on word
– For each word at a worker, reduce function locally add up
counts
• Given input: “One a penny, two a penny, hot cross buns.”
– Records output by the map() function would be
• (“One”, 1), (“a”, 1), (“penny”, 1),(“two”, 1), (“a”, 1),
(“penny”, 1), (“hot”, 1), (“cross”, 1), (“buns”, 1).
– Records output by reduce function would be
• (“One”, 1), (“a”, 2), (“penny”, 2), (“two”, 1), (“hot”, 1),
(“cross”, 1), (“buns”, 1)
Schematic Flow of Keys and Values
• Flow of keys and values in a map reduce task

rk1 rv1 rk1 rv1,rv7,...

rk7 rv2 rk2 rv8,rvi,...
mk1 mv1
rk3 rv3 rk3 rv3,...
mk2 mv2
rk1 rv7
rk7 rv2,...
rk2 rv8

rki ... rvn,...

rk2 rvi
mkn mvn
rki rvn

map inputs map outputs reduce inputs

(key, value) (key, value)
Hadoop’s Architecture: MapReduce Engine
Hadoop MapReduce Engine
A MapReduce Process
• JobClient
• Submit job
• JobTracker
• Manage and schedule job, split job into tasks;
• Splits up data into smaller tasks(“Map”) and
sends it to the TaskTracker process in each node
• TaskTracker
• Start and monitor the task execution;
• reports back to the JobTracker node and reports
on job progress, sends data (“Reduce”) or
requests new jobs
• Child
• The process that really executes the task
MapReduce Engine
• Main nodes run TaskTracker to accept and reply to MapReduce
tasks, Main Nodes run DataNode to store needed blocks
closely as possible
• Central control node runs NameNode to keep track of HDFS
directories & files, and JobTracker to dispatch compute tasks to
TaskTracker
• MapReduce requires a distributed file system and an engine
that can distribute, coordinate, monitor and gather the results.
• Hadoop provides that engine through (the file system we
discussed earlier) and the JobTracker + TaskTracker system.
• JobTracker is simply a scheduler.
• TaskTracker is assigned a Map or Reduce (or other
operations); Map or Reduce run on node and so is the
TaskTracker; each task is run on its own JVM on a node.
• Written in Java, also supports Python and Ruby
MapReduce, Batch Processing
 Batch processing: processing large amounts of data at-once, in
one-go to deliver a result according to a query on the data.
 Need for many computations over large/huge sets of data:
 Input data: crawled documents, web request logs
 Output data: inverted indices, summary of pages crawled per
host, the set of the most frequent queries in a given day, …
 Most of these computation are relatively straight-forward
 To speedup computation and shorten processing time, we can
distribute data across 100s of machines and process them in
parallel
 But, parallel computations are difficult and complex to manage:
 Race conditions, debugging, data distribution, fault-tolerance,
load balancing, etc
 Ideally, we would like to process data in parallel but not deal
with the complexity of parallelisation and data distribution 16
MapReduce Example Applications
 The MapReduce model can be applied to many applications:
 Distributed grep:
 map: emits a line, if line matched the pattern
 reduce: identity function
 Count of URL access Frequency
 Reverse Web-Link Graph
 Inverted Index
 Distributed Sort
 ….

17
MapReduce Implementation

 MapReduce implementation presented in the paper

matched Google infrastructure at-the-time:
1. Large cluster of commodity PCs connected via switched Ethernet
2. Machines are typically dual-processor x86, running Linux, 2-4GB of
mem! (slow machines for today’s standards)
3. A cluster of machines, so failures are anticipated
4. Storage with (GFS) Google File System (2003) on IDE disks
attached to PCs. GFS is a distributed file system, uses replication
for availability and reliability.
 Scheduling system:
1. Users submit jobs
2. Each job consists of tasks; scheduler assigns tasks to machines

18
Parallel Execution
 User specifies:
 M: number of map tasks
 R: number of reduce tasks
 Map:
 MapReduce library splits the input file into M pieces
 Typically 16-64MB per piece
 Map tasks are distributed across the machines
 Reduce:
 Partitioning the intermediate key space into R pieces
 hash(intermediate_key) mod R
 Typical setting:
 2,000 machines
 M = 200,000
 R = 5,000
19
Execution Flow

20
Master Data Structures

 For each map/reduce task:

 State status {idle, in-progress, completed}
 Identity of the worker machine (for non-idle tasks)

 The location of intermediate file regions is passed from maps to

reducers tasks through the master.
 This information is pushed incrementally (as map tasks finish) to
workers that have in-progress reduce tasks.

21
Fault-Tolerance
Two types of failures:
1. worker failures:
 Identified by sending heartbeat messages by the master. If no
response within a certain amount of time, then the worker is dead.
 In-progress and completed map tasks are re-scheduled  idle
 In-progress reduce tasks are re-scheduled  idle
 Workers executing reduce tasks affected from failed map/workers are
notified of re-scheduling
 Question: Why completed tasks have to be re-scheduler?
 Answer: Map output is stored on local fs, while reduce output is stored
on GFS
2. master failure:
1. Rare
2. Can be recovered from checkpoints
3. Solution: aborts the MapReduce computation and starts again
22
Disk Locality

 Network bandwidth is a relatively scarce resource and also

increases latency
 The goal is to save network bandwidth

 Use of GFS that stores typically three copies of the data block
on different machines
 Map tasks are scheduled “close” to data
 On nodes that have input data (local disk)
 If not, on nodes that are nearer to input data (e.g., same switch)

23
Task Granularity
 Number of map tasks > number of worker nodes
 Better load balancing
 Better recovery

 But, this, increases load on the master

 More scheduling
 More states to be saved

 M could be chosen with respect to the block size of the file

system
 For locality properties
 R is usually specified by users
 Each reduce tasks produces one output file

24
Stragglers
 Slow workers delay overall completion time  stragglers
 Bad disks with soft errors
 Other tasks using up resources
 Machine configuration problems, etc

 Very close to end of MapReduce operation, master schedules backup

execution of the remaining in-progress tasks.
 A task is marked as complete whenever either the primary or the
backup execution completes.

 Example: sort operation takes 44% longer to complete when the

backup task mechanism is disabled.

25
Refinements: Partitioning Function

 Partitioning function identifies the reduce task

 Users specify the desired output files they want, R
 But, there may be more keys than R
 Uses the intermediate key and R
 Default: hash(key) mod R

 Important to choose well-balanced partitioning functions:

 hash(hostname(urlkey)) mod R
 For output keys that are URLs

26
Refinements: Combiner Function

 Introduce a mini-reduce phase before intermediate data is

sent to reduce
 When there is significant repetition of intermediate keys
 Merge values of intermediate keys before sending to reduce tasks
 Example: word count, many records of the form <word_name, 1>.
Merge records with the same word_name
 Similar to reduce function

 Saves network bandwidth

27
Hadoop MapReduceSummary
 Hadoop is a software framework for distributed processing of
large datasets across large clusters of computers
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
 Hadoop is open-source implementation for Google
MapReduce
 Hadoop is based on a simple programming model called
MapReduce
 Hadoop is based on a simple data model, any data will fit

28
Design Principles of Hadoop
 Need to process big data
 Need to parallelize computation across thousands of nodes
 Commodity hardware
 Large number of low-end cheap machines working in
parallel to solve a computing problem
 This is in contrast to Parallel DBs
 Small number of high-end expensive machines
Automatic parallelization & distribution
 Hidden from the end-user
Fault tolerance and automatic recovery
 Nodes/tasks will fail and will recover automatically
Clean and simple programming abstraction
 Users only provide two functions “map” and “reduce”
29
Divide and Conquer

“Work”
Partition

w1 w2 w3

“worker” “worker” “worker”

r1 r2 r3

“Result” Combine
Distributed File System
 Two major products used in big data computing: HDFS
 ( Hadoop Distributed File System ) and Google’s GFS
( Google File System ) （ now Colossus ）。
 HDFS uses a Master/Slave architecture ， in which a
master node （ NameNode ） and a group of slave node
（ DataNode ） are used to create a data storage system.
 A data file saved to HDFS is first partitioned into multiple
data blocks with a fixed size, and the duplicate copies of
each data block are distributed to various datanodes.
 Don’t move data to workers… move workers to the data!
 Store data on the local disks of nodes in the cluster
 Start up the workers on the node that has the data local
 Why?
 Not enough RAM to hold all the data in memory
 Disk access is slow, but disk throughput is reasonable
 A distributed file system is the answer
 GFS (Google File System)
 HDFS for Hadoop (= GFS clone)
Hadoop Distributed File System (HDFS)
 Highly fault-tolerant
 High throughput
 Suitable for applications with large data sets
 Can be built out of commodity hardware
 Master/slave architecture
 HDFS cluster consists of a single Namenode, a master server that
manages the file system namespace and regulates access to files by
clients
 There are a number of DataNodes usually one per node in a cluster
 HDFS exposes a file system namespace and allows user data to be
stored in files
 A file is split into one or more blocks and set of blocks are stored in
DataNodes
NameNode
• Maps a filename to list of Block IDs.
• Maps each Block ID to DataNodes containing a replica of the block.
• Stores metadata for the files, like the directory structure of a typical
FS.
• The server holding the NameNode instance is quite crucial, as there is
only one.
• Transaction log for file deletes/adds, etc. Does not use transactions for
whole blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a
DataNode failure
• Single Namespace for entire cluster. Files are broken up into blocks
 Typically 64 MB block size
DataNode
•The DataNodes manage storage attached to the nodes that they run on
•DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.
•Maps a Block ID to a physical location on disk
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• Each block replicated on multiple DataNodes
Client
• Finds location of blocks from NameNode
• Accesses data directly from DataNode
•Client can only append to existing files
Data Coherency
• Write-once-read-many access model
 Distributed file systems good for millions of large files.
 But have very high overheads and poor performance with billions of
smaller tuples
Main Properties of HDFS

 Large: A HDFS instance may consist of thousands of

server machines, each storing part of the file system’s
data
 Replication: Each data block is replicated many
times (default is 3)
 Failure: Failure is the norm rather than exception
 Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core architectural
goal of HDFS
 Namenode is consistently checking Datanodes

36
NameNode and DataNode Comparsion

NameNode: DataNode:
• Stores the actual data
• Stores metadata for the files, like the in HDFS
directory structure of a typical FS. • Can run on any
• The server holding the NameNode underlying filesystem
instance is quite crucial, as there is (ext3/4, NTFS, etc)
only one. • Notifies NameNode of
• Transaction log for file deletes/adds, what blocks it has
etc. Does not use transactions for • NameNode replicates
whole blocks or file-streams, only blocks 2x in local rack,
metadata. 1x elsewhere
• Handles creation of more replica
blocks when necessary after a
DataNode failure
HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..
Client
Block ops
Read Datanodes Datanodes

replication B
Blocks

Rack1 Write Rack2

 Structure Client
• Master/Slave
 Organization
 One Master node
 A group of Slave nodes
 Data file is partitioned into blocks with fixed size
• 3 copies of each data block are generated and stored on different
Software Components on Hadoop

Master/Slave Mode
 On the master node a Namenode thread is running to do
the work of datanode registration, file partition, data block
replication and distribution, job scheduling, and cluster
recourse management
 On each slave node, a Datanode thread is running to do
the tasks of data block storage, node state reporting, data
processing, etc
The Google File System
GFS architecture and components: The GFS is composed of clusters.
 A cluster is a set of networked computers. GFS clusters contain
three types of interdependent entities which are: Client, master
and chunk server.
 Clients could be: Computers or applications manipulating
existing files or creating new files on the system.
 The master server is the orchestrator or manager of the cluster
system that maintain the operation log. Operation log keeps track
of the activities made by the master itself which helps reducing
the service interruptions to a minimum level.
 At startup, master server retrieves information about contents
and inventories from chunk servers. Then after, the master server
keeps tracks of the location of the chunks with the cluster.
The GFS architecture keeps the messages that the master server
sends and receives which is very small. The master server itself
doesn’t handle file data at all, this is done by chunk servers.
 Chunk servers are the core engine of the GFS. They store file
chunks of 64 MB size. Chunk servers coordinate with the master
server and send requested chunks to clients directly.
 GFS replicas: The GFS has two replicas: Primary and secondary
replicas.
 A primary replica is the data chunk that a chunk server sends to a
client.
 Secondary replicas serve as backups on other chunk servers. The
master server decides which chunks act as primary or secondary.
If the client makes changes to the data in the chunk, then the
master server lets the chunk servers with secondary replicas,
know they have to copy the new chunk off the primary chunk
server to stay in its current state.
Google File System (GFS) Motivation
 GFS  developed in the late 1990s; uses thousands of storage
systems built from inexpensive commodity components to
provide petabytes of storage to a large user community with
diverse needs
 Motivation
1. Component failures is the norm
 Appl./OS bugs, human errors, failures of disks, power supplies, …
2. Files are huge (muti-GB to -TB files)
3. The most common operation is to append to an existing
file; random write operations to a file are extremely
infrequent. Sequential read operations are the norm
4. The consistency model should be relaxed to simplify the
system implementation but without placing an additional
burden on the application developers
42
GFS Assumptions
 The system is built from inexpensive commodity components
that often fail.
 The system stores a modest number of large files.
 The workload consists mostly of two kinds of reads: large
streaming reads and small random reads.
 The workloads also have many large sequential writes that
append data to files.
 The system must implement well-defined semantics for many
clients simultaneously appending to the same file.

43
Three main applications of Hadoop

• Advertisement (Mining user behavior to generate

recommendations)
• Searches (group related documents)
• Security (search for uncommon patterns)

0.00 - Accelerated Proof of Concept Delivery Guide - CRM Online
100% (1)
0.00 - Accelerated Proof of Concept Delivery Guide - CRM Online
26 pages
Data Modelling 101 For Data Analysts
No ratings yet
Data Modelling 101 For Data Analysts
13 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Chapter One: Introduction To ICT
100% (1)
Chapter One: Introduction To ICT
29 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Agents White Paper
100% (1)
Agents White Paper
21 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
BI Unit 1
No ratings yet
BI Unit 1
57 pages
ACTION Research Assignment 2021
No ratings yet
ACTION Research Assignment 2021
3 pages
TM-Entrepreneurship and Employabilityskill - One
100% (2)
TM-Entrepreneurship and Employabilityskill - One
182 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
Chapter Two: Nature of ICT Education in Secondary and Preparatory School
No ratings yet
Chapter Two: Nature of ICT Education in Secondary and Preparatory School
11 pages
Venue Data Fields - Vistar Media
No ratings yet
Venue Data Fields - Vistar Media
12 pages
Gina Case Study Dsbda Final
No ratings yet
Gina Case Study Dsbda Final
21 pages
Pam 17
No ratings yet
Pam 17
28 pages
Admin Console
No ratings yet
Admin Console
46 pages
Secure Log Storage Using Blockchain and Cloud Infrastructure
No ratings yet
Secure Log Storage Using Blockchain and Cloud Infrastructure
14 pages
Ecc Deadline Related
No ratings yet
Ecc Deadline Related
2 pages
Diploma in Cyber Security
No ratings yet
Diploma in Cyber Security
25 pages
Uploaded Resume
No ratings yet
Uploaded Resume
1 page
SAI - RAMYA (Data Science Resume)
No ratings yet
SAI - RAMYA (Data Science Resume)
1 page
Swot Assignment
No ratings yet
Swot Assignment
1 page
Hadoop
No ratings yet
Hadoop
34 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Naming in Distributed Systems
No ratings yet
Naming in Distributed Systems
40 pages
My CV
No ratings yet
My CV
4 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Introduction and Overview: Download Completed Project
No ratings yet
Introduction and Overview: Download Completed Project
11 pages
Webinar PPT 24
No ratings yet
Webinar PPT 24
21 pages
Lab 2
No ratings yet
Lab 2
24 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Unit 5
No ratings yet
Unit 5
32 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Daniel Akhademe
No ratings yet
Daniel Akhademe
14 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Mypowerpoint
No ratings yet
Mypowerpoint
14 pages
Chapter Three: Planning Ict Teaching
100% (1)
Chapter Three: Planning Ict Teaching
15 pages
Course1 Description Database Design
No ratings yet
Course1 Description Database Design
58 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
START HERE - One Developer Private Preview Documentation
No ratings yet
START HERE - One Developer Private Preview Documentation
19 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Sap B1 Solution For Construction: Case Study
No ratings yet
Sap B1 Solution For Construction: Case Study
3 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Check List For Oracle Database Upgrade
No ratings yet
Check List For Oracle Database Upgrade
2 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
MC Lecture 9
No ratings yet
MC Lecture 9
15 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Lecture 4
No ratings yet
Lecture 4
25 pages
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
24 pages
Serial Number Tracking
No ratings yet
Serial Number Tracking
34 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
dm006w Mysql Introduction Online Oct20-En-V
No ratings yet
dm006w Mysql Introduction Online Oct20-En-V
27 pages
Unit 5
No ratings yet
Unit 5
35 pages
By Christian Mechem and Geoff Crowley
No ratings yet
By Christian Mechem and Geoff Crowley
11 pages
Arun IJARSH
No ratings yet
Arun IJARSH
10 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
MC Lecture 8
No ratings yet
MC Lecture 8
20 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Cyber - Security - Module 4 - 1
No ratings yet
Cyber - Security - Module 4 - 1
29 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
MC Lecture 7
No ratings yet
MC Lecture 7
9 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Question Bank - 2020
No ratings yet
Question Bank - 2020
4 pages
Chapter II (Cloud-Computing)
No ratings yet
Chapter II (Cloud-Computing)
21 pages
Cheat Sheet
No ratings yet
Cheat Sheet
2 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Senior Devops Engineer Resume Example
No ratings yet
Senior Devops Engineer Resume Example
1 page
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
eLMIS Briefer For LS - A4 Final
No ratings yet
eLMIS Briefer For LS - A4 Final
2 pages
Conflict
No ratings yet
Conflict
5 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
Distributed Systems-Course Outline
No ratings yet
Distributed Systems-Course Outline
1 page
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Aim Methodology For Oracle Ebusiness Suite: Author: Abhijit Ray
No ratings yet
Aim Methodology For Oracle Ebusiness Suite: Author: Abhijit Ray
32 pages
vdkyez8db3x2-DLPAdminExamPrepGuideFeb2023 2
No ratings yet
vdkyez8db3x2-DLPAdminExamPrepGuideFeb2023 2
4 pages
Application Letter TO:Abyssinia Bank: Date: Name: Sitota Betela Address: 0937299572
No ratings yet
Application Letter TO:Abyssinia Bank: Date: Name: Sitota Betela Address: 0937299572
2 pages

Chapter Five Hadoop Mapreduce & HDFS

Uploaded by

Chapter Five Hadoop Mapreduce & HDFS

Uploaded by

Hadoop, a distributed

framework for Big Data

• Hadoop has two major components

 MapReduce :Programming model(framework) that helps

 Hadoop Distributed File System (HDFS): A distributed file

 Input: a set of key/value pairs

map(String input_key, String input_value):

reduce(String output_key, Iterator intermediate_values):

rk1 rv1 rk1 rv1,rv7,...

rki ... rvn,...

map inputs map outputs reduce inputs

 MapReduce implementation presented in the paper

 For each map/reduce task:

 The location of intermediate file regions is passed from maps to

 Network bandwidth is a relatively scarce resource and also

 But, this, increases load on the master

 M could be chosen with respect to the block size of the file

 Very close to end of MapReduce operation, master schedules backup

 Example: sort operation takes 44% longer to complete when the

 Partitioning function identifies the reduce task

 Important to choose well-balanced partitioning functions:

 Introduce a mini-reduce phase before intermediate data is

 Saves network bandwidth

“worker” “worker” “worker”

 Large: A HDFS instance may consist of thousands of

Rack1 Write Rack2

• Advertisement (Mining user behavior to generate

You might also like