0% found this document useful (0 votes)

62 views

Lecture4 IntroMapReduce PDF

Uploaded by

Ronald Bbosa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views

Lecture4 IntroMapReduce PDF

Uploaded by

Ronald Bbosa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

Introduction to

MapReduce/Hadoop
Limitations of Existing Data Analytics Architecture

BI Reports + Interactive Apps Can’t Explore Original

High Fidelity Raw Data
RDBMS (aggregated data)

ETL Compute Grid

Moving Data To
Compute Doesn’t Scale

Storage Only Grid (original raw data)

Archiving =
Mostly Append
Premature
Collection Data Death

Instrumentation

2
©2011 Cloudera, Inc. All Rights Reserved.

Slides from Dr. Amr Awadallah’s Hadoop talk at Stanford, CTO & VPE from Cloudera
Typical Large-Data Problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output

• The problem:
– Diverse input format (data diversity & heterogeneity)
– Large Scale: Terabytes, Petabytes
– Parallelization

(Dean and Ghemawat, OSDI 2004)

How to leverage a number of
cheap off-the-shelf computers?

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

Divide and Conquer
“Work”
Partition

w1 w2 w3

“worker” “worker” “worker”

r1 r2 r3

“Result” Combine
Parallelization Challenges
• How do we assign work units to workers?
• What if we have more work units than
workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have
finished?
• What if workers die?
What is the common theme of all of these problems?
Common Theme?
• Parallelization problems arise from:
– Communication between workers (e.g., to
exchange state)
– Access to shared resources (e.g., data)
• Thus, we need a synchronization mechanism
Source: Ricardo Guimarães Herrmann
Managing Multiple Workers
• Difficult because
– We don’t know the order in which workers run
– We don’t know when workers interrupt each other
– We don’t know the order in which workers access shared data
• Thus, we need:
– Semaphores (lock, unlock)
– Conditional variables (wait, notify, broadcast)
– Barriers
• Still, lots of problems:
– Deadlock, livelock, race conditions...
– Dining philosophers, sleeping barbers, cigarette smokers...
• Moral of the story: be careful!
Current Tools
Shared Memory Message Passing

• Programming models

Memory
– Shared memory (pthreads)
– Message passing (MPI) P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

• Design Patterns
– Master-slaves
– Producer-consumer flows
– Shared work queues
producer consumer
master

work queue

slaves
producer consumer
Concurrency Challenge!
• Concurrency is difficult to reason about
• Concurrency is even more difficult to reason about
– At the scale of datacenters (even across datacenters)
– In the presence of failures
– In terms of multiple interacting services
• Not to mention debugging…
• The reality:
– Lots of one-off solutions, custom code
– Write you own dedicated library, then program with it
– Burden on the programmer to explicitly manage
everything
What’s the point?
• It’s all about the right level of abstraction
– The von Neumann architecture has served us well,
but is no longer appropriate for the multi-
core/cluster environment
• Hide system-level details from the developers
– No more race conditions, lock contention, etc.
• Separating the what from how
– Developer specifies the computation that needs to be
performed
– Execution framework (“runtime”) handles actual
execution
Key Ideas
• Scale “out”, not “up”
– Limits of SMP and large shared-memory machines
• Move processing to the data
– Cluster have limited bandwidth
• Process data sequentially, avoid random access
– Seeks are expensive, disk throughput is reasonable
• Seamless scalability
– From the mythical man-month to the tradable
machine-hour
The datacenter is the computer!

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

Apache Hadoop
• The name Hadoop is a made-up name. The
project’s creator, Doug Cutting, explains how
the name came about:
– The name my kid gave a stuffed yellow elephant.
Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere: those are
my naming criteria. Kids are good at generating
such. Googol is a kid’s term.
Apache Hadoop
• Scalable fault-tolerant distributed system for Big Data:
– Data Storage
– Data Processing
– A virtual Big Data machine
– Borrowed concepts/Ideas from Google; Open source
under the Apache license
• Core Hadoop has two main systems:
– Hadoop/MapReduce: distributed big data processing
infrastructure (abstract/paradigm, fault-tolerant, schedule,
execution)
– HDFS (Hadoop Distributed File System): fault-tolerant,
high-bandwidth, high availability distributed storage
Apache YARN
• Apache YARN (Yet Another Resource Negotiator) is
Hadoop’s cluster resource management system.
MapReduce: Big Data Processing Abstraction
RDBMS compared to MapReduce
Terminology
• A MapReduce job is a unit of work that the client
wants to be performed:
– input data,
– the MapReduce program
– Configuration information.
• Hadoop runs the job by dividing it into tasks, of
which there are two types:
– map tasks and reduce tasks.
– The tasks are scheduled using YARN and run on nodes
in the cluster.
Terminology
• Hadoop divides the input to a MapReduce job into
fixed-size pieces called input splits, or just splits.
Hadoop creates one map task for each split, which
runs the user-defined map function for each record
in the split.
• Having many splits means the time taken to
process each split is small compared to the time to
process the whole input.
• So if we are processing the splits in parallel, the
processing is better load balanced when the splits
are small, since a faster machine will be able to
process proportionally more splits over the course
of the job than a slower machine.
Split size
• If splits are too small, the overhead of
managing the splits and map task creation
begins to dominate the total job execution
time.
• For most jobs, a good split size tends to be the
size of an HDFS block, which is 128 MB by
default, although this can be changed for the
cluster or specified when each file is created.
Split size
• It should now be clear why the optimal split size
is the same as the block size:
– it is the largest size of input that can be guaranteed
to be stored on a single node.
– If the split spanned two blocks, it would be unlikely
that any HDFS node stored both blocks, so some of
the split would have to be transferred across the
network to the node running the map task, which is
clearly less efficient than running the whole map
task using local data.
Data locality optimization
• Hadoop does its best to run the map task on a node where the
input data resides in HDFS, because it doesn’t use valuable
cluster bandwidth.
Map output
• Map tasks write their output to the local disk,
not to HDFS.
• Map output is intermediate output: it’s
processed by reduce tasks to produce the final
output, and once the job is complete, the map
output can be thrown away. So, storing it in
HDFS with replication would be overkill.
data flow with a single reduce task
data flow with multiple reduce tasks
Typical Large-Data Problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
Key idea: provide a functional abstraction for these two
operations

(Dean and Ghemawat, OSDI 2004)

Roots in Functional Programming

Map f f f f f

Fold g g g g g
MapReduce
• Programmers specify two functions:
map (k, v) → [(k’, v’)]
reduce (k’, [v’]) → [(k’, v’)]
– All values with the same key are sent to the same
reducer
• The execution framework handles everything
else…
Key Observation from Data Mining
Algorithms (Jin & Agrawal, SDM’01)
• Popular algorithms have a
common canonical loop While( ) {
forall( data instances d) {
• Can be used as the basis for
supporting a common middleware I = process(d)
(FREERide, Framework for Rapid R(I) = R(I) op d
Implementation of Data mining
Engines) }
…….
• Target distributed memory
parallelism, shared memory }
parallelism, and combination

• Ability to process large and disk-

resident datasets
2020/4/22
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

Shuffle and Sort: aggregate values by keys

a 1 5 b 2 7 c 2 3 6 8

reduce reduce reduce

r1 s1 r2 s2 r3 s3
MapReduce
• Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
– All values with the same key are sent to the same
reducer
• The execution framework handles everything
else…
What’s “everything else”?
MapReduce “Runtime”
• Handles scheduling
– Assigns workers to map and reduce tasks
• Handles “data distribution”
– Moves processes to data
• Handles synchronization
– Gathers, sorts, and shuffles intermediate data
• Handles errors and faults
– Detects worker failures and restarts
• Everything happens on top of a distributed FS
(later)
MapReduce
• Programmers specify two functions:
map (k, v) → [(k’, v’)]
reduce (k’, [v’]) → [(k’, v’)]
– All values with the same key are reduced together
• The execution framework handles everything else…
• Not quite…usually, programmers also specify:
partition (k’, number of partitions) → partition for k’
– Often a simple hash of the key, e.g., hash(k’) mod n
– Divides up key space for parallel reduce operations
combine (k’, [v’]) → [(k’, v’’)]
– Mini-reducers that run in memory after the map phase
– Used as an optimization to reduce network traffic
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

combine combine combine combine

a 1 b 2 c 9 a 5 c 2 b 7 c 8

partition partition partition partition

Shuffle and Sort: aggregate values by keys

a 1 5 b 2 7 c 2 3
9 6
8 8

reduce reduce reduce

r1 s1 r2 s2 r3 s3
Two more details…
• Barrier between map and reduce phases
– But we can begin copying intermediate data
earlier
• Keys arrive at each reducer in sorted order
– No enforced ordering across reducers
MapReduce can refer to…
• The programming model
• The execution framework (aka “runtime”)
• The specific implementation

Usage is usually clear from context!

“Hello World”: Word Count

Map(String docid, String text):

for each word w in text:
Emit(w, 1);

Reduce(String term, Iterator<Int> values):

int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
MapReduce Implementations
• Google has a proprietary implementation in C++
– Bindings in Java, Python
• Hadoop is an open-source implementation in Java
– Development led by Yahoo, used in production
– Now an Apache project
– Rapidly expanding software ecosystem
• Lots of custom research implementations
– For GPUs, cell processors, etc.
Hadoop History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Becomes Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• Sept 2008 – Hive becomes a Hadoop subproject
• Feb 2009 – The Yahoo! Search Webmap is a Hadoop application
that runs on more than 10,000 core Linux cluster and produces data
that is now used in every Yahoo! Web search query.
• June 2009 – On June 10, 2009, Yahoo! made available the source
code to the version of Hadoop it runs in production.
• In 2010 Facebook claimed that they have the largest Hadoop cluster
in the world with 21 PB of storage. On July 27, 2011 they
announced the data has grown to 30 PB.
Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
Example Word Count (Map)
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word,one);
}
}
}
Example Word Count (Reduce)
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Example Word Count (Driver)
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Word Count Execution
Input Map Shuffle & Sort Reduce Output

the, 1
brown, 1
the quick fox, 1 brown, 2
brown fox Map fox, 2
Reduce how, 1
now, 1
the, 1 the, 3
fox, 1
the, 1
the fox ate
the mouse Map
quick, 1

how, 1
now, 1 ate, 1 ate, 1
brown, 1 mouse, 1 cow, 1
Reduce mouse, 1
how now quick, 1
brown cow Map cow, 1
An Optimization: The Combiner
• A combiner is a local aggregation function
for repeated keys produced by same map
• For associative ops. like sum, count, max
• Decreases size of intermediate data

• Example: local counting for Word Count:

def combiner(key, values):
output(key, sum(values))
Word Count with Combiner
Input Map & Combine Shuffle & Sort Reduce Output

the, 1
brown, 1
the quick fox, 1 brown, 2
brown fox Map fox, 2
Reduce how, 1
now, 1
the, 3
the, 2
fox, 1
the fox ate
the mouse Map
quick, 1

how, 1
now, 1 ate, 1 ate, 1
brown, 1 mouse, 1 cow, 1
Reduce mouse, 1
how now quick, 1
brown cow Map cow, 1
User
Program

(1) submit

Master

(2) schedule map (2) schedule reduce

worker
split 0
(6) write output
split 1 (5) remote read worker
(3) read file 0
split 2 (4) local write
worker
split 3
split 4 output
worker
file 1

worker

Input Map Intermediate files Reduce Output

files phase (on local disk) phase files

Adapted from (Dean and Ghemawat, OSDI 2004)

How do we get data to the workers?
NAS

SAN

Compute Nodes

What’s the problem here?

Distributed File System
• Don’t move data to workers… move workers to
the data!
– Store data on the local disks of nodes in the cluster
– Start up the workers on the node that has the data
local
• Why?
– Not enough RAM to hold all the data in memory
– Disk access is slow, but disk throughput is reasonable
• A distributed file system is the answer
– GFS (Google File System) for Google’s MapReduce
– HDFS (Hadoop Distributed File System) for Hadoop
Distributed file system
• When a dataset outgrows the storage capacity of
a single physical machine, it becomes necessary
to partition it across a number of separate
machines.
• Filesystems that manage the storage across a
network of machines are called distributed
filesystems.
• One of the biggest challenges is making the
filesystem tolerate node failure without suffering
data loss.
HDFS
• Hadoop comes with a distributed filesystem
called HDFS, which stands for Hadoop
Distributed Filesystem.
• HDFS is Hadoop’s flagship filesystem, but
Hadoop actually has a general purpose
filesystem abstraction, so we’ll see along the
way how Hadoop integrates with other
storage systems
Design of HDFS
• HDFS is a filesystem designed for storing very large files
with streaming data access patterns, running on
clusters of commodity hardware.
– “Very large” means files that are hundreds of megabytes,
gigabytes, or terabytes in size
– Streaming data access
– HDFS is built around the idea that the most efficient data
processing pattern is a write once, read-many-times
pattern
– Commodity hardware.Hadoop doesn’t require expensive,
highly reliable hardware. It’s designed to run on clusters of
commodity hardware (commonly available hardware that
can be obtained from multiple vendors).
HDFS Concepts
• A disk has a block size, which is the minimum amount
of data that it can read or write. Filesystems for a single
disk build on this by dealing with data in blocks, which
are an integral multiple of the disk block size.
Filesystem blocks are typically a few kilobytes in size,
whereas disk blocks are normally 512 bytes.
• HDFS, too, has the concept of a block, but it is a much
larger unit — 128 MB by default. Unlike a filesystem for
a single disk, a file in HDFS that is smaller than a single
block does not occupy a full block’s worth of
underlying storage. (For example, a 1 MB file stored
with a block size of 128 MB uses 1 MB of disk space,
not 128 MB.)
Namenodes and Datanodes
• An HDFS cluster has two types of nodes operating in
a master−worker pattern: a namenode (the master)
and a number of datanodes (workers).
– The namenode manages the filesystem namespace. It
maintains the filesystem tree and the metadata for all the
files and directories in the tree. The namenode also
knows the datanodes on which all the blocks for a given
file are located;
– Datanodes are the workhorses of the filesystem. They
store and retrieve blocks when they are told to (by clients
or the namenode), and they report back to the
namenode periodically with lists of blocks that they are
storing.
client
• A client accesses the filesystem on behalf of
the user by communicating with the
namenode and datanodes.
• The client presents a filesystem interface
similar to a Portable Operating System
Interface (POSIX), so the user code does not
need to know about the namenode and
datanodes to function.
GFS: Assumptions
• Commodity hardware over “exotic” hardware
– Scale “out”, not “up”
• High component failure rates
– Inexpensive commodity components fail all the time
• “Modest” number of huge files
– Multi-gigabyte files are common, if not encouraged
• Files are write-once, mostly appended to
– Perhaps concurrently
• Large streaming reads over random access
– High sustained throughput over low latency

GFS slides adapted from material by (Ghemawat et al., SOSP 2003)

GFS: Design Decisions
• Files stored as chunks
– Fixed size (64MB)
• Reliability through replication
– Each chunk replicated across 3+ chunkservers
• Single master to coordinate access, keep metadata
– Simple centralized management
• No data caching
– Little benefit due to large datasets, streaming reads
• Simplify the API
– Push some of the issues onto the client (e.g., data layout)
HDFS = GFS clone (same basic ideas)
From GFS to HDFS
• Terminology differences:
– GFS master = Hadoop namenode
– GFS chunkservers = Hadoop datanodes
• Functional differences:
– HDFS performance is (likely) slower

For the most part, we’ll use the Hadoop terminology…

HDFS Working Flow
HDFS namenode

Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)

instructions to datanode

datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system

… …

Adapted from (Ghemawat et al., SOSP 2003)

HDFS Architecture
Cluster Membership

NameNode

Secondary
NameNode

Client

Cluster Membership

DataNodes
NameNode : Maps a file to a file-id and list of MapNodes
DataNode : Maps a block-id to a physical location on disk
SecondaryNameNode: Periodic merge of Transaction log
Distributed File System
• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
NameNode Metadata
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
Namenode Responsibilities
• Managing the file system namespace:
– Holds file/directory structure, metadata, file-to-
block mapping, access permissions, etc.
• Coordinating file operations:
– Directs clients to datanodes for reads and writes
– No data is moved through the namenode
• Maintaining overall health:
– Periodic communication with the datanodes
– Block re-replication and rebalancing
– Garbage collection
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to
the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
Block Placement
• Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
• Clients read from nearest replica
• Would like to make this policy pluggable
Data Correctness
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from
DataNode
– If Validation fails, Client tries other replicas
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
Putting everything together…

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

… … …
slave node slave node slave node
MapReduce Data Flow

Figure is from Hadoop, The Definitive Guide, 2nd Edition, Tom White, O’Reilly
Figure is from Hadoop, The Definitive Guide, 2nd Edition
Figure is from Hadoop, The Definitive Guide, 2nd Edition
• References:
• Hadoop: The Definitive Guide, Tom White, O’Reilly
• Hadoop In Action, Chuck Lam, Manning

SAPUI5-FIORI Interview Questions
100% (1)
SAPUI5-FIORI Interview Questions
34 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
MicroTrap VOD Operations Manual Revision 4-2-1
No ratings yet
MicroTrap VOD Operations Manual Revision 4-2-1
74 pages
Importing Huawei™ CHR Data For Analysis With Aexio's XEUS Pro
100% (1)
Importing Huawei™ CHR Data For Analysis With Aexio's XEUS Pro
5 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
10 - Big Data Architecture and Tools (1)
No ratings yet
10 - Big Data Architecture and Tools (1)
31 pages
HADOOP
No ratings yet
HADOOP
55 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Big Data
No ratings yet
Big Data
67 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Big Data Analytics
No ratings yet
Big Data Analytics
44 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
BSC in Information Technology: Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology: Massive or Big Data Processing J.Alosius
17 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Week 02
No ratings yet
Week 02
115 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
BDA-Unit-1
No ratings yet
BDA-Unit-1
35 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
BD Notes
No ratings yet
BD Notes
11 pages
CLOUD COMPUTING UNIT 3
No ratings yet
CLOUD COMPUTING UNIT 3
10 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Lecture 3 MapReduce Spark
No ratings yet
Lecture 3 MapReduce Spark
62 pages
1 MapReduce introduction with example
No ratings yet
1 MapReduce introduction with example
52 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BDP 2023 03
No ratings yet
BDP 2023 03
59 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
Apache Hadoop Training
No ratings yet
Apache Hadoop Training
377 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Ditp ch2
No ratings yet
Ditp ch2
2 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
Unit 5
No ratings yet
Unit 5
7 pages
Best Hadoop Online Training
No ratings yet
Best Hadoop Online Training
41 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Xie Oriented R-CNN For Object Detection ICCV 2021 Paper
No ratings yet
Xie Oriented R-CNN For Object Detection ICCV 2021 Paper
10 pages
Unifying Non-Local Blocks For Neural Networks
No ratings yet
Unifying Non-Local Blocks For Neural Networks
10 pages
VRP Operating System Image Management: Huawei Technologies Co., LTD
No ratings yet
VRP Operating System Image Management: Huawei Technologies Co., LTD
14 pages
HCNA-HNTD Intermediate Lab Guide V2.2 PDF
No ratings yet
HCNA-HNTD Intermediate Lab Guide V2.2 PDF
175 pages
Django - The Easy Way A Step-By-Step Guide On Building Django Websites, 2nd Edition (PDFDrive)
100% (2)
Django - The Easy Way A Step-By-Step Guide On Building Django Websites, 2nd Edition (PDFDrive)
165 pages
2021 GKS-U Application Guidelines (English)
No ratings yet
2021 GKS-U Application Guidelines (English)
36 pages
Lecture3 2020classification PDF
No ratings yet
Lecture3 2020classification PDF
124 pages
Data Management For The Internet of Thin PDF
No ratings yet
Data Management For The Internet of Thin PDF
31 pages
CCNA 1 v6.0 ITN Chapter 11 Exam Answers 2019 - Premium IT Exam & Certified
No ratings yet
CCNA 1 v6.0 ITN Chapter 11 Exam Answers 2019 - Premium IT Exam & Certified
1 page
U01-01 Historical Review
No ratings yet
U01-01 Historical Review
32 pages
Title: Encoder Instructional Aims
100% (4)
Title: Encoder Instructional Aims
7 pages
CSR Harmony 2.0 Software Release Note
No ratings yet
CSR Harmony 2.0 Software Release Note
22 pages
Common 3D Printer Tender Specification
No ratings yet
Common 3D Printer Tender Specification
3 pages
Manual Netgear PDF
No ratings yet
Manual Netgear PDF
578 pages
Universal Mobile Telecommunications System
No ratings yet
Universal Mobile Telecommunications System
18 pages
Tata Sky Summer Interns Project Report
No ratings yet
Tata Sky Summer Interns Project Report
38 pages
Back To The Top
No ratings yet
Back To The Top
10 pages
Modelo de CV
No ratings yet
Modelo de CV
2 pages
K-Bus Ip Router
No ratings yet
K-Bus Ip Router
22 pages
Pitch Shifter Project Report
No ratings yet
Pitch Shifter Project Report
15 pages
PowerHA Workshop Part1
No ratings yet
PowerHA Workshop Part1
50 pages
Libusb - Info: Sudo Mkdir /usr/local/lib
No ratings yet
Libusb - Info: Sudo Mkdir /usr/local/lib
5 pages
Xerox Docu Centre II
No ratings yet
Xerox Docu Centre II
8 pages
ATM Intro
No ratings yet
ATM Intro
27 pages
3G Knowledge Sharing
100% (2)
3G Knowledge Sharing
104 pages
MIB & SIB Nokia LTE PDF
No ratings yet
MIB & SIB Nokia LTE PDF
14 pages
DataStage Tricks & Tips
No ratings yet
DataStage Tricks & Tips
41 pages
Surface RT Recovery Image Instructions
No ratings yet
Surface RT Recovery Image Instructions
2 pages
ASP
No ratings yet
ASP
659 pages
Securing Multi-Path Routing Using Trust Management in Heterogeneous Wireless Sensor Network
No ratings yet
Securing Multi-Path Routing Using Trust Management in Heterogeneous Wireless Sensor Network
32 pages
Alagappa University, Karaikudi Revised Syllabus Under Cbcs Pattern (W.E.F. 2011-12) B.C.A. - Programme Structure
100% (1)
Alagappa University, Karaikudi Revised Syllabus Under Cbcs Pattern (W.E.F. 2011-12) B.C.A. - Programme Structure
28 pages
Sap T-Codes
No ratings yet
Sap T-Codes
7 pages
Course Title Project Code Project Title Pages Released Revised
No ratings yet
Course Title Project Code Project Title Pages Released Revised
5 pages
Digital Design Using Verilog HDL QB
No ratings yet
Digital Design Using Verilog HDL QB
14 pages
Vtunotesbysri
No ratings yet
Vtunotesbysri
70 pages

Lecture4 IntroMapReduce PDF

Uploaded by

Lecture4 IntroMapReduce PDF

Uploaded by

Introduction to

BI Reports + Interactive Apps Can’t Explore Original

ETL Compute Grid

Storage Only Grid (original raw data)

(Dean and Ghemawat, OSDI 2004)

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

“worker” “worker” “worker”

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

(Dean and Ghemawat, OSDI 2004)

• Ability to process large and disk-

map map map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

map map map map

combine combine combine combine

partition partition partition partition

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

Usage is usually clear from context!

Map(String docid, String text):

Reduce(String term, Iterator<Int> values):

private final static IntWritable one = new IntWritable(1);

public void map(Object key, Text value, Context context

public void reduce(Text key, Iterable<IntWritable> values,

• Example: local counting for Word Count:

(2) schedule map (2) schedule reduce

Input Map Intermediate files Reduce Output

Adapted from (Dean and Ghemawat, OSDI 2004)

What’s the problem here?

GFS slides adapted from material by (Ghemawat et al., SOSP 2003)

For the most part, we’ll use the Hadoop terminology…

Adapted from (Ghemawat et al., SOSP 2003)

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

You might also like