0% found this document useful (0 votes)

29 views101 pages

Week-2 Lecture Notes

big data

Uploaded by

727822tuad027

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views101 pages

Week-2 Lecture Notes

big data

Uploaded by

727822tuad027

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 101

Hadoop Distributed File System

(HDFS)

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)
Preface
Content of this Lecture:

In this lecture, we will discuss design goals of HDFS, the

read/write process to HDFS, the main configuration

EL
tuning parameters to control HDFS performance and
robustness.

PT
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Introduction
Hadoop provides a distributed file system and a framework for
the analysis and transformation of very large data sets using
the MapReduce paradigm.

EL
An important characteristic of Hadoop is the partitioning of
data and computation across many (thousands) of hosts, and
executing application computations in parallel close to their
data.
PT
A Hadoop cluster scales computation capacity, storage capacity
N
and IO bandwidth by simply adding commodity servers.
Hadoop clusters at Yahoo! span 25,000 servers, and store 25
petabytes of application data, with the largest cluster being
3500 servers. One hundred other organizations worldwide
report using Hadoop.
Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)
Introduction
Hadoop is an Apache project; all components are available via
the Apache open source license.

Yahoo! has developed and contributed to 80% of the core of

EL
Hadoop (HDFS and MapReduce).

at Microsoft. PT
HBase was originally developed at Powerset, now a department

Hive was originated and developed at Facebook.

N
Pig, ZooKeeper, and Chukwa were originated and developed at
Yahoo!
Avro was originated at Yahoo! and is being co-developed with
Cloudera.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Hadoop Project Components

HDFS Distributed file system

MapReduce Distributed computation framework

EL
HBase Column-oriented table service
Dataflow language and parallel execution
Pig

Hive
ZooKeeper
PT
framework
Data warehouse infrastructure
Distributed coordination service
N
Chukwa System for collecting management data
Avro Data serialization system

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Design Concepts
Scalable distributed filesystem: So essentially, as you add disks
you get scalable performance. And as you add more, you're
adding a lot of disks, and that scales out the performance.

EL
Distributed data on local disks on several nodes.

PT
Low cost commodity hardware: A lot of performance out of it
because you're aggregating performance.
N
Node 1 Node 2 Node n

B1 B2 … Bn

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Design Goals
Hundreds/Thousands of nodes and disks:
It means there's a higher probability of hardware failure. So the design
needs to handle node/disk failures.

EL
Portability across heterogeneous hardware/software:
Implementation across lots of different kinds of hardware and software.

PT
Handle large data sets:
Need to handle terabytes to petabytes.
N
Enable processing with high throughput

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Techniques to meet HDFS design goals
Simplified coherency model:
The idea is to write once and then read many times. And that simplifies
the number of operations required to commit the write.

EL
Data replication:
Helps to handle hardware failures.
Try to spread the data, same piece of data on different nodes.

PT
Move computation close to the data:
So you're not moving data around. That improves your performance and
N
throughput.

Relax POSIX requirements to increase the throughput.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Basic architecture of HDFS

EL
PT
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Architecture: Key Components

Single NameNode: A master server that manages the file system

namespace and basically regulates access to these files from
clients, and it also keeps track of where the data is on the

EL
DataNodes and where the blocks are distributed essentially.

PT
Multiple DataNodes: Typically one per node in a cluster. So
you're basically using storage which is local.
N
Basic Functions:
Manage the storage on the DataNode.
Read and write requests on the clients
Block creation, deletion, and replication is all based on instructions from
the NameNode.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Original HDFS Design
Single NameNode
Multiple DataNodes
Manage storage- blocks of data

EL
Serving read/write requests from clients
Block creation, deletion, replication

PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

HDFS in Hadoop 2
HDFS Federation: Basically what we are doing is trying to have
multiple data nodes, and multiple name nodes. So that we can
increase the name space data. So, if you recall from the first design
you have essentially a single node handling all the namespace

EL
responsibilities. And you can imagine as you start having thousands of
nodes that they'll not scale, and if you have billions of files, you will
have scalability issues. So to address that, the federation aspect was

PT
brought in. That also brings performance improvements.
N
Benefits:
Increase namespace scalability
Performance
Isolation

Big Data Computing Vu Pham Big Data Enabling Technologies

HDFS in Hadoop 2
How its done
Multiple Namenode servers
Multiple namespaces

EL
Data is now stored in Block pools

PT
So there is a pool associated with each namenode or
namespace.
N
And these pools are essentially spread out over all the data
nodes.

Big Data Computing Vu Pham Big Data Enabling Technologies

HDFS in Hadoop 2
High Availability-
Redundant NameNodes

EL
Heterogeneous Storage
and Archival Storage

PT
ARCHIVE, DISK, SSD, RAM_DISK
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Federation: Block Pools

EL
PT
So, if you remember the original design you have one name space and a bunch of
N
data nodes. So, the structure looks similar.

You have a bunch of NameNodes, instead of one NameNode. And each of those
NameNodes is essentially right into these pools, but the pools are spread out over the
data nodes just like before. This is where the data is spread out. You can gloss over
the different data nodes. So, the block pool is essentially the main thing that's
different.
Big Data Computing Vu Pham Big Data Enabling Technologies
HDFS Performance Measures

Determine the number of blocks for a given file size,

EL
Key HDFS and system components that are affected
by the block size.

PT
An impact of using a lot of small files on HDFS and
system
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Recall: HDFS Architecture

EL
Distributed data on local disks on several nodes

PT
N
Node 1 Node 2 Node n

B1 B2 … Bn

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Block Size

Default block size is 64 megabytes.

Good for large files!

EL
So a 10GB file will be broken into: 10 x 1024/64=160 blocks

PT
N
Node 1 Node 2 Node n

B1 B2 … Bn

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Importance of No. of Blocks in a file
NameNode memory usage: Every block that you create basically
every file could be a lot of blocks as we saw in the previous case,
160 blocks. And if you have millions of files that's millions of
objects essentially. And for each object, it uses a bit of memory on

EL
the NameNode, so that is a direct effect of the number of blocks.
But if you have replication, then you have 3 times the number of
blocks.

PT
Number of map tasks: Number of maps typically depends on the
number of blocks being processed.
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Large No. of small files: Impact on Name node

Memory usage: Typically, the usage is around 150 bytes per

object. Now, if you have a billion objects, that's going to be like
300GB of memory.

EL
PT
Network load: Number of checks with datanodes proportional
to number of blocks
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Large No. of small files: Performance Impact

Number of map tasks: Suppose we have 10GB of data to

process and you have them all in lots of 32k file sizes? Then we
will end up with 327680 map tasks.

EL
Huge list of tasks that are queued.

PT
The other impact of this is the map tasks, each time they spin up
and spin down, there's a latency involved with that because you
N
are starting up Java processes and stopping them.

Inefficient disk I/O with small sizes

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS optimized for large files

Lots of small files is bad!

EL
Solution:

Merge/Concatenate files
Sequence files PT
HBase, HIVE configuration
N
CombineFileInputFormat

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Read/Write Processes in HDFS

EL
PT
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Read Process in HDFS

EL
PT
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Write Process in HDFS

EL
PT
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Tuning Parameters

EL
PT
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Overview

Tuning parameters

Specifically DFS Block size

EL
NameNode, DataNode system/dfs parameters.

PT
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS XML configuration files

Tuning environment typically in HDFS XML configuration files,

for example, in the hdfs-site.xml.

EL
This is more for system administrators of Hadoop clusters, but
it's good to know what changes affect impact the performance,
and especially if your trying things out on your own there some

PT
important parameters to keep in mind.

Commercial vendors have GUI based management console

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Block Size

Recall: impacts how much NameNode memory is used, number

of map tasks that are showing up, and also have impacts on
performance.

EL
Default 64 megabytes: Typically bumped up to 128 megabytes
and can be changed based on workloads.

PT
The parameter that this changes dfs.blocksize or dfs.block.size.
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Replication

Default replication is 3.
Parameter: dfs.replication

EL
Tradeoffs:

Lower it to reduce replication cost

Less robust
PT
Higher replication can make data local to more workers
Lower replication ➔ More space
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Lot of other parameters

Various tunables for datanode, namenode.

Examples:

EL
Dfs.datanode.handler.count (10): Sets the number of server
threads on each datanode
Dfs.namenode.fs-limits.max-blocks-per-file: Maximum number
of blocks per file.

Full List:
PT
N
http://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-hdfs/hdfs-default.xml

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Performance and
Robustness

EL
PT
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Common Failures

DataNode Failures: Server can fail, disk can crash, data

corruption.

EL
Network Failures: Sometimes there's data corruption because
of network issues or disk issue. So, all of that could lead to a
failure in the DataNode aspect of HDFS. You could have network

PT
failures. So, you could have a network go down between a
particular and the name node that can affect a lot of data nodes
at the same time.
N
NameNode Failures: Could have name node failures, disk failure
on the name node itself or the name node itself could corrupt
this process.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Robustness

NameNode receives heartbeat and block reports from

DataNodes

EL
PT
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Mitigation of common failures

Periodic heartbeat: from DataNode to NameNode.

DataNodes without recent heartbeat:

EL
Mark the data. And any new I/O that comes up is not going to be sent to
that data node. Also remember that NameNode has information on all

PT
the replication information for the files on the file system. So, if it knows
that a datanode fails which blocks will follow that replication factor.
N
Now this replication factor is set for the entire system and also you could
set it for particular file when you're writing the file. Either way, the
NameNode knows which blocks fall below replication factor. And it will
restart the process to re-replicate.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Mitigation of common failures

Checksum computed on file creation.

Checksums stored in HDFS namespace.

EL
Used to check retrieved data.
PT
Re-read from alternate replica
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Mitigation of common failures

Multiple copies of central meta data structures.

Failover to standby NameNode- manual by default.

EL
PT
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Performance

Changing blocksize and replication factor can improve

performance.

EL
Example: Distributed copy

PT
Hadoop distcp allows parallel transfer of files.
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Replication trade off with respect to robustness

One performance tradeoff is, actually when you go out

to do some of the map reduce jobs, having replicas
gives additional locality possibilities, but the big trade

EL
off is the robustness. In this case, we said no replicas.
Might lose a node or a local disk: can't recover because

PT
there is no replication.

Similarly, with data corruption, if you get a checksum

N
that's bad, now you can't recover because you don't
have a replica.

Other parameters changes can have similar effects.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)
Conclusion

In this lecture, we have discussed design goals of HDFS,

the read/write process to HDFS, the main configuration

EL
tuning parameters to control HDFS performance and
robustness.

PT
N

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Hadoop MapReduce 1.0

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Hadoop MapReduce 1.0
What is Map Reduce
MapReduce is the execution engine of Hadoop.

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 1.0

Map Reduce Components

The Job Tracker

EL
Task Tracker

PT
N

Big Data Computing Vu Pham Hadoop MapReduce 1.0

The Job Tracker
The Job Tracker is hosted
inside the master and it
receives the job execution
request from the client.

EL
Its main duties are to
break down the receive job
that is big computations in

PT
small parts allocate the
partial computations that
is tasks to the slave nodes
N
monitoring the progress
and report of task
execution from the slave.
The unit of execution is
job.

Big Data Computing Vu Pham Hadoop MapReduce 1.0

The Task Tracker
Task tracker is the MapReduce
component on the slave machine
as there are multiple slave
machines.
Many task trackers are available in

EL
a cluster its duty is to perform
computation given by job tracker

machine.
PT
on the data available on the slave

The task tracker will communicate

N
the progress and report the
results to the job tracker.
The master node contains the job
tracker and name node whereas
all slaves contain the task tracker
and data node.
Big Data Computing Vu Pham Hadoop MapReduce 1.0
Execution Steps
Step-1The client submits the job to Job
Tracker
Step-2 Job Tracker asks Name node the
location of data
Step-3 As per the reply from name

EL
node, the Job Tracker ask respective
task trackers to execute the task on
their data

PT
Step-4 All the results are stored on
some Data Node and the Name Node is
informed about the same.
N
Step-5 The task trackers inform the job
completion and progress to Job Tracker
Step-6 The Job Tracker inform the
completion to client
Step-7 Client contacts the Name Node
and retrieve the results
Big Data Computing Vu Pham Hadoop MapReduce 1.0
Hadoop MapReduce 2.0

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Preface
Content of this Lecture:

In this lecture, we will discuss the ‘MapReduce

EL
paradigm’ and its internal working and
implementation overview.

PT
We will also see many examples and different
N
applications of MapReduce being used, and look into
how the ‘scheduling and fault tolerance’ works inside
MapReduce.

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Introduction
MapReduce is a programming model and an associated
implementation for processing and generating large data
sets.

EL
Users specify a map function that processes a key/value
pair to generate a set of intermediate key/value pairs, and
PT
a reduce function that merges all intermediate values
associated with the same intermediate key.
N
Many real world tasks are expressible in this model.

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Contd…
Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines.
The run-time system takes care of the details of partitioning the
input data, scheduling the program's execution across a set of

EL
machines, handling machine failures, and managing the required
inter-machine communication.

PT
This allows programmers without any experience with parallel and
distributed systems to easily utilize the resources of a large
distributed system.
N
A typical MapReduce computation processes many terabytes of
data on thousands of machines. Hundreds of MapReduce
programs have been implemented and upwards of one thousand
MapReduce jobs are executed on Google's clusters every day.

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Distributed File System
Chunk Servers
File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)

EL
Try to keep replicas in different racks

Master node
PT
Also known as Name Nodes in HDFS
Stores metadata
N
Might be replicated

Client library for file access

Talks to master to find chunk servers
Connects directly to chunkservers to access data
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Motivation for Map Reduce (Why)

Large-Scale Data Processing

Want to use 1000s of CPUs

EL
But don’t want hassle of managing things

PT
MapReduce Architecture provides
Automatic parallelization & distribution
N
Fault tolerance
I/O scheduling
Monitoring & status updates

Big Data Computing Vu Pham Hadoop MapReduce 2.0

MapReduce Paradigm

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

What is MapReduce?
Terms are borrowed from Functional Language (e.g., Lisp)
Sum of squares:
(map square ‘(1 2 3 4))
Output: (1 4 9 16)

EL
[processes each record sequentially and independently]
(reduce + ‘(1 4 9 16))

PT
(+ 16 (+ 9 (+ 4 1) ) )
Output: 30
N
[processes set of all records in batches]
Let’s consider a sample application: Wordcount
You are given a huge dataset (e.g., Wikipedia dump or all of
Shakespeare’s works) and asked to list the count for each of the
words in each of the documents therein
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Map

Process individual records to generate intermediate

key/value pairs.

EL
Key Value

PT
Welcome Everyone
Hello Everyone
Welcome
Everyone
Hello
1
1
1
N
Everyone 1
Input <filename, file text>

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Map

Parallelly Process individual records to generate

intermediate key/value pairs.

EL
MAP TASK 1

PT
Welcome Everyone
Hello Everyone
Welcome
Everyone
Hello
1
1
1
N
Everyone 1
Input <filename, file text>

MAP TASK 2

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Map

Parallelly Process a large number of individual

records to generate intermediate key/value pairs.

EL
PT
Welcome 1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Why are you here
I am also here Everyone 1
They are also here Why 1
Yes, it’s THEM!
Are 1
The same people we were thinking of
You 1
…….
N
Here 1
…….

Input <filename, file text>

MAP TASKS

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Reduce
Reduce processes and merges all intermediate values
associated per key

EL
Key Value
Welcome 1
Everyone
Hello
1
1
PT Everyone
Hello
Welcome
2
1
1
N
Everyone 1

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Reduce
• Each key assigned to one Reduce
• Parallelly Processes and merges all intermediate values
by partitioning keys

EL
Welcome 1 REDUCE Everyone 2
Everyone 1
Hello 1
Everyone 1
PT
TASK 1
REDUCE
TASK 2
Hello
Welcome
1
1
N
• Popular: Hash partitioning, i.e., key is assigned to
– reduce # = hash(key)%number of reduce tasks
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Programming Model

The computation takes a set of input key/value pairs, and

produces a set of output key/value pairs.

EL
The user of the MapReduce library expresses the
computation as two functions:

(i) The Map

PT
N
(ii) The Reduce

Big Data Computing Vu Pham Hadoop MapReduce 2.0

(i) Map Abstraction

Map, written by the user, takes an input pair and produces

EL
a set of intermediate key/value pairs.

The MapReduce library groups together all intermediate

PT
values associated with the same intermediate key ‘I’ and
passes them to the Reduce function.
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

(ii) Reduce Abstraction
The Reduce function, also written by the user, accepts an
intermediate key ‘I’ and a set of values for that key.

EL
It merges together these values to form a possibly smaller
set of values.

PT
Typically just zero or one output value is produced per
Reduce invocation. The intermediate values are supplied to
N
the user's reduce function via an iterator.

This allows us to handle lists of values that are too large to

fit in memory.

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Map-Reduce Functions for Word Count

map(key, value):
// key: document name; value: text of document

EL
for each word w in value:
emit(w, 1)

reduce(key, values):
PT
N
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Map-Reduce Functions

Input: a set of key/value pairs

User supplies two functions:

EL
map(k,v) → list(k1,v1)
reduce(k1, list(v1)) → v2

PT
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

MapReduce Applications

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Applications
Here are a few simple applications of interesting programs that
can be easily expressed as MapReduce computations.
Distributed Grep: The map function emits a line if it matches a
supplied pattern. The reduce function is an identity function that

EL
just copies the supplied intermediate data to the output.
Count of URL Access Frequency: The map function processes

PT
logs of web page requests and outputs (URL; 1). The reduce
function adds together all values for the same URL and emits a
(URL; total count) pair.
N
ReverseWeb-Link Graph: The map function outputs (target;
source) pairs for each link to a target URL found in a page named
source. The reduce function concatenates the list of all source
URLs associated with a given target URL and emits the pair:
(target; list(source))
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Contd…
Term-Vector per Host: A term vector summarizes the
most important words that occur in a document or a set
of documents as a list of (word; frequency) pairs.

EL
The map function emits a (hostname; term vector) pair
for each input document (where the hostname is
PT
extracted from the URL of the document).
N
The reduce function is passed all per-document term
vectors for a given host. It adds these term vectors
together, throwing away infrequent terms, and then emits
a final (hostname; term vector) pair

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Contd…
Inverted Index: The map function parses each document,
and emits a sequence of (word; document ID) pairs. The
reduce function accepts all pairs for a given word, sorts
the corresponding document IDs and emits a (word;

EL
list(document ID)) pair. The set of all output pairs forms a
simple inverted index. It is easy to augment this

PT
computation to keep track of word positions.
N
Distributed Sort: The map function extracts the key from
each record, and emits a (key; record) pair. The reduce
function emits all pairs unchanged.

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Applications of MapReduce
(1) Distributed Grep:

Input: large set of files

EL
Output: lines that match pattern

Map – Emits a line if it matches the supplied

pattern PT
N
Reduce – Copies the intermediate data to output

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Applications of MapReduce
(2) Reverse Web-Link Graph:

Input: Web graph: tuples (a, b)

EL
where (page a → page b)

PT
Output: For each page, list of pages that link to it
N
Map – process web log and for each input <source,
target>, it outputs <target, source>
Reduce - emits <target, list(source)>

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Applications of MapReduce
(3) Count of URL access frequency:

Input: Log of accessed URLs, e.g., from proxy server

Output: For each URL, % of total accesses for that URL

EL
Map – Process web log and outputs <URL, 1>
Multiple Reducers - Emits <URL, URL_count>

PT
(So far, like Wordcount. But still need %)
Chain another MapReduce job after above one
Map – Processes <URL, URL_count> and outputs
N
<1, (<URL, URL_count> )>
1 Reducer – Does two passes. In first pass, sums up all
URL_count’s to calculate overall_count. In second pass
calculates %’s
Emits multiple <URL, URL_count/overall_count>
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Applications of MapReduce
(4) Map task’s output is sorted (e.g., quicksort)
Reduce task’s input is sorted (e.g., mergesort)

Sort

EL
Input: Series of (key, value) pairs
Output: Sorted <value>s

PT
Map – <key, value> → <value, _> (identity)
Reducer – <key, value> → <key, value> (identity)
N
Partitioning function – partition keys across reducers
based on ranges (can’t use hashing!)
• Take data distribution into account to balance
reducer tasks

Big Data Computing Vu Pham Hadoop MapReduce 2.0

The YARN Scheduler
• Used underneath Hadoop 2.x +
• YARN = Yet Another Resource Negotiator
• Treats each server as a collection of containers
Container = fixed CPU + fixed memory

EL
–

• Has 3 main components

–

• Scheduling PT
Global Resource Manager (RM)

Per-server Node Manager (NM)

N
–

• Daemon and server-specific functions

– Per-application (job) Application Master (AM)
• Container negotiation with RM and NMs
• Detecting task failures of that job
Big Data Computing Vu Pham Hadoop MapReduce 2.0
YARN: How a job gets a container
Resource Manager
Capacity Scheduler In this figure
• 2 servers (A, B)
• 2 jobs (1, 2)

EL
1. Need
container
Node A Node Manager A
PT
3. Container on Node B

Node B
2. Container Completed

Node Manager B
N
Application Application Task
4. Start task, please!
Master 1 Master 2 (App2)

Big Data Computing Vu Pham Hadoop MapReduce 2.0

MapReduce Examples

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example: 1 Word Count using MapReduce

map(key, value):
// key: document name; value: text of document

EL
for each word w in value:
emit(w, 1)

reduce(key, values):
PT
N
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Count Illustrated

map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list

EL
Emit result “(word, sum)”

see 1 bob 1
see bob run
see spot throw PT bob 1
run 1
run 1
see 2
N
see 1 spot 1
spot 1 throw 1
throw 1

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 2: Counting words of different lengths
The map function takes a value and outputs key:value
pairs.

For instance, if we define a map function that takes a

EL
string and outputs the length of the word as the key and
the word itself as the value then

PT
map(steve) would return 5:steve and
N
map(savannah) would return 8:savannah.

This allows us to run the map function against values in

parallel and provides a huge advantage.

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 2: Counting words of different lengths
Before we get to the reduce function, the mapreduce
framework groups all of the values together by key, so if the
map functions output the following key:value pairs:
3 : the

EL
3 : and
They get grouped as:
3 : you
4 : then
4 : what PT 3 : [the, and, you]
4 : [then, what, when]
N
4 : when
5 : [steve, where]
5 : steve 8 : [savannah, research]
5 : where
8 : savannah
8 : research
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Example 2: Counting words of different lengths
Each of these lines would then be passed as an argument
to the reduce function, which accepts a key and a list of
values.
In this instance, we might be trying to figure out how many

EL
words of certain lengths exist, so our reduce function will
just count the number of items in the list and output the
key with the size of the list, like:

3:3
PT
N
4:3
5:2
8:2

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 2: Counting words of different lengths

The reductions can also be done in parallel, again providing

a huge advantage. We can then look at these final results
and see that there were only two words of length 5 in the

EL
corpus, etc...

PT
The most common example of mapreduce is for counting
the number of times words occur in a corpus.
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 3: Word Length Histogram
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the
course of human events it becomes necessary for a people to advance from that subordination in which they have
hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws
of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should

EL
declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are
created equal and independent; that from that equal creation they derive rights inherent and inalienable, among
which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments
are instituted among men, deriving their just power from the consent of the governed; that whenever any form of
government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to

PT
institute new government, laying it’s foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long
established should not be changed for light and transient causes: and accordingly all experience hath shewn that
mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to
N
which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and
pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their
duty, to throw off such government and to provide new guards for future security. Such has been the patient
sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no
one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the
establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for
the truth of which we pledge a faith yet unsullied by falsehood.

Big Data Computing Vu Pham Hadoop MapReduce 2.0

EL
which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments
are instituted among men, deriving their just power from the consent of the governed; that whenever any form of
government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to
institute new government, laying it’s foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long

PT
established should not be changed for light and transient causes: and accordingly all experience hath shewn that
mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and
pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their
duty, to throw off such government and to provide new guards for future security. Such has been the patient
N
sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no
one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the
establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for
the truth of which we pledge a faith yet unsullied by falsehood.

How many “big”, “medium” and “small” words, are used ?

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 3: Word Length Histogram

Big = Yellow = 10+ letters

EL
Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 3: Word Length Histogram

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 3: Word Length Histogram

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 3: Word Length Histogram

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 4: Build an Inverted Index

Input: Desired output:

EL
tweet1, (“I love pancakes for breakfast”) “pancakes”, (tweet1, tweet2)
tweet2, (“I dislike pancakes”) “breakfast”, (tweet1, tweet3)

tweet4, (“I love to eat”)PT

tweet3, (“What should I eat for breakfast?”) “eat”, (tweet3, tweet4)
“love”, (tweet1, tweet4)
…
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 5: Relational Join

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 5: Relational Join: Before Map Phase

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 5: Relational Join: Map Phase

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 5: Relational Join: Reduce Phase

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 5: Relational Join in MapReduce, again

EL
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 6: Finding Friends
Facebook has a list of friends (note that friends are a bi-directional
thing on Facebook. If I'm your friend, you're mine).
They also have lots of disk space and they serve hundreds of millions
of requests everyday. They've decided to pre-compute calculations
when they can to reduce the processing time of requests. One

EL
common processing request is the "You and Joe have 230 friends in
common" feature.

PT
When you visit someone's profile, you see a list of friends that you
have in common. This list doesn't change frequently so it'd be
wasteful to recalculate it every time you visited the profile (sure you
N
could use a decent caching strategy, but then we wouldn't be able to
continue writing about mapreduce for this problem).
We're going to use mapreduce so that we can calculate everyone's
common friends once a day and store those results. Later on it's just
a quick lookup. We've got lots of disk, it's cheap.

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 6: Finding Friends
Assume the friends are stored as Person->[List of Friends], our
friends list is then:

A -> B C D

EL
B -> A C D E
C -> A B D E
D -> A B C E
E -> B C D
PT
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 6: Finding Friends
For map(A -> B C D) :
(A B) -> B C D
(A C) -> B C D

EL
(A D) -> B C D

For map(B -> A C D E) : (Note that A comes before B in the key)

(A B) -> A C D E
(B C) -> A C D E
PT
N
(B D) -> A C D E
(B E) -> A C D E

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 6: Finding Friends
For map(C -> A B D E) :

(A C) -> A B D E

EL
(B C) -> A B D E And finally for map(E -> B C D):
(C D) -> A B D E
(C E) -> A B D E (B E) -> B C D

PT
For map(D -> A B C E) :
(A D) -> A B C E
(C E) -> B C D
(D E) -> B C D
N
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 6: Finding Friends
Before we send these key-value pairs to the reducers, we
group them by their keys and get:

(A B) -> (A C D E) (B C D)

EL
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)

PT
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
N
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 6: Finding Friends
Each line will be passed as an argument to a reducer.

The reduce function will simply intersect the lists of values

EL
and output the same key with the result of the intersection.

PT
For example, reduce((A B) -> (A C D E) (B C D))
will output (A B) : (C D)
and means that friends A and B have C and D as common
N
friends.

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Example 6: Finding Friends
The result after reduction is:
(A B) -> (C D)
(A C) -> (B D)

EL
(A D) -> (B C)
(B C) -> (A D E)
Now when D visits B's profile,
(B D) -> (A C E)
(B E) -> (C D) PT we can quickly look up (B D) and
see that they have three friends
N
(C D) -> (A B E) in common, (A C E).
(C E) -> (B D)
(D E) -> (B C)

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Reading
Jeffrey Dean and Sanjay Ghemawat,

“MapReduce: Simplified Data Processing on Large

EL
Clusters”

PT
http://labs.google.com/papers/mapreduce.html
N

Big Data Computing Vu Pham Hadoop MapReduce 2.0

Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Rfid Based Door Lock Using Arduino
80% (10)
Rfid Based Door Lock Using Arduino
52 pages
Big Data Refers to Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers to Extremely Large and Complex Datasets That 1
421 pages
Big data aktu unit 3
No ratings yet
Big data aktu unit 3
90 pages
Hadoop Architecture - Hadoop Distributed File System (HDFS)-2
No ratings yet
Hadoop Architecture - Hadoop Distributed File System (HDFS)-2
39 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
258 pages
Lec4 Merged
No ratings yet
Lec4 Merged
84 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
ICT Grade 4 Exam
100% (3)
ICT Grade 4 Exam
4 pages
bda 2_hadoop
No ratings yet
bda 2_hadoop
112 pages
Hdfs Part 1
No ratings yet
Hdfs Part 1
72 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
BDA-3
No ratings yet
BDA-3
70 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
BIGDTA_UNIT_3
No ratings yet
BIGDTA_UNIT_3
65 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Lec 4
No ratings yet
Lec 4
27 pages
HDFS
No ratings yet
HDFS
22 pages
Lecture 4 Introduction to Hadoop
No ratings yet
Lecture 4 Introduction to Hadoop
25 pages
BD-Unit-II (1)
No ratings yet
BD-Unit-II (1)
57 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
6 - HDFS
No ratings yet
6 - HDFS
37 pages
BG 345
No ratings yet
BG 345
26 pages
Unit 2
No ratings yet
Unit 2
56 pages
Unit 1 (Chapter 2) - Big Data Storage
No ratings yet
Unit 1 (Chapter 2) - Big Data Storage
34 pages
4
No ratings yet
4
53 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Notes - 3 Unit neha
No ratings yet
Notes - 3 Unit neha
25 pages
lab2_BD
No ratings yet
lab2_BD
20 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
Chap4_BigDataStorageAndManagement
No ratings yet
Chap4_BigDataStorageAndManagement
46 pages
3 HDFS
No ratings yet
3 HDFS
16 pages
Lec 4
No ratings yet
Lec 4
28 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
16 pages
Unit- 3 (HDFS)-1
No ratings yet
Unit- 3 (HDFS)-1
24 pages
Unit- 3 (HDFS)
No ratings yet
Unit- 3 (HDFS)
23 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
UNIT-3-1 (1)
No ratings yet
UNIT-3-1 (1)
20 pages
Unit-I
No ratings yet
Unit-I
38 pages
Introduction To Hadoop Distributed File System (HDFS)
No ratings yet
Introduction To Hadoop Distributed File System (HDFS)
22 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Bigdata Lecture 2
No ratings yet
Bigdata Lecture 2
17 pages
DC MOD 6
No ratings yet
DC MOD 6
9 pages
HADOOP
No ratings yet
HADOOP
18 pages
BDT - Unit - II - Hdfs and Hadoop Io
No ratings yet
BDT - Unit - II - Hdfs and Hadoop Io
42 pages
HDFS
No ratings yet
HDFS
11 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
How To Connect SLD Data Supplier To SLDR (FRUN) - v2
No ratings yet
How To Connect SLD Data Supplier To SLDR (FRUN) - v2
58 pages
Unit II Big Data Analytics
No ratings yet
Unit II Big Data Analytics
11 pages
HDFS
No ratings yet
HDFS
8 pages
Product Catalog 1
No ratings yet
Product Catalog 1
84 pages
HDFS
No ratings yet
HDFS
1 page
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
1Z0 1195 24
0% (1)
1Z0 1195 24
9 pages
Cohesity BestPractices Guide VMware VSphere Data Protection
No ratings yet
Cohesity BestPractices Guide VMware VSphere Data Protection
32 pages
profibus fail of 300 ms 2 times a day - Siemens Industry Online Support - Technical Forum (ATIP_Group)
No ratings yet
profibus fail of 300 ms 2 times a day - Siemens Industry Online Support - Technical Forum (ATIP_Group)
4 pages
SYS600 - COM500i - Users Guide
100% (1)
SYS600 - COM500i - Users Guide
146 pages
8 Bit Addition: To Add Two 8 Bit Numbers
No ratings yet
8 Bit Addition: To Add Two 8 Bit Numbers
20 pages
IT0021 - Laboratory Exercise 3-1
No ratings yet
IT0021 - Laboratory Exercise 3-1
18 pages
Installation and Upgrade Guide For Cisco5 - 8 PDF
No ratings yet
Installation and Upgrade Guide For Cisco5 - 8 PDF
174 pages
How To Setup L2TP - IPSec VPN Servers (Debian) - Electronic Design
No ratings yet
How To Setup L2TP - IPSec VPN Servers (Debian) - Electronic Design
6 pages
BPF Internals
No ratings yet
BPF Internals
122 pages
Egory K Resume 20231216
No ratings yet
Egory K Resume 20231216
14 pages
300-820-CLCEI-v1.2
No ratings yet
300-820-CLCEI-v1.2
3 pages
Linux+ Guide To Linux Certification, Third Edition
No ratings yet
Linux+ Guide To Linux Certification, Third Edition
43 pages
Innovations in Ethernet Encryption (802.1ae - Macsec) For Securing High Speed (1-100ge) Wan Deployments
No ratings yet
Innovations in Ethernet Encryption (802.1ae - Macsec) For Securing High Speed (1-100ge) Wan Deployments
22 pages
SW 1pav 01
No ratings yet
SW 1pav 01
5 pages
FIMER - VSN700 Data Logger - EN - Rev - B - EN
No ratings yet
FIMER - VSN700 Data Logger - EN - Rev - B - EN
4 pages
TR 3595
No ratings yet
TR 3595
12 pages
Operating System (17CS64)
No ratings yet
Operating System (17CS64)
3 pages
Vrealize Operations Service Discovery Management Pack - 2.0 - 2
No ratings yet
Vrealize Operations Service Discovery Management Pack - 2.0 - 2
32 pages
App Server Logs
No ratings yet
App Server Logs
6 pages
Running: Option 2 - If You Are Using A Script To Start The Managed Server
No ratings yet
Running: Option 2 - If You Are Using A Script To Start The Managed Server
5 pages
Auto Play
No ratings yet
Auto Play
3 pages
Active Directory Interview Questions & Answers: Answer
No ratings yet
Active Directory Interview Questions & Answers: Answer
10 pages
Dell Networking RoCE Configuration
No ratings yet
Dell Networking RoCE Configuration
11 pages
16.1 Purpose of An Operating System (MT-L) PDF
No ratings yet
16.1 Purpose of An Operating System (MT-L) PDF
10 pages
R07 Set No. 2
No ratings yet
R07 Set No. 2
5 pages
PortFast and UplinkFast
No ratings yet
PortFast and UplinkFast
8 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet