Week-2 Lecture Notes
Week-2 Lecture Notes
(HDFS)
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)
Preface
Content of this Lecture:
EL
tuning parameters to control HDFS performance and
robustness.
PT
N
EL
An important characteristic of Hadoop is the partitioning of
data and computation across many (thousands) of hosts, and
executing application computations in parallel close to their
data.
PT
A Hadoop cluster scales computation capacity, storage capacity
N
and IO bandwidth by simply adding commodity servers.
Hadoop clusters at Yahoo! span 25,000 servers, and store 25
petabytes of application data, with the largest cluster being
3500 servers. One hundred other organizations worldwide
report using Hadoop.
Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)
Introduction
Hadoop is an Apache project; all components are available via
the Apache open source license.
EL
Hadoop (HDFS and MapReduce).
at Microsoft. PT
HBase was originally developed at Powerset, now a department
EL
HBase Column-oriented table service
Dataflow language and parallel execution
Pig
Hive
ZooKeeper
PT
framework
Data warehouse infrastructure
Distributed coordination service
N
Chukwa System for collecting management data
Avro Data serialization system
EL
Distributed data on local disks on several nodes.
PT
Low cost commodity hardware: A lot of performance out of it
because you're aggregating performance.
N
Node 1 Node 2 Node n
B1 B2 … Bn
EL
Portability across heterogeneous hardware/software:
Implementation across lots of different kinds of hardware and software.
PT
Handle large data sets:
Need to handle terabytes to petabytes.
N
Enable processing with high throughput
EL
Data replication:
Helps to handle hardware failures.
Try to spread the data, same piece of data on different nodes.
PT
Move computation close to the data:
So you're not moving data around. That improves your performance and
N
throughput.
EL
PT
N
EL
DataNodes and where the blocks are distributed essentially.
PT
Multiple DataNodes: Typically one per node in a cluster. So
you're basically using storage which is local.
N
Basic Functions:
Manage the storage on the DataNode.
Read and write requests on the clients
Block creation, deletion, and replication is all based on instructions from
the NameNode.
EL
Serving read/write requests from clients
Block creation, deletion, replication
PT
N
EL
responsibilities. And you can imagine as you start having thousands of
nodes that they'll not scale, and if you have billions of files, you will
have scalability issues. So to address that, the federation aspect was
PT
brought in. That also brings performance improvements.
N
Benefits:
Increase namespace scalability
Performance
Isolation
EL
Data is now stored in Block pools
PT
So there is a pool associated with each namenode or
namespace.
N
And these pools are essentially spread out over all the data
nodes.
EL
Heterogeneous Storage
and Archival Storage
PT
ARCHIVE, DISK, SSD, RAM_DISK
N
EL
PT
So, if you remember the original design you have one name space and a bunch of
N
data nodes. So, the structure looks similar.
You have a bunch of NameNodes, instead of one NameNode. And each of those
NameNodes is essentially right into these pools, but the pools are spread out over the
data nodes just like before. This is where the data is spread out. You can gloss over
the different data nodes. So, the block pool is essentially the main thing that's
different.
Big Data Computing Vu Pham Big Data Enabling Technologies
HDFS Performance Measures
EL
Key HDFS and system components that are affected
by the block size.
PT
An impact of using a lot of small files on HDFS and
system
N
EL
Distributed data on local disks on several nodes
PT
N
Node 1 Node 2 Node n
B1 B2 … Bn
EL
So a 10GB file will be broken into: 10 x 1024/64=160 blocks
PT
N
Node 1 Node 2 Node n
B1 B2 … Bn
EL
the NameNode, so that is a direct effect of the number of blocks.
But if you have replication, then you have 3 times the number of
blocks.
PT
Number of map tasks: Number of maps typically depends on the
number of blocks being processed.
N
EL
PT
Network load: Number of checks with datanodes proportional
to number of blocks
N
EL
Huge list of tasks that are queued.
PT
The other impact of this is the map tasks, each time they spin up
and spin down, there's a latency involved with that because you
N
are starting up Java processes and stopping them.
EL
Solution:
Merge/Concatenate files
Sequence files PT
HBase, HIVE configuration
N
CombineFileInputFormat
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
Tuning parameters
EL
NameNode, DataNode system/dfs parameters.
PT
N
EL
This is more for system administrators of Hadoop clusters, but
it's good to know what changes affect impact the performance,
and especially if your trying things out on your own there some
PT
important parameters to keep in mind.
EL
Default 64 megabytes: Typically bumped up to 128 megabytes
and can be changed based on workloads.
PT
The parameter that this changes dfs.blocksize or dfs.block.size.
N
Default replication is 3.
Parameter: dfs.replication
EL
Tradeoffs:
Examples:
EL
Dfs.datanode.handler.count (10): Sets the number of server
threads on each datanode
Dfs.namenode.fs-limits.max-blocks-per-file: Maximum number
of blocks per file.
Full List:
PT
N
http://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-hdfs/hdfs-default.xml
EL
PT
N
EL
Network Failures: Sometimes there's data corruption because
of network issues or disk issue. So, all of that could lead to a
failure in the DataNode aspect of HDFS. You could have network
PT
failures. So, you could have a network go down between a
particular and the name node that can affect a lot of data nodes
at the same time.
N
NameNode Failures: Could have name node failures, disk failure
on the name node itself or the name node itself could corrupt
this process.
EL
PT
N
EL
Mark the data. And any new I/O that comes up is not going to be sent to
that data node. Also remember that NameNode has information on all
PT
the replication information for the files on the file system. So, if it knows
that a datanode fails which blocks will follow that replication factor.
N
Now this replication factor is set for the entire system and also you could
set it for particular file when you're writing the file. Either way, the
NameNode knows which blocks fall below replication factor. And it will
restart the process to re-replicate.
EL
Used to check retrieved data.
PT
Re-read from alternate replica
N
EL
PT
N
EL
Example: Distributed copy
PT
Hadoop distcp allows parallel transfer of files.
N
EL
off is the robustness. In this case, we said no replicas.
Might lose a node or a local disk: can't recover because
PT
there is no replication.
EL
tuning parameters to control HDFS performance and
robustness.
PT
N
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Hadoop MapReduce 1.0
What is Map Reduce
MapReduce is the execution engine of Hadoop.
EL
PT
N
EL
Task Tracker
PT
N
EL
Its main duties are to
break down the receive job
that is big computations in
PT
small parts allocate the
partial computations that
is tasks to the slave nodes
N
monitoring the progress
and report of task
execution from the slave.
The unit of execution is
job.
EL
a cluster its duty is to perform
computation given by job tracker
machine.
PT
on the data available on the slave
EL
node, the Job Tracker ask respective
task trackers to execute the task on
their data
PT
Step-4 All the results are stored on
some Data Node and the Name Node is
informed about the same.
N
Step-5 The task trackers inform the job
completion and progress to Job Tracker
Step-6 The Job Tracker inform the
completion to client
Step-7 Client contacts the Name Node
and retrieve the results
Big Data Computing Vu Pham Hadoop MapReduce 1.0
Hadoop MapReduce 2.0
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Preface
Content of this Lecture:
EL
paradigm’ and its internal working and
implementation overview.
PT
We will also see many examples and different
N
applications of MapReduce being used, and look into
how the ‘scheduling and fault tolerance’ works inside
MapReduce.
EL
Users specify a map function that processes a key/value
pair to generate a set of intermediate key/value pairs, and
PT
a reduce function that merges all intermediate values
associated with the same intermediate key.
N
Many real world tasks are expressible in this model.
EL
machines, handling machine failures, and managing the required
inter-machine communication.
PT
This allows programmers without any experience with parallel and
distributed systems to easily utilize the resources of a large
distributed system.
N
A typical MapReduce computation processes many terabytes of
data on thousands of machines. Hundreds of MapReduce
programs have been implemented and upwards of one thousand
MapReduce jobs are executed on Google's clusters every day.
EL
Try to keep replicas in different racks
Master node
PT
Also known as Name Nodes in HDFS
Stores metadata
N
Might be replicated
EL
But don’t want hassle of managing things
PT
MapReduce Architecture provides
Automatic parallelization & distribution
N
Fault tolerance
I/O scheduling
Monitoring & status updates
EL
PT
N
EL
[processes each record sequentially and independently]
(reduce + ‘(1 4 9 16))
PT
(+ 16 (+ 9 (+ 4 1) ) )
Output: 30
N
[processes set of all records in batches]
Let’s consider a sample application: Wordcount
You are given a huge dataset (e.g., Wikipedia dump or all of
Shakespeare’s works) and asked to list the count for each of the
words in each of the documents therein
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Map
EL
Key Value
PT
Welcome Everyone
Hello Everyone
Welcome
Everyone
Hello
1
1
1
N
Everyone 1
Input <filename, file text>
EL
MAP TASK 1
PT
Welcome Everyone
Hello Everyone
Welcome
Everyone
Hello
1
1
1
N
Everyone 1
Input <filename, file text>
MAP TASK 2
EL
PT
Welcome 1
Welcome Everyone
Everyone 1
Hello Everyone
Hello 1
Why are you here
I am also here Everyone 1
They are also here Why 1
Yes, it’s THEM!
Are 1
The same people we were thinking of
You 1
…….
N
Here 1
…….
EL
Key Value
Welcome 1
Everyone
Hello
1
1
PT Everyone
Hello
Welcome
2
1
1
N
Everyone 1
EL
Welcome 1 REDUCE Everyone 2
Everyone 1
Hello 1
Everyone 1
PT
TASK 1
REDUCE
TASK 2
Hello
Welcome
1
1
N
• Popular: Hash partitioning, i.e., key is assigned to
– reduce # = hash(key)%number of reduce tasks
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Programming Model
EL
The user of the MapReduce library expresses the
computation as two functions:
EL
a set of intermediate key/value pairs.
EL
It merges together these values to form a possibly smaller
set of values.
PT
Typically just zero or one output value is produced per
Reduce invocation. The intermediate values are supplied to
N
the user's reduce function via an iterator.
map(key, value):
// key: document name; value: text of document
EL
for each word w in value:
emit(w, 1)
reduce(key, values):
PT
N
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
EL
map(k,v) → list(k1,v1)
reduce(k1, list(v1)) → v2
PT
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs
N
EL
PT
N
EL
just copies the supplied intermediate data to the output.
Count of URL Access Frequency: The map function processes
PT
logs of web page requests and outputs (URL; 1). The reduce
function adds together all values for the same URL and emits a
(URL; total count) pair.
N
ReverseWeb-Link Graph: The map function outputs (target;
source) pairs for each link to a target URL found in a page named
source. The reduce function concatenates the list of all source
URLs associated with a given target URL and emits the pair:
(target; list(source))
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Contd…
Term-Vector per Host: A term vector summarizes the
most important words that occur in a document or a set
of documents as a list of (word; frequency) pairs.
EL
The map function emits a (hostname; term vector) pair
for each input document (where the hostname is
PT
extracted from the URL of the document).
N
The reduce function is passed all per-document term
vectors for a given host. It adds these term vectors
together, throwing away infrequent terms, and then emits
a final (hostname; term vector) pair
EL
list(document ID)) pair. The set of all output pairs forms a
simple inverted index. It is easy to augment this
PT
computation to keep track of word positions.
N
Distributed Sort: The map function extracts the key from
each record, and emits a (key; record) pair. The reduce
function emits all pairs unchanged.
EL
Output: lines that match pattern
EL
where (page a → page b)
PT
Output: For each page, list of pages that link to it
N
Map – process web log and for each input <source,
target>, it outputs <target, source>
Reduce - emits <target, list(source)>
EL
Map – Process web log and outputs <URL, 1>
Multiple Reducers - Emits <URL, URL_count>
PT
(So far, like Wordcount. But still need %)
Chain another MapReduce job after above one
Map – Processes <URL, URL_count> and outputs
N
<1, (<URL, URL_count> )>
1 Reducer – Does two passes. In first pass, sums up all
URL_count’s to calculate overall_count. In second pass
calculates %’s
Emits multiple <URL, URL_count/overall_count>
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Applications of MapReduce
(4) Map task’s output is sorted (e.g., quicksort)
Reduce task’s input is sorted (e.g., mergesort)
Sort
EL
Input: Series of (key, value) pairs
Output: Sorted <value>s
PT
Map – <key, value> → <value, _> (identity)
Reducer – <key, value> → <key, value> (identity)
N
Partitioning function – partition keys across reducers
based on ranges (can’t use hashing!)
• Take data distribution into account to balance
reducer tasks
EL
–
• Scheduling PT
Global Resource Manager (RM)
EL
1. Need
container
Node A Node Manager A
PT
3. Container on Node B
Node B
2. Container Completed
Node Manager B
N
Application Application Task
4. Start task, please!
Master 1 Master 2 (App2)
EL
PT
N
map(key, value):
// key: document name; value: text of document
EL
for each word w in value:
emit(w, 1)
reduce(key, values):
PT
N
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
EL
Emit result “(word, sum)”
see 1 bob 1
see bob run
see spot throw PT bob 1
run 1
run 1
see 2
N
see 1 spot 1
spot 1 throw 1
throw 1
EL
string and outputs the length of the word as the key and
the word itself as the value then
PT
map(steve) would return 5:steve and
N
map(savannah) would return 8:savannah.
EL
3 : and
They get grouped as:
3 : you
4 : then
4 : what PT 3 : [the, and, you]
4 : [then, what, when]
N
4 : when
5 : [steve, where]
5 : steve 8 : [savannah, research]
5 : where
8 : savannah
8 : research
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Example 2: Counting words of different lengths
Each of these lines would then be passed as an argument
to the reduce function, which accepts a key and a list of
values.
In this instance, we might be trying to figure out how many
EL
words of certain lengths exist, so our reduce function will
just count the number of items in the list and output the
key with the size of the list, like:
3:3
PT
N
4:3
5:2
8:2
EL
corpus, etc...
PT
The most common example of mapreduce is for counting
the number of times words occur in a corpus.
N
EL
declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are
created equal and independent; that from that equal creation they derive rights inherent and inalienable, among
which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments
are instituted among men, deriving their just power from the consent of the governed; that whenever any form of
government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to
PT
institute new government, laying it’s foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long
established should not be changed for light and transient causes: and accordingly all experience hath shewn that
mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to
N
which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and
pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their
duty, to throw off such government and to provide new guards for future security. Such has been the patient
sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no
one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the
establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for
the truth of which we pledge a faith yet unsullied by falsehood.
EL
which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments
are instituted among men, deriving their just power from the consent of the governed; that whenever any form of
government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to
institute new government, laying it’s foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long
PT
established should not be changed for light and transient causes: and accordingly all experience hath shewn that
mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and
pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their
duty, to throw off such government and to provide new guards for future security. Such has been the patient
N
sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no
one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the
establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for
the truth of which we pledge a faith yet unsullied by falsehood.
EL
Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
tweet1, (“I love pancakes for breakfast”) “pancakes”, (tweet1, tweet2)
tweet2, (“I dislike pancakes”) “breakfast”, (tweet1, tweet3)
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
common processing request is the "You and Joe have 230 friends in
common" feature.
PT
When you visit someone's profile, you see a list of friends that you
have in common. This list doesn't change frequently so it'd be
wasteful to recalculate it every time you visited the profile (sure you
N
could use a decent caching strategy, but then we wouldn't be able to
continue writing about mapreduce for this problem).
We're going to use mapreduce so that we can calculate everyone's
common friends once a day and store those results. Later on it's just
a quick lookup. We've got lots of disk, it's cheap.
A -> B C D
EL
B -> A C D E
C -> A B D E
D -> A B C E
E -> B C D
PT
N
EL
(A D) -> B C D
(A C) -> A B D E
EL
(B C) -> A B D E And finally for map(E -> B C D):
(C D) -> A B D E
(C E) -> A B D E (B E) -> B C D
PT
For map(D -> A B C E) :
(A D) -> A B C E
(C E) -> B C D
(D E) -> B C D
N
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E
(A B) -> (A C D E) (B C D)
EL
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)
PT
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
N
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)
EL
and output the same key with the result of the intersection.
PT
For example, reduce((A B) -> (A C D E) (B C D))
will output (A B) : (C D)
and means that friends A and B have C and D as common
N
friends.
EL
(A D) -> (B C)
(B C) -> (A D E)
Now when D visits B's profile,
(B D) -> (A C E)
(B E) -> (C D) PT we can quickly look up (B D) and
see that they have three friends
N
(C D) -> (A B E) in common, (A C E).
(C E) -> (B D)
(D E) -> (B C)
EL
Clusters”
PT
http://labs.google.com/papers/mapreduce.html
N