0% found this document useful (0 votes)
62 views

Lecture4 IntroMapReduce PDF

Uploaded by

Ronald Bbosa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Lecture4 IntroMapReduce PDF

Uploaded by

Ronald Bbosa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Introduction to

MapReduce/Hadoop
Limitations of Existing Data Analytics Architecture

BI Reports + Interactive Apps Can’t Explore Original


High Fidelity Raw Data
RDBMS (aggregated data)

ETL Compute Grid

Moving Data To
Compute Doesn’t Scale

Storage Only Grid (original raw data)


Archiving =
Mostly Append
Premature
Collection Data Death

Instrumentation

2
©2011 Cloudera, Inc. All Rights Reserved.

Slides from Dr. Amr Awadallah’s Hadoop talk at Stanford, CTO & VPE from Cloudera
Typical Large-Data Problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output

• The problem:
– Diverse input format (data diversity & heterogeneity)
– Large Scale: Terabytes, Petabytes
– Parallelization

(Dean and Ghemawat, OSDI 2004)


How to leverage a number of
cheap off-the-shelf computers?

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf


Divide and Conquer
“Work”
Partition

w1 w2 w3

“worker” “worker” “worker”

r1 r2 r3

“Result” Combine
Parallelization Challenges
• How do we assign work units to workers?
• What if we have more work units than
workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have
finished?
• What if workers die?
What is the common theme of all of these problems?
Common Theme?
• Parallelization problems arise from:
– Communication between workers (e.g., to
exchange state)
– Access to shared resources (e.g., data)
• Thus, we need a synchronization mechanism
Source: Ricardo Guimarães Herrmann
Managing Multiple Workers
• Difficult because
– We don’t know the order in which workers run
– We don’t know when workers interrupt each other
– We don’t know the order in which workers access shared data
• Thus, we need:
– Semaphores (lock, unlock)
– Conditional variables (wait, notify, broadcast)
– Barriers
• Still, lots of problems:
– Deadlock, livelock, race conditions...
– Dining philosophers, sleeping barbers, cigarette smokers...
• Moral of the story: be careful!
Current Tools
Shared Memory Message Passing

• Programming models

Memory
– Shared memory (pthreads)
– Message passing (MPI) P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

• Design Patterns
– Master-slaves
– Producer-consumer flows
– Shared work queues
producer consumer
master

work queue

slaves
producer consumer
Concurrency Challenge!
• Concurrency is difficult to reason about
• Concurrency is even more difficult to reason about
– At the scale of datacenters (even across datacenters)
– In the presence of failures
– In terms of multiple interacting services
• Not to mention debugging…
• The reality:
– Lots of one-off solutions, custom code
– Write you own dedicated library, then program with it
– Burden on the programmer to explicitly manage
everything
What’s the point?
• It’s all about the right level of abstraction
– The von Neumann architecture has served us well,
but is no longer appropriate for the multi-
core/cluster environment
• Hide system-level details from the developers
– No more race conditions, lock contention, etc.
• Separating the what from how
– Developer specifies the computation that needs to be
performed
– Execution framework (“runtime”) handles actual
execution
Key Ideas
• Scale “out”, not “up”
– Limits of SMP and large shared-memory machines
• Move processing to the data
– Cluster have limited bandwidth
• Process data sequentially, avoid random access
– Seeks are expensive, disk throughput is reasonable
• Seamless scalability
– From the mythical man-month to the tradable
machine-hour
The datacenter is the computer!

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf


Apache Hadoop
• The name Hadoop is a made-up name. The
project’s creator, Doug Cutting, explains how
the name came about:
– The name my kid gave a stuffed yellow elephant.
Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere: those are
my naming criteria. Kids are good at generating
such. Googol is a kid’s term.
Apache Hadoop
• Scalable fault-tolerant distributed system for Big Data:
– Data Storage
– Data Processing
– A virtual Big Data machine
– Borrowed concepts/Ideas from Google; Open source
under the Apache license
• Core Hadoop has two main systems:
– Hadoop/MapReduce: distributed big data processing
infrastructure (abstract/paradigm, fault-tolerant, schedule,
execution)
– HDFS (Hadoop Distributed File System): fault-tolerant,
high-bandwidth, high availability distributed storage
Apache YARN
• Apache YARN (Yet Another Resource Negotiator) is
Hadoop’s cluster resource management system.
MapReduce: Big Data Processing Abstraction
RDBMS compared to MapReduce
Terminology
• A MapReduce job is a unit of work that the client
wants to be performed:
– input data,
– the MapReduce program
– Configuration information.
• Hadoop runs the job by dividing it into tasks, of
which there are two types:
– map tasks and reduce tasks.
– The tasks are scheduled using YARN and run on nodes
in the cluster.
Terminology
• Hadoop divides the input to a MapReduce job into
fixed-size pieces called input splits, or just splits.
Hadoop creates one map task for each split, which
runs the user-defined map function for each record
in the split.
• Having many splits means the time taken to
process each split is small compared to the time to
process the whole input.
• So if we are processing the splits in parallel, the
processing is better load balanced when the splits
are small, since a faster machine will be able to
process proportionally more splits over the course
of the job than a slower machine.
Split size
• If splits are too small, the overhead of
managing the splits and map task creation
begins to dominate the total job execution
time.
• For most jobs, a good split size tends to be the
size of an HDFS block, which is 128 MB by
default, although this can be changed for the
cluster or specified when each file is created.
Split size
• It should now be clear why the optimal split size
is the same as the block size:
– it is the largest size of input that can be guaranteed
to be stored on a single node.
– If the split spanned two blocks, it would be unlikely
that any HDFS node stored both blocks, so some of
the split would have to be transferred across the
network to the node running the map task, which is
clearly less efficient than running the whole map
task using local data.
Data locality optimization
• Hadoop does its best to run the map task on a node where the
input data resides in HDFS, because it doesn’t use valuable
cluster bandwidth.
Map output
• Map tasks write their output to the local disk,
not to HDFS.
• Map output is intermediate output: it’s
processed by reduce tasks to produce the final
output, and once the job is complete, the map
output can be thrown away. So, storing it in
HDFS with replication would be overkill.
data flow with a single reduce task
data flow with multiple reduce tasks
Typical Large-Data Problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
Key idea: provide a functional abstraction for these two
operations

(Dean and Ghemawat, OSDI 2004)


Roots in Functional Programming

Map f f f f f

Fold g g g g g
MapReduce
• Programmers specify two functions:
map (k, v) → [(k’, v’)]
reduce (k’, [v’]) → [(k’, v’)]
– All values with the same key are sent to the same
reducer
• The execution framework handles everything
else…
Key Observation from Data Mining
Algorithms (Jin & Agrawal, SDM’01)
• Popular algorithms have a
common canonical loop While( ) {
forall( data instances d) {
• Can be used as the basis for
supporting a common middleware I = process(d)
(FREERide, Framework for Rapid R(I) = R(I) op d
Implementation of Data mining
Engines) }
…….
• Target distributed memory
parallelism, shared memory }
parallelism, and combination

• Ability to process large and disk-


resident datasets
2020/4/22
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

Shuffle and Sort: aggregate values by keys


a 1 5 b 2 7 c 2 3 6 8

reduce reduce reduce

r1 s1 r2 s2 r3 s3
MapReduce
• Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
– All values with the same key are sent to the same
reducer
• The execution framework handles everything
else…
What’s “everything else”?
MapReduce “Runtime”
• Handles scheduling
– Assigns workers to map and reduce tasks
• Handles “data distribution”
– Moves processes to data
• Handles synchronization
– Gathers, sorts, and shuffles intermediate data
• Handles errors and faults
– Detects worker failures and restarts
• Everything happens on top of a distributed FS
(later)
MapReduce
• Programmers specify two functions:
map (k, v) → [(k’, v’)]
reduce (k’, [v’]) → [(k’, v’)]
– All values with the same key are reduced together
• The execution framework handles everything else…
• Not quite…usually, programmers also specify:
partition (k’, number of partitions) → partition for k’
– Often a simple hash of the key, e.g., hash(k’) mod n
– Divides up key space for parallel reduce operations
combine (k’, [v’]) → [(k’, v’’)]
– Mini-reducers that run in memory after the map phase
– Used as an optimization to reduce network traffic
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

combine combine combine combine

a 1 b 2 c 9 a 5 c 2 b 7 c 8

partition partition partition partition

Shuffle and Sort: aggregate values by keys


a 1 5 b 2 7 c 2 3
9 6
8 8

reduce reduce reduce

r1 s1 r2 s2 r3 s3
Two more details…
• Barrier between map and reduce phases
– But we can begin copying intermediate data
earlier
• Keys arrive at each reducer in sorted order
– No enforced ordering across reducers
MapReduce can refer to…
• The programming model
• The execution framework (aka “runtime”)
• The specific implementation

Usage is usually clear from context!


“Hello World”: Word Count

Map(String docid, String text):


for each word w in text:
Emit(w, 1);

Reduce(String term, Iterator<Int> values):


int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
MapReduce Implementations
• Google has a proprietary implementation in C++
– Bindings in Java, Python
• Hadoop is an open-source implementation in Java
– Development led by Yahoo, used in production
– Now an Apache project
– Rapidly expanding software ecosystem
• Lots of custom research implementations
– For GPUs, cell processors, etc.
Hadoop History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Becomes Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• Sept 2008 – Hive becomes a Hadoop subproject
• Feb 2009 – The Yahoo! Search Webmap is a Hadoop application
that runs on more than 10,000 core Linux cluster and produces data
that is now used in every Yahoo! Web search query.
• June 2009 – On June 10, 2009, Yahoo! made available the source
code to the version of Hadoop it runs in production.
• In 2010 Facebook claimed that they have the largest Hadoop cluster
in the world with 21 PB of storage. On July 27, 2011 they
announced the data has grown to 30 PB.
Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
Example Word Count (Map)
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word,one);
}
}
}
Example Word Count (Reduce)
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,


Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Example Word Count (Driver)
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Word Count Execution
Input Map Shuffle & Sort Reduce Output

the, 1
brown, 1
the quick fox, 1 brown, 2
brown fox Map fox, 2
Reduce how, 1
now, 1
the, 1 the, 3
fox, 1
the, 1
the fox ate
the mouse Map
quick, 1

how, 1
now, 1 ate, 1 ate, 1
brown, 1 mouse, 1 cow, 1
Reduce mouse, 1
how now quick, 1
brown cow Map cow, 1
An Optimization: The Combiner
• A combiner is a local aggregation function
for repeated keys produced by same map
• For associative ops. like sum, count, max
• Decreases size of intermediate data

• Example: local counting for Word Count:


def combiner(key, values):
output(key, sum(values))
Word Count with Combiner
Input Map & Combine Shuffle & Sort Reduce Output

the, 1
brown, 1
the quick fox, 1 brown, 2
brown fox Map fox, 2
Reduce how, 1
now, 1
the, 3
the, 2
fox, 1
the fox ate
the mouse Map
quick, 1

how, 1
now, 1 ate, 1 ate, 1
brown, 1 mouse, 1 cow, 1
Reduce mouse, 1
how now quick, 1
brown cow Map cow, 1
User
Program

(1) submit

Master

(2) schedule map (2) schedule reduce

worker
split 0
(6) write output
split 1 (5) remote read worker
(3) read file 0
split 2 (4) local write
worker
split 3
split 4 output
worker
file 1

worker

Input Map Intermediate files Reduce Output


files phase (on local disk) phase files

Adapted from (Dean and Ghemawat, OSDI 2004)


How do we get data to the workers?
NAS

SAN

Compute Nodes

What’s the problem here?


Distributed File System
• Don’t move data to workers… move workers to
the data!
– Store data on the local disks of nodes in the cluster
– Start up the workers on the node that has the data
local
• Why?
– Not enough RAM to hold all the data in memory
– Disk access is slow, but disk throughput is reasonable
• A distributed file system is the answer
– GFS (Google File System) for Google’s MapReduce
– HDFS (Hadoop Distributed File System) for Hadoop
Distributed file system
• When a dataset outgrows the storage capacity of
a single physical machine, it becomes necessary
to partition it across a number of separate
machines.
• Filesystems that manage the storage across a
network of machines are called distributed
filesystems.
• One of the biggest challenges is making the
filesystem tolerate node failure without suffering
data loss.
HDFS
• Hadoop comes with a distributed filesystem
called HDFS, which stands for Hadoop
Distributed Filesystem.
• HDFS is Hadoop’s flagship filesystem, but
Hadoop actually has a general purpose
filesystem abstraction, so we’ll see along the
way how Hadoop integrates with other
storage systems
Design of HDFS
• HDFS is a filesystem designed for storing very large files
with streaming data access patterns, running on
clusters of commodity hardware.
– “Very large” means files that are hundreds of megabytes,
gigabytes, or terabytes in size
– Streaming data access
– HDFS is built around the idea that the most efficient data
processing pattern is a write once, read-many-times
pattern
– Commodity hardware.Hadoop doesn’t require expensive,
highly reliable hardware. It’s designed to run on clusters of
commodity hardware (commonly available hardware that
can be obtained from multiple vendors).
HDFS Concepts
• A disk has a block size, which is the minimum amount
of data that it can read or write. Filesystems for a single
disk build on this by dealing with data in blocks, which
are an integral multiple of the disk block size.
Filesystem blocks are typically a few kilobytes in size,
whereas disk blocks are normally 512 bytes.
• HDFS, too, has the concept of a block, but it is a much
larger unit — 128 MB by default. Unlike a filesystem for
a single disk, a file in HDFS that is smaller than a single
block does not occupy a full block’s worth of
underlying storage. (For example, a 1 MB file stored
with a block size of 128 MB uses 1 MB of disk space,
not 128 MB.)
Namenodes and Datanodes
• An HDFS cluster has two types of nodes operating in
a master−worker pattern: a namenode (the master)
and a number of datanodes (workers).
– The namenode manages the filesystem namespace. It
maintains the filesystem tree and the metadata for all the
files and directories in the tree. The namenode also
knows the datanodes on which all the blocks for a given
file are located;
– Datanodes are the workhorses of the filesystem. They
store and retrieve blocks when they are told to (by clients
or the namenode), and they report back to the
namenode periodically with lists of blocks that they are
storing.
client
• A client accesses the filesystem on behalf of
the user by communicating with the
namenode and datanodes.
• The client presents a filesystem interface
similar to a Portable Operating System
Interface (POSIX), so the user code does not
need to know about the namenode and
datanodes to function.
GFS: Assumptions
• Commodity hardware over “exotic” hardware
– Scale “out”, not “up”
• High component failure rates
– Inexpensive commodity components fail all the time
• “Modest” number of huge files
– Multi-gigabyte files are common, if not encouraged
• Files are write-once, mostly appended to
– Perhaps concurrently
• Large streaming reads over random access
– High sustained throughput over low latency

GFS slides adapted from material by (Ghemawat et al., SOSP 2003)


GFS: Design Decisions
• Files stored as chunks
– Fixed size (64MB)
• Reliability through replication
– Each chunk replicated across 3+ chunkservers
• Single master to coordinate access, keep metadata
– Simple centralized management
• No data caching
– Little benefit due to large datasets, streaming reads
• Simplify the API
– Push some of the issues onto the client (e.g., data layout)
HDFS = GFS clone (same basic ideas)
From GFS to HDFS
• Terminology differences:
– GFS master = Hadoop namenode
– GFS chunkservers = Hadoop datanodes
• Functional differences:
– HDFS performance is (likely) slower

For the most part, we’ll use the Hadoop terminology…


HDFS Working Flow
HDFS namenode

Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)

instructions to datanode

datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system

… …

Adapted from (Ghemawat et al., SOSP 2003)


HDFS Architecture
Cluster Membership

NameNode

Secondary
NameNode

Client

Cluster Membership

DataNodes
NameNode : Maps a file to a file-id and list of MapNodes
DataNode : Maps a block-id to a physical location on disk
SecondaryNameNode: Periodic merge of Transaction log
Distributed File System
• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
NameNode Metadata
• Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
• Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
• A Transaction Log
– Records file creations, file deletions. etc
Namenode Responsibilities
• Managing the file system namespace:
– Holds file/directory structure, metadata, file-to-
block mapping, access permissions, etc.
• Coordinating file operations:
– Directs clients to datanodes for reads and writes
– No data is moved through the namenode
• Maintaining overall health:
– Periodic communication with the datanodes
– Block re-replication and rebalancing
– Garbage collection
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
• Block Report
– Periodically sends a report of all existing blocks to
the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
Block Placement
• Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
• Clients read from nearest replica
• Would like to make this policy pluggable
Data Correctness
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from
DataNode
– If Validation fails, Client tries other replicas
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
Putting everything together…

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

… … …
slave node slave node slave node
MapReduce Data Flow

Figure is from Hadoop, The Definitive Guide, 2nd Edition, Tom White, O’Reilly
Figure is from Hadoop, The Definitive Guide, 2nd Edition
Figure is from Hadoop, The Definitive Guide, 2nd Edition
• References:
• Hadoop: The Definitive Guide, Tom White, O’Reilly
• Hadoop In Action, Chuck Lam, Manning

You might also like