Hadoop Mapreduce
Hadoop Mapreduce
Pietro Michiardi
Eurecom
Introduction
What is MapReduce
A programming model:
I Inspired by functional programming
I Allows expressing distributed computations on massive amounts of
data
An execution framework:
I Designed for large-scale data processing
I Designed to run on clusters of commodity hardware
Motivations
Big Data
kho dữ liệu khổng lồ
Sharing is difficult:
I Synchronization, deadlocks
I Finite bandwidth to access data from SAN
I Temporal dependencies are complicated (restarts)
Implications of Failures
Sources of Failures
I Hardware / Software
I Electrical, Cooling, ...
I Unavailability of a resource due to overload
Failure Types
I Permanent
I Transient
1
From a post by Ted Dunning on the Hadoop mailing list
Pietro Michiardi (Eurecom) Tutorial: MapReduce 13 / 131
Introduction Big Ideas
Data-intensive applications
I Read and process the whole Internet dataset from a crawler
I Read and process the whole Social Graph
Auxiliary components
I Hadoop Pig
I Hadoop Hive
I Cascading/Scalding
I ... and many many more!
Pietro Michiardi (Eurecom) Tutorial: MapReduce 15 / 131
Introduction Big Ideas
Seamless Scalability
Part One
Preliminaries
Handle failures
f f f f f
g g g g g
map phase:
I Given a list, map takes as an argument a function f (that takes a
single argument) and applies it to all element in a list
fold phase:
I Given a list, fold takes as arguments a function g (that takes two
arguments) and an initial value
I g is first applied to the initial value and the first item in the list
I The result is stored in an intermediate variable, which is used as an
input together with the next item to a second application of g
I The process is repeated until all items in the list have been
consumed
In practice:
I User-specified computation is applied (in parallel) to all input
records of a dataset
I Intermediate results are aggregated by another user-specified
computation
Data Structures
A MapReduce job
2
We use the convention [· · · ] to denote a list.
Pietro Michiardi (Eurecom) Tutorial: MapReduce 31 / 131
MapReduce Framework The Framework
Input:
I Key-value pairs: (docid, doc) stored on the distributed filesystem
I docid: unique identifier of a document
I doc: is the text of the document itself
Mapper:
I Takes an input key-value pair, tokenize the document
I Emits intermediate key-value pairs: the word is the key and the
integer is the value
The framework:
I Guarantees all values associated with the same key (the word) are
brought to the same reducer
The reducer:
I Receives all values associated to some keys
I Sums the values and writes output key-value pairs: the key is the
word and the value is the number of occurrences
Side effects
I Not allowed in functional programming
I E.g.: preserving state across multiple inputs
I State is kept internal
Scheduling
Each Job is broken into tasks
I Map tasks work on fractions of the input dataset, as defined by the
underlying distributed filesystem
I Reduce tasks work on intermediate inputs and write back to the
distributed filesystem
Scheduling
Data/code co-location
Synchronization
Hardware failures
I Individual machines: disks, RAM
I Networking equipment
I Power / cooling
Software failures
I Exceptions, bugs
Partitioners
Hash-based partitioner
I Computes the hash of the key modulo the number of reducers r
I This ensures a roughly even partitioning of the key space
F However, it ignores values: this can cause imbalance in the data
processed by each reducer
I When dealing with complex keys, even the base partitioner may
need customization
Combiners
Distributed filesystems
HDFS
Master-slave architecture
I NameNode: master maintains the namespace (metadata, file to
block mapping, location of blocks) and maintains overall health of
the file system
I DataNode: slaves manage the data blocks
HDFS, an Illustration
HDFS I/O
A typical read from a client involves:
1 Contact the NameNode to determine where the actual data is stored
2 NameNode replies with block identifiers and locations (i.e., which
DataNode)
3 Contact the DataNode to fetch data
Tức là dữ liệu sẽ được tạo mới sau mỗi lần ghi mà không ghi đè trực tiếp
Replication policy
I Spread replicas across differen racks
I Robust against cluster node failures
I Robust against rack failures
"Batch-oriented" có nghĩa là công việc được thực hiện theo từng đợt xử lý (batch) thay vì xử lý tất cả cùng một lúc. Các tác vụ hoặc dữ
liệu được nhóm lại và xử lý cùng một lúc trong mỗi đợt.
Workloads are batch oriented
là một mô hình trong đa luồng, trong đó các luồng tự nguyện chia sẻ tài nguyên và tương tác với nhau để hoàn thành một nhiệm vụ
chung, không phụ thuộc vào hệ điều hành.
Cooperative scenario
Part Two
Preliminaries
Hadoop Deployments
I The BigFoot platform (if time allows)
Terminology Job (Công việc): Đại diện cho toàn bộ quá trình thực thi một bài toán hoặc một tác vụ trên một tập dữ liệu. Gồm các
pha xử lý Mapper và Reducer và có thể có nhiều job chạy song song hoặc liên tiếp trên hệ thống MapReduce. Mỗi
job có thể có các cài đặt và tùy chọn riêng, và có thể có tập dữ liệu đầu vào và đầu ra riêng.
MapReduce:
I Job: an execution of a Mapper and Reducer across a data set
I Task: an execution of a Mapper or a Reducer on a slice of data
I Task Attempt: instance of an attempt to execute a task
khi một nhiệm vụ (task) được thực hiện, nó có thể có nhiều "lần thử" (attempts) để thực hiện nhiệm vụ đó. Mỗi lần thử (attempt)
I Example: Mỗi
đại diện cho một phiên bản thực thi cụ thể của nhiệm vụ (task). Nếu một lần thử (attempt) thất bại, hệ thống có thể cố gắng chạy lại
nhiệm vụ đó bằng cách tạo một lần thử mới.
F Running “Word Count” across 20 files is one job
F 20 files to be mapped = 20 map tasks + some number of reduce tasks
F At least 20 attempts will be performed... more if a machine crashes
Đại diện cho một bước xử lý cụ thể trong quá trình thực thi của một job. Có hai loại nhiệm vụ chính là Mapper và Reducer. Một công việc
(job) thường có nhiều nhiệm vụ (task) Mapper và Reducer tương ứng với số lượng phân vùng và số lượng nhóm dữ liệu. Mỗi nhiệm vụ
(task) được thực hiện độc lập trên một phần của tập dữ liệu đầu vào và tạo ra kết quả trung gian (mapper output hoặc reducer output) để
được sử dụng trong các bước xử lý tiếp theo.
Task Attempts
mỗi nhiệm vụ được thực thực hiện ít nhất 1 lần, có thể nhiều lần
I Task attempted at least once, possibly more
I Multiple crashes on input imply discarding it
I Multiple attempts may occur in parallel (speculative execution)
I Task ID from TaskInProgress is not a unique identifier
Dữ liệu đầu vào sẽ không được xử lý và không tham gia vào kết quả cuối cùng của MapReduce. Việc loại bỏ dữ liệu đầu vào
có thể xảy ra khi có nhiều lỗi xảy ra hoặc khi dữ liệu không đáng tin cậy hoặc không hợp lệ.
HDFS in details
HDFS Blocks
Nếu 1 tệp có kích thước nhỏ hơn 1 khối, hệ thống chỉ cần sử dụng 1 phần của khối đó
để lưu trữ -> tận dụng không gian lưu trữ và dung lượng
(Big) files are broken into block-sized chunks
I NOTE: A file that is smaller than a single block does not occupy a
full block’s worth of underlying storage
NameNode
I Keeps metadata in RAM
I Each block information occupies roughly 150 bytes of memory
I Without NameNode, the filesystem cannot be used
F Persistence of metadata: synchronous and atomic writes to NFS
Secondary NameNode
I Merges the namespce with the edit log
I A useful trick to recover from a failure of the NameNode is to use the
NFS copy of metadata and switch the secondary to primary
DataNode
I They store data and talk to clients
I They report periodically to the NameNode the list of blocks they hold
“External” clients
I For each block, the NameNode returns a set of DataNodes holding
a copy thereof
I DataNodes are sorted according to their proximity to the client
“MapReduce” clients
I TaskTracker and DataNodes are colocated
I For each block, the NameNode usually3 returns the local DataNode
3
Exceptions exist due to stragglers.
Pietro Michiardi (Eurecom) Tutorial: MapReduce 66 / 131
Hadoop MapReduce HDFS in details
Details on replication
I Clients ask NameNode for a list of suitable DataNodes
I This list forms a pipeline: first DataNode stores a copy of a
block, then forwards it to the second, and so on
Replica Placement
I Tradeoff between reliability and bandwidth
I Default placement:
F First copy on the “same” node of the client, second replica is off-rack,
third replica is on the same rack as the second but on a different node
F Since Hadoop 0.21, replica placement can be customized
Hadoop I/O
What’s next
I Overview of what Hadoop offers
I For an in depth knowledge, use [11]
Data Integrity
Compression
Compression
Splittable files, Example 2 (gzip)
I Consider a compressed file of 1GB
I HDFS will split it in 16 blocks of 64MB each
I Creating an InputSplit for each block will not work, since it is not
possible to read at an arbitrary point
Serialization
Sequence Files
Specialized data structure to hold custom input data
I Using blobs of binaries is not efficient
SequenceFiles
I Provide a persistent data structure for binary key-value pairs
I Also work well as containers for smaller files so that the framework
is more happy (remember, better few large files than lots of small
files)
I They come with the sync() method to introduce sync points to
help managing InputSplits for MapReduce
dam bao update file nhanh chong tai diem can ghi
may ao java
Job Submission
JobClient class
I The runJob() method creates a new instance of a JobClient
I Then it calls the submitJob() on this class
Job Initialization
Task Assignment
Hearbeat-based mechanism
I TaskTrackers periodically send hearbeats to the JobTracker
I TaskTracker is alive
I Heartbeat contains also information on availability of the
TaskTrackers to execute a task
I JobTracker piggybacks a task if TaskTracker is available
Selecting a task
I JobTracker first needs to select a job (i.e. scheduling)
I TaskTrackers have a fixed number of slots for map and reduce
tasks
I JobTracker gives priority to map tasks (WHY?)
Data locality
I JobTracker is topology aware
F Useful for map tasks
F Unused for reduce tasks
Pietro Michiardi (Eurecom) Tutorial: MapReduce 81 / 131
Hadoop MapReduce Hadoop MapReduce in details
Task Execution
Handling Failures
In the real world, code is buggy, processes crash and machine fails
4
With streaming, you need to take care of the orphaned process.
5
Exception is made for speculative execution
Pietro Michiardi (Eurecom) Tutorial: MapReduce 83 / 131
Hadoop MapReduce Hadoop MapReduce in details
Handling Failures
TaskTracker Failure
I Types: crash, running very slowly
I Heartbeats will not be sent to JobTracker
I JobTracker waits for a timeout (10 minutes), then it removes the
TaskTracker from its scheduling pool
I JobTracker needs to reschedule even completed tasks (WHY?)
I JobTracker needs to reschedule tasks in progress
I JobTracker may even blacklist a TaskTracker if too many tasks
failed
JobTracker Failure
I Currently, Hadoop has no mechanism for this kind of failure
I In future releases:
F Multiple JobTrackers
F Use ZooKeeper as a coordination mechanisms
Scheduling
FIFO Scheduler (default behavior)
I Each job uses the whole cluster
I Not suitable for shared production-level cluster
F Long jobs monopolize the cluster
F Short jobs can hold back and have no guarantees on execution time
Fair Scheduler
I Every user gets a fair share of the cluster capacity over time
I Jobs are placed in to pools, one for each user
F Users that submit more jobs have no more resources than oterhs
F Can guarantee minimum capacity per pool
I Supports preemption
I “Contrib” module, requires manual installation
Capacity Scheduler
I Hierarchical queues (mimic an oragnization)
I FIFO scheduling in each queue
I Supports priority
Pietro Michiardi (Eurecom) Tutorial: MapReduce 85 / 131
Hadoop MapReduce Hadoop MapReduce in details
dữ liệu trong memory, chia theo key và sort lại sau đó lưu vào trong ổ cứng
map không gửi qua reduce, mầ reduce sẽ gửi request để lấy file => thực hiện thông qua
mạng
slide 90
kết quả
quá trình này thì maptask sẽ không hoạt động trong quá trình ghi
Disk spills
I Written in round-robin to a local dir
I Output data is parttioned corresponding to the reducers they will be
sent to
I Within each partition, data is sorted (in-memory)
I Optionally, if there is a combiner, it is executed just after the sort
phase
Pietro Michiardi (Eurecom) Tutorial: MapReduce 88 / 131
Hadoop MapReduce Hadoop MapReduce in details
Input consolidation
I A background thread merges all partial inputs into larger, sorted
files
I Note that if compression was used (for map outputs to save
bandwidth), decompression will take place in memory
MapReduce Types
Types:
I K types implement WritableComparable
I V types implement Writable
What is a Writable
Reading Data
TextInputFormat
I Traeats each newline-terminated line of a file as a value
KeyValueTextInputFormat
I Maps newline-terminated text lines of “key” SEPARATOR “value”
SequenceFileInputFormat
I Binary file of key-value pairs with some additional metadata
SequenceFileAsTextInputFormat
I Same as before but, maps (k.toString(), v.toString())
Record Readers
LineRecordReader
I Reads a line from a text file
KeyValueRecordReader
I Used by KeyValueTextInputFormat
WritableComparator
Configured through:
JobConf.setOutputValueGroupingComparator()
Partiotioner
The Reducer
Analogous to InputFormat
Preliminaries
Configuration
Before writing a MapReduce program, we need to set up and
cofigure the development environment
I Components in Hadoop are configured with an ad hoc API
I Configuration class is a collection of properties and their values
I Resources can be combined into a configuration
Alternatives
I Switch configurations (local, cluster)
I Alternatives (see Cloudera documentation for Ubuntu) is very
effective
Pietro Michiardi (Eurecom) Tutorial: MapReduce 112 / 131
Hadoop MapReduce Hadoop MapReduce in details
Local Execution
Cluster Execution
Packaging
Launching a Job
The WebUI
Hadoop Logs
Running Dependent Jobs, and Oozie
Hadoop Deployments
Cluster deployment
I Private cluster
I Cloud-based cluster
I AWS Elasitc MapReduce
Outlook:
I Cluster specification
F Hardware
F Network Topology
I Hadoop Configuration
F Memory considerations
Cluster Specification
Commodity Hardware
I Commodity 6= Low-end
F False economy due to failure rate and maintenance costs
I Commodity 6= High-end
F High-end machines perform better, which would imply a smaller
cluster
F A single machine failure would compromise a large fraction of the
cluster
A 2010 specification:
I 2 quad-cores
I 16-24 GB ECC RAM
I 4 × 1 TB SATA disks6
I Gigabit Ethernet
6
Why not using RAID instead of JBOD?
Pietro Michiardi (Eurecom) Tutorial: MapReduce 117 / 131
Hadoop MapReduce Hadoop Deployments
Cluster Specification
Example:
IAssume your data grows by 1 TB per week
IAssume you have three-way replication in HDFS
→ You need additional 3TB of raw storage per week
I Allow for some overhead (temporary files, logs)
Cluster Specification
Typical configuration
I 30-40 servers per rack
I 1 GB switch per rack
I Core switch or router with 1GB or better
Features
I Aggregate bandwidth between nodes on the same rack is much
larger than for nodes on different racks
I Rack awareness
F Hadoop should know the cluster topology
F Benefits both HDFS (data placement) and MapReduce (locality)
Hadoop Configuration
There are a handful of files for controlling the operation of an
Hadoop Cluster
I See next slide for a summary table
Hadoop Configuration
Hadoop on EC2
Example
ILaunch a cluster test-hadoop-cluster, with one master node
(JobTracker and NameNode) and 5 worker nodes (DataNodes
and TaskTrackers)
→ hadoop-ec2 launch-cluster test-hadoop-cluster 5
I See project webpage and Chapter 9, page 290 [11]
Hadoop as a service
I Amazon handles everything, which becomes transparent
I How this is done remains a mistery
References I
References II
References III
References IV