Big Data Processing, MapReduce
Big Data Processing, MapReduce
Outline
q Batch and Transactional Processing
q Hadoop
q MapReduce
Reference:
• Chapter 6, “Big Data Fundamentals: Concepts, Drivers & Techniques”, by Thomas Erl,
Wajid Khattak, Paul Buhler. 1st Ed. ISBN-10: 0134291077,
2
Big Data Management Software Stack
3
Distributed Data Processing
• Achieved through
physically separate
machines that are
networked together
as a cluster
4
Processing Workloads
• Batch: processing data in batches and usually imposes delays, which in
turn results in high-latency responses
o Also known as offline processing
5
Batch Processing
● a batch workload can include
grouped read/writes to INSERT,
SELECT, UPDATE and DELETE
● response time could vary from
minutes to hours
● generally involves processing a
range of large datasets
6
Transactional Processing
● Transactional workloads have few
joins and lower latency responses
than batch workloads
● Generally more write-intensive
than read-intensive
● smaller data footprint
7
Hadoop
● Hadoop is a versatile
framework that provides
both processing and storage
capabilities
8
Batch Processing with MapReduce
● MapReduce is a programming model that allows parallel and distributed
processing of data across a Hadoop cluster
● Does not require that the input data conform to any particular data model
● Data processing algorithm is moved to the nodes that store the data
9
Map and Reduce Tasks
10
Example #1 of MapReduce
Goal: Count the number of times a word appeared in a document
11
Example #1 of MapReduce
Goal: Count the number of times a word appeared in a document
1. Map: divide the documents and assign them to the servers (e.g., 20 each)
• (Key, Value) pair à (Word, Count) à (“Taco”, 7)
2. Combine and Partition if necessary
3. Shuffle and Sort à Take the output from previous stage and combine them
together in a sorted list
4. Reduce à Sum or merge to arrive at the final result
12
Example #2 of MapReduce
Goal: Count and catalog all the coins in a
pile (different currency types and
denominations)
“classical” approach to
parallel computing
Ref: https://freecontent.manning.com/explaining-mapreduce-with-ducks/
13
Example #2 of MapReduce
Goal: Count and catalog all the coins in a
pile (different currency types and
denominations)
MapReduce
Ref: https://freecontent.manning.com/explaining-mapreduce-with-ducks/
14