7_bdp-2024-08

Big Data Processing
Jiaul Paik
Lecture 8
Programming on Hadoop
Map-reduce Model
Example Problem: Counting Words
We have a huge collection of text document and count

the number of times each distinct word appears in the file
• Sample application
• Result ranking, Language model
How can you solve this using a single machine?

Word Count
• Case 1:
• Files too large for memory, but all <word, count> pairs fit in memory
• You can create a big string array OR you can create a hash table
Word Count
• Case 2: All <word, count> pairs do not fit in memory, but fit into disk
• A possible approach (write computer programs/functions for each step)
1. Break the text document into sequence of words

getWords(textFile)
2. Sort the words

(this will bring the same words together) sortWords(list)
3. Count the frequencies in a single pass countWords(sorted list)
This captures the essence of Map-Reduce model

Map-Reduce: In a Nutshell
getWords(dataFile) sort count
Map
extract something you care about (here word Group by key Reduce
and count) (sort and shuffle) Aggregate, summarize
Summary
1. Structure remains the same
2. Map and Reduce to be defined by the user to fit the problem

MapReduce: Map Step
Input key-value pairs Intermediate key-value pairs

(file name and its content) (word and count)
k1 v1
map
c1
f1 k2 v2
map
c2
f2 k3 v3
… …
map
c3 k4 v4
f3
MapReduce: Reduce Step
key-value pairs
(produced by Map step)
Output
Key-value groups key-value pairs
k1 v1
reduce
k1 v1 v3 v6 k1 𝑣′
k2 v2
reduce
Group
by key k2 v2 v5 k2 𝑣′′
k1 v3
…
k3 v4
… …
k2 v5
v6 reduce
k1 k3 v4 k3 𝑣′′′
Map-reduce: Word Count
Provided by the Handled by MR system Provided by the
programmer programmer
MAP: Reduce:
Group by key:
Read input and Collect all values
Collect all pairs with
deep learning produces a set of belonging to the key
the same key
architectures such as key-value pairs and output
deep neural networks,
deep belief networks,
(deep, 1) (deep, 1)
deep reinforcement (learning, 1)
(deep, 1)
learning, recurrent (architectures, 1)
(networks, 1)
(such, 1) (deep, 2)
neural networks and (as, 1) (networks, 1) (networks, 3)
(…., 1) (networks, 1)
convolutional neural (….., 1)
(…., 1) (the, 3)
nets have been applied
(networks, 1)
to fields including (the, 1) (reinforcement, 1)
computer vision (deep, 1)
(the, 1)
(reinforcement , 1) (vision, 1)
(…., 1) (the, 1) …
(and, 1) (reinforcement, 1)
…. (vision, 1)
(vision, 1)
Big document
……..
…
(key, value) (key, value) (key, value)
Map-Reduce Execution
Detailed Look
Map-reduce System: Inside Look
map task 1 map task 2 map task 3
data data data data data data data data data

M M M M M M M M M
(𝒌𝟏 , 𝒗𝟏 ) (𝒌𝟑 , 𝒗𝟒 ) (𝒌𝟒 , 𝒗𝟒 ) (𝒌𝟐 , 𝒗𝟏 ) (𝒌𝟏 , 𝒗𝟒 ) (𝒌𝟑 , 𝒗𝟓 ) (𝒌𝟐 , 𝒗𝟔 ) (𝒌𝟑 , 𝒗𝟕 ) (𝒌𝟏 , 𝒗𝟓 )

(𝒌𝟐 , 𝒗𝟐 ) (𝒌𝟐 , 𝒗𝟐 )
partition function partition function partition function

All phases are distributed
with many tasks running
in parallel
SORT and GROUP SORT and GROUP

𝑘1 : v1 , v4 , v5 𝑘4 : v4 𝑘2 : v2 , v1 , v2 , v6 𝑘3 : 𝑣4 , 𝑣5 , 𝑣7
R R R R
Final output Final output Final output Final output
reduce task 1 reduce task 2

Partitioning
Output files
Output from mappers (partition by keys)
Partition function
(on multiple machines)
0
𝑘1 , 𝑣1
𝑘1 , 𝑣1
𝑘1 , 𝑣1
𝑘2 , 𝑣2
𝑘3 , 𝑣3
1
𝑘5 , 𝑣10 𝑘2 , 𝑣2
𝑘2 , 𝑣1 𝑘2 , 𝑣1
𝑘3 , 𝑣4 𝐻 𝑘𝑒𝑦 = 𝑏
0≤𝑏<𝑚
𝑘1 , 𝑣19
𝑘2 , 𝑣6 ……
𝑘7 , 𝑣7
m-1
𝑘3 , 𝑣9 𝑘𝑚 , 𝑣𝑚
𝑘2 , 𝑣6 𝑘𝑚 , 𝑣𝑐
𝑘3 , 𝑣7
Shuffle and Sort
IMPORTANT and EXPENSIVE OPERATION
• MapReduce guarantees input to every reducer is sorted by key
• The heart of mapreduce is the shuffle operation
• Shuffle does the following

1. performs the sort
2. transfers the map outputs to the reducers as inputs
Shuffle and Sort: the map side
partition, sort and

spill to disk
merge on disk
map
Buffer in on local disk
memory
Input data
(chunk)
partitions

7_bdp-2024-08

Uploaded by

Copyright:

Available Formats

7_bdp-2024-08

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7_bdp-2024-08

Uploaded by

Copyright:

Available Formats

Big Data Processing

We have a huge collection of text document and count

• Result ranking, Language model

How can you solve this using a single machine?

• A possible approach (write computer programs/functions for each step)

1. Break the text document into sequence of words

2. Sort the words

3. Count the frequencies in a single pass countWords(sorted list)

This captures the essence of Map-Reduce model

getWords(dataFile) sort count

1. Structure remains the same

2. Map and Reduce to be defined by the user to fit the problem

Input key-value pairs Intermediate key-value pairs

data data data data data data data data data

(𝒌𝟏 , 𝒗𝟏 ) (𝒌𝟑 , 𝒗𝟒 ) (𝒌𝟒 , 𝒗𝟒 ) (𝒌𝟐 , 𝒗𝟏 ) (𝒌𝟏 , 𝒗𝟒 ) (𝒌𝟑 , 𝒗𝟓 ) (𝒌𝟐 , 𝒗𝟔 ) (𝒌𝟑 , 𝒗𝟕 ) (𝒌𝟏 , 𝒗𝟓 )

partition function partition function partition function

SORT and GROUP SORT and GROUP

Final output Final output Final output Final output

reduce task 1 reduce task 2

IMPORTANT and EXPENSIVE OPERATION

• MapReduce guarantees input to every reducer is sorted by key

• The heart of mapreduce is the shuffle operation

• Shuffle does the following

partition, sort and

You might also like