7_bdp-2024-08

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Big Data Processing

Jiaul Paik
Lecture 8
Programming on Hadoop

Map-reduce Model
Example Problem: Counting Words

We have a huge collection of text document and count


the number of times each distinct word appears in the file

• Sample application

• Result ranking, Language model

How can you solve this using a single machine?


Word Count

• Case 1:

• Files too large for memory, but all <word, count> pairs fit in memory

• You can create a big string array OR you can create a hash table
Word Count
• Case 2: All <word, count> pairs do not fit in memory, but fit into disk

• A possible approach (write computer programs/functions for each step)

1. Break the text document into sequence of words


getWords(textFile)

2. Sort the words


(this will bring the same words together) sortWords(list)

3. Count the frequencies in a single pass countWords(sorted list)

This captures the essence of Map-Reduce model


Map-Reduce: In a Nutshell

getWords(dataFile) sort count

Map
extract something you care about (here word Group by key Reduce
and count) (sort and shuffle) Aggregate, summarize

Summary

1. Structure remains the same

2. Map and Reduce to be defined by the user to fit the problem


MapReduce: Map Step

Input key-value pairs Intermediate key-value pairs


(file name and its content) (word and count)

k1 v1
map
c1
f1 k2 v2
map
c2
f2 k3 v3

… …
map
c3 k4 v4
f3
MapReduce: Reduce Step

key-value pairs
(produced by Map step)
Output
Key-value groups key-value pairs
k1 v1
reduce
k1 v1 v3 v6 k1 𝑣′
k2 v2
reduce
Group
by key k2 v2 v5 k2 𝑣′′
k1 v3


k3 v4
… …
k2 v5

v6 reduce
k1 k3 v4 k3 𝑣′′′
Map-reduce: Word Count
Provided by the Handled by MR system Provided by the
programmer programmer

MAP: Reduce:
Group by key:
Read input and Collect all values
Collect all pairs with
deep learning produces a set of belonging to the key
the same key
architectures such as key-value pairs and output
deep neural networks,
deep belief networks,
(deep, 1) (deep, 1)
deep reinforcement (learning, 1)
(deep, 1)
learning, recurrent (architectures, 1)
(networks, 1)
(such, 1) (deep, 2)
neural networks and (as, 1) (networks, 1) (networks, 3)
(…., 1) (networks, 1)
convolutional neural (….., 1)
(…., 1) (the, 3)
nets have been applied
(networks, 1)
to fields including (the, 1) (reinforcement, 1)
computer vision (deep, 1)
(the, 1)
(reinforcement , 1) (vision, 1)
(…., 1) (the, 1) …
(and, 1) (reinforcement, 1)
…. (vision, 1)
(vision, 1)
Big document
……..

(key, value) (key, value) (key, value)
Map-Reduce Execution

Detailed Look
Map-reduce System: Inside Look
map task 1 map task 2 map task 3

data data data data data data data data data


M M M M M M M M M

(𝒌𝟏 , 𝒗𝟏 ) (𝒌𝟑 , 𝒗𝟒 ) (𝒌𝟒 , 𝒗𝟒 ) (𝒌𝟐 , 𝒗𝟏 ) (𝒌𝟏 , 𝒗𝟒 ) (𝒌𝟑 , 𝒗𝟓 ) (𝒌𝟐 , 𝒗𝟔 ) (𝒌𝟑 , 𝒗𝟕 ) (𝒌𝟏 , 𝒗𝟓 )


(𝒌𝟐 , 𝒗𝟐 ) (𝒌𝟐 , 𝒗𝟐 )

partition function partition function partition function


All phases are distributed
with many tasks running
in parallel

SORT and GROUP SORT and GROUP


𝑘1 : v1 , v4 , v5 𝑘4 : v4 𝑘2 : v2 , v1 , v2 , v6 𝑘3 : 𝑣4 , 𝑣5 , 𝑣7

R R R R

Final output Final output Final output Final output

reduce task 1 reduce task 2


Partitioning
Output files
Output from mappers (partition by keys)
Partition function
(on multiple machines)
0
𝑘1 , 𝑣1
𝑘1 , 𝑣1
𝑘1 , 𝑣1
𝑘2 , 𝑣2
𝑘3 , 𝑣3

1
𝑘5 , 𝑣10 𝑘2 , 𝑣2
𝑘2 , 𝑣1 𝑘2 , 𝑣1
𝑘3 , 𝑣4 𝐻 𝑘𝑒𝑦 = 𝑏
0≤𝑏<𝑚
𝑘1 , 𝑣19
𝑘2 , 𝑣6 ……
𝑘7 , 𝑣7

m-1
𝑘3 , 𝑣9 𝑘𝑚 , 𝑣𝑚
𝑘2 , 𝑣6 𝑘𝑚 , 𝑣𝑐
𝑘3 , 𝑣7
Shuffle and Sort

IMPORTANT and EXPENSIVE OPERATION

• MapReduce guarantees input to every reducer is sorted by key

• The heart of mapreduce is the shuffle operation

• Shuffle does the following


1. performs the sort
2. transfers the map outputs to the reducers as inputs
Shuffle and Sort: the map side

partition, sort and


spill to disk

merge on disk
map
Buffer in on local disk
memory

Input data
(chunk)
partitions

You might also like