BD 08 Map Reduce
BD 08 Map Reduce
Lars Schmidt-Thieme
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 32
Big Data Analytics
Syllabus
Tue. 9.4. (1) 0. Introduction
A. Parallel Computing
Tue. 16.4. (2) A.1 Threads
Tue. 23.4. (3) A.2 Message Passing Interface (MPI)
Tue. 30.4. (4) A.3 Graphical Processing Units (GPUs)
B. Distributed Storage
Tue. 7.5. (5) B.1 Distributed File Systems
Tue. 14.5. (6) B.2 Partioning of Relational Databases
Tue. 21.5. (7) B.3 NoSQL Databases
C. Distributed Computing Environments
Tue. 28.5. (8) C.1 Map-Reduce
Tue. 4.6. — — Pentecoste Break —
Tue. 11.6. (9) C.2 Resilient Distributed Datasets (Spark)
Tue. 18.6. (10) C.3 Computational Graphs (TensorFlow)
D. Distributed Machine Learning Algorithms
Tue. 25.6. (11) D.1 Distributed Stochastic Gradient Descent
Tue. 2.7. (12) D.2 Distributed Matrix Factorization
Tue. 9.7. (13) Questions and Answers
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 32
Big Data Analytics
Outline
1. Introduction
4. Map-Reduce
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 32
Big Data Analytics 1. Introduction
Outline
1. Introduction
4. Map-Reduce
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 32
Big Data Analytics 1. Introduction
Technology Stack
Part D
Distributed Machine
Learning Algorithms
Part C
Part B
Distributed Storage
Part A
Parallel/Distributed Computing
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 32
Big Data Analytics 1. Introduction
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
2 / 32
Big Data Analytics 1. Introduction
Memory
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
3 / 32
Big Data Analytics 1. Introduction
Distributed Infrastructure
Network
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
4 / 32
Big Data Analytics 2. Parallel Computing Speedup
Outline
1. Introduction
4. Map-Reduce
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
5 / 32
Big Data Analytics 2. Parallel Computing Speedup
t(T , 1)
s(T , p) =
t(T , p)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
5 / 32
Big Data Analytics 2. Parallel Computing Speedup
t(T , 1)
e(T , p) =
p · t(T , p)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
6 / 32
Big Data Analytics 2. Parallel Computing Speedup
Considerations
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
7 / 32
Big Data Analytics 3. Example: Counting Words
Outline
1. Introduction
4. Map-Reduce
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
8 / 32
Big Data Analytics 3. Example: Counting Words
D := {d1 , . . . , dn }
(w1 , . . . , wm )
I the task is to generate word counts for each word in the corpus
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
8 / 32
Big Data Analytics 3. Example: Counting Words
Each processor:
1. access a document d ∈ D
2. for each word w in document d :
2.1 lock(cw )
2.2 cw ← cw + 1
2.3 unlock(cw )
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
9 / 32
Big Data Analytics 3. Example: Counting Words
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
10 / 32
Big Data Analytics 3. Example: Counting Words
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
11 / 32
Big Data Analytics 3. Example: Counting Words
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
12 / 32
Big Data Analytics 3. Example: Counting Words
Slave:
Local memory:
subset of documents: π(D, p) := {dp Pn , . . . , d(p+1) Pn −1 }
address of the master: addr_master
local word counts: c ∈ R|W |
1. c ← {0}|W |
2. for each document d ∈ π(D, p)
for each word w in document d :
cw ← cw + 1
3. Send message send(addr_master, c)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
13 / 32
Big Data Analytics 3. Example: Counting Words
Master:
Local memory:
1. Global word counts: c global ∈ R|W |
2. List of slaves: S
c global ← {0}|W |
s ← {0}|S|
For each received message (p, c p )
1. c global ← c global + c p
2. sp ← 1
3. if ||s||1 = |S| return c global
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
14 / 32
Big Data Analytics 3. Example: Counting Words
I We need to manually assign master and slave roles for each processor
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
15 / 32
Big Data Analytics 4. Map-Reduce
Outline
1. Introduction
4. Map-Reduce
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
16 / 32
Big Data Analytics 4. Map-Reduce
Map-Reduce
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
16 / 32
Big Data Analytics 4. Map-Reduce
I Examples:
Key Value
document array of words
document word
user movies
user friends
user tweet
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
17 / 32
Big Data Analytics 4. Map-Reduce
Map-Reduce / Idea
Map-Reduce
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
19 / 32
Big Data Analytics 4. Map-Reduce
K
X
(w , (c1 , . . . , cK )) 7→ (w , ck )
k=1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
22 / 32
Big Data Analytics 4. Map-Reduce
(crying, 1)
(crying, 1)
(d4, “I'm crying”)
(looking, 1)
(d5, “the deeper the love”)
(deeper, 1) (ain't, 1) (ain't, 2)
(ain't, 1) (rain, 1)
(rain, 1) (looking, 1)
(love, 2) (looking, 1) (deeper, 1)
(d6, “is this love”)
(ain't, 1) (deeper, 1) (this, 1)
Mappers Reducers
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
23 / 32
Big Data Analytics 4. Map-Reduce
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
24 / 32
Big Data Analytics 4. Map-Reduce
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
25 / 32
Big Data Analytics 4. Map-Reduce
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
26 / 32
Big Data Analytics 4. Map-Reduce
Execution
I high-level abstraction:
I No need to worry about how many processors are available
I No need to specify which ones will be mappers and which ones will be
reducers
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
27 / 32
Big Data Analytics 4. Map-Reduce
Fault Tolerance
I map failure
I re-execute map
I preferably on another node
I speculative execution
I execute two mappers on each data segment in parallel each
I keep results from the first
I kill the slower one, once the other completed.
I node failure
I re-execute completed and in-progress map()
I re-execute in-progress reduce tasks
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
28 / 32
Big Data Analytics 4. Map-Reduce
t(T , 1)
eMR (T , p) =
p · t(T , p)
wD
= wD
p( p + 2K σD p )
1
=
1 + 2K wσ
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
30 / 32
Big Data Analytics 4. Map-Reduce
1
eMR (T , p) =
1 + 2K wσ
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
31 / 32
Big Data Analytics 4. Map-Reduce
Summary
I Map Reduce is a distributed computing framework.
Further Readings
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Outline
5. Map-Reduce Tutorial
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
34 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Mappers
I extend baseclass Mapper<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
34 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Mappers
I extend baseclass Mapper<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
34 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Mappers
I extend baseclass Mapper<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.
I optionally,
setup(Context ctxt): set up mapper
I called once before first call to map
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
35 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
35 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
35 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
35 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Serialization
To store the output of mappers, combiners and reducers in files, keys and
values have to be serialized:
I hadoop requires all keys and values of a step to be of the same class.
the class of an object to deserialize is known in advance.
Serialization (2/2)
I Writable wrappers for elementary data types:
BooleanWritable boolean
ByteWritable byte
DoubleWritable double
FloatWritable float
IntWritable int
LongWritable long
ShortWritable short
Text String
(all in org.apache.hadoop.io)
I these are not subclasses of the default wrappers Integer, Double etc.
(as the latter are final)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
37 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Serialization (2/2)
I Writable wrappers for elementary data types:
BooleanWritable boolean
ByteWritable byte
DoubleWritable double
FloatWritable float
IntWritable int
LongWritable long
ShortWritable short
Text String
(all in org.apache.hadoop.io)
I these are not subclasses of the default wrappers Integer, Double etc.
(as the latter are final)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
38 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Example 1 / Mapper
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
39 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Job Configuration
Configuration (org.apache.hadoop.conf):
I default constructor:
read default configuration from files
I core-default.xml and
I core-site.xml
(to be found in the classpath).
I addResource(Path file):
update configuration from another configuration file.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
40 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Job
Job (org.apache.hadoop.conf):
I constructor Job(Configuration conf, String name):
create a new job.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
41 / 32
Big Data Analytics 5. Map-Reduce Tutorial
FileInputFormat (org.apache.hadoop.mapreduce.lib.input):
I addInputPath(Job job, Path path):
add input paths
FileOutputFormat (org.apache.hadoop.mapreduce.lib.output):
I setOutputPath(Job job, Path path):
set output paths
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
42 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
43 / 32
Big Data Analytics 5. Map-Reduce Tutorial
1. Set paths
1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPC LASSPATH =(JAVA_HOME)/lib/to
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
44 / 32
Big Data Analytics 5. Map-Reduce Tutorial
1. Set paths
1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPC LASSPATH =(JAVA_HOME)/lib/to
2. Compile sources:
1 hadoop com.sun.tools. javac . Main −sourcepath . MRJobStarter1.java
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
44 / 32
Big Data Analytics 5. Map-Reduce Tutorial
1. Set paths
1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPC LASSPATH =(JAVA_HOME)/lib/to
2. Compile sources:
1 hadoop com.sun.tools. javac . Main −sourcepath . MRJobStarter1.java
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
44 / 32
Big Data Analytics 5. Map-Reduce Tutorial
1. Set paths
1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPC LASSPATH =(JAVA_HOME)/lib/to
2. Compile sources:
1 hadoop com.sun.tools. javac . Main −sourcepath . MRJobStarter1.java
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
44 / 32
Big Data Analytics 5. Map-Reduce Tutorial
I input 2:
1 Hello Hadoop Goodbye Hadoop
I output:
lst@lst-uni:~> hdfs dfs -ls /ex1/output.ex2
Found 2 items
-rw-r–r– 2 lst supergroup 0 2016-05-24 18:59 /ex1/output.ex2/_SUCCESS
-rw-r–r– 2 lst supergroup 66 2016-05-24 18:59 /ex1/output.ex2/part-r-00000
lst@lst-uni:~> hdfs dfs -cat /ex1/output.ex2/part-r-00000
Bye 1
Goodbye 1
Hadoop 1
Hadoop 1
Hello 1
Hello 1
World 1
World 1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
45 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Reducers
I extend baseclass Reducer<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
46 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Reducers
I extend baseclass Reducer<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
46 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Reducers
I extend baseclass Reducer<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.
I optionally,
setup(Context ctxt): set up reducer
I called once before first call to reduce
Example 2 / Reducer
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
47 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
48 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Example 2 / Output
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
49 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
50 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Example 3 / Output
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
51 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Predefined Mappers
I Mapper: (org.apache.hadoop.mapreduce):
m(k, v ) := ((k, v ))
I InverseMapper (org.apache.hadoop.mapreduce.lib.map):
I ChainMapper (org.apache.hadoop.mapreduce.lib.chain)
I TokenCounterMapper (org.apache.hadoop.mapreduce.lib.map):
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
53 / 32
Big Data Analytics 5. Map-Reduce Tutorial
I RegexMapper (org.apache.hadoop.mapreduce.lib.map):
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
54 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Predefined Reducers
I Reducer: (org.apache.hadoop.mapreduce):
r (k, V ) := (k, V )
I IntSumReducer, LongSumReducer
(org.apache.hadoop.mapreduce.lib.reduce):
X
m(k, V ) := (v , v ))
v ∈V
I ChainReducer (org.apache.hadoop.mapreduce.lib.chain)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
55 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
56 / 32
Big Data Analytics 5. Map-Reduce Tutorial
Further Topics
I Streaming
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
57 / 32
Big Data Analytics 5. Map-Reduce Tutorial
References
Kristina Chodorow. MongoDB: The Definitive Guide. O’Reilly and Associates, Beijing, 2 edition, May 2013. ISBN
978-1-4493-4468-9.
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI’04
Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, volume 6,
2004.
Jeffrey Dean and Sanjay Ghemawat. MapReduce. Communications of the ACM, 51(1):107, January 2008. ISSN
00010782. doi: 10.1145/1327452.1327492.
Jeffrey Dean and Sanjay Ghemawat. MapReduce: A flexible data processing tool. Communications of the ACM, 53
(1):72–77, 2010.
Tom White. Hadoop: The Definitive Guide. O’Reilly, 4 edition, 2015.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
58 / 32