0% found this document useful (0 votes)

13 views77 pages

BD 08 Map Reduce

Uploaded by

downloadgame1510

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views77 pages

BD 08 Map Reduce

Uploaded by

downloadgame1510

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Big Data Analytics

C. Distributed Computing Emvironments / C.1 Map Reduce

Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)

Institute for Computer Science
University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 32
Big Data Analytics

Syllabus
Tue. 9.4. (1) 0. Introduction
A. Parallel Computing
Tue. 16.4. (2) A.1 Threads
Tue. 23.4. (3) A.2 Message Passing Interface (MPI)
Tue. 30.4. (4) A.3 Graphical Processing Units (GPUs)
B. Distributed Storage
Tue. 7.5. (5) B.1 Distributed File Systems
Tue. 14.5. (6) B.2 Partioning of Relational Databases
Tue. 21.5. (7) B.3 NoSQL Databases
C. Distributed Computing Environments
Tue. 28.5. (8) C.1 Map-Reduce
Tue. 4.6. — — Pentecoste Break —
Tue. 11.6. (9) C.2 Resilient Distributed Datasets (Spark)
Tue. 18.6. (10) C.3 Computational Graphs (TensorFlow)
D. Distributed Machine Learning Algorithms
Tue. 25.6. (11) D.1 Distributed Stochastic Gradient Descent
Tue. 2.7. (12) D.2 Distributed Matrix Factorization
Tue. 9.7. (13) Questions and Answers
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 32
Big Data Analytics

Outline

1. Introduction

2. Parallel Computing Speedup

3. Example: Counting Words

4. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 32
Big Data Analytics 1. Introduction

Outline

1. Introduction

2. Parallel Computing Speedup

3. Example: Counting Words

4. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 32
Big Data Analytics 1. Introduction

Technology Stack
Part D

Distributed Machine
Learning Algorithms

Part C

Distributed Execution Environments

Part B

Distributed Storage

Part A

Parallel/Distributed Computing

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 32
Big Data Analytics 1. Introduction

Why do we need a Computational Model?

I Our data is nicely stored in a distributed infrastructure

I We have a number of computers at our disposal

I We want our analytics software to take advantage of all this

computing power

I When programming we want to focus on understanding our data and

not our infrastructure

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
2 / 32
Big Data Analytics 1. Introduction

Shared Memory Infrastructure

Processor Processor Processor

Memory

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
3 / 32
Big Data Analytics 1. Introduction

Distributed Infrastructure

Network

Processor Processor Processor Processor

Memory Memory Memory Memory

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
4 / 32
Big Data Analytics 2. Parallel Computing Speedup

Outline

1. Introduction

2. Parallel Computing Speedup

3. Example: Counting Words

4. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
5 / 32
Big Data Analytics 2. Parallel Computing Speedup

Parallel Computing / Speedup

I We have p processors available to execute a task T

I Ideally: the more processors the faster a task is executed

I Reality: synchronisation and communication costs

I Speedup s(T , p) of a task T by using p processors:
I Be t(T , p) the time needed to execute T using p processors
I Speedup is given by:

t(T , 1)
s(T , p) =
t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
5 / 32
Big Data Analytics 2. Parallel Computing Speedup

Parallel Computing / Efficiency

I We have p processors available to execute a task T

I Efficiency e(T , p) of a task T by using p processors:

t(T , 1)
e(T , p) =
p · t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
6 / 32
Big Data Analytics 2. Parallel Computing Speedup

Considerations

I It is not worth using a lot of processors for solving small problems

I Algorithms should increase efficiency with problem size

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
7 / 32
Big Data Analytics 3. Example: Counting Words

Outline

1. Introduction

2. Parallel Computing Speedup

3. Example: Counting Words

4. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
8 / 32
Big Data Analytics 3. Example: Counting Words

Word Count Example

I Given a corpus of text documents

D := {d1 , . . . , dn }

each containing a sequence of words:

(w1 , . . . , wm )

from a set W of possible words.

I the task is to generate word counts for each word in the corpus

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
8 / 32
Big Data Analytics 3. Example: Counting Words

Paradigms — Shared Memory

I All the processors have access to all counters

I Counters can be overwritten

I Processors need to lock counters before using them

Shared vector for word counts: c ∈ N|W |

c ← {0}|W |

Each processor:
1. access a document d ∈ D
2. for each word w in document d :
2.1 lock(cw )
2.2 cw ← cw + 1
2.3 unlock(cw )
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
9 / 32
Big Data Analytics 3. Example: Counting Words

Paradigms — Shared Memory

I inefficient due to waiting times for the locks

I the more processors, the less efficient

I in a distributed scenario even worse

due to communication overhead for acquiring/releasing the locks

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
10 / 32
Big Data Analytics 3. Example: Counting Words

Paradigms — Message passing

I Each processor sees only one part of the data

π(D, p) := {dp Pn , . . . , d(p+1) Pn −1 }

I Each processor works on its partition

I Results are exchanged between processors (message passing)

For each processor p:

1. For each d ∈ π(D, p)
1.1 process(d )
2. Communicate results

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
11 / 32
Big Data Analytics 3. Example: Counting Words

Word Count — Message passing

We need to define two types of processes:

1. slave
I counts the words on a subset of documents and informs the master
2. master
I gathers counts from the slaves and sums them up

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
12 / 32
Big Data Analytics 3. Example: Counting Words

Word Count — Message passing

Slave:
Local memory:
subset of documents: π(D, p) := {dp Pn , . . . , d(p+1) Pn −1 }
address of the master: addr_master
local word counts: c ∈ R|W |

1. c ← {0}|W |
2. for each document d ∈ π(D, p)
for each word w in document d :
cw ← cw + 1
3. Send message send(addr_master, c)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
13 / 32
Big Data Analytics 3. Example: Counting Words

Word Count — Message passing

Master:
Local memory:
1. Global word counts: c global ∈ R|W |
2. List of slaves: S

c global ← {0}|W |
s ← {0}|S|
For each received message (p, c p )
1. c global ← c global + c p
2. sp ← 1
3. if ||s||1 = |S| return c global

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
14 / 32
Big Data Analytics 3. Example: Counting Words

Paradigms — Message passing

I We need to manually assign master and slave roles for each processor

I Partition of the data needs to be done manually

I Implementations like OpenMPI only provide services to exchange

messages

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
15 / 32
Big Data Analytics 4. Map-Reduce

Outline

1. Introduction

2. Parallel Computing Speedup

3. Example: Counting Words

4. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
16 / 32
Big Data Analytics 4. Map-Reduce

Map-Reduce

I distributed computing environment

I introduced 2004 by Google
I open source reference implementation: Hadoop (since 2006)
I meanwhile supported by many distributed programming environments
I e.g., in document databases such as MongoDB

I builds on a job scheduler

I for Hadoop: yarn

I considers data is partitioned over nodes

I for Hadoop: blocks of a file in a distributed filesystem

I high level abstraction

I programmer only specifies a map and a reduce function

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
16 / 32
Big Data Analytics 4. Map-Reduce

Key-Value input data

I Map-Reduce requires the data to be stored in a key-value format

I Examples:

Key Value
document array of words
document word
user movies
user friends
user tweet

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
17 / 32
Big Data Analytics 4. Map-Reduce

Map-Reduce / Idea

I represent input and output data as key/value pairs.

I break down computation into three phases:
1. map phase
I apply a function map to each input key/value pair
I the result is represented also as key/value pairs
I execute map on each data node for its data partition (data locality)
2. shuffle phase
I group all intermediate key/value pairs by key to key/valueset pairs
I repartition the intermediate data by key
3. reduce phase
I apply a function reduce to each intermediate key/valueset pair
I execute reduce on each node for its intermediate data partition

Note: The shuffle phase is also called sort or merge phase.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
18 / 32
Big Data Analytics 4. Map-Reduce

Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
19 / 32
Big Data Analytics 4. Map-Reduce

The Paradigm - Formally

Let I be a set called input keys,

O be a set called output keys,
X be a space called input values,
V be a space called intermediate values,
Y be a space called output values
A function
m : I × X → (O × V )∗
is called map,
a function
r : O × (V ∗ ) → O × Y
reducer.

Note: As always, ∗ denotes sequences.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
20 / 32
Big Data Analytics 4. Map-Reduce

Map-Reduce Driver Algorithm

1 map-reduce(m, r , p, (Dw )w ∈W ):
2 in parallel on workers w ∈ W :
3 E := ∅
4 for (i, x) ∈ Dw :
where
5 E := E ∪ m(i, x)
6 E 0 := dict(default = ∅) I m a mapper,
7 for (o, v ) ∈ E : I r a reducer,
8 E 0 [o] := E 0 [o] ∪ {v } I p : O → W a partition function for
9 synchronize all workers the output keys,
0 for (o, vs) ∈ E 0 : I (Dw )w ∈W a dataset partition stored
1 send (o, vs) to worker p(o) on worker w
2 F := dict(default = ∅) result:
3 for all (o, vs) received: I dataset G , stored distributed on
4 F (o) := F (o) ∪ vs workers W
5 synchronize all workers
6 Gw := ∅
7 for all (o, vs) ∈ F :
8 Gw := Gw ∪ r (o, vs)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
21 / 32
Big Data Analytics 4. Map-Reduce

Word Count Example

Map:
I Input: document-word list pairs

I Output: word-count pairs

(dn , (w1 , . . . , wM )) 7→ ((w , cw )w ∈W :cw >0 )

Reduce:
I Input: word-(count list) pairs

I Output: word-count pairs

K
X
(w , (c1 , . . . , cK )) 7→ (w , ck )
k=1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
22 / 32
Big Data Analytics 4. Map-Reduce

Word Count Example

(love, 1)
(stranger,1)
(stranger,1)
(d1, “love ain't no stranger”) (love, 1)
(crying, 1) (stranger,1)
(love, 2)
(d2, “crying in the rain”) (love, 5)
(rain, 1)
(love, 2)
(ain't, 1) (crying, 2)
(crying, 1)

(crying, 1)

(d3, “looking for love”) (love, 2)

(crying, 1)
(d4, “I'm crying”)
(looking, 1)
(d5, “the deeper the love”)
(deeper, 1) (ain't, 1) (ain't, 2)
(ain't, 1) (rain, 1)
(rain, 1) (looking, 1)
(love, 2) (looking, 1) (deeper, 1)
(d6, “is this love”)
(ain't, 1) (deeper, 1) (this, 1)

(d7, “Ain't no love”) (this, 1) (this, 1)

Mappers Reducers
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
23 / 32
Big Data Analytics 4. Map-Reduce

Hadoop Example / Map

1 public static class Map
2 extends MapReduceBase
3 implements Mapper<LongWritable, Text, Text, IntWritable> {
4 private final static IntWritable one = new IntWritable(1);
5 private Text word = new Text();
6
7 public void map(LongWritable key, Text value,
8 OutputCollector<Text, IntWritable> output,
9 Reporter reporter )
10 throws IOException {
11
12 String line = value. toString ();
13 StringTokenizer tokenizer = new StringTokenizer(line );
14
15 while ( tokenizer .hasMoreTokens()) {
16 word.set( tokenizer .nextToken());
17 output. collect (word, one);
18 }
19 }
20 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
24 / 32
Big Data Analytics 4. Map-Reduce

Hadoop Example / Reduce

1 public static class Reduce

2 extends MapReduceBase
3 implements Reducer<Text, IntWritable, Text, IntWritable> {
4
5 public void reduce(Text key, Iterator <IntWritable> values,
6 OutputCollector<Text, IntWritable> output,
7 Reporter reporter )
8 throws IOException {
9
10 int sum = 0;
11 while ( values .hasNext())
12 sum += values.next().get ();
13
14 output. collect (key, new IntWritable(sum));
15 }
16 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
25 / 32
Big Data Analytics 4. Map-Reduce

Hadoop Example / Main

1 public static void main(String [] args) throws Exception {
2 JobConf conf = new JobConf(WordCount.class);
3 conf .setJobName("wordcount");
4
5 conf .setOutputKeyClass(Text.class);
6 conf .setOutputValueClass(IntWritable.class);
7
8 conf .setMapperClass(Map.class);
9 conf .setCombinerClass(Reduce.class);
10 conf .setReducerClass(Reduce.class);
11
12 conf .setInputFormat(TextInputFormat.class);
13 conf .setOutputFormat(TextOutputFormat.class);
14
15 FileInputFormat.setInputPaths(conf, new Path(args[0]));
16 FileOutputFormat.setOutputPath(conf, new Path(args[1]));
17
18 JobClient .runJob(conf);
19 }
20 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
26 / 32
Big Data Analytics 4. Map-Reduce

Execution

I mappers are executed in parallel

I reducers are executed in parallel
I started after all mappers have completed

I high-level abstraction:
I No need to worry about how many processors are available
I No need to specify which ones will be mappers and which ones will be
reducers

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
27 / 32
Big Data Analytics 4. Map-Reduce

Fault Tolerance
I map failure
I re-execute map
I preferably on another node
I speculative execution
I execute two mappers on each data segment in parallel each
I keep results from the first
I kill the slower one, once the other completed.

I node failure
I re-execute completed and in-progress map()
I re-execute in-progress reduce tasks

I particular key-value pairs that cause mappers to crash

I skip just the problematic pairs

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
28 / 32
Big Data Analytics 4. Map-Reduce

Parallel Efficiency of Map-Reduce

I We have p processors for performing map and reduce operations
I Time to perform a task T on data D: t(T , 1) = wD
I Time for producing intermediate data after the map phase:
t(T inter , 1) = σD
I Overheads:
σD
I intermediate data per mapper: p
I each of the p reducers needs to read one p-th of the data written by
each of the p mappers:
σD 1 σD
p=
p p p
I Time for performing the task with Map-reduce:
wD σD
tMR (T , p) = + 2K
p p
Note: K represents the overhead of IO operations (reading and writing data to disk)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
29 / 32
Big Data Analytics 4. Map-Reduce

Parallel Efficiency of Map-Reduce

I Time for performing the task in one processor: wD

I Time for performing the task with p processors on Map-reduce:

wD σD
tMR (T , p) = + 2K
p p
I Efficiency of Map-Reduce:

t(T , 1)
eMR (T , p) =
p · t(T , p)
wD
= wD
p( p + 2K σD p )
1
=
1 + 2K wσ

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
30 / 32
Big Data Analytics 4. Map-Reduce

Parallel Efficiency of Map-Reduce

1
eMR (T , p) =
1 + 2K wσ

I Apparently the efficiency is independent of p

I High speedups can be achieved with large number of processors

I If σ is high (too much intermediate data) the efficiency deteriorates

I In many cases σ depends on p

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
31 / 32
Big Data Analytics 4. Map-Reduce

Summary
I Map Reduce is a distributed computing framework.

I Map Reduce represents input and output data as key/value pairs.

I Map Reduce decomposes computation into three phases:
1. map: applying a function to each input key/value pair.
2. shuffle:
I grouping intermediate data into key/valueset pairs
I repartitioning intermediate data by key over nodes
3. reduce: applying a function to each intermediate key/valueset pair.

I Mappers are executed data local

I For a program, the map and reduce functions have to be specified.
I the shuffle step is fixed.

I The size of the intermediate data is crucial for efficiency as it has to

be repartioned.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
32 / 32
Big Data Analytics

I original Map Reduce framework by Google:

I Dean and Ghemawat [2004, 2008, 2010]

I MapReduce reference implementation in Hadoop:

I [White, 2015, ch. 2, 7]
I also ch. 6, 8 and 9.

I MapReduce in a document database, MongoDB:

I [Chodorow, 2013, ch. 7]

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Outline

5. Map-Reduce Tutorial

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
34 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Mappers
I extend baseclass Mapper<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
34 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Mappers
I extend baseclass Mapper<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.

I overwrite map(KI key, VI value, Context ctxt)

I Context is an inner class of Mapper<KI,VI,KO,VO>
I write(KO,VO): write next output pair

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
34 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Mappers
I extend baseclass Mapper<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.

I overwrite map(KI key, VI value, Context ctxt)

I Context is an inner class of Mapper<KI,VI,KO,VO>
I write(KO,VO): write next output pair

I optionally,
setup(Context ctxt): set up mapper
I called once before first call to map

I cleanup(Context ctxt): clean up mapper

I called once after last call to map
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
34 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Key and Value Types

Requirements for key and value types:

I serializable — Writable (org.apache.hadoop.io)
I To store the output of mappers, combiners and reducers in files

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
35 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Key and Value Types

Requirements for key and value types:

I serializable — Writable (org.apache.hadoop.io)
I To store the output of mappers, combiners and reducers in files

Additional requirements for key types:

I comparable — Comparable (java.lang)
I To sort the output of mappers and combiners by key

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
35 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Key and Value Types

Requirements for key and value types:

I serializable — Writable (org.apache.hadoop.io)
I To store the output of mappers, combiners and reducers in files

Additional requirements for key types:

I comparable — Comparable (java.lang)
I To sort the output of mappers and combiners by key

I stable hashcode across different JVM instances

I To partition the output of combiners by key.
I The default implementation in Object is not stable!

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
35 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Key and Value Types

Requirements for key and value types:

I serializable — Writable (org.apache.hadoop.io)
I To store the output of mappers, combiners and reducers in files

Additional requirements for key types:

I comparable — Comparable (java.lang)
I To sort the output of mappers and combiners by key

I stable hashcode across different JVM instances

I To partition the output of combiners by key.
I The default implementation in Object is not stable!

I pooled in interface WritableComparable (org.apache.hadoop.io)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
35 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Serialization
To store the output of mappers, combiners and reducers in files, keys and
values have to be serialized:
I hadoop requires all keys and values of a step to be of the same class.
the class of an object to deserialize is known in advance.

I standard Java serialization (java.lang.Serializable) serializes class

information, thus is more verbose and complex and therefore not used
in hadoop.
I new interface Writable (org.apache.hadoop.io):
I void write(DataOutput out) throws IOException:
write object to a data output.
I void readFields(DataInput in) throws IOException:
set member variables (fields) of an object to values read from a data
input.
I standard DataInput and DataOutput (java.io)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
36 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Serialization (2/2)
I Writable wrappers for elementary data types:
BooleanWritable boolean
ByteWritable byte
DoubleWritable double
FloatWritable float
IntWritable int
LongWritable long
ShortWritable short
Text String
(all in org.apache.hadoop.io)
I these are not subclasses of the default wrappers Integer, Double etc.
(as the latter are final)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
37 / 32
Big Data Analytics 5. Map-Reduce Tutorial

I If one needs to pass more complex objects between steps,

implement custom Writable (in terms of these elementary
Writables).
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
37 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Example 1 / Mapper (in principle)

1 import java . io . IOException;

2 import java . util . StringTokenizer ;
3
4 import org . apache.hadoop.io. IntWritable ;
5 import org . apache.hadoop.io.Text;
6 import org . apache.hadoop.mapreduce.Mapper;
7
8 public class TokenizerMapperS
9 extends Mapper<Object, Text, Text, IntWritable >{
10
11 public void map(Object key, Text value , Context context
12 ) throws IOException, InterruptedException {
13 StringTokenizer itr = new StringTokenizer(value . toString ());
14 while ( itr . hasMoreTokens())
15 context . write (new Text( itr . nextToken()), new IntWritable (1));
16 }
17 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
38 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Example 1 / Mapper

1 import java . io . IOException;

2 import java . util . StringTokenizer ;
3
4 import org . apache.hadoop.io. IntWritable ;
5 import org . apache.hadoop.io.Text;
6 import org . apache.hadoop.mapreduce.Mapper;
7
8 public class TokenizerMapper
9 extends Mapper<Object, Text, Text, IntWritable >{
10
11 private final static IntWritable one = new IntWritable (1);
12 private Text word = new Text();
13
14 public void map(Object key, Text value , Context context
15 ) throws IOException, InterruptedException {
16 StringTokenizer itr = new StringTokenizer(value . toString ());
17 while ( itr . hasMoreTokens()) {
18 word.set ( itr . nextToken());
19 context . write (word, one);
20 }
21 }
22 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
39 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Job Configuration
Configuration (org.apache.hadoop.conf):
I default constructor:
read default configuration from files
I core-default.xml and
I core-site.xml
(to be found in the classpath).

I addResource(Path file):
update configuration from another configuration file.

I String get(String name):

get the value of a configuration option.

I set(String name, String value):

set the value of a configuration option.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
40 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Job

Job (org.apache.hadoop.conf):
I constructor Job(Configuration conf, String name):
create a new job.

I setMapperClass(Class cls), setCombinerClass(Class cls),

setReducerClass(Class cls):
set the class for mappers, combiners and reducers

I setOutputKeyClass(Class cls), setOutputValueClass(Class cls):

set the class for output keys and values.

I boolean waitForCompletion(boolean verbose):

submit job and wait until it completes.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
41 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Input and Output Paths

FileInputFormat (org.apache.hadoop.mapreduce.lib.input):
I addInputPath(Job job, Path path):
add input paths
FileOutputFormat (org.apache.hadoop.mapreduce.lib.output):
I setOutputPath(Job job, Path path):
set output paths

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
42 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Example 1 / Job Runner (Mapper Only)

1 import java . io . IOException;

2 import java . util . StringTokenizer ;
3
4 import org . apache.hadoop.conf. Configuration ;
5 import org . apache.hadoop.fs . Path;
6 import org . apache.hadoop.io. IntWritable ;
7 import org . apache.hadoop.io.Text;
8 import org . apache.hadoop.mapreduce.Job;
9 import org . apache.hadoop.mapreduce.lib. input . FileInputFormat ;
10 import org . apache.hadoop.mapreduce.lib.output.FileOutputFormat;
11
12 public class MRJobStarter1 {
13 public static void main(String [] args ) throws Exception {
14 Configuration conf = new Configuration ();
15 Job job = Job.getInstance (conf , "word count");
16 job . setMapperClass(TokenizerMapper.class );
17 job . setOutputKeyClass(Text. class );
18 job . setOutputValueClass ( IntWritable . class );
19 FileInputFormat . addInputPath(job, new Path(args [0]));
20 FileOutputFormat.setOutputPath(job, new Path(args [1]));
21
22 System.exit (job . waitForCompletion( true ) ? 0 : 1);
23 }
24 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
43 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Compiling and Running Map Reduce Classes

1. Set paths
1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPC LASSPATH =(JAVA_HOME)/lib/to

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
44 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Compiling and Running Map Reduce Classes

1. Set paths
1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPC LASSPATH =(JAVA_HOME)/lib/to

2. Compile sources:
1 hadoop com.sun.tools. javac . Main −sourcepath . MRJobStarter1.java

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
44 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Compiling and Running Map Reduce Classes

1. Set paths
1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPC LASSPATH =(JAVA_HOME)/lib/to

2. Compile sources:
1 hadoop com.sun.tools. javac . Main −sourcepath . MRJobStarter1.java

3. Package all class files of the job into a jar:

1 jar cf job . jar MRJobStarter1.class TokenizerMapper.class

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
44 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Compiling and Running Map Reduce Classes

1. Set paths
1 export PATH=PATH : /home/lst/system/hadoop/binexportHADOOPC LASSPATH =(JAVA_HOME)/lib/to

2. Compile sources:
1 hadoop com.sun.tools. javac . Main −sourcepath . MRJobStarter1.java

3. Package all class files of the job into a jar:

1 jar cf job . jar MRJobStarter1.class TokenizerMapper.class

4. Run the jar:

1 hadoop jar job . jar MRJobStarter1 /ex1/input /ex1/output

I the output directory must not yet exist.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
44 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Example 1 / Inputs and Output

I input 1:
1 Hello World Bye World

I input 2:
1 Hello Hadoop Goodbye Hadoop

I output:
lst@lst-uni:~> hdfs dfs -ls /ex1/output.ex2
Found 2 items
-rw-r–r– 2 lst supergroup 0 2016-05-24 18:59 /ex1/output.ex2/_SUCCESS
-rw-r–r– 2 lst supergroup 66 2016-05-24 18:59 /ex1/output.ex2/part-r-00000
lst@lst-uni:~> hdfs dfs -cat /ex1/output.ex2/part-r-00000
Bye 1
Goodbye 1
Hadoop 1
Hadoop 1
Hello 1
Hello 1
World 1
World 1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
45 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Reducers
I extend baseclass Reducer<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
46 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Reducers
I extend baseclass Reducer<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.

I overwrite reduce(KI key, Iterable<VI> value, Context ctxt)

I Context is an inner class of Reducer<KI,VI,KO,VO>
I write(KO,VO): write next output pair

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
46 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Reducers
I extend baseclass Reducer<KI,VI,KO,VO>
(org.apache.hadoop.mapreduce)
I types: KI = input key, VI = input value,
KO = output key, VO = output value.

I overwrite reduce(KI key, Iterable<VI> value, Context ctxt)

I Context is an inner class of Reducer<KI,VI,KO,VO>
I write(KO,VO): write next output pair

I optionally,
setup(Context ctxt): set up reducer
I called once before first call to reduce

I cleanup(Context ctxt): clean up reducer

I called once after last call to reduce
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
46 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Example 2 / Reducer

1 import java . io . IOException;

2
3 import org . apache.hadoop.io. IntWritable ;
4 import org . apache.hadoop.io.Text;
5 import org . apache.hadoop.mapreduce.Reducer;
6
7 public class IntSumReducer
8 extends Reducer<Text,IntWritable,Text, IntWritable > {
9 private IntWritable result = new IntWritable ();
10
11 public void reduce(Text key, Iterable <IntWritable> values,
12 Context context
13 ) throws IOException, InterruptedException {
14 int sum = 0;
15 for ( IntWritable val : values )
16 sum += val.get();
17 result . set (sum);
18 context . write (key, result );
19 }
20 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
47 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Example 2 / Job Runner

1 import java . io . IOException;
2 import java . util . StringTokenizer ;
3
4 import org . apache.hadoop.conf. Configuration ;
5 import org . apache.hadoop.fs . Path;
6 import org . apache.hadoop.io. IntWritable ;
7 import org . apache.hadoop.io.Text;
8 import org . apache.hadoop.mapreduce.Job;
9 import org . apache.hadoop.mapreduce.lib. input . FileInputFormat ;
10 import org . apache.hadoop.mapreduce.lib.output.FileOutputFormat;
11
12 public class MRJobStarter2 {
13 public static void main(String [] args ) throws Exception {
14 Configuration conf = new Configuration ();
15 Job job = Job.getInstance (conf , "word count");
16 job . setMapperClass(TokenizerMapper.class );
17 job . setReducerClass (IntSumReducer.class );
18 job . setOutputKeyClass(Text. class );
19 job . setOutputValueClass ( IntWritable . class );
20 FileInputFormat . addInputPath(job, new Path(args [0]));
21 FileOutputFormat.setOutputPath(job, new Path(args [1]));
22
23 System.exit (job . waitForCompletion( true ) ? 0 : 1);
24 }
25 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
48 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Example 2 / Output

lst@lst-uni:~> hdfs dfs -cat /ex1/output.2/part*

Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
49 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Example 3 / Job Runner: Only Mapper and Combiner

1 import java . io . IOException;
2 import java . util . StringTokenizer ;
3
4 import org . apache.hadoop.conf. Configuration ;
5 import org . apache.hadoop.fs . Path;
6 import org . apache.hadoop.io. IntWritable ;
7 import org . apache.hadoop.io.Text;
8 import org . apache.hadoop.mapreduce.Job;
9 import org . apache.hadoop.mapreduce.lib. input . FileInputFormat ;
10 import org . apache.hadoop.mapreduce.lib.output.FileOutputFormat;
11
12 public class MRJobStarter3 {
13 public static void main(String [] args ) throws Exception {
14 Configuration conf = new Configuration ();
15 Job job = Job.getInstance (conf , "word count");
16 job . setMapperClass(TokenizerMapper.class );
17 job . setCombinerClass(IntSumReducer.class );
18 job . setOutputKeyClass(Text. class );
19 job . setOutputValueClass ( IntWritable . class );
20 FileInputFormat . addInputPath(job, new Path(args [0]));
21 FileOutputFormat.setOutputPath(job, new Path(args [1]));
22
23 System.exit (job . waitForCompletion( true ) ? 0 : 1);
24 }
25 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
50 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Example 3 / Output

lst@lst-uni:~> hdfs dfs -cat /ex1/output.3/part*

Bye 1
Goodbye 1
Hadoop 2
Hello 1
Hello 1
World 2

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
51 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Example 4 / Job Runner: Combiner and Reducer

1 import java . io . IOException;
2 import java . util . StringTokenizer ;
3
4 import org . apache.hadoop.conf. Configuration ;
5 import org . apache.hadoop.fs . Path;
6 import org . apache.hadoop.io. IntWritable ;
7 import org . apache.hadoop.io.Text;
8 import org . apache.hadoop.mapreduce.Job;
9 import org . apache.hadoop.mapreduce.lib. input . FileInputFormat ;
10 import org . apache.hadoop.mapreduce.lib.output.FileOutputFormat;
11
12 public class MRJobStarter4 {
13 public static void main(String [] args ) throws Exception {
14 Configuration conf = new Configuration ();
15 Job job = Job.getInstance (conf , "word count");
16 job . setMapperClass(TokenizerMapper.class );
17 job . setCombinerClass(IntSumReducer.class );
18 job . setReducerClass (IntSumReducer.class );
19 job . setOutputKeyClass(Text. class );
20 job . setOutputValueClass ( IntWritable . class );
21 FileInputFormat . addInputPath(job, new Path(args [0]));
22 FileOutputFormat.setOutputPath(job, new Path(args [1]));
23
24 System.exit (job . waitForCompletion( true ) ? 0 : 1);
25 }
26 }

The output is the same as only with a reducer,

but less intermediate data is moved.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
52 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Predefined Mappers
I Mapper: (org.apache.hadoop.mapreduce):

m(k, v ) := ((k, v ))

I InverseMapper (org.apache.hadoop.mapreduce.lib.map):

m(k, v ) := ((v , k))

I ChainMapper (org.apache.hadoop.mapreduce.lib.chain)

chainm,` (k, v ) := (`(k 0 , m0 ) | (k 0 , v 0 ) ∈ m(k, v ))

I TokenCounterMapper (org.apache.hadoop.mapreduce.lib.map):

m(k, v ) := ((k 0 , 1) | k 0 ∈ tokenize(v ))

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
53 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Predefined Mappers (2/2)

I RegexMapper (org.apache.hadoop.mapreduce.lib.map):

m(k, v ) := ((k 0 , 1) | k 0 ∈ find-regex(v ))

Regex pattern to search for and

groups to report
can be set via configuration options
PATTERN mapreduce.mapper.regex
GROUP mapreduce.mapper.regexmapper..group

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
54 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Predefined Reducers

I Reducer: (org.apache.hadoop.mapreduce):

r (k, V ) := (k, V )

I IntSumReducer, LongSumReducer
(org.apache.hadoop.mapreduce.lib.reduce):
X
m(k, V ) := (v , v ))
v ∈V

I ChainReducer (org.apache.hadoop.mapreduce.lib.chain)

chainr ,s (k, V ) := s(r (k, V ))

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
55 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Job Default Values

mapper class: class org.apache.hadoop.mapreduce.Mapper

combiner class: null
reducer class: class org.apache.hadoop.mapreduce.Reducer
output key class: class org.apache.hadoop.io.LongWritable
output value class: class org.apache.hadoop.io.Text

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
56 / 32
Big Data Analytics 5. Map-Reduce Tutorial

Further Topics

I Controlling map-reduce jobs

I YARN
I chaining map-reduce jobs
I iterative algorithms

I Controlling the number of mappers and reducers

I Managing resources required by all mappers or reducers

I Input and output using relational databases

I Streaming

I Examples, examples, examples

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
57 / 32
Big Data Analytics 5. Map-Reduce Tutorial

References
Kristina Chodorow. MongoDB: The Definitive Guide. O’Reilly and Associates, Beijing, 2 edition, May 2013. ISBN
978-1-4493-4468-9.
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI’04
Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, volume 6,
2004.
Jeffrey Dean and Sanjay Ghemawat. MapReduce. Communications of the ACM, 51(1):107, January 2008. ISSN
00010782. doi: 10.1145/1327452.1327492.
Jeffrey Dean and Sanjay Ghemawat. MapReduce: A flexible data processing tool. Communications of the ACM, 53
(1):72–77, 2010.
Tom White. Hadoop: The Definitive Guide. O’Reilly, 4 edition, 2015.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
58 / 32

A Study On Awareness of Jan-Aushadhi Medical Store.
100% (3)
A Study On Awareness of Jan-Aushadhi Medical Store.
41 pages
Performance Dashboards
100% (4)
Performance Dashboards
76 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Arlene G. Fink - Evaluation Fundamentals
100% (1)
Arlene G. Fink - Evaluation Fundamentals
277 pages
Yum Yum D Giga
No ratings yet
Yum Yum D Giga
368 pages
Action Research Designs: Presented By: Dr. Abdul Khaliq
100% (1)
Action Research Designs: Presented By: Dr. Abdul Khaliq
37 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
MMD 01
No ratings yet
MMD 01
80 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
49 pages
Cloud Computing Unit4
No ratings yet
Cloud Computing Unit4
55 pages
Ecs765p W1
No ratings yet
Ecs765p W1
39 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Ti RVA DNP3 PDF
No ratings yet
Ti RVA DNP3 PDF
42 pages
Chapter 3 - 大数据管理
No ratings yet
Chapter 3 - 大数据管理
38 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
BD 07 Spark
No ratings yet
BD 07 Spark
49 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
48 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
48 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
45 pages
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
No ratings yet
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
35 pages
3.2 Distributed Computing Platforms
No ratings yet
3.2 Distributed Computing Platforms
25 pages
BDP 2024 08
No ratings yet
BDP 2024 08
14 pages
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
No ratings yet
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
36 pages
ECS765P - W1 - Introduction To Big Data
No ratings yet
ECS765P - W1 - Introduction To Big Data
50 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Lecture 10 Map Reduce
No ratings yet
Lecture 10 Map Reduce
42 pages
Scalable Language Processing Algorithms For The Masses: A Case Study in Computing Word Co-Occurrence Matrices With Mapreduce
No ratings yet
Scalable Language Processing Algorithms For The Masses: A Case Study in Computing Word Co-Occurrence Matrices With Mapreduce
10 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
BDA05 DistributedComputing
No ratings yet
BDA05 DistributedComputing
7 pages
7 - BDP 2024 08
No ratings yet
7 - BDP 2024 08
14 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Cloud 4 Unit
No ratings yet
Cloud 4 Unit
26 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Ditp ch2
No ratings yet
Ditp ch2
2 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Module2 D MapReduceParadigm
No ratings yet
Module2 D MapReduceParadigm
84 pages
CS-3032 (BD) - CS End April 2024
No ratings yet
CS-3032 (BD) - CS End April 2024
27 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Big Data Analytics
No ratings yet
Big Data Analytics
50 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
M.Tech - Dissertation Presentation
No ratings yet
M.Tech - Dissertation Presentation
28 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Biggdata
No ratings yet
Biggdata
24 pages
Write Your First MapReduce Program in 20 Minutes
No ratings yet
Write Your First MapReduce Program in 20 Minutes
16 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Dean 08 Map Reduce
No ratings yet
Dean 08 Map Reduce
7 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Hadoop-Yahoo - Tutorial Course 1
No ratings yet
Hadoop-Yahoo - Tutorial Course 1
149 pages
2022 - Knowledge Management and Digital Transformation For Industry 4.0 A Structured Literature Review
No ratings yet
2022 - Knowledge Management and Digital Transformation For Industry 4.0 A Structured Literature Review
20 pages
PGP - Unified Brochure
No ratings yet
PGP - Unified Brochure
18 pages
Syllabus E63 Spring2016-2
No ratings yet
Syllabus E63 Spring2016-2
3 pages
Pisa 2015 Ms - Released Item Descriptions Final English
No ratings yet
Pisa 2015 Ms - Released Item Descriptions Final English
29 pages
Topic 4 - Data Mining Tools and Technique
No ratings yet
Topic 4 - Data Mining Tools and Technique
22 pages
Implementation of Cache Memory
No ratings yet
Implementation of Cache Memory
15 pages
Generic Threats and Security Controls
No ratings yet
Generic Threats and Security Controls
4 pages
Factors Influencing Big Data Decision-Making Quality
No ratings yet
Factors Influencing Big Data Decision-Making Quality
8 pages
PLSQL
No ratings yet
PLSQL
21 pages
Fidelangeli Galli
No ratings yet
Fidelangeli Galli
46 pages
Advanced Data Base
No ratings yet
Advanced Data Base
15 pages
UNIT - IV Interacting With Database CO:-4 Develop A Program Using Database MCQ Question Bank
No ratings yet
UNIT - IV Interacting With Database CO:-4 Develop A Program Using Database MCQ Question Bank
35 pages
Ora12c NF PDF
No ratings yet
Ora12c NF PDF
36 pages
HC110111012 File System Navigation and Management
No ratings yet
HC110111012 File System Navigation and Management
19 pages
State Polytechnic of Jember: The Exercises of File System Chapter 10
No ratings yet
State Polytechnic of Jember: The Exercises of File System Chapter 10
15 pages
DCU 1008 Research Methodology (Common Unit-Diploma)
No ratings yet
DCU 1008 Research Methodology (Common Unit-Diploma)
102 pages
Data Engineer
No ratings yet
Data Engineer
1 page
Implementing Powerexchange Oracle CDC With Logminer in A Non-Rac Environment
No ratings yet
Implementing Powerexchange Oracle CDC With Logminer in A Non-Rac Environment
38 pages
Financial Literacy and Personal Financial Management of Regular Employees in The Municipality of Polomolok: A Correlation Analysis
No ratings yet
Financial Literacy and Personal Financial Management of Regular Employees in The Municipality of Polomolok: A Correlation Analysis
12 pages
Current Log
No ratings yet
Current Log
32 pages
Ethics Form
No ratings yet
Ethics Form
7 pages
Research Title
No ratings yet
Research Title
2 pages
Data Normalisation
No ratings yet
Data Normalisation
6 pages
Patterns of Data Driven Decision Making - Apoorva R Oulkar & Atul Mandal
No ratings yet
Patterns of Data Driven Decision Making - Apoorva R Oulkar & Atul Mandal
8 pages
acSELerator Team SEL-5045 Software
No ratings yet
acSELerator Team SEL-5045 Software
8 pages
Tutorial Deepseek Eng 667865656
No ratings yet
Tutorial Deepseek Eng 667865656
2 pages
Nitin Story
No ratings yet
Nitin Story
2 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet