0% found this document useful (0 votes)

20 views53 pages

Hadoop Map Reduce Concepts - Teaching - 1

Uploaded by

Trung Hiếu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views53 pages

Hadoop Map Reduce Concepts - Teaching - 1

Uploaded by

Trung Hiếu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

MapReduce Concepts

Hadoop MapReduce
• Hadoop MapReduce is a
– Software framework
– For easily writing applications
– Which process vast amounts of data (multi-terabyte data-sets)
– In-parallel on large clusters (thousands of nodes) of commodity
hardware
– In a reliable, fault-tolerant manner.
• A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.
• The framework sorts the outputs of the maps, which are then input to
the reduce tasks.
• Typically both the input and the output of the job are stored in a file-system.
• The framework takes care of scheduling tasks, monitoring them and re-
executes the failed tasks.
JobTracker and TaskTracker
• The MapReduce framework consists of a
– single master JobTracker and
– one slave TaskTracker per cluster-node.
• The master is responsible for scheduling the jobs'
component - tasks on the slaves, monitoring them and
re-executing the failed tasks.
• The slaves execute the tasks as directed by the master.
JobTracker and TaskTracker
Job Specification
• Minimally, applications specify the input/output locations and
supply map and reduce functions via implementations of appropriate
interfaces and/or abstract-classes.
• These, and other job parameters, comprise the job configuration.
• The Hadoop job client then submits the job (jar/executable etc.) and
configuration to the JobTracker which then assumes the responsibility of
distributing the software/configuration to the slaves, scheduling tasks and
monitoring them, providing status and diagnostic information to the job-
client.
• Although the Hadoop framework is implemented in JavaTM, MapReduce
applications need not be written in Java.
– Hadoop Streaming is a utility which allows users to create and run jobs with any
executables (e.g. shell utilities) as the mapper and/or the reducer.
– Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce
applications (non JNITM based).
Input Output
• The MapReduce framework operates exclusively on <key,
value> pairs.
• That is, the framework views the input to the job as a set
of <key, value> pairs and produces a set of <key, value> pairs as
the output of the job, conceivably of different types.
• The key and value classes have to be serializable by the
framework and hence need to implement
the Writable interface.
• Additionally, the key classes have to implement
the WritableComparable interface to facilitate sorting by the
framework.
• Input and Output types of a MapReduce job:
– (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -
> <k3, v3> (output)
Hadoop Map/Reduce Architecture
JobTracker - Master Accepts MR jobs
Assigns tasks to slaves
UI for submitting
jobs Monitors tasks
Polls status Handles failures
information

Client
Run Map and Reduce
task TaskTracker -
Slaves
TaskTracker
Manage intermediate TaskTracker
output
Run the Map and
Reduce functions
Report progress Task
Hadoop HDFS + MR cluster - putting them together
Locality optimizations
– With large data, bandwidth to data is a problem
– Map tasks are scheduled close to the inputs when possible
Automatic re-execution on failure

Get Block Namenode

Locations

HTTP Monitoring UI
Client
JobTracker
Submit Job
tasks
←.

T D T D T D T D T D

Machines with Datanodes and Tasktrackers

Word Count DataFlow
Ways of using Hadoop
• Map-Reduce Java API
– http://hadoop.apache.org/mapreduce/

• Streaming
– http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

• Pipes
– http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/pipes/package-
summary.html

• Pig
– http://hadoop.apache.org/pig/

• Hive
– http://hive.apache.org
Example: Compute TF-IDF using Map/Reduce
• TF-IDF (Term Frequency, Inverse Document Frequency)

• Is a basic technique

• To compute the relevancy of a document with respect to a

particular term

• "Term" is a generalized element contains within a document.

• A "term" is a generalized idea of what a document contains.

• E.g. a term can be a word, a phrase, or a concept.

TF-IDF
• Intuitively, the relevancy of a document to a term can
be calculated

• from the percentage of that term shows up in the

document

• i.e.: the count of the term in that document divide by

the total number of terms in it.

• We called this the "term frequency“

TF-IDF
• On the other hand,

• If this is a very common term which appears in many other

documents,

• Then its relevancy should be reduced.

• i.e.: the count of documents having this term divided by total

number of documents.

• We called this the "document frequency“

TF-IDF
• The overall relevancy of a document with respect to a term can
be computed using both the term frequency and document
frequency.

• relevancy = tf-idf = term frequency * log (1 / document

frequency)

• This is called tf-idf.

• A "document" can be considered as a multi-dimensional vector

where each dimension represents a term with the tf-idf as its
value.
Example: Compute TF-IDF using Map/Reduce
Basic MapReduce Patterns
1. Counting and Summing
2. Collating
3. Filtering (“Grepping”), Parsing, and Validation
4. Distributed Task Execution
5. Sorting
6. Iterative Message Passing (Graph Processing)
7. Distinct Values (Unique Items Counting)
8. Cross-Correlation
9. Selection
10. Projection
11. Union
12. Intersection
13. Difference
14. GroupBy and Aggregation
15. Joining
16. Time-series – moving average
Counting and Summing
• Problem Statement:
– There is a number of documents where each document is a set of
terms.
– It is required to calculate a total number of occurrences of each
term in all documents.
– Alternatively, it can be an arbitrary function of the terms.
– For instance, there is a log file where each record contains a
response time and it is required to calculate an average response
time.

• Applications: Log Analysis, Data Querying

Counting and Summing
Solution:
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)

class Combiner
method Combine(term t, [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)

class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
Collating
• Problem Statement:
– There is a set of items and some function of one item.
– It is required to save all items that have the same value of
function into one file or perform some other computation
that requires all such items to be processed as a group.
– The most typical example is building of inverted indexes.

• Applications: Inverted Indexes, ETL

Collating
Solution:
Mapper computes a given function for each item and emits
value of the function as a key and item itself as a value.

class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term f(t), t)

Reducer obtains all items grouped by function value and

process or save them. In case of inverted indexes, items are
terms (words) and function is a document ID where the
term was found.

class Reducer
method Reduce(term f(t), [t1, t2,...])
//process or save
Emit(term t, count sum)
Filtering (“Grepping”), Parsing, and Validation
• Problem Statement:
– There is a set of records and it is required to collect all
records that meet some condition or transform each record
(independently from other records) into another
representation.
– The later case includes such tasks as text parsing and value
extraction, conversion from one format to another.

• Applications: Log Analysis, Data Querying, ETL, Data

Validation
Filtering (“Grepping”), Parsing, and Validation
Solution:
Mapper takes records one by one and emits
accepted items or their transformed
versions
Distributed Task Execution
• Problem Statement:
– There is a large computational problem that can be divided
into multiple parts and results from all parts can be
combined together to obtain a final result.

• Applications: Physical and Engineering Simulations,

Numerical Analysis, Performance Testing
Distributed Task Execution
Solution:
Problem description is split in a set of specifications and
specifications are stored as input data for Mappers.

Each Mapper takes a specification, performs corresponding

computations and emits results.

Reducer combines all emitted parts into the final result.

Case Study: Simulation of a Digital Communication System

There is a software simulator of a digital communication system like WiMAX that passes
some volume of random data through the system model and computes error probability
of throughput.

Each Mapper runs simulation for specified amount of data which is 1/Nth of the required
sampling and emit error rate.

Reducer computes average error rate.

Sorting
• Problem Statement:
– There is a set of records and it is required to sort these
records by some rule or process these records in a certain
order.

• Applications: ETL, Data Analysis

Sorting
Solution:
Mappers just emit all items as values associated with the
sorting keys that are assembled as function of items.

Nevertheless, in practice sorting is often used in a quite

tricky way, that’s why it is said to be a heart of MapReduce
(and Hadoop).

In particular, it is very common to use composite keys to

achieve secondary sorting and grouping.

Sorting in MapReduce is originally intended for sorting of

the emitted key-value pairs by key, but there exist
techniques that leverage Hadoop implementation specifics
to achieve sorting by values.
Iterative Message Passing (Graph Processing)
• Problem Statement:
– There is a network of entities and relationships between them.
– It is required to calculate a state of each entity on the basis of properties of the
other entities in its neighborhood.
– This state can represent a distance to other nodes, indication that there is a
neighbor with the certain properties, characteristic of neighborhood density and
so on.

• Applications: Graph Analysis, Web Indexing

Iterative Message Passing (Graph Processing)
Solution:
A network is stored as a set of nodes and each node contains a list of
adjacent node IDs.

Conceptually, MapReduce jobs are performed in iterative way and at each

iteration each node sends messages to its neighbors.

Each neighbor updates its state on the basis of the received messages.

Iterations are terminated by some condition like fixed maximal number of

iterations (say, network diameter) or negligible changes in states between
two consecutive iterations.

From the technical point of view, Mapper emits messages for each node
using ID of the adjacent node as a key.

As result, all messages are grouped by the incoming node and reducer is
able to re-compute state and rewrite node with the new state.
Iterative Message Passing (Graph Processing)
Solution:
class Mapper
method Map(id n, object N)
Emit(id n, object N)
for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))

class Reducer
method Reduce(id m, [s1, s2,...])
object M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M=s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, object M)
Case Study: Availability Propagation Through The Tree of Categories
• Problem Statement:
– This problem is inspired by real life eCommerce task.
– There is a tree of categories that branches out from large categories (like Men,
Women, Kids) to smaller ones (like Men Jeans or Women Dresses), and
eventually to small end-of-line categories (like Men Blue Jeans).
– End-of-line category is either available (contains products) or not.
– Some high level category is available if there is at least one available end-of-line
category in its subtree.
– The goal is to calculate availabilities for all categories if availabilities of end-of-
line categories are know.
Case Study: Availability Propagation Through The Tree of Categories

class N
State in {True = 2, False = 1, null = 0}, //initialized 1 or 2 for end-
of-line categories, 0 otherwise

method getMessage(object N)
return N.State

method calculateState(state s, data [d1, d2,...])

return max( [d1, d2,...] )
Case Study: Breadth-First Search
• Problem Statement: There is a graph and it is required to
calculate distance (a number of hops) from one source node to
all other nodes in the graph.
Case Study: Breadth-First Search
class N
State is distance, initialized 0 for source node, INFINITY for all
other nodes

method getMessage(N)
return N.State + 1

method calculateState(state s, data [d1, d2,...])

min( [d1, d2,...] )
Distinct Values (Unique Items Counting)
• Problem Statement:
– There is a set of records that contain fields F and G. Count the total number of
unique values of filed F for each subset of records that have the same G
(grouped by G).
– The problem can be a little bit generalized and formulated in terms of faceted
search: There is a set of records. Each record has field F and arbitrary number of
category labels G = {G1, G2, …} . Count the total number of unique values of filed
F for each subset of records for each value of any label.

• Example Record 1: F=1, G={a, b} Applications: Log Analysis,

Record 2: F=2, G={a, d, e} Unique Users Counting
Record 3: F=1, G={b}
Record 4: F=3, G={a, b}

Result:
a -> 3 // F=1, F=2, F=3
b -> 2 // F=1, F=3
d -> 1 // F=2
e -> 1 // F=2
Distinct Values (Unique Items Counting)
Solution:
PHASE-1
class Mapper
method Map(null, record [value f, categories [g1, g2,...]])
for all category g in [g1, g2,...]
Emit(record [g, f], count 1)

class Reducer
method Reduce(record [g, f], counts [n1, n2, ...])
Emit(record [g, f], null )

PHASE-2
class Mapper
method Map(record [f, g], null)
Emit(value g, count 1)

class Reducer
method Reduce(value g, counts [n1, n2,...])
Emit(value g, sum( [n1, n2,...] ) )
Cross-Correlation
• Problem Statement:
– There is a set of tuples of items.
– For each possible pair of items calculate a number of tuples
where these items co-occur.
– If the total number of items is N then N*N values should be
reported.
– This problem appears in :
• text analysis (say, items are words and tuples are sentences),
• market analysis (customers who buy this tend to also buy that).
– If N*N is small - matrix can fit in the memory of a single
machine.
Market Basket Analysis
• Suppose a store sells N different products and has data on B customer purchases
(each purchase is called a “market basket”)

• An association rule is expressed as “If x is sold, then y is sold”:

– xy

• “Support” for x  y: Joint probability of P (x ^ y)

– The probability of finding x and y together in a random basket

• “Confidence” in x  y: Conditional probability of P (y | x)

– The probability of finding y in a basket if the basket already contains x

• Goal is finding rules with high support and high confidence

– How are they computed?
– Cooccur [x, x] represents the number of baskets that contain x
– Cooccur [x, y] represents the number of baskets that contain x & y
– Cooccur [x, y] == Cooccur [y, x]
– Support: Calculated as Cooccur [x, y] / B
– Confidence: Calculated as Cooccur [x, y] / Cooccur [x, x]
Cross-Correlation
Solution:
class Mapper
method Map(null, basket = [i1, i2,...] )
for all item i in basket
for all item j in basket
Emit(pair [i j], count 1)

class Reducer
method Reduce(pair [i j], counts [c1, c2,...])
s = sum([c1, c2,...])
Emit(pair[i j], count s)
Cross-Correlation
Solution:
class Mapper
method Map(null, basket = [i1, i2,...] )
for all item i in basket
H = new AssociativeArray : item -> counter
for all item j in basket
H{j} = H{j} + 1
Emit(item i, stripe H)

class Reducer
method Reduce(item i, stripes [H1, H2,...])
H = new AssociativeArray : item -> counter
H = merge-sum( [H1, H2,...] )
for all item j in H.keys()
Emit(pair [i j], H{j})
Selection
Solution:
class Mapper
method Map(rowkey key, tuple t)
if t satisfies the predicate
Emit(tuple t, null)
Projection
Solution:
class Mapper
method Map(rowkey key, tuple t)
tuple g = project(t) // extract required fields to tuple g
Emit(tuple g, null)

class Reducer
method Reduce(tuple t, array n) // n is an array of nulls
Emit(tuple t, null)
Union
Solution:
class Mapper
method Map(rowkey key, tuple t)
Emit(tuple t, null)

class Reducer
method Reduce(tuple t, array n) // n is an array of one or two nulls
Emit(tuple t, null)
Intersection
Solution:
class Mapper
method Map(rowkey key, tuple t)
Emit(tuple t, null)

class Reducer
method Reduce(tuple t, array n) // n is an array of one or more nulls
if n.size() >= 2
Emit(tuple t, null)
Difference
Solution:
// We want to compute difference R – S.
// Mapper emits all tuples and tag which is a name of the set this record came from.
// Reducer emits only records that came from R but not from S.

class Mapper
method Map(rowkey key, tuple t)
Emit(tuple t, string t.SetName) // t.SetName is either 'R' or 'S'

class Reducer
method Reduce(tuple t, array n) // array n can be ['R'], ['S'], ['R' 'S'], or ['S', 'R']
if n.size() = 1 and n[1] = 'R'
Emit(tuple t, null)
Join
• A JOIN is a means for combining fields from two tables by using values
common to each.
• ANSI standard SQL specifies four types of JOIN: INNER, OUTER, LEFT,
and RIGHT.
• As a special case, a table (base table, view, or joined table) can JOIN to itself
in a self-join.
• Join algorithms
– Nested loop
– Sort-Merge
– Hash
SELECT *
FROM employee
INNER JOIN department ON employee.DepartmentID = department.DepartmentID;

SELECT *
FROM employee
JOIN department ON employee.DepartmentID = department.DepartmentID;

SELECT *
FROM employee
LEFT OUTER JOIN department ON employee.DepartmentID = department.DepartmentID;
Nested loop
For each tuple r in R do
For each tuple s in S do
If r and s satisfy the join condition
Then output the tuple <r,s>
Sort-Merge
p in P; q in Q; g in Q
while more tuples in inputs do
while p.a < g.b do
advance p
end while
while p.a > g.b do
advance g //a group might begin here
end while
while p.a == g.b do
q = g //mark group beginning
while p.a == q.b do
Add <p,q> to the result
Advance q
end while
Advance p //move forward
end while
g = q //candidate to begin next group
end while
Sort-Merge : MapReduce
• This algorithm joins of two sets R and L on some key k.

• Mapper goes through all tuples from R and L, extracts key k from the tuples,
marks tuple with a tag that indicates a set this tuple came from (‘R’ or ‘L’),
and emits tagged tuple using k as a key.

• Reducer receives all tuples for a particular key k and put them into two
buckets – for R and for L.
• When two buckets are filled, Reducer runs nested loop over them and emits
a cross join of the buckets.
• Each emitted tuple is a concatenation R-tuple, L-tuple, and key k.

• This approach has the following disadvantages:

– Mapper emits absolutely all data, even for keys that occur only in one set and
have no pair in the other.
– Reducer should hold all data for one key in the memory. If data doesn’t fit the
memory, its Reducer’s responsibility to handle this by some kind of swap.
Sort-Merge : MapReduce Pseudocode
Solution:
class Mapper
method Map(null, tuple [join_key k, value v1, value v2,...])
Emit(join_key k, tagged_tuple [set_name tag, values [v1, v2, ...] ] )

class Reducer
method Reduce(join_key k, tagged_tuples [t1, t2,...])
H = new AssociativeArray : set_name -> values
for all tagged_tuple t in [t1, t2,...] // separate values into 2 arrays
H{t.tag}.add(t.values)
for all values r in H{'R'} // produce a cross-join of the two arrays
for all values l in H{'L'}
Emit(null, [k r l] )
Hash Join
• The Hash Join algorithm consists of a ‘build’ phase and a ‘probe’ phase.
• In its simplest variant, the smaller dataset is loaded into an in-memory hash
table in the build phase.
• In the ‘probe’ phase, the larger dataset is scanned and joined with the
relevant tuple(s) by looking into the hash table.

for all p in P do
Load p into in memory hash table H
end for

for all q in Q do
if H contains p matching with q then
add <p,q> to the result
end if
end for
Hash Join : MapReduce Pseudocode
• Let’s assume that we join two sets – R and L, R is relative small. If so, R can
be distributed to all Mappers and each Mapper can load it and index by the
join key.

class Mapper
method Initialize
H = new AssociativeArray : join_key -> tuple from R
R = loadR()
for all [ join_key k, tuple [r1, r2,...] ] in R
H{k} = H{k}.append( [r1, r2,...] )

method Map(join_key k, tuple l)

for all tuple r in H{k}
Emit(null, tuple [k r l] )
Reference
F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages
99-110, 2010

R. Vernica, M. J. Carey, and C. Li. Ecient parallel set-similarity joins using mapreduce. In
SIGMOD, pages 495-506, 2010.

H. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-Reduce-Merge: simpliﬁed relational

data processing on large clusters. In SIGMOD Conference, pages 1029–1040, 2007.

Processing theta-joins using MapReduce, Alper Okcan, Mirek Riedewald, SIGMOD '11

Foto N. Afrati, Jeffrey D. Ullman: Optimizing Multiway Joins in a Map-Reduce Environment.

IEEE Trans. Knowl. Data Eng. 23(9): 1282-1298 (2011)

V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets

and Vectors, Ahmed Metwally, Christos Faloutsos, PVLDB Proceedings of the VLDB
Endowment, vol. 5 (2012)
End of session
Day – 2: MapReduce Concepts

Unit 3 Notes
No ratings yet
Unit 3 Notes
21 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
C Program by Best Author
No ratings yet
C Program by Best Author
358 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Iso 8503-1 - 8503-2 - Surface Roughness Comprator PDF
No ratings yet
Iso 8503-1 - 8503-2 - Surface Roughness Comprator PDF
4 pages
Yio Chu Kang Secondary Sec 1 SA2 2020 Science
No ratings yet
Yio Chu Kang Secondary Sec 1 SA2 2020 Science
21 pages
Indradrive Diagnostic Old
No ratings yet
Indradrive Diagnostic Old
436 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
2010 Ford Scape 3.0l Fluid Capacities
No ratings yet
2010 Ford Scape 3.0l Fluid Capacities
2 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
Husqvarna 2003 SM WRE 125 Manual
No ratings yet
Husqvarna 2003 SM WRE 125 Manual
2 pages
A REPORT ON MIMO IN WIRELESS APPLICATIONS - Final
No ratings yet
A REPORT ON MIMO IN WIRELESS APPLICATIONS - Final
11 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Leading Dan Lagging Indicators Highlights
No ratings yet
Leading Dan Lagging Indicators Highlights
78 pages
Hadoop
No ratings yet
Hadoop
34 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
29 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Clock
No ratings yet
Clock
13 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Nonincendive Circuit Parameters: Planning and Installation Guide For Tricon v9-v10 Systems
No ratings yet
Nonincendive Circuit Parameters: Planning and Installation Guide For Tricon v9-v10 Systems
26 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Maths Scanner
No ratings yet
Maths Scanner
136 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
AR253 History 2 - Structuralism and Metabolism
No ratings yet
AR253 History 2 - Structuralism and Metabolism
55 pages
MFJ 249 Manual
No ratings yet
MFJ 249 Manual
18 pages
Unit IV Programming Model
No ratings yet
Unit IV Programming Model
30 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Att 8 - ASTM B8-4
No ratings yet
Att 8 - ASTM B8-4
7 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Big Data
No ratings yet
Big Data
43 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
BES - Lecture 10 - Simple Linear Regression
No ratings yet
BES - Lecture 10 - Simple Linear Regression
15 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
Akka HTTP
No ratings yet
Akka HTTP
23 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Prelims Test Series Csat 1722243977612
No ratings yet
Prelims Test Series Csat 1722243977612
3 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Unit 3
No ratings yet
Unit 3
13 pages
Unit 2
No ratings yet
Unit 2
12 pages
Automated Face Mask Detection: A Project by Nishant Goel Under The Guidance of Dr. Anil Kumar
No ratings yet
Automated Face Mask Detection: A Project by Nishant Goel Under The Guidance of Dr. Anil Kumar
21 pages
Quarter 3 Week 5 and 6 Final
No ratings yet
Quarter 3 Week 5 and 6 Final
11 pages
Draftspecificationformantransformer 7775 Kvawithincr
No ratings yet
Draftspecificationformantransformer 7775 Kvawithincr
13 pages
Hadoop OnePage
No ratings yet
Hadoop OnePage
2 pages
Eng CD 2374900 A4-3077475
No ratings yet
Eng CD 2374900 A4-3077475
4 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Table 1: Sales and Advertising Data Agent Sales Advertising: Regression Statistics
No ratings yet
Table 1: Sales and Advertising Data Agent Sales Advertising: Regression Statistics
10 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Experiment 2 VOM
No ratings yet
Experiment 2 VOM
5 pages
Determine and Describe The Intersection of Sets Using Various Representations and B
No ratings yet
Determine and Describe The Intersection of Sets Using Various Representations and B
18 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
CSC270 DB CDF V4.0
No ratings yet
CSC270 DB CDF V4.0
2 pages
Module Programming
No ratings yet
Module Programming
15 pages
Fungsi Sistem Otot
No ratings yet
Fungsi Sistem Otot
8 pages
Comparison of Shielding Methods
No ratings yet
Comparison of Shielding Methods
2 pages
TS01C Toko Oput Et Al 11323 PPT
No ratings yet
TS01C Toko Oput Et Al 11323 PPT
9 pages
Activity Fluid Machinery
No ratings yet
Activity Fluid Machinery
1 page
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
MATLAB Data Science
From Everand
MATLAB Data Science
Henry Codwell
No ratings yet
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

Hadoop Map Reduce Concepts - Teaching - 1

Uploaded by

Hadoop Map Reduce Concepts - Teaching - 1

Uploaded by

MapReduce Concepts

Get Block Namenode

Machines with Datanodes and Tasktrackers

• To compute the relevancy of a document with respect to a

• "Term" is a generalized element contains within a document.

• A "term" is a generalized idea of what a document contains.

• E.g. a term can be a word, a phrase, or a concept.

• from the percentage of that term shows up in the

• i.e.: the count of the term in that document divide by

• We called this the "term frequency“

• If this is a very common term which appears in many other

• Then its relevancy should be reduced.

• i.e.: the count of documents having this term divided by total

• We called this the "document frequency“

• relevancy = tf-idf = term frequency * log (1 / document

• This is called tf-idf.

• A "document" can be considered as a multi-dimensional vector

• Applications: Log Analysis, Data Querying

• Applications: Inverted Indexes, ETL

Reducer obtains all items grouped by function value and

• Applications: Log Analysis, Data Querying, ETL, Data

• Applications: Physical and Engineering Simulations,

Each Mapper takes a specification, performs corresponding

Reducer combines all emitted parts into the final result.

Case Study: Simulation of a Digital Communication System

Reducer computes average error rate.

• Applications: ETL, Data Analysis

Nevertheless, in practice sorting is often used in a quite

In particular, it is very common to use composite keys to

Sorting in MapReduce is originally intended for sorting of

• Applications: Graph Analysis, Web Indexing

Conceptually, MapReduce jobs are performed in iterative way and at each

Iterations are terminated by some condition like fixed maximal number of

method calculateState(state s, data [d1, d2,...])

method calculateState(state s, data [d1, d2,...])

• Example Record 1: F=1, G={a, b} Applications: Log Analysis,

• An association rule is expressed as “If x is sold, then y is sold”:

• “Support” for x  y: Joint probability of P (x ^ y)

• “Confidence” in x  y: Conditional probability of P (y | x)

• Goal is finding rules with high support and high confidence

• This approach has the following disadvantages:

method Map(join_key k, tuple l)

H. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-Reduce-Merge: simpliﬁed relational

Foto N. Afrati, Jeffrey D. Ullman: Optimizing Multiway Joins in a Map-Reduce Environment.

V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets

You might also like