Map reduce

Uploaded by

Maaz Sayyed

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Map reduce

Uploaded by

Maaz Sayyed

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

MAPREDUCE

Traditional Way
Traditional Way
• let us take an example where I have a weather log containing
the daily average temperature of the years from 2000 to 2015.
Here, I want to calculate the day having the highest
temperature in each year.
• So, just like in the traditional way, I will split the data into smaller
parts or blocks and store them in different machines. Then, I will
find the highest temperature in each part stored in the
corresponding machine. At last, I will combine the results
received from each of the machines to have the final output.
Traditional Way
Challenges in traditional approach
• Critical path problem: delay problem
• Reliability problem
• Equal split issue: overload/underutilized
• The single split may fail: no combined result
• Aggregation of the result:
• To overcome these issues, we have the MapReduce framework
which allows us to perform such parallel computations without
bothering about the issues like reliability, fault tolerance etc.
Therefore, MapReduce gives you the flexibility to write code logic
without caring about the design issues of the system.
challenges in traditional approach
1.Critical path problem: It is the amount of time taken to finish the job
without delaying the next milestone or actual completion date. So,
if, any of the machines delay the job, the whole work gets delayed.
2.Reliability problem: What if, any of the machines which are working
with a part of data fails? The management of this failover becomes a
challenge.
3.Equal split issue: How will I divide the data into smaller chunks so
that each machine gets even part of data to work with. In other words,
how to equally divide the data such that no individual machine is
overloaded or underutilized.
4.The single split may fail: If any of the machines fail to provide the
output, I will not be able to calculate the result. So, there should be a
mechanism to ensure this fault tolerance capability of the system.
5.Aggregation of the result: There should be a mechanism to
aggregate the result generated by each of the machines to produce
the final output.
MapReduce
• A MapReduce is a data processing tool which is used to
process the data parallelly in a distributed form.
• It was developed in 2004, on the basis of paper titled as
"MapReduce: Simplified Data Processing on Large
Clusters," published by Google
• MapReduce is a programming model and an associated
implementation for processing and generating large data
sets.
• Many real world tasks are expressible in this model
MapReduce
• Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines.
• MapReduce runs on a large cluster of commodity machines and is
highly scalable:a typical MapReduce computation processes many
terabytes of data on thousands of machines.
• Programmers find the system easy to use: hundreds of
MapReduce programs have been implemented and upwards of
one thousand MapReduce jobs are executed on Google's clusters
every day.
What is MapReduce?
How MapReduce Works?
• The MapReduce algorithm contains two important tasks, namely Map
and Reduce.
• The Map task takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key-
value pairs).
• The Reduce task takes the output from the Map as an input and
combines those data tuples (key-value pairs) into a smaller set of
tuples.
• The reduce task is always performed after the map job.
• The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or key-value
pairs which is the final output.
How MapReduce Works?
• Input Phase −The input reader reads the upcoming data and splits it into the data blocks
of the appropriate size (64 MB to 128 MB). Here we have a Record Reader that
translates each record in an input file and sends the parsed data to the mapper in the
form of key-value pairs.
• Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
• Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
• Combiner − A combiner is a type of local Reducer that groups similar data from the map
phase into identifiable sets. It takes the intermediate keys from the mapper as input and
applies a user-defined code to aggregate the values in a small scope of one mapper. It is
not a part of the main MapReduce algorithm; it is optional.
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads
the grouped key-value pairs onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list. The data list groups the
equivalent keys together so that their values can be iterated easily in the Reducer task.
• Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.
• Output Phase − In the output phase, we have an output formatter that translates the
final key-value pairs from the Reducer function and writes them onto a file using a record
writer.
• Suppose a text file called example.txt whose contents are as
follows:
• Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Example

Line no
• First, divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
• Then, tokenize the words in each of the mappers and give a hardcoded value (1)
to each of the tokens or words. The rationale behind giving a hardcoded value
equal to 1 is that every word, in itself, will occur once.
• Now, a list of key-value pair will be created where the key is nothing but the
individual words and value is one. So, for the first line (Dear Bear River) we have 3
key-value pairs – Dear, 1; Bear, 1; River, 1. The mapping process remains the
same on all the nodes.
• After the mapper phase, a partition process takes place where sorting and shuffling
happen so that all the tuples with the same key are sent to the corresponding
reducer.
• So, after the sorting and shuffling phase, each reducer will have a unique key and a
list of values corresponding to that very key. For example, Bear, [1,1]; Car,
[1,1,1].., etc.
• Now, each Reducer counts the values which are present in that list of values. As
shown in the figure, reducer gets a list of values which is [1,1] for the key Bear.
Then, it counts the number of ones in the very list and gives the final output as –
Bear, 2.
• Finally, all the output key/value pairs are then collected and written in the output file.
MapReduce algorithm
• The MapReduce algorithm contains two important tasks, namely
Map and Reduce.
• The map task is done by means of Mapper Class
• The reduce task is done by means of Reducer Class.
• Mapper class takes the input, tokenizes it, maps and sorts it.
• The output of Mapper class is used as input by Reducer class,
which in turn searches matching pairs and reduces them.
MapReduce algorithm
• Mapper Class:The first stage in Data Processing using MapReduce
is the Mapper Class. Here, RecordReader processes each Input
record and generates the respective key-value pair. Hadoop’s Mapper
store saves this intermediate data into the local disk.
• Input Split: It is the logical representation of data. It represents a block of
work that contains a single map task in the MapReduce Program.
• RecordReader :It interacts with the Input split and converts the obtained
data in the form of Key-Value Pairs.
• Reducer Class:The Intermediate output generated from the mapper
is fed to the reducer which processes it and generates the final output
which is then saved in the HDFS.
• Driver Class :The major component in a MapReduce job is a Driver
Class. It is responsible for setting up a MapReduce Job to run-in
Hadoop.
https://youtu.be/mafw2-CVYnA
MapReduce algorithm
• MapReduce implements various mathematical algorithms to
divide a task into small parts and assign them to multiple
systems.
• These mathematical algorithms may include the following −
• Sorting
• Searching
• Indexing
• TF-IDF(Term Frequency-Inverse Document Frequency)
Advantages of MapReduce
• Parallel Processing:In MapReduce, we are dividing the job among
multiple nodes and each node works with a part of the job simultaneously.
So, MapReduce is based on Divide and Conquer paradigm which helps us
to process the data using different machines. As the data is processed by
multiple machines instead of a single machine in parallel, the time taken to
process the data gets reduced by a tremendous amount
• Data Locality:
• Instead of moving data to the processing unit, we are moving the
processing unit to the data in the MapReduce Framework. In the
traditional system, we used to bring data to the processing unit and
process it. But, as the data grew and became very huge, bringing this
huge amount of data to the processing unit posed the following issues:
• Moving huge data to processing is costly and deteriorates the network performance.
• Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
• The master node can get over-burdened and may fail.
• Now, MapReduce allows us to overcome the above issues by bringing the
processing unit to the data.
• The data is distributed among multiple nodes where each node processes
the part of the data residing on it. This allows us to have the following
advantages:
• It is very cost-effective to move processing unit to the data.
• The processing time is reduced as all the nodes are working with their part of the
data in parallel.
• Every node gets a part of the data to process and therefore, there is no chance of a
node getting overburdened.
Usage of MapReduce
• It can be used in various application like document
clustering, distributed sorting, and web link-graph
reversal.
• It can be used for distributed pattern-based searching.
• use MapReduce in machine learning.
• Used by Google to regenerate Google's index of the World
Wide Web.
• It can be used in multiple computing environments such
as multi-cluster, multi-core, and mobile environment.
1. Entertainment
Hadoop MapReduce assists end users in finding the most popular movies based on their preferences and
previous viewing history. It primarily concentrates on their clicks and logs.
Various OTT services, including Netflix, regularly release many web series and movies. It may have
happened to you that you couldn’t pick which movie to watch, so you looked at Netflix’s recommendations
and decided to watch one of the suggested series or films. Netflix uses Hadoop and MapReduce to indicate
to the user some well-known movies based on what they have watched and which movies they enjoy.
MapReduce can examine user clicks and logs to learn how they watch movies.
2. E-commerce
Several e-commerce companies, including Flipkart, Amazon, and eBay, employ MapReduce to evaluate
consumer buying patterns based on customers’ interests or historical purchasing patterns. For various e-
commerce businesses, it provides product suggestion methods by analyzing data, purchase history, and user
interaction logs.
Many e-commerce vendors use the MapReduce programming model to identify popular products based on
customer preferences or purchasing behavior. Making item proposals for e-commerce inventory is part of it,
as is looking at website records, purchase histories, user interaction logs, etc., for product
recommendations.
3. Social media
Nearly 500 million tweets, or about 3000 per second, are sent daily on the microblogging platform Twitter. MapReduce
processes Twitter data, performing operations such as tokenization, filtering, counting, and aggregating counters.
Tokenization: It creates key-value pairs from the tokenized tweets by mapping the tweets as maps of tokens.
Filtering: The terms that are not wanted are removed from the token maps.
Counting: It creates a token counter for each word in the count.
Aggregate counters: A grouping of comparable counter values is prepared into small, manageable pieces using aggregate
counters.
4. Data warehouse
Systems that handle enormous volumes of information are known as data warehouse systems. The star schema, which
consists of a fact table and several dimension tables, is the most popular data warehouse model. In a shared-nothing
architecture, storing all the necessary data on a single node is impossible, so retrieving data from other nodes is essential.
This results in network congestion and slow query execution speeds. If the dimensions are not too big, users can replicate
them over nodes to get around this issue and maximize parallelism. Using MapReduce, we may build specialized business
logic for data insights while analyzing enormous data volumes in data warehouses.
5. Fraud detection
Conventional methods of preventing fraud are not always very effective. For instance, data analysts typically manage
inaccurate payments by auditing a tiny sample of claims and requesting medical records from specific submitters. Hadoop is
a system well suited for handling large volumes of data needed to create fraud detection algorithms. Financial businesses,
including banks, insurance companies, and payment locations, use Hadoop and MapReduce for fraud detection, pattern
recognition evidence, and business analytics through transaction analysis.
• https://www.spiceworks.com/tech/big-data/articles/what-is-map-reduce/
• https://www.edureka.co/blog/mapreduce-tutorial/

CV Sisgp 2023 - 2024
No ratings yet
CV Sisgp 2023 - 2024
4 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
2 MapReduce continue
No ratings yet
2 MapReduce continue
12 pages
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
No ratings yet
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
4 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
UNIT – III
No ratings yet
UNIT – III
38 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Unit - III
No ratings yet
Unit - III
37 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
BDA U2 - copy
No ratings yet
BDA U2 - copy
79 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
Unit 3
No ratings yet
Unit 3
13 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Mapreduce
No ratings yet
Mapreduce
5 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Module 3 (Part-1) - Big Data
No ratings yet
Module 3 (Part-1) - Big Data
46 pages
BDA FW-4
No ratings yet
BDA FW-4
7 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Analyzing_Data_with_Hadoop
No ratings yet
Analyzing_Data_with_Hadoop
54 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
Final - Module-4 Cloud Computing - May 8, 2023
No ratings yet
Final - Module-4 Cloud Computing - May 8, 2023
88 pages
Lecture 5 - MapReduce
No ratings yet
Lecture 5 - MapReduce
43 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Map Reduce
No ratings yet
Map Reduce
11 pages
Rohit
No ratings yet
Rohit
14 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
No ratings yet
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
5 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Map Reduce-LO2
No ratings yet
Map Reduce-LO2
62 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Unit 3 MapReduce Part 1
No ratings yet
Unit 3 MapReduce Part 1
12 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
Map Reduce
No ratings yet
Map Reduce
14 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Unit Ii Iintroduction To Map Reduce
No ratings yet
Unit Ii Iintroduction To Map Reduce
4 pages
Unit 5
No ratings yet
Unit 5
35 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Unit 4 2 - CC
No ratings yet
Unit 4 2 - CC
6 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Results
No ratings yet
Results
1 page
MiniMax Algotrithm trace
No ratings yet
MiniMax Algotrithm trace
14 pages
VisFusion Supp
No ratings yet
VisFusion Supp
7 pages
Q
No ratings yet
Q
4 pages
Introduction to SWI-PROLOG
No ratings yet
Introduction to SWI-PROLOG
4 pages
Introduction
No ratings yet
Introduction
89 pages
Experiment Numbe1
No ratings yet
Experiment Numbe1
3 pages
Experiment Number
No ratings yet
Experiment Number
5 pages
Tic-Tac-Toe- non-AI and AI technique-slide handouts
No ratings yet
Tic-Tac-Toe- non-AI and AI technique-slide handouts
14 pages
A-star and AO- star algorithm traces prepared by DR- P S Dhabe
No ratings yet
A-star and AO- star algorithm traces prepared by DR- P S Dhabe
5 pages
Lab assignments AI
No ratings yet
Lab assignments AI
2 pages
shell statement
No ratings yet
shell statement
2 pages
Tut1_MOF
No ratings yet
Tut1_MOF
2 pages
OS_Unit _I_Shell_ARS
No ratings yet
OS_Unit _I_Shell_ARS
96 pages
Unlocking-the-Power-of-Natural-Language-Processing-Computational-Linguistics(1)
No ratings yet
Unlocking-the-Power-of-Natural-Language-Processing-Computational-Linguistics(1)
15 pages
mam's_input
No ratings yet
mam's_input
2 pages
lab_8
No ratings yet
lab_8
6 pages
Lab7
No ratings yet
Lab7
4 pages
TY_OS_Lab_Manual
No ratings yet
TY_OS_Lab_Manual
56 pages
Text-Processing-For-NLP-Lemmatization-In-Text-Processing (14)
No ratings yet
Text-Processing-For-NLP-Lemmatization-In-Text-Processing (14)
12 pages
Text-Processing-For-NLP-Web-Scrapping (5)
No ratings yet
Text-Processing-For-NLP-Web-Scrapping (5)
18 pages
LAB_3
No ratings yet
LAB_3
8 pages
Project Workflow
No ratings yet
Project Workflow
1 page
Text-Processing-For-NLP-Sentence-Processing (13)
No ratings yet
Text-Processing-For-NLP-Sentence-Processing (13)
10 pages
Text-Processing-For-NLP-Word-Embedding (15)
No ratings yet
Text-Processing-For-NLP-Word-Embedding (15)
11 pages
The-Use-of-Natural-Language-Processing (4)
No ratings yet
The-Use-of-Natural-Language-Processing (4)
15 pages
Text-Processing-For-NLP-String-Tokenization (11)
No ratings yet
Text-Processing-For-NLP-String-Tokenization (11)
10 pages
Text-Processing-For-NLP-Text-Processing (6)
No ratings yet
Text-Processing-For-NLP-Text-Processing (6)
15 pages
Text-Processing-For-NLP-Understanding-Regex (7)
No ratings yet
Text-Processing-For-NLP-Understanding-Regex (7)
16 pages
REPORT
No ratings yet
REPORT
21 pages
Sales Process Map
No ratings yet
Sales Process Map
7 pages
Quality Assurance in Game Development - Matej Komar
No ratings yet
Quality Assurance in Game Development - Matej Komar
50 pages
Settings For CPU: Intel Core I5-9600k
No ratings yet
Settings For CPU: Intel Core I5-9600k
2 pages
Java Lab
No ratings yet
Java Lab
44 pages
Test-1 Answer Key
No ratings yet
Test-1 Answer Key
23 pages
Vaidhyanathan CV
No ratings yet
Vaidhyanathan CV
3 pages
Note 552711 - FAQ: Client Copy: Symptom
No ratings yet
Note 552711 - FAQ: Client Copy: Symptom
6 pages
Basic Introduction To Vuforia
No ratings yet
Basic Introduction To Vuforia
13 pages
SV Labs
No ratings yet
SV Labs
6 pages
GSM Single Site Verification
100% (8)
GSM Single Site Verification
26 pages
Daden Newsletter 1407
No ratings yet
Daden Newsletter 1407
2 pages
SysInternals Docs
100% (1)
SysInternals Docs
191 pages
Exam: MST - III - Nov-2021 - CS3CO29 - Digital Electronics
No ratings yet
Exam: MST - III - Nov-2021 - CS3CO29 - Digital Electronics
17 pages
5G&EMF Explained - AMTA - 23aug - 2019 - 20
No ratings yet
5G&EMF Explained - AMTA - 23aug - 2019 - 20
12 pages
How To Setup The Sony SR PC4 For Ethernet Transfer 12-02-01
No ratings yet
How To Setup The Sony SR PC4 For Ethernet Transfer 12-02-01
4 pages
Config Guide Trim Op Tim Ization Apo
No ratings yet
Config Guide Trim Op Tim Ization Apo
13 pages
Info Tech Word Assignment
No ratings yet
Info Tech Word Assignment
9 pages
CNSL Lab Writeups FAQ
No ratings yet
CNSL Lab Writeups FAQ
5 pages
Delta (AutoCon)
No ratings yet
Delta (AutoCon)
49 pages
XY-MBZ55A-YC1155-Bluetooth-5-BR-EDR-BLE-module-Datasheet-20211101
No ratings yet
XY-MBZ55A-YC1155-Bluetooth-5-BR-EDR-BLE-module-Datasheet-20211101
27 pages
Quick Setup Guide: Radar Sensor For Continuous Level Measurement of Water and Wastewater
No ratings yet
Quick Setup Guide: Radar Sensor For Continuous Level Measurement of Water and Wastewater
28 pages
Project "Procurement of Infra User Driven Interactive Vetting" Are Not Reflected
No ratings yet
Project "Procurement of Infra User Driven Interactive Vetting" Are Not Reflected
1 page
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
Forti ADC
No ratings yet
Forti ADC
2 pages
Isis Deployment
No ratings yet
Isis Deployment
136 pages
ST roip4+DWR POCSTARS c1
100% (1)
ST roip4+DWR POCSTARS c1
2 pages
BDA Assignment5
No ratings yet
BDA Assignment5
10 pages
Manasa Tatavarthy: E-Mail: Mobile
No ratings yet
Manasa Tatavarthy: E-Mail: Mobile
4 pages