0% found this document useful (0 votes)

31 views37 pages

BDA Unit 3 1

Uploaded by

Jerald Ruban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views37 pages

BDA Unit 3 1

Uploaded by

Jerald Ruban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Parallel Programming with Map

Reduce

UNIT-III
Parallel Processing with MapReduce
• MapReduce is an attractive model for parallel data
processing in high- performance cluster computing
environments.
• The scalability of MapReduce is proven to be high,
because a job in the MapReduce model is partitioned
into numerous small tasks running on multiple
machines in a large-scale cluster.
• All Map tasks are executed at the same time, forming
parallel processing of data.
• After that, sort the output intermediate data of the
Map task. Then the system sends the intermediate
data to the Reduce task for further protocol
How Google Search Works

Video:
https://www.youtube.com/watch?v=0eKVizvYSUQ
https://www.youtube.com/watch?v=BNHR6IQJGZs&t=102s

Google is a fully-automated search engine that uses

software known as "web crawlers" that explore the
web on a regular basis to find sites to add to our
index.
There are three stages of Google Search:
Google Search works in three stages, and not all
pages make it through each stage:
• Crawling: Google downloads text, images, and
videos from pages it found on the internet with
automated programs called crawlers.
• Indexing: Google analyzes the text, images, and
video files on the page, and stores the information
in the Google index, which is a large database.
• Serving search results: When a user searches on
Google, Google returns information that's relevant
to the user's query.
Crawling:
• Google searches the web with automated programs
called crawlers, looking for pages that are new or updated.
Google stores those page addresses (or page URLs) in a
big list to look at later.
• We find pages by many different methods, but the main
method is following links from pages.

Indexing:
• Google visits the pages that it has learned about by
crawling, and tries to analyze what each page is about.
• Google analyzes the content, images, and video files in the
page, trying to understand what the page is about.
• This information is stored in the Google index, a huge
database that is stored on many, many (many!) computers.
Serving search results:
• When a user performs a Google search, Google
tries to determine the highest quality results.
• The "best" results have many factors, including
things such as the user's location, language, device
(desktop or phone), and previous queries.
• For example, searching for "bicycle repair shops"
would show different answers based on their
location.
• Google doesn't accept payment to rank pages
higher, and ranking is done algorithmically.
• Search results will be displayed according to the
page ranking algorithm.
MapReduce Overview:
• Traditional Enterprise Systems normally have a
centralized server to store and process data.
• Traditional model is certainly not suitable to
process huge volumes of scalable data and cannot
be accommodated by standard database servers.
• Google solved this bottleneck issue using an
algorithm called MapReduce.
• MapReduce divides a task into small parts and
assigns them to many computers.
• Later, the results are collected at one place and
integrated to form the result dataset.
How MapReduce Works?
• The MapReduce algorithm contains two important
tasks, namely Map and Reduce.
• The Map task takes a set of data and converts it
into another set of data, where individual
elements are broken down into tuples (key-value
pairs).
• The Reduce task takes the output from the Map as
an input and combines those data tuples (key-
value pairs) into a smaller set of tuples.
Let us now take a close look at each of the phases and
try to understand their significance:
• Input Phase − Here we have a Record Reader that translates each
record in an input file and sends the parsed data to the mapper in
the form of key-value pairs.

• Map − Map is a user-defined function, which takes a series of key-

value pairs and processes each one of them to generate zero or
more key-value pairs.

• Intermediate Keys − They key-value pairs generated by the mapper

are known as intermediate keys.

• Combiner − A combiner is a type of local Reducer that groups similar

data from the map phase into identifiable sets.
– It takes the intermediate keys from the mapper as input and applies
a user-defined code to aggregate the values in a small scope of one
mapper.
- It is not a part of the main MapReduce algorithm; it is optional.
• Shuffle and Sort − The Reducer task starts with the
Shuffle and Sort step.
It downloads the grouped key-value pairs onto
the local machine, where the Reducer is
running.

• Reducer − The Reducer takes the grouped key-value

paired data as input and runs a Reducer function on
each one of them.
Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a
wide range of processing.
• Output Phase − In the output phase, we have an
output formatter that translates the final key-value
pairs from the Reducer function and writes them
onto a file using a record writer.
Sample Map Reduce Application: Word Count

1. First, in the map stage, the input data (the six documents) is split and
distributed across the cluster (the three servers).
In this case, each map task works on a split containing two documents.
During mapping, there is no communication between the nodes. They
perform independently.
2. Then, map tasks create a <key, value> pair for every
word. These pairs show how many times a word occurs. A
word is a key, and a value is its count.

For example, one document contains three of four words

we are looking for: Apache 7 times, Class 8
times, and Track 6 times.
The key-value pairs in one map task output look like this:
• <apache, 7>
• <class, 8>
• <track, 6>

This process is done in parallel tasks on all nodes for all

3. After input splitting and mapping completes, the
outputs of every map task are shuffled. This is the
first step of the Reduce stage.
Since we are looking for the frequency of occurrence
for four words, there are four parallel Reduce tasks.

The reduce tasks can run on the same nodes as the

map tasks, or they can run on any other node.
• The shuffle step ensures the keys Apache, Hadoop,
Class, and Track are sorted for the reduce step.
This process groups the values by keys in the form
of <key, value-list> pairs.
4. In the reduce step of the Reduce stage, each of
the four tasks process a <key, value-list> to provide a
final key-value pair. The reduce tasks also happen at
the same time and work independently.

In our example from the diagram, the reduce tasks

get the following individual results:
• <apache, 22>
• <hadoop, 20>
• <class, 18>
• <track, 22>
MapReduce Programming:
map() function:
• The map() function allows us to iterate over each
item in an iterable.
• Map(), on the other hand, operates independently
on each item rather than producing a single result.

Syntax
map(function, iterable)
where parameters are
• function − The function to be used in the code.
• iterable − This is the value that is iterated in the
code.
reduce() function:
• The reduce() function iterates through each item in
a list or other iterable data type, returning a single
value. It's in the functools library.
• This is more efficient than looping.

Syntax
reduce(function, iterable)
where parameters are
• function − The function to be used in the code.
• iterable − This is the value that is iterated in the
code.
Example for map():

def multiplyNumbers(givenNumbers):
return givenNumbers*3
givenNumbers = map(multiplyNumbers, [1, 3, 5, 2, 6])
print("Multiplying list elements with 3:")
for element in givenNumbers:
print(element)

o/p:
Multiplying list elements with 3:
3
9
15
6
Example for reduce():
from functools import reduce
def addNumbers(x, y):
return x+y
inputList = [12, 4, 10, 15, 6, 5]
print("The sum of all list items:")
print(reduce(addNumbers, inputList))

o/p:
The sum of all list items: 52
Map Reduce Jobs Execution:
Steps of MapReduce Job Execution flow:
MapReduce processess the data in various phases with the help of
different components:
1. Input Files
• In input files data for MapReduce job is stored. In HDFS, input files
reside. Input files format is arbitrary. Line-based log files and binary
format can also be used.

2. Input Format
• After that Input Format defines how to split and read these input
files. It selects the files or other objects for input. Input Format
creates Input Split.

3. Input Splits
• It represents the data which will be processed by an
individual Mapper. For each split, one map task is created. Thus the
number of map tasks is equal to the number of Input Splits.
4. Record Reader
• It communicates with the input Split. And then converts the data
into key-value pairs suitable for reading by the Mapper. Record
Reader by default uses Text Input Format to convert data into a key-
value pair.

5. Mapper
• It processes input record produced by the Record Reader and
generates intermediate key-value pairs. The intermediate output is
completely different from the input pair. The output of the mapper
is the full collection of key-value pairs.
• Hadoop framework doesn’t store the output of mapper on HDFS.

6. Combiner
• Combiner is Mini-reducer which performs local aggregation on the
mapper’s output. It minimizes the data transfer between mapper
and reducer.
7. Partitioner
• Partitioner comes into the existence if we are
working with more than one reducer. It takes the
output of the combiner and performs partitioning.

8. Shuffling and Sorting

• After partitioning, the output is shuffled to the
reduce node. The shuffling is the physical
movement of the data which is done over the
network. As all the mappers finish and shuffle the
output on the reducer nodes.
9. Reducer
• Reducer then takes set of intermediate key-value
pairs produced by the mappers as the input. After
that runs a reducer function on each of them to
generate the output.

10. Output Format

• Output Format defines the way how Record Reader
writes these output key-value pairs in output files.
So, its instances provided by the Hadoop write files
in HDFS.
Hive and Pig Language capabilities:

Hive:
• Hive is built on the top of Hadoop and is used to
process structured data in Hadoop.
• Hive was developed by Facebook.
• It provides various types of querying language
which is frequently known as Hive Query Language.
• The image above demonstrates a user writing
queries in the HiveQL language, which is then
converted into MapReduce tasks.
• Next, the data is processed and analyzed.
• HiveQL works on structured data, such as numbers,
addresses, dates, names, and so on.
• HiveQL allows multiple users to query data
simultaneously.
Hive Thrift server:
Hive Server is an optional service that allows a remote client to submit
requests to Hive, using a variety of programming languages, and
retrieve results.
CLI(Command Line Interface)
Pig:
• Pig is used for the analysis of a large amount of
data.
• Pig is used to perform all kinds of data manipulation
operations in Hadoop.
• The two parts of the Apache Pig are Pig-Latin and
Pig-Engine.
• It provides the Pig-Latin language to write the code
that contains many inbuilt functions like join, filter,
etc.
• Pig Engine is used to convert all these scripts into a
specific map and reduce tasks.
• Pig stands out by its operation on various types of
data, including structured, semi-structured, and
unstructured data.
• Whether we are working with structured, semi-
structured, or unstructured data, Pig takes care of it
all.

Poa Sba
100% (2)
Poa Sba
14 pages
SAP S - 4HANA Sourcing and Procurement - 1
100% (2)
SAP S - 4HANA Sourcing and Procurement - 1
36 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
DC9 072A Industrial
No ratings yet
DC9 072A Industrial
4 pages
Annexure D - For Cable Cellar MVWS System
No ratings yet
Annexure D - For Cable Cellar MVWS System
1 page
Excel Associate
No ratings yet
Excel Associate
7 pages
Example Pitch Desk
100% (1)
Example Pitch Desk
21 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Class VII Exam Paper-1
100% (1)
Class VII Exam Paper-1
3 pages
Research Ethics in The Digital Age - Ethics For The Social Sciences and Humanities in Times of Mediatization and Digitization (High)
No ratings yet
Research Ethics in The Digital Age - Ethics For The Social Sciences and Humanities in Times of Mediatization and Digitization (High)
159 pages
Technical Manual Qa-S (10-25) PDF
No ratings yet
Technical Manual Qa-S (10-25) PDF
102 pages
Operating Instructions s700 Series Extractive Gas Analyzers en Im0011408
No ratings yet
Operating Instructions s700 Series Extractive Gas Analyzers en Im0011408
232 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
ASCP Workbench Rev 04
No ratings yet
ASCP Workbench Rev 04
136 pages
Big Data
No ratings yet
Big Data
120 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Big Data Unit-2 PPT Part2
No ratings yet
Big Data Unit-2 PPT Part2
78 pages
ECS765P - W2 - The MapReduce Programming Model
No ratings yet
ECS765P - W2 - The MapReduce Programming Model
53 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Bda U2
No ratings yet
Bda U2
79 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit - III
No ratings yet
Unit - III
37 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages
Some Introductory Concepts On Fiberr Optic System
No ratings yet
Some Introductory Concepts On Fiberr Optic System
36 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
L04 MapReduce
No ratings yet
L04 MapReduce
37 pages
0 VHDL Basic
No ratings yet
0 VHDL Basic
38 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Map Reduce
No ratings yet
Map Reduce
11 pages
6.unit 3 Bda
No ratings yet
6.unit 3 Bda
18 pages
Unit 3
No ratings yet
Unit 3
13 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Samsung DCS Hotel Operator System Administration Guide
No ratings yet
Samsung DCS Hotel Operator System Administration Guide
19 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Data Science
No ratings yet
Data Science
7 pages
Unit 2
No ratings yet
Unit 2
12 pages
Unit 9-2
No ratings yet
Unit 9-2
14 pages
Nweg5122 Poe
No ratings yet
Nweg5122 Poe
21 pages
UNIT 3bda
No ratings yet
UNIT 3bda
16 pages
Unit 3
No ratings yet
Unit 3
27 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Unit 3
No ratings yet
Unit 3
22 pages
PW2 - Type of Fiber and Stripping Process SESI 1 2022 - 2023
No ratings yet
PW2 - Type of Fiber and Stripping Process SESI 1 2022 - 2023
12 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Geo SCADA Expert Performance Guidelines
No ratings yet
Geo SCADA Expert Performance Guidelines
12 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
1 s2.0 S2772940024000171 Main1
No ratings yet
1 s2.0 S2772940024000171 Main1
10 pages
Why MapReduce
No ratings yet
Why MapReduce
8 pages
MatLab Add
No ratings yet
MatLab Add
9 pages
Lab 12
No ratings yet
Lab 12
8 pages
Unit III
No ratings yet
Unit III
8 pages
ISAAC Info For Online Portfolio: About
No ratings yet
ISAAC Info For Online Portfolio: About
1 page
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
No ratings yet
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
5 pages
Describe The MapReduce Execution Steps With A Neat Diagram
No ratings yet
Describe The MapReduce Execution Steps With A Neat Diagram
10 pages
Learning Episode 11 Updated
No ratings yet
Learning Episode 11 Updated
7 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Unit Ii Iintroduction To Map Reduce
No ratings yet
Unit Ii Iintroduction To Map Reduce
4 pages
The Use of Ultrasonic Cleaning in Dairy Industry: How Does It Work?
No ratings yet
The Use of Ultrasonic Cleaning in Dairy Industry: How Does It Work?
3 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
4 pages
2023 - Welcome - Back - Interaction - Scripter - Guidelines 1
No ratings yet
2023 - Welcome - Back - Interaction - Scripter - Guidelines 1
4 pages
Computation With The Fractional Fourier Transform
No ratings yet
Computation With The Fractional Fourier Transform
2 pages
Homecharger: Type 1 Plug Type 2 Plug Type 2 Socket
No ratings yet
Homecharger: Type 1 Plug Type 2 Plug Type 2 Socket
2 pages
Mais Lang Atong Lungagon
No ratings yet
Mais Lang Atong Lungagon
1 page
Ramesh 02 Mar 2025
No ratings yet
Ramesh 02 Mar 2025
1 page
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet

BDA Unit 3 1

Uploaded by

BDA Unit 3 1

Uploaded by

Parallel Programming with Map

Google is a fully-automated search engine that uses

• Map − Map is a user-defined function, which takes a series of key-

• Intermediate Keys − They key-value pairs generated by the mapper

• Combiner − A combiner is a type of local Reducer that groups similar

• Reducer − The Reducer takes the grouped key-value

For example, one document contains three of four words

This process is done in parallel tasks on all nodes for all

The reduce tasks can run on the same nodes as the

In our example from the diagram, the reduce tasks

8. Shuffling and Sorting

10. Output Format

You might also like