0% found this document useful (0 votes)
31 views37 pages

BDA Unit 3 1

Uploaded by

Jerald Ruban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views37 pages

BDA Unit 3 1

Uploaded by

Jerald Ruban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Parallel Programming with Map

Reduce

UNIT-III
Parallel Processing with MapReduce
• MapReduce is an attractive model for parallel data
processing in high- performance cluster computing
environments.
• The scalability of MapReduce is proven to be high,
because a job in the MapReduce model is partitioned
into numerous small tasks running on multiple
machines in a large-scale cluster.
• All Map tasks are executed at the same time, forming
parallel processing of data.
• After that, sort the output intermediate data of the
Map task. Then the system sends the intermediate
data to the Reduce task for further protocol
How Google Search Works

Video:
https://www.youtube.com/watch?v=0eKVizvYSUQ
https://www.youtube.com/watch?v=BNHR6IQJGZs&t=102s

Google is a fully-automated search engine that uses


software known as "web crawlers" that explore the
web on a regular basis to find sites to add to our
index.
There are three stages of Google Search:
Google Search works in three stages, and not all
pages make it through each stage:
• Crawling: Google downloads text, images, and
videos from pages it found on the internet with
automated programs called crawlers.
• Indexing: Google analyzes the text, images, and
video files on the page, and stores the information
in the Google index, which is a large database.
• Serving search results: When a user searches on
Google, Google returns information that's relevant
to the user's query.
Crawling:
• Google searches the web with automated programs
called crawlers, looking for pages that are new or updated.
Google stores those page addresses (or page URLs) in a
big list to look at later.
• We find pages by many different methods, but the main
method is following links from pages.

Indexing:
• Google visits the pages that it has learned about by
crawling, and tries to analyze what each page is about.
• Google analyzes the content, images, and video files in the
page, trying to understand what the page is about.
• This information is stored in the Google index, a huge
database that is stored on many, many (many!) computers.
Serving search results:
• When a user performs a Google search, Google
tries to determine the highest quality results.
• The "best" results have many factors, including
things such as the user's location, language, device
(desktop or phone), and previous queries.
• For example, searching for "bicycle repair shops"
would show different answers based on their
location.
• Google doesn't accept payment to rank pages
higher, and ranking is done algorithmically.
• Search results will be displayed according to the
page ranking algorithm.
MapReduce Overview:
• Traditional Enterprise Systems normally have a
centralized server to store and process data.
• Traditional model is certainly not suitable to
process huge volumes of scalable data and cannot
be accommodated by standard database servers.
• Google solved this bottleneck issue using an
algorithm called MapReduce.
• MapReduce divides a task into small parts and
assigns them to many computers.
• Later, the results are collected at one place and
integrated to form the result dataset.
How MapReduce Works?
• The MapReduce algorithm contains two important
tasks, namely Map and Reduce.
• The Map task takes a set of data and converts it
into another set of data, where individual
elements are broken down into tuples (key-value
pairs).
• The Reduce task takes the output from the Map as
an input and combines those data tuples (key-
value pairs) into a smaller set of tuples.
Let us now take a close look at each of the phases and
try to understand their significance:
• Input Phase − Here we have a Record Reader that translates each
record in an input file and sends the parsed data to the mapper in
the form of key-value pairs.

• Map − Map is a user-defined function, which takes a series of key-


value pairs and processes each one of them to generate zero or
more key-value pairs.

• Intermediate Keys − They key-value pairs generated by the mapper


are known as intermediate keys.

• Combiner − A combiner is a type of local Reducer that groups similar


data from the map phase into identifiable sets.
– It takes the intermediate keys from the mapper as input and applies
a user-defined code to aggregate the values in a small scope of one
mapper.
- It is not a part of the main MapReduce algorithm; it is optional.
• Shuffle and Sort − The Reducer task starts with the
Shuffle and Sort step.
It downloads the grouped key-value pairs onto
the local machine, where the Reducer is
running.

• Reducer − The Reducer takes the grouped key-value


paired data as input and runs a Reducer function on
each one of them.
Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a
wide range of processing.
• Output Phase − In the output phase, we have an
output formatter that translates the final key-value
pairs from the Reducer function and writes them
onto a file using a record writer.
Sample Map Reduce Application: Word Count

1. First, in the map stage, the input data (the six documents) is split and
distributed across the cluster (the three servers).
In this case, each map task works on a split containing two documents.
During mapping, there is no communication between the nodes. They
perform independently.
2. Then, map tasks create a <key, value> pair for every
word. These pairs show how many times a word occurs. A
word is a key, and a value is its count.

For example, one document contains three of four words


we are looking for: Apache 7 times, Class 8
times, and Track 6 times.
The key-value pairs in one map task output look like this:
• <apache, 7>
• <class, 8>
• <track, 6>

This process is done in parallel tasks on all nodes for all


3. After input splitting and mapping completes, the
outputs of every map task are shuffled. This is the
first step of the Reduce stage.
Since we are looking for the frequency of occurrence
for four words, there are four parallel Reduce tasks.

The reduce tasks can run on the same nodes as the


map tasks, or they can run on any other node.
• The shuffle step ensures the keys Apache, Hadoop,
Class, and Track are sorted for the reduce step.
This process groups the values by keys in the form
of <key, value-list> pairs.
4. In the reduce step of the Reduce stage, each of
the four tasks process a <key, value-list> to provide a
final key-value pair. The reduce tasks also happen at
the same time and work independently.

In our example from the diagram, the reduce tasks


get the following individual results:
• <apache, 22>
• <hadoop, 20>
• <class, 18>
• <track, 22>
MapReduce Programming:
map() function:
• The map() function allows us to iterate over each
item in an iterable.
• Map(), on the other hand, operates independently
on each item rather than producing a single result.

Syntax
map(function, iterable)
where parameters are
• function − The function to be used in the code.
• iterable − This is the value that is iterated in the
code.
reduce() function:
• The reduce() function iterates through each item in
a list or other iterable data type, returning a single
value. It's in the functools library.
• This is more efficient than looping.

Syntax
reduce(function, iterable)
where parameters are
• function − The function to be used in the code.
• iterable − This is the value that is iterated in the
code.
Example for map():

def multiplyNumbers(givenNumbers):
return givenNumbers*3
givenNumbers = map(multiplyNumbers, [1, 3, 5, 2, 6])
print("Multiplying list elements with 3:")
for element in givenNumbers:
print(element)

o/p:
Multiplying list elements with 3:
3
9
15
6
Example for reduce():
from functools import reduce
def addNumbers(x, y):
return x+y
inputList = [12, 4, 10, 15, 6, 5]
print("The sum of all list items:")
print(reduce(addNumbers, inputList))

o/p:
The sum of all list items: 52
Map Reduce Jobs Execution:
Steps of MapReduce Job Execution flow:
MapReduce processess the data in various phases with the help of
different components:
1. Input Files
• In input files data for MapReduce job is stored. In HDFS, input files
reside. Input files format is arbitrary. Line-based log files and binary
format can also be used.

2. Input Format
• After that Input Format defines how to split and read these input
files. It selects the files or other objects for input. Input Format
creates Input Split.

3. Input Splits
• It represents the data which will be processed by an
individual Mapper. For each split, one map task is created. Thus the
number of map tasks is equal to the number of Input Splits.
4. Record Reader
• It communicates with the input Split. And then converts the data
into key-value pairs suitable for reading by the Mapper. Record
Reader by default uses Text Input Format to convert data into a key-
value pair.

5. Mapper
• It processes input record produced by the Record Reader and
generates intermediate key-value pairs. The intermediate output is
completely different from the input pair. The output of the mapper
is the full collection of key-value pairs.
• Hadoop framework doesn’t store the output of mapper on HDFS.

6. Combiner
• Combiner is Mini-reducer which performs local aggregation on the
mapper’s output. It minimizes the data transfer between mapper
and reducer.
7. Partitioner
• Partitioner comes into the existence if we are
working with more than one reducer. It takes the
output of the combiner and performs partitioning.

8. Shuffling and Sorting


• After partitioning, the output is shuffled to the
reduce node. The shuffling is the physical
movement of the data which is done over the
network. As all the mappers finish and shuffle the
output on the reducer nodes.
9. Reducer
• Reducer then takes set of intermediate key-value
pairs produced by the mappers as the input. After
that runs a reducer function on each of them to
generate the output.

10. Output Format


• Output Format defines the way how Record Reader
writes these output key-value pairs in output files.
So, its instances provided by the Hadoop write files
in HDFS.
Hive and Pig Language capabilities:

Hive:
• Hive is built on the top of Hadoop and is used to
process structured data in Hadoop.
• Hive was developed by Facebook.
• It provides various types of querying language
which is frequently known as Hive Query Language.
• The image above demonstrates a user writing
queries in the HiveQL language, which is then
converted into MapReduce tasks.
• Next, the data is processed and analyzed.
• HiveQL works on structured data, such as numbers,
addresses, dates, names, and so on.
• HiveQL allows multiple users to query data
simultaneously.
Hive Thrift server:
Hive Server is an optional service that allows a remote client to submit
requests to Hive, using a variety of programming languages, and
retrieve results.
CLI(Command Line Interface)
Pig:
• Pig is used for the analysis of a large amount of
data.
• Pig is used to perform all kinds of data manipulation
operations in Hadoop.
• The two parts of the Apache Pig are Pig-Latin and
Pig-Engine.
• It provides the Pig-Latin language to write the code
that contains many inbuilt functions like join, filter,
etc.
• Pig Engine is used to convert all these scripts into a
specific map and reduce tasks.
• Pig stands out by its operation on various types of
data, including structured, semi-structured, and
unstructured data.
• Whether we are working with structured, semi-
structured, or unstructured data, Pig takes care of it
all.

You might also like