BDA Unit 3 1
BDA Unit 3 1
Reduce
UNIT-III
Parallel Processing with MapReduce
• MapReduce is an attractive model for parallel data
processing in high- performance cluster computing
environments.
• The scalability of MapReduce is proven to be high,
because a job in the MapReduce model is partitioned
into numerous small tasks running on multiple
machines in a large-scale cluster.
• All Map tasks are executed at the same time, forming
parallel processing of data.
• After that, sort the output intermediate data of the
Map task. Then the system sends the intermediate
data to the Reduce task for further protocol
How Google Search Works
Video:
https://www.youtube.com/watch?v=0eKVizvYSUQ
https://www.youtube.com/watch?v=BNHR6IQJGZs&t=102s
Indexing:
• Google visits the pages that it has learned about by
crawling, and tries to analyze what each page is about.
• Google analyzes the content, images, and video files in the
page, trying to understand what the page is about.
• This information is stored in the Google index, a huge
database that is stored on many, many (many!) computers.
Serving search results:
• When a user performs a Google search, Google
tries to determine the highest quality results.
• The "best" results have many factors, including
things such as the user's location, language, device
(desktop or phone), and previous queries.
• For example, searching for "bicycle repair shops"
would show different answers based on their
location.
• Google doesn't accept payment to rank pages
higher, and ranking is done algorithmically.
• Search results will be displayed according to the
page ranking algorithm.
MapReduce Overview:
• Traditional Enterprise Systems normally have a
centralized server to store and process data.
• Traditional model is certainly not suitable to
process huge volumes of scalable data and cannot
be accommodated by standard database servers.
• Google solved this bottleneck issue using an
algorithm called MapReduce.
• MapReduce divides a task into small parts and
assigns them to many computers.
• Later, the results are collected at one place and
integrated to form the result dataset.
How MapReduce Works?
• The MapReduce algorithm contains two important
tasks, namely Map and Reduce.
• The Map task takes a set of data and converts it
into another set of data, where individual
elements are broken down into tuples (key-value
pairs).
• The Reduce task takes the output from the Map as
an input and combines those data tuples (key-
value pairs) into a smaller set of tuples.
Let us now take a close look at each of the phases and
try to understand their significance:
• Input Phase − Here we have a Record Reader that translates each
record in an input file and sends the parsed data to the mapper in
the form of key-value pairs.
1. First, in the map stage, the input data (the six documents) is split and
distributed across the cluster (the three servers).
In this case, each map task works on a split containing two documents.
During mapping, there is no communication between the nodes. They
perform independently.
2. Then, map tasks create a <key, value> pair for every
word. These pairs show how many times a word occurs. A
word is a key, and a value is its count.
Syntax
map(function, iterable)
where parameters are
• function − The function to be used in the code.
• iterable − This is the value that is iterated in the
code.
reduce() function:
• The reduce() function iterates through each item in
a list or other iterable data type, returning a single
value. It's in the functools library.
• This is more efficient than looping.
Syntax
reduce(function, iterable)
where parameters are
• function − The function to be used in the code.
• iterable − This is the value that is iterated in the
code.
Example for map():
def multiplyNumbers(givenNumbers):
return givenNumbers*3
givenNumbers = map(multiplyNumbers, [1, 3, 5, 2, 6])
print("Multiplying list elements with 3:")
for element in givenNumbers:
print(element)
o/p:
Multiplying list elements with 3:
3
9
15
6
Example for reduce():
from functools import reduce
def addNumbers(x, y):
return x+y
inputList = [12, 4, 10, 15, 6, 5]
print("The sum of all list items:")
print(reduce(addNumbers, inputList))
o/p:
The sum of all list items: 52
Map Reduce Jobs Execution:
Steps of MapReduce Job Execution flow:
MapReduce processess the data in various phases with the help of
different components:
1. Input Files
• In input files data for MapReduce job is stored. In HDFS, input files
reside. Input files format is arbitrary. Line-based log files and binary
format can also be used.
2. Input Format
• After that Input Format defines how to split and read these input
files. It selects the files or other objects for input. Input Format
creates Input Split.
3. Input Splits
• It represents the data which will be processed by an
individual Mapper. For each split, one map task is created. Thus the
number of map tasks is equal to the number of Input Splits.
4. Record Reader
• It communicates with the input Split. And then converts the data
into key-value pairs suitable for reading by the Mapper. Record
Reader by default uses Text Input Format to convert data into a key-
value pair.
5. Mapper
• It processes input record produced by the Record Reader and
generates intermediate key-value pairs. The intermediate output is
completely different from the input pair. The output of the mapper
is the full collection of key-value pairs.
• Hadoop framework doesn’t store the output of mapper on HDFS.
6. Combiner
• Combiner is Mini-reducer which performs local aggregation on the
mapper’s output. It minimizes the data transfer between mapper
and reducer.
7. Partitioner
• Partitioner comes into the existence if we are
working with more than one reducer. It takes the
output of the combiner and performs partitioning.
Hive:
• Hive is built on the top of Hadoop and is used to
process structured data in Hadoop.
• Hive was developed by Facebook.
• It provides various types of querying language
which is frequently known as Hive Query Language.
• The image above demonstrates a user writing
queries in the HiveQL language, which is then
converted into MapReduce tasks.
• Next, the data is processed and analyzed.
• HiveQL works on structured data, such as numbers,
addresses, dates, names, and so on.
• HiveQL allows multiple users to query data
simultaneously.
Hive Thrift server:
Hive Server is an optional service that allows a remote client to submit
requests to Hive, using a variety of programming languages, and
retrieve results.
CLI(Command Line Interface)
Pig:
• Pig is used for the analysis of a large amount of
data.
• Pig is used to perform all kinds of data manipulation
operations in Hadoop.
• The two parts of the Apache Pig are Pig-Latin and
Pig-Engine.
• It provides the Pig-Latin language to write the code
that contains many inbuilt functions like join, filter,
etc.
• Pig Engine is used to convert all these scripts into a
specific map and reduce tasks.
• Pig stands out by its operation on various types of
data, including structured, semi-structured, and
unstructured data.
• Whether we are working with structured, semi-
structured, or unstructured data, Pig takes care of it
all.