Unit 3
Unit 3
Unit-2
Map Reduce
Developing a Map Reduce Application
MapReduce is a programming model for building applications which can process big data in
parallel on multiple nodes.
It provides analytical abilities for analysis of large amount of complex data.
Traditional model is not suitable to process large amount of data and cannot be incorporated
by standard database servers.
Google solves this problem using MapReduce Algorithm.
MapReduce is a distributed data processing algorithm, introduced by Google.
It is influenced by functional programming model. In cluster environment, MapReduce
algorithm is used to process large volume of data efficiently, reliably and parallel.
It uses divide and conquer approach to process large volume of data.
It divides input task into manageable sub-task to execute parallel.
MapReduce Architecture
Example:
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
Output of
MapReduce task
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
How MapReduce Works?
MapReduce divides a task into small parts and assigns them to many computers.
The results are collected at one place and integrated to form the result dataset.
The MapReduce algorithm contains two important tasks:
1. Map - Splits & Mapping
2. Reduce - Shuffling, Reducing
The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data tuples
(key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Map(Splits & Mapping) & Reduce(Shuffling, Reducing)
How MapReduce Works? – Cont.
The complete execution process
(execution of Map and Reduce tasks,
both) is controlled by two types of
entities called:
1. Job Tracker: Acts like
a master (responsible for complete
execution of submitted job)
2. Multiple Task Trackers: Acts
like slaves, each of them performing the
job.
For every job submitted for execution
in the system, there is
one Jobtracker that resides
on Namenode and there are multiple
tasktrackers which reside
on Datanode.
How MapReduce Works? – Cont.
A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on
different data nodes.
Execution of individual task is then to look after by task tracker, which resides on every data
node executing part of the job.
Task tracker's responsibility is to send the progress report to the job tracker.
In addition, task tracker periodically sends 'heartbeat' signal to the Jobtracker so as to notify
him of the current state of the system.
Thus job tracker keeps track of the overall progress of each job. In the event of task failure,
the job tracker can reschedule it on a different task tracker.
MapReduce Algorithm
MapReduce Algorithm – Cont.
Input Phase
We have a Record Reader that translates each record in an input file and sends the parsed data to the
mapper in the form of key-value pairs.
Map
It is a user-defined function, which takes a series of key-value pairs and processes each one of them to
generate zero or more key-value pairs.
Intermediate Keys
They key-value pairs generated by the mapper are known as intermediate keys.
Combiner
A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets.
It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the
values in a small scope of one mapper.
It is not a part of the main MapReduce algorithm; it is optional.
MapReduce Algorithm – Cont.
Shuffle and Sort
The Reducer task starts with the Shuffle and Sort step.
It downloads the grouped key-value pairs onto the local machine, where the Reducer is running.
The individual key-value pairs are sorted by key into a larger data list.
The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer
task.
Reducer
The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of
them.
Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range
of processing.
Once the execution is over, it gives zero or more key-value pairs to the final step.
Output Phase
In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer
function and writes them onto a file using a record writer.
MapReduce Feature
Scalability
Flexibility
Security & Authentication
Cost Effective Solution
Fast
Data profiling
Data profiling is the process of examining, analyzing, and creating useful summaries of data.
The process yields a high-level overview which aids in the discovery of data quality issues,
risks, and overall trends. Data profiling produces critical insights into data that companies can
then leverage to their advantage.
Data profiling refers to the process of examining, analyzing, reviewing and summarizing data
sets to gain insight into the quality of data. Data quality is a measure of the condition of data
based on factors such as its accuracy, completeness, consistency, timeliness and accessibility.
Data profiling
More specifically, data profiling sifts through data to determine its legitimacy and quality.
Analytical algorithms detect dataset characteristics such as mean, minimum, maximum,
percentile, and frequency to examine data in minute detail. It then performs analyses to
uncover metadata, including frequency distributions, key relationships, foreign key
candidates, and functional dependencies. Finally, it uses all of this information to expose how
those factors align with your business’s standards and goals.
Data profiling can eliminate costly errors that are common in customer databases. These
errors include null values (unknown or missing values), values that shouldn’t be included,
values with unusually high or low frequency, values that don’t follow expected patterns, and
values outside the normal range.