Module II
Module II
Analytics
Module II – Hadoop and
Map Reduce
Contents
• Hadoop
• Components of Hadoop
• Analyzing Big data with Hadoop
• Design of HDFS
• Developing a Map reduce Application
• Map Reduce
• Distributed File System (DFS)
• Map Reduce
• Algorithms using Map Reduce
• Communication cost Model
• Graph Model for Map Reduce Problem
• Hadoop is an open source framework from Apache and is used
to store process and analyze data which are very huge in
volume.
• Hadoop is written in Java and is not OLAP (online analytical
processing).
• It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn
and many more.
• Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and
on the basis of that HDFS was developed. It states that the files will be
broken into blocks and stored in nodes over the distributed architecture.
2. Yarn:Yet another Resource Negotiator is used for job scheduling and
manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes input
data and converts it into a data set which can be computed in Key value
pair. The output of Map task is consumed by reduce task and then the out
of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are
used by other Hadoop modules.
Hadoop Architecture
• The Hadoop architecture is a package of the file system,
MapReduce engine and the HDFS (Hadoop Distributed File
System).
• The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
• A Hadoop cluster consists of a single master and multiple slave
nodes.
• The master node includes Job Tracker, Task Tracker, NameNode,
and DataNode whereas the slave node includes DataNode and
TaskTracker.
• Map reduce layer
• The MapReduce comes into existence when the client application submits the MapReduce job
to Job Tracker.
• In response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes,
the TaskTracker fails or time out.
• In such a case, that part of the job is rescheduled.
• HDFS layer
• The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop.
• It contains a master/slave architecture.
• This architecture consist of a single NameNode performs the role of master, and multiple
DataNodes performs the role of a slave
• NameNode
• It is a single master server exist in the HDFS cluster.
• As it is a single node, it may become the reason of single point failure.
• It manages the file system namespace by executing an operation like the opening, renaming and closing the
files.
• It simplifies the architecture of the system.
• DataNode
• The HDFS cluster contains multiple DataNodes.
• Each DataNode contains multiple data blocks.
• These data blocks are used to store data.
• It is the responsibility of DataNode to read and write requests from the file system's clients.
• It performs block creation, deletion, and replication upon instruction from the NameNode.
• Job Tracker
• The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.
• In response, NameNode provides metadata to Job Tracker.
• Task Tracker
• It works as a slave node for Job Tracker.
• It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a
Mapper
Advantages of Hadoop
• Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
• Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
• Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
• Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
Disadvantages of Hadoop
• Security Concern
• Not fit for small data
• Vulnerable by nature
Hadoop Technology In Monitoring Patient
Vitals
HDFS
• Huge datasets:
• HDFS should have hundreds of nodes per cluster to manage the applications having huge
datasets.
• Hardware at data:
• A requested task can be done efficiently, when the computation takes place near the data.
• Especially where huge datasets are involved, it reduces the network traffic and increases
the throughput.
HDFS – Data Organization
• ACLs are used for implemention permissions that differ from natural
hierarchy of users and groups
• enabled by dfs.namenode.acls.enabled=true
MapReduce
• MapReduce is a programming model and framework within the Hadoop ecosystem
that enables efficient processing of big data by automatically distributing and
parallelizing the computation.
• It consists of two fundamental tasks: Map and Reduce.
• In the Map phase, the input data divides into smaller chunks and processes
independently in parallel across multiple nodes in a distributed computing
environment.
• Each chunk transforms or “maps” into key-value pairs by applying a user-defined
function. The output of the Map phase is a set of intermediate key-value pairs.
• The Reduce phase follows the Map phase. It gathers the intermediate key-value pairs
generated by the Map tasks, performs data shuffling to group together pairs with the
same key, and then applies a user-defined reduction function to aggregate and
process the data.
• The output of the Reduce phase is the final result of the computation.
• Map Reduce example allows for efficient processing of large-scale datasets by
leveraging parallelism and distributing the workload across a cluster of machines.
• It simplifies the development of distributed data processing applications by
abstracting away the complexities of parallelization, data distribution, and fault
tolerance, making it an essential tool for big data processing in the Hadoop ecosystem
Why do we need MapReduce?
• Processing Web Data on a Single Machine
• 20+ billion web pages x 20KB = 400+ terabytes
• One computer can read 30‐35 MB/sec from disk
• ~ four months to read the web
• ~1,000 hard drives just to store the web
• Even more to do something with the data
1. Input Splits
2. Mapping
3. Shuffling
4. Sorting
5. Reducing
Input Splits
• MapReduce splits the input into smaller chunks called input splits, representing a block
of work with a single mapper task.
Mapping
• The input data is processed and divided into smaller segments in the mapper phase,
where the number of mappers is equal to the number of input splits.
• RecordReader produces a key-value pair of the input splits using TextFormat, which
Reducer later uses as input.
• The mapper then processes these key-value pairs using coding logic to produce an
output of the same form.
Shuffling
• In the shuffling phase, the output of the mapper phase is passed to the reducer phase by
removing duplicate values and grouping the values.
• The output remains in the form of keys and values in the mapper phase.
• Since shuffling can begin even before the mapper phase is complete, it saves time.
Sorting
• Sorting is performed simultaneously with shuffling.
• The Sorting phase involves merging and sorting the output generated by the mapper.
• The intermediate key-value pairs are sorted by key before starting the reducer phase,
and the values can take any order. Sorting by value is done by secondary sorting.
Reducing
• In the reducer phase, the system reduces the intermediate values from the shuffling
phase to produce a single output value that summarizes the entire dataset.
• The process then uses HDFS to store the final output
Parallelism
• Map functions run in parallel, create intermediate values from each input
data set
• The programmer must specify a proper input split (chunk) between mappers to enable
parallelism
• Reduce functions also run in parallel, each will work on different output keys
• Number of reducers is a key parameter which determines map‐reduce performance
Map Reduce example also faces some limitations, and they are as follows:
• MapReduce is a low-level programming model which involves a lot of writing code.
• The batch-based processing nature of MapReduce makes it unsuitable for real-
time processing.
• It does not support data pipelining or overlapping of Map and Reduce phases.
• Task initialization, coordination, monitoring, and scheduling take up a large chunk
of MapReduce’s execution time and reduce its performance.
• MapReduce cannot cache the intermediate data in memory, thereby diminishing
Hadoop’s performance
MapReduce Programming Model
Graphs