0% found this document useful (0 votes)
2 views

Module II

The document provides an overview of Hadoop and MapReduce, detailing their components, architecture, and functionalities for big data processing. It explains the roles of HDFS, Yarn, and the MapReduce framework, including their advantages and disadvantages. Additionally, it covers the operational processes of HDFS and MapReduce, emphasizing their scalability, fault tolerance, and efficiency in handling large datasets.

Uploaded by

satyamshivam.in
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module II

The document provides an overview of Hadoop and MapReduce, detailing their components, architecture, and functionalities for big data processing. It explains the roles of HDFS, Yarn, and the MapReduce framework, including their advantages and disadvantages. Additionally, it covers the operational processes of HDFS and MapReduce, emphasizing their scalability, fault tolerance, and efficiency in handling large datasets.

Uploaded by

satyamshivam.in
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

MCA2004 – Big Data

Analytics
Module II – Hadoop and
Map Reduce
Contents
• Hadoop
• Components of Hadoop
• Analyzing Big data with Hadoop
• Design of HDFS
• Developing a Map reduce Application

• Map Reduce
• Distributed File System (DFS)
• Map Reduce
• Algorithms using Map Reduce
• Communication cost Model
• Graph Model for Map Reduce Problem
• Hadoop is an open source framework from Apache and is used
to store process and analyze data which are very huge in
volume.
• Hadoop is written in Java and is not OLAP (online analytical
processing).
• It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn
and many more.
• Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and
on the basis of that HDFS was developed. It states that the files will be
broken into blocks and stored in nodes over the distributed architecture.
2. Yarn:Yet another Resource Negotiator is used for job scheduling and
manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes input
data and converts it into a data set which can be computed in Key value
pair. The output of Map task is consumed by reduce task and then the out
of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are
used by other Hadoop modules.
Hadoop Architecture
• The Hadoop architecture is a package of the file system,
MapReduce engine and the HDFS (Hadoop Distributed File
System).
• The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
• A Hadoop cluster consists of a single master and multiple slave
nodes.
• The master node includes Job Tracker, Task Tracker, NameNode,
and DataNode whereas the slave node includes DataNode and
TaskTracker.
• Map reduce layer
• The MapReduce comes into existence when the client application submits the MapReduce job
to Job Tracker.
• In response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes,
the TaskTracker fails or time out.
• In such a case, that part of the job is rescheduled.

• HDFS layer
• The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop.
• It contains a master/slave architecture.
• This architecture consist of a single NameNode performs the role of master, and multiple
DataNodes performs the role of a slave
• NameNode
• It is a single master server exist in the HDFS cluster.
• As it is a single node, it may become the reason of single point failure.
• It manages the file system namespace by executing an operation like the opening, renaming and closing the
files.
• It simplifies the architecture of the system.

• DataNode
• The HDFS cluster contains multiple DataNodes.
• Each DataNode contains multiple data blocks.
• These data blocks are used to store data.
• It is the responsibility of DataNode to read and write requests from the file system's clients.
• It performs block creation, deletion, and replication upon instruction from the NameNode.

• Job Tracker
• The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.
• In response, NameNode provides metadata to Job Tracker.

• Task Tracker
• It works as a slave node for Job Tracker.
• It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a
Mapper
Advantages of Hadoop
• Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
• Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
• Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
• Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
Disadvantages of Hadoop

• Security Concern
• Not fit for small data
• Vulnerable by nature
Hadoop Technology In Monitoring Patient
Vitals
HDFS

• HDFS is a distributed file system that is fault tolerant, scalable


and extremely easy to expand.
• HDFS is the primary distributed storage for Hadoop applications.
• HDFS provides interfaces for applications to move themselves
closer to data.
• HDFS is designed to ‘just work’, however a working knowledge
helps in diagnostics and improvements.
Components of HDFS
• There are two (and a half) types of machines in a HDFS
cluster
• NameNode :– is the heart of an HDFS filesystem,it maintains
and manages the file system metadata. E.g; what blocks
make up a file, and on which datanodes those blocks are
stored.
• DataNode :- where HDFS stores the actual data, there are
usually quite a few of these.
HDFS Architecture
Unique features of HDFS
• HDFS also has a bunch of unique features that make it ideal for distributed
systems:
• Failure tolerant - data is duplicated across multiple DataNodes to protect against machine
failures. The default is a replication factor of 3 (every block is stored on three machines).
• Scalability - data transfers happen directly with the DataNodes so your read/write
capacity scales fairly well with the number of DataNodes
• Space - need more disk space? Just add more DataNodes and re-balance
• Industry standard - Other distributed applications are built on top of HDFS (HBase, Map-
Reduce)

• HDFS is designed to process large data sets with write-once-read-many


semantics, it is not for low latency access
Goals of HDFS
• Fault detection and recovery:
• Since HDFS includes a large number of commodity hardware, failure of components is
frequent.
• Therefore, HDFS should have mechanisms for quick and automatic fault detection and
recovery.

• Huge datasets:
• HDFS should have hundreds of nodes per cluster to manage the applications having huge
datasets.

• Hardware at data:
• A requested task can be done efficiently, when the computation takes place near the data.
• Especially where huge datasets are involved, it reduces the network traffic and increases
the throughput.
HDFS – Data Organization

• Each file written into HDFS is split into data blocks


• Each block is stored on one or more nodes
• Each copy of the block is called replica
• Block placement policy
• First replica is placed on the local node
• Second replica is placed in a different rack
• Third replica is placed in the same rack as the second replica
Read Operation in HDFS
• User sends an “open” request to the NameNode to get the location of file blocks.
• For each file block, the NameNode returns the address of a set of DataNodes
containing replica information for the requested file.
• The number of addresses depends on the number of block replicas.
• The user calls the “read” function to connect to the closest DataNode containing
the first block of the file.
• After the first block is streamed from the respective DataNode to the user, the
established connection is terminated and the same process is repeated for all
blocks of the requested file until the whole file is streamed to the user.
Write Operation in HDFS
• User sends a “create” request to the NameNode to create a new file in the
file system namespace.
• If the file does not exist, the NameNode notifies the user and allows him to
start writing data to the file by calling the “write” function.
• The first block of the file is written to an internal queue termed the data
queue while a data streamer monitors its writing into a DataNode.
• Since, each file block needs to be replicated by a predefined factor, the
data streamer first sends a request to the NameNode to get a list of suitable
DataNodes to store replicas of the first block.
HDFS Security
• Authentication to Hadoop
• Simple – insecure way of using OS username to determine hadoop identity
• Kerberos – authentication using kerberos ticket
• Set by hadoop.security.authentication=simple|kerberos

• File and Directory permissions are same like in POSIX


• read (r), write (w), and execute (x) permissions
• also has an owner, group and mode
• enabled by default (dfs.permissions.enabled=true)

• ACLs are used for implemention permissions that differ from natural
hierarchy of users and groups
• enabled by dfs.namenode.acls.enabled=true
MapReduce
• MapReduce is a programming model and framework within the Hadoop ecosystem
that enables efficient processing of big data by automatically distributing and
parallelizing the computation.
• It consists of two fundamental tasks: Map and Reduce.
• In the Map phase, the input data divides into smaller chunks and processes
independently in parallel across multiple nodes in a distributed computing
environment.
• Each chunk transforms or “maps” into key-value pairs by applying a user-defined
function. The output of the Map phase is a set of intermediate key-value pairs.
• The Reduce phase follows the Map phase. It gathers the intermediate key-value pairs
generated by the Map tasks, performs data shuffling to group together pairs with the
same key, and then applies a user-defined reduction function to aggregate and
process the data.
• The output of the Reduce phase is the final result of the computation.
• Map Reduce example allows for efficient processing of large-scale datasets by
leveraging parallelism and distributing the workload across a cluster of machines.
• It simplifies the development of distributed data processing applications by
abstracting away the complexities of parallelization, data distribution, and fault
tolerance, making it an essential tool for big data processing in the Hadoop ecosystem
Why do we need MapReduce?
• Processing Web Data on a Single Machine
• 20+ billion web pages x 20KB = 400+ terabytes
• One computer can read 30‐35 MB/sec from disk
• ~ four months to read the web
• ~1,000 hard drives just to store the web
• Even more to do something with the data

• Takes too long on a single machine, but with 1000 machines?


• < 3 hours to perform on 1000 machines
• But how long to program? What about the overheads?
• Communication, coordination, recovery from machine failure
• Status Reporting, Debugging, Optimization, Locality
• Reinventing the Wheel: This has to be done for every program!
Advantages
The advantages of using MapReduce are as follows:
• MapReduce can define mapper and reducer in several different languages using Hadoop streaming.
• MapReduce facilitates automatic parallelization and distribution, reducing the time required to run
programs.
• MapReduce provides fault tolerance by re-executing, writing map output to a distributed file
system, and restarting failed map or reducer tasks.
• Processing of data using MapReduce is a cost-effective solution.
• MapReduce processes large volumes of unstructured data very quickly.
• Using HDFS and HBase security, Map Reduce ensures data security by allowing only approved
users to access data stored in the system.
• MapReduce programming utilizes a simple programming model to handle tasks more efficiently
and quickly and is easy to learn.
• MapReduce is flexible and works with several Hadoop languages to handle and store data.
Map Reduce example process has the following phases:

1. Input Splits
2. Mapping
3. Shuffling
4. Sorting
5. Reducing
Input Splits
• MapReduce splits the input into smaller chunks called input splits, representing a block
of work with a single mapper task.
Mapping
• The input data is processed and divided into smaller segments in the mapper phase,
where the number of mappers is equal to the number of input splits.
• RecordReader produces a key-value pair of the input splits using TextFormat, which
Reducer later uses as input.
• The mapper then processes these key-value pairs using coding logic to produce an
output of the same form.
Shuffling
• In the shuffling phase, the output of the mapper phase is passed to the reducer phase by
removing duplicate values and grouping the values.
• The output remains in the form of keys and values in the mapper phase.
• Since shuffling can begin even before the mapper phase is complete, it saves time.
Sorting
• Sorting is performed simultaneously with shuffling.
• The Sorting phase involves merging and sorting the output generated by the mapper.
• The intermediate key-value pairs are sorted by key before starting the reducer phase,
and the values can take any order. Sorting by value is done by secondary sorting.
Reducing
• In the reducer phase, the system reduces the intermediate values from the shuffling
phase to produce a single output value that summarizes the entire dataset.
• The process then uses HDFS to store the final output
Parallelism

• Map functions run in parallel, create intermediate values from each input
data set
• The programmer must specify a proper input split (chunk) between mappers to enable
parallelism

• Reduce functions also run in parallel, each will work on different output keys
• Number of reducers is a key parameter which determines map‐reduce performance

• All values are processed independently


• Reduce phase cannot start until the map phase is completely finished
Limitations of MapReduce

Map Reduce example also faces some limitations, and they are as follows:
• MapReduce is a low-level programming model which involves a lot of writing code.
• The batch-based processing nature of MapReduce makes it unsuitable for real-
time processing.
• It does not support data pipelining or overlapping of Map and Reduce phases.
• Task initialization, coordination, monitoring, and scheduling take up a large chunk
of MapReduce’s execution time and reduce its performance.
• MapReduce cannot cache the intermediate data in memory, thereby diminishing
Hadoop’s performance
MapReduce Programming Model
Graphs

• Graph are everywhere: social graphs representing connections or


citation graphs representing hierarchy in scientific research
• Due to massive scale, it is impractical to use conventional
techniques for graph storage and in-memory analysis
• These constraints had driven the development of scalable systems
such as distributed file systems like Google File System and Hadoop
File System
• MapReduce provides a good way to partition and analyze graph

You might also like