0% found this document useful (0 votes)

7 views46 pages

Module II

The document provides an overview of Hadoop and MapReduce, detailing their components, architecture, and functionalities for big data processing. It explains the roles of HDFS, Yarn, and the MapReduce framework, including their advantages and disadvantages. Additionally, it covers the operational processes of HDFS and MapReduce, emphasizing their scalability, fault tolerance, and efficiency in handling large datasets.

Uploaded by

satyamshivam.in

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views46 pages

Module II

Uploaded by

satyamshivam.in

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

MCA2004 – Big Data

Analytics
Module II – Hadoop and
Map Reduce
Contents
• Hadoop
• Components of Hadoop
• Analyzing Big data with Hadoop
• Design of HDFS
• Developing a Map reduce Application

• Map Reduce
• Distributed File System (DFS)
• Map Reduce
• Algorithms using Map Reduce
• Communication cost Model
• Graph Model for Map Reduce Problem
• Hadoop is an open source framework from Apache and is used
to store process and analyze data which are very huge in
volume.
• Hadoop is written in Java and is not OLAP (online analytical
processing).
• It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn
and many more.
• Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and
on the basis of that HDFS was developed. It states that the files will be
broken into blocks and stored in nodes over the distributed architecture.
2. Yarn:Yet another Resource Negotiator is used for job scheduling and
manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes input
data and converts it into a data set which can be computed in Key value
pair. The output of Map task is consumed by reduce task and then the out
of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are
used by other Hadoop modules.
Hadoop Architecture
• The Hadoop architecture is a package of the file system,
MapReduce engine and the HDFS (Hadoop Distributed File
System).
• The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
• A Hadoop cluster consists of a single master and multiple slave
nodes.
• The master node includes Job Tracker, Task Tracker, NameNode,
and DataNode whereas the slave node includes DataNode and
TaskTracker.
• Map reduce layer
• The MapReduce comes into existence when the client application submits the MapReduce job
to Job Tracker.
• In response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes,
the TaskTracker fails or time out.
• In such a case, that part of the job is rescheduled.

• HDFS layer
• The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop.
• It contains a master/slave architecture.
• This architecture consist of a single NameNode performs the role of master, and multiple
DataNodes performs the role of a slave
• NameNode
• It is a single master server exist in the HDFS cluster.
• As it is a single node, it may become the reason of single point failure.
• It manages the file system namespace by executing an operation like the opening, renaming and closing the
files.
• It simplifies the architecture of the system.

• DataNode
• The HDFS cluster contains multiple DataNodes.
• Each DataNode contains multiple data blocks.
• These data blocks are used to store data.
• It is the responsibility of DataNode to read and write requests from the file system's clients.
• It performs block creation, deletion, and replication upon instruction from the NameNode.

• Job Tracker
• The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.
• In response, NameNode provides metadata to Job Tracker.

• Task Tracker
• It works as a slave node for Job Tracker.
• It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a
Mapper
Advantages of Hadoop
• Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
• Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
• Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
• Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
Disadvantages of Hadoop

• Security Concern
• Not fit for small data
• Vulnerable by nature
Hadoop Technology In Monitoring Patient
Vitals
HDFS

• HDFS is a distributed file system that is fault tolerant, scalable

and extremely easy to expand.
• HDFS is the primary distributed storage for Hadoop applications.
• HDFS provides interfaces for applications to move themselves
closer to data.
• HDFS is designed to ‘just work’, however a working knowledge
helps in diagnostics and improvements.
Components of HDFS
• There are two (and a half) types of machines in a HDFS
cluster
• NameNode :– is the heart of an HDFS filesystem,it maintains
and manages the file system metadata. E.g; what blocks
make up a file, and on which datanodes those blocks are
stored.
• DataNode :- where HDFS stores the actual data, there are
usually quite a few of these.
HDFS Architecture
Unique features of HDFS
• HDFS also has a bunch of unique features that make it ideal for distributed
systems:
• Failure tolerant - data is duplicated across multiple DataNodes to protect against machine
failures. The default is a replication factor of 3 (every block is stored on three machines).
• Scalability - data transfers happen directly with the DataNodes so your read/write
capacity scales fairly well with the number of DataNodes
• Space - need more disk space? Just add more DataNodes and re-balance
• Industry standard - Other distributed applications are built on top of HDFS (HBase, Map-
Reduce)

• HDFS is designed to process large data sets with write-once-read-many

semantics, it is not for low latency access
Goals of HDFS
• Fault detection and recovery:
• Since HDFS includes a large number of commodity hardware, failure of components is
frequent.
• Therefore, HDFS should have mechanisms for quick and automatic fault detection and
recovery.

• Huge datasets:
• HDFS should have hundreds of nodes per cluster to manage the applications having huge
datasets.

• Hardware at data:
• A requested task can be done efficiently, when the computation takes place near the data.
• Especially where huge datasets are involved, it reduces the network traffic and increases
the throughput.
HDFS – Data Organization

• Each file written into HDFS is split into data blocks

• Each block is stored on one or more nodes
• Each copy of the block is called replica
• Block placement policy
• First replica is placed on the local node
• Second replica is placed in a different rack
• Third replica is placed in the same rack as the second replica
Read Operation in HDFS
• User sends an “open” request to the NameNode to get the location of file blocks.
• For each file block, the NameNode returns the address of a set of DataNodes
containing replica information for the requested file.
• The number of addresses depends on the number of block replicas.
• The user calls the “read” function to connect to the closest DataNode containing
the first block of the file.
• After the first block is streamed from the respective DataNode to the user, the
established connection is terminated and the same process is repeated for all
blocks of the requested file until the whole file is streamed to the user.
Write Operation in HDFS
• User sends a “create” request to the NameNode to create a new file in the
file system namespace.
• If the file does not exist, the NameNode notifies the user and allows him to
start writing data to the file by calling the “write” function.
• The first block of the file is written to an internal queue termed the data
queue while a data streamer monitors its writing into a DataNode.
• Since, each file block needs to be replicated by a predefined factor, the
data streamer first sends a request to the NameNode to get a list of suitable
DataNodes to store replicas of the first block.
HDFS Security
• Authentication to Hadoop
• Simple – insecure way of using OS username to determine hadoop identity
• Kerberos – authentication using kerberos ticket
• Set by hadoop.security.authentication=simple|kerberos

• File and Directory permissions are same like in POSIX

• read (r), write (w), and execute (x) permissions
• also has an owner, group and mode
• enabled by default (dfs.permissions.enabled=true)

• ACLs are used for implemention permissions that differ from natural
hierarchy of users and groups
• enabled by dfs.namenode.acls.enabled=true
MapReduce
• MapReduce is a programming model and framework within the Hadoop ecosystem
that enables efficient processing of big data by automatically distributing and
parallelizing the computation.
• It consists of two fundamental tasks: Map and Reduce.
• In the Map phase, the input data divides into smaller chunks and processes
independently in parallel across multiple nodes in a distributed computing
environment.
• Each chunk transforms or “maps” into key-value pairs by applying a user-defined
function. The output of the Map phase is a set of intermediate key-value pairs.
• The Reduce phase follows the Map phase. It gathers the intermediate key-value pairs
generated by the Map tasks, performs data shuffling to group together pairs with the
same key, and then applies a user-defined reduction function to aggregate and
process the data.
• The output of the Reduce phase is the final result of the computation.
• Map Reduce example allows for efficient processing of large-scale datasets by
leveraging parallelism and distributing the workload across a cluster of machines.
• It simplifies the development of distributed data processing applications by
abstracting away the complexities of parallelization, data distribution, and fault
tolerance, making it an essential tool for big data processing in the Hadoop ecosystem
Why do we need MapReduce?
• Processing Web Data on a Single Machine
• 20+ billion web pages x 20KB = 400+ terabytes
• One computer can read 30‐35 MB/sec from disk
• ~ four months to read the web
• ~1,000 hard drives just to store the web
• Even more to do something with the data

• Takes too long on a single machine, but with 1000 machines?

• < 3 hours to perform on 1000 machines
• But how long to program? What about the overheads?
• Communication, coordination, recovery from machine failure
• Status Reporting, Debugging, Optimization, Locality
• Reinventing the Wheel: This has to be done for every program!
Advantages
The advantages of using MapReduce are as follows:
• MapReduce can define mapper and reducer in several different languages using Hadoop streaming.
• MapReduce facilitates automatic parallelization and distribution, reducing the time required to run
programs.
• MapReduce provides fault tolerance by re-executing, writing map output to a distributed file
system, and restarting failed map or reducer tasks.
• Processing of data using MapReduce is a cost-effective solution.
• MapReduce processes large volumes of unstructured data very quickly.
• Using HDFS and HBase security, Map Reduce ensures data security by allowing only approved
users to access data stored in the system.
• MapReduce programming utilizes a simple programming model to handle tasks more efficiently
and quickly and is easy to learn.
• MapReduce is flexible and works with several Hadoop languages to handle and store data.
Map Reduce example process has the following phases:

1. Input Splits
2. Mapping
3. Shuffling
4. Sorting
5. Reducing
Input Splits
• MapReduce splits the input into smaller chunks called input splits, representing a block
of work with a single mapper task.
Mapping
• The input data is processed and divided into smaller segments in the mapper phase,
where the number of mappers is equal to the number of input splits.
• RecordReader produces a key-value pair of the input splits using TextFormat, which
Reducer later uses as input.
• The mapper then processes these key-value pairs using coding logic to produce an
output of the same form.
Shuffling
• In the shuffling phase, the output of the mapper phase is passed to the reducer phase by
removing duplicate values and grouping the values.
• The output remains in the form of keys and values in the mapper phase.
• Since shuffling can begin even before the mapper phase is complete, it saves time.
Sorting
• Sorting is performed simultaneously with shuffling.
• The Sorting phase involves merging and sorting the output generated by the mapper.
• The intermediate key-value pairs are sorted by key before starting the reducer phase,
and the values can take any order. Sorting by value is done by secondary sorting.
Reducing
• In the reducer phase, the system reduces the intermediate values from the shuffling
phase to produce a single output value that summarizes the entire dataset.
• The process then uses HDFS to store the final output
Parallelism

• Map functions run in parallel, create intermediate values from each input
data set
• The programmer must specify a proper input split (chunk) between mappers to enable
parallelism

• Reduce functions also run in parallel, each will work on different output keys
• Number of reducers is a key parameter which determines map‐reduce performance

• All values are processed independently

• Reduce phase cannot start until the map phase is completely finished
Limitations of MapReduce

Map Reduce example also faces some limitations, and they are as follows:
• MapReduce is a low-level programming model which involves a lot of writing code.
• The batch-based processing nature of MapReduce makes it unsuitable for real-
time processing.
• It does not support data pipelining or overlapping of Map and Reduce phases.
• Task initialization, coordination, monitoring, and scheduling take up a large chunk
of MapReduce’s execution time and reduce its performance.
• MapReduce cannot cache the intermediate data in memory, thereby diminishing
Hadoop’s performance
MapReduce Programming Model
Graphs

• Graph are everywhere: social graphs representing connections or

citation graphs representing hierarchy in scientific research
• Due to massive scale, it is impractical to use conventional
techniques for graph storage and in-memory analysis
• These constraints had driven the development of scalable systems
such as distributed file systems like Google File System and Hadoop
File System
• MapReduce provides a good way to partition and analyze graph

Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
258 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
Big Data Refers to Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers to Extremely Large and Complex Datasets That 1
421 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
2024 Coinbase Method
33% (3)
2024 Coinbase Method
36 pages
UNIT 5
No ratings yet
UNIT 5
101 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
BD-Unit-II (1)
No ratings yet
BD-Unit-II (1)
57 pages
Unit III
No ratings yet
Unit III
32 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
BDA_UNIT-IV
No ratings yet
BDA_UNIT-IV
37 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
CC Unit 5
No ratings yet
CC Unit 5
43 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big data unit 2
No ratings yet
Big data unit 2
25 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
3
No ratings yet
3
20 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
bd sec b
No ratings yet
bd sec b
19 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Unit 2
No ratings yet
Unit 2
21 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
10th August Morning and Afternoon session Hadoop (1)
No ratings yet
10th August Morning and Afternoon session Hadoop (1)
18 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Cloud Computing Unit 5updated
No ratings yet
Cloud Computing Unit 5updated
43 pages
UNIT-2
No ratings yet
UNIT-2
14 pages
Big data Unit 4 own
No ratings yet
Big data Unit 4 own
18 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
No ratings yet
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
30 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
The Hadoop Approach
100% (2)
The Hadoop Approach
14 pages
HUD Data Sheet 79g
100% (2)
HUD Data Sheet 79g
3 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
1e-26 List of Spare Parts for Radio Equipment
No ratings yet
1e-26 List of Spare Parts for Radio Equipment
17 pages
Efficient Ways To Improve The Performance of HDFS For Small Files
No ratings yet
Efficient Ways To Improve The Performance of HDFS For Small Files
5 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
7 Ways to Leverage Whatsapp
No ratings yet
7 Ways to Leverage Whatsapp
4 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
2020-Development of a Compact Simple Unpressurized Watt-levellow-temperature-differential Stirling Engine
No ratings yet
2020-Development of a Compact Simple Unpressurized Watt-levellow-temperature-differential Stirling Engine
16 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
IoT-FDP-Brochure
No ratings yet
IoT-FDP-Brochure
3 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
FIN452&537-Chapter-0-IntroToCourse-SP25
No ratings yet
FIN452&537-Chapter-0-IntroToCourse-SP25
11 pages
fernox tf1 handbook
No ratings yet
fernox tf1 handbook
9 pages
GE Oil & Gas Nuovo Pignone: Title: Part List: Drawing: Gas Turbine Ms5002D
100% (1)
GE Oil & Gas Nuovo Pignone: Title: Part List: Drawing: Gas Turbine Ms5002D
1 page
Assembly - Diferencial
No ratings yet
Assembly - Diferencial
25 pages
IET Biometrics - 2021 - Yu - A Survey on Deepfake Video Detection
No ratings yet
IET Biometrics - 2021 - Yu - A Survey on Deepfake Video Detection
18 pages
Security+ Dump
No ratings yet
Security+ Dump
57 pages
Sitara Boot Camp 03 Giving Linux the Boot
No ratings yet
Sitara Boot Camp 03 Giving Linux the Boot
31 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Resume_Classification_Using_ML_Techniques
No ratings yet
Resume_Classification_Using_ML_Techniques
5 pages
Brkarc 3146
No ratings yet
Brkarc 3146
93 pages
Chapter 3 GATE Questions
No ratings yet
Chapter 3 GATE Questions
2 pages
PEDJ0694 - 793F Truck - v5
No ratings yet
PEDJ0694 - 793F Truck - v5
5 pages
HP Neverstop Laser MFP 1200 Series
No ratings yet
HP Neverstop Laser MFP 1200 Series
6 pages
Desirability Function Approach For Selection of Facility Location: A Case Study
No ratings yet
Desirability Function Approach For Selection of Facility Location: A Case Study
10 pages
Asterisk-15-Under-the Hood-Webinar
No ratings yet
Asterisk-15-Under-the Hood-Webinar
40 pages
Debugging With GDB
No ratings yet
Debugging With GDB
14 pages
Customer Perception On E-Banking Services - A Study With Reference To Private and Public Sector Banks
No ratings yet
Customer Perception On E-Banking Services - A Study With Reference To Private and Public Sector Banks
12 pages
Floboss S600+ Flow Computer: S600+ Product Data Sheet
No ratings yet
Floboss S600+ Flow Computer: S600+ Product Data Sheet
11 pages
Contol Devices
No ratings yet
Contol Devices
6 pages
Shri Ramswaroop Memorial University: Topic - Three Level Image Password Authentication System
No ratings yet
Shri Ramswaroop Memorial University: Topic - Three Level Image Password Authentication System
22 pages
Manipulating With Text: Orientation - Power of Excel: NMIMS - Bangalore
No ratings yet
Manipulating With Text: Orientation - Power of Excel: NMIMS - Bangalore
5 pages
Local Media8732017145129449152
No ratings yet
Local Media8732017145129449152
3 pages
User Enrolment Form V2
No ratings yet
User Enrolment Form V2
2 pages
"Printed Circuit Board (PCB) Designing": Objective of The Course
No ratings yet
"Printed Circuit Board (PCB) Designing": Objective of The Course
5 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

Module II

Uploaded by

Module II

Uploaded by

MCA2004 – Big Data

• HDFS is a distributed file system that is fault tolerant, scalable

• HDFS is designed to process large data sets with write-once-read-many

• Each file written into HDFS is split into data blocks

• File and Directory permissions are same like in POSIX

• Takes too long on a single machine, but with 1000 machines?

• All values are processed independently

• Graph are everywhere: social graphs representing connections or

You might also like