0% found this document useful (0 votes)
10 views22 pages

BDMA Part 3

Uploaded by

432Kriti Rani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views22 pages

BDMA Part 3

Uploaded by

432Kriti Rani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

1

Big Data Management & Analytics


PGDM Trimester III

Lecture by
Dr. Ruchi Garg
BIMTECH
Greater Noida
Layout
2/20

 Big Data analytics


 World of HADOOP
Layout
3/20
Big data Analytics
4/20

 Descriptive: What happened? History. Footfall in a


mall. Hindsight
 Diagnostic: Why did it happen? Identify the drivers
of change. Why less footfall in mall. Insight.
 Predictive: What might happen? Using AI tools.
Sales decrease by how much in mall. Foresight.
 Prescriptive: What need to be done? Offers.
 Cognitive: AI and analytical tools. Solution from
tools. Most critical. Example????
World of HADOOP
5/20

 Hadoop is an opensource software platform for


distributed storage and distributed processing of
very large data sets on computer clusters.
HADOOP
6/20

 Hadoop HDFS to store data across slave


machines

 Hadoop YARN for resource management in the


Hadoop cluster

 Hadoop MapReduce to process data in a


distributed fashion
HADOOP
7/20

 The Hadoop Distributed File System (HDFS) is


Hadoop’s storage layer. Housed on multiple
servers, data is divided into blocks based on file
size. These blocks are then randomly distributed
and stored across slave machines.

 HDFS in Hadoop Architecture divides large data


into different blocks. Replicated three times by
default, each block contains 128 MB of data.
HDFS
8/20
HDFS
9/20
HDFS
10/
20

 In this example, blocks A, B, C, and D are


replicated three times and placed on different
racks. If DataNode 7 crashes, we still have two
copies of block C data on DataNode 4 of Rack 1
and DataNode 9 of Rack 3.
YARN
11/
20

 YARN (Yet Another Resource Negotiator): YARN is


the resource manager which arbitrates all
available cluster resources. It also follows the
master/slave approach. YARN has a resource
manager (master) per cluster and then a node
manager (slaves) per node.
YARN
12/
20

 It keeps the meta data about which jobs are


running on which node and manages how much
memory and CPU is consumed and hence has a
holistic view of total CPU and RAM consumption of
the whole cluster.

 The Node Manager is the per-machine agent who


is responsible for monitoring their resource usage
(cpu, memory, disk, network) and reporting the
HDFS
13/
20
YARN
14/
20

 The elements of YARN include:

 ResourceManager (one per cluster)


 ApplicationMaster (one per application)
 NodeManagers (one per node)
YARN
15/
20
 ResourceManager: Resource Manager manages the resource
allocation in the cluster and is responsible for tracking how many
resources are available in the cluster and each node manager’s
contribution.
 ApplicationMaster: Application Master manages the resource
needs of individual applications. It connects with the node
manager to execute and monitor tasks.
 NodeManagers: tracks running jobs and sends signals (or
heartbeats) to the resource manager to relay the status of a node.
It also monitors each container’s resource utilization.
 Container: Container houses a collection of resources like RAM,
CPU, and network bandwidth.
YARN
16/
20
Steps to Running an application in YARN
17/
20

 Client submits an application to the


ResourceManager
 ResourceManager allocates a container
 ApplicationMaster contacts the related
NodeManager because it needs to use the
containers
 NodeManager launches the container
 Container executes the ApplicationMaster
Map Phase
18/
20

 Map Phase stores data in the form of blocks. Data


is read, processed and given a key-value pair in
this phase. It is responsible for running a
particular task on one or multiple splits or inputs.
Reduce Phase
19/
20

 The reduce Phase receives the key-value pair from the map
phase. The key-value pair is then aggregated into smaller
sets and an output is produced. Processes such as shuffling
and sorting occur in the reduce phase.

 The mapper function handles the input data and runs a


function on every input split (known as map tasks). There
can be one or multiple map tasks based on the size of the
file and the configuration setup. Data is then sorted,
shuffled, and moved to the reduce phase, where a reduce
function aggregates the data and provides the output.
Map Reduce
20/
20
References
21/
20

 https://
www.simplilearn.com/tutorials/hadoop-tutorial/hadoop-archit
ecture
 https://
towardsdatascience.com/the-world-of-hadoop-d1e
5f5eb98d
Thank
You

You might also like