0% found this document useful (0 votes)

24 views

Lecture 8 - Batch Analysis Part 1

Uploaded by

Mohammed Albohiry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Lecture 8 - Batch Analysis Part 1

Uploaded by

Mohammed Albohiry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

BIG DATA Master IT Lecture 8 Course code : M25331

Batch Analytics

Dr. Ali Haider Shamsan

1
Lecture Outlines
• Batch Analysis frameworks
• Hadoop and MapReduce
• Pig
• Apache Oozie
• Apache Spark
• Apache Solr

Review

• Data Acquisition

Keywords

big data, Database, batch Analytics

2
Review
NoSQL

• Data Acquisition Considerations

• Publish - Subscribe Messaging Frameworks
• Big Data Collection Systems
• Messaging Queues
• Custom Connectors

3
Batch Analysis frameworks
• Batch analytics is a type of analytics that involves processing large amounts of
data in batches, typically over a period of time.
• Batch analytics typically involve the use of ETL (Extract-Transform-Load)
processes to
• extract data from various sources,
• transform the data into a structure suitable for analytics,
• and load it into a data warehouse or other data repository.
• This is then followed by the use of analytical tools to
• explore the data,
• discover patterns, and
• generate insights.
• Batch analytics can be used to conduct predictive and descriptive analytics, as
well as more complex analytics such as machine learning and deep learning.
4
Batch Analysis frameworks

• Hadoop-MapReduce
• Pig
• Spark
• Solr.

5
Batch Analysis frameworks
Hadoop and
MapReduce
• Apache Hadoop is an open source framework for distributed batch
processing of big data.
• Similarly, MapReduce is a parallel programming model suitable
analysis of big data.
• MapReduce algorithms allow large-scale computations to be
automatically parallelized across a large cluster of servers.

6
Batch Analysis frameworks
MapReduce
Programming Model
• MapReduce is a parallel data processing model for processing and analysis
of massive scale data .
• MapReduce model has two phases: Map and Reduce.
• The input data to the map and reduce phases is in the form of key-value
pairs.
• Run-time systems for MapReduce are typically large clusters built of
commodity hardware.
• The MapReduce run-time systems take care of tasks such partitioning the
data, scheduling of jobs and communication between nodes in the cluster.
• This makes it easier for programmers to analyze massive scale data
without worrying about tasks such as data partitioning and scheduling. 7
Batch Analysis frameworks
MapReduce
Programming Model
• In the Map phase, data is read from a distributed file system, partitioned
among a set of computing nodes in the cluster, and sent to the nodes as a
set of key-value pairs.
• The Map tasks process the input records independently of each other
and produce intermediate results as key-value pairs.
• When all the Map tasks are completed, the Reduce phase begins in
which the intermediate data with the same key is aggregated.
• An optional Combine task can be used to perform data aggregation on
the intermediate data of the same key for the output of the mapper
before transferring the output to the Reduce task.
8
Batch Analysis frameworks

Hadoop YARN

• Hadoop YARN is the next generation architecture of Hadoop (version

2.x).
• In the YARN architecture, the original processing engine of Hadoop
(MapReduce) has been separated from the resource management
component (which is now part of YARN).
• This makes YARN effectively an operating system for Hadoop that
supports different processing engines on a Hadoop cluster such as
• MapReduce for batch processing, Apache Tez for interactive queries, Apache
Storm for stream processing.

9
Batch Analysis frameworks

Hadoop YARN

Figure 8.1
10
Batch Analysis frameworks

Hadoop YARN

The key components of YARN are described as follows:

1. Resource Manager (RM):
• RM manages the global assignment of compute resources to
applications.
• RM consists of two main services: –
• Scheduler: Scheduler is a pluggable service that manages and enforces
the resource scheduling policy in the cluster.
• Applications Manager (AsM):
• AsM manages the running Application Masters in the cluster.
• AsM is responsible for starting application masters and for monitoring
and restarting them on different nodes in case of failures.
11
Batch Analysis frameworks

Hadoop YARN
The key components of YARN are described as follows:
2. Application Master (AM):
• A application Master AM manages the application’s life cycle.
• AM is responsible for negotiating resources from the RM and working with the
NMs to execute and monitor the tasks.
3. Node Manager (NM):
• NM manages the user processes on that machine.
4. Containers:
• Container is a bundle of resources allocated by RM (memory, CPU and network).
• A container is a conceptual entity that grants an application the privilege to use a
certain amount of resources on a given machine to run a task.
• Each node has multiple containers based on the resource allocations made by
the RM. 12
Batch Analysis frameworks

Hadoop YARN

Figure 8.2

13
Batch Analysis frameworks

Hadoop YARN
• Figure 8.2 shows a YARN cluster with a Resource Manager node and
three Node Manager nodes.
• There are as many Application Masters running as there are applications
(jobs).
• Each application’s AM manages the application tasks such as
• starting, monitoring and restarting tasks in case of failures.
• Each application has multiple tasks.
• Each task runs in a separate container.
• Each container in YARN can be used for both map and reduce tasks.
• The resource allocation model of YARN is more flexible with the
introduction of resource containers which improve cluster utilization.
14
Batch Analysis frameworks

Hadoop YARN

Figure 8.3

15
Batch Analysis frameworks

Hadoop YARN
• To better understand the YARN job execution workflow let us analyze the interactions
between the main components on YARN.
• Figure 8.3 shows the interactions between a Client and Resource Manager.
• Job execution begins with the submission of a new application request by the client to
the RM.
• The RM then responds with a unique application ID and information about cluster
resource capabilities that the client will need in requesting resources for running the
application’s AM.
• Using the information received from the RM, the client constructs and submits an
Application Submission Context which contains information such as scheduler queue,
priority and user information.
• The Application Submission Context also contains a Container Launch Context which
contains the application’s jar, job files, security tokens and any resource requirements.
• The client can query the RM for application reports.
• The client can also "force kill" an application by sending a request to the RM.
16
Batch Analysis frameworks

Hadoop YARN

Figure 8.4

17
Batch Analysis frameworks

Hadoop YARN
• Above Figure shows the interactions between Resource Manager and
Application Master.
• Upon receiving an application submission context from a client, the RM finds
an available container meeting the resource requirements for running the AM
for the application.
• On finding a suitable container, the RM contacts the NM for the container to
start the AM process on its node.
• When the AM is launched it registers itself with the RM.
• The registration process consists of handshaking that conveys information such
as the port that the AM will be listening on, the tracking URL for monitoring the
application’s status and progress, etc.
• The registration response from the RM contains information for the AM that is
used in calculating and requesting any resource requests for the application’s
individual tasks (such as minimum and maximum resource capabilities for the
cluster). 18
Batch Analysis frameworks

Hadoop YARN

• The AM relays heartbeat and progress information to the RM.

• The AM sends resource allocation requests to the RM that contains a list
of requested containers, and may also contain a list of released
containers by the AM.
• Upon receiving the allocation request, the scheduler component of the
RM computes a list of containers that satisfy the request and sends back
an allocation response.
• Upon receiving the resource list, the AM contacts the associated NMs
for starting the containers.
• When the job finishes, the AM sends a Finish Application message to the
RM.
19
Batch Analysis frameworks

Hadoop YARN

Figure 8.5
20
Batch Analysis frameworks

Hadoop YARN

• Figure 8.5 shows the interactions between the an Application Master

and the Node Manager.
• Based on the resource list received from the RM, the AM requests the
hosting NM for each container to start the container.
• The AM can request and receive a container status report from the Node
Manager.
• Figure 8.6 shows the MapReduce job execution within a YARN cluster

21
Batch Analysis frameworks

Hadoop YARN

Figure 8.6

22
Hadoop YARN

Hadoop Schedulers

• The scheduler is a pluggable component in Hadoop that allows it to

support different scheduling algorithms.
• The pluggable scheduler framework provides the flexibility to support a
variety of workloads with varying priority and performance constraints.
• The Hadoop scheduling algorithms are described as follows:
FIFO
• FIFO scheduler maintains a work queue in which the jobs are queued.
The scheduler pulls jobs in first-in first-out manner (oldest job first) for
scheduling.
• There is no concept of priority or size of the job in FIFO scheduler.
23
Hadoop YARN

Hadoop Schedulers
Fair Scheduler
• The Fair Scheduler was originally developed by Facebook.
• Facebook uses Hadoop to manage the massive content and log data it
accumulates every day.
• It is our understanding that the need for Fair Scheduler arose when Facebook
wanted to share the data warehousing infrastructure between multiple users.
• The Fair Scheduler allocates resources evenly between multiple jobs and also
provides capacity guarantees.
• Fair Scheduler assigns resources to jobs such that each job gets an equal share
of the available resources on average over time.
• The Fair Scheduler lets short jobs finish in reasonable time while not starving
long jobs.
24
Hadoop YARN

Hadoop Schedulers

Fair Scheduler
• Tasks slots that are free are assigned to to the new jobs, so that each job
gets roughly the same amount of CPU time.
• The Fair Scheduler maintains a set of pools into which jobs are placed.
Each pool has a guaranteed capacity.
• When there is a single job running, all the resources are assigned to that
job. When there are multiple jobs in the pools, each pool gets at least as
many task slots as guaranteed.
• This lets the scheduler guarantee capacity for pools while utilizing
resources efficiently when these pools don’t contain jobs.
25
Hadoop YARN

Hadoop Schedulers

Fair Scheduler
• The Fair Scheduler keeps track of the compute time received by each
job.
• Fair scheduler is useful when a small or large Hadoop cluster is shared
between multiple groups of users.
• Though the fair scheduler ensures fairness by maintaining a set of pools
and providing guaranteed capacity to each pool, it does not provide any
timing guarantees and hence it is ill-equipped for real-time jobs.

26
Hadoop YARN

Hadoop Schedulers

Capacity Scheduler
• Capacity scheduler has similar functionality as the Fair Scheduler but
adopts a different scheduling philosophy.
• In Capacity Scheduler, multiple named queues are defined, each with a
configurable number of map and reduce slots.
• Each queue is also assigned a guaranteed capacity.
• The Capacity Scheduler gives each queue its capacity when it contains
jobs, and shares any unused capacity between the queues.

27
Hadoop YARN

Hadoop Schedulers

Capacity Scheduler
• When a TaskTracker has free slots, the Capacity Scheduler picks a queue
for which the ratio of number of running slots to capacity is the lowest.
• The capacity scheduler is useful when a large Hadoop cluster is shared
between with multiple clients and different types and priorities of jobs.
• Though the capacity scheduler ensures fairness by maintaining a set of
queues and providing guaranteed capacity to each queue, it does not
provide any timing guarantees and, therefore, it may be ill-equipped for
real-time jobs.

28
Next lecture

• second part of batch Analysis

Assignment

Deadline

Previous Deadline

Unit Iii
No ratings yet
Unit Iii
20 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Lecture 8 - Batch Analysis Full
100% (1)
Lecture 8 - Batch Analysis Full
36 pages
Data W - Bigdata8
No ratings yet
Data W - Bigdata8
105 pages
Big Data Notes Unit-3
No ratings yet
Big Data Notes Unit-3
7 pages
Bigdata & Hadoop: Shushrutha Reddy K M.Tech in Computational Engineering From Rgukt Senior Bigdata Developer @servicenow
100% (1)
Bigdata & Hadoop: Shushrutha Reddy K M.Tech in Computational Engineering From Rgukt Senior Bigdata Developer @servicenow
49 pages
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
No ratings yet
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
31 pages
10 - Big Data Architecture and Tools (1)
No ratings yet
10 - Big Data Architecture and Tools (1)
31 pages
M2 Bigdata&Hadoop
No ratings yet
M2 Bigdata&Hadoop
27 pages
Big data unit 3 own
No ratings yet
Big data unit 3 own
20 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Yarn Tutorial
No ratings yet
Yarn Tutorial
14 pages
YARN Essentials - Sample Chapter
No ratings yet
YARN Essentials - Sample Chapter
12 pages
BDMA Part 3
No ratings yet
BDMA Part 3
22 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Download
No ratings yet
Download
7 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Unit 3
No ratings yet
Unit 3
18 pages
custom_notes
No ratings yet
custom_notes
10 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Hadoop
No ratings yet
Hadoop
7 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Unit-2 1
No ratings yet
Unit-2 1
93 pages
YARN (Yet Another Resource Negotiator) : Apache Hadoop in A Nutshell
No ratings yet
YARN (Yet Another Resource Negotiator) : Apache Hadoop in A Nutshell
2 pages
Big Data and Hadoop - Suzanne
No ratings yet
Big Data and Hadoop - Suzanne
5 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Big Data _ Tomas Iglesias IV(1)
No ratings yet
Big Data _ Tomas Iglesias IV(1)
37 pages
Apache Hadoop YARN
No ratings yet
Apache Hadoop YARN
24 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Chap5_BigDataComputingAndProcessing
No ratings yet
Chap5_BigDataComputingAndProcessing
72 pages
Big Data Unit 2 AKTU Notes
No ratings yet
Big Data Unit 2 AKTU Notes
63 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
04 MapRed 6 JobExecutionOnYarn
No ratings yet
04 MapRed 6 JobExecutionOnYarn
20 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
8 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
Yarn and its Failures
No ratings yet
Yarn and its Failures
22 pages
4 Ppt on YARN MapReduce 31 10 20
No ratings yet
4 Ppt on YARN MapReduce 31 10 20
17 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
No ratings yet
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
14 pages
UNIT-4 bda
No ratings yet
UNIT-4 bda
26 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
Big Data Exam Help
No ratings yet
Big Data Exam Help
7 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Mod 5
No ratings yet
Mod 5
46 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Unit - 4 Yarn
No ratings yet
Unit - 4 Yarn
20 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
unit5 b
No ratings yet
unit5 b
4 pages
3. Introduction-to-Hadoop-Ecosystem
No ratings yet
3. Introduction-to-Hadoop-Ecosystem
26 pages
Unit 3-1
No ratings yet
Unit 3-1
65 pages
Low-Latency, High-Throughput Access To Static Global Resources Within The Hadoop Framework
No ratings yet
Low-Latency, High-Throughput Access To Static Global Resources Within The Hadoop Framework
15 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Sun Cluster Checklist Verification
No ratings yet
Sun Cluster Checklist Verification
5 pages
HP Vertica 7.1.x Install Guide
No ratings yet
HP Vertica 7.1.x Install Guide
157 pages
Transition Guide: Hardware Refresh
No ratings yet
Transition Guide: Hardware Refresh
57 pages
Exchange Server Error - 1018: How Microsoft IT Recovers Damaged Exchange Databases
No ratings yet
Exchange Server Error - 1018: How Microsoft IT Recovers Damaged Exchange Databases
42 pages
Quickstart Guide of OpenVox GSM Gateway vs-GW1600 Series With Clustering
No ratings yet
Quickstart Guide of OpenVox GSM Gateway vs-GW1600 Series With Clustering
9 pages
Infoscale Enterprise Vse+ Level Training: SF Cluster File System High Availability
No ratings yet
Infoscale Enterprise Vse+ Level Training: SF Cluster File System High Availability
23 pages
How Nutanix Works Ebook
100% (5)
How Nutanix Works Ebook
40 pages
Ca Unit 4 Prabu
No ratings yet
Ca Unit 4 Prabu
24 pages
Yandex NV - Nebius Group Investor Presentation
No ratings yet
Yandex NV - Nebius Group Investor Presentation
49 pages
454u6 - Elective I - Grid Computing
100% (1)
454u6 - Elective I - Grid Computing
20 pages
Oracle-Data-Management-with-Rubrik-Technical-White-Paper
No ratings yet
Oracle-Data-Management-with-Rubrik-Technical-White-Paper
17 pages
VxRail Appliance - VxRail Restricted Procedures-Rapid Appliance Self Recovery (RASR) Factory Image Upgrade Procedure (Dell Platforms Only)
No ratings yet
VxRail Appliance - VxRail Restricted Procedures-Rapid Appliance Self Recovery (RASR) Factory Image Upgrade Procedure (Dell Platforms Only)
45 pages
Groupwork1 l5 Correction
No ratings yet
Groupwork1 l5 Correction
2 pages
Cloudera Data Platform
No ratings yet
Cloudera Data Platform
69 pages
h13 511 Dumps Examsdocs
No ratings yet
h13 511 Dumps Examsdocs
73 pages
O RAN - WG11.O CLOUD Security Analysis TR.O R003 v06.00
No ratings yet
O RAN - WG11.O CLOUD Security Analysis TR.O R003 v06.00
119 pages
Glossary - Tools For DS
No ratings yet
Glossary - Tools For DS
3 pages
Tnavigator Eng
No ratings yet
Tnavigator Eng
12 pages
A Solution For Every HPC Challenge: Fast, Easy Development Large Dataset Handling Parallel Processing Multiple Output
No ratings yet
A Solution For Every HPC Challenge: Fast, Easy Development Large Dataset Handling Parallel Processing Multiple Output
4 pages
Cisco Meeting Server (CMS)
No ratings yet
Cisco Meeting Server (CMS)
151 pages
1 - Stackrox - CKS Study Guide
No ratings yet
1 - Stackrox - CKS Study Guide
26 pages
Apache NiFi Overview
No ratings yet
Apache NiFi Overview
20 pages
CC-All 5 Units Notes
No ratings yet
CC-All 5 Units Notes
86 pages
06 Database, Security, CDN, and EI Services
No ratings yet
06 Database, Security, CDN, and EI Services
88 pages
Santosh Goud: E Mail: Santhoshgoud544 Contact No
No ratings yet
Santosh Goud: E Mail: Santhoshgoud544 Contact No
3 pages
bigDataspark Manual(MR-22)
No ratings yet
bigDataspark Manual(MR-22)
106 pages
LVM Administrator's Guide
No ratings yet
LVM Administrator's Guide
194 pages
Mam 76 Planning Worksheet
No ratings yet
Mam 76 Planning Worksheet
3 pages
Upgrade Guide - Ovirt
No ratings yet
Upgrade Guide - Ovirt
60 pages

Lecture 8 - Batch Analysis Part 1

Uploaded by

Lecture 8 - Batch Analysis Part 1

Uploaded by

BIG DATA Master IT Lecture 8 Course code : M25331

Dr. Ali Haider Shamsan

big data, Database, batch Analytics

• Data Acquisition Considerations

• Hadoop YARN is the next generation architecture of Hadoop (version

The key components of YARN are described as follows:

• The AM relays heartbeat and progress information to the RM.

• Figure 8.5 shows the interactions between the an Application Master

• The scheduler is a pluggable component in Hadoop that allows it to

• second part of batch Analysis

You might also like