Lecture 8 - Batch Analysis Part 1
Lecture 8 - Batch Analysis Part 1
Batch Analytics
1
Lecture Outlines
• Batch Analysis frameworks
• Hadoop and MapReduce
• Pig
• Apache Oozie
• Apache Spark
• Apache Solr
Review
• Data Acquisition
Keywords
2
Review
NoSQL
3
Batch Analysis frameworks
• Batch analytics is a type of analytics that involves processing large amounts of
data in batches, typically over a period of time.
• Batch analytics typically involve the use of ETL (Extract-Transform-Load)
processes to
• extract data from various sources,
• transform the data into a structure suitable for analytics,
• and load it into a data warehouse or other data repository.
• This is then followed by the use of analytical tools to
• explore the data,
• discover patterns, and
• generate insights.
• Batch analytics can be used to conduct predictive and descriptive analytics, as
well as more complex analytics such as machine learning and deep learning.
4
Batch Analysis frameworks
• Hadoop-MapReduce
• Pig
• Spark
• Solr.
5
Batch Analysis frameworks
Hadoop and
MapReduce
• Apache Hadoop is an open source framework for distributed batch
processing of big data.
• Similarly, MapReduce is a parallel programming model suitable
analysis of big data.
• MapReduce algorithms allow large-scale computations to be
automatically parallelized across a large cluster of servers.
6
Batch Analysis frameworks
MapReduce
Programming Model
• MapReduce is a parallel data processing model for processing and analysis
of massive scale data .
• MapReduce model has two phases: Map and Reduce.
• The input data to the map and reduce phases is in the form of key-value
pairs.
• Run-time systems for MapReduce are typically large clusters built of
commodity hardware.
• The MapReduce run-time systems take care of tasks such partitioning the
data, scheduling of jobs and communication between nodes in the cluster.
• This makes it easier for programmers to analyze massive scale data
without worrying about tasks such as data partitioning and scheduling. 7
Batch Analysis frameworks
MapReduce
Programming Model
• In the Map phase, data is read from a distributed file system, partitioned
among a set of computing nodes in the cluster, and sent to the nodes as a
set of key-value pairs.
• The Map tasks process the input records independently of each other
and produce intermediate results as key-value pairs.
• When all the Map tasks are completed, the Reduce phase begins in
which the intermediate data with the same key is aggregated.
• An optional Combine task can be used to perform data aggregation on
the intermediate data of the same key for the output of the mapper
before transferring the output to the Reduce task.
8
Batch Analysis frameworks
Hadoop YARN
9
Batch Analysis frameworks
Hadoop YARN
Figure 8.1
10
Batch Analysis frameworks
Hadoop YARN
Hadoop YARN
The key components of YARN are described as follows:
2. Application Master (AM):
• A application Master AM manages the application’s life cycle.
• AM is responsible for negotiating resources from the RM and working with the
NMs to execute and monitor the tasks.
3. Node Manager (NM):
• NM manages the user processes on that machine.
4. Containers:
• Container is a bundle of resources allocated by RM (memory, CPU and network).
• A container is a conceptual entity that grants an application the privilege to use a
certain amount of resources on a given machine to run a task.
• Each node has multiple containers based on the resource allocations made by
the RM. 12
Batch Analysis frameworks
Hadoop YARN
Figure 8.2
13
Batch Analysis frameworks
Hadoop YARN
• Figure 8.2 shows a YARN cluster with a Resource Manager node and
three Node Manager nodes.
• There are as many Application Masters running as there are applications
(jobs).
• Each application’s AM manages the application tasks such as
• starting, monitoring and restarting tasks in case of failures.
• Each application has multiple tasks.
• Each task runs in a separate container.
• Each container in YARN can be used for both map and reduce tasks.
• The resource allocation model of YARN is more flexible with the
introduction of resource containers which improve cluster utilization.
14
Batch Analysis frameworks
Hadoop YARN
Figure 8.3
15
Batch Analysis frameworks
Hadoop YARN
• To better understand the YARN job execution workflow let us analyze the interactions
between the main components on YARN.
• Figure 8.3 shows the interactions between a Client and Resource Manager.
• Job execution begins with the submission of a new application request by the client to
the RM.
• The RM then responds with a unique application ID and information about cluster
resource capabilities that the client will need in requesting resources for running the
application’s AM.
• Using the information received from the RM, the client constructs and submits an
Application Submission Context which contains information such as scheduler queue,
priority and user information.
• The Application Submission Context also contains a Container Launch Context which
contains the application’s jar, job files, security tokens and any resource requirements.
• The client can query the RM for application reports.
• The client can also "force kill" an application by sending a request to the RM.
16
Batch Analysis frameworks
Hadoop YARN
Figure 8.4
17
Batch Analysis frameworks
Hadoop YARN
• Above Figure shows the interactions between Resource Manager and
Application Master.
• Upon receiving an application submission context from a client, the RM finds
an available container meeting the resource requirements for running the AM
for the application.
• On finding a suitable container, the RM contacts the NM for the container to
start the AM process on its node.
• When the AM is launched it registers itself with the RM.
• The registration process consists of handshaking that conveys information such
as the port that the AM will be listening on, the tracking URL for monitoring the
application’s status and progress, etc.
• The registration response from the RM contains information for the AM that is
used in calculating and requesting any resource requests for the application’s
individual tasks (such as minimum and maximum resource capabilities for the
cluster). 18
Batch Analysis frameworks
Hadoop YARN
Hadoop YARN
Figure 8.5
20
Batch Analysis frameworks
Hadoop YARN
21
Batch Analysis frameworks
Hadoop YARN
Figure 8.6
22
Hadoop YARN
Hadoop Schedulers
Hadoop Schedulers
Fair Scheduler
• The Fair Scheduler was originally developed by Facebook.
• Facebook uses Hadoop to manage the massive content and log data it
accumulates every day.
• It is our understanding that the need for Fair Scheduler arose when Facebook
wanted to share the data warehousing infrastructure between multiple users.
• The Fair Scheduler allocates resources evenly between multiple jobs and also
provides capacity guarantees.
• Fair Scheduler assigns resources to jobs such that each job gets an equal share
of the available resources on average over time.
• The Fair Scheduler lets short jobs finish in reasonable time while not starving
long jobs.
24
Hadoop YARN
Hadoop Schedulers
Fair Scheduler
• Tasks slots that are free are assigned to to the new jobs, so that each job
gets roughly the same amount of CPU time.
• The Fair Scheduler maintains a set of pools into which jobs are placed.
Each pool has a guaranteed capacity.
• When there is a single job running, all the resources are assigned to that
job. When there are multiple jobs in the pools, each pool gets at least as
many task slots as guaranteed.
• This lets the scheduler guarantee capacity for pools while utilizing
resources efficiently when these pools don’t contain jobs.
25
Hadoop YARN
Hadoop Schedulers
Fair Scheduler
• The Fair Scheduler keeps track of the compute time received by each
job.
• Fair scheduler is useful when a small or large Hadoop cluster is shared
between multiple groups of users.
• Though the fair scheduler ensures fairness by maintaining a set of pools
and providing guaranteed capacity to each pool, it does not provide any
timing guarantees and hence it is ill-equipped for real-time jobs.
26
Hadoop YARN
Hadoop Schedulers
Capacity Scheduler
• Capacity scheduler has similar functionality as the Fair Scheduler but
adopts a different scheduling philosophy.
• In Capacity Scheduler, multiple named queues are defined, each with a
configurable number of map and reduce slots.
• Each queue is also assigned a guaranteed capacity.
• The Capacity Scheduler gives each queue its capacity when it contains
jobs, and shares any unused capacity between the queues.
27
Hadoop YARN
Hadoop Schedulers
Capacity Scheduler
• When a TaskTracker has free slots, the Capacity Scheduler picks a queue
for which the ratio of number of running slots to capacity is the lowest.
• The capacity scheduler is useful when a large Hadoop cluster is shared
between with multiple clients and different types and priorities of jobs.
• Though the capacity scheduler ensures fairness by maintaining a set of
queues and providing guaranteed capacity to each queue, it does not
provide any timing guarantees and, therefore, it may be ill-equipped for
real-time jobs.
28
Next lecture
Assignment
Deadline
Previous Deadline
54