0% found this document useful (0 votes)
45 views

Discuss Mesos and Yarn and The Relative Placement of The Two Respectively

This document discusses Mesos, YARN, Apache Tez, and their relationships. It provides the following key points: 1) Mesos is an OS-level cluster manager that isolates processes and shares resources, while YARN is designed specifically for Hadoop workloads as an application-level scheduler. 2) Apache Tez builds on YARN to allow complex dataflow graphs beyond MapReduce. It uses input-processor-output modules to improve performance over MapReduce. 3) In the Hadoop stack, Tez sits above YARN and allows frameworks like Hive and Pig to express computations as dataflow graphs for better performance than MapReduce.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Discuss Mesos and Yarn and The Relative Placement of The Two Respectively

This document discusses Mesos, YARN, Apache Tez, and their relationships. It provides the following key points: 1) Mesos is an OS-level cluster manager that isolates processes and shares resources, while YARN is designed specifically for Hadoop workloads as an application-level scheduler. 2) Apache Tez builds on YARN to allow complex dataflow graphs beyond MapReduce. It uses input-processor-output modules to improve performance over MapReduce. 3) In the Hadoop stack, Tez sits above YARN and allows frameworks like Hive and Pig to express computations as dataflow graphs for better performance than MapReduce.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

ASSIGNMENT 2 ( BDAM )

Group Members
Dikshika Arya 19PT1-07
Jigyasa Monga 19PT1-12
Pankhuri Bhatnagar 19PT1-18

Question 1:

Discuss Mesos and Yarn and the relative placement of the two respectively.
MESOS
● open source cluster manager that handles workloads in a distributed environment
through dynamic resource sharing and isolation
● suited for the deployment and management of applications in large-scale clustered
environments
● Isolates the processes running in a cluster, such as memory, CPU, file system, rack
locality and I/O, to keep them from interfering with each other. Such isolation allows
Mesos to create a single, large pool of resources to offer workloads
● brings together the existing resources of the machines/nodes in a cluster into a single
pool from which a variety of workloads may utilize
● Also known as node abstraction, this removes the need to allocate specific machines for
different workloads
● Mesos also utilizes Apache Zookeeper, part of Hadoop, to synchronize distributed
processes to ensure all clients receive consistent data and assure fault tolerance
● Each framework consists of at least two crucial components: a scheduler and executor.
Schedulers register with the Mesos master to get resources, and executors launch the
command or program that runs tasks on the slaves
● The master offers resources to each framework, but it is the framework’s scheduler that
chooses which of those available resources to use. After a framework accepts the
resources offered by the master, it sends a description of the tasks back to the master.
The master then sends these tasks to the slave, and the executor on the slave launches
the tasks
● Mesos sit between the operating system and the application layer and basically acts as a
data center kernel
YARN (Yet Another Resource Negotiator)
● In 2012, the architecture was upgraded to YARN, which provided a general purpose data
processing framework 
● This framework supports not just the MapReduce model but also newer data processing
frameworks
● In YARN data processing is separated from resource management and scheduling
components of MapReduce
● Helps in efficiently running interactive queries, streaming applications and supports
broader range of applications

Source: https://www.oreilly.com/content/a-tale-of-two-clusters-mesos-and-yarn/

In between YARN and Mesos, YARN is specially designed for Hadoop workloads whereas Mesos
is designed for all kinds of workloads. YARN is an application level scheduler and Mesos is an OS
level scheduler. It is better to use YARN if we have already run a Hadoop cluster.

Question 2:
2. Discuss Apache Tez and its utility? How does it fit into Hadoop logical stack ?

● The Apache TEZ project is aimed at building an application framework which allows for a
complex directed-acyclic-graph of tasks for processing data. It is currently built on top
Apache Hadoop YARN- the resource management framework.
● It is a distributed parallel execution framework which is targeted towards data
processing applications.
● It is based on expressing a computation as a data flow graph.
● It negotiates resources from the Hadoop framework.
● It supports Fault tolerance and recovery.
● It also supports Horizontal scalability, Resource elasticity.
● It has a shared library of ready-to-use components.
● It is highly customizable to meet a broad range of use cases.

The 2 main design themes for Tez are:


● Empowering end users by:
○ Expressive dataflow definition APIs
○ Flexible Input-Processor-Output runtime model
○ Data type agnostic
○ Simplifying deployment
● Execution Performance
○ Performance gains over Map Reduce
○ Optimal resource management
○ Plan reconfiguration at runtime
○ Dynamic physical data flow decisions

Tez helps in solving hard problems of running in a distributed Hadoop environment. Using this,
Apps can focus on solving their domain specific problems.

Apache Tez in Hadoop Logical Stack-

 Tez is built on top of YARN, which is the new resource-management framework for
Hadoop.
 Tez generalizes the MapReduce paradigm to a more powerful framework based on
expressing computations as a dataflow graph.
 Tez is not meant directly for end-users – in fact it enables developers to build end-user
applications with much better performance and flexibility.
 Tez enables the project to be highly customizable to meet broad spectrum of use cases
and there is a significant improvement in the response time of APIs Hive, Pig , etc when
they use Tez instead of Map Reduce for data processing.
 Tez: Simple Deployment –
Tez is completely a client-side application, leverages YARN local resources and
distributed cache. It usually does not need to deploy anything on the cluster for Tez. It
requires to just upload the relevant Tez libraries to HDFS and then use the Tez client to
submit with those libraries.

Working of Tez:

 Express, model and execute processing logic:


Tez models data processing as a dataflow graph, with the graph vertices representing
application logic and its edges representing movement of data. A rich data flow
definition API allows users to intuitively express complex query logic. The API fits well
with query plans produced by higher-level declarative applications like Apache Hive and
Apache Pig.

 Model interaction between Input, Processor and Output Modules-


Tez models the user logic running in each vertex of the dataflow graph as a composition
of Input, Processor and Output modules. Input & Output determine the data format and
how and where it is read or written. The Processor holds the data transformation logic.

 Dynamically reconfigure graphs


Distributed data processing is dynamic and accordingly, the information is available
during runtime, which helps to optimize the execution plan further. Consequently, Tez
includes support for pluggable vertex management modules to collect runtime
information and change the dataflow graph dynamically to optimize performance and
resource utilization.

 Optimize performance and resource management


YARN manages resources in a Hadoop cluster. The Tez execution engine framework
efficiently acquires resources from YARN and reuses every component in the pipeline so
that no operation is duplicated unnecessarily.

Tez API

The Tez API has the following components –


 DAG (Directed Acyclic Graph) –
It defines the overall job. One DAG object corresponds to one job. The user creates a
DAG object for each data processing job.
 Vertex –
It defines the user logic along with the resources and the environment needed to
execute the user logic. One Vertex corresponds to one step in the job. The user creates
a Vertex object for each step in the job and adds it to the DAG.
 Edge –
It defines the connection between producer and consumer vertices. The user creates an
Edge object and connects the producer and consumer vertices using it.
By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be
used to process data that earlier took multiple MR jobs, now in a single Tez job as shown below.

You might also like