CH 4 BDA
CH 4 BDA
PECAIML601A
CHAPTER-4
1 MARK QUESTIONS
1. A ________ node acts as the Slave and is responsible for executing a Task assigned to it by the
JobTracker.
• TaskTracker receives the information necessary for the execution of a Task from JobTracker, Executes the Task, and
Sends the Results back to JobTracker.
2. ___________ part of the MapReduce is responsible for processing one or more chunks of data and
producing the output results.
• Maptask part of the MapReduce is responsible for processing one or more chunks of data and producing the output
results.
3. _________ function is responsible for consolidating the results produced by each of the Map ()
functions/tasks.
• Reduce function is responsible for consolidating the results produced by each of the Map() functions/tasks.
5. The CapacityScheduler supports _____________ queues to allow for more predictable sharing of
cluster resources.
• Hierarchial
6. Users can bundle their Yarn code in a _________ file and execute it using jar command.
• Users can bundle their Yarn code in a Jar file and execute it using jar command.
5 MARKS QUESTIONS
1. Draw Hadoop Yarn Architecture and also explain the components of Hadoop Yarn Architecture
A) SCHEDULER
• The scheduler is responsible for allocating the resources to the running application. The scheduler is pure scheduler
it means that it performs no monitoring no tracking for the application and even doesn’t guarantees about restarting
failed tasks either due to application failure or hardware failures.
B) APPLICATION MANAGER
• It manages running Application Masters in the cluster, i.e., it is responsible for starting application masters and for
monitoring and restarting them on different nodes in case of failures.
• The classic MapReduce model is a programming model and framework introduced by Google, which forms the
foundation for processing large-scale data in a distributed and parallel manner. It consists of two primary
components: the Map function and the Reduce function.
• The Map function takes an input dataset and applies a user-defined transformation to each element independently.
It generates a set of intermediate key-value pairs as output, where the key represents a category or identifier, and
the value is the result of the transformation. The Map function is designed to be parallelizable, allowing multiple
instances of the function to be executed in parallel on different parts of the dataset.
• After the Map function is applied to the entire dataset, the intermediate key-value pairs are shuffled and sorted
based on their keys. This process groups together all the values associated with the same key, preparing them for
the Reduce function.
• The Reduce function takes the intermediate key-value pairs as input and performs an aggregation or summarization
operation on each group of values associated with a specific key. The Reduce function produces a set of final output
key-value pairs, where the key typically represents a unique category or result, and the value represents the
aggregated or summarized result.
• The classic MapReduce model provides fault tolerance by automatically handling failures and rerunning failed tasks
on other nodes in the distributed system. It also optimizes data movement by minimizing network communication,
as the intermediate key-value pairs are shuffled and sorted locally before being passed to the Reduce function.
• The classic MapReduce model has been widely used in various big data processing frameworks, including Apache
Hadoop. It provides a scalable and efficient approach for processing large volumes of data by leveraging the parallel
processing capabilities of distributed systems. However, it is worth noting that newer frameworks and models have
emerged that build upon or enhance the classic MapReduce model, offering additional functionalities and
optimizations.
15 MARKS QUESTIONS
1. Question
Answer b)
A) ClientService
o The client interface to the Resource Manager. This component handles all the RPC interfaces to the RM
from the clients including operations like application submission, application termination, obtaining queue
information, cluster statistics etc.
B) AdminService
o To make sure that admin requests don’t get starved due to the normal users’ requests and to give the
operators’ commands the higher priority, all the admin operations like refreshing node-list, the queues’
configuration etc. are served via this separate interface.
Components connecting RM to the nodes
A) ResourceTrackerService
o This is the component that obtains heartbeats from nodes in the cluster and forwards them to
YarnScheduler. Responds to RPCs from all the nodes, registers new nodes, rejecting requests from any
invalid/decommissioned nodes, It works closely with NMLivelinessMonitor and NodesListManager.
b) NMLivelinessMonitor
o To keep track of live nodes and dead nodes. This component keeps track of each node’s its last heartbeat
time. Any node that doesn’t send a heartbeat within a configured interval of time, by default 10 minutes,
is deemed dead and is expired by the RM. All the containers currently running on an expired node are
marked as dead and no new containers are scheduling on such node.
c) NodesListManager
o Manages valid and excluded nodes. Responsible for reading the host configuration files and seeding the
initial list of nodes based on those files. Keeps track of nodes that are decommissioned as time progresses.
a) ApplicationsManager
o Responsible for maintaining a collection of submitted applications. Also, keeps a cache of completed
applications so as to serve users’ requests via web UI or command line long after the applications in
question finished.
b) ApplicationACLsManager
o RM needs to gate the user facing APIs like the client and admin requests to be accessible only to authorized
users. This component maintains the ACLs lists per application and enforces them whenever a request like
killing an application, viewing an application status is received.
c) ApplicationMasterLauncher
o Maintains a thread-pool to launch AMs of newly submitted applications as well as applications whose
previous AM attempts exited due to some reason. Also responsible for cleaning up the AM when an
application has finished normally or forcefully terminated.
d) YarnScheduler
o Yarn Scheduler is responsible for allocating resources to the various running applications subject to
constraints of capacities, queues etc. It also performs its scheduling function based on the resource
requirements of the applications. For example, memory, CPU, disk, network etc. Currently, only memory is
supported and support for CPU is close to completion.
E) ContainerAllocationExpirer
o This component is in charge of ensuring that all allocated containers are used by AMs and subsequently
launched on the correspond NMs.
• AMs run as untrusted user code and can potentially hold on to allocations without using them, and as such can cause
cluster under-utilization. To address this, ContainerAllocationExpirer maintains the list of allocated containers that
are still not used on the corresponding NMs.
• For any container, if the corresponding NM doesn’t report to the RM that the container has started running within
a configured interval of time, by default 10 minutes, then the container is deemed as dead and is expired by the RM.
2. Question
a) What is MapReduce
b) How Map and Reduce work Together?
c) What is a key -value pair in Hadoop? How to generate Key -value pair in MapReduce
Answer a) MapReduce
• MapReduce is a programming model and framework for processing and analyzing large volumes of data in a
distributed and parallel manner. It was introduced by Google in 2004 and has since become a widely adopted
approach for big data processing.
• The MapReduce framework simplifies the task of writing distributed data processing applications by abstracting
away the complexity of parallelization, fault tolerance, and data distribution. It provides a high-level programming
model that allows developers to focus on the logic of their data transformations rather than the low-level details of
distributed computing.
• In the MapReduce paradigm, data processing is divided into two main stages: the Map stage and the Reduce stage.
o Map Stage: In this stage, a function called the "mapper" is applied to each input element in parallel. The
mapper takes an input key-value pair and produces intermediate key-value pairs as output. The
intermediate key-value pairs are not stored permanently but are passed on to the next stage.
o Shuffle and Sort: After the Map stage, the intermediate key-value pairs are sorted and grouped based on
their keys. This process is called shuffle and sort. It ensures that all intermediate values with the same key
are grouped together, allowing for efficient processing in the next stage.
o Reduce Stage: In this stage, a function called the "reducer" is applied to each group of intermediate key-
value pairs. The reducer takes a key and the corresponding set of values and produces a set of final output
key-value pairs. The reducer performs aggregation, summarization, or any other operation that requires
combining the values associated with a particular key.
• The MapReduce framework handles the parallel execution, fault tolerance, and data distribution automatically. It
divides the input data into smaller chunks and assigns them to different machines or processors in a cluster. The
mappers and reducers can run in parallel on different portions of the data, enabling efficient processing of large
datasets.
• MapReduce is designed to handle large-scale data processing tasks by leveraging the parallel processing capabilities
of a distributed system. It has been widely used in various big data processing frameworks, such as Apache Hadoop,
to perform tasks like data transformation, filtering, sorting, indexing, and more.
• Map and Reduce are two fundamental operations in distributed computing and parallel processing frameworks,
such as MapReduce. They work together to enable efficient processing of large volumes of data across multiple
machines or processors.
• The Map operation applies a transformation function to each element in a dataset independently, producing a set
of key-value pairs as output. This transformation function can be any operation or computation that can be applied
to individual elements of the dataset. The key-value pairs generated by the Map operation are often referred to as
intermediate key-value pairs.
• Once the Map operation has been performed on the entire dataset, the intermediate key-value pairs are grouped
based on their keys, and these groups are sent to the Reduce operation. The Reduce operation applies a specific
aggregation or summarization function to each group of intermediate key-value pairs, producing a final output for
each key. The aggregation function can be any operation that takes a set of values associated with a key and
produces a single value.
• The key idea behind the MapReduce paradigm is that the Map operation can be performed in parallel on different
portions of the dataset, with each machine or processor handling a subset of the data. This parallelization allows for
efficient processing of large datasets by distributing the workload across multiple computing resources. Once the
Map operation is completed, the intermediate key-value pairs can be shuffled and distributed to the Reduce
operations based on their keys, again allowing for parallel processing of different groups of intermediate data.
• The combination of Map and Reduce operations enables scalable and fault-tolerant processing of large-scale data.
By dividing the computation into independent map tasks and aggregating the results through the reduce tasks,
MapReduce frameworks can efficiently process data in parallel across a cluster of machines, minimizing data
movement and maximizing resource utilization. This approach has been widely adopted in big data processing
systems and has greatly contributed to the ability to handle massive datasets efficiently.
• In MapReduce, map function processes a certain key-value pair and emits a certain number of key-value pairs and
the Reduce function processes values grouped by the same key and emits another set of key-value pairs as
output. The output types of the Map should match the input types of the Reduce as shown below:
• Map: (K1, V1) -> list (K2, V2)
• Reduce: {(K2, list (V2 }) -> list (K3, V3)