Hadoop Evs
Hadoop Evs
InterviewQuestions
Interview Questions
effort.
interview.
www.bosscoderacademy.com 1
Section 1:
Hadoop Basics
Q 1. What is Hadoop, and why is it
important for Big Data?
Ans. Hadoop is an open-source framework that stores and
processes large amounts of data across clusters of commodity
hardware.
www.bosscoderacademy.com 2
Map Reduce
Distributed
Processing
Distributed
HDFS
storage
YARN Hadoop
common
Yet Another
Resource
Negotiator(Job
Java Library and
scheduling and
utilities(Java Scripts)
Resource Manager)
www.bosscoderacademy.com 3
Section 2:
Hadoop Architecture
Q 3. What is HDFS, and what is its purpose?
Ans. HDFS (Hadoop Distributed File System) is designed for large-
scale storage and high fault tolerance.
HDFS splits large files into smaller, fixed-size blocks (default: 128
MB) and stores them across multiple nodes, ensuring data
redundancy by replicating blocks across the cluster.
www.bosscoderacademy.com 4
Q 4. What is the NameNode, and what is its
role in HDFS?
Ans. The NameNode is the master node of HDFS that manages
the metadata for files stored in the cluster, including information
like file locations, replication details, and directory structures.
The NameNode does not store actual data, but it coordinates data
storage across DataNodes.
Code Example: Checking NameNode Status
bash
hdfs dfsadmin —report
This command checks the status of the NameNode and the
available nodes in the cluster.
www.bosscoderacademy.com 6
Q 7. Explain data replication in HDFS.
Ans. HDFS replicates each data block across multiple DataNodes
to ensure fault tolerance. The default replication factor is three,
meaning each block is stored on three different nodes. This
redundancy allows data to be available even if some nodes fail.
Code Example: View or Set Replication Factor
bash
Q 8.
What is the Secondary NameNode,
and why is it used?
Ans. The Secondary NameNode assists the NameNode by
periodically merging the FSImage (file system image) and edit
logs to create a new, updated FSImage. It helps reduce the load
on the NameNode, but it is not a backup node.
File system
metadata
File.txt=A,C
Name Node
give me your
metadata
Secondary
Name Node
www.bosscoderacademy.com 7
Section 4:
MapReduce Framework
Q 6.
What is MapReduce, and what are its
main phases?
Ans. MapReduce is a programming model for parallel processing
of large data sets. It has two main phases:
Map: Processes input data into key-value pairs.
Reduce: Aggregates and summarizes the data to
produce final output.
www.bosscoderacademy.com 8
StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
Q 10.
What is the purpose of a combiner in
MapReduce?
Ans. A combiner is a mini-reducer that performs local aggregation
of output data from the map function before sending it to the
reducer. It helps minimize the amount of data transferred between
the map and reduce phases, thus improving performance.
Code Example: Using a Combiner
java
This line sets the combiner class for a MapReduce job, reducing
intermediate data.
www.bosscoderacademy.com 9
Q 11.
What is speculative execution in
Hadoop MapReduce?
Ans. Speculative execution runs duplicate tasks on different
nodes if one task appears to be running slower than expected.
This helps ensure that straggling tasks do not delay the entire job.
Code Configuration: Enable Speculative Execution
xml
rope rtY>
<value>true</value>
</property>
www.bosscoderacademy.com 10
Section 5:
Other Data
MapReduce
Processing Frameworks
www.bosscoderacademy.com 11
Q 13.
What are the main components of
YARN?
Ans. The key components of YARN are:
ResourceManager: Allocates resources to different
applications
NodeManager: Monitors resources on individual nodes
ApplicationMaster: Manages the lifecycle of applications.
www.bosscoderacademy.com 12
Q 14.
Explain how ResourceManager works
in YARN.
Ans. The ResourceManager is the master authority for resource
management in YARN. It allocates resources to various
applications running on the cluster based on availability and
priority.
www.bosscoderacademy.com 13
Section 6:
Hadoop Ecosystem and
Tools
Q 15.
What is Apache Hive, and what is its
use in Hadoop?
Ans. Hive is a data warehousing tool built on top of Hadoop that
allows users to run SQL-like queries (using HiveQL) on data
stored in HDFS.
www.bosscoderacademy.com 14
Q 16.
What is Apache Pig, and what is its
role in Hadoop?
Ans. Apache Pig is a high-level platform for creating MapReduce
programs using a scripting language called Pig Latin.
DUMP filtered_data;
This script loads data from HDFS, filters it based on salary, and
prints the results.
Q 17.
What is Apache HBase, and why
would you use it?
Ans. Apache HBase is a NoSQL database that runs on top of HDFS,
allowing for random, real-time read and write access to data.
www.bosscoderacademy.com 15
Code Example: HBase Table Creation using HBase
Shell
shell
, ‘professional _ info'
Q 18.
How does Hadoop ensure fault
tolerance?
Ans. Hadoop ensures fault tolerance by replicating data blocks
across multiple DataNodes.
If a DataNode fails, HDFS can still retrieve the data from replicated
nodes. MapReduce also achieves fault tolerance by re-running
failed tasks on other available nodes.
Code Example: Setting Replication Factor
xml
<property>
<name>dfs.replication<>/name>
<va1ue>3</va1ue>
</property>
www.bosscoderacademy.com 16
Q 19.
What are Counters in Hadoop
MapReduce?
Ans. Counters are a mechanism to count events, such as the
number of processed records or error occurrences, during the
execution of a MapReduce job. They help track the job's progress
and monitor the health of the MapReduce job.
Code Example: Using Counters in Java MapReduce
java
enum MyCounters {
RECORD_COUNT ,
ERROR COUNT }
// Logic here... }
Q 20.
What is the Hadoop Distributed
Cache, and why is it used?
Ans. The Distributed Cache is a mechanism in Hadoop that allows
files needed by jobs (e.g., JAR files, text files) to be cached and
made available across all nodes running a MapReduce job. This
reduces the need to repeatedly access HDFS for small files.
www.bosscoderacademy.com 17
Code Example: Using Distributed Cache in Java
java
This line adds a file to the distributed cache so that each mapper
or reducer can access it locally.
www.bosscoderacademy.com 18
Section 7:
Advanced Hadoop Concepts
Q 21.
What is Rack Awareness in Hadoop,
and why is it important?
Ans. Rack Awareness is a concept in Hadoop that allows the
cluster to understand the physical topology of the nodes,
specifically which rack a node is part of.
<property>
<name>net.topology.script.file.name</ name>
<value>/path/to/rack—awareness—script.sh</value>
</property>
www.bosscoderacademy.com 19
Q 22.
What is speculative execution in
Hadoop MapReduce?
Ans. Speculative execution in Hadoop runs multiple instances of the
same task on different nodes if a particular task is taking too long to
complete.
The result from the first instance to complete is taken, and the
others are killed, thereby ensuring faster job completion.
Configuration Example: Enabling Speculative Execution
xml
<property>
<name>mapreduce.map.speculative</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.speculative</name>
<value>true</value>
</property>
This enables speculative execution for both map and reduce tasks.
Q 23.
What is Hadoop Streaming, and why
is it useful?
Ans. Hadoop Streaming is a utility that allows users to create and run
MapReduce jobs with any executable or script (such as Python, Perl,
etc.) as the Mapper or Reducer.
www.bosscoderacademy.com 20
It makes Hadoop accessible to programmers who prefer scripting
languages over Java.
—mapper /path/to/mapper.py \
—reducer /path/to/reducer, py \
—input /user/hadoop/input \
-output /user/hadoop/output
Its main functions are shuffle and sort (grouping similar keys
together) and reduce (processing these keys to produce a final
summary).
www.bosscoderacademy.com 21
public void reduce(Text key, Iterable<IntWritable>
values, Context context) throws
int sum = 0;
sum += val.get();
This example sums the values associated with each key, typical in
a word count program.
Infographic Prompt: A flowchart showing the shuffle, sort, and
reduce phases of a Reducer, ending with the final output.
Q 25.
How can you handle small files in
HDFS efficiently?
Ans. Handling small files efficiently in HDFS can be achieved by
using:
HAR (Hadoop Archive) to combine multiple small files into a
single archive file.
SequenceFiles, which store key-value pairs in a compressed
format.
HBase, stores data in a more structured format, reducing the
load on HDFS.
www.bosscoderacademy.com 22
Command Example: Creating HAR
bash
hadoop archive —archiveName myArchive.har —p /user/
hadoop/input /user/hadoop/output
mkdir”/foo”
NameNode
User
mkdir
”/foo”
edit log
www.bosscoderacademy.com 23
Q 27. What is Checkpointing in Hadoop?
Ans. Checkpointing is the process of merging the edit logs with
the FSImage to produce an updated FSImage.
This helps the NameNode start faster and prevents the edit log
from growing too large.
www.bosscoderacademy.com 24
Q 28.
What is ZooKeeper, and what role does
it play in Hadoop?
Ans. ZooKeeper is a centralized service for maintaining
configuration information, naming, providing distributed
synchronization, and group services.
Q 29.
Describe Hadoop's High Availability
(HA) feature.
Ans. Hadoop’s High Availability feature ensures that there is no
single point of failure for the NameNode. It achieves this by using
two NameNodes—an active NameNode and a standby
NameNode—that work in tandem to ensure availability.
Configuration Example: Enabling HA
This usually involves configuring JournalNodes to help
synchronize the state between the active and standby
NameNodes.
www.bosscoderacademy.com 25
Shared dir
All name space edits
Active
Standby
NN NN
DN DN DN DN
Q 30.
What is DistCp in Hadoop, and how
is it used?
Ans. DistCp (Distributed Copy) is a tool used to copy large
datasets between different clusters or within a cluster, leveraging
the MapReduce framework for efficient parallel copying.
www.bosscoderacademy.com 26
Why Bosscoder?
1000+ Alumni placed at Top Product-
based companies.
Explore More