0% found this document useful (0 votes)

34 views4 pages

Interview Questions - Introduction To Hadoop and MapReduce Programming

Uploaded by

Junaid Sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views4 pages

Interview Questions - Introduction To Hadoop and MapReduce Programming

Uploaded by

Junaid Sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Interview Questions - Introduction to Hadoop and

MapReduce Programming
1. Can you modify files in Hadoop Distributed File System (HDFS)?

No. Files in HDFS are immutable, which means that they cannot be modified. HDFS only allows
files to be appended.

2. What is a Secondary NameNode and why is it required?

A Secondary NameNode is created such that it acts as a NameNode in case the actual
NameNode fails. The Secondary NameNode maintains a copy of the namespace image by
periodically merging it with the edit log. When the primary node fails and the data files are lost,
you copy the NameNode metadata files to the secondary NameNode and run it as the new
primary NameNode.

3. When should you use a combiner in MapReduce jobs?

A Combiner is an optional class that acts as a Reducer for the output key–value pairs that are
sent by a particular Mapper. A Combiner helps in minimising the data transferred between the
Map and Reduce tasks. Thus, it helps in reducing the size of the intermediate data, thereby
saving disk and network I/O. The Reducer class can itself be used as the Combiner if the
Reduce function is commutative and associative like integer addition and multiplication.
However, if the Reduce function is an average function which is not commutative and
associative, then you need to write a separate class for the Combiner.

4. Explain the shuffle and sort phases in MapReduce

MapReduce makes sure that the input provided to every Reducer is sorted by key. Shuffle is the
phase in which the system performs the sort and then transfers the Map outputs to the
Reducers as input. In a MapReduce job, Shuffle and Sort happen as follows:

Map Side
● The output of the Map task is not simply rewritten to the disk; rather, it is first pre-sorted
by writing it into a buffer memory.
● Before writing the Map output to the disk, the data is first partitioned corresponding to the
Reducers to which they will ultimately be sent.

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

● Within each partition, an in-memory sort by key is performed first and then the Combiner
function (if provided) is performed.
● As different Mappers can finish at different times, sometimes shuffling can start even
before the Map phase is complete to save time.

Reduce Side
● The Reducer starts by copying the corresponding partitions into a buffer memory (copy
phase).
● Then, these sorted partitions are merged into a single sorted file before the Reduce
phase starts.
● Each Reduce task has to group/Reduce values by key. This step becomes easy if the
input data is already sorted.

5. Where is the output of a Mapper written?

The output of a Mapper is written on the local disk. MapReduce writes its final output to the
HDFS block, but the intermediate output is written on the disk, because writing an intermediate
output, which is temporary, to HDFS would be inefficient.

6. What is data locality in Hadoop?

By providing data locality, Hadoop tries to ensure that the minimum geographical distance
between the data and the compute nodes is maintained so that data access is fast. Data locality
helps in achieving good performance by optimising the overhead of the network I/O.

7. Can two clients write to an HDFS file simultaneously?

No. When two clients try to write on a file simultaneously, the second client has to wait until the
first client has completed its job. This does not apply to reading a file, that is, multiple clients can
access a file simultaneously. Therefore, Hadoop is built to work on the write once read many
(WORM) functionality.

8. What happens if a DataNode fails?

In a Hadoop cluster, the NameNode periodically receives a heartbeat and a Block report, which
consists of the details of the data block stored in a particular DataNode. If the NameNode does
not receive heartbeat messages from a particular DataNode within a certain period of time, then
it marks the DataNode as dead.

To recover the data lost in that DataNode, it begins replicating the blocks that were stored in the
dead DataNode on an alive DataNode according to the block report of the dead DataNode. This

replication data transfer happens directly between the two DataNodes and never passes
through the NameNode.

9. Explain the small files problem in Hadoop.

HDFS is designed for processing/storing big data. So, in case of small files, it is not prepared to
efficiently process/store numerous small files. These files generate a lot of overhead to the
NameNode and the DataNodes. Reading through small files normally causes a lot of seeks and
hopping from one DataNode to another to retrieve each small file. All of this adds up to
inefficient data read/write operations.

10. What is the difference between Data Block and Input Split?

Data Block: HDFS stores data by first splitting it into smaller chunks. HDFS splits a large file
into smaller chunks known as blocks. Thus, it stores each file as a set of data blocks. These
data blocks are replicated and distributed across multiple DataNodes.

Input Split: An input split represents the amount of data that is processed by an individual
Mapper at a time. In MapReduce, the number of input splits is equal to that of Map tasks.
Hence, it is used to configure the number of Map tasks which is equal to the number of Input
Splits.

11. What is HA in a NameNode?

The implementation of Standby and Secondary NameNodes from Hadoop 2.x ensures High
Availability in Hadoop clusters, which was not present in Hadoop 1.x. In the case of Hadoop 1.x
clusters (one NameNode, multiple DataNodes), a NameNode was a single point of failure. If a
NameNode went down owing to lack of backup, then the entire cluster would become
unavailable. Hadoop 2.x solved this problem of a single point of failure by including an additional
Standby/Secondary NameNode to the cluster. In Hadoop, a pair of NameNodes is in an
active-standby configuration. The standby NameNode acts as a backup for the NameNode
metadata. The standby NameNode also receives block reports from the DataNodes and
maintains a synced copy of edit logs with the active NameNode, and in case the NameNode is
down, the standby NameNode takes charge and ensures cluster availability.

12. What is Hadoop Streaming?

Hadoop Streaming is an API that allows writing Mappers and Reduces in any language.
It uses Unix standard streams as the interface between Hadoop and the user application.
Streaming is naturally suited for text processing. The data view is line-oriented and processed
as a key-value pair separated by a 'tab' character. The Reduce function reads lines from the
standard input, which is sorted by key, and writes its results to the standard output.

Given below is a link to the official Hadoop Documentation on Hadoop Streaming -
https://hadoop.apache.org/docs/r1.2.1/streaming.html

13. What is HA in YARN ResourceManager?

The YARN Resource Manager is responsible for managing the resources in a cluster and
scheduling applications.Prior to Hadoop 2.4, the Resource Manager was a single point of failure
in a YARN cluster.
The Resource Manager provides High Availability (HA) by implementing an active-standby
Resource Manager pair to remove this single point of failure. When the active Resource
Manager Node fails, the control switches to the standby Resource Manager, and all halted
applications resume from the last state saved in the state store. This allows handling failover
without any performance degradation in the following situations:
● Unplanned events such as machine crashes
● Planned maintenance events of software or hardware upgrades to the machine running
the ResourceManager
● ResourceManager HA requires the ZooKeeper and HDFS services to be running
Given below is a link to the official Hadoop Documentation on Resource Manager HA concept -
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.ht
ml

AWS Services and Features
No ratings yet
AWS Services and Features
2 pages
VHS To DVD 7.0: Honestech
No ratings yet
VHS To DVD 7.0: Honestech
74 pages
Android SQLite Database with Examples - Tutlane
No ratings yet
Android SQLite Database with Examples - Tutlane
6 pages
Blender Export Guide Rev 0
No ratings yet
Blender Export Guide Rev 0
4 pages
Md Nazish Sarfraz Cv
No ratings yet
Md Nazish Sarfraz Cv
1 page
Bhuvan Resume
No ratings yet
Bhuvan Resume
2 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Lecture3
No ratings yet
Lecture3
42 pages
LU 3 and LU 4 NoSQL
No ratings yet
LU 3 and LU 4 NoSQL
36 pages
Untitled Document
No ratings yet
Untitled Document
5 pages
Unit 3 Bba
No ratings yet
Unit 3 Bba
11 pages
BDA Notes Unit-4
No ratings yet
BDA Notes Unit-4
86 pages
Project Report
No ratings yet
Project Report
25 pages
File Allocation Table File System (FAT, FAT12, FAT16)
100% (1)
File Allocation Table File System (FAT, FAT12, FAT16)
23 pages
AdagioODBC92A 1
No ratings yet
AdagioODBC92A 1
14 pages
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
No ratings yet
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
10 pages
SEE Model Question All Subject
No ratings yet
SEE Model Question All Subject
8 pages
Oracle File Types
No ratings yet
Oracle File Types
2 pages
Why Data Cleaning Is Critical
No ratings yet
Why Data Cleaning Is Critical
5 pages
Analyzing_Data_with_Hadoop
No ratings yet
Analyzing_Data_with_Hadoop
54 pages
Job Scheduling in MR
No ratings yet
Job Scheduling in MR
6 pages
Big Data Hadoop
No ratings yet
Big Data Hadoop
11 pages
Unit 3
No ratings yet
Unit 3
10 pages
Top 500 Data Engineering Interview Questions
No ratings yet
Top 500 Data Engineering Interview Questions
126 pages
Simtest 1: (From Cert21 LPI)
No ratings yet
Simtest 1: (From Cert21 LPI)
11 pages
B. Hadoop Ecosystem_III (MapReduce)
No ratings yet
B. Hadoop Ecosystem_III (MapReduce)
55 pages
Big data Unit 4 own
No ratings yet
Big data Unit 4 own
18 pages
Basic Hadoop Interview Questionsxyzz
No ratings yet
Basic Hadoop Interview Questionsxyzz
18 pages
Data Egineer Interview Questions
No ratings yet
Data Egineer Interview Questions
126 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
CIA3 Answer
No ratings yet
CIA3 Answer
5 pages
Oracle Business Intelligence Suite Enterprise Edition 11g Release 1 (11.1.1.7.0) Certification Matrix
No ratings yet
Oracle Business Intelligence Suite Enterprise Edition 11g Release 1 (11.1.1.7.0) Certification Matrix
35 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
CLOUD UNIT 5
No ratings yet
CLOUD UNIT 5
52 pages
500+ Data Engineering Interview_Questions
No ratings yet
500+ Data Engineering Interview_Questions
118 pages
Railway Reservation
No ratings yet
Railway Reservation
9 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
DBMS Notes Unit 1
No ratings yet
DBMS Notes Unit 1
5 pages
Questionsand Answers
No ratings yet
Questionsand Answers
23 pages
Big Data Notes
No ratings yet
Big Data Notes
8 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Bda Imp No Header Footer (1)
No ratings yet
Bda Imp No Header Footer (1)
25 pages
Bda U3, U4 and U5 Two Marks Qs
No ratings yet
Bda U3, U4 and U5 Two Marks Qs
19 pages
ADMT Assignment2 (Ans)
No ratings yet
ADMT Assignment2 (Ans)
10 pages
Data Modeling With The UML
No ratings yet
Data Modeling With The UML
21 pages
Electrical Machines Lab Manual Student
No ratings yet
Electrical Machines Lab Manual Student
59 pages
Big Data Hadoop Questions
No ratings yet
Big Data Hadoop Questions
7 pages
Compare Hadoop & Spark Criteria Hadoop Spark
No ratings yet
Compare Hadoop & Spark Criteria Hadoop Spark
18 pages
Splits Input Into Independent Chunks in Parallel Manner
No ratings yet
Splits Input Into Independent Chunks in Parallel Manner
4 pages
What Are The Core Components of Hadoop
No ratings yet
What Are The Core Components of Hadoop
6 pages
ER (Entity Relationship) Diagram Model in DBMS
No ratings yet
ER (Entity Relationship) Diagram Model in DBMS
9 pages
Journal
No ratings yet
Journal
32 pages
06 Laboratory Exercise 1
No ratings yet
06 Laboratory Exercise 1
7 pages
Big Book of MLOps 2nd Edition
No ratings yet
Big Book of MLOps 2nd Edition
78 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Hadoop Exams
No ratings yet
Hadoop Exams
14 pages
CCD-333 Exam Tutorial
No ratings yet
CCD-333 Exam Tutorial
20 pages
Block (Oracle Block) : Recursive SQL
No ratings yet
Block (Oracle Block) : Recursive SQL
17 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Vpopmail
No ratings yet
Vpopmail
21 pages
Dbms Index
No ratings yet
Dbms Index
5 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Unit v Programming Model
No ratings yet
Unit v Programming Model
53 pages
Looking For Real Exam Questions For IT Certification Exams!
No ratings yet
Looking For Real Exam Questions For IT Certification Exams!
12 pages
00 - Introduction (Read ME!!!)
No ratings yet
00 - Introduction (Read ME!!!)
50 pages
Topic 5 - Fundamental of Data Visulization-Edit
No ratings yet
Topic 5 - Fundamental of Data Visulization-Edit
17 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
MATLAB Short Notes
No ratings yet
MATLAB Short Notes
312 pages
Hadoop Interview Questions Faq
No ratings yet
Hadoop Interview Questions Faq
14 pages
Cloudera Certification Dump - 410-Anil
100% (3)
Cloudera Certification Dump - 410-Anil
49 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
Hadoop Interview Questions Author: Pappupass Learning Resource
No ratings yet
Hadoop Interview Questions Author: Pappupass Learning Resource
16 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
16 pages
Hadoop Interview Questions New
No ratings yet
Hadoop Interview Questions New
9 pages
Big Data Hadoop Interview Questions and Answers
No ratings yet
Big Data Hadoop Interview Questions and Answers
26 pages
MS SQL Reporting Services 2005
No ratings yet
MS SQL Reporting Services 2005
51 pages
Hadoopsdsdgs
No ratings yet
Hadoopsdsdgs
29 pages
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
100% (1)
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
72 pages
Hadoop Interviews Q
No ratings yet
Hadoop Interviews Q
9 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Adithya Jatangi: Professional Summary
No ratings yet
Adithya Jatangi: Professional Summary
7 pages
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Datastage Guide
No ratings yet
Datastage Guide
233 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

Interview Questions - Introduction To Hadoop and MapReduce Programming

Uploaded by

Interview Questions - Introduction To Hadoop and MapReduce Programming

Uploaded by

Interview Questions - Introduction to Hadoop and

2. What is a Secondary NameNode and why is it required?

3. When should you use a combiner in MapReduce jobs?

4. Explain the shuffle and sort phases in MapReduce

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

5. Where is the output of a Mapper written?

6. What is data locality in Hadoop?

7. Can two clients write to an HDFS file simultaneously?

8. What happens if a DataNode fails?

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

9. Explain the small files problem in Hadoop.

11. What is HA in a NameNode?

12. What is Hadoop Streaming?

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

13. What is HA in YARN ResourceManager?

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

You might also like