Big Data Hadoop Questions

The document discusses various aspects of Hadoop including: - The different configuration files in Hadoop and their purposes. - How MapReduce works in two phases - map and reduce. - That MapReduce is a programming model for processing large datasets in parallel across a cluster. - How to restart all daemons in Hadoop using scripts in the sbin directory.

Uploaded by

Bala Giridhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views7 pages

Big Data Hadoop Questions

Uploaded by

Bala Giridhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Big Data Hadoop Questions

1. What are the different configuration files in Hadoop?

Answer: The different configuration files in Hadoop are – core-site.xml – This
configuration file contains Hadoop core
configuration settings, for example, I/O settings, very common for MapReduce
and HDFS. It uses hostname a port.
mapred-site.xml – This configuration file specifies a framework name for
MapReduce by setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons configuration
settings. It also specifies default block permission and
replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration settings for
ResourceManager and NodeManager.

2.
How do Hadoop MapReduce works?
There are two phases of MapReduce operation.
• Map phase – In this phase, the input data is split by map tasks. The map
tasks run in parallel. These split data is used for analysis purpose.

• Reduce phase- In this phase, the similar split data is aggregated from the
entire collection and shows the result.

3. What is MapReduce? What is the syntax you use to run a

MapReduce program?

MapReduce is a programming model in Hadoop for processing large data sets

over a cluster of computers, commonly known as HDFS. It is a parallel
programming model.
The syntax to run a MapReduce program is – hadoop_jar_file.jar / input_path
/output_path.
4.
How to restart all the daemons in Hadoop?
Answer: To restart all the daemons, it is required to stop all the daemons first.
The Hadoop directory contains sbin directory that stores
the script files to stop and start daemons in Hadoop.
Use stop daemons command /sbin/stop-all.sh to stop all the daemons and then
use /sin/start-all.sh command to start all the daemons again.

5.
Why is HDFS only suitable for large data sets and not the correct tool to
use for many small files?
This is due to the performance issue of NameNode. Usually, NameNode is
allocated with huge space to store metadata for the large-scale file. The
metadata is supposed to be a from a single file for optimum space utilization and
cost benefit. In case of small size files, NameNode does not utilize the entire
space which is a performance optimization issue.
6. Name the core methods of a reducer
The three core methods of a reducer are,
1. setup()

2. reduce()

3. cleanup()

7. Hadoop vs. Traditional System RDBMS

Hadoop was designed for large, distributed data processing that addresses every
file in the database, which is a type of processing that takes time. For tasks
where performance isn’t critical, such as running end-of-day reports to review
daily transactions, scanning historical data, and performing analytics where a
slower time-to-insight is acceptable, Hadoop is ideal.
On the other hand, in cases where organizations rely on time-sensitive data
analysis, a traditional database is the better fit. That’s because shorter time-to-
insight isn’t about analyzing large unstructured datasets, which Hadoop does so
well. It’s about analyzing smaller data sets in real or near-real time, which is what
traditional databases are well equipped to do.
RDBMSs only work better when an entity relationship model (ER model) is
defined perfectly as it follows Codd’s 12 rule and, therefore, the database
schema or structure can grow. The emphasis is on strong consistency, referential
integrity, abstraction from the physical layer, and complex queries through the
SQL, whereas the Hadoop framework works very well with structured and
unstructured data. This also supports a variety of data formats in real time, such
as XML, JSON, and text based flat file formats.

Here are the key differences between HDFS and relational

database:
RDBMS vs. Hadoop
RDBMS Hadoop
RDBMS relies on the structured Any kind of data can be stored into
Data Types data and the schema of the data Hadoop i.e. Be it structured,
is always known. unstructured or semi-structured.
Hadoop allows us to process the
RDBMS provides limited or no
Processing data which is distributed across
processing capabilities.
the cluster in a parallel fashion.
Schema RDBMS is based on ‘schema on
On the contrary, Hadoop follows
on Read Vs. write’ where schema validation is
the schema on read policy.
Write done before loading the data.
In RDBMS, reads are fast The writes are fast in HDFS
Read/Write
because the schema of the data because no schema validation
Speed
is already known. happens during HDFS write.
Hadoop is an open source
Licensed software, therefore, I
Cost framework. So, I don’t need to pay
have to pay for the software.
for the software.
RDBMS is used for OLTP Hadoop is used for Data
Best Fit Use
(Online Trasanctional discovery, data analytics or OLAP
Case
Processing) system. system.
8.
What happens when two clients try to access the same file in the
HDFS?
HDFS supports exclusive writes only.
When the first client contacts the “NameNode” to open the file for
writing, the “NameNode” grants a lease to the client to create this
file. When the second client tries to open the same file for writing,
the “NameNode” will notice that the lease for the file is already
granted to another client, and will reject the open request for the
second client.
9.What does ‘jps’ command do?
It gives the status of the deamons which run Hadoop cluster. It
gives the output mentioning the status of namenode, datanode ,
secondary namenode, Jobtracker and Task tracker.
10. Suppose Hadoop spawned 100 tasks for a job and one of the
task failed. What will Hadoop do?
It will restart the task again on some other TaskTracker and only if
the task fails more than four ( the default setting and can be
changed) times will it kill the job.
11. Consider case scenario: In M/R system, - HDFS block size is
64 MB
- Input format is FileInputFormat
- – We have 3 files of size 64K, 65Mb and 127Mb\
- How many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows −
• - 1 split for 64K files
• - 2 splits for 65MB files
• - 2 splits for 127MB files

12.
What is Speculative execution?
A job running on a Hadoop cluster could be divided in to many
tasks. In a big cluster some of these tasks could be running slow
for various reasons, hardware degradation or software
miconfiguration etc. Hadoop initiates a replica of a task when it
sees a tasks which is running for sometime and failed to make
any progress, on average, as the other tasks from the job. This
replica or duplicate exeuction of task is referred to as Speculative
Execution.
When a task completes successfully all the duplicate tasks that
are running will be killed. So if the original task completes before
the speculative task, then the speculative task is killed; on the
other hand, if the speculative task finishes first, then the original is
killed.
13.
What is the functionality of JobTracker in Hadoop? How many instances of
a JobTracker run on Hadoop cluster?
JobTracker is a giant service which is used to submit and track MapReduce jobs
in Hadoop. Only one JobTracker process runs on any Hadoop cluster.
JobTracker runs it within its own JVM process.
Functionalities of JobTracker in Hadoop:
◦ When client application submits jobs to the JobTracker, the JobTracker
talks to the NameNode to find the location of the data.
◦ It locates TaskTracker nodes with available slots for data.
◦ It assigns the work to the chosen TaskTracker nodes.
◦ The TaskTracker nodes are responsible to notify the JobTracker when a
task fails and then JobTracker decides what to do then. It may resubmit
the task on another node or it may mark that task to avoid.
14.
Explain Hadoop Archives?
Apache Hadoop HDFS stores and processes large (terabytes) data sets.
However, storing a large number of small files in HDFS is inefficient, since each
file is stored in a block, and block metadata is held in memory by the namenode.
Reading through small files normally causes lots of seeks and lots of hopping
from datanode to datanode to retrieve each small file, all of which is inefficient
data access pattern.
Hadoop Archive (HAR) basically deals with small files issue. HAR pack a number
of small files into a large file, so, one can access the original files in parallel
transparently (without expanding the files) and efficiently.
Hadoop Archives are special format archives. It maps to a file system directory.
Hadoop Archive always has a *.har extension. In particular, Hadoop MapReduce
uses Hadoop Archives as an Input.

15. What is the difference between Reducer and Combiner in Hadoop

MapReduce?
The Combiner is Mini-Reducer that perform local reduce task. The Combiner
runs on the Map output and produces the output to reducer input. A combiner is
usually used for network optimization. Reducer takes a set of an intermediate
key-value pair produced by the mapper as the input. Then runs a reduce function
on each of them to generate the output. An output of the reducer is the final
output.
• Unlike a reducer, the combiner has a limitation. i.e. the input or
output key and value types must match the output types of the mapper.
• Combiners can operate only on a subset of keys and values. i.e.
combiners can execute on functions that are commutative.
• Combiner functions take input from a single mapper. While reducers
can take data from multiple mappers as a result of partitioning.

Top 500 Data Engineering Interview Questions
No ratings yet
Top 500 Data Engineering Interview Questions
126 pages
De - Qbank
No ratings yet
De - Qbank
125 pages
InterviewQuestions 1735756800
No ratings yet
InterviewQuestions 1735756800
125 pages
Data Egineer Interview Questions
No ratings yet
Data Egineer Interview Questions
126 pages
500+ Data Engineering Interview - Questions
No ratings yet
500+ Data Engineering Interview - Questions
118 pages
Hadoop Interview Qs
No ratings yet
Hadoop Interview Qs
99 pages
Bda Unit 2
No ratings yet
Bda Unit 2
16 pages
Top 50 Hadoop Interview Questions For 2019
No ratings yet
Top 50 Hadoop Interview Questions For 2019
42 pages
Apache Spark Questions
No ratings yet
Apache Spark Questions
34 pages
BigData Fundamental and Hadoop Interview Questions
No ratings yet
BigData Fundamental and Hadoop Interview Questions
33 pages
Top Hadoop Interview Q&A
No ratings yet
Top Hadoop Interview Q&A
25 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
IMTC634 - Data Science - Chapter 13
No ratings yet
IMTC634 - Data Science - Chapter 13
16 pages
Hadoop Week 2
No ratings yet
Hadoop Week 2
40 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
BDA Viva
No ratings yet
BDA Viva
26 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Unit - II
No ratings yet
Unit - II
64 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Questionsand Answers
No ratings yet
Questionsand Answers
23 pages
Hadoop Interview Quations: HDFS (Hadoop Distributed File System)
No ratings yet
Hadoop Interview Quations: HDFS (Hadoop Distributed File System)
23 pages
Hadoop
No ratings yet
Hadoop
31 pages
Akbar and Birbal Tamil PDF
No ratings yet
Akbar and Birbal Tamil PDF
263 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Super 25 Unit 3 Notes
No ratings yet
Super 25 Unit 3 Notes
8 pages
Big Data Hadoop
No ratings yet
Big Data Hadoop
11 pages
HADOOP
No ratings yet
HADOOP
19 pages
What Are The Core Components of Hadoop
No ratings yet
What Are The Core Components of Hadoop
6 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
500+ Interview Questions-1
No ratings yet
500+ Interview Questions-1
126 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
Hadoop Interview1
No ratings yet
Hadoop Interview1
27 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Csen 3101
No ratings yet
Csen 3101
11 pages
Hadoop Big Data: Follow This Link To Know About Features of Hadoop
No ratings yet
Hadoop Big Data: Follow This Link To Know About Features of Hadoop
85 pages
Compare Hadoop & Spark Criteria Hadoop Spark
No ratings yet
Compare Hadoop & Spark Criteria Hadoop Spark
18 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Practical 12 Question Bank
No ratings yet
Practical 12 Question Bank
5 pages
Interview Questions - Introduction To Hadoop and MapReduce Programming
No ratings yet
Interview Questions - Introduction To Hadoop and MapReduce Programming
4 pages
Basic Hadoop Interview Questionsxyzz
No ratings yet
Basic Hadoop Interview Questionsxyzz
18 pages
Bda U3, U4 and U5 Two Marks Qs
No ratings yet
Bda U3, U4 and U5 Two Marks Qs
19 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
16 pages
Hadoop Interview Guide
100% (1)
Hadoop Interview Guide
34 pages
Exam H13-624: IT Certification Guaranteed, The Easy Way!
No ratings yet
Exam H13-624: IT Certification Guaranteed, The Easy Way!
152 pages
Amazon: Exam Questions AWS-Solution-Architect-Associate
100% (2)
Amazon: Exam Questions AWS-Solution-Architect-Associate
29 pages
Unit3 BD
100% (1)
Unit3 BD
104 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
9 pages
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
100% (1)
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
72 pages
Hadoop Exams
No ratings yet
Hadoop Exams
14 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Practical Big Data Analytics Hands On Techniques To Implement Enterprise Analytics and Machine Learning Using Hadoop Spark NoSQL and R 1st Edition Nataraj Dasgupta Instant Download
No ratings yet
Practical Big Data Analytics Hands On Techniques To Implement Enterprise Analytics and Machine Learning Using Hadoop Spark NoSQL and R 1st Edition Nataraj Dasgupta Instant Download
76 pages
Big Data Hadoop Interview Questions and Answers
100% (1)
Big Data Hadoop Interview Questions and Answers
25 pages
Hadoop Admin Interview Questions and Answers
No ratings yet
Hadoop Admin Interview Questions and Answers
9 pages
Jenny Blog
No ratings yet
Jenny Blog
12 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Ccs334 - Big Data Analytics
75% (4)
Ccs334 - Big Data Analytics
2 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
27 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
DSBDA ORAL Question Bank
100% (1)
DSBDA ORAL Question Bank
6 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
97 pages
Big Data Platforms
No ratings yet
Big Data Platforms
8 pages
Databricks Developer Resume
No ratings yet
Databricks Developer Resume
3 pages
Mathematics: Grade
No ratings yet
Mathematics: Grade
53 pages
Grade 5 Eureka Essentials
No ratings yet
Grade 5 Eureka Essentials
93 pages
Grade 3: Student Booklet
100% (1)
Grade 3: Student Booklet
17 pages
Hadoop Interview Questions New
No ratings yet
Hadoop Interview Questions New
9 pages
Big Data Hadoop Interview Questions and Answers
No ratings yet
Big Data Hadoop Interview Questions and Answers
26 pages
Apache Spark Component Guide
No ratings yet
Apache Spark Component Guide
84 pages
Hadoop Interviews Q
No ratings yet
Hadoop Interviews Q
9 pages
Hadoop Interview Questions - Part 1
No ratings yet
Hadoop Interview Questions - Part 1
8 pages
Alwars Divya Desams
No ratings yet
Alwars Divya Desams
52 pages
C3.Ai A New Technology Stack
No ratings yet
C3.Ai A New Technology Stack
22 pages
Cloudera Kafka
100% (1)
Cloudera Kafka
50 pages
Oracle Big Data
No ratings yet
Oracle Big Data
12 pages
Unified Batch and Real Time Stream Processing
No ratings yet
Unified Batch and Real Time Stream Processing
68 pages
Dell EMC 2023 Update
No ratings yet
Dell EMC 2023 Update
21 pages
Information Management Syllabus
No ratings yet
Information Management Syllabus
9 pages
Stocks Analysis and Prediction Using Big Data Analytics
No ratings yet
Stocks Analysis and Prediction Using Big Data Analytics
4 pages
Vtron Manual
No ratings yet
Vtron Manual
12 pages
BDA Exp Removed Removed
No ratings yet
BDA Exp Removed Removed
33 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
Sathya Sai Gita (Ii) : Gopala Go Gopala Gopala
No ratings yet
Sathya Sai Gita (Ii) : Gopala Go Gopala Gopala
4 pages
(Solved) Case Study - GlobalHealth Innovations LTD, A Leading Healthcare... - Course Hero
No ratings yet
(Solved) Case Study - GlobalHealth Innovations LTD, A Leading Healthcare... - Course Hero
6 pages
Certified Data Scientist Brochure Datamites India V6.5
No ratings yet
Certified Data Scientist Brochure Datamites India V6.5
22 pages
Ecosystem Relationships
No ratings yet
Ecosystem Relationships
15 pages
Microsoft Integration Runtime - Release Notes: Azure Data Factory
No ratings yet
Microsoft Integration Runtime - Release Notes: Azure Data Factory
31 pages
CS3301 Data Structures L T P C
No ratings yet
CS3301 Data Structures L T P C
2 pages
Data Science Techniques, Tools and Predictions: March 2020
No ratings yet
Data Science Techniques, Tools and Predictions: March 2020
9 pages
Submitted by Pigu 3335
No ratings yet
Submitted by Pigu 3335
39 pages
2020 Richmondhill Eng
No ratings yet
2020 Richmondhill Eng
12 pages
ALI LogStore
No ratings yet
ALI LogStore
13 pages
Weather Data Analysis Using Had Oop
No ratings yet
Weather Data Analysis Using Had Oop
9 pages
PTTP Algorithm and Its Benefits in Hospital Queuing Recommendation System
No ratings yet
PTTP Algorithm and Its Benefits in Hospital Queuing Recommendation System
11 pages
What Are The Components of Web Service?: Java Questions
No ratings yet
What Are The Components of Web Service?: Java Questions
9 pages
Vaikuntha 2020
No ratings yet
Vaikuntha 2020
2 pages
Answer To Question No. 1
No ratings yet
Answer To Question No. 1
2 pages
Work Experience: Synechron Technologies
No ratings yet
Work Experience: Synechron Technologies
2 pages
Hadoop实际解决方案手册: Chinese Edition
From Everand
Hadoop实际解决方案手册: Chinese Edition
Posts & Telecom Press
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Big Data Hadoop Questions

Uploaded by

Big Data Hadoop Questions

Uploaded by

Big Data Hadoop Questions

1. What are the different configuration files in Hadoop?

3. What is MapReduce? What is the syntax you use to run a

MapReduce is a programming model in Hadoop for processing large data sets

7. Hadoop vs. Traditional System RDBMS

Here are the key differences between HDFS and relational

15. What is the difference between Reducer and Combiner in Hadoop

You might also like