0% found this document useful (0 votes)
72 views7 pages

Big Data Hadoop Questions

The document discusses various aspects of Hadoop including: - The different configuration files in Hadoop and their purposes. - How MapReduce works in two phases - map and reduce. - That MapReduce is a programming model for processing large datasets in parallel across a cluster. - How to restart all daemons in Hadoop using scripts in the sbin directory.

Uploaded by

Bala Giridhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views7 pages

Big Data Hadoop Questions

The document discusses various aspects of Hadoop including: - The different configuration files in Hadoop and their purposes. - How MapReduce works in two phases - map and reduce. - That MapReduce is a programming model for processing large datasets in parallel across a cluster. - How to restart all daemons in Hadoop using scripts in the sbin directory.

Uploaded by

Bala Giridhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Big Data Hadoop Questions

1. What are the different configuration files in Hadoop?


Answer: The different configuration files in Hadoop are – core-site.xml – This
configuration file contains Hadoop core
configuration settings, for example, I/O settings, very common for MapReduce
and HDFS. It uses hostname a port.
mapred-site.xml – This configuration file specifies a framework name for
MapReduce by setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons configuration
settings. It also specifies default block permission and
replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration settings for
ResourceManager and NodeManager.

2.
How do Hadoop MapReduce works?
There are two phases of MapReduce operation.
• Map phase – In this phase, the input data is split by map tasks. The map
tasks run in parallel. These split data is used for analysis purpose.

• Reduce phase- In this phase, the similar split data is aggregated from the
entire collection and shows the result.

3. What is MapReduce? What is the syntax you use to run a


MapReduce program?

MapReduce is a programming model in Hadoop for processing large data sets


over a cluster of computers, commonly known as HDFS. It is a parallel
programming model.
The syntax to run a MapReduce program is – hadoop_jar_file.jar / input_path
/output_path.
4.
How to restart all the daemons in Hadoop?
Answer: To restart all the daemons, it is required to stop all the daemons first.
The Hadoop directory contains sbin directory that stores
the script files to stop and start daemons in Hadoop.
Use stop daemons command /sbin/stop-all.sh to stop all the daemons and then
use /sin/start-all.sh command to start all the daemons again.

5.
Why is HDFS only suitable for large data sets and not the correct tool to
use for many small files?
This is due to the performance issue of NameNode. Usually, NameNode is
allocated with huge space to store metadata for the large-scale file. The
metadata is supposed to be a from a single file for optimum space utilization and
cost benefit. In case of small size files, NameNode does not utilize the entire
space which is a performance optimization issue.
6. Name the core methods of a reducer
 The three core methods of a reducer are,
1. setup()

2. reduce()

3. cleanup()

7. Hadoop vs. Traditional System RDBMS


Hadoop was designed for large, distributed data processing that addresses every
file in the database, which is a type of processing that takes time. For tasks
where performance isn’t critical, such as running end-of-day reports to review
daily transactions, scanning historical data, and performing analytics where a
slower time-to-insight is acceptable, Hadoop is ideal.
On the other hand, in cases where organizations rely on time-sensitive data
analysis, a traditional database is the better fit. That’s because shorter time-to-
insight isn’t about analyzing large unstructured datasets, which Hadoop does so
well. It’s about analyzing smaller data sets in real or near-real time, which is what
traditional databases are well equipped to do.
RDBMSs only work better when an entity relationship model (ER model) is
defined perfectly as it follows Codd’s 12 rule and, therefore, the database
schema or structure can grow. The emphasis is on strong consistency, referential
integrity, abstraction from the physical layer, and complex queries through the
SQL, whereas the Hadoop framework works very well with structured and
unstructured data. This also supports a variety of data formats in real time, such
as XML, JSON, and text based flat file formats.

Here are the key differences between HDFS and relational


database:
RDBMS vs. Hadoop
RDBMS Hadoop
RDBMS relies on the structured Any kind of data can be stored into
Data Types data and the schema of the data Hadoop i.e. Be it structured,
is always known. unstructured or semi-structured.
Hadoop allows us to process the
RDBMS provides limited or no
Processing data which is distributed across
processing capabilities.
the cluster in a parallel fashion.
Schema RDBMS is based on ‘schema on
On the contrary, Hadoop follows
on Read Vs. write’ where schema validation is
the schema on read policy.
Write done before loading the data.
In RDBMS, reads are fast The writes are fast in HDFS
Read/Write
because the schema of the data because no schema validation
Speed
is already known. happens during HDFS write.
Hadoop is an open source
Licensed software, therefore, I
Cost framework. So, I don’t need to pay
have to pay for the software.
for the software.
RDBMS is used for OLTP Hadoop is used for Data
Best Fit Use
(Online Trasanctional discovery, data analytics or OLAP
Case
Processing) system. system.
8.
 What happens when two clients try to access the same file in the
HDFS?
HDFS supports exclusive writes only.
When the first client contacts the “NameNode” to open the file for
writing, the “NameNode” grants a lease to the client to create this
file. When the second client tries to open the same file for writing,
the “NameNode” will notice that the lease for the file is already
granted to another client, and will reject the open request for the
second client.
9.What does ‘jps’ command do?
It gives the status of the deamons which run Hadoop cluster. It
gives the output mentioning the status of namenode, datanode ,
secondary namenode, Jobtracker and Task tracker.
10. Suppose Hadoop spawned 100 tasks for a job and one of the
task failed. What will Hadoop do?
It will restart the task again on some other TaskTracker and only if
the task fails more than four ( the default setting and can be
changed) times will it kill the job.
11. Consider case scenario: In M/R system, - HDFS block size is
64 MB
- Input format is FileInputFormat
- – We have 3 files of size 64K, 65Mb and 127Mb\
- How many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows −
• - 1 split for 64K files
• - 2 splits for 65MB files
• - 2 splits for 127MB files

12.
What is Speculative execution?
A job running on a Hadoop cluster could be divided in to many
tasks. In a big cluster some of these tasks could be running slow
for various reasons, hardware degradation or software
miconfiguration etc. Hadoop initiates a replica of a task when it
sees a tasks which is running for sometime and failed to make
any progress, on average, as the other tasks from the job. This
replica or duplicate exeuction of task is referred to as Speculative
Execution.
When a task completes successfully all the duplicate tasks that
are running will be killed. So if the original task completes before
the speculative task, then the speculative task is killed; on the
other hand, if the speculative task finishes first, then the original is
killed.
13.
 What is the functionality of JobTracker in Hadoop? How many instances of
a JobTracker run on Hadoop cluster?
JobTracker is a giant service which is used to submit and track MapReduce jobs
in Hadoop. Only one JobTracker process runs on any Hadoop cluster.
JobTracker runs it within its own JVM process.
Functionalities of JobTracker in Hadoop:
◦ When client application submits jobs to the JobTracker, the JobTracker
talks to the NameNode to find the location of the data.
◦ It locates TaskTracker nodes with available slots for data.
◦ It assigns the work to the chosen TaskTracker nodes.
◦ The TaskTracker nodes are responsible to notify the JobTracker when a
task fails and then JobTracker decides what to do then. It may resubmit
the task on another node or it may mark that task to avoid.
14.
Explain Hadoop Archives? 
Apache Hadoop HDFS stores and processes large (terabytes) data sets.
However, storing a large number of small files in HDFS is inefficient, since each
file is stored in a block, and block metadata is held in memory by the namenode.
Reading through small files normally causes lots of seeks and lots of hopping
from datanode to datanode to retrieve each small file, all of which is inefficient
data access pattern.
Hadoop Archive (HAR) basically deals with small files issue. HAR pack a number
of small files into a large file, so, one can access the original files in parallel
transparently (without expanding the files) and efficiently.
Hadoop Archives are special format archives. It maps to a file system directory.
Hadoop Archive always has a *.har extension. In particular, Hadoop MapReduce
uses Hadoop Archives as an Input.

15. What is the difference between Reducer and Combiner in Hadoop


MapReduce?
The Combiner is Mini-Reducer that perform local reduce task. The Combiner
runs on the Map output and produces the output to reducer input. A combiner is
usually used for network optimization. Reducer takes a set of an intermediate
key-value pair produced by the mapper as the input. Then runs a reduce function
on each of them to generate the output. An output of the reducer is the final
output.
• Unlike a reducer, the combiner has a limitation. i.e. the input or
output key and value types must match the output types of the mapper.
• Combiners can operate only on a subset of keys and values. i.e.
combiners can execute on functions that are commutative.
• Combiner functions take input from a single mapper. While reducers
can take data from multiple mappers as a result of partitioning.

You might also like