Big Data Hadoop Questions
Big Data Hadoop Questions
2.
How do Hadoop MapReduce works?
There are two phases of MapReduce operation.
• Map phase – In this phase, the input data is split by map tasks. The map
tasks run in parallel. These split data is used for analysis purpose.
• Reduce phase- In this phase, the similar split data is aggregated from the
entire collection and shows the result.
5.
Why is HDFS only suitable for large data sets and not the correct tool to
use for many small files?
This is due to the performance issue of NameNode. Usually, NameNode is
allocated with huge space to store metadata for the large-scale file. The
metadata is supposed to be a from a single file for optimum space utilization and
cost benefit. In case of small size files, NameNode does not utilize the entire
space which is a performance optimization issue.
6. Name the core methods of a reducer
The three core methods of a reducer are,
1. setup()
2. reduce()
3. cleanup()
12.
What is Speculative execution?
A job running on a Hadoop cluster could be divided in to many
tasks. In a big cluster some of these tasks could be running slow
for various reasons, hardware degradation or software
miconfiguration etc. Hadoop initiates a replica of a task when it
sees a tasks which is running for sometime and failed to make
any progress, on average, as the other tasks from the job. This
replica or duplicate exeuction of task is referred to as Speculative
Execution.
When a task completes successfully all the duplicate tasks that
are running will be killed. So if the original task completes before
the speculative task, then the speculative task is killed; on the
other hand, if the speculative task finishes first, then the original is
killed.
13.
What is the functionality of JobTracker in Hadoop? How many instances of
a JobTracker run on Hadoop cluster?
JobTracker is a giant service which is used to submit and track MapReduce jobs
in Hadoop. Only one JobTracker process runs on any Hadoop cluster.
JobTracker runs it within its own JVM process.
Functionalities of JobTracker in Hadoop:
◦ When client application submits jobs to the JobTracker, the JobTracker
talks to the NameNode to find the location of the data.
◦ It locates TaskTracker nodes with available slots for data.
◦ It assigns the work to the chosen TaskTracker nodes.
◦ The TaskTracker nodes are responsible to notify the JobTracker when a
task fails and then JobTracker decides what to do then. It may resubmit
the task on another node or it may mark that task to avoid.
14.
Explain Hadoop Archives?
Apache Hadoop HDFS stores and processes large (terabytes) data sets.
However, storing a large number of small files in HDFS is inefficient, since each
file is stored in a block, and block metadata is held in memory by the namenode.
Reading through small files normally causes lots of seeks and lots of hopping
from datanode to datanode to retrieve each small file, all of which is inefficient
data access pattern.
Hadoop Archive (HAR) basically deals with small files issue. HAR pack a number
of small files into a large file, so, one can access the original files in parallel
transparently (without expanding the files) and efficiently.
Hadoop Archives are special format archives. It maps to a file system directory.
Hadoop Archive always has a *.har extension. In particular, Hadoop MapReduce
uses Hadoop Archives as an Input.