Hadoop Interview1
Hadoop Interview1
Hadoop Interview1
Storage layer – HDFS
Batch processing engine – MapReduce
Resource Management Layer – YARN
Storage – Since data is very large, so storing such huge amount of data is
very difficult.
Security – Since the data is huge in size, keeping it secure is another
challenge.
Analytics – In Big Data, most of the time we are unaware of the kind of data
we are dealing with. So analyzing that data is even more difficult.
Data Quality – In the case of Big Data, data is very messy, inconsistent and
incomplete.
Discovery – Using a powerful algorithm to find patterns and insights are very
difficult.
Apache Hadoop stores huge files as they are (raw) without specifying any
schema.
High scalability – We can add any number of nodes, hence enhancing
performance dramatically.
Reliable – It stores data reliably on the cluster despite machine failure.
High availability – In Hadoop data is highly available despite hardware
failure. If a machine or hardware crashes, then we can access data from another
path.
Economic – Hadoop runs on a cluster of commodity hardware which is not
very expensive
core-site.xml
hdfs-site.xml files.
mapred-site.xml
yarn-default.xml
1. <configuration>
2. <property>
3. <name>fs.defaultFS</name>
4. <value>hdfs://localhost:9000</value>
5. </property>
6. </configuration>
hdfs-site.xml – This file contains the configuration setting for HDFS daemons. hdfs-
site.xml also specify default block replication and permission checking on HDFS.
1. <configuration>
2. <property>
3. <name>dfs.replication</name>
4. <value>1</value>
5. </property>
6. </configuration>
mapred-site.xml – In this file, we specify a framework name for MapReduce. we
can specify by setting the mapreduce.framework.name.
1. <configuration>
2. <property>
3. <name>mapreduce.framework.name</name>
4. <value>yarn</value>
5. </property>
6. </configuration>
yarn-site.xml – This file provide configuration setting
for NodeManager and ResourceManager.
1. <configuration>
2. <property>
3. <name>yarn.nodemanager.aux-services</name>
4. <value>mapreduce_shuffle</value>
5. </property>
6. <property>
7. <name>yarn.nodemanager.env-whitelist</name> <value>
8. JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPE
ND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property>
9. </configuration>
11) What are the limitations of Hadoop?
Various limitations of Hadoop are:
Issue with small files – Hadoop is not suited for small files. Small files are
the major problems in HDFS. A small file is significantly smaller than the HDFS
block size (default 128MB). If you are storing these large number of small files,
HDFS can’t handle these lots of files. As HDFS works with a small number of large
files for storing data sets rather than larger number of small files. If one use the
huge number of small files, then this will overload the namenode. Since
namenode stores the namespace of HDFS.
HAR files, Sequence files, and Hbase overcome small files issues.
Processing Speed – With parallel and distributed algorithm, MapReduce
process large data sets. MapReduce performs the task: Map and Reduce.
MapReduce requires a lot of time to perform these tasks thereby increasing
latency. As data is distributed and processed over the cluster in MapReduce. So,
it will increase the time and reduces processing speed.
Support only Batch Processing – Hadoop supports only batch processing.
It does not process streamed data and hence, overall performance is slower.
MapReduce framework does not leverage the memory of the cluster to the
maximum.
Iterative Processing – Hadoop is not efficient for iterative processing. As
hadoop does not support cyclic data flow. That is the chain of stages in which the
input to the next stage is the output from the previous stage.
Vulnerable by nature – Hadoop is entirely written in Java, a language most
widely used. Hence java been most heavily exploited by cyber-criminal. Therefore
it implicates in numerous security breaches.
Security- Hadoop can be challenging in managing the complex application.
Hadoop is missing encryption at storage and network levels, which is a major
point of concern. Hadoop supports Kerberos authentication, which is hard to
manage.
Data local – In this category data is on the same node as the mapper
working on the data. In such case, the proximity of the data is closer to the
computation. This is the most preferred scenario.
Intra – Rack- In this scenarios mapper run on the different node but on the
same rack. As it is not always possible to execute the mapper on the same
datanode due to constraints.
Inter-Rack – In this scenarios mapper run on the different rack. As it is not
possible to execute mapper on a different node in the same rack due to resource
constraints.
It loads the file system namespace from the last saved FsImage into its main
memory and the edits log file.
Merges edits log file on FsImage and results in new file system namespace.
Then it receives block reports containing information about block location
from all datanodes.
18) Why does one remove or add nodes in a Hadoop cluster frequently?
The most important features of the Hadoop is its utilization of Commodity hardware.
However, this leads to frequent Datanode crashes in a Hadoop cluster.
Another striking feature of Hadoop is the ease of scale by the rapid growth in data
volume.
Hence, due to above reasons, administrator Add/Remove DataNodes in a Hadoop
Cluster.
19) What is throughput in Hadoop?
The amount of work done in a unit time is Throughput. Because of bellow reasons
HDFS provides good throughput:
The HDFS is Write Once and Read Many Model. It simplifies the data
coherency issues as the data written once, one can not modify it. Thus, provides
high throughput data access.
Hadoop works on Data Locality principle. This principle state that moves
computation to data instead of data to computation. This reduces network
congestion and therefore, enhances the overall system throughput.
First of all, Run: “ps –ef| grep –I ResourceManager” and then, look for log
directory in the displayed result. Find out the job-id from the displayed list. Then
check whether error message associated with that job or not.
Now, on the basis of RM logs, identify the worker node which involves in the
execution of the task.
Now, login to that node and run- “ps –ef| grep –I NodeManager”
Examine the NodeManager log.
The majority of errors come from user level logs for each amp-reduce job.
1. MultipleInputs.addInputPath(job,ncdcInputPath,TextInputFormat.class,MaxTemperatureMapper.class);
2. MultipleInputs.addInputPath(job,metofficeInputPath,TextInputFormat.class, MetofficeMaxTemperatureMapper.class);
The above code replaces the usual calls
to FileInputFormat.addInputPath() and job.setmapperClass(). Both the Met Office
and NCDC data are text based. So, we use TextInputFormat for each. And, we will
use two different mappers, as the two data sources have different line format.
The MaxTemperatureMapperr reads NCDC input data and extracts the year and
temperature fields. The MetofficeMaxTemperatureMappers reads Met Office input
data. Then, extracts the year and temperature fields.
32) Is it possible to have hadoop job output in multiple directories? If yes,
how?
Yes, it is possible by using following approaches:
a. Using MultipleOutputs class-
This class simplifies writing output data to multiple outputs.
MultipleOutputs. addNamedOutput(job,”OutputFileName”,OutputFormatClass,keyClass,valueClass);
The API provides two overloaded write methods to achieve this
MultipleOutput. write(‘OutputFileName”, new Text (key), new Text(value));
Then, we need to use overloaded write method, with an extra parameter for the base
output path. This will allow to write the output file to separate output directories.
MultipleOutput. write(‘OutputFileName”, new Text (key), new Text(value), baseOutputPath);
Then, we need to change your baseOutputpath in each of our implementation.
b. Rename/Move the file in driver class-
This is the easiest hack to write output to multiple directories. So, we can use
MultipleOutputs and write all the output files to a single directory. But the file names
need to be different for each category.
1) What is HDFS- Hadoop Distributed File System?
Hadoop distributed file system (HDFS) is the primary storage system of Hadoop.
HDFS stores very large files running on a cluster of commodity hardware. It works on
the principle of storage of less number of large files rather than the huge number of
small files. HDFS stores data reliably even in the case of hardware failure. It provides
high throughput access to the application by accessing in parallel.
Components of HDFS:
I. NameNode – It is also known as Master node. Namenode stores meta-data i.e.
number of blocks, their location, replicas and other details. This meta-data is
available in memory in the master for faster retrieval of data. NameNode maintains
and manages the slave nodes, and assigns tasks to them. It should be deployed on
reliable hardware as it is the centerpiece of HDFS.
Task of NameNode
II. DataNode – It is also known as Slave. In Hadoop HDFS, DataNode is responsible
for storing actual data in HDFS. DataNode performs read and write operation as per
request for the clients. One can deploy the DataNode on commodity hardware.
Task of DataNode
To reduce the disk seeks (IO). Larger the block size, lesser the file blocks and
less number of disk seek and transfer of the block can be done within respectable
limits and that to parallelly.
HDFS have huge data sets, i.e. terabytes and petabytes of data. If we take 4
KB block size for HDFS, just like Linux file system, which have 4 KB block size,
then we would be having too many blocks and therefore too much of metadata.
Managing this huge number of blocks and metadata will create huge overhead
and traffic which is something which we don’t want. So, the block size is set to
128 MB.
On the other hand, block size can’t be so large that the system is waiting a very long
time for the last unit of data processing to finish its work.
4) How data or file is written into HDFS?
When a client wants to write a file to HDFS, it communicates to namenode for
metadata. The Namenode responds with details of a number of blocks, replication
factor. Then, on basis of information from NameNode, client split files into multiple
blocks. After that client starts sending them to first DataNode. The client sends block
A to Datanode 1 with other two Datanodes details.
When Datanode 1 receives block A sent from the client, Datanode 1 copy same block
to Datanode 2 of the same rack. As both the Datanodes are in the same rack so
block transfer via rack switch. Now Datanode 2 copies the same block to Datanode 3.
As both the Datanodes are in different racks so block transfer via an out-of-rack
switch.
After the Datanode receives the blocks from the client. Then Datanode sends write
confirmation to Namenode. Now Datanode sends write confirmation to the client. The
Same process will repeat for each block of the file. Data transfer happen in parallel
for faster write of blocks.
5) Can multiple clients write into an HDFS file concurrently?
Multiple clients cannot write into an HDFS file at same time. Apache Hadoop HDFS
follows single writer multiple reader models. The client which opens a file for writing,
the NameNode grant a lease. Now suppose, some other client wants to write into
that file. It asks NameNode for the write operation. NameNode first checks whether
it has granted the lease for writing into that file to someone else or not. When
someone already acquires the lease, then, it will reject the write request of the other
client.
6) How data or file is read in HDFS?
To read from HDFS, the first client communicates to namenode for metadata. A
client comes out of namenode with the name of files and its location. The Namenode
responds with details of the number of blocks, replication factor. Now client
communicates with Datanode where the blocks are present. Clients start reading
data parallel from the Datanode. It read on the basis of information received from
the namenodes.
Once client or application receives all the blocks of the file, it will combine these
blocks to form a file. For read performance improvement, the location of each block
ordered by their distance from the client. HDFS selects the replica which is closest to
the client. This reduces the read latency and bandwidth consumption. It first read the
block in the same node. Then another node in the same rack, and then finally
another Datanode in another rack.
7) Why HDFS stores data using commodity hardware despite the higher
chance of failures?
HDFS stores data using commodity hardware because HDFS is highly fault-tolerant.
HDFS provides fault tolerance by replicating the data blocks. And then distribute it
among different DataNodes across the cluster. By default, replication factor is 3
which is configurable. Replication of data solves the problem of data loss in
unfavorable conditions. And unfavorable conditions are crashing of the node,
hardware failure and so on. So, when any machine in the cluster goes down, then
the client can easily access their data from another machine. And this machine
contains the same copy of data blocks.
8) How is indexing done in HDFS?
Hadoop has a unique way of indexing. Once Hadoop framework store the data as
per the block size. HDFS will keep on storing the last part of the data which will say
where the next part of the data will be. In fact, this is the base of HDFS.
9) What is a Heartbeat in HDFS?
Heartbeat is the signals that NameNode receives from the DataNodes to show that
it is functioning (alive). NameNode and DataNode do communicate using Heartbeat.
If after the certain time of heartbeat NameNode do not receive any response from
DataNode, then that Node is dead. The NameNode then schedules the creation of
new replicas of those blocks on other DataNodes.
Heartbeats from a Datanode also carry information about total storage capacity.
Also, carry the fraction of storage in use, and the number of data transfers currently
in progress.
The default heartbeat interval is 3 seconds. One can change it by
using dfs.heartbeat.interval in hdfs-site.xml.
10) How to copy a file into HDFS with a different block size to that of
existing block size configuration?
One can copy a file into HDFS with a different block size by using:
–Ddfs.blocksize=block_size, where block_size is in bytes.
So, let us explain it with an example:
Suppose, you want to copy a file called test.txt of size, say of 128 MB, into the hdfs.
And for this file, you want the block size to be 32MB (33554432 Bytes) in place of
the default (128 MB). So, you would issue the following command:
Hadoop fs –Ddfs.blocksize =33554432 –copyFromlocal/home/dataflair/test.txt/sample_hdfs
Now, you can check the HDFS block size associated with this file by:
You can also change the replication factor on per-file basis using the
command: hadoop fs –setrep –w 3 / file_location
You can also change replication factor for all the files in a directory by
using: hadoop fs –setrep –w 3 –R / directoey_location
XOR Algorithm
Reed-Solomon Algorithm
Round-robin: It distributes the new blocks evenly across the available disks.
Available space: It writes data to the disk that has maximum free space (by
percentage).
20) How would you check whether your NameNode is working or not?
There are several ways to check the status of the NameNode. Mostly, one uses the
jps command to check the status of all daemons running in the HDFS.
21) Is Namenode machine same as DataNode machine as in terms of
hardware?
Unlike the DataNodes, a NameNode is a highly available server. That manages the
File System Namespace and maintains the metadata information. Metadata
information is a number of blocks, their location, replicas and other details. It also
executes file system execution such as naming, closing, opening files/directories.
Therefore, NameNode requires higher RAM for storing the metadata for millions of
files. Whereas, DataNode is responsible for storing actual data in HDFS. It
performsread and write operation as per request of the clients. Therefore,
Datanode needs to have a higher disk capacity for storing huge data sets.
22) What are file permissions in HDFS and how HDFS check permissions for
files or directory?
For files and directories, Hadoop distributed file system (HDFS) implements a
permissions model. For each file or directory, thus, we can manage permissions for a
set of 3 distinct user classes:
The owner, group, and others.
The 3 different permissions for each user class: Read (r), write (w),
and execute(x).
We can also check the owner’s permissions if the username matches the
owner of the directory.
If the group matches the directory’s group, then Hadoop tests the user’s
group permissions.
Hadoop tests the “other” permission when the owner and the group names
don’t match.
If none of the permissions checks succeed, the client’s request is denied.
Map- It is the first phase of processing. In which we specify all the complex
logic/business rules/costly code. The map takes a set of data and converts it into
another set of data. It also breaks individual elements into tuples (key-value
pairs).
Reduce- It is the second phase of processing. In which we specify light-
weight processing like aggregation/summation. Reduce takes the output from the
map as input. After that, it combines tuples (key-value) based on the key. And
then, modifies the value of the key accordingly.
Costly – All the data (terabytes) in one server or as database cluster which is
very expensive. And also hard to manage.
Time-consuming – By using single machine we cannot analyze the data
(terabytes) as it will take a lot of time.
Cost-efficient – It distribute the data over multiple low config machines.
Time-efficient – If we want to analyze the data. We can write the analysis
code in Map function. And the integration code in Reduce function and execute it.
Thus, this MapReduce code will go to every machine which has a part of our data
and executes on that specific part. Hence instead of moving terabytes of data, we
just move kilobytes of code. So this type of movement is time-efficient.
The amount of data we want to process along with block size. It depends on
the number of InputSplit. If we have the block size of 128 MB and we expect
10TB of input data, thus we will have 82,000 maps.
Ultimately InputFormat determines the number of maps.
The configuration of the slave i.e. number of core and RAM available on the
slave. The right number of map/node can between 10-100. Hadoop framework
should give 1 to 1.5 cores of the processor to each mapper. Thus, for a 15 core
processor, 10 mappers can run.
In MapReduce job, by changing the block size one can control the number of
Mappers. Hence, by Changing block size the number of InputSplit increases or
decreases.
By using the JobConf’s conf.setNumMapTasks(int num) one can increase the
number of map tasks manually.
Mapper= {(total data size)/ (input split size)}
If data size= 1 Tb and input split size= 100 MB
Hence, Mapper= (1000*1000)/100= 10,000
Unlike a reducer, the combiner has a limitation. i.e. the input or output key
and value types must match the output types of the mapper.
Combiners can operate only on a subset of keys and values. i.e. combiners
can execute on functions that are commutative.
Combiner functions take input from a single mapper. While reducers can take
data from multiple mappers as a result of partitioning.
Block – Block is the continuous location on the hard drive where data HDFS
store data. In general, FileSystem store data as a collection of blocks. In a similar
way, HDFS stores each file as blocks, and distributes it across the Hadoop
cluster.
InputSplit- InputSplit represents the data which individual Mapper will
process. Further split divides into records. Each record (which is a key-value pair)
will be processed by the map.
Data representation
Size
Block- The default size of the HDFS block is 128 MB which is configured as per
our requirement. All blocks of the file are of the same size except the last block.
The last Block can be of same size or smaller. In Hadoop, the files split into 128
MB blocks and then stored into Hadoop Filesystem.
InputSplit- Split size is approximately equal to block size, by default.
Example
Consider an example, where we need to store the file in HDFS. HDFS stores files as
blocks. Block is the smallest unit of data that can store or retrieved from the disk.
The default size of the block is 128MB. HDFS break files into blocks and stores these
blocks on different nodes in the cluster. We have a file of 130 MB, so HDFS will break
this file into 2 blocks.
Now, if we want to perform MapReduce operation on the blocks, it will not process,
as the 2nd block is incomplete. InputSplit solves this problem. InputSplit will form a
logical grouping of blocks as a single block. As the InputSplit include a location for
the next block. It also includes the byte offset of the data needed to complete the
block.
From this, we can conclude that InputSplit is only a logical chunk of data. i.e. it has
just the information about blocks address or location. Thus, during MapReduce
execution, Hadoop scans through the blocks and create InputSplits. Split act as a
broker between block and mapper.
11) What is a Speculative Execution in Hadoop MapReduce?
MapReduce breaks jobs into tasks and these tasks run parallel rather than
sequential. Thus reduces overall execution time. This model of execution is sensitive
to slow tasks as they slow down the overall execution of a job. There are various
reasons for the slowdown of tasks like hardware degradation. But it may be difficult
to detect causes since the tasks still complete successfully. Although it takes more
time than the expected time.
Apache Hadoop doesn’t try to diagnose and fix slow running task. Instead, it tries to
detect them and run backup tasks for them. This is called Speculative execution in
Hadoop. These backup tasks are called Speculative tasks in hadoop. First of all
Hadoop framework launch all the tasks for the job in Hadoop MapReduce. Then it
launches speculative tasks for those tasks that have been running for some time
(one minute). And the task that has not made any much progress, on average, as
compared with other tasks from the job. If the original task completes before the
speculative task. Then it will kill the speculative task. On the other hand, it will kill
the original task if the speculative task finishes before it.
12) How to submit extra files(jars, static files) for MapReduce job during
runtime?
MapReduce framework provides Distributed Cache to caches files needed by the
applications. It can cache read-only text files, archives, jar files etc.
First of all, an application which needs to use distributed cache to distribute a file
should make sure that the files are available on URLs. Hence, URLs can be
either hdfs:// or http://. Now, if the file is present on the hdfs:// or http://urls.
Then, user mentions it to be cache file to distribute. This framework will copy the
cache file on all the nodes before starting of tasks on those nodes. The files are only
copied once per job. Applications should not modify those files.
By default size of the distributed cache is 10 GB. We can adjust the size of
distributed cache using local.cache.size.