Compare Hadoop & Spark Criteria Hadoop Spark

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

1.

Compare Hadoop & Spark

Criteria Hadoop Spark

Dedicated storage HDFS None

Speed of processing average excellent

Libraries Separate tools available Spark Core, SQL, Streaming, MLlib, GraphX

2. What are real-time industry applications of Hadoop?


Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and
distributed computing of large volumes of data. It provides rapid, high performance and cost-
effective analysis of structured and unstructured data generated on digital platforms and within the
enterprise. It is used in almost all departments and sectors today.Some of the instances where
Hadoop is used:

 Managing traffic on streets.


 Streaming processing.
 Content Management and Archiving Emails.
 Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster.
 Fraud detection and Prevention.
 Advertisements Targeting Platforms are using Hadoop to capture and analyze click stream,
transaction, video and social media data.
 Managing content, posts, images and videos on social media platforms.
 Analyzing customer data in real-time for improving business performance.
 Public sector fields such as intelligence, defense, cyber security and scientific research.
 Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns,
identify rogue traders, more precisely target their marketing campaigns based on customer
segmentation, and improve customer satisfaction.
 Getting access to unstructured data like output from medical devices, doctor’s notes, lab
results, imaging reports, medical correspondence, clinical data, and financial data.

3. How is Hadoop different from other parallel computing systems?

 Hadoop is a distributed file system, which lets you store and handle massive amount of data
on a cloud of machines, handling data redundancy. Go through this HDFS content to know
how the distributed file system works. The primary benefit is that since data is stored in
several nodes, it is better to process it in distributed manner. Each node can process the data
stored on it instead of spending time in moving it over the network.

 On the contrary, in Relational database computing system, you can query data in real-time,
but it is not efficient to store data in tables, records and columns when the data is huge.

4. What all modes Hadoop can be run in?


Hadoop can run in three modes:

 Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output
operations. This mode is mainly used for debugging purpose, and it does not support the
use of HDFS. Further, in this mode, there is no custom configuration required for mapred-
site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.

 Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration
for all the three files mentioned above. In this case, all daemons are running on one node
and thus, both Master and Slave node are the same.

 Fully Distributed Mode (Multiple Cluster Node): This is the production phase of
Hadoop (what Hadoop is known for) where data is used and distributed across several
nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

5. Explain the major difference between HDFS block and InputSplit.


In simple terms, block is the physical representation of data while split is the logical representation
of data present in the block. Split acts a s an intermediary between block and mapper.
Suppose we have two blocks:
Block 1: ii nntteell
Block 2: Ii ppaatt
Now, considering the map, it will read first block from ii till ll, but does not know how to process
the second block at the same time. Here comes Split into play, which will form a logical group of
Block1 and Block 2 as a single block.
It then forms key-value pair using inputformat and records reader and sends map for further
processing With inputsplit, if you have limited resources, you can increase the split size to limit the
number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited
resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with
only 5 maps executing at a time.
However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is
processed by single map, consuming more time when the file is bigger.
6. What is distributed cache and what are its benefits?
Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when needed.
Learn more in this MapReduce Tutorial now. Once a file is cached for a specific job, hadoop will
make it available on each data node both in system and in memory, where map and reduce tasks are
executing.Later, you can easily access and read the cache file and populate any collection (like
array, hashmap) in your code.
Benefits of using distributed cache are:

 It distributes simple, read only text/data files and/or complex types like jars, archives and
others. These archives are then un-archived at the slave node.
 Distributed cache tracks the modification timestamps of cache files, which notifies that the
files should not be modified until a job is executing currently.

7. Explain the difference between NameNode, Checkpoint NameNode and BackupNode.

 NameNode is the core of HDFS that manages the metadata – the information of what file
maps to what block locations and what blocks are stored on what datanode. In simple
terms, it’s the data about the data being stored. NameNode supports a directory tree-like
structure consisting of all the files present in HDFS on a Hadoop cluster. It uses following
files for namespace:
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.

 Checkpoint NameNode has the same directory structure as NameNode, and creates
checkpoints for namespace at regular intervals by downloading the fsimage and edits file
and margining them within the local directory. The new image after merging is then
uploaded to NameNode.
There is a similar node like Checkpoint, commonly known as Secondary Node, but it does
not support the ‘upload to NameNode’ functionality.

 Backup Node provides similar functionality as Checkpoint, enforcing synchronization


with NameNode. It maintains an up-to-date in-memory copy of file system namespace and
doesn’t require getting hold of changes after regular intervals. The backup node needs to
save the current state in-memory to an image file to create a new checkpoint.

8. What are the most common Input Formats in Hadoop?


There are three most common input formats in Hadoop:

 Text Input Format: Default input format in Hadoop.


 Key Value Input Format: used for plain text files where the files are broken into lines
 Sequence File Input Format: used for reading files in sequence

9. Define DataNode and how does NameNode tackle DataNode failures?


DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each
datanode sends a heartbeat message to notify that it is alive. If the namenode does noit receive a
message from datanode for 10 minutes, it considers it to be dead or out of place, and starts
replication of blocks that were hosted on that data node such that they are hosted on some other
data node.A BlockReport contains list of all blocks on a DataNode. Now, the system starts to
replicate what were stored in dead DataNode.
The NameNode manages the replication of data blocksfrom one DataNode to other. In this process,
the replication data transfers directly between DataNode such that the data never passes the
NameNode.

10. What are the core methods of a Reducer?


The three core methods of a Reducer are:

1. setup(): this method is used for configuring various parameters like input data size,
distributed cache.
public void setup (context)
2. reduce(): heart of the reducer always called once per key with the associated reduced task
public void reduce(Key, Value, context)
3. cleanup(): this method is called to clean temporary files, only once at the end of the task
public void cleanup (context)

11. What is SequenceFile in Hadoop?


Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key/value
pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter
classes. The three SequenceFile formats are:

1. Uncompressed key/value records.


2. Record compressed key/value records – only ‘values’ are compressed here.
3. Block compressed key/value records – both keys and values are collected in ‘blocks’
separately and compressed. The size of the ‘block’ is configurable.

12. What is Job Tracker role in Hadoop?


Job Tracker’s primary function is resource management (managing the task trackers), tracking
resource availability and task life cycle management (tracking the taks progress and fault
tolerance).

 It is a process that runs on a separate node, not on a DataNode often.


 Job Tracker communicates with the NameNode to identify data location.
 Finds the best Task Tracker Nodes to execute tasks on given nodes.
 Monitors individual Task Trackers and submits the overall job back to the client.
 It tracks the execution of MapReduce workloads local to the slave node.

13. What is the use of RecordReader in Hadoop?


Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into single
record. For instance, if our input data is split like:
Row1: Welcome to
Row2: Intellipaat
It will be read as “Welcome to Intellipaat” using RecordReader.

14. What is Speculative Execution in Hadoop?


One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that
few slow nodes limit the rest of the program. Tehre are various reasons for the tasks to be slow,
which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks,
Hadoop tries to detect when the task runs slower than expected and then launches other equivalent
task as backup. This backup mechanism in Hadoop is Speculative Execution.
It creates a duplicate task on another disk. The same input can be processed multiple times in
parallel. When most tasks in a job comes to completion, the speculative execution mechanism
schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free
currently. When these tasks finish, it is intimated to the JobTracker. If other copies are executing
speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output.
Speculative execution is by default true in Hadoop. To disable, set
mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution
JobConf options to false.
15. What happens if you try to run a Hadoop job with an output directory that is already
present?
It will throw an exception saying that the output file directory already exists.
To run the MapReduce job, you need to ensure that the output directory does not exist before in
the HDFS.
To delete the directory before running the job, you can use shell:Hadoop fs –rmr
/path/to/your/output/Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);
Prepare yourself for the MapReduce Interview questions and answers Now

16. How can you debug Hadoop code?


First, check the list of MapReduce jobs currently running. Next, we need to see that there are no
orphaned jobs running; if yes, you need to determine the location of RM logs.

1. Run: “ps –ef | grep –I ResourceManager”


and look for log directory in the displayed result. Find out the job-id from the displayed list
and check if there is any error message associated with that job.
2. On the basis of RM logs, identify the worker node that was involved in execution of the
task.
3. Now, login to that node and run – “ps –ef | grep –iNodeManager”
4. Examine the Node Manager log. The majority of errors come from user level logs for each
map-reduce job.

17. How to configure Replication Factor in HDFS?


hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml
will change the default replication for all files placed in HDFS.
You can also modify the replication factor on a per-file basis using the

Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely,

you can also change the replication factor of all the files under a directory.

[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir

Go through Hadoop Administration Training to learn about Replication Factor In HDFS now!

18. How to compress mapper output but not the reducer output?
To achieve this compression, you should set:

conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)

19. What is the difference between Map Side join and Reduce Side Join?
Map side Join at map side is performed data reaches the map. You need a strict structure for
defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than
map side join since the input datasets need not be structured. However, it is less efficient as it will
have to go through sort and shuffle phases, coming with network overheads.

20. How can you transfer data from Hive to HDFS?


By writing the query:

hive> insert overwrite directory '/' select * from emp;


You can write your query for the data you want to import from Hive to HDFS. The output you
receive will be stored in part files in the specified HDFS path.

21. What companies use Hadoop, any idea?


Learn how Big Data and Hadoop have changed the rules of the game in this blog post. Yahoo! (the
biggest contributor to the creation of Hadoop) – Yahoo search engine uses Hadoop, Facebook –
Developed Hive for analysis,Amazon,Netflix,Adobe,eBay,Spotify,Twitter,Adobe.

Whenever you go for a Big Data interview, the interviewer may ask some basic level questions.
Whether you are a fresher or experienced in the big data field, the basic knowledge is required.
So, let’s cover some frequently asked basic big data interview questions and answers to crack big
data interview.

1. What do you know about the term “Big Data”?

Answer: Big Data is a term associated with complex and large datasets. A relational database
cannot handle big data, and that’s why special tools and methods are used to perform operations
on a vast collection of data. Big data enables companies to understand their business better and
helps them derive meaningful information from the unstructured and raw data collected on a
regular basis. Big data also allows the companies to take better business decisions backed by
data.

2. What are the five V’s of Big Data?

Answer: The five V’s of Big data is as follows:

● Volume – Volume represents the volume i.e. amount of data that is growing at a high
rate i.e. data volume in Petabytes
● Velocity – Velocity is the rate at which data grows. Social media contributes a major
role in the velocity of growing data.
● Variety – Variety refers to the different data types i.e. various data formats like text,
audios, videos, etc.
● Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to
the high volume of data that brings incompleteness and inconsistency.
● Value –Value refers to turning data into value. By turning accessed big data into
values, businesses may generate revenue.

5 V’s of Big Data

Note: This is one of the basic and significant questions asked in the big data interview. You can
choose to explain the five V’s in detail if you see the interviewer is interested to know more.
However, the names can even be mentioned if you are asked about the term “Big Data”.

3. Tell us how big data and Hadoop are related to each other.

Answer: Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a
framework that specializes in big data operations also became popular. The framework can be
used by professionals to analyze big data and help businesses to make decisions.
Note: This question is commonly asked in a big data interview. You can go further to answer this
question and try to explain the main components of Hadoop.
4. How is big data analysis helpful in increasing business revenue?

Answer: Big data analysis has become very important for the businesses. It helps businesses to
differentiate themselves from others and increase the revenue. Through predictive analytics, big
data analytics provides businesses customized recommendations and suggestions. Also, big data
analytics enables businesses to launch new products depending on customer needs and
preferences. These factors make businesses earn more revenue, and thus companies are using big
data analytics. Companies may encounter a significant increase of 5-20% in revenue by
implementing big data analytics. Some popular companies those are using big data analytics to
increase their revenue is – Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.

5. Explain the steps to be followed to deploy a Big Data solution.

Answer: Followings are the three steps that are followed to deploy a Big Data Solution –

1. Data Ingestion

The first step for deploying a big data solution is the data ingestion i.e. extraction of data from
various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning
System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds
etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data
is then stored in HDFS.

Steps of Deploying Big Data


Solution

2. Data Storage

After data ingestion, the next step is to store the extracted data. The data either be stored in
HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access
whereas HBase for random read/write access.
3. Data Processing

The final step in deploying a big data solution is the data processing. The data is processed
through one of the processing frameworks like Spark, MapReduce, Pig, etc.
6. Do you have any Big Data experience? If so, please share it with us.

How to Approach: There is no specific answer to the question as it is a subjective question and
the answer depends on your previous experience. Asking this question during a big data
interview, the interviewer wants to understand your previous experience and is also trying to
evaluate if you are fit for the project requirement.
So, how will you approach the question? If you have previous experience, start with your duties
in your past position and slowly add details to the conversation. Tell them about your
contributions that made the project successful. This question is generally, the 2nd or 3rd question
asked in an interview. The later questions are based on this question, so answer it carefully. You
should also take care not to go overboard with a single aspect of your previous job. Keep it
simple and to the point.

7. Do you prefer good data or good models? Why?

How to Approach: This is a tricky question but generally asked in the big data interview. It asks
you to choose between good data or good models. As a candidate, you should try to answer it
from your experience. Many companies want to follow a strict process of evaluating data, means
they have already selected data models. In this case, having good data can be game-changing.
The other way around also works as a model is chosen based on good data.
As we already mentioned, answer it from your experience. However, don’t say that having both
good data and good models is important as it is hard to have both in real life projects.

8. Will you optimize algorithms or code to make them run faster?

How to Approach: The answer to this question should always be “Yes.” Real world
performance matters and it doesn’t depend on the data or model you are using in your project.
The interviewer might also be interested to know if you have had any previous experience in
code or algorithm optimization. For a beginner, it obviously depends on which projects he
worked on in the past. Experienced candidates can share their experience accordingly as well.
However, be honest about your work, and it is fine if you haven’t optimized code in the past. Just
let the interviewer know your real experience and you will be able to crack the big data
interview.

9. How do you approach data preparation?

How to Approach: Data preparation is one of the crucial steps in big data projects. A big data
interview may involve at least one question based on data preparation. When the interviewer
asks you this question, he wants to know what steps or precautions you take during data
preparation.
As you already know, data preparation is required to get necessary data which can then further
be used for modeling purposes. You should convey this message to the interviewer. You should
also emphasize the type of model you are going to use and reasons behind choosing that
particular model. Last, but not the least, you should also discuss important data preparation terms
such as transforming variables, outlier values, unstructured data, identifying gaps, and others.

10. How would you transform unstructured data into structured data?

How to Approach: Unstructured data is very common in big data. The unstructured data should
be transformed into structured data to ensure proper data analysis. You can start answering the
question by briefly differentiating between the two. Once done, you can now discuss the methods
you use to transform one form to another. You might also share the real-world situation where
you did it. If you have recently been graduated, then you can share information related to your
academic projects.
By answering this question correctly, you are signaling that you understand the types of data,
both structured and unstructured, and also have the practical experience to work with these. If
you give an answer to this question specifically, you will definitely be able to crack the big data
interview.
Basic Big Data Hadoop Interview Questions

Hadoop is one of the most popular Big Data frameworks, and if you are going for a Hadoop
interview prepare yourself with these basic level interview questions for Big Data Hadoop.
These questions will be helpful for you whether you are going for a Hadoop developer or
Hadoop Admin interview.

11. Explain the difference between Hadoop and RDBMS.

Answer: The difference between Hadoop and RDBMS is as follows –

12. What are the common input formats in Hadoop?

Answer: Below are the common input formats in Hadoop –

● Text Input Format – The default input format defined in Hadoop is the Text Input
Format.
● Sequence File Input Format – To read files in a sequence, Sequence File Input
Format is used.
● Key Value Input Format – The input format used for plain text files (files broken
into lines) is the Key Value Input Format.

13. Explain some important features of Hadoop.

Answer: Hadoop supports the storage and processing of big data. It is the best solution for
handling big data challenges. Some important features of Hadoop are –
● Open Source – Hadoop is an open source framework which means it is available free
of cost. Also, the users are allowed to change the source code as per their
requirements.
● Distributed Processing – Hadoop supports distributed processing of data i.e. faster
processing. The data in Hadoop HDFS is stored in a distributed manner and
MapReduce is responsible for the parallel processing of data.
● Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each
block at different nodes, by default. This number can be changed according to the
requirement. So, we can recover the data from another node if one node fails. The
detection of node failure and recovery of data is done automatically.
● Reliability – Hadoop stores data on the cluster in a reliable manner that is
independent of machine. So, the data stored in Hadoop environment is not affected by
the failure of the machine.
● Scalability – Another important feature of Hadoop is the scalability. It is compatible
with the other hardware and we can easily ass the new hardware to the nodes.
● High Availability – The data stored in Hadoop is available to access even after the
hardware failure. In case of hardware failure, the data can be accessed from another
path.

14. Explain the different modes in which Hadoop run.

Answer: Apache Hadoop runs in the following three modes –

● Standalone (Local) Mode – By default, Hadoop runs in a local mode i.e. on a non-
distributed, single node. This mode uses the local file system to perform input and
output operation. This mode does not support the use of HDFS, so it is used for
debugging. No custom configuration is needed for configuration files in this mode.
● Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a
single node just like the Standalone mode. In this mode, each daemon runs in a
separate Java process. As all the daemons run on a single node, there is the same
node for both the Master and Slave nodes.
● Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on
separate individual nodes and thus forms a multi-node cluster. There are different
nodes for Master and Slave nodes.

15. Explain the core components of Hadoop.


Answer: Hadoop is an open source framework that is meant for storage and processing of big
data in a distributed manner. The core components of Hadoop are –
● HDFS (Hadoop Distributed File System) – HDFS is the basic storage system of
Hadoop. The large data files running on a cluster of commodity hardware are stored
in HDFS. It can store data in a reliable manner even when hardware fails.

Core Components of Hadoop

● Hadoop MapReduce – MapReduce is the Hadoop layer that is responsible for data
processing. It writes an application to process unstructured and structured data stored
in HDFS. It is responsible for the parallel processing of high volume of data by
dividing data into independent tasks. The processing is done in two phases Map and
Reduce. The Map is the first phase of processing that specifies complex logic code
and the Reduce is the second phase of processing that specifies light-weight
operations.
● YARN – The processing framework in Hadoop is YARN. It is used for resource
management and provides multiple data processing engines i.e. data science, real-
time streaming, and batch processing.
Prepare yourself for the next Hadoop Job Interview with Top 50 Hadoop Interview
Questions and Answers.

Hadoop Developer Interview Questions for Fresher

It is not easy to crack Hadoop developer interview but the preparation can do everything. If you
are a fresher, learn the Hadoop concepts and prepare properly. Have a good knowledge of the
different file systems, Hadoop versions, commands, system security, etc. Here are few questions
that will help you pass the Hadoop developer interview.

16. What are the different configuration files in Hadoop?

Answer: The different configuration files in Hadoop are –

core-site.xml – This configuration file contains Hadoop core configuration settings, for example,
I/O settings, very common for MapReduce and HDFS. It uses hostname a port.
mapred-site.xml – This configuration file specifies a framework name for MapReduce by
setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also
specifies default block permission and replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration settings for ResourceManager
and NodeManager.

17. What are the differences between Hadoop 2 and Hadoop 3?

Answer: Following are the differences between Hadoop 2 and Hadoop 3 –


18. How can you achieve security in Hadoop?

Answer: Kerberos are used to achieve security in Hadoop. There are 3 steps to access a service
while using Kerberos, at a high level. Each step involves a message exchange with a server.
1. Authentication – The first step involves authentication of the client to the
authentication server, and then provides a time-stamped TGT (Ticket-Granting
Ticket) to the client.
2. Authorization – In this step, the client uses received TGT to request a service ticket
from the TGS (Ticket Granting Server).
3. Service Request – It is the final step to achieve security in Hadoop. Then the client
uses service ticket to authenticate himself to the server.

19. What is commodity hardware?

Answer: Commodity hardware is a low-cost system identified by less-availability and low-


quality. The commodity hardware comprises of RAM as it performs a number of services that
require RAM for the execution. One doesn’t require high-end hardware configuration or
supercomputers to run Hadoop, it can be run on any commodity hardware.

20. How is NFS different from HDFS?

Answer: There are a number of distributed file systems that work in their own way. NFS
(Network File System) is one of the oldest and popular distributed file storage systems whereas
HDFS (Hadoop Distributed File System) is the recently used and popular one to handle big data.
The main differences between NFS and HDFS are as follows –
Hadoop and Spark are the two most popular big data frameworks. But there is a commonly
asked question – do we need Hadoop to run Spark? Watch this video to find the answer to this
question.

Hadoop Developer Interview Questions for Experienced

The interviewer has more expectations from an experienced Hadoop developer, and thus his
questions are one-level up. So, if you have gained some experience, don’t forget to cover
command based, scenario-based, real-experience based questions. Here we bring some sample
interview questions for experienced Hadoop developers.
21. How to restart all the daemons in Hadoop?

Answer: To restart all the daemons, it is required to stop all the daemons first. The Hadoop
directory contains sbin directory that stores the script files to stop and start daemons in Hadoop.
Use stop daemons command /sbin/stop-all.sh to stop all the daemons and then use /sin/start-all.sh
command to start all the daemons again.

22. What is the use of jps command in Hadoop?


Answer: The jps command is used to check if the Hadoop daemons are running properly or not.
This command shows all the daemons running on a machine i.e. Datanode, Namenode,
NodeManager, ResourceManager etc.

23. Explain the process that overwrites the replication factors in HDFS.

Answer: There are two methods to overwrite the replication factors in HDFS –

Method 1: On File Basis

In this method, the replication factor is changed on the basis of file using Hadoop FS shell. The
command used for this is:
$hadoop fs – setrep –w2/my/test_file

Here, test_file is the filename that’s replication factor will be set to 2.

Method 2: On Directory Basis

In this method, the replication factor is changed on directory basis i.e. the replication factor for all
the files under a given directory is modified.
$hadoop fs –setrep –w5/my/test_dir

Here, test_dir is the name of the directory, the replication factor for the directory and all the files
in it will be set to 5.

24. What will happen with a NameNode that doesn’t have any data?

Answer: A NameNode without any data doesn’t exist in Hadoop. If there is a NameNode, it will
contain some data in it or it won’t exist.

25. Explain NameNode recovery process.

Answer: The NameNode recovery process involves the below-mentioned steps to make Hadoop
cluster running:
● In the first step in the recovery process, file system metadata replica (FsImage) starts
a new NameNode.
● The next step is to configure DataNodes and Clients. These DataNodes and Clients
will then acknowledge new NameNode.
● During the final step, the new NameNode starts serving the client on the completion
of last checkpoint FsImage loading and receiving block reports from the DataNodes.
Hadoop clusters. Thus, it makes routine maintenance difficult. For this reason, HDFS
high availability architecture is recommended to use.

You might also like