Compare Hadoop & Spark Criteria Hadoop Spark
Compare Hadoop & Spark Criteria Hadoop Spark
Compare Hadoop & Spark Criteria Hadoop Spark
Libraries Separate tools available Spark Core, SQL, Streaming, MLlib, GraphX
Hadoop is a distributed file system, which lets you store and handle massive amount of data
on a cloud of machines, handling data redundancy. Go through this HDFS content to know
how the distributed file system works. The primary benefit is that since data is stored in
several nodes, it is better to process it in distributed manner. Each node can process the data
stored on it instead of spending time in moving it over the network.
On the contrary, in Relational database computing system, you can query data in real-time,
but it is not efficient to store data in tables, records and columns when the data is huge.
Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output
operations. This mode is mainly used for debugging purpose, and it does not support the
use of HDFS. Further, in this mode, there is no custom configuration required for mapred-
site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.
Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration
for all the three files mentioned above. In this case, all daemons are running on one node
and thus, both Master and Slave node are the same.
Fully Distributed Mode (Multiple Cluster Node): This is the production phase of
Hadoop (what Hadoop is known for) where data is used and distributed across several
nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.
It distributes simple, read only text/data files and/or complex types like jars, archives and
others. These archives are then un-archived at the slave node.
Distributed cache tracks the modification timestamps of cache files, which notifies that the
files should not be modified until a job is executing currently.
NameNode is the core of HDFS that manages the metadata – the information of what file
maps to what block locations and what blocks are stored on what datanode. In simple
terms, it’s the data about the data being stored. NameNode supports a directory tree-like
structure consisting of all the files present in HDFS on a Hadoop cluster. It uses following
files for namespace:
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint NameNode has the same directory structure as NameNode, and creates
checkpoints for namespace at regular intervals by downloading the fsimage and edits file
and margining them within the local directory. The new image after merging is then
uploaded to NameNode.
There is a similar node like Checkpoint, commonly known as Secondary Node, but it does
not support the ‘upload to NameNode’ functionality.
1. setup(): this method is used for configuring various parameters like input data size,
distributed cache.
public void setup (context)
2. reduce(): heart of the reducer always called once per key with the associated reduced task
public void reduce(Key, Value, context)
3. cleanup(): this method is called to clean temporary files, only once at the end of the task
public void cleanup (context)
you can also change the replication factor of all the files under a directory.
Go through Hadoop Administration Training to learn about Replication Factor In HDFS now!
18. How to compress mapper output but not the reducer output?
To achieve this compression, you should set:
conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)
19. What is the difference between Map Side join and Reduce Side Join?
Map side Join at map side is performed data reaches the map. You need a strict structure for
defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than
map side join since the input datasets need not be structured. However, it is less efficient as it will
have to go through sort and shuffle phases, coming with network overheads.
Whenever you go for a Big Data interview, the interviewer may ask some basic level questions.
Whether you are a fresher or experienced in the big data field, the basic knowledge is required.
So, let’s cover some frequently asked basic big data interview questions and answers to crack big
data interview.
Answer: Big Data is a term associated with complex and large datasets. A relational database
cannot handle big data, and that’s why special tools and methods are used to perform operations
on a vast collection of data. Big data enables companies to understand their business better and
helps them derive meaningful information from the unstructured and raw data collected on a
regular basis. Big data also allows the companies to take better business decisions backed by
data.
● Volume – Volume represents the volume i.e. amount of data that is growing at a high
rate i.e. data volume in Petabytes
● Velocity – Velocity is the rate at which data grows. Social media contributes a major
role in the velocity of growing data.
● Variety – Variety refers to the different data types i.e. various data formats like text,
audios, videos, etc.
● Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to
the high volume of data that brings incompleteness and inconsistency.
● Value –Value refers to turning data into value. By turning accessed big data into
values, businesses may generate revenue.
Note: This is one of the basic and significant questions asked in the big data interview. You can
choose to explain the five V’s in detail if you see the interviewer is interested to know more.
However, the names can even be mentioned if you are asked about the term “Big Data”.
3. Tell us how big data and Hadoop are related to each other.
Answer: Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a
framework that specializes in big data operations also became popular. The framework can be
used by professionals to analyze big data and help businesses to make decisions.
Note: This question is commonly asked in a big data interview. You can go further to answer this
question and try to explain the main components of Hadoop.
4. How is big data analysis helpful in increasing business revenue?
Answer: Big data analysis has become very important for the businesses. It helps businesses to
differentiate themselves from others and increase the revenue. Through predictive analytics, big
data analytics provides businesses customized recommendations and suggestions. Also, big data
analytics enables businesses to launch new products depending on customer needs and
preferences. These factors make businesses earn more revenue, and thus companies are using big
data analytics. Companies may encounter a significant increase of 5-20% in revenue by
implementing big data analytics. Some popular companies those are using big data analytics to
increase their revenue is – Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.
Answer: Followings are the three steps that are followed to deploy a Big Data Solution –
1. Data Ingestion
The first step for deploying a big data solution is the data ingestion i.e. extraction of data from
various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning
System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds
etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data
is then stored in HDFS.
2. Data Storage
After data ingestion, the next step is to store the extracted data. The data either be stored in
HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access
whereas HBase for random read/write access.
3. Data Processing
The final step in deploying a big data solution is the data processing. The data is processed
through one of the processing frameworks like Spark, MapReduce, Pig, etc.
6. Do you have any Big Data experience? If so, please share it with us.
How to Approach: There is no specific answer to the question as it is a subjective question and
the answer depends on your previous experience. Asking this question during a big data
interview, the interviewer wants to understand your previous experience and is also trying to
evaluate if you are fit for the project requirement.
So, how will you approach the question? If you have previous experience, start with your duties
in your past position and slowly add details to the conversation. Tell them about your
contributions that made the project successful. This question is generally, the 2nd or 3rd question
asked in an interview. The later questions are based on this question, so answer it carefully. You
should also take care not to go overboard with a single aspect of your previous job. Keep it
simple and to the point.
How to Approach: This is a tricky question but generally asked in the big data interview. It asks
you to choose between good data or good models. As a candidate, you should try to answer it
from your experience. Many companies want to follow a strict process of evaluating data, means
they have already selected data models. In this case, having good data can be game-changing.
The other way around also works as a model is chosen based on good data.
As we already mentioned, answer it from your experience. However, don’t say that having both
good data and good models is important as it is hard to have both in real life projects.
How to Approach: The answer to this question should always be “Yes.” Real world
performance matters and it doesn’t depend on the data or model you are using in your project.
The interviewer might also be interested to know if you have had any previous experience in
code or algorithm optimization. For a beginner, it obviously depends on which projects he
worked on in the past. Experienced candidates can share their experience accordingly as well.
However, be honest about your work, and it is fine if you haven’t optimized code in the past. Just
let the interviewer know your real experience and you will be able to crack the big data
interview.
How to Approach: Data preparation is one of the crucial steps in big data projects. A big data
interview may involve at least one question based on data preparation. When the interviewer
asks you this question, he wants to know what steps or precautions you take during data
preparation.
As you already know, data preparation is required to get necessary data which can then further
be used for modeling purposes. You should convey this message to the interviewer. You should
also emphasize the type of model you are going to use and reasons behind choosing that
particular model. Last, but not the least, you should also discuss important data preparation terms
such as transforming variables, outlier values, unstructured data, identifying gaps, and others.
10. How would you transform unstructured data into structured data?
How to Approach: Unstructured data is very common in big data. The unstructured data should
be transformed into structured data to ensure proper data analysis. You can start answering the
question by briefly differentiating between the two. Once done, you can now discuss the methods
you use to transform one form to another. You might also share the real-world situation where
you did it. If you have recently been graduated, then you can share information related to your
academic projects.
By answering this question correctly, you are signaling that you understand the types of data,
both structured and unstructured, and also have the practical experience to work with these. If
you give an answer to this question specifically, you will definitely be able to crack the big data
interview.
Basic Big Data Hadoop Interview Questions
Hadoop is one of the most popular Big Data frameworks, and if you are going for a Hadoop
interview prepare yourself with these basic level interview questions for Big Data Hadoop.
These questions will be helpful for you whether you are going for a Hadoop developer or
Hadoop Admin interview.
● Text Input Format – The default input format defined in Hadoop is the Text Input
Format.
● Sequence File Input Format – To read files in a sequence, Sequence File Input
Format is used.
● Key Value Input Format – The input format used for plain text files (files broken
into lines) is the Key Value Input Format.
Answer: Hadoop supports the storage and processing of big data. It is the best solution for
handling big data challenges. Some important features of Hadoop are –
● Open Source – Hadoop is an open source framework which means it is available free
of cost. Also, the users are allowed to change the source code as per their
requirements.
● Distributed Processing – Hadoop supports distributed processing of data i.e. faster
processing. The data in Hadoop HDFS is stored in a distributed manner and
MapReduce is responsible for the parallel processing of data.
● Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each
block at different nodes, by default. This number can be changed according to the
requirement. So, we can recover the data from another node if one node fails. The
detection of node failure and recovery of data is done automatically.
● Reliability – Hadoop stores data on the cluster in a reliable manner that is
independent of machine. So, the data stored in Hadoop environment is not affected by
the failure of the machine.
● Scalability – Another important feature of Hadoop is the scalability. It is compatible
with the other hardware and we can easily ass the new hardware to the nodes.
● High Availability – The data stored in Hadoop is available to access even after the
hardware failure. In case of hardware failure, the data can be accessed from another
path.
● Standalone (Local) Mode – By default, Hadoop runs in a local mode i.e. on a non-
distributed, single node. This mode uses the local file system to perform input and
output operation. This mode does not support the use of HDFS, so it is used for
debugging. No custom configuration is needed for configuration files in this mode.
● Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a
single node just like the Standalone mode. In this mode, each daemon runs in a
separate Java process. As all the daemons run on a single node, there is the same
node for both the Master and Slave nodes.
● Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on
separate individual nodes and thus forms a multi-node cluster. There are different
nodes for Master and Slave nodes.
● Hadoop MapReduce – MapReduce is the Hadoop layer that is responsible for data
processing. It writes an application to process unstructured and structured data stored
in HDFS. It is responsible for the parallel processing of high volume of data by
dividing data into independent tasks. The processing is done in two phases Map and
Reduce. The Map is the first phase of processing that specifies complex logic code
and the Reduce is the second phase of processing that specifies light-weight
operations.
● YARN – The processing framework in Hadoop is YARN. It is used for resource
management and provides multiple data processing engines i.e. data science, real-
time streaming, and batch processing.
Prepare yourself for the next Hadoop Job Interview with Top 50 Hadoop Interview
Questions and Answers.
It is not easy to crack Hadoop developer interview but the preparation can do everything. If you
are a fresher, learn the Hadoop concepts and prepare properly. Have a good knowledge of the
different file systems, Hadoop versions, commands, system security, etc. Here are few questions
that will help you pass the Hadoop developer interview.
core-site.xml – This configuration file contains Hadoop core configuration settings, for example,
I/O settings, very common for MapReduce and HDFS. It uses hostname a port.
mapred-site.xml – This configuration file specifies a framework name for MapReduce by
setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also
specifies default block permission and replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration settings for ResourceManager
and NodeManager.
Answer: Kerberos are used to achieve security in Hadoop. There are 3 steps to access a service
while using Kerberos, at a high level. Each step involves a message exchange with a server.
1. Authentication – The first step involves authentication of the client to the
authentication server, and then provides a time-stamped TGT (Ticket-Granting
Ticket) to the client.
2. Authorization – In this step, the client uses received TGT to request a service ticket
from the TGS (Ticket Granting Server).
3. Service Request – It is the final step to achieve security in Hadoop. Then the client
uses service ticket to authenticate himself to the server.
Answer: There are a number of distributed file systems that work in their own way. NFS
(Network File System) is one of the oldest and popular distributed file storage systems whereas
HDFS (Hadoop Distributed File System) is the recently used and popular one to handle big data.
The main differences between NFS and HDFS are as follows –
Hadoop and Spark are the two most popular big data frameworks. But there is a commonly
asked question – do we need Hadoop to run Spark? Watch this video to find the answer to this
question.
The interviewer has more expectations from an experienced Hadoop developer, and thus his
questions are one-level up. So, if you have gained some experience, don’t forget to cover
command based, scenario-based, real-experience based questions. Here we bring some sample
interview questions for experienced Hadoop developers.
21. How to restart all the daemons in Hadoop?
Answer: To restart all the daemons, it is required to stop all the daemons first. The Hadoop
directory contains sbin directory that stores the script files to stop and start daemons in Hadoop.
Use stop daemons command /sbin/stop-all.sh to stop all the daemons and then use /sin/start-all.sh
command to start all the daemons again.
23. Explain the process that overwrites the replication factors in HDFS.
Answer: There are two methods to overwrite the replication factors in HDFS –
In this method, the replication factor is changed on the basis of file using Hadoop FS shell. The
command used for this is:
$hadoop fs – setrep –w2/my/test_file
In this method, the replication factor is changed on directory basis i.e. the replication factor for all
the files under a given directory is modified.
$hadoop fs –setrep –w5/my/test_dir
Here, test_dir is the name of the directory, the replication factor for the directory and all the files
in it will be set to 5.
24. What will happen with a NameNode that doesn’t have any data?
Answer: A NameNode without any data doesn’t exist in Hadoop. If there is a NameNode, it will
contain some data in it or it won’t exist.
Answer: The NameNode recovery process involves the below-mentioned steps to make Hadoop
cluster running:
● In the first step in the recovery process, file system metadata replica (FsImage) starts
a new NameNode.
● The next step is to configure DataNodes and Clients. These DataNodes and Clients
will then acknowledge new NameNode.
● During the final step, the new NameNode starts serving the client on the completion
of last checkpoint FsImage loading and receiving block reports from the DataNodes.
Hadoop clusters. Thus, it makes routine maintenance difficult. For this reason, HDFS
high availability architecture is recommended to use.