BDAV Question Bank Solution

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

BDAV Question Bank Solution

1. Explain 5 V’s of Big Data.


Ans. Big data is a combination of unstructured, semi-structured or structured data collected by
organizations. These data sets can be mined to gain insights and used in machine learning
projects, predictive modeling and other advanced analytics applications.

The 5 V's are defined as follows:


1. Velocity is the speed at which the data is created and how fast it moves.
2. Volume is the amount of data qualifying as big data.
3. Value is the value the data provides.
4. Variety is the diversity that exists in the types of data.
5. Veracity is the data's quality and accuracy.
Velocity
● Velocity refers to how quickly data is generated and how fast it moves. This is an important
aspect for organizations that need their data to flow quickly, so it's available at the right times to
make the best business decisions possible.
● An organization that uses big data will have a large and continuous flow of data that's being
created and sent to its end destination. Data could flow from sources such as machines,
networks, smartphones or social media. Velocity applies to the speed at which this information
arrives -- for example, how many social media posts per day are ingested -- as well as the speed
at which it needs to be digested and analyzed -- often quickly and sometimes in near real time.
● As an example, in healthcare, many medical devices today are designed to monitor patients and
collect data. From in-hospital medical equipment to wearable devices, collected data needs to
be sent to its destination and analyzed quickly.
● In some cases, however, it might be better to have a limited set of collected data than to collect
more data than an organization can handle -- because this can lead to slower data velocities.

Volume
● Volume refers to the amount of data that exists. Volume is like the base of big data, as it's the
initial size and amount of data that's collected. If the volume of data is large enough, it can be
considered big data. However, what's considered to be big data is relative and will change
depending on the available computing power that's on the market.
● For example, a company that operates hundreds of stores across several states generates
millions of transactions per day. This qualifies as big data, and the average number of total
transactions per day across stores represents its volume.

Value
● Value refers to the benefits that big data can provide, and it relates directly to what
organizations can do with that collected data. Being able to pull value from big data is a
requirement, as the value of big data increases significantly depending on the insights that can
be gained from it.
● Organizations can use big data tools to gather and analyze the data, but how they derive value
from that data should be unique to them. Tools like Apache Hadoop can help organizations
store, clean and rapidly process this massive amount of data.
● A great example of big data value can be found in the gathering of individual customer data.
When a company can profile its customers, it can personalize their experience in marketing and
sales, improving the efficiency of contacts and garnering greater customer satisfaction.

Variety
● Variety refers to the diversity of data types. An organization might obtain data from several data
sources, which might vary in value. Data can come from sources in and outside an enterprise as
well. The challenge in variety concerns the standardization and distribution of all data being
collected.
● As noted above, collected data can be unstructured, semi-structured or structured.
Unstructured data is data that's unorganized and comes in different files or formats. Typically,
unstructured data isn't a good fit for a mainstream relational database because it doesn't fit into
conventional data models. Semi-structured data is data that hasn't been organized into a
specialized repository but has associated information, such as metadata. This makes it easier to
process than unstructured data. Structured data, meanwhile, is data that has been organized
into a formatted repository. This means the data is made more addressable for effective data
processing and analysis.
● A more specific example could be found in a company that gathers a variety of data about its
customers. This can include structured data culled from transactions or unstructured social
media posts and call center text. Much of this might arrive in the form of raw data, requiring
cleaning before processing.

Veracity
● Veracity refers to the quality, accuracy, integrity and credibility of data. Gathered data could
have missing pieces, might be inaccurate or might not be able to provide real, valuable insight.
Veracity, overall, refers to the level of trust there is in the collected data.
● Data can sometimes become messy and difficult to use. A large amount of data can cause more
confusion than insights if it's incomplete. For example, in the medical field, if data about what
drugs a patient is taking is incomplete, the patient's life could be endangered.
● Both value and veracity help define the quality and insights gathered from data. Thresholds for
the truth of data often -- and should -- exist in an organization at the executive level, to
determine whether it's suitable for high-level decision-making.

2. Application of Big Data


Ans. Applications of Big Data are:
1.Recommendation: By tracking customer spending habits, shopping behavior, Big retails stores
provide a recommendation to the customer. E-commerce sites like Amazon, Walmart, and
Flipkart do product recommendation. They track what product a customer is searching, based
on that data they recommend that type of product to that customer.
As an example, suppose any customer searched for bed cover on Amazon. So, Amazon got data
that customers may be interested in buying bed cover. Next time when that customer goes to
any google page, advertisements of various bed covers will be seen. Thus, advertisements of the
right product to the right customer can be sent.

2. Smart Traffic System: Data about the condition of the traffic of different roads, collected
through cameras kept beside the road, at entry and exit point of the city, GPS device placed in
the vehicle (Ola, Uber cab, etc.). All such data are analyzed and jam-free or less jam way, less
time taking ways are recommended. Such a smart traffic system can be built in the city by Big
data analysis. One more profit is fuel consumption can be reduced.

3. Auto Driving Car: Big data analysis helps drive a car without human interpretation. In the
various spots of the car camera, a sensor is placed that gathers data like the size of the
surrounding car, obstacle, distance from those, etc. These data are being analyzed, then various
calculations like how many angles to rotate, what should be speed, when to stop, etc are carried
out. These calculations help to take action automatically.

4. IoT:
Manufacturing companies install IOT sensors into machines to collect operational data.
Analyzing such data, it can be predicted how long a machine will work without any problem
when it requires repairing so that the company can take action before the situation when the
machine is facing a lot of issues or gets totally down. Thus, the cost to replace the whole
machine can be saved.
In the Healthcare field, Big data is providing a significant contribution. Using big data tools, data
regarding patient experience is collected and is used by doctors to give better treatment. IoT
devices can sense a symptom of probable coming disease in the human body and prevent it
from giving advance treatment. IoT Sensors placed near-patient, new-born babies constantly
keep track of various health conditions like heart bit rate, blood pressure, etc. Whenever any
parameter crosses the safe limit, an alarm is sent to a doctor, so that they can take steps
remotely very soon.

5. Education Sector: Online educational course conducting organizations utilize big data to
search candidates interested in that course. If someone searches for a YouTube tutorial video on
a subject, then an online or offline course provider organization on that subject sends an ad
online to that person about their course.

6. Energy Sector: Smart electric meter reads consumed power every 15 minutes and sends this
read data to the server, where data is analyzed and it can be estimated what is the time in a day
when the power load is less throughout the city. By this system manufacturing units or
housekeepers are suggested the time when they should drive their heavy machine in the night
time when power load is less to enjoy less electricity bill.
3. Differentiate between Traditional Vs Big Data.

4. Explain Types of Big Data and give examples.


Ans. Big Data can be defined as a high amount of data that cannot be processed or stored with
the help of standard processing equipment and data storage. A massive amount of data is
produced daily, and interpreting and manually processing complex and expansive datasets are
next to impossible. It requires modern tools and expert skills to interpret large volumes of data
and provide them to organizations with valuable insights to help businesses grow. Let's discuss
various types of big data in detail.
Different Types of Big Data
Big data types in Big Data are used to categorize the numerous kinds of data generated daily.
Primarily there are 3 types of big data in analytics. The following types of Big Data with
examples are explained below:-
A. Structured Data
Any data that can be processed, is easily accessible, and can be stored in a fixed format is called
structured data. In Big Data, structured data is the easiest to work with because it has highly
coordinated measurements that are defined by setting parameters. Structured types of Big Data
are:-
Overview:
● Highly organized and easily searchable in databases.
● Follows a predefined schema (e.g., rows and columns in a table).
● Typically stored in relational databases (SQL).
Examples:
● Customer information databases (names, addresses, phone numbers).
● Financial data (transactions, account balances).
● Inventory management systems.
● Metadata (data about data).
Image:

Merits:
● Easy to analyze and query.
● High consistency and accuracy.
● Efficient storage and retrieval.
● Strong data integrity and validation.
Limitations:
● Limited flexibility (must adhere to a strict schema).
● Scalability issues with very large datasets.
● Less suitable for complex big data types.
B. Semi-structured Data
In Big Data, semi-structured data is a combination of both unstructured and structured types of
big data. This form of data constitutes the features of structured data but has unstructured
information that does not adhere to any formal structure of data models or any relational
database. Some semi-structured data examples include XML and JSON.
Overview:
● Contains both structured and unstructured elements.
● Lacks a fixed schema but includes tags and markers to separate data elements.
● Often stored in formats like XML, JSON, or NoSQL databases.
Examples:
● JSON files for web APIs.
● XML documents for data interchange.
● Email messages (headers are structured, body can be unstructured).
● HTML pages.
Image:

Merits:
● More flexible than structured data.
● Easier to parse and analyze than unstructured data.
● Can handle a wide variety of data types.
● Better suited for hierarchical data.
Limitations:
● More complex to manage than structured data.
● Parsing can be resource-intensive.
● Inconsistent data quality.
C. Unstructured Data
Unstructured data in Big Data is where the data format constitutes multitudes of unstructured
files (images, audio, log, and video). This form of data is classified as intricate data because of its
unfamiliar structure and relatively huge size. A stark example of unstructured data is an output
returned by ‘Google Search’ or ‘Yahoo Search.’
Overview:
● Data that does not conform to a predefined schema.
● Includes text, multimedia, and other non-tabular data types.
● Stored in data lakes, NoSQL databases, and other flexible storage solutions.
Examples:
● Text documents (Word files, PDFs).
● Multimedia files (images, videos, audio).
● Social media posts.
● Web pages.
Image:

Merits:
● Capable of storing vast amounts of diverse data.
● High flexibility in data storage.
● Suitable for complex data types like multimedia.
● Facilitates advanced analytics and machine learning applications.
Limitations:
● Difficult to search and analyze without preprocessing.
● Requires large storage capacities.
● Inconsistent data quality and reliability.
5. What are differences between NameNode and Standby NameNode

NameNode StandBy NameNode

1. The primary component responsible for 1. A secondary node that acts as a backup to
managing the filesystem namespace. the Active NameNode.

2. It maintains the metadata of all files and 2. It stays in sync with the Active NameNode
directories in the HDFS (e.g.,file permissions, but does not serve client requests or manage
locations of blocks on DataNodes). DataNodes under normal conditions.

3. It is part of the High Availability (HA) setup 3. It is part of the High Availability (HA) setup
to ensure the HDFS continues running in case to ensure the HDFS continues running in case
the Active NameNode fails. the Active NameNode fails.

4. In a non-HA setup, there's only one Active 4. In an HA setup, the Standby NameNode
NameNode, meaning if it fails, the entire constantly monitors and syncs with the Active
HDFS becomes unavailable. NameNode by receiving edits log updates
and block reports from DataNodes.

5. This single point of failure is mitigated with 5. In case the Active NameNode fails, the
the introduction of an HA configuration, Standby NameNode can take over and
where the Standby NameNode is used. become the new Active NameNode with
little to no downtime.

6. Applies changes to the filesystem metadata 6. Synchronizes its state with the Active
and writes these changes to the edits log. NameNode through shared storage to keep
the metadata and edits log up to date.

7. The Active NameNode directly handles all 7.The Standby NameNode does not handle
client requests for metadata (file system client requests unless it becomes the Active
operations like read, write, delete). NameNode due to a failover.

8. If the Active NameNode fails, a failover 8. It takes over the role of the Active
procedure is triggered. NameNode during a failover, resuming client
requests and managing the file system as the
new Active NameNode.

6. Draw and Explain Secondary Name Node and Checkpointing Mechanism


Ans. The Secondary NameNode in Hadoop HDFS plays an important role in the checkpointing
mechanism, but it is often misunderstood. Contrary to its name, the Secondary NameNode is
not a backup of the NameNode, and it does not provide high availability or failover capabilities.
Its main function is to assist the NameNode in managing metadata by handling the
checkpointing process.
Role of the Secondary NameNode
● The Secondary NameNode's primary job is to periodically merge the fsimage (file system
image) and the edits log to create a new checkpoint of the metadata.
● Over time, the edits log grows larger as it records all file system changes (such as file
creations, deletions, or modifications) since the last checkpoint.
● If the edits log becomes too large, it could potentially lead to performance degradation
or even failure of the NameNode. To prevent this, the Secondary NameNode periodically
merges the edits log with the fsimage.
Checkpointing Mechanism
The checkpointing process involves creating a new, consolidated fsimage file from the edits log
and the old fsimage. This process helps the NameNode avoid performance issues caused by the
ever-growing edits log. Here's how it works:
1. Fsimage and Edits Log:
○ The fsimage is a persistent, compact snapshot of the entire HDFS file system's
metadata (file structure, block locations, permissions, etc.) at a specific point in
time.
○ The edits log records every change to the file system that occurs after the last
checkpoint. This log grows over time as more changes are made.
2. Checkpointing Process:
○ Step 1: Transfer of fsimage and edits log:
■ The Secondary NameNode fetches a copy of the fsimage and the edits log
from the Active NameNode periodically.
○ Step 2: Merging:
■ The Secondary NameNode merges the edits log with the fsimage. This
process consolidates the edits (changes) into the fsimage, effectively
bringing the fsimage up to date with all changes made since the last
checkpoint.
○ Step 3: Creation of a new fsimage:
■ Once the merge is complete, the Secondary NameNode creates a new,
updated fsimage.
○ Step 4: Transfer back to the NameNode:
■ The Secondary NameNode sends the new fsimage back to the Active
NameNode. The NameNode can now discard the old edits log and start
recording new changes in a fresh edits log.
3. Benefits of Checkpointing:
○ Memory Efficiency: By merging the fsimage and edits log periodically, the
NameNode can operate more efficiently, as it no longer has to maintain or replay
a huge edits log.
○ Faster Startup: When the NameNode restarts, it loads the fsimage and replays
the edits log to rebuild the file system state. A smaller edits log results in faster
recovery times after restarts.
7. What is Rack Awareness in HADOOP? Define Rack Awareness Policies.
Ans. In a large Hadoop cluster, there are multiple racks. Each rack consists of DataNodes.
Communication between the DataNodes on the same rack is more efficient as compared to the
communication between DataNodes residing on different racks.
To reduce the network traffic during file read/write, NameNode chooses the closest DataNode
for serving the client read/write request. NameNode maintains the rack ids of each DataNode to
achieve this rack information. This concept of choosing the closest DataNode based on the rack
information is known as Rack Awareness.
A default Hadoop installation assumes that all the DataNodes reside on the same rack.
Why Rack Awareness?
The reasons for the Rack Awareness in Hadoop are:
1. To reduce the network traffic while file read/write, which improves the cluster
performance.
2. To achieve fault tolerance, even when the rack goes down (discussed later in this
article).
3. Achieve high availability of data so that data is available even in unfavorable
conditions.
4. To reduce the latency, that is, to make the file read/write operations done with
lower delay.
NameNode uses a rack awareness algorithm while placing the replicas in HDFS.
Rack Awareness Policies:
Here are some of the key rack awareness policies used by Hadoop when determining how to
distribute data across racks and nodes:
1. Replica Placement Policy:
○ The default replication strategy in Hadoop, based on rack awareness, places the
data replicas as follows:
■ First Replica: Placed on the node where the client is writing the data
(typically the same rack).
■ Second Replica: Placed on a node in a different rack, ensuring that the
data survives if the first rack fails.
■ Third Replica: Placed on another node within the same rack as the
second replica (but on a different node).
2. This policy ensures that data is distributed across racks to tolerate rack failures while
keeping network traffic in check.
3. Data Locality Policy:
○ Whenever possible, Hadoop tries to schedule tasks on nodes where the required
data blocks are located (node-level data locality).
○ If node-level locality is not possible, Hadoop schedules the task on nodes in the
same rack (rack-level locality) to reduce network traffic.
○ If neither node-level nor rack-level locality is available, Hadoop falls back on using
nodes from different racks (remote data locality), which is the least preferred due
to the higher network cost.
4. Failure Handling Policy:
○ In case of a node failure, Hadoop uses rack awareness to ensure that data blocks
are replicated to healthy nodes. It avoids placing all replicas on the same rack or
node to mitigate the risk of data loss.
○ If an entire rack fails, Hadoop still has copies of the data in other racks, ensuring
the cluster’s resiliency and fault tolerance.
8. Explain Speculative Execution? How Map Reduce jobs can be optimized using
Speculative Execution?
Ans. In MapReduce, jobs are broken into tasks and the tasks are run in parallel to make the
overall job execution time smaller than it would otherwise be if the tasks run sequentially. Now
among the divided tasks, if one of the tasks take more time than desired, then the
overall execution time of job increases.
Tasks may be slow for various reasons:
● Including hardware degradation or software misconfiguration, but the causes may be hard to
detect since the tasks may be completed successfully, could be after a longer time than
expected.
● Apache Hadoop does not fix or diagnose slow-running tasks.
● Instead, it tries to detect when a task is running slower than expected and launches another,
equivalent task as a backup (the backup task is called a speculative task). This process is called
Speculative execution in MapReduce.
● Speculative execution in Hadoop does not imply that launching duplicate tasks at the same time
so they can race. As this will result in wastage of resources in the cluster. Rather, a speculative
task is launched only after a task runs for a significant amount of time and the framework
detects it running slow as compared to other tasks, running for the same job.
● When a task successfully completes, then duplicate tasks that are running are killed since they
are no longer needed.
● If the speculative task after the original task, then kill the speculative task. On the other hand, if
the speculative task finishes first, then the original one is killed. Speculative execution in
Hadoop is just an optimization, it is not a feature
● to make jobs run more reliably.
● The speed of the MapReduce job is dominated by the slowest task. MapReduce first detects
slow tasks. Then, run redundant (speculative) tasks. This will optimistically commit before the
● corresponding stragglers. This process is known as speculative execution. Only one copy of a
straggler is allowed to be speculated. Whichever copy (among the two copies) of a task
● commits first, it becomes the definitive copy, and the other copy is
● killed by the framework
9. Describe the map reduce algorithm for matrix and vector multiplication.
Ans. MapReduce is a technique in which a huge program is subdivided into small tasks and run
parallelly to make computation faster, save time, and mostly used in distributed systems. It has
2 important parts:
● Mapper: It takes raw data input and organizes it into key, value pairs. For example, In
a dictionary, you search for the word “Data” and its associated meaning is “facts and
statistics collected together for reference or analysis”. Here the Key is Data and the
Value associated with is facts and statistics collected together for reference or
analysis.
● Reducer: It is responsible for processing data in parallel and producing final output.
Consider the following matrix:

2×2 matrices A and B


Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of
columns(j)=2. Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of
columns(k)=2. Each cell of the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called
A21 i.e. 2nd-row 1st column. Now One step matrix multiplication has 1 mapper and 1 reducer.
The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have
# 2 values 1 & 2, each case can have 2 further
# values of j=1 and j=2. Substituting all values
# in formula

k=1 i=1 j=1 ((1, 1), (A, 1, 1))


j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))
j=2 ((2, 1), (A, 2, 4))

k=2 i=1 j=1 ((1, 2), (A, 1, 1))


j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))
j=2 ((2, 2), (A, 2, 4))
Computing the mapper for Matrix B
i=1 j=1 k=1 ((1, 1), (B, 1, 5))
k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
k=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))


k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))
k=2 ((2, 2), (B, 2, 8))
The formula for Reducer is:
Reducer(k, v)=(i, k)=>Make sorted A list and B list
(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Therefore computing the reducer:
# We can observe from Mapper computation
# that 4 pairs are common (1, 1), (1, 2),
# (2, 1) and (2, 2)
# Make a list separate for Matrix A &
# B with adjoining values taken from
# Mapper step above:

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}


Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i)

(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}


Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}


Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)

(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}


Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)

From (i), (ii), (iii) and (iv) we conclude that


((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)
Therefore the Final Matrix is:
Final output of Matrix multiplication.

10. What is shuffling and sorting in Map Reduce?


Ans. Shuffling in MapReduce
● The process of transferring data from the mappers to reducers is shuffling. It is also the
process by which the system performs the sort. Then it transfers the map output to the
reducer as input. This is the reason the shuffle phase is necessary for the reducers.
● Otherwise, they would not have any input (or input from every mapper). Since shuffling
can start even before the map phase has finished. So this saves some time and
completes the tasks in less time.
Sorting in MapReduce
● MapReduce Framework automatically sorts the keys generated by the mapper. Thus,
before starting off, all intermediate key-value pairs get sorted by key and not by value. It
does not sort values passed to each reducer. They can be in any order.
● Sorting in a MapReduce job helps reducers to easily distinguish when a new reduce task
should start.
● This saves time for the reducer. Reducer in MapReduce starts a new reduce task when
the next key in the sorted input data is different from the previous. Each reduce task
takes key value pairs as input and generates key-value pairs as output.
● The important thing to note is that shuffling and sorting in Hadoop MapReduce are will
not take place at all if you specify zero reducers (setNumReduceTasks(0)).
● If the reducer is zero, then the MapReduce job stops at the map phase. And the map
phase does not include any kind of sorting (even the map phase is faster).
11. What is Input Format? Write difference between HDFS Block and Input Split.
Ans. Block in HDFS
● Hadoop HDFS split large files into small chunks known as Blocks. It contains a minimum
amount of data that can be read or write. HDFS stores each file as blocks.
● The Hadoop application distributes the data block across multiple nodes. HDFS client doesn’t
have any control on the block like block location, the Namenode decides all such things.
● InputSplit in Hadoop
● It represents the data which individual mapper processes. Thus the number of map tasks is
equal to the number of InputSplits. Framework divides split into records, which mapper
processes.
● Initially input files store the data for MapReduce jobs. Input a file typically resides in HDFS
InputFormat describes how to split up and read input files. InputFormat is responsible for
creating InputSplit.

Comparison Between InputSplit vs Blocks in Hadoop


1. Data Representation
● Block – HDFS Block is the physical representation of data in Hadoop.
● InputSplit – MapReduce InputSplit is the logical representation of data present in
the block in Hadoop. It is basically used during data processing in the MapReduce
program or other processing techniques. The main thing to focus is that InputSplit
doesn’t contain actual data; it is just a reference to the data.
2. Size
● Block – By default, the HDFS block size is 128MB which you can change as per your
requirement. All HDFS blocks are the same size except the last block, which can be
either the same size or smaller. Hadoop framework breaks files into 128 MB blocks
and then stores them into the Hadoop file system.
● InputSplit – InputSplit size by default is approximately equal to block size. It is user
defined. In the MapReduce program the user can control split size based on the size
of data.
3. Example of Block and InputSplit in Hadoop
Suppose we need to store the file in HDFS. Hadoop HDFS stores files as blocks. Block is the
smallest unit of data that can be stored or retrieved from the disk.
The default size of the block is 128MB. Hadoop HDFS breaks files into blocks. Then it stores
these blocks on different nodes in the cluster.

For example, we have a file of 130 MB. So HDFS will break this file into 2 blocks.
Now, if we want to perform a MapReduce operation on the blocks, it will not process. The
reason is that the 2nd block is incomplete. So, InputSplit solves this problem.
MapReduce InputSplit will form a logical grouping of blocks as a single block. As the InputSplit
includes a location for the next block and the byte offset of the data needed to complete the
block.

12. Illustrate the main component of the Hadoop system.


Ans. Hadoop is a framework that uses distributed storage and parallel processing to store and
manage Big Data. It is the most commonly used software to handle Big Data. There are three
components of Hadoop.
1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.
2. Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop.
3. Hadoop YARN - Hadoop YARN is a resource management unit of Hadoop.
Hadoop HDFS
Data is stored in a distributed manner in HDFS. There are two components of HDFS - name node
and data node. While there is only one name node, there can be multiple data nodes.
HDFS is specially designed for storing huge datasets in commodity hardware. An enterprise
version of a server costs roughly $10,000 per terabyte for the full processor. In case you need to
buy 100 of these enterprise version servers, it will go up to a million dollars.
Hadoop enables you to use commodity machines as your data nodes. This way, you don’t have
to spend millions of dollars just on your data nodes. However, the name node is always an
enterprise server.
Features of HDFS
● Provides distributed storage
● Can be implemented on commodity hardware
● Provides data security
● Highly fault-tolerant - If one machine goes down, the data from that machine goes to
the next machine
Master and Slave Nodes
Master and slave nodes form the HDFS cluster. The name node is called the master, and the
data nodes are called the slaves.

The name node is responsible for the workings of the data nodes. It also stores the metadata.
The data nodes read, write, process, and replicate the data. They also send signals, known as
heartbeats, to the name node. These heartbeats show the status of the data node.
Consider that 30TB of data is loaded into the name node. The name node distributes it across
the data nodes, and this data is replicated among the data notes. You can see in the image
above that the blue, grey, and red data are replicated among the three data nodes.
Replication of the data is performed three times by default. It is done this way, so if a
commodity machine fails, you can replace it with a new machine that has the same data.

Hadoop MapReduce
Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the
processing is done at the slave nodes, and the final result is sent to the master node.
A data containing code is used to process the entire data. This coded data is usually very small in
comparison to the data itself. You only need to send a few kilobytes worth of code to perform a
heavy-duty process on computers.

The input dataset is first split into chunks of data. In this example, the input has three lines of
text with three separate entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset is
then split into three chunks, based on these entities, and processed parallely.
In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one
car, one ship, and one train.
These key-value pairs are then shuffled and sorted together based on their keys. At the reduce
phase, the aggregation takes place, and the final output is obtained.

Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit
of Hadoop and is available as a component of Hadoop version 2.
● Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS.
● It is responsible for managing cluster resources to make sure you don't overload one
machine.
● It performs job scheduling to make sure that the jobs are scheduled in the right place

Suppose a client machine wants to do a query or fetch some code for data analysis. This job
request goes to the resource manager (Hadoop Yarn), which is responsible for resource
allocation and management.
In the node section, each of the nodes has its node managers. These node managers manage
the nodes and monitor the resource usage in the node. The containers contain a collection of
physical resources, which could be RAM, CPU, or hard drives. Whenever a job request comes in,
the app master requests the container from the node manager. Once the node manager gets
the resource, it goes back to the Resource Manager.

13. What is a Map Reduce Partitioner? What is the need of Partitioner? How many partitoner
are there in HADOOP?
Ans. In Hadoop's MapReduce framework, the Partitioner plays a crucial role in determining
how the intermediate key-value pairs (produced by the Mapper phase) are distributed to the
Reducer tasks. It ensures that data is partitioned appropriately and that all key-value pairs
belonging to the same key are sent to the same reducer.
What is a MapReduce Partitioner?
A Partitioner in MapReduce is responsible for controlling the partitioning of the intermediate
map output (key-value pairs) across reducers. It decides which reducer will process which
subset of the intermediate keys produced by the mappers.
In other words, the partitioner determines the logical division of data so that each reducer
receives a distinct and non-overlapping portion of the data, based on the key. The goal is to
group all the values associated with the same key into the same reducer.
Need for a Partitioner in MapReduce:
The Partitioner serves several important purposes in the MapReduce process:
1. Efficient Data Distribution:
○ The partitioner helps distribute the mapper's output evenly among reducers to
avoid data skew. If all data is sent to a single reducer, it will become a bottleneck,
slowing down the entire job. A well-designed partitioner ensures load balancing.
2. Grouping Similar Data:
○ Keys that need to be processed together must go to the same reducer. The
partitioner ensures that all records with the same key are directed to the same
reducer, making sure that reducers can correctly group and process these
records.
3. Custom Partitioning Logic:
○ In many cases, the default HashPartitioner may not suffice, especially when you
want specific data distribution patterns. A custom partitioner allows for
implementing application-specific logic to determine how the data should be
distributed across reducers.
How the Partitioner Works:
1. Mapper Phase: Each mapper processes a chunk of input data and produces key-value
pairs as intermediate output.
2. Partitioner: Before sending the mapper's output to the reducers, the partitioner
determines which reducer each key-value pair should be sent to, based on the key.
3. Reducer Phase: All the key-value pairs for a particular key are grouped together and sent
to the appropriate reducer for further processing.
For example, if there are three reducers, and the hash of a particular key resolves to partition 2,
then all key-value pairs with that key will be sent to the reducer responsible for partition 2.

How many Partitioners are there in Hadoop?


The total number of Partitioner that run in Hadoop is equal to the number of reducers
i.e. Partitioner will divide the data according to the number of reducers which is set by the
JobConf.setNumReduceTasks() method.
Thus, the data from a single partitioner is processed by a single reducer and partitioner is
created only when there are multiple reducers.

14. Explain in detail Shared nothing architecture.


Ans. Shared Nothing Architecture (SNA) is a distributed computing architecture where each
node (or server) in the system is independent and self-sufficient. This means that nodes do not
share memory or storage; they only communicate with each other through a network. Here are
the key characteristics and benefits of Shared Nothing Architecture:
Characteristics of Shared Nothing Architecture
● Independence: Each node operates independently and does not share disk storage
or RAM with other nodes. Nodes communicate via network protocols, such as
TCP/IP.
● Scalability: The architecture can easily scale horizontally by adding more nodes
without significant changes to the existing setup. Each additional node brings its own
memory and storage, enhancing the system’s overall capacity and performance.
● Fault Isolation: Since nodes are independent, the failure of one node does not
directly affect others. Faults are isolated to individual nodes, making the system
more resilient.
● Data Distribution: Data is partitioned across the nodes, often using techniques like
sharding. Each node is responsible for a specific subset of the data.
● Parallel Processing: Multiple nodes can perform operations concurrently on different
partitions of the data, leading to significant performance improvements for
large-scale tasks.

Importance of Shared Nothing Architecture in System Design


Shared Nothing Architecture (SNA) plays a significant role in system design, especially for
distributed systems, due to its various advantages and impact on performance, scalability, and
reliability.
● Scalability
Systems can be scaled out by adding more nodes without major modifications. This is crucial for
handling increasing loads and growing datasets.
Each new node brings its own storage and computing resources, directly enhancing the system’s
overall capacity.
● Performance
Tasks can be distributed across multiple nodes, enabling parallel processing and significantly
improving performance for large-scale computations.
Reduced contention for resources since nodes operate independently.
● Reliability and Fault Tolerance
Failure of one node does not affect the others, ensuring that the system remains operational
even if parts of it fail.
This leads to higher availability and reliability, which is critical for systems requiring continuous
uptime.
● Maintenance and Manageability
Nodes can be maintained, upgraded, or replaced independently, simplifying system
management.
Reduced downtime for maintenance activities as other nodes can continue to operate normally.
● Cost-Effectiveness
Organizations can start with a smaller number of nodes and scale out as needed, aligning
infrastructure costs with business growth.
Avoids the high initial investment in large, monolithic systems.
● Flexibility
The architecture supports a modular approach, where different components of the system can
be developed, tested, and deployed independently.
This flexibility facilitates rapid development and deployment cycles.

Key Components of Shared Nothing Architecture


Shared Nothing Architecture (SNA) is structured to maximize independence and parallelism
among nodes in a distributed system. Here are the key components that typically make up an
SNA system:
1. Nodes
● Each node is a self-contained server with its own CPU, memory, and storage.
● Nodes do not share these resources with one another, ensuring no single point of
contention.
2. Data Partitioning
● Data is partitioned across nodes using a method called sharding. Each shard contains
a subset of the data, and each node manages one or more shards.
● Sharding can be based on various criteria like ranges of values, hash functions, or
geographic distribution.
3. Network Communication
● Nodes communicate with each other via network protocols (e.g., TCP/IP).
● This communication is essential for coordinating operations, replicating data, and
ensuring consistency.
4. Replication and Redundancy
● To ensure high availability and fault tolerance, data is often replicated across
multiple nodes.
● Replication strategies can be synchronous or asynchronous, depending on the
consistency requirements.
5. Load Balancing
● Load balancers distribute incoming requests evenly across nodes to ensure no single
node becomes a bottleneck.
● This helps in optimizing resource utilization and maintaining high performance.
6. Distributed Query Processing
● For databases, query processing is distributed among nodes. A central coordinator
may break down queries into sub-queries that are processed in parallel by different
nodes.
● The results are then aggregated and returned to the client.

Benefits of Shared Nothing Architecture


Shared Nothing Architecture (SNA) offers several significant benefits that make it an attractive
choice for designing scalable, high-performance, and reliable systems. Here are the key benefits:
1. Scalability
● Horizontal Scalability:
New nodes can be added to the system without disrupting existing operations. This is known as
horizontal scaling, where additional nodes provide more computational power, storage, and
bandwidth.
● Elasticity:
Resources can be adjusted dynamically based on demand, allowing systems to handle varying
workloads efficiently. This elasticity is particularly beneficial in cloud environments.
2. Performance
● Parallel Processing:
Multiple nodes can perform operations in parallel, significantly improving the performance of
data-intensive tasks such as big data processing, analytics, and real-time applications.
● Resource Isolation:
Since each node has its own dedicated resources (CPU, memory, storage), there is no
competition for shared resources, leading to predictable and optimal performance.
3. Reliability and Availability
● Fault Tolerance:
The failure of one node does not affect the operation of other nodes. Faults are isolated,
enhancing the overall reliability and robustness of the system.
● High Availability:
Redundant data and processes across multiple nodes ensure that the system remains available
even if some nodes fail. This redundancy is crucial for mission-critical applications.
4. Maintenance and Management
● Simplified Maintenance:
Maintenance, upgrades, and repairs can be performed on individual nodes without affecting the
entire system. This simplifies system management and reduces downtime.
● Independent Development and Deployment:
Teams can develop, deploy, and upgrade different parts of the system independently, speeding
up development cycles and enhancing flexibility.
5. Cost Efficiency
● Pay-as-You-Grow Model:
Organizations can start with a minimal setup and add resources incrementally as needed,
optimizing costs and avoiding large upfront investments.
● Resource Optimization:
By distributing workloads and storage, SNA ensures better utilization of resources, reducing
waste and operational costs.

Challenges of Shared Nothing Architecture


While Shared Nothing Architecture (SNA) offers numerous benefits, it also presents several
challenges that need to be addressed to ensure the effective functioning of a distributed
system. Here are some key challenges associated with SNA:
● Architectural Complexity:
Designing an SNA system involves complex architectural decisions, such as data partitioning,
network communication, and fault tolerance mechanisms.
● Implementation Overhead:
Implementing these designs requires sophisticated algorithms and robust infrastructure, which
can be time-consuming and resource-intensive.
● Consistency Models:
Maintaining data consistency across multiple nodes is challenging. Techniques like eventual
consistency, strong consistency, and distributed transactions need to be carefully implemented
to meet application requirements.
● Performance Impact:
The performance of an SNA system heavily relies on network communication. High network
latency or insufficient bandwidth can degrade the overall performance.

15. Explain Computing Selection and Projection by Map Reduce


Ans. In the context of MapReduce, two fundamental operations from relational algebra,
Selection and Projection, can be performed effectively for distributed data processing.
1. Selection (σ)
Selection is an operation used to filter rows based on a certain condition. In SQL terms, it
corresponds to the WHERE clause.
For example, in relational algebra:
σage>30(Employees)\sigma_{\text{age} > 30} (Employees)σage>30​(Employees)
This selects all records from the Employees relation where the age is greater than 30.
Selection in MapReduce:
● Map Phase:
○ The mapper reads each row (or record) from the input dataset.
○ For each record, the mapper applies the selection condition (e.g., age > 30).
○ If the condition is satisfied, the mapper emits the record (key-value pair);
otherwise, it discards it.
● Reduce Phase:
○ Typically, for selection, the reducer does not need to do any additional
processing since the filtering is already done in the map phase. Hence, the
reduce phase may simply collect and output the filtered records.
Example:
Consider an input of employee records where we want to filter only employees older than 30.
Input:
1, Alice, 28
2, Bob, 34
3, Carol, 31
Map Output:
2, Bob, 34
3, Carol, 31
● Reduce Output: (in this case, no real reducing is necessary)

2. Projection (π)
Projection selects specific columns from the dataset, eliminating the others. In SQL terms, this
corresponds to the SELECT column1, column2, ... part of a query.
For example:
πname,age(Employees)\pi_{\text{name}, \text{age}} (Employees)πname,age​(Employees)
This would return only the name and age columns from the Employees relation.
Projection in MapReduce:
● Map Phase:
○ The mapper reads each record.
○ For each record, the mapper extracts only the fields (columns) that are part of
the projection.
○ The mapper emits the key-value pair, where the value contains only the
projected columns.
● Reduce Phase:
○ The reducer does not have to do any further work unless there is a need for
additional aggregation or deduplication (depending on the use case).
Example:
Consider the same employee dataset where we want to project only the name and age
columns.
Input:
1, Alice, 28
2, Bob, 34
3, Carol, 31
Map Output:
Alice, 28
Bob, 34
Carol, 31
● Reduce Output: (again, no real reducing may be required unless aggregation or
deduplication is needed)

Combining Selection and Projection:


Often, selection and projection are combined to first filter rows (selection) and then only
output specific columns (projection). In a MapReduce job, both operations can be performed
during the map phase:
● Apply selection logic (filter rows based on conditions).
● Apply projection logic (retain only specific columns).
Example:
To select employees older than 30 and project only their name and age, you can combine both
operations in the Map phase:
Input:
1, Alice, 28
2, Bob, 34
3, Carol, 31
Map Output (after selection and projection):
Bob, 34
Carol, 31
16. Explain Computing Grouping and Aggregation by Map Reduce
Ans. In MapReduce, two other essential operations often needed in data processing are
Grouping and Aggregation. These operations are common in relational databases and are
analogous to the GROUP BY and aggregate functions like SUM(), COUNT(), AVG(), etc., in SQL.
1. Grouping (GROUP BY)
Grouping is the operation where data is grouped based on the value of one or more columns. In
SQL, this corresponds to the GROUP BY clause.
For example:
SELECT department, COUNT(*) FROM Employees GROUP BY department;

This query groups the employees by their department and then counts how many employees
are in each department.
Grouping in MapReduce:
● Map Phase:
○ The mapper reads each record from the input dataset.
○ For each record, the mapper emits a key-value pair, where the key is the column
or columns by which we want to group (e.g., department), and the value is the
data relevant for aggregation (e.g., 1 for a count, or some other numeric field for
sum).
● Shuffle and Sort Phase:
○ The MapReduce framework automatically groups the records by the key (e.g., all
records with the same department are grouped together). This is the "grouping"
step.
● Reduce Phase:
○ The reducer receives the grouped key-value pairs.
○ For each group, the reducer performs the necessary computation (e.g., summing,
counting, etc.) and emits the final aggregated result for that group.
Example:
Consider the following employee data:
1, Alice, HR
2, Bob, Engineering
3, Carol, HR
4, David, Engineering
5, Eve, Marketing

We want to group by department and count the number of employees in each department.
Map Output:
HR 1
Engineering 1
HR 1
Engineering 1
Marketing 1
Reduce Output:
HR 2
Engineering 2
Marketing 1

2. Aggregation
Aggregation refers to applying functions such as SUM, COUNT, AVG, MAX, or MIN to grouped
data. For example, computing the total sales in each region or counting the number of
employees in each department are aggregation operations.
Aggregation in MapReduce:
● Map Phase:
○ The mapper emits key-value pairs, where the key is the group (e.g., department)
and the value is the data being aggregated (e.g., salary for summing salaries, or 1
for counting records).
● Reduce Phase:
○ The reducer performs the aggregation function. For example, it sums up all
values for a given key (e.g., sum of salaries for each department).
Example:
Consider we want to compute the sum of salaries for each department. Here's the employee
dataset with salary:
1, Alice, HR, 50000
2, Bob, Engineering, 60000
3, Carol, HR, 55000
4, David, Engineering, 70000
5, Eve, Marketing, 45000

We want to group by department and sum the salary for each department.
Map Phase: Each mapper reads a record and emits the department as the key and the salary as
the value.
HR 50000
Engineering 60000
HR 55000
Engineering 70000
Marketing 45000
● Shuffle and Sort: The records are grouped by the key (department) and passed to the
reducer.
● Reduce Phase: The reducer sums the salaries for each department.
Reduce Output:
HR 105000
Engineering 130000
Marketing 45000

Combining Grouping and Aggregation


In MapReduce, grouping and aggregation typically go hand-in-hand. First, the data is grouped
by the key in the shuffle/sort phase, and then an aggregation function (such as sum, count, etc.)
is applied to each group in the reduce phase.
Example: Average Salary by Department
To calculate the average salary by department, both grouping and aggregation are needed. The
Map phase will emit (department, salary) pairs, and the Reduce phase will calculate the sum
and count of salaries, then compute the average.
Steps:
Map Output:
HR 50000
HR 55000
Engineering 60000
Engineering 70000
Marketing 45000
Shuffle and Sort (group by department):
HR: [50000, 55000]
Engineering: [60000, 70000]
Marketing: [45000]
● Reduce Phase (aggregation to compute average):
○ For each department, calculate the total salary and the number of employees,
then compute the average.
Reduce Output:
HR 52500
Engineering 65000
Marketing 45000

17. Short note on sorting and natural joins


Ans. Sorting
Sorting is one of the essential data operations, where the goal is to arrange data in a specific
order (e.g., ascending or descending) based on one or more attributes (columns).
Sorting in MapReduce:
● Map Phase:
○ The mapper reads each record and emits a key-value pair, where the key is the
field (or fields) by which we want to sort (for example, age or salary), and the
value is the entire record.
● Shuffle and Sort Phase:
○ MapReduce automatically sorts the records based on the key emitted by the
mapper. This sorting happens during the shuffle and sort phase, so no extra
programming is required to handle the sorting.
● Reduce Phase:
○ The reducer receives the sorted records. Depending on the use case, the reducer
might either pass the sorted records along as they are or do some further
computation on them.
Example:
Consider a dataset of employees:
1, Alice, 28
2, Bob, 34
3, Carol, 31
4, David, 25
5, Eve, 29
To sort this data by age:
Map Output:
25 (1, David, 25)
28 (1, Alice, 28)
29 (5, Eve, 29)
31 (3, Carol, 31)
34 (2, Bob, 34)
Shuffle and Sort (automatic by MapReduce): The framework sorts these by the key (age) in
ascending order:
25: (1, David, 25)
28: (1, Alice, 28)
29: (5, Eve, 29)
31: (3, Carol, 31)
34: (2, Bob, 34)
● Reduce Output: The sorted data can be output directly or further processed as needed.

Natural Joins
A natural join is a type of relational join that automatically joins two relations (tables) based on
columns with the same name and values. For example, given two tables:
● Employees(emp_id, name, dept_id)
● Departments(dept_id, dept_name)
A natural join would combine records from both tables where the dept_id matches in both.
Natural Join in MapReduce:
To implement a natural join in MapReduce, the idea is to use the common join key (e.g.,
dept_id) to group records from both tables that need to be joined.
● Map Phase:
○ Each mapper reads records from both datasets (e.g., Employees and
Departments).
○ For each record, the mapper emits a key-value pair where the key is the join
attribute (e.g., dept_id), and the value is the record (either from the Employees
table or the Departments table), prefixed with a tag indicating which dataset it
belongs to.
● Shuffle and Sort:
○ During the shuffle phase, records from both datasets are grouped by the join key
(e.g., dept_id), so that all records with the same key are sent to the same
reducer.
● Reduce Phase:
○ The reducer receives all records with the same join key (e.g., all records with
dept_id = 1).
○ It joins records from the two datasets based on the common join key.
Example:
Let’s say we have two datasets:
Employees:
emp_id, name, dept_id
1, Alice, 101
2, Bob, 102
3, Carol, 101

Departments:
dept_id, dept_name
101, HR
102, Engineering
We want to perform a natural join on dept_id.
Map Output: The mapper reads records from both datasets and emits key-value pairs based on
dept_id:
101 ("Employees", 1, Alice, 101)
102 ("Employees", 2, Bob, 102)
101 ("Employees", 3, Carol, 101)
101 ("Departments", 101, HR)
102 ("Departments", 102, Engineering)
Shuffle and Sort: The shuffle phase groups records by dept_id:
101: [("Employees", 1, Alice, 101), ("Employees", 3, Carol, 101), ("Departments", 101, HR)]
102: [("Employees", 2, Bob, 102), ("Departments", 102, Engineering)]
● Reduce Phase: The reducer joins the records:
○ For dept_id = 101, it joins Alice and Carol with the HR department.
○ For dept_id = 102, it joins Bob with the Engineering department.
Reduce Output:
(1, Alice, HR)
(3, Carol, HR)
(2, Bob, Engineering)

18. Explain the Hadoop ecosystem with core components. Explain its architecture.
Ans. Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common
Utilities. Most of the tools or solutions are used to supplement or support these major
elements. All these tools work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database
● Mahout, Spark MLLib: Machine Learning algorithm libraries
● Solar, Lucene: Searching and Indexing
● Zookeeper: Managing cluster
● Oozie: Job Scheduling

The Hadoop ecosystem architecture refers to the organization of the various tools and
components within the Hadoop ecosystem that enable the efficient processing, storage, and
management of large datasets in a distributed computing environment. The architecture is
designed to handle Big Data workloads across clusters of commodity hardware and is scalable,
fault-tolerant, and highly distributed.
Layers of the Hadoop Ecosystem Architecture
1. Storage Layer: Hadoop Distributed File System (HDFS)
HDFS is the foundation of the Hadoop ecosystem for distributed storage, designed to store large
amounts of data reliably across multiple nodes in a cluster. It splits large files into smaller blocks
and stores them across different nodes.
Components of HDFS:
● NameNode:
○ The master node responsible for managing the file system metadata (e.g., the
directory structure, file-to-block mappings).
○ Tracks which DataNode stores which blocks of data.
○ Handles replication of data blocks to ensure fault tolerance (default replication
factor is 3).
● DataNodes:
○ The worker nodes that store the actual data in blocks.
○ DataNodes regularly communicate with the NameNode to send block reports
and health checks.
○ Each block is replicated across multiple DataNodes to ensure high availability and
fault tolerance.
● Secondary/Standby NameNode:
○ A backup node that periodically stores snapshots of the NameNode’s metadata
to prevent data loss in case the NameNode fails.

2. Resource Management Layer: YARN (Yet Another Resource Negotiator)


YARN is Hadoop's resource manager, responsible for managing and scheduling resources across
the cluster. It enables multiple data processing frameworks to share the same cluster resources
and provides a flexible architecture for running various applications.
Components of YARN:
● ResourceManager (RM):
○ The master daemon responsible for allocating system resources (e.g., CPU,
memory) to various applications running in the cluster.
○ It schedules resources across all nodes and works with NodeManagers to execute
jobs.
● NodeManager (NM):
○ A worker daemon running on each node responsible for managing containers
(units of resources such as CPU and memory).
○ It monitors the health of the node and ensures that containers are launched as
per the ResourceManager's instructions.
● ApplicationMaster (AM):
○ The per-application manager responsible for negotiating resources with the
ResourceManager and working with NodeManagers to execute tasks.
○ Each job (e.g., MapReduce job, Spark job) has its own ApplicationMaster instance
that handles the lifecycle of that job.
● Containers:
○ Resource units (CPU and memory) allocated by YARN to run tasks of applications.
Each task (e.g., a map task, reduce task, or Spark executor) runs in a container.

3. Data Processing Layer


The data processing layer consists of various frameworks and tools that run on top of YARN to
process the data stored in HDFS or other storage systems. This layer includes the processing
engines that execute jobs, analytics, and queries.
Frameworks for Data Processing:
● MapReduce:
○ A distributed data processing framework that works in two stages:
■ Map phase: Transforms input data into key-value pairs.
■ Reduce phase: Aggregates and processes the key-value pairs generated
by the map phase.
○ MapReduce jobs are scheduled and executed using YARN.
● Apache Spark:
○ An in-memory processing framework that offers high-speed data processing
capabilities. Spark can run batch processing, real-time data streaming, machine
learning, and graph processing.
○ Spark jobs are executed in a distributed manner using YARN for resource
management.
● Apache Tez:
○ A DAG (Directed Acyclic Graph)-based execution framework designed to improve
the performance of MapReduce for complex data processing pipelines.
○ Tez is used by tools like Hive and Pig to optimize query execution.
● Apache Flink:
○ A real-time data stream processing engine that can also perform batch
processing.
○ Flink can integrate with YARN for scalable processing of continuous data streams.

4. Data Access and Query Layer


This layer includes tools that allow users to query and access data stored in HDFS or other
databases within the Hadoop ecosystem.
Key Tools for Data Access:
● Apache Hive:
○ A data warehouse infrastructure built on top of Hadoop, allowing SQL-like
queries (HiveQL) on large datasets.
○ Hive converts HiveQL queries into MapReduce, Tez, or Spark jobs to run them on
Hadoop.
● Apache Pig:
○ A high-level platform for writing data transformation scripts. Pig uses a simple
scripting language (Pig Latin) to process and analyze data in Hadoop.
○ Pig scripts are converted into MapReduce or Tez jobs.
● Apache HBase:
○ A distributed, NoSQL database that runs on top of HDFS, allowing for real-time
read/write access to large datasets.
○ HBase is suitable for storing sparse datasets with high-volume, low-latency
access requirements.
● Presto:
○ A distributed SQL query engine that allows querying data from a variety of
sources (HDFS, NoSQL databases, RDBMS, etc.) using standard SQL.

5. Data Ingestion Layer


This layer deals with how data is brought into the Hadoop ecosystem from external systems like
relational databases, logs, sensors, and real-time streams.
Data Ingestion Tools:
● Apache Flume:
○ A distributed service for ingesting large amounts of log data into Hadoop. It can
collect data from various sources (e.g., web servers, application logs) and push it
into HDFS or HBase.
● Apache Sqoop:
○ A tool for efficiently importing and exporting bulk data between Hadoop (HDFS,
Hive, HBase) and relational databases (e.g., MySQL, PostgreSQL).
● Kafka:
○ A distributed messaging system that is widely used for collecting and streaming
real-time data. Kafka integrates with Hadoop through tools like Spark Streaming
or Flink for real-time analytics.

6. Workflow Management Layer


This layer provides tools to manage, schedule, and coordinate workflows and data pipelines
within the Hadoop ecosystem.
Key Tools for Workflow Management:
● Apache Oozie:
○ A workflow scheduler that helps in managing Hadoop jobs (e.g., MapReduce,
Spark, Hive). It allows users to define complex workflows and trigger jobs based
on time or data availability.
● Apache Airflow:
○ An open-source tool for creating, scheduling, and monitoring complex
workflows. Airflow is used for data engineering and orchestration of ETL
processes in Hadoop and other big data environments.

7. Monitoring and Security Layer


This layer ensures that Hadoop clusters are monitored for performance, availability, and
security.
Monitoring Tools:
● Ambari: Provides a web interface to manage, configure, and monitor Hadoop clusters. It
tracks the health of the cluster, provides metrics, and helps with provisioning.
● Ganglia and Nagios: Widely used for monitoring the performance and health of nodes in
a Hadoop cluster.
Security:
● Kerberos: A protocol used for securing authentication between nodes and users in a
Hadoop environment.
● Ranger: Provides a centralized security framework to define, enforce, and audit access
policies across Hadoop components like HDFS, Hive, and HBase.
● Knox: Secures the perimeter of Hadoop clusters, providing a gateway that protects
Hadoop services and ensures secure access.

19. What are the different frameworks that run under yarn? Discuss the various yard
Daemons.
Ans. YARN (Yet Another Resource Negotiator) is a key component of the Hadoop ecosystem,
responsible for managing resources in a Hadoop cluster and scheduling jobs. It provides a
platform to run various distributed processing frameworks, allowing for better resource
management and job scheduling in a more flexible and scalable manner.
Frameworks That Run Under YARN
YARN is designed to run many types of distributed frameworks and applications, not just
Hadoop’s native MapReduce.
a. MapReduce
● Description: Hadoop's original processing model, which divides tasks into mapping and
reducing steps.
● Use Case: Data processing tasks such as filtering, aggregation, and data transformation.
b. Apache Spark
● Description: A fast, in-memory data processing engine that supports batch, real-time
streaming, and iterative algorithms.
● Use Case: Machine learning, real-time analytics, and large-scale data transformations.
● Integration with YARN: Spark can run on YARN by using Spark's built-in YARN support for
resource allocation and task scheduling.
c. Apache Tez
● Description: A more flexible and efficient framework than MapReduce, designed to
handle complex data processing pipelines.
● Use Case: Interactive query processing and graph-based data processing workflows.
● Integration with YARN: Tez is integrated with YARN for scheduling and resource
management and is often used by higher-level frameworks like Apache Hive and Pig.
d. Apache Flink
● Description: A real-time stream processing engine and also supports batch processing.
● Use Case: Continuous data processing with event-time semantics, complex event
processing, and real-time analytics.
● Integration with YARN: Flink can run on YARN, allowing it to scale easily across a Hadoop
cluster.
e. Apache HBase
● Description: A distributed NoSQL database that supports real-time read/write access to
large datasets.
● Use Case: Low-latency access to large-scale data for applications like social media,
finance, and telecommunications.
● Integration with YARN: HBase tasks (region servers) can be managed using YARN for
resource allocation.

YARN Daemons
YARN has several daemons that manage resource allocation, job scheduling, and task execution
across the cluster. These daemons are responsible for ensuring that jobs run efficiently and that
resources are used optimally.
Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per
application).

a. ResourceManager
● Role: The ResourceManager (RM) is the master daemon that manages all resources in
the YARN cluster.
● Key Responsibilities:
○ Accepts job submissions from clients.
○ Allocates resources to jobs across the cluster.
○ Manages the lifecycle of applications running in the cluster.
○ Works with the NodeManagers to track the health of worker nodes.
○ Schedules resources using various scheduling policies (e.g., FIFO, Fair Scheduler,
Capacity Scheduler).
b. NodeManager
● Role: The NodeManager (NM) is responsible for managing resources and tasks on a
single worker node in the cluster.
● Key Responsibilities:
○ Monitors resource usage (CPU, memory) on each node.
○ Launches and manages containers (the units of computation) where application
tasks run.
○ Reports resource availability and status to the ResourceManager.
○ Oversees the health of the node, including handling resource isolation and
cleanup.
c. ApplicationMaster
● Role: The ApplicationMaster (AM) is responsible for managing the lifecycle of an
individual YARN application (e.g., a MapReduce job or a Spark job).
● Key Responsibilities:
○ Negotiates resources with the ResourceManager for the application.
○ Tracks the execution of tasks (containers) and manages their failure or retry.
○ Acts as the orchestrator for the application, determining how tasks should be
scheduled and run.
○ Typically, each application has its own dedicated ApplicationMaster.

20. What is a Map Reduce Combiner? Write advantages and disadvantages of Map
Reduce Combiner?
Ans. A Mapreduce Combiner is also called a semi-reducer, which is an optional class operating
by taking in the inputs from the Mapper or Map class. And then it passes the key value paired
output to the Reducer or Reduce class. The predominant function of a combiner is to sum up
the output of map records with similar keys. The key value assembly output of the combiner will
be dispatched over the network into the Reducer as an input task. Class of combiner is placed
between class of map and class of reduce to decrease the data volume transferred between
reduce and map. Usually, the map output task data is large and the transferred data to task for
reduction is high.
How does MapReduce Combiner work?
This is a brief summary on the working of MapReduce Combiner:
The Mapreduce Combiner must implement a reducer interface method as it does not have a
predefined interface. Each of the output of map key is operated by the combiner, Similar key
value output should be processed as Reducer class cause the combiner operated on each key
map output. The combiner will be able to produce sum up information even with a huge
dataset because it takes the place of the original output data of the map. When a MapReduce
job is run on a large dataset, a huge chunk of intermediate data is created by map class and the
intermediate data is given to the reducer for later processing which will lead to huge network
congestion.
MapReduce program outline is somehow like this without the combiner:
No combiner is used in above diagram. The input is halved into two map classes or mappers and
keys are 9 generated in number from mappers. Now we take in the intermediate data to be 9
key value pairs and then the mapper sends directly this data to reduce class or While
dispatching the data to the reducer, it takes in time some bandwidth network (bandwidth is the
time which is taken to transfer data from one machine to another machine). Time has a
significant increase while data transfer if the size of the data is too big.
In between reducer and mapper, we have a combiner hadoop then intermediate data is shuffled
prior dispatching it to the reducer and generates the output as 4 key value pairs. With a
combiner, it is just two. To know how, look below.
MapReduce program outline is somehow like this with the combiner:

Reducer is now processing only 4 key value pairs which are given as an input from 2 combiners.
Reducer is getting executed only 4 ties to give the final result output, which boosts up the
overall performance.

Advantages of MapReduce Combiner


1. Reduces Data Transfer: Minimizes the amount of data sent from Mapper to Reducer,
saving network bandwidth.
2. Improves Performance: Speeds up job execution by reducing the volume of
intermediate data.
3. Optimizes Resources: Lowers I/O and network overhead.
4. Enhances Scalability: Allows jobs to handle larger datasets more efficiently.

Disadvantages of MapReduce Combiner


1. Not Always Usable: Only works with associative and commutative operations.
2. No Execution Guarantee: Hadoop may skip the Combiner even if specified.
3. Adds Complexity: Requires careful implementation to ensure correct results.
4. Local Scope: Only reduces data locally, still requires a global reduction at the Reducer.
21. What is the role of a record reader in HADOOP?
Ans. In Hadoop, the RecordReader plays a crucial role in converting raw input data into
key-value pairs that can be processed by the Mapper. It operates during the data input phase of
a MapReduce job, working alongside the InputFormat to handle input data.
Role of a RecordReader in Hadoop:
1. Converts Input Splits to Key-Value Pairs:
○ The RecordReader is responsible for reading the data from the input split (a
chunk of the input data) and converting it into a format that the Mapper can
process, typically key-value pairs.
2. Works with InputFormat:
○ The InputFormat divides the input data into logical InputSplits, but it is the
RecordReader that reads each split and generates records (key-value pairs).
○ For example, in the case of a text file, the key might be the byte offset of the line,
and the value could be the contents of the line.
3. Reads Data Efficiently:
○ The RecordReader reads data from the underlying storage (e.g., HDFS) efficiently
and ensures that the Mapper gets one record at a time.
○ It abstracts the complexity of reading from different formats (e.g., text, sequence
files) and provides a seamless interface for the Mapper.
4. Handles Different Data Formats:
○ Depending on the InputFormat being used, different types of RecordReaders are
employed. For instance:
■ TextInputFormat: Uses a LineRecordReader to read each line of a text file
as a record.
■ KeyValueTextInputFormat: Splits each line into a key-value pair based on
a delimiter.
Key Workflow of RecordReader:
1. InputFormat divides input data into splits.
2. RecordReader processes each split and converts it into key-value pairs.
3. The Mapper processes these key-value pairs.

22. Write a short note on master & slave V/s peer to peer
Ans. 1. Master-Slave Architecture:
In a master-slave architecture, there is a distinct hierarchy where one node acts as the master
and the other nodes function as slaves.
● Master:
○ Controls and coordinates the system.
○ Assigns tasks to slave nodes, manages resources, and maintains metadata.
○ Example: In Hadoop, the NameNode is the master that manages metadata, and
DataNodes are the slaves that store actual data.
● Slave:
○ Performs tasks assigned by the master.
○ Does not have autonomy to make decisions on its own.
○ Example: In Hadoop, the TaskTracker and DataNodes are slaves.
Advantages:
● Centralized control simplifies management and coordination.
● Efficient for distributed task assignment and resource management.
Disadvantages:
● Single Point of Failure: If the master fails, the entire system can be compromised.
● Scalability can be limited by the master’s capacity.

2. Peer-to-Peer (P2P) Architecture:


In a peer-to-peer (P2P) architecture, all nodes are equal peers with no designated master or
slave. Each node can function both as a client and a server.
● Peers:
○ Share resources and workloads equally.
○ Communicate and cooperate directly without centralized control.
○ Example: BitTorrent file-sharing network is a common P2P system where each
peer shares and downloads files from others.
Advantages:
● No Single Point of Failure: Since there is no master, the system is more resilient to node
failures.
● Highly scalable and decentralized.
Disadvantages:
● Coordination is more complex due to the lack of a central authority.
● Performance may degrade if peers are unreliable or underperforming.
Comparison:
Feature Master-Slave Peer-to-Peer (P2P)

Control Centralized (master controls) Decentralized (all nodes equal)

Scalability Limited by master’s capacity Highly scalable

Fault Tolerance Single point of failure (master) No single point of failure

Coordination Easier, master handles tasks Complex, no central control

Examples Hadoop BitTorrent, Blockchain


(NameNode/DataNode)

23. Write a short note on mapper task and reducer task


Ans. Mapper Task in Hadoop MapReduce
The Mapper is the first phase of a MapReduce job. It processes input data and produces
intermediate key-value pairs, which are then passed to the Reducer for further processing. The
Mapper works on individual splits of data from the input source (e.g., HDFS).
Steps in the Mapper Task:
1. Input Splitting:
○ The input data is split into chunks called InputSplits, and each split is processed
by a separate Mapper task.
2. RecordReader:
○ The RecordReader reads each record from the split and converts it into key-value
pairs that the Mapper can process.
3. Map Function:
○ The core logic is written in the map() function, which processes each input
key-value pair and produces zero or more output key-value pairs.
○ Example: In a word count program, the input might be lines of text, and the
Mapper emits (word, 1) pairs for each word.
4. Intermediate Data:
○ The Mapper outputs intermediate key-value pairs that are stored locally before
being shuffled to the Reducer.

Reducer Task in Hadoop MapReduce


The Reducer is the second phase of a MapReduce job. After the intermediate data is shuffled
and sorted, the Reducer processes the grouped key-value pairs to produce the final output.
Steps in the Reducer Task:
1. Shuffle and Sort:
○ The intermediate data from all Mapper tasks is shuffled across the network to
the appropriate Reducer nodes.
○ The data is also sorted by key, so that all values for a given key are grouped
together.
2. Reduce Function:
○ The core logic of the Reducer is in the reduce() function, which processes each
key along with its associated values. It aggregates or summarizes the values to
produce the final output.
○ Example: In a word count program, the Reducer sums the values for each word
to get the total count.
3. Final Output:
○ The Reducer writes the final results to an output file in HDFS or another storage
system.
24. List and explain types of NO SQL Databases with examples?
Ans. A database is a collection of structured data or information which is stored in a computer
system and can be accessed easily. A database is usually managed by a Database Management
System (DBMS).
NoSQL is a non-relational database that is used to store the data in the nontabular form. NoSQL
stands for Not only SQL. The main types are documents, key-value, wide-column, and graphs.
Types of NoSQL Database:
● Document-based databases
● Key-value stores
● Column-oriented databases
● Graph-based databases
Document-Based Database:
● The document-based database is a nonrelational database. Instead of storing the data in rows
and columns (tables), it uses the documents to store the data in the database. A document
database stores data in JSON, BSON, or XML documents.
● Documents can be stored and retrieved in a form that is much closer to the data objects used in
applications which means less translation is required to use these data in the applications. In
the Document database, the particular elements can be accessed by using the index value that
is assigned for faster querying.
● Collections are the group of documents that store documents that have similar contents. Not all
the documents are in any collection as they require a similar schema because document
databases have a flexible schema.
Key features of documents database:
● Flexible schema: Documents in the database has a flexible schema. It means the documents
in the database need not be the same schema.
● Faster creation and maintenance: the creation of documents is easy and minimal
maintenance is required once we create the document.
● No foreign keys: There is no dynamic relationship between two documents so documents
can be independent of one another. So, there is no requirement for a foreign key in a
document database.
● Open formats: To build a document we use XML, JSON, and others.
Key-Value Stores:
A key-value store is a nonrelational database. The simplest form of a NoSQL database is a
key-value store. Every data element in the database is stored in key-value pairs. The data can be
retrieved by using a unique key allotted to each element in the database. The values can be
simple data types like strings and numbers or complex objects.
A key-value store is like a relational database with only two columns which is the key and the
value.
Key features of the key-value store:
● Simplicity.
● Scalability.
● Speed.

Column Oriented Databases:


A column-oriented database is a non-relational database that stores the data in columns instead
of rows. That means when we want to run analytics on a small number of columns, you can
read those columns directly without consuming memory with the unwanted data.
Columnar databases are designed to read data more efficiently and retrieve the data with
greater speed. A columnar database is used to store a large amount of data.
Key features of columnar oriented database:
● Scalability.
● Compression.
● Very responsive.
Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It stores the data in
the form of nodes in the database. The connections between the nodes are called links or
relationships.
Key features of graph database:
● In a graph-based database, it is easy to identify the relationship between the data by
using the links.
● The Query’s output is real-time results.
● The speed depends upon the number of relationships among the database
elements.
● Updating data is also easy, as adding a new node or edge to a graph database is a
straightforward task that does not require significant schema changes.

25. What is the CAP Theorem?


Ans. The CAP theorem is a fundamental concept in distributed systems theory that was first
proposed by Eric Brewer in 2000 and subsequently shown by Seth Gilbert and Nancy Lynch in
2002. It asserts that all three of the following qualities cannot be concurrently guaranteed in
any distributed data system:
1. Consistency
Consistency means that all the nodes (databases) inside a network will have the same copies of
a replicated data item visible for various transactions. It guarantees that every node in a
distributed cluster returns the same, most recent, and successful write. It refers to every client
having the same view of the data. There are various types of consistency models. Consistency in
CAP refers to sequential consistency, a very strong form of consistency.
Note that the concept of Consistency in ACID and CAP are slightly different since in CAP, it refers
to the consistency of the values in different copies of the same data item in a replicated
distributed system. In ACID, it refers to the fact that a transaction will not violate the integrity
constraints specified on the database schema.
For example, a user checks his account balance and knows that he has 500 rupees. He spends
200 rupees on some products. Hence the amount of 200 must be deducted changing his
account balance to 300 rupees. This change must be committed and communicated with all
other databases that hold this user’s details. Otherwise, there will be inconsistency, and the
other database might show his account balance as 500 rupees which is not true.

Consistency problem
2. Availability
Availability means that each read or write request for a data item will either be processed
successfully or will receive a message that the operation cannot be completed. Every non-failing
node returns a response for all the read and write requests in a reasonable amount of time. The
key word here is “every”. In simple terms, every node (on either side of a network partition)
must be able to respond in a reasonable amount of time.
For example, user A is a content creator having 1000 other users subscribed to his channel.
Another user B who is far away from user A tries to subscribe to user A’s channel. Since the
distance between both users are huge, they are connected to different database node of the
social media network. If the distributed system follows the principle of availability, user B must
be able to subscribe to user A’s channel.

Availability problem
3. Partition Tolerance
Partition tolerance means that the system can continue operating even if the network
connecting the nodes has a fault that results in two or more partitions, where the nodes in each
partition can only communicate among each other. That means, the system continues to
function and upholds its consistency guarantees in spite of network partitions. Network
partitions are a fact of life. Distributed systems guaranteeing partition tolerance can gracefully
recover from partitions once the partition heals.
For example, take the example of the same social media network where two users are trying to
find the subscriber count of a particular channel. Due to some technical fault, there occurs a
network outage, the second database connected by user B losses its connection with first
database. Hence the subscriber count is shown to the user B with the help of replica of data
which was previously stored in database 1 backed up prior to network outage. Hence the
distributed system is partition tolerant.

Partition Tolerance
The CAP theorem states that distributed databases can have at most two of the three
properties: consistency, availability, and partition tolerance. As a result, database systems
prioritize only two properties at a time.

Venn diagram of CAP theorem

26. Explain BASE Properties of NoSQL Database.


Ans. BASE stands for basically available, soft state, and eventually consistent. The acronym
highlights that BASE is opposite of ACID, like their chemical equivalents.
Basically available
Basically available is the database’s concurrent accessibility by users at all times. One user doesn’t
need to wait for others to finish the transaction before updating the record. For example, during a
sudden surge in traffic on an ecommerce platform, the system may prioritize serving product
listings and accepting orders. Even if there is a slight delay in updating inventory quantities, users
continue to check out items.
Soft state
Soft state refers to the notion that data can have transient or temporary states that may change
over time, even without external triggers or inputs. It describes the record’s transitional state
when several applications update it simultaneously. The record’s value is eventually finalized only
after all transactions complete. For example, if users edit a social media post, the change may not
be visible to other users immediately. However, later on, the post updates by itself (reflecting the
older change) even though no user triggered it.

Eventually consistent
Eventually consistent means the record will achieve consistency when all the concurrent updates
have been completed. At this point, applications querying the record will see the same value. For
example, consider a distributed document editing system where multiple users can simultaneously
edit a document. If User A and User B both edit the same section of the document simultaneously,
their local copies may temporarily differ until the changes are propagated and synchronized.
However, over time, the system ensures eventual consistency by propagating and merging the
changes made by different users.

27. Discuss the different architecture patterns of NoSQL.


Ans. Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big data and store
it in a valid format. It is widely used because of its flexibility and a wide variety of services.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database

These are explained below.


1. Key-Value Store Database:
This model is one of the most basic models of NoSQL databases. As the name suggests, the data
is stored in the form of Key-Value Pairs. The key is usually a sequence of strings, integers or
characters but can also be a more advanced data type. The value is typically linked or correlated
to the key. The key-value pair storage databases generally store data as a hash table where each
key is unique. The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc). This
type of pattern is usually used in shopping websites or e-commerce applications.
Advantages:
● Can handle large amounts of data and heavy load,
● Easy retrieval of data by keys.
Limitations:
● Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
● Data can involve many-to-many relationships which may collide.
Examples:
● DynamoDB
● Berkeley DB

2. Column Store Database:


Rather than storing data in relational tuples, the data is stored in individual cells which are
further grouped into columns. Column-oriented databases work only on columns. They store
large amounts of data into columns together. Format and titles of the columns can diverge from
one row to another. Every column is treated separately. But still, each individual column may
contain multiple other columns like traditional databases.
Basically, columns are modes of storage in this type.
Advantages:
● Data is readily available
● Queries like SUM, AVERAGE, COUNT can be easily performed on columns.
Examples:
● HBase
● Bigtable by Google
● Cassandra

3. Document Database:
The document database fetches and accumulates data in form of key-value pairs but here, the
values are called as Documents. Document can be stated as a complex data structure.
Document here can be a form of text, arrays, strings, JSON, XML or any such format. The use of
nested documents is also very common. It is very effective as most of the data created is usually
in the form of JSONs and is unstructured.
Advantages:
● This type of format is very useful and apt for semi-structured data.
● Storage retrieval and managing of documents is easy.
Limitations:
● Handling multiple documents is challenging
● Aggregation operations may not work accurately.
Examples:
● MongoDB
● CouchDB

Figure – Document Store Model in form of JSON documents


4. Graph Databases:
Clearly, this architecture pattern deals with the storage and management of data in graphs.
Graphs are basically structures that depict connections between two or more objects in some
data. The objects or entities are called nodes and are joined together by relationships called
Edges. Each edge has a unique identifier. Each node serves as a point of contact for the graph.
This pattern is very commonly used in social networks where there are a large number of
entities and each entity has one or many characteristics which are connected by edges. The
relational database pattern has tables that are loosely connected, whereas graphs are often
very strong and rigid in nature.
Advantages:
● Fastest traversal because of connections.
● Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
● Neo4J
● FlockDB( Used by Twitter)

Figure – Graph model format of NoSQL Databases


28. Explain distribution models Master Slave and Peer to peer with the help of
diagram.
Ans. 1. Master-Slave Architecture
In the master-slave architecture, there is a distinct hierarchy with a single master node and
multiple slave nodes. The master node is responsible for controlling the system and distributing
tasks, while the slave nodes perform the tasks assigned by the master.
Key Characteristics:
● Master:
○ Central controller that assigns tasks to slave nodes and manages resources.
○ Handles coordination and decision-making.
○ Example: In Hadoop, the NameNode is the master, and DataNodes are slaves.
● Slaves:
○ Execute tasks assigned by the master.
○ Perform data storage or processing but rely on the master for instructions.
○ Example: In Hadoop, DataNodes store data, and TaskTrackers run jobs.
Advantages:
● Centralized control simplifies management and coordination.
● Easier to monitor and manage task distribution.
Disadvantages:
● Single point of failure: If the master fails, the whole system may fail.
● Scalability is limited by the master node’s capacity.

2. Peer-to-Peer (P2P) Architecture


In the peer-to-peer (P2P) architecture, all nodes are equal peers, meaning that each node can
act as both a client and a server. There is no centralized control, and each node can initiate
communication, share resources, or participate in the task execution.
Key Characteristics:
● Peers:
○ Each peer is both a client and a server, participating in both sharing and receiving
resources.
○ Communication and tasks are distributed equally among peers.
○ Example: BitTorrent or Blockchain, where all nodes share and request data from
each other without a central coordinator.
Advantages:
● No single point of failure: The system remains functional even if some peers fail.
● Highly scalable, as new peers can join the network easily without a central coordinator.
Disadvantages:
● Coordination complexity: Lack of a central authority can make coordination and task
distribution more difficult.
● Performance depends on the reliability and capacity of individual peers.
29. What are the various applications of NoSQL in industry?
Ans. Session Store
● Managing session data using relational databases is very difficult, especially in cases
where applications are grown very much.
● In such cases the right approach is to use a global session store, which manages
session information for every user who visits the site.
● NOSQL is suitable for storing such web application session information and is very
large in size.
● Since the session data is unstructured in form, it is easy to store it in schema less
documents rather than in relation database record.
2. User Profile Store
● To enable online transactions, user preferences, authentication of users and more,
it is required to store the user profile by web and mobile application.
● In recent times users of web and mobile applications have grown very rapidly. The
relational database could not handle such a large volume of user profile data
growing rapidly, as it is limited to a single server.
● Using NOSQL capacity can be easily increased by adding server, which makes
scaling cost effective
3. Content and Metadata Store
● Many companies like publication houses require a place where they can store
large amount of data, which include articles, digital content and e-books, in order
to merge various tools for learning in single platform
● The applications which are content based, for such application metadata is very
frequently accessed data which need less response times.
● For building applications based on content, use of NoSQL provide flexibility in
faster access to data and to store different types of contents
4. Mobile Applications
● Since smartphone users are increasing very rapidly, mobile applications face
problems related to growth and volume.
● Using NoSQL database mobile application development can be started with small
size and can be easily expanded as the number of user increases, which is very
difficult if you consider relational databases.
● Since NoSQL databases store the data in schema-less, the application developer
can update the apps without having to do major modification in the database.
● The mobile app companies like Kobo and Playtika, use NOSQL and serve millions of
users across the world.
5. Third-Party Data Aggregation
● Frequently a business requires access to data produced by a third party. For
instance, a consumer packaged goods company may require to get sales data from
stores as well as shopper’s purchase history.
● In such scenarios, NoSQL databases are suitable, since NoSQL databases can
manage huge amounts of data which is generated at high speed from various data
sources.
6. Internet of Things
● Today, billions of devices are connected to the internet, such as smartphones,
tablets, home appliances, systems installed in hospitals, cars and warehouses. For
such devices a large volume and variety of data is generated and keeps on
generating.
● Relational databases are unable to store such data. The NOSQL permits
organizations to expand concurrent access to data from billions of devices and
systems which are connected, store huge amounts of data and meet the required
performance.
7. E-Commerce
● E-commerce companies use NoSQL to store a huge volume of data and a large
amount of requests from users.

30. What are the benefits of HBase over other NoSQL databases?
Ans. HBase, a distributed, scalable NoSQL database that runs on top of Hadoop, offers several
advantages over other NoSQL databases. Its tight integration with the Hadoop ecosystem and
certain unique features make it a popular choice for certain big data applications.
Benefits of HBase over Other NoSQL Databases:
1. Tight Integration with Hadoop
● HBase is built to work seamlessly with Hadoop's HDFS (Hadoop Distributed File System).
This makes it ideal for handling large datasets that are stored on HDFS, allowing
real-time read/write access to massive amounts of data stored in Hadoop.
● It also integrates with other Hadoop ecosystem components like MapReduce, Hive, Pig,
and Spark for batch processing and querying large datasets.
2. Efficient for Random Read/Write Operations
● Unlike Hadoop's batch-processing nature, HBase is designed for real-time read/write
operations on large datasets.
● It excels at handling random, sparse, and non-sequential data access, which many other
NoSQL databases might struggle with at such large scale.
3. Strong Consistency
● HBase provides strong consistency for reads and writes, meaning once data is written to
HBase, it is immediately visible to subsequent read operations.
● This is in contrast to some NoSQL databases (like Cassandra), which prioritize availability
over consistency in distributed environments and may allow stale reads.
4. Column-Oriented Storage
● HBase is a column-family-oriented database, where data is stored in columns rather
than rows. This design allows for very flexible data storage, especially for wide and
sparse datasets.
● Column families allow more efficient reads and writes by focusing only on specific data
sets (columns) rather than scanning entire rows, which can be an advantage over
traditional row-based NoSQL systems like MongoDB.
5. Scalability
● HBase is highly scalable both horizontally (by adding more nodes) and vertically (by
adding more memory or CPU to the nodes).
● Its underlying architecture, based on HDFS and distributed across a cluster, ensures that
HBase can handle large-scale datasets in a fault-tolerant manner.
6. Automatic Sharding (Region Splitting)
● HBase automatically shards tables into smaller units called regions. These regions are
distributed across the nodes of a cluster, which helps in balancing the load and
enhancing system performance.
● This automatic region splitting allows HBase to handle large datasets efficiently, without
requiring manual intervention to split and manage partitions.
7. Support for Big Data Analytics
● HBase, when combined with Hadoop’s batch processing capabilities, is excellent for big
data analytics.
● You can store massive amounts of data in HBase and process it using MapReduce or
Spark for large-scale computations.
HBase vs. Other NoSQL Databases :
Feature HBase Other NoSQL Databases (e.g., MongoDB,
Cassandra)

Integration with Deeply integrated (HDFS, Not as tightly integrated


Hadoop MapReduce, etc.)

Consistency Strong consistency Cassandra: Eventual consistency; MongoDB:


(reads after writes) Immediate consistency but lacks HDFS
integration

Read/Write Efficient for random, Other NoSQL databases vary (MongoDB: fast
Operations real-time operations reads; Cassandra: write-heavy workloads)

Column-Oriented Column-family-based MongoDB: Document-based; Cassandra:


Storage model Columnar (but more key-value based)

Scalability Horizontally scalable Other NoSQL databases are also scalable but
have different architecture
Automatic Automatic region Sharding is possible, but may require manual
Sharding splitting configuration

Big Data Analytics Suited for big data and May need third-party integration for big data
analytics analytics

Data Versioning Supports data versioning Limited or no versioning support in many


other NoSQL systems

Transaction Row-level atomicity Transaction support varies (MongoDB


Support supports multi-document ACID transactions)

31. Draw and explain Hbase Architecture-Read, Write Mechanism.


Ans. HBase Write Mechanism
This below image explains the write mechanism in HBase.

The write mechanism goes through the following process sequentially (refer to the above
image):
Step 1: Whenever the client has a write request, the client writes the data to the WAL (Write
Ahead Log).
● The edits are then appended at the end of the WAL file.
● This WAL file is maintained in every Region Server and Region Server uses it to recover
data which is not committed to the disk.
Step 2: Once data is written to the WAL, then it is copied to the MemStore.
Step 3: Once the data is placed in MemStore, then the client receives the acknowledgment.
Step 4: When the MemStore reaches the threshold, it dumps or commits the data into a HFile.
HBase Write Mechanism- MemStore
● The MemStore always updates the data stored in it, in a lexicographical order
(sequentially in a dictionary manner) as sorted KeyValues. There is one MemStore for
each column family, and thus the updates are stored in a sorted manner for each column
family.
● When the MemStore reaches the threshold, it dumps all the data into a new HFile in a
sorted manner. This HFile is stored in HDFS. HBase contains multiple HFiles for each
Column Family.
● Over time, the number of HFile grows as MemStore dumps the data.
● MemStore also saves the last written sequence number, so Master Server and
MemStore both know what is committed so far and where to start from. When the
region starts up, the last sequence number is read, and from that number, new edits
start.
HBase Architecture: HBase Write Mechanism- HFile
● The writes are placed sequentially on the disk. Therefore, the movement of the disk’s
read-write head is very less. This makes the write and search mechanism very fast.
● The HFile indexes are loaded in memory whenever a HFile is opened. This helps in
finding a record in a single seek.
● The trailer is a pointer which points to the HFile’s meta block . It is written at the end of
the committed file. It contains information about timestamp and bloom filters.
● Bloom Filter helps in searching key value pairs, it skips the file which does not contain
the required row key. Timestamp also helps in searching a version of the file, it helps in
skipping the data.
HBase Architecture: Read Mechanism
● For reading the data, the scanner first looks for the Row cell in Block cache. Here all the
recently read key value pairs are stored.
● If Scanner fails to find the required result, it moves to the MemStore, as we know this is
the write cache memory. There, it searches for the most recently written files, which has
not been dumped yet in HFile.
● At last, it will use bloom filters and block cache to load the data from HFile.

32. Draw and Explain Hbase Architecture-Compaction and Region Split


Ans. HBase Architecture: Compaction

HBase combines HFiles to reduce the storage and reduce the number of disk seeks needed for a
read. This process is called compaction. Compaction chooses some HFiles from a region and
combines them. There are two types of compaction
1. Minor Compaction: HBase automatically picks smaller HFiles and recommits them to
bigger HFiles as shown in the above image. This is called Minor Compaction. It performs
merge sort for committing smaller HFiles to bigger HFiles. This helps in storage space
optimization.
2. Major Compaction: As illustrated in the above image, in Major compaction, HBase
merges and recommits the smaller HFiles of a region to a new HFile. In this process, the
same column families are placed together in the new HFile. It drops deleted and expired
cell in this process. It increases read performance.
But during this process, input-output disks and network traffic might get congested. This is
known as write amplification. So, it is generally scheduled during low peak load timings.
HBase Architecture: Region Split
The below figure illustrates the Region Split mechanism.

Whenever a region becomes large, it is divided into two child regions, as shown in the above
figure. Each region represents exactly a half of the parent region. Then this split is reported to
the HMaster. This is handled by the same Region Server until the HMaster allocates them to a
new Region Server for load balancing.

33. Write a Short Note on Region Server


Ans. HBase Architecture: Region Server
● It is worker nodes which handle read, write, update, and delete requests from clients.
● It maintains various regions running on the top of HDFS.
● It processes & runs on every node in the hadoop cluster.
● It runs on HDFS DataNode .
● It contains all the rows between the start key and the end key assigned to that region.
● HBase tables can be divided into a number of regions in such a way that all the columns
of a column family are stored in one region.
● Each region contains the rows in a sorted order.
● Many regions are assigned to a Region Server, which is responsible for handling,
managing, executing reads and writes operations on that set of regions.
● So, concluding in a simpler way:
● A table can be divided into a number of regions.
● It is a sorted range of rows storing data between a start key and an end key.
● It has a default size of 256MB which can be configured according to the need.
● A Group of regions is served to the clients by a Region Server.
● A Region Server can serve approximately 1000 regions to the client.

A Region Server maintains various regions running on the top of HDFS. Components of a Region
Server are:
● WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file
attached to every Region Server inside the distributed environment. The WAL stores the
new data that hasn’t been persisted or committed to the permanent storage. It is used
in case of failure to recover the data sets.
● Block Cache: From the above image, it is clearly visible that Block Cache resides in the
top of Region Server. It stores the frequently read data in the memory. If the data in
BlockCache is least recently used, then that data is removed from BlockCache.
● MemStore: It is the write cache. It stores all the incoming data before committing it to
the disk or permanent memory. There is one MemStore for each column family in a
region. As you can see in the image, there are multiple MemStores for a region because
each region contains multiple column families. The data is sorted in lexicographical order
before committing it to the disk.
● HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the
actual cells on the disk. MemStore commits the data to HFile when the size of MemStore
exceeds.

34. What are the major components of HBase Data model? Explain each one in brief
Ans. HBase is a distributed, scalable, column-family-oriented NoSQL database built on top of
Hadoop's HDFS. It’s designed to manage vast amounts of data with minimal structure and
provides random, real-time read/write access to it. The HBase data model is highly flexible and
is inspired by Google’s BigTable, offering a few core components that make it powerful in
managing structured, semi-structured, and sparse data.

Major components of HBase Data model are:


1. Table
2. Row
3. Column Family
4. Column
5. Cell
6. Timestamp(Versioning)
7. Region
8. Regionserver
1. Table
● HBase Tables are similar to tables in relational databases, but there are significant
differences in how they are structured and managed.
● An HBase table is essentially a collection of rows and column families. Unlike relational
databases, where columns are predefined and static, HBase only requires column
families to be predefined, and individual columns can be added dynamically.
● Tables in HBase are designed to be sparse: rows can have different sets of columns
without overhead for empty cells.
Example: A table called Customer can contain two column families: personal_info and
order_info.
2. Row
● A Row in HBase represents an individual data entry. Each row is uniquely identified by a
row key.
● The row key is essentially the primary identifier of a record and is indexed for fast access.
Row keys are sorted lexicographically, meaning the order of the keys determines how
rows are distributed across the HBase cluster. This sort order is crucial for efficient data
access patterns.
● A row contains one or more columns, which are grouped under column families. HBase
allows rows to have different sets of columns, making it well-suited for sparse datasets.
Example: For a row in a Customer table with the row key customer_123, the row might store
the customer’s personal and order information.
Row Key: customer_123
personal_info: { name: "John", email: "[email protected]" }
order_info: { last_order: "2024-09-15", total_spent: 500 }
3. Column Family
● Column Families are the key building blocks of an HBase table and must be defined
when creating the table.
● A column family is a collection of columns that are logically related and are stored
together physically on disk for performance optimization.
● All the columns within a column family are stored together on disk, and data within the
same column family is compressed together. This improves I/O efficiency because
queries that target a specific column family do not need to read the entire row.
● Column families can hold any number of columns, but the number of column families is
typically kept small to ensure efficient data storage and retrieval.
Example:
● A column family could be personal_info to store customer names, emails, and phone
numbers.
● Another column family, order_info, could store customer order details, such as the last
order date and total amount spent.
4. Column
● A Column in HBase is identified by two parts: the column family and the column
qualifier. The column family groups related data together, while the qualifier identifies
individual pieces of data within the family.
● Columns in HBase are dynamic: new columns can be added to any row at any time
without schema changes.
● Columns do not have to be present in every row; a column can exist for one row and not
for others, making it highly flexible and efficient for sparse data.
Example:
● In the column family personal_info, there could be columns like name, email, and
phone.
● In the column family order_info, columns might include last_order, total_spent, and
order_count.
In HBase, column families and column qualifiers are represented as
column_family:column_qualifier.
Row Key: customer_123
personal_info:name = "John"
personal_info:email = "[email protected]"
order_info:last_order = "2024-09-15"
order_info:total_spent = 500
5. Cell
● A Cell is the smallest unit of storage in HBase. It is the intersection of a row, a column
family, and a column qualifier. Each cell contains a value and multiple versions of that
value based on timestamps.
● The cell stores the actual data and can hold multiple versions of the same data, each
tagged with a unique timestamp. This versioning allows for time-based retrieval, where
you can retrieve historical data by querying past versions of a cell.
Example:
● The cell at personal_info:name for customer_123 may contain multiple versions of the
name data, each with a different timestamp.
personal_info:name = "John" (Timestamp: 2024-09-15)
6. Timestamp (Versioning)
● Every cell in HBase can store multiple versions of a value, with each version being
associated with a timestamp.
● By default, HBase uses the system time as the timestamp, but custom timestamps can
be provided during data insertion.
● This feature allows HBase to store historical data and retrieve specific versions of data
based on time. This is useful for scenarios such as auditing, data history tracking, or
time-series data.
● Version retention can be configured per column family, and older versions are deleted
based on these settings.
Example:
● The name column for customer_123 might store the following versions:

personal_info:name = "John" (Timestamp: 2024-09-15)


personal_info:name = "Johnny" (Timestamp: 2024-06-01)
7. Region
● Regions are horizontal partitions of tables in HBase. A region contains a subset of the
table's rows, usually defined by a contiguous range of row keys.
● HBase automatically splits a table into regions as the data grows. Each region contains all
the rows between a start row key and an end row key.
● Regions are distributed across the cluster and managed by RegionServers. As the table
grows, regions are split into smaller regions to balance the load across the cluster.
● Each region can span multiple column families but will only handle rows within the row
key range assigned to it.
Example:
● A table might be split into two regions based on row keys:
○ Region 1: Handles row keys from customer_000 to customer_499
○ Region 2: Handles row keys from customer_500 to customer_999
8. RegionServer
● RegionServers are the worker nodes in an HBase cluster that handle data storage and
management.
● A RegionServer manages multiple regions and handles read/write requests from clients.
It is responsible for storing data in HFiles (HBase’s internal file format) and for managing
MemStore (in-memory data).
● As regions grow in size, they are automatically split and redistributed among the
RegionServers to balance the workload and ensure efficient data access.
Example:
● A single RegionServer might handle several regions, each containing a different range of
row keys for multiple tables.
HBase Data Model Example
Let’s take an example of a table named Users with two column families: personal_info and
login_info.
Table Layout:
Row Key Column Family Column Value Timestamp

user_001 personal_info name John Doe 2024-09-15 12:00

user_001 personal_info email [email protected] 2024-09-15 12:00


m

user_001 login_info last_login 2024-09-15 10:30 2024-09-15 12:00


user_002 personal_info name Jane Smith 2024-09-14 11:00

user_002 login_info last_login 2024-09-14 09:00 2024-09-14 11:00


Key Components in the Table:
● Row Keys: user_001, user_002
● Column Families: personal_info, login_info
● Columns: name, email, last_login
● Cells: Each row key and column combination is a cell, which stores data values with
multiple versions.

35. What important role do Region Server and Zookeeper play in HBase architecture?
Ans. RegionServer
The HBase RegionServer is responsible for managing and serving the data stored in HBase. It
handles read/write requests from clients, manages data storage and retrieval, and ensures load
distribution across the HBase cluster.
Key Roles of RegionServer in HBase:
1. Managing Regions:
○ A RegionServer manages multiple regions, which are horizontal partitions of an
HBase table. Each region contains a subset of rows from the table, based on a range
of row keys.
○ RegionServers handle the storage and retrieval of data for the regions assigned to
them.
○ As the table grows in size, regions are split automatically, and these splits are assigned
to different RegionServers to distribute the load evenly across the cluster.
2. Handling Read/Write Requests:
○ RegionServers handle all client requests for reading and writing data within the
regions they manage.
○ For read requests, RegionServers retrieve the required data from HFiles (HBase’s
storage files) on HDFS.
○ For write requests, RegionServers first store data in the MemStore (an in-memory
buffer) and then periodically flush the data to disk as HFiles for persistent storage.
3. MemStore and HFile Management:
○ Data written to HBase is first stored in MemStore (an in-memory buffer). Once the
MemStore reaches a certain size, it is flushed to HDFS as HFiles.
○ Each column family in a region has its own MemStore and set of HFiles. HBase stores
data in HFiles in a sorted manner, making read operations efficient.
○ RegionServers also manage compaction: periodically merging smaller HFiles into
larger ones to reduce file fragmentation and improve read performance.
4. Region Splitting and Reassignment:
○ When a region grows too large (due to an increase in data), the RegionServer
automatically splits the region into two smaller regions. The split regions are then
assigned to different RegionServers to distribute the load more effectively.
○ This dynamic splitting and reassignment of regions allow HBase to scale horizontally
across many RegionServers.
5. Region Recovery and Availability:
○ RegionServers are responsible for ensuring data availability. If a RegionServer fails, the
regions it was managing are re-assigned to other RegionServers by the HBase Master
to ensure continuous service.
6. Data Consistency and Durability:
○ RegionServers work with HDFS to provide durability. Data is written to WAL
(Write-Ahead Log) before being written to MemStore. This ensures that in case of
failure, uncommitted data can be recovered by replaying the WAL.
○ HBase provides strong consistency for read/write operations at the row level.

ZooKeeper
ZooKeeper is a distributed coordination service used by HBase to maintain the state and
configuration of the HBase cluster. It plays a crucial role in ensuring that the cluster operates
correctly and efficiently by managing server coordination and failure recovery.
Key Roles of ZooKeeper in HBase:
1. Cluster Coordination and Management:
○ ZooKeeper is responsible for managing the overall health and status of the HBase
cluster.
○ It maintains knowledge about the state of RegionServers, HBase Master, and other
components in the cluster, ensuring smooth coordination between them.
○ ZooKeeper stores metadata about which RegionServer is responsible for each region,
allowing clients to locate the correct RegionServer to access specific data.
2. Tracking HBase Master and RegionServer States:
○ HBase Master: ZooKeeper helps clients discover the HBase Master by tracking the
Master node’s availability. When a Master node fails or goes down, ZooKeeper elects a
new Master from available standby nodes.
○ RegionServers: ZooKeeper tracks which RegionServers are online and which regions
they are managing. When a RegionServer joins or leaves the cluster, ZooKeeper
updates the cluster state, allowing the Master to assign or reassign regions as needed.
3. Ensuring High Availability and Failover:
○ Master Failover: If the HBase Master node fails, ZooKeeper helps in electing a new
Master from available standby nodes, ensuring the continued functioning of the
cluster.
○ RegionServer Failover: If a RegionServer fails, ZooKeeper notifies the HBase Master,
which then reassigns the regions managed by the failed RegionServer to other
RegionServers, ensuring no data is lost and service continues uninterrupted.
○ This failover mechanism ensures high availability of data and services in HBase.
4. Client and RegionServer Discovery:
○ When a client wants to access data in HBase, it first consults ZooKeeper to determine
which RegionServer is responsible for the row key being requested.
○ ZooKeeper maintains information about the region-to-RegionServer mapping
(metadata), so clients can directly communicate with the appropriate RegionServer for
read/write operations without needing to query the Master each time.
5. Configuration Management:
○ ZooKeeper is used to manage the configuration files of HBase. Any changes in
configuration (such as region assignments) are propagated and managed by
ZooKeeper.
○ This centralized configuration management ensures that all nodes in the cluster are
synchronized and aware of changes.
6. Metadata Storage:
○ ZooKeeper stores the location of the META table (a system table that tracks all regions
and their assigned RegionServers). This allows HBase to quickly route requests to the
correct RegionServer for any given region.
○ ZooKeeper also holds important metadata related to the HBase system, such as the list
of active nodes and the structure of the cluster.

You might also like