BDAV Question Bank Solution
BDAV Question Bank Solution
BDAV Question Bank Solution
Volume
● Volume refers to the amount of data that exists. Volume is like the base of big data, as it's the
initial size and amount of data that's collected. If the volume of data is large enough, it can be
considered big data. However, what's considered to be big data is relative and will change
depending on the available computing power that's on the market.
● For example, a company that operates hundreds of stores across several states generates
millions of transactions per day. This qualifies as big data, and the average number of total
transactions per day across stores represents its volume.
Value
● Value refers to the benefits that big data can provide, and it relates directly to what
organizations can do with that collected data. Being able to pull value from big data is a
requirement, as the value of big data increases significantly depending on the insights that can
be gained from it.
● Organizations can use big data tools to gather and analyze the data, but how they derive value
from that data should be unique to them. Tools like Apache Hadoop can help organizations
store, clean and rapidly process this massive amount of data.
● A great example of big data value can be found in the gathering of individual customer data.
When a company can profile its customers, it can personalize their experience in marketing and
sales, improving the efficiency of contacts and garnering greater customer satisfaction.
Variety
● Variety refers to the diversity of data types. An organization might obtain data from several data
sources, which might vary in value. Data can come from sources in and outside an enterprise as
well. The challenge in variety concerns the standardization and distribution of all data being
collected.
● As noted above, collected data can be unstructured, semi-structured or structured.
Unstructured data is data that's unorganized and comes in different files or formats. Typically,
unstructured data isn't a good fit for a mainstream relational database because it doesn't fit into
conventional data models. Semi-structured data is data that hasn't been organized into a
specialized repository but has associated information, such as metadata. This makes it easier to
process than unstructured data. Structured data, meanwhile, is data that has been organized
into a formatted repository. This means the data is made more addressable for effective data
processing and analysis.
● A more specific example could be found in a company that gathers a variety of data about its
customers. This can include structured data culled from transactions or unstructured social
media posts and call center text. Much of this might arrive in the form of raw data, requiring
cleaning before processing.
Veracity
● Veracity refers to the quality, accuracy, integrity and credibility of data. Gathered data could
have missing pieces, might be inaccurate or might not be able to provide real, valuable insight.
Veracity, overall, refers to the level of trust there is in the collected data.
● Data can sometimes become messy and difficult to use. A large amount of data can cause more
confusion than insights if it's incomplete. For example, in the medical field, if data about what
drugs a patient is taking is incomplete, the patient's life could be endangered.
● Both value and veracity help define the quality and insights gathered from data. Thresholds for
the truth of data often -- and should -- exist in an organization at the executive level, to
determine whether it's suitable for high-level decision-making.
2. Smart Traffic System: Data about the condition of the traffic of different roads, collected
through cameras kept beside the road, at entry and exit point of the city, GPS device placed in
the vehicle (Ola, Uber cab, etc.). All such data are analyzed and jam-free or less jam way, less
time taking ways are recommended. Such a smart traffic system can be built in the city by Big
data analysis. One more profit is fuel consumption can be reduced.
3. Auto Driving Car: Big data analysis helps drive a car without human interpretation. In the
various spots of the car camera, a sensor is placed that gathers data like the size of the
surrounding car, obstacle, distance from those, etc. These data are being analyzed, then various
calculations like how many angles to rotate, what should be speed, when to stop, etc are carried
out. These calculations help to take action automatically.
4. IoT:
Manufacturing companies install IOT sensors into machines to collect operational data.
Analyzing such data, it can be predicted how long a machine will work without any problem
when it requires repairing so that the company can take action before the situation when the
machine is facing a lot of issues or gets totally down. Thus, the cost to replace the whole
machine can be saved.
In the Healthcare field, Big data is providing a significant contribution. Using big data tools, data
regarding patient experience is collected and is used by doctors to give better treatment. IoT
devices can sense a symptom of probable coming disease in the human body and prevent it
from giving advance treatment. IoT Sensors placed near-patient, new-born babies constantly
keep track of various health conditions like heart bit rate, blood pressure, etc. Whenever any
parameter crosses the safe limit, an alarm is sent to a doctor, so that they can take steps
remotely very soon.
5. Education Sector: Online educational course conducting organizations utilize big data to
search candidates interested in that course. If someone searches for a YouTube tutorial video on
a subject, then an online or offline course provider organization on that subject sends an ad
online to that person about their course.
6. Energy Sector: Smart electric meter reads consumed power every 15 minutes and sends this
read data to the server, where data is analyzed and it can be estimated what is the time in a day
when the power load is less throughout the city. By this system manufacturing units or
housekeepers are suggested the time when they should drive their heavy machine in the night
time when power load is less to enjoy less electricity bill.
3. Differentiate between Traditional Vs Big Data.
Merits:
● Easy to analyze and query.
● High consistency and accuracy.
● Efficient storage and retrieval.
● Strong data integrity and validation.
Limitations:
● Limited flexibility (must adhere to a strict schema).
● Scalability issues with very large datasets.
● Less suitable for complex big data types.
B. Semi-structured Data
In Big Data, semi-structured data is a combination of both unstructured and structured types of
big data. This form of data constitutes the features of structured data but has unstructured
information that does not adhere to any formal structure of data models or any relational
database. Some semi-structured data examples include XML and JSON.
Overview:
● Contains both structured and unstructured elements.
● Lacks a fixed schema but includes tags and markers to separate data elements.
● Often stored in formats like XML, JSON, or NoSQL databases.
Examples:
● JSON files for web APIs.
● XML documents for data interchange.
● Email messages (headers are structured, body can be unstructured).
● HTML pages.
Image:
Merits:
● More flexible than structured data.
● Easier to parse and analyze than unstructured data.
● Can handle a wide variety of data types.
● Better suited for hierarchical data.
Limitations:
● More complex to manage than structured data.
● Parsing can be resource-intensive.
● Inconsistent data quality.
C. Unstructured Data
Unstructured data in Big Data is where the data format constitutes multitudes of unstructured
files (images, audio, log, and video). This form of data is classified as intricate data because of its
unfamiliar structure and relatively huge size. A stark example of unstructured data is an output
returned by ‘Google Search’ or ‘Yahoo Search.’
Overview:
● Data that does not conform to a predefined schema.
● Includes text, multimedia, and other non-tabular data types.
● Stored in data lakes, NoSQL databases, and other flexible storage solutions.
Examples:
● Text documents (Word files, PDFs).
● Multimedia files (images, videos, audio).
● Social media posts.
● Web pages.
Image:
Merits:
● Capable of storing vast amounts of diverse data.
● High flexibility in data storage.
● Suitable for complex data types like multimedia.
● Facilitates advanced analytics and machine learning applications.
Limitations:
● Difficult to search and analyze without preprocessing.
● Requires large storage capacities.
● Inconsistent data quality and reliability.
5. What are differences between NameNode and Standby NameNode
1. The primary component responsible for 1. A secondary node that acts as a backup to
managing the filesystem namespace. the Active NameNode.
2. It maintains the metadata of all files and 2. It stays in sync with the Active NameNode
directories in the HDFS (e.g.,file permissions, but does not serve client requests or manage
locations of blocks on DataNodes). DataNodes under normal conditions.
3. It is part of the High Availability (HA) setup 3. It is part of the High Availability (HA) setup
to ensure the HDFS continues running in case to ensure the HDFS continues running in case
the Active NameNode fails. the Active NameNode fails.
4. In a non-HA setup, there's only one Active 4. In an HA setup, the Standby NameNode
NameNode, meaning if it fails, the entire constantly monitors and syncs with the Active
HDFS becomes unavailable. NameNode by receiving edits log updates
and block reports from DataNodes.
5. This single point of failure is mitigated with 5. In case the Active NameNode fails, the
the introduction of an HA configuration, Standby NameNode can take over and
where the Standby NameNode is used. become the new Active NameNode with
little to no downtime.
6. Applies changes to the filesystem metadata 6. Synchronizes its state with the Active
and writes these changes to the edits log. NameNode through shared storage to keep
the metadata and edits log up to date.
7. The Active NameNode directly handles all 7.The Standby NameNode does not handle
client requests for metadata (file system client requests unless it becomes the Active
operations like read, write, delete). NameNode due to a failover.
8. If the Active NameNode fails, a failover 8. It takes over the role of the Active
procedure is triggered. NameNode during a failover, resuming client
requests and managing the file system as the
new Active NameNode.
For example, we have a file of 130 MB. So HDFS will break this file into 2 blocks.
Now, if we want to perform a MapReduce operation on the blocks, it will not process. The
reason is that the 2nd block is incomplete. So, InputSplit solves this problem.
MapReduce InputSplit will form a logical grouping of blocks as a single block. As the InputSplit
includes a location for the next block and the byte offset of the data needed to complete the
block.
The name node is responsible for the workings of the data nodes. It also stores the metadata.
The data nodes read, write, process, and replicate the data. They also send signals, known as
heartbeats, to the name node. These heartbeats show the status of the data node.
Consider that 30TB of data is loaded into the name node. The name node distributes it across
the data nodes, and this data is replicated among the data notes. You can see in the image
above that the blue, grey, and red data are replicated among the three data nodes.
Replication of the data is performed three times by default. It is done this way, so if a
commodity machine fails, you can replace it with a new machine that has the same data.
Hadoop MapReduce
Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the
processing is done at the slave nodes, and the final result is sent to the master node.
A data containing code is used to process the entire data. This coded data is usually very small in
comparison to the data itself. You only need to send a few kilobytes worth of code to perform a
heavy-duty process on computers.
The input dataset is first split into chunks of data. In this example, the input has three lines of
text with three separate entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset is
then split into three chunks, based on these entities, and processed parallely.
In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one
car, one ship, and one train.
These key-value pairs are then shuffled and sorted together based on their keys. At the reduce
phase, the aggregation takes place, and the final output is obtained.
Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit
of Hadoop and is available as a component of Hadoop version 2.
● Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS.
● It is responsible for managing cluster resources to make sure you don't overload one
machine.
● It performs job scheduling to make sure that the jobs are scheduled in the right place
Suppose a client machine wants to do a query or fetch some code for data analysis. This job
request goes to the resource manager (Hadoop Yarn), which is responsible for resource
allocation and management.
In the node section, each of the nodes has its node managers. These node managers manage
the nodes and monitor the resource usage in the node. The containers contain a collection of
physical resources, which could be RAM, CPU, or hard drives. Whenever a job request comes in,
the app master requests the container from the node manager. Once the node manager gets
the resource, it goes back to the Resource Manager.
13. What is a Map Reduce Partitioner? What is the need of Partitioner? How many partitoner
are there in HADOOP?
Ans. In Hadoop's MapReduce framework, the Partitioner plays a crucial role in determining
how the intermediate key-value pairs (produced by the Mapper phase) are distributed to the
Reducer tasks. It ensures that data is partitioned appropriately and that all key-value pairs
belonging to the same key are sent to the same reducer.
What is a MapReduce Partitioner?
A Partitioner in MapReduce is responsible for controlling the partitioning of the intermediate
map output (key-value pairs) across reducers. It decides which reducer will process which
subset of the intermediate keys produced by the mappers.
In other words, the partitioner determines the logical division of data so that each reducer
receives a distinct and non-overlapping portion of the data, based on the key. The goal is to
group all the values associated with the same key into the same reducer.
Need for a Partitioner in MapReduce:
The Partitioner serves several important purposes in the MapReduce process:
1. Efficient Data Distribution:
○ The partitioner helps distribute the mapper's output evenly among reducers to
avoid data skew. If all data is sent to a single reducer, it will become a bottleneck,
slowing down the entire job. A well-designed partitioner ensures load balancing.
2. Grouping Similar Data:
○ Keys that need to be processed together must go to the same reducer. The
partitioner ensures that all records with the same key are directed to the same
reducer, making sure that reducers can correctly group and process these
records.
3. Custom Partitioning Logic:
○ In many cases, the default HashPartitioner may not suffice, especially when you
want specific data distribution patterns. A custom partitioner allows for
implementing application-specific logic to determine how the data should be
distributed across reducers.
How the Partitioner Works:
1. Mapper Phase: Each mapper processes a chunk of input data and produces key-value
pairs as intermediate output.
2. Partitioner: Before sending the mapper's output to the reducers, the partitioner
determines which reducer each key-value pair should be sent to, based on the key.
3. Reducer Phase: All the key-value pairs for a particular key are grouped together and sent
to the appropriate reducer for further processing.
For example, if there are three reducers, and the hash of a particular key resolves to partition 2,
then all key-value pairs with that key will be sent to the reducer responsible for partition 2.
2. Projection (π)
Projection selects specific columns from the dataset, eliminating the others. In SQL terms, this
corresponds to the SELECT column1, column2, ... part of a query.
For example:
πname,age(Employees)\pi_{\text{name}, \text{age}} (Employees)πname,age(Employees)
This would return only the name and age columns from the Employees relation.
Projection in MapReduce:
● Map Phase:
○ The mapper reads each record.
○ For each record, the mapper extracts only the fields (columns) that are part of
the projection.
○ The mapper emits the key-value pair, where the value contains only the
projected columns.
● Reduce Phase:
○ The reducer does not have to do any further work unless there is a need for
additional aggregation or deduplication (depending on the use case).
Example:
Consider the same employee dataset where we want to project only the name and age
columns.
Input:
1, Alice, 28
2, Bob, 34
3, Carol, 31
Map Output:
Alice, 28
Bob, 34
Carol, 31
● Reduce Output: (again, no real reducing may be required unless aggregation or
deduplication is needed)
This query groups the employees by their department and then counts how many employees
are in each department.
Grouping in MapReduce:
● Map Phase:
○ The mapper reads each record from the input dataset.
○ For each record, the mapper emits a key-value pair, where the key is the column
or columns by which we want to group (e.g., department), and the value is the
data relevant for aggregation (e.g., 1 for a count, or some other numeric field for
sum).
● Shuffle and Sort Phase:
○ The MapReduce framework automatically groups the records by the key (e.g., all
records with the same department are grouped together). This is the "grouping"
step.
● Reduce Phase:
○ The reducer receives the grouped key-value pairs.
○ For each group, the reducer performs the necessary computation (e.g., summing,
counting, etc.) and emits the final aggregated result for that group.
Example:
Consider the following employee data:
1, Alice, HR
2, Bob, Engineering
3, Carol, HR
4, David, Engineering
5, Eve, Marketing
We want to group by department and count the number of employees in each department.
Map Output:
HR 1
Engineering 1
HR 1
Engineering 1
Marketing 1
Reduce Output:
HR 2
Engineering 2
Marketing 1
2. Aggregation
Aggregation refers to applying functions such as SUM, COUNT, AVG, MAX, or MIN to grouped
data. For example, computing the total sales in each region or counting the number of
employees in each department are aggregation operations.
Aggregation in MapReduce:
● Map Phase:
○ The mapper emits key-value pairs, where the key is the group (e.g., department)
and the value is the data being aggregated (e.g., salary for summing salaries, or 1
for counting records).
● Reduce Phase:
○ The reducer performs the aggregation function. For example, it sums up all
values for a given key (e.g., sum of salaries for each department).
Example:
Consider we want to compute the sum of salaries for each department. Here's the employee
dataset with salary:
1, Alice, HR, 50000
2, Bob, Engineering, 60000
3, Carol, HR, 55000
4, David, Engineering, 70000
5, Eve, Marketing, 45000
We want to group by department and sum the salary for each department.
Map Phase: Each mapper reads a record and emits the department as the key and the salary as
the value.
HR 50000
Engineering 60000
HR 55000
Engineering 70000
Marketing 45000
● Shuffle and Sort: The records are grouped by the key (department) and passed to the
reducer.
● Reduce Phase: The reducer sums the salaries for each department.
Reduce Output:
HR 105000
Engineering 130000
Marketing 45000
Natural Joins
A natural join is a type of relational join that automatically joins two relations (tables) based on
columns with the same name and values. For example, given two tables:
● Employees(emp_id, name, dept_id)
● Departments(dept_id, dept_name)
A natural join would combine records from both tables where the dept_id matches in both.
Natural Join in MapReduce:
To implement a natural join in MapReduce, the idea is to use the common join key (e.g.,
dept_id) to group records from both tables that need to be joined.
● Map Phase:
○ Each mapper reads records from both datasets (e.g., Employees and
Departments).
○ For each record, the mapper emits a key-value pair where the key is the join
attribute (e.g., dept_id), and the value is the record (either from the Employees
table or the Departments table), prefixed with a tag indicating which dataset it
belongs to.
● Shuffle and Sort:
○ During the shuffle phase, records from both datasets are grouped by the join key
(e.g., dept_id), so that all records with the same key are sent to the same
reducer.
● Reduce Phase:
○ The reducer receives all records with the same join key (e.g., all records with
dept_id = 1).
○ It joins records from the two datasets based on the common join key.
Example:
Let’s say we have two datasets:
Employees:
emp_id, name, dept_id
1, Alice, 101
2, Bob, 102
3, Carol, 101
Departments:
dept_id, dept_name
101, HR
102, Engineering
We want to perform a natural join on dept_id.
Map Output: The mapper reads records from both datasets and emits key-value pairs based on
dept_id:
101 ("Employees", 1, Alice, 101)
102 ("Employees", 2, Bob, 102)
101 ("Employees", 3, Carol, 101)
101 ("Departments", 101, HR)
102 ("Departments", 102, Engineering)
Shuffle and Sort: The shuffle phase groups records by dept_id:
101: [("Employees", 1, Alice, 101), ("Employees", 3, Carol, 101), ("Departments", 101, HR)]
102: [("Employees", 2, Bob, 102), ("Departments", 102, Engineering)]
● Reduce Phase: The reducer joins the records:
○ For dept_id = 101, it joins Alice and Carol with the HR department.
○ For dept_id = 102, it joins Bob with the Engineering department.
Reduce Output:
(1, Alice, HR)
(3, Carol, HR)
(2, Bob, Engineering)
18. Explain the Hadoop ecosystem with core components. Explain its architecture.
Ans. Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common
Utilities. Most of the tools or solutions are used to supplement or support these major
elements. All these tools work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database
● Mahout, Spark MLLib: Machine Learning algorithm libraries
● Solar, Lucene: Searching and Indexing
● Zookeeper: Managing cluster
● Oozie: Job Scheduling
The Hadoop ecosystem architecture refers to the organization of the various tools and
components within the Hadoop ecosystem that enable the efficient processing, storage, and
management of large datasets in a distributed computing environment. The architecture is
designed to handle Big Data workloads across clusters of commodity hardware and is scalable,
fault-tolerant, and highly distributed.
Layers of the Hadoop Ecosystem Architecture
1. Storage Layer: Hadoop Distributed File System (HDFS)
HDFS is the foundation of the Hadoop ecosystem for distributed storage, designed to store large
amounts of data reliably across multiple nodes in a cluster. It splits large files into smaller blocks
and stores them across different nodes.
Components of HDFS:
● NameNode:
○ The master node responsible for managing the file system metadata (e.g., the
directory structure, file-to-block mappings).
○ Tracks which DataNode stores which blocks of data.
○ Handles replication of data blocks to ensure fault tolerance (default replication
factor is 3).
● DataNodes:
○ The worker nodes that store the actual data in blocks.
○ DataNodes regularly communicate with the NameNode to send block reports
and health checks.
○ Each block is replicated across multiple DataNodes to ensure high availability and
fault tolerance.
● Secondary/Standby NameNode:
○ A backup node that periodically stores snapshots of the NameNode’s metadata
to prevent data loss in case the NameNode fails.
19. What are the different frameworks that run under yarn? Discuss the various yard
Daemons.
Ans. YARN (Yet Another Resource Negotiator) is a key component of the Hadoop ecosystem,
responsible for managing resources in a Hadoop cluster and scheduling jobs. It provides a
platform to run various distributed processing frameworks, allowing for better resource
management and job scheduling in a more flexible and scalable manner.
Frameworks That Run Under YARN
YARN is designed to run many types of distributed frameworks and applications, not just
Hadoop’s native MapReduce.
a. MapReduce
● Description: Hadoop's original processing model, which divides tasks into mapping and
reducing steps.
● Use Case: Data processing tasks such as filtering, aggregation, and data transformation.
b. Apache Spark
● Description: A fast, in-memory data processing engine that supports batch, real-time
streaming, and iterative algorithms.
● Use Case: Machine learning, real-time analytics, and large-scale data transformations.
● Integration with YARN: Spark can run on YARN by using Spark's built-in YARN support for
resource allocation and task scheduling.
c. Apache Tez
● Description: A more flexible and efficient framework than MapReduce, designed to
handle complex data processing pipelines.
● Use Case: Interactive query processing and graph-based data processing workflows.
● Integration with YARN: Tez is integrated with YARN for scheduling and resource
management and is often used by higher-level frameworks like Apache Hive and Pig.
d. Apache Flink
● Description: A real-time stream processing engine and also supports batch processing.
● Use Case: Continuous data processing with event-time semantics, complex event
processing, and real-time analytics.
● Integration with YARN: Flink can run on YARN, allowing it to scale easily across a Hadoop
cluster.
e. Apache HBase
● Description: A distributed NoSQL database that supports real-time read/write access to
large datasets.
● Use Case: Low-latency access to large-scale data for applications like social media,
finance, and telecommunications.
● Integration with YARN: HBase tasks (region servers) can be managed using YARN for
resource allocation.
YARN Daemons
YARN has several daemons that manage resource allocation, job scheduling, and task execution
across the cluster. These daemons are responsible for ensuring that jobs run efficiently and that
resources are used optimally.
Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per
application).
a. ResourceManager
● Role: The ResourceManager (RM) is the master daemon that manages all resources in
the YARN cluster.
● Key Responsibilities:
○ Accepts job submissions from clients.
○ Allocates resources to jobs across the cluster.
○ Manages the lifecycle of applications running in the cluster.
○ Works with the NodeManagers to track the health of worker nodes.
○ Schedules resources using various scheduling policies (e.g., FIFO, Fair Scheduler,
Capacity Scheduler).
b. NodeManager
● Role: The NodeManager (NM) is responsible for managing resources and tasks on a
single worker node in the cluster.
● Key Responsibilities:
○ Monitors resource usage (CPU, memory) on each node.
○ Launches and manages containers (the units of computation) where application
tasks run.
○ Reports resource availability and status to the ResourceManager.
○ Oversees the health of the node, including handling resource isolation and
cleanup.
c. ApplicationMaster
● Role: The ApplicationMaster (AM) is responsible for managing the lifecycle of an
individual YARN application (e.g., a MapReduce job or a Spark job).
● Key Responsibilities:
○ Negotiates resources with the ResourceManager for the application.
○ Tracks the execution of tasks (containers) and manages their failure or retry.
○ Acts as the orchestrator for the application, determining how tasks should be
scheduled and run.
○ Typically, each application has its own dedicated ApplicationMaster.
20. What is a Map Reduce Combiner? Write advantages and disadvantages of Map
Reduce Combiner?
Ans. A Mapreduce Combiner is also called a semi-reducer, which is an optional class operating
by taking in the inputs from the Mapper or Map class. And then it passes the key value paired
output to the Reducer or Reduce class. The predominant function of a combiner is to sum up
the output of map records with similar keys. The key value assembly output of the combiner will
be dispatched over the network into the Reducer as an input task. Class of combiner is placed
between class of map and class of reduce to decrease the data volume transferred between
reduce and map. Usually, the map output task data is large and the transferred data to task for
reduction is high.
How does MapReduce Combiner work?
This is a brief summary on the working of MapReduce Combiner:
The Mapreduce Combiner must implement a reducer interface method as it does not have a
predefined interface. Each of the output of map key is operated by the combiner, Similar key
value output should be processed as Reducer class cause the combiner operated on each key
map output. The combiner will be able to produce sum up information even with a huge
dataset because it takes the place of the original output data of the map. When a MapReduce
job is run on a large dataset, a huge chunk of intermediate data is created by map class and the
intermediate data is given to the reducer for later processing which will lead to huge network
congestion.
MapReduce program outline is somehow like this without the combiner:
No combiner is used in above diagram. The input is halved into two map classes or mappers and
keys are 9 generated in number from mappers. Now we take in the intermediate data to be 9
key value pairs and then the mapper sends directly this data to reduce class or While
dispatching the data to the reducer, it takes in time some bandwidth network (bandwidth is the
time which is taken to transfer data from one machine to another machine). Time has a
significant increase while data transfer if the size of the data is too big.
In between reducer and mapper, we have a combiner hadoop then intermediate data is shuffled
prior dispatching it to the reducer and generates the output as 4 key value pairs. With a
combiner, it is just two. To know how, look below.
MapReduce program outline is somehow like this with the combiner:
Reducer is now processing only 4 key value pairs which are given as an input from 2 combiners.
Reducer is getting executed only 4 ties to give the final result output, which boosts up the
overall performance.
22. Write a short note on master & slave V/s peer to peer
Ans. 1. Master-Slave Architecture:
In a master-slave architecture, there is a distinct hierarchy where one node acts as the master
and the other nodes function as slaves.
● Master:
○ Controls and coordinates the system.
○ Assigns tasks to slave nodes, manages resources, and maintains metadata.
○ Example: In Hadoop, the NameNode is the master that manages metadata, and
DataNodes are the slaves that store actual data.
● Slave:
○ Performs tasks assigned by the master.
○ Does not have autonomy to make decisions on its own.
○ Example: In Hadoop, the TaskTracker and DataNodes are slaves.
Advantages:
● Centralized control simplifies management and coordination.
● Efficient for distributed task assignment and resource management.
Disadvantages:
● Single Point of Failure: If the master fails, the entire system can be compromised.
● Scalability can be limited by the master’s capacity.
Consistency problem
2. Availability
Availability means that each read or write request for a data item will either be processed
successfully or will receive a message that the operation cannot be completed. Every non-failing
node returns a response for all the read and write requests in a reasonable amount of time. The
key word here is “every”. In simple terms, every node (on either side of a network partition)
must be able to respond in a reasonable amount of time.
For example, user A is a content creator having 1000 other users subscribed to his channel.
Another user B who is far away from user A tries to subscribe to user A’s channel. Since the
distance between both users are huge, they are connected to different database node of the
social media network. If the distributed system follows the principle of availability, user B must
be able to subscribe to user A’s channel.
Availability problem
3. Partition Tolerance
Partition tolerance means that the system can continue operating even if the network
connecting the nodes has a fault that results in two or more partitions, where the nodes in each
partition can only communicate among each other. That means, the system continues to
function and upholds its consistency guarantees in spite of network partitions. Network
partitions are a fact of life. Distributed systems guaranteeing partition tolerance can gracefully
recover from partitions once the partition heals.
For example, take the example of the same social media network where two users are trying to
find the subscriber count of a particular channel. Due to some technical fault, there occurs a
network outage, the second database connected by user B losses its connection with first
database. Hence the subscriber count is shown to the user B with the help of replica of data
which was previously stored in database 1 backed up prior to network outage. Hence the
distributed system is partition tolerant.
Partition Tolerance
The CAP theorem states that distributed databases can have at most two of the three
properties: consistency, availability, and partition tolerance. As a result, database systems
prioritize only two properties at a time.
Eventually consistent
Eventually consistent means the record will achieve consistency when all the concurrent updates
have been completed. At this point, applications querying the record will see the same value. For
example, consider a distributed document editing system where multiple users can simultaneously
edit a document. If User A and User B both edit the same section of the document simultaneously,
their local copies may temporarily differ until the changes are propagated and synchronized.
However, over time, the system ensures eventual consistency by propagating and merging the
changes made by different users.
3. Document Database:
The document database fetches and accumulates data in form of key-value pairs but here, the
values are called as Documents. Document can be stated as a complex data structure.
Document here can be a form of text, arrays, strings, JSON, XML or any such format. The use of
nested documents is also very common. It is very effective as most of the data created is usually
in the form of JSONs and is unstructured.
Advantages:
● This type of format is very useful and apt for semi-structured data.
● Storage retrieval and managing of documents is easy.
Limitations:
● Handling multiple documents is challenging
● Aggregation operations may not work accurately.
Examples:
● MongoDB
● CouchDB
30. What are the benefits of HBase over other NoSQL databases?
Ans. HBase, a distributed, scalable NoSQL database that runs on top of Hadoop, offers several
advantages over other NoSQL databases. Its tight integration with the Hadoop ecosystem and
certain unique features make it a popular choice for certain big data applications.
Benefits of HBase over Other NoSQL Databases:
1. Tight Integration with Hadoop
● HBase is built to work seamlessly with Hadoop's HDFS (Hadoop Distributed File System).
This makes it ideal for handling large datasets that are stored on HDFS, allowing
real-time read/write access to massive amounts of data stored in Hadoop.
● It also integrates with other Hadoop ecosystem components like MapReduce, Hive, Pig,
and Spark for batch processing and querying large datasets.
2. Efficient for Random Read/Write Operations
● Unlike Hadoop's batch-processing nature, HBase is designed for real-time read/write
operations on large datasets.
● It excels at handling random, sparse, and non-sequential data access, which many other
NoSQL databases might struggle with at such large scale.
3. Strong Consistency
● HBase provides strong consistency for reads and writes, meaning once data is written to
HBase, it is immediately visible to subsequent read operations.
● This is in contrast to some NoSQL databases (like Cassandra), which prioritize availability
over consistency in distributed environments and may allow stale reads.
4. Column-Oriented Storage
● HBase is a column-family-oriented database, where data is stored in columns rather
than rows. This design allows for very flexible data storage, especially for wide and
sparse datasets.
● Column families allow more efficient reads and writes by focusing only on specific data
sets (columns) rather than scanning entire rows, which can be an advantage over
traditional row-based NoSQL systems like MongoDB.
5. Scalability
● HBase is highly scalable both horizontally (by adding more nodes) and vertically (by
adding more memory or CPU to the nodes).
● Its underlying architecture, based on HDFS and distributed across a cluster, ensures that
HBase can handle large-scale datasets in a fault-tolerant manner.
6. Automatic Sharding (Region Splitting)
● HBase automatically shards tables into smaller units called regions. These regions are
distributed across the nodes of a cluster, which helps in balancing the load and
enhancing system performance.
● This automatic region splitting allows HBase to handle large datasets efficiently, without
requiring manual intervention to split and manage partitions.
7. Support for Big Data Analytics
● HBase, when combined with Hadoop’s batch processing capabilities, is excellent for big
data analytics.
● You can store massive amounts of data in HBase and process it using MapReduce or
Spark for large-scale computations.
HBase vs. Other NoSQL Databases :
Feature HBase Other NoSQL Databases (e.g., MongoDB,
Cassandra)
Read/Write Efficient for random, Other NoSQL databases vary (MongoDB: fast
Operations real-time operations reads; Cassandra: write-heavy workloads)
Scalability Horizontally scalable Other NoSQL databases are also scalable but
have different architecture
Automatic Automatic region Sharding is possible, but may require manual
Sharding splitting configuration
Big Data Analytics Suited for big data and May need third-party integration for big data
analytics analytics
The write mechanism goes through the following process sequentially (refer to the above
image):
Step 1: Whenever the client has a write request, the client writes the data to the WAL (Write
Ahead Log).
● The edits are then appended at the end of the WAL file.
● This WAL file is maintained in every Region Server and Region Server uses it to recover
data which is not committed to the disk.
Step 2: Once data is written to the WAL, then it is copied to the MemStore.
Step 3: Once the data is placed in MemStore, then the client receives the acknowledgment.
Step 4: When the MemStore reaches the threshold, it dumps or commits the data into a HFile.
HBase Write Mechanism- MemStore
● The MemStore always updates the data stored in it, in a lexicographical order
(sequentially in a dictionary manner) as sorted KeyValues. There is one MemStore for
each column family, and thus the updates are stored in a sorted manner for each column
family.
● When the MemStore reaches the threshold, it dumps all the data into a new HFile in a
sorted manner. This HFile is stored in HDFS. HBase contains multiple HFiles for each
Column Family.
● Over time, the number of HFile grows as MemStore dumps the data.
● MemStore also saves the last written sequence number, so Master Server and
MemStore both know what is committed so far and where to start from. When the
region starts up, the last sequence number is read, and from that number, new edits
start.
HBase Architecture: HBase Write Mechanism- HFile
● The writes are placed sequentially on the disk. Therefore, the movement of the disk’s
read-write head is very less. This makes the write and search mechanism very fast.
● The HFile indexes are loaded in memory whenever a HFile is opened. This helps in
finding a record in a single seek.
● The trailer is a pointer which points to the HFile’s meta block . It is written at the end of
the committed file. It contains information about timestamp and bloom filters.
● Bloom Filter helps in searching key value pairs, it skips the file which does not contain
the required row key. Timestamp also helps in searching a version of the file, it helps in
skipping the data.
HBase Architecture: Read Mechanism
● For reading the data, the scanner first looks for the Row cell in Block cache. Here all the
recently read key value pairs are stored.
● If Scanner fails to find the required result, it moves to the MemStore, as we know this is
the write cache memory. There, it searches for the most recently written files, which has
not been dumped yet in HFile.
● At last, it will use bloom filters and block cache to load the data from HFile.
HBase combines HFiles to reduce the storage and reduce the number of disk seeks needed for a
read. This process is called compaction. Compaction chooses some HFiles from a region and
combines them. There are two types of compaction
1. Minor Compaction: HBase automatically picks smaller HFiles and recommits them to
bigger HFiles as shown in the above image. This is called Minor Compaction. It performs
merge sort for committing smaller HFiles to bigger HFiles. This helps in storage space
optimization.
2. Major Compaction: As illustrated in the above image, in Major compaction, HBase
merges and recommits the smaller HFiles of a region to a new HFile. In this process, the
same column families are placed together in the new HFile. It drops deleted and expired
cell in this process. It increases read performance.
But during this process, input-output disks and network traffic might get congested. This is
known as write amplification. So, it is generally scheduled during low peak load timings.
HBase Architecture: Region Split
The below figure illustrates the Region Split mechanism.
Whenever a region becomes large, it is divided into two child regions, as shown in the above
figure. Each region represents exactly a half of the parent region. Then this split is reported to
the HMaster. This is handled by the same Region Server until the HMaster allocates them to a
new Region Server for load balancing.
A Region Server maintains various regions running on the top of HDFS. Components of a Region
Server are:
● WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file
attached to every Region Server inside the distributed environment. The WAL stores the
new data that hasn’t been persisted or committed to the permanent storage. It is used
in case of failure to recover the data sets.
● Block Cache: From the above image, it is clearly visible that Block Cache resides in the
top of Region Server. It stores the frequently read data in the memory. If the data in
BlockCache is least recently used, then that data is removed from BlockCache.
● MemStore: It is the write cache. It stores all the incoming data before committing it to
the disk or permanent memory. There is one MemStore for each column family in a
region. As you can see in the image, there are multiple MemStores for a region because
each region contains multiple column families. The data is sorted in lexicographical order
before committing it to the disk.
● HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the
actual cells on the disk. MemStore commits the data to HFile when the size of MemStore
exceeds.
34. What are the major components of HBase Data model? Explain each one in brief
Ans. HBase is a distributed, scalable, column-family-oriented NoSQL database built on top of
Hadoop's HDFS. It’s designed to manage vast amounts of data with minimal structure and
provides random, real-time read/write access to it. The HBase data model is highly flexible and
is inspired by Google’s BigTable, offering a few core components that make it powerful in
managing structured, semi-structured, and sparse data.
35. What important role do Region Server and Zookeeper play in HBase architecture?
Ans. RegionServer
The HBase RegionServer is responsible for managing and serving the data stored in HBase. It
handles read/write requests from clients, manages data storage and retrieval, and ensures load
distribution across the HBase cluster.
Key Roles of RegionServer in HBase:
1. Managing Regions:
○ A RegionServer manages multiple regions, which are horizontal partitions of an
HBase table. Each region contains a subset of rows from the table, based on a range
of row keys.
○ RegionServers handle the storage and retrieval of data for the regions assigned to
them.
○ As the table grows in size, regions are split automatically, and these splits are assigned
to different RegionServers to distribute the load evenly across the cluster.
2. Handling Read/Write Requests:
○ RegionServers handle all client requests for reading and writing data within the
regions they manage.
○ For read requests, RegionServers retrieve the required data from HFiles (HBase’s
storage files) on HDFS.
○ For write requests, RegionServers first store data in the MemStore (an in-memory
buffer) and then periodically flush the data to disk as HFiles for persistent storage.
3. MemStore and HFile Management:
○ Data written to HBase is first stored in MemStore (an in-memory buffer). Once the
MemStore reaches a certain size, it is flushed to HDFS as HFiles.
○ Each column family in a region has its own MemStore and set of HFiles. HBase stores
data in HFiles in a sorted manner, making read operations efficient.
○ RegionServers also manage compaction: periodically merging smaller HFiles into
larger ones to reduce file fragmentation and improve read performance.
4. Region Splitting and Reassignment:
○ When a region grows too large (due to an increase in data), the RegionServer
automatically splits the region into two smaller regions. The split regions are then
assigned to different RegionServers to distribute the load more effectively.
○ This dynamic splitting and reassignment of regions allow HBase to scale horizontally
across many RegionServers.
5. Region Recovery and Availability:
○ RegionServers are responsible for ensuring data availability. If a RegionServer fails, the
regions it was managing are re-assigned to other RegionServers by the HBase Master
to ensure continuous service.
6. Data Consistency and Durability:
○ RegionServers work with HDFS to provide durability. Data is written to WAL
(Write-Ahead Log) before being written to MemStore. This ensures that in case of
failure, uncommitted data can be recovered by replaying the WAL.
○ HBase provides strong consistency for read/write operations at the row level.
ZooKeeper
ZooKeeper is a distributed coordination service used by HBase to maintain the state and
configuration of the HBase cluster. It plays a crucial role in ensuring that the cluster operates
correctly and efficiently by managing server coordination and failure recovery.
Key Roles of ZooKeeper in HBase:
1. Cluster Coordination and Management:
○ ZooKeeper is responsible for managing the overall health and status of the HBase
cluster.
○ It maintains knowledge about the state of RegionServers, HBase Master, and other
components in the cluster, ensuring smooth coordination between them.
○ ZooKeeper stores metadata about which RegionServer is responsible for each region,
allowing clients to locate the correct RegionServer to access specific data.
2. Tracking HBase Master and RegionServer States:
○ HBase Master: ZooKeeper helps clients discover the HBase Master by tracking the
Master node’s availability. When a Master node fails or goes down, ZooKeeper elects a
new Master from available standby nodes.
○ RegionServers: ZooKeeper tracks which RegionServers are online and which regions
they are managing. When a RegionServer joins or leaves the cluster, ZooKeeper
updates the cluster state, allowing the Master to assign or reassign regions as needed.
3. Ensuring High Availability and Failover:
○ Master Failover: If the HBase Master node fails, ZooKeeper helps in electing a new
Master from available standby nodes, ensuring the continued functioning of the
cluster.
○ RegionServer Failover: If a RegionServer fails, ZooKeeper notifies the HBase Master,
which then reassigns the regions managed by the failed RegionServer to other
RegionServers, ensuring no data is lost and service continues uninterrupted.
○ This failover mechanism ensures high availability of data and services in HBase.
4. Client and RegionServer Discovery:
○ When a client wants to access data in HBase, it first consults ZooKeeper to determine
which RegionServer is responsible for the row key being requested.
○ ZooKeeper maintains information about the region-to-RegionServer mapping
(metadata), so clients can directly communicate with the appropriate RegionServer for
read/write operations without needing to query the Master each time.
5. Configuration Management:
○ ZooKeeper is used to manage the configuration files of HBase. Any changes in
configuration (such as region assignments) are propagated and managed by
ZooKeeper.
○ This centralized configuration management ensures that all nodes in the cluster are
synchronized and aware of changes.
6. Metadata Storage:
○ ZooKeeper stores the location of the META table (a system table that tracks all regions
and their assigned RegionServers). This allows HBase to quickly route requests to the
correct RegionServer for any given region.
○ ZooKeeper also holds important metadata related to the HBase system, such as the list
of active nodes and the structure of the cluster.