Bda Mid Ans
Bda Mid Ans
(U-I)
BIG DATA:
“ Big data refers to huge volumes of data sets whose size is beyond the ability of typical
traditional database software tools to capture, store, manage and analyze. ”
Big data refers to large, complex, and diverse sets of information that grow at ever-increasing
rates. It includes data that is too vast, varied, and fast-changing for traditional data-processing
methods to handle efficiently. The three key characteristics of big data are often described
as:Volume: The sheer amount of data generated from multiple sources, such as social media,
IoT devices, and sensors.
Velocity: The speed at which this data is generated, collected, and processed in real-time or
near real-time.
Variety: The diverse types of data, including structured, semi-structured, and unstructured
data, such as text, images, videos, and log files.
1.INGESTION:
The process of bringing the data into the data system we are building
• The ingestion layer is the very first step of pulling in raw data.
• Batch, in which large groups of data are gathered and delivered together.
• A batch layer (cold path) stores all of the incoming data in its raw form and performs
batch processing on the data. The result of this processing is stored as a batch view.
• Streaming, which is a continuous flow of data. This is necessary for real-time data
analytics
• A speed layer (hot path) analyzes data in real time. This layer is designed for low
latency (minimum delay), at the expense of accuracy.
• Data for batch processing operations is typically stored in a distributed file store that
can hold high volumes of large files in various formats.
• This kind of store is often called a data lake. Options for implementing this storage
include Azure Data Lake Store in Azure Storage.
• Storage is where the converted data is stored in a data lake or warehouse and
eventually processed.
• The data lake/warehouse is the most essential component of a big data ecosystem.
• Data in data lake contain only thorough, relevant data to make insights as valuable
as possible.
• In the analysis layer, data gets passed through several tools, shaping it into actionable
insights.
• tables
• The most important thing in this layer is making sure the intent and meaning of the
output is understandable.
• Data preparation involves the integration of required data into a dataset that will be
used for data mining;
• data mining involves examining large databases in order to generate new information
and patterns.
• result validation involves the verification of accuracy and patterns produced by data
mining algorithms;
1 descriptive data,
2.prescriptive data,
3.diagnostic data,
5. predictive data.
• Some industries with the greatest need for data intelligence include 1) cyber security,
• 2) finance,
• 3) health,
• 4)insurance, and
• 5) law enforcement.
3.Explain the Stream Data Model with a neat sketch and explain how stream processing
is done in data streams.(U-II)
1) An aggregator that gathers event streams and batch files from a variety of data sources,
3) An analytics engine that analyzes the data, correlates values and blends streams together.
Message broker:
• Message Brokers are used to send a stream of events from the producer to consumers
through a push-based mechanism. Message broker runs as a server, with producer and
consumer connecting to it as clients. Producers can write events to the broker and
consumers receive them from the broker.
1.This is the element that takes data from a source, called a producer, translates it into a
standard message format, and streams it on an ongoing basis. Other components can then
listen in and consume the messages passed on by the broker..
2. Data events from one or more message brokers must be aggregated, transformed, and
structured before data can be analysed with SQL-based analytics tools. This would be done
by an ETL tool or platform that receives queries from users, fetches events from message
queues, then applies the query to generate a result.
The result may be an API call, an action, a visualization, an alert, or in some cases a new data
stream. Examples of open-source ETL tools for streaming data are Apache Storm, Spark
Streaming, and WSO2 Stream Processor.
•
4.Estimating moments and sliding windows in a stream(DGIM algorithm)(U-II)
Sliding Windows
Sliding window is a useful model of stream processing in which the queries
are about a window of length N – the N most recent elements received.
In certain cases N is so large that the data cannot be stored in memory, or
even on disk. Sliding window is also known as windowing.
In DGIM algorithm, each bit that arrives has a timestamp, for the
position at which it arrives. if the first bit has a timestamp 1, the
second bit has a timestamp 2 and so on.. the positions are
recognized with the window size N (the window sizes are usually
taken as a multiple of 2).The windows are divided into buckets
consisting of 1’s and 0's.
1. The right side of the bucket should always start with 1. (if it starts
with a 0,it is to be neglected) E.g. · 1001011 → a bucket of size 4
,having four 1’s and starting with 1 on it’s right end.
This picture shows how we can form the buckets based on the
number of ones by following the rules.
In the given data stream let us assume the new bit arrives from the
right. When the new bit = 0
After the new bit ( 0 ) arrives with a time stamp 101, there is no
change in the buckets.
But what if the new bit that arrives is 1, then we need to make
changes..
To combine any two adjacent buckets of the same size, replace them
by one bucket of twice the size. The timestamp of the new bucket is
the timestamp of the rightmost of the two buckets.
Counting the sizes of the buckets in the last 20 bits, we say, there are
11 ones.
5. Explain sampling and Filtering methods in Data Stream processing?(U-II)
6. Draw a neat sketch of HDFS Architecture and explain the building blocks (Name
Node and Data Node) of Hadoop.(U-III)
In the Hadoop Distributed File System (HDFS) architecture, **NameNode** and
**DataNodes** play crucial roles in managing and storing data.
### 1. **NameNode**:
- **Role**: The NameNode is the master node in HDFS that manages the filesystem
namespace and controls access to files. It does not store the actual data but keeps metadata
about the file structure, such as:
- File names
- Directories
- Permissions
- Block locations
- **Responsibilities**:
- **Metadata management**: It tracks which DataNodes hold the data blocks for a file
and maintains the directory tree of the file system.
- **Block mapping**: It maps file names to the corresponding blocks stored in
DataNodes.
- **Replication control**: It monitors block replication to ensure fault tolerance.
- **Fault tolerance**: The NameNode is critical to the HDFS operation, so it often has a
backup (Secondary NameNode) or standby node in more advanced configurations for
failover.
### 2. **DataNodes**:
- **Role**: DataNodes are the worker nodes that actually store the data blocks. Each file is
divided into blocks (usually 128 MB or 64 MB), and these blocks are distributed across
multiple DataNodes.
- **Responsibilities**:
- **Block storage**: DataNodes store the data blocks and serve read/write requests from
clients.
- **Heartbeat and block report**: DataNodes regularly send heartbeats and block reports
to the NameNode to report their status and the blocks they are storing. This helps the
NameNode track the health of the system and determine if data needs to be re-replicated due
to failure.
- **Replication**: Blocks are replicated across multiple DataNodes (default replication
factor is 3) to provide fault tolerance. If one DataNode fails, the data is still available from the
replicas.
### **Interaction**:
- When a client wants to read or write a file, it first communicates with the NameNode to
get metadata and block locations.
- The client then interacts directly with the DataNodes to read or write the actual data
blocks.