0% found this document useful (0 votes)
20 views18 pages

Bda Mid Ans

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views18 pages

Bda Mid Ans

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1. Define Big Data Explain the components of Big data ecosystem.

(U-I)

BIG DATA:

“ Big data refers to huge volumes of data sets whose size is beyond the ability of typical
traditional database software tools to capture, store, manage and analyze. ”

Big data refers to large, complex, and diverse sets of information that grow at ever-increasing
rates. It includes data that is too vast, varied, and fast-changing for traditional data-processing
methods to handle efficiently. The three key characteristics of big data are often described
as:Volume: The sheer amount of data generated from multiple sources, such as social media,
IoT devices, and sensors.
Velocity: The speed at which this data is generated, collected, and processed in real-time or
near real-time.
Variety: The diverse types of data, including structured, semi-structured, and unstructured
data, such as text, images, videos, and log files.

COMPONENTS OF BIG DATA ECOSYSTEM:

1.INGESTION:

The process of bringing the data into the data system we are building

• The ingestion layer is the very first step of pulling in raw data.

• The various sources of data.

• It comes from internal sources, relational,non-relational databases,social media,emails


Types of ingestion:

• There are two kinds of ingestions :

• Batch, in which large groups of data are gathered and delivered together.

• A batch layer (cold path) stores all of the incoming data in its raw form and performs
batch processing on the data. The result of this processing is stored as a batch view.

• Streaming, which is a continuous flow of data. This is necessary for real-time data
analytics

• A speed layer (hot path) analyzes data in real time. This layer is designed for low
latency (minimum delay), at the expense of accuracy.

2. DATA STORAGE.(DATAWH VS DATALAKE):

• Data for batch processing operations is typically stored in a distributed file store that
can hold high volumes of large files in various formats.

• This kind of store is often called a data lake. Options for implementing this storage
include Azure Data Lake Store in Azure Storage.

• Storage is where the converted data is stored in a data lake or warehouse and
eventually processed.

• The data lake/warehouse is the most essential component of a big data ecosystem.

• Data in data lake contain only thorough, relevant data to make insights as valuable
as possible.

• It must be efficient with as little redundancy as possible to allow for quicker


processing.

3. BIG DATA ANALYTICS :

• In the analysis layer, data gets passed through several tools, shaping it into actionable
insights.

• There are four types of analytics on big data :

• Diagnostic: Explains why a problem is happening.

• Descriptive: Describes the current state of a business through historical data.

• Predictive: Projects future results based on historical data.

• Prescriptive: Takes predictive analytics a step further by projecting best future


efforts.

4. CONSUMPTION : (END USER)


• The final big data component is presenting the information in a format useful to the
end-user.

• This can be in the forms of :

• tables

• advanced visualizations and even single numbers if requested.

• The most important thing in this layer is making sure the intent and meaning of the
output is understandable.

2. Write briefly about Intelligent Data Analysis (U-I)

• Intelligent data analysis refers to the use of analysis, classification, conversion,


extraction, organization, and reasoning methods to extract useful knowledge from
data.

• This data analytics intelligence process generally consists of

1. the data preparation stage,

2. the data mining stage,

3. the result validation

4. and result explanation stage.

• Data preparation involves the integration of required data into a dataset that will be
used for data mining;

• data mining involves examining large databases in order to generate new information
and patterns.

• result validation involves the verification of accuracy and patterns produced by data
mining algorithms;

• result explanation involves the intuitive communication of results using visualization


methods.

Five major components of IDA:

1 descriptive data,

2.prescriptive data,

3.diagnostic data,

4. decisive data, and

5. predictive data.
• Some industries with the greatest need for data intelligence include 1) cyber security,

• 2) finance,

• 3) health,

• 4)insurance, and

• 5) law enforcement.

Intelligent data capture technology is a valuable application in these industries for


transforming print documents or images into meaningful data.

3.Explain the Stream Data Model with a neat sketch and explain how stream processing
is done in data streams.(U-II)

STREAM DATA MODEL:


STREAMING DATA ARCHITECTURE :

• A streaming data architecture is an information technology framework that puts the


focus on processing data in motion and treats extract-transform-load (ETL) batch
processing as just one more event in a continuous stream of events.

• This type of architecture has three basic components –

1) An aggregator that gathers event streams and batch files from a variety of data sources,

2) A broker that makes data available for consumption and

3) An analytics engine that analyzes the data, correlates values and blends streams together.

Message broker:

• Message Brokers are used to send a stream of events from the producer to consumers
through a push-based mechanism. Message broker runs as a server, with producer and
consumer connecting to it as clients. Producers can write events to the broker and
consumers receive them from the broker.

1.This is the element that takes data from a source, called a producer, translates it into a
standard message format, and streams it on an ongoing basis. Other components can then
listen in and consume the messages passed on by the broker..
2. Data events from one or more message brokers must be aggregated, transformed, and
structured before data can be analysed with SQL-based analytics tools. This would be done
by an ETL tool or platform that receives queries from users, fetches events from message
queues, then applies the query to generate a result.

The result may be an API call, an action, a visualization, an alert, or in some cases a new data
stream. Examples of open-source ETL tools for streaming data are Apache Storm, Spark
Streaming, and WSO2 Stream Processor.

3. Data Analytics / Serverless Query Engine:

• streaming data is prepared for consumption by the stream processor. It must be


analyzed to provide value. There are many different approaches to streaming data
analytics. Here are some of the tools most commonly used for streaming data
analytics.


4.Estimating moments and sliding windows in a stream(DGIM algorithm)(U-II)
Sliding Windows
Sliding window is a useful model of stream processing in which the queries
are about a window of length N – the N most recent elements received.
In certain cases N is so large that the data cannot be stored in memory, or
even on disk. Sliding window is also known as windowing.

Consider a sliding window of length N=6 on a single stream


as shown in figure 1. As the stream content varies over time
the sliding window highlights new stream elements.

Figure 1. Sliding window on stream


Example1: Consider Amazon online transactions. For every
product X we keep 0/1 stream of whether that product was
sold in the n-th transaction. A query like, “how many times
have we sold X in the last k sales?” and an answer for it can
be derived using sliding window concept.

DGIM algorithm (Datar-Gionis-Indyk-Motwani Algorithm)


Designed to find the number 1’s in a data set. This algorithm uses
O(log²N) bits to represent a window of N bit, allows to estimate the
number of 1’s in the window with and error of no more than 50%.

So this algorithm gives a 50% precise answer.

In DGIM algorithm, each bit that arrives has a timestamp, for the
position at which it arrives. if the first bit has a timestamp 1, the
second bit has a timestamp 2 and so on.. the positions are
recognized with the window size N (the window sizes are usually
taken as a multiple of 2).The windows are divided into buckets
consisting of 1’s and 0's.

RULES FOR FORMING THE BUCKETS:

1. The right side of the bucket should always start with 1. (if it starts
with a 0,it is to be neglected) E.g. · 1001011 → a bucket of size 4
,having four 1’s and starting with 1 on it’s right end.

2. Every bucket should have at least one 1, else no bucket can be


formed.

3. All buckets should be in powers of 2.

4. The buckets cannot decrease in size as we move to the left. (move


in increasing order towards left)

Let us take an example to understand the algorithm.


Estimating the number of 1’s and counting the buckets in the given
data stream.

This picture shows how we can form the buckets based on the
number of ones by following the rules.

In the given data stream let us assume the new bit arrives from the
right. When the new bit = 0
After the new bit ( 0 ) arrives with a time stamp 101, there is no
change in the buckets.

But what if the new bit that arrives is 1, then we need to make
changes..

· Create a new bucket with the current timestamp and size 1.


· If there was only one bucket of size 1, then nothing more needs to
be done. However, if there are now three buckets of size 1( buckets
with timestamp 100,102, 103 in the second step in the picture) We
fix the problem by combining the leftmost(earliest) two buckets of
size 1. (purple box)

To combine any two adjacent buckets of the same size, replace them
by one bucket of twice the size. The timestamp of the new bucket is
the timestamp of the rightmost of the two buckets.

Now, sometimes combining two buckets of size 1 may create a third


bucket of size 2. If so, we combine the leftmost two buckets of size 2
into a bucket of size 4. This process may ripple through the bucket
sizes.

How long can you continue doing this…

You can continue if current timestamp- leftmost bucket timestamp


of window < N (=24 here) E.g. 103–87=16 < 24 so I continue, if it
greater or equal to then I stop.

Finally the answer to the query.

How many 1’s are there in the last 20 bits?

Counting the sizes of the buckets in the last 20 bits, we say, there are
11 ones.
5. Explain sampling and Filtering methods in Data Stream processing?(U-II)
6. Draw a neat sketch of HDFS Architecture and explain the building blocks (Name
Node and Data Node) of Hadoop.(U-III)
In the Hadoop Distributed File System (HDFS) architecture, **NameNode** and
**DataNodes** play crucial roles in managing and storing data.

### 1. **NameNode**:
- **Role**: The NameNode is the master node in HDFS that manages the filesystem
namespace and controls access to files. It does not store the actual data but keeps metadata
about the file structure, such as:
- File names
- Directories
- Permissions
- Block locations
- **Responsibilities**:
- **Metadata management**: It tracks which DataNodes hold the data blocks for a file
and maintains the directory tree of the file system.
- **Block mapping**: It maps file names to the corresponding blocks stored in
DataNodes.
- **Replication control**: It monitors block replication to ensure fault tolerance.
- **Fault tolerance**: The NameNode is critical to the HDFS operation, so it often has a
backup (Secondary NameNode) or standby node in more advanced configurations for
failover.

### 2. **DataNodes**:
- **Role**: DataNodes are the worker nodes that actually store the data blocks. Each file is
divided into blocks (usually 128 MB or 64 MB), and these blocks are distributed across
multiple DataNodes.
- **Responsibilities**:
- **Block storage**: DataNodes store the data blocks and serve read/write requests from
clients.
- **Heartbeat and block report**: DataNodes regularly send heartbeats and block reports
to the NameNode to report their status and the blocks they are storing. This helps the
NameNode track the health of the system and determine if data needs to be re-replicated due
to failure.
- **Replication**: Blocks are replicated across multiple DataNodes (default replication
factor is 3) to provide fault tolerance. If one DataNode fails, the data is still available from the
replicas.

### **Interaction**:
- When a client wants to read or write a file, it first communicates with the NameNode to
get metadata and block locations.
- The client then interacts directly with the DataNodes to read or write the actual data
blocks.

You might also like