0% found this document useful (0 votes)

20 views18 pages

Bda Mid Ans

Uploaded by

prapadhya uppalapati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views18 pages

Bda Mid Ans

Uploaded by

prapadhya uppalapati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

1. Define Big Data Explain the components of Big data ecosystem.

(U-I)

BIG DATA:

“ Big data refers to huge volumes of data sets whose size is beyond the ability of typical
traditional database software tools to capture, store, manage and analyze. ”

Big data refers to large, complex, and diverse sets of information that grow at ever-increasing
rates. It includes data that is too vast, varied, and fast-changing for traditional data-processing
methods to handle efficiently. The three key characteristics of big data are often described
as:Volume: The sheer amount of data generated from multiple sources, such as social media,
IoT devices, and sensors.
Velocity: The speed at which this data is generated, collected, and processed in real-time or
near real-time.
Variety: The diverse types of data, including structured, semi-structured, and unstructured
data, such as text, images, videos, and log files.

COMPONENTS OF BIG DATA ECOSYSTEM:

1.INGESTION:

The process of bringing the data into the data system we are building

• The ingestion layer is the very first step of pulling in raw data.

• The various sources of data.

• It comes from internal sources, relational,non-relational databases,social media,emails

Types of ingestion:

• There are two kinds of ingestions :

• Batch, in which large groups of data are gathered and delivered together.

• A batch layer (cold path) stores all of the incoming data in its raw form and performs
batch processing on the data. The result of this processing is stored as a batch view.

• Streaming, which is a continuous flow of data. This is necessary for real-time data
analytics

• A speed layer (hot path) analyzes data in real time. This layer is designed for low
latency (minimum delay), at the expense of accuracy.

2. DATA STORAGE.(DATAWH VS DATALAKE):

• Data for batch processing operations is typically stored in a distributed file store that
can hold high volumes of large files in various formats.

• This kind of store is often called a data lake. Options for implementing this storage
include Azure Data Lake Store in Azure Storage.

• Storage is where the converted data is stored in a data lake or warehouse and
eventually processed.

• The data lake/warehouse is the most essential component of a big data ecosystem.

• Data in data lake contain only thorough, relevant data to make insights as valuable
as possible.

• It must be efficient with as little redundancy as possible to allow for quicker

processing.

3. BIG DATA ANALYTICS :

• In the analysis layer, data gets passed through several tools, shaping it into actionable
insights.

• There are four types of analytics on big data :

• Diagnostic: Explains why a problem is happening.

• Descriptive: Describes the current state of a business through historical data.

• Predictive: Projects future results based on historical data.

• Prescriptive: Takes predictive analytics a step further by projecting best future

efforts.

4. CONSUMPTION : (END USER)

• The final big data component is presenting the information in a format useful to the
end-user.

• This can be in the forms of :

• tables

• advanced visualizations and even single numbers if requested.

• The most important thing in this layer is making sure the intent and meaning of the
output is understandable.

2. Write briefly about Intelligent Data Analysis (U-I)

• Intelligent data analysis refers to the use of analysis, classification, conversion,

extraction, organization, and reasoning methods to extract useful knowledge from
data.

• This data analytics intelligence process generally consists of

1. the data preparation stage,

2. the data mining stage,

3. the result validation

4. and result explanation stage.

• Data preparation involves the integration of required data into a dataset that will be
used for data mining;

• data mining involves examining large databases in order to generate new information
and patterns.

• result validation involves the verification of accuracy and patterns produced by data
mining algorithms;

• result explanation involves the intuitive communication of results using visualization

methods.

Five major components of IDA:

1 descriptive data,

2.prescriptive data,

3.diagnostic data,

4. decisive data, and

5. predictive data.
• Some industries with the greatest need for data intelligence include 1) cyber security,

• 2) finance,

• 3) health,

• 4)insurance, and

• 5) law enforcement.

Intelligent data capture technology is a valuable application in these industries for

transforming print documents or images into meaningful data.

3.Explain the Stream Data Model with a neat sketch and explain how stream processing
is done in data streams.(U-II)

STREAM DATA MODEL:

STREAMING DATA ARCHITECTURE :

• A streaming data architecture is an information technology framework that puts the

focus on processing data in motion and treats extract-transform-load (ETL) batch
processing as just one more event in a continuous stream of events.

• This type of architecture has three basic components –

1) An aggregator that gathers event streams and batch files from a variety of data sources,

2) A broker that makes data available for consumption and

3) An analytics engine that analyzes the data, correlates values and blends streams together.

Message broker:

• Message Brokers are used to send a stream of events from the producer to consumers
through a push-based mechanism. Message broker runs as a server, with producer and
consumer connecting to it as clients. Producers can write events to the broker and
consumers receive them from the broker.

1.This is the element that takes data from a source, called a producer, translates it into a
standard message format, and streams it on an ongoing basis. Other components can then
listen in and consume the messages passed on by the broker..
2. Data events from one or more message brokers must be aggregated, transformed, and
structured before data can be analysed with SQL-based analytics tools. This would be done
by an ETL tool or platform that receives queries from users, fetches events from message
queues, then applies the query to generate a result.

The result may be an API call, an action, a visualization, an alert, or in some cases a new data
stream. Examples of open-source ETL tools for streaming data are Apache Storm, Spark
Streaming, and WSO2 Stream Processor.

3. Data Analytics / Serverless Query Engine:

• streaming data is prepared for consumption by the stream processor. It must be

analyzed to provide value. There are many different approaches to streaming data
analytics. Here are some of the tools most commonly used for streaming data
analytics.

•
4.Estimating moments and sliding windows in a stream(DGIM algorithm)(U-II)
Sliding Windows
Sliding window is a useful model of stream processing in which the queries
are about a window of length N – the N most recent elements received.
In certain cases N is so large that the data cannot be stored in memory, or
even on disk. Sliding window is also known as windowing.

Consider a sliding window of length N=6 on a single stream

as shown in figure 1. As the stream content varies over time
the sliding window highlights new stream elements.

Figure 1. Sliding window on stream

Example1: Consider Amazon online transactions. For every
product X we keep 0/1 stream of whether that product was
sold in the n-th transaction. A query like, “how many times
have we sold X in the last k sales?” and an answer for it can
be derived using sliding window concept.

DGIM algorithm (Datar-Gionis-Indyk-Motwani Algorithm)

Designed to find the number 1’s in a data set. This algorithm uses
O(log²N) bits to represent a window of N bit, allows to estimate the
number of 1’s in the window with and error of no more than 50%.

So this algorithm gives a 50% precise answer.

In DGIM algorithm, each bit that arrives has a timestamp, for the
position at which it arrives. if the first bit has a timestamp 1, the
second bit has a timestamp 2 and so on.. the positions are
recognized with the window size N (the window sizes are usually
taken as a multiple of 2).The windows are divided into buckets
consisting of 1’s and 0's.

RULES FOR FORMING THE BUCKETS:

1. The right side of the bucket should always start with 1. (if it starts
with a 0,it is to be neglected) E.g. · 1001011 → a bucket of size 4
,having four 1’s and starting with 1 on it’s right end.

2. Every bucket should have at least one 1, else no bucket can be

formed.

3. All buckets should be in powers of 2.

4. The buckets cannot decrease in size as we move to the left. (move

in increasing order towards left)

Let us take an example to understand the algorithm.

Estimating the number of 1’s and counting the buckets in the given
data stream.

This picture shows how we can form the buckets based on the
number of ones by following the rules.

In the given data stream let us assume the new bit arrives from the
right. When the new bit = 0
After the new bit ( 0 ) arrives with a time stamp 101, there is no
change in the buckets.

But what if the new bit that arrives is 1, then we need to make
changes..

· Create a new bucket with the current timestamp and size 1.

· If there was only one bucket of size 1, then nothing more needs to
be done. However, if there are now three buckets of size 1( buckets
with timestamp 100,102, 103 in the second step in the picture) We
fix the problem by combining the leftmost(earliest) two buckets of
size 1. (purple box)

To combine any two adjacent buckets of the same size, replace them
by one bucket of twice the size. The timestamp of the new bucket is
the timestamp of the rightmost of the two buckets.

Now, sometimes combining two buckets of size 1 may create a third

bucket of size 2. If so, we combine the leftmost two buckets of size 2
into a bucket of size 4. This process may ripple through the bucket
sizes.

How long can you continue doing this…

You can continue if current timestamp- leftmost bucket timestamp

of window < N (=24 here) E.g. 103–87=16 < 24 so I continue, if it
greater or equal to then I stop.

Finally the answer to the query.

How many 1’s are there in the last 20 bits?

Counting the sizes of the buckets in the last 20 bits, we say, there are
11 ones.
5. Explain sampling and Filtering methods in Data Stream processing?(U-II)
6. Draw a neat sketch of HDFS Architecture and explain the building blocks (Name
Node and Data Node) of Hadoop.(U-III)
In the Hadoop Distributed File System (HDFS) architecture, **NameNode** and
**DataNodes** play crucial roles in managing and storing data.

### 1. **NameNode**:
- **Role**: The NameNode is the master node in HDFS that manages the filesystem
namespace and controls access to files. It does not store the actual data but keeps metadata
about the file structure, such as:
- File names
- Directories
- Permissions
- Block locations
- **Responsibilities**:
- **Metadata management**: It tracks which DataNodes hold the data blocks for a file
and maintains the directory tree of the file system.
- **Block mapping**: It maps file names to the corresponding blocks stored in
DataNodes.
- **Replication control**: It monitors block replication to ensure fault tolerance.
- **Fault tolerance**: The NameNode is critical to the HDFS operation, so it often has a
backup (Secondary NameNode) or standby node in more advanced configurations for
failover.

### 2. **DataNodes**:
- **Role**: DataNodes are the worker nodes that actually store the data blocks. Each file is
divided into blocks (usually 128 MB or 64 MB), and these blocks are distributed across
multiple DataNodes.
- **Responsibilities**:
- **Block storage**: DataNodes store the data blocks and serve read/write requests from
clients.
- **Heartbeat and block report**: DataNodes regularly send heartbeats and block reports
to the NameNode to report their status and the blocks they are storing. This helps the
NameNode track the health of the system and determine if data needs to be re-replicated due
to failure.
- **Replication**: Blocks are replicated across multiple DataNodes (default replication
factor is 3) to provide fault tolerance. If one DataNode fails, the data is still available from the
replicas.

### **Interaction**:
- When a client wants to read or write a file, it first communicates with the NameNode to
get metadata and block locations.
- The client then interacts directly with the DataNodes to read or write the actual data
blocks.

Big Data Components
No ratings yet
Big Data Components
31 pages
CODING For KIDS 2 BOOKS in 1 Python For Kids and Scratch Coding For Kids. A Beginners Guide To Computer Programming. Have Fun and Learn To Code Quickly, Even If You'Re New To Programming. by Morrison
100% (1)
CODING For KIDS 2 BOOKS in 1 Python For Kids and Scratch Coding For Kids. A Beginners Guide To Computer Programming. Have Fun and Learn To Code Quickly, Even If You'Re New To Programming. by Morrison
226 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
Chapter 6
No ratings yet
Chapter 6
26 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Unit II Big Data Architecture
No ratings yet
Unit II Big Data Architecture
5 pages
Unit-Ii 30-1-24
No ratings yet
Unit-Ii 30-1-24
162 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
Unit 1 Windowing
No ratings yet
Unit 1 Windowing
23 pages
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-08-19 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-08-19 Reference-Material-I
53 pages
Python in One Shot
No ratings yet
Python in One Shot
10 pages
Lec 01
No ratings yet
Lec 01
17 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
Data Analytics Assignment
No ratings yet
Data Analytics Assignment
20 pages
G10pretest Posttest
100% (1)
G10pretest Posttest
3 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Unit 3
No ratings yet
Unit 3
30 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
AGR 351 - Soil Water Movement - PPT 1 - Agri Junction
No ratings yet
AGR 351 - Soil Water Movement - PPT 1 - Agri Junction
27 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Bda M4
No ratings yet
Bda M4
57 pages
Stream Processing
No ratings yet
Stream Processing
70 pages
BDA Unit 3
No ratings yet
BDA Unit 3
18 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
BDA GTU Study Material Presentations Unit-4 29092021094703AM
No ratings yet
BDA GTU Study Material Presentations Unit-4 29092021094703AM
33 pages
Big Data Architecture
No ratings yet
Big Data Architecture
41 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Bda 2
No ratings yet
Bda 2
16 pages
Microooooooooooooo
No ratings yet
Microooooooooooooo
33 pages
BigData Mod2
No ratings yet
BigData Mod2
12 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
Unit II (Big Data)
No ratings yet
Unit II (Big Data)
19 pages
Bda Important Questions
100% (1)
Bda Important Questions
4 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Lec 4 - Big Data Ecosystem Architecture
No ratings yet
Lec 4 - Big Data Ecosystem Architecture
28 pages
5-Introduction To Streams Concepts, Stream Data Model and Architecture-03!02!2025
No ratings yet
5-Introduction To Streams Concepts, Stream Data Model and Architecture-03!02!2025
17 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
1 - Big Data Analytics & IoT
No ratings yet
1 - Big Data Analytics & IoT
13 pages
Big Data Architectures
No ratings yet
Big Data Architectures
8 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Unit 2
No ratings yet
Unit 2
10 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Module II
No ratings yet
Module II
22 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
ADAS Description en
No ratings yet
ADAS Description en
10 pages
Level I Final Branded Exam Information Document 2019
No ratings yet
Level I Final Branded Exam Information Document 2019
13 pages
Biomechanics and Tooth Movement in Orthodontics: Ita Purnama Alwi JO55201001 Ppdgs Ortodonti FKG Unhas 2021
No ratings yet
Biomechanics and Tooth Movement in Orthodontics: Ita Purnama Alwi JO55201001 Ppdgs Ortodonti FKG Unhas 2021
40 pages
Introduction To Micro Vision Keil
No ratings yet
Introduction To Micro Vision Keil
10 pages
Technical Analyst Mock Test - Vskills Practice Tests
No ratings yet
Technical Analyst Mock Test - Vskills Practice Tests
9 pages
EC Ders Föyü - 241025 - 115702
No ratings yet
EC Ders Föyü - 241025 - 115702
226 pages
Microsoft Office 2007 Word Assignments Computers Grade 9
No ratings yet
Microsoft Office 2007 Word Assignments Computers Grade 9
0 pages
MT7830A MaxicTechnology
No ratings yet
MT7830A MaxicTechnology
8 pages
CNT 0010838 02
No ratings yet
CNT 0010838 02
9 pages
Thermolysis
No ratings yet
Thermolysis
5 pages
Cambridge International AS & A Level: Chemistry 9701/51
No ratings yet
Cambridge International AS & A Level: Chemistry 9701/51
12 pages
K2S-NG01007551-GEN-CG6968-00006 - A01-Soil Investigation Interpretative Report
No ratings yet
K2S-NG01007551-GEN-CG6968-00006 - A01-Soil Investigation Interpretative Report
79 pages
MSSQL
No ratings yet
MSSQL
46 pages
BRP 131 Philips
No ratings yet
BRP 131 Philips
2 pages
ReadMe Win
No ratings yet
ReadMe Win
4 pages
1SVR730700R2100
No ratings yet
1SVR730700R2100
4 pages
NJT5037 37F Rev00
No ratings yet
NJT5037 37F Rev00
2 pages
4-SSS SAS ASA and AAS Congruence PDF
No ratings yet
4-SSS SAS ASA and AAS Congruence PDF
4 pages
Control Strategy and Application of Power Converter
No ratings yet
Control Strategy and Application of Power Converter
6 pages
DBMSLAB 02b
No ratings yet
DBMSLAB 02b
10 pages
Basic Lowry Model: by Dr. Jean-Paul Rodrigue
No ratings yet
Basic Lowry Model: by Dr. Jean-Paul Rodrigue
14 pages
Window Placement
No ratings yet
Window Placement
7 pages
Exp11 RA2112703010019
No ratings yet
Exp11 RA2112703010019
4 pages
Sizing of Fuel Cell - Ultracapacitors Hybrid Electric Vehicles Based On The Energy Management Strategy
No ratings yet
Sizing of Fuel Cell - Ultracapacitors Hybrid Electric Vehicles Based On The Energy Management Strategy
5 pages
Earliest Start Time (Es) : CPM Analysis Page of
No ratings yet
Earliest Start Time (Es) : CPM Analysis Page of
4 pages
F QVB F Ilb: Ps2 Magnetism Problems
No ratings yet
F QVB F Ilb: Ps2 Magnetism Problems
1 page

Bda Mid Ans

Uploaded by

Bda Mid Ans

Uploaded by

1. Define Big Data Explain the components of Big data ecosystem.

COMPONENTS OF BIG DATA ECOSYSTEM:

• The various sources of data.

• It comes from internal sources, relational,non-relational databases,social media,emails

• There are two kinds of ingestions :

2. DATA STORAGE.(DATAWH VS DATALAKE):

• It must be efficient with as little redundancy as possible to allow for quicker

3. BIG DATA ANALYTICS :

• There are four types of analytics on big data :

• Diagnostic: Explains why a problem is happening.

• Descriptive: Describes the current state of a business through historical data.

• Predictive: Projects future results based on historical data.

• Prescriptive: Takes predictive analytics a step further by projecting best future

4. CONSUMPTION : (END USER)

• This can be in the forms of :

• advanced visualizations and even single numbers if requested.

2. Write briefly about Intelligent Data Analysis (U-I)

• Intelligent data analysis refers to the use of analysis, classification, conversion,

• This data analytics intelligence process generally consists of

1. the data preparation stage,

2. the data mining stage,

3. the result validation

4. and result explanation stage.

• result explanation involves the intuitive communication of results using visualization

Five major components of IDA:

4. decisive data, and

Intelligent data capture technology is a valuable application in these industries for

STREAM DATA MODEL:

• A streaming data architecture is an information technology framework that puts the

• This type of architecture has three basic components –

2) A broker that makes data available for consumption and

3. Data Analytics / Serverless Query Engine:

• streaming data is prepared for consumption by the stream processor. It must be

Consider a sliding window of length N=6 on a single stream

Figure 1. Sliding window on stream

DGIM algorithm (Datar-Gionis-Indyk-Motwani Algorithm)

So this algorithm gives a 50% precise answer.

RULES FOR FORMING THE BUCKETS:

2. Every bucket should have at least one 1, else no bucket can be

3. All buckets should be in powers of 2.

4. The buckets cannot decrease in size as we move to the left. (move

Let us take an example to understand the algorithm.

· Create a new bucket with the current timestamp and size 1.

Now, sometimes combining two buckets of size 1 may create a third

How long can you continue doing this…

You can continue if current timestamp- leftmost bucket timestamp

Finally the answer to the query.

How many 1’s are there in the last 20 bits?

You might also like