0% found this document useful (0 votes)
2 views

Module 4

This presentation from St. Francis Institute of Technology covers the fundamentals of Big Data Analytics, specifically focusing on Mining Big Data Streams. It discusses the Stream Data Model, the need for DataStream-Management Systems (DSMS), types of queries, and various applications in fields like finance and network analysis. Additionally, it addresses challenges in data stream query processing, including memory requirements and approximate query answering techniques.

Uploaded by

Biya Rahul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 4

This presentation from St. Francis Institute of Technology covers the fundamentals of Big Data Analytics, specifically focusing on Mining Big Data Streams. It discusses the Stream Data Model, the need for DataStream-Management Systems (DSMS), types of queries, and various applications in fields like finance and network analysis. Additionally, it addresses challenges in data stream query processing, including memory requirements and approximate query answering techniques.

Uploaded by

Biya Rahul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes.

Distribution and modifications of the content is prohibited.

BIG DATA ANALYTICS(BDA)


ITDO8011

Subject In-charge
Sonali Suryawanshi
Assistant Professor, Department of Information Technology, SFIT
Room No. 328
email: [email protected]

St. Francis Institute of Technology


Department of Information Technology 1
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Module 4 : Mining Big Data Streams (CO5)


• The Stream Data Model: A DataStream-Management System,
• Examples of Stream Sources, Stream Queries,
• Issues in Stream Processing.
• Sampling Data in a Stream : Sampling Techniques.
• Filtering Streams: The Bloom Filter
• Counting Distinct Elements in a Stream : The Count-Distinct Problem, The Flajolet-Martin
Algorithm, Combining Estimates, Space Requirements .
• Counting Ones in a Window: The Cost of Exact Counts, The Datar-Gionis-Indyk, Motwani
Algorithm,
• Query Answering in the DGIM Algorithm.

Self-learning Topics: Streaming services like Apache Kafka/Amazon Kinesis/Google Cloud


DataFlow. Standard spark streaming library. Integration with IOT devices to capture real time
stream data.
St. Francis Institute of Technology
Department of Information Technology 2
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Mining Big Data Streams

1. What is a Data stream?


2. What is a stream model Block architecture of DSMS?
3. What are the sources of Data Streams?
4. What are the challenges in querying /processing large data streams?
5. What are the types of data stream Queries/Data stream examples?

St. Francis Institute of Technology


Department of Information Technology 3
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data Model: A DataStream-Management System,

What is data stream?

Continuous flowing data in sequence


reason:
• Due to data-intensive applications
• High rate of Unbounded sequence of data items and records,
• may or may not be related or correlated with each other

St. Francis Institute of Technology


Department of Information Technology 4
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data Model

What is data stream?

• transient rather than persistent


• not known in advance as its real-time.
• data input is controlled externally – google queries, twitter/Facebook
status updates
• Data is infinite and non-stationary- (distribution changes over time)
• continuously arriving in multiple numbers, unbounded and rapid,
time-variant, unpredictable
St. Francis Institute of Technology
Department of Information Technology 5
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data Model:

Need of DSMS

• Traditional RDBMS store and retrieve data which is static in nature


• Does not perceive a notion of time unless time is added as timestamp
• Adequate for legacy applications
• No support for online analytics
• Need immediate processing and storage otherwise it will be lost forever
• It arrives so fast thus not easy to store all in active storage and then
interact with it in convention way

St. Francis Institute of Technology


Department of Information Technology 6
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data Model: A DataStream-Management System,

Need of DSMS

St. Francis Institute of Technology


Department of Information Technology 7
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data Model: A DataStream-Management System (DSMS)

• New model for managing stream data


• Emphasis on continuous query language and query evaluation
• Data stream are rapid, continuous, ordered (implicitly by arrival time,
explicitly by sequence of items or time stamps)
• Data arrival is Rapid - not processed or stored immediately lost forever
• The algorithm with strict constraints of space and time
• Should work for long running continuous standing and persistent
queries.

St. Francis Institute of Technology


Department of Information Technology 8
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data Model: A DataStream-Management System (DSMS)

Thus the characteristics of model that attempts to store and


retrieve data streams is
i) order based & time based operations
ii) Approximate summary of data (may not return exact results)
iii) No operators to be used that works on entire data
iv) Backtracking over data is infeasible
v) must quickly react to any unusual data values
vi) Scalable architecture

St. Francis Institute of Technology


Department of Information Technology 9
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data Model: A DataStream-Management System (DSMS)

1 ) Stream input regulator : to regulate the


input rates, perhaps by dropping packets.
2) query repository : for registering long
running, placed into groups for shared
processing.
3) Query processor : to re-optimize the
query plans in response to changing input
rates.(like for one time queries )
4 ) Temporary working storage for window
queries
5) Summary storage : approximate
summary structures
6) Static storage for meta-data (e.g.,
physical location of each source).
St. Francis Institute of Technology
Department of Information Technology 10
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data Model: A DataStream-Management System (DSMS)

types of queries
1) One-time queries: POINT-IN-TIME
snapshot of data set, with answer
returned to user

eg stock price checker alerts when stock


price crosses a particular price point
2) Continuous queries: evaluated
continuously
Answers are produced over time always
reflected data stream seen so far. (max,
avg, count)
• Eg. Maximum stock price for particular
stock after every one hour
St. Francis Institute of Technology
Department of Information Technology 11
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data Model: A DataStream-Management System (DSMS)

Other category:
• 1) predefined queries: supplied to
DSMS before any relevant data streams
have already begun, most commonly
Continuous queries(max,min, avg, sum)

• 2) adhoc queries(not known in


advance): issued online after data
streams have begun.
Can be of one-time or continuous.
It asked once about the current state
of the stream( for alerts, warnings)

St. Francis Institute of Technology


Department of Information Technology 12
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data applications :

• Mining query streams:


• Mining click streams
• Mining social network
• Image data
• Telephone call record
• Financial applications
• Network monitoring
• Telecommunication data management
• Manufacturing
• Sensor network
• Email

St. Francis Institute of Technology


Department of Information Technology 13
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data applications :

1) Sensor network:
• Many sensors : huge source of data in terms of streams feeding into central controller
• Situations: require constant monitoring of many parameters to make important decisions
• Response: Alert and alarms

Need for Analysis , aggregation and joins over multiple streams corresponding to various
sensors
- Joins on multiple streams like temperature streams, ocean current streams from whether
stations to give alerts or warning of disasters for Rapidly changing information depending
on vagaries of nature
- Monitoring stream of current power usage statistics reported to power station, grouping
on them by location, user type etc, to manage power distribution efficiently
-
St. Francis Institute of Technology
Department of Information Technology 14
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data applications :

2) Network traffic analysis :


Network Service provider gathers information : internet traffic, heavily used routes for
optimal route , to identify and predict potential congestions

Analyzed the data for fraudulent activities


I) Intrusion detection system: particular server becomes victim of denial of service, the
route becomes heavily congested
• To identify denial of service attack:
• Check whether current stream of action over a time window are similar to a previously
identified intrusion on the network
• Check if several routes over which traffic is moving have several common
intermediate nodes which may be potential to indicate congestion on the route
St. Francis Institute of Technology
Department of Information Technology 15
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data applications :


3) Financial applications:
• Online analysis of stock prices
• making hold or sell decisions
• requires quickly identify correlations and fast changing trends
• forecasting future valuations as data is constantly arriving from several sources
like news, current stock movement
Typical queries:
• 1) find all stocks priced between $50 and $200 which is showing very large buying
in the last one hour based on some federal bank news about tax cuts for particular
industry
• 2) Find all stock trading above their 100 day moving average by more than 10%
and also with volume exceeding a million shares
• St. Francis Institute of Technology
Department of Information Technology 16
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Stream Data applications :

4) Transaction log analysis:


• Online mining of web usage logs,
• Online mining of telephone call records and
• Online mining of automated bank machine transactions
• Goal is to find interesting customer behaviour pattern
• Suspicious spending behaviour to indicate fraudulent transactions
• Eg. Buying pattern of user at website and plan advertising campaigns and
recommendations
• Eg Monitoring locations continuously,
• Monitoring average spends for credit cards and identify potential frauds

St. Francis Institute of Technology


Department of Information Technology 17
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Issues in data stream query processing

1. Unbounded Memory Requirements :

● Since external memory not suitable for real time analysis due to
high latency
● External memory not suitable for continuous queries
● Thus Need of algorithm that confines to main memory without
accessing disk.

St. Francis Institute of Technology


Department of Information Technology 18
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Issues in data stream query processing

2. Approximate Query Answering :


● techniques for data reduction and synopsis construction, including:
sketches, random sampling, histograms and wavelets.

● correlated aggregate queries over data streams

● or small space summaries over data streams to provide approximate


answers

St. Francis Institute of Technology


Department of Information Technology 19
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Issues in data stream query processing

3) Sliding Windows:

● insights based on the recent past will be more informative and useful than insights
based on stale data.,

● must keep the window fresh, deleting the oldest elements as new ones
come in.

● can be the most recent n elements of a stream, for some n, or it can be all
the elements that arrived within the last t time units, for example, 1 month.

St. Francis Institute of Technology


Department of Information Technology 20
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Issues in data stream query processing

4 ) Batch Processing, Sampling and Synopses

● techniques for producing approximate answers is to avoid looking at every data


element
● Batch processing - data elements are buffered as they arrive, and the answer to the
query is computed periodically, represents the exact answer at a point in the
recent past rather than the exact answer at the present moment
● Sampling : some data points must skipped altogether
● Synopsis : design an approximate data structure that maintains a small synopsis or
sketch of the data rather than an exact representation

St. Francis Institute of Technology


Department of Information Technology 21
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Issues in data stream query processing

5) Blocking Operators

● query operator that is unable to produce an answer until it has seen its entire input.
● aggregation operators such as SUM COUNT, MIN, MAX and AVG.
● incorporation of the blocking operators into the query tree poses problems.
● never be able to produce any output as data streams are infinite .
● dealing with them effectively is one of the challenges of data stream computation.

St. Francis Institute of Technology


Department of Information Technology 22
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Sampling Data in a Stream : Sampling Techniques

01 Reservoir Sampling ●

Biased Reservoir
02 ●
Sampling

03 Concise Sampling ●

St. Francis Institute of Technology


Department of Information Technology 23

You might also like