0% found this document useful (0 votes)
30 views57 pages

BDA Mod 3

Uploaded by

junkmailpavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views57 pages

BDA Mod 3

Uploaded by

junkmailpavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

SWE2011 - BIG DATA ANALYTICS

FALL 2024-25
Dr.J.Jagannathan
Assistant Professor Sr.Grade 1
School of Computer Science Engineering and Information Systems
Vellore Institute of Technology - Vellore 07-10-2024 1
MODULE:III

Stream Data Mining


Introduction to Streams Concepts – Stream data model and architecture -
Stream Computing, Sampling data in a stream – Filtering streams – Counting
distinct elements in a stream – Estimating moments – Counting oneness in a
window – Decaying window – Real time Analytics Platform(RTAP) applications.

2 07-10-2024
4VS OF BIG DATA

3
INFINITE DATA

High dim. Graph Infinite Machine


Apps
data data data learning

Locality Filtering Recommen


PageRank,
sensitive data SVM der
SimRank
hashing streams systems

Community Queries on Decision Association


Clustering
Detection streams Trees Rules

Dimension Duplicate
Spam Web Perceptron,
ality document
reduction Detection advertising kNN detection
STREAM CONCEPTS

 Recently, there’s been a rise in applications that handle large amounts of data.
Instead of storing this data permanently, these applications work with data
that comes in continuously and changes rapidly.
 This type of data is called “transient data streams.”
 Examples include:
• Financial transactions
• Network monitoring
• Sensor data from devices
• web applications
• manufacturing
5 07-10-2024
STREAM CONCEPTS

 In the data stream model, individual data items may be relational


tuples, e.g., network measurements, call records, web page visits, sensor
readings, and so on.
 However, their continuous arrival in multiple, rapid, time-varying,
possibly unpredictable and unbounded streams appears to yield
some fundamentally new research problems.

6 07-10-2024
DATA STREAM MODEL

 A data stream is a real time continuous and ordered sequence of


items.
 It is not possible to control the order in which the items arrive, nor it is
feasible to locally store a stream in its entirety in any memory device.
 Further a query over streams will actually run continuously over a
period of time and return new results as new data arrives.
 Therefore these are known as long running, continuous, standing
and persistent queries.

7 07-10-2024
CHARACTERISTICS

1. The data model and query processor must allow both order based and time based
operations.
2. The inability to store a complete stream indicates that some approximate summary
structures must be used.
3. Streaming query plans must not use any operators that require the entire input before
any results are produced. Such operators will block the query processor indefinitely.
4. Any query that requires backtracking over a data streams is infeasible. This is due to
storage and performance constraints imposed by a data stream.
5. Applications that monitor streams in real time must react quickly to unusual data values.
6. Scalability requirements dictate the parallel and shared execution of many continuous
queries must be possible.
8 07-10-2024
ARCHITECTURE

 An input monitor may regulate the input streams perhaps by


dropping packets.
 Data are typically stored in three partitions.
1. Temporary working storage (for window queries)
2. Summary storage
3. Static storage for meta-data (Physical location of each storage)

9 07-10-2024
ARCHITECTURE

 Long running queries are registered in the query repository and placed into groups
for shared processing.
 The query processor communicates with the input monitor and may reoptimize the
query plans in response to changing input rates. Results are streamed to the user or
temporarily buffered.
 A Data-Stream-Management System In analogy to a database-management system,
we can view a stream processor as a kind of data-management system, the high-level
organization.

10 07-10-2024
ARCHITECTURE

 Any number of streams can enter the system.


 Each stream can provide elements at its own schedule; they need not
have the same data rates or data types, and the time between elements
of one stream need not be uniform.
 The fact that the rate of arrival of stream elements is not under the
control of the system distinguishes stream processing from the
processing of data that goes on within a database-management system.

11 07-10-2024
07-10-2024 12
ARCHITECTURE

• Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special
circumstances using time-consuming retrieval processes.
• There is also a working store, into which summaries or parts of streams may be
placed, and which can be used for answering queries.
• The working store might be disk, or it might be main memory, depending on how fast
we need to process queries.
07-10-2024 13
EXAMPLES OF DATA STREAM APPLICATIONS

 Sensor Networks –Alerts and alarms generated as a response to information received from sensors.
Example – Perform joins of several streams like temperature, ocean currents streams at weather
stations to give alerts or warnings like cyclone or tsunami.
 Financial Applications – Online analysis of stock prices and making hold or sell decisions requires
quickly identifying correlations and fast changing trends.Transaction.
 Log Analysis – Online mining of web usage logs, telephone call records and ATM transactions are
examples of data streams. Goal is to find customer behavior patterns. Example – Identify current
buying pattern of users in website and plan advertising campaigns and recommendations.
 Image Data : Satellites often send down to earth streams consisting of many terabytes of images per
day. Surveillance cameras produce images with lower resolution than satellites, but there can be many
of them, each producing a stream of images at intervals like one second.

14 07-10-2024
KINDS OF STREAM PROCESSING TECHNIQUES
• [

• Sampling data in a Stream


– To create a sample of a stream that is usable for a class of queries
• Filtering Data Stream
– To allow particular set of elements by filtering the stream arrival
• Counting distinct elements in a Stream
– To estimate the number of different elements appearing in a stream
• Estimating moments
– Involves the distribution of frequencies of different elements in a stream
• Counting Ones in a Window
– Counting the number of 1’s in the binary stream

1
5
STREAM COMPUTING

 Stream Queries
 There are two ways that queries get asked about streams.
 One-time queries - (a class that includes traditional DBMS queries)
Example - Alert when stock crosses over a price point.
 Continuous queries, on the other hand, are evaluated continuously as data streams continue to
arrive.
Example –Aggregation queries like maximum, average, count etc. Maximum price of stock every hour,
or number of time stock gains over a particular point.

16 07-10-2024
ISSUES IN STREAM PROCESSING

 Deliver elements very rapidly.


 Executed in main memory, without access to secondary storage

17 07-10-2024
SAMPLING DATA IN A STREAM

 Process of collecting a representative collection of samples from entire


stream.
 Usually very smaller than the entire sampling data.
 Retains all the significant characteristic and behaviors of the stream
 Used to estimate / predict many crucial aggregates on the stream.

18 07-10-2024
SAMPLING TECHNIQUES

1. Fixed Proportion Sampling


 Samples fixed proportion of data
 Used when you are aware of the length of data
 Ensures representative sample
 Useful for large volumes
 Less biased than fixed sized sampling
 May lead to under / over representation

19 07-10-2024
SAMPLING TECHNIQUES

1. Fixed Proportion Sampling


 A social media platform wants to analyze the sentiments of its users towards a topic. They
receive millions of tweets per day and use fixed proportion sampling to select a
representative sample. They randomly select 1% of the tweets received each hour.
ensuring a representative sample for statistical analysis of user sentiments towards the
topic.

20 07-10-2024
SAMPLING TECHNIQUES

2. Fixed Size Sampling


 Samples fixed number of data points.
 Does not guarantee representative sample.
 Useful for reducing data volume.
 Can be biased if data is not randomly distributed
 Less effective when data size increases

21 07-10-2024
SAMPLING TECHNIQUES

2. Fixed Size Sampling


Suppose we have a data stream of customer orders for an online store, with 10,000
orders coming in every hour. Using fixed size sampling, we randomly select 1,000 orders
from each hour's data stream for analysis, thus reducing the total number of data points
to process from 10,000 to 1,000 per hour.

22 07-10-2024
SAMPLING TECHNIQUES

3. Biased Reservoir Sampling


 Used in streams to select a subset of the data in a way that is not uniformly random.
 Can lead to a biased sample that may not be representative of the full dataset.
 The selection of elements is based on a predetermined probability distribution that may be
weighted towards certain elements or groups of elements.
 The probability distribution used for biased reservoir sampling may be based on various factors,
such as the frequency of occurrence of certain types of data or the importance of certain data
points.
 Used when there are constraints on the resources available for sampling, such as limited memory
or computational power.
 It is important to carefully consider the potential biases introduced by this sampling technique and
23
adjust07-10-2024
the analysis accordingly
SAMPLING TECHNIQUES

3. Biased Reservoir Sampling


 Suppose we have a data stream of product ratings, and we want to select a sample of
ratings to estimate the average rating of a product. However, we know that some users
tend to give higher ratings than others. Using biased reservoir sampling, we can assign a
higher probability of selection to ratings from users who tend to give more accurate
ratings. This way, our sample is more likely to represent the true average rating of the
product.

24 07-10-2024
SAMPLING TECHNIQUES

4. Concise Sampling
 Goal is to maintain a small reservoir of a fixed size while still achieving representative
sampling of the data stream
 Number of samples that can be stored in memory at a given time is limited, which
can be a challenge when dealing with large data streams.
 Size of the sample may need to be adjusted based on the amount of memory
available to store the data.
 Instead of selecting samples randomly, the sampling algorithm may prioritize choosing
samples with unique or representative values of a particular attribute in the data
stream
25 07-10-2024
SAMPLING TECHNIQUES

4. Concise Sampling
 A bank wants to analyze customer spending habits from a stream of transactions.
 They use concise sampling to choose distinct customer IDs as their attribute.
 The size of the reservoir is limited to 1000 customers.
 They adjust the sample size based on available memory.
 This allows for efficient analysis while maintaining accuracy.

26 07-10-2024
FILTERING STREAMS

 Bloom Filtering
 Space-efficient data structure
 used to check whether an element belongs to a set
 Probably says that the element belongs to a set (False Positive)
 Accurately says that the element does not belongs to a set (Only
True Negatives)
 Hence, Recall rate is 100 %

27 07-10-2024
BLOOM FILTERING

 Example
 Insert elements 10 and 7 in the bloom filter of size 5. Consider
these two hash functions:
 hl(x): x mod 5
 h2(x) : (2x + 6) mod 5
 Comment on the presence of elements 14 and 15.

28 07-10-2024
BLOOM FILTERING

29 07-10-2024
30 07-10-2024
BLOOM FILTERING
• The underlying concept is to utilize the main memory as a bit array.
• With 1 GB of main memory. We have a room for 8 billion
bits.
• Device a hash function ‘h’ and hash each member of ‘S’ to a bit and
set the bit as ‘1’. All the other bits of array remain ‘0’.
• Since there are 1 billion members of ‘S’, approximately
1/8th of the bits will be ‘1’.
• The exact fraction of bit set to ‘1’ will be slightly less than 1/8th
(Because it is possible that two members of ‘S’ may hash into the
same bit.

31
COUNTING DISTINCT PROBLEM
• Data stream consists of a universe of elements chosen
from a set of size N
– Maintain a count of the number of distinct elements seen so
far
• Maintain the set of elements seen so far
– That is, keep a hash table of all the distinct elements seen so
far
– Hashing and variety of algorithms are to be used
32
APPLICATIONS
• A Web site gathering statistics on how many unique
users it has seen in each given month.
– The universal set is the set of logins for that site, and a stream
element is generated each time someone logs in.
– This measure is appropriate for a site like Amazon,
where the typical user logs in with their unique login name.

33
• Web site like Google that does not require login to issue a
search query
– may be able to identify users only by the IP address from
which they send the query.

– There are about 4 billion IP addresses, sequences of


four 8-bit bytes will serve as the universal set in this case.

34
SOLUTION

• The obvious way to solve the problem is to keep in main memory a list of all the
elements seen so far in the stream.
• Adopt an efficient search structure such as a hash table or search tree, so one
can quickly add new elements and check whether or not the element that just
arrived on the stream was already seen.
• As long as the number of distinct elements is not too great, this structure can fit
in main memory and there is little problem obtaining an exact answer to the
question how many distinct elements appear in the stream.
• Approach : Flajolet-Martin Algorithm

35
FLAJOLET-MARTIN-ALGORITHM

 Problem of finding distinct elements in a stream of data with


repetitions
Applications:
IP addresses of packets passing through router
Motifs in DNA sequence
unique visitors to apps/websites

36 07-10-2024
FLAJOLET-MARTIN-ALGORITHM

 S is the stream of with repetitions, and N is the distinct elements


 S : {xl, x2, x3, x4 .... xn} , then N: {al, a2, a3, a4 .... ak} (given K<=n)
 For instance, S : {1,2,4,3,4,3,2,4,1,2} then N : {1,2,3,4} and F : 4
 where F is total number of distinct elements

37 07-10-2024
COUNTING DISTINCT ELEMENTS IN A STREAM – NAÏVE SOLUTION.

SET COUNTER = O
SET UNIQUE SET = [ ]
WHILE COUNTER NOT EQUALS LAST ELEMENT INDEX:
IF CURRENT ELEMENT NOT PRESENT IN UNIQUE SET:
ADD CURRENT ELEMENT IN UNIQUE SET
INCREMENT THE COUNTER
DISPLAY COUNT OF DISTINCT ELEMENTS : LENGTH (UNIQUE SET)

38 07-10-2024
FLAJOLET-MARTIN-ALGORITHM

 To find the approximate number of distinct elements in a stream


 In a single pass
 uses very less memory space while executing
 Hence, efficient and robust
Note: This algorithm is meant to be used when the stream of
elements as well as the expected distinct element count is very very
large

39 07-10-2024
PSEUDOCODE/ALGORITHM

1. Select a hash function h(x) so each element in the set is mapped to a value
to at least log2n bits
2. Convert this h(x) output to binary_value
3. For each binary_value, find r(binary_value) : length of the trailing zeroes in
binary_value
4. Find R : max(r(binary_value))
5. Finally, Approximate count of distinct elements will be 2R

40 07-10-2024
FLAJOLET-MARTIN-ALGORITHM

SET COUNTER = O
SET MAX_R = O
WHILE COUNTER NOT EQUALS LAST ELEMENT INDEX:
VAL = BINARY OF HASH OUTPUT OF CURRENT ELEMENT
COUNT NO. OF TRAILING ZEROES IN VAL
IF COUNT > MAX_R:
MAX_R = COUNT
INCREMENT THE COUNTER
DISPLAY APPX. COUNT OF DISTINCT ELEMENTS:2** (MAX_R)

41 07-10-2024
42 07-10-2024
EXAMPLE

 S=1,3,2,1,2,3,4,3,1,2,3,1
 h(x)=(6x+1) mod 5

43 07-10-2024
COUNTING ONENESS

• A window of length N on a binary stream.


• We focus on the situation where we can not afford to store the entire
window.
• We want at all times to be able to answer queries of the form
“how many 1’s are there in the last k bits?” for any k ≤ N.
• Solution proposed through Datar-Gionis- Indyk- Motwani Algorithm
– DGM algorithm

44 07-10-2024
DATAR-GIONIS- INDYK- MOTWANI ALGORITHM (DGIM)

• Commonly called as Motwani Algorithm or DGIM


• Used to find the number of I's in a stream of data
• Uses O(log2N) bits to represent a window of N bits
• Error rate is no more than 50 %

45 07-10-2024
ELEMENTS
 Timestamp
 Each element entering in the stream will be allotted a timestamp based on the position of
it
 Example: If first bit has timestamp 1, then second bit will have timestamp 2, third bit 3 and
so on....
 Buckets
 Used to represent time intervals in a data stream
 Algorithm divides the stream into buckets, each will have size of power of 2
 Bucket contains the bits O and 1

46 07-10-2024
RULES FOR FORMING A BUCKET

1. Every bucket should contain at least a single 1 in it


2. Right side of the bucket should strictly start from 1
3. Length of the bucket is equal to the number of 1’s in it
4. Every bucket length should be in powers of 2
5. As we move to left, the bucket size should not decrease
6. No more than two buckets can have same size

47 07-10-2024
EXAMPLE

 Consider the following stream:


 Stream = ... 10101111001001100101…
 N = 20

48 07-10-2024
ESTIMATING MOMENTS

• A generalization of the problem of counting distinct


elements in a stream.
– The problem, called computing “moments”
• Involves the distribution of frequencies of different
elements in the stream.
• We shall define moments of all orders and concentrate
on computing second moments, from which the general
algorithm for all moments is a simple extension.
49
DEFINITION OF MOMENTS

• Suppose a stream consists of elements chosen from a universal set.


• Assume the universal set is ordered so we can speak of the ith element for any i.
• Let mi be the number of occurrences of the

• ith element for any i. Then the kth-order

moment (or just kth moment) of the stream is the sum over all i of (mi)k.
• Kth Moment

50
COMPUTING DIFFERENT MOMENTS

• 0th moment - Count the number of different elements in the


stream.
• 1st moment = sum of the numbers of elements in the
stream (length of the stream)
• 2nd moment = surprise number (a measure of how uneven
the distribution is)

51
ALON MATIAS SZEGEDY METHOD
• AMS method works for all moments
• Gives an unbiased estimate
• We will just concentrate on the 2nd moment S
• We pick and keep track of many variables X:
– For each variable X we store X.el and X.val
• X.elcorresponds to the item i
• X.valcorresponds to the count of item i
– Note this requires a count in main memory, so number of
Xs is limited
• Our goal is to compute 52
FIND THE SURPRISE NUMBER

• Given Stream : 7, 9, 8, 7,8,9,9,8,9,7

5
3
REAL-TIME ANALYTICS

 Refers to finding meaningful patterns in data at the actual time of


receiving
 Real-Time Analytics Platform (RTAP) analyses the data, correlates,
and predicts the outcomes in the real time.
 Manages and processes data and helps timely decision-making
 Helps to develop dynamic analysis applications
 Leads to evolution of business intelligence

54 07-10-2024
07-10-2024 55
07-10-2024 56
WIDELY USED RTAPS

 Apache SparkStreaming
 Cisco Connected Streaming Analytics (CSA)
 Oracle Stream Analytics (OSA)
 SAP HANA
 SQL streamBlaze
 TIBCO StreamBase

57 07-10-2024

You might also like