Module 3 Mining Data Streams

Uploaded by

Aleesha K B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views96 pages

Module 3 Mining Data Streams

Uploaded by

Aleesha K B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

Mining data streams

ANISHA JOSE PH
Mining Data Streams
•Another assumption: data arrives in a stream or streams, and if it is not
processed immediately or stored, then it is lost forever.
•Assume that the data arrives so rapidly that it is not feasible to store it
all in active storage (i.e., in a conventional database), and then interact
with it at the time of our choosing.
•The algorithms for processing streams each involve summarization of
the stream in some way.
•Considering how to make a useful sample of a stream and how to filter
a stream to eliminate most of the “undesirable” elements.
Mining Data Streams
•Another approach to summarizing a stream is to look at only a fixed-
length “window” consisting of the last n elements for some (typically
large) n.
•If there are many streams and/or n is large, we may not be able to
store the entire window for every stream, so we need to summarize
even the windows.
The Stream Data Model

Data-Stream-Management
System
Examples of Stream Sources

•Sensor Data
•Image Data: Satellites, Surveillance cameras
•Internet and Web Traffic: Google, Yahoo etc
Stream Queries
There are two ways that queries get asked about streams .
1. A place within the processor where standing queries are
stored. These queries are, in a sense, permanently executing, and
produce outputs at appropriate times.
Examples:
•A standing query to output an alert whenever the temperature
exceeds 25 degrees centigrade.
•A standing query that, each time a new reading arrives, produces the
average of the 24 most recent readings.
Stream Queries
Another query we might ask is the maximum temperature ever
recorded by that sensor. It is not necessary to record the entire
stream. stored maximum
The average temperature over all time, we have only to record
two values: the number of readings ever sent in the stream and the
sum of those readings.
Stream Queries
2. The other form of query is ad-hoc.
If we want the facility to ask a wide variety of ad-hoc queries, a
common approach is to store a sliding window of each stream in the working
store.
A sliding window can be the most recent n elements of a stream, for some n,
or it can be all the elements that arrived within the last t time units, e.g.,
one day.
Example: To report the number of unique users over the past month.
If we think of the window as a relation Logins(name, time)
Stream Queries
Issues in Stream Processing
•We must process elements in real time, or we lose the
opportunity to process them at all, without accessing the archival
storage.
•Thus, it often is important that the stream-processing algorithm is
executed in main memory.
•Thus, many problems about streaming data would be easy to solve if
we had enough memory.
Sampling Data in a Stream
As our first example of managing streaming data, we shall look at
extracting reliable samples from a stream.
Our running example:
We assume the stream consists of tuples (user, query, time). Suppose
that we want to answer queries such as “What fraction of the typical
user’s queries were repeated over the past month?” Assume also that
we wish to store only 1/10th of the stream elements.
Sampling Data in a Stream
•The obvious approach would be to generate a random number, say
an integer from 0 to 9, in response to each search query.
•Store the tuple if and only if the random number is 0. If we do so,
each user has, on average, 1/10th of their queries stored.
•Suppose a user has issued s search queries one time in the past
month, d search queries twice, and no search queries more than
twice.
Sampling Data in a Stream
•If we have a 1/10th sample, of queries, we shall see in the sample
for that user an expected s/10 of the search queries issued once.
•Of the d search queries issued twice, only d/100 will appear
twice(d/10*d/10) in the sample. That fraction is d times the
probability that both occurrences of the query will be in the 1/10th
sample.
•Of the queries that appear twice in the full stream, 18d/100 will
appear exactly once(1/10*9/10+9/10*1/10).
Sampling Data in a Stream
•To see why, note that 18/100 is the probability that one of the two
occurrences will be in the 1/10th of the stream that is selected, while the
other is in the 9/10th that is not selected.
•The correct answer to the query about the fraction of repeated searches
is d/(s+d).
•However, the answer we shall obtain from the sample is d/(10s+19d).
Sampling Data in a Stream
Derivation:
•d/100 appear twice
•while s/10+18d/100 appear once.
•Thus, the fraction appearing twice in the sample is d/100 divided by
d/100+ s/10 + 18d/100. (Add 2 denominator terms, Multiply by 100)
•This ratio is d/(10s+ 19d).
For no positive values of s and d is d/(s + d) = d/(10s + 19d).
Obtaining a Representative Sample
•Like many queries about the statistics of typical users, cannot be
answered by taking a sample of each user’s search queries.
•Thus, we must strive to pick 1/10th of the users, and take all their
searches for the sample, while taking none of the searches from other
users.
•Each time a search query arrives in the stream, we look up the user to
see whether or not they are in the sample. If so, we add this search query
to the sample, and if not, then not.
Obtaining a Representative Sample
However, if we have no record of ever having seen this user before, then
we generate a random integer between 0 and 9.
If the number is 0, we add this user to our list with value “in,” and if the
number is other than 0, we add the user with the value “out.”
That method works as long as we can afford to keep the list of all users
and their in/out decision in main memory.
By using a hash function, one can avoid keeping the list of users. That is,
we hash each user name to one of ten buckets, 0 through 9.
If the user hashes to bucket 0, then accept this search query for the
sample, and if not, then not.
Obtaining a Representative Sample
•Note we do not actually store the user in the bucket; in fact, there is
no data in the buckets at all.
•Effectively, we use the hash function as a random number generator,
with the important property that, when applied to the same user
several times, we always get the same “random” number.
•More generally, we can obtain a sample consisting of any rational
fraction a/b of the users by hashing user names to b buckets, 0 through
b − 1. Add the search query to the sample if the hash value is less than
a.
The General Sampling Problem
•Our stream consists of tuples with n components.
•A subset of the components are the key components, on which the
selection of the sample will be based.
• In our running example, there are three components – user, query, and
time – of which only
•user is in the key.
•However, we could also take a sample of queries by making query be the
key, or even take a sample of user-query pairs by making both those
components form the key.
Filtering Streams
•Another common process on streams is selection, or filtering.
•We want to accept those tuples in the stream that meet a criterion.
•Accepted tuples are passed to another process as a stream, while other
tuples are dropped.
•The problem becomes harder when the criterion involves lookup
for membership in a set.
•It is especially hard, when that set is too large to store in main memory.
•The technique known as “Bloom filtering” as a way to eliminate most of
the tuples that do not meet the criterion.
A Motivating Example
•Suppose we have a set S of one billion allowed email addresses – those
that we will allow through because we believe them not to be spam.
•The stream consists of pairs: an email address and the email itself.
•Since the typical email address is 20 bytes or more, it is not reasonable
to store S in main memory.
•Suppose for argument’s sake that we have one gigabyte of available
main memory.
•In the technique known as Bloom filtering, we use that main memory as
a bit array
A Motivating Example
•Suppose for argument’s sake that we have one gigabyte of available
main memory.
•In the technique known as Bloom filtering, we use that main memory
as a bit array.
•In this case, we have room for eight billion bits, since one byte equals
eight bits.
•Devise a hash function h from email addresses to eight billion buckets.
Hash each member of S to a bit, and set that bit to 1.
•All other bits of the array remain 0.
Blooms Filter
Analysis of Bloom Filtering
•We need to understand how to calculate the probability of a false
positive, as a function of n, the bit-array length, m the number of
members of S, and k, the number of hash functions.
•Suppose we have x targets and y darts.
•The probability that a given dart will not hit a given target is (x − 1)/x.
Analysis of Bloom Filtering
The probability that a given dart will hit a given target is 1/x.
The probability that a given dart will not hit a given target is1-1/x.
The probability that none of the y darts will hit a given target is (1-
1/x)^y.

(1-1/t)^t =1/e
The Flajolet-Martin Algorithm - Analysis
•The probability that a given stream element a has h(a) ending in
at least r 0’s is 2^−r(1/(2^r)).
•Suppose there are m distinct elements in the stream.
•Then the probability that none of them has tail length at least r is
(1 − 2^−r)^m.
•We can rewrite it as .
•Assuming r is reasonably large, the inner expression is of the
form,
Thus, the probability of not finding a stream element with as
many as r 0’s at the end of its hash value is .
The Flajolet-Martin Algorithm - Analysis
We can conclude:
Combining Estimates
•Unfortunately, there is a trap regarding the strategy for combining
the estimates of m, the number of distinct elements, that we obtain
by using many different hash functions.
•Our first assumption would be that if we take the average the
values 2^R that we get from each hash function, we shall get a value
that approaches the true m, the more hash functions we use.
• However, that is not the case, and the reason has to do with the
influence an overestimate has on the average.
Combining Estimates
• Another way to combine estimates is to take the median of all
estimates.
• The median is not affected by the occasional outsized value of
2^R, so the worry described above for the average should not
carry over to the median.
• Unfortunately, the median suffers from another defect: it is always
a power of 2.
Combining Estimates
There is a solution to the problem, however. We can combine the
two methods. First, group the hash functions into small groups, and
take their average. Then, take the median of the averages.
Space Requirements
Observe that as we read the stream it is not necessary to store the
elements seen. The only thing we need to keep in main memory is
one integer per hash function; this integer records the largest tail
length seen so far for that hash function and any stream element.
Estimating Moments
Assume the universal set is ordered so we can speak of the ith
element for any i.
Let mi be the number of occurrences of the ith element for any i.
Then the kth-order moment (or just kth moment) of the stream is
the sum over all i of (mi)^k.
Kth moment,
Estimating Moments
•The 0th moment is the sum of 1 for each mi that is greater than 0.
That is, the 0th moment is a count of the number of distinct
elements in the stream.
•The 1st moment is the sum of the mi’s, which must be the length of
the stream. Thus, first moments are especially easy to compute; just
count the length of the stream seen so far.
•The second moment is the sum of the squares of the mi’s. It is
sometimes called the surprise number, since it measures how
uneven the distribution of elements in the stream is.
Estimating Moments
•Suppose we have a stream of length 100, in which eleven different
elements appear.
•The most even distribution of these eleven elements would have
one appearing 10 times and the other ten appearing 9 times each. In
this case, the surprise number is 1x 10^2 + 10 × 9^2 = 910.
•At the other extreme, one of the eleven elements could appear 90
times and the other ten appear 1 time each. Then, the
surprise number would be 90^2 + 10 × 1^2 = 8110.
The Alon-Matias-Szegedy Algorithm for
Second Moments
Example
Why the Alon-Matias-Szegedy Algorithm
Works
Let e(i) be the stream element that appears at position i in the stream,
and let c(i) be the number of times element e(i) appears in the stream
among positions i, i + 1, . . . , n.
Example:
Consider the stream of Example 4.7. e(6) = a, since the 6th position
holds a. Also, c(6) = 4, since a appears at positions 9, 13, and 14, as well
as at position 6. Note that a also appears at position 1, but that fact
does not contribute to c(6).
Why the Alon-Matias-Szegedy Algorithm
Works
Why the Alon-Matias-Szegedy Algorithm
Works
•For instance, concentrate on some element a that appears ma times in
the stream.
•The term for the last position in which a appears must be 2 × 1 − 1 = 1.
•The term for the next-to-last position in which a appears is 2 × 2 − 1 =
3.
•The positions with a before that yield terms 5, 7, and so on, up to 2ma −
1, which is the term for the first position in which a appears.
•That is, the formula for the expected value of 2X.value − 1 can be
written:
Why the Alon-Matias-Szegedy
Algorithm Works

Note that 1+3+5+· · ·+(2ma−1) = (ma)^2.

The proof is an easy induction on the number of terms in the sum.
Higher-Order Moments
•We estimate kth moments, for k > 2, in essentially the same way as we
estimate second moments.
•We used the formula n(2v − 1) to turn a value v, the count of the
number of occurrences of some particular stream element a, into an
estimate of the second moment.
•We saw why this formula works: the terms 2v − 1, for v = 1, 2, . . . ,m
sum to m^2, where m is the number of times a appears in the stream.
Higher-Order Moments
•Notice that 2v − 1 is the difference between v^2 and (v − 1)^2. Suppose
we wanted the third moment rather than the second.
•Then all we have to do is replace 2v−1 by v^3−(v−1)^3 = 3v^2−3v+1.

•so we can use as our estimate of the third moment the formula

Where v = X.value is the value associated with some variable X.

Higher-Order Moments
More generally, we can estimate kth moments for any k ≥ 2 by
turning value v = X.value into,
Counting Ones in a Window
To begin, suppose we want to be able to count exactly the number of
1’s in the last k bits for any k ≤ N.
Storage Requirements for the DGIM
Algorithm
•We observed that each bucket can be represented by O(logN) bits.
•If the window has length N, then there are no more than N 1’s, surely.
•Suppose the largest bucket is of size 2^j. Then j cannot exceed log2( N),
or else there are more 1’s in this bucket than there are 1’s in the entire
window.
•We conclude that there are O(logN) buckets. Since each bucket can
be represented in O(logN) bits, the total space required for all the
buckets representing a window of size N is O(log N)^2
Decaying Window
•Define the exponentially decaying window for this stream to be the
sum,

•However, when a new element at+1 arrives at the stream input, all
we need to do is:
Decaying Window
•The reason this method works is that each of the previous elements
has now moved one position further from the current element, so its
weight is multiplied by 1 − c.
•Further, the weight on the current element is (1 − c)^0 = 1, so
adding at+1 is the correct way to include the new element’s
contribution.

Variational Approach to Gravity Field Theories From Newton to Einstein and Beyond 1st Edition Alberto Vecchiato (Auth.) pdf download
100% (3)
Variational Approach to Gravity Field Theories From Newton to Einstein and Beyond 1st Edition Alberto Vecchiato (Auth.) pdf download
69 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
Bda Unit - 2
No ratings yet
Bda Unit - 2
12 pages
Sampling Data in A Stream PDF
No ratings yet
Sampling Data in A Stream PDF
3 pages
mining data stream
No ratings yet
mining data stream
31 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
Data Analytics (Unit-03)_7777
No ratings yet
Data Analytics (Unit-03)_7777
48 pages
DSBD_Unit-II_3
No ratings yet
DSBD_Unit-II_3
28 pages
Streams 2
No ratings yet
Streams 2
49 pages
BDA PT 2
No ratings yet
BDA PT 2
35 pages
BDA-UNIT3
No ratings yet
BDA-UNIT3
22 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
Blooms Filter
No ratings yet
Blooms Filter
15 pages
MMD 03
No ratings yet
MMD 03
53 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
Ch05a Streams1
No ratings yet
Ch05a Streams1
48 pages
14 Streams
No ratings yet
14 Streams
6 pages
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
No ratings yet
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
20 pages
Lec 8
No ratings yet
Lec 8
4 pages
unit-3.pptx
No ratings yet
unit-3.pptx
49 pages
Unit 2
No ratings yet
Unit 2
23 pages
Unit2 Bda
No ratings yet
Unit2 Bda
293 pages
Mining Data Streams 1
No ratings yet
Mining Data Streams 1
46 pages
Assignment No.2: HOANG Nguyen Phong
No ratings yet
Assignment No.2: HOANG Nguyen Phong
6 pages
Bda Unit II Lecture2
No ratings yet
Bda Unit II Lecture2
10 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
Assumptions of Logistic Regression
100% (1)
Assumptions of Logistic Regression
2 pages
module4(2)
No ratings yet
module4(2)
10 pages
Streams 1
No ratings yet
Streams 1
33 pages
BDA Module-4
No ratings yet
BDA Module-4
8 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
Mining Data Streams
No ratings yet
Mining Data Streams
33 pages
Mod2_Data_Streams
No ratings yet
Mod2_Data_Streams
75 pages
Big Data Analytics_Unit 3
No ratings yet
Big Data Analytics_Unit 3
64 pages
Unit III - MMD - Lecture Notes
No ratings yet
Unit III - MMD - Lecture Notes
8 pages
DA Unit 3
No ratings yet
DA Unit 3
12 pages
Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Stream Processing
No ratings yet
Stream Processing
70 pages
DGIM
No ratings yet
DGIM
90 pages
Manual Bda 6 7 8
No ratings yet
Manual Bda 6 7 8
6 pages
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
No ratings yet
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
47 pages
module4(3)
No ratings yet
module4(3)
20 pages
Unit 4
No ratings yet
Unit 4
10 pages
Probabilistic Graphical Model Handout
No ratings yet
Probabilistic Graphical Model Handout
6 pages
Differential Equations.-SOLUTIONS
No ratings yet
Differential Equations.-SOLUTIONS
33 pages
BIG_DATA_UNIT_II_NOTES
No ratings yet
BIG_DATA_UNIT_II_NOTES
19 pages
Unit 3
No ratings yet
Unit 3
30 pages
Finding Frequent Items in Data Streams
No ratings yet
Finding Frequent Items in Data Streams
11 pages
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
Nature of Mathematics
No ratings yet
Nature of Mathematics
33 pages
MMD3
No ratings yet
MMD3
17 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
AdvancedMath CalculatingDevices F5 Loibanguti (2022)
No ratings yet
AdvancedMath CalculatingDevices F5 Loibanguti (2022)
44 pages
BaiTap K Map v4
No ratings yet
BaiTap K Map v4
15 pages
Grad School
No ratings yet
Grad School
8 pages
Calculus I Review - Differentiation
No ratings yet
Calculus I Review - Differentiation
17 pages
Aa242B: Mechanical Vibrations: Analytical Dynamics of Discrete Systems
No ratings yet
Aa242B: Mechanical Vibrations: Analytical Dynamics of Discrete Systems
40 pages
BLM Measuring and Drawing Angles and Triangles
No ratings yet
BLM Measuring and Drawing Angles and Triangles
15 pages
Chapter 2 Boolean Expression
No ratings yet
Chapter 2 Boolean Expression
14 pages
2019 Bme 109 Biomechanics Lab 0
No ratings yet
2019 Bme 109 Biomechanics Lab 0
24 pages
Nature'S Numbers: - Basicbooks
No ratings yet
Nature'S Numbers: - Basicbooks
39 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
2019 Exam
No ratings yet
2019 Exam
22 pages
Solution
No ratings yet
Solution
7 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
Yf 600 CMH MCB SMoqsjt SH
No ratings yet
Yf 600 CMH MCB SMoqsjt SH
6 pages
Cazoom Math. Fractions. Dividing Fractions
No ratings yet
Cazoom Math. Fractions. Dividing Fractions
2 pages
Abb Robot Project IRB 640
No ratings yet
Abb Robot Project IRB 640
11 pages
C2 Solution
No ratings yet
C2 Solution
3 pages
Mariel_Vázquez
No ratings yet
Mariel_Vázquez
3 pages
Example: Convert To An Improper Fraction. 3 2 5
No ratings yet
Example: Convert To An Improper Fraction. 3 2 5
1 page
Operators: Chapter-2
No ratings yet
Operators: Chapter-2
8 pages
Quadratic Equation
No ratings yet
Quadratic Equation
6 pages
Ix Maths Term-1 Subjective Set-2 2023-24
No ratings yet
Ix Maths Term-1 Subjective Set-2 2023-24
2 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
No ratings yet
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
76 pages
Class-8-Maths-Term-I-Sample-Paper
No ratings yet
Class-8-Maths-Term-I-Sample-Paper
3 pages
Discrete Guidelines NEP-3
No ratings yet
Discrete Guidelines NEP-3
2 pages
Geometric Sequences
No ratings yet
Geometric Sequences
10 pages
Resume For Weebly
No ratings yet
Resume For Weebly
1 page
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Numerical Methods for Scientists and Engineers
From Everand
Numerical Methods for Scientists and Engineers
Richard Hamming
4/5 (11)

Module 3 Mining Data Streams

Uploaded by

Module 3 Mining Data Streams

Uploaded by

Mining data streams

Note that 1+3+5+· · ·+(2ma−1) = (ma)^2.

Where v = X.value is the value associated with some variable X.

You might also like