Module 3 Mining Data Streams
Module 3 Mining Data Streams
ANISHA JOSE PH
Mining Data Streams
•Another assumption: data arrives in a stream or streams, and if it is not
processed immediately or stored, then it is lost forever.
•Assume that the data arrives so rapidly that it is not feasible to store it
all in active storage (i.e., in a conventional database), and then interact
with it at the time of our choosing.
•The algorithms for processing streams each involve summarization of
the stream in some way.
•Considering how to make a useful sample of a stream and how to filter
a stream to eliminate most of the “undesirable” elements.
Mining Data Streams
•Another approach to summarizing a stream is to look at only a fixed-
length “window” consisting of the last n elements for some (typically
large) n.
•If there are many streams and/or n is large, we may not be able to
store the entire window for every stream, so we need to summarize
even the windows.
The Stream Data Model
Data-Stream-Management
System
Examples of Stream Sources
•Sensor Data
•Image Data: Satellites, Surveillance cameras
•Internet and Web Traffic: Google, Yahoo etc
Stream Queries
There are two ways that queries get asked about streams .
1. A place within the processor where standing queries are
stored. These queries are, in a sense, permanently executing, and
produce outputs at appropriate times.
Examples:
•A standing query to output an alert whenever the temperature
exceeds 25 degrees centigrade.
•A standing query that, each time a new reading arrives, produces the
average of the 24 most recent readings.
Stream Queries
Another query we might ask is the maximum temperature ever
recorded by that sensor. It is not necessary to record the entire
stream. stored maximum
The average temperature over all time, we have only to record
two values: the number of readings ever sent in the stream and the
sum of those readings.
Stream Queries
2. The other form of query is ad-hoc.
If we want the facility to ask a wide variety of ad-hoc queries, a
common approach is to store a sliding window of each stream in the working
store.
A sliding window can be the most recent n elements of a stream, for some n,
or it can be all the elements that arrived within the last t time units, e.g.,
one day.
Example: To report the number of unique users over the past month.
If we think of the window as a relation Logins(name, time)
Stream Queries
Issues in Stream Processing
•We must process elements in real time, or we lose the
opportunity to process them at all, without accessing the archival
storage.
•Thus, it often is important that the stream-processing algorithm is
executed in main memory.
•Thus, many problems about streaming data would be easy to solve if
we had enough memory.
Sampling Data in a Stream
As our first example of managing streaming data, we shall look at
extracting reliable samples from a stream.
Our running example:
We assume the stream consists of tuples (user, query, time). Suppose
that we want to answer queries such as “What fraction of the typical
user’s queries were repeated over the past month?” Assume also that
we wish to store only 1/10th of the stream elements.
Sampling Data in a Stream
•The obvious approach would be to generate a random number, say
an integer from 0 to 9, in response to each search query.
•Store the tuple if and only if the random number is 0. If we do so,
each user has, on average, 1/10th of their queries stored.
•Suppose a user has issued s search queries one time in the past
month, d search queries twice, and no search queries more than
twice.
Sampling Data in a Stream
•If we have a 1/10th sample, of queries, we shall see in the sample
for that user an expected s/10 of the search queries issued once.
•Of the d search queries issued twice, only d/100 will appear
twice(d/10*d/10) in the sample. That fraction is d times the
probability that both occurrences of the query will be in the 1/10th
sample.
•Of the queries that appear twice in the full stream, 18d/100 will
appear exactly once(1/10*9/10+9/10*1/10).
Sampling Data in a Stream
•To see why, note that 18/100 is the probability that one of the two
occurrences will be in the 1/10th of the stream that is selected, while the
other is in the 9/10th that is not selected.
•The correct answer to the query about the fraction of repeated searches
is d/(s+d).
•However, the answer we shall obtain from the sample is d/(10s+19d).
Sampling Data in a Stream
Derivation:
•d/100 appear twice
•while s/10+18d/100 appear once.
•Thus, the fraction appearing twice in the sample is d/100 divided by
d/100+ s/10 + 18d/100. (Add 2 denominator terms, Multiply by 100)
•This ratio is d/(10s+ 19d).
For no positive values of s and d is d/(s + d) = d/(10s + 19d).
Obtaining a Representative Sample
•Like many queries about the statistics of typical users, cannot be
answered by taking a sample of each user’s search queries.
•Thus, we must strive to pick 1/10th of the users, and take all their
searches for the sample, while taking none of the searches from other
users.
•Each time a search query arrives in the stream, we look up the user to
see whether or not they are in the sample. If so, we add this search query
to the sample, and if not, then not.
Obtaining a Representative Sample
However, if we have no record of ever having seen this user before, then
we generate a random integer between 0 and 9.
If the number is 0, we add this user to our list with value “in,” and if the
number is other than 0, we add the user with the value “out.”
That method works as long as we can afford to keep the list of all users
and their in/out decision in main memory.
By using a hash function, one can avoid keeping the list of users. That is,
we hash each user name to one of ten buckets, 0 through 9.
If the user hashes to bucket 0, then accept this search query for the
sample, and if not, then not.
Obtaining a Representative Sample
•Note we do not actually store the user in the bucket; in fact, there is
no data in the buckets at all.
•Effectively, we use the hash function as a random number generator,
with the important property that, when applied to the same user
several times, we always get the same “random” number.
•More generally, we can obtain a sample consisting of any rational
fraction a/b of the users by hashing user names to b buckets, 0 through
b − 1. Add the search query to the sample if the hash value is less than
a.
The General Sampling Problem
•Our stream consists of tuples with n components.
•A subset of the components are the key components, on which the
selection of the sample will be based.
• In our running example, there are three components – user, query, and
time – of which only
•user is in the key.
•However, we could also take a sample of queries by making query be the
key, or even take a sample of user-query pairs by making both those
components form the key.
Filtering Streams
•Another common process on streams is selection, or filtering.
•We want to accept those tuples in the stream that meet a criterion.
•Accepted tuples are passed to another process as a stream, while other
tuples are dropped.
•The problem becomes harder when the criterion involves lookup
for membership in a set.
•It is especially hard, when that set is too large to store in main memory.
•The technique known as “Bloom filtering” as a way to eliminate most of
the tuples that do not meet the criterion.
A Motivating Example
•Suppose we have a set S of one billion allowed email addresses – those
that we will allow through because we believe them not to be spam.
•The stream consists of pairs: an email address and the email itself.
•Since the typical email address is 20 bytes or more, it is not reasonable
to store S in main memory.
•Suppose for argument’s sake that we have one gigabyte of available
main memory.
•In the technique known as Bloom filtering, we use that main memory as
a bit array
A Motivating Example
•Suppose for argument’s sake that we have one gigabyte of available
main memory.
•In the technique known as Bloom filtering, we use that main memory
as a bit array.
•In this case, we have room for eight billion bits, since one byte equals
eight bits.
•Devise a hash function h from email addresses to eight billion buckets.
Hash each member of S to a bit, and set that bit to 1.
•All other bits of the array remain 0.
Blooms Filter
Analysis of Bloom Filtering
•We need to understand how to calculate the probability of a false
positive, as a function of n, the bit-array length, m the number of
members of S, and k, the number of hash functions.
•Suppose we have x targets and y darts.
•The probability that a given dart will not hit a given target is (x − 1)/x.
Analysis of Bloom Filtering
The probability that a given dart will hit a given target is 1/x.
The probability that a given dart will not hit a given target is1-1/x.
The probability that none of the y darts will hit a given target is (1-
1/x)^y.
(1-1/t)^t =1/e
The Flajolet-Martin Algorithm - Analysis
•The probability that a given stream element a has h(a) ending in
at least r 0’s is 2^−r(1/(2^r)).
•Suppose there are m distinct elements in the stream.
•Then the probability that none of them has tail length at least r is
(1 − 2^−r)^m.
•We can rewrite it as .
•Assuming r is reasonably large, the inner expression is of the
form,
Thus, the probability of not finding a stream element with as
many as r 0’s at the end of its hash value is .
The Flajolet-Martin Algorithm - Analysis
We can conclude:
Combining Estimates
•Unfortunately, there is a trap regarding the strategy for combining
the estimates of m, the number of distinct elements, that we obtain
by using many different hash functions.
•Our first assumption would be that if we take the average the
values 2^R that we get from each hash function, we shall get a value
that approaches the true m, the more hash functions we use.
• However, that is not the case, and the reason has to do with the
influence an overestimate has on the average.
Combining Estimates
• Another way to combine estimates is to take the median of all
estimates.
• The median is not affected by the occasional outsized value of
2^R, so the worry described above for the average should not
carry over to the median.
• Unfortunately, the median suffers from another defect: it is always
a power of 2.
Combining Estimates
There is a solution to the problem, however. We can combine the
two methods. First, group the hash functions into small groups, and
take their average. Then, take the median of the averages.
Space Requirements
Observe that as we read the stream it is not necessary to store the
elements seen. The only thing we need to keep in main memory is
one integer per hash function; this integer records the largest tail
length seen so far for that hash function and any stream element.
Estimating Moments
Assume the universal set is ordered so we can speak of the ith
element for any i.
Let mi be the number of occurrences of the ith element for any i.
Then the kth-order moment (or just kth moment) of the stream is
the sum over all i of (mi)^k.
Kth moment,
Estimating Moments
•The 0th moment is the sum of 1 for each mi that is greater than 0.
That is, the 0th moment is a count of the number of distinct
elements in the stream.
•The 1st moment is the sum of the mi’s, which must be the length of
the stream. Thus, first moments are especially easy to compute; just
count the length of the stream seen so far.
•The second moment is the sum of the squares of the mi’s. It is
sometimes called the surprise number, since it measures how
uneven the distribution of elements in the stream is.
Estimating Moments
•Suppose we have a stream of length 100, in which eleven different
elements appear.
•The most even distribution of these eleven elements would have
one appearing 10 times and the other ten appearing 9 times each. In
this case, the surprise number is 1x 10^2 + 10 × 9^2 = 910.
•At the other extreme, one of the eleven elements could appear 90
times and the other ten appear 1 time each. Then, the
surprise number would be 90^2 + 10 × 1^2 = 8110.
The Alon-Matias-Szegedy Algorithm for
Second Moments
Example
Why the Alon-Matias-Szegedy Algorithm
Works
Let e(i) be the stream element that appears at position i in the stream,
and let c(i) be the number of times element e(i) appears in the stream
among positions i, i + 1, . . . , n.
Example:
Consider the stream of Example 4.7. e(6) = a, since the 6th position
holds a. Also, c(6) = 4, since a appears at positions 9, 13, and 14, as well
as at position 6. Note that a also appears at position 1, but that fact
does not contribute to c(6).
Why the Alon-Matias-Szegedy Algorithm
Works
Why the Alon-Matias-Szegedy Algorithm
Works
•For instance, concentrate on some element a that appears ma times in
the stream.
•The term for the last position in which a appears must be 2 × 1 − 1 = 1.
•The term for the next-to-last position in which a appears is 2 × 2 − 1 =
3.
•The positions with a before that yield terms 5, 7, and so on, up to 2ma −
1, which is the term for the first position in which a appears.
•That is, the formula for the expected value of 2X.value − 1 can be
written:
Why the Alon-Matias-Szegedy
Algorithm Works
•so we can use as our estimate of the third moment the formula
•However, when a new element at+1 arrives at the stream input, all
we need to do is:
Decaying Window
•The reason this method works is that each of the previous elements
has now moved one position further from the current element, so its
weight is multiplied by 1 − c.
•Further, the weight on the current element is (1 − c)^0 = 1, so
adding at+1 is the correct way to include the new element’s
contribution.