0% found this document useful (0 votes)
44 views31 pages

Mining Data Stream

Uploaded by

Sobhan Behuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views31 pages

Mining Data Stream

Uploaded by

Sobhan Behuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Mining Data Stream

Contents are from the book “Mining of Massive Datasets”


leskovec et al.
Data Stream
● A stream is a body of water that flows on Earth's surface.
● Data streaming is the process of transmitting a continuous flow of data (also
known as streams) typically fed into stream processing software to derive
valuable insights. A data stream consists of a series of data elements
ordered in time.
● A stream is infinite sequence of value
● Some examples of data streams include sensor data, activity logs from web
browsers, and financial transaction logs.
● A data stream can be visualized as an endless conveyor belt, carrying data
elements and continuously feeding them into a data processor.

Ref: https://www.tibco.com/reference-center/what-is-data-streaming
Assumptions
● Data arrives in a stream or streams, and if it is not processed immediately or stored,
then it is lost forever.
● Moreover, we shall assume that the data arrives so rapidly that it is not feasible to store
it all in active storage (i.e., in a conventional database), and then interact with it at the
time of our choosing.
What we are going to learn?
● The algorithms for processing streams each involve summarization of the stream
in some way.
○ Sampling Stream
○ Filter a stream
○ Estimate the number of different elements in a stream using much less
storage than would be required
Stream Model
● In many data mining situations, we do not know the entire dataset in advance
● Stream Management is important when the input rate is controlled externally:
○ Google queries
○ Twitter or Facebook status updates
● We can think of the data as infinite and non-stationary (the distribution
changes over time)
● Input elements enter at a rapid rate, at one or more input ports (i.e., streams)
○ We call elements of the stream tuples
● The system cannot store the entire stream accessibly
Data Stream Management System
Stream Sources
● Sensor Data
○ Surface temperature of sea or Surface height
○ Send data every tenth of second-3.5mb per day (4 byte real number)
○ Millions of sensor-3.5 terabyte data everyday
● Satellite Data
○ Satellite images
○ Surveillance system (low resolution)
■ 6 million camera in London and each camera is stream
● Internet and Web traffic
○ Billions search query on Google
■ Increase in queries “sore throat”
○ Billion clicks in Yahoo
■ Sudden increase in click rate for a link indicate breaking news or broken link
Stream Queries
● Ad-hoc Queries :
○ a question asked once about the current state of a stream or streams.
○ a common approach is to store a sliding window of each stream in the working store
○ A sliding window can be the most recent n elements of a stream, for some n, or it can be all
the elements that arrived within the last t time units,
● Standing Queries
○ Permanently executing, and produce outputs at appropriate times
○ Temperature exceed X Centigrade
○ Average of first N reading
○ Maximum temperature
Sampling Stream

● Since we can not store the entire stream, one obvious


approach is to store a sample
● Two different problems:
○ Sample a fixed proportion of elements in the stream (say 1 in 10)
○ Maintain a random sample of fixed size over a potentially infinite stream
At any “time” k we would like a random sample of s elements
Sampling Stream

● Scenario: Search engine query stream


○ Stream of tuples: (user, query, time)
○ Answer questions such as: How often did a user run the same query
in a single days OR “What fraction of the typical user’s queries were
repeated over the past month?
○ Have space to store 1/10th of query stream
● Naïve solution:
○ Generate a random integer in [0..9] for each query
○ Store the query if the integer is 0, otherwise discard
Sampling Stream
Sampling Stream
● a user has issued x search queries one time in the past month
● d search queries twice,
● no search queries more than twice.
● we have a 1/10th sample, of queries
● x-number of query search one time
● x+2d- number of query
● 2d/10 - sample one time + two time
● 2d/100 sample two time
● 2d/10-2d/100=18d/100
Hashing
Sampling User
● Pick 1/10th of users and take all their searches in the sample
● Use a hash function that hashes the user name or user id uniformly into 10
buckets
Sliding window on Single stream
Filtering stream
● Another common process on streams is selection, or filtering.
● Suppose we have a set S of one billion allowed email addresses – those that
we will allow through because we believe them not to be spam.
● The stream consists of pairs: an email address and the email itself. Since the
typical email address is 20 bytes or more, it is not reasonable to store S in
main memory. T
Case for Bloom Filter
● Suppose for argument’s sake that we have one gigabyte of available main
memory.
● In the technique known as Bloom filtering, we use that main memory as a bit
array. In this case, we have room for eight billion bits, since one byte equals
eight bits.
● Devise a hash function h from email addresses to eight billion buckets. Hash
each member of S to a bit, and set that bit to 1. All other bits of the array
remain 0.
● Since there are one billion members of S, approximately 1/8th of the bits will
be 1. The exact fraction of bits set to 1 will be slightly less than 1/8th, because
it is possible that two members of S hash to the same bit.
Bloom Filter
● A Bloom filter is a data structure designed to tell you, rapidly and
memory-efficiently, whether an element is present in a set.
● Bloom filter is a probabilistic data structure: it tells us that the element either
definitely is not in the set or may be in the set.
● The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll
use to demonstrate:
Bloom Filter
● A Bloom filter consists of:
○ An array of n bits, initially all 0’s.
○ A collection of hash functions h1, h2,...,hk. Each hash function maps “key” values to n buckets,
corresponding to the n bits of the bit-array.
○ A set S of m key values
● The purpose of the Bloom filter is to allow through all stream elements whose
keys are in S, while rejecting most of the stream elements whose keys are not
in S.
● To test a key K that arrives in the stream, check that all of h1(K), h2(K),...,hk(K)
are 1’s in the bit-array. If all are 1’s, then let the stream element through. If
one or more of these bits are 0, then K could not be in S, so reject the stream
element
Bloom Filter -ADD

Image source : https://freecontent.manning.com/all-about-bloom-filters/


Bloom Filter : False positive

Image source : https://freecontent.manning.com/all-about-bloom-filters/


Bloom Filter
● Given a Bloom filter with m bits and k hashing functions, both insertion and
membership testing are O(k).
● That is, each time you want to add an element to the set or check set
membership, you just need to run the element through the k hash functions
and add it to the set or check those bits.
Bloom Filter
Bloom Filter
Probability of False positive
Bloom Filter

● Approximate set membership problem .


● Trade-off between the space and the false positive probability .
Counting Distinct Elements in a Stream
As a useful example of this problem, consider a Web site gathering statistics on
how many unique users it has seen in each given month. The universal set is the
set of logins for that site, and a stream element is generated each time someone
logs in. This measure is appropriate for a site like Amazon, where the typical user
logs in with their unique login name. A similar problem is a Web site like Google
that does not require login to issue a search query, and may be able to identify
users only by the IP address from which they send the query. There are about 4
billion IP addresses,2 sequences of four 8-bit bytes will serve as the universal set
in this case.
Counting Distinct Elements in a Stream
● Problem:
○ Data stream consists of a universe of elements chosen from a set of size N
○ Maintain a count of the number of distinct elements seen so far
● Obvious approach:
○ Maintain the set of elements seen so far That is, keep a hash table of all the distinct elements
seen so far
Counting Distinct Elements in a Stream
● How many different words are found among the Web pages being crawled at
a site?
○ Unusually low or high numbers could indicate artificial pages (spam?)
● How many different Web pages does each customer request in a week?
● How many distinct products have we sold in the last week?
Flajolet Martin Algorithm

Flajolet Martin Algorithm, also known as FM algorithm, is used to approximate the number of unique elements

in a data stream or database in one pass. The highlight of this algorithm is that it uses less memory space

while executing.

Pseudo Code-Stepwise Solution:

● Selecting a hash function h so each element in the set is mapped to a string to at least log 2n bits.

● For each element x, r(x)= length of trailing zeroes in h(x)

● R= max(r(x)) => Distinct elements= 2R

You might also like