Mining Data Stream
Mining Data Stream
Ref: https://www.tibco.com/reference-center/what-is-data-streaming
Assumptions
● Data arrives in a stream or streams, and if it is not processed immediately or stored,
then it is lost forever.
● Moreover, we shall assume that the data arrives so rapidly that it is not feasible to store
it all in active storage (i.e., in a conventional database), and then interact with it at the
time of our choosing.
What we are going to learn?
● The algorithms for processing streams each involve summarization of the stream
in some way.
○ Sampling Stream
○ Filter a stream
○ Estimate the number of different elements in a stream using much less
storage than would be required
Stream Model
● In many data mining situations, we do not know the entire dataset in advance
● Stream Management is important when the input rate is controlled externally:
○ Google queries
○ Twitter or Facebook status updates
● We can think of the data as infinite and non-stationary (the distribution
changes over time)
● Input elements enter at a rapid rate, at one or more input ports (i.e., streams)
○ We call elements of the stream tuples
● The system cannot store the entire stream accessibly
Data Stream Management System
Stream Sources
● Sensor Data
○ Surface temperature of sea or Surface height
○ Send data every tenth of second-3.5mb per day (4 byte real number)
○ Millions of sensor-3.5 terabyte data everyday
● Satellite Data
○ Satellite images
○ Surveillance system (low resolution)
■ 6 million camera in London and each camera is stream
● Internet and Web traffic
○ Billions search query on Google
■ Increase in queries “sore throat”
○ Billion clicks in Yahoo
■ Sudden increase in click rate for a link indicate breaking news or broken link
Stream Queries
● Ad-hoc Queries :
○ a question asked once about the current state of a stream or streams.
○ a common approach is to store a sliding window of each stream in the working store
○ A sliding window can be the most recent n elements of a stream, for some n, or it can be all
the elements that arrived within the last t time units,
● Standing Queries
○ Permanently executing, and produce outputs at appropriate times
○ Temperature exceed X Centigrade
○ Average of first N reading
○ Maximum temperature
Sampling Stream
Flajolet Martin Algorithm, also known as FM algorithm, is used to approximate the number of unique elements
in a data stream or database in one pass. The highlight of this algorithm is that it uses less memory space
while executing.
● Selecting a hash function h so each element in the set is mapped to a string to at least log 2n bits.