Stream Computing Methods
Stream Computing Methods
“Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming,
2019 1
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Big Data Stream Computing
• Stream computing is a way to analyze
and process Big Data in real time to
gain current insights to take appropriate
decisions or to predict new trends in the
immediate future
• Implements in a distributed clustered
environment
• High rate of receiving data in stream .
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 2
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Stream computing Applications
• Financial sectors,
• Business intelligence,
• Risk management,
• Marketing management,
• Search engines, and
• Social network analysis
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 3
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Data Stream Algorithms Efficiency
Measurements
1. Number of passes (scans) the algorithm
must make over the stream
2. Available memory
3. Running time of the algorithm.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 4
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Sampling methods Obtaining a
Representative Sample
• First category, probabilistic sampling is
a statistical technique
• Second category, non-probabilistic
sampling uses arbitrary or purposive
(biased) sample selection instead of
sampling based on a randomized
selection
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 5
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Reservoir Sampling Method
• A random sampling method, choosing a
sample of limited data items from a list
containing a very large number of items
randomly
• The list is larger than one that upholds
in the main memory
• Example 7.4
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 9
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Filtering of Stream
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 10
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Example
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 11
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Filtering of Stream: The Bloom
Filter Analysis
• A simple space-efficient data structure
introduced by Burton Howard Bloom in
1970.
• The filter matches the membership of
an element in a dataset.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 12
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Bloom Filter
• The filter is basically a bit vector of
length m that represent a set S = {x1, x2,
. . . , xm} of m elements,
• Initially all bits 0. Then, define k
independent hash functions, h1, h2, . . . ,
and hk. Each of which maps (hashes)
some element x in set S to one of the m
array positions with a uniform random
distribution.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 13
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Bloom Filter
• Number k is constant, and much smaller
than m. That is, for each element x ∈ S,
the bits hi (x) are set to 1 for 1 ≤ i ≤ k.
[∈ is symbol in set theory for
‘contained in’.]
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 14
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Figure 7.4 (a) Inserting an element x in bit vector
(filter) of length m = 10, (b) finding an element y in
an example of Bloom Filter
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 15
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Counting Bloom Filter: A Variant
of Bloom Filter
• Maintains a counter for each bit in the
Bloom filter
• The counters corresponding to the k
hash values increment or decrement,
whenever an element in the filter is
added or deleted, respectively
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 16
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Counting Bloom Filter: A Variant
of Bloom Filter
• As soon as a counter changes from 0 to
1, the corresponding bit in the bit vector
is set to 1. When a counter changes
from 1 to 0, the corresponding bit in the
bit vector is set to 0. The counter
basically maintains the number of
elements that hashed
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 17
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Figure 7.5 An example of Counting Bloom Filter
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 18
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Counting Distinct Elements in a
Stream
• Relates to finding the number of
dissimilar elements in a data stream
• The stream of data contains repeated
elements
• This is a well-known problem in
networking and databases
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 19
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Counting Distinct Elements in a
Stream and Count Distinct Problem
• If n possible elements a1, a2, …, and an
are present then for an exact result n
spaces are required. In the worst case,
all n elements can be present. Let m be
the number of distinct elements. The
objective is to find an estimate of m
using only s storage units, where s ≪
m.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 20
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Bitmap algorithm to compute
distinct elements
• Example 7.6
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 21
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
The Flajolet–Martin (FM)
Algorithm
• FM method approximates the m,
number of distinct (unique) elements, in
a stream or a database in one pass
• The stream consisting of n elements
with m unique elements runs in O (n)
time and needs O (log (m)) memory. e
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 22
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
.. FM Algorithm
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 23
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Features of the FM algorithm
(i) Hash-based algorithm.
(ii) Needs several repetitions to get a good
estimate.
(iii) The more different elements in the
data, the more different hash values are
obtained
•
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 24
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Features of the FM algorithm
(iv) Different hash values suggest the
chances of one of these values will be
unusual [the unusual property can be
that the value ends in many 0s
(alternates also exist)].
• Example 7.5.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 25
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Counting of 1’s in a Window
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 26
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Sliding Time Window Method for
Data Stream Processing
• A popular model for infinite data
stream processing. (Window refers to
time interval during which stream
raised and processed the queries)
• The receiving of data elements is taking
place one by one arrived.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 27
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Sliding Time Window Method for
Data Stream Processing
• Statistical computations are over a
sliding time-window of size N (not over
the whole stream) in time-units
• Window covers the most recent data
items arrived.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 28
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Datar-Gionis-Indyk-Motwani
(DGIM) algorithm
• Gives solution when N very large,
(assume N is 1 Billion)
• DGIM algorithm suggests that store just
the O [log2 (log2 N)] bits per stream
•
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 29
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Datar-Gionis-Indyk-Motwani
(DGIM) algorithm
• Concept of time buckets used
• A time window divides in a number of
buckets, different from hash buckets
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 30
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Decaying Windows
• Useful in applications which need
identification of most common
elements
• Decaying window concept assigns more
weight to recent elements
• The technique computes a smooth
aggregation of all the 1’s ever seen in
the stream, with decaying weights.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 31
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Decaying Windows
• When element further appears in the
stream, less weight is given.
• The effect of exponentially decaying
weights is to spread out the weights of
the stream elements as far back in time
as the stream flows
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 32
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Summary
We learnt:
• Big Data Stream Computing
• Sampling the stream
• Filtering
• Bloom Filter, Counting Bloom Filter
• Counting distinct elements
• Flajolet–Martin Algorithm with one
pass
•
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
33
Summary
We learnt:
• Counting of 1’s in a Window
• Sliding time window
• DGIM algorithm
• Decaying Windows
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 34
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
End of Lesson 4 on
Stream Computing Methods
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 35
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India