0% found this document useful (0 votes)
7 views

Stream Computing Methods

Uploaded by

krishnaharish678
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Stream Computing Methods

Uploaded by

krishnaharish678
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Lesson 4

Stream Computing Methods

“Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming,
2019 1
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Big Data Stream Computing
• Stream computing is a way to analyze
and process Big Data in real time to
gain current insights to take appropriate
decisions or to predict new trends in the
immediate future
• Implements in a distributed clustered
environment
• High rate of receiving data in stream .
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 2
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Stream computing Applications
• Financial sectors,
• Business intelligence,
• Risk management,
• Marketing management,
• Search engines, and
• Social network analysis

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 3
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Data Stream Algorithms Efficiency
Measurements
1. Number of passes (scans) the algorithm
must make over the stream
2. Available memory
3. Running time of the algorithm.

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 4
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Sampling methods Obtaining a
Representative Sample
• First category, probabilistic sampling is
a statistical technique
• Second category, non-probabilistic
sampling uses arbitrary or purposive
(biased) sample selection instead of
sampling based on a randomized
selection

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 5
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Reservoir Sampling Method
• A random sampling method, choosing a
sample of limited data items from a list
containing a very large number of items
randomly
• The list is larger than one that upholds
in the main memory
• Example 7.4

“Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark


2019 Streaming, 6
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Concise Sampling
• Concise sampling like the reservoir
sampling method, with a difference that
a value that appears once is stored as a
singleton, whereas a value that appears
more than once is stored as a (value,
count) pair
• Inserts a new data item in the sample
with a probability of 1/n.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 7
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Counting Sampling
• A refinement of concise sampling in
terms of accuracy
• The method maintains the sample in the
case of deletion of data items as well
• Decrementing the count value upon
deleting a value
• [Deletion mean after reading moving to
next.]
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 8
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Procedures for calculating sample
sizes
(i) estimation, called confidence interval
approach, and
(ii) hypothesis testing. Statistics prescribes
Chi-squared, T-test, Z-test, F-test, P
value for testing the significance of a
statistical inference

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 9
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Filtering of Stream

• Identifies the sequence patterns in a


stream
• Stream filtering is the process of
selection or matching instances of a
desired pattern in a continuous stream
of data

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 10
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Example

• Assume that a data stream consists of


tuples
• Filtering steps: (i) Accept the tuples that
meet a criterion in the stream, (ii) Pass
the accepted tuples to another process
as a stream and (iii) discard remaining
tuples

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 11
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Filtering of Stream: The Bloom
Filter Analysis
• A simple space-efficient data structure
introduced by Burton Howard Bloom in
1970.
• The filter matches the membership of
an element in a dataset.

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 12
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Bloom Filter
• The filter is basically a bit vector of
length m that represent a set S = {x1, x2,
. . . , xm} of m elements,
• Initially all bits 0. Then, define k
independent hash functions, h1, h2, . . . ,
and hk. Each of which maps (hashes)
some element x in set S to one of the m
array positions with a uniform random
distribution.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 13
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Bloom Filter
• Number k is constant, and much smaller
than m. That is, for each element x ∈ S,
the bits hi (x) are set to 1 for 1 ≤ i ≤ k.
[∈ is symbol in set theory for
‘contained in’.]

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 14
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Figure 7.4 (a) Inserting an element x in bit vector
(filter) of length m = 10, (b) finding an element y in
an example of Bloom Filter

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 15
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Counting Bloom Filter: A Variant
of Bloom Filter
• Maintains a counter for each bit in the
Bloom filter
• The counters corresponding to the k
hash values increment or decrement,
whenever an element in the filter is
added or deleted, respectively

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 16
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Counting Bloom Filter: A Variant
of Bloom Filter
• As soon as a counter changes from 0 to
1, the corresponding bit in the bit vector
is set to 1. When a counter changes
from 1 to 0, the corresponding bit in the
bit vector is set to 0. The counter
basically maintains the number of
elements that hashed

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 17
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Figure 7.5 An example of Counting Bloom Filter

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 18
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Counting Distinct Elements in a
Stream
• Relates to finding the number of
dissimilar elements in a data stream
• The stream of data contains repeated
elements
• This is a well-known problem in
networking and databases

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 19
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Counting Distinct Elements in a
Stream and Count Distinct Problem
• If n possible elements a1, a2, …, and an
are present then for an exact result n
spaces are required. In the worst case,
all n elements can be present. Let m be
the number of distinct elements. The
objective is to find an estimate of m
using only s storage units, where s ≪
m.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 20
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Bitmap algorithm to compute
distinct elements
• Example 7.6

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 21
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
The Flajolet–Martin (FM)
Algorithm
• FM method approximates the m,
number of distinct (unique) elements, in
a stream or a database in one pass
• The stream consisting of n elements
with m unique elements runs in O (n)
time and needs O (log (m)) memory. e

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 22
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
.. FM Algorithm

• Thus, the space consumption calculates


with the maximum number of possible
distinct elements in the stream, which
makes it innovative

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 23
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Features of the FM algorithm
(i) Hash-based algorithm.
(ii) Needs several repetitions to get a good
estimate.
(iii) The more different elements in the
data, the more different hash values are
obtained

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 24
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Features of the FM algorithm
(iv) Different hash values suggest the
chances of one of these values will be
unusual [the unusual property can be
that the value ends in many 0s
(alternates also exist)].
• Example 7.5.

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 25
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Counting of 1’s in a Window

• Infinite Stream Processing


• Figure 7.3: Volume of data is too large
that it cannot be stored. Hardly a chance
exists to look at all of it.
• Important queries may be likely to ask
only about the most recent data or
summaries of data.

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 26
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Sliding Time Window Method for
Data Stream Processing
• A popular model for infinite data
stream processing. (Window refers to
time interval during which stream
raised and processed the queries)
• The receiving of data elements is taking
place one by one arrived.

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 27
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Sliding Time Window Method for
Data Stream Processing
• Statistical computations are over a
sliding time-window of size N (not over
the whole stream) in time-units
• Window covers the most recent data
items arrived.

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 28
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Datar-Gionis-Indyk-Motwani
(DGIM) algorithm
• Gives solution when N very large,
(assume N is 1 Billion)
• DGIM algorithm suggests that store just
the O [log2 (log2 N)] bits per stream

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 29
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Datar-Gionis-Indyk-Motwani
(DGIM) algorithm
• Concept of time buckets used
• A time window divides in a number of
buckets, different from hash buckets

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 30
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Decaying Windows
• Useful in applications which need
identification of most common
elements
• Decaying window concept assigns more
weight to recent elements
• The technique computes a smooth
aggregation of all the 1’s ever seen in
the stream, with decaying weights.
2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 31
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Decaying Windows
• When element further appears in the
stream, less weight is given.
• The effect of exponentially decaying
weights is to spread out the weights of
the stream elements as far back in time
as the stream flows

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 32
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Summary
We learnt:
• Big Data Stream Computing
• Sampling the stream
• Filtering
• Bloom Filter, Counting Bloom Filter
• Counting distinct elements
• Flajolet–Martin Algorithm with one
pass

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
33
Summary
We learnt:
• Counting of 1’s in a Window
• Sliding time window
• DGIM algorithm
• Decaying Windows

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 34
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
End of Lesson 4 on
Stream Computing Methods

2019 “Big Data Analytics “, Ch.07 L04: Data Stream Mining ....Spark Streaming, 35
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India

You might also like