BigData Mod2
BigData Mod2
NOTES
STREAM DATA MODEL
▪ Stream is data in motion.
▪ Data generated continuously at high velocity and in large volumes is known as
streaming data.
▪ A stream data source is characterized by continuous time-stamped logs in real time.
▪ Example: A user clicking a link on a web page, Stream data sources include:
✓ Server and security logs
✓ Clickstream data from websites and apps
✓ IoT sensors
✓ Real-time advertising platforms
▪ Following are the different ways of modelling data stream, querying, processing, and
management:
1. Graph model
2. Relation-oriented stream-tuples model
3. Object-based data stream model
STREAM COMPUTING
▪ Stream computing is a way to analyse and process Big Data in real time to gain
current insights.
▪ Stream computing has many uses, such as financial sectors for business intelligence,
risk management, marketing management, etc.
▪ Stream computing is also used in search engines and social network analysis.
FILTERING STREAMS
▪ Due to the nature of data streams, stream filtering is one of the most useful and
practical approaches to efficient stream evaluation.
▪ Stream filtering is the process of selection or matching instances of a desired
pattern in a continuous stream of data.
▪ The filtering steps for a stream are:
(i) Accept the tuples that meet the criterion in the stream
(ii) Pass the accepted tuples to another process as a stream
(iii) Discard remaining tuples.
▪ Several filtering techniques are:
Bloom Filter and its variants
Stream Quotient Filter (SQF)
Particle filter
XML filters etc.
BLOOM FILTER ANALYSIS
• Bloom filter is a simple space-efficient data structure introduced by Burton
Howard Bloom in 1970.
• The filter matches the membership of an element in a dataset.
• The filter has been widely used in applications such as:
✓ Database applications
• The bits may have by chance been set to 1 during the insertion of other
elements. Thus, there is a chance of incorrect assumption. This is a false
positive.
Problem
Find an estimation of the number of distinct elements in the following stream using
FM algorithm:
s = 2, 3, 1, 2, 3, 4, 3, 1, 2, 3, 1, 4
Solution
Now, apply hash function on the input stream, perform the bit calculation and
trailing zeroes to get:
(i) The maximum number of trailing zeros from the binary equivalent trailing
zero values, r=2
(ii) The distinct value R= 2r = 22 = 4
(iii) Therefore, R= 4, means there are four (4) distinct values as 2,3,1,4.
ESTIMATING MOMENTS
▪ Assume a random variable X, where X refers to a variable, such as number of distinct
elements x in a data stream.
▪ Assume that variable x has probabilistic distribution in values around the mean value
𝑥.
▪ Probabilistic distribution means probability of variable having value found = x varying
with variable X.
▪ Expected value among the distributed Xi values where i varies from 0 to n will
depend upon the expected distinct element count.
▪ Expected value will be m for expected number of distinct elements in the data
stream, and much less than m for wide variance.
▪ The variance is the square of the standard deviation in m from the expected value,
the second central moment of a distribution, and the 𝜎 2 or var(x) represents
covariance of the random variable with itself.
▪ Moments (0,1,2 …) refer to expected values to the powers of (0,1,2, ...) random-
variable variance.
▪ When a new bit comes in, discard the first bit. This will result in the exact answer.
▪ If there is not enough memory to store the N bits (assume N is 1 Billion), the solution
can be obtained by using the Datar-Gionis-Indyk-Motwani (DGIM) algorithm.
▪ The technique computes a smooth aggregation of all the 1’s ever seen in the stream,
with decaying weights.
▪ When it further appears in the stream, less weight is given.
RTAP APPLICATIONS
▪ Fraud detection systems for online transactions
▪ Log analysis for understanding usage pattern
▪ Click analysis for online recommendations
▪ Push notifications to the customers for location-based advertisements for retail
▪ Action for emergency services such as fires and accidents in an industry
▪ Social media
2. Pull the specific tweets in real-time using Twitter API, then process and load this
data into a persistent storage.
3. The cleaning of the data proceeds with punctuations, stop words, URLs, common
emoticons, and hashtags, and reference deletion. Multiple consecutive letters in
a word are reduced to two (‘toooooooo much’ is replaced with ‘too much’). Spell
checking is also performed to words that have been identified as misspelled in
order to infer the correct word.
4. Classify various types of tweets and segment them after estimating the influence
of each tweet using the predictive analysis library. Perform sentiment analysis by
identifying whether people are tweeting positive or negative statements about
some actions.
5. Use linguistic concepts, perform opinion mining, analyze data and bring out
powerful insights. Use important features of the analysis based on machine
learning. Thus, applications learn by analysing ever-increasing amounts of data.
6. The goal is to build the model for predicting the sentiments from tweets.