Big Data Unit Ii Notes
Big Data Unit Ii Notes
In the data stream model, individual data items may be relational tuples, e.g., network
measurements, call records, web page visits, sensor readings, and so on.
Further a query over streams will actually run continuously over a period of time and
return new results as new data arrives.
Therefore these are known as long running, continuous, standing and persistent
queries.
Characteristics
1. The data model and query processor must allow both order based and time based
operations.
2. The inability to store a complete stream indicates that some approximate summary
structures must be used.
3. Streaming query plans must not use any operators that require the entire input before any
results are produced. Such operators will block the query processor indefinitely.
4. Any query that requires backtracking over a data streams is infeasible. This is due to
storage and performance constraints imposed by a data stream.
5. Applications that monitor streams in real time must react quickly to unusual data values.
6. Scalability requirements dictate the parallel and shared execution of many continuous
queries must be possible.
Architecture
An input monitor may regulate the input streams perhaps by dropping packets. Data are
typically stored in three partitions.
2. Summary storage
Long running queries are registered in the query repository and placed into groups for
shared processing.
The query processor communicates with the input monitor and may reoptimize the
query plans in response to changing input rates. Results are streamed to the user or
temporarily buffered.
Any number of streams can enter the system. Each stream can provide elements at its own
schedule; they need not have the same data rates or data types, and the time between
elements of one stream need not be uniform.
The fact that the rate of arrival of stream elements is not under the control of the system
distinguishes stream processing from the processing of data that goes on within a database-
management system.
Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special
circumstances using time-consuming retrieval processes.
There is also a working store, into which summaries or parts of streams may be
placed, and which can be used for answering queries.
The working store might be disk, or it might be main memory, depending on how fast
we need to process queries.
Sensor Networks – SN are large source of data occurring in streams. Used in numerous
situations that require constant monitoring of several variables based on which important
decisions are made. Alerts and alarms generated as a response to information received from
sensors. Example – Perform joins of several streams like temperature, ocean currents streams
at weather stations to give alerts or warnings like cyclone or tsunami.
Network Traffic Analysis – ISP’s get information about Internet traffic, heavily used routes
etc. to identify and predict congestions. Streams also used to identify fraudulent activities.
Example queries – Check whether current stream of actions over time are similar to previous
intrusion on the network.
Financial Applications – Online analysis of stock prices and making hold or sell decisions
requires quickly identifying correlations and fast changing trends.
Transaction Log Analysis – Online mining of web usage logs, telephone call records and
ATM transactions are examples of data streams. Goal is to find customer behavior patterns.
Example – Identify current buying pattern of users in website and plan advertising campaigns
and recommendations.
Image Data
Satellites often send down to earth streams consisting of many terabytes of images per day.
Surveillance cameras produce images with lower resolution than satellites, but there can be
many of them, each producing a stream of images at intervals like one second.
Stream Computing
Stream Queries
There are two ways that queries get asked about streams.
One-time queries - (a class that includes traditional DBMS queries) are queries that are
evaluated once over a point-in-time snapshot of the data set, with the answer returned to the
user. Example - Alert when stock crosses over a price point.
Continuous queries, on the other hand, are evaluated continuously as data streams continue to
arrive. Continuous query answers may be stored and updated as new data arrives or they may
be produced as data streams themselves. Example –Aggregation queries like maximum,
average, count etc. Maximum price of stock every hour, or number of time stock gains over a
particular point.
First, streams often deliver elements very rapidly. We must process elements in real time, or
we lose the opportunity to process them at all, without accessing the archival storage.
Thus, it often is important that the stream-processing algorithm is executed in main memory,
without access to secondary storage or with only rare accesses to secondary storage.
Thus, many problems about streaming data would be easy to solve if we had enough
memory, but become rather hard and require the invention of new techniques in order to
execute them at a realistic rate on a machine of realistic size.
As our first example of managing streaming data, we shall look at extracting reliable samples
from a stream.
The general problem we shall address is selecting a subset of a stream so that we can ask
queries about the selected subset and have the answers be statistically representative of the
stream as a whole. If we know what queries are to be asked, then there are a number of
methods that might work, but we are looking for a technique that will allow ad-hoc queries
on the sample.
The General Sampling Problem
The running example is typical of the following general problem. Our stream consists of
tuples with n components. A subset of the components are the key components, on which the
selection of the sample will be based. In our running example, there are three components –
user, query, and time – of which only user is in the key.
However, we could also take a sample of queries by making query be the key, or even take a
sample of user-query pairs by making both those components form the key.
To take a sample of size a/b, we hash the key value for each tuple to b buckets, and accept
the tuple for the sample if the hash value is less than a. If the key consists of more than one
component, the hash function needs to combine the values for those components to make a
single hash-value.
Often, the sample will grow as more of the stream enters the system. In our running example,
we retain all the search queries of the selected 1/10th of the users, forever. As time goes on,
more searches for the same users will be accumulated, and new users that are selected for the
sample will appear in the stream.
If we have a budget for how many tuples from the stream can be stored as the sample, then
the fraction of key values must vary, lowering as time goes on.
In order to assure that at all times, the sample consists of all tuples from a subset of the key
values, we choose a hash function h from key values to a very large number of values 0, 1, . .
. ,B−1. We maintain a threshold t, which initially can be the largest bucket number, B −1.
If the number of stored tuples of the sample exceeds the allotted space, we lower t to t−1 and
remove from the sample all those tuples whose key K hashes to t.
For efficiency, we can lower t by more than 1, and remove the tuples with several of the
highest hash values, whenever we need to throw some key values out of the sample.
Further efficiency is obtained by maintaining an index on the hash value, so we can find all
those tuples whose keys hash to a particular value quickly.
A common process on streams is selection or filtering. We want to accept those tuples in the
stream that meet a criterion. Accepted tuples are passed to another process as a stream, while
other tuples are dropped. If the selection criterion is a property of the tuple that can be
calculated (e.g., the first component is less than 10), then the selection is easy to do. The
problem becomes harder when the criterion involves lookup for membership in a set. It is
especially hard, when that set is too large to store in main memory. In this section, we shall
discuss the technique known as “Bloom filtering” as a way to eliminate most of the tuples
that do not meet the criterion.
Applications
Email spam filtering is one best example. We know 1 billion “good” email addresses and if
an email comes from one of these, it is NOT spam.
Publish-subscribe systems where lots of messages (news articles) are collected and people
express interest in certain sets of keywords and determine whether each message matches
user’s interest.
Solution:
4) Hash each member of s S to one of n buckets, and set that bit to 1, i.e., B[h(s)]=1
5) Hash each element a of the stream and output only those that hash to bit that was set to 1
– Output a if B[h(a)] == 1
Blooms filter creates false positives but no false negatives, i.e., if the item is in S we surely
output it, if not we may still output it.
Suppose |S| = 1 billion email addresses, |B|= 1GB = 8 billion bits, then, If the email address is
in S, then it surely hashes to a bucket that has the big set to 1, so it always gets through (no
false negatives). Approximately 1/8 of the bits are set to 1, so about 1/8th of the addresses
not in S get through to the output (false positives). Actually, less than 1/8th, because more
than one address might hash to the same bit
False positives and false negatives are concepts analogous to type I and type II errors in
statistical hypothesis testing, where a positive result corresponds to rejecting the null
hypothesis, and a negative result corresponds to not rejecting the null hypothesis. The terms
are often used interchangeably, but there are differences in detail and interpretation.
2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values to n
buckets, corresponding to the n bits of the bit-array.
If a key value is in S, then the element will surely pass through the Bloom filter. However, if
the key value is not in S, it might still pass. We need to understand how to calculate the
probability of a false positive, as a function of n, the bit-array length, m the number of
members of S, and k, the number of hash functions. The model to use is throwing darts at
targets. Suppose we have x targets and y darts. Any dart is equally likely to hit any target.
After throwing the darts, how many targets can we expect to be hit at least once? The
analysis goes as follows:
The probability that a given dart will not hit a given target is (x − 1)/x. The probability that
none of the y darts will hit a given target is ((x-1)/x)y. We can write this expression as (1-
1/x)x(y/x). Using the approximation (1-€)1/€ = 1/e for small €,we conclude that the
probability that none of the y darts hit a given target is e-y/x.
In general, Bloom filters guarantee no false negatives, and use limited memory and hence it
is great for pre-processing before more expensive checks. It is suitable for hardware
implementation by using hash function, computations can be infact parallelized. Is it better to
have 1 big B or k small Bs which is the same as (1 – e-km/n)k vs. (1 – e-m/(n/k))k . But
keeping 1 big B is simpler.
An obvious approach if the number of elements is not very large would be to maintain a Set.
We can check if the set contains the element when a new element arrives. If not, we add the
element to the set. The size of the set would give the number of distinct elements. However, if
the number of elements is vast or we are maintaining counts for multiple streams, it would be
infeasible to maintain the set in memory. Storing the data on disk would be an option if we are
only interested in offline computation using batch processing frameworks like Map Reduce.
Like the previous algorithms we looked at, the algorithms for counting distinct elements are
also approximate, with an error threshold that can be tweaked by changing the algorithm's
parameters.
The first algorithm for counting distinct elements is the Flajolet-Martin algorithm, named after
the algorithm's creators. The Flajolet-Martin algorithm is a single pass algorithm. If there
are m distinct elements in a universe comprising of n elements, the algorithm runs
in O(n) time and O(log(m)) space complexity. The following steps define the algorithm.
First, we pick a hash function h that takes stream elements as input and outputs a bit
string. The length of the bit strings is large enough such that the result of the hash
function is much larger than the size of the universe. We require at least log nbits if
there are n elements in the universe.
r(a) is used to denote the number of trailing zeros in the binary representation
of h(a) for an element a in the stream.
The probability that h(a) ends in at least i zeros is exactly (2^-i). For example, for i=0, there is
a probability 1that the tail has at least 0zeros. For i=1, there is a probability of 1/2 that the last
bit is zero; for i=2, the probability is 1/4 that the last two bits are zero's, and so on. The
probability of the rightmost set bit drops by a factor of 1/2 with every position from the Least
Significant Bit to the Most Significant Bit.
This probability should become 0 when bit position R >> log m while it should be non-zero
when R <= log m. Hence, if we find the rightmost unset bit position R such that the
probability is 0, we can say that the number of unique elements will approximately be 2 ^ R.
The Flajolet-Martin uses a multiplicative hash function to transform the non-uniform set space
to a uniform distribution. The general form of the hash function is
(ax + b) mod c where a and b are odd numbers and c is the length of the hash range.
The Flajolet-Martin algorithm is sensitive to the hash function used, and results vary widely
based on the data set and the hash function. Hence there are better algorithms that utilize more
than one hash function. These algorithms use the average and median values to reduce skew
and increase the predictability of the result.
y = hash(x)
r = get_righmost_set_bit(y)
set_bit(B, r)
4. R = get_righmost_unset_bit(B)
5. return 2 ^ R
We define a hash range, big enough to hold the maximum number of possible unique values,
something as big as 2 ^ 64. Every stream element is passed through a hash function that
transforms the elements into a uniform distribution.
For each hash value, we find the position of the rightmost set bit and mark the corresponding
position in the bit set to 1. Once all elements are processed, the bit vector will have 1s at all
the positions corresponding to the position of every rightmost set bit for all elements in the
stream.
Now we find R , the rightmost 0 in this bit vector. This position R corresponds to the
rightmost set bit that we have not seen while processing the elements. This corresponds to the
probability 0 and will help in approximating the cardinality of unique elements as 2 ^ R.
Estimating Moments
The problem, called computing “moments,” involves the distribution of frequencies of
different elements in thestream.
Suppose a stream consists of elements chosen from a universal set. Assume the universal set
is ordered so we can speak of the ith element for any i. Let mi be the number of occurrences
of the ith element for any i. Then the kth-order moment (or just kth moment) of the stream is
the sum over all i of(mi)k.
For now, let us assume that a stream has a particular length n. Suppose we do not have
enough space to count all the mi’s for all the elements of thestream.
We can still estimate the second moment of the stream using a limited amount of space; the
more space we use, the more accurate the estimate will be. We compute some number of
variables. For each variable X, we store:
2. An integer X.value, which is the value of the variable. To determine the value of a variable
X, we choose a position in the stream between 1 and n, uniformly and at random. Set
X.element to be the element found there, and initialize X.value to 1. As we read the stream,
add 1 to X.value each time we encounter another occurrence of X.element.
As in previous sections, we focus on the situation where we cannot afford to store the entire
window. After showing an approximate algorithm for the binary case, we discuss how this
idea can be extended to summing numbers.
Designed to find the number 1’s in a data set. This algorithm uses O(log²N) bits to represent a
window of N bit, allows to estimate the number of 1’s in the window with and error of no
more than 50%.
1. The right side of the bucket should always start with 1. (if it starts with a 0,it is to be
neglected) E.g. · 1001011 → a bucket of size 4 ,having four 1’s and starting with 1 on it’s
right end.
2. Every bucket should have at least one 1, else no bucket can be formed.
4. The buckets cannot decrease in size as we move to the left. (move in increasing order
towards left)
Estimating the number of 1’s and counting the buckets in the given data stream.
This picture shows how we can form the buckets based on the number of ones by following
the rules.
In the given data stream let us assume the new bit arrives from the right. When the new bit = 0
After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the buckets.
But what if the new bit that arrives is 1, then we need to make changes.
Create a new bucket with the current timestamp and size 1.
If there was only one bucket of size 1, then nothing more needs to be done. However, if
there are now three buckets of size 1( buckets with timestamp 100,102, 103 in the
second step in the picture) We fix the problem by combining the leftmost(earliest) two
buckets of size 1. (purple box)
To combine any two adjacent buckets of the same size, replace them by one bucket of twice
the size. The timestamp of the new bucket is the timestamp of the rightmost of the two
buckets.
Now, sometimes combining two buckets of size 1 may create a third bucket of size 2. If so,
we combine the leftmost two buckets of size 2 into a bucket of size 4. This process may ripple
through the bucket sizes.
Decaying Windows
We have assumed that a sliding window held a certain tail of the stream, either the most
recent N elements for fixed N, or all the elements that arrived after some time in the past.
Sometimes we do not want to make a sharp distinction between recent elements and those in
the distant past, but want to weight the recent elements more heavily. In this section, we
consider “exponentially decaying windows,” and an application where they are quite useful:
finding the most common “recent” elements.
Suppose we have a stream whose elements are the movie tickets purchased all over the
world, with the name of the movie as part of the element. We want to keep a summary of the
stream that is the most popular movies “currently.”
On the other hand, a movie that sold n tickets in each of the last 10 weeks is probably more
popular than a movie that sold 2n tickets last week but nothing in previous weeks.
One solution would be to imagine a bit stream for each movie. The ith bit has value 1 if the
ith ticket is for that movie, and 0 otherwise. Pick a window size N, which is the number of
most recent tickets that would be considered in evaluating popularity.
An alternative approach is to redefine the question so that we are not asking for a count of 1’s
in a window. Rather, let us compute a smooth aggregation of all the 1’s ever seen in the
stream, with decaying weights, so the further back in the stream, the less weight is given.
Formally, let a stream currently consist of the elements a1, a2, . . . , at, where a1 is the first
element to arrive and at is the current element. Let c be a small constant, such as 10−6 or
10−9. Define the exponentially decaying window for this stream to be the sum.
Let us return to the problem of finding the most popular movies in a stream of ticket sales.
We shall use an exponentially decaying window with a constant c, which you might think of
as 10−9. That is, we approximate a sliding window holding the last one billion ticket sales.
We imagine that the number of possible movies in the stream is huge, so we do not want to
record values for the unpopular movies. Therefore, we establish a threshold, say 1/2, so that
if the popularity score for a movie goes below this number, its score is dropped from the
counting.
For reasons that will become obvious, the threshold must be less than 1, although it can be
any number less than 1. When a new ticket arrives on the stream, do the following:
1. Foreachmoviewhosescorewearecurrentlymaintaining,multiplyitsscoreby(1−c).
2. Suppose the new ticket is for movie M. If there is currently a score for M, add 1 to that
score. If there is no score for M, create one and initialize it to1.
It may not be obvious that the number of movies whose scores are maintained at any time is
limited. However, note that the sum of all scores is 1/c.
There cannot be more than 2/c movies with score of 1/2 or more, or else the sum of the
scores would exceed1/c.
Thus, 2/c is a limit on the number of movies being counted at any time.
Real time analytics makes use of all available data and resources when they are needed. It
consist of dynamic analysis and reporting based on the entered on to a system less than one
minute before the actual time of use.
Real time denotes the ability to process as it arrives, rather than storing the data and
retrieving it at some point in the future.
Real time analytics is thus delivering meaningful patterns in the data for something urgent.
Types of real time analytics
On Demand Real Time Analytics – It is reactive because it waits for users to request a
query and then delivers the analytics. This is used when someone within a company needs to
take a pulse on what is happening right this minute.
Continuous Real Time Analytics – It is more proactive and alerts users with continuous
updates in real time. Example Monitoring stock market trends provide analytics to help users
make a decision to buy or sell all in real time.
Financial Services – Analyze tickets, tweets, satellite integrity, weather trends, and any other
type of data to inform trading algorithm in real time.
Government – Identify social program fraud within seconds based on program history,
citizen profile, and geographical data.
E-Commerce sites – Real time analytics will help to tap into user preferences as people are
on the site or using product. By knowing what user likes at a run time can help the site to
decide relevant content to be made available to that user. This can result in better customer
experience overall leading to increase in sales.
Companies like Facebook and twitter generates petabytes of real time data. This data must be
harnessed to provide real time analytics to make better business decisions. Today Billions of
devices are already connected to the internet with more connecting every day.
Real time analytics will leverage information from all these devices to apply analytics
algorithms and generate automated actions within milliseconds of a trigger.
Input - An event happens (New sale, new customer, someone enters a high security zone etc.)
Process and Store Input – Capture the data of the event, and analyze the data without
leveraging resources that are dedicated to operations.
• Processing in memory
• In database analytics
• In memory analytics
Sentiment analysis is widely applied to reviews and social media for a variety of applications
ranging from marketing to customer service.
A basic task in sentiment analysis is classifying the polarity of a given text at the document,
sentence, or feature/aspect level where the expressed opinion in a document, a sentence or an
entity feature aspect is positive negative or neural.
Applications
News media website interested in getting edge over its competitors by featuring site content
that is immediately relevant to its readers. They use social media analysis topics relevant to
their readers by doing real time sentiment analysis on twitter data. Specifically to identify
what topics are trending in real time on twitter.
Twitter has become a central site where people express their opinions and views on political
parties and candidates. Emerging events or news or often followed almost instantly by a burst
in twitter volume which if analyzed in real time can help explore how these events affect
public opinion While traditional content analytics takes days or weeks to complete, RSTA
can look into entire content about election and delivering results instantly and continuously.
Ad agencies can track the crowd sentiment during commercial viewing on TV and decide
which commercials are resulting in positive sentiment and which are not.
Analyzing sentiments of messages posted to social media or online forums can generate
countless business values for the organizations, which aims to extract timely business
intelligence about how their products or services are perceived by their customers. As a result
proactive marketing or product design strategies can be
developed to efficiently increase the customer base.
Tools
Apache Strom is a distributed real time computation system for processing large volume of
data. It is part of Hadoop. Strom is extremely fast with the ability to process over a million of
records per second per node on a cluster of modest size.
Apache Solr is another tool from Hadoop which provides a highly reliable scalable search
engine facility at real time.
RADAR is a software solution for retailers built using a Natural language processing based
sentiment analysis engine and utilizing Hadoop technologies including HDFS, YARN,
Apache Storm, Apache Solr, Oozie and Zookeeper to help them maximize sales through
databased continuous repricing.
Online retailers can track the following for number of products in their portfolio,
b. Competitive pricing / promotions being offered in social media and in the web.
A general real time stock prediction and machine learning architecture comprises three
basiccomponents,
a. Incoming real time trading data must be captured and stored becoming historical data.
b. The system must be able to learn from historical trends in the data and recognize patterns
and probabilities to informdecisions.
c. The system needs to do a real time comparison of new incoming trading data with the
learned patterns and probabilities based in historical data. Then it predicts an outcome and
determines a action to take.
Using the live hot data from Apache Geode, a Spark MLib application creates and trains a
model computing new data to historical patterns. The models could also be supported by
other toolsets such as Apache MADlib or R.
Results of the machine-learning model are pushed to other interested applications and also
updated within Apache Geode for real time predication and decisioning.
As data ages and starts to become cool it is moved from Apache Geode to Apache HAWQ
and eventually lands in Apache Hadoop. Apache HAWQ allows for SQL based analysis on
petabyte scale data sets and allows data scientists to iterate on and improve models.
Another process is triggered to periodically retain and update the machine learning model
based on the whole historical data set. This closes the loop and creates ongoing updates and
improvements when historical patterns change or as new models emerge.