0% found this document useful (0 votes)

19 views19 pages

Big Data Unit Ii Notes

The document discusses the concept of data streams, emphasizing their importance in various applications such as financial analysis, network monitoring, and sensor networks. It outlines the characteristics of data streams, the architecture of data stream management systems, and techniques for processing and filtering data streams, including Bloom filters and the Flajolet-Martin algorithm for counting distinct elements. The document highlights the challenges posed by the continuous and rapid arrival of data, necessitating efficient real-time processing methods.

Uploaded by

Vimal Cheyyar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views19 pages

Big Data Unit Ii Notes

Uploaded by

Vimal Cheyyar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Mining Stream

Introduction to Stream Concepts

 Recently a new class of data-intensive applications has become widely recognized
applications in which the data is modeled best not as persistent relations but rather as
transient data streams

 Examples of such applications include financial applications, network monitoring,

security, telecommunications data management, web applications, manufacturing,
sensor networks, and others.

 In the data stream model, individual data items may be relational tuples, e.g., network
measurements, call records, web page visits, sensor readings, and so on.

 However, their continuous arrival in multiple, rapid, time-varying, possibly

unpredictable and unbounded streams appears to yield some fundamentally new
research problems.

Data Stream Model

 A data stream is a real time continuous and ordered sequence of items. It is not
possible to control the order in which the items arrive, nor it is feasible to locally store
a stream in its entirety in any memory device.

 Further a query over streams will actually run continuously over a period of time and
return new results as new data arrives.

 Therefore these are known as long running, continuous, standing and persistent
queries.

Characteristics

1. The data model and query processor must allow both order based and time based
operations.

2. The inability to store a complete stream indicates that some approximate summary
structures must be used.

3. Streaming query plans must not use any operators that require the entire input before any
results are produced. Such operators will block the query processor indefinitely.
4. Any query that requires backtracking over a data streams is infeasible. This is due to
storage and performance constraints imposed by a data stream.

5. Applications that monitor streams in real time must react quickly to unusual data values.

6. Scalability requirements dictate the parallel and shared execution of many continuous
queries must be possible.

Architecture
An input monitor may regulate the input streams perhaps by dropping packets. Data are
typically stored in three partitions.

1. Temporary working storage (for window queries)

2. Summary storage

3. Static storage for meta-data (Physical location of each storage)

 Long running queries are registered in the query repository and placed into groups for
shared processing.

 The query processor communicates with the input monitor and may reoptimize the
query plans in response to changing input rates. Results are streamed to the user or
temporarily buffered.

 A Data-Stream-Management System. In analogy to a database-management system,

we can view a stream processor as a kind of data-management system, the high-level
organization.

Any number of streams can enter the system. Each stream can provide elements at its own
schedule; they need not have the same data rates or data types, and the time between
elements of one stream need not be uniform.

The fact that the rate of arrival of stream elements is not under the control of the system
distinguishes stream processing from the processing of data that goes on within a database-
management system.

 Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special
circumstances using time-consuming retrieval processes.

 There is also a working store, into which summaries or parts of streams may be
placed, and which can be used for answering queries.
 The working store might be disk, or it might be main memory, depending on how fast
we need to process queries.

Fig.1 Data Stream Architecture

Examples of Data Stream Applications

Sensor Networks – SN are large source of data occurring in streams. Used in numerous
situations that require constant monitoring of several variables based on which important
decisions are made. Alerts and alarms generated as a response to information received from
sensors. Example – Perform joins of several streams like temperature, ocean currents streams
at weather stations to give alerts or warnings like cyclone or tsunami.

Network Traffic Analysis – ISP’s get information about Internet traffic, heavily used routes
etc. to identify and predict congestions. Streams also used to identify fraudulent activities.
Example queries – Check whether current stream of actions over time are similar to previous
intrusion on the network.

Financial Applications – Online analysis of stock prices and making hold or sell decisions
requires quickly identifying correlations and fast changing trends.

Transaction Log Analysis – Online mining of web usage logs, telephone call records and
ATM transactions are examples of data streams. Goal is to find customer behavior patterns.
Example – Identify current buying pattern of users in website and plan advertising campaigns
and recommendations.

Image Data

Satellites often send down to earth streams consisting of many terabytes of images per day.
Surveillance cameras produce images with lower resolution than satellites, but there can be
many of them, each producing a stream of images at intervals like one second.

Stream Computing
Stream Queries

There are two ways that queries get asked about streams.

One-time queries - (a class that includes traditional DBMS queries) are queries that are
evaluated once over a point-in-time snapshot of the data set, with the answer returned to the
user. Example - Alert when stock crosses over a price point.

Continuous queries, on the other hand, are evaluated continuously as data streams continue to
arrive. Continuous query answers may be stored and updated as new data arrives or they may
be produced as data streams themselves. Example –Aggregation queries like maximum,
average, count etc. Maximum price of stock every hour, or number of time stock gains over a
particular point.

Issues in Stream Processing

First, streams often deliver elements very rapidly. We must process elements in real time, or
we lose the opportunity to process them at all, without accessing the archival storage.

Thus, it often is important that the stream-processing algorithm is executed in main memory,
without access to secondary storage or with only rare accesses to secondary storage.

Thus, many problems about streaming data would be easy to solve if we had enough
memory, but become rather hard and require the invention of new techniques in order to
execute them at a realistic rate on a machine of realistic size.

Sampling Data in a Stream

As our first example of managing streaming data, we shall look at extracting reliable samples
from a stream.

The general problem we shall address is selecting a subset of a stream so that we can ask
queries about the selected subset and have the answers be statistically representative of the
stream as a whole. If we know what queries are to be asked, then there are a number of
methods that might work, but we are looking for a technique that will allow ad-hoc queries
on the sample.
The General Sampling Problem

The running example is typical of the following general problem. Our stream consists of
tuples with n components. A subset of the components are the key components, on which the
selection of the sample will be based. In our running example, there are three components –
user, query, and time – of which only user is in the key.

However, we could also take a sample of queries by making query be the key, or even take a
sample of user-query pairs by making both those components form the key.

To take a sample of size a/b, we hash the key value for each tuple to b buckets, and accept
the tuple for the sample if the hash value is less than a. If the key consists of more than one
component, the hash function needs to combine the values for those components to make a
single hash-value.

Varying the Sample Size

Often, the sample will grow as more of the stream enters the system. In our running example,
we retain all the search queries of the selected 1/10th of the users, forever. As time goes on,
more searches for the same users will be accumulated, and new users that are selected for the
sample will appear in the stream.

If we have a budget for how many tuples from the stream can be stored as the sample, then
the fraction of key values must vary, lowering as time goes on.

In order to assure that at all times, the sample consists of all tuples from a subset of the key
values, we choose a hash function h from key values to a very large number of values 0, 1, . .
. ,B−1. We maintain a threshold t, which initially can be the largest bucket number, B −1.

If the number of stored tuples of the sample exceeds the allotted space, we lower t to t−1 and
remove from the sample all those tuples whose key K hashes to t.

For efficiency, we can lower t by more than 1, and remove the tuples with several of the
highest hash values, whenever we need to throw some key values out of the sample.

Further efficiency is obtained by maintaining an index on the hash value, so we can find all
those tuples whose keys hash to a particular value quickly.

Filtering Data Streams

Due to the nature of data streams, stream filtering is one of the most useful and practical
approaches to efficient stream evaluation, whether it is done implicitly by the system to
guarantee the stability of the stream processing under overload conditions, or explicitly by
the evaluating procedure. In this section we will review some of the filtering techniques
commonly used in data stream processing.

A common process on streams is selection or filtering. We want to accept those tuples in the
stream that meet a criterion. Accepted tuples are passed to another process as a stream, while
other tuples are dropped. If the selection criterion is a property of the tuple that can be
calculated (e.g., the first component is less than 10), then the selection is easy to do. The
problem becomes harder when the criterion involves lookup for membership in a set. It is
especially hard, when that set is too large to store in main memory. In this section, we shall
discuss the technique known as “Bloom filtering” as a way to eliminate most of the tuples
that do not meet the criterion.

Applications

Email spam filtering is one best example. We know 1 billion “good” email addresses and if
an email comes from one of these, it is NOT spam.

Publish-subscribe systems where lots of messages (news articles) are collected and people
express interest in certain sets of keywords and determine whether each message matches
user’s interest.

Solution:

1) Given a set of keys S that we want to filter

2) Create a bit array B of n bits, initially all 0s

3) Choose a hash function h with range [0,n)

4) Hash each member of s S to one of n buckets, and set that bit to 1, i.e., B[h(s)]=1

5) Hash each element a of the stream and output only those that hash to bit that was set to 1

– Output a if B[h(a)] == 1

– Output the item since it may be in S.

Fig.2. Filtering Stream

Blooms filter creates false positives but no false negatives, i.e., if the item is in S we surely
output it, if not we may still output it.

Suppose |S| = 1 billion email addresses, |B|= 1GB = 8 billion bits, then, If the email address is
in S, then it surely hashes to a bucket that has the big set to 1, so it always gets through (no
false negatives). Approximately 1/8 of the bits are set to 1, so about 1/8th of the addresses
not in S get through to the output (false positives). Actually, less than 1/8th, because more
than one address might hash to the same bit

False positives and false negatives are concepts analogous to type I and type II errors in
statistical hypothesis testing, where a positive result corresponds to rejecting the null
hypothesis, and a negative result corresponds to not rejecting the null hypothesis. The terms
are often used interchangeably, but there are differences in detail and interpretation.

The Bloom Filter

A Bloom filter consists of:

1. An array of n bits, initially all 0’s.

2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values to n
buckets, corresponding to the n bits of the bit-array.

3. A set S of m key values.

The purpose of the Bloom filter is to allow through all stream elements whose keys are in S,
while rejecting most of the stream elements whose keys are not in S. To initialize the bit
array, begin with all bits 0. Take each key value in S and hash it using each of the k hash
functions. Set to 1 each bit that is hi(K) for some hash function hi and some key value K in S.
To test a key K that arrives in the stream, check that all of h1(K), h2(K), . . . , hk(K) are 1’s
in the bit-array. If all are 1’s, then let the stream element through. If one or more of these bits
are 0, then K could not be in S, so reject the stream element.

Analysis of Bloom Filtering

If a key value is in S, then the element will surely pass through the Bloom filter. However, if
the key value is not in S, it might still pass. We need to understand how to calculate the
probability of a false positive, as a function of n, the bit-array length, m the number of
members of S, and k, the number of hash functions. The model to use is throwing darts at
targets. Suppose we have x targets and y darts. Any dart is equally likely to hit any target.
After throwing the darts, how many targets can we expect to be hit at least once? The
analysis goes as follows:

The probability that a given dart will not hit a given target is (x − 1)/x. The probability that
none of the y darts will hit a given target is ((x-1)/x)y. We can write this expression as (1-
1/x)x(y/x). Using the approximation (1-€)1/€ = 1/e for small €,we conclude that the
probability that none of the y darts hit a given target is e-y/x.

In general, Bloom filters guarantee no false negatives, and use limited memory and hence it
is great for pre-processing before more expensive checks. It is suitable for hardware
implementation by using hash function, computations can be infact parallelized. Is it better to
have 1 big B or k small Bs which is the same as (1 – e-km/n)k vs. (1 – e-m/(n/k))k . But
keeping 1 big B is simpler.

Counting distinct elements

Counting distinct elements is a problem that frequently arises in distributed systems. In
general, the size of the set under consideration (which we will henceforth call the universe) is
enormous. For example, if we build a system to identify denial of service attacks, the set could
consist of all IP V4 and V6 addresses. Another common use case is to count the number of
unique visitors on popular websites like Twitter or Facebook.

An obvious approach if the number of elements is not very large would be to maintain a Set.
We can check if the set contains the element when a new element arrives. If not, we add the
element to the set. The size of the set would give the number of distinct elements. However, if
the number of elements is vast or we are maintaining counts for multiple streams, it would be
infeasible to maintain the set in memory. Storing the data on disk would be an option if we are
only interested in offline computation using batch processing frameworks like Map Reduce.
Like the previous algorithms we looked at, the algorithms for counting distinct elements are
also approximate, with an error threshold that can be tweaked by changing the algorithm's
parameters.

Flajolet — Martin Algorithm

The first algorithm for counting distinct elements is the Flajolet-Martin algorithm, named after
the algorithm's creators. The Flajolet-Martin algorithm is a single pass algorithm. If there
are m distinct elements in a universe comprising of n elements, the algorithm runs
in O(n) time and O(log(m)) space complexity. The following steps define the algorithm.

 First, we pick a hash function h that takes stream elements as input and outputs a bit
string. The length of the bit strings is large enough such that the result of the hash
function is much larger than the size of the universe. We require at least log nbits if
there are n elements in the universe.

 r(a) is used to denote the number of trailing zeros in the binary representation
of h(a) for an element a in the stream.

 R denotes the maximum value of r seen in the stream so far.

 The estimate of the number of distinct elements in the stream is (2^R).

To intuitively understand why the algorithm works consider the following.

The probability that h(a) ends in at least i zeros is exactly (2^-i). For example, for i=0, there is
a probability 1that the tail has at least 0zeros. For i=1, there is a probability of 1/2 that the last
bit is zero; for i=2, the probability is 1/4 that the last two bits are zero's, and so on. The
probability of the rightmost set bit drops by a factor of 1/2 with every position from the Least
Significant Bit to the Most Significant Bit.

This probability should become 0 when bit position R >> log m while it should be non-zero
when R <= log m. Hence, if we find the rightmost unset bit position R such that the
probability is 0, we can say that the number of unique elements will approximately be 2 ^ R.

The Flajolet-Martin uses a multiplicative hash function to transform the non-uniform set space
to a uniform distribution. The general form of the hash function is
(ax + b) mod c where a and b are odd numbers and c is the length of the hash range.

The Flajolet-Martin algorithm is sensitive to the hash function used, and results vary widely
based on the data set and the hash function. Hence there are better algorithms that utilize more
than one hash function. These algorithms use the average and median values to reduce skew
and increase the predictability of the result.

Flajolet-Martin Psuedocode and Explanation

1. L = 64 (size of the bitset), B= bit set of size L

2. hash_func = (ax + b) mod 2^L

3. for each item x in stream

 y = hash(x)

 r = get_righmost_set_bit(y)

 set_bit(B, r)

4. R = get_righmost_unset_bit(B)

5. return 2 ^ R

We define a hash range, big enough to hold the maximum number of possible unique values,
something as big as 2 ^ 64. Every stream element is passed through a hash function that
transforms the elements into a uniform distribution.

For each hash value, we find the position of the rightmost set bit and mark the corresponding
position in the bit set to 1. Once all elements are processed, the bit vector will have 1s at all
the positions corresponding to the position of every rightmost set bit for all elements in the
stream.

Now we find R , the rightmost 0 in this bit vector. This position R corresponds to the
rightmost set bit that we have not seen while processing the elements. This corresponds to the
probability 0 and will help in approximating the cardinality of unique elements as 2 ^ R.
Estimating Moments
The problem, called computing “moments,” involves the distribution of frequencies of
different elements in thestream.

Suppose a stream consists of elements chosen from a universal set. Assume the universal set
is ordered so we can speak of the ith element for any i. Let mi be the number of occurrences
of the ith element for any i. Then the kth-order moment (or just kth moment) of the stream is
the sum over all i of(mi)k.

The Alon-Matias-Szegedy Algorithm for Second Moments

For now, let us assume that a stream has a particular length n. Suppose we do not have
enough space to count all the mi’s for all the elements of thestream.

We can still estimate the second moment of the stream using a limited amount of space; the
more space we use, the more accurate the estimate will be. We compute some number of
variables. For each variable X, we store:

1. A particular element of the universal set, which we refer to as X.element ,and

2. An integer X.value, which is the value of the variable. To determine the value of a variable
X, we choose a position in the stream between 1 and n, uniformly and at random. Set
X.element to be the element found there, and initialize X.value to 1. As we read the stream,
add 1 to X.value each time we encounter another occurrence of X.element.

Counting Ones in a Window

Suppose we have a window of length N on a binary stream. We want at all times to be able to
answer queries of the form “how many 1’s are there in the last k bits?” for any k ≤ N .

As in previous sections, we focus on the situation where we cannot afford to store the entire
window. After showing an approximate algorithm for the binary case, we discuss how this
idea can be extended to summing numbers.

DGIM algorithm (Datar-Gionis-Indyk-Motwani Algorithm)

Designed to find the number 1’s in a data set. This algorithm uses O(log²N) bits to represent a
window of N bit, allows to estimate the number of 1’s in the window with and error of no
more than 50%.

So this algorithm gives a 50% precise answer.

In DGIM algorithm, each bit that arrives has a timestamp, for the position at which it arrives.
if the first bit has a timestamp 1, the second bit has a timestamp 2 and so on.. the positions are
recognized with the window size N (the window sizes are usually taken as a multiple of
2).The windows are divided into buckets consisting of 1’s and 0's.

RULES FOR FORMING THE BUCKETS:

1. The right side of the bucket should always start with 1. (if it starts with a 0,it is to be
neglected) E.g. · 1001011 → a bucket of size 4 ,having four 1’s and starting with 1 on it’s
right end.

2. Every bucket should have at least one 1, else no bucket can be formed.

3. All buckets should be in powers of 2.

4. The buckets cannot decrease in size as we move to the left. (move in increasing order
towards left)

Let us take an example to understand the algorithm.

Estimating the number of 1’s and counting the buckets in the given data stream.

This picture shows how we can form the buckets based on the number of ones by following
the rules.
In the given data stream let us assume the new bit arrives from the right. When the new bit = 0

After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the buckets.

But what if the new bit that arrives is 1, then we need to make changes.
 Create a new bucket with the current timestamp and size 1.
 If there was only one bucket of size 1, then nothing more needs to be done. However, if
there are now three buckets of size 1( buckets with timestamp 100,102, 103 in the
second step in the picture) We fix the problem by combining the leftmost(earliest) two
buckets of size 1. (purple box)

To combine any two adjacent buckets of the same size, replace them by one bucket of twice
the size. The timestamp of the new bucket is the timestamp of the rightmost of the two
buckets.

Now, sometimes combining two buckets of size 1 may create a third bucket of size 2. If so,
we combine the leftmost two buckets of size 2 into a bucket of size 4. This process may ripple
through the bucket sizes.

Decaying Windows
We have assumed that a sliding window held a certain tail of the stream, either the most
recent N elements for fixed N, or all the elements that arrived after some time in the past.

Sometimes we do not want to make a sharp distinction between recent elements and those in
the distant past, but want to weight the recent elements more heavily. In this section, we
consider “exponentially decaying windows,” and an application where they are quite useful:
finding the most common “recent” elements.

The Problem of Most-Common Elements

Suppose we have a stream whose elements are the movie tickets purchased all over the
world, with the name of the movie as part of the element. We want to keep a summary of the
stream that is the most popular movies “currently.”

On the other hand, a movie that sold n tickets in each of the last 10 weeks is probably more
popular than a movie that sold 2n tickets last week but nothing in previous weeks.

One solution would be to imagine a bit stream for each movie. The ith bit has value 1 if the
ith ticket is for that movie, and 0 otherwise. Pick a window size N, which is the number of
most recent tickets that would be considered in evaluating popularity.

Definition of the Decaying Window

An alternative approach is to redefine the question so that we are not asking for a count of 1’s
in a window. Rather, let us compute a smooth aggregation of all the 1’s ever seen in the
stream, with decaying weights, so the further back in the stream, the less weight is given.
Formally, let a stream currently consist of the elements a1, a2, . . . , at, where a1 is the first
element to arrive and at is the current element. Let c be a small constant, such as 10−6 or
10−9. Define the exponentially decaying window for this stream to be the sum.

Finding the Most Popular Elements

Let us return to the problem of finding the most popular movies in a stream of ticket sales.
We shall use an exponentially decaying window with a constant c, which you might think of
as 10−9. That is, we approximate a sliding window holding the last one billion ticket sales.

We imagine that the number of possible movies in the stream is huge, so we do not want to
record values for the unpopular movies. Therefore, we establish a threshold, say 1/2, so that
if the popularity score for a movie goes below this number, its score is dropped from the
counting.

For reasons that will become obvious, the threshold must be less than 1, although it can be
any number less than 1. When a new ticket arrives on the stream, do the following:

1. Foreachmoviewhosescorewearecurrentlymaintaining,multiplyitsscoreby(1−c).

2. Suppose the new ticket is for movie M. If there is currently a score for M, add 1 to that
score. If there is no score for M, create one and initialize it to1.

3. If any score is below the threshold 1/2, drop that score.

It may not be obvious that the number of movies whose scores are maintained at any time is
limited. However, note that the sum of all scores is 1/c.

There cannot be more than 2/c movies with score of 1/2 or more, or else the sum of the
scores would exceed1/c.

Thus, 2/c is a limit on the number of movies being counted at any time.

Real Time Analytical Platform

Real time analytics makes use of all available data and resources when they are needed. It
consist of dynamic analysis and reporting based on the entered on to a system less than one
minute before the actual time of use.

Real time denotes the ability to process as it arrives, rather than storing the data and
retrieving it at some point in the future.

Real time analytics is thus delivering meaningful patterns in the data for something urgent.
Types of real time analytics

On Demand Real Time Analytics – It is reactive because it waits for users to request a
query and then delivers the analytics. This is used when someone within a company needs to
take a pulse on what is happening right this minute.
Continuous Real Time Analytics – It is more proactive and alerts users with continuous
updates in real time. Example Monitoring stock market trends provide analytics to help users
make a decision to buy or sell all in real time.

Real Time Analytics Applications

Financial Services – Analyze tickets, tweets, satellite integrity, weather trends, and any other
type of data to inform trading algorithm in real time.

Government – Identify social program fraud within seconds based on program history,
citizen profile, and geographical data.

E-Commerce sites – Real time analytics will help to tap into user preferences as people are
on the site or using product. By knowing what user likes at a run time can help the site to
decide relevant content to be made available to that user. This can result in better customer
experience overall leading to increase in sales.

Insurance Industry – Digital channel of customers interaction as well as conversations online

have created new stream of real time event data.

Generic Design of an RTAP

Companies like Facebook and twitter generates petabytes of real time data. This data must be
harnessed to provide real time analytics to make better business decisions. Today Billions of
devices are already connected to the internet with more connecting every day.

Real time analytics will leverage information from all these devices to apply analytics
algorithms and generate automated actions within milliseconds of a trigger.

Real time analytics needed the following aspects of data flow,

Input - An event happens (New sale, new customer, someone enters a high security zone etc.)

Process and Store Input – Capture the data of the event, and analyze the data without
leveraging resources that are dedicated to operations.

Output – Consume this data without distributing operations

The following key capabilities must be provided by any analytical platform

• Delivering in Memory Transaction Speed

• Quickly Moving Unneeded data to disk for long term storage

• Distributing data and processing for speed

• Supporting continuous queries for real time events

• Embedding data into apps or apps into database

• Additional requirements

Many technologies support real time analytics, they are,

• Processing in memory

• In database analytics

• Data warehouse applications

• In memory analytics

• Massive parallel programming

REAL TIME SENTIMENT ANALYSIS

Sentiment analysis also known as opinion mining refers to the use of natural language
processing, text analysis and computational linguistics to identify and extract subjective
information in source materials.

Sentiment analysis is widely applied to reviews and social media for a variety of applications
ranging from marketing to customer service.

A basic task in sentiment analysis is classifying the polarity of a given text at the document,
sentence, or feature/aspect level where the expressed opinion in a document, a sentence or an
entity feature aspect is positive negative or neural.

Applications

News media website interested in getting edge over its competitors by featuring site content
that is immediately relevant to its readers. They use social media analysis topics relevant to
their readers by doing real time sentiment analysis on twitter data. Specifically to identify
what topics are trending in real time on twitter.

Twitter has become a central site where people express their opinions and views on political
parties and candidates. Emerging events or news or often followed almost instantly by a burst
in twitter volume which if analyzed in real time can help explore how these events affect
public opinion While traditional content analytics takes days or weeks to complete, RSTA
can look into entire content about election and delivering results instantly and continuously.

Ad agencies can track the crowd sentiment during commercial viewing on TV and decide
which commercials are resulting in positive sentiment and which are not.

Analyzing sentiments of messages posted to social media or online forums can generate
countless business values for the organizations, which aims to extract timely business
intelligence about how their products or services are perceived by their customers. As a result
proactive marketing or product design strategies can be
developed to efficiently increase the customer base.

Tools

Apache Strom is a distributed real time computation system for processing large volume of
data. It is part of Hadoop. Strom is extremely fast with the ability to process over a million of
records per second per node on a cluster of modest size.

Apache Solr is another tool from Hadoop which provides a highly reliable scalable search
engine facility at real time.

RADAR is a software solution for retailers built using a Natural language processing based
sentiment analysis engine and utilizing Hadoop technologies including HDFS, YARN,
Apache Storm, Apache Solr, Oozie and Zookeeper to help them maximize sales through
databased continuous repricing.

Online retailers can track the following for number of products in their portfolio,

a. Social sentiment for each product

b. Competitive pricing / promotions being offered in social media and in the web.

REAL TIME STOCK PREDICTION

Traditional stock market prediction algorithm check historical stock price and try to predict
the future using different models. But in real time scenario stock market trends continually
change economic forces, new products, competition, world events, regulations and even
tweets are all factors to affect stock prices. Thus real time analytics to predict stock prices is
the need of the hour.

A general real time stock prediction and machine learning architecture comprises three
basiccomponents,

a. Incoming real time trading data must be captured and stored becoming historical data.

b. The system must be able to learn from historical trends in the data and recognize patterns
and probabilities to informdecisions.

c. The system needs to do a real time comparison of new incoming trading data with the
learned patterns and probabilities based in historical data. Then it predicts an outcome and
determines a action to take.

For Example consider the following,

Live data from Yahoo Finance or any other finance news RSS feed is real and processed. The
data is then stored in memory with a fast consistent resilient and linearly scalable system.

Using the live hot data from Apache Geode, a Spark MLib application creates and trains a
model computing new data to historical patterns. The models could also be supported by
other toolsets such as Apache MADlib or R.

Results of the machine-learning model are pushed to other interested applications and also
updated within Apache Geode for real time predication and decisioning.

As data ages and starts to become cool it is moved from Apache Geode to Apache HAWQ
and eventually lands in Apache Hadoop. Apache HAWQ allows for SQL based analysis on
petabyte scale data sets and allows data scientists to iterate on and improve models.

Another process is triggered to periodically retain and update the machine learning model
based on the whole historical data set. This closes the loop and creates ongoing updates and
improvements when historical patterns change or as new models emerge.

Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
Mining Data Stream
No ratings yet
Mining Data Stream
31 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Unit 3 - BD - Streaming
No ratings yet
Unit 3 - BD - Streaming
42 pages
Snowflake Snowpro Certification Exam Cheat Sheet by Jeno Yamma
100% (1)
Snowflake Snowpro Certification Exam Cheat Sheet by Jeno Yamma
7 pages
Unit2 Bda
No ratings yet
Unit2 Bda
293 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
64 pages
Unit 4
No ratings yet
Unit 4
84 pages
Data Streams1
No ratings yet
Data Streams1
10 pages
Bda L4
No ratings yet
Bda L4
32 pages
Introduction To Stream Data Model
50% (2)
Introduction To Stream Data Model
15 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Unit-Ii 30-1-24
No ratings yet
Unit-Ii 30-1-24
162 pages
Data Stream's Characteristics: The Following Are The List of Characteristics of DS
No ratings yet
Data Stream's Characteristics: The Following Are The List of Characteristics of DS
29 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
96 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
33 pages
Bda Ut-2
No ratings yet
Bda Ut-2
18 pages
Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
Unit 3
No ratings yet
Unit 3
56 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
Stream Processing
No ratings yet
Stream Processing
70 pages
Mining Techniques For Streaming Data
No ratings yet
Mining Techniques For Streaming Data
14 pages
BDA Module-4
No ratings yet
BDA Module-4
8 pages
Mining Data Streams
No ratings yet
Mining Data Streams
33 pages
Ch05a Streams1
No ratings yet
Ch05a Streams1
48 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
Bda M4
No ratings yet
Bda M4
57 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
Unit 3
No ratings yet
Unit 3
30 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Unit II (Big Data)
No ratings yet
Unit II (Big Data)
19 pages
Unit III - MMD - Lecture Notes
No ratings yet
Unit III - MMD - Lecture Notes
8 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Mining&Data Stream Unit-3 - Removed
No ratings yet
Mining&Data Stream Unit-3 - Removed
50 pages
UNIT-3 (Mining Data Streams)
No ratings yet
UNIT-3 (Mining Data Streams)
50 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
Unit 4
No ratings yet
Unit 4
10 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Data Analytics and Visualization Unit-III
No ratings yet
Data Analytics and Visualization Unit-III
21 pages
BigData Mod2
No ratings yet
BigData Mod2
12 pages
Data Stream Unit4
No ratings yet
Data Stream Unit4
20 pages
Big Data Analytics Unit 2 MINING DATA STREAMS
100% (2)
Big Data Analytics Unit 2 MINING DATA STREAMS
22 pages
5 Unit
No ratings yet
5 Unit
5 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Stream
No ratings yet
Stream
30 pages
Top 50 SQL Queries For Interview
No ratings yet
Top 50 SQL Queries For Interview
135 pages
Unit 2
No ratings yet
Unit 2
10 pages
Introduction To Stream Concepts - Stream Data Model and Architecture
No ratings yet
Introduction To Stream Concepts - Stream Data Model and Architecture
8 pages
Module II
No ratings yet
Module II
22 pages
Introduction To Oracle: Lecturer: J. Mutai
No ratings yet
Introduction To Oracle: Lecturer: J. Mutai
12 pages
Real Time Data Stream Processing Engine
No ratings yet
Real Time Data Stream Processing Engine
13 pages
DBMS M3
No ratings yet
DBMS M3
66 pages
CP4152 Database Practices Lab Record
No ratings yet
CP4152 Database Practices Lab Record
38 pages
Talend Quick Book
No ratings yet
Talend Quick Book
38 pages
1759
No ratings yet
1759
56 pages
Exam1 PBI
No ratings yet
Exam1 PBI
10 pages
PL SQL
No ratings yet
PL SQL
12 pages
DOAG2021 DataPumpDeepDive
No ratings yet
DOAG2021 DataPumpDeepDive
61 pages
BDA - MongoDB
No ratings yet
BDA - MongoDB
12 pages
XII Practical File CS
No ratings yet
XII Practical File CS
36 pages
CS 202 Assignment 2 Improving Code For Security 5.19.25
No ratings yet
CS 202 Assignment 2 Improving Code For Security 5.19.25
2 pages
External Tables: - Not Just Loading A CSV File Kim Berg Hansen Senior Consultant
No ratings yet
External Tables: - Not Just Loading A CSV File Kim Berg Hansen Senior Consultant
57 pages
Web Application Example PDF
No ratings yet
Web Application Example PDF
17 pages
Unit 4 (Database Architecture)
No ratings yet
Unit 4 (Database Architecture)
15 pages
Unit - I 1.data Base System
No ratings yet
Unit - I 1.data Base System
102 pages
Hari 2 Fungsi, Join, Subquery v2
No ratings yet
Hari 2 Fungsi, Join, Subquery v2
70 pages
SQL Server 2019 Editions Datasheet
No ratings yet
SQL Server 2019 Editions Datasheet
3 pages
SAPNote 2799991
No ratings yet
SAPNote 2799991
12 pages
SQL Tutorial
No ratings yet
SQL Tutorial
6 pages
Unit - 1 - Part3 - DBMS Architecture
No ratings yet
Unit - 1 - Part3 - DBMS Architecture
4 pages
Dba Code
No ratings yet
Dba Code
83 pages
Module 1 - ITM
No ratings yet
Module 1 - ITM
5 pages
CS 3308 Discussion Assignment Unit 2
No ratings yet
CS 3308 Discussion Assignment Unit 2
6 pages
Experiment On Joins
No ratings yet
Experiment On Joins
5 pages
FrameworksForEntityMatchingAComparison Dke
No ratings yet
FrameworksForEntityMatchingAComparison Dke
14 pages
21CS745 Model Papper
No ratings yet
21CS745 Model Papper
2 pages
Assignment 04
No ratings yet
Assignment 04
7 pages
Yogita Aher Python Developer 2023
No ratings yet
Yogita Aher Python Developer 2023
1 page
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
From Everand
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
Stephen Fleming
5/5 (2)

Big Data Unit Ii Notes

Uploaded by

Big Data Unit Ii Notes

Uploaded by

Mining Stream

Introduction to Stream Concepts

 Examples of such applications include financial applications, network monitoring,

 However, their continuous arrival in multiple, rapid, time-varying, possibly

Data Stream Model

1. Temporary working storage (for window queries)

3. Static storage for meta-data (Physical location of each storage)

 A Data-Stream-Management System. In analogy to a database-management system,

Fig.1 Data Stream Architecture

Examples of Data Stream Applications

Issues in Stream Processing

Sampling Data in a Stream

Varying the Sample Size

Filtering Data Streams

1) Given a set of keys S that we want to filter

2) Create a bit array B of n bits, initially all 0s

3) Choose a hash function h with range [0,n)

– Output the item since it may be in S.

The Bloom Filter

A Bloom filter consists of:

1. An array of n bits, initially all 0’s.

3. A set S of m key values.

Analysis of Bloom Filtering

Counting distinct elements

Flajolet — Martin Algorithm

 R denotes the maximum value of r seen in the stream so far.

 The estimate of the number of distinct elements in the stream is (2^R).

To intuitively understand why the algorithm works consider the following.

Flajolet-Martin Psuedocode and Explanation

1. L = 64 (size of the bitset), B= bit set of size L

2. hash_func = (ax + b) mod 2^L

3. for each item x in stream

The Alon-Matias-Szegedy Algorithm for Second Moments

1. A particular element of the universal set, which we refer to as X.element ,and

Counting Ones in a Window

DGIM algorithm (Datar-Gionis-Indyk-Motwani Algorithm)

So this algorithm gives a 50% precise answer.

RULES FOR FORMING THE BUCKETS:

3. All buckets should be in powers of 2.

Let us take an example to understand the algorithm.

The Problem of Most-Common Elements

Definition of the Decaying Window

Finding the Most Popular Elements

3. If any score is below the threshold 1/2, drop that score.

Real Time Analytical Platform

Real Time Analytics Applications

Insurance Industry – Digital channel of customers interaction as well as conversations online

Generic Design of an RTAP

Real time analytics needed the following aspects of data flow,

Output – Consume this data without distributing operations

The following key capabilities must be provided by any analytical platform

• Delivering in Memory Transaction Speed

• Quickly Moving Unneeded data to disk for long term storage

• Distributing data and processing for speed

• Supporting continuous queries for real time events

• Embedding data into apps or apps into database

Many technologies support real time analytics, they are,

• Data warehouse applications

• Massive parallel programming

REAL TIME SENTIMENT ANALYSIS

a. Social sentiment for each product

REAL TIME STOCK PREDICTION

For Example consider the following,

You might also like