0% found this document useful (0 votes)
19 views19 pages

Big Data Unit Ii Notes

The document discusses the concept of data streams, emphasizing their importance in various applications such as financial analysis, network monitoring, and sensor networks. It outlines the characteristics of data streams, the architecture of data stream management systems, and techniques for processing and filtering data streams, including Bloom filters and the Flajolet-Martin algorithm for counting distinct elements. The document highlights the challenges posed by the continuous and rapid arrival of data, necessitating efficient real-time processing methods.

Uploaded by

Vimal Cheyyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views19 pages

Big Data Unit Ii Notes

The document discusses the concept of data streams, emphasizing their importance in various applications such as financial analysis, network monitoring, and sensor networks. It outlines the characteristics of data streams, the architecture of data stream management systems, and techniques for processing and filtering data streams, including Bloom filters and the Flajolet-Martin algorithm for counting distinct elements. The document highlights the challenges posed by the continuous and rapid arrival of data, necessitating efficient real-time processing methods.

Uploaded by

Vimal Cheyyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Mining Stream

Introduction to Stream Concepts


 Recently a new class of data-intensive applications has become widely recognized
applications in which the data is modeled best not as persistent relations but rather as
transient data streams

 Examples of such applications include financial applications, network monitoring,


security, telecommunications data management, web applications, manufacturing,
sensor networks, and others.

 In the data stream model, individual data items may be relational tuples, e.g., network
measurements, call records, web page visits, sensor readings, and so on.

 However, their continuous arrival in multiple, rapid, time-varying, possibly


unpredictable and unbounded streams appears to yield some fundamentally new
research problems.

Data Stream Model


 A data stream is a real time continuous and ordered sequence of items. It is not
possible to control the order in which the items arrive, nor it is feasible to locally store
a stream in its entirety in any memory device.

 Further a query over streams will actually run continuously over a period of time and
return new results as new data arrives.

 Therefore these are known as long running, continuous, standing and persistent
queries.

Characteristics

1. The data model and query processor must allow both order based and time based
operations.

2. The inability to store a complete stream indicates that some approximate summary
structures must be used.

3. Streaming query plans must not use any operators that require the entire input before any
results are produced. Such operators will block the query processor indefinitely.
4. Any query that requires backtracking over a data streams is infeasible. This is due to
storage and performance constraints imposed by a data stream.

5. Applications that monitor streams in real time must react quickly to unusual data values.

6. Scalability requirements dictate the parallel and shared execution of many continuous
queries must be possible.

Architecture
An input monitor may regulate the input streams perhaps by dropping packets. Data are
typically stored in three partitions.

1. Temporary working storage (for window queries)

2. Summary storage

3. Static storage for meta-data (Physical location of each storage)

 Long running queries are registered in the query repository and placed into groups for
shared processing.

 The query processor communicates with the input monitor and may reoptimize the
query plans in response to changing input rates. Results are streamed to the user or
temporarily buffered.

 A Data-Stream-Management System. In analogy to a database-management system,


we can view a stream processor as a kind of data-management system, the high-level
organization.

Any number of streams can enter the system. Each stream can provide elements at its own
schedule; they need not have the same data rates or data types, and the time between
elements of one stream need not be uniform.

The fact that the rate of arrival of stream elements is not under the control of the system
distinguishes stream processing from the processing of data that goes on within a database-
management system.

 Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special
circumstances using time-consuming retrieval processes.

 There is also a working store, into which summaries or parts of streams may be
placed, and which can be used for answering queries.
 The working store might be disk, or it might be main memory, depending on how fast
we need to process queries.

Fig.1 Data Stream Architecture

Examples of Data Stream Applications

Sensor Networks – SN are large source of data occurring in streams. Used in numerous
situations that require constant monitoring of several variables based on which important
decisions are made. Alerts and alarms generated as a response to information received from
sensors. Example – Perform joins of several streams like temperature, ocean currents streams
at weather stations to give alerts or warnings like cyclone or tsunami.

Network Traffic Analysis – ISP’s get information about Internet traffic, heavily used routes
etc. to identify and predict congestions. Streams also used to identify fraudulent activities.
Example queries – Check whether current stream of actions over time are similar to previous
intrusion on the network.

Financial Applications – Online analysis of stock prices and making hold or sell decisions
requires quickly identifying correlations and fast changing trends.

Transaction Log Analysis – Online mining of web usage logs, telephone call records and
ATM transactions are examples of data streams. Goal is to find customer behavior patterns.
Example – Identify current buying pattern of users in website and plan advertising campaigns
and recommendations.

Image Data

Satellites often send down to earth streams consisting of many terabytes of images per day.
Surveillance cameras produce images with lower resolution than satellites, but there can be
many of them, each producing a stream of images at intervals like one second.

Stream Computing
Stream Queries

There are two ways that queries get asked about streams.

One-time queries - (a class that includes traditional DBMS queries) are queries that are
evaluated once over a point-in-time snapshot of the data set, with the answer returned to the
user. Example - Alert when stock crosses over a price point.

Continuous queries, on the other hand, are evaluated continuously as data streams continue to
arrive. Continuous query answers may be stored and updated as new data arrives or they may
be produced as data streams themselves. Example –Aggregation queries like maximum,
average, count etc. Maximum price of stock every hour, or number of time stock gains over a
particular point.

Issues in Stream Processing

First, streams often deliver elements very rapidly. We must process elements in real time, or
we lose the opportunity to process them at all, without accessing the archival storage.

Thus, it often is important that the stream-processing algorithm is executed in main memory,
without access to secondary storage or with only rare accesses to secondary storage.

Thus, many problems about streaming data would be easy to solve if we had enough
memory, but become rather hard and require the invention of new techniques in order to
execute them at a realistic rate on a machine of realistic size.

Sampling Data in a Stream

As our first example of managing streaming data, we shall look at extracting reliable samples
from a stream.

The general problem we shall address is selecting a subset of a stream so that we can ask
queries about the selected subset and have the answers be statistically representative of the
stream as a whole. If we know what queries are to be asked, then there are a number of
methods that might work, but we are looking for a technique that will allow ad-hoc queries
on the sample.
The General Sampling Problem

The running example is typical of the following general problem. Our stream consists of
tuples with n components. A subset of the components are the key components, on which the
selection of the sample will be based. In our running example, there are three components –
user, query, and time – of which only user is in the key.

However, we could also take a sample of queries by making query be the key, or even take a
sample of user-query pairs by making both those components form the key.

To take a sample of size a/b, we hash the key value for each tuple to b buckets, and accept
the tuple for the sample if the hash value is less than a. If the key consists of more than one
component, the hash function needs to combine the values for those components to make a
single hash-value.

Varying the Sample Size

Often, the sample will grow as more of the stream enters the system. In our running example,
we retain all the search queries of the selected 1/10th of the users, forever. As time goes on,
more searches for the same users will be accumulated, and new users that are selected for the
sample will appear in the stream.

If we have a budget for how many tuples from the stream can be stored as the sample, then
the fraction of key values must vary, lowering as time goes on.

In order to assure that at all times, the sample consists of all tuples from a subset of the key
values, we choose a hash function h from key values to a very large number of values 0, 1, . .
. ,B−1. We maintain a threshold t, which initially can be the largest bucket number, B −1.

If the number of stored tuples of the sample exceeds the allotted space, we lower t to t−1 and
remove from the sample all those tuples whose key K hashes to t.

For efficiency, we can lower t by more than 1, and remove the tuples with several of the
highest hash values, whenever we need to throw some key values out of the sample.

Further efficiency is obtained by maintaining an index on the hash value, so we can find all
those tuples whose keys hash to a particular value quickly.

Filtering Data Streams


Due to the nature of data streams, stream filtering is one of the most useful and practical
approaches to efficient stream evaluation, whether it is done implicitly by the system to
guarantee the stability of the stream processing under overload conditions, or explicitly by
the evaluating procedure. In this section we will review some of the filtering techniques
commonly used in data stream processing.

A common process on streams is selection or filtering. We want to accept those tuples in the
stream that meet a criterion. Accepted tuples are passed to another process as a stream, while
other tuples are dropped. If the selection criterion is a property of the tuple that can be
calculated (e.g., the first component is less than 10), then the selection is easy to do. The
problem becomes harder when the criterion involves lookup for membership in a set. It is
especially hard, when that set is too large to store in main memory. In this section, we shall
discuss the technique known as “Bloom filtering” as a way to eliminate most of the tuples
that do not meet the criterion.

Applications

Email spam filtering is one best example. We know 1 billion “good” email addresses and if
an email comes from one of these, it is NOT spam.

Publish-subscribe systems where lots of messages (news articles) are collected and people
express interest in certain sets of keywords and determine whether each message matches
user’s interest.

Solution:

1) Given a set of keys S that we want to filter

2) Create a bit array B of n bits, initially all 0s

3) Choose a hash function h with range [0,n)

4) Hash each member of s S to one of n buckets, and set that bit to 1, i.e., B[h(s)]=1

5) Hash each element a of the stream and output only those that hash to bit that was set to 1

– Output a if B[h(a)] == 1

– Output the item since it may be in S.


Fig.2. Filtering Stream

Blooms filter creates false positives but no false negatives, i.e., if the item is in S we surely
output it, if not we may still output it.

Suppose |S| = 1 billion email addresses, |B|= 1GB = 8 billion bits, then, If the email address is
in S, then it surely hashes to a bucket that has the big set to 1, so it always gets through (no
false negatives). Approximately 1/8 of the bits are set to 1, so about 1/8th of the addresses
not in S get through to the output (false positives). Actually, less than 1/8th, because more
than one address might hash to the same bit

False positives and false negatives are concepts analogous to type I and type II errors in
statistical hypothesis testing, where a positive result corresponds to rejecting the null
hypothesis, and a negative result corresponds to not rejecting the null hypothesis. The terms
are often used interchangeably, but there are differences in detail and interpretation.

The Bloom Filter

A Bloom filter consists of:

1. An array of n bits, initially all 0’s.

2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values to n
buckets, corresponding to the n bits of the bit-array.

3. A set S of m key values.


The purpose of the Bloom filter is to allow through all stream elements whose keys are in S,
while rejecting most of the stream elements whose keys are not in S. To initialize the bit
array, begin with all bits 0. Take each key value in S and hash it using each of the k hash
functions. Set to 1 each bit that is hi(K) for some hash function hi and some key value K in S.
To test a key K that arrives in the stream, check that all of h1(K), h2(K), . . . , hk(K) are 1’s
in the bit-array. If all are 1’s, then let the stream element through. If one or more of these bits
are 0, then K could not be in S, so reject the stream element.

Analysis of Bloom Filtering

If a key value is in S, then the element will surely pass through the Bloom filter. However, if
the key value is not in S, it might still pass. We need to understand how to calculate the
probability of a false positive, as a function of n, the bit-array length, m the number of
members of S, and k, the number of hash functions. The model to use is throwing darts at
targets. Suppose we have x targets and y darts. Any dart is equally likely to hit any target.
After throwing the darts, how many targets can we expect to be hit at least once? The
analysis goes as follows:

The probability that a given dart will not hit a given target is (x − 1)/x. The probability that
none of the y darts will hit a given target is ((x-1)/x)y. We can write this expression as (1-
1/x)x(y/x). Using the approximation (1-€)1/€ = 1/e for small €,we conclude that the
probability that none of the y darts hit a given target is e-y/x.

In general, Bloom filters guarantee no false negatives, and use limited memory and hence it
is great for pre-processing before more expensive checks. It is suitable for hardware
implementation by using hash function, computations can be infact parallelized. Is it better to
have 1 big B or k small Bs which is the same as (1 – e-km/n)k vs. (1 – e-m/(n/k))k . But
keeping 1 big B is simpler.

Counting distinct elements


Counting distinct elements is a problem that frequently arises in distributed systems. In
general, the size of the set under consideration (which we will henceforth call the universe) is
enormous. For example, if we build a system to identify denial of service attacks, the set could
consist of all IP V4 and V6 addresses. Another common use case is to count the number of
unique visitors on popular websites like Twitter or Facebook.

An obvious approach if the number of elements is not very large would be to maintain a Set.
We can check if the set contains the element when a new element arrives. If not, we add the
element to the set. The size of the set would give the number of distinct elements. However, if
the number of elements is vast or we are maintaining counts for multiple streams, it would be
infeasible to maintain the set in memory. Storing the data on disk would be an option if we are
only interested in offline computation using batch processing frameworks like Map Reduce.
Like the previous algorithms we looked at, the algorithms for counting distinct elements are
also approximate, with an error threshold that can be tweaked by changing the algorithm's
parameters.

Flajolet — Martin Algorithm

The first algorithm for counting distinct elements is the Flajolet-Martin algorithm, named after
the algorithm's creators. The Flajolet-Martin algorithm is a single pass algorithm. If there
are m distinct elements in a universe comprising of n elements, the algorithm runs
in O(n) time and O(log(m)) space complexity. The following steps define the algorithm.

 First, we pick a hash function h that takes stream elements as input and outputs a bit
string. The length of the bit strings is large enough such that the result of the hash
function is much larger than the size of the universe. We require at least log nbits if
there are n elements in the universe.

 r(a) is used to denote the number of trailing zeros in the binary representation
of h(a) for an element a in the stream.

 R denotes the maximum value of r seen in the stream so far.

 The estimate of the number of distinct elements in the stream is (2^R).

To intuitively understand why the algorithm works consider the following.

The probability that h(a) ends in at least i zeros is exactly (2^-i). For example, for i=0, there is
a probability 1that the tail has at least 0zeros. For i=1, there is a probability of 1/2 that the last
bit is zero; for i=2, the probability is 1/4 that the last two bits are zero's, and so on. The
probability of the rightmost set bit drops by a factor of 1/2 with every position from the Least
Significant Bit to the Most Significant Bit.

This probability should become 0 when bit position R >> log m while it should be non-zero
when R <= log m. Hence, if we find the rightmost unset bit position R such that the
probability is 0, we can say that the number of unique elements will approximately be 2 ^ R.

The Flajolet-Martin uses a multiplicative hash function to transform the non-uniform set space
to a uniform distribution. The general form of the hash function is
(ax + b) mod c where a and b are odd numbers and c is the length of the hash range.

The Flajolet-Martin algorithm is sensitive to the hash function used, and results vary widely
based on the data set and the hash function. Hence there are better algorithms that utilize more
than one hash function. These algorithms use the average and median values to reduce skew
and increase the predictability of the result.

Flajolet-Martin Psuedocode and Explanation

1. L = 64 (size of the bitset), B= bit set of size L

2. hash_func = (ax + b) mod 2^L

3. for each item x in stream

 y = hash(x)

 r = get_righmost_set_bit(y)

 set_bit(B, r)

4. R = get_righmost_unset_bit(B)

5. return 2 ^ R

We define a hash range, big enough to hold the maximum number of possible unique values,
something as big as 2 ^ 64. Every stream element is passed through a hash function that
transforms the elements into a uniform distribution.

For each hash value, we find the position of the rightmost set bit and mark the corresponding
position in the bit set to 1. Once all elements are processed, the bit vector will have 1s at all
the positions corresponding to the position of every rightmost set bit for all elements in the
stream.

Now we find R , the rightmost 0 in this bit vector. This position R corresponds to the
rightmost set bit that we have not seen while processing the elements. This corresponds to the
probability 0 and will help in approximating the cardinality of unique elements as 2 ^ R.
Estimating Moments
The problem, called computing “moments,” involves the distribution of frequencies of
different elements in thestream.

Suppose a stream consists of elements chosen from a universal set. Assume the universal set
is ordered so we can speak of the ith element for any i. Let mi be the number of occurrences
of the ith element for any i. Then the kth-order moment (or just kth moment) of the stream is
the sum over all i of(mi)k.

The Alon-Matias-Szegedy Algorithm for Second Moments

For now, let us assume that a stream has a particular length n. Suppose we do not have
enough space to count all the mi’s for all the elements of thestream.

We can still estimate the second moment of the stream using a limited amount of space; the
more space we use, the more accurate the estimate will be. We compute some number of
variables. For each variable X, we store:

1. A particular element of the universal set, which we refer to as X.element ,and

2. An integer X.value, which is the value of the variable. To determine the value of a variable
X, we choose a position in the stream between 1 and n, uniformly and at random. Set
X.element to be the element found there, and initialize X.value to 1. As we read the stream,
add 1 to X.value each time we encounter another occurrence of X.element.

Counting Ones in a Window


Suppose we have a window of length N on a binary stream. We want at all times to be able to
answer queries of the form “how many 1’s are there in the last k bits?” for any k ≤ N .

As in previous sections, we focus on the situation where we cannot afford to store the entire
window. After showing an approximate algorithm for the binary case, we discuss how this
idea can be extended to summing numbers.

DGIM algorithm (Datar-Gionis-Indyk-Motwani Algorithm)

Designed to find the number 1’s in a data set. This algorithm uses O(log²N) bits to represent a
window of N bit, allows to estimate the number of 1’s in the window with and error of no
more than 50%.

So this algorithm gives a 50% precise answer.


In DGIM algorithm, each bit that arrives has a timestamp, for the position at which it arrives.
if the first bit has a timestamp 1, the second bit has a timestamp 2 and so on.. the positions are
recognized with the window size N (the window sizes are usually taken as a multiple of
2).The windows are divided into buckets consisting of 1’s and 0's.

RULES FOR FORMING THE BUCKETS:

1. The right side of the bucket should always start with 1. (if it starts with a 0,it is to be
neglected) E.g. · 1001011 → a bucket of size 4 ,having four 1’s and starting with 1 on it’s
right end.

2. Every bucket should have at least one 1, else no bucket can be formed.

3. All buckets should be in powers of 2.

4. The buckets cannot decrease in size as we move to the left. (move in increasing order
towards left)

Let us take an example to understand the algorithm.

Estimating the number of 1’s and counting the buckets in the given data stream.

This picture shows how we can form the buckets based on the number of ones by following
the rules.
In the given data stream let us assume the new bit arrives from the right. When the new bit = 0

After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the buckets.

But what if the new bit that arrives is 1, then we need to make changes.
 Create a new bucket with the current timestamp and size 1.
 If there was only one bucket of size 1, then nothing more needs to be done. However, if
there are now three buckets of size 1( buckets with timestamp 100,102, 103 in the
second step in the picture) We fix the problem by combining the leftmost(earliest) two
buckets of size 1. (purple box)

To combine any two adjacent buckets of the same size, replace them by one bucket of twice
the size. The timestamp of the new bucket is the timestamp of the rightmost of the two
buckets.

Now, sometimes combining two buckets of size 1 may create a third bucket of size 2. If so,
we combine the leftmost two buckets of size 2 into a bucket of size 4. This process may ripple
through the bucket sizes.

Decaying Windows
We have assumed that a sliding window held a certain tail of the stream, either the most
recent N elements for fixed N, or all the elements that arrived after some time in the past.

Sometimes we do not want to make a sharp distinction between recent elements and those in
the distant past, but want to weight the recent elements more heavily. In this section, we
consider “exponentially decaying windows,” and an application where they are quite useful:
finding the most common “recent” elements.

The Problem of Most-Common Elements

Suppose we have a stream whose elements are the movie tickets purchased all over the
world, with the name of the movie as part of the element. We want to keep a summary of the
stream that is the most popular movies “currently.”

On the other hand, a movie that sold n tickets in each of the last 10 weeks is probably more
popular than a movie that sold 2n tickets last week but nothing in previous weeks.

One solution would be to imagine a bit stream for each movie. The ith bit has value 1 if the
ith ticket is for that movie, and 0 otherwise. Pick a window size N, which is the number of
most recent tickets that would be considered in evaluating popularity.

Definition of the Decaying Window

An alternative approach is to redefine the question so that we are not asking for a count of 1’s
in a window. Rather, let us compute a smooth aggregation of all the 1’s ever seen in the
stream, with decaying weights, so the further back in the stream, the less weight is given.
Formally, let a stream currently consist of the elements a1, a2, . . . , at, where a1 is the first
element to arrive and at is the current element. Let c be a small constant, such as 10−6 or
10−9. Define the exponentially decaying window for this stream to be the sum.

Finding the Most Popular Elements

Let us return to the problem of finding the most popular movies in a stream of ticket sales.
We shall use an exponentially decaying window with a constant c, which you might think of
as 10−9. That is, we approximate a sliding window holding the last one billion ticket sales.

We imagine that the number of possible movies in the stream is huge, so we do not want to
record values for the unpopular movies. Therefore, we establish a threshold, say 1/2, so that
if the popularity score for a movie goes below this number, its score is dropped from the
counting.

For reasons that will become obvious, the threshold must be less than 1, although it can be
any number less than 1. When a new ticket arrives on the stream, do the following:

1. Foreachmoviewhosescorewearecurrentlymaintaining,multiplyitsscoreby(1−c).

2. Suppose the new ticket is for movie M. If there is currently a score for M, add 1 to that
score. If there is no score for M, create one and initialize it to1.

3. If any score is below the threshold 1/2, drop that score.

It may not be obvious that the number of movies whose scores are maintained at any time is
limited. However, note that the sum of all scores is 1/c.

There cannot be more than 2/c movies with score of 1/2 or more, or else the sum of the
scores would exceed1/c.

Thus, 2/c is a limit on the number of movies being counted at any time.

Real Time Analytical Platform

Real time analytics makes use of all available data and resources when they are needed. It
consist of dynamic analysis and reporting based on the entered on to a system less than one
minute before the actual time of use.

Real time denotes the ability to process as it arrives, rather than storing the data and
retrieving it at some point in the future.

Real time analytics is thus delivering meaningful patterns in the data for something urgent.
Types of real time analytics

On Demand Real Time Analytics – It is reactive because it waits for users to request a
query and then delivers the analytics. This is used when someone within a company needs to
take a pulse on what is happening right this minute.
Continuous Real Time Analytics – It is more proactive and alerts users with continuous
updates in real time. Example Monitoring stock market trends provide analytics to help users
make a decision to buy or sell all in real time.

Real Time Analytics Applications

Financial Services – Analyze tickets, tweets, satellite integrity, weather trends, and any other
type of data to inform trading algorithm in real time.

Government – Identify social program fraud within seconds based on program history,
citizen profile, and geographical data.

E-Commerce sites – Real time analytics will help to tap into user preferences as people are
on the site or using product. By knowing what user likes at a run time can help the site to
decide relevant content to be made available to that user. This can result in better customer
experience overall leading to increase in sales.

Insurance Industry – Digital channel of customers interaction as well as conversations online


have created new stream of real time event data.

Generic Design of an RTAP

Companies like Facebook and twitter generates petabytes of real time data. This data must be
harnessed to provide real time analytics to make better business decisions. Today Billions of
devices are already connected to the internet with more connecting every day.

Real time analytics will leverage information from all these devices to apply analytics
algorithms and generate automated actions within milliseconds of a trigger.

Real time analytics needed the following aspects of data flow,

Input - An event happens (New sale, new customer, someone enters a high security zone etc.)

Process and Store Input – Capture the data of the event, and analyze the data without
leveraging resources that are dedicated to operations.

Output – Consume this data without distributing operations

The following key capabilities must be provided by any analytical platform

• Delivering in Memory Transaction Speed

• Quickly Moving Unneeded data to disk for long term storage

• Distributing data and processing for speed

• Supporting continuous queries for real time events

• Embedding data into apps or apps into database


• Additional requirements

Many technologies support real time analytics, they are,

• Processing in memory

• In database analytics

• Data warehouse applications

• In memory analytics

• Massive parallel programming

REAL TIME SENTIMENT ANALYSIS


Sentiment analysis also known as opinion mining refers to the use of natural language
processing, text analysis and computational linguistics to identify and extract subjective
information in source materials.

Sentiment analysis is widely applied to reviews and social media for a variety of applications
ranging from marketing to customer service.

A basic task in sentiment analysis is classifying the polarity of a given text at the document,
sentence, or feature/aspect level where the expressed opinion in a document, a sentence or an
entity feature aspect is positive negative or neural.

Applications

News media website interested in getting edge over its competitors by featuring site content
that is immediately relevant to its readers. They use social media analysis topics relevant to
their readers by doing real time sentiment analysis on twitter data. Specifically to identify
what topics are trending in real time on twitter.

Twitter has become a central site where people express their opinions and views on political
parties and candidates. Emerging events or news or often followed almost instantly by a burst
in twitter volume which if analyzed in real time can help explore how these events affect
public opinion While traditional content analytics takes days or weeks to complete, RSTA
can look into entire content about election and delivering results instantly and continuously.

Ad agencies can track the crowd sentiment during commercial viewing on TV and decide
which commercials are resulting in positive sentiment and which are not.

Analyzing sentiments of messages posted to social media or online forums can generate
countless business values for the organizations, which aims to extract timely business
intelligence about how their products or services are perceived by their customers. As a result
proactive marketing or product design strategies can be
developed to efficiently increase the customer base.

Tools

Apache Strom is a distributed real time computation system for processing large volume of
data. It is part of Hadoop. Strom is extremely fast with the ability to process over a million of
records per second per node on a cluster of modest size.

Apache Solr is another tool from Hadoop which provides a highly reliable scalable search
engine facility at real time.

RADAR is a software solution for retailers built using a Natural language processing based
sentiment analysis engine and utilizing Hadoop technologies including HDFS, YARN,
Apache Storm, Apache Solr, Oozie and Zookeeper to help them maximize sales through
databased continuous repricing.

Online retailers can track the following for number of products in their portfolio,

a. Social sentiment for each product

b. Competitive pricing / promotions being offered in social media and in the web.

REAL TIME STOCK PREDICTION


Traditional stock market prediction algorithm check historical stock price and try to predict
the future using different models. But in real time scenario stock market trends continually
change economic forces, new products, competition, world events, regulations and even
tweets are all factors to affect stock prices. Thus real time analytics to predict stock prices is
the need of the hour.

A general real time stock prediction and machine learning architecture comprises three
basiccomponents,

a. Incoming real time trading data must be captured and stored becoming historical data.

b. The system must be able to learn from historical trends in the data and recognize patterns
and probabilities to informdecisions.

c. The system needs to do a real time comparison of new incoming trading data with the
learned patterns and probabilities based in historical data. Then it predicts an outcome and
determines a action to take.

For Example consider the following,


Live data from Yahoo Finance or any other finance news RSS feed is real and processed. The
data is then stored in memory with a fast consistent resilient and linearly scalable system.

Using the live hot data from Apache Geode, a Spark MLib application creates and trains a
model computing new data to historical patterns. The models could also be supported by
other toolsets such as Apache MADlib or R.

Results of the machine-learning model are pushed to other interested applications and also
updated within Apache Geode for real time predication and decisioning.

As data ages and starts to become cool it is moved from Apache Geode to Apache HAWQ
and eventually lands in Apache Hadoop. Apache HAWQ allows for SQL based analysis on
petabyte scale data sets and allows data scientists to iterate on and improve models.

Another process is triggered to periodically retain and update the machine learning model
based on the whole historical data set. This closes the loop and creates ongoing updates and
improvements when historical patterns change or as new models emerge.

You might also like