0% found this document useful (0 votes)
226 views

Data Analytics Unit 3

This document discusses data streams and stream processing. It begins by defining what a data stream is, noting that data streams involve enormous volumes of continuous data arriving at a high rate that cannot be stored entirely. It then describes different types of data streams like transactional, measurement, and social media streams. The document outlines the components of a streaming data architecture, including message brokers, batch processing tools, streaming data storage, and analytics engines. It discusses techniques for sampling data in streams like sliding windows and reservoir sampling to efficiently process high-volume streams.

Uploaded by

Aditi Jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
226 views

Data Analytics Unit 3

This document discusses data streams and stream processing. It begins by defining what a data stream is, noting that data streams involve enormous volumes of continuous data arriving at a high rate that cannot be stored entirely. It then describes different types of data streams like transactional, measurement, and social media streams. The document outlines the components of a streaming data architecture, including message brokers, batch processing tools, streaming data storage, and analytics engines. It discusses techniques for sampling data in streams like sliding windows and reservoir sampling to efficiently process high-volume streams.

Uploaded by

Aditi Jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT-3 Mining Data Streams

Introduction to stream concepts :


A data stream is an existing, continuous, ordered (implicitly by entrance time or
explicitly by timestamp) chain of items. It is unfeasible to control the order in which
units arrive, nor it is feasible to locally capture stream in its entirety.
It is enormous volumes of data, items arrive at a high rate.
Types of Data Streams :
Data stream –A data stream is a(possibly unchained) sequence of tuples. Each tuple
comprised of a set of attributes, similar to a row in a database table.

Transnational data stream –It is a log interconnection between entities


 Credit card – purchases by consumers from producer
 Telecommunications – phone calls by callers to the dialed parties
 Web – accesses by clients of information at servers

Measurement data streams –


 Sensor Networks – a physical natural phenomenon, road traffic
 IP Network – traffic at router interfaces
 Earth climate – temperature, humidity level at weather stations

Examples of Stream Sources-


Sensor Data – In navigation systems, sensor data is used. Imagine a temperature
sensor floating about in the ocean, sending back to the base station a reading of the
surface temperature each hour. The data generated by this sensor is a stream of real
numbers. We have 3.5 terabytes arriving every day and we for sure need to think
about what we can be kept continuing and what can only be archived.

Image Data – Satellites frequently send down-to-earth streams containing many


terabytes of images per day. Surveillance cameras generate images with lower
resolution than satellites, but there can be numerous of them, each producing a
stream of images at a break of 1 second each.

Internet and Web Traffic – A bobbing node in the center of the internet receives
streams of IP packets from many inputs and paths them to its outputs. Websites
receive streams of heterogeneous types. For example, Google receives a hundred
million search queries per day.

Characteristics of Data Streams :


 Large volumes of continuous data, possibly infinite.
 Steady changing and requires a fast, real-time response.
 Data stream captures nicely our data processing needs of today.
 Random access is expensive and a single scan algorithm
 Store only the summary of the data seen so far.
 Maximum stream data are at a pretty low level or multidimensional in
creation, needs multilevel and multidimensional treatment.
Applications of Data Streams :
 Fraud perception
 Real-time goods dealing
 Consumer enterprise
 Observing and describing on inside IT systems

Advantages of Data Streams :


 This data is helpful in upgrading sales
 Help in recognizing the fallacy
 Helps in minimizing costs
 It provides details to react swiftly to risk

Disadvantages of Data Streams :


 Lack of security of data in the cloud
 Hold cloud donor subordination
 Off-premises warehouse of details introduces the probable for disconnection

Stream data model and architecture:- A streaming data architecture is a framework


of software components built to ingest and process large volumes of streaming data
from multiple sources. While traditional data solutions focused on writing and reading
data in batches, a streaming data architecture consumes data immediately as it is
generated, persists it to storage, and may include various additional components per
use case – such as tools for real-time processing, data manipulation, and analytics.
Streaming architectures must account for the unique characteristics of data streams,
which tend to generate massive amounts of data (terabytes to petabytes) that it is at
best semi-structured and requires significant pre-processing and ETL to become
useful.

The Components of a Streaming Architecture


1. The Message Broker / Stream Processor
This is the element that takes data from a source, called a producer, translates it into a
standard message format, and streams it on an ongoing basis. Other components can
then listen in and consume the messages passed on by the broker.

The first generation of message brokers, such as RabbitMQ and Apache ActiveMQ,
relied on the Message Oriented Middleware (MOM) paradigm. Later, hyper-
performant messaging platforms (often called stream processors) emerged that are
more suitable for a streaming paradigm.
Popular stream processing tools:

 Apache Kafka
 RabbitMQ

2. Batch processing and real-time ETL tools

In data-intensive organizations, process streaming data is an essential component of the


big data architecture. There are many fully managed frameworks to choose from that all
set up an end-to-end streaming data pipeline in the cloud to enable real-time analytics.

Example managed tools:

 Amazon Kinesis Data Streams


 Azure Event Hub
 Google Cloud PubSub

3. Streaming Data Storage

Organizations typically store their streaming event data in cloud object stores to serve
as operational data lake due to the sheer volume and multi-structured nature of event
streams. They offer a low-cost and long-term solution for storing large amounts of
event data. They're also a flexible integration point, allowing tools from outside your
streaming ecosystem to access data.

To learn more, check out Data Lake Consulting services.

Examples:

 Amazon Redshift
 Microsoft Azure Data Lake Storage
 Google Cloud Storage

4. Data Analytics / Serverless Query Engine

​ ​ With data processed and stored in a data warehouse/data lake, you will now need
data analytics tools.

Examples (not exhaustive):

 Query engines – Athena, Presto, Hive, Redshift Spectrum, Pig


 Text search engines – Elasticsearch, OpenSearch, Solar, Kusto
 Streaming data analytics – Amazon Kinesis, Google Cloud DataFlow, Azure
Stream Analytics
Stream Computing
The stream processing computational paradigm consists of assimilating data
readings from collections of software or hardware sensors in stream form (i.e., as an
infinite series of tuples), analyzing the data, and producing actionable results, possibly
in stream format as well.
In a stream processing system, applications typically act as continuous queries,
ingesting data continuously, analyzing and correlating the data, and generating a
stream of results. Applications are represented as data-flow graphs composed of
operators and interconnected by streams, as shown in the figure. The individual
operators implement algorithms for data analysis, such as parsing, filtering, feature
extraction, and classification. Such algorithms are typically single-pass because of the
high data rates of external feeds (e.g., market information from stock exchanges,
environmental sensors readings from sites in a forest, etc.).
Stream processing applications are usually constructed to identify new information by
incrementally building models and assessing whether new data deviates from model
predictions and, thus, is interesting in some way. For example, in a financial
engineering application, one might be constructing pricing models for options on
securities, while at the same time detecting misprinted quotes, from a live stock
market feed. In such an application, the predictive model itself might be refined as
more market data and other data sources become available (e.g., a feed with weather
predictions, estimates on fuel prices, or headline news).

Any number of streams can enter the


system. Each stream can provide
elements at its own schedule; they need
not have the same data rates or data
types, and the time between elements of
one stream need not be uniform. The
fact that the rate of arrival of stream
elements is not under the control of the
system distinguishes stream processing
from the processing of data that goes on
within a database-management system.
The latter system controls the rate at
which data is read from the disk, and
therefore never has to worry about data
getting lost as it attempts to execute
queries. Streams may be archived in a
large archival store, but we assume it is not possible to answer queries from the
archival store. It could be examined only under special circumstances using time-
consuming retrieval processes. There is also a working store, into which summaries or
parts of streams may be placed, and which can be used for answering queries. The
working store might be disk, or it might be main memory, depending on how fast we
need to process queries. But either way, it is of sufficiently limited capacity that it
cannot store all the data from all the streams.

Sampling Data In a Stream:-


IoT devices produce constant data streams of measurements and logs, which are
difficult to analyze in-time. Especially in embedded systems or edge devices, memory
and CPU power are too limited for just in time analysis of such streams. Even strong
systems will (sooner or later) have problems in observing the full history of the arrived
data.

4 Sampling Techniques for Efficient Stream Processing

1. Sliding Window:- This is the simplest and most straightforward method. A first-in,
first-out (FIFO) queue with size n and a skip/sub-sampling factor k ≥1 is maintained.
In addition to that, a stride factor s ≥ 1 describes by how many time-steps the window
is shifted before analyzing it.

Advantage
 simple to implement
 deterministic — reservoir can be
filled very fast from the beginning

Drawbacks
 the time history represented by the reservoir R is short; long-term concept drifts
cannot be detected easily — outliers can create noisy analyses

2. Unbiased Reservoir Sampling:- A reservoir R is maintained such that at time t >


n the probability of accepting point x(t) in the reservoir is equal to n/t.
The algorithm [1] is as follows:
 Fill the reservoir R with the first n points of the stream.
 At time t > n replace a randomly chosen (equal probability) entry in the
reservoir R with acceptance probability n/t.
This leads to a reservoir R(t) such that each point x(1)…x(t) is contained in R(t) with
equal property n/t.

Advantages
 The reservoir contains data points from all history of the stream with equal
probability.
 Very simple implementation; adding a point requires only O(1)

Drawbacks
 A concept drift cannot be compensated; the oldest data point x(1) is equal
important in this sampling technique as the latest data point x(t).

3. Biased Reservoir Sampling:- In biased reservoir sampling Alg. 3.1, [2] the
probability of a data point x(t) being in the reservoir is a decreasing function of its
lingering time within R. So the probability of finding points of the sooner history in R
is high. Very old data points will be in R with very low probability.
Illustration of the Biased Reservoir Sampling

The probability that point x(r) is contained in R(t) equals to

Reservoir Sampling Chain output

Advantages:

 O(1 )algorithm for adding a new data point.

 Slowly moving concept drifts can be compensated.

 An adjustable forgetting factor can be tuned for the application of interest.

Drawbacks:

It is a randomized technique. So the algorithm is non-deterministic. However, the


variance might be estimated by running an ensemble of independent reservoirs [3]

R1…RB
4. Histograms

A histogram is maintained while observing the data stream. Hereto, data points are
sorted into intervals/buckets

[ li , ui ]

If the useful range of the observed values is known in advance, a simple vector with
counts and breakpoints could do the job.

V-optimal histograms tries to minimize the variance within each histogram bucket. [4]
proposes an algorithm for efficiently maintaining an approximate V-optimum
histogram from a data stream. This is of relevance for interval data, such as a time-
series of temperature values; i.e., absolute value and distance between values have a
meaning.

Counting Distinct Elements

Counting distinct elements is a problem that frequently arises in distributed systems. In


general, the size of the set under consideration (which we will henceforth call
the universe) is enormous. For example, if we build a system to identify denial of
service attacks, the set could consist of all IP V4 and V6 addresses. Another common
use case is to count the number of unique visitors on popular websites like Twitter or
Facebook.

An obvious approach if the number of elements is not very large would be to maintain
a Set. We can check if the set contains the element when a new element arrives. If not,
we add the element to the set. The size of the set would give the number of distinct
elements. However, if the number of elements is vast or we are maintaining counts for
multiple streams, it would be infeasible to maintain the set in memory. The algorithms
for counting distinct elements are also approximate, with an error threshold that can be
tweaked by changing the algorithm's parameters.

Flajolet — Martin Algorithm

The first algorithm for counting distinct elements is the Flajolet-Martin algorithm,
named after the algorithm's creators. The Flajolet-Martin algorithm is a single pass
algorithm. If there are m distinct elements in a universe comprising of n elements, the
algorithm runs in O(n) time and O(log(m)) space complexity. The following steps
define the algorithm.

 First, we pick a hash function h that takes stream elements as input and outputs
a bit string. The length of the bit strings is large enough such that the result of the
hash function is much larger than the size of the universe. We require at least log
nbits if there are n elements in the universe.
 r(a) is used to denote the number of trailing zeros in the binary representation
of h(a) for an element a in the stream.

 R denotes the maximum value of r seen in the stream so far.

 The estimate of the number of distinct elements in the stream is (2^R).

Flajolet-Martin Psuedocode and Explanation

1. L = 64 (size of the bitset), B= bitset of size L

2. hash_func = (ax + b) mod 2^L

3. for each item x in stream

 y = hash(x)

 r = get_righmost_set_bit(y)

 set_bit(B, r)

4. R = get_righmost_unset_bit(B)

5. return 2 ^ R

Estimating Moments

Estimating moments is a generalization of the problem of counting distinct elements


in a stream. The problem, called computing "moments," involves the distribution of
frequencies of different elements in the stream.
Suppose a stream consists of elements chosen from a universal set. Assume the
universal set is ordered so we can speak of the ith element for any i.
Let mi be the number of occurrences of the ith element for any i. Then the kth-order
moment of the stream is the sum over all i of (mi ) k .
For example :
1. The 0th moment is the sum of 1 of each mi that is greater than 0 i.e., 0th moment is
a count of the number of distinct element in the stream.
2. The 1st moment is the sum of the mi ’s, which must be the length of the stream.
Thus, first moments are especially easy to compute i.e., just count the length of
the stream seen so far.
3. The second moment is the sum of the squares of the mi ’s. It is sometimes called
the surprise number, since it measures how uneven the distribution of elements in
the stream is.
4. To see the distinction, suppose we have a stream of length 100, in which eleven
different elements appear. The most even distribution of these eleven elements
would have one appearing 10 times and the other ten appearing 9 times each.
5. In this case, the surprise number is 102 + 10 × 92 = 910. At the other extreme,
one of the eleven elements could appear 90 times and the other ten appear 1 time
each. Then, the surprise number would be 902 + 10 × 12 = 8110

Algorithm For Counting Oneness In a Window:-A data stream is an


ordered sequence of instances that in many applications of data stream mining can be
read only once or a small number of times using limited computing and storage
capabilities.

TYPES OF QUERIES

Ad-Hoc query- You ask a query and there is an immediate response. E.g: What is the
maximum value seen so far in the stream S?

Standing queries- You are asking a query to the system say “ Anytime you have an
answer to this query send me the response” , here you don’t get the answer
immediately .

Now let us suppose we have a window of length N (say N=24) on a binary system, We
want at all times to be able to answer a query of the form “ How many 1’s are there in
the last K bits?” for K<=N.
Here comes the DGIM Algorithm into picture:

COUNTING THE NUMBER OF 1’s IN THE DATA STREAM DGIM algorithm


(Datar-Gionis-Indyk-Motwani Algorithm)

Designed to find the number 1’s in a data set. This algorithm uses O(log²N) bits to
represent a window of N bit, allows to estimate the number of 1’s in the window with
and error of no more than 50%. So this algorithm gives a 50% precise answer.

In DGIM algorithm, each bit that arrives has a timestamp, for the position at which it
arrives. if the first bit has a timestamp 1, the second bit has a timestamp 2 and so on..
the positions are recognized with the window size N (the window sizes are usually
taken as a multiple of 2).The windows are divided into buckets consisting of 1’s and
0's.

RULES FOR FORMING THE BUCKETS:

1. The right side of the bucket should always start with 1. (if it starts with a 0,it is
to be neglected) E.g. · 1001011 → a bucket of size 4 ,having four 1’s and starting
with 1 on it’s right end.

2. Every bucket should have at least one 1, else no bucket can be formed.

3. All buckets should be in powers of 2.


4. The buckets cannot decrease in size as we move to the left. (move in increasing
order towards left)

Let us take an example to understand the algorithm. Estimating the number of 1’s and
counting the buckets in the given data stream.

This picture shows how we can form the buckets based on the number of ones by
following the rules.

In the given data stream let us assume the new bit arrives from the right. When the new
bit = 0

After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the buckets.

But what if the new bit that arrives is 1, then we need to make changes..
Create a new bucket with the current timestamp and size 1.

If there was only one bucket of size 1, then nothing more needs to be done. However,
if there are now three buckets of size 1( buckets with timestamp 100,102, 103 in the
second step in the picture) We fix the problem by combining the leftmost(earliest) two
buckets of size 1. (purple box)

To combine any two adjacent buckets of the same size, replace them by one bucket of
twice the size. The timestamp of the new bucket is the timestamp of the rightmost of
the two buckets.

Now, sometimes combining two buckets of size 1 may create a third bucket of size 2.
If so, we combine the leftmost two buckets of size 2 into a bucket of size 4. This
process may ripple through the bucket sizes.

How long can you continue doing this…

You can continue if current timestamp- leftmost bucket timestamp of window < N
(=24 here) E.g. 103–87=16 < 24 so It continue, if it greater or equal to then it will be
stop.
Finally the answer to the query. How many 1’s are there in the last 20 bits? Counting
the sizes of the buckets in the last 20 bits, we say, there are 11 ones.

Sentiment analysis:- Sentiment analysis is a type of natural language processing for


tracking the mood of the public about a particular product. Sentiment analysis is also
known as opinion mining.
It collects and examines opinions about the particular product made in blog posts,
comments, or tweets.
Sentiment analysis can track a particular topic, many companies use it to track or
observe their products, status.
A basic task in sentiment analysis is categorizing the polarity of a given text at the
document, sentence whether the expressed opinion in a document, a sentence or an
entity feature is positive, negative, or neutral.
Sentiment classification looks, at emotional states such as “angry,” “sad,” and
“happy”.

Following are the major applications of sentiment analysis in real world


scenarios :

1. Reputation monitoring : Twitter and Facebook are a central point of many


sentiment analysis applications. The most common application is to maintain the
reputation of a particular brand on Twitter and/or Facebook.

2. Result prediction : By analyzing sentiments from related sources, one can predict
the probable outcome of a particular event.

3. Decision making : Sentiment analysis can be used as an important aspect


supporting the decision making systems. For instance, in the financial markets
investment, there are numerous news items, articles, blogs, and tweets about every
public company.

4. Product and service review : The most common application of sentiment analysis
is in the area of reviews of customer products and services.

Real-Time Analytics Platform (RTAP) :-


 A real-time analytics platform enables organizations to make the most out of real-
time data by helping them to extract the valuable information and trends from it.
 Such platforms help in measuring data from the business point of view in real
time, further making the best use of data.
 An ideal real-time analytics platform would help in analyzing the data,
correlating it and predicting the outcomes on a real-time basis.
 The real-time analytics platform helps organizations in tracking things in real
time, thus helping them in the decision-making process.
 The platforms connect the data sources for better analytics and visualization.
 Real time analytics is the analysis of data as soon as that data becomes available.
In other words, users get insights or can draw conclusions immediately the data
enters their system.
Examples of real-time analytics include :-
 Real time credit scoring, helping financial institutions to decide immediately
whether to extend credit.
 Customer relationship management (CRM), maximizing satisfaction and business
results during each interaction with the customer.
 Fraud detection at points of sale.
 Targeting individual customers in retail outlets with promotions and incentives,
while the customers are in the store and next to the merchandise.

Real-Time Analytics Platform (RTAP) Application


Healthcare

The level of data generated within healthcare systems is not trivial. Traditionally, the
health care industry lagged in using Big Data, because of limited ability to standardize
and consolidate data.Researchers are mining the data to see what treatments are more
effective for particular conditions, identify patterns related to drug side effects, and
gains other important information that can help patients and reduce costs.

With the added adoption of mHealth, eHealth and wearable technologies the volume
of data is increasing at an exponential rate. This includes electronic health record data,
imaging data, patient generated data, sensor data, and other forms of data.By mapping
healthcare data with geographical data sets, it’s possible to predict disease that will
escalate in specific areas. Based of predictions, it’s easier to strategize diagnostics and
plan for stocking serums and vaccines.

Manufacturing

Predictive manufacturing provides near-zero downtime and transparency. It requires


an enormous amount of data and advanced prediction tools for a systematic process of
data into useful information.

Major benefits of using Big Data applications in manufacturing industry are:

 Product quality and defects tracking


 Supply planning
 Manufacturing process defect tracking
 Output forecasting
 Increasing energy efficiency
 Testing and simulation of new manufacturing processes
 Support for mass-customization of manufacturing

Media & Entertainment

Various companies in the media and entertainment industry are facing new business
models, for the way they – create, market and distribute their content. This is
happening because of current consumer’s search and the requirement of accessing
content anywhere, any time, on any device.

Now, publishing environments are tailoring advertisements and content to appeal


consumers. These insights are gathered through various data-mining activities. Big
Data applications benefits media and entertainment industry by:

 Predicting what the audience wants


 Scheduling optimization
 Increasing acquisition and retention
 Ad targeting
 Content monetization and new product development

Internet of Things (IoT)

Data extracted from IoT devices provides a mapping of device inter-connectivity.


Such mappings have been used by various companies and governments to increase
efficiency. IoT is also increasingly adopted as a means of gathering sensory data, and
this sensory data is used in medical and manufacturing contexts.

Government

The use and adoption of Big Data within governmental processes allows efficiencies
in terms of cost, productivity, and innovation. In government use cases, the same data
sets are often applied across multiple applications & it requires multiple departments
to work in collaboration.

You might also like