0% found this document useful (0 votes)
9 views30 pages

Unit 3

The document discusses the concept of data streams, their applications, and processing methods in real-time analytics. It covers various types of data streams, stream architectures, and tools for processing and analyzing streaming data, including case studies and applications in sectors like healthcare and finance. Additionally, it highlights the importance of real-time sentiment analysis and stock market prediction using machine learning techniques.

Uploaded by

Sakthi Vel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views30 pages

Unit 3

The document discusses the concept of data streams, their applications, and processing methods in real-time analytics. It covers various types of data streams, stream architectures, and tools for processing and analyzing streaming data, including case studies and applications in sectors like healthcare and finance. Additionally, it highlights the importance of real-time sentiment analysis and stock market prediction using machine learning techniques.

Uploaded by

Sakthi Vel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT 3

MINING DATA STREAM

Introduction to Streams Concept:


* Refers to a sequence of data elements or symbols made available over
time
* Data stream transmits from a source and receives at the processing end
in a network
* A continuous stream of data flows between the source and receiver
ends,
and which is processed in real time
* Also refers to communication of bytes or characters over sockets in a
computer network
* A program uses stream as an underlying data type in inter-process
communication channels.
Examples of Data Stream Applications
1. Making data-driven marketing decisions in real time. It requires the use
of data from trends analyses of real-time sales, and analysis of social
media, and the sales distribution.
2. Monitoring and detection of potential failures of system using network
management tool
3. Monitoring of industrial or manufacturing machinery in real time
4. A sensor network or IoT controlled by another entity, or a set of entities
5. Watching online video lectures, and rewinding or forwarding them.
Application processing of a data stream
• Processing is in micro-batches instead of processing batches
• Processing of stream can be comprehended as filling milk in bottles
on a conveyor belt and capping them,
one at a time successfully rather than in a large batch at the same time
Types of Data Streams :
Data stream
A data stream is a(possibly unchained) sequence of tuples. Each tuple comprised
of a set of attributes, similar to a row in a database table.
Transactional data stream
It is a log interconnection between entities
Credit card – purchases by consumers from producer
Telecommunications – phone calls by callers to the dialed parties
Web – accesses by clients of information at servers
Measurement data streams
Sensor Networks – a physical natural phenomenon, road traffic
IP Network – traffic at router interfaces
Earth climate – temperature, humidity level at weather stations
Stream Data Model and Architecture:
Data Stream Model
• Stream is data in motion
• Three approaches for updating the endpoints (sinks) are (i) non-overlapping, (ii)
slow (batch processing) and (iii) fast (near real-time)
• Different ways of modeling data stream, querying, processing and management
1. Object-based data stream model
• Data-flows modeled as objects
• Examples: Cougar and Tribeca object based data stream
• Cougar models sensors’ data as a stream of objects
• Tribeca models the network monitoring data as a stream of objects.
2. XML-based data stream model
• Example: NiagaraCQ, an XML-based data stream model
• Scalable continuous query processing over XML documents
• Performs operations over millions of simultaneous queries by dynamically
grouping them according to their structural similarities.
Window-based data stream model
• Stream data direction can be towards fixed window, sliding window or
landmark window-sinks (end-points) [Window means a time window during
which the data stream is looked at an instance.]
Stream Architecture:
Streaming Data Architecture consists of software components, built and
connected together to ingest and process streaming data from various
sources. Streaming Data Architecture processes the data right after it is
collected
The processing includes allocating it into the designated storage and may
include triggering further processing steps, like analytics, further data
manipulation or sort of further real-time processing.
1 Message Brokers
The component group that takes data from a source, transforms it into a standard
message format, and streams it on an ongoing basis, to make it available for use,
which means that other tools can listen in and consume the messages passed on
by the brokers
The popular stream processing tools are open source software Apache Kafka, or
PaaS (Platform as a Service) components like Azure Event Hub, Azure IoT Hub,
GCP Cloud Pub/Sub or GCP Confluent Cloud,
2 Processing Tools
The output data streams from the above described message broker or stream
processor needs to be transformed and structured to get analyzed further using
analytics tools.
When it comes to open source frameworks, which focus on processing streamed
data, the most popular and broadly known are Apache Storm, Apache Spark
Streaming and Apache
That is designed to analyze and process high volumes of fast streaming data
from multiple sources simultaneously. So, it falls a little bit under data analytics
as well as real-time processing.
3 Data Analytics Tools
The stream processor and processing tool, it needs to be analyzed to
provide value. There are many different approaches to streaming data
analytics, but let’s focus on the mostly known ones.
Apache Cassandra is an open source NoSQL distributed database and it
provides low latency serving of streaming events to applications.
4 Streaming Data Storage
A data lake is the most flexible and cheap option for storing event data,
but it is quite challenging to properly set it up and maintain.
The other option can be storing the data in a data warehouse or in
persistent storage of selected tools, like Kafka, Databricks/Spark,
BigQuery.
Batch processing
When we talk about traditional analytics, we mean business intelligence
(BI) methods and technical infrastructure.
Stream Computing:
* Stream computing is a way to analyze and process Big Data in real time
to gain current insights to take appropriate decisions or to predict new
trends in the immediate future
* Implements in a distributed clustered environment
• High rate of receiving data in stream .
Stream computing Applications
• Financial sectors,
• Business intelligence,
• Risk management,
• Marketing management,
• Search engines, and
• Social network analysis
Data Stream Algorithms Efficiency Measurements
1. Number of passes (scans) the algorithm must make over the stream
2. Available memory
3. Running time of the algorithm
Sampling Data in a Stream:
1 First category, probabilistic sampling is a statistical technique
2 Second category, non-probabilistic sampling uses arbitrary or purposive
(biased) sample selection instead of sampling based on a randomized
selection
* Reservoir Sampling Method
* Concise Sampling
• Counting Sampling
Reservoir Sampling Method
1 A random sampling method, choosing a sample of limited data items
from a list containing a very large number of items randomly
2 The list is larger than one that upholds in the main memory
Concise Sampling
1 Concise sampling like the reservoir sampling method, with a difference
that a value that appears once is stored as a singleton, whereas a value that
appears more than once is stored as a (value,count) pair
2 Inserts a new data item in the sample with a probability of 1/n
Counting Sampling
• A refinement of concise sampling in terms of accuracy
• The method maintains the sample in the case of deletion of data items as
well
• Decrementing the count value upon deleting a value
• [Deletion mean after reading moving to next.]
Procedures for calculating sample sizes
i) Estimation, called confidence interval approach, and
(ii) hypothesis testing. Statistics prescribes Chi-squared, T-test, Z-test, F-test,
P value for testing the significance of a statistical inference
Filtering of Stream
• Identifies the sequence patterns in a stream
• Stream filtering is the process of selection or matching instances of a
desired pattern in a continuous stream of data
Example
• Assume that a data stream consists of tuples
• Filtering steps:
(i) Accept the tuples that meet a criterion in the stream,
(ii) Pass the accepted tuples to another process as a stream and
(iii) discard remaining tuples
Filtering of Stream: The Bloom Filter Analysis
• A simple space-efficient data structure introduced by Burton Howard Bloom in
1970.
• The filter matches the membership of an element in a dataset
Bloom Filter
• The filter is basically a bit vector of length m that represent a set S = {x, x2, . . ,
xm} of m elements,
• Initially all bits 0. Then, define k independent hash functions,{ h1, h2, . . . and
hk } Each of which maps (hashes) some element x in set S to one of the m array
positions with a uniform random distribution.
Counting Distinct Elements in a Stream
1 Relates to finding the number of dissimilar elements in a data stream
2 The stream of data contains repeated elements
3 This is a well-known problem in networking and databases
Counting Distinct Elements in a Stream and Count Distinct Problem
If n possible elements a1, a2, …, and an are present then for an exact result n
spaces are required. In the worst case,all n elements can be present. Let m be
the number of distinct elements. The objective is to find an estimate of m
using only s storage units, where s ≪m.
The Flajolet–Martin (FM) Algorithm
• FM method approximates the m, number of distinct (unique) elements, in
a stream or a database in one pass
• The stream consisting of n elements with m unique elements runs in O (n)
time and needs O (log (m)) memory
FM Algorithm
Thus, the space consumption calculates with the maximum number of possible
distinct elements in the stream, which makes it innovative
Features of the FM algorithm
(i) Hash-based algorithm.
(ii) Needs several repetitions to get a good estimate.
(iii) The more different elements in the data, the more different hash values are
obtained
(iv) Different hash values suggest the chances of one of these values will be
unusual [the unusual property can be that the value ends in many 0s
(alternates also exist)].
Estimating moments
Moments of order k
* If a stream has A distinct elements, and each element has frequency mi
• The kth order moment of the stream is

* The 0th order moment is the number of distinct elements in the stream
* The 1st order moment is the length of the stream
Counting of 1’s in a Window
• Infinite Stream Processing
• Volume of data is too large that it cannot be stored. Hardly a chance
exists to look at all of it.
• Important queries may be likely to ask only about the most recent data or
summaries of data.
Decaying Windows
• Useful in applications which need identification of most common elements
• Decaying window concept assigns more weight to recent elements
• The technique computes a smooth aggregation of all the 1’s ever seen in
the stream, with decaying weights.
• When element further appears in the stream, less weight is given.
• The effect of exponentially decaying weights is to spread out the weights of
the stream elements as far back in time as the stream flows
1. Multiply the current sum/score by the value (1−c).
2. Add the weight corresponding to the new element
Real Time Analytics Platform(RTAP)Application:
Real-time analytics
• Refers to finding meaningful patterns in data at the actual time of
receiving
• Real-Time Analytics Platform (RTAP) analyses the data, correlates, and
predicts the outcomes in the real time.
• Manages and processes data and helps timely decision-making
• Helps to develop dynamic analysis applications
• Leads to evolution of business intelligence
Widely used RTAPs
• Apache SparkStreaming—a Big Data platform for data stream analytics in
real time.
• Cisco Connected Streaming Analytics (CSA)—a platform that delivers
insights from high-velocity streams of live data from multiple sources and
enables immediate action.
• Oracle Stream Analytics (OSA)—a platform that provides a graphical
interface to “Fast Data”.
• SAP HANA— a streaming analytics tool which also does real-time analytics
• SQL streamBlaze—an analytics platform, offering a real-time, easy-touse and
powerful visual development environment for developers and analysts.
• TIBCO StreamBase—streaming analytics, which accelerates action in
order to quickly build applications.
• Informatica — a real-time data streaming tool which transforms a
torrent of small messages and events into unprecedented business agility
• IBM Stream Computing—a data streaming tool that analyzes a broad
range of streaming data— unstructured text, video, audio, geospatial, sensor—
helping organizations spot the opportunities and risks and make
decisions in real times
RTAP Applications
1. Fraud detection systems for online transactions
2. Log analysis for understanding usage pattern
3. Click analysis for online recommendations
4. Social Media Analytics
5. Push notifications to the customers for location-based
advertisements for retail
6. Action for emergency services such as fires and accidents in an
industry
7. Any abnormal measurements require immediate reaction in
healthcare
monitoring
Case Studies
Case Study 1: Streaming to a Cloud-Based Lambda Architecture
Qlik is working with a Fortune 500 healthcare solution provider to hospitals,
pharmacies, clinical laboratories, and doctors that is investing in cloud analytics
to identify opportunities for improving quality of care.
Such as SQL Server and Oracle to a Kafka message queue that in turn feeds a
Lambda architecture on Amazon Web Services (AWS) Simple Storage Service
(S3).
Case Study 2: Streaming to the Data Lake
Decision makers at an international food industry leader, which we’ll call
“Suppertime,” needed a current view and continuous integration of
production capacity data, customer orders, and purchase orders to efficiently
process, distributes
Qlik Replicate injects this data stream, along with any changes to the source
metadata and data definition language (DDL) changes, to a Kafka message
queue that feeds HDFS and HBase consumers that subscribe to the relevant
message topics
Case Study 3: Streaming, Data Lake, and Cloud Architecture
Qlik Replicate CDC is remotely capturing updates and DDL changes from
source databases (Oracle, SQL Server, MySQL, and DB2) at four locations in
the United States. Qlik Replicate then sends that data through an encrypted
File Channel connection over a wide area network (WAN) to a virtual
machine–based instance of Qlik Replicate in the Azure cloud .
Case Study 4:Supporting Microservices on the AWSCloud Architecture
The DB2 transaction log and send them via encrypted multipathing to the
AWS cloud. There, certain transaction records are copied straight to an RDS
database, using the same schemas as the source, for analytics by a single line
of business.
Case Study 5: Real-Time Operational Data Store/Data Warehouse
To improve the efficiency of its data replication process. This required
continuous copies of transactional data from the company’s production Oracle
database to an operational data store (ODS) based on SQL Server. Although
the target is an ODS rather than a full-fledged data warehouse, this case study
serves our purpose of illustrating the advantages of CDC for high-scale
structured analysis and reporting.
Real Time Sentiment Analysis:
Real-time Sentiment Analysis is a machine learning (ML) technique that
automatically recognizes and extracts the sentiment in a text whenever it
occurs. It is most commonly used to analyze brand and product mentions in
live social comments and posts.
Why Do We Need Real-Time Sentiment Analysis?
Real-time sentiment analysis has several applications for brand and customer
analysis.
1. Live social feeds from video platforms like Instagram or Facebook
2. Real-time sentiment analysis of text feeds from platforms such as Twitter.
This is immensely helpful in prompt addressing of negative or wrongful social
mentions as well as threat detection in cyberbullying.
3. Live monitoring of Influencer live streams.
4. Live video streams of interviews, news broadcasts, seminars, panel
discussions, speaker events, and lectures.
5. Live audio streams such as in virtual meetings on Zoom or Skype, or at
product support call centers for customer feedback analysis.
6. Live monitoring of product review platforms for brand mentions.
7. Up-to-date scanning of news websites for relevant news through keywords
and hashtags along with the sentiment in the news
How Is Real-Time Sentiment Analysis Done?
Live sentiment analysis is done through machine learning algorithms that are
trained to recognize and analyze all data types from multiple data sources,
across different languages, for sentiment..
Step 1 - Data collection
To extract sentiment from live feeds from social media or other online
sources, we first need to add live APIs of those specific platforms, such as
Instagram or Facebook
Step 2 - Data processing
All the data from the various platforms thus gathered is now analyzed. All
text data in comments are cleaned up and processed for the next stage. All
non-text data from live video or audio feeds is transcribed and also added to
the text pipeline
Step 3 - Data analysis
All the data is now analyzed using native natural language processing (NLP),
semantic clustering, and aspect-based sentiment analysis.
Step 4 - Data visualization
All the intelligence derived from the real-time sentiment analysis in is now show
cased on a reporting dashboard in the form of statistics, graphs, and other visual
elements. It is from this sentiment analysis dashboard that you can set alerts for
brand mentions and keywords in live feeds as well.
The Most Important Features Of A Real-Time Sentiment Analysis Platform
* Multiplatform
* Multimedia
* Multilingual
* Web scraping
* Alerts
* Reporting
Stock Market Prediction
1. Stock market prediction and analysis are some of the most difficult jobs to
complete.
2. There are numerous causes for this, including market volatility and a variety
of other dependent and independent variables that influence the value of a
certain stock in the market
3. These variables make it extremely difficult for any stock market expert to
anticipate the rise and fall of the market with great precision.
4. The introduction of Machine Learning and its strong algorithms, the most
recent market research and Stock Market Prediction advancements have
begun to include such approaches in analyzing stock market data.
5. Machine Learning Algorithms are widely utilized by many organizations
in Stock market prediction.
6. This article will walk through a simple implementation of analyzing and
forecasting the stock prices of a Popular Worldwide Online Retail Store in
Python using various Machine Learning Algorithms

You might also like