0% found this document useful (0 votes)
17 views16 pages

Module Iv

This document provides an overview of data stream mining concepts. It discusses data stream models and architectures, sampling techniques for data streams including probability and non-probability sampling, and applications of data stream mining such as real-time sentiment analysis and stock market predictions. Key aspects of data streams are their continuous, ordered nature with high volume and velocity that requires efficient single-pass algorithms for analysis. Modern streaming data architectures incorporate ingestion, storage, processing and destination components to enable real-time analytics on streaming data.

Uploaded by

manoj mlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

Module Iv

This document provides an overview of data stream mining concepts. It discusses data stream models and architectures, sampling techniques for data streams including probability and non-probability sampling, and applications of data stream mining such as real-time sentiment analysis and stock market predictions. Key aspects of data streams are their continuous, ordered nature with high volume and velocity that requires efficient single-pass algorithms for analysis. Modern streaming data architectures incorporate ingestion, storage, processing and destination components to enable real-time analytics on streaming data.

Uploaded by

manoj mlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

MODULE IV MINING DATA STREAMS 9

Streams: Concepts – Stream Data Model and Architecture - Sampling data in a stream -
Mining Data Streams and Mining Time-series data - Real Time Analytics Platform
(RTAP) Applications - Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions.

Streams:

stream concepts :

A data stream is an existing, continuous, ordered (implicitly by entrance time or explicitly


by timestamp) chain of items. It is unfeasible to control the order in which units arrive, nor
it is feasible to locally capture stream in its entirety.
It is enormous volumes of data, items arrive at a high rate.
Types of Data Streams :
 Data stream –
A data stream is a(possibly unchained) sequence of tuples. Each tuple comprised of a set of
attributes, similar to a row in a database table.
 Transactional data stream –
It is a log interconnection between entities
1. Credit card – purchases by consumers from producer
2. Telecommunications – phone calls by callers to the dialed parties
3. Web – accesses by clients of information at servers
 Measurement data streams –
1. Sensor Networks – a physical natural phenomenon, road traffic
2. IP Network – traffic at router interfaces
3. Earth climate – temperature, humidity level at weather stations
Examples of Stream Sources-
1. Sensor Data –
In navigation systems, sensor data is used. Imagine a temperature sensor floating about in
the ocean, sending back to the base station a reading of the surface temperature each hour.
The data generated by this sensor is a stream of real numbers. We have 3.5 terabytes
arriving every day and we for sure need to think about what we can be kept continuing
and what can only be archived.

2. Image Data –
Satellites frequently send down-to-earth streams containing many terabytes of images per
day. Surveillance cameras generate images with lower resolution than satellites, but there
can be numerous of them, each producing a stream of images at a break of 1 second each.

3. Internet and Web Traffic –


A bobbing node in the center of the internet receives streams of IP packets from many
inputs and paths them to its outputs. Websites receive streams of heterogeneous types.
For example, Google receives a hundred million search queries per day.
Characteristics of Data Streams :
1. Large volumes of continuous data, possibly infinite.
2. Steady changing and requires a fast, real-time response.
3. Data stream captures nicely our data processing needs of today.
4. Random access is expensive and a single scan algorithm
5. Store only the summary of the data seen so far.
6. Maximum stream data are at a pretty low level or multidimensional in creation, needs
multilevel and multidimensional treatment.
Applications of Data Streams :
1. Fraud perception
2. Real-time goods dealing
3. Consumer enterprise
4. Observing and describing on inside IT systems
Advantages of Data Streams :
 This data is helpful in upgrading sales
 Help in recognizing the fallacy
 Helps in minimizing costs
 It provides details to react swiftly to risk

Disadvantages of Data Streams :


 Lack of security of data in the cloud
 Hold cloud donor subordination
 Off-premises warehouse of details introduces the probable for disconnection

Architecture

A streaming data architecture is an information technology framework that puts the focus on
processing data in motion and treats extract-transform-load (ETL) batch processing as just
one more event in a continuous stream of events.

In modern streaming data deployments, many organizations are adopting a full stack approach
rather than relying on patching together open-source technologies. The modern data platform
is built on business-centric value chains rather than IT-centric coding processes, wherein the
complexity of traditional architecture is abstracted into a single self-service platform that turns
event streams into analytics-ready data.

The idea behind Upsolver SQLake is to automate the labor-intensive parts of working with
streaming data: message ingestion, batch and streaming ETL, storage management, and
preparing data for analytics.
Benefits of a modern streaming architecture:

 Can eliminate the need for large data engineering projects

 Performance, high availability, and fault tolerance built in

 Newer platforms are cloud-based and can be deployed very quickly with no upfront
investment

 Flexibility and support for multiple use cases

The modern data streaming architecture includes the following key components:

 Source - Your source of streaming data includes data sources like sensors, social media, IoT
devices, log files generated by using your web and mobile applications, mobile devices that
generates semi-structured and unstructured data as continuous streams at high velocity.
 Stream ingestion - The stream storage layer is responsible for providing scalable and cost-
effective components to store streaming data. The streaming data can be stored in the order it
was received for a set duration of time, and can be replayed indefinitely during that time.
 Stream storage - The stream ingestion layer is responsible for ingesting data into the stream
storage layer. It provides the ability to collect data from tens of thousands of data sources and
ingest in near real-time. .
 Stream processing - The stream processing layer is responsible for transforming data into a
consumable state through data validation, cleanup, normalization, transformation, and
enrichment. The streaming records are read in the order they are produced, allowing for real-
time analytics, building event driven applications, or streaming ETL.
 Destination - The destination layer is like a purpose-built destination depending upon your
use case. Your destination can be an event driven application, data lake, data warehouse,
database, or an OpenSearch.

Sampling data in a stream

Stream sampling is the process of collecting a representative sample of the elements of a data
stream. The sample is usually much smaller than the entire stream, but can be designed to retain
many important characteristics of the stream, and can be used to estimate many important
aggregates on the stream.
Every sampling type comes under two broad categories:

 Probability sampling - Random selection techniques are used to select the sample.

 Non-probability sampling - Non-random selection techniques based on certain criteria


are used to select the sample.

Probability Sampling Techniques

Probability Sampling Techniques are one of the important types of sampling techniques.
Probability sampling allows every member of the population a chance to get selected. It is
mainly used in quantitative research when you want to produce results representative of the
whole population.

1. Simple Random Sampling

In simple random sampling, the researcher selects the participants randomly. There are a
number of data analytics tools like random number generators and random number tables used
that are based entirely on chance.

Example: The researcher assigns every member in a company database a number from 1 to
1000 (depending on the size of your company) and then use a random number generator to
select 100 members.
2. Systematic Sampling

In systematic sampling, every population is given a number as well like in simple random
sampling. However, instead of randomly generating numbers, the samples are chosen at regular
intervals.

Example: The researcher assigns every member in the company database a number. Instead of
randomly generating numbers, a random starting point (say 5) is selected. From that number
onwards, the researcher selects every, say, 10th person on the list (5, 15, 25, and so on) until
the sample is obtained.

3. Stratified Sampling

In stratified sampling, the population is subdivided into subgroups, called strata, based on some
characteristics (age, gender, income, etc.). After forming a subgroup, you can then use random
or systematic sampling to select a sample for each subgroup. This method allows you to draw
more precise conclusions because it ensures that every subgroup is properly represented.

Example: If a company has 500 male employees and 100 female employees, the researcher
wants to ensure that the sample reflects the gender as well. So the population is divided into
two subgroups based on gender.

4. Cluster Sampling

In cluster sampling, the population is divided into subgroups, but each subgroup has similar
characteristics to the whole sample. Instead of selecting a sample from each subgroup, you
randomly select an entire subgroup. This method is helpful when dealing with large and diverse
populations.

Example: A company has over a hundred offices in ten cities across the world which has
roughly the same number of employees in similar job roles. The researcher randomly selects 2
to 3 offices and uses them as the sample.

Non-Probability Sampling Techniques

Non-Probability Sampling Techniques is one of the important types of Sampling techniques.


In non-probability sampling, not every individual has a chance of being included in the sample.
This sampling method is easier and cheaper but also has high risks of sampling bias. It is often
used in exploratory and qualitative research with the aim to develop an initial understanding of
the population.
1. Convenience Sampling

In this sampling method, the researcher simply selects the individuals which are most easily
accessible to them. This is an easy way to gather data, but there is no way to tell if the sample
is representative of the entire population. The only criteria involved is that people are available
and willing to participate.

Example: The researcher stands outside a company and asks the employees coming in to
answer questions or complete a survey.

2. Voluntary Response Sampling

Voluntary response sampling is similar to convenience sampling, in the sense that the only
criterion is people are willing to participate. However, instead of the researcher choosing the
participants, the participants volunteer themselves.

Example: The researcher sends out a survey to every employee in a company and gives them
the option to take part in it.

3. Purposive Sampling

In purposive sampling, the researcher uses their expertise and judgment to select a sample that
they think is the best fit. It is often used when the population is very small and the researcher
only wants to gain knowledge about a specific phenomenon rather than make statistical
inferences.

Example: The researcher wants to know about the experiences of disabled employees at a
company. So the sample is purposefully selected from this population.

4. Snowball Sampling

In snowball sampling, the research participants recruit other participants for the study. It is used
when participants required for the research are hard to find. It is called snowball sampling
because like a snowball, it picks up more participants along the way and gets larger and larger.

Example: The researcher wants to know about the experiences of homeless people in a city.
Since there is no detailed list of homeless people, a probability sample is not possible. The only
way to get the sample is to get in touch with one homeless person who will then put you in
touch with other homeless people in a particular area.

Mining Data Streams and Mining Time-series data

Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, Data mining is the science, art, and technology of discovering large and complex
bodies of data in order to discover useful patterns. Theoreticians and practitioners are
continually seeking improved techniques to make the process more efficient, cost-effective,
and accurate.
This article discusses Sequence data. Evaluation of data reached the maximum extent and
may still peruse in the future. To generalize the evaluation of data we classify them as
Sequence Data, Graphs, and Network Mining, another kind of data.

A sequence is an ordered list of events. Sequences data are classified based on


characteristics as:
 Time-Series data (data with respect to time)
 Symbolic data (data with laps in an interval of time)
 Biological data (data related to DNA and protein)

Time-Series Data:

In this type of sequence, the data are of numeric data type recorded at a regular level. They
are generated by an economic process like Stock Market analysis, Medical Observations.
They are useful for studying natural phenomena.
Nowadays these times series are used for piecewise data approximations for further analysis.
In this time-series data, we find a subsequence that matches the query we search.
 Time Series Forecasting: Forecasting is a method of making predictions based on past
and present data to know what happens in the future. Trend analysis is a method of
forecasting Time Series. It is a function that generates historic patterns in time series that
are used in short and long-term predictions. We can obtain various patterns in time series
like cyclic movements, trend movements, seasonal movements as we see they are with
respect to time or season. ARIMA, SARIMA, long memory time series modeling are
some of the popular methods for such analysis.

Real Time Analytics Platform (RTAP) Applications


 A real-time analytics platform enables organizations to make the most out of real-time
data by helping them to extract the valuable information and trends from it.
 Such platforms help in measuring data from the business point of view in real time,
further making the best use of data.
 An ideal real-time analytics platform would help in analyzing the data, correlating it
and predicting the outcomes on a real-time basis.
 The real-time analytics platform helps organizations in tracking things in real time,
thus helping them in the decision-making process.
 The platforms connect the data sources for better analytics and visualization.
 Real time analytics is the analysis of data as soon as that data becomes available. In
other words, users get insights or can draw conclusions immediately the data enters
their system.
Real-time Sentiment Analysis
Real-time Sentiment Analysis is a machine learning (ML) technique that automatically
recognizes and extracts the sentiment in a text whenever it occurs. It is most commonly
used to analyze brand and product mentions in live social comments and posts.

The real-time sentiment analysis process uses several ML tasks such as


natural language processing, text analysis, semantic clustering, etc to identify
opinions expressed about brand experiences in live feeds and extract business
intelligence from them.

Why Do We Need Real-Time Sentiment Analysis?

Real-time sentiment analysis has several applications for brand and customer
analysis. These include the following.

1. Live social feeds from video platforms like Instagram or Facebook


2. Real-time sentiment analysis of text feeds from platforms such as
Twitter. This is immensely helpful in prompt addressing of negative or
wrongful social mentions as well as threat detection in cyberbullying.
3. Live monitoring of Influencer live streams.
4. Live video streams of interviews, news broadcasts, seminars, panel
discussions, speaker events, and lectures.
5. Live audio streams such as in virtual meetings on Zoom or Skype, or at
product support call centers for customer feedback analysis.
6. Live monitoring of product review platforms for brand mentions.
7. Up-to-date scanning of news websites for relevant news through
keywords and hashtags along with the sentiment in the news.
8. A real-time sentiment analysis platform needs to be first trained on a
data set based on your industry and needs. Once this is done, the
platform performs live sentiment analysis of real-time feeds effortlessly.

Below are the steps involved in the process.

Step 1 - Data collection

To extract sentiment from live feeds from social media or other online
sources, we first need to add live APIs of those specific platforms, such
as Instagram or Facebook. In case of a platform or online scenario that
does not have a live API, such as can be the case of Skype or Zoom,
repeat, time-bound data pull requests are carried out. This gives the
solution the ability to constantly track relevant data based on your set
criteria.

Step 2 - Data processing

All the data from the various platforms thus gathered is now analyzed.
All text data in comments are cleaned up and processed for the next
stage. All non-text data from live video or audio feeds is transcribed and
also added to the text pipeline. In this case, the platform extracts
semantic insights by first converting the audio, and the audio in the
video data, to text through speech-to-text software.

This transcript has timestamps for each word and is indexed section by
section based on pauses or changes in the speaker. A granular analysis
of the audio content like this gives the solution enough context to
correctly identify entities, themes, and topics based on your
requirements. This time-bound mapping of the text also helps
with semantic search.
Step 3 - Data analysis

All the data is now analyzed using native natural language processing
(NLP), semantic clustering, and aspect-based sentiment analysis. The
platform derives sentiment from aspects and themes it discovers from
the live feed, giving you the sentiment score for each of them.

It can also give you an overall sentiment score in percentile form and
tell you sentiment based on language and data sources, thus giving you
a break-up of audience opinions based on various demographics.

Step 4 - Data visualization

All the intelligence derived from the real-time sentiment analysis in


step 3 is now showcased on a reporting dashboard in the form of
statistics, graphs, and other visual elements. It is from this sentiment
analysis dashboard that you can set alerts for brand mentions and
keywords in live feeds as well.

Stock Market Prediction:

Stock Market Prediction (SMP) is an example of time-series forecasting that


promptly examines previous data and estimates future data values. Financial
market prediction has been a matter of worry for analysts in different
disciplines, including economics, mathematics, material science, and
computer science. Driving profits from the trading of stocks is an important
factor for the prediction of the stock market.
The process starts with the collection of the data, and then pre-processing
that data so that it can be fed to a machine learning model. The prediction
models generally use two types of data: market and textual data. The
literature of both types is discussed in the following section. The next section
classifies the previous studies based on the type of data used. Furthermore,
the next section surveys the previous studies based on the various data-pre
processing approaches applied. Moreover, the literature is further surveyed
based on the machine learning algorithms used by different systems.

You might also like