Module Iv
Module Iv
Streams: Concepts – Stream Data Model and Architecture - Sampling data in a stream -
Mining Data Streams and Mining Time-series data - Real Time Analytics Platform
(RTAP) Applications - Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions.
Streams:
stream concepts :
2. Image Data –
Satellites frequently send down-to-earth streams containing many terabytes of images per
day. Surveillance cameras generate images with lower resolution than satellites, but there
can be numerous of them, each producing a stream of images at a break of 1 second each.
Architecture
A streaming data architecture is an information technology framework that puts the focus on
processing data in motion and treats extract-transform-load (ETL) batch processing as just
one more event in a continuous stream of events.
In modern streaming data deployments, many organizations are adopting a full stack approach
rather than relying on patching together open-source technologies. The modern data platform
is built on business-centric value chains rather than IT-centric coding processes, wherein the
complexity of traditional architecture is abstracted into a single self-service platform that turns
event streams into analytics-ready data.
The idea behind Upsolver SQLake is to automate the labor-intensive parts of working with
streaming data: message ingestion, batch and streaming ETL, storage management, and
preparing data for analytics.
Benefits of a modern streaming architecture:
Newer platforms are cloud-based and can be deployed very quickly with no upfront
investment
The modern data streaming architecture includes the following key components:
Source - Your source of streaming data includes data sources like sensors, social media, IoT
devices, log files generated by using your web and mobile applications, mobile devices that
generates semi-structured and unstructured data as continuous streams at high velocity.
Stream ingestion - The stream storage layer is responsible for providing scalable and cost-
effective components to store streaming data. The streaming data can be stored in the order it
was received for a set duration of time, and can be replayed indefinitely during that time.
Stream storage - The stream ingestion layer is responsible for ingesting data into the stream
storage layer. It provides the ability to collect data from tens of thousands of data sources and
ingest in near real-time. .
Stream processing - The stream processing layer is responsible for transforming data into a
consumable state through data validation, cleanup, normalization, transformation, and
enrichment. The streaming records are read in the order they are produced, allowing for real-
time analytics, building event driven applications, or streaming ETL.
Destination - The destination layer is like a purpose-built destination depending upon your
use case. Your destination can be an event driven application, data lake, data warehouse,
database, or an OpenSearch.
Stream sampling is the process of collecting a representative sample of the elements of a data
stream. The sample is usually much smaller than the entire stream, but can be designed to retain
many important characteristics of the stream, and can be used to estimate many important
aggregates on the stream.
Every sampling type comes under two broad categories:
Probability sampling - Random selection techniques are used to select the sample.
Probability Sampling Techniques are one of the important types of sampling techniques.
Probability sampling allows every member of the population a chance to get selected. It is
mainly used in quantitative research when you want to produce results representative of the
whole population.
In simple random sampling, the researcher selects the participants randomly. There are a
number of data analytics tools like random number generators and random number tables used
that are based entirely on chance.
Example: The researcher assigns every member in a company database a number from 1 to
1000 (depending on the size of your company) and then use a random number generator to
select 100 members.
2. Systematic Sampling
In systematic sampling, every population is given a number as well like in simple random
sampling. However, instead of randomly generating numbers, the samples are chosen at regular
intervals.
Example: The researcher assigns every member in the company database a number. Instead of
randomly generating numbers, a random starting point (say 5) is selected. From that number
onwards, the researcher selects every, say, 10th person on the list (5, 15, 25, and so on) until
the sample is obtained.
3. Stratified Sampling
In stratified sampling, the population is subdivided into subgroups, called strata, based on some
characteristics (age, gender, income, etc.). After forming a subgroup, you can then use random
or systematic sampling to select a sample for each subgroup. This method allows you to draw
more precise conclusions because it ensures that every subgroup is properly represented.
Example: If a company has 500 male employees and 100 female employees, the researcher
wants to ensure that the sample reflects the gender as well. So the population is divided into
two subgroups based on gender.
4. Cluster Sampling
In cluster sampling, the population is divided into subgroups, but each subgroup has similar
characteristics to the whole sample. Instead of selecting a sample from each subgroup, you
randomly select an entire subgroup. This method is helpful when dealing with large and diverse
populations.
Example: A company has over a hundred offices in ten cities across the world which has
roughly the same number of employees in similar job roles. The researcher randomly selects 2
to 3 offices and uses them as the sample.
In this sampling method, the researcher simply selects the individuals which are most easily
accessible to them. This is an easy way to gather data, but there is no way to tell if the sample
is representative of the entire population. The only criteria involved is that people are available
and willing to participate.
Example: The researcher stands outside a company and asks the employees coming in to
answer questions or complete a survey.
Voluntary response sampling is similar to convenience sampling, in the sense that the only
criterion is people are willing to participate. However, instead of the researcher choosing the
participants, the participants volunteer themselves.
Example: The researcher sends out a survey to every employee in a company and gives them
the option to take part in it.
3. Purposive Sampling
In purposive sampling, the researcher uses their expertise and judgment to select a sample that
they think is the best fit. It is often used when the population is very small and the researcher
only wants to gain knowledge about a specific phenomenon rather than make statistical
inferences.
Example: The researcher wants to know about the experiences of disabled employees at a
company. So the sample is purposefully selected from this population.
4. Snowball Sampling
In snowball sampling, the research participants recruit other participants for the study. It is used
when participants required for the research are hard to find. It is called snowball sampling
because like a snowball, it picks up more participants along the way and gets larger and larger.
Example: The researcher wants to know about the experiences of homeless people in a city.
Since there is no detailed list of homeless people, a probability sample is not possible. The only
way to get the sample is to get in touch with one homeless person who will then put you in
touch with other homeless people in a particular area.
Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, Data mining is the science, art, and technology of discovering large and complex
bodies of data in order to discover useful patterns. Theoreticians and practitioners are
continually seeking improved techniques to make the process more efficient, cost-effective,
and accurate.
This article discusses Sequence data. Evaluation of data reached the maximum extent and
may still peruse in the future. To generalize the evaluation of data we classify them as
Sequence Data, Graphs, and Network Mining, another kind of data.
Time-Series Data:
In this type of sequence, the data are of numeric data type recorded at a regular level. They
are generated by an economic process like Stock Market analysis, Medical Observations.
They are useful for studying natural phenomena.
Nowadays these times series are used for piecewise data approximations for further analysis.
In this time-series data, we find a subsequence that matches the query we search.
Time Series Forecasting: Forecasting is a method of making predictions based on past
and present data to know what happens in the future. Trend analysis is a method of
forecasting Time Series. It is a function that generates historic patterns in time series that
are used in short and long-term predictions. We can obtain various patterns in time series
like cyclic movements, trend movements, seasonal movements as we see they are with
respect to time or season. ARIMA, SARIMA, long memory time series modeling are
some of the popular methods for such analysis.
Real-time sentiment analysis has several applications for brand and customer
analysis. These include the following.
To extract sentiment from live feeds from social media or other online
sources, we first need to add live APIs of those specific platforms, such
as Instagram or Facebook. In case of a platform or online scenario that
does not have a live API, such as can be the case of Skype or Zoom,
repeat, time-bound data pull requests are carried out. This gives the
solution the ability to constantly track relevant data based on your set
criteria.
All the data from the various platforms thus gathered is now analyzed.
All text data in comments are cleaned up and processed for the next
stage. All non-text data from live video or audio feeds is transcribed and
also added to the text pipeline. In this case, the platform extracts
semantic insights by first converting the audio, and the audio in the
video data, to text through speech-to-text software.
This transcript has timestamps for each word and is indexed section by
section based on pauses or changes in the speaker. A granular analysis
of the audio content like this gives the solution enough context to
correctly identify entities, themes, and topics based on your
requirements. This time-bound mapping of the text also helps
with semantic search.
Step 3 - Data analysis
All the data is now analyzed using native natural language processing
(NLP), semantic clustering, and aspect-based sentiment analysis. The
platform derives sentiment from aspects and themes it discovers from
the live feed, giving you the sentiment score for each of them.
It can also give you an overall sentiment score in percentile form and
tell you sentiment based on language and data sources, thus giving you
a break-up of audience opinions based on various demographics.