0% found this document useful (0 votes)

234 views

Data Analytics Unit 3

This document discusses data streams and stream processing. It begins by defining what a data stream is, noting that data streams involve enormous volumes of continuous data arriving at a high rate that cannot be stored entirely. It then describes different types of data streams like transactional, measurement, and social media streams. The document outlines the components of a streaming data architecture, including message brokers, batch processing tools, streaming data storage, and analytics engines. It discusses techniques for sampling data in streams like sliding windows and reservoir sampling to efficiently process high-volume streams.

Uploaded by

Aditi Jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

234 views

Data Analytics Unit 3

Uploaded by

Aditi Jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

UNIT-3 Mining Data Streams

Introduction to stream concepts :

A data stream is an existing, continuous, ordered (implicitly by entrance time or
explicitly by timestamp) chain of items. It is unfeasible to control the order in which
units arrive, nor it is feasible to locally capture stream in its entirety.
It is enormous volumes of data, items arrive at a high rate.
Types of Data Streams :
Data stream –A data stream is a(possibly unchained) sequence of tuples. Each tuple
comprised of a set of attributes, similar to a row in a database table.

Transnational data stream –It is a log interconnection between entities

 Credit card – purchases by consumers from producer
 Telecommunications – phone calls by callers to the dialed parties
 Web – accesses by clients of information at servers

Measurement data streams –

 Sensor Networks – a physical natural phenomenon, road traffic
 IP Network – traffic at router interfaces
 Earth climate – temperature, humidity level at weather stations

Examples of Stream Sources-

Sensor Data – In navigation systems, sensor data is used. Imagine a temperature
sensor floating about in the ocean, sending back to the base station a reading of the
surface temperature each hour. The data generated by this sensor is a stream of real
numbers. We have 3.5 terabytes arriving every day and we for sure need to think
about what we can be kept continuing and what can only be archived.

Image Data – Satellites frequently send down-to-earth streams containing many

terabytes of images per day. Surveillance cameras generate images with lower
resolution than satellites, but there can be numerous of them, each producing a
stream of images at a break of 1 second each.

Internet and Web Traffic – A bobbing node in the center of the internet receives
streams of IP packets from many inputs and paths them to its outputs. Websites
receive streams of heterogeneous types. For example, Google receives a hundred
million search queries per day.

Characteristics of Data Streams :

 Large volumes of continuous data, possibly infinite.
 Steady changing and requires a fast, real-time response.
 Data stream captures nicely our data processing needs of today.
 Random access is expensive and a single scan algorithm
 Store only the summary of the data seen so far.
 Maximum stream data are at a pretty low level or multidimensional in
creation, needs multilevel and multidimensional treatment.
Applications of Data Streams :
 Fraud perception
 Real-time goods dealing
 Consumer enterprise
 Observing and describing on inside IT systems

Advantages of Data Streams :

 This data is helpful in upgrading sales
 Help in recognizing the fallacy
 Helps in minimizing costs
 It provides details to react swiftly to risk

Disadvantages of Data Streams :

 Lack of security of data in the cloud
 Hold cloud donor subordination
 Off-premises warehouse of details introduces the probable for disconnection

Stream data model and architecture:- A streaming data architecture is a framework

of software components built to ingest and process large volumes of streaming data
from multiple sources. While traditional data solutions focused on writing and reading
data in batches, a streaming data architecture consumes data immediately as it is
generated, persists it to storage, and may include various additional components per
use case – such as tools for real-time processing, data manipulation, and analytics.
Streaming architectures must account for the unique characteristics of data streams,
which tend to generate massive amounts of data (terabytes to petabytes) that it is at
best semi-structured and requires significant pre-processing and ETL to become
useful.

The Components of a Streaming Architecture

1. The Message Broker / Stream Processor
This is the element that takes data from a source, called a producer, translates it into a
standard message format, and streams it on an ongoing basis. Other components can
then listen in and consume the messages passed on by the broker.

The first generation of message brokers, such as RabbitMQ and Apache ActiveMQ,
relied on the Message Oriented Middleware (MOM) paradigm. Later, hyper-
performant messaging platforms (often called stream processors) emerged that are
more suitable for a streaming paradigm.
Popular stream processing tools:

 Apache Kafka
 RabbitMQ

2. Batch processing and real-time ETL tools

In data-intensive organizations, process streaming data is an essential component of the

big data architecture. There are many fully managed frameworks to choose from that all
set up an end-to-end streaming data pipeline in the cloud to enable real-time analytics.

Example managed tools:

 Amazon Kinesis Data Streams

 Azure Event Hub
 Google Cloud PubSub

3. Streaming Data Storage

Organizations typically store their streaming event data in cloud object stores to serve
as operational data lake due to the sheer volume and multi-structured nature of event
streams. They offer a low-cost and long-term solution for storing large amounts of
event data. They're also a flexible integration point, allowing tools from outside your
streaming ecosystem to access data.

To learn more, check out Data Lake Consulting services.

Examples:

 Amazon Redshift
 Microsoft Azure Data Lake Storage
 Google Cloud Storage

4. Data Analytics / Serverless Query Engine

With data processed and stored in a data warehouse/data lake, you will now need
data analytics tools.

Examples (not exhaustive):

 Query engines – Athena, Presto, Hive, Redshift Spectrum, Pig

 Text search engines – Elasticsearch, OpenSearch, Solar, Kusto
 Streaming data analytics – Amazon Kinesis, Google Cloud DataFlow, Azure
Stream Analytics
Stream Computing
The stream processing computational paradigm consists of assimilating data
readings from collections of software or hardware sensors in stream form (i.e., as an
infinite series of tuples), analyzing the data, and producing actionable results, possibly
in stream format as well.
In a stream processing system, applications typically act as continuous queries,
ingesting data continuously, analyzing and correlating the data, and generating a
stream of results. Applications are represented as data-flow graphs composed of
operators and interconnected by streams, as shown in the figure. The individual
operators implement algorithms for data analysis, such as parsing, filtering, feature
extraction, and classification. Such algorithms are typically single-pass because of the
high data rates of external feeds (e.g., market information from stock exchanges,
environmental sensors readings from sites in a forest, etc.).
Stream processing applications are usually constructed to identify new information by
incrementally building models and assessing whether new data deviates from model
predictions and, thus, is interesting in some way. For example, in a financial
engineering application, one might be constructing pricing models for options on
securities, while at the same time detecting misprinted quotes, from a live stock
market feed. In such an application, the predictive model itself might be refined as
more market data and other data sources become available (e.g., a feed with weather
predictions, estimates on fuel prices, or headline news).

Any number of streams can enter the

system. Each stream can provide
elements at its own schedule; they need
not have the same data rates or data
types, and the time between elements of
one stream need not be uniform. The
fact that the rate of arrival of stream
elements is not under the control of the
system distinguishes stream processing
from the processing of data that goes on
within a database-management system.
The latter system controls the rate at
which data is read from the disk, and
therefore never has to worry about data
getting lost as it attempts to execute
queries. Streams may be archived in a
large archival store, but we assume it is not possible to answer queries from the
archival store. It could be examined only under special circumstances using time-
consuming retrieval processes. There is also a working store, into which summaries or
parts of streams may be placed, and which can be used for answering queries. The
working store might be disk, or it might be main memory, depending on how fast we
need to process queries. But either way, it is of sufficiently limited capacity that it
cannot store all the data from all the streams.

Sampling Data In a Stream:-

IoT devices produce constant data streams of measurements and logs, which are
difficult to analyze in-time. Especially in embedded systems or edge devices, memory
and CPU power are too limited for just in time analysis of such streams. Even strong
systems will (sooner or later) have problems in observing the full history of the arrived
data.

4 Sampling Techniques for Efficient Stream Processing

1. Sliding Window:- This is the simplest and most straightforward method. A first-in,
first-out (FIFO) queue with size n and a skip/sub-sampling factor k ≥1 is maintained.
In addition to that, a stride factor s ≥ 1 describes by how many time-steps the window
is shifted before analyzing it.

Advantage
 simple to implement
 deterministic — reservoir can be
filled very fast from the beginning

Drawbacks
 the time history represented by the reservoir R is short; long-term concept drifts
cannot be detected easily — outliers can create noisy analyses

2. Unbiased Reservoir Sampling:- A reservoir R is maintained such that at time t >

n the probability of accepting point x(t) in the reservoir is equal to n/t.
The algorithm [1] is as follows:
 Fill the reservoir R with the first n points of the stream.
 At time t > n replace a randomly chosen (equal probability) entry in the
reservoir R with acceptance probability n/t.
This leads to a reservoir R(t) such that each point x(1)…x(t) is contained in R(t) with
equal property n/t.

Advantages
 The reservoir contains data points from all history of the stream with equal
probability.
 Very simple implementation; adding a point requires only O(1)

Drawbacks
 A concept drift cannot be compensated; the oldest data point x(1) is equal
important in this sampling technique as the latest data point x(t).

3. Biased Reservoir Sampling:- In biased reservoir sampling Alg. 3.1, [2] the
probability of a data point x(t) being in the reservoir is a decreasing function of its
lingering time within R. So the probability of finding points of the sooner history in R
is high. Very old data points will be in R with very low probability.
Illustration of the Biased Reservoir Sampling

The probability that point x(r) is contained in R(t) equals to

Reservoir Sampling Chain output

Advantages:

 O(1 )algorithm for adding a new data point.

 Slowly moving concept drifts can be compensated.

 An adjustable forgetting factor can be tuned for the application of interest.

Drawbacks:

It is a randomized technique. So the algorithm is non-deterministic. However, the

variance might be estimated by running an ensemble of independent reservoirs [3]

R1…RB
4. Histograms

A histogram is maintained while observing the data stream. Hereto, data points are
sorted into intervals/buckets

[ li , ui ]

If the useful range of the observed values is known in advance, a simple vector with
counts and breakpoints could do the job.

V-optimal histograms tries to minimize the variance within each histogram bucket. [4]
proposes an algorithm for efficiently maintaining an approximate V-optimum
histogram from a data stream. This is of relevance for interval data, such as a time-
series of temperature values; i.e., absolute value and distance between values have a
meaning.

Counting Distinct Elements

Counting distinct elements is a problem that frequently arises in distributed systems. In

general, the size of the set under consideration (which we will henceforth call
the universe) is enormous. For example, if we build a system to identify denial of
service attacks, the set could consist of all IP V4 and V6 addresses. Another common
use case is to count the number of unique visitors on popular websites like Twitter or
Facebook.

An obvious approach if the number of elements is not very large would be to maintain
a Set. We can check if the set contains the element when a new element arrives. If not,
we add the element to the set. The size of the set would give the number of distinct
elements. However, if the number of elements is vast or we are maintaining counts for
multiple streams, it would be infeasible to maintain the set in memory. The algorithms
for counting distinct elements are also approximate, with an error threshold that can be
tweaked by changing the algorithm's parameters.

Flajolet — Martin Algorithm

The first algorithm for counting distinct elements is the Flajolet-Martin algorithm,
named after the algorithm's creators. The Flajolet-Martin algorithm is a single pass
algorithm. If there are m distinct elements in a universe comprising of n elements, the
algorithm runs in O(n) time and O(log(m)) space complexity. The following steps
define the algorithm.

 First, we pick a hash function h that takes stream elements as input and outputs
a bit string. The length of the bit strings is large enough such that the result of the
hash function is much larger than the size of the universe. We require at least log
nbits if there are n elements in the universe.
 r(a) is used to denote the number of trailing zeros in the binary representation
of h(a) for an element a in the stream.

 R denotes the maximum value of r seen in the stream so far.

 The estimate of the number of distinct elements in the stream is (2^R).

Flajolet-Martin Psuedocode and Explanation

1. L = 64 (size of the bitset), B= bitset of size L

2. hash_func = (ax + b) mod 2^L

3. for each item x in stream

 y = hash(x)

 r = get_righmost_set_bit(y)

 set_bit(B, r)

4. R = get_righmost_unset_bit(B)

5. return 2 ^ R

Estimating Moments

Estimating moments is a generalization of the problem of counting distinct elements

in a stream. The problem, called computing "moments," involves the distribution of
frequencies of different elements in the stream.
Suppose a stream consists of elements chosen from a universal set. Assume the
universal set is ordered so we can speak of the ith element for any i.
Let mi be the number of occurrences of the ith element for any i. Then the kth-order
moment of the stream is the sum over all i of (mi ) k .
For example :
1. The 0th moment is the sum of 1 of each mi that is greater than 0 i.e., 0th moment is
a count of the number of distinct element in the stream.
2. The 1st moment is the sum of the mi ’s, which must be the length of the stream.
Thus, first moments are especially easy to compute i.e., just count the length of
the stream seen so far.
3. The second moment is the sum of the squares of the mi ’s. It is sometimes called
the surprise number, since it measures how uneven the distribution of elements in
the stream is.
4. To see the distinction, suppose we have a stream of length 100, in which eleven
different elements appear. The most even distribution of these eleven elements
would have one appearing 10 times and the other ten appearing 9 times each.
5. In this case, the surprise number is 102 + 10 × 92 = 910. At the other extreme,
one of the eleven elements could appear 90 times and the other ten appear 1 time
each. Then, the surprise number would be 902 + 10 × 12 = 8110

Algorithm For Counting Oneness In a Window:-A data stream is an

ordered sequence of instances that in many applications of data stream mining can be
read only once or a small number of times using limited computing and storage
capabilities.

TYPES OF QUERIES

Ad-Hoc query- You ask a query and there is an immediate response. E.g: What is the
maximum value seen so far in the stream S?

Standing queries- You are asking a query to the system say “ Anytime you have an
answer to this query send me the response” , here you don’t get the answer
immediately .

Now let us suppose we have a window of length N (say N=24) on a binary system, We
want at all times to be able to answer a query of the form “ How many 1’s are there in
the last K bits?” for K<=N.
Here comes the DGIM Algorithm into picture:

COUNTING THE NUMBER OF 1’s IN THE DATA STREAM DGIM algorithm

(Datar-Gionis-Indyk-Motwani Algorithm)

Designed to find the number 1’s in a data set. This algorithm uses O(log²N) bits to
represent a window of N bit, allows to estimate the number of 1’s in the window with
and error of no more than 50%. So this algorithm gives a 50% precise answer.

In DGIM algorithm, each bit that arrives has a timestamp, for the position at which it
arrives. if the first bit has a timestamp 1, the second bit has a timestamp 2 and so on..
the positions are recognized with the window size N (the window sizes are usually
taken as a multiple of 2).The windows are divided into buckets consisting of 1’s and
0's.

RULES FOR FORMING THE BUCKETS:

1. The right side of the bucket should always start with 1. (if it starts with a 0,it is
to be neglected) E.g. · 1001011 → a bucket of size 4 ,having four 1’s and starting
with 1 on it’s right end.

2. Every bucket should have at least one 1, else no bucket can be formed.

3. All buckets should be in powers of 2.

4. The buckets cannot decrease in size as we move to the left. (move in increasing
order towards left)

Let us take an example to understand the algorithm. Estimating the number of 1’s and
counting the buckets in the given data stream.

This picture shows how we can form the buckets based on the number of ones by
following the rules.

In the given data stream let us assume the new bit arrives from the right. When the new
bit = 0

After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the buckets.

But what if the new bit that arrives is 1, then we need to make changes..
Create a new bucket with the current timestamp and size 1.

If there was only one bucket of size 1, then nothing more needs to be done. However,
if there are now three buckets of size 1( buckets with timestamp 100,102, 103 in the
second step in the picture) We fix the problem by combining the leftmost(earliest) two
buckets of size 1. (purple box)

To combine any two adjacent buckets of the same size, replace them by one bucket of
twice the size. The timestamp of the new bucket is the timestamp of the rightmost of
the two buckets.

Now, sometimes combining two buckets of size 1 may create a third bucket of size 2.
If so, we combine the leftmost two buckets of size 2 into a bucket of size 4. This
process may ripple through the bucket sizes.

How long can you continue doing this…

You can continue if current timestamp- leftmost bucket timestamp of window < N
(=24 here) E.g. 103–87=16 < 24 so It continue, if it greater or equal to then it will be
stop.
Finally the answer to the query. How many 1’s are there in the last 20 bits? Counting
the sizes of the buckets in the last 20 bits, we say, there are 11 ones.

Sentiment analysis:- Sentiment analysis is a type of natural language processing for

tracking the mood of the public about a particular product. Sentiment analysis is also
known as opinion mining.
It collects and examines opinions about the particular product made in blog posts,
comments, or tweets.
Sentiment analysis can track a particular topic, many companies use it to track or
observe their products, status.
A basic task in sentiment analysis is categorizing the polarity of a given text at the
document, sentence whether the expressed opinion in a document, a sentence or an
entity feature is positive, negative, or neutral.
Sentiment classification looks, at emotional states such as “angry,” “sad,” and
“happy”.

Following are the major applications of sentiment analysis in real world

scenarios :

1. Reputation monitoring : Twitter and Facebook are a central point of many

sentiment analysis applications. The most common application is to maintain the
reputation of a particular brand on Twitter and/or Facebook.

2. Result prediction : By analyzing sentiments from related sources, one can predict
the probable outcome of a particular event.

3. Decision making : Sentiment analysis can be used as an important aspect

supporting the decision making systems. For instance, in the financial markets
investment, there are numerous news items, articles, blogs, and tweets about every
public company.

4. Product and service review : The most common application of sentiment analysis
is in the area of reviews of customer products and services.

Real-Time Analytics Platform (RTAP) :-

 A real-time analytics platform enables organizations to make the most out of real-
time data by helping them to extract the valuable information and trends from it.
 Such platforms help in measuring data from the business point of view in real
time, further making the best use of data.
 An ideal real-time analytics platform would help in analyzing the data,
correlating it and predicting the outcomes on a real-time basis.
 The real-time analytics platform helps organizations in tracking things in real
time, thus helping them in the decision-making process.
 The platforms connect the data sources for better analytics and visualization.
 Real time analytics is the analysis of data as soon as that data becomes available.
In other words, users get insights or can draw conclusions immediately the data
enters their system.
Examples of real-time analytics include :-
 Real time credit scoring, helping financial institutions to decide immediately
whether to extend credit.
 Customer relationship management (CRM), maximizing satisfaction and business
results during each interaction with the customer.
 Fraud detection at points of sale.
 Targeting individual customers in retail outlets with promotions and incentives,
while the customers are in the store and next to the merchandise.

Real-Time Analytics Platform (RTAP) Application

Healthcare

The level of data generated within healthcare systems is not trivial. Traditionally, the
health care industry lagged in using Big Data, because of limited ability to standardize
and consolidate data.Researchers are mining the data to see what treatments are more
effective for particular conditions, identify patterns related to drug side effects, and
gains other important information that can help patients and reduce costs.

With the added adoption of mHealth, eHealth and wearable technologies the volume
of data is increasing at an exponential rate. This includes electronic health record data,
imaging data, patient generated data, sensor data, and other forms of data.By mapping
healthcare data with geographical data sets, it’s possible to predict disease that will
escalate in specific areas. Based of predictions, it’s easier to strategize diagnostics and
plan for stocking serums and vaccines.

Manufacturing

Predictive manufacturing provides near-zero downtime and transparency. It requires

an enormous amount of data and advanced prediction tools for a systematic process of
data into useful information.

Major benefits of using Big Data applications in manufacturing industry are:

 Product quality and defects tracking

 Supply planning
 Manufacturing process defect tracking
 Output forecasting
 Increasing energy efficiency
 Testing and simulation of new manufacturing processes
 Support for mass-customization of manufacturing

Media & Entertainment

Various companies in the media and entertainment industry are facing new business
models, for the way they – create, market and distribute their content. This is
happening because of current consumer’s search and the requirement of accessing
content anywhere, any time, on any device.

Now, publishing environments are tailoring advertisements and content to appeal

consumers. These insights are gathered through various data-mining activities. Big
Data applications benefits media and entertainment industry by:

 Predicting what the audience wants

 Scheduling optimization
 Increasing acquisition and retention
 Ad targeting
 Content monetization and new product development

Internet of Things (IoT)

Data extracted from IoT devices provides a mapping of device inter-connectivity.

Such mappings have been used by various companies and governments to increase
efficiency. IoT is also increasingly adopted as a means of gathering sensory data, and
this sensory data is used in medical and manufacturing contexts.

Government

The use and adoption of Big Data within governmental processes allows efficiencies
in terms of cost, productivity, and innovation. In government use cases, the same data
sets are often applied across multiple applications & it requires multiple departments
to work in collaboration.

Math Makes Sense Practice and Homework Book Grade 5 Answer Key
100% (1)
Math Makes Sense Practice and Homework Book Grade 5 Answer Key
7 pages
(MJPRU) BCA Last 5 Years Questions Papers Solved
No ratings yet
(MJPRU) BCA Last 5 Years Questions Papers Solved
50 pages
BDF 2022 Combined 2
No ratings yet
BDF 2022 Combined 2
266 pages
Slide Viva Detection Fake Account Using Machine Learning 20 - 1830!10!2021
No ratings yet
Slide Viva Detection Fake Account Using Machine Learning 20 - 1830!10!2021
25 pages
CatalogoPGT25 PDF
100% (5)
CatalogoPGT25 PDF
4 pages
Daa Notes Unit 4
No ratings yet
Daa Notes Unit 4
14 pages
Da Unit-1
No ratings yet
Da Unit-1
23 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
Data Structures Semester 3
No ratings yet
Data Structures Semester 3
329 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
DBMS Unit 1 Notes
No ratings yet
DBMS Unit 1 Notes
38 pages
II Cse Cs3352 Fds QB Unit2
No ratings yet
II Cse Cs3352 Fds QB Unit2
5 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Machine Learning (6CS4-02) Unit-3 Notes
No ratings yet
Machine Learning (6CS4-02) Unit-3 Notes
21 pages
Unit 1 - BD - Introduction To Big Data
100% (1)
Unit 1 - BD - Introduction To Big Data
90 pages
Data Warehouse Notes
No ratings yet
Data Warehouse Notes
5 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
File Hanling - New - C++
No ratings yet
File Hanling - New - C++
26 pages
AI Data Science
100% (1)
AI Data Science
17 pages
Google App Engine
No ratings yet
Google App Engine
10 pages
Big Data Nit067
No ratings yet
Big Data Nit067
1 page
Handling of Categorical Data
No ratings yet
Handling of Categorical Data
18 pages
Numerical Based On Indexing: Problem 1.2
No ratings yet
Numerical Based On Indexing: Problem 1.2
3 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
UNIT 3 DWDM Notes
No ratings yet
UNIT 3 DWDM Notes
32 pages
Plant Detection Final
No ratings yet
Plant Detection Final
25 pages
Data-Mining-Lab-Manual Cs 703b
No ratings yet
Data-Mining-Lab-Manual Cs 703b
41 pages
BDA Lab ManuaL[1]
No ratings yet
BDA Lab ManuaL[1]
83 pages
Ai Notes
No ratings yet
Ai Notes
68 pages
DBMS Unit-V
0% (1)
DBMS Unit-V
48 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
25 pages
COIS Level 4 Unit 1
No ratings yet
COIS Level 4 Unit 1
2 pages
DWDM LAB Final Manualtest
No ratings yet
DWDM LAB Final Manualtest
134 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
Python Solutions For iPA 10-Feb-23
No ratings yet
Python Solutions For iPA 10-Feb-23
21 pages
AWS Data Engineering Notes by Iusmanmaqbool
No ratings yet
AWS Data Engineering Notes by Iusmanmaqbool
79 pages
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
Dbms Important Questions and Answers
No ratings yet
Dbms Important Questions and Answers
9 pages
B.tech. 3rd Yr CSE (AI) 2022 23 Revised
No ratings yet
B.tech. 3rd Yr CSE (AI) 2022 23 Revised
35 pages
ADSA Unit-4
No ratings yet
ADSA Unit-4
16 pages
Mc5502 Bda Unit I Notes
No ratings yet
Mc5502 Bda Unit I Notes
106 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Unit-4-Unit-4-Bda EDIT
No ratings yet
Unit-4-Unit-4-Bda EDIT
16 pages
Unit 4 Ai
100% (2)
Unit 4 Ai
16 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
Mumbai Educational Trust: MET Institute of Computer Science
No ratings yet
Mumbai Educational Trust: MET Institute of Computer Science
368 pages
Unit 1
No ratings yet
Unit 1
61 pages
Unit Ii
No ratings yet
Unit Ii
11 pages
2021-22 DM Lab Manual
No ratings yet
2021-22 DM Lab Manual
53 pages
04-Entity-Relational Model (Part 1) - SCD
No ratings yet
04-Entity-Relational Model (Part 1) - SCD
46 pages
Unit I Dbms
0% (1)
Unit I Dbms
45 pages
What Is Apache Flume?: Collecting, Aggregating, and Moving Large Amounts of Log Data. in
No ratings yet
What Is Apache Flume?: Collecting, Aggregating, and Moving Large Amounts of Log Data. in
8 pages
Datawarehouse PPT
No ratings yet
Datawarehouse PPT
39 pages
Taking User Input in Python
No ratings yet
Taking User Input in Python
6 pages
Discuss Ethical Issues in Data Science Covering .
100% (1)
Discuss Ethical Issues in Data Science Covering .
3 pages
VTU Exam Question Paper With Solution of 17CS73 Machine Learning Jan-2021-Swathi Y
No ratings yet
VTU Exam Question Paper With Solution of 17CS73 Machine Learning Jan-2021-Swathi Y
7 pages
Database Management System
No ratings yet
Database Management System
35 pages
Natural Language Processing
No ratings yet
Natural Language Processing
38 pages
Mastering Active Directory
From Everand
Mastering Active Directory
VICTOR P HENDERSON
No ratings yet
Module II
No ratings yet
Module II
22 pages
Technical Information Series ZNS General Description: Sandwich Valves
No ratings yet
Technical Information Series ZNS General Description: Sandwich Valves
4 pages
OLI Analyzer Studio User Guide PDF
100% (1)
OLI Analyzer Studio User Guide PDF
204 pages
TLC6 PPT Slides - English
No ratings yet
TLC6 PPT Slides - English
12 pages
Mactan-Cebu International Airport (Mcia) PPP Project
No ratings yet
Mactan-Cebu International Airport (Mcia) PPP Project
8 pages
ultimate-interview-prep-guide
No ratings yet
ultimate-interview-prep-guide
7 pages
Fiberglass
No ratings yet
Fiberglass
9 pages
Class x Sst Answer Key Set b
No ratings yet
Class x Sst Answer Key Set b
8 pages
business permit
No ratings yet
business permit
3 pages
CBV 100
No ratings yet
CBV 100
4 pages
Deosen - F200-Xanthan Gum
100% (1)
Deosen - F200-Xanthan Gum
2 pages
The Amazing Son in Law 2601 2800
No ratings yet
The Amazing Son in Law 2601 2800
585 pages
Installation Instructi NS: 17 Touring/ 18+ Softail VO2 Air Intake 41061/71061/41053/40049
No ratings yet
Installation Instructi NS: 17 Touring/ 18+ Softail VO2 Air Intake 41061/71061/41053/40049
5 pages
A05!21!24-Kunal - Poster - For Wood Conference SB Edit
No ratings yet
A05!21!24-Kunal - Poster - For Wood Conference SB Edit
1 page
Unit 5 DMS
No ratings yet
Unit 5 DMS
16 pages
Strategies For Safety and Security in Tourism
No ratings yet
Strategies For Safety and Security in Tourism
9 pages
vgu
No ratings yet
vgu
2 pages
Contract Agreement
No ratings yet
Contract Agreement
6 pages
ATKT Exam Notice Feb_Mar-2025
No ratings yet
ATKT Exam Notice Feb_Mar-2025
2 pages
Brochurre Abbemat 300, 350,500
No ratings yet
Brochurre Abbemat 300, 350,500
7 pages
Travel Agencies Contact Info
100% (1)
Travel Agencies Contact Info
4 pages
Coffee Day Beverages: by Nikhil Wagh
No ratings yet
Coffee Day Beverages: by Nikhil Wagh
17 pages
Practice Questions Inverse LT and DE
No ratings yet
Practice Questions Inverse LT and DE
2 pages
Chapter 2756
No ratings yet
Chapter 2756
30 pages
Transmission Construction Standard: Saudi Electricitycompany
0% (1)
Transmission Construction Standard: Saudi Electricitycompany
12 pages
Encyclopedia of the U S Constitution Two Volume Set 1st Edition David Schultzinstant download
100% (1)
Encyclopedia of the U S Constitution Two Volume Set 1st Edition David Schultzinstant download
51 pages
CDR Vlsi
No ratings yet
CDR Vlsi
2 pages
Paquete Condensado Por Agua Hasta 5 TR
No ratings yet
Paquete Condensado Por Agua Hasta 5 TR
142 pages
Interfacing A Hantronix 320 X 240 Graphics Module To An 8-Bit Microcontroller
No ratings yet
Interfacing A Hantronix 320 X 240 Graphics Module To An 8-Bit Microcontroller
8 pages

Data Analytics Unit 3

Uploaded by

Data Analytics Unit 3

Uploaded by

UNIT-3 Mining Data Streams

Introduction to stream concepts :

Transnational data stream –It is a log interconnection between entities

Measurement data streams –

Examples of Stream Sources-

Image Data – Satellites frequently send down-to-earth streams containing many

Characteristics of Data Streams :

Advantages of Data Streams :

Disadvantages of Data Streams :

Stream data model and architecture:- A streaming data architecture is a framework

The Components of a Streaming Architecture

2. Batch processing and real-time ETL tools

In data-intensive organizations, process streaming data is an essential component of the

Example managed tools:

 Amazon Kinesis Data Streams

3. Streaming Data Storage

To learn more, check out Data Lake Consulting services.

4. Data Analytics / Serverless Query Engine

Examples (not exhaustive):

 Query engines – Athena, Presto, Hive, Redshift Spectrum, Pig

Any number of streams can enter the

Sampling Data In a Stream:-

4 Sampling Techniques for Efficient Stream Processing

2. Unbiased Reservoir Sampling:- A reservoir R is maintained such that at time t >

The probability that point x(r) is contained in R(t) equals to

Reservoir Sampling Chain output

 O(1 )algorithm for adding a new data point.

 Slowly moving concept drifts can be compensated.

 An adjustable forgetting factor can be tuned for the application of interest.

It is a randomized technique. So the algorithm is non-deterministic. However, the

Counting Distinct Elements

Counting distinct elements is a problem that frequently arises in distributed systems. In

Flajolet — Martin Algorithm

 R denotes the maximum value of r seen in the stream so far.

 The estimate of the number of distinct elements in the stream is (2^R).

Flajolet-Martin Psuedocode and Explanation

1. L = 64 (size of the bitset), B= bitset of size L

2. hash_func = (ax + b) mod 2^L

3. for each item x in stream

Estimating moments is a generalization of the problem of counting distinct elements

Algorithm For Counting Oneness In a Window:-A data stream is an

COUNTING THE NUMBER OF 1’s IN THE DATA STREAM DGIM algorithm

RULES FOR FORMING THE BUCKETS:

3. All buckets should be in powers of 2.

How long can you continue doing this…

Sentiment analysis:- Sentiment analysis is a type of natural language processing for

Following are the major applications of sentiment analysis in real world

1. Reputation monitoring : Twitter and Facebook are a central point of many

3. Decision making : Sentiment analysis can be used as an important aspect

Real-Time Analytics Platform (RTAP) :-

Real-Time Analytics Platform (RTAP) Application

Predictive manufacturing provides near-zero downtime and transparency. It requires

Major benefits of using Big Data applications in manufacturing industry are:

 Product quality and defects tracking

Media & Entertainment

Now, publishing environments are tailoring advertisements and content to appeal

 Predicting what the audience wants

Internet of Things (IoT)

Data extracted from IoT devices provides a mapping of device inter-connectivity.

You might also like