BDA Mod 3
BDA Mod 3
FALL 2024-25
Dr.J.Jagannathan
Assistant Professor Sr.Grade 1
School of Computer Science Engineering and Information Systems
Vellore Institute of Technology - Vellore 07-10-2024 1
MODULE:III
2 07-10-2024
4VS OF BIG DATA
3
INFINITE DATA
Dimension Duplicate
Spam Web Perceptron,
ality document
reduction Detection advertising kNN detection
STREAM CONCEPTS
Recently, there’s been a rise in applications that handle large amounts of data.
Instead of storing this data permanently, these applications work with data
that comes in continuously and changes rapidly.
This type of data is called “transient data streams.”
Examples include:
• Financial transactions
• Network monitoring
• Sensor data from devices
• web applications
• manufacturing
5 07-10-2024
STREAM CONCEPTS
6 07-10-2024
DATA STREAM MODEL
7 07-10-2024
CHARACTERISTICS
1. The data model and query processor must allow both order based and time based
operations.
2. The inability to store a complete stream indicates that some approximate summary
structures must be used.
3. Streaming query plans must not use any operators that require the entire input before
any results are produced. Such operators will block the query processor indefinitely.
4. Any query that requires backtracking over a data streams is infeasible. This is due to
storage and performance constraints imposed by a data stream.
5. Applications that monitor streams in real time must react quickly to unusual data values.
6. Scalability requirements dictate the parallel and shared execution of many continuous
queries must be possible.
8 07-10-2024
ARCHITECTURE
9 07-10-2024
ARCHITECTURE
Long running queries are registered in the query repository and placed into groups
for shared processing.
The query processor communicates with the input monitor and may reoptimize the
query plans in response to changing input rates. Results are streamed to the user or
temporarily buffered.
A Data-Stream-Management System In analogy to a database-management system,
we can view a stream processor as a kind of data-management system, the high-level
organization.
10 07-10-2024
ARCHITECTURE
11 07-10-2024
07-10-2024 12
ARCHITECTURE
• Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special
circumstances using time-consuming retrieval processes.
• There is also a working store, into which summaries or parts of streams may be
placed, and which can be used for answering queries.
• The working store might be disk, or it might be main memory, depending on how fast
we need to process queries.
07-10-2024 13
EXAMPLES OF DATA STREAM APPLICATIONS
Sensor Networks –Alerts and alarms generated as a response to information received from sensors.
Example – Perform joins of several streams like temperature, ocean currents streams at weather
stations to give alerts or warnings like cyclone or tsunami.
Financial Applications – Online analysis of stock prices and making hold or sell decisions requires
quickly identifying correlations and fast changing trends.Transaction.
Log Analysis – Online mining of web usage logs, telephone call records and ATM transactions are
examples of data streams. Goal is to find customer behavior patterns. Example – Identify current
buying pattern of users in website and plan advertising campaigns and recommendations.
Image Data : Satellites often send down to earth streams consisting of many terabytes of images per
day. Surveillance cameras produce images with lower resolution than satellites, but there can be many
of them, each producing a stream of images at intervals like one second.
14 07-10-2024
KINDS OF STREAM PROCESSING TECHNIQUES
• [
1
5
STREAM COMPUTING
Stream Queries
There are two ways that queries get asked about streams.
One-time queries - (a class that includes traditional DBMS queries)
Example - Alert when stock crosses over a price point.
Continuous queries, on the other hand, are evaluated continuously as data streams continue to
arrive.
Example –Aggregation queries like maximum, average, count etc. Maximum price of stock every hour,
or number of time stock gains over a particular point.
16 07-10-2024
ISSUES IN STREAM PROCESSING
17 07-10-2024
SAMPLING DATA IN A STREAM
18 07-10-2024
SAMPLING TECHNIQUES
19 07-10-2024
SAMPLING TECHNIQUES
20 07-10-2024
SAMPLING TECHNIQUES
21 07-10-2024
SAMPLING TECHNIQUES
22 07-10-2024
SAMPLING TECHNIQUES
24 07-10-2024
SAMPLING TECHNIQUES
4. Concise Sampling
Goal is to maintain a small reservoir of a fixed size while still achieving representative
sampling of the data stream
Number of samples that can be stored in memory at a given time is limited, which
can be a challenge when dealing with large data streams.
Size of the sample may need to be adjusted based on the amount of memory
available to store the data.
Instead of selecting samples randomly, the sampling algorithm may prioritize choosing
samples with unique or representative values of a particular attribute in the data
stream
25 07-10-2024
SAMPLING TECHNIQUES
4. Concise Sampling
A bank wants to analyze customer spending habits from a stream of transactions.
They use concise sampling to choose distinct customer IDs as their attribute.
The size of the reservoir is limited to 1000 customers.
They adjust the sample size based on available memory.
This allows for efficient analysis while maintaining accuracy.
26 07-10-2024
FILTERING STREAMS
Bloom Filtering
Space-efficient data structure
used to check whether an element belongs to a set
Probably says that the element belongs to a set (False Positive)
Accurately says that the element does not belongs to a set (Only
True Negatives)
Hence, Recall rate is 100 %
27 07-10-2024
BLOOM FILTERING
Example
Insert elements 10 and 7 in the bloom filter of size 5. Consider
these two hash functions:
hl(x): x mod 5
h2(x) : (2x + 6) mod 5
Comment on the presence of elements 14 and 15.
28 07-10-2024
BLOOM FILTERING
29 07-10-2024
30 07-10-2024
BLOOM FILTERING
• The underlying concept is to utilize the main memory as a bit array.
• With 1 GB of main memory. We have a room for 8 billion
bits.
• Device a hash function ‘h’ and hash each member of ‘S’ to a bit and
set the bit as ‘1’. All the other bits of array remain ‘0’.
• Since there are 1 billion members of ‘S’, approximately
1/8th of the bits will be ‘1’.
• The exact fraction of bit set to ‘1’ will be slightly less than 1/8th
(Because it is possible that two members of ‘S’ may hash into the
same bit.
31
COUNTING DISTINCT PROBLEM
• Data stream consists of a universe of elements chosen
from a set of size N
– Maintain a count of the number of distinct elements seen so
far
• Maintain the set of elements seen so far
– That is, keep a hash table of all the distinct elements seen so
far
– Hashing and variety of algorithms are to be used
32
APPLICATIONS
• A Web site gathering statistics on how many unique
users it has seen in each given month.
– The universal set is the set of logins for that site, and a stream
element is generated each time someone logs in.
– This measure is appropriate for a site like Amazon,
where the typical user logs in with their unique login name.
33
• Web site like Google that does not require login to issue a
search query
– may be able to identify users only by the IP address from
which they send the query.
34
SOLUTION
• The obvious way to solve the problem is to keep in main memory a list of all the
elements seen so far in the stream.
• Adopt an efficient search structure such as a hash table or search tree, so one
can quickly add new elements and check whether or not the element that just
arrived on the stream was already seen.
• As long as the number of distinct elements is not too great, this structure can fit
in main memory and there is little problem obtaining an exact answer to the
question how many distinct elements appear in the stream.
• Approach : Flajolet-Martin Algorithm
35
FLAJOLET-MARTIN-ALGORITHM
36 07-10-2024
FLAJOLET-MARTIN-ALGORITHM
37 07-10-2024
COUNTING DISTINCT ELEMENTS IN A STREAM – NAÏVE SOLUTION.
SET COUNTER = O
SET UNIQUE SET = [ ]
WHILE COUNTER NOT EQUALS LAST ELEMENT INDEX:
IF CURRENT ELEMENT NOT PRESENT IN UNIQUE SET:
ADD CURRENT ELEMENT IN UNIQUE SET
INCREMENT THE COUNTER
DISPLAY COUNT OF DISTINCT ELEMENTS : LENGTH (UNIQUE SET)
38 07-10-2024
FLAJOLET-MARTIN-ALGORITHM
39 07-10-2024
PSEUDOCODE/ALGORITHM
1. Select a hash function h(x) so each element in the set is mapped to a value
to at least log2n bits
2. Convert this h(x) output to binary_value
3. For each binary_value, find r(binary_value) : length of the trailing zeroes in
binary_value
4. Find R : max(r(binary_value))
5. Finally, Approximate count of distinct elements will be 2R
40 07-10-2024
FLAJOLET-MARTIN-ALGORITHM
SET COUNTER = O
SET MAX_R = O
WHILE COUNTER NOT EQUALS LAST ELEMENT INDEX:
VAL = BINARY OF HASH OUTPUT OF CURRENT ELEMENT
COUNT NO. OF TRAILING ZEROES IN VAL
IF COUNT > MAX_R:
MAX_R = COUNT
INCREMENT THE COUNTER
DISPLAY APPX. COUNT OF DISTINCT ELEMENTS:2** (MAX_R)
41 07-10-2024
42 07-10-2024
EXAMPLE
S=1,3,2,1,2,3,4,3,1,2,3,1
h(x)=(6x+1) mod 5
43 07-10-2024
COUNTING ONENESS
44 07-10-2024
DATAR-GIONIS- INDYK- MOTWANI ALGORITHM (DGIM)
45 07-10-2024
ELEMENTS
Timestamp
Each element entering in the stream will be allotted a timestamp based on the position of
it
Example: If first bit has timestamp 1, then second bit will have timestamp 2, third bit 3 and
so on....
Buckets
Used to represent time intervals in a data stream
Algorithm divides the stream into buckets, each will have size of power of 2
Bucket contains the bits O and 1
46 07-10-2024
RULES FOR FORMING A BUCKET
47 07-10-2024
EXAMPLE
48 07-10-2024
ESTIMATING MOMENTS
moment (or just kth moment) of the stream is the sum over all i of (mi)k.
• Kth Moment
50
COMPUTING DIFFERENT MOMENTS
51
ALON MATIAS SZEGEDY METHOD
• AMS method works for all moments
• Gives an unbiased estimate
• We will just concentrate on the 2nd moment S
• We pick and keep track of many variables X:
– For each variable X we store X.el and X.val
• X.elcorresponds to the item i
• X.valcorresponds to the count of item i
– Note this requires a count in main memory, so number of
Xs is limited
• Our goal is to compute 52
FIND THE SURPRISE NUMBER
5
3
REAL-TIME ANALYTICS
54 07-10-2024
07-10-2024 55
07-10-2024 56
WIDELY USED RTAPS
Apache SparkStreaming
Cisco Connected Streaming Analytics (CSA)
Oracle Stream Analytics (OSA)
SAP HANA
SQL streamBlaze
TIBCO StreamBase
57 07-10-2024