0% found this document useful (0 votes)

164 views85 pages

Finding Similar Items

The document discusses techniques for finding similar items in large datasets. It describes locality-sensitive hashing (LSH) which works by converting documents to sets of shingles and then using minhashing to generate signatures. These signatures preserve similarity while reducing the document size. LSH further improves efficiency by focusing only on pairs of signatures that are likely to be similar based on having been hashed to the same buckets. This avoids needing to compare all possible document pairs.

Uploaded by

Mallika Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

164 views85 pages

Finding Similar Items

Uploaded by

Mallika Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 85

Finding Similar Items

Introduction

● In today’s world, data is in many forms i.e. data might be in video, audio, text
etc.
● The quantity of data is huge and unstructured. This dataset can have similarities
and repeated items.
● A fundamental Data Mining problem is to examine data for similar items.
● For example looking for near duplicate pages in a collection of web pages or
these pages could be mirrors that have almost the same content but differ in
information about the host and about other mirrors.
Why Finding Similar Items?

Consider the example of collection of web pages. Similar pages might change the
name of the course, year, and make small changes from year to year. It is important
to be able to detect similar pages of these kinds, because search engines produce
better results if they avoid showing two pages that are nearly identical within the
first page of results.
Applications of Finding Similar Items

Plagiarism:

Finding plagiarized documents tests our ability to find textual similarity. The
plagiarizer may extract only some parts of a document for his own. He may alter a
few words and may alter the order in which sentences of the original appear. Yet the
resulting document may still contain 50% or more of the original. No simple process
of comparing documents character by character will detect a sophisticated
plagiarism.
Online Purchases

Amazon.com has millions of customers and sells millions of items. Its database
records which items have been bought by which customers. We can say two
customers are similar if their sets of purchased items have a high Jaccard similarity.
Likewise, two items that have sets of purchasers with high Jaccard similarity will be
deemed similar. Note that, while we might expect mirror sites to have Jaccard
similarity above 90%, it is unlikely that any two customers have Jaccard similarity
that high (unless they have purchased only one item). Even a Jaccard similarity like
20% might be unusual enough to identify customers with similar tastes. The same
observation holds for items; Jaccard similarities need not be very high to be
significant.
Movie Ratings

Netflix records which movies each of its customers rented, and also the ratings
assigned to those movies by the customers. We can see movies as similar if they
were rented or rated highly by many of the same customers, and see customers as
similar if they rented or rated highly many of the same movies. The same
observations that we made for Amazon above apply in this situation: similarities
need not be high to be significant, and clustering movies by genre will make things
easier.
Challenges Faced

● Many small pieces of one document can appear out of order in another.
● Too many documents to compare all pairs.
● Documents are so large or so many that they cannot fit in main memory.
Techniques Used

There are some techniques which are used for finding similar items in a data set:
● LSH
● Shingling
● Min Hashing
● Distance Measures
LSH
Motivation

The task of finding nearest neighbours is very common. You can think of
applications like finding duplicate or similar documents, audio/video search.
Although using brute force to check for all possible combinations will give you the
exact nearest neighbour but it’s not scalable at all. Approximate algorithms to
accomplish this task has been an area of active research. Although these algorithms
don’t guarantee to give you the exact answer, more often than not they’ll be provide
a good approximation. These algorithms are faster and scalable.
Finding similar items/Documents
Locality sensitive hashing (LSH)
LSH refers to a family of functions (known as LSH families) to hash data points
into buckets so that data points near each other are located in the same buckets
with high probability, while data points far from each other are likely to be in
different buckets. This makes it easier to identify observations with various
degrees of similarity.
LSH has many applications, including:

Near-duplicate detection: LSH is commonly used to deduplicate large quantities of

documents, webpages, and other files.

Genome-wide association study: Biologists often use LSH to identify similar gene

expressions in genome databases.

LSH has many applications, including:

Large-scale image search: Google used LSH along with PageRank to build their

image search technology VisualRank.

Audio/video fingerprinting: In multimedia technologies, LSH is widely used as a

fingerprinting technique A/V data.

Goal of the task

Goal: Given a large number ( in the millions or billions) of documents, find “near
duplicate” pairs
We can break down the LSH algorithm into 3
broad steps

1. Shingling: Convert documents to sets

2. Min-Hashing: Convert large sets to short signatures, while preserving similarity

3. Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar

documents.
Shingling

In this step, we convert each document into a set of characters of length k (also
known as k-shingles or k-grams). The key idea is to represent each document in
our collection as a set of k-shingles.

For ex: One of your document (D): “Nadal”. Now if we’re interested in 2-
shingles, then our set: {Na, ad, da, al}. Similarly set of 3-shingles: {Nad, ada,
dal}.
Shingling

● Similar documents are more likely to share more shingles

● Reordering paragraphs in a document of changing words doesn’t have much
effect on shingles.
● k value of 8–10 is generally used in practice. A small value will result in many
shingles which are present in most of the documents (bad for differentiating
documents).
Define Shingles
Metric to measure similarity between
documents - Jaccard Index can be
useful
Jaccard Index (Jaccard similarity and Jaccard
Distance)
Let’s discuss 2 big issues that we need to tackle:

Time complexity:

Now you may be thinking that we can stop here. But if you think about the scalability,
doing just this won’t work. For a collection of n documents, you need to do n*(n-1)/2
comparison, basically O(n²). Imagine you have 1 million documents, then the number
of comparison will be 5*10¹¹ (not scalable at all!).

Space complexity:

The document matrix is a sparse matrix and storing it as it is will be a big memory
overhead. One way to solve this is hashing.
Hashing

The idea of hashing is to convert each document to a small signature using a hashing
function H. Suppose a document in our corpus is denoted by d. Then:
● H(d) is the signature and it’s small enough to fit in memory
● If similarity(d1,d2) is high then Probability(H(d1)==H(d2)) is high
● If similarity(d1,d2) is low then Probability(H(d1)==H(d2)) is low

Choice of hashing function is tightly linked to the similarity metric we’re using. For
Jaccard similarity the appropriate hashing function is min-hashing.
Minhashing

Minhashing Goal: Convert large sets to short signatures, while preserving similarity
In brief
MinHash property

The similarity of the signatures is the fraction of the min-hash functions (rows) in
which they agree. So the similarity of signature for C1 and C3 is 2/3 as 1st and 3rd
row are same.
So using min-hashing we have solved the problem of space complexity by
eliminating the sparseness and at the same time preserving the similarity.
Locality-Sensitive Hashing
Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar
documents.
The general idea of LSH is to find a algorithm such that if we input signatures of 2
documents, it tells us that those 2 documents form a candidate pair or not i.e. their
similarity is greater than a threshold t. Remember that we are taking similarity of
signatures as a proxy for Jaccard similarity between the original documents.
Specifically for min-hash signature matrix:
● Hash columns of signature matrix M using several hash functions
● If 2 documents hash into same bucket for at least one of the hash function we
can take the 2 documents as a candidate pair
LSH summary

● Tune M, b, r to get almost all pairs with similar signatures, but eliminate most
pairs that do not have similar signatures
● Check in main memory that candidate pairs really do have similar signatures
● Optional: In another pass through data, check that the remaining candidate pairs
really represent similar documents
Extra Materials

To read more on the code implementation of LSH, checkout this article.

https://santhoshhari.github.io/Locality-Sensitive-Hashing/
Mining Data Streams
Agenda

● Elements of stream and stream processing

● Applications of stream model
● Problems that arise while dealing with streams
● Sampling Data in a Stream
● Filtering Streams
Assumptions about data

● Data arrives in a stream or streams

● If data is not processed or stored immediately, then it is lost forever

● The rate at which data comes in is rapid that it is infeasible to store it all in
active storage such as a conventional database
The Stream Data Model
The Stream Data Model

● Unlimited streams can enter the system

● Each stream can provide elements at its own schedule with different data rates or data
types

● The time between elements within a stream need not be uniform

● Streams are archived in the archival store

The Stream Data Model

● The summaries or parts of streams are placed in a working store

● The streams in working store can be used for answering queries

● Depending on how fast the queries should be processed, working store can be a disk or a
main memory

● Both the storing methods have limited capacity to store all data from all the streams
The Stream Data Model

● There is a place within the processor where standing queries are stored
● They are permanently executing and produce results at appropriate times

Examples:

● A standing query to output an alert whenever the temperature exceeds 25

degrees based on sensor data
● Maximum temperature ever recorded by the sensor , average temperature
recorded etc.
Ad-hoc Queries

● The other form of queries is ad-hoc

● Since we cannot know what questions would be asked through ad-hoc

interface, the streams can be prepared by storing appropriate parts or
summaries of the stream

● One approach is to store a sliding window of each stream in the working store
Sliding Window

● A sliding window is the most recent n elements of a stream, for some n

● It can also be all the elements that arrived within the last t time units say
one day
● If each stream element is considered to be a tuple, then each window can
be a relation that can be queried using an SQL query
● The stream management should keep the window fresh, deleting the oldest
elements as new ones come in
Sliding Window

Example:

Web sites often like to report the number of unique users over the past month. If we think
of each login as a stream element, we can maintain a window that is all logins in the
most recent month. We must associate the arrival time with each login, so we know
when it no longer belongs to the window. If we think of the window as a relation
Logins(name, time), then it is simple to get the number of unique users over the past
month.

The SQL query is: SELECT COUNT(DISTINCT(name)) FROM Logins WHERE time >= t;
Here, t is a constant that represents the time one month before the current time
Applications of Stream Model

● Sensor Data

● Image Data

● Internet and Web Traffic

Issues in Stream Processing

● Streams deliver elements very rapidly. So the elements should be processed in real
time or we lose the opportunity to process them at all

● All the streams together can easily exceed the amount of available main memory

● So, it is much more efficient to get an approximate answer to our problem than an
exact solution and use hashing techniques to introduce randomness into algorithm’s
behavior
Sampling Data in a Stream

● The problem we shall deal with is selecting a subset of a stream so that we

can ask queries about the selected subset and have the answers be
statistical representation of the stream as a whole
● Let’s understand the concept through an example
● Say a search engine receives a stream of queries and would like to know
what fraction of the typical user’s queries were repeated over the past
month
● The stream consists of tuples (user, query, time)
● Also assume that we store only 1/10th of the stream elements
Sampling Data in a Stream

● The first approach would be to generate a random number, say an integer

between 0 and 9, in response to each query
● Store the tuple only iff the random number is 0
● So, on an average 1/10th of their queries will be stored
● However, this method wouldn’t always work
● Say a user issued s queries one time, d queries twice in the past month
● If we sample 1/10th of the queries, s/10 of the queries will be issued once,
d/100 of the queries will appear twice and 18d/100 will appear once
Sampling Data in a Stream

● The correct answer to the query would be d/(s+d)

● But, the answer we obtain from the sample is d/(10s+19d)
● So, the alternative solution would be to pick 1/10th of the users and take all
their searches for the sample
● This would include only few users
Representative Sampling

● If we can store the list of all users and whether or not they are in the sample, then we could
do the following

Each time a search query arrives in the stream, we look up the user to see whether or not
they are in the sample.

If so, we add this search query to the sample, and if not, then not.

However, if we have no record of ever having seen this user before, then we generate a
random integer between 0 and 9.

If the number is 0, we add this user to our list with value “in,” and if the number is other than
0, we add the user with the value “out.”
Representative Sampling

● By using hash function we can avoid maintaining the list of users altogether
● That is, we hash each user name to one of ten buckets, 0 through 9
● If the user hashes to bucket 0, then accept this search query for the sample,
and if not, then not
● Effectively, we use the hash function as a random number generator
because of it’s important property that when applied to the same user
several times we always get the same random number
Filtering Streams

● Another common process on streams is selection or filtering

● We will discuss about a technique known as Bloom filtering to eliminate the

tuples based on the criterion
The Bloom Filter

A Bloom filter consists of

● An array of n bits, initially all 0’s

● A collection of hash functions h1, h2, . . . , hk. Each hash function maps
“key” values to n buckets, corresponding to the n bits of the bit-array
● A set S of m key values
The Bloom Filter

● To initialize the bit array, begin with all bits 0

● Take each key value in S and hash it using each of the k hash function
● Set to 1 each bit that is hi(K) for some hash function hi and some key value
K in S
● To test a key K that arrives in the stream, check that all of h1(K), h2(K), . . . ,
hk(K) are 1’s in the bit-array.
● If all are 1’s, then let the stream element through, else reject the stream
element

Logiqids Worksheets - Senior KG
73% (11)
Logiqids Worksheets - Senior KG
7 pages
GRADE 4 TERM 1 TEST MATHEMATICS MEMO (Final)
100% (4)
GRADE 4 TERM 1 TEST MATHEMATICS MEMO (Final)
5 pages
Introduction To Information and Big Data Security
No ratings yet
Introduction To Information and Big Data Security
39 pages
M1 - Introducing Google Cloud v5.2 - ILT
No ratings yet
M1 - Introducing Google Cloud v5.2 - ILT
69 pages
SQL - 05 - Window Functions
No ratings yet
SQL - 05 - Window Functions
20 pages
Web-Based Document Management System For AACUP CIT Campus
No ratings yet
Web-Based Document Management System For AACUP CIT Campus
56 pages
Deep Learning for Computer Vision with SAS: An Introduction
From Everand
Deep Learning for Computer Vision with SAS: An Introduction
Robert Blanchard
No ratings yet
Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
From Everand
Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
Michael Walker
5/5 (1)
Memsql
No ratings yet
Memsql
23 pages
Getting Started with Greenplum for Big Data Analytics
From Everand
Getting Started with Greenplum for Big Data Analytics
Sunila Gollapudi
No ratings yet
Visualization 2 Data Representation
100% (1)
Visualization 2 Data Representation
56 pages
POL BigDataStatisticsJune2014
No ratings yet
POL BigDataStatisticsJune2014
27 pages
Approaches To The Analysis of Survey Data PDF
No ratings yet
Approaches To The Analysis of Survey Data PDF
28 pages
Pentaho Data Integration Cookbook - Second Edition
From Everand
Pentaho Data Integration Cookbook - Second Edition
María Carina Roldán
No ratings yet
Segmentation
100% (1)
Segmentation
51 pages
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
No ratings yet
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
14 pages
Chapter 2 - Database Requirements and ER Modeling
No ratings yet
Chapter 2 - Database Requirements and ER Modeling
74 pages
Seminar
No ratings yet
Seminar
16 pages
Data Mart Vs Data Warehouse
100% (1)
Data Mart Vs Data Warehouse
6 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Routinemap Patterns of Life in Spatiotemporal Visualization
No ratings yet
Routinemap Patterns of Life in Spatiotemporal Visualization
10 pages
Ace The Data Science Interview-1
No ratings yet
Ace The Data Science Interview-1
5 pages
How Do I Learn Statistics For Data Science - Quora
No ratings yet
How Do I Learn Statistics For Data Science - Quora
4 pages
Java Collections PDF
No ratings yet
Java Collections PDF
566 pages
Map Reduce
100% (1)
Map Reduce
33 pages
Dimensional Modeling
100% (1)
Dimensional Modeling
19 pages
Nosql Data Architecture Patterns
No ratings yet
Nosql Data Architecture Patterns
62 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
22 pages
Social Network Analysis Con Python PDF
No ratings yet
Social Network Analysis Con Python PDF
80 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
04 Java
No ratings yet
04 Java
32 pages
Reference architecture Standard Requirements
From Everand
Reference architecture Standard Requirements
Gerardus Blokdyk
No ratings yet
Comparative Analysis of Brute Force and Boyer Moore Algorithms in Word Suggestion Search
No ratings yet
Comparative Analysis of Brute Force and Boyer Moore Algorithms in Word Suggestion Search
5 pages
Business rules A Complete Guide
From Everand
Business rules A Complete Guide
Gerardus Blokdyk
No ratings yet
Data Mart Info
No ratings yet
Data Mart Info
5 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
Big Data's Human Component
No ratings yet
Big Data's Human Component
4 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
MDX Tutorial
100% (1)
MDX Tutorial
31 pages
(2007) - The Aesthetics of Graph Visualization
No ratings yet
(2007) - The Aesthetics of Graph Visualization
8 pages
Data Models
No ratings yet
Data Models
57 pages
CODE201911 Practices DataVisualizations
No ratings yet
CODE201911 Practices DataVisualizations
19 pages
Govindarajan Data Vault PDF
100% (1)
Govindarajan Data Vault PDF
29 pages
Cassandra Design Patterns: Build real-world, industry-strength data storage solutions with time-tested design methodologies using Cassandra
From Everand
Cassandra Design Patterns: Build real-world, industry-strength data storage solutions with time-tested design methodologies using Cassandra
Rajanarayanan Thottuvaikkatumana
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
SQL To MongoDB Mapping Chart
No ratings yet
SQL To MongoDB Mapping Chart
5 pages
Segmentation and Object Recognition Using Edge Detection Techniques
No ratings yet
Segmentation and Object Recognition Using Edge Detection Techniques
9 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Dbms Mini Project
No ratings yet
Dbms Mini Project
19 pages
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
From Everand
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
Pierre-yves Bonnefoy
No ratings yet
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Proximity Search Operators Guidelines
No ratings yet
Proximity Search Operators Guidelines
2 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
SHARE 120 - VSAM Boot Camp - An Introduction To VSAM
No ratings yet
SHARE 120 - VSAM Boot Camp - An Introduction To VSAM
50 pages
Sequential Patterns The GSP Algorithm
No ratings yet
Sequential Patterns The GSP Algorithm
10 pages
Cassandra: Types of Nosql Databases
No ratings yet
Cassandra: Types of Nosql Databases
6 pages
Data Versioning For Graph Databases
No ratings yet
Data Versioning For Graph Databases
71 pages
Python Data Wrangling Tutorial: Pandas Cheatsheet
No ratings yet
Python Data Wrangling Tutorial: Pandas Cheatsheet
1 page
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
From Everand
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
Gourab Mukherjee
No ratings yet
Flume Case Study
No ratings yet
Flume Case Study
2 pages
07 Fundamentals of Genetic Algorithms
No ratings yet
07 Fundamentals of Genetic Algorithms
47 pages
L3 - Data Models
No ratings yet
L3 - Data Models
13 pages
Welcome To The Jfrog Artifactory User Guide!
No ratings yet
Welcome To The Jfrog Artifactory User Guide!
3 pages
HPE - A00094858en - Us - Aruba CX Mobile App User Guide
No ratings yet
HPE - A00094858en - Us - Aruba CX Mobile App User Guide
42 pages
Smart Phone Catalog 230518
No ratings yet
Smart Phone Catalog 230518
17 pages
Lab 2
No ratings yet
Lab 2
3 pages
Technical Writing Syllabus
No ratings yet
Technical Writing Syllabus
2 pages
Bhanu Priya 2020 IOP Conf. Ser. Mater. Sci. Eng. 912 062009
No ratings yet
Bhanu Priya 2020 IOP Conf. Ser. Mater. Sci. Eng. 912 062009
10 pages
Cursed Emoji Love - Google Search
No ratings yet
Cursed Emoji Love - Google Search
1 page
CRM Question
No ratings yet
CRM Question
2 pages
Photo Contest Criteria and Guidelines
No ratings yet
Photo Contest Criteria and Guidelines
2 pages
Fraunhofer CML TOS-Study Excerpt PDF
No ratings yet
Fraunhofer CML TOS-Study Excerpt PDF
13 pages
Horizon CT5.1
No ratings yet
Horizon CT5.1
40 pages
Assignment 1 Advanced Programming
No ratings yet
Assignment 1 Advanced Programming
37 pages
ABCmouse Part of The Body Worksheets Packet
No ratings yet
ABCmouse Part of The Body Worksheets Packet
19 pages
1620A "Dewk" Thermo-Hygrometer: Technical Data
No ratings yet
1620A "Dewk" Thermo-Hygrometer: Technical Data
4 pages
2018 - 4 - Answer Key of Naib Tehsildar (Main) - 2018 Held On 14-04-2018
No ratings yet
2018 - 4 - Answer Key of Naib Tehsildar (Main) - 2018 Held On 14-04-2018
2 pages
Guidelines DS Python
No ratings yet
Guidelines DS Python
2 pages
Retailer Outlet Name Retailer Nametelephone Number 1 Street No - Member Name
No ratings yet
Retailer Outlet Name Retailer Nametelephone Number 1 Street No - Member Name
5 pages
8hr Intermediate Handson HSTB
No ratings yet
8hr Intermediate Handson HSTB
218 pages
USAA Bank Statement 5 Page
No ratings yet
USAA Bank Statement 5 Page
8 pages
CV 190
No ratings yet
CV 190
2 pages
An Understanding of AI's Limitations Is Starting To Sink in - The Economist
No ratings yet
An Understanding of AI's Limitations Is Starting To Sink in - The Economist
4 pages
Do Not All Lecture Slides, Activities & Tutorial Exercises
No ratings yet
Do Not All Lecture Slides, Activities & Tutorial Exercises
4 pages
Nabl Test Report Cross Verification Methodology and Steps For Report Authenticity
No ratings yet
Nabl Test Report Cross Verification Methodology and Steps For Report Authenticity
4 pages
Math 0405 Test 1 Review
No ratings yet
Math 0405 Test 1 Review
5 pages
Study of Supervising and Monitoring of Numerical Relays
No ratings yet
Study of Supervising and Monitoring of Numerical Relays
30 pages
3HAC056431 PS IRB 910SC-en PDF
No ratings yet
3HAC056431 PS IRB 910SC-en PDF
56 pages
Additive Manufacturing Kme071
No ratings yet
Additive Manufacturing Kme071
1 page

Finding Similar Items

Uploaded by

Finding Similar Items

Uploaded by

Finding Similar Items

Near-duplicate detection: LSH is commonly used to deduplicate large quantities of

documents, webpages, and other files.

expressions in genome databases.

image search technology VisualRank.

Audio/video fingerprinting: In multimedia technologies, LSH is widely used as a

fingerprinting technique A/V data.

1. Shingling: Convert documents to sets

2. Min-Hashing: Convert large sets to short signatures, while preserving similarity

3. Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar

● Similar documents are more likely to share more shingles

To read more on the code implementation of LSH, checkout this article.

● Elements of stream and stream processing

● Data arrives in a stream or streams

● If data is not processed or stored immediately, then it is lost forever

● Unlimited streams can enter the system

● The time between elements within a stream need not be uniform

● Streams are archived in the archival store

● The summaries or parts of streams are placed in a working store

● The streams in working store can be used for answering queries

● A standing query to output an alert whenever the temperature exceeds 25

● The other form of queries is ad-hoc

● Since we cannot know what questions would be asked through ad-hoc

● A sliding window is the most recent n elements of a stream, for some n

● Internet and Web Traffic

● The problem we shall deal with is selecting a subset of a stream so that we

● The first approach would be to generate a random number, say an integer

● The correct answer to the query would be d/(s+d)

● Another common process on streams is selection or filtering

● We will discuss about a technique known as Bloom filtering to eliminate the

A Bloom filter consists of

● An array of n bits, initially all 0’s

● To initialize the bit array, begin with all bits 0

You might also like