0% found this document useful (0 votes)
164 views85 pages

Finding Similar Items

The document discusses techniques for finding similar items in large datasets. It describes locality-sensitive hashing (LSH) which works by converting documents to sets of shingles and then using minhashing to generate signatures. These signatures preserve similarity while reducing the document size. LSH further improves efficiency by focusing only on pairs of signatures that are likely to be similar based on having been hashed to the same buckets. This avoids needing to compare all possible document pairs.

Uploaded by

Mallika Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
164 views85 pages

Finding Similar Items

The document discusses techniques for finding similar items in large datasets. It describes locality-sensitive hashing (LSH) which works by converting documents to sets of shingles and then using minhashing to generate signatures. These signatures preserve similarity while reducing the document size. LSH further improves efficiency by focusing only on pairs of signatures that are likely to be similar based on having been hashed to the same buckets. This avoids needing to compare all possible document pairs.

Uploaded by

Mallika Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 85

Finding Similar Items

Introduction

● In today’s world, data is in many forms i.e. data might be in video, audio, text
etc.
● The quantity of data is huge and unstructured. This dataset can have similarities
and repeated items.
● A fundamental Data Mining problem is to examine data for similar items.
● For example looking for near duplicate pages in a collection of web pages or
these pages could be mirrors that have almost the same content but differ in
information about the host and about other mirrors.
Why Finding Similar Items?

Consider the example of collection of web pages. Similar pages might change the
name of the course, year, and make small changes from year to year. It is important
to be able to detect similar pages of these kinds, because search engines produce
better results if they avoid showing two pages that are nearly identical within the
first page of results.
Applications of Finding Similar Items

Plagiarism:

Finding plagiarized documents tests our ability to find textual similarity. The
plagiarizer may extract only some parts of a document for his own. He may alter a
few words and may alter the order in which sentences of the original appear. Yet the
resulting document may still contain 50% or more of the original. No simple process
of comparing documents character by character will detect a sophisticated
plagiarism.
Online Purchases

Amazon.com has millions of customers and sells millions of items. Its database
records which items have been bought by which customers. We can say two
customers are similar if their sets of purchased items have a high Jaccard similarity.
Likewise, two items that have sets of purchasers with high Jaccard similarity will be
deemed similar. Note that, while we might expect mirror sites to have Jaccard
similarity above 90%, it is unlikely that any two customers have Jaccard similarity
that high (unless they have purchased only one item). Even a Jaccard similarity like
20% might be unusual enough to identify customers with similar tastes. The same
observation holds for items; Jaccard similarities need not be very high to be
significant.
Movie Ratings

Netflix records which movies each of its customers rented, and also the ratings
assigned to those movies by the customers. We can see movies as similar if they
were rented or rated highly by many of the same customers, and see customers as
similar if they rented or rated highly many of the same movies. The same
observations that we made for Amazon above apply in this situation: similarities
need not be high to be significant, and clustering movies by genre will make things
easier.
Challenges Faced

● Many small pieces of one document can appear out of order in another.
● Too many documents to compare all pairs.
● Documents are so large or so many that they cannot fit in main memory.
Techniques Used

There are some techniques which are used for finding similar items in a data set:
● LSH
● Shingling
● Min Hashing
● Distance Measures
LSH
Motivation

The task of finding nearest neighbours is very common. You can think of
applications like finding duplicate or similar documents, audio/video search.
Although using brute force to check for all possible combinations will give you the
exact nearest neighbour but it’s not scalable at all. Approximate algorithms to
accomplish this task has been an area of active research. Although these algorithms
don’t guarantee to give you the exact answer, more often than not they’ll be provide
a good approximation. These algorithms are faster and scalable.
Finding similar items/Documents
Locality sensitive hashing (LSH)
LSH refers to a family of functions (known as LSH families) to hash data points
into buckets so that data points near each other are located in the same buckets
with high probability, while data points far from each other are likely to be in
different buckets. This makes it easier to identify observations with various
degrees of similarity.
LSH has many applications, including:

Near-duplicate detection: LSH is commonly used to deduplicate large quantities of

documents, webpages, and other files.

Genome-wide association study: Biologists often use LSH to identify similar gene

expressions in genome databases.


LSH has many applications, including:

Large-scale image search: Google used LSH along with PageRank to build their

image search technology VisualRank.

Audio/video fingerprinting: In multimedia technologies, LSH is widely used as a

fingerprinting technique A/V data.


Goal of the task

Goal: Given a large number ( in the millions or billions) of documents, find “near
duplicate” pairs
We can break down the LSH algorithm into 3
broad steps

1. Shingling: Convert documents to sets

2. Min-Hashing: Convert large sets to short signatures, while preserving similarity

3. Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar


documents.
Shingling

In this step, we convert each document into a set of characters of length k (also
known as k-shingles or k-grams). The key idea is to represent each document in
our collection as a set of k-shingles.

For ex: One of your document (D): “Nadal”. Now if we’re interested in 2-
shingles, then our set: {Na, ad, da, al}. Similarly set of 3-shingles: {Nad, ada,
dal}.
Shingling

● Similar documents are more likely to share more shingles


● Reordering paragraphs in a document of changing words doesn’t have much
effect on shingles.
● k value of 8–10 is generally used in practice. A small value will result in many
shingles which are present in most of the documents (bad for differentiating
documents).
Define Shingles
Metric to measure similarity between
documents - Jaccard Index can be
useful
Jaccard Index (Jaccard similarity and Jaccard
Distance)
Let’s discuss 2 big issues that we need to tackle:

Time complexity:

Now you may be thinking that we can stop here. But if you think about the scalability,
doing just this won’t work. For a collection of n documents, you need to do n*(n-1)/2
comparison, basically O(n²). Imagine you have 1 million documents, then the number
of comparison will be 5*10¹¹ (not scalable at all!).

Space complexity:

The document matrix is a sparse matrix and storing it as it is will be a big memory
overhead. One way to solve this is hashing.
Hashing

The idea of hashing is to convert each document to a small signature using a hashing
function H. Suppose a document in our corpus is denoted by d. Then:
● H(d) is the signature and it’s small enough to fit in memory
● If similarity(d1,d2) is high then Probability(H(d1)==H(d2)) is high
● If similarity(d1,d2) is low then Probability(H(d1)==H(d2)) is low

Choice of hashing function is tightly linked to the similarity metric we’re using. For
Jaccard similarity the appropriate hashing function is min-hashing.
Minhashing

Minhashing Goal: Convert large sets to short signatures, while preserving similarity
In brief
MinHash property

The similarity of the signatures is the fraction of the min-hash functions (rows) in
which they agree. So the similarity of signature for C1 and C3 is 2/3 as 1st and 3rd
row are same.
So using min-hashing we have solved the problem of space complexity by
eliminating the sparseness and at the same time preserving the similarity.
Locality-Sensitive Hashing
Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar
documents.
The general idea of LSH is to find a algorithm such that if we input signatures of 2
documents, it tells us that those 2 documents form a candidate pair or not i.e. their
similarity is greater than a threshold t. Remember that we are taking similarity of
signatures as a proxy for Jaccard similarity between the original documents.
Specifically for min-hash signature matrix:
● Hash columns of signature matrix M using several hash functions
● If 2 documents hash into same bucket for at least one of the hash function we
can take the 2 documents as a candidate pair
LSH summary

● Tune M, b, r to get almost all pairs with similar signatures, but eliminate most
pairs that do not have similar signatures
● Check in main memory that candidate pairs really do have similar signatures
● Optional: In another pass through data, check that the remaining candidate pairs
really represent similar documents
Extra Materials

To read more on the code implementation of LSH, checkout this article.

https://santhoshhari.github.io/Locality-Sensitive-Hashing/
Mining Data Streams
Agenda

● Elements of stream and stream processing


● Applications of stream model
● Problems that arise while dealing with streams
● Sampling Data in a Stream
● Filtering Streams
Assumptions about data

● Data arrives in a stream or streams

● If data is not processed or stored immediately, then it is lost forever

● The rate at which data comes in is rapid that it is infeasible to store it all in
active storage such as a conventional database
The Stream Data Model
The Stream Data Model

● Unlimited streams can enter the system

● Each stream can provide elements at its own schedule with different data rates or data
types

● The time between elements within a stream need not be uniform

● Streams are archived in the archival store


The Stream Data Model

● The summaries or parts of streams are placed in a working store

● The streams in working store can be used for answering queries

● Depending on how fast the queries should be processed, working store can be a disk or a
main memory

● Both the storing methods have limited capacity to store all data from all the streams
The Stream Data Model

● There is a place within the processor where standing queries are stored
● They are permanently executing and produce results at appropriate times

Examples:

● A standing query to output an alert whenever the temperature exceeds 25


degrees based on sensor data
● Maximum temperature ever recorded by the sensor , average temperature
recorded etc.
Ad-hoc Queries

● The other form of queries is ad-hoc

● Since we cannot know what questions would be asked through ad-hoc


interface, the streams can be prepared by storing appropriate parts or
summaries of the stream

● One approach is to store a sliding window of each stream in the working store
Sliding Window

● A sliding window is the most recent n elements of a stream, for some n


● It can also be all the elements that arrived within the last t time units say
one day
● If each stream element is considered to be a tuple, then each window can
be a relation that can be queried using an SQL query
● The stream management should keep the window fresh, deleting the oldest
elements as new ones come in
Sliding Window

Example:

Web sites often like to report the number of unique users over the past month. If we think
of each login as a stream element, we can maintain a window that is all logins in the
most recent month. We must associate the arrival time with each login, so we know
when it no longer belongs to the window. If we think of the window as a relation
Logins(name, time), then it is simple to get the number of unique users over the past
month.

The SQL query is: SELECT COUNT(DISTINCT(name)) FROM Logins WHERE time >= t;
Here, t is a constant that represents the time one month before the current time
Applications of Stream Model

● Sensor Data

● Image Data

● Internet and Web Traffic


Issues in Stream Processing

● Streams deliver elements very rapidly. So the elements should be processed in real
time or we lose the opportunity to process them at all

● All the streams together can easily exceed the amount of available main memory

● So, it is much more efficient to get an approximate answer to our problem than an
exact solution and use hashing techniques to introduce randomness into algorithm’s
behavior
Sampling Data in a Stream

● The problem we shall deal with is selecting a subset of a stream so that we


can ask queries about the selected subset and have the answers be
statistical representation of the stream as a whole
● Let’s understand the concept through an example
● Say a search engine receives a stream of queries and would like to know
what fraction of the typical user’s queries were repeated over the past
month
● The stream consists of tuples (user, query, time)
● Also assume that we store only 1/10th of the stream elements
Sampling Data in a Stream

● The first approach would be to generate a random number, say an integer


between 0 and 9, in response to each query
● Store the tuple only iff the random number is 0
● So, on an average 1/10th of their queries will be stored
● However, this method wouldn’t always work
● Say a user issued s queries one time, d queries twice in the past month
● If we sample 1/10th of the queries, s/10 of the queries will be issued once,
d/100 of the queries will appear twice and 18d/100 will appear once
Sampling Data in a Stream

● The correct answer to the query would be d/(s+d)


● But, the answer we obtain from the sample is d/(10s+19d)
● So, the alternative solution would be to pick 1/10th of the users and take all
their searches for the sample
● This would include only few users
Representative Sampling

● If we can store the list of all users and whether or not they are in the sample, then we could
do the following

Each time a search query arrives in the stream, we look up the user to see whether or not
they are in the sample.

If so, we add this search query to the sample, and if not, then not.

However, if we have no record of ever having seen this user before, then we generate a
random integer between 0 and 9.

If the number is 0, we add this user to our list with value “in,” and if the number is other than
0, we add the user with the value “out.”
Representative Sampling

● By using hash function we can avoid maintaining the list of users altogether
● That is, we hash each user name to one of ten buckets, 0 through 9
● If the user hashes to bucket 0, then accept this search query for the sample,
and if not, then not
● Effectively, we use the hash function as a random number generator
because of it’s important property that when applied to the same user
several times we always get the same random number
Filtering Streams

● Another common process on streams is selection or filtering

● We will discuss about a technique known as Bloom filtering to eliminate the


tuples based on the criterion
The Bloom Filter

A Bloom filter consists of

● An array of n bits, initially all 0’s


● A collection of hash functions h1, h2, . . . , hk. Each hash function maps
“key” values to n buckets, corresponding to the n bits of the bit-array
● A set S of m key values
The Bloom Filter

● To initialize the bit array, begin with all bits 0


● Take each key value in S and hash it using each of the k hash function
● Set to 1 each bit that is hi(K) for some hash function hi and some key value
K in S
● To test a key K that arrives in the stream, check that all of h1(K), h2(K), . . . ,
hk(K) are 1’s in the bit-array.
● If all are 1’s, then let the stream element through, else reject the stream
element

You might also like