Finding Similar Items
Finding Similar Items
Introduction
● In today’s world, data is in many forms i.e. data might be in video, audio, text
etc.
● The quantity of data is huge and unstructured. This dataset can have similarities
and repeated items.
● A fundamental Data Mining problem is to examine data for similar items.
● For example looking for near duplicate pages in a collection of web pages or
these pages could be mirrors that have almost the same content but differ in
information about the host and about other mirrors.
Why Finding Similar Items?
Consider the example of collection of web pages. Similar pages might change the
name of the course, year, and make small changes from year to year. It is important
to be able to detect similar pages of these kinds, because search engines produce
better results if they avoid showing two pages that are nearly identical within the
first page of results.
Applications of Finding Similar Items
Plagiarism:
Finding plagiarized documents tests our ability to find textual similarity. The
plagiarizer may extract only some parts of a document for his own. He may alter a
few words and may alter the order in which sentences of the original appear. Yet the
resulting document may still contain 50% or more of the original. No simple process
of comparing documents character by character will detect a sophisticated
plagiarism.
Online Purchases
Amazon.com has millions of customers and sells millions of items. Its database
records which items have been bought by which customers. We can say two
customers are similar if their sets of purchased items have a high Jaccard similarity.
Likewise, two items that have sets of purchasers with high Jaccard similarity will be
deemed similar. Note that, while we might expect mirror sites to have Jaccard
similarity above 90%, it is unlikely that any two customers have Jaccard similarity
that high (unless they have purchased only one item). Even a Jaccard similarity like
20% might be unusual enough to identify customers with similar tastes. The same
observation holds for items; Jaccard similarities need not be very high to be
significant.
Movie Ratings
Netflix records which movies each of its customers rented, and also the ratings
assigned to those movies by the customers. We can see movies as similar if they
were rented or rated highly by many of the same customers, and see customers as
similar if they rented or rated highly many of the same movies. The same
observations that we made for Amazon above apply in this situation: similarities
need not be high to be significant, and clustering movies by genre will make things
easier.
Challenges Faced
● Many small pieces of one document can appear out of order in another.
● Too many documents to compare all pairs.
● Documents are so large or so many that they cannot fit in main memory.
Techniques Used
There are some techniques which are used for finding similar items in a data set:
● LSH
● Shingling
● Min Hashing
● Distance Measures
LSH
Motivation
The task of finding nearest neighbours is very common. You can think of
applications like finding duplicate or similar documents, audio/video search.
Although using brute force to check for all possible combinations will give you the
exact nearest neighbour but it’s not scalable at all. Approximate algorithms to
accomplish this task has been an area of active research. Although these algorithms
don’t guarantee to give you the exact answer, more often than not they’ll be provide
a good approximation. These algorithms are faster and scalable.
Finding similar items/Documents
Locality sensitive hashing (LSH)
LSH refers to a family of functions (known as LSH families) to hash data points
into buckets so that data points near each other are located in the same buckets
with high probability, while data points far from each other are likely to be in
different buckets. This makes it easier to identify observations with various
degrees of similarity.
LSH has many applications, including:
Genome-wide association study: Biologists often use LSH to identify similar gene
Large-scale image search: Google used LSH along with PageRank to build their
Goal: Given a large number ( in the millions or billions) of documents, find “near
duplicate” pairs
We can break down the LSH algorithm into 3
broad steps
In this step, we convert each document into a set of characters of length k (also
known as k-shingles or k-grams). The key idea is to represent each document in
our collection as a set of k-shingles.
For ex: One of your document (D): “Nadal”. Now if we’re interested in 2-
shingles, then our set: {Na, ad, da, al}. Similarly set of 3-shingles: {Nad, ada,
dal}.
Shingling
Time complexity:
Now you may be thinking that we can stop here. But if you think about the scalability,
doing just this won’t work. For a collection of n documents, you need to do n*(n-1)/2
comparison, basically O(n²). Imagine you have 1 million documents, then the number
of comparison will be 5*10¹¹ (not scalable at all!).
Space complexity:
The document matrix is a sparse matrix and storing it as it is will be a big memory
overhead. One way to solve this is hashing.
Hashing
The idea of hashing is to convert each document to a small signature using a hashing
function H. Suppose a document in our corpus is denoted by d. Then:
● H(d) is the signature and it’s small enough to fit in memory
● If similarity(d1,d2) is high then Probability(H(d1)==H(d2)) is high
● If similarity(d1,d2) is low then Probability(H(d1)==H(d2)) is low
Choice of hashing function is tightly linked to the similarity metric we’re using. For
Jaccard similarity the appropriate hashing function is min-hashing.
Minhashing
Minhashing Goal: Convert large sets to short signatures, while preserving similarity
In brief
MinHash property
The similarity of the signatures is the fraction of the min-hash functions (rows) in
which they agree. So the similarity of signature for C1 and C3 is 2/3 as 1st and 3rd
row are same.
So using min-hashing we have solved the problem of space complexity by
eliminating the sparseness and at the same time preserving the similarity.
Locality-Sensitive Hashing
Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar
documents.
The general idea of LSH is to find a algorithm such that if we input signatures of 2
documents, it tells us that those 2 documents form a candidate pair or not i.e. their
similarity is greater than a threshold t. Remember that we are taking similarity of
signatures as a proxy for Jaccard similarity between the original documents.
Specifically for min-hash signature matrix:
● Hash columns of signature matrix M using several hash functions
● If 2 documents hash into same bucket for at least one of the hash function we
can take the 2 documents as a candidate pair
LSH summary
● Tune M, b, r to get almost all pairs with similar signatures, but eliminate most
pairs that do not have similar signatures
● Check in main memory that candidate pairs really do have similar signatures
● Optional: In another pass through data, check that the remaining candidate pairs
really represent similar documents
Extra Materials
https://santhoshhari.github.io/Locality-Sensitive-Hashing/
Mining Data Streams
Agenda
● The rate at which data comes in is rapid that it is infeasible to store it all in
active storage such as a conventional database
The Stream Data Model
The Stream Data Model
● Each stream can provide elements at its own schedule with different data rates or data
types
● Depending on how fast the queries should be processed, working store can be a disk or a
main memory
● Both the storing methods have limited capacity to store all data from all the streams
The Stream Data Model
● There is a place within the processor where standing queries are stored
● They are permanently executing and produce results at appropriate times
Examples:
● One approach is to store a sliding window of each stream in the working store
Sliding Window
Example:
Web sites often like to report the number of unique users over the past month. If we think
of each login as a stream element, we can maintain a window that is all logins in the
most recent month. We must associate the arrival time with each login, so we know
when it no longer belongs to the window. If we think of the window as a relation
Logins(name, time), then it is simple to get the number of unique users over the past
month.
The SQL query is: SELECT COUNT(DISTINCT(name)) FROM Logins WHERE time >= t;
Here, t is a constant that represents the time one month before the current time
Applications of Stream Model
● Sensor Data
● Image Data
● Streams deliver elements very rapidly. So the elements should be processed in real
time or we lose the opportunity to process them at all
● All the streams together can easily exceed the amount of available main memory
● So, it is much more efficient to get an approximate answer to our problem than an
exact solution and use hashing techniques to introduce randomness into algorithm’s
behavior
Sampling Data in a Stream
● If we can store the list of all users and whether or not they are in the sample, then we could
do the following
Each time a search query arrives in the stream, we look up the user to see whether or not
they are in the sample.
If so, we add this search query to the sample, and if not, then not.
However, if we have no record of ever having seen this user before, then we generate a
random integer between 0 and 9.
If the number is 0, we add this user to our list with value “in,” and if the number is other than
0, we add the user with the value “out.”
Representative Sampling
● By using hash function we can avoid maintaining the list of users altogether
● That is, we hash each user name to one of ten buckets, 0 through 9
● If the user hashes to bucket 0, then accept this search query for the sample,
and if not, then not
● Effectively, we use the hash function as a random number generator
because of it’s important property that when applied to the same user
several times we always get the same random number
Filtering Streams