0% found this document useful (0 votes)
42 views8 pages

12-Finding Similar Sets

This document discusses techniques for finding similar documents, including shingling, minhashing, and locality sensitive hashing. It introduces these techniques and explains how they work together: shingling converts documents to sets of n-gram strings, minhashing converts these sets to short signatures that preserve similarity, and locality sensitive hashing focuses on signature pairs likely to be similar to identify candidate document pairs for direct comparison. The document provides examples of how changing a word or reordering paragraphs only affects a small number of n-grams, and how n-grams can be hashed to compress their representation while still capturing document similarity.

Uploaded by

praneetm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views8 pages

12-Finding Similar Sets

This document discusses techniques for finding similar documents, including shingling, minhashing, and locality sensitive hashing. It introduces these techniques and explains how they work together: shingling converts documents to sets of n-gram strings, minhashing converts these sets to short signatures that preserve similarity, and locality sensitive hashing focuses on signature pairs likely to be similar to identify candidate document pairs for direct comparison. The document provides examples of how changing a word or reordering paragraphs only affects a small number of n-grams, and how n-grams can be hashed to compress their representation while still capturing document similarity.

Uploaded by

praneetm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Applications

Shingling
Minhashing
Locality Sensitive Hashing

Mining of Massive Datasets


Leskovec, Rajaraman, and Ullman
Stanford University

Many data mining problems can be expressed


as finding similar sets:
1. Pages with similar words, e.g., for
classification by topic.
2. NetFlix users with similar tastes in movies, for
recommendation systems.
3. Dual: movies with similar sets of fans.
4. Entity resolution.

Given a body of documents, e.g., the Web,


find pairs of documents with a lot of text in
common, such as:
Mirror sites, or approximate mirrors.
Application: Dont want to show both in a search.

Plagiarism, including large quotations.


Similar news articles at many news sites.
Application: Cluster articles by same story.

Shingling : convert documents, emails,


etc., to sets.
2. Minhashing : convert large sets to short
signatures, while preserving similarity.
3. Locality sensitive hashing : focus on pairs
of signatures likely to be similar.
1.

Locality
sensitive
Hashing

Docu
ment

The set
of strings
of length k
that appear
in the doc
ument

Signatures :
short integer
vectors that
represent the
sets, and
reflect their
similarity

Candidate
pairs :
those pairs
of signatures
that we need
to test for
similarity.

A k shingle (or k gram) for a document is a


sequence of k characters that appears in the
document.
Example: k=2; doc = abcab. Set of 2 shingles
= {ab, bc, ca}.
Represent a doc by its set of k shingles.

Documents that are intuitively similar will have


many shingles in common.
Changing a word only affects k shingles within
distance k from the word.
Reordering paragraphs only affects the 2k
shingles that cross paragraph boundaries.
Example: k=3, The dog which chased the cat
versus The dog that chased the cat.
Only 3 shingles replaced are g_w, _wh, whi, hic, ich,
ch_, and h_c.
9/29/2014

Mining of Massive Datasets. Leskovec, Rajaraman and Ullman. Stanford University

To compress long shingles, we can hash


them to (say) 4 bytes.
Called tokens.

Represent a doc by its tokens, that is, the


set of hash values of its k shingles.
Two documents could (rarely) appear to
have shingles in common, when in fact only
the hash values were shared.

You might also like