0% found this document useful (0 votes)
12 views8 pages

Unit II - MMD - Lecture NotesStu

Uploaded by

kjsravani2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views8 pages

Unit II - MMD - Lecture NotesStu

Uploaded by

kjsravani2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

MINING MASSIVE DATASETS

(20DS7T07)
R20 :: CSE (DATA SCIENCE) :: IV–I Semester
LECTURE NOTES

Applications of Set Similarity: Jaccard Similarity of Sets, Similarity of


Documents, Collaborative Filtering as a Similar-Sets Problem.
Shingling of Documents: k-Shingles, Choosing the Shingle Size, Hashing
UNIT II
Shingles, Shingles Built from Words.
Locality-Sensitive Hashing for Documents: LSH for Minhash Signatures,
Analysis of the Banding Technique, Combining the Techniques.

2.1 Applications of Set Similarity


The similarity of sets based on the relative size of their intersection. This notion of similarity
is called Jaccard similarity.
2.1.1 Jaccard Similarity of Sets:
The Jaccard similarity of sets S and T is |S ∩ T |/|S ∪ T |, that is, the ratio of the size of the
intersection of S and T to the size of their union. We shall denote the Jaccard similarity of S and T
by SIM(S, T).

There are three elements in their intersection and a total of eight elements that appear in S or T
or both. Thus, SIM(S, T ) = 3/8.

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 1


2.1.2 Similarity of Documents:
An important class of problems that Jaccard similarity addresses well is that of finding
textually similar documents in a large corpus such as the Web or a collection of news articles.
Textual similarity also has important uses. Many of these involve finding duplicates or near
duplicates. First, let us observe that testing whether two documents are exact duplicates is easy; just
compare the two documents character-by-character, and if they ever differ then they are not the
same. However, in many applications, the documents are not identical, yet they share large portions
of their text. Here are some examples:
Plagiarism
Finding plagiarized documents tests our ability to find textual similarity. The plagiarizer may
extract only some parts of a document for his own. He may alter a few words and may alter the order
in which sentences of the original appear. Yet the resulting document may still contain much of the
original. No simple process of comparing documents character by character will detect a
sophisticated plagiarism.
Mirror Pages
It is common for important or popular Web sites to be duplicated at a number of hosts, in
order to share the load. The pages of these mirror sites will be quite similar, but are rarely identical.
A related phenomenon is the reuse of Web pages from one academic class to another. These pages
might include class notes, assignments, and lecture slides. Similar pages might change the name of
the course, year, and make small changes from year to year. It is important to be able to detect
similar pages of these kinds, because search engines produce better results if they avoid showing two
pages that are nearly identical within the first page of results.
Articles from the Same Source
It is common for one reporter to write a news article that gets distributed, say through the
Associated Press, to many newspapers, which then publish the article on their Web sites. Each
newspaper changes the article somewhat.
They may cut out paragraphs, or even add material of their own. They most likely will surround the
article by their own logo, ads, and links to other articles at their site. However, the core of each
newspaper’s page will be the original article.
News aggregators, such as Google News, try to find all versions of such an article, in order to show
only one, and that task requires finding when two Web pages are textually similar, although not

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 2


identical.
2.1.3 Collaborative Filtering as a Similar-Sets Problem:
Another class of applications where similarity of sets is very important is called collaborative
filtering, a process whereby we recommend to users items that were liked by other users who have
exhibited similar tastes.
some common examples:
On-Line Purchases
Amazon.com has millions of customers and sells millions of items. Its database records which
items have been bought by which customers. We can say two customers are similar if their sets of
purchased items have a high Jaccard similarity. Likewise, two items that have sets of purchasers
with high Jaccard similarity will be deemed similar.
Even a Jaccard similarity like 20% might be unusual enough to identify customers with
similar tastes. The same observation holds for items; Jaccard similarities need not be very high to be
significant.
Movie Ratings
Netflix records which movies each of its customers rented, and also the ratings assigned to
those movies by the customers. We can regard movies as similar if they were rented or rated highly
by many of the same customers, and see customers as similar if they rented or rated highly many of
the same movies.
Similarities need not be high to be significant, and clustering movies by genre will make
things easier.

2.2 Shingling of Documents


The most effective way to represent documents as sets, for the purpose of identifying lexically
similar documents is to construct from the document the set of short strings that appear within it.
The simplest and most common approach is shingling.
2.2.1 k-Shingles
A document is a string of characters. Define a k-shingle for a document to be any substring of
length k found within the document. Then, we may associate with each document the set of k-
shingles that appear one or more times within that document.
Example 3.3 : Suppose our document D is the string abcdabd, and we pick k = 2. Then the set of 2-
shingles for D is {ab, bc, cd, da, bd}.

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 3


Note that the substring ab appears twice within D, but appears only once as a shingle. A
variation of shingling produces a bag, rather than a set, so each shingle would appear in the result as
many times as it appears in the document.
There are several options regarding how white space (blank, tab, newline, etc.) is treated. It
probably makes sense to replace any sequence of one or more white-space characters by a single
blank. That way, we distinguish shingles that cover two or more words from those that do not.
2.2.2 Choosing the Shingle Size
We can pick k to be any constant we like. However, if we pick k too small, then we would
expect most sequences of k characters to appear in most documents.
If so, then we could have documents whose shingle-sets had high Jaccard similarity, yet the
documents had none of the same sentences or even phrases. As an extreme example, if we use k = 1,
most Web pages will have most of the common characters and few other characters, so almost all
Web pages will have high similarity
How large k should be depends on how long typical documents are and how large the set of
typical characters is. The important thing to remember is:
• k should be picked large enough that the probability of any given shingle appearing in any given
document is low.
Thus, if our corpus of documents is emails, picking k = 5 should be fine. To see why, suppose
that only letters and a general white-space character appear in emails (although in practice, most of
the printable ASCII characters can be expected to appear occasionally). If so, then there would be
5
27 = 14,348,907 possible shingles. Since the typical email is much smaller than 14 million

characters long, we would expect k = 5 to work well, and indeed it does.


2.2.3 Hashing Shingles
Instead of using substrings directly as shingles, we can pick a hash function that maps strings
of length k to some number of buckets and treat the resulting bucket number as the shingle. The set
representing a document is then the set of integers that are bucket numbers of one or more k-shingles
that appear in the document.
For instance, we could construct the set of 9-shingles for a document and then map each of
those 9-shingles to a bucket number in the range 0 to 232 − 1. Thus, each shingle is represented by
four bytes instead of nine. Not only has the data been compacted, but we can now manipulate
(hashed) shingles by single-word machine operations.

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 4


Notice that we can differentiate documents better if we use 9-shingles and hash them down to
four bytes than to use 4-shingles, even though the space used to represent a shingle is the same.
2.2.4 Shingles Built from Words
An alternative form of shingle has proved effective for the problem of identifying similar
news articles. The exploitable distinction for this problem is that the news articles are written in a
rather different style than are other elements that typically appear on the page with the article. News
articles, and most prose, have a lot of stop words, the most common words such as “and,” “you,”
“to,” and so on. In many applications, we want to ignore stop words, since they don’t tell us
anything useful about the article, such as its topic.
However, for the problem of finding similar news articles, it was found that defining a shingle
to be a stop word followed by the next two words, regardless of whether or not they were stop
words, formed a useful set of shingles. The advantage of this approach is that the news article would
then contribute more shingles to the set representing the Web page than would the surrounding
elements.
Recall that the goal of the exercise is to find pages that had the same articles, regardless of the
surrounding elements. By biasing the set of shingles in favor of the article, pages with the same
article and different surrounding material have higher Jaccard similarity than pages with the same
surrounding material but with a different article.

2.3 Locality-Sensitive Hashing for Documents


Even though we can use minhashing to compress large documents into small signatures and
preserve the expected similarity of any pair of documents, it still may be impossible to find the pairs
with greatest similarity efficiently. The reason is that the number of pairs of documents may be too
large, even if there are not too many documents.
If our goal is to compute the similarity of every pair, there is nothing we can do to reduce the
work, although parallelism can reduce the elapsed time. However, often we want only the most
similar pairs or all pairs that are above some lower bound in similarity. If so, then we need to focus
our attention only on pairs that are likely to be similar, without investigating every pair. There is a
general theory of how to provide such focus, called locality-sensitive hashing (LSH) or near-neighbor
search.
2.3.1 LSH for Minhash Signatures
One general approach to LSH is to “hash” items several times, in such a way that similar

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 5


items are more likely to be hashed to the same bucket than dissimilar items are. We then consider any
pair that hashed to the same bucket for any of the hashings to be a candidate pair.
We check only the candidate pairs for similarity. The hope is that most of the dissimilar pairs
will never hash to the same bucket, and therefore will never be checked. Those dissimilar pairs that
do hash to the same bucket are false positives; we hope these will be only a small fraction of all pairs.
We also hope that most of the truly similar pairs will hash to the same bucket under at least one of the
hash functions. Those that do not are false negatives; we hope these will be only a small fraction of
the truly similar pairs.
If we have minhash signatures for the items, an effective way to choose the hashings is to
divide the signature matrix into b bands consisting of r rows each. For each band, there is a hash
function that takes vectors of r integers (the portion of one column within that band) and hashes them
to some large number of buckets. We can use the same hash function for all the bands, but we use a
separate bucket array for each band, so columns with the same vector in different bands will not hash
to the same bucket.
2.3.2 Analysis of the Banding Technique
Suppose we use b bands of r rows each, and suppose that a particular pair of documents have
Jaccard similarity s. The probability the minhash signatures for these documents agree in any one
particular row of the signature matrix is s. We can calculate the probability that these documents (or
rather their signatures) become a candidate pair as follows:

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 6


It may not be obvious, but regardless of the chosen constants b and r, this function has the
form of an S-curve, as suggested in Fig. 3.8. The threshold, that is, the value of similarity s at which
the probability of becoming a candidate is 1/2, is a function of b and r. The threshold is roughly
where the rise is the steepest, and for large b and r we find that pairs with similarity above the
threshold are very likely to become candidates, while those below the threshold are unlikely to
become candidates – exactly the situation we want. An approximation to the threshold is (1/b)1/r. For
example, if b = 16 and r = 4, then the threshold is approximately at s = 1/2, since the 4th root of 1/16
is 1/2.
2.3.3 Combining the Techniques
We can now give an approach to finding the set of candidate pairs for similar documents and then
discovering the truly similar documents among them. It must be emphasized that this approach can
produce false negatives – pairs of similar documents that are not identified as such because they never
become a candidate pair. There will also be false positives – candidate pairs that are evaluated, but
are found not to be sufficiently similar.

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 7


Mining Massive Datasets – UNIT 2 – Lecture Notes Page 8

You might also like