0% found this document useful (0 votes)

12 views8 pages

Unit II - MMD - Lecture NotesStu

Uploaded by

kjsravani2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views8 pages

Unit II - MMD - Lecture NotesStu

Uploaded by

kjsravani2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

MINING MASSIVE DATASETS

(20DS7T07)
R20 :: CSE (DATA SCIENCE) :: IV–I Semester
LECTURE NOTES

Applications of Set Similarity: Jaccard Similarity of Sets, Similarity of

Documents, Collaborative Filtering as a Similar-Sets Problem.
Shingling of Documents: k-Shingles, Choosing the Shingle Size, Hashing
UNIT II
Shingles, Shingles Built from Words.
Locality-Sensitive Hashing for Documents: LSH for Minhash Signatures,
Analysis of the Banding Technique, Combining the Techniques.

2.1 Applications of Set Similarity

The similarity of sets based on the relative size of their intersection. This notion of similarity
is called Jaccard similarity.
2.1.1 Jaccard Similarity of Sets:
The Jaccard similarity of sets S and T is |S ∩ T |/|S ∪ T |, that is, the ratio of the size of the
intersection of S and T to the size of their union. We shall denote the Jaccard similarity of S and T
by SIM(S, T).

There are three elements in their intersection and a total of eight elements that appear in S or T
or both. Thus, SIM(S, T ) = 3/8.

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 1

2.1.2 Similarity of Documents:
An important class of problems that Jaccard similarity addresses well is that of finding
textually similar documents in a large corpus such as the Web or a collection of news articles.
Textual similarity also has important uses. Many of these involve finding duplicates or near
duplicates. First, let us observe that testing whether two documents are exact duplicates is easy; just
compare the two documents character-by-character, and if they ever differ then they are not the
same. However, in many applications, the documents are not identical, yet they share large portions
of their text. Here are some examples:
Plagiarism
Finding plagiarized documents tests our ability to find textual similarity. The plagiarizer may
extract only some parts of a document for his own. He may alter a few words and may alter the order
in which sentences of the original appear. Yet the resulting document may still contain much of the
original. No simple process of comparing documents character by character will detect a
sophisticated plagiarism.
Mirror Pages
It is common for important or popular Web sites to be duplicated at a number of hosts, in
order to share the load. The pages of these mirror sites will be quite similar, but are rarely identical.
A related phenomenon is the reuse of Web pages from one academic class to another. These pages
might include class notes, assignments, and lecture slides. Similar pages might change the name of
the course, year, and make small changes from year to year. It is important to be able to detect
similar pages of these kinds, because search engines produce better results if they avoid showing two
pages that are nearly identical within the first page of results.
Articles from the Same Source
It is common for one reporter to write a news article that gets distributed, say through the
Associated Press, to many newspapers, which then publish the article on their Web sites. Each
newspaper changes the article somewhat.
They may cut out paragraphs, or even add material of their own. They most likely will surround the
article by their own logo, ads, and links to other articles at their site. However, the core of each
newspaper’s page will be the original article.
News aggregators, such as Google News, try to find all versions of such an article, in order to show
only one, and that task requires finding when two Web pages are textually similar, although not

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 2

identical.
2.1.3 Collaborative Filtering as a Similar-Sets Problem:
Another class of applications where similarity of sets is very important is called collaborative
filtering, a process whereby we recommend to users items that were liked by other users who have
exhibited similar tastes.
some common examples:
On-Line Purchases
Amazon.com has millions of customers and sells millions of items. Its database records which
items have been bought by which customers. We can say two customers are similar if their sets of
purchased items have a high Jaccard similarity. Likewise, two items that have sets of purchasers
with high Jaccard similarity will be deemed similar.
Even a Jaccard similarity like 20% might be unusual enough to identify customers with
similar tastes. The same observation holds for items; Jaccard similarities need not be very high to be
significant.
Movie Ratings
Netflix records which movies each of its customers rented, and also the ratings assigned to
those movies by the customers. We can regard movies as similar if they were rented or rated highly
by many of the same customers, and see customers as similar if they rented or rated highly many of
the same movies.
Similarities need not be high to be significant, and clustering movies by genre will make
things easier.

2.2 Shingling of Documents

The most eﬀective way to represent documents as sets, for the purpose of identifying lexically
similar documents is to construct from the document the set of short strings that appear within it.
The simplest and most common approach is shingling.
2.2.1 k-Shingles
A document is a string of characters. Deﬁne a k-shingle for a document to be any substring of
length k found within the document. Then, we may associate with each document the set of k-
shingles that appear one or more times within that document.
Example 3.3 : Suppose our document D is the string abcdabd, and we pick k = 2. Then the set of 2-
shingles for D is {ab, bc, cd, da, bd}.

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 3

Note that the substring ab appears twice within D, but appears only once as a shingle. A
variation of shingling produces a bag, rather than a set, so each shingle would appear in the result as
many times as it appears in the document.
There are several options regarding how white space (blank, tab, newline, etc.) is treated. It
probably makes sense to replace any sequence of one or more white-space characters by a single
blank. That way, we distinguish shingles that cover two or more words from those that do not.
2.2.2 Choosing the Shingle Size
We can pick k to be any constant we like. However, if we pick k too small, then we would
expect most sequences of k characters to appear in most documents.
If so, then we could have documents whose shingle-sets had high Jaccard similarity, yet the
documents had none of the same sentences or even phrases. As an extreme example, if we use k = 1,
most Web pages will have most of the common characters and few other characters, so almost all
Web pages will have high similarity
How large k should be depends on how long typical documents are and how large the set of
typical characters is. The important thing to remember is:
• k should be picked large enough that the probability of any given shingle appearing in any given
document is low.
Thus, if our corpus of documents is emails, picking k = 5 should be ﬁne. To see why, suppose
that only letters and a general white-space character appear in emails (although in practice, most of
the printable ASCII characters can be expected to appear occasionally). If so, then there would be
5
27 = 14,348,907 possible shingles. Since the typical email is much smaller than 14 million

characters long, we would expect k = 5 to work well, and indeed it does.

2.2.3 Hashing Shingles
Instead of using substrings directly as shingles, we can pick a hash function that maps strings
of length k to some number of buckets and treat the resulting bucket number as the shingle. The set
representing a document is then the set of integers that are bucket numbers of one or more k-shingles
that appear in the document.
For instance, we could construct the set of 9-shingles for a document and then map each of
those 9-shingles to a bucket number in the range 0 to 232 − 1. Thus, each shingle is represented by
four bytes instead of nine. Not only has the data been compacted, but we can now manipulate
(hashed) shingles by single-word machine operations.

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 4

Notice that we can differentiate documents better if we use 9-shingles and hash them down to
four bytes than to use 4-shingles, even though the space used to represent a shingle is the same.
2.2.4 Shingles Built from Words
An alternative form of shingle has proved effective for the problem of identifying similar
news articles. The exploitable distinction for this problem is that the news articles are written in a
rather different style than are other elements that typically appear on the page with the article. News
articles, and most prose, have a lot of stop words, the most common words such as “and,” “you,”
“to,” and so on. In many applications, we want to ignore stop words, since they don’t tell us
anything useful about the article, such as its topic.
However, for the problem of finding similar news articles, it was found that defining a shingle
to be a stop word followed by the next two words, regardless of whether or not they were stop
words, formed a useful set of shingles. The advantage of this approach is that the news article would
then contribute more shingles to the set representing the Web page than would the surrounding
elements.
Recall that the goal of the exercise is to find pages that had the same articles, regardless of the
surrounding elements. By biasing the set of shingles in favor of the article, pages with the same
article and different surrounding material have higher Jaccard similarity than pages with the same
surrounding material but with a different article.

2.3 Locality-Sensitive Hashing for Documents

Even though we can use minhashing to compress large documents into small signatures and
preserve the expected similarity of any pair of documents, it still may be impossible to ﬁnd the pairs
with greatest similarity eﬃciently. The reason is that the number of pairs of documents may be too
large, even if there are not too many documents.
If our goal is to compute the similarity of every pair, there is nothing we can do to reduce the
work, although parallelism can reduce the elapsed time. However, often we want only the most
similar pairs or all pairs that are above some lower bound in similarity. If so, then we need to focus
our attention only on pairs that are likely to be similar, without investigating every pair. There is a
general theory of how to provide such focus, called locality-sensitive hashing (LSH) or near-neighbor
search.
2.3.1 LSH for Minhash Signatures
One general approach to LSH is to “hash” items several times, in such a way that similar

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 5

items are more likely to be hashed to the same bucket than dissimilar items are. We then consider any
pair that hashed to the same bucket for any of the hashings to be a candidate pair.
We check only the candidate pairs for similarity. The hope is that most of the dissimilar pairs
will never hash to the same bucket, and therefore will never be checked. Those dissimilar pairs that
do hash to the same bucket are false positives; we hope these will be only a small fraction of all pairs.
We also hope that most of the truly similar pairs will hash to the same bucket under at least one of the
hash functions. Those that do not are false negatives; we hope these will be only a small fraction of
the truly similar pairs.
If we have minhash signatures for the items, an eﬀective way to choose the hashings is to
divide the signature matrix into b bands consisting of r rows each. For each band, there is a hash
function that takes vectors of r integers (the portion of one column within that band) and hashes them
to some large number of buckets. We can use the same hash function for all the bands, but we use a
separate bucket array for each band, so columns with the same vector in diﬀerent bands will not hash
to the same bucket.
2.3.2 Analysis of the Banding Technique
Suppose we use b bands of r rows each, and suppose that a particular pair of documents have
Jaccard similarity s. The probability the minhash signatures for these documents agree in any one
particular row of the signature matrix is s. We can calculate the probability that these documents (or
rather their signatures) become a candidate pair as follows:

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 6

It may not be obvious, but regardless of the chosen constants b and r, this function has the
form of an S-curve, as suggested in Fig. 3.8. The threshold, that is, the value of similarity s at which
the probability of becoming a candidate is 1/2, is a function of b and r. The threshold is roughly
where the rise is the steepest, and for large b and r we find that pairs with similarity above the
threshold are very likely to become candidates, while those below the threshold are unlikely to
become candidates – exactly the situation we want. An approximation to the threshold is (1/b)1/r. For
example, if b = 16 and r = 4, then the threshold is approximately at s = 1/2, since the 4th root of 1/16
is 1/2.
2.3.3 Combining the Techniques
We can now give an approach to finding the set of candidate pairs for similar documents and then
discovering the truly similar documents among them. It must be emphasized that this approach can
produce false negatives – pairs of similar documents that are not identified as such because they never
become a candidate pair. There will also be false positives – candidate pairs that are evaluated, but
are found not to be sufficiently similar.

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 7

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 8

Finding Similar Items: Aadil Ahmad, Pawan Kumar, Himanshu Kamboj, and Sunil Kumar
No ratings yet
Finding Similar Items: Aadil Ahmad, Pawan Kumar, Himanshu Kamboj, and Sunil Kumar
3 pages
Unit III
No ratings yet
Unit III
85 pages
03 Hash
No ratings yet
03 Hash
84 pages
ch04 LSH
No ratings yet
ch04 LSH
54 pages
Module 2 Algorithm For Massive Datasets
No ratings yet
Module 2 Algorithm For Massive Datasets
79 pages
UNIT 2 Bigdata Mining and Analytics
No ratings yet
UNIT 2 Bigdata Mining and Analytics
18 pages
Computational Tools DTU Presentation Week4
No ratings yet
Computational Tools DTU Presentation Week4
40 pages
Block-3 Unit 9
No ratings yet
Block-3 Unit 9
73 pages
Finding Similar Items
No ratings yet
Finding Similar Items
85 pages
Data Mining: Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Sketching, Locality Sensitive Hashing
61 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
Lect 26 and 27 - Locality Sensitive Hashing
No ratings yet
Lect 26 and 27 - Locality Sensitive Hashing
43 pages
Big Data Unit II
No ratings yet
Big Data Unit II
23 pages
Big Data - Lecture05 - LSH
No ratings yet
Big Data - Lecture05 - LSH
56 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
58 pages
On The Resemblance and Containment of Documents
No ratings yet
On The Resemblance and Containment of Documents
9 pages
Mining Massive DataSets
No ratings yet
Mining Massive DataSets
54 pages
Similarity 1
No ratings yet
Similarity 1
53 pages
Book 2
No ratings yet
Book 2
3 pages
MMD2
No ratings yet
MMD2
13 pages
MMD 02
No ratings yet
MMD 02
97 pages
Chapter 5
No ratings yet
Chapter 5
53 pages
03.2 03.3 Shingling MinHash
No ratings yet
03.2 03.3 Shingling MinHash
32 pages
L3: Finding Similar Items: Locality Sensitive Hashing
No ratings yet
L3: Finding Similar Items: Locality Sensitive Hashing
54 pages
LSH Lecture
No ratings yet
LSH Lecture
101 pages
Similarity Search For Big Data Locality Sensitive Hashing
No ratings yet
Similarity Search For Big Data Locality Sensitive Hashing
41 pages
Comparison Jaccard Similarity Cosine Similarity and Combined
No ratings yet
Comparison Jaccard Similarity Cosine Similarity and Combined
8 pages
Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors
No ratings yet
Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors
3 pages
Assignment No. 2: Similarity and Dissimilarity Measures
No ratings yet
Assignment No. 2: Similarity and Dissimilarity Measures
11 pages
12-Finding Similar Sets
No ratings yet
12-Finding Similar Sets
8 pages
Locality Sensitive Hashing Towards Data Science
No ratings yet
Locality Sensitive Hashing Towards Data Science
16 pages
CSS Master
From Everand
CSS Master
Tiffany B Brown
No ratings yet
Exposure of Document
No ratings yet
Exposure of Document
5 pages
The Ascetic Programmer
From Everand
The Ascetic Programmer
Antonio Piccolboni
5/5 (1)
Unit 4
No ratings yet
Unit 4
61 pages
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
No ratings yet
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
4 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Dealing With Massive Data - Duplicate Detection
No ratings yet
Dealing With Massive Data - Duplicate Detection
3 pages
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Publishing Student Writing to the iPad/iPhone/iPod Touch Using Smashwords and Bluefire Reader
From Everand
Publishing Student Writing to the iPad/iPhone/iPod Touch Using Smashwords and Bluefire Reader
Kristin Fontichiaro
No ratings yet
Beginning HTML and CSS
From Everand
Beginning HTML and CSS
Rob Larsen
No ratings yet
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
From Everand
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
Marcin Jamro
No ratings yet
Computational Journalism 2016 Week 2: Text Analysis
No ratings yet
Computational Journalism 2016 Week 2: Text Analysis
68 pages
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
Special Topics in Search Engines: Result Summaries Anti-Spamming Duplicate Elimination
No ratings yet
Special Topics in Search Engines: Result Summaries Anti-Spamming Duplicate Elimination
48 pages
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
IRS Unit 4
No ratings yet
IRS Unit 4
63 pages
CSE545 Sp23 (6) Similarity Search 3-8
No ratings yet
CSE545 Sp23 (6) Similarity Search 3-8
84 pages
Data Visualization with JavaScript
From Everand
Data Visualization with JavaScript
Stephen A. Thomas
5/5 (3)
312 Course Project-1
No ratings yet
312 Course Project-1
16 pages
Format Synopsis DP
No ratings yet
Format Synopsis DP
12 pages
Text Document Classification and Pattern Recognition: Qin Wu, Eddie Fuller, Cun-Quan Zhang
No ratings yet
Text Document Classification and Pattern Recognition: Qin Wu, Eddie Fuller, Cun-Quan Zhang
6 pages
Week 5
No ratings yet
Week 5
64 pages
The Beginner’s Guide to Databases & SQL
From Everand
The Beginner’s Guide to Databases & SQL
Steven Mcananey
No ratings yet
Linux, Apache, MySQL, PHP Performance End to End
From Everand
Linux, Apache, MySQL, PHP Performance End to End
Colin McKinnon
5/5 (1)
Finding Similar Files in A Large File System - 1994
No ratings yet
Finding Similar Files in A Large File System - 1994
11 pages
React Components
From Everand
React Components
Christopher Pitt
No ratings yet
Search Tree: Fundamentals and Applications
From Everand
Search Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
FUJITSU PFC QLE2690 / QLE2692 Fibre Channel Adapters: Data Sheet
No ratings yet
FUJITSU PFC QLE2690 / QLE2692 Fibre Channel Adapters: Data Sheet
4 pages
General Instructions:: Sanskriti School Dr. S Radhakrishnan Marg Chanakyapuri
No ratings yet
General Instructions:: Sanskriti School Dr. S Radhakrishnan Marg Chanakyapuri
2 pages
PostgreSQL Administration (DBA) - ToC 22112024
No ratings yet
PostgreSQL Administration (DBA) - ToC 22112024
12 pages
WBS-2-Operations Analytics-W2S1-How-to-Build-and-Optimization-Model
No ratings yet
WBS-2-Operations Analytics-W2S1-How-to-Build-and-Optimization-Model
18 pages
Lecture 1: Matrices and Systems of Linear Equations: Brandon Behring
No ratings yet
Lecture 1: Matrices and Systems of Linear Equations: Brandon Behring
37 pages
JavaScript - If... Else Statement
No ratings yet
JavaScript - If... Else Statement
4 pages
1 s2.0 S0140366419306930 Main
No ratings yet
1 s2.0 S0140366419306930 Main
7 pages
Users User Email: For GAM 4.40
No ratings yet
Users User Email: For GAM 4.40
7 pages
Onboarding Guide (Integration With SAP Sales Cloud Version 2)
No ratings yet
Onboarding Guide (Integration With SAP Sales Cloud Version 2)
86 pages
PowerHour ParallelingSolutions 2019-08-22 PDF
No ratings yet
PowerHour ParallelingSolutions 2019-08-22 PDF
49 pages
DrTVelmurugan Profile 1
No ratings yet
DrTVelmurugan Profile 1
35 pages
OCS752 QB Model
No ratings yet
OCS752 QB Model
2 pages
CS411 Final Term MCQs Merged by Masters
No ratings yet
CS411 Final Term MCQs Merged by Masters
357 pages
An Introduction To Near-Field Communication and The Contactless Communication API
No ratings yet
An Introduction To Near-Field Communication and The Contactless Communication API
17 pages
Dbms Unit 5 Final
No ratings yet
Dbms Unit 5 Final
16 pages
DS Dwa-171 D1 Eng
No ratings yet
DS Dwa-171 D1 Eng
3 pages
ASEDLMS - Meter - Explorer - Setup - v1.5.7 - Release - Notes We
No ratings yet
ASEDLMS - Meter - Explorer - Setup - v1.5.7 - Release - Notes We
14 pages
Revision 2 Board Examination
No ratings yet
Revision 2 Board Examination
9 pages
Modulador Digital IP To RF User's Manual - V1.0
No ratings yet
Modulador Digital IP To RF User's Manual - V1.0
12 pages
Poshan Tracker 23.6 New Updates
No ratings yet
Poshan Tracker 23.6 New Updates
36 pages
Data Communication Syllabus
No ratings yet
Data Communication Syllabus
6 pages
Aphs Course Details
No ratings yet
Aphs Course Details
4 pages
Patna High Court Assistant Mock Test - Attempt To
No ratings yet
Patna High Court Assistant Mock Test - Attempt To
1 page
Toolbox PLUS Users Manual 3.11.0
No ratings yet
Toolbox PLUS Users Manual 3.11.0
238 pages
Reading Writing Hypertext and Intertext
No ratings yet
Reading Writing Hypertext and Intertext
58 pages
MayankJain Resume
No ratings yet
MayankJain Resume
1 page
3GPP TS 22.090
No ratings yet
3GPP TS 22.090
9 pages
Voip Cube Multi Tenants
No ratings yet
Voip Cube Multi Tenants
8 pages
2IL50 Data Structures: 2018-19 Q3 Lecture 1: Introduction
No ratings yet
2IL50 Data Structures: 2018-19 Q3 Lecture 1: Introduction
61 pages
Prabakaran Chinnadurai - Service Delivery Program Manager
No ratings yet
Prabakaran Chinnadurai - Service Delivery Program Manager
5 pages

Unit II - MMD - Lecture NotesStu

Uploaded by

Unit II - MMD - Lecture NotesStu

Uploaded by

MINING MASSIVE DATASETS

Applications of Set Similarity: Jaccard Similarity of Sets, Similarity of

2.1 Applications of Set Similarity

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 1

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 2

2.2 Shingling of Documents

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 3

characters long, we would expect k = 5 to work well, and indeed it does.

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 4

2.3 Locality-Sensitive Hashing for Documents

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 5

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 6

Mining Massive Datasets – UNIT 2 – Lecture Notes Page 7

You might also like