MMD2
MMD2
Elaborate all the questions in detailed way for 7 marks(including with Main SIDE HEADINGS and
bullet points)UNIT-2
Assignment Questions
Set-A
1) Describe Jaccard Similarity of Sets?
2) Define k-Shingles and how to choose the Shingle size
3) Write about LSH for Minhash Signatures?
Jaccard Similarity is a measure used to compare the similarity between two sets, focusing on the
overlap in their elements. It's often used in applications where set similarity is meaningful, like in
document comparison, image analysis, or collaborative filtering.
∣S ∩ T ∣
SIM(S, T ) =
∣S ∪ T ∣
where ∣S ∩ T ∣ is the number of elements common to both sets, and ∣S ∪ T ∣ is the total number
of unique elements across both sets.
Explanation:
This ratio indicates how much overlap there is between two sets relative to the total
number of unique items.
A similarity of 1 means the sets are identical, while a similarity of 0 means they have no
elements in common.
Applications:
Plagiarism Detection: Finding similarities between texts to detect copied content.
Mirror Pages: Identifying duplicate web pages across multiple sites.
Collaborative Filtering: For recommending items to users based on shared interests.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 1/13
Shingling is a process for converting a document into a set of substrings, known as k-shingles, which
is particularly useful for comparing documents in a structured way.
Definition of k-Shingles:
A k-shingle is a substring of length k within a document.
Each unique k-length sequence in the document represents one shingle.
Example: For a document with the string "abcdabd" and k = 2, the 2-shingles are: { "ab", "bc",
"cd", "da", "bd" }
Special Considerations:
Word-Based Shingles: In some applications like news articles, shingles can be word-based
rather than character-based. This approach, especially using common words (stop words),
can highlight key similarities in content structure across articles.
Locality-Sensitive Hashing (LSH) for Minhash Signatures is a technique to efficiently find similar items
by grouping similar items into "buckets," making large-scale similarity comparisons feasible.
Concept:
LSH is used to avoid comparing all pairs in a large dataset by hashing items into buckets in
a way that similar items are more likely to end up in the same bucket.
It reduces computational complexity by narrowing down the list of candidate pairs for
similarity checks.
Steps:
Minhashing: Each document is represented by a minhash signature, a small, fixed-length
vector that retains the Jaccard similarity of the original sets.
Banding Technique: The minhash signature matrix is divided into several bands, each
containing a few rows.
Each band has a hash function that maps the rows within it to buckets.
Documents with similar minhash values in any band are considered candidate pairs.
Advantages of Banding:
This technique introduces an S-curve effect where:
Pairs with high similarity are very likely to be considered candidates.
Pairs with low similarity are unlikely to be considered, which optimizes performance.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 2/13
False Positives and Negatives:
Some dissimilar items might occasionally hash to the same bucket (false positives).
Some truly similar pairs may not hash to the same bucket in any band (false
negatives).
Application:
LSH is particularly useful in document comparison and duplicate detection where
exhaustive pairwise comparisons would be computationally prohibitive.
These detailed explanations cover key aspects and applications, offering a comprehensive
understanding suitable for a 7-mark answer.
Set-B
1) Write about Similarity of Documents?
2) Explain how Shingles are built from Words?
3) Report the Analysis of Banding Technique?
Set-C
1) Explain Collaborative Filtering as a Similar-Sets Problem
2) Describe the matrix representation of sets?
3) Explain Hashing Shingles?
Set-B
Document Similarity is the measure of how alike two documents are, generally based on their
textual content. This is critical in tasks like identifying duplicate or near-duplicate documents,
clustering documents by topic, and recommendation systems.
Applications:
Plagiarism Detection: Identifying sections of text that have been copied from other
sources.
Search Engines: Grouping similar pages to avoid showing nearly identical results.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 3/13
Content Aggregation: Clustering articles on the same topic to provide a single version of
repeated information.
Word-Based Shingling involves creating shingles based on sequences of words rather than individual
characters, which is particularly effective for documents like articles or news reports.
Example: For the sentence "Data science is amazing," if we choose a 2-word shingle:
Possible shingles: {"Data science," "science is," "is amazing"}.
Advantages:
Preserves semantic meaning better than character-based shingles.
More effective for larger texts like news articles, where phrases are more meaningful than
individual characters.
The Banding Technique is part of the Locality-Sensitive Hashing (LSH) process, used to efficiently find
pairs of similar items by reducing the number of comparisons.
Overview:
Minhash signatures of documents are divided into bands, each containing a set of rows.
Each band has its own hash function, which groups similar items into the same bucket.
Threshold Calculation:
The threshold where items are likely to be in the same bucket is approximately (1/b)1/r ,
where b is the number of bands and r is the number of rows in each band.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 4/13
Set-C
Concept:
Collaborative filtering identifies users with similar interests by calculating the similarity of
sets (e.g., products purchased or movies watched) between users.
Applications:
E-commerce: Amazon recommends items by comparing the purchase history of similar
customers.
Movie Platforms: Netflix recommends movies based on similar users’ ratings and viewing
history.
In Matrix Representation of sets, data items are represented as rows and columns in a binary matrix,
allowing efficient computation of similarities and transformations.
Structure:
Each row represents an item (e.g., a document or customer).
Each column represents a unique element (e.g., a word or purchased product).
Binary Encoding: An entry is 1 if an element is present in the item; otherwise, it’s 0.
Advantages:
Compact representation for large datasets.
Allows vectorized operations for efficient similarity computation across large-scale datasets.
Hashing Shingles is a process of converting shingles into hash values, which simplifies comparison
and storage for large-scale document similarity calculations.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 5/13
Hashing Process:
Each shingle (substring or word sequence) is hashed into a unique identifier or a bucket.
Bucketed Shingles: For example, if using 9-character shingles, each can be hashed into a
smaller bucket number (e.g., 0 to 232 − 1).
Purpose of Hashing:
Efficiency: Hashing reduces storage by representing shingles as compact numbers rather
than text strings.
Uniformity: Hashing reduces the risk of shingle overlap by distributing shingles across
buckets, especially useful when comparing large datasets.
Applications:
Similarity Comparisons: Hashed shingles allow faster comparison of document similarity
in large datasets.
Memory Optimization: Reduces memory usage since hashed shingles require less storage
than raw shingles.
Regular
1. What is Jaccard similarity in set theory, and how is it applied to assess the similarity between
sets.
2. Define banding, and how does banding reduce false positives or negatives in detecting
document similarities?
3. How does set similarity, particularly Jaccard similarity, apply to collaborative filtering problems?
4. Explain the process of combining various techniques, such as shingling, hashing, and LSH, in
document similarity analysis
Regular
1. What is Jaccard Similarity in Set Theory, and How Is It Applied to Assess the Similarity
Between Sets?
Jaccard Similarity is a metric in set theory that quantifies the similarity between two sets by
measuring the overlap of elements in proportion to the total unique elements.
Definition:
∣S ∩ T ∣
Jaccard Similarity =
∣S ∪ T ∣
where S and T are two sets, ∣S ∩ T ∣ is the count of shared elements, and ∣S ∪ T ∣ is the total
count of unique elements.
Interpretation:
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 6/13
A Jaccard similarity of 1 indicates that the sets are identical.
A similarity of 0 means no shared elements.
Applications:
Document Similarity: Used to identify duplicate or near-duplicate documents based on
shared words or phrases.
Image Analysis: Compares visual features in computer vision.
Collaborative Filtering: Finds users with similar tastes by comparing item sets they have
interacted with.
2. Define Banding, and How Does Banding Reduce False Positives or Negatives in Detecting
Document Similarities?
Banding is a technique used in Locality-Sensitive Hashing (LSH) to efficiently identify similar items
without exhaustively comparing all pairs, particularly for large datasets.
Definition of Banding:
The minhash signature of each document is divided into several bands, each containing a
subset of rows from the signature matrix.
Each band is hashed separately, grouping documents with similar rows into the same
bucket.
Effectiveness:
The banding technique produces an S-curve threshold, where only pairs with similarity
above a certain threshold become candidates for comparison, reducing unnecessary
comparisons.
3. How Does Set Similarity, Particularly Jaccard Similarity, Apply to Collaborative Filtering
Problems?
Collaborative Filtering recommends items to users based on similar behaviors or preferences, often
modeled as set similarity problems.
Applications:
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 7/13
E-commerce: Amazon uses collaborative filtering to recommend products by identifying
similar users based on purchase history.
Streaming Services: Netflix identifies similar viewing patterns among users for movie
recommendations.
4. Explain the Process of Combining Various Techniques, Such as Shingling, Hashing, and LSH, in
Document Similarity Analysis
The combination of shingling, hashing, and Locality-Sensitive Hashing (LSH) techniques allows for
efficient and scalable document similarity analysis.
1. Shingling: Documents are converted into sets of short sequences (shingles), which serve as
the basis for similarity comparisons.
Each document is broken into k-shingles, representing unique k-length sequences of
characters or words.
3. Minhashing: Documents are converted into minhash signatures that preserve the Jaccard
similarity between original shingle sets.
This step reduces each document’s representation while retaining its similarity
properties.
4. Locality-Sensitive Hashing (LSH): The minhash signatures are divided into bands to
quickly identify candidate pairs.
Similar documents are grouped by hashing each band, where only candidate pairs
(those in the same bucket) are compared.
Supply
1. What does Jaccard Similarity measure in sets, and how is it calculated?
2. How does the choice of shingle size impact the accuracy of similarity comparisons between
documents?
3. Define k-shingles and their role in document analysis.
4. Define LSH for Minhash Signatures and its role in document similarity analysis.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 8/13
Supply
Jaccard Similarity measures the degree of overlap between two sets, capturing how similar two sets
are based on their common elements relative to their total unique elements.
Purpose:
Jaccard Similarity is commonly used in set theory to measure the similarity between two
sets by examining their shared elements.
Calculation:
∣S ∩ T ∣
Jaccard Similarity =
∣S ∪ T ∣
where:
S and T are the two sets being compared,
∣S ∩ T ∣ is the count of elements common to both sets,
∣S ∪ T ∣ is the total count of unique elements across both sets.
Range:
Values range from 0 (no similarity) to 1 (identical sets), making it easy to interpret similarity
levels between sets.
2. How Does the Choice of Shingle Size Impact the Accuracy of Similarity Comparisons Between
Documents?
The choice of shingle size (k) is crucial in determining the effectiveness of document similarity
analysis, as it impacts the granularity of the comparison.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 9/13
For example, in typical text documents, a shingle size of 5–10 characters (or 2–3 words)
often captures meaningful similarities without excessive false positives.
k-Shingles are substrings of length k extracted from a document, which serve as units for measuring
similarity between documents.
Definition:
A k-shingle is any sequence of k consecutive characters (or words) within a document.
Examples:
For a document with the string "data science," if k = 3, the shingles would be:
{"dat", "ata", "ta ", " sc", "sci", "cie", "enc", "nce"}.
4. Define LSH for Minhash Signatures and Its Role in Document Similarity Analysis
Locality-Sensitive Hashing (LSH) for Minhash Signatures is a technique for efficiently identifying
similar documents in large datasets by creating “candidate pairs” for comparison.
Definition of LSH:
LSH is a method where multiple hash functions are used to create several “bands” of rows
from minhash signatures.
Documents with similar minhash signatures are more likely to hash to the same bucket
within a band, forming candidate pairs for similarity checks.
LSH enables scalable similarity detection by focusing resources only on probable matches, which is
essential in mining massive datasets.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 10/13
Elaborate all the questions in detailed way for 10 marks(including with Main SIDE HEADINGS and
bullet points, answers will be given from above uploaded pdf and also provide relevant Examples)
1. Explain Collaborative Filtering as a Similar-Sets Problem
2. Explain how Shingles are built from Words?
3. Write about LSH for Minhash Signatures?
4. Report the Analysis of Banding Technique?
Here's a detailed response based on the provided PDF and expanded explanations:
Overview:
Collaborative filtering focuses on finding users or items with high overlap in preferences or
behaviors.
Users with similar interaction patterns are grouped together, and recommendations are
made by suggesting items that similar users have interacted with.
∣U ∩ V ∣
Jaccard Similarity =
∣U ∪ V ∣
Applications:
E-commerce: Recommending products based on similar users’ purchase history.
Streaming Services: Suggesting shows or movies based on shared viewing patterns.
Example:
Consider two users:
User A's purchases: {"Laptop", "Mouse", "Keyboard"}
User B's purchases: {"Laptop", "Keyboard", "Headphones"}
Jaccard Similarity:
Common items: {"Laptop", "Keyboard"}
Total unique items: {"Laptop", "Mouse", "Keyboard", "Headphones"}
Similarity score = 24 = 0.5
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 11/13
2. Explain How Shingles Are Built from Words
Example:
Document: "Data science is fascinating."
With a window size of 2, the shingles are:
{"Data science", "science is", "is fascinating"}
Benefits of LSH:
Efficiently narrows down the number of potential matches, making it suitable for large
datasets.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 12/13
Reduces computation time by focusing only on probable candidates.
Example:
Two documents with similar text will likely have similar minhash signatures. If their
signatures match in one or more bands, they are flagged as a candidate pair for further
similarity checking.
Explanation of Banding:
The minhash signature is divided into multiple bands, each containing several rows.
Each band is hashed independently, creating buckets where similar documents are
grouped.
Threshold Effect:
Banding creates a similarity threshold, allowing efficient differentiation between highly
similar and less similar pairs.
The threshold, (1/b)1/r , determines which pairs are likely to be candidates based on the
chosen number of bands (b) and rows per band (r).
Example:
For a dataset where minhash signatures have high overlap, the banding technique allows
documents with similar structure to end up in the same bucket with higher probability,
streamlining the search for similar document pairs.
The Banding Technique is an essential aspect of LSH as it optimizes the matching process for large
datasets, focusing computational resources on likely matches.
ChatGPT can make mistakes. Check important info.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 13/13