0% found this document useful (0 votes)
20 views13 pages

MMD2

Uploaded by

kjsravani2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views13 pages

MMD2

Uploaded by

kjsravani2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Jaccard Similarity k-Shingles LSH

Unit II - MMD - Lecture NotesStu.pdf


PDF

Elaborate all the questions in detailed way for 7 marks(including with Main SIDE HEADINGS and
bullet points)UNIT-2
Assignment Questions
Set-A
1) Describe Jaccard Similarity of Sets?
2) Define k-Shingles and how to choose the Shingle size
3) Write about LSH for Minhash Signatures?

1) Describe Jaccard Similarity of Sets

Jaccard Similarity is a measure used to compare the similarity between two sets, focusing on the
overlap in their elements. It's often used in applications where set similarity is meaningful, like in
document comparison, image analysis, or collaborative filtering.

Definition: The Jaccard Similarity of sets S and T is defined as:

∣S ∩ T ∣
SIM(S, T ) =
∣S ∪ T ∣

where ∣S ∩ T ∣ is the number of elements common to both sets, and ∣S ∪ T ∣ is the total number
of unique elements across both sets.

Explanation:
This ratio indicates how much overlap there is between two sets relative to the total
number of unique items.
A similarity of 1 means the sets are identical, while a similarity of 0 means they have no
elements in common.

Applications:
Plagiarism Detection: Finding similarities between texts to detect copied content.
Mirror Pages: Identifying duplicate web pages across multiple sites.
Collaborative Filtering: For recommending items to users based on shared interests.

2) Define k-Shingles and How to Choose the Shingle Size

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 1/13
Shingling is a process for converting a document into a set of substrings, known as k-shingles, which
is particularly useful for comparing documents in a structured way.

Definition of k-Shingles:
A k-shingle is a substring of length k within a document.
Each unique k-length sequence in the document represents one shingle.

Example: For a document with the string "abcdabd" and k = 2, the 2-shingles are: { "ab", "bc",
"cd", "da", "bd" }

Choosing the Shingle Size (k):


The value of k depends on the length and variety of the documents:
If k is too small, common character combinations may lead to false similarities
between unrelated documents.
If k is too large, minor differences may cause shingles to appear unique even in
similar documents.
Guideline: k should be large enough to minimize the probability of random overlaps
between different documents but small enough to detect actual shared phrases or
sections.

Special Considerations:
Word-Based Shingles: In some applications like news articles, shingles can be word-based
rather than character-based. This approach, especially using common words (stop words),
can highlight key similarities in content structure across articles.

3) Write about LSH for Minhash Signatures

Locality-Sensitive Hashing (LSH) for Minhash Signatures is a technique to efficiently find similar items
by grouping similar items into "buckets," making large-scale similarity comparisons feasible.

Concept:
LSH is used to avoid comparing all pairs in a large dataset by hashing items into buckets in
a way that similar items are more likely to end up in the same bucket.
It reduces computational complexity by narrowing down the list of candidate pairs for
similarity checks.

Steps:
Minhashing: Each document is represented by a minhash signature, a small, fixed-length
vector that retains the Jaccard similarity of the original sets.
Banding Technique: The minhash signature matrix is divided into several bands, each
containing a few rows.
Each band has a hash function that maps the rows within it to buckets.
Documents with similar minhash values in any band are considered candidate pairs.

Advantages of Banding:
This technique introduces an S-curve effect where:
Pairs with high similarity are very likely to be considered candidates.
Pairs with low similarity are unlikely to be considered, which optimizes performance.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 2/13
False Positives and Negatives:
Some dissimilar items might occasionally hash to the same bucket (false positives).
Some truly similar pairs may not hash to the same bucket in any band (false
negatives).

Application:
LSH is particularly useful in document comparison and duplicate detection where
exhaustive pairwise comparisons would be computationally prohibitive.

These detailed explanations cover key aspects and applications, offering a comprehensive
understanding suitable for a 7-mark answer.

Set-B
1) Write about Similarity of Documents?
2) Explain how Shingles are built from Words?
3) Report the Analysis of Banding Technique?

Set-C
1) Explain Collaborative Filtering as a Similar-Sets Problem
2) Describe the matrix representation of sets?
3) Explain Hashing Shingles?

Set-B

1) Write about Similarity of Documents

Document Similarity is the measure of how alike two documents are, generally based on their
textual content. This is critical in tasks like identifying duplicate or near-duplicate documents,
clustering documents by topic, and recommendation systems.

Techniques for Measuring Similarity:


Jaccard Similarity: Measures similarity by comparing sets of unique elements (e.g., words
or shingles) from each document.
Cosine Similarity: Compares documents by treating them as vectors in a multi-dimensional
space, particularly useful when using term frequency-inverse document frequency (TF-IDF).
Minhashing and Locality-Sensitive Hashing (LSH): Reduces the dimensionality of
documents by creating signatures, which make large-scale comparisons more efficient.

Applications:
Plagiarism Detection: Identifying sections of text that have been copied from other
sources.
Search Engines: Grouping similar pages to avoid showing nearly identical results.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 3/13
Content Aggregation: Clustering articles on the same topic to provide a single version of
repeated information.

2) Explain How Shingles are Built from Words

Word-Based Shingling involves creating shingles based on sequences of words rather than individual
characters, which is particularly effective for documents like articles or news reports.

Process of Building Word-Based Shingles:


Select a window size (number of words per shingle), commonly 2 or 3.
Extract sequences of words within the window and treat each sequence as a unique
shingle.

Example: For the sentence "Data science is amazing," if we choose a 2-word shingle:
Possible shingles: {"Data science," "science is," "is amazing"}.

Advantages:
Preserves semantic meaning better than character-based shingles.
More effective for larger texts like news articles, where phrases are more meaningful than
individual characters.

Use of Stop Words:


Sometimes shingles are built to start with a stop word followed by content words. This
approach helps to retain the structure of phrases found in articles or formal texts.

3) Report the Analysis of Banding Technique

The Banding Technique is part of the Locality-Sensitive Hashing (LSH) process, used to efficiently find
pairs of similar items by reducing the number of comparisons.

Overview:
Minhash signatures of documents are divided into bands, each containing a set of rows.
Each band has its own hash function, which groups similar items into the same bucket.

Function of the Technique:


Threshold Creation: Bands help to distinguish between high-similarity and low-similarity
pairs, creating an “S-curve” effect.
Candidate Generation: Only documents hashed to the same bucket in any band are
considered candidate pairs, significantly reducing comparison needs.

False Positives and Negatives:


False Positives: Dissimilar documents hashed into the same bucket by chance.
False Negatives: Similar documents not hashed to the same bucket in any band.

Threshold Calculation:
The threshold where items are likely to be in the same bucket is approximately (1/b)1/r ,
where b is the number of bands and r is the number of rows in each band.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 4/13
Set-C

1) Explain Collaborative Filtering as a Similar-Sets Problem

Collaborative Filtering is a technique used in recommendation systems where users are


recommended items based on preferences of similar users.

Concept:
Collaborative filtering identifies users with similar interests by calculating the similarity of
sets (e.g., products purchased or movies watched) between users.

Jaccard Similarity for Collaborative Filtering:


Customer-Item Relationships: Two customers are similar if the sets of items they have
engaged with (purchased, liked, etc.) have a high Jaccard similarity.
Example: If two customers both buy books from the same genres frequently, they are likely
to receive similar recommendations.

Applications:
E-commerce: Amazon recommends items by comparing the purchase history of similar
customers.
Movie Platforms: Netflix recommends movies based on similar users’ ratings and viewing
history.

2) Describe the Matrix Representation of Sets

In Matrix Representation of sets, data items are represented as rows and columns in a binary matrix,
allowing efficient computation of similarities and transformations.

Structure:
Each row represents an item (e.g., a document or customer).
Each column represents a unique element (e.g., a word or purchased product).
Binary Encoding: An entry is 1 if an element is present in the item; otherwise, it’s 0.

Uses in Set Similarity:


Minhashing: The matrix is often transformed using minhashing to create compressed
signatures, preserving similarity for comparison.
LSH: Signatures are hashed into bands, leveraging the matrix structure for efficient
similarity checks.

Advantages:
Compact representation for large datasets.
Allows vectorized operations for efficient similarity computation across large-scale datasets.

3) Explain Hashing Shingles

Hashing Shingles is a process of converting shingles into hash values, which simplifies comparison
and storage for large-scale document similarity calculations.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 5/13
Hashing Process:
Each shingle (substring or word sequence) is hashed into a unique identifier or a bucket.
Bucketed Shingles: For example, if using 9-character shingles, each can be hashed into a
smaller bucket number (e.g., 0 to 232 − 1).

Purpose of Hashing:
Efficiency: Hashing reduces storage by representing shingles as compact numbers rather
than text strings.
Uniformity: Hashing reduces the risk of shingle overlap by distributing shingles across
buckets, especially useful when comparing large datasets.

Applications:
Similarity Comparisons: Hashed shingles allow faster comparison of document similarity
in large datasets.
Memory Optimization: Reduces memory usage since hashed shingles require less storage
than raw shingles.

Regular
1. What is Jaccard similarity in set theory, and how is it applied to assess the similarity between
sets.
2. Define banding, and how does banding reduce false positives or negatives in detecting
document similarities?
3. How does set similarity, particularly Jaccard similarity, apply to collaborative filtering problems?
4. Explain the process of combining various techniques, such as shingling, hashing, and LSH, in
document similarity analysis

Regular

1. What is Jaccard Similarity in Set Theory, and How Is It Applied to Assess the Similarity
Between Sets?

Jaccard Similarity is a metric in set theory that quantifies the similarity between two sets by
measuring the overlap of elements in proportion to the total unique elements.

Definition:

∣S ∩ T ∣
Jaccard Similarity =
∣S ∪ T ∣

where S and T are two sets, ∣S ∩ T ∣ is the count of shared elements, and ∣S ∪ T ∣ is the total
count of unique elements.

Interpretation:

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 6/13
A Jaccard similarity of 1 indicates that the sets are identical.
A similarity of 0 means no shared elements.

Applications:
Document Similarity: Used to identify duplicate or near-duplicate documents based on
shared words or phrases.
Image Analysis: Compares visual features in computer vision.
Collaborative Filtering: Finds users with similar tastes by comparing item sets they have
interacted with.

2. Define Banding, and How Does Banding Reduce False Positives or Negatives in Detecting
Document Similarities?

Banding is a technique used in Locality-Sensitive Hashing (LSH) to efficiently identify similar items
without exhaustively comparing all pairs, particularly for large datasets.

Definition of Banding:
The minhash signature of each document is divided into several bands, each containing a
subset of rows from the signature matrix.
Each band is hashed separately, grouping documents with similar rows into the same
bucket.

Reducing False Positives and False Negatives:


False Positives: Banding reduces false positives by requiring multiple rows to match within
a band, not just one row.
False Negatives: Reduces false negatives by creating multiple opportunities for truly
similar documents to be hashed into the same bucket across different bands.

Effectiveness:
The banding technique produces an S-curve threshold, where only pairs with similarity
above a certain threshold become candidates for comparison, reducing unnecessary
comparisons.

3. How Does Set Similarity, Particularly Jaccard Similarity, Apply to Collaborative Filtering
Problems?

Collaborative Filtering recommends items to users based on similar behaviors or preferences, often
modeled as set similarity problems.

Jaccard Similarity in Collaborative Filtering:


User-Item Interactions: For instance, in an e-commerce setting, each user’s interaction
history can be treated as a set (e.g., items purchased or liked).
User Similarity: Users with a high Jaccard similarity in their item sets are likely to have
similar preferences.
Item Similarity: Items purchased by overlapping groups of users also exhibit high Jaccard
similarity, indicating shared appeal.

Applications:

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 7/13
E-commerce: Amazon uses collaborative filtering to recommend products by identifying
similar users based on purchase history.
Streaming Services: Netflix identifies similar viewing patterns among users for movie
recommendations.

4. Explain the Process of Combining Various Techniques, Such as Shingling, Hashing, and LSH, in
Document Similarity Analysis

The combination of shingling, hashing, and Locality-Sensitive Hashing (LSH) techniques allows for
efficient and scalable document similarity analysis.

Process of Combining Techniques:

1. Shingling: Documents are converted into sets of short sequences (shingles), which serve as
the basis for similarity comparisons.
Each document is broken into k-shingles, representing unique k-length sequences of
characters or words.

2. Hashing Shingles: Shingles are then hashed to create compact representations.


Each shingle is mapped to a unique number, reducing memory use and allowing
faster comparison.

3. Minhashing: Documents are converted into minhash signatures that preserve the Jaccard
similarity between original shingle sets.
This step reduces each document’s representation while retaining its similarity
properties.

4. Locality-Sensitive Hashing (LSH): The minhash signatures are divided into bands to
quickly identify candidate pairs.
Similar documents are grouped by hashing each band, where only candidate pairs
(those in the same bucket) are compared.

Benefits of Combining Techniques:


Efficiency: The pipeline minimizes computational complexity, reducing memory usage and
processing time.
Scalability: Useful for analyzing document similarity in large datasets by narrowing down
to candidate pairs only.
Accuracy: Each stage reduces noise, allowing for accurate identification of similar
documents, even in vast corpuses.

Supply
1. What does Jaccard Similarity measure in sets, and how is it calculated?
2. How does the choice of shingle size impact the accuracy of similarity comparisons between
documents?
3. Define k-shingles and their role in document analysis.
4. Define LSH for Minhash Signatures and its role in document similarity analysis.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 8/13
Supply

1. What Does Jaccard Similarity Measure in Sets, and How Is It Calculated?

Jaccard Similarity measures the degree of overlap between two sets, capturing how similar two sets
are based on their common elements relative to their total unique elements.

Purpose:
Jaccard Similarity is commonly used in set theory to measure the similarity between two
sets by examining their shared elements.

Calculation:

∣S ∩ T ∣
Jaccard Similarity =
∣S ∪ T ∣

where:
S and T are the two sets being compared,
∣S ∩ T ∣ is the count of elements common to both sets,
∣S ∪ T ∣ is the total count of unique elements across both sets.

Range:
Values range from 0 (no similarity) to 1 (identical sets), making it easy to interpret similarity
levels between sets.

2. How Does the Choice of Shingle Size Impact the Accuracy of Similarity Comparisons Between
Documents?

The choice of shingle size (k) is crucial in determining the effectiveness of document similarity
analysis, as it impacts the granularity of the comparison.

Impact on Similarity Comparisons:


Small Shingle Size:
Produces a high number of common shingles between unrelated documents,
increasing false positives.
For example, with k = 1, most documents will share common single characters,
resulting in high similarity scores even for different content.
Large Shingle Size:
Reduces false positives but may overlook relevant matches, especially when
comparing documents with minor rephrasing.
Very large k values could miss partial matches and reduce the sensitivity to shared
phrases.

Optimal Shingle Size:


Depends on document length and type, but is typically chosen to balance sensitivity and
specificity.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 9/13
For example, in typical text documents, a shingle size of 5–10 characters (or 2–3 words)
often captures meaningful similarities without excessive false positives.

3. Define k-Shingles and Their Role in Document Analysis

k-Shingles are substrings of length k extracted from a document, which serve as units for measuring
similarity between documents.

Definition:
A k-shingle is any sequence of k consecutive characters (or words) within a document.

Role in Document Analysis:


By converting documents into sets of k-shingles, each document is represented as a set
that can be compared to others using set similarity measures like Jaccard similarity.
Shingles help in identifying overlapping content, even if the documents contain small
variations in wording or formatting.

Examples:
For a document with the string "data science," if k = 3, the shingles would be:
{"dat", "ata", "ta ", " sc", "sci", "cie", "enc", "nce"}.

4. Define LSH for Minhash Signatures and Its Role in Document Similarity Analysis

Locality-Sensitive Hashing (LSH) for Minhash Signatures is a technique for efficiently identifying
similar documents in large datasets by creating “candidate pairs” for comparison.

Definition of LSH:
LSH is a method where multiple hash functions are used to create several “bands” of rows
from minhash signatures.
Documents with similar minhash signatures are more likely to hash to the same bucket
within a band, forming candidate pairs for similarity checks.

Role in Document Similarity Analysis:


Efficient Candidate Generation: LSH significantly reduces the number of comparisons by
narrowing down to pairs that are likely to be similar.
False Positive and Negative Management: By adjusting the number of bands and rows,
LSH balances false positives and false negatives, increasing the efficiency of similarity
detection.
Scalability: Ideal for applications like plagiarism detection, duplicate content detection, and
large-scale document clustering where exact pairwise comparisons would be
computationally expensive.

LSH enables scalable similarity detection by focusing resources only on probable matches, which is
essential in mining massive datasets.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 10/13
Elaborate all the questions in detailed way for 10 marks(including with Main SIDE HEADINGS and
bullet points, answers will be given from above uploaded pdf and also provide relevant Examples)
1. Explain Collaborative Filtering as a Similar-Sets Problem
2. Explain how Shingles are built from Words?
3. Write about LSH for Minhash Signatures?
4. Report the Analysis of Banding Technique?

Here's a detailed response based on the provided PDF and expanded explanations:

1. Explain Collaborative Filtering as a Similar-Sets Problem

Collaborative Filtering is a method used in recommendation systems to suggest items to users


based on the preferences of other users with similar tastes. This approach treats user-item
interactions as sets, allowing the use of similarity metrics like Jaccard similarity to find common
interests.

Overview:
Collaborative filtering focuses on finding users or items with high overlap in preferences or
behaviors.
Users with similar interaction patterns are grouped together, and recommendations are
made by suggesting items that similar users have interacted with.

Jaccard Similarity for Set-Based Collaborative Filtering:


Jaccard similarity measures the overlap between two sets. Here, it helps in identifying users
with shared preferences or items with shared audiences.
For two sets U and V (e.g., items purchased by two users), Jaccard similarity is calculated
as:

∣U ∩ V ∣
Jaccard Similarity =
∣U ∪ V ∣

Applications:
E-commerce: Recommending products based on similar users’ purchase history.
Streaming Services: Suggesting shows or movies based on shared viewing patterns.

Example:
Consider two users:
User A's purchases: {"Laptop", "Mouse", "Keyboard"}
User B's purchases: {"Laptop", "Keyboard", "Headphones"}
Jaccard Similarity:
Common items: {"Laptop", "Keyboard"}
Total unique items: {"Laptop", "Mouse", "Keyboard", "Headphones"}
Similarity score = 24 = 0.5

Based on this similarity, items purchased by User B, like "Headphones," might be


recommended to User A.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 11/13
2. Explain How Shingles Are Built from Words

Word-Based Shingling is a method to create shingles, or fixed-size sequences, from documents by


grouping words rather than individual characters. This technique captures the contextual meaning
within documents and is especially effective for analyzing larger text structures.

Definition of Word-Based Shingling:


A word-based shingle is a sequence of consecutive words (rather than characters) from a
document, chosen by selecting a specific window size.
Word shingles help identify similar text structures while being less sensitive to small
variations like typos.

Process of Building Word Shingles:


Choose Window Size: Select a number of words (usually 2–3) to form a shingle.
Extract Word Sequences: Create shingles by sliding the window across the document.

Example:
Document: "Data science is fascinating."
With a window size of 2, the shingles are:
{"Data science", "science is", "is fascinating"}

Special Consideration with Stop Words:


Sometimes, shingles are built to start with common stop words followed by key terms. This
approach enhances the identification of content-specific structures, as stop words are
typically present across related text documents.

3. Write About LSH for Minhash Signatures

Locality-Sensitive Hashing (LSH) is an efficient method for identifying similar documents by


organizing items into buckets, where similar items are more likely to appear in the same bucket.

Concept of LSH for Minhash:


Minhashing: A technique to reduce each document to a fixed-length signature that retains
Jaccard similarity between the original sets.
LSH Application: Minhash signatures are organized in a way that clusters similar
signatures together, reducing the need for exhaustive comparisons.

Steps of LSH for Minhash Signatures:


Divide Signatures into Bands: Minhash signatures are divided into bands with each band
containing a subset of rows.
Hash Each Band: Each band is hashed separately, placing documents with the same band
values into the same bucket.
Candidate Pair Generation: Only documents in the same bucket are compared, reducing
the total number of comparisons.

Benefits of LSH:
Efficiently narrows down the number of potential matches, making it suitable for large
datasets.
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 12/13
Reduces computation time by focusing only on probable candidates.

Example:
Two documents with similar text will likely have similar minhash signatures. If their
signatures match in one or more bands, they are flagged as a candidate pair for further
similarity checking.

4. Report the Analysis of Banding Technique

The Banding Technique in Locality-Sensitive Hashing is a method to improve efficiency in similarity


detection by creating an S-curve effect, where pairs with higher similarity have an increased chance of
being identified.

Explanation of Banding:
The minhash signature is divided into multiple bands, each containing several rows.
Each band is hashed independently, creating buckets where similar documents are
grouped.

Role in Reducing False Positives and False Negatives:


False Positives: Reduced because only documents with high similarity across a band end
up in the same bucket.
False Negatives: Also minimized, as similar documents will have multiple chances to match
in at least one band.

Threshold Effect:
Banding creates a similarity threshold, allowing efficient differentiation between highly
similar and less similar pairs.
The threshold, (1/b)1/r , determines which pairs are likely to be candidates based on the
chosen number of bands (b) and rows per band (r).

Example:
For a dataset where minhash signatures have high overlap, the banding technique allows
documents with similar structure to end up in the same bucket with higher probability,
streamlining the search for similar document pairs.

The Banding Technique is an essential aspect of LSH as it optimizes the matching process for large
datasets, focusing computational resources on likely matches.
ChatGPT can make mistakes. Check important info.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 13/13

You might also like