0% found this document useful (0 votes)
8 views8 pages

Project Report

The document discusses semantic search and ranking, emphasizing the importance of accurate retrieval and ranking optimization using models like SBERT and metrics such as cosine similarity. It reviews various research papers on semantic models, including their architectures and accuracy improvements, and compares different models like ColBERT and SBERT based on their strengths and weaknesses. The document also highlights the differences between cosine similarity and dot-product similarity, providing guidance on when to use each method.

Uploaded by

drash078692
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Project Report

The document discusses semantic search and ranking, emphasizing the importance of accurate retrieval and ranking optimization using models like SBERT and metrics such as cosine similarity. It reviews various research papers on semantic models, including their architectures and accuracy improvements, and compares different models like ColBERT and SBERT based on their strengths and weaknesses. The document also highlights the differences between cosine similarity and dot-product similarity, providing guidance on when to use each method.

Uploaded by

drash078692
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Semantic Search and Ranking

10.12.2024

Drashti Bhavsar
LOTI AI. Inc

Overview
Semantic search and ranking are based on similarity from documents and queries.
Asymmetric Search is going to apply here.
Query: "Best laptops for gaming under $1000"
Documents: Product descriptions like:

● "Dell Inspiron 15, Intel i7, $899"


● "HP Pavilion Gaming, AMD Ryzen 5, $999"

Here, the query is a natural language question, while the documents are structured
product entries.

Different Representations: The query may include context or intent, while documents are
factual.

Goals
Accurate Retrieval:
1

● Ensure that relevant documents are ranked higher in the results using a model like
SBERT.
● Improve retrieval precision by evaluating and refining the similarity scoring method.

Ranking Optimization:

● Rank documents based on relevance using metrics like cosine similarity.


● Validate ranking performance using metrics such as MRR (Mean Reciprocal Rank)
and accuracy to measure effectiveness.

System Validation:

● Use a manually labeled "relevance" column in the dataset to calculate the system’s
accuracy and verify the correctness of retrieval results.

Efficient Results Presentation:

● Return the top N results (e.g., Top 20) for each query to focus on the most relevant
documents.
● Integrate the similarity, rank, and query context into the dataset for seamless
evaluation.

Research Papers

1. Semantic Models for the First-Stage Retrieval: A Comprehensive


Review (ACM Transactions on Information Systems)

Overview:
This review likely discusses the evolution of semantic retrieval models used in the first
stage of information retrieval pipelines. The first stage is critical for narrowing down large
document collections to a manageable subset.

● How It Works:
Semantic models use embeddings and advanced machine learning to improve
retrieval by understanding the query context better than traditional term-based
methods like BM25.
● Architecture:
Likely compares classical term-matching architectures to neural network
architectures, such as BERT-based retrieval models (e.g., Dense Passage Retrieval or
DPR).
2

● Models:
Includes neural models for dense vector retrieval, embedding techniques, and
transformer-based approaches.
● Accuracy:
Reports improvements in recall and precision, particularly when neural methods
replace traditional term-matching systems in datasets like MS MARCO.

2. Semantic Scholar (IEEE Xplore)

Overview:
Semantic Scholar is an academic search engine that uses AI to enhance search results
relevance by focusing on semantic understanding.

● How It Works:
Combines traditional information retrieval methods with machine learning and
deep learning to understand user intent and rank documents effectively. Uses
techniques like entity linking, citation analysis, and contextual embeddings.
● Architecture:
A hybrid system involving traditional index-based search augmented by neural
networks like transformers.
● Models:
Likely includes BERT for semantic understanding, graph-based models for citation
networks, and clustering techniques for topic discovery.
● Accuracy:
Shows high precision in ranking academic search results, but numeric accuracy
metrics depend on the specific datasets tested.

3. Advancements in Semantic Information Retrieval (ScienceDirect)

Overview:
This paper explores advanced retrieval methods, particularly for specific domains such as
healthcare or education.

● How It Works:
Uses semantic analysis to go beyond keyword matching, leveraging ontologies or
knowledge graphs to understand domain-specific relationships.
● Architecture:
Likely discusses multi-layered neural networks combined with domain-specific
knowledge graphs.
3

● Models:
Techniques like Word2Vec for word embeddings or hierarchical attention models
for document comprehension.
● Accuracy:
Improvements reported in specialized domains, with metrics tied to specific
datasets like PubMed or educational repositories.

Google Search for best model:

Sentence Transformers:

A pre-trained model specifically designed for sentence embedding, allowing for efficient
semantic comparisons between queries and documents.

DistilBERT:

A smaller, faster version of BERT, which is often preferred for large-scale applications due
to its computational efficiency.

XLNet:

Another transformer-based model that can capture long-range dependencies in text,


potentially improving semantic understanding for complex queries.

Similarities and Models

1. Difference Between Cosine Similarity and Dot-Product Similarity


Metric Cosine Similarity Dot-Product Similarity

Formula Cosine(A,B)=A⋅B∥A∥∥B∥\ Dot Product(A,B)=A⋅B\text{Dot


text{Cosine}(A, B) = \frac{\ Product}(A, B) = \mathbf{A} \
mathbf{A} \cdot \mathbf{B}} cdot \mathbf{B}Dot
{\|\mathbf{A}\| \|\ Product(A,B)=A⋅B
mathbf{B}\|}Cosine(A,B)=∥A∥
∥B∥A⋅B
4

Normalization Normalizes vectors to unit Directly compares vector


length before comparing. magnitudes.

Range Outputs values between -1 and Output can be any real number.
1.

Interpretation Measures the angle between Measures the magnitude of


vectors (how similar they are in alignment between vectors. Larger
direction). values = more similarity.

- 1 = perfectly similar

- 0 = orthogonal (no similarity)

- -1 = perfectly dissimilar
(opposite direction)

Use Cases - Semantic similarity tasks (e.g., - Often used in neural network
SBERT-based search). attention mechanisms.

- Handles vectors of different - More efficient in systems where


magnitudes well. vectors are normalized beforehand.

Sensitivity to Independent of vector Sensitive to vector magnitudes.


Magnitude magnitudes, focusing purely on Larger vectors produce larger dot
direction. products.

Efficiency Slightly more computationally More computationally efficient


expensive due to normalization. without normalization.

When to Use Which? (Similarity)

● Cosine Similarity: Use when vector magnitudes vary significantly and you are
interested in directional similarity.
● Dot-Product Similarity: Use when vectors are already normalized (e.g., in many
neural models like attention mechanisms) or when magnitude information is
meaningful.
5

When to Use Which? (Model)

● Use ColBERT if accuracy and fine-grained relevance are more important than
speed, such as in search engines or Q&A systems.
● Use SBERT if speed and scalability are critical, such as in low-latency applications
or resource-constrained environments.

2. Difference Between ColBERT and SBERT


Feature ColBERT SBERT

Full Name Contextualized Late Sentence-BERT


Interaction over BERT

Purpose Focuses on efficient and Optimized for sentence


scalable similarity and semantic
passage/document search.
retrieval.

Architectur - Uses a late-interaction - Fine-tunes BERT for


e mechanism for semantic similarity tasks.
scalability.
- Encodes entire
- Each token from a sentences into a single
query and document is fixed-length embedding.
represented
- Key components: BERT
independently, allowing
backbone with a pooling
token-level interactions.
layer (mean/max pooling
- Key components: token or [CLS] token).
embeddings, MaxSim for
interaction.

Encoding Token-level embedding Sentence-level


(each token has its own embedding (entire
vector). sentence as a single
6

vector).

Interaction Late interaction: Matches Pre-computed


Strategy query tokens against embeddings: Sentences
document tokens during compared using
inference. similarity measures
(cosine/dot product).

Strengths - More granular token- - Fast retrieval due to


level interaction precomputed
improves retrieval embeddings.
accuracy.
- Better for low-resource
- Ideal for tasks like or interactive tasks like
document ranking or semantic search or
large-scale passage clustering.
retrieval.

Weaknesses - Higher computational - Slightly less precise for


cost during inference. complex document
retrieval.
- Token-level matching
can be slower for large - Cannot perform token-
corpora. level matching.

Which Is More Accurate? (Models)


● ColBERT is generally more accurate in asymmetric search tasks, especially when
fine-grained interactions are critical.
● SBERT trades off some accuracy for speed and scalability, making it suitable for
scenarios where efficiency is a priority.
7

Comparison Summary

Model ColBERT SBERT

Accuracy Higher (e.g., Lower (e.g., MRR@10 ~0.30–


MRR@10 ~0.36–0.40) 0.33)

Interaction Token-level late Sentence-level fixed


interaction embeddings

Latency Higher (performs Lower (precomputed


token-level matching embeddings enable fast
during inference). retrieval).

Use Case Ideal for high- Best for fast similarity


accuracy document matching with large corpora.
retrieval.

You might also like