0% found this document useful (0 votes)

8 views8 pages

Project Report

The document discusses semantic search and ranking, emphasizing the importance of accurate retrieval and ranking optimization using models like SBERT and metrics such as cosine similarity. It reviews various research papers on semantic models, including their architectures and accuracy improvements, and compares different models like ColBERT and SBERT based on their strengths and weaknesses. The document also highlights the differences between cosine similarity and dot-product similarity, providing guidance on when to use each method.

Uploaded by

drash078692

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views8 pages

Project Report

Uploaded by

drash078692

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Semantic Search and Ranking

10.12.2024
─

Drashti Bhavsar
LOTI AI. Inc

Overview
Semantic search and ranking are based on similarity from documents and queries.
Asymmetric Search is going to apply here.
Query: "Best laptops for gaming under $1000"
Documents: Product descriptions like:

● "Dell Inspiron 15, Intel i7, $899"

● "HP Pavilion Gaming, AMD Ryzen 5, $999"

Here, the query is a natural language question, while the documents are structured
product entries.

Different Representations: The query may include context or intent, while documents are
factual.

Goals
Accurate Retrieval:
1

● Ensure that relevant documents are ranked higher in the results using a model like
SBERT.
● Improve retrieval precision by evaluating and refining the similarity scoring method.

Ranking Optimization:

● Rank documents based on relevance using metrics like cosine similarity.

● Validate ranking performance using metrics such as MRR (Mean Reciprocal Rank)
and accuracy to measure effectiveness.

System Validation:

● Use a manually labeled "relevance" column in the dataset to calculate the system’s
accuracy and verify the correctness of retrieval results.

Efficient Results Presentation:

● Return the top N results (e.g., Top 20) for each query to focus on the most relevant
documents.
● Integrate the similarity, rank, and query context into the dataset for seamless
evaluation.

Research Papers

1. Semantic Models for the First-Stage Retrieval: A Comprehensive

Review (ACM Transactions on Information Systems)

Overview:
This review likely discusses the evolution of semantic retrieval models used in the first
stage of information retrieval pipelines. The first stage is critical for narrowing down large
document collections to a manageable subset.

● How It Works:
Semantic models use embeddings and advanced machine learning to improve
retrieval by understanding the query context better than traditional term-based
methods like BM25.
● Architecture:
Likely compares classical term-matching architectures to neural network
architectures, such as BERT-based retrieval models (e.g., Dense Passage Retrieval or
DPR).
2

● Models:
Includes neural models for dense vector retrieval, embedding techniques, and
transformer-based approaches.
● Accuracy:
Reports improvements in recall and precision, particularly when neural methods
replace traditional term-matching systems in datasets like MS MARCO.

2. Semantic Scholar (IEEE Xplore)

Overview:
Semantic Scholar is an academic search engine that uses AI to enhance search results
relevance by focusing on semantic understanding.

● How It Works:
Combines traditional information retrieval methods with machine learning and
deep learning to understand user intent and rank documents effectively. Uses
techniques like entity linking, citation analysis, and contextual embeddings.
● Architecture:
A hybrid system involving traditional index-based search augmented by neural
networks like transformers.
● Models:
Likely includes BERT for semantic understanding, graph-based models for citation
networks, and clustering techniques for topic discovery.
● Accuracy:
Shows high precision in ranking academic search results, but numeric accuracy
metrics depend on the specific datasets tested.

3. Advancements in Semantic Information Retrieval (ScienceDirect)

Overview:
This paper explores advanced retrieval methods, particularly for specific domains such as
healthcare or education.

● How It Works:
Uses semantic analysis to go beyond keyword matching, leveraging ontologies or
knowledge graphs to understand domain-specific relationships.
● Architecture:
Likely discusses multi-layered neural networks combined with domain-specific
knowledge graphs.
3

● Models:
Techniques like Word2Vec for word embeddings or hierarchical attention models
for document comprehension.
● Accuracy:
Improvements reported in specialized domains, with metrics tied to specific
datasets like PubMed or educational repositories.

Google Search for best model:

Sentence Transformers:

A pre-trained model specifically designed for sentence embedding, allowing for efficient
semantic comparisons between queries and documents.

DistilBERT:

A smaller, faster version of BERT, which is often preferred for large-scale applications due
to its computational efficiency.

XLNet:

Another transformer-based model that can capture long-range dependencies in text,

potentially improving semantic understanding for complex queries.

Similarities and Models

1. Difference Between Cosine Similarity and Dot-Product Similarity

Metric Cosine Similarity Dot-Product Similarity

Formula Cosine(A,B)=A⋅B∥A∥∥B∥\ Dot Product(A,B)=A⋅B\text{Dot

text{Cosine}(A, B) = \frac{\ Product}(A, B) = \mathbf{A} \
mathbf{A} \cdot \mathbf{B}} cdot \mathbf{B}Dot
{\|\mathbf{A}\| \|\ Product(A,B)=A⋅B
mathbf{B}\|}Cosine(A,B)=∥A∥
∥B∥A⋅B
4

Normalization Normalizes vectors to unit Directly compares vector

length before comparing. magnitudes.

Range Outputs values between -1 and Output can be any real number.
1.

Interpretation Measures the angle between Measures the magnitude of

vectors (how similar they are in alignment between vectors. Larger
direction). values = more similarity.

- 1 = perfectly similar

- 0 = orthogonal (no similarity)

- -1 = perfectly dissimilar
(opposite direction)

Use Cases - Semantic similarity tasks (e.g., - Often used in neural network
SBERT-based search). attention mechanisms.

- Handles vectors of different - More efficient in systems where

magnitudes well. vectors are normalized beforehand.

Sensitivity to Independent of vector Sensitive to vector magnitudes.

Magnitude magnitudes, focusing purely on Larger vectors produce larger dot
direction. products.

Efficiency Slightly more computationally More computationally efficient

expensive due to normalization. without normalization.

When to Use Which? (Similarity)

● Cosine Similarity: Use when vector magnitudes vary significantly and you are
interested in directional similarity.
● Dot-Product Similarity: Use when vectors are already normalized (e.g., in many
neural models like attention mechanisms) or when magnitude information is
meaningful.
5

When to Use Which? (Model)

● Use ColBERT if accuracy and fine-grained relevance are more important than
speed, such as in search engines or Q&A systems.
● Use SBERT if speed and scalability are critical, such as in low-latency applications
or resource-constrained environments.

2. Difference Between ColBERT and SBERT

Feature ColBERT SBERT

Full Name Contextualized Late Sentence-BERT

Interaction over BERT

Purpose Focuses on efficient and Optimized for sentence

scalable similarity and semantic
passage/document search.
retrieval.

Architectur - Uses a late-interaction - Fine-tunes BERT for

e mechanism for semantic similarity tasks.
scalability.
- Encodes entire
- Each token from a sentences into a single
query and document is fixed-length embedding.
represented
- Key components: BERT
independently, allowing
backbone with a pooling
token-level interactions.
layer (mean/max pooling
- Key components: token or [CLS] token).
embeddings, MaxSim for
interaction.

Encoding Token-level embedding Sentence-level

(each token has its own embedding (entire
vector). sentence as a single
6

vector).

Interaction Late interaction: Matches Pre-computed

Strategy query tokens against embeddings: Sentences
document tokens during compared using
inference. similarity measures
(cosine/dot product).

Strengths - More granular token- - Fast retrieval due to

level interaction precomputed
improves retrieval embeddings.
accuracy.
- Better for low-resource
- Ideal for tasks like or interactive tasks like
document ranking or semantic search or
large-scale passage clustering.
retrieval.

Weaknesses - Higher computational - Slightly less precise for

cost during inference. complex document
retrieval.
- Token-level matching
can be slower for large - Cannot perform token-
corpora. level matching.

Which Is More Accurate? (Models)

● ColBERT is generally more accurate in asymmetric search tasks, especially when
fine-grained interactions are critical.
● SBERT trades off some accuracy for speed and scalability, making it suitable for
scenarios where efficiency is a priority.
7

Comparison Summary

Model ColBERT SBERT

Accuracy Higher (e.g., Lower (e.g., MRR@10 ~0.30–

MRR@10 ~0.36–0.40) 0.33)

Interaction Token-level late Sentence-level fixed

interaction embeddings

Latency Higher (performs Lower (precomputed

token-level matching embeddings enable fast
during inference). retrieval).

Use Case Ideal for high- Best for fast similarity

accuracy document matching with large corpora.
retrieval.

What Is An Arithmetic Sequence?: Arithmetic Sequences and Series
No ratings yet
What Is An Arithmetic Sequence?: Arithmetic Sequences and Series
34 pages
WON A Corp Is Entitled To Moral Damages
100% (1)
WON A Corp Is Entitled To Moral Damages
6 pages
Unit 1 - 3.2 Exam Bayyinah Tv's Arabic With Husna
85% (13)
Unit 1 - 3.2 Exam Bayyinah Tv's Arabic With Husna
6 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
唐钰葆-Introduction to GR
No ratings yet
唐钰葆-Introduction to GR
42 pages
Berger and Luckman Sociology of Knowledge
100% (1)
Berger and Luckman Sociology of Knowledge
7 pages
Detailed Lesson Plan (DLP) Format: Instructional Planning
100% (1)
Detailed Lesson Plan (DLP) Format: Instructional Planning
3 pages
Reading:: Sources
No ratings yet
Reading:: Sources
15 pages
DOS 1.0 Jan82
No ratings yet
DOS 1.0 Jan82
307 pages
Notis Georgiou, Portfolio
No ratings yet
Notis Georgiou, Portfolio
75 pages
17
No ratings yet
17
105 pages
Summary PDF
No ratings yet
Summary PDF
55 pages
IR Presentation 1
No ratings yet
IR Presentation 1
41 pages
Unit 5
No ratings yet
Unit 5
81 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
NLP Review 3 Formatted 2
No ratings yet
NLP Review 3 Formatted 2
27 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
74 pages
Semantic Search With LLMs
No ratings yet
Semantic Search With LLMs
21 pages
Week 5 - Latent Semantic Indexing
No ratings yet
Week 5 - Latent Semantic Indexing
38 pages
Sem A Tic Microsoft
No ratings yet
Sem A Tic Microsoft
31 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
FYP Proposal
No ratings yet
FYP Proposal
18 pages
Improving Retrieval Augmented Generation
No ratings yet
Improving Retrieval Augmented Generation
33 pages
NLP See
No ratings yet
NLP See
27 pages
Ir Mod2 Notes
No ratings yet
Ir Mod2 Notes
26 pages
NLP Question
No ratings yet
NLP Question
38 pages
1 Overview
No ratings yet
1 Overview
44 pages
Affan 1
No ratings yet
Affan 1
24 pages
Professions and Occupations
No ratings yet
Professions and Occupations
2 pages
Article
No ratings yet
Article
20 pages
Jurisprudence Syllabus - NAAC - New
No ratings yet
Jurisprudence Syllabus - NAAC - New
8 pages
W6L2 LLM For Search
No ratings yet
W6L2 LLM For Search
70 pages
Rethinking Search: Making Domain Experts Out of Dilettantes
No ratings yet
Rethinking Search: Making Domain Experts Out of Dilettantes
27 pages
Knowledge Retrieval Based On Generative AI: 1 Te-Lun Yang
No ratings yet
Knowledge Retrieval Based On Generative AI: 1 Te-Lun Yang
8 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
An Efficient and Robust Semantic Hashing Framework For Similar Text Search
No ratings yet
An Efficient and Robust Semantic Hashing Framework For Similar Text Search
31 pages
Wu Et Al 2015 PDF
No ratings yet
Wu Et Al 2015 PDF
14 pages
XII - I PreBoard - PHYSICS
No ratings yet
XII - I PreBoard - PHYSICS
12 pages
AutoRAG Automated Framework For Optimization of Retrieval-Augmented Generation
No ratings yet
AutoRAG Automated Framework For Optimization of Retrieval-Augmented Generation
22 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
17 pages
Yann Debray - 1714613827618
No ratings yet
Yann Debray - 1714613827618
16 pages
Lecture16-Retrieval Augmented Generation With LLMs
No ratings yet
Lecture16-Retrieval Augmented Generation With LLMs
40 pages
Vector Search
No ratings yet
Vector Search
10 pages
Semantic Search
No ratings yet
Semantic Search
9 pages
GEN AI Series - Enterprise Unified Semantic Search: Concepts, Implementation, and Source Code Insights
No ratings yet
GEN AI Series - Enterprise Unified Semantic Search: Concepts, Implementation, and Source Code Insights
39 pages
Web Scale Semantic Product Search With Large Language Models
No ratings yet
Web Scale Semantic Product Search With Large Language Models
12 pages
Azure Semantic Search vs. Vector Search
No ratings yet
Azure Semantic Search vs. Vector Search
17 pages
Social Network 1
No ratings yet
Social Network 1
29 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Bingo: Software Research Specification On
No ratings yet
Bingo: Software Research Specification On
6 pages
RAG Training NEW
No ratings yet
RAG Training NEW
47 pages
Learning Similarities: An Ensemble Model For Textual Query Image Retrieval System
No ratings yet
Learning Similarities: An Ensemble Model For Textual Query Image Retrieval System
8 pages
Bush & James JR 2020 - Adolescents in Individualistics Cultures
No ratings yet
Bush & James JR 2020 - Adolescents in Individualistics Cultures
11 pages
Colbert: Efficient and Effective Passage Search Via Contextualized Late Interaction Over Bert
No ratings yet
Colbert: Efficient and Effective Passage Search Via Contextualized Late Interaction Over Bert
10 pages
Iat 1 IRT
No ratings yet
Iat 1 IRT
10 pages
NLP See
No ratings yet
NLP See
9 pages
Step by Step Instructions On How To ThinApp Microsoft Office 2007
No ratings yet
Step by Step Instructions On How To ThinApp Microsoft Office 2007
13 pages
NLP QBS Module 4 & 5
No ratings yet
NLP QBS Module 4 & 5
21 pages
Rank
No ratings yet
Rank
9 pages
Unit 4
No ratings yet
Unit 4
16 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
29 Khattab CS224U IR Part 5
No ratings yet
29 Khattab CS224U IR Part 5
18 pages
CHEMISTRY Grade 9 Retake
No ratings yet
CHEMISTRY Grade 9 Retake
8 pages
Grade 9 - Ems - Exam - Term 4
No ratings yet
Grade 9 - Ems - Exam - Term 4
6 pages
Generative Information Retrieval
No ratings yet
Generative Information Retrieval
9 pages
Idetc-2022 v2
No ratings yet
Idetc-2022 v2
15 pages
Semantic Search
No ratings yet
Semantic Search
9 pages
Assessment of Hydrogen Energy For Sustai
No ratings yet
Assessment of Hydrogen Energy For Sustai
8 pages
EAS Journal Template
No ratings yet
EAS Journal Template
7 pages
2023 Palarong Pampaaralan Dance Sports Guidelines
No ratings yet
2023 Palarong Pampaaralan Dance Sports Guidelines
4 pages
Question Bank-Print-Irt
No ratings yet
Question Bank-Print-Irt
9 pages
Articles Search Project
No ratings yet
Articles Search Project
8 pages
Irs Unit - 4
No ratings yet
Irs Unit - 4
29 pages
From Follow The Rabbit Proof Fence
No ratings yet
From Follow The Rabbit Proof Fence
7 pages
61-Article Text-193-1-10-20211021
No ratings yet
61-Article Text-193-1-10-20211021
5 pages
Colbert
No ratings yet
Colbert
10 pages
Notes and Questions: Aqa Gcse
No ratings yet
Notes and Questions: Aqa Gcse
18 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
5 pages
Semantic Product Search
No ratings yet
Semantic Product Search
4 pages
Honda Avoiding Fuel-Related Problems
No ratings yet
Honda Avoiding Fuel-Related Problems
2 pages
HASYTEC DBPi Brochure
No ratings yet
HASYTEC DBPi Brochure
4 pages
Irs Sem Unit 5
No ratings yet
Irs Sem Unit 5
8 pages
Education Skills: Video Editing and Post Production
No ratings yet
Education Skills: Video Editing and Post Production
1 page
Eo Organizing The BDC
No ratings yet
Eo Organizing The BDC
3 pages
Summary
No ratings yet
Summary
3 pages
Putriana S.F - C1G019024 - Consumption, Savings and Investment Function
No ratings yet
Putriana S.F - C1G019024 - Consumption, Savings and Investment Function
1 page
Overcoming Obstacles
No ratings yet
Overcoming Obstacles
1 page
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet