Project Report
Project Report
10.12.2024
─
Drashti Bhavsar
LOTI AI. Inc
Overview
Semantic search and ranking are based on similarity from documents and queries.
Asymmetric Search is going to apply here.
Query: "Best laptops for gaming under $1000"
Documents: Product descriptions like:
Here, the query is a natural language question, while the documents are structured
product entries.
Different Representations: The query may include context or intent, while documents are
factual.
Goals
Accurate Retrieval:
1
● Ensure that relevant documents are ranked higher in the results using a model like
SBERT.
● Improve retrieval precision by evaluating and refining the similarity scoring method.
Ranking Optimization:
System Validation:
● Use a manually labeled "relevance" column in the dataset to calculate the system’s
accuracy and verify the correctness of retrieval results.
● Return the top N results (e.g., Top 20) for each query to focus on the most relevant
documents.
● Integrate the similarity, rank, and query context into the dataset for seamless
evaluation.
Research Papers
Overview:
This review likely discusses the evolution of semantic retrieval models used in the first
stage of information retrieval pipelines. The first stage is critical for narrowing down large
document collections to a manageable subset.
● How It Works:
Semantic models use embeddings and advanced machine learning to improve
retrieval by understanding the query context better than traditional term-based
methods like BM25.
● Architecture:
Likely compares classical term-matching architectures to neural network
architectures, such as BERT-based retrieval models (e.g., Dense Passage Retrieval or
DPR).
2
● Models:
Includes neural models for dense vector retrieval, embedding techniques, and
transformer-based approaches.
● Accuracy:
Reports improvements in recall and precision, particularly when neural methods
replace traditional term-matching systems in datasets like MS MARCO.
Overview:
Semantic Scholar is an academic search engine that uses AI to enhance search results
relevance by focusing on semantic understanding.
● How It Works:
Combines traditional information retrieval methods with machine learning and
deep learning to understand user intent and rank documents effectively. Uses
techniques like entity linking, citation analysis, and contextual embeddings.
● Architecture:
A hybrid system involving traditional index-based search augmented by neural
networks like transformers.
● Models:
Likely includes BERT for semantic understanding, graph-based models for citation
networks, and clustering techniques for topic discovery.
● Accuracy:
Shows high precision in ranking academic search results, but numeric accuracy
metrics depend on the specific datasets tested.
Overview:
This paper explores advanced retrieval methods, particularly for specific domains such as
healthcare or education.
● How It Works:
Uses semantic analysis to go beyond keyword matching, leveraging ontologies or
knowledge graphs to understand domain-specific relationships.
● Architecture:
Likely discusses multi-layered neural networks combined with domain-specific
knowledge graphs.
3
● Models:
Techniques like Word2Vec for word embeddings or hierarchical attention models
for document comprehension.
● Accuracy:
Improvements reported in specialized domains, with metrics tied to specific
datasets like PubMed or educational repositories.
Sentence Transformers:
A pre-trained model specifically designed for sentence embedding, allowing for efficient
semantic comparisons between queries and documents.
DistilBERT:
A smaller, faster version of BERT, which is often preferred for large-scale applications due
to its computational efficiency.
XLNet:
Range Outputs values between -1 and Output can be any real number.
1.
- 1 = perfectly similar
- -1 = perfectly dissimilar
(opposite direction)
Use Cases - Semantic similarity tasks (e.g., - Often used in neural network
SBERT-based search). attention mechanisms.
● Cosine Similarity: Use when vector magnitudes vary significantly and you are
interested in directional similarity.
● Dot-Product Similarity: Use when vectors are already normalized (e.g., in many
neural models like attention mechanisms) or when magnitude information is
meaningful.
5
● Use ColBERT if accuracy and fine-grained relevance are more important than
speed, such as in search engines or Q&A systems.
● Use SBERT if speed and scalability are critical, such as in low-latency applications
or resource-constrained environments.
vector).
Comparison Summary