0% found this document useful (0 votes)
5 views12 pages

Project Report

Uploaded by

Vishal Vikal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

Project Report

Uploaded by

Vishal Vikal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Minor Project:

Submitted to Department Of Information Technology


By

Roll Number: 233721


Under Supervision of: Dr. Abhishek Verma

School of Information Science and Technology


Project Report:

Problem Statement:
An online system to automatically verify new title submissions by checking for similarities with
existing titles.

Problem Overview:
The system should automatically verify new title submissions by:
 Checking for similarity to existing titles.
 Enforcing guidelines that prevent the use of disallowed words, combinations of existing
titles, similar meanings in other languages, and variations in periodicity.
 Providing a probability score that reflects the likelihood of a title being accepted.

Requirements:
1. Similarity Check:
o Phonetic Similarity: Use algorithms like Soundex or Metaphone to check for similar-
sounding names.
o Prefix/Suffix Similarity: Identify and reject titles that share common prefixes or suffixes
like "The", "India", "Samachar", "News".
o Spelling Variations: Handle minor spelling changes (e.g., "Namaskar" vs. "Namascar")
to prevent bypassing similarity checks.
o Similarity Percentage: Calculate and compare similarity between the new title and
existing titles, giving a percentage value.
2. Prefix/Suffix Handling:
o Maintain a list of disallowed prefixes and suffixes.
o Reject titles that contain these disallowed elements if they cause similarity to an existing
title.
3. Guideline Enforcement:
o Maintain a list of disallowed words (e.g., "Police", "Crime", "Army") that cannot
appear in new titles.
o Prevent the creation of new titles by combining two existing ones (e.g., "Hindu" +
"Indian Express" should be rejected).
o Reject titles with similar meanings in other languages (e.g., "Daily Evening" vs.
"Pratidin Sandhya").
o Disallow the addition of periodicity terms (e.g., "daily", "weekly") to existing titles to
form new ones.
4. Verification Probability:
o The system should provide a probability score that indicates the likelihood of a title
being verified. For example, if a title has a similarity score of 80%, the probability of it
being verified would be 20%.
5. Database Interaction:
o Efficient Search: The system should be capable of quickly searching and comparing
new titles against the database of 160,000 titles using optimized techniques.
o Tracking Submissions: The system should track submitted titles to prevent approving
similar titles in the future.
o Optimized Search: Use indexing and optimized search techniques to ensure fast
processing times.
6. User Feedback:
o Provide users with clear feedback if their title is rejected due to similarity, disallowed
words, prefixes/suffixes, combinations, or violations of other rules.
o Display the verification probability score to the user.
o Allow users to modify and resubmit their titles after receiving feedback.
7. Scalability:
o The system should be designed to handle an increasing volume of title submissions
without performance degradation.
Expected Solution:
1. Similarity Scoring:
o The system should calculate the similarity percentage between a new title and existing
titles using algorithms like Levenshtein distance, Jaccard similarity, or phonetic
matching algorithms.
2. Verification Probability:
o A similarity score can be used to derive the verification probability. For example, a title
with 80% similarity will have a 20% chance of being verified.
3. Guideline Enforcement:
o Use predefined lists for disallowed words, prefixes, and suffixes to ensure that the
system automatically rejects titles that violate these rules.
o Detect combinations of existing titles using pattern-matching techniques to prevent
accidental new titles from merging old ones.
4. Efficient Database Search:
o The system should use advanced indexing techniques like Elasticsearch or Apache Solr
to ensure that searching through the 160,000 existing titles is fast and efficient.
5. User Feedback:
o The feedback to users should be clear and actionable, explaining why a title was
rejected and offering suggestions for modification.
o The verification probability score should be presented clearly, showing users how close
their title was to being verified.

Technologies and Tools for Implementation:


Frontend:

 Reactjs:
o React, also known as ReactJS, is a popular and powerful JavaScript library used
for building dynamic and interactive user interfaces, primarily for single-page
applications (SPAs). It was developed and maintained by Facebook and has
gained significant popularity due to its efficient rendering techniques, reusable
components, and active community support.

This Photo by Unknown Author is licensed under CC BY-SA

Backend:
 Python:
o Python is a high-level, general-purpose programming language. Its design
philosophy emphasizes code readability with the use of significant indentation.
This Photo by Unknown Author is licensed under CC BY-SA-NC

 Flask server:
o it’s a Python module that lets you develop web applications easily. It’s has a
small and easy-to-extend core: it’s a microframework that doesn’t include an
ORM (Object Relational Manager) or such features.
o I created the backend api using flask server it let’s you generate the report on
one click.

 Sentence Transformers from Hugging face:


o Sentence Transformers are a type of neural network architecture that creates
vector representations of entire sentences or paragraphs. These embeddings
capture the semantic meaning of the text, which allows systems to understand
the context, relationships, and intent behind the words.
o They encode the given string into dense vectors so that cosine similarity can be
calculated.
o

 Bert:
o BERT (Bidirectional Encoder Representations from Transformers) stands as an
open-source machine learning framework designed for the natural language
processing (NLP). Originating in 2018, this framework was crafted by
researchers from Google AI Language.

 Pretrained models (Bert):


o all-mpnet-base-v2:
 This is a sentence-transformers model: It maps sentences & paragraphs
to a 768 dimensional dense vector space and can be used for tasks like
clustering or semantic search.
 This is used for generating embeddings of each word in the string. Then
calculating the angle between each vector using cosine similarity function.
o mixedbread-ai/mxbai-embed-large-v1:

 Numpy:
o NumPy is a library for the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large collection of
high-level mathematical functions to operate on these arrays.
o I used it for calculating mean of the collected data.

This Photo by Unknown Author is licensed under CC BY-SA


 PyTorch:
o Pytorch is an open-source deep learning framework available with a Python and
C++ interface. The PyTorch resides inside the torch module. In PyTorch, the
data that has to be processed is input in the form of a tensor.
o It is returned by the similarity function of the module.

This Photo by Unknown Author is licensed under CC BY

 Approach to solve the problem:


o Cosine similarity.
 Cosine similarity is a metric that measures how similar two vectors are in
a multi-dimensional space by calculating the cosine of the angle between
them. It's commonly used in data analysis, text analysis, image
recognition, and recommendation systems.
o I created the embeddings of each unique title presented in the database.
o Then I compared with each upcoming title.
o I used Sentence transformers from hugging face and used bert models for
generating the embeddings of each title.
o 768 dense vectors of embeddings is used to calculate angle between them.
o It shows the how similar a string with each other.

 Creating binary files for time saving:


o When an encoder generates the embedding for each string. It takes time so It is
better to save the embeddings in the binary file for making it reusable.
o Pickle is used to saving the embeddings into a binary file.
o Every unique title has it’s own embedding which is inside the array of dict. And
then saved into the binary files using pickle methods.
o And it was read on similarity check time.

 A matrix approach for solving the problem.


o In this approach every string is splited based on spaces then embeddings for each
word is created.
o Every word embeddings is compared to the upcoming title word embeddings.
o It generates the matrix for of each title to the upcoming title.

o Here we can easily compare the similarity of each string to another

o Every row in the matrix is given a weight then multiplying with higher number
presented in the each row.

o We get the actual similarity with each string.


o It is also useful for generating the exact match by just looking at the diagonal of
the matrix.

o Prefix match can also be calculated by just looking at the first and last
row elements.

Some Screen Shots:

You might also like