0% found this document useful (0 votes)

5 views12 pages

Project Report

Uploaded by

Vishal Vikal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views12 pages

Project Report

Uploaded by

Vishal Vikal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Minor Project:

Submitted to Department Of Information Technology

Roll Number: 233721

Under Supervision of: Dr. Abhishek Verma

School of Information Science and Technology

Project Report:

Problem Statement:
An online system to automatically verify new title submissions by checking for similarities with
existing titles.

Problem Overview:
The system should automatically verify new title submissions by:
 Checking for similarity to existing titles.
 Enforcing guidelines that prevent the use of disallowed words, combinations of existing
titles, similar meanings in other languages, and variations in periodicity.
 Providing a probability score that reflects the likelihood of a title being accepted.

Requirements:
1. Similarity Check:
o Phonetic Similarity: Use algorithms like Soundex or Metaphone to check for similar-
sounding names.
o Prefix/Suffix Similarity: Identify and reject titles that share common prefixes or suffixes
like "The", "India", "Samachar", "News".
o Spelling Variations: Handle minor spelling changes (e.g., "Namaskar" vs. "Namascar")
to prevent bypassing similarity checks.
o Similarity Percentage: Calculate and compare similarity between the new title and
existing titles, giving a percentage value.
2. Prefix/Suffix Handling:
o Maintain a list of disallowed prefixes and suffixes.
o Reject titles that contain these disallowed elements if they cause similarity to an existing
title.
3. Guideline Enforcement:
o Maintain a list of disallowed words (e.g., "Police", "Crime", "Army") that cannot
appear in new titles.
o Prevent the creation of new titles by combining two existing ones (e.g., "Hindu" +
"Indian Express" should be rejected).
o Reject titles with similar meanings in other languages (e.g., "Daily Evening" vs.
"Pratidin Sandhya").
o Disallow the addition of periodicity terms (e.g., "daily", "weekly") to existing titles to
form new ones.
4. Verification Probability:
o The system should provide a probability score that indicates the likelihood of a title
being verified. For example, if a title has a similarity score of 80%, the probability of it
being verified would be 20%.
5. Database Interaction:
o Efficient Search: The system should be capable of quickly searching and comparing
new titles against the database of 160,000 titles using optimized techniques.
o Tracking Submissions: The system should track submitted titles to prevent approving
similar titles in the future.
o Optimized Search: Use indexing and optimized search techniques to ensure fast
processing times.
6. User Feedback:
o Provide users with clear feedback if their title is rejected due to similarity, disallowed
words, prefixes/suffixes, combinations, or violations of other rules.
o Display the verification probability score to the user.
o Allow users to modify and resubmit their titles after receiving feedback.
7. Scalability:
o The system should be designed to handle an increasing volume of title submissions
without performance degradation.
Expected Solution:
1. Similarity Scoring:
o The system should calculate the similarity percentage between a new title and existing
titles using algorithms like Levenshtein distance, Jaccard similarity, or phonetic
matching algorithms.
2. Verification Probability:
o A similarity score can be used to derive the verification probability. For example, a title
with 80% similarity will have a 20% chance of being verified.
3. Guideline Enforcement:
o Use predefined lists for disallowed words, prefixes, and suffixes to ensure that the
system automatically rejects titles that violate these rules.
o Detect combinations of existing titles using pattern-matching techniques to prevent
accidental new titles from merging old ones.
4. Efficient Database Search:
o The system should use advanced indexing techniques like Elasticsearch or Apache Solr
to ensure that searching through the 160,000 existing titles is fast and efficient.
5. User Feedback:
o The feedback to users should be clear and actionable, explaining why a title was
rejected and offering suggestions for modification.
o The verification probability score should be presented clearly, showing users how close
their title was to being verified.

Technologies and Tools for Implementation:

Frontend:

 Reactjs:
o React, also known as ReactJS, is a popular and powerful JavaScript library used
for building dynamic and interactive user interfaces, primarily for single-page
applications (SPAs). It was developed and maintained by Facebook and has
gained significant popularity due to its efficient rendering techniques, reusable
components, and active community support.

This Photo by Unknown Author is licensed under CC BY-SA

Backend:
 Python:
o Python is a high-level, general-purpose programming language. Its design
philosophy emphasizes code readability with the use of significant indentation.
This Photo by Unknown Author is licensed under CC BY-SA-NC

 Flask server:
o it’s a Python module that lets you develop web applications easily. It’s has a
small and easy-to-extend core: it’s a microframework that doesn’t include an
ORM (Object Relational Manager) or such features.
o I created the backend api using flask server it let’s you generate the report on
one click.

 Sentence Transformers from Hugging face:

o Sentence Transformers are a type of neural network architecture that creates
vector representations of entire sentences or paragraphs. These embeddings
capture the semantic meaning of the text, which allows systems to understand
the context, relationships, and intent behind the words.
o They encode the given string into dense vectors so that cosine similarity can be
calculated.
o

 Bert:
o BERT (Bidirectional Encoder Representations from Transformers) stands as an
open-source machine learning framework designed for the natural language
processing (NLP). Originating in 2018, this framework was crafted by
researchers from Google AI Language.

 Pretrained models (Bert):

o all-mpnet-base-v2:
 This is a sentence-transformers model: It maps sentences & paragraphs
to a 768 dimensional dense vector space and can be used for tasks like
clustering or semantic search.
 This is used for generating embeddings of each word in the string. Then
calculating the angle between each vector using cosine similarity function.
o mixedbread-ai/mxbai-embed-large-v1:

 Numpy:
o NumPy is a library for the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large collection of
high-level mathematical functions to operate on these arrays.
o I used it for calculating mean of the collected data.

This Photo by Unknown Author is licensed under CC BY-SA

 PyTorch:
o Pytorch is an open-source deep learning framework available with a Python and
C++ interface. The PyTorch resides inside the torch module. In PyTorch, the
data that has to be processed is input in the form of a tensor.
o It is returned by the similarity function of the module.

This Photo by Unknown Author is licensed under CC BY

 Approach to solve the problem:

o Cosine similarity.
 Cosine similarity is a metric that measures how similar two vectors are in
a multi-dimensional space by calculating the cosine of the angle between
them. It's commonly used in data analysis, text analysis, image
recognition, and recommendation systems.
o I created the embeddings of each unique title presented in the database.
o Then I compared with each upcoming title.
o I used Sentence transformers from hugging face and used bert models for
generating the embeddings of each title.
o 768 dense vectors of embeddings is used to calculate angle between them.
o It shows the how similar a string with each other.

 Creating binary files for time saving:

o When an encoder generates the embedding for each string. It takes time so It is
better to save the embeddings in the binary file for making it reusable.
o Pickle is used to saving the embeddings into a binary file.
o Every unique title has it’s own embedding which is inside the array of dict. And
then saved into the binary files using pickle methods.
o And it was read on similarity check time.

 A matrix approach for solving the problem.

o In this approach every string is splited based on spaces then embeddings for each
word is created.
o Every word embeddings is compared to the upcoming title word embeddings.
o It generates the matrix for of each title to the upcoming title.

o Here we can easily compare the similarity of each string to another

o Every row in the matrix is given a weight then multiplying with higher number
presented in the each row.

o We get the actual similarity with each string.

o It is also useful for generating the exact match by just looking at the diagonal of
the matrix.

o Prefix match can also be calculated by just looking at the first and last
row elements.

Some Screen Shots:

Data Analysis q1 Finals
100% (1)
Data Analysis q1 Finals
405 pages
Namma Kalvi 10th Maths Definitions and Formulae Study Material em 216537
No ratings yet
Namma Kalvi 10th Maths Definitions and Formulae Study Material em 216537
12 pages
Faculty Name: Dr. Humera Khanam Subject Name:NLP
No ratings yet
Faculty Name: Dr. Humera Khanam Subject Name:NLP
206 pages
Gyung-Jin Park PHD (Auth.) - Analytic Methods For Design Practice-Springer-Verlag London (2007)
No ratings yet
Gyung-Jin Park PHD (Auth.) - Analytic Methods For Design Practice-Springer-Verlag London (2007)
635 pages
Project
No ratings yet
Project
11 pages
Sentimental Analysis of Movie Review
100% (1)
Sentimental Analysis of Movie Review
58 pages
Plagiarism Checker
No ratings yet
Plagiarism Checker
25 pages
Text Classification and Processing Using NLP
No ratings yet
Text Classification and Processing Using NLP
21 pages
Similarity Engine Thought Paper
No ratings yet
Similarity Engine Thought Paper
7 pages
Ai&Ml Bai601 NLP Lab Manual
No ratings yet
Ai&Ml Bai601 NLP Lab Manual
48 pages
Most Expected Questions 2024 AI 417 Class 10
No ratings yet
Most Expected Questions 2024 AI 417 Class 10
109 pages
20-10-2024 - JR - Super60 - STERLING BT - Jee-Adv (2020-P1) - QAT-05 - Q. Paper
No ratings yet
20-10-2024 - JR - Super60 - STERLING BT - Jee-Adv (2020-P1) - QAT-05 - Q. Paper
20 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
CSP Report FINAL
No ratings yet
CSP Report FINAL
46 pages
Blackbook Format
No ratings yet
Blackbook Format
70 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
Project Report MRS
No ratings yet
Project Report MRS
47 pages
Python
No ratings yet
Python
22 pages
Fin Irjmets1685071414
No ratings yet
Fin Irjmets1685071414
7 pages
Pfe Resumatcher
No ratings yet
Pfe Resumatcher
76 pages
Proposal - Plagiarism Detection in Text-Based Assignments Using Natural Language Processing Technique
No ratings yet
Proposal - Plagiarism Detection in Text-Based Assignments Using Natural Language Processing Technique
11 pages
AI Mini Project
No ratings yet
AI Mini Project
22 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
János Kornai - Anti-Equilibrium - On Economic Systems Theory and The Tasks of Research
No ratings yet
János Kornai - Anti-Equilibrium - On Economic Systems Theory and The Tasks of Research
424 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Lie Group
No ratings yet
Lie Group
59 pages
Laboratory Practice VI Natural Language Processing
No ratings yet
Laboratory Practice VI Natural Language Processing
8 pages
Assignment #1 Text Retrieval & Search Engine
No ratings yet
Assignment #1 Text Retrieval & Search Engine
6 pages
Plagiarismchecker
No ratings yet
Plagiarismchecker
8 pages
Generative AI Report
No ratings yet
Generative AI Report
15 pages
Final Report
No ratings yet
Final Report
59 pages
Report 83
No ratings yet
Report 83
50 pages
NLP MTE Syllabus and Practice Problems
No ratings yet
NLP MTE Syllabus and Practice Problems
2 pages
Math 425 Multivariate Data Analysis - Kabarak University
No ratings yet
Math 425 Multivariate Data Analysis - Kabarak University
5 pages
Text Plagiarism Checker Using NLP: Presented by Under The Supervision of
No ratings yet
Text Plagiarism Checker Using NLP: Presented by Under The Supervision of
18 pages
Chapter 11
No ratings yet
Chapter 11
14 pages
AI Assignment
No ratings yet
AI Assignment
8 pages
Extra
No ratings yet
Extra
4 pages
BaakeSchlaegel 2011 PeanoBakerSeries
No ratings yet
BaakeSchlaegel 2011 PeanoBakerSeries
5 pages
Plagiarism Checker Amp Link Advisor Using Concepts of Levenshtein Distance Algorithm With Google Query Search - An Approach
No ratings yet
Plagiarism Checker Amp Link Advisor Using Concepts of Levenshtein Distance Algorithm With Google Query Search - An Approach
6 pages
Dupppppppppp
No ratings yet
Dupppppppppp
15 pages
Math Exam 2
No ratings yet
Math Exam 2
3 pages
Concurrent Context Free Framework For Conceptual Similarity Problem Using Reverse Dictionary
No ratings yet
Concurrent Context Free Framework For Conceptual Similarity Problem Using Reverse Dictionary
4 pages
Applied Mathematics Assignment Class 12 Matrix
No ratings yet
Applied Mathematics Assignment Class 12 Matrix
13 pages
Multi-Label Classification System That Automatically Tags Users' Questions To Enhance User Experience
No ratings yet
Multi-Label Classification System That Automatically Tags Users' Questions To Enhance User Experience
8 pages
NM (2) - Merged
No ratings yet
NM (2) - Merged
16 pages
NM (2) - Merged - Organized
No ratings yet
NM (2) - Merged - Organized
16 pages
Active Control of Multimodal Cable Vibrations by Axial Support Motion by Thumanoon Susumpow and Yozo Fujino, Member, ASCE
No ratings yet
Active Control of Multimodal Cable Vibrations by Axial Support Motion by Thumanoon Susumpow and Yozo Fujino, Member, ASCE
9 pages
Abstract Reasoning Test
100% (3)
Abstract Reasoning Test
14 pages
NLP - CH-6
No ratings yet
NLP - CH-6
4 pages
Thesis
No ratings yet
Thesis
23 pages
Unit 6 - NLP Notes
No ratings yet
Unit 6 - NLP Notes
7 pages
Differential Equations and Linear Algebra Supplementary Notes
No ratings yet
Differential Equations and Linear Algebra Supplementary Notes
17 pages
9 - Matrices and Determinant
No ratings yet
9 - Matrices and Determinant
27 pages
Final Project
No ratings yet
Final Project
17 pages
Procedure C Lab-16
No ratings yet
Procedure C Lab-16
4 pages
Quantum Mechanics - Chapter7 - Angular Momentum and Rotations
No ratings yet
Quantum Mechanics - Chapter7 - Angular Momentum and Rotations
50 pages
Edupedia 9th Class Math Paper
100% (1)
Edupedia 9th Class Math Paper
4 pages
Lecture3-Network Matrices, The Y-Bus Matrix Tap Changing Transformers
No ratings yet
Lecture3-Network Matrices, The Y-Bus Matrix Tap Changing Transformers
13 pages
MATH115F19Outline PDF
No ratings yet
MATH115F19Outline PDF
2 pages
AGM - Worked Solutions
No ratings yet
AGM - Worked Solutions
418 pages
FYP Proposal
No ratings yet
FYP Proposal
18 pages
Jacob With Berry: Charanjit Kandola, Daniel Fagerlie Due: June 2, 2017
No ratings yet
Jacob With Berry: Charanjit Kandola, Daniel Fagerlie Due: June 2, 2017
5 pages
Automatic Paper Corrector Using NLP - 1650875208
No ratings yet
Automatic Paper Corrector Using NLP - 1650875208
4 pages
Computer Graphics Applications of Computer Graphics: Input Devices
No ratings yet
Computer Graphics Applications of Computer Graphics: Input Devices
21 pages
Alshammari 2023 Ijca 922667
No ratings yet
Alshammari 2023 Ijca 922667
4 pages
Determinants Test
No ratings yet
Determinants Test
2 pages
Topic Analysis Presentation
No ratings yet
Topic Analysis Presentation
23 pages
Determinant Matlab Project
No ratings yet
Determinant Matlab Project
17 pages
Assignment 2 314
No ratings yet
Assignment 2 314
2 pages
Numerical Solutions of Fredholm Integral Equations Using Bernstein Polynomials
No ratings yet
Numerical Solutions of Fredholm Integral Equations Using Bernstein Polynomials
9 pages
MatDyn1 2
No ratings yet
MatDyn1 2
35 pages
A1443737461 - 25340 - 14 - 2020 - AI Projects
No ratings yet
A1443737461 - 25340 - 14 - 2020 - AI Projects
8 pages
2 Sem Chemical Engg F
No ratings yet
2 Sem Chemical Engg F
14 pages
Project Status Report For 6Th Semester: Niques
No ratings yet
Project Status Report For 6Th Semester: Niques
3 pages
College Algebra Dictionary
No ratings yet
College Algebra Dictionary
12 pages
Hal Id-1a Program
No ratings yet
Hal Id-1a Program
5 pages
TypeScript for Python Developers: Bridging Syntax and Practices
From Everand
TypeScript for Python Developers: Bridging Syntax and Practices
Baldurs L.
No ratings yet
The Rust Programming Language, 2nd Edition
From Everand
The Rust Programming Language, 2nd Edition
Steve Klabnik
No ratings yet
JavaScript OOP Step by Step: A Practical Guide with Examples
From Everand
JavaScript OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
From Everand
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
Lorenzo Bettini
4/5 (1)
Learning Elasticsearch
From Everand
Learning Elasticsearch
Abhishek Andhavarapu
4/5 (1)
Mastering Web Application Development with Express
From Everand
Mastering Web Application Development with Express
Alexandru Vlăduțu
No ratings yet
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Mastering MEAN Stack: Build full stack applications using MongoDB, Express.js, Angular, and Node.js (English Edition)
From Everand
Mastering MEAN Stack: Build full stack applications using MongoDB, Express.js, Angular, and Node.js (English Edition)
Pinakin Ashok Chaubal
No ratings yet
Unleashing the Power of Astro
From Everand
Unleashing the Power of Astro
Tamas Piros
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
C# Interview Questions You'll Most Likely Be Asked
From Everand
C# Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Project Report

Uploaded by

Project Report

Uploaded by

Minor Project:

Submitted to Department Of Information Technology

Roll Number: 233721

School of Information Science and Technology

Technologies and Tools for Implementation:

This Photo by Unknown Author is licensed under CC BY-SA

 Sentence Transformers from Hugging face:

 Pretrained models (Bert):

This Photo by Unknown Author is licensed under CC BY-SA

This Photo by Unknown Author is licensed under CC BY

 Approach to solve the problem:

 Creating binary files for time saving:

 A matrix approach for solving the problem.

o Here we can easily compare the similarity of each string to another

o We get the actual similarity with each string.

Some Screen Shots:

You might also like