Project Report
Project Report
Problem Statement:
An online system to automatically verify new title submissions by checking for similarities with
existing titles.
Problem Overview:
The system should automatically verify new title submissions by:
Checking for similarity to existing titles.
Enforcing guidelines that prevent the use of disallowed words, combinations of existing
titles, similar meanings in other languages, and variations in periodicity.
Providing a probability score that reflects the likelihood of a title being accepted.
Requirements:
1. Similarity Check:
o Phonetic Similarity: Use algorithms like Soundex or Metaphone to check for similar-
sounding names.
o Prefix/Suffix Similarity: Identify and reject titles that share common prefixes or suffixes
like "The", "India", "Samachar", "News".
o Spelling Variations: Handle minor spelling changes (e.g., "Namaskar" vs. "Namascar")
to prevent bypassing similarity checks.
o Similarity Percentage: Calculate and compare similarity between the new title and
existing titles, giving a percentage value.
2. Prefix/Suffix Handling:
o Maintain a list of disallowed prefixes and suffixes.
o Reject titles that contain these disallowed elements if they cause similarity to an existing
title.
3. Guideline Enforcement:
o Maintain a list of disallowed words (e.g., "Police", "Crime", "Army") that cannot
appear in new titles.
o Prevent the creation of new titles by combining two existing ones (e.g., "Hindu" +
"Indian Express" should be rejected).
o Reject titles with similar meanings in other languages (e.g., "Daily Evening" vs.
"Pratidin Sandhya").
o Disallow the addition of periodicity terms (e.g., "daily", "weekly") to existing titles to
form new ones.
4. Verification Probability:
o The system should provide a probability score that indicates the likelihood of a title
being verified. For example, if a title has a similarity score of 80%, the probability of it
being verified would be 20%.
5. Database Interaction:
o Efficient Search: The system should be capable of quickly searching and comparing
new titles against the database of 160,000 titles using optimized techniques.
o Tracking Submissions: The system should track submitted titles to prevent approving
similar titles in the future.
o Optimized Search: Use indexing and optimized search techniques to ensure fast
processing times.
6. User Feedback:
o Provide users with clear feedback if their title is rejected due to similarity, disallowed
words, prefixes/suffixes, combinations, or violations of other rules.
o Display the verification probability score to the user.
o Allow users to modify and resubmit their titles after receiving feedback.
7. Scalability:
o The system should be designed to handle an increasing volume of title submissions
without performance degradation.
Expected Solution:
1. Similarity Scoring:
o The system should calculate the similarity percentage between a new title and existing
titles using algorithms like Levenshtein distance, Jaccard similarity, or phonetic
matching algorithms.
2. Verification Probability:
o A similarity score can be used to derive the verification probability. For example, a title
with 80% similarity will have a 20% chance of being verified.
3. Guideline Enforcement:
o Use predefined lists for disallowed words, prefixes, and suffixes to ensure that the
system automatically rejects titles that violate these rules.
o Detect combinations of existing titles using pattern-matching techniques to prevent
accidental new titles from merging old ones.
4. Efficient Database Search:
o The system should use advanced indexing techniques like Elasticsearch or Apache Solr
to ensure that searching through the 160,000 existing titles is fast and efficient.
5. User Feedback:
o The feedback to users should be clear and actionable, explaining why a title was
rejected and offering suggestions for modification.
o The verification probability score should be presented clearly, showing users how close
their title was to being verified.
Reactjs:
o React, also known as ReactJS, is a popular and powerful JavaScript library used
for building dynamic and interactive user interfaces, primarily for single-page
applications (SPAs). It was developed and maintained by Facebook and has
gained significant popularity due to its efficient rendering techniques, reusable
components, and active community support.
Backend:
Python:
o Python is a high-level, general-purpose programming language. Its design
philosophy emphasizes code readability with the use of significant indentation.
This Photo by Unknown Author is licensed under CC BY-SA-NC
Flask server:
o it’s a Python module that lets you develop web applications easily. It’s has a
small and easy-to-extend core: it’s a microframework that doesn’t include an
ORM (Object Relational Manager) or such features.
o I created the backend api using flask server it let’s you generate the report on
one click.
Bert:
o BERT (Bidirectional Encoder Representations from Transformers) stands as an
open-source machine learning framework designed for the natural language
processing (NLP). Originating in 2018, this framework was crafted by
researchers from Google AI Language.
Numpy:
o NumPy is a library for the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large collection of
high-level mathematical functions to operate on these arrays.
o I used it for calculating mean of the collected data.
o Every row in the matrix is given a weight then multiplying with higher number
presented in the each row.
o Prefix match can also be calculated by just looking at the first and last
row elements.