0% found this document useful (0 votes)
39 views15 pages

Generative AI Report

The Plagiarism Detector Project Report outlines the development of a machine learning-based plagiarism detection system that protects content confidentiality while maintaining detection effectiveness. The report discusses various methodologies, objectives, and the importance of plagiarism detection in academic and professional settings. It emphasizes the need for continuous improvement in detection algorithms to address subtle forms of plagiarism and enhance overall accuracy.

Uploaded by

nishamurugan273
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views15 pages

Generative AI Report

The Plagiarism Detector Project Report outlines the development of a machine learning-based plagiarism detection system that protects content confidentiality while maintaining detection effectiveness. The report discusses various methodologies, objectives, and the importance of plagiarism detection in academic and professional settings. It emphasizes the need for continuous improvement in detection algorithms to address subtle forms of plagiarism and enhance overall accuracy.

Uploaded by

nishamurugan273
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

PLAGIARISM DETECTOR PROJECT REPORT

Submitted by,

NAME ROLL NUMBER

A NAVEENA 20211ISE0034

NISHA M 20211ISE0021

BHAVANA 20221LIE0002

Under the Supervision

Of

Ms. POORNIMA

PRESIDENCY UNIVERSITY,

BENGALURU

DECEMBER 2024
TABLE OF CONTENT:

ABSTRACT
INTRODUCTION
LITERATURE REVIEW
RESEARCH GAPS OF EXISTING METHODOLOGY
PROPOSED METHODOLOGY
OBJECTIVES
SYSTEM DESIGN AND IMPLEMENTATION
OUTCOMES
CODE
RESULTS
CONCLUSION
APPENDICES
REFERENCES
ABSTRACT:
Plagiarism is an unethical act of using someone else's work or
ideas without giving them credit, which is a growing problem in
various fields. However, the current systems for plagiarism
detection require revealing the full content of input documents
and document collections, which can raise procedural and legal
concerns regarding data confidentiality, limiting or prohibiting
the use of plagiarism detection services. To address these issues,
we aim to create a plagiarism detection approach that doesn't
need a centralized provider or expose any content as cleartext.
Our research has produced initial results showing that our
content-protecting method achieves the same detection
effectiveness as the original method while making it practically
impossible to reveal the protected content through common
attacks. Various techniques, such as manual detection, text
similarity analysis, and automated plagiarism detection using
machine learning, have been developed to prevent plagiarism.
This paper focuses on machine learning techniques for
plagiarism detection and discusses different approaches,
algorithms, and datasets used in detecting plagiarism, along
with their advantages and limitations. The paper also presents
some future research directions in this area.

INTRODUCTION:
Plagiarism has become a major issue in academic and other
fields, as it can harm the author's reputation and the credibility
of their research work. Plagiarism is the act of using someone
else's work, ideas, or words without proper credit, and it can
occur intentionally or unintentionally through various forms,
such as copying and pasting, paraphrasing, or using synonyms.
Plagiarism detection systems (PDS) typically require users to
submit input documents, which the systems compare to a large
proprietary database of documents to retrieve similar content
and highlight it for user inspection. There are two types of
Plagiarism: a. Unintentional Plagiarism  Paraphrasing poorly:
changing a few words without changing the sentence structure
of the original, or changing the sentence structure but not the
words.  Quoting poorly: putting quotation marks around part
of a quotation but not around all of it, or putting quotation
marks around a passage that is partly paraphrased and partly
quoted.  Citing poorly: omitting an occasional citation or citing
inaccurately. b. Intentional Plagiarism • Presenting pre-existing
papers found on the Internet or elsewhere as one's own work. •
Reproducing an essay or article from the Internet, an online
resource, or an electronic database without proper citation or
acknowledgment. • Creating a paper by merging material from
various sources without attribution or citation. • Taking
language or concepts from other sources or classmates without
properly acknowledging the origin of the information.

LITERATURE REVIEW:
Plagiarism Detection in Programming Assignments using Machine Learning
Nishesh Awale, Mitesh Pandey, Anish Dulal Department of Electronics and
Computer Engineering, Pulchowk Campus, Lalitpur, Nepal. These days,
there has been a rise in plagiarism in programming assignments, which has a
negative impact on how students are evaluated. This article suggests using a
machine learning technique to detect plagiarism in programming
assignments.  Methodology Perform in the hopes of writing report in order
to eliminate the copied report and highlighting the critical aspect of writing
assignment on their own.  Findings Various characteristics associated with a
programming assignment pair were calculated, and the Xg boost model was
employed to classify them. The accuracy score achieved was 92%. 2.2 Paper
2 - Plagiarism Detector Using Machine Learning Algorithms The easy
accessibility of vast information resources has led to an increase in
plagiarism in free text. To address this issue, automated plagiarism detection
systems are used to identify plagiarized content in large databases. However,
this task is complicated by advanced plagiarism methods like paraphrasing
and summarizing that conceal the occurrence of plagiarism.  Methodology
The recognition paraphrase is NLP and the objective of this study is to
propose a unified technique to detect plagiarism. It compares the perspective
with that of a sim plagiarism detector.  Findings Operation of the system
does not require any complex directions or training. It is a time- efficient
plagiarism detection system. 2.3 Paper 3 - Complex Dynamic Event
Participant in an Event-Based Social Network: A ThreeDimensional
Matching The current methods primarily concentrate on organizing
techniques that involve users and events on an EBSN (Online Social
Network) platform in an offline situation, where all data is pre-known. 
Methodology Detection by using feature extraction from the Ultra- Fined
Trained repositories extracted by using Data Mining Techniques and NLP. 
Findings Full Connected layers implementation using PyTorch - 100 percent
of accuracy which gives authorization to user that someone else actually
write it.provides immediate feedback. This tool lessens the dependency on
human

Drawbacks:
PROPOSED METHODOLOGY:
1. Preprocessing the Input Data
 Text Extraction:
o Extract plain text from the input document(s). This step handles
various file formats like .txt, .docx, .pdf, etc., using libraries like
docx for Word documents or PyPDF2 for PDF files.
 Normalization:
o Convert the text to lowercase.
o Remove special characters, numbers, and extra spaces.
o Tokenize the text into smaller units (e.g., sentences or words).
o Lemmatize or stem words to reduce them to their base forms (e.g.,
"running" → "run").

2. Database or Corpus Comparison


 The detector compares the processed text against:
o A local database of previously submitted documents.
o Online content through web scraping or APIs.
o Published academic works in databases like PubMed or ArXiv.
 Some tools use large-scale web crawlers to index online content for
comparison.

3. Similarity Analysis
 Exact Matching:
o Direct word-for-word matching between the input text and the
database.
 Shingling (N-Gram Matching):
o Break the text into overlapping sequences of N words (e.g., "I love
programming" → ["I love", "love programming"]).
o Compare these N-grams to detect similarities.
 Semantic Similarity:
o Use Natural Language Processing (NLP) models to detect
paraphrased or semantically similar sentences.
o Tools like Word2Vec, BERT, or Sentence Transformers are often
employed.
 Citation Checking:
o Determine whether properly cited content appears as a match or
whether citations are missing.

4. Plagiarism Scoring
 Percentage Similarity:
o The tool calculates the percentage of the document that matches
content from other sources.
 Type of Match:
o Identifies whether the match is:
 Direct (verbatim copying).
 Near-verbatim (minor changes in wording).
 Paraphrased (content rephrased but retains the same meaning).
 Threshold:
o Apply a threshold (e.g., 15%) to distinguish between acceptable and
plagiarized content.

5. Report Generation
 Highlight Matches:
o Mark plagiarized portions in the text with links to the matched sources.
 Detailed Report:
o Provide a breakdown of:
 Matched content.
 Matched sources (e.g., URLs, document names).
 Overall similarity score (e.g., 25% plagiarized).
 Categorization:
o Separate the matches into properly cited and uncited categories.

6. Additional Features
 Exclusion Filters:
o Exclude common phrases, citations, or bibliography sections from
plagiarism detection.
 Customization:
o Allow users to define thresholds and match types (e.g., exclude
matches below a certain percentage).

Tools and Libraries Used


 Python Libraries:
o NLTK or spaCy for text preprocessing and tokenization.
o difflib for sequence matching.
o FuzzyWuzzy for fuzzy string matching.
 Similarity Models:
o TF-IDF Vectorization.
o Cosine Similarity.
o Pre-trained NLP models (e.g., BERT, Sentence-BERT).
 Database Tools:
o SQL databases for storing local content.
o Web scraping libraries like BeautifulSoup or APIs for accessing
online content.

The methodology for AI-Based Interview Preparation Tool design and


implementation is structured, integrating state-of-the-art AI technologies with
user-friendly interfaces and detailed feedback systems in creating a mock
interview preparation environment. Major stages in this process will involve:
1. Requirement Analysis
• Outline the key features required for an effective interview preparation
tool, such as question creation, response assessment, and tailored
feedback.
• Identify the drawbacks present in the current tools and approaches,
such as limited adaptability and feedback in real-time.
• Clearly define the main audience, focusing on technical job seekers
such as those looking for employment in software development,
programming, and engineering.

6. Iterative Improvement

• Collect feedback from the users during the testing phase to identify
aspects that need improvement.
• Enhance the prompt templates for a better variety in questions and
clearer feedback.
• Optimize system performance for quicker responses and smooth user
interactions.

OBJECTIVES:
A plagiarism checker is a vital tool designed to ensure originality
and uphold ethical standards in academic, professional, and creative
domains. By detecting instances of unoriginal or copied content, it
promotes academic integrity and fosters a culture of honesty. These
tools ensure that submitted work genuinely reflects the creator’s
effort and knowledge, discouraging unethical practices like copying
or paraphrasing without proper citation. In educational settings,
they encourage students to produce independent and innovative
work while guiding researchers to maintain high standards in their
publications.

One of the core objectives of a plagiarism checker is to support


proper attribution. It identifies improperly or inadequately cited
content and helps users correct these errors by pointing to the
original sources. Proper citation is crucial not only for giving due
credit to original authors but also for building trust in the quality
and reliability of a piece of work. Additionally, plagiarism checkers
play a critical role in preventing copyright infringement by
identifying unauthorized use of intellectual property, protecting the
rights of content creators, and maintaining legal compliance.

Plagiarism checkers also contribute significantly to quality control


in publishing and professional environments. Publishers and peer
reviewers use these tools to identify and address any unoriginal
content before it reaches the public. This ensures that journals,
books, and articles meet the standards of originality and
authenticity expected in the industry. Similarly, in professional
settings, they ensure that deliverables are unique and reflective of
an organization’s values, safeguarding its credibility and reputation.

Beyond detection, plagiarism checkers act as educational tools for


improving writing and research skills. They provide constructive
feedback by highlighting plagiarized sections and offering insights
on proper citation and paraphrasing techniques. This fosters a better
understanding of ethical writing practices and encourages
individuals to adopt creative and independent approaches in their
work. For students and professionals alike, these tools become
invaluable resources for refining their skills and achieving
originality.

In conclusion, plagiarism checkers serve as indispensable tools for


maintaining integrity, protecting intellectual property, and
enhancing the quality of work. They enable fair assessments in
academic and professional settings by ensuring originality and
discouraging dishonest practices. By promoting ethical standards,
supporting proper attribution, and encouraging the development of
independent thinking, plagiarism checkers contribute to a culture of
trust, creativity, and credibility in all areas of content creation and
research.

CONCLUSION:

The detection of plagiarism is a crucial task in various fields, including


academia. The use of machine learning has significantly transformed the field
of plagiarism detection. The utilization of machine learning algorithms has
been established as an effective and efficient method for detecting plagiarism.
These algorithms can analyse vast amounts of text and identify patterns that
may indicate plagiarism. Several methods, including rule-based, text-based, and
hybrid techniques, have been utilized for plagiarism detection using machine
learning. However, the accuracy of these techniques depends on several factors,
such as text size, language complexity, and dataset quality. By incorporating
techniques such as natural language processing and text similarity analysis,
machine learning algorithms can accurately detect instances of plagiarism in
large datasets, thereby saving time and effort for educators and researchers.
Despite their efficacy in detecting direct plagiarism, these algorithms may not
always be able to identify more subtle forms of plagiarism, such as
patchwriting or paraphrasing. Hence, it is imperative to refine and enhance
these algorithms to enhance their accuracy and effectiveness in detecting all
forms of plagiarism. Overall, machine learning for plagiarism detection is a
promising area of research that can significantly enhance the quality and
integrity of academic work. Educators and researchers must continue to explore
and utilize these tools to promote academic honesty and research credibility.
The combination of natural language processing, text similarity analysis, and
machine learning algorithms such as k-NN, SVM, and neural networks have
shown potential in improving plagiarism detection accuracy. Future research
should focus on developing more precise and efficient techniques for
plagiarism detection.
REFERENCES:

1. Vo Ngoc Mai Anh; Hoang Kim Ngoc Anh; Vo Nhat Huy; Huynh Gia Huy;
Minh Ly. "Improve
Productivity and Quality Using Lean Six Sigma: A Case Study". International
Research Journal on
Advanced Science Hub, 5, 03, 2023, 71-83. doi: 10.47392/irjash.2023.016
2. R. Devi Priya, R. Sivaraj, Ajith Abraham, T. Pravin, P. Sivasankar and N.
Anitha. "MultiObjective
Particle Swarm Optimization Based Preprocessing of Multi-Class Extremely
Imbalanced Datasets".
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Vol. 30, No. 05, pp.
735-755 (2022). Doi: 10.1142/S0218488522500209
3. Swathi Buragadda; Siva Kalyani Pendum V P; Dulla Krishna Kavya; Shaik
Shaheda Khanam.
"Multi Disease Classification System Based on Symptoms using The Blended
Approach". International Research Journal on Advanced Science Hub, 5, 03,
2023, 84-90. doi:
10.47392/irjash.2023.017
4. Susanta Saha; Sohini Mondal. "An in-depth analysis of the Entertainment
Preferences before and
after Covid-19 among Engineering Students of West Bengal". International
Research Journal on
Advanced Science Hub, 5, 03, 2023, 91-102. doi: 10.47392/irjash.2023.018

You might also like