MD Sohail Me102 Project Report II
MD Sohail Me102 Project Report II
On
By
MD Sohail
ME102
CERTIFICATE
This is to certify that the Dissertation II report entitled
Submitted by
of M.E. Computer Engineering (Sem IV) has satisfactorily delivered his seminar and it is
submitted towards the partial fulfilment of the requirement of Savitribai Phule Pune University,
under the Department of Computer Engineering, MMCOE, Pune for the award of the degree
of Master of Engineering (Computer Engineering)
Prof. Dr. Kalpana Sunil Thakre Prof Dr. Kalpana Sunil Thakre
Internal Guide Head
Department of Computer Engineering Department of Computer Engineering
2
Acknowledgement
I take this opportunity to express my deep sense of gratitude towards my esteemed guide Prof
Dr. Kalpana Sunil Thakre for giving me this splendid opportunity to select and present this
seminar and providing facilities for successful completion.
I thank Dr. Kalpana Sunil Thakre, Head, Department of Computer Engineering, for opening
the doors of the department towards the realization of the seminar, all the staff members, for
their indispensable support, priceless suggestions and for most valuable time lent as and when
required. With all the respect and gratitude, I would like to thank all the people, who have
helped me directly or indirectly.
Name: MD SOHAIL
Roll No: ME102
3
List of Publications
Paper Title:
“Managing Token Limitations with RoBERTa-Large for Enhanced Plagiarism Detection”
Status: Submitted
Paper Title:
“Plagiarism Detection solution using LLM-Longformer Model”
Status:
In-Progress
4
SYNOPSIS
Abstract:
Plagiarism occurs when someone uses words, ideas, or work products, attributable to another
identifiable person or source, without attributing the work to the source from which it was
obtained, in a situation in which there is a legitimate expectation of original authorship, to
obtain some benefit, credit, or gain which need not be monetary. Plagiarism constitutes a severe
form of academic misconduct. In research, plagiarism is included in the three “cardinal sins”,
FFP—Fabrication, falsification, and plagiarism.
Plagiarism constitutes a threat to the educational process because students may receive credit
for someone else’s work or complete courses without achieving the desired learning outcomes.
General perception is that software must be able to easily do things that humans find difficult.
Software cannot determine plagiarism, but it can work as a support tool for identifying some
text similarity that may constitute plagiarism.
This paper reports on a survey of 15 web-based text-matching systems that can be used when
plagiarism is suspected. It referred some research peppers (mentioned in the reference section)
in support for the assessment and analysis of market available open source and commercial
plagiarism tools. A usability examination was also performed. The sobering results show that
although some systems can indeed help identify some plagiarized content, they clearly do not
find all plagiarism and at times also identify non-plagiarized material as problematic.
In this dissertation report, we will discuss more about the proposed Plagiarism solution based
on Natural Language Processing.
Keywords:
Plagiarism Detection, NLP (Natural Language Processing), LLM (Large language processing),
Roberta, Machine learning algorithm
Objective:
The objective strives to protect and promote original research, strengthen the educational
process, and advance the education system through the utilization of an AI/ML-based solution.
By leveraging distributed processing and implementing a semantic-based search engine, the
5
objective aims to bring uniqueness and innovation to the research domain while improving the
overall quality of education.
Dashboards are used to identify risks and perform cohort analysis, while reports present results
in the context of students' assignments. Clear and actionable data points are provided for each
submission, including checking for similarity against a leading content database. The objective
aims to uncover text manipulations aimed at bypassing integrity checks and verify the
originality of student work in potential contract cheating cases. It also aims to guide students
towards producing higher-quality academic writing by enabling them to check text similarity
and grammar before submitting. Comparing an assignment to prior student work, analysing
document metadata, and applying a score to assess the probability of contract cheating are also
part of the objective.
Motivation:
The motivation behind this objective encompasses protecting original research, preserving the
educational process, harnessing AI/ML-based solutions, leveraging distributed processing,
developing semantic-based search engines, fostering uniqueness in research, and enhancing the
education system. Through these endeavors, this objective aims to ensure academic integrity,
drive innovation, and create a more effective and inclusive research and education landscape.
One of the primary motivations is to safeguard the originality of research. With the exponential
growth of digital content and the ease of information dissemination, there is an increased risk
of plagiarism and the unauthorized use of others' work. By deploying an AI/ML-based solution,
it becomes possible to detect and prevent instances of research misconduct, ensuring that the
contributions of researchers are duly recognized and protected.
Problem Statement:
Plagiarism involves the use of words, ideas, or work products that can be traced back to another
person or source without proper attribution. It is a serious form of academic misconduct that
undermines the integrity of the educational process. When individuals engage in plagiarism,
they not only fail to protect the original research content of others but also deny recognition,
credit, and benefits to the original researchers. This unethical behaviour poses a threat to the
advancement of knowledge and the fair distribution of rewards within academia. Furthermore,
the absence of accessible open source and commercial software capable of performing
semantic checks on document content exacerbates the challenge of detecting and preventing
plagiarism effectively.
6
Algorithm Strategy:
The proposed plagiarism detection strategy involves three key components:
Neural Network for Improved Plagiarism Detection: Deep learning, utilizing artificial neural
networks, is proposed for more sophisticated computations on extensive datasets.
The neural network structure mimics the human brain, with layers including an input layer,
hidden layer(s), and an output layer.
This approach aims to enhance the accuracy and efficiency of plagiarism detection using neural
network technology.
Outcome:
This section outlines the probability scores assigned by each plagiarism detection model to
individual suspicious documents within the dataset. The probability values, ranging from 0 to
1, reflect the confidence of each model in identifying potential instances of plagiarism for each
document.
7
Roberta-base and Bert-base: Exhibit competitive performance but with variability across
datasets.
Roberta-large: Shows improvement over the base model but is surpassed by the modified
version in terms of accuracy and stability.
In summary, Roberta-large-ms stands out as the modified model, consistently delivering
superior accuracy and demonstrating effectiveness in handling large document segments. This
highlights the significance of the modifications made to Roberta-large in improving overall
model performance.
References:
https://ieeexplore.ieee.org/document/9548089
https://www.javatpoint.com/apache-spark-architecture
https://educationaltechnologyjournal.springeropen.com/articles/10.1186/s41239-020-00192-4
https://www.plagiarismtools.com/#
https://roboticsbiz.com/top-22-natural-language-processing-nlp-frameworks/
https://analyticsindiamag.com/7-most-popular-nlp-frameworks-in-machine-learning/
https://ieeexplore.ieee.org/document/9667257/figures#figures
https://www.geeksforgeeks.org/sentiment-classification-using-bert/?ref=lbp
http://jalammar.github.io/illustrated-bert/
https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/
https://www.analyticsvidhya.com/blog/2022/09/sentiment-analysis-with-nlp/
https://www.lexalytics.com/blog/machine-learning-natural-language-processing/
https://medium.com/@ritidass29/the-essential-guide-to-how-nlp-works-4d3bb23faf76
8
Papers Published:
Submitted
SIGN:
SIGN:
SIGN:
9
Table of Contents
Chapter 1 .................................................................................................................................. 15
1 Introduction To ML Based Plagiarism Solution .............................................................. 15
1.1 Domain Description .................................................................................................. 15
1.2 ML Based Plagiarism Solution ................................................................................. 16
1.3 Motivation ................................................................................................................. 16
Chapter 2 .................................................................................................................................. 18
2 Literature Survey ............................................................................................................. 18
2.1 Literature Summary................................................................................................... 19
Chapter 3 .................................................................................................................................. 21
3 Proposed Work................................................................................................................. 21
3.1 Objectives and Challenges ........................................................................................ 21
3.2 Project Scope ............................................................................................................. 22
3.3 Problem Statement .................................................................................................... 22
3.4 Proposed Algorithm .................................................................................................. 22
Chapter 4 .................................................................................................................................. 25
4 Software Requirement Specification ............................................................................... 25
Chapter 5 .................................................................................................................................. 27
5 Software Design Specification ......................................................................................... 27
5.1 System Architecture Design ...................................................................................... 27
5.2 Hardware and Software Requirement ....................................................................... 30
5.2.1 Hardware Requirements..................................................................................... 30
5.2.2 Software Requirement ....................................................................................... 31
5.3 UML Diagram ........................................................................................................... 31
5.3.1 Use-Case-Diagram ............................................................................................. 31
5.3.2 Class Diagram .................................................................................................... 33
5.3.3 Sequence Diagram ............................................................................................. 34
5.3.4 Deployment Diagram ......................................................................................... 37
Chapter 6 .................................................................................................................................. 38
6 SCHEDULE OF WORK ................................................................................................. 38
Chapter 7 .................................................................................................................................. 39
7 Dataset.............................................................................................................................. 39
Chapter 8 .................................................................................................................................. 42
8 Results and Discussion .................................................................................................... 42
8.1 Model Training Summary Report ............................................................................. 42
Model Comparison Report Analysis .................................................................................... 42
Top Performers Report ........................................................................................................ 42
10
8.2 Plagiarism Result Summary Report .......................................................................... 44
Chapter 9 .................................................................................................................................. 46
9 Conclusion ....................................................................................................................... 46
Chapter 10 ................................................................................................................................ 47
10 References ........................................................................................................................ 47
11
List of Tables
12
List of Figures
13
Domain Name
ML Based Plagiarism Solution
Technical Keywords
1. Plagiarism
a. Plagiarism Software,
b. Plagiarism Tool
c. Plagiarism Checker
2. Distributed Processing
a. Hadoop big data processing
b. Spark in-memory data processing
3. Machine Learning
a. Machine learning algorithm
b. ML Model
c. Supervise and Unsupervised learning
d. Reinforcement Learning
e. Deep Learning
14
Chapter 1
The machine learning process begins with observations or data, such as examples, direct
experience or instruction. It looks for patterns in data so it can later make inferences based on
the examples provided. The primary aim of ML is to allow computers to learn autonomously
without human intervention or assistance and adjust actions accordingly.
Machine learning uses two types of techniques: supervised learning, which trains a model on
known input and output data so that it can predict future outputs, and unsupervised learning,
which finds hidden patterns or intrinsic structures in input data.
15
Application Area of Machine Learning
Machine learning is not science fiction. It is already widely used by businesses across all
sectors to advance innovation and increase process efficiency. In 2021, 41% of companies
accelerated their rollout of AI as a result of the pandemic. These newcomers are joining the
31% of companies that already have AI in production or are actively piloting AI technologies
• Data security: Machine learning models can identify data security vulnerabilities before
they can turn into breaches. By looking at past experiences, machine learning models
can predict future high-risk activities so risk can be proactively mitigated.
• Finance: Banks, trading brokerages and fintech firms use machine learning algorithms
to automate trading and to provide financial advisory services to investors. Bank of
America is using a chatbot, Erica, to automate customer support.
• Healthcare: ML is used to analyze massive healthcare data sets to accelerate discovery
of treatments and cures, improve patient outcomes, and automate routine processes to
prevent human error. For example, IBM’s Watson uses data mining to provide
physicians data they can use to personalize patient treatment.
• Fraud detection: AI is being used in the financial and banking sector to autonomously
analyze large numbers of transactions to uncover fraudulent activity in real time.
Technology services firm Capgemini claims that fraud detection systems using
machine learning and analytics minimize fraud investigation time by 70% and improve
detection accuracy by 90%.
• Retail: AI researchers and developers are using ML algorithms to develop AI
recommendation engines that offer relevant product suggestions based on buyers’ past
choices, as well as historical, geographic, and demographic data.
1.3 Motivation
16
• Semantic based search engine
• Bring uniqueness in research area
• Improve education system
17
Chapter 2
2 Literature Survey
The reviewed literature encompasses a diverse array of contributions to the realm of natural
language processing (NLP) and machine translation. Notably, the introduction of layer
normalization by Ba, Kiros, and Hinton [1] has significantly improved the training and
performance of deep neural networks. Bahdanau, Cho, and Bengio's work on neural machine
translation [2] presents an approach that concurrently learns to align and translate, offering
valuable insights into more effective translation models. Britz, Goldie, Luong, and Le [3]
conduct a thorough exploration of various neural machine translation architectures, shedding
light on design choices and their impact.
The study by Cho, van Merrienboer, Gulcehre, Bougares, Schwenk, and Bengio [7] delves into
learning phrase representations using RNN encoder-decoder, contributing to statistical
machine translation. Lin et al. propose a structured self-attentive sentence embedding method
[8], enriching sentence representation through self-attention. Press and Wolf [9] explore the
utilization of output embedding to improve language models, providing valuable insights into
enhancing language processing models.
Szegedy et al.'s work on rethinking the inception architecture [10] offers innovative approaches
to designing deep neural networks. "Attention Is All You Need" [11], by Vaswani et al.,
introduces the transformer architecture, revolutionizing NLP and machine translation through
the attention mechanism. Jozefowicz et al. [12] explore the limits of language modeling,
addressing challenges and limitations in model development.
Luong, Pham, and Manning's investigation into effective approaches to attention-based neural
machine translation [13] adds depth to the understanding of attention mechanisms in NLP. The
work by Wu et al. [14] on Google's neural machine translation system bridges the gap between
human and machine translation, contributing to advancements in translation quality.
Additionally, the Hugging Face Transformers library and documentation [15] serve as a
valuable resource for practitioners working with transformer-based models, providing a
comprehensive toolkit for NLP tasks.
In recent contributions to natural language processing and machine learning, Sun et al. (2021)
introduce an innovative application of BERT for authorship attribution [7]. The study explores
the use of BERT, a powerful pre-trained language representation model, in the context of
determining authorship in textual content. This work provides insights into the versatility of
BERT beyond traditional language understanding tasks.
18
capturing nuanced sentiment patterns over extended textual content. This research expands the
applicability of transformer models in sentiment analysis, showcasing advancements in
understanding context-rich information.
These recent studies add depth to the evolving landscape of transformer-based models,
showcasing their adaptability to diverse tasks within natural language processing. The use of
BERT for authorship attribution and Longformer for sentiment analysis highlights the
continued exploration and expansion of transformer architectures in addressing real-world
challenges across different domains.
Collectively, these seminal contributions form the foundation for understanding the evolution
of deep learning models in NLP, machine translation, and related fields, offering a rich source
of insights for researchers and practitioners in the domain.
19
20
Chapter 3
3 Proposed Work
“ML Based Plagiarism Solution” (MBPS), will be leveraging the ML model for building robust
solution to scan through the body of text to see if there are instances of plagiarism within the
written content. It will guide students to higher-quality academic writing. Students check text
similarity and grammar before submitting the paper. It will be responsible to:
• Alert the writer to whether or not their text has duplicate content within a body of text
• Cross-reference public websites and web pages to help identify instances of plagiarism
• Leverage a large online database of published work that author uses for cross-
referencing
• Alerting the writers whether or not section of their writing requires a citation
• Score the document based on its original content
• AI/ML technology to identify identical and paraphrased content
It is intended for use at all levels of educational institutions and focus on English medium of
education system. It will leverage the power of ML model and spark distributed processing
engine for finding the content similarity on submitted paper.
Objectives:
The objective strives to protect and promote original research, strengthen the educational
process, and advance the education system through the utilization of an AI/ML-based solution.
By leveraging distributed processing and implementing a semantic-based search engine, the
objective aims to bring uniqueness and innovation to the research domain while improving the
overall quality of education.
Dashboards are used to identify risks and perform cohort analysis, while reports present results
in the context of students' assignments. Clear and actionable data points are provided for each
submission, including checking for similarity against a leading content database. The objective
aims to uncover text manipulations aimed at bypassing integrity checks and verify the
originality of student work in potential contract cheating cases. It also aims to guide students
towards producing higher-quality academic writing by enabling them to check text similarity
and grammar before submitting. Comparing an assignment to prior student work, analysing
document metadata, and applying a score to assess the probability of contract cheating are also
part of the objective.
Challenges
21
• Contract cheating – students engaging an external party to complete their coursework,
which is then submitted as his or her own.
• Text manipulation – swapping characters/alphabets, replacing spaces with invisible
white text, inserting images of text and more, designed to deceive plagiarism detection
tools.
• Source code plagiarism – copying another person’s source code without attributing it
to the owner and claiming it as one’s own.
• Self-plagiarism – submission of a student’s previously published work in its entirety or
reusing parts of it in a new written assignment.
• For academic use for the students to detect plagiarism on their research paper
• Focus on documents written in English languages
• It will support large volume and verity of documents
• Check for similarity against our industry-leading content database
• Paraphrased content can also be identified
• Protects privacy and security of content
Plagiarism involves the use of words, ideas, or work products that can be traced back to another
person or source without proper attribution. It is a serious form of academic misconduct that
undermines the integrity of the educational process. When individuals engage in plagiarism,
they not only fail to protect the original research content of others but also deny recognition,
credit, and benefits to the original researchers. This unethical behaviour poses a threat to the
advancement of knowledge and the fair distribution of rewards within academia. Furthermore,
the absence of accessible open source and commercial software capable of performing
semantic checks on document content exacerbates the challenge of detecting and preventing
plagiarism effectively.
22
Figure3. BERT building block
FAST-BERT library:
Fast-Bert is the deep learning library that allows developers and data scientists to train
and deploy BERT and XLNet based models for natural language processing tasks
beginning with Text Classification.
Using Fast-BERT, will able to:
- Train (more precisely fine-tune) BERT, RoBERTa and XLNet text classification
models on your custom dataset.
- Tune model hyper-parameters such as epochs, learning rate, batch size, optimiser
schedule and more.
- Save and deploy trained model for inference (including on AWS Sagemaker).
Fast-Bert will support both multi-class and multi-label text classification for the
following:
- BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton
Lee and Kristina Toutanova.
23
- XLNet (from Google/CMU) released with the paper XLNet: Generalized
Autoregressive Pretraining for Language Understanding by Zhilin Yang*, Zihang
Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
- RoBERTa (from Facebook), a Robustly Optimized BERT Pretraining Approach by
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.
- DistilBERT (from HuggingFace), released together with the blogpost Smaller, faster,
cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh,
Lysandre Debut and Thomas Wolf.
Data provides each node with information in the form of inputs. The node multiplies the
inputs with random weights, calculates them, and adds a bias. Finally, nonlinear functions,
also known as activation functions, are applied to determine which neuron to fire.
24
Chapter 4
Req_mbps_01
The system should be able to process documents in a variety of formats, including Microsoft
Word, PDF, and plain text.
Req_mbps_02
The system should be able to compare the content of a submitted document against a database
of known sources, such as published articles and academic papers, to identify instances of
plagiarism.
Req_mbps_03
The system should be able to accurately identify copied or paraphrased content, even if the text
has been modified or altered in some way.
Req_mbps_04
The system should provide a report detailing the instances of plagiarism found in the submitted
document, including the sources of the copied content and the percentage of the document that
is plagiarized.
Req_mbps_05
The system should be able to handle large volumes of documents and should have a fast-
processing time.
Req_mbps_06
The system should be user-friendly, with a simple interface that allows users to easily submit
documents and view results.
Req_mbps_07
The system should be secure, with measures in place to protect the privacy of submitted
documents and the confidentiality of the results.
Req_mbps_08
The system should have the ability to exclude certain sources or sections of text from the
comparison process, such as common phrases or cited sources.
Req_mbps_09
The system should provide a way for users to review and challenge the results of the plagiarism
check, in the event that a false positive is detected.
25
Req_mbps_10
The system should be able to handle English language and should be able to accurately detect
plagiarism in documents written in English language.
Req_mbps_11
The system should have a robust database of sources to compare against, with the ability to
continuously update and expand the database.
Req_mbps_12
The system should be compatible with a range of devices and operating systems, including
desktop computers, laptops, and mobile devices.
26
Chapter 5
Design Pattern:
✓ Aggregator
✓ API Gateway
✓ Asynchronous Messaging
✓ Database & Shared Data
✓ Event Sourcing
✓ Branch
✓ Command Query Responsibility Segregator
✓ Chain of Responsibility
✓ Circuit Breaker
✓ Security
✓ Decomposition
MBPS is the microservice based multi-layer and containerized architecture that can be
deployed on on-premise as well as cloud environment. Architectural elements descriptions are
provided in the below table.
27
Table2. Architectural elements specification
28
Database Aspects Considerations / Details and (optional) Rationale
Alternatives
Organization Volume, Table Space, Table
Table, View, Users
Database Side Use of SPs, Functions, NA
Processing Triggers, Cursors
Connection Connection Pooling, Postgres and Mongo DB Database
Driver Type
Performance Tuning Query Optimization, Indexing
Indexing, De-
Normalization
Architecture Type:
• MBPS uses multi-layer microservice & containerized architecture
Basis of Layering:
• Controller Layer – This layer will redirect the calls to proper service and business logic
• Service Layer – This layer contains all the business logic
• Repository Layer- This layer contains all the database-related operations
Reusability:
• Java – One separate jar is created with all common utilities which are used as a
reference in other microservices.
• Common Jar file consists of classes and methods related to Common Exceptions,
Request and Response classes, Model classes, Classes required for Security and
validations, Enums, licenses, Annotations, and Config Files which are required to set
up the application.
Batch Operation:
29
• Scheduler: Responsible for running batch operation while extracting data from
multiple sources over the internet.
.
Integration with External System(s): Integration other microservice and external service
will be happened through REST service.
Fail over:
• If microservice is failing, then proper exception message is returned.
Deployment Scenario:
• Jenkins with Master-Slave architecture for distributed builds
• Post build reports and notifications
Security:
• Application Gateway (WAF Firewall) Firewall configuration
• NSG rules for inbound and outbound traffic
• Communication between layer will be done through https protocol that will encrypt the
message over the network
Troubleshooting:
• Java - Log4j libraries for logger purpose.
Data Processing Layer: Responsible for data extraction from multiple sources and make it
available for ML model to be trained.
Model Pretraining Engine: This service will be responsible for training the model on variety
of data source.
Plagiarism Model Service: Once the Model get trained and evaluated, will be expose as a
service to be used for outer world.
Plagiarism Report Service: This service responsible for data presentation on for reporting and
dashboard interface.
30
5.2.2 Software Requirement
5.3.1 Use-Case-Diagram
A use case diagram for a MBPS checker includes the following elements:
Actors
Student: the primary user of the tool, who submits a document for plagiarism checking
Teacher: a secondary user of the tool, who may use the tool to check students' work for
plagiarism.
Use Cases:
1. Check document for plagiarism: the main functionality of the tool, in which a student
or teacher submits a document and the tool checks it for plagiarized content
2. View report: the user can view a report on the results of the plagiarism check, including
any instances of plagiarism found and the sources they were copied from
3. Login: the user must authenticate their identity before using the tool
4. Register: new users can create an account to use the tool
5. Manage account: this use case would allow the user to update their account information,
such as their password or email address.
6. Upgrade account: this use case would allow the user to upgrade their account to a
premium version of the tool, which may offer additional features or a higher limit on
the number of documents they can check for plagiarism.
7. Check document for originality: this use case would allow the user to check a document
for originality, rather than just checking for plagiarism. This could be useful for writers
or content creators who want to ensure that their work is completely original.
8. Compare documents: this use case would allow the user to compare two or more
documents to see how similar they are. This could be useful for educators who want to
check for plagiarism between students' papers.
9. View history: this use case would allow the user to view a history of their previous
plagiarism checks, including the results and any reports generated.
Relationships:
1. Extend: the "view report" use case can be extended by the "check document for
plagiarism" use case, meaning that it can only be performed after a plagiarism check
has been run
2. Include: the "login" and "register" use cases may be included as part of the process for
using the tool, but are not essential to its primary functionality.
31
Following the analysis, students can "View Report" to understand potential plagiarism
instances. The "Check Document for Originality" use case allows students to assess the
uniqueness of their work beyond plagiarism, receiving feedback for improvement.
Additionally, the system supports "Compare Document," enabling students to benchmark
their work against others. These use cases collectively empower students to ensure the
integrity and originality of their documents through comprehensive checks and
comparisons.
32
Figure 7b. Use case diagram on governance by actor
33
PlagiarismCheckerTool class represents the main tool that coordinates the process of
detecting plagiarism.
Document class represents a document that is being checked for plagiarism. It has a title and
the actual text of the document.
Corpus class represents a collection of documents. It has a list of Document objects.
PlagiarismDetectionAlgorithm class represents an algorithm that is responsible for detecting
plagiarism in a given document or corpus.
SimilarityMeasurement class represents a method for measuring the similarity between two
documents. This can be used by a PlagiarismDetectionAlgorithm to determine the likelihood
that one document is plagiarized from another.
In this design, the BERTModel class represents the BERT model that is being used to detect
plagiarism.
BERTPlagiarismDetectionAlgorithm class is a subclass of PlagiarismDetectionAlgorithm
that uses the BERT model to detect plagiarism. The PlagiarismCheckerTool coordinates the
process of comparing documents and applying plagiarism detection algorithms, and the
PlagiarismResult class represents the result of a plagiarism detection algorithm being applied
to a comparison.
InputProcessor and OutputProcessor classes handle the input and output of the plagiarism
checker tool. The InputProcessor class is responsible for reading in the documents and corpus
to be checked for plagiarism, and the OutputProcessor class is responsible for generating the
report or other output of the tool.
Req_mbps_01
Sequence diagram to handle variety of document format such as pdf, doc, text etc.
34
Figure9. Sequence diagram, to handle verity of document
In this diagram, the input handling subsystem is responsible for determining the format of the
submitted document and routing it to the appropriate format conversion subsystem. The format
conversion subsystem is responsible for converting the document to an internal format that can
be easily parsed and analyzed by the rest of the system. Once the document has been converted
and parsed, it can be compared to the database of previously submitted documents and analyzed
for similarity by the comparison and analysis subsystems. Finally, the output subsystem
generates a report that includes any identified matches and the corresponding similarity scores,
and sends it to the user.
35
Figure10. Sequence diagram: Content comparison of document
- The user submits the document to the system through a user interface.
- The system receives the document and begins the plagiarism detection process.
- The system pre-processes the submitted document by performing tasks such as
tokenization and stemming.
- The system retrieves a list of known sources from the database.
- The system pre-processes each known source in the same way as the submitted
document.
- The system compares the submitted document to each known source to identify
instances of plagiarism. This can be done using various techniques such as cosine
similarity or Longest Common Subsequence (LCS).
- The system generates a report that indicates the level of plagiarism for each instance of
plagiarism identified.
- The system displays the report to the user through the user interface.
The document pre-processing subsystem could be responsible for tasks such as tokenization
and stemming, while the comparison subsystem could be responsible for techniques such as
cosine similarity or LCS. The report generation subsystem could be responsible for formatting
and presenting the results in a user-friendly way
36
5.3.4 Deployment Diagram
Deployment principle:
- Application will be deployed on Azure cloud environment
- API Gateway with Web Application Firewall
- API Gateway IP address will be exposed to external world
- All microservice will be deployed on 2 servers with system configuration: 8 core CPU,
16 gb RAM, 500 gb Hard disk
- Secure communication using https
37
Chapter 6
6 SCHEDULE OF WORK
38
Chapter 7
7 Dataset
This dataset utilized in this study is a limited subset extracted from the International
Competition on Plagiarism Detection PAN 2010. The dataset comprises two distinct types
of files: suspicious documents and source documents. Additionally, the dataset includes
accompanying information indicating whether a suspicious document exhibits plagiarism
from the source documents or not.
Below is the outcome from various stages of solution:
Data Processing
- Clean the text files and combine the files to give data frames - one each for the source and the
suspicious files
- Pandas data frame for the suspicious file and source file. The reason we put the files as a data
frame is because it will be easier to apply the same operations later to each file
39
Table7. Trigrams for the suspicious files
We get the two measures for comparing the similarity between trigrams of suspicious and
source files
Similarity measure using Jaccard similarity coefficient
40
Similarity measure using Containment measure
41
Chapter 8
Analysis:
Roberta-large-ms (Modified Model): Consistently outperforms other models, demonstrating
superior accuracy (74% to 80%) and effective handling of large document segments.
Roberta-base and Bert-base: Exhibit competitive performance but with variability across
datasets.
Roberta-large: Shows improvement over the base model but is surpassed by the modified
version in terms of accuracy and stability.
In summary, Roberta-large-ms stands out as the modified model, consistently delivering
superior accuracy and demonstrating effectiveness in handling large document segments. This
highlights the significance of the modifications made to Roberta-large in improving overall
model performance.
42
Table 10. Top performer model based on model training results
Analysis:
Roberta-large-ms (Modified Model): Emerges as the top performer, consistently achieving the
highest accuracy (80%) across different datasets. Notably, this model exhibits robust
performance in handling large document segments, showcasing its adaptability to diverse
plagiarism detection tasks.
Roberta-base: Demonstrates competitive accuracy (72%) and stability, particularly excelling
in pan-pc-2011 with the best accuracy achieved in 3 epochs.
Roberta-large: Maintains strong performance with an accuracy range of 72% to 76%,
performing well in pan-pc-2009 with the best accuracy achieved in 5 epochs.
43
Loss Analysis Report:
This report provides insights into the mean, minimum, and maximum loss values for each
model, offering a comprehensive view of their convergence and stability.
Analysis:
Bert-base: Exhibits the highest mean loss (69.77%), indicating a relatively higher degree of
model uncertainty during training. The range from 68.1% to 71.3% suggests variability in
convergence across different datasets.
Roberta-base: Shows a lower mean loss (61.97%) compared to Bert-base, reflecting a more
stable training process. The range from 56.4% to 69.2% suggests consistent but variable
convergence.
Roberta-large: Presents a moderate mean loss (64.93%), indicating a balanced convergence.
The range from 59.1% to 68.1% suggests moderate variability during training.
Roberta-large-ms (Modified Model): Displays the lowest mean loss (61.03%) and a narrower
range from 56.4% to 67.6%), suggesting enhanced stability and consistent convergence. This
aligns with its superior performance in the accuracy metrics.
This section combines information from the previous reports, highlighting the models'
performance, the best-performing model for each document, and summary statistics. The aim
is to offer a consolidated overview of the plagiarism detection
44
Table 12. Model probability score against the suspicious document
Analysis:
The probability scores reveal varying degrees of confidence among the models, with each
assigning distinct values to the same suspicious document. For instance, in "suspicious-
document00003.txt," Bert-base assigns a probability of 0.320, while Roberta-large-ms assigns
a slightly higher probability of 0.470. These variations underscore the nuanced approaches and
inherent differences in the models' learning mechanisms.
Analysis:
Analysing the summary statistics allows us to grasp the overall behaviour of each model. For
instance, Bert-base exhibits a mean probability of 0.410, indicating a moderate level of
confidence, while Roberta-large-ms demonstrates a higher mean probability of 0.596,
suggesting greater overall confidence in its predictions.
45
Chapter 9
9 Conclusion
ML Based Plagiarism Solution can help to identify potential instances of plagiarism in research
paper writing. These tools work by comparing the text authors have written to a database of
other sources, such as published articles and websites, to see if there are any significant
matches.
There are several benefits to using this solution. For one, it can help you avoid accidental
plagiarism, which occurs when you unintentionally use someone else's work without proper
attribution. This can be a serious issue, as it can result in accusations of academic dishonesty
or even legal consequences. A plagiarism checker tool can help you ensure that you are
properly citing your sources and giving credit where it is due.
Additionally, a plagiarism checker tool can also help you identify instances of deliberate
plagiarism, where someone is trying to pass off someone else's work as their own. This can be
a problem in a variety of contexts, including academic research, business writing, and online
content creation. A plagiarism checker tool can help you identify these instances and take
appropriate action to address them.
In conclusion, this solution is a useful tool for anyone who wants to ensure that their writing is
original and properly cited. It can help you avoid accidental plagiarism and identify instances
of deliberate plagiarism, allowing you to maintain the integrity of your work and avoid any
potential legal or ethical issues.
46
Chapter 10
10 References
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR,
abs/1409.0473, 2014.
[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR,
abs/1703.03906, 2017.
[4] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805
[5] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized BERT approach. arXiv
preprint arXiv:1907.11692.
[6] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv
preprint arXiv:1705.03122v2, 2017
[7] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase
representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
[8] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-
attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017
[9] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016
[10] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for
computer vision. CoRR, abs/1512.00567, 2015
[11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser. "Attention Is All You
Need" Proceedings of the 31st International Conference on Neural Information Processing SystemsDecember 2017, pp 6000–6010
[12] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv
preprint arXiv:1602.02410, 2016
[13] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. arXiv
preprint arXiv:1508.04025, 2015
[14] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin
Gao, Klaus Macherey, et al. Google’s neural machine
translation system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144, 2016
[15] Hugginffase Transformer library and documentation https://huggingface.co/docs/transformers/index
47
Marathwada Mitra Mandal’s
College of Engineering, Pune
Accredited with ‘A++’ grade by NAAC
PROGRESSIVE RECORD
48
7 05-Jan-23 Dissertation I Preparation of
Presentation Dissertation I
presentation
8 Jul-23 ML Model Creation
49