0% found this document useful (0 votes)
46 views49 pages

MD Sohail Me102 Project Report II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views49 pages

MD Sohail Me102 Project Report II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

A Dissertation II Report

On

ML Based Plagiarism Solution


SUBMITTED TO THE SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE AWARD OF THE DEGREE

MASTER OF ENGINEERING (COMPUTER ENGINEERING)

By

MD Sohail

ME102

Under the guidance


of

Dr. Kalpana Sunil Thakre


Professor and HOD
[email protected]

DEPARTMENT OF COMPUTER ENGINEERING

Marathwada Mitra Mandal’s College of Engineering,


Karvenagar,
Savitribai Phule Pune University
2023-24
Marathwada Mitra Mandal’s
College of Engineering Karvenagar, Pune
Accredited with ‘A++’ grade by NAAC

CERTIFICATE
This is to certify that the Dissertation II report entitled

ML Based Plagiarism Solution

Submitted by

MD SOHAIL Exam No: ME102

of M.E. Computer Engineering (Sem IV) has satisfactorily delivered his seminar and it is
submitted towards the partial fulfilment of the requirement of Savitribai Phule Pune University,
under the Department of Computer Engineering, MMCOE, Pune for the award of the degree
of Master of Engineering (Computer Engineering)

Date: 12 December 2023


Place: Pune.

Prof. Dr. Kalpana Sunil Thakre Prof Dr. Kalpana Sunil Thakre
Internal Guide Head
Department of Computer Engineering Department of Computer Engineering

Internal Examiner: (Name) ___________________ (Sign) _______________

External Examiner: (Name) ___________________ (Sign) _______________

2
Acknowledgement

I take this opportunity to express my deep sense of gratitude towards my esteemed guide Prof
Dr. Kalpana Sunil Thakre for giving me this splendid opportunity to select and present this
seminar and providing facilities for successful completion.
I thank Dr. Kalpana Sunil Thakre, Head, Department of Computer Engineering, for opening
the doors of the department towards the realization of the seminar, all the staff members, for
their indispensable support, priceless suggestions and for most valuable time lent as and when
required. With all the respect and gratitude, I would like to thank all the people, who have
helped me directly or indirectly.

Name: MD SOHAIL
Roll No: ME102

3
List of Publications
Paper Title:
“Managing Token Limitations with RoBERTa-Large for Enhanced Plagiarism Detection”

Track Name: Computational Technologies

Paper ID: 148

Status: Submitted

Paper Title:
“Plagiarism Detection solution using LLM-Longformer Model”

Status:
In-Progress

4
SYNOPSIS

Title: ML Based Plagiarism Solution

Name of the Candidate: MD. SOHAIL Exam Number: ME102

Guide Name: Dr. Kalpana Sunil Thakre

Branch: Computer Engineering

College Name: MMCOE Pune College Code: 4045

Domain Name: Machine Learning

Abstract:
Plagiarism occurs when someone uses words, ideas, or work products, attributable to another
identifiable person or source, without attributing the work to the source from which it was
obtained, in a situation in which there is a legitimate expectation of original authorship, to
obtain some benefit, credit, or gain which need not be monetary. Plagiarism constitutes a severe
form of academic misconduct. In research, plagiarism is included in the three “cardinal sins”,
FFP—Fabrication, falsification, and plagiarism.
Plagiarism constitutes a threat to the educational process because students may receive credit
for someone else’s work or complete courses without achieving the desired learning outcomes.
General perception is that software must be able to easily do things that humans find difficult.
Software cannot determine plagiarism, but it can work as a support tool for identifying some
text similarity that may constitute plagiarism.
This paper reports on a survey of 15 web-based text-matching systems that can be used when
plagiarism is suspected. It referred some research peppers (mentioned in the reference section)
in support for the assessment and analysis of market available open source and commercial
plagiarism tools. A usability examination was also performed. The sobering results show that
although some systems can indeed help identify some plagiarized content, they clearly do not
find all plagiarism and at times also identify non-plagiarized material as problematic.

In this dissertation report, we will discuss more about the proposed Plagiarism solution based
on Natural Language Processing.

Keywords:
Plagiarism Detection, NLP (Natural Language Processing), LLM (Large language processing),
Roberta, Machine learning algorithm

Objective:
The objective strives to protect and promote original research, strengthen the educational
process, and advance the education system through the utilization of an AI/ML-based solution.
By leveraging distributed processing and implementing a semantic-based search engine, the

5
objective aims to bring uniqueness and innovation to the research domain while improving the
overall quality of education.
Dashboards are used to identify risks and perform cohort analysis, while reports present results
in the context of students' assignments. Clear and actionable data points are provided for each
submission, including checking for similarity against a leading content database. The objective
aims to uncover text manipulations aimed at bypassing integrity checks and verify the
originality of student work in potential contract cheating cases. It also aims to guide students
towards producing higher-quality academic writing by enabling them to check text similarity
and grammar before submitting. Comparing an assignment to prior student work, analysing
document metadata, and applying a score to assess the probability of contract cheating are also
part of the objective.

Motivation:
The motivation behind this objective encompasses protecting original research, preserving the
educational process, harnessing AI/ML-based solutions, leveraging distributed processing,
developing semantic-based search engines, fostering uniqueness in research, and enhancing the
education system. Through these endeavors, this objective aims to ensure academic integrity,
drive innovation, and create a more effective and inclusive research and education landscape.
One of the primary motivations is to safeguard the originality of research. With the exponential
growth of digital content and the ease of information dissemination, there is an increased risk
of plagiarism and the unauthorized use of others' work. By deploying an AI/ML-based solution,
it becomes possible to detect and prevent instances of research misconduct, ensuring that the
contributions of researchers are duly recognized and protected.

Problem Statement:
Plagiarism involves the use of words, ideas, or work products that can be traced back to another
person or source without proper attribution. It is a serious form of academic misconduct that
undermines the integrity of the educational process. When individuals engage in plagiarism,
they not only fail to protect the original research content of others but also deny recognition,
credit, and benefits to the original researchers. This unethical behaviour poses a threat to the
advancement of knowledge and the fair distribution of rewards within academia. Furthermore,
the absence of accessible open source and commercial software capable of performing
semantic checks on document content exacerbates the challenge of detecting and preventing
plagiarism effectively.

6
Algorithm Strategy:
The proposed plagiarism detection strategy involves three key components:

BERT Algorithm for Document Similarity: BERT (Bidirectional Encoder Representations


from Transformers) is employed for plagiarism detection.
BERT is based on transformer architecture and leverages ideas from various NLP innovations,
including semi-supervised sequence learning, ELMo, ULMFiT, and the OpenAI transformer.
Pre-trained on a large corpus of unlabelled text, BERT achieves a deep and comprehensive
understanding of language. Fine-tuning the pre-trained model enables state-of-the-art
performance in a variety of NLP tasks.

Neural Network for Improved Plagiarism Detection: Deep learning, utilizing artificial neural
networks, is proposed for more sophisticated computations on extensive datasets.
The neural network structure mimics the human brain, with layers including an input layer,
hidden layer(s), and an output layer.
This approach aims to enhance the accuracy and efficiency of plagiarism detection using neural
network technology.

Data Extraction and Analytic Library (Python):


Python is chosen as the programming language for data science tasks, offering ease of learning,
debugging, and a wide range of libraries.
Python libraries such as TensorFlow, Numpy, SciPy, Pandas, Matplotlib, Keras, SciKit-Learn,
PyTorch, and Scrapy are highlighted for their significance in data science.
These libraries contribute to the implementation of the proposed algorithms and the overall
data extraction and analysis process.

Outcome:
This section outlines the probability scores assigned by each plagiarism detection model to
individual suspicious documents within the dataset. The probability values, ranging from 0 to
1, reflect the confidence of each model in identifying potential instances of plagiarism for each
document.

Table 1. Model probability score against the suspicious document

Roberta-large-ms (Modified Model): Consistently outperforms other models, demonstrating


superior accuracy (74% to 80%) and effective handling of large document segments.

7
Roberta-base and Bert-base: Exhibit competitive performance but with variability across
datasets.
Roberta-large: Shows improvement over the base model but is surpassed by the modified
version in terms of accuracy and stability.
In summary, Roberta-large-ms stands out as the modified model, consistently delivering
superior accuracy and demonstrating effectiveness in handling large document segments. This
highlights the significance of the modifications made to Roberta-large in improving overall
model performance.

Conclusion and Discussion:


ML Based Plagiarism Solution can help to identify potential instances of plagiarism in research
paper writing. These tools work by comparing the text authors have written to a database of
other sources, such as published articles and websites, to see if there are any significant
matches.
There are several benefits to using this solution. For one, it can help you avoid accidental
plagiarism, which occurs when you unintentionally use someone else's work without proper
attribution. This can be a serious issue, as it can result in accusations of academic dishonesty
or even legal consequences. A plagiarism checker tool can help you ensure that you are
properly citing your sources and giving credit where it is due.
Additionally, a plagiarism checker tool can also help you identify instances of deliberate
plagiarism, where someone is trying to pass off someone else's work as their own. This can be
a problem in a variety of contexts, including academic research, business writing, and online
content creation. A plagiarism checker tool can help you identify these instances and take
appropriate action to address them.
In conclusion, this solution is a useful tool for anyone who wants to ensure that their writing is
original and properly cited. It can help you avoid accidental plagiarism and identify instances
of deliberate plagiarism, allowing you to maintain the integrity of your work and avoid any
potential legal or ethical issues.

References:
https://ieeexplore.ieee.org/document/9548089
https://www.javatpoint.com/apache-spark-architecture
https://educationaltechnologyjournal.springeropen.com/articles/10.1186/s41239-020-00192-4
https://www.plagiarismtools.com/#
https://roboticsbiz.com/top-22-natural-language-processing-nlp-frameworks/
https://analyticsindiamag.com/7-most-popular-nlp-frameworks-in-machine-learning/
https://ieeexplore.ieee.org/document/9667257/figures#figures
https://www.geeksforgeeks.org/sentiment-classification-using-bert/?ref=lbp
http://jalammar.github.io/illustrated-bert/
https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/
https://www.analyticsvidhya.com/blog/2022/09/sentiment-analysis-with-nlp/
https://www.lexalytics.com/blog/machine-learning-natural-language-processing/
https://medium.com/@ritidass29/the-essential-guide-to-how-nlp-works-4d3bb23faf76

8
Papers Published:
Submitted

SIGN:

Student Name: MD Sohail

SIGN:

GUIDE NAME: Prof Dr. Kalpana S. Thakre

SIGN:

Dr. Swati N. Shekapure Prof Dr. Kalpana S. Thakre

M.E. Co-ordinator HOD, Computer Dept.

9
Table of Contents
Chapter 1 .................................................................................................................................. 15
1 Introduction To ML Based Plagiarism Solution .............................................................. 15
1.1 Domain Description .................................................................................................. 15
1.2 ML Based Plagiarism Solution ................................................................................. 16
1.3 Motivation ................................................................................................................. 16
Chapter 2 .................................................................................................................................. 18
2 Literature Survey ............................................................................................................. 18
2.1 Literature Summary................................................................................................... 19
Chapter 3 .................................................................................................................................. 21
3 Proposed Work................................................................................................................. 21
3.1 Objectives and Challenges ........................................................................................ 21
3.2 Project Scope ............................................................................................................. 22
3.3 Problem Statement .................................................................................................... 22
3.4 Proposed Algorithm .................................................................................................. 22
Chapter 4 .................................................................................................................................. 25
4 Software Requirement Specification ............................................................................... 25
Chapter 5 .................................................................................................................................. 27
5 Software Design Specification ......................................................................................... 27
5.1 System Architecture Design ...................................................................................... 27
5.2 Hardware and Software Requirement ....................................................................... 30
5.2.1 Hardware Requirements..................................................................................... 30
5.2.2 Software Requirement ....................................................................................... 31
5.3 UML Diagram ........................................................................................................... 31
5.3.1 Use-Case-Diagram ............................................................................................. 31
5.3.2 Class Diagram .................................................................................................... 33
5.3.3 Sequence Diagram ............................................................................................. 34
5.3.4 Deployment Diagram ......................................................................................... 37
Chapter 6 .................................................................................................................................. 38
6 SCHEDULE OF WORK ................................................................................................. 38
Chapter 7 .................................................................................................................................. 39
7 Dataset.............................................................................................................................. 39
Chapter 8 .................................................................................................................................. 42
8 Results and Discussion .................................................................................................... 42
8.1 Model Training Summary Report ............................................................................. 42
Model Comparison Report Analysis .................................................................................... 42
Top Performers Report ........................................................................................................ 42

10
8.2 Plagiarism Result Summary Report .......................................................................... 44
Chapter 9 .................................................................................................................................. 46
9 Conclusion ....................................................................................................................... 46
Chapter 10 ................................................................................................................................ 47
10 References ........................................................................................................................ 47

11
List of Tables

Section Table Page Number


3.1 Table 1: Literature Summary 19
6.1 Table2. Architectural elements 29
specification
6.2 Table3. Database Design Specification 30
8 Table4. Model Dataset 40
8 Table5. Suspicious Data Source 40
8 Table6. Source Data Source 41
8 Table7. Trigrams for the suspicious files 41
8 Table8. trigrams for the source files 41
9.1 Table 9. Model comparison during model 43
training
9.1 Table 10. Top performer model based on 44
model training results
9.1 Table 11. Loss analysis during model 45
training
9.2 Table 12. Model probability score against 46
the suspicious document
9.2 Table 13. Probability Statistic for each 46
model

12
List of Figures

Section Figure Page Number


2.1 Figure 1: Machine Learning Process 15
flow
2.1 Figure 2: Machine Learning working 15
block
4.4 Figure 3: BERT building block 23
4.4 Figure 4: BERT process flow 23
4.5 Figure 5: Artificial Neural Network 24
layers architecture
6.1 Figure 6: System architecture of MBPS 29
project
6.3 Figure 7a: Use case diagram on activities 33
by actor
6.3 Figure 7b: Use case diagram on 34
governance by actor
6.3.2 Figure: 8 Class Diagram 35
6.3.3 Figure: 9 Sequence diagram, to handle 35
verity of document
6.3.3 Figure: 10 Sequence diagram: Content 37
comparison of document
6.3.4 Figure: 11 Deployment diagram 38
7 Figure: 12 Project schedule diagram 39
8 Figure: 13. Similarity measure using 41
Jaccard similarity coefficient
8 Figure: 14. Similarity measure using 42
Containment similarity coefficient
8 Similarity score 42

13
Domain Name
ML Based Plagiarism Solution

Technical Keywords
1. Plagiarism
a. Plagiarism Software,
b. Plagiarism Tool
c. Plagiarism Checker
2. Distributed Processing
a. Hadoop big data processing
b. Spark in-memory data processing
3. Machine Learning
a. Machine learning algorithm
b. ML Model
c. Supervise and Unsupervised learning
d. Reinforcement Learning
e. Deep Learning

14
Chapter 1

1 Introduction To ML Based Plagiarism Solution


1.1 Domain Description
Machine learning is an application of AI that enables systems to learn and improve from
experience without being explicitly programmed. Machine learning focuses on developing
computer programs that can access data and use it to learn for themselves. Machine learning
algorithms use computational methods to “learn” information directly from data without
relying on a predetermined equation as a model. The algorithms adaptively improve their
performance as the number of samples available for learning increases. Deep learning is a
specialized form of machine learning.

How does Machine Learning Work


Similar to how the human brain gains knowledge and understanding, machine learning relies
on input, such as training data or knowledge graphs, to understand entities, domains and the
connections between them. With entities defined, deep learning can begin.

Figure1. Machine Learning Process flow

The machine learning process begins with observations or data, such as examples, direct
experience or instruction. It looks for patterns in data so it can later make inferences based on
the examples provided. The primary aim of ML is to allow computers to learn autonomously
without human intervention or assistance and adjust actions accordingly.
Machine learning uses two types of techniques: supervised learning, which trains a model on
known input and output data so that it can predict future outputs, and unsupervised learning,
which finds hidden patterns or intrinsic structures in input data.

Figure 2. Machine Learning working block

15
Application Area of Machine Learning
Machine learning is not science fiction. It is already widely used by businesses across all
sectors to advance innovation and increase process efficiency. In 2021, 41% of companies
accelerated their rollout of AI as a result of the pandemic. These newcomers are joining the
31% of companies that already have AI in production or are actively piloting AI technologies
• Data security: Machine learning models can identify data security vulnerabilities before
they can turn into breaches. By looking at past experiences, machine learning models
can predict future high-risk activities so risk can be proactively mitigated.
• Finance: Banks, trading brokerages and fintech firms use machine learning algorithms
to automate trading and to provide financial advisory services to investors. Bank of
America is using a chatbot, Erica, to automate customer support.
• Healthcare: ML is used to analyze massive healthcare data sets to accelerate discovery
of treatments and cures, improve patient outcomes, and automate routine processes to
prevent human error. For example, IBM’s Watson uses data mining to provide
physicians data they can use to personalize patient treatment.
• Fraud detection: AI is being used in the financial and banking sector to autonomously
analyze large numbers of transactions to uncover fraudulent activity in real time.
Technology services firm Capgemini claims that fraud detection systems using
machine learning and analytics minimize fraud investigation time by 70% and improve
detection accuracy by 90%.
• Retail: AI researchers and developers are using ML algorithms to develop AI
recommendation engines that offer relevant product suggestions based on buyers’ past
choices, as well as historical, geographic, and demographic data.

1.2 ML Based Plagiarism Solution


At a basic level, plagiarism software scans through a body of text to see if there are instances
of plagiarism within the written content. Free and paid plagiarism software alert the writer to
whether their text has duplicate content within a body of text. Most tools cross-reference public
websites and web pages to help identify instances of plagiarism. Some plagiarism software also
leverages a large online database of published work they use for cross-referencing, this is
especially true for plagiarism detection software designed for academic use cases.
More advanced paid plagiarism software offers other features like alerting the writer to whether
or not a section of their writing requires a citation, providing citation formatting guidelines or
tools, and giving a document a score based on how original the content is. Certain advanced
tools offer plagiarism checking in multiple languages and use AI technology to help identify
identical and paraphrased content.
Some vendors offer plagiarism checking software as part of a larger suite of writing and
proofreading tools. For example, Grammarly provides users with a plagiarism checker along
with grammar, spelling, syntax, writing style, and text sentiment features.
In all existing Plagiarism open source and commercial software, the semantic similarity of
document content is either not matured or completely ignored.

1.3 Motivation

• Protect the original research


• Save the educational process
• AI/ML based solution
• Distributed processing

16
• Semantic based search engine
• Bring uniqueness in research area
• Improve education system

17
Chapter 2

2 Literature Survey
The reviewed literature encompasses a diverse array of contributions to the realm of natural
language processing (NLP) and machine translation. Notably, the introduction of layer
normalization by Ba, Kiros, and Hinton [1] has significantly improved the training and
performance of deep neural networks. Bahdanau, Cho, and Bengio's work on neural machine
translation [2] presents an approach that concurrently learns to align and translate, offering
valuable insights into more effective translation models. Britz, Goldie, Luong, and Le [3]
conduct a thorough exploration of various neural machine translation architectures, shedding
light on design choices and their impact.

The advent of BERT (Bidirectional Encoder Representations from Transformers) [4], as


proposed by Devlin et al., marks a pivotal milestone in natural language understanding.
Building upon BERT, Liu et al. introduce RoBERTa [5], a robustly optimized BERT approach
that addresses limitations and enhances the model's robustness. Gehring et al.'s exploration of
convolutional sequence-to-sequence learning [6] offers an alternative perspective to traditional
RNN encoder-decoder architectures.

The study by Cho, van Merrienboer, Gulcehre, Bougares, Schwenk, and Bengio [7] delves into
learning phrase representations using RNN encoder-decoder, contributing to statistical
machine translation. Lin et al. propose a structured self-attentive sentence embedding method
[8], enriching sentence representation through self-attention. Press and Wolf [9] explore the
utilization of output embedding to improve language models, providing valuable insights into
enhancing language processing models.

Szegedy et al.'s work on rethinking the inception architecture [10] offers innovative approaches
to designing deep neural networks. "Attention Is All You Need" [11], by Vaswani et al.,
introduces the transformer architecture, revolutionizing NLP and machine translation through
the attention mechanism. Jozefowicz et al. [12] explore the limits of language modeling,
addressing challenges and limitations in model development.

Luong, Pham, and Manning's investigation into effective approaches to attention-based neural
machine translation [13] adds depth to the understanding of attention mechanisms in NLP. The
work by Wu et al. [14] on Google's neural machine translation system bridges the gap between
human and machine translation, contributing to advancements in translation quality.
Additionally, the Hugging Face Transformers library and documentation [15] serve as a
valuable resource for practitioners working with transformer-based models, providing a
comprehensive toolkit for NLP tasks.

In recent contributions to natural language processing and machine learning, Sun et al. (2021)
introduce an innovative application of BERT for authorship attribution [7]. The study explores
the use of BERT, a powerful pre-trained language representation model, in the context of
determining authorship in textual content. This work provides insights into the versatility of
BERT beyond traditional language understanding tasks.

Furthermore, Xu et al. (2022) present a novel application of Longformer in sentiment analysis


[8]. Longformer, known for its ability to handle long-range dependencies in documents, is
leveraged for sentiment analysis tasks. The study explores the effectiveness of Longformer in

18
capturing nuanced sentiment patterns over extended textual content. This research expands the
applicability of transformer models in sentiment analysis, showcasing advancements in
understanding context-rich information.
These recent studies add depth to the evolving landscape of transformer-based models,
showcasing their adaptability to diverse tasks within natural language processing. The use of
BERT for authorship attribution and Longformer for sentiment analysis highlights the
continued exploration and expansion of transformer architectures in addressing real-world
challenges across different domains.
Collectively, these seminal contributions form the foundation for understanding the evolution
of deep learning models in NLP, machine translation, and related fields, offering a rich source
of insights for researchers and practitioners in the domain.

2.1 Literature Summary


Table1. Literature Summary

19
20
Chapter 3

3 Proposed Work
“ML Based Plagiarism Solution” (MBPS), will be leveraging the ML model for building robust
solution to scan through the body of text to see if there are instances of plagiarism within the
written content. It will guide students to higher-quality academic writing. Students check text
similarity and grammar before submitting the paper. It will be responsible to:
• Alert the writer to whether or not their text has duplicate content within a body of text
• Cross-reference public websites and web pages to help identify instances of plagiarism
• Leverage a large online database of published work that author uses for cross-
referencing
• Alerting the writers whether or not section of their writing requires a citation
• Score the document based on its original content
• AI/ML technology to identify identical and paraphrased content

It is intended for use at all levels of educational institutions and focus on English medium of
education system. It will leverage the power of ML model and spark distributed processing
engine for finding the content similarity on submitted paper.

3.1 Objectives and Challenges

Objectives:

The objective strives to protect and promote original research, strengthen the educational
process, and advance the education system through the utilization of an AI/ML-based solution.
By leveraging distributed processing and implementing a semantic-based search engine, the
objective aims to bring uniqueness and innovation to the research domain while improving the
overall quality of education.
Dashboards are used to identify risks and perform cohort analysis, while reports present results
in the context of students' assignments. Clear and actionable data points are provided for each
submission, including checking for similarity against a leading content database. The objective
aims to uncover text manipulations aimed at bypassing integrity checks and verify the
originality of student work in potential contract cheating cases. It also aims to guide students
towards producing higher-quality academic writing by enabling them to check text similarity
and grammar before submitting. Comparing an assignment to prior student work, analysing
document metadata, and applying a score to assess the probability of contract cheating are also
part of the objective.

Challenges

• Changes in the cultural and social environment in the past decade


• Academic integrity comprises of honest, trust, respect, fairness and responsibility
• Technological advancement, such as internet and email, has however substantially
increased the sources available for plagiarizing
• Unauthorized re-use of material found on the Internet
• Different kinds of academic frauds are increased as technology increases

21
• Contract cheating – students engaging an external party to complete their coursework,
which is then submitted as his or her own.
• Text manipulation – swapping characters/alphabets, replacing spaces with invisible
white text, inserting images of text and more, designed to deceive plagiarism detection
tools.
• Source code plagiarism – copying another person’s source code without attributing it
to the owner and claiming it as one’s own.
• Self-plagiarism – submission of a student’s previously published work in its entirety or
reusing parts of it in a new written assignment.

3.2 Project Scope

• For academic use for the students to detect plagiarism on their research paper
• Focus on documents written in English languages
• It will support large volume and verity of documents
• Check for similarity against our industry-leading content database
• Paraphrased content can also be identified
• Protects privacy and security of content

3.3 Problem Statement

Plagiarism involves the use of words, ideas, or work products that can be traced back to another
person or source without proper attribution. It is a serious form of academic misconduct that
undermines the integrity of the educational process. When individuals engage in plagiarism,
they not only fail to protect the original research content of others but also deny recognition,
credit, and benefits to the original researchers. This unethical behaviour poses a threat to the
advancement of knowledge and the fair distribution of rewards within academia. Furthermore,
the absence of accessible open source and commercial software capable of performing
semantic checks on document content exacerbates the challenge of detecting and preventing
plagiarism effectively.

3.4 Proposed Algorithm

Below are the proposed algorithms for plagiarism detection:


A. Using BERT Algorithm to find document similarity
BERT builds on top of a number of clever ideas that have been bubbling up in the NLP
community recently including but not limited to:
- Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le),
- ELMo (by Matthew Peters and researchers from AI2 and UW CSE),
- ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder),
- the OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans,
and Sutskever), and
- the Transformer (Vaswani et al).

22
Figure3. BERT building block

BERT is based of transformer architecture, pre-trained on larger corpus of unlabeled test


including Wikipedia (2500 million words) and book corpus (800 million words). It has
deeper and intimate understandings of how the language works using deeply
bidirectional model. Pre-trained BERT model can be fine-tuned with just one additional
output layer to create State-Of-the-Art model for wide range of NLP task.

Figure4. BERT process flow

FAST-BERT library:
Fast-Bert is the deep learning library that allows developers and data scientists to train
and deploy BERT and XLNet based models for natural language processing tasks
beginning with Text Classification.
Using Fast-BERT, will able to:
- Train (more precisely fine-tune) BERT, RoBERTa and XLNet text classification
models on your custom dataset.
- Tune model hyper-parameters such as epochs, learning rate, batch size, optimiser
schedule and more.
- Save and deploy trained model for inference (including on AWS Sagemaker).
Fast-Bert will support both multi-class and multi-label text classification for the
following:
- BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton
Lee and Kristina Toutanova.

23
- XLNet (from Google/CMU) released with the paper XLNet: Generalized
Autoregressive Pretraining for Language Understanding by Zhilin Yang*, Zihang
Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
- RoBERTa (from Facebook), a Robustly Optimized BERT Pretraining Approach by
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.
- DistilBERT (from HuggingFace), released together with the blogpost Smaller, faster,
cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh,
Lysandre Debut and Thomas Wolf.

B. Neural Network to improve plagiarism detection


Deep learning uses artificial neural networks to perform sophisticated computations on
large amounts of data. It is a type of machine learning that works based on the structure
and function of the human brain. A neural network is structured like the human brain and
consists of artificial neurons, also known as nodes. These nodes are stacked next to each
other in three layers:
- Input layer
- Hidden Layer(s)
- Output Layer

Figure5. Artificial Neural Network layers architecture

Data provides each node with information in the form of inputs. The node multiplies the
inputs with random weights, calculates them, and adds a bias. Finally, nonlinear functions,
also known as activation functions, are applied to determine which neuron to fire.

C. Data Extraction and Analytic Library


When it comes to solving data science tasks and challenges, Python is the choice for most
data scientists. It an easy-to-learn, easy-to-debug, widely used, object-oriented, open-
source, high-performance language, and there are many more benefits to Python
programming. Python has been built with extraordinary Python libraries for data science:
- TensorFlow, Numpy, SciPy, Pandas, Matplotlib, Keras, SciKit-Learn, PyTorch,
Scrapy

24
Chapter 4

4 Software Requirement Specification


ML Based Plagiarism Solution (MBPS) is document duplicate checker that helps students,
teachers, academic institutions, research firms, web writers, publishers, and many other
segments of society in their domains.
The system should be able to process documents in a variety of formats, including Microsoft
Word, PDF, and plain text.

Below is the list of requirements for this project:

Req_mbps_01
The system should be able to process documents in a variety of formats, including Microsoft
Word, PDF, and plain text.

Req_mbps_02
The system should be able to compare the content of a submitted document against a database
of known sources, such as published articles and academic papers, to identify instances of
plagiarism.

Req_mbps_03
The system should be able to accurately identify copied or paraphrased content, even if the text
has been modified or altered in some way.

Req_mbps_04
The system should provide a report detailing the instances of plagiarism found in the submitted
document, including the sources of the copied content and the percentage of the document that
is plagiarized.

Req_mbps_05
The system should be able to handle large volumes of documents and should have a fast-
processing time.

Req_mbps_06
The system should be user-friendly, with a simple interface that allows users to easily submit
documents and view results.

Req_mbps_07
The system should be secure, with measures in place to protect the privacy of submitted
documents and the confidentiality of the results.

Req_mbps_08
The system should have the ability to exclude certain sources or sections of text from the
comparison process, such as common phrases or cited sources.

Req_mbps_09
The system should provide a way for users to review and challenge the results of the plagiarism
check, in the event that a false positive is detected.

25
Req_mbps_10
The system should be able to handle English language and should be able to accurately detect
plagiarism in documents written in English language.

Req_mbps_11
The system should have a robust database of sources to compare against, with the ability to
continuously update and expand the database.

Req_mbps_12
The system should be compatible with a range of devices and operating systems, including
desktop computers, laptops, and mobile devices.

26
Chapter 5

5 Software Design Specification


The system architecture for this project will follow the below design principles:

✓ Single Responsibility Principle while building services (Independent & Autonomous


Services)
✓ Scalability
✓ Decentralization
✓ Resilience Services
✓ Real-Time Load Balancing
✓ Availability
✓ Continuous Delivery Through DevOps Integration
✓ Seamless API Integration and Continuous Monitoring
✓ Isolation From Failure
✓ Auto-Provisioning

Design Pattern:

✓ Aggregator
✓ API Gateway
✓ Asynchronous Messaging
✓ Database & Shared Data
✓ Event Sourcing
✓ Branch
✓ Command Query Responsibility Segregator
✓ Chain of Responsibility
✓ Circuit Breaker
✓ Security
✓ Decomposition

5.1 System Architecture Design

MBPS is the microservice based multi-layer and containerized architecture that can be
deployed on on-premise as well as cloud environment. Architectural elements descriptions are
provided in the below table.

27
Table2. Architectural elements specification

Architectural Considerations / Design Decision and (optional) Rationale


Elements Alternatives
Architecture Type Client Server, MQ Multi-layer microservice architecture
Based, Workflow, etc.
Basis of Layering No of tiers and purpose 1. Controller Layer – This layer will
of each tier redirect the calls to proper service
and business logic
2. Service Layer – This layer contains
all the business logic
3. Repository Layer- This layer
contains all the database-related
operations
Reusability Use of existing 4. Java – One separate jar is created
components, Identify with all common utilities which are
new components used as a reference in other
microservices.
Batch Operation Manual, Scheduler Orchestration Scheduler, Scheduler for
based, Database related Predictive Analysis
Integration with For Input / Output, Interacting with other microservice through
External System(s) Process-based, REST Services
Message based
Load Balancing Clustering, Parallel 1. Non-Docker Environment – Eureka,
Mechanism (if servers Feign Client,
applicable) 2. Docker Environment- Kubernetes
Clustering uses an in-built load
balancer and naming server.
Failover Redundancy, Session If the microservice is failing, a then proper
Backup exception message is returned.
Deployment Scenario Location, 1. Jenkins with Master-Slave
Administration, architecture for distributed builds
Connectivity 2. Post-build reports and notifications
3. Integrated all the reports like Sonar,
CAST AIP, Java Unit testing,
Angular Unit Testing, Testing
Automation Framework
Security Firewall, Server 1. Application Gateway (WAF
organization Firewall) Firewall configuration
2. NSG rules for inbound and outbound
traffic
Troubleshooting Event / Error logs, Java - Log4j libraries for logger purposes.
Application
instrumentation
External Interface SOA / Offline file- REST Based Communication
based / EAI based
Table3. Database Design Specification

28
Database Aspects Considerations / Details and (optional) Rationale
Alternatives
Organization Volume, Table Space, Table
Table, View, Users
Database Side Use of SPs, Functions, NA
Processing Triggers, Cursors
Connection Connection Pooling, Postgres and Mongo DB Database
Driver Type
Performance Tuning Query Optimization, Indexing
Indexing, De-
Normalization

Figure6. System architecture of MBPS project

Architecture Type:
• MBPS uses multi-layer microservice & containerized architecture

Basis of Layering:
• Controller Layer – This layer will redirect the calls to proper service and business logic
• Service Layer – This layer contains all the business logic
• Repository Layer- This layer contains all the database-related operations

Reusability:
• Java – One separate jar is created with all common utilities which are used as a
reference in other microservices.
• Common Jar file consists of classes and methods related to Common Exceptions,
Request and Response classes, Model classes, Classes required for Security and
validations, Enums, licenses, Annotations, and Config Files which are required to set
up the application.

Batch Operation:

29
• Scheduler: Responsible for running batch operation while extracting data from
multiple sources over the internet.
.
Integration with External System(s): Integration other microservice and external service
will be happened through REST service.

Load Balancing Mechanism (if applicable):


• Non-Docker Environment – Eureka, Feign Client
• Docker Environment- Kubernetes Clustering uses in-built load balancer and naming
server.

Fail over:
• If microservice is failing, then proper exception message is returned.

Deployment Scenario:
• Jenkins with Master-Slave architecture for distributed builds
• Post build reports and notifications

Security:
• Application Gateway (WAF Firewall) Firewall configuration
• NSG rules for inbound and outbound traffic
• Communication between layer will be done through https protocol that will encrypt the
message over the network

Troubleshooting:
• Java - Log4j libraries for logger purpose.

Data Processing Layer: Responsible for data extraction from multiple sources and make it
available for ML model to be trained.

Model Pretraining Engine: This service will be responsible for training the model on variety
of data source.

Plagiarism Model Service: Once the Model get trained and evaluated, will be expose as a
service to be used for outer world.

Plagiarism Report Service: This service responsible for data presentation on for reporting and
dashboard interface.

5.2 Hardware and Software Requirement


5.2.1 Hardware Requirements

• Disk drive – 500 GB or more


• RAM – Minimum 16 GB RAM (16 GB preferable for good performance)
• CPU Cores – 8 Core
• Processor – i5 or above
• Number of Servers – 2

30
5.2.2 Software Requirement

• OS version supported: Centos 8, Ubuntu 18.04 or higher


• Database: Postgres 13.4, MongoDB 4.4.10 (Community version)
• Development Tech Stack: Open JDK 16, Python 3.6.8, Tomcat, Angular 12, Nodejs 12
• Deployment Tech Stack: Open JRE 16, Python 3.6.8, Gunicon, Tomcat 10.0
• Browser: Chrome

5.3 UML Diagram

5.3.1 Use-Case-Diagram

A use case diagram for a MBPS checker includes the following elements:

Actors
Student: the primary user of the tool, who submits a document for plagiarism checking
Teacher: a secondary user of the tool, who may use the tool to check students' work for
plagiarism.

Use Cases:
1. Check document for plagiarism: the main functionality of the tool, in which a student
or teacher submits a document and the tool checks it for plagiarized content
2. View report: the user can view a report on the results of the plagiarism check, including
any instances of plagiarism found and the sources they were copied from
3. Login: the user must authenticate their identity before using the tool
4. Register: new users can create an account to use the tool
5. Manage account: this use case would allow the user to update their account information,
such as their password or email address.
6. Upgrade account: this use case would allow the user to upgrade their account to a
premium version of the tool, which may offer additional features or a higher limit on
the number of documents they can check for plagiarism.
7. Check document for originality: this use case would allow the user to check a document
for originality, rather than just checking for plagiarism. This could be useful for writers
or content creators who want to ensure that their work is completely original.
8. Compare documents: this use case would allow the user to compare two or more
documents to see how similar they are. This could be useful for educators who want to
check for plagiarism between students' papers.
9. View history: this use case would allow the user to view a history of their previous
plagiarism checks, including the results and any reports generated.
Relationships:
1. Extend: the "view report" use case can be extended by the "check document for
plagiarism" use case, meaning that it can only be performed after a plagiarism check
has been run
2. Include: the "login" and "register" use cases may be included as part of the process for
using the tool, but are not essential to its primary functionality.

Activities by actor student:


The "Activities by Student" use case diagram encompasses key functionalities. Students
can initiate a "Check Document for Plagiarism" process, submitting their work for analysis.

31
Following the analysis, students can "View Report" to understand potential plagiarism
instances. The "Check Document for Originality" use case allows students to assess the
uniqueness of their work beyond plagiarism, receiving feedback for improvement.
Additionally, the system supports "Compare Document," enabling students to benchmark
their work against others. These use cases collectively empower students to ensure the
integrity and originality of their documents through comprehensive checks and
comparisons.

Figure 7a. Use case diagram on activities by actor

Governance by actor student


The "Student Login to Plagiarism System" use case, involves students accessing the system
by logging in, providing a secure entry point. Once logged in, students can "Manage
Account" by updating personal information or preferences. The "Upgrade Account" feature
allows users to enhance their system access, possibly unlocking premium features. Students
can also "Register" new documents for plagiarism checks. Lastly, the "View History" use
case enables students to review their past interactions or submissions, fostering
transparency and tracking within the plagiarism system. This set of use cases provides a
comprehensive overview of students' interactions with the system, ensuring efficient
account management and access to historical data.

32
Figure 7b. Use case diagram on governance by actor

5.3.2 Class Diagram

Figure8. Class Diagram

33
PlagiarismCheckerTool class represents the main tool that coordinates the process of
detecting plagiarism.
Document class represents a document that is being checked for plagiarism. It has a title and
the actual text of the document.
Corpus class represents a collection of documents. It has a list of Document objects.
PlagiarismDetectionAlgorithm class represents an algorithm that is responsible for detecting
plagiarism in a given document or corpus.
SimilarityMeasurement class represents a method for measuring the similarity between two
documents. This can be used by a PlagiarismDetectionAlgorithm to determine the likelihood
that one document is plagiarized from another.

Comparison class represents a single comparison between two documents


PlagiarismResult class represents the result of a plagiarism detection algorithm being applied
to a comparison. The PlagiarismResult class could include information such as the degree of
similarity between the two documents, whether the algorithm determined that one of the
documents was likely plagiarized from the other, and any other relevant details

In this design, the BERTModel class represents the BERT model that is being used to detect
plagiarism.
BERTPlagiarismDetectionAlgorithm class is a subclass of PlagiarismDetectionAlgorithm
that uses the BERT model to detect plagiarism. The PlagiarismCheckerTool coordinates the
process of comparing documents and applying plagiarism detection algorithms, and the
PlagiarismResult class represents the result of a plagiarism detection algorithm being applied
to a comparison.

InputProcessor and OutputProcessor classes handle the input and output of the plagiarism
checker tool. The InputProcessor class is responsible for reading in the documents and corpus
to be checked for plagiarism, and the OutputProcessor class is responsible for generating the
report or other output of the tool.

5.3.3 Sequence Diagram

Req_mbps_01
Sequence diagram to handle variety of document format such as pdf, doc, text etc.

34
Figure9. Sequence diagram, to handle verity of document

In this diagram, the input handling subsystem is responsible for determining the format of the
submitted document and routing it to the appropriate format conversion subsystem. The format
conversion subsystem is responsible for converting the document to an internal format that can
be easily parsed and analyzed by the rest of the system. Once the document has been converted
and parsed, it can be compared to the database of previously submitted documents and analyzed
for similarity by the comparison and analysis subsystems. Finally, the output subsystem
generates a report that includes any identified matches and the corresponding similarity scores,
and sends it to the user.

Req_mbps_02: compare the content of a submitted document against a database of known


sources, such as published articles and academic papers, to identify instances of plagiarism.

35
Figure10. Sequence diagram: Content comparison of document

- The user submits the document to the system through a user interface.
- The system receives the document and begins the plagiarism detection process.
- The system pre-processes the submitted document by performing tasks such as
tokenization and stemming.
- The system retrieves a list of known sources from the database.
- The system pre-processes each known source in the same way as the submitted
document.
- The system compares the submitted document to each known source to identify
instances of plagiarism. This can be done using various techniques such as cosine
similarity or Longest Common Subsequence (LCS).
- The system generates a report that indicates the level of plagiarism for each instance of
plagiarism identified.
- The system displays the report to the user through the user interface.
The document pre-processing subsystem could be responsible for tasks such as tokenization
and stemming, while the comparison subsystem could be responsible for techniques such as
cosine similarity or LCS. The report generation subsystem could be responsible for formatting
and presenting the results in a user-friendly way

36
5.3.4 Deployment Diagram

Figure11. Deployment diagram

Deployment principle:
- Application will be deployed on Azure cloud environment
- API Gateway with Web Application Firewall
- API Gateway IP address will be exposed to external world
- All microservice will be deployed on 2 servers with system configuration: 8 core CPU,
16 gb RAM, 500 gb Hard disk
- Secure communication using https

37
Chapter 6

6 SCHEDULE OF WORK

Figure12. Project schedule diagram

38
Chapter 7

7 Dataset
This dataset utilized in this study is a limited subset extracted from the International
Competition on Plagiarism Detection PAN 2010. The dataset comprises two distinct types
of files: suspicious documents and source documents. Additionally, the dataset includes
accompanying information indicating whether a suspicious document exhibits plagiarism
from the source documents or not.
Below is the outcome from various stages of solution:

Data Processing
- Clean the text files and combine the files to give data frames - one each for the source and the
suspicious files
- Pandas data frame for the suspicious file and source file. The reason we put the files as a data
frame is because it will be easier to apply the same operations later to each file

Outcome of suspicious file to data frame:

Table4. Model Dataset

Table5. Suspicious Data Source

Table6. Source Data Source

39
Table7. Trigrams for the suspicious files

Table8. trigrams for the source files

Similarity Measures on trigrams


Next, we compare the suspicious files with the source files using two similarity measures:
- Jaccard similarity coefficient
- Containment measure

We get the two measures for comparing the similarity between trigrams of suspicious and
source files
Similarity measure using Jaccard similarity coefficient

Figure: 13. Similarity measure using Jaccard similarity coefficient

40
Similarity measure using Containment measure

Figure: 14. Similarity measure using Containment similarity coefficient

Figure: 15. Similarity score

41
Chapter 8

8 Results and Discussion


8.1 Model Training Summary Report
This section provides a comprehensive overview of the model training process, offering
insights from key reports that play a pivotal role in evaluating the performance of plagiarism
detection models. The reports included are the "Model Comparison Report Analysis," "Top
Performers Report," and "Loss Analysis Report." Each report contributes unique perspectives
on the models' capabilities, shedding light on their accuracy, efficiency, and overall
effectiveness in handling plagiarism detection tasks.

Model Comparison Report Analysis


This study compares the performance of various models, including Roberta-base, Roberta-
large, Bert-base, and a modified version, Roberta-large-ms, designed to handle large document
segments efficiently. The models were evaluated on three datasets: pan-pc-2009, pan-pc-2010,
and pan-pc-2011.

Table 9. Model comparison during model training

Analysis:
Roberta-large-ms (Modified Model): Consistently outperforms other models, demonstrating
superior accuracy (74% to 80%) and effective handling of large document segments.
Roberta-base and Bert-base: Exhibit competitive performance but with variability across
datasets.
Roberta-large: Shows improvement over the base model but is surpassed by the modified
version in terms of accuracy and stability.
In summary, Roberta-large-ms stands out as the modified model, consistently delivering
superior accuracy and demonstrating effectiveness in handling large document segments. This
highlights the significance of the modifications made to Roberta-large in improving overall
model performance.

Top Performers Report


This report focuses on the top-performing models across various datasets, emphasizing their
loss, accuracy, and overall test results.

42
Table 10. Top performer model based on model training results

Analysis:
Roberta-large-ms (Modified Model): Emerges as the top performer, consistently achieving the
highest accuracy (80%) across different datasets. Notably, this model exhibits robust
performance in handling large document segments, showcasing its adaptability to diverse
plagiarism detection tasks.
Roberta-base: Demonstrates competitive accuracy (72%) and stability, particularly excelling
in pan-pc-2011 with the best accuracy achieved in 3 epochs.
Roberta-large: Maintains strong performance with an accuracy range of 72% to 76%,
performing well in pan-pc-2009 with the best accuracy achieved in 5 epochs.

43
Loss Analysis Report:
This report provides insights into the mean, minimum, and maximum loss values for each
model, offering a comprehensive view of their convergence and stability.

Table 11. Loss analysis during model training

Analysis:
Bert-base: Exhibits the highest mean loss (69.77%), indicating a relatively higher degree of
model uncertainty during training. The range from 68.1% to 71.3% suggests variability in
convergence across different datasets.
Roberta-base: Shows a lower mean loss (61.97%) compared to Bert-base, reflecting a more
stable training process. The range from 56.4% to 69.2% suggests consistent but variable
convergence.
Roberta-large: Presents a moderate mean loss (64.93%), indicating a balanced convergence.
The range from 59.1% to 68.1% suggests moderate variability during training.
Roberta-large-ms (Modified Model): Displays the lowest mean loss (61.03%) and a narrower
range from 56.4% to 67.6%), suggesting enhanced stability and consistent convergence. This
aligns with its superior performance in the accuracy metrics.

8.2 Plagiarism Result Summary Report

This section combines information from the previous reports, highlighting the models'
performance, the best-performing model for each document, and summary statistics. The aim
is to offer a consolidated overview of the plagiarism detection

Models with probability for each Suspicious document


This section outlines the probability scores assigned by each plagiarism detection model to
individual suspicious documents within the dataset. The probability values, ranging from 0 to
1, reflect the confidence of each model in identifying potential instances of plagiarism for each
document.

44
Table 12. Model probability score against the suspicious document

Analysis:
The probability scores reveal varying degrees of confidence among the models, with each
assigning distinct values to the same suspicious document. For instance, in "suspicious-
document00003.txt," Bert-base assigns a probability of 0.320, while Roberta-large-ms assigns
a slightly higher probability of 0.470. These variations underscore the nuanced approaches and
inherent differences in the models' learning mechanisms.

Summary Statistics for Each Model


This section provides summary statistics for each model, including mean, median, minimum,
and maximum probability scores. These statistics offer a holistic view of the models' overall
performance across all suspicious documents.

Table 13. Probability Statistic for each model

Analysis:
Analysing the summary statistics allows us to grasp the overall behaviour of each model. For
instance, Bert-base exhibits a mean probability of 0.410, indicating a moderate level of
confidence, while Roberta-large-ms demonstrates a higher mean probability of 0.596,
suggesting greater overall confidence in its predictions.

45
Chapter 9

9 Conclusion
ML Based Plagiarism Solution can help to identify potential instances of plagiarism in research
paper writing. These tools work by comparing the text authors have written to a database of
other sources, such as published articles and websites, to see if there are any significant
matches.

There are several benefits to using this solution. For one, it can help you avoid accidental
plagiarism, which occurs when you unintentionally use someone else's work without proper
attribution. This can be a serious issue, as it can result in accusations of academic dishonesty
or even legal consequences. A plagiarism checker tool can help you ensure that you are
properly citing your sources and giving credit where it is due.

Additionally, a plagiarism checker tool can also help you identify instances of deliberate
plagiarism, where someone is trying to pass off someone else's work as their own. This can be
a problem in a variety of contexts, including academic research, business writing, and online
content creation. A plagiarism checker tool can help you identify these instances and take
appropriate action to address them.

In conclusion, this solution is a useful tool for anyone who wants to ensure that their writing is
original and properly cited. It can help you avoid accidental plagiarism and identify instances
of deliberate plagiarism, allowing you to maintain the integrity of your work and avoid any
potential legal or ethical issues.

46
Chapter 10

10 References
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR,
abs/1409.0473, 2014.
[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR,
abs/1703.03906, 2017.
[4] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805
[5] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized BERT approach. arXiv
preprint arXiv:1907.11692.
[6] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv
preprint arXiv:1705.03122v2, 2017
[7] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase
representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
[8] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-
attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017
[9] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016
[10] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for
computer vision. CoRR, abs/1512.00567, 2015
[11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser. "Attention Is All You
Need" Proceedings of the 31st International Conference on Neural Information Processing SystemsDecember 2017, pp 6000–6010
[12] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv
preprint arXiv:1602.02410, 2016
[13] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. arXiv
preprint arXiv:1508.04025, 2015
[14] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin
Gao, Klaus Macherey, et al. Google’s neural machine
translation system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144, 2016
[15] Hugginffase Transformer library and documentation https://huggingface.co/docs/transformers/index

47
Marathwada Mitra Mandal’s
College of Engineering, Pune
Accredited with ‘A++’ grade by NAAC

Department of Computer engineering

ME (Computer Engineering 2017 course)


Dissertation Stage II

PROGRESSIVE RECORD

Name of the candidate: MD SOHAIL


Roll Number: ME102

Name of the Guide: Dr. Kalpana Sunil Thakre


Domain: Machine Learning
Title: ML Based Plagiarism Solution

Sr. No Date Discussion Agenda Tasks given Remarks Sign

1 15-Sep-22 Title for Dissertation Search the topic


Stage I related to Natural
Language processing
2 10-Oct-22 Dissertation Title Dissertation Title has Title of the
Review been approved. Dissertation as “ML
Search the multiple Based Plagiarism
Journal paper on Solution”
BERT improvement
3 01-Nov-22 Discussion on Dissertation report
Dissertation report format has been
content provided
4 29-Nov-22 Review on draft Review comments
version of have been given
Dissertation Report
5 9-Dec-22 Discussion on Dissertation
Dissertation presentation agenda
presentation has been provided
6 29-Dec-22 Review of Review comments
Dissertation have been given
presentation

48
7 05-Jan-23 Dissertation I Preparation of
Presentation Dissertation I
presentation
8 Jul-23 ML Model Creation

9 Aug-23 Dataset preparation Prepare source and


suspicious dataset
from PAN-PC-2011
10 Sep-23 Data Processor Development of Data
Module Processor Module

11 Oct-23 Data Reader Module Development of Data


Reader Module

12 Oct-23 Pretrained Engine Development of


Pretrained engine

13. Nov-23 Report Module Development of


Report Module

14. Dec-23 Testing of Model Testing and


verification of Model

Student Sign Guide Sign


MD. Sohail Prof Dr. Kalpana Sunil Thakre

49

You might also like