0% found this document useful (0 votes)
93 views11 pages

Semester Final Project Report

The document is a project report for plagiarism detection. It contains: 1) An outline of the project including the background, objectives, and implementation structure. 2) Descriptions of three string matching algorithms - Longest Common Subsequence (LCS), Rabin Karp, and Knuth-Morris-Pratt (KMP) - that could be used for plagiarism detection. 3) Pseudocode and implementations for the LCS and KMP algorithms. The report proposes that KMP may be the best algorithm for plagiarism detection due to its linear time complexity.

Uploaded by

Engineer Zain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views11 pages

Semester Final Project Report

The document is a project report for plagiarism detection. It contains: 1) An outline of the project including the background, objectives, and implementation structure. 2) Descriptions of three string matching algorithms - Longest Common Subsequence (LCS), Rabin Karp, and Knuth-Morris-Pratt (KMP) - that could be used for plagiarism detection. 3) Pseudocode and implementations for the LCS and KMP algorithms. The report proposes that KMP may be the best algorithm for plagiarism detection due to its linear time complexity.

Uploaded by

Engineer Zain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Project Report

CS-854 Advanced Algorithm Analysis

Project Title: Plagiarism Detection

Group Members:
Hafiz Muhammad Anas
Zain Ul Abideen

Submitted To: Prof. Dr Zuhair Zafar


1.Outline of Project:
1.1 Background of the Project:
Plagiarism is not a new problem in academia, but it has grown in importance with the spread of the
Internet and the ease of access to a global source of content, making human-only intervention
inadequate. Despite this, there is still a significant issue with plagiarism because computer-assisted
plagiarism detection. An active area of study in the fields of Information Retrieval (IR) and Natural
Language Processing at the moment is (NLP). This process has been made easier by the development
of numerous software tools, and we provide an overview of plagiarism detection algorithm for usage
in academic and educational contexts.
Although the fight against plagiarism and its ethical implications is not a new problem for academia
or anywhere else, it is still evident that a small number of people are always eager to take the
dishonest shortcut to academic success. Despite the popularity of automatic anti-cheating support
tools, plagiarism is still being committed by unique and creative means. Reviewers are forced to rely
on these programs because it is nearly impossible to detect plagiarism in any form of work using only
human labor. Teachers and reviewers alike are entitled to question the strength and utility of current
plagiarism detection technologies when confronted with plagiarism incidents that go beyond obvious

1.2 Objectives of the Project:


Since the early 2000s, numerous comparative research with various scopes and standards have been
carried out regarding plagiarism detection technologies. Assessing their fitness for academic usage,
either broadly or in a particular context, was the shared purpose. The main objective of this project is
to purpose a better algorithm that can be useful in detecting plagiarism while taking lesser time.

1.3. Project Period:


The Project started in November 2022 and will last for almost 2 months.

1.4 Project Implementation Structure:


We will discuss three string matching algorithm and then we will talk about why Knuth-Morris-Pratt
is better than all others. Every text processing computer application must tackle string and pattern
matching issues. It is a fairly straightforward yet crucial string-matching problem, variations of which
appear while looking for plagiarism, comparable DNA or protein sequences.
Here is a brief overview of string-matching algorithms.

2. Longest common subsequence (LCS):


The longest common subsequence (LCS) is defined as the longest subsequence that is
common to all the given sequences, provided that the elements of the subsequence are not
required to occupy consecutive positions within the original sequences. In this case, longest
indicates that the subsequent element should be the largest. The term "common" denotes
characters that are shared by both strings of characters. The term "subsequence" refers to the
process of extracting a subset of characters from a string that is written in ascending order.
2.1 LCS Problem Statement:
Given two sequences, find the length of longest subsequence present in both of
them. A subsequence is a sequence that appears in the same relative order, but not
necessarily contiguous. For example, “abc”, “abg”, “bdf”, “aeg”, ‘”acefg”, .. etc are
subsequences of “abcdefg”.
2.2 Pseudo-code:
LCS-LENGTH (X, Y)
1. m ← length [X]
2. n ← length [Y]
3. for i ← 1 to m
4. do c [i,0] ← 0
5. for j ← 0 to m
6. do c [0,j] ← 0
7. for i ← 1 to m
8. do for j ← 1 to n
9. do if xi= yj
10. then c [i,j] ← c [i-1,j-1] + 1
11. b [i,j] ← "↖"
12. else if c[i-1,j] ≥ c[i,j-1]
13. then c [i,j] ← c [i-1,j]
14. b [i,j] ← "↑"
15. else c [i,j] ← c [i,j-1]
16. b [i,j] ← "← "
17. return c and b.

2.3 Naïve Approach:


The obvious answer to this issue is to create all possible subsequences of the two provided
sequences and then identify the one that matches longest. The complexity of this solution in
terms of time is exponential. Time complexity of the naive recursive approach is O(2^n) in
worst case and worst case happens when all characters of X and Y mismatch i.e., length of
LCS is 0. 
A Naive recursive implementation of LCS problem:
#include <bits/stdc++.h>
using namespace std;
 
/* Returns length of LCS for X[0..m-1], Y[0..n-1] */
int lcs( char *X, char *Y, int m, int n )
{
    if (m == 0 || n == 0)
        return 0;
    if (X[m-1] == Y[n-1])
        return 1 + lcs(X, Y, m-1, n-1);
    else
        return max(lcs(X, Y, m, n-1), lcs(X, Y, m-1, n));
}
  
/* Driver code */
int main()
{
    char X[] = "AGGTAB";
    char Y[] = "GXTXAYB";
     
    int m = strlen(X);
    int n = strlen(Y);
     
    cout<<"Length of LCS is "<< lcs( X, Y, m, n ) ;
     
    return 0;
}

2.4 Memorization Approach:


The incomplete recursion tree in naïve approach solves lcs ("AXY", "AYZ") twice. There are
several subproblems that are solved repeatedly, as can be seen if the entire recursion tree is
shown. Thus, this problem has Overlapping Substructure property and recompilation of same
subproblems can be avoided by either using Memorization or Tabulation.
A memorization implementation of LCS problem:
#include <bits/stdc++.h>
using namespace std;
 
/* Returns length of LCS for X[0..m-1], Y[0..n-1] */
int lcs(char* X, char* Y, int m, int n,
        vector<vector<int> >& dp)
{
    if (m == 0 || n == 0)
        return 0;
    if (X[m - 1] == Y[n - 1])
        return dp[m][n] = 1 + lcs(X, Y, m - 1, n - 1, dp);
 
    if (dp[m][n] != -1) {
        return dp[m][n];
    }
    return dp[m][n] = max(lcs(X, Y, m, n - 1, dp),
                          lcs(X, Y, m - 1, n, dp));
}
 
/* Driver code */
int main()
{
    char X[] = "AGGTAB";
    char Y[] = "GXTXAYB";
 
    int m = strlen(X);
    int n = strlen(Y);
    vector<vector<int> > dp(m + 1, vector<int>(n + 1, -1));
    cout << "Length of LCS is " << lcs(X, Y, m, n, dp);
 
    return 0;
}

2.5. Time Complexity:


Although the time complexity reduced from exponential to O(m*n) as compared to the naïve
recursive approach, but we will see, there exist an algorithm that gives O(n) linear time
complexity.

3. Rabin Karp Algorithm:


This algorithm is based on the concept of hashing. It matches the hash value of the pattern
with the hash value of current substring of text, and if the hash values match then only it
starts matching individual characters.

3.1 Pseudo-code:

n = t.length
m = p.length
h = dm-1 mod q
p = 0
t0 = 0
for i = 1 to m
p = (dp + p[i]) mod q
t0 = (dt0 + t[i]) mod q
for s = 0 to n - m
if p = ts
if p[1.....m] = t[s + 1..... s + m]
print "pattern found at position" s
If s < n-m
ts + 1 = (d (ts - t[s + 1]h) + t[s + m + 1]) mod q

Implementation
# Rabin-Karp algorithm
d = 10
def search(pattern, text, q):
m = len(pattern)
n = len(text)
p = 0
t = 0
h = 1
i = 0
j = 0
for i in range(m-1):
h = (h*d) % q
# Calculate hash value for pattern and text
for i in range(m):
p = (d*p + ord(pattern[i])) % q
t = (d*t + ord(text[i])) % q
# Find the match
for i in range(n-m+1):
if p == t:
for j in range(m):
if text[i+j] != pattern[j]:
break
j += 1
if j == m:
print("Pattern is found at position: " +
str(i+1))
if i < n-m:
t = (d*(t-ord(text[i])*h) + ord(text[i+m])) % q

if t < 0:
t = t+q
text = "ABCCDDAEFG"
pattern = "CDD"
q = 13
search (pattern, text, q)

3.2 Time Complexity:


The average and best-case running time of the Rabin-Karp algorithm is O(n+m), but its worst-
case time is O(nm). The worst case of the Rabin-Karp algorithm occurs when all characters of
pattern and text are the same as the hash values of all the substrings match with the hash value.

4.Knuth–Morris–Pratt (KMP) Algorithm


The idea is whenever a mismatch is detected, we already know some of the characters in the text
of the next window. So, we take advantage of this information to avoid matching the characters
that we know will anyway match. One of the best applications of KMP is plagiarism checking.
First of all, we construct Pi-table, after the Pi table is computed, our next step is to search the
pattern in the given string.

4.1 Pseudo-code:
(a) COMPUTE- PREFIX- FUNCTION (P)
1. m ←length [P]
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
(b) KMP-MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]
3. Π← COMPUTE-PREFIX-FUNCTION (P)
4. q ← 0
5. for i ← 1 to n
6. do while q > 0 and P [q + 1] ≠ T[i]
7. do q ← Π [q]
8. If P [q + 1] = T[i]
9. then q ← q + 1
10. If q = m
11. then print "Pattern occurs with shift" i - m
12. q ← Π[q]

Implementation

// C++ program for implementation of KMP pattern searching


// algorithm
#include <bits/stdc++.h> 
void computeLPSArray(char* pat, int M, int* lps); 
// Prints occurrences of txt[] in pat[]
void KMPSearch(char* pat, char* txt)
{
    int M = strlen(pat);
    int N = strlen(txt);
 
    // create lps[] that will hold the longest prefix suffix
    // values for pattern
    int lps[M];
   // Preprocess the pattern (calculate lps[] array)
    computeLPSArray(pat, M, lps);
 
    int i = 0; // index for txt[]
    int j = 0; // index for pat[]
    while ((N - i) >= (M - j)) {
        if (pat[j] == txt[i]) {
            j++;
            i++;
        }
 
        if (j == M) {
            printf("Found pattern at index %d ", i - j);
            j = lps[j - 1];
        }
        // mismatch after j matches
        else if (i < N && pat[j] != txt[i]) {
            // Do not match lps[0..lps[j-1]] characters,
            // they will match anyway
            if (j != 0)
                j = lps[j - 1];
            else
                i = i + 1;
        }
    }
}
 
// Fills lps[] for given pattern pat[0..M-1]
void computeLPSArray(char* pat, int M, int* lps)
{
    // length of the previous longest prefix suffix
    int len = 0;
 
    lps[0] = 0; // lps[0] is always 0
 
    // the loop calculates lps[i] for i = 1 to M-1
    int i = 1;
    while (i < M) {
        if (pat[i] == pat[len]) {
            len++;
            lps[i] = len;
            i++;
        }
        else // (pat[i] != pat[len])
        {
            // This is tricky. Consider the example.
            // AAACAAAA and i = 7. The idea is similar
            // to search step.
            if (len != 0) {
                len = lps[len - 1];
                // Also, note that we do not increment
                // i here
            }
            else // if (len == 0)
            {
                lps[i] = 0;
                i++;
            }
        }
    }
}
// Driver code
int main()
{
    char txt[] = "ABABDABACDABABCABAB";
    char pat[] = "ABABCABAB";
    KMPSearch(pat, txt);
    return 0;
}

4.2 Time Complexity:


Prefix function: In the above pseudo code for calculating the prefix function, the for loop from
step 4 to step 10 runs 'm' times. Step1 to Step3 take constant time. Hence the running time of
computing prefix function is O (m).
Matching function: The for loop beginning in step 5 runs 'n' times, i.e., as long as the length of
the string 'S.' Since step 1 to step 4 take constant times, the running time is dominated by this for
the loop. Thus, running time of the matching function is O (n).

5. Time complexity Comparison:


The time complexity comparison between all of these three algorithms is given below in the
form of table.

6. Conclusion:
The nice advantage about KMP is that its worst-case efficiency is guaranteed. Preprocessing
takes always O(n) time, while searching takes always O(m) . There is no possibility of being
unfortunate, no worst-case inputs, etc. For the string-matching issue, Knuth-Morris and Pratt
offer a linear time solution. By eliminating comparisons with the elements that have previously
been used in comparisons with elements of the pattern 'p' to be matched, a matching time of O
(n) is attained.
7. Future Work:
For the most part, accurate plagiarism detection is currently limited to text content. The three
sectors that could benefit from a similar plagiarism checker product – art, music, and video.  All
of these industries currently have a number of outstanding legal cases and disputes over
plagiarized content.   With the right technology, plagiarism in all of these fields could be
minimized through detection and checking prior to distribution.

8. References:
1. https://www.cs.auckland.ac.nz/courses/compsci369s1c/lectures/GG-notes/CS369-
StringAlgs.pdf
2. https://www.researchgate.net/publication/335319583_Plagiarism_Detection_Software_an
_Overview
3. https://www.javatpoint.com/daa-knuth-morris-pratt-algorithm
4. https://www.cs.ubc.ca/labs/algorithms/Courses/CPSC445 08/Handouts/kmp.pdf
5. https://www.researchgate.net/publication/311205690_Overview_of_Different_Plagiarism
_Detection Tools
6. https://web.stanford.edu/class/cs97si/10-string-algorithms.pdf
7. https://www.codespeedy.com/knuth-morris-pratt-kmp-algorithm-in-c/

You might also like