0% found this document useful (0 votes)

93 views11 pages

Semester Final Project Report

The document is a project report for plagiarism detection. It contains: 1) An outline of the project including the background, objectives, and implementation structure. 2) Descriptions of three string matching algorithms - Longest Common Subsequence (LCS), Rabin Karp, and Knuth-Morris-Pratt (KMP) - that could be used for plagiarism detection. 3) Pseudocode and implementations for the LCS and KMP algorithms. The report proposes that KMP may be the best algorithm for plagiarism detection due to its linear time complexity.

Uploaded by

Engineer Zain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views11 pages

Semester Final Project Report

Uploaded by

Engineer Zain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Project Report

CS-854 Advanced Algorithm Analysis

Project Title: Plagiarism Detection

Group Members:
Hafiz Muhammad Anas
Zain Ul Abideen

Submitted To: Prof. Dr Zuhair Zafar

1.Outline of Project:
1.1 Background of the Project:
Plagiarism is not a new problem in academia, but it has grown in importance with the spread of the
Internet and the ease of access to a global source of content, making human-only intervention
inadequate. Despite this, there is still a significant issue with plagiarism because computer-assisted
plagiarism detection. An active area of study in the fields of Information Retrieval (IR) and Natural
Language Processing at the moment is (NLP). This process has been made easier by the development
of numerous software tools, and we provide an overview of plagiarism detection algorithm for usage
in academic and educational contexts.
Although the fight against plagiarism and its ethical implications is not a new problem for academia
or anywhere else, it is still evident that a small number of people are always eager to take the
dishonest shortcut to academic success. Despite the popularity of automatic anti-cheating support
tools, plagiarism is still being committed by unique and creative means. Reviewers are forced to rely
on these programs because it is nearly impossible to detect plagiarism in any form of work using only
human labor. Teachers and reviewers alike are entitled to question the strength and utility of current
plagiarism detection technologies when confronted with plagiarism incidents that go beyond obvious

1.2 Objectives of the Project:

Since the early 2000s, numerous comparative research with various scopes and standards have been
carried out regarding plagiarism detection technologies. Assessing their fitness for academic usage,
either broadly or in a particular context, was the shared purpose. The main objective of this project is
to purpose a better algorithm that can be useful in detecting plagiarism while taking lesser time.

1.3. Project Period:

The Project started in November 2022 and will last for almost 2 months.

1.4 Project Implementation Structure:

We will discuss three string matching algorithm and then we will talk about why Knuth-Morris-Pratt
is better than all others. Every text processing computer application must tackle string and pattern
matching issues. It is a fairly straightforward yet crucial string-matching problem, variations of which
appear while looking for plagiarism, comparable DNA or protein sequences.
Here is a brief overview of string-matching algorithms.

2. Longest common subsequence (LCS):

The longest common subsequence (LCS) is defined as the longest subsequence that is
common to all the given sequences, provided that the elements of the subsequence are not
required to occupy consecutive positions within the original sequences. In this case, longest
indicates that the subsequent element should be the largest. The term "common" denotes
characters that are shared by both strings of characters. The term "subsequence" refers to the
process of extracting a subset of characters from a string that is written in ascending order.
2.1 LCS Problem Statement:
Given two sequences, find the length of longest subsequence present in both of
them. A subsequence is a sequence that appears in the same relative order, but not
necessarily contiguous. For example, “abc”, “abg”, “bdf”, “aeg”, ‘”acefg”, .. etc are
subsequences of “abcdefg”.
2.2 Pseudo-code:
LCS-LENGTH (X, Y)
1. m ← length [X]
2. n ← length [Y]
3. for i ← 1 to m
4. do c [i,0] ← 0
5. for j ← 0 to m
6. do c [0,j] ← 0
7. for i ← 1 to m
8. do for j ← 1 to n
9. do if xi= yj
10. then c [i,j] ← c [i-1,j-1] + 1
11. b [i,j] ← "↖"
12. else if c[i-1,j] ≥ c[i,j-1]
13. then c [i,j] ← c [i-1,j]
14. b [i,j] ← "↑"
15. else c [i,j] ← c [i,j-1]
16. b [i,j] ← "← "
17. return c and b.

2.3 Naïve Approach:

The obvious answer to this issue is to create all possible subsequences of the two provided
sequences and then identify the one that matches longest. The complexity of this solution in
terms of time is exponential. Time complexity of the naive recursive approach is O(2^n) in
worst case and worst case happens when all characters of X and Y mismatch i.e., length of
LCS is 0.
A Naive recursive implementation of LCS problem:
#include <bits/stdc++.h>
using namespace std;

/* Returns length of LCS for X[0..m-1], Y[0..n-1] */
int lcs( char *X, char *Y, int m, int n )
{
    if (m == 0 || n == 0)
        return 0;
    if (X[m-1] == Y[n-1])
        return 1 + lcs(X, Y, m-1, n-1);
    else
        return max(lcs(X, Y, m, n-1), lcs(X, Y, m-1, n));
}

/* Driver code */
int main()
{
    char X[] = "AGGTAB";
    char Y[] = "GXTXAYB";

    int m = strlen(X);
    int n = strlen(Y);

    cout<<"Length of LCS is "<< lcs( X, Y, m, n ) ;

    return 0;
}

2.4 Memorization Approach:

The incomplete recursion tree in naïve approach solves lcs ("AXY", "AYZ") twice. There are
several subproblems that are solved repeatedly, as can be seen if the entire recursion tree is
shown. Thus, this problem has Overlapping Substructure property and recompilation of same
subproblems can be avoided by either using Memorization or Tabulation.
A memorization implementation of LCS problem:
#include <bits/stdc++.h>
using namespace std;

/* Returns length of LCS for X[0..m-1], Y[0..n-1] */
int lcs(char* X, char* Y, int m, int n,
        vector<vector<int> >& dp)
{
    if (m == 0 || n == 0)
        return 0;
    if (X[m - 1] == Y[n - 1])
        return dp[m][n] = 1 + lcs(X, Y, m - 1, n - 1, dp);

    if (dp[m][n] != -1) {
        return dp[m][n];
    }
    return dp[m][n] = max(lcs(X, Y, m, n - 1, dp),
                          lcs(X, Y, m - 1, n, dp));
}

/* Driver code */
int main()
{
    char X[] = "AGGTAB";
    char Y[] = "GXTXAYB";

    int m = strlen(X);
    int n = strlen(Y);
    vector<vector<int> > dp(m + 1, vector<int>(n + 1, -1));
    cout << "Length of LCS is " << lcs(X, Y, m, n, dp);

    return 0;
}

2.5. Time Complexity:

Although the time complexity reduced from exponential to O(m*n) as compared to the naïve
recursive approach, but we will see, there exist an algorithm that gives O(n) linear time
complexity.

3. Rabin Karp Algorithm:

This algorithm is based on the concept of hashing. It matches the hash value of the pattern
with the hash value of current substring of text, and if the hash values match then only it
starts matching individual characters.

3.1 Pseudo-code:

n = t.length
m = p.length
h = dm-1 mod q
p = 0
t0 = 0
for i = 1 to m
p = (dp + p[i]) mod q
t0 = (dt0 + t[i]) mod q
for s = 0 to n - m
if p = ts
if p[1.....m] = t[s + 1..... s + m]
print "pattern found at position" s
If s < n-m
ts + 1 = (d (ts - t[s + 1]h) + t[s + m + 1]) mod q

Implementation
# Rabin-Karp algorithm
d = 10
def search(pattern, text, q):
m = len(pattern)
n = len(text)
p = 0
t = 0
h = 1
i = 0
j = 0
for i in range(m-1):
h = (h*d) % q
# Calculate hash value for pattern and text
for i in range(m):
p = (d*p + ord(pattern[i])) % q
t = (d*t + ord(text[i])) % q
# Find the match
for i in range(n-m+1):
if p == t:
for j in range(m):
if text[i+j] != pattern[j]:
break
j += 1
if j == m:
print("Pattern is found at position: " +
str(i+1))
if i < n-m:
t = (d*(t-ord(text[i])*h) + ord(text[i+m])) % q

if t < 0:
t = t+q
text = "ABCCDDAEFG"
pattern = "CDD"
q = 13
search (pattern, text, q)

3.2 Time Complexity:

The average and best-case running time of the Rabin-Karp algorithm is O(n+m), but its worst-
case time is O(nm). The worst case of the Rabin-Karp algorithm occurs when all characters of
pattern and text are the same as the hash values of all the substrings match with the hash value.

4.Knuth–Morris–Pratt (KMP) Algorithm

The idea is whenever a mismatch is detected, we already know some of the characters in the text
of the next window. So, we take advantage of this information to avoid matching the characters
that we know will anyway match. One of the best applications of KMP is plagiarism checking.
First of all, we construct Pi-table, after the Pi table is computed, our next step is to search the
pattern in the given string.

4.1 Pseudo-code:
(a) COMPUTE- PREFIX- FUNCTION (P)
1. m ←length [P]
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
(b) KMP-MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]
3. Π← COMPUTE-PREFIX-FUNCTION (P)
4. q ← 0
5. for i ← 1 to n
6. do while q > 0 and P [q + 1] ≠ T[i]
7. do q ← Π [q]
8. If P [q + 1] = T[i]
9. then q ← q + 1
10. If q = m
11. then print "Pattern occurs with shift" i - m
12. q ← Π[q]

Implementation

// C++ program for implementation of KMP pattern searching

// algorithm
#include <bits/stdc++.h>
void computeLPSArray(char* pat, int M, int* lps);
// Prints occurrences of txt[] in pat[]
void KMPSearch(char* pat, char* txt)
{
    int M = strlen(pat);
    int N = strlen(txt);

    // create lps[] that will hold the longest prefix suffix
    // values for pattern
    int lps[M];
// Preprocess the pattern (calculate lps[] array)
    computeLPSArray(pat, M, lps);

    int i = 0; // index for txt[]
    int j = 0; // index for pat[]
    while ((N - i) >= (M - j)) {
        if (pat[j] == txt[i]) {
            j++;
            i++;
        }

        if (j == M) {
            printf("Found pattern at index %d ", i - j);
            j = lps[j - 1];
        }
        // mismatch after j matches
        else if (i < N && pat[j] != txt[i]) {
            // Do not match lps[0..lps[j-1]] characters,
            // they will match anyway
            if (j != 0)
                j = lps[j - 1];
            else
                i = i + 1;
        }
    }
}

// Fills lps[] for given pattern pat[0..M-1]
void computeLPSArray(char* pat, int M, int* lps)
{
    // length of the previous longest prefix suffix
    int len = 0;

    lps[0] = 0; // lps[0] is always 0

    // the loop calculates lps[i] for i = 1 to M-1
    int i = 1;
    while (i < M) {
        if (pat[i] == pat[len]) {
            len++;
            lps[i] = len;
            i++;
        }
        else // (pat[i] != pat[len])
        {
            // This is tricky. Consider the example.
            // AAACAAAA and i = 7. The idea is similar
            // to search step.
            if (len != 0) {
                len = lps[len - 1];
                // Also, note that we do not increment
                // i here
            }
            else // if (len == 0)
            {
                lps[i] = 0;
                i++;
            }
        }
    }
}
// Driver code
int main()
{
    char txt[] = "ABABDABACDABABCABAB";
    char pat[] = "ABABCABAB";
    KMPSearch(pat, txt);
    return 0;
}

4.2 Time Complexity:

Prefix function: In the above pseudo code for calculating the prefix function, the for loop from
step 4 to step 10 runs 'm' times. Step1 to Step3 take constant time. Hence the running time of
computing prefix function is O (m).
Matching function: The for loop beginning in step 5 runs 'n' times, i.e., as long as the length of
the string 'S.' Since step 1 to step 4 take constant times, the running time is dominated by this for
the loop. Thus, running time of the matching function is O (n).

5. Time complexity Comparison:

The time complexity comparison between all of these three algorithms is given below in the
form of table.

6. Conclusion:
The nice advantage about KMP is that its worst-case efficiency is guaranteed. Preprocessing
takes always O(n) time, while searching takes always O(m) . There is no possibility of being
unfortunate, no worst-case inputs, etc. For the string-matching issue, Knuth-Morris and Pratt
offer a linear time solution. By eliminating comparisons with the elements that have previously
been used in comparisons with elements of the pattern 'p' to be matched, a matching time of O
(n) is attained.
7. Future Work:
For the most part, accurate plagiarism detection is currently limited to text content. The three
sectors that could benefit from a similar plagiarism checker product – art, music, and video. All
of these industries currently have a number of outstanding legal cases and disputes over
plagiarized content. With the right technology, plagiarism in all of these fields could be
minimized through detection and checking prior to distribution.

8. References:
1. https://www.cs.auckland.ac.nz/courses/compsci369s1c/lectures/GG-notes/CS369-
StringAlgs.pdf
2. https://www.researchgate.net/publication/335319583_Plagiarism_Detection_Software_an
_Overview
3. https://www.javatpoint.com/daa-knuth-morris-pratt-algorithm
4. https://www.cs.ubc.ca/labs/algorithms/Courses/CPSC445 08/Handouts/kmp.pdf
5. https://www.researchgate.net/publication/311205690_Overview_of_Different_Plagiarism
_Detection Tools
6. https://web.stanford.edu/class/cs97si/10-string-algorithms.pdf
7. https://www.codespeedy.com/knuth-morris-pratt-kmp-algorithm-in-c/

54.string Inotes
No ratings yet
54.string Inotes
20 pages
Project Explanation
No ratings yet
Project Explanation
50 pages
UNIT-V String Matching
No ratings yet
UNIT-V String Matching
24 pages
American Express Data Analyst DSA Interview Questions
No ratings yet
American Express Data Analyst DSA Interview Questions
16 pages
DSA - Strings - Notes
No ratings yet
DSA - Strings - Notes
8 pages
Unit 2 Daa PDF
No ratings yet
Unit 2 Daa PDF
99 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
46 pages
G5 Advanced String Algorithms Lecture (With Code)
No ratings yet
G5 Advanced String Algorithms Lecture (With Code)
142 pages
500 Fang
No ratings yet
500 Fang
39 pages
Lecture Notes On Pattern Matching Algorithms
No ratings yet
Lecture Notes On Pattern Matching Algorithms
16 pages
CSE 205 Lab Manual 13 LCS
No ratings yet
CSE 205 Lab Manual 13 LCS
5 pages
DAA (Lecture 5)
No ratings yet
DAA (Lecture 5)
52 pages
DAA (Lecture 5)
No ratings yet
DAA (Lecture 5)
52 pages
4string Matching Kmprabin Karp and Naive
No ratings yet
4string Matching Kmprabin Karp and Naive
57 pages
Adamodelpaper 3
No ratings yet
Adamodelpaper 3
35 pages
Unit 3
No ratings yet
Unit 3
34 pages
Lecture 18 - String Matching-KMP
No ratings yet
Lecture 18 - String Matching-KMP
40 pages
Advanced String Lecture
No ratings yet
Advanced String Lecture
50 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
ADSA IA2 Solution
No ratings yet
ADSA IA2 Solution
14 pages
DP Problem Algortithms
No ratings yet
DP Problem Algortithms
16 pages
String Problems
No ratings yet
String Problems
20 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
13 pages
String Matching
No ratings yet
String Matching
35 pages
Internetalgo
No ratings yet
Internetalgo
13 pages
07 Brute Force
No ratings yet
07 Brute Force
54 pages
DAA DA Output
No ratings yet
DAA DA Output
9 pages
Daa Da
No ratings yet
Daa Da
9 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
10 pages
Exp 10 Daa Ak
No ratings yet
Exp 10 Daa Ak
7 pages
11339AoA - EX-7
No ratings yet
11339AoA - EX-7
7 pages
DAA Final Examination 2003en
No ratings yet
DAA Final Examination 2003en
10 pages
Fin f12 Sol
No ratings yet
Fin f12 Sol
6 pages
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
No ratings yet
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
18 pages
String
No ratings yet
String
4 pages
Module V
No ratings yet
Module V
4 pages
CPS Final Project
No ratings yet
CPS Final Project
4 pages
Ra2311026050228 Sundaranandhan.r.j
No ratings yet
Ra2311026050228 Sundaranandhan.r.j
4 pages
Adsa
No ratings yet
Adsa
9 pages
Python Program For Array Rotation
No ratings yet
Python Program For Array Rotation
3 pages
DAA Assignment (Module4)
No ratings yet
DAA Assignment (Module4)
10 pages
String Matching
No ratings yet
String Matching
5 pages
Longest Common Sub Sequence
No ratings yet
Longest Common Sub Sequence
4 pages
Strings and Pattern Matching
No ratings yet
Strings and Pattern Matching
17 pages
ICPC 2019 - Online Preliminary Problem Set Analysis
No ratings yet
ICPC 2019 - Online Preliminary Problem Set Analysis
6 pages
Adobe Scan Nov 24, 2023
No ratings yet
Adobe Scan Nov 24, 2023
5 pages
Lec06 448
No ratings yet
Lec06 448
6 pages
Aoa 6
No ratings yet
Aoa 6
4 pages
Palindrome Problems
No ratings yet
Palindrome Problems
3 pages
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
No ratings yet
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
18 pages
A Two Way Pattern Matching Algorithm Using Sliding Patterns
No ratings yet
A Two Way Pattern Matching Algorithm Using Sliding Patterns
5 pages
B60 Exp07 Aoa
No ratings yet
B60 Exp07 Aoa
8 pages
hw10 Solution PDF
No ratings yet
hw10 Solution PDF
5 pages
Lecture Notes On Pattern Matching Algorithms
No ratings yet
Lecture Notes On Pattern Matching Algorithms
16 pages
Rabin Karp Algorithm of Pattern Matching (Goutam Padhy)
No ratings yet
Rabin Karp Algorithm of Pattern Matching (Goutam Padhy)
15 pages
Naïve Method. Code:: Naive, Rabin-Karp, and Knuth-Morris-Pratt Algorithms For String Matching
No ratings yet
Naïve Method. Code:: Naive, Rabin-Karp, and Knuth-Morris-Pratt Algorithms For String Matching
5 pages
CSE 2012-Design and Analysis of Algorithms Practice Problem Sheet (String Matching Problem)
No ratings yet
CSE 2012-Design and Analysis of Algorithms Practice Problem Sheet (String Matching Problem)
2 pages
3.sample Assignment
No ratings yet
3.sample Assignment
38 pages
Fall 2020 - TPTG620 - 2 - BC180203568
No ratings yet
Fall 2020 - TPTG620 - 2 - BC180203568
43 pages
Lecture 28 - Money Counting and Bin Packing Problem
No ratings yet
Lecture 28 - Money Counting and Bin Packing Problem
15 pages
Lecture 25 - Greedy Algorithms
No ratings yet
Lecture 25 - Greedy Algorithms
6 pages
Algorithmic Probability: Fundamentals and Applications
From Everand
Algorithmic Probability: Fundamentals and Applications
Fouad Sabry
No ratings yet