Project Documentation
Project Documentation
TEXTUAL DATA
A project report submitted in partial fulfillment of the
Requirements for the award of degree of
Bachelor of Technology
in
Computer Science and Engineering
by
M.Deevena (S170747)
V.Anjani Devi (S171018)
O.Devika Tejaswi(S170402)
i
RAJIV GANDHI UNIVERSITY OF KNOWLEDGE TECHNOLOGIES
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
(A.P. Government Act 18 of 2008)
RGUKT-Srikakulam, Srikakulam Dist – 532410
Tele Fax: 08656 – 235557/235150
CERTIFICATE OF COMPLETION
This is to certify that the work entitled, “Master MCQ’s using Textual data” is the
bona fide work of M.Deevena(S170747), V.Anjani Devi(S171018), O.Devika
Tejaswi(S170402) carried out under the guidance and supervision of Ms.J.Vishnu
Priyanka for the final year project of Bachelor of Technology in the department of
Computer Science and Engineering at Rajiv Gandhi University of Knowledge
Technologies (RGUKT) Srikakulam. This work was completed during the academic
session of December 2022 – April 2023 under my guidance.
------------------------- ----------------------------
Ms.J.Vishnu Priyanka Ms. M. Roopa
Assistant Professor Head of the Department
Department of CSE Department of CSE
RGUKT - Srikakulam RGUKT - Srikakulam
ii
RAJIV GANDHI UNIVERSITY OF KNOWLEDGE TECHNOLOGIES
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
(A.P. Government Act 18 of 2008)
RGUKT-Srikakulam, Srikakulam Dist – 532410
Tele Fax: 08656 – 235557/235150
CERTIFICATE OF EXAMINATION
This is to certify that the work entitled, “Master MCQ’s using Textual data” is the bona
fide work of M.Deevena(S170747),V.Anjani Devi(S171018),O.Devika Tejaswi(S170402)
here by accord our approval of it as a study carried out and presented in a manner required
for its acceptance in the partial fulfilment of the requirement for the award of the degree
of Bachelor of Technology for which it has been submitted. This approval does not
necessarily endorse or accept every statement made, opinion expressed or conclusion
drawn, as a recorded in this thesis. It only signifies the acceptance of this thesis for the
purpose for which it has been submitted.
------------------------- --------------------------
Ms.J.Vishnu Priyanka Examiner
Assistant Professor Assistant Professor
Department of CSE Department of CSE
RGUKT - Srikakulam RGUKT - Srikakulam
iii
RAJIV GANDHI UNIVERSITY OF KNOWLEDGE TECHNOLOGIES
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
(A.P. Government Act 18 of 2008)
RGUKT-Srikakulam, Srikakulam Dist – 532410
Tele Fax: 08656 – 235557/235150
DECLARATION
We also declare that this project is a result of our own effort and has not been
copied or imitated from any source. Citations from any websites are mentioned in the
references. Furthermore, the results presented in this project report have not been
submitted to any other university or institute for the award of any degree or diploma.
Date: ____________
Place: ____________
M.Deevena(S170747)
V.Anjani Devi(S171018)
O.Devika Tejaswi(S170402)
iv
ACKNOWLEDGEMENT
We express our sincere appreciation and utmost respect to our team guide, Ms.
J. Vishnu Priyanka, for her exceptional guidance, monitoring, and constant motivation
throughout this semester. The time spent with her during the course of this project has
been invaluable, and we will always treasure the knowledge gained in the field of Web
Development and Natural Language Processing (NLP).
We are grateful for the confidence bestowed upon us and for entrusting our
project entitled "Master MCQ's using Textual data" to our team.
We extend our gratitude to Ms. M. Roopa (HOD of CSE) and other faculty
members for serving as a source of inspiration and constant encouragement that assisted
us in successfully completing the project.
Finally, we express our heartfelt appreciation to our parents for their blessings
and to our friends for their assistance and good wishes towards the successful
completion of this project.
M.Deevena(S170747)
V.Anjani Devi(S171018)
O.Devika Tejaswi(S170402)
v
ABSTRACT
The main aim of our project, 'Master MCQ’s using textual data', is to develop a software
tool that utilizes NLP techniques to automatically generate multiple-choice questions
from textual data in various formats, including PDFs, DOCs, TXTs, images, PPTs, and
URLs. The tool extracts important keywords and phrases from the input data and uses
them to generate questions with relevant distractors. In addition, this project includes a
voice assistance feature to aid blind individuals in writing exams more easily. The
system reads the generated questions aloud and accepts user input in the form of options,
which are marked accordingly. This project features a web-based interface that is easy
to use and accessible to individuals with different levels of technical skills. It is a
valuable resource for educators, professionals, and individuals with visual impairments
who need to create or take multiple-choice exams. It may have applications in various
fields, including education, training, and accessibility.
vi
TABLE OF CONTENTS
1.INTRODUCTION ............................................................................................................ 1
1.1 Problem Definition ....................................................................................... 1
1.2 Motivation of the Project .................................................................................. 1
1.3 Limitations of the Project ................................................................................. 2
1.4 Existing System ................................................................................................ 2
1.5 Proposed System .............................................................................................. 3
1.6 Scope of the Project .......................................................................................... 3
1.6.1 User-Friendly Interface ......................................................................................... 4
1.6.2 Efficient Time Management .................................................................................. 4
2. LITERATURE SURVEY ................................................................................................ 5
2.1 Automatic Generation of Multiple Choice Questions Using Wikipedia: ............ 5
2.2 An automated multiple choice question generation using Natural Language
Processing Techniques:........................................................................................... 6
2.3 Automated MCQ Generator using Natural Language Processing: ..................... 6
3. REQUIREMENT SPECIFICATION ............................................................................. 8
3.1 Functional Requirements .................................................................................. 8
3.2 Non-Functional Requirements .......................................................................... 9
3.3 System Requirements ..................................................................................... 11
3.3.1 Hardware Requirements ...................................................................................... 11
3.3.2 Software Requirements ....................................................................................... 12
4. SYSTEM DESIGN......................................................................................................... 13
4.1 Introduction .................................................................................................... 13
4.2 Class Diagram ................................................................................................ 14
4.3 Use case Diagram ........................................................................................... 16
4.4 Sequence Diagram .......................................................................................... 17
4.5 Activity Diagram ............................................................................................ 18
5. WORKING PROCESS.................................................................................................. 19
5.1 Components of Quiz Craft .............................................................................. 19
5.2 Features .......................................................................................................... 19
vii
5.3 Working procedure ......................................................................................... 20
6. RESULTS AND OUTPUT SCREENS .......................................................................... 21
6.1 Registration Page ............................................................................................ 21
6.2 Login Page ..................................................................................................... 21
6.3 Home Page ..................................................................................................... 22
6.4 Insert Data Page ............................................................................................. 22
6.5 Generated Questions Page .............................................................................. 24
6.6 Admin Dashboard Page .................................................................................. 24
6.7 Manual Questions Page .................................................................................. 25
7. TESTING AND VALIDATION .................................................................................... 27
7.1 Introduction .................................................................................................... 27
7.2 Types of Testing ............................................................................................. 27
7.2.1 Unit Testing ........................................................................................................ 27
7.2.2 Black Box Testing ............................................................................................... 28
7.2.3 White Box Testing .............................................................................................. 28
7.2.4 Integration Testing .............................................................................................. 28
7.3 Validation ....................................................................................................... 28
8. CONCLUSION .............................................................................................................. 33
9. REFERENCES .............................................................................................................. 34
9.1 Automatic Generation of Multiple Choice Questions Using Wikipedia ........... 34
9.2 An automated multiple choice question generation using Natural Language
Processing Techniques .......................................................................................... 34
9.3 Automated MCQ Generator using Natural Language Processing .................... 34
APPENDIX ........................................................................................................................ 35
viii
1.INTRODUCTION
The idea to implement the "Master MCQ's using Textual Data" web-based
software tool was based on our personal experiences with the time-consuming process
of manually generating multiple-choice questions (MCQs) and the lack of accessibility
options for individuals with visual impairments. We observed that existing methods for
generating MCQs, require significant manual effort, which may not be feasible for
creating a large number of questions. Additionally, individuals with visual impairments
face challenges in accessing and answering MCQs. As technical students, we aimed to
address these issues through a technical solution. After successful analysis and
brainstorming discussions, we proposed our idea, which we believe can provide a partial
solution to these major issues. Our system utilizes NLP techniques to automatically
generate MCQs from various formats of textual data and includes a voice assistance
feature for blind individuals to easily access and answer the generated questions. By
developing this system, we aim to contribute towards a more efficient and accessible
learning environment, and we hope that our software tool can help to improve the
educational experience for all.
1
1.3 Limitations of the Project
2
1.5 Proposed System
Our proposed system, "Master MCQ's using textual data," leverages NLP-based
techniques to generate correct and relevant questions from textual data with greater
efficiency and accuracy. We use the Text-to-Text Transfer Transformer (T5) algorithm
to summarize the text and generate MCQs through sentence mapping, which greatly
reduces the time and effort required to generate questions manually. Additionally, we
use WordNet or Sense2Vec to generate distractors for each question. By using T5 and a
lexical database, our proposed system can generate questions with high accuracy and
relevance to the given lesson material. We developed a web-based application that
provides a user-friendly interface for effortless accessibility. This application will be
beneficial for educators and visually impaired individuals who require assistance in
generating or taking multiple-choice tests. Overall, our proposed system aims to
enhance the speed, accuracy, and accessibility of generating and taking multiple-choice
exams from textual data.
Our project has a broad scope and can serve a diverse audience facing challenges
such as limited resources and tight deadlines. Our software tool is designed to
automatically generate multiple-choice questions from various forms of textual data,
including PDFs, DOCs, TXTs, images, PPTs, and URLs. Additionally, our system
includes a voice assistance feature that aids visually impaired individuals in taking and
writing exams. The tool is web-based and provides a user-friendly interface that can be
accessed by individuals with varying levels of technical expertise. With potential
applications in education and training, our tool can benefit educators, trainers, and
anyone in need of generating or taking multiple-choice exams from textual data.
Overall, our project aims to provide an efficient and accessible solution to the challenges
of generating and taking multiple-choice exams from textual data, with the goal of
improving accessibility and enhancing the learning experience.
3
The proposed project focuses on the following aspects:
A user-friendly interface is available for users to interact with the system and
create their accounts. Input provided by users through the graphical interface is
stored in the database for future use.
4
2. LITERATURE SURVEY
The following reference papers have been examined to identify the strengths and
weaknesses of existing systems for automatic MCQ generation. Through our analysis,
we have identified several limitations and potential areas for improvement in these
systems, which have informed the development of our own project. The following are
the list of the papers we reviewed, along with a brief summary of their findings:
In the domain of sports, the authors have developed a three-part format for
multiple-choice questions (MCQs), which includes the stem, serving as the
foundation for the question, the target word, indicating the correct answer, and
the distractors, which comprise the incorrect answers. By utilizing existing
questions in this field, the authors aimed to identify sentences suitable for
MCQs.
The authors employed a combination of parsing techniques and named entity
recognition (NER) systems to identify the correct answer for the MCQ. In
addition, they extracted extra attribute values of the correct answer from the
internet and explored Wikipedia for entities sharing similar attribute values.
To generate distractors, the authors retrieved relevant data from structured
sources, such as information boxes or the opening sentence of the content
featured on the right-hand side of a Wikipedia page. Subsequently, they searched
Wikipedia for related candidates from the same category, or selected distractors
at random from the pool of candidates.
Overall, the approach demonstrates the potential for automated generation of MCQs
using Wikipedia. However, randomly choosing distractors may lead to lower quality
MCQs, and not every topic has a corresponding Wikipedia page, so educators may not
be able to extract MCQs for their specific needs using this method.
5
2.2 An automated multiple choice question generation using Natural
Language Processing Techniques:
The authors developed a system that automatically generates multiple choice questions
(MCQs) from lesson materials using (TF-IDF) and N-grams. The system's efficiency is
assessed by comparing manually extracted keywords from five lesson materials by a
teacher with those auto-generated by the system. The number of MCQs generated for
each document was found to be proportional to the number of extracted keywords. The
system picked three other keywords at random from the extracted pool of keywords to
serve as distractors for users to select the correct option. However, a drawback of this
method is that the distractors, which are extracted keywords, may not match the options,
and the same options may repeat in every question.
The authors of this paper proposed an automated system that utilizes natural language
processing techniques for generating multiple choice questions (MCQs). The system
comprises three primary components, including text summarization, keyword
extraction, and distractor generation.
For text summarization, the authors use the BERT algorithm, which is a state-
of-the-art method for natural language processing tasks. The system first
identifies the most important sentences in a given text and then generates a
summary of the key points
For keyword extraction, the authors use two different methods: the Python
Keyword Extractor (PKE) and the Rapid Automatic Keyword Extraction
(RAKE) library. These methods identify the most relevant words and phrases in
the text, which are then used to generate the MCQs.
For distractor generation, the WordNet algorithm, a lexical database for English,
is employed to identify related words and concepts that are similar to the correct
answer but not quite right.
6
Some limitations were identified during the research. Specifically, the fixed-length input
and output of the BERT algorithm can restrict its effectiveness for tasks such as
summarization, where input and output lengths can vary significantly. Additionally,
WordNet, which was last updated in 2012, has a limited number of words, resulting in
poor generation of distractors or even the inability to find suitable distractors. However,
in our project, we have considered using PKE for keyword extraction, which is capable
of identifying important keywords
7
3. REQUIREMENT SPECIFICATION
8
number, address, phone number, and email id. To login, the user must enter their
username and password, and they will be able to access only the software features that
they are authorized to use.
Register: To begin using the system, the user is required to complete the
registration/sign-up process, with two distinct user types available:
Admin: The admin has to provide details like name, address, phone
number, email id.
Student: The students have to provide details about his/her name, ID
number, address, phone number, email id.
Login:
Input: Enter the username and password provided.
Output: Users will be able to use the features of software that they are
authorized to access.
9
• Usability Requirement
The system must be accessible through a website on a phone, PC, or tablet, and users
should be able to navigate the system easily without any special training. The system
must be user-friendly.
• Availability Requirement
The website must be available to users 24 hours a day, 365 days a year with no
downtime.
• Efficiency Requirement
The Mean Time to Repair (MTTR) should be no more than an hour, and the system
• Accuracy Requirement
The system must provide accurate information, taking concurrency issues into account.
• Performance Requirement
The system must refresh information based on updates and respond to user requests in
no more than two seconds. Large processing jobs may take longer, but responses to
• Security Requirement
The system must provide security by limiting access for different users, and the
10
• Reliability Requirement
The system must be highly reliable, and in case of server crashes, data will not be lost
Hardware Requirements:
Laptop/PC
Processor: Intel Core i5 or higher
RAM: 8 GB or higher
Hard Disk Space: 50 GB or more
Display: 14-inch or larger display with a resolution of 1920 x 1080
pixels or higher
Graphics Card: Any modern graphics card with at least 2GB of
dedicated memory
Input devices: Keyboard or mouse
11
Operating System: Windows 10 or later
Microphone
Speakers or Headphones
Sound Card
Software Requirements:
Operating System:
Windows 8 or above.
Frontend:
Html 5
CSS3
Bootstrap 4
JavaScript
Web Framework:
Flask
IDE:
Python Language
12
4. SYSTEM DESIGN
4.1 Introduction
In January 1997, the first version of the Unified Modelling Language (UML) was
released. It was the result of a collaboration between Grady Booch, James Rumbaugh,
and Ivar Jacobson, who combined the most effective aspects of their respective object-
oriented analysis and design techniques. UML's fundamental components are drawn
from the methods of Booch, OMT, and OOSE.
Use case Diagram: It displays the available use cases and their application by actors.
Class Diagram: It describes the system's structure, which comprises classes,
associations, and other relationships.
Sequence Diagram: It visualizes object interaction through message exchange.
Activity Diagram: It represents the program's flow from a defined starting point to a
finishing point.
State chart Diagram: It illustrates state machines, including states, transitions, events,
and activities.
13
Object Diagram: It depicts a snapshot of class object instances and their relationships.
Collaboration Diagram: It emphasize the order in which objects send and receive
messages.
Component Diagram: It depicts the system's static implementation view by showing
the organization and dependencies among components.
Deployment Diagram: It shows the configuration of runtime processing nodes and
components that reside on them.
A Class diagram is an integral part of system design and represents both the core objects
and interactions within the application, as well as the classes that will be programmed.
The diagram consists of boxes that represent each class, containing three distinct parts:
1. The top portion displays the class name, which is centrally aligned, written in
bold, and begins with a capitalized letter.
2. The middle portion lists the class attributes, aligned to the left and written in
lowercase.
3. The bottom part describes the methods that the class can execute, aligned to the
left and written in lowercase.
By identifying and grouping classes in a class diagram, the static relationships between
the objects can be established to facilitate system design. To further refine the
conceptual design, classes may be divided into multiple subclasses through detailed
modeling.
14
Class Diagram
15
4.3 Use case Diagram
Use case diagrams are commonly utilized to analyse the high-level requirements
of a system. These requirements are expressed as organized system functionalities
known as use cases. Actors are also important elements in use case diagrams, as they
are the ones who interact with the system. Actors can include human users, internal or
external applications. To create a concise use case diagram, it is necessary to identify
the functionalities to be represented, actors, and relationships among the actors and use
cases. An effective use case diagram must have a well-chosen name for each use case
that accurately reflects its functionality. It is also important to name actors appropriately
and to clearly show relationships and dependencies within the diagram. Not all types of
relationships need to be included in the diagram since its primary purpose is to identify
system requirements. Additionally, notes can be added to clarify essential points.
Use case Diagram
16
4.4 Sequence Diagram
A Sequence diagram is a type of interaction diagram that visualizes the order of
processes and their interactions. It presents the objects and classes involved in a scenario
and illustrates the messages exchanged between objects to execute the functionality.
The lifelines on a sequence diagram represent different objects or processes, while
horizontal arrows show the messages exchanged between them in chronological order.
Sequence Diagram
17
4.5 Activity Diagram
Activity diagrams are a useful tool for representing the workflow of business
processes or even class operations. These diagrams are similar to flowcharts in that they
model the flow of activities from one to another. The activity diagram toolbox provides
a range of tools that can be used to create such diagrams, including activities, decisions,
end states, objects, object flow, start states, states, swim lanes, synchronizations, and
transmissions.
Activity Diagrams
18
5. WORKING PROCESS
Our web project contains many components for the user convenience and the user
satisfaction. The components are:
Login & Logout pages: Users can log in using their credentials and log out
when finished.
Registration page: Newusers must register before using the website.
User Dashboard: Each user has a personalized dashboard displaying generated
MCQs and other relevant information.
Admin Dashboard: Administrators have a personalized dashboard with
additional features such as the ability to add or delete questions and options,
view statistical data, and manage users.
Profile pages: Users and admins have separate profile pages to manage their
accounts and view their details.
5.2 Features
Availability: The website is accessible round the clock and can be reached from
any location worldwide, provided that the user has internet connectivity.
Responsive Design: The website incorporates a responsive design that enables
easy accessibility across various devices, such as desktop computers, laptops,
tablets, and mobile phones.
High Reliability: The website has a robust and reliable architecture that ensures
user data is secure and safe, even in case of power loss or internet disconnection.
Ease of Access: The website can be accessed in a simple and easy manner
through the help of the internet, with no additional software required.
Browser Compatibility: Any contemporary web browser, such as Google
Chrome, Mozilla Firefox, or Microsoft Edge, can be used to access the website.
19
Authentication and Authorization: The website provides robust
authentication and authorization features to ensure user data is secure.
User-friendly Interface: The website features a user-friendly interface that
simplifies the process of generating MCQs and navigating for users.
Multiple Input Options: Users can input text, files, and URLs as sources for
MCQ generation, providing flexibility and convenience.
The working process refers to the procedure followed by our system, including
how the software functions and the steps involved when a user browses the URL.
Initially, all system users, such as students and examiners, are taught how to access and
use the software. During registration, users must provide their ID number, name, email,
phone number, and password, with all fields being validated to prevent entry of illegal
values. Once registered, user details are saved in the database, and login credentials are
sent to the registered email and phone number.
Once a student receives their credentials, they can log in to the system, which
provides authentication and authorization. Without registration, a student cannot access
any features of the system. Upon successful login, the student can see the dashboard,
which displays generated questions and previous session information. The dashboard
content is sourced from the database, and students can view and modify generated
questions according to their needs.
The profile page contains all the details of each student, and they can edit their
account details. A limited view of their details is also available on the top right of the
dashboard. Students can customize the theme to white, black, or transparent based on
their preference, and they can log out of the software using the logout button provided
in the bottom left of the website.
20
6. RESULTS AND OUTPUT SCREENS
21
6.3 Home Page
22
Fig 6.4.2: Insert URL page
23
6.5 Generated Questions Page
24
Fig 6.6.1:Add or Remove users page
25
Fig 6.7.2: Add/Remove Questions and options
26
7. TESTING AND VALIDATION
7.1 Introduction
Unit Testing is performed on individual modules once they are finished and able
to run. The testing is limited to the requirements of the designer.
27
7.2.2 Black Box Testing
The black box testing approach generates test cases that execute all the
functional requirements of the program as input conditions. This type of testing is useful
for detecting errors in missing or incorrect functions, interface errors, data structure
errors, external database access errors, performance errors, and
initialization/termination errors. However, in this type of testing, only the output is
checked for accuracy, and the logical flow of the data is not examined.
White Box Testing involves generating test cases based on the internal logic of each
module, typically through the creation of flow diagrams. The purpose is to test all
logical decisions on both their true and false sides, guarantee that all independent paths
have been executed, execute all loops at their boundaries and within their operational
bounds, and ensure the validity of internal data structures.
7.3 Validation
The successful implementation of the system verifies that all the requirements
stated in the software requirements specification have been met. If incorrect input is
entered, the system displays the appropriate error message.
28
Test Scenario-1: Login
29
Test Scenario-3: Dashboard
1 Enter valid text in the text MCQs are MCQs are Passed
area and click on generate generated and generated and
MCQ displayed on the displayed on the
screen. screen.
30
2 Enter invalid text in the No MCQs are No MCQs are Passed
text area and click on generated. generated.
generate MCQ.
31
3 Updating details Updated Updated
with a user name Successfully. Successfully.
Passed
which is not
already exist.
32
8. CONCLUSION
33
9. REFERENCES
34
APPENDIX
The project entitled “Master MCQ’s using Textual Data” solves some of the issues that
are facing by the examiners to conduct the examinations. Thus, we are looking forward
for implementing this project.
PRE-REQUISITES:
pip install scipy: This installs the scipy library that can be used for a wide range
of scientific and technical computing tasks.
pip install pytesseract: This installs the pytesseract library, which provides an
interface for using the Tesseract OCR engine to recognize text from images.
pip install textract: This installs the textract library, which is a Python wrapper
for extracting text from various file formats, such as PDF, DOCX, and images.
pip install tesseract-ocr: This installs the Tesseract OCR engine on your system,
which is required by pytesseract to recognize text from images.
pip install gTTS: This installs the gTTS library, which provides an interface for
using Google Text-to-Speech API to convert text to speech.
pip install playsound: This installs the playsound library, which allows you to
play sound files from Python.
pip install pygobject: This installs the pygobject library, which provides a
Python wrapper for the GObject library. This is used to create graphical user
interfaces in Python.
pip install flashtext: This installs the flashtext library, which provides a fast and
flexible way to replace words or phrases in text.
pip install git+https://github.com/boudinfl/pke.git: This installs the pke
(Python Keyphrase Extraction) library from its GitHub repository. pke is a
Python library for extracting keywords and keyphrases from text.
pip install transformers: This installs the transformers library, which is a
Python library for natural language processing (NLP). transformers provides a
wide range of pre-trained models for tasks such as text classification, question
answering, and language generation.
35
pip install sentencepiece: This installs the sentencepiece library, which is a
Python library for subword text tokenization. sentencepiece can be used to split
text into smaller units for NLP tasks such as machine translation.
pip install textwrap3: This installs the textwrap3 library, which provides
improved text wrapping functionality compared to the built-in textwrap module.
pip install strsim: This installs the strsim library, which provides functions for
computing string similarity metrics such as Levenshtein distance
pip install sense2vec: This installs the sense2vec library, which provides pre-
trained word embeddings based on the spaCy library.
pip install sentence-transformers: This installs the sentence-transformers
library, which provides pre-trained models for generating vector embeddings of
sentences or paragraphs. These embeddings can be used for various NLP tasks
such as sentence similarity and clustering.
nltk.download('punkt'): This downloads the punkt package from the Natural
Language Toolkit (NLTK), which provides functions for tokenizing text into
sentences or words. This package is often used in NLP tasks for text
preprocessing.
nltk.download('brown'): This downloads the Brown Corpus, which is a
collection of text samples used for linguistic research. The Brown Corpus is often
used in NLP tasks for training language models and evaluating text processing
algorithms.
nltk.download('wordnet'): This downloads the WordNet database, which is a
large lexical database of English words organized by semantic relationships.
WordNet is often used in NLP tasks for tasks such as word sense disambiguation
and semantic similarity computation.
nltk.download('stopwords'): This downloads a list of stop words from NLTK,
which are commonly used words that are often removed from text during
preprocessing. Stop words include words like "a", "an", "the", "and", "but", etc.
and are often removed because they do not carry significant meaning for NLP
tasks.
36
pip install wget
pip install tar
Navigate to the directory where you want to download the file and run the following
command to download it:
wget
https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_m
d.tar.gz
Once the download is complete, run the following command to extract the contents of
the archive:
tar -xvf s2v_reddit_2015_md.tar.gz_2015_
This will extract the contents of the archive to a new directory called
s2v_reddit_2015_md.
wget is a command-line tool for downloading files from the internet. We use it to
download the sense2vec model archive from the official GitHub repository.
Once the file is downloaded, we need to extract its contents using the tar command,
which is a command-line tool for working with tar archives. The tar -xvf command is
used to extract the contents of the archive to a new directory called
s2v_reddit_2015_md.
Overall, this process is necessary to obtain the sense2vec model, which is a pre-trained
language model used for generating multiple-choice questions from a given text.
37
Source Code:
mcq generationflask.py
import textract
import os
from code_5 import generate_question,convert_to_text
app = Flask(__name__)
@app.route('/')
def home():
return render_template('new.html')
@app.route('/submit', methods=['POST'])
def submit():
text = request.form['text']
output=generate_question(text, "Sense2vec")
@app.route('/submit2', methods=['POST'])
def submit2():
file = request.files["file"]
file.save(os.path.join(file.filename))
return 'Text submitted: {}'.format("Success")
@app.route('/submit3', methods=['POST'])
def submit3():
url = request.form['url']
text=convert_to_text(url)
output=generate_question(text, "Sense2vec")
return 'Text submitted: {}'.format(output)
if __name__ == '__main__':
app.run()
Maincode.py
38
import docx2txt
import pytesseract
from PIL import Image
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
import docx2txt
import pptx
import requests
from bs4 import BeautifulSoup
from textwrap3 import wrap
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
from nltk.tokenize import sent_tokenize
from textwrap import wrap
import random
import numpy as np
import nltk
import pke
from nltk.corpus import wordnet as wn
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import torch
from flashtext import KeywordProcessor
import numpy as np
import string
import traceback
from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk('s2v_old')
from sentence_transformers import SentenceTransformer
from similarity.normalized_levenshtein import NormalizedLevenshtein
normalized_levenshtein = NormalizedLevenshtein()
from collections import OrderedDict
from sklearn.metrics.pairwise import cosine_similarity
from gtts import gTTS
39
return convert_ppt_to_text(file_path)
elif file_path.startswith('https:'):
return convert_url_to_text(file_path)
elif file_path.startswith('http:'):
return convert_url_to_text(file_path)
else:
return text
output = ""
audio_output = ""
output = output
40
return output
if __name__ == '__main__':
def convert_url_to_text(file_path):
try:
page = requests.get(file_path)
if page.status_code == 200:
soup = BeautifulSoup(page.content, "html.parser")
for script in soup(["script", "style"]):
script.decompose()
text = soup.get_text()
return text.strip()
else:
return None
except:
return None
def convert_image_to_text(file_path):
try:
# Open the image file and extract the text using OCR
image = Image.open(file_path)
text = pytesseract.image_to_string(image)
except Exception as e:
print(f"Error processing image file: {e}")
text = None
return text
def convert_pdf_to_text(file_path):
try:
# Extract the text from the PDF file
resource_manager = PDFResourceManager()
file_stream = StringIO()
converter = TextConverter(resource_manager, file_stream)
interpreter = PDFPageInterpreter(resource_manager, converter)
with open(file_path, 'rb') as file:
for page in PDFPage.get_pages(file, caching=True,
check_extractable=True):
interpreter.process_page(page)
text = file_stream.getvalue()
except Exception as e:
print(f"Error processing PDF file: {e}")
text = None
return text
def convert_doc_to_text(file_path):
try:
# Extract the text from the Word document
41
text = docx2txt.process(file_path)
except Exception as e:
print(f"Error processing Wordpi file: {e}")
text = None
return text
def convert_ppt_to_text(file_path):
try:
# Extract the text from the PowerPoint presentation
presentation = pptx.Presentation(file_path)
text = ''
for slide in presentation.slides:
for shape in slide.shapes:
if hasattr(shape, 'text'):
text += shape.text
except Exception as e:
print(f"Error processing PowerPoint file: {e}")
text = None
return text
def get_nouns_multipartite(content):
out=[]
try:
extractor = pke.unsupervised.MultipartiteRank()
extractor.load_document(input=content,language='en')
pos = {'PROPN','NOUN'}
stoplist = list(string.punctuation)
stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-
']
stoplist += stopwords.words('english')
extractor.candidate_selection(pos=pos)
extractor.candidate_weighting(alpha=1.1,
threshold=0.75,
method='average')
keyphrases = extractor.get_n_best(n=15)
return out
def get_keywords(originaltext,summarytext):
keywords = get_nouns_multipartite(originaltext)
42
keyword_processor = KeywordProcessor()
for keyword in keywords:
keyword_processor.add_keyword(keyword)
keywords_found = keyword_processor.extract_keywords(summarytext)
keywords_found = list(set(keywords_found))
important_keywords =[]
for keyword in keywords:
if keyword in keywords_found:
important_keywords.append(keyword)
return important_keywords[:4]
imp_keywords = get_keywords(text,summarized_text)
question_model = question_model.to(device)
def get_question(context,answer,model,tokenizer):
text = "context: {} answer: {}".format(context,answer)
encoding = tokenizer.encode_plus(text,max_length=384,
pad_to_max_length=False,truncation=True, return_tensors="pt").to(device)
input_ids, attention_mask = encoding["input_ids"],
encoding["attention_mask"]
outs = model.generate(input_ids=input_ids,
attention_mask=attention_mask,
early_stopping=True,
num_beams=5,
num_return_sequences=1,
no_repeat_ngram_size=2,
max_length=72)
Question = dec[0].replace("question:","")
Question= Question.strip()
return Question
43
ques =
get_question(summarized_text,answer,question_model,question_tokenizer)
sentence_transformer_model = SentenceTransformer('msmarco-distilbert-
base-v3')
def filter_same_sense_words(original,wordlist):
filtered_words=[]
base_sense =original.split('|')[1]
print (base_sense)
for eachword in wordlist:
if eachword[0].split('|')[1] == base_sense:
filtered_words.append(eachword[0].split('|')[0].replace("_", "
").title().strip())
return filtered_words
def get_highest_similarity_score(wordlist,wrd):
score=[]
for each in wordlist:
score.append(normalized_levenshtein.similarity(each.lower(),wrd.lowe
r()))
return max(score)
def sense2vec_get_words(word,s2v,topn,question):
output = []
print ("word ",word)
try:
sense = s2v.get_best_sense(word, senses= ["NOUN",
"PERSON","PRODUCT","LOC","ORG","EVENT","NORP","WORK OF
ART","FAC","GPE","NUM","FACILITY"])
most_similar = s2v.most_similar(sense, n=topn)
output = filter_same_sense_words(sense,most_similar)
except:
output =[]
threshold = 0.6
final=[word]
if question is None:
return []
checklist =question.split()
for x in output:
if get_highest_similarity_score(final,x)<threshold and x not in
final and x not in checklist:
final.append(x)
44
return final[1:]
word_doc_similarity = cosine_similarity(word_embeddings,
doc_embedding)
word_similarity = cosine_similarity(word_embeddings)
keywords_idx = [np.argmax(word_doc_similarity)]
candidates_idx = [i for i in range(len(words)) if i !=
keywords_idx[0]]
candidate_similarities = word_doc_similarity[candidates_idx, :]
target_similarities = np.max(word_similarity[candidates_idx][:,
keywords_idx], axis=1)
keywords_idx.append(mmr_idx)
candidates_idx.remove(mmr_idx)
def get_distractors_wordnet(word):
distractors=[]
try:
syn = wn.synsets(word,'n')[0]
word= word.lower()
orig_word = word
if len(word.split())>0:
word = word.replace(" ","_")
hypernym = syn.hypernyms()
if len(hypernym) == 0:
return distractors
for item in hypernym[0].hyponyms():
name = item.lemmas()[0].name()
if name == orig_word:
continue
name = name.replace("_"," ")
45
name = " ".join(w.capitalize() for w in name.split())
if name is not None and name not in distractors:
distractors.append(name)
except:
print ("Wordnet distractors not found")
return distractors
def get_distractors
(word,origsentence,sense2vecmodel,sentencemodel,top_n,lambdaval):
distractors =
sense2vec_get_words(word,sense2vecmodel,top_n,origsentence)
print ("distractors ",distractors)
if len(distractors) ==0:
return distractors
distractors_new = [word.capitalize()]
distractors_new.extend(distractors)
embedding_sentence = origsentence+ " "+word.capitalize()
keyword_embedding = sentencemodel.encode([embedding_sentence])
distractor_embeddings = sentencemodel.encode(distractors_new)
max_keywords = min(len(distractors_new),5)
filtered_keywords = mmr(keyword_embedding,
distractor_embeddings,distractors_new,max_keywords,lambdaval)
final = [word.capitalize()]
for wrd in filtered_keywords:
if wrd.lower() !=word.lower():
final.append(wrd.capitalize())
final = final[1:]
return final
46