project_document
project_document
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE & ENGINEERING (AI & ML)
Submitted By
Mr. Chundi kousik (20KT1A4215)
Ms. Chipilla Kavya Sravani (20KT1A4214)
Mr. Pagadala Tharaka Subbareddy (20KT1A4237)
2020-2024
POTTI SRIRAMULU CHALAVADI MALLIKHARJUNARAO
COLLEGE OF ENGINEERING & TECHNOLOGY KOTHAPET,
VIJAYAWADA-520001.
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING (AI & ML)
CERTIFICATE
This is to certify that the project work entitled “Student Answer Evaluation Using LLM’s”
is a bonafide work carried out by Chundi kousik chundi(20KT1A4215), Chipilla Kavya
Sravani(20KT1A4214), Pagadala Tharaka Subbareddy(20KT1A4237). Fulfillment for the
award of the degree of Bachelor of Technology in COMPUTER SCIENCE &
ENGINEERING (AI & ML) of Jawaharlal Nehru Technological University, Kakinada during
the year 2023-2024. It is certified that all corrections/suggestions indicated for internal
assessment have been incorporated in the report. The project report has been approved as it
satisfies the academic requirements in respect of project work prescribed for the above degree.
External Examiner
ACKNOWLEDGEMENT
We owe a great many thanks to a great many people who helped and supported and suggested
us in every step. We are glad for having the support of our principal Dr. J. Lakshmi Narayana
who inspired us with his words filled with dedication and discipline towards work. We express
our gratitude towards Mrs. N. V. Maha Lakshmi , HOD of AIML for extending her support
through technical and motivation classes which had been the major source to carrying out our
project. We are very much thankful to Mrs. N. V. Maha Lakshmi, Associate Professor,
Guide of our project for guiding and correcting various documents of ours with attention and
care. She has taken the pain to go through the project and make necessary corrections as and
when needed. Finally, we thank one and all who directly and indirectly helped us to complete
our project successfully.
Project Associates
This is to declare that the project entitled “Student Answer Evaluation Using LLMs”
submitted by us in the partial fulfillment of requirements for the award of the degree of Bachelor
of Technology in COMPUTER SCIENCE & ENGINEERING(AI & ML) in Potti
Sriramulu Chalavadi Mallikharjuna Rao College of Engineering and Technology, is
bonafide record of project work carried out by us under the guidance of Mrs. N. V. Maha
Lakshmi, Associate Professor. As per our knowledge, the work has not beensubmitted to any
other institute or universities for any other degree.
Project Associates
In educational assessment, the need for accurate and insightful evaluation of student responses
is paramount. This project introduces a novel approach by leveraging Long Language Models
(LLMs) to enhance the assessment process. Unlike conventional methods that rely on
predefined criteria or human judgment, this system harnesses the power of LLMs to compare
student answers with model-generated ideal responses. At the heart of this methodology lies the
ability of LLMs to understand language semantics deeply, enabling them to generate coherent
and contextually appropriate responses. By employing this capability, the system facilitates a
dynamic evaluation framework that transcends the limitations of traditional grading
approaches.
Central to this paradigm shift is the emphasis on semantic similarity between student responses
and ideal answers. Through sophisticated computational analysis, the system provides objective
and adaptive assessment, accommodating diverse responses and educational contexts.
Moreover, it offers granular feedback to students, pinpointing specific areas for improvement
in their responses. This innovative approach not only promises to elevate the standard of
educational assessment but also fosters a more equitable and insightful evaluation of student
learning, paving the way for enhanced pedagogical practices.
1.Introduction
1.1.1 Scope 1
1.1.2 Purpose 1
2. System Analysis
2.4 Methodologies
3. System Design
3.4 Dataset 26
4. System Implementation
4.2 Code 28
4.3 Results 34
5.Testing
6.Conclusion 38
7.Future Work 39
8.References 40
9.Bibliography 42
10.Appendix
10.4 NLP 45
10.5 AI & ML 46
10.6 LLM’s 47
LIST OF FIGURES
FIGURE NO
Representation
Representation
8 3.7 Level-0 25
9 3.8 Levevl-1 25
1. INTRODUCTION
1.1.1 Scope
1.1.2 Purpose
The purpose of this project is to enhance the precision, speed, and scalability of student answer
evaluation processes in educational settings. By leveraging advancements in natural language
processing and information retrieval, the methodology aims to provide more efficient and
reliable means of assessing student performance. The project seeks to offer educators actionable
insights into student performance while also contributing to the broader transformation of
automated assessment practices within educational contexts.
The primary object of this study is to develop and implement an innovative methodology for
evaluating student responses using advanced natural language processing techniques and
information retrieval systems. The study aims to analyze the effectiveness of this approach in
enhancing the precision, speed, and scalability of assessment processes compared to traditional
methods. Additionally, the study explores the potential implications of integrating technology,
particularly LLMs, in educational assessment practices, with a focus on its applicability across
various subjects and question formats.
In[2] , Hossam Magdy Balaha Researcher at Mansoura University in Egypt led a study that
introduced an Automatic Exam Correction Framework (AECF) designed for a variety of
question forms, such as equations, essays, and multiple-choice questions (MCQs). The system
responds to the growing need for automated grading, which is particularly relevant in the
context of online learning. The creation of a five-layered technique with the goal of simplifying
the grading process is at the heart of their project.'HMB-MMS-EMA', an equation similarity
checker algorithm that is novel, is a key component of their methodology. A specific dataset
called 'HMB-EMD-v1' was created to support this method and make expression matching tasks
easier. This paper uses Python tools such as Gensim, SpaCy, and NLTK to analyze various
approaches for translating textual input into numerical representations.
In[3], Vedant Bahel describes an automated evaluation system that is intended to evaluate
descriptive responses on test questions. Their suggested method relies on automating the
assessment process through the use of Natural Language Processing (NLP) and Data Mining
techniques, providing a solution to the time-consuming and labor-intensive operation of grading
such answers. The core of their research involves the application of Siamese Manhattan LSTM
(MaLSTM) for text similarity analysis, taking into account variables such as response length,
syntax, language proficiency, and correctness of answers. The study highlights the effectiveness
and institutional sustainability of their method by drawing contrasts with assignments that are
manually assessed. It does, however, recognize its limitations, especially when assessing
responses that include figures, diagrams, equations, or numerical data, indicating potential
directions for future.
In[4], Neslihan Suzen Writing for the University of Leicester in the United Kingdom and
Lobachevsky University in Russia, Neslihan Suzen, Alexander N. Gorban, Jeremy Levesley,
and Evgeny M. Mirkes explore the topic of Automatic Grading and Feedback mechanisms for
short answer questions, with a focus on the UK GCSE system. Using data from a University of
North Texas basic computer science course, the study uses conventional data mining techniques
to compare student responses with model answers, concentrating on word usage that is often
used. Furthermore, the study investigates clustering techniques for assigning grades and
providing feedback to students in an effective manner. Most importantly, the research promotes
computer methods to improve scoring reliability rather than to replace human scoring. Located
at the cutting edge of instructional technology.
In [5],Dr. A.Mercy Rani, The scholarly article "Automated Explanatory Answer Evaluation
Using Machine Learning Approach," written by Dr. A. Mercy Rani, an assistant professor at Sri
S. Ramasamy Naidu Memorial College in Sattur, India, describes a novel method for assessing
explanatory answers using a machine learning paradigm. The report, which was published in
July 2021 in the Design Engineering magazine, addresses the urgent need for effective online
assessment techniques, which is made more apparent by the pandemic-related shift to digital
schooling. The suggested approach uses Cosine Similarity as a grading metric after extracting
keywords from student responses using Natural Language Processing (NLP) techniques and
comparing them with an answer key. It highlights the benefits of online assessments and
highlights the necessity of automated assessment systems in the digital environment.
In [6], Steven Burrows ,"The Eras and Trends of Automatic Short Answer Grading," provides
a detailed analysis of ASAG (Automated Short Answer Grading). The research examines the
complex procedure of assessing succinct natural language responses using computer-based
methodologies, and it was published in the International Journal of Artificial Intelligence in
Education in 2015. The authors find five temporal patterns that signify important
methodological advances through a historical investigation of 35 ASAG systems. They also
examine six common dimensions, providing an extensive synopsis of the ASAG environment.
In the conclusion, the study denotes a moment of consolidation in the field by defining an age
of evaluation as the most recent trend in ASAG research.
In [7], Jinzhu Luo, examines Automatic Short Answer Grading (ASAG) utilizing deep learning
methods, with a particular emphasis on the Sentence BERT model. Against the backdrop of
online learning, the paper tackles the ongoing difficulties in short response question grading
and proposes a model that outperforms conventional techniques in terms of accuracy and
efficiency. The thesis analyzes and contrasts the Sentence BERT model's performance with that
of the original BERT model, analyzing different task functions and evaluating the impact of
answer length on grading efficacy. Prominent advances in accuracy measures like the Marco
F1 score and the Weighted F1 score are explained, with a focus on the benefits of shorter replies.
submitted in fulfillment of the Master of Science degree requirements.
In[8], Md. Motiur Rahman The study explores an NLP-based Automatic Answer Script
Evaluation system. This approach uses a multidimensional technique with the goal of
accelerating the evaluation process while minimizing problems such as evaluator bias and the
time-consuming nature of manual grading. Text is extracted from answer scripts, summarized,
and a variety of similarity metrics are used to determine how closely student responses match
the right answers. Finally, points are assigned. The study investigates the effectiveness of the
suggested evaluation framework by utilizing four unique similarity metrics—Cosine, Jaccard,
Bigram, and Synonym—as well as keyword-based summarization. Promising results from the
experiments show that the automated evaluation system frequently.
In[9], Rick Somers The study explores an NLP-based Automatic Answer Script Evaluation
system. This approach uses a multidimensional technique with the goal of accelerating the
evaluation process while minimizing problems such as evaluator bias and the time-consuming
nature of manual grading. Text is extracted from answer scripts, summarized, and a variety of
similarity metrics are used to determine how closely student responses match the right answers.
Finally, points are assigned. The study investigates the effectiveness of the suggested evaluation
framework by utilizing four unique similarity metrics—Cosine, Jaccard, Bigram, and
Synonym—as well as keyword-based summarization. Promising results from the experiments
show that the automated evaluation system frequently.
In[10], Gyeong-Geon Lee An open-access article titled "Applying large language models and
chain-of-thought for automatic scoring" by Gyeong-Geon Lee and colleagues explores the use
of GPT-3.5 and GPT-4 in conjunction with Chain-of-Thought (CoT) to automatically score
student responses in science assessments. The study, which was published in Computers and
Education: Artificial Intelligence, aims to address issues with accessibility, technical
complexity, and the lack of explainability that arises with AI-based scoring systems. Using six
prompt engineering strategies to experiment on a dataset of 1,650 student responses, the study
highlights the advantages of few-shot learning over zero-shot learning and the significant
improvement in scoring accuracy when CoT is combined with item stems and scoring rubrics.
Additionally, the study explores how Large Language Models (LLMs) can provide explicable.
The evaluation of student answers in educational settings presents a significant challenge, often
requiring manual assessment by instructors. This process can be time-consuming and
subjective, leading to inconsistencies in grading. Automated methods for evaluating student
answers are desirable to streamline the assessment process and provide more objective
feedback. However, existing automated systems often lack the ability to accurately assess
student responses in diverse contexts and subject areas. This project aims to address these
limitations by developing a system that leverages question-answer retrieval and chunking
techniques to generate actual answers for comparison with student responses, enabling more
efficient and consistent evaluation of student performance.
and relevance in delivering original answers based on the content extracted from the chunks,
enhancing the learning and retrieval process in ML and AI studies.
• Student Answer Input:
To facilitate comprehensive evaluation, the system incorporates a module for receiving student
answers as input. This functionality allows seamless interaction with learners, enabling them to
provide their responses to questions related to machine learning (ML) topics. Upon submission
of a student's answer, the system initiates a robust evaluation process to assess its accuracy and
relevance.
• Comparison and Evaluation:
During the comparison and evaluation process, the system meticulously assesses the student's
answer against the actual response, prioritizing semantic coherence over mere word matching.
It employs objective criteria, penalizing factual inaccuracies with a score of 0% while
maintaining fairness and consistency. Furthermore, the system is adept at handling edge cases
like students repeating questions as answers. Ultimately, a percentage score is calculated,
reflecting the degree of alignment between the student's response and the expected answer, thus
providing insightful feedback on the student's comprehension and performance.
• Percentage Calculation:
Calculate a percentage score representing the similarity or correctness of the student answer
compared to the actual answer.The percentage score can be based on various factors, such as
the number of matching words, semantic similarity, or syntactic structure similarity.
• Output:
Our system analyzes your answer alongside the original answer using advanced language
models (LLMs). These LLMs go beyond simple keyword matching and consider the meaning
behind the words. The result is a percentage score (0% to 100%) reflecting how closely your
answer aligns with the expected response. This score takes into account both semantic similarity
and factual accuracy.
2.SYSTEM ANALYSIS
The system analysis for the Answer Evaluating project involves a deep dive into the manual
evaluation processes currently in place, aiming to understand instructor workflows and identify
inefficiencies. This analysis assesses the feasibility of integrating Large Language Models
(LLMs) like Llama Mistral into the evaluation process, considering operational, technical, and
behavioral factors. Additionally, stakeholder analysis gauges the needs of instructors, students,
and administrators. Ultimately, this phase aims to inform the development of an automated
evaluation system that utilizes LLMs to enhance accuracy, efficiency, and effectiveness in
assessing student answers.
The feasibility study investigates the viability of integrating Large Language Models (LLMs)
like Llama Mistral into the student answer evaluation process. It assesses scalability, resource
requirements, integration complexity, and ethical considerations to determine practicality and
effectiveness. Through rigorous analysis, the study aims to provide insights into the feasibility
of leveraging LLMs for enhanced student assessment methods. Its findings will inform
decision-making regarding the implementation and integration of LLMs in educational settings.
Operational feasibility assesses the practicality of integrating Large Language Models (LLMs)
like Llama Mistral into existing educational workflows. It examines whether the system can
seamlessly fit into teachers' and administrators' daily tasks without significant disruption. This
evaluation considers factors such as user training requirements, workflow adjustments, and the
overall impact on productivity and efficiency. Ultimately, operational feasibility aims to
determine whether implementing LLMs for student answer evaluation is operationally viable
within the educational context.
Technical feasibility scrutinizes the integration of Large Language Models (LLMs) such as
Llama Mistral within the technological infrastructure of educational institutions. It evaluates
the compatibility of LLMs with existing systems and software, ensuring seamless integration
without compromising performance. Assessing hardware and software requirements, this
analysis delves into server capabilities, computational resources, and any necessary software
updates or modifications. Moreover, it investigates the scalability of the system to
accommodate varying workloads and user demands. By addressing these technical aspects
comprehensively, the feasibility study aims to determine the readiness and viability of
incorporating LLMs into the student answer evaluation process.
Behavioral feasibility examines the user acceptance and interaction dynamics of the Smart
Evaluator project. It focuses on creating a user-friendly interface that facilitates intuitive
interaction for both teachers and students. Through features like presenting original answers
alongside student responses and providing percentage scores, the system aims to enhance
engagement and comprehension. By prioritizing clarity and ease of use, the project seeks to
promote user satisfaction and adoption. This evaluation ensures that the system aligns with user
expectations and effectively supports the student evaluation process.
Utilizing entirely open-source resources such as Google Colab and LLMS like LLAMA and
Mistral ensures a cost-effective approach to implementing the project. Leveraging these free
and accessible tools minimizes initial investment and ongoing maintenance costs, enhancing
long-term sustainability. Integrating Streamlit for the web interface further contributes to
affordability, as it offers a user-friendly platform for development without additional licensing
fees. By embracing open-source solutions, the project not only maximizes cost-efficiency but
also promotes transparency, collaboration, and community-driven innovation.
The System Requirements Specification (SRS) outlines the essential criteria for developing an
automated sentiment analysis system tailored for news articles extracted from newspaper
images. It delineates the functionalities necessary to efficiently analyze sentiment distribution
within news content, aiming to provide valuable insights into public opinion and sentiment
trends. Approval of this document signifies acknowledgment and agreement that the resultant
system, fulfilling these stipulated requirements, will be deemed acceptable for implementation.
This document serves as a guiding framework to ensure the development of a robust and
effective solution aligned with the project's objectives and user needs.
The functional requirements of the system delineate its core capabilities, focused on
streamlining the evaluation process within educational settings. Firstly, it encompasses
retrieving answers from educational materials based on provided questions, ensuring relevance
and accuracy. Following this, the system segments responses into meaningful units, facilitating
comprehensive analysis. Utilizing Large Language Models (LLMs) like LLAMA and Mistral,
it generates actual answers, providing a benchmark for comparison. Through robust algorithms,
student responses are evaluated for correctness, and similarity percentages are calculated,
offering quantitative feedback. Furthermore, the system automates report generation for both
instructors and students, expediting the feedback loop and enhancing educational outcomes.
These features collectively ensure efficient assessment and feedback delivery, enriching the
teaching and learning experience.
supportability, the system aims to minimize compatibility issues and facilitate smooth
deployment and maintenance processes
• Flexibility: Providing the flexibility to adapt and extend functionality is crucial for
accommodating evolving requirements and user needs over time. This includes designing
the system with modular architecture and well-defined APIs, enabling the integration of
new features or modules without disrupting existing components. By prioritizing flexibility,
the system aims to future-proof itself and remain adaptable to changing educational trends
and technologies.
The System Requirements Specification (SRS) document for the Answer Evaluation project
serves as a comprehensive blueprint outlining the essential features and characteristics of the
automated student answer evaluation system. It meticulously defines the functionality, usability,
reliability, performance, supportability, and flexibility required to achieve project objectives
effectively. Approval of the SRS signifies a consensus that the developed system must strictly
adhere to these specified requirements to attain acceptance. As a guiding framework for
development, this document ensures alignment with project goals and stakeholder expectations,
fostering clarity and accountability throughout the development lifecycle. By delineating clear
parameters and standards, the SRS lays the foundation for the creation of a robust and efficient
solution tailored to the needs of educators and students.
2.4 METHODOLOGIES
Employing advanced long language models like BERT, GPT, LLAMA, and Mistral, we seek to
revolutionize student answer evaluation by focusing on semantic comprehension over simple
word or sentence matching. We begin with data collection and preprocessing of student answers
across various subjects. Contextual embeddings are then extracted using pre-trained language
models to capture nuanced semantic meaning. Semantic similarity is calculated through
methods like cosine similarity or Euclidean distance for more nuanced evaluation. A scoring
mechanism and grading rubric are devised based on similarity scores to categorize answers by
proficiency. Through iterative refinement and optimization, including model training and fine-
tuning, we aim to create a robust evaluation system. Integration into an automated platform
with an educator-friendly interface ensures practicality. Our methodology will undergo rigorous
evaluation and comparison against traditional approaches, demonstrating its efficacy in
assessing student answers based on meaning.
• Rubrics: Rubrics serve as invaluable scoring guides, delineating specific criteria for
evaluating student responses across various assignments. By offering a structured
framework, they facilitate consistent and fair assessment practices, aligning with
educational objectives. Traditionally paper-based, these rubrics require instructors to
manually apply scores to each criterion, which can be time-consuming and prone to
subjectivity. However, they remain essential tools for providing transparent feedback and
guiding students towards achieving learning outcomes. As technology evolves, digital
rubric platforms emerge, offering efficiencies in grading and enhancing collaboration
among educators. Transitioning to digital formats promises to streamline assessment
processes while maintaining the integrity and effectiveness of rubric-based evaluation.
3.SYSTEM DESIGN
Elements of a System:
• Data - An ML textbook forms the core training material, enriching the model's
understanding of machine learning concepts. Supplementing this are question and answer
datasets specific to machine learning, further enhancing the system's knowledge base. By
combining these datasets, the system can generate accurate responses and evaluate student
answers effectively. This diversity ensures the system's proficiency in handling a broad
spectrum of queries and tasks related to student answer evaluation.
In initializing the design definition, the plan outlines the development of an automated
educational assessment system leveraging AI technology. This system aims to streamline the
evaluation of student responses, offering immediate and insightful feedback. The process
encompasses thorough analysis of requirements, meticulous system design, integration of AI
algorithms, rigorous reliability testing, and deployment on scalable infrastructure. Key
technologies include the Transformers library for implementing Llama and Mistral models,
Streamlit for web application development, and the Hugging Face Hub for model versioning
and sharing. By leveraging these cutting-edge tools and methodologies, the project aims to
revolutionize the student assessment process, enhancing efficiency and effectiveness in
educational contexts..
Establishing design characteristics for the automated educational assessment system involves
defining clear attributes for architecture, interfaces, and system elements. The focus lies on
achieving real-time performance, scalability, and accuracy, particularly in the implementation
of the pose detection model. Interfaces are refined to optimize user interaction and
accommodate external service integration, ensuring a seamless and intuitive experience for both
educators and students. By prioritizing these design characteristics, the system aims to deliver
efficient assessment processes, reliable performance, and enhanced usability within educational
environments.
The system architecture of the project revolves around utilizing PDF documents, chunking them
into sections, and employing Large Language Models (LLMs) such as Llama and Mistral to aid
in question generation and answer evaluation. By segmenting the PDFs, the system extracts
relevant portions and generates questions using LLMs, considering each chunk as an original
answer. When students submit their answers, the system prompts LLMs with specific rules to
compare these responses against the original chunks, calculating the similarity percentage as an
output. This process ensures a streamlined approach to assessing student answers, leveraging
AI capabilities to enhance accuracy and efficiency.
Through the systematic integration of LLMs and PDF chunking, the system orchestrates a
seamless evaluation process, enabling educators to efficiently analyze student responses and
provide timely feedback. By harnessing the power of AI algorithms and prompt-guided
interactions, the system facilitates precise comparisons between student answers and original
sources. Ultimately, educators receive comprehensive insights into the similarity percentage,
along with the source of the original answer, empowering them to make informed decisions and
support student learning effectively.
In the Data Flow Diagram (DFD) depicting the process of PDF chunking, database
creation, loading LLMs, question generation, and similarity assessment, the flow of data
begins with the input of PDF documents. These documents are then segmented into
manageable chunks, which are stored in the database for later retrieval. Upon receiving a
request for question generation, the system extracts chunks from the database and prompts
the LLMs to generate questions based on this content. The questions are then presented to
the users, initiating the process of answer submission by students.
Following student submissions, the system retrieves the corresponding original chunks
from the database and passes both the student's answer and the original chunk to the LLMs
for comparison. Utilizing predefined rules and algorithms, the LLMs analyze the similarity
between the two responses, generating a similarity percentage as an output. This output,
along with the source of the original answer, is then provided to educators for assessment
and feedback. Throughout this process, the DFD illustrates the flow of data, ensuring
transparency and understanding of each step involved in the evaluation process.
Components Of DFDs:
The data flow diagram has four components. They are:
• External Entity
• Process
• Data Flow
• Warehouse
External Entity:
An outside process or system that sends or receives data to and from the diagrammed
system.They are also known as sources, terminators, sinks or actors and are represented by
squares.
Process:
Input to output transformation in a system takes place because of process function. The
symbols of a process are rectangular with rounded corners, oval, rectangle or a circle. The
process is named a short sentence, in one word or a phrase to express its essence
Data flow describes the information transferring between different parts of the systems.
The arrow symbol is the symbol of data flow. A relatable name should be given to the flow
to determine the information which is being moved. Data flow also represents material
along with information that is being moved. Material shifts are modeled in systems that are
not merely informative. A given flow should only transfer a single type of information. The
direction of flow is represented by the arrow which can also be bi-directional.
Warehouse:
The data is stored in the warehouse for later use. Two horizontal lines represent the symbol
of the store. The warehouse is simply not restricted to being a data file rather it can be
anything like a folder with documents, an optical disc, a filing cabinet. The data warehouse
can be viewed independent of its implementation. When the data flow from the warehouse
it is considered as data reading and when data flows to the warehouse it is called data entry
or data updation.
In Software engineering DFD can be drawn to represent the system of different levels of
abstraction. Higher-level DFDs are partitioned into low levels-hacking more information
and functional elements. Levels in DFD are numbered 0, 1, 2 or beyond. Here, we will see
mainly 3 levels in the data flow diagram, which are: 0-level DFD, 1-level DFD, and 2-level
DFD.
Level-0:
The Level 0 Data Flow Diagram (DFD) outlines the core process of the automated student answer
evaluation system. It begins with the input of student responses, followed by the selection of the
appropriate Large Language Model (LLM) to process the query. The system then passes the query to the
selected LLM, which analyzes the student's response and the original answer, ultimately generating the
similarity percentage as an output. This simplified depiction provides a high-level overview of the
fundamental steps involved in the evaluation process, highlighting the flow of data from input to output
through the interaction with the LLM.
Level-1:
In the Level 1 Data Flow Diagram (DFD), the system's steps are expanded to provide a more
detailed understanding of the student answer evaluation process. It begins with the input of
student responses, which are then passed to the LLM selection module. Here, the system
identifies the most suitable LLM based on predefined criteria and passes the selected model to
the query processing module. The query processing module extracts the relevant information
from both the student's answer and the original source, preparing the data for comparison.
Subsequently, the system passes the processed data to the similarity assessment module, where
the LLM evaluates the similarity between the student's response and the original answer. After
analysis, the module generates the similarity percentage, which is then presented as the output
of the system. This detailed breakdown in the Level 1 DFD allows for a clear visualization of
the sequential steps involved in the student answer evaluation process, from input to output,
and highlights the role of each module in facilitating efficient assessment and feedback.
3.4 DATASETS:
•Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow – Aurélien Géron
http://14.139.161.31/OddSem-0822-1122/Hands-On_Machine_Learning_with_Scikit-Learn-
Keras-and-TensorFlow-2nd-Edition-Aurelien-Geron.pdf
•I've compiled a comprehensive PDF encompassing AI and ML questions and answers, perfect
for database storage and retrieval. It's designed to facilitate easy chunking for efficient data
management and future reference. This resource streamlines knowledge acquisition and
enhances the retrieval process, ensuring seamless access to key insights across all AI and ML
topics.
4.SYSTEM IMPLEMENTATION
Streamlit is a Python library for building interactive web applications. It offers a simple syntax and a
variety of widgets for data visualization and user interaction. Streamlit apps update in real-time as users
interact with them, providing a seamless experience. It integrates well with popular Python libraries like
Pandas and Matplotlib. Deployment to platforms like Streamlit Sharing and Heroku is straightforward.
Installation:
Google Colab:
Collab typically refers to Google Colab, short for Google Colaboratory, which is a free cloud-
based platform provided by Google that allows users to write and execute Python code in a
Jupyter Notebook environment. It provides access to GPU and TPU resources, making it
particularly useful for machine learning tasks. Users can also collaborate in real-time on Colab
notebooks, making it a popular choice for collaborative coding and sharing code with others.
DATA_PATH = 'sourcedocs/'
DB_FAISS_PATH = 'vectorstore/db_faiss'
Creating Database:
def create_vector_db():
loader = DirectoryLoader(DATA_PATH,
glob='*.pdf',
loader_cls=PyPDFLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,
chunk_overlap=50)
texts = text_splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-
L6-v2',
model_kwargs={'device': 'cpu'})
db = FAISS.from_documents(texts, embeddings)
db.save_local(DB_FAISS_PATH)
if __name__ == "__main__":
create_vector_db()
drive.mount("/content/drive")
login("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")
%%writefile qa_interface.py
import streamlit as st
DB_FAISS_PATH = '/content/drive/MyDrive/vectorstore/db_faiss/'
def qa_bot():
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-
MiniLM-L6-v2",
model_kwargs={'device': 'cpu'})
db = FAISS.load_local(DB_FAISS_PATH, embeddings,
allow_dangerous_deserialization=True)
llm = CTransformers(
model="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
model_type="llama",
max_new_tokens=1000,
temperature=0.4
Context: {context}
Question: {question}
Helpful answer:
qa_chain = RetrievalQA.from_chain_type(llm=llm,
chain_type='stuff',
retriever=db.as_retriever(search_kwargs={'k': 2}),
return_source_documents=True,
chain_type_kwargs={'prompt': prompt})
return qa_chain
def generate_response_from_gemini(input_text):
genai.configure(api_key="….. ")
generation_config = {
"temperature": 0.5,
"top_p": 1,
"top_k": 32,
"max_output_tokens": 4096,
safety_settings = [
llm = genai.GenerativeModel(
model_name="gemini-pro",
generation_config=generation_config,
safety_settings=safety_settings,
output = llm.generate_content(input_text)
return output.text
# Streamlit app
def main():
# Call QA function
qa_answer = qa_result['result']
input_prompt_template = """
You are an interviewer who interviews the students. Your job is to give the marks in the
form of percentages for the answers given by the students to the actual answer. Before you give
the marks to the students take the time to give marks and there are some at most important set
of rules to be followed and edge cases to be handled. The input is given in the form of
studentAnswer: {studentAnswer}, actualAnswer: {actualAnswer}
Set of rules:-
2. make sure that you don't give marks based on the words matched for the studentAnswer
and the actualAnswer. Give marks based on comparing the whole meaning of the
studentAnswer and the actualAnswer.
3. Make sure that you don't give any explanation your work is only to give marks for the
studentAnswer in the form of a percentage.
1. If the student repeats the question as the answer give him 0 marks.
2. Even if the words in the studentAnswer match with the words in the actualAnswer don't
give marks by just considering word matches give marks based on comparing the meaning of
the studentAnswer with the actualAnswer."""
response_text =
generate_response_from_gemini(input_prompt_template.format(actualAnswer=qa_answer,
studentAnswer=student_answer))
# Display results
st.subheader("Results:")
if __name__ == "__main__":
main()
4.3 RESULTS
The above figure 4.4 shows the percentage of student answer. Because of the wrong answer the
percentage is zero.
The figure 4.6 represents the evaluation of the student answer and the original answer. The
percentage represents similarity between the student answer and the original answer
5.TESTING
Perplexity: Perplexity is a metric in language modeling that measures how well a model
predicts a sequence of words. Lower perplexity values indicate better predictive performance,
suggesting that the model is less surprised by the actual sequence of words. It is calculated as
the exponentiation of the average negative log likelihood of the test data, normalized by the
number of words.
6.CONCLUSION
The utilization of Large Language Models (LLMs) holds immense promise for revolutionizing
education through automated student response evaluation. With a focus on continuous study,
collaborative efforts, and innovative approaches, LLM-based evaluation models can undergo
further refinement to enhance accuracy and efficiency. Future endeavors should prioritize
iterative improvements in algorithms and methodologies, aiming to optimize the performance
of these models and meet the evolving needs of educational settings. Additionally, exploring a
diverse range of free-source LLMs and assessing their effectiveness presents valuable
opportunities for advancing the evaluation process, paving the way for innovative practices and
improved learning outcomes.
7.FUTURE WORK
In future endeavors, advancing the automated student answer evaluation system entails a
multifaceted exploration of various avenues for enhancement. Experimentation with a diverse
array of large language models (LLMs), ranging from GPT-3 to BERT, RoBERTa, and XLNet,
offers a rich opportunity to evaluate their efficacy in generating accurate answers and assessing
student responses. By fine-tuning selected LLMs on domain-specific datasets, we can bolster
their understanding and adaptability to the intricacies of educational contexts, fostering more
precise evaluations.
Furthermore, adopting ensemble methods that amalgamate predictions from multiple LLMs
holds promise for augmenting overall performance by capitalizing on the strengths of individual
models. Supplementing the dataset through techniques like paraphrasing and synonym
substitution can enhance model generalization and fortify robustness. Integrating advanced
evaluation metrics, such as semantic similarity and coherence, empowers deeper insights into
the quality of student responses, facilitating more nuanced assessments. Moreover, exploring
the development of domain-specific LLMs tailored explicitly to educational domains and
integrating user feedback mechanisms for continuous refinement represent pivotal steps
towards achieving heightened efficacy and relevance in educational assessment. Concurrently,
optimizing scalability and efficiency ensures the system's adeptness in managing larger datasets
and heightened demand, paving the way for broader adoption and impact across educational
landscapes.
8.REFERENCES
.[2] H. M. Balaha and M. M. Saafan, "Automatic Exam Correction Framework (AECF) for the
MCQs, Essays, and Equations Matching," in IEEE Access, vol. 9, pp. 32368-32389, 2021, doi:
10.1109/ACCESS.2021.3060940.
[3] Vedant Bahel, Achamma Thomas,” Text similarity analysis for evaluation of descriptive
answers” https://doi.org/10.48550/arXiv.2105.02935
1877-0509,https://doi.org/10.1016/j.procs.2020.02.171.
[5] Rani, A.M. Automated Explanatory Answer Evaluation Using Machine Learning Approach.
Design Engineering, pp.1181-1190, 2021.
[6] Burrows, S., Gurevych, I. & Stein, B. The Eras and Trends of Automatic Short Answer
Grading. Int J Artif Intell Educ 25, 60–117 (2015). https://doi.org/10.1007/s40593-014-0026-8
[7] Bonthu, S., Rama Sree, S., Krishna Prasad, M.H.M. (2021). Automated Short Answer
Grading Using Deep Learning: A Survey. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl,
E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2021. Lecture Notes in
Computer Science(), vol 12844. Springer, Cham. https://doi.org/10.1007/978-3-030-84060-0_5
[8] S. K. Sinha, S. Yadav and B. Verma, "NLP-based Automatic Answer Evaluation," 2022 6th
International Conference on Computing Methodologies and Communication (ICCMC), Erode,
India, 2022, pp. 807-811, doi: 10.1109/ICCMC53470.2022.9754052. keywords: {Weight
measurement;Analytical models;Mood;Natural languages;Automatic Evaluation;NLP;Text
Summarization;Similarity Measure;Evaluation Function}
[9] Rick Somers, Samuel Cunningham-Nelson, Wageeh Boles, Applying natural language
processing to automatically assess student conceptual understanding from textual responses,
https://doi.org/10.14742/ajet.7121
[10] Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, Xiaoming Zhai,Applying
large language models and chain-of-thought for automatic scoring,Computers and Education:
Artificial Intelligence,Volume 6,2024,100213,ISSN 2666-920X,
https://doi.org/10.1016/j.caeai.2024.100213
9.BIBLIOGRAPHY
1) https://huggingface.co/
2) https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
3) https://huggingface.co/search/full-text?q=TheBloke%2FLlama-2-7b-Chat-GGUF
4) https://www.langchain.com
5) https://research.ibm.com/blog/retrieval-augmented-generation-RAG
10.APPENDIX
10.1 Python Introduction
Python is a high-level, interpreted programming language known for its simplicity, readability,
and versatility. Developed by Guido van Rossum and first released in 1991, Python has gained
widespread popularity and has become one of the most widely used programming languages in
the world. Its syntax emphasizes code readability and simplicity, making it an ideal language
for both beginners and experienced developers alike.
Python's versatility is evident in its broad range of applications across various domains. In web
development, frameworks like Django and Flask are popular choices for building robust and
scalable web applications. In data science, Python's rich ecosystem of libraries such as NumPy,
Pandas, and Matplotlib makes it a preferred language for data analysis, visualization, and
machine learning. Moreover, Python is extensively used in artificial intelligence and scientific
computing, with libraries like TensorFlow, PyTorch, and SciPy powering advanced research
and applications in these fields.
Due to its open-source nature and active community support, Python continues to evolve
rapidly, with frequent updates and new features being added to the language. Its ease of learning
and powerful capabilities have contributed to its widespread adoption across industries and
domains, solidifying its position as one of the most popular programming languages in the
world.
Python's origins can be traced back to the late 1980s when Guido van Rossum, a Dutch
programmer, began working on the language as a side project. His goal was to create a language
that prioritized simplicity and readability while still being powerful and versatile. In February
1991, Python's first version, Python 0.9.0, was released.
Over the years, Python has undergone several major releases, each introducing new features,
improvements, and optimizations. Python 2.x series, released in 2000, became widely popular
and remained in use for many years. However, with the introduction of Python 3.x series in
2008, the language underwent significant changes and improvements, leading to better
performance, enhanced features, and improved syntax.
Python's community-driven development model has played a crucial role in its success. The
Python Software Foundation (PSF), established in 2001, oversees the development and
maintenance of the language, ensuring its continued growth and evolution. Today, Python
enjoys widespread adoption and usage across industries, with millions of developers worldwide
contributing to its ecosystem through libraries, frameworks, and open-source projects.
Python is renowned for its rich set of features and characteristics that make it an attractive
choice for developers. Some of the key features of Python include:
Simplicity: Python's syntax is designed to be clear and concise, making it easy to read and write
code. This simplicity allows developers to focus on solving problems rather than worrying
about complex syntax.
Readability: Python emphasizes readability, with code that closely resembles English-like
syntax. This readability reduces the time and effort required to understand and maintain code,
especially in collaborative projects.
Interpreted: Python is an interpreted language, meaning that code is executed line by line by
an interpreter at runtime. This allows for rapid development and testing, as changes to code can
be immediately evaluated without the need for compilation.
Dynamic Typing: Python uses dynamic typing, allowing variables to be assigned without
specifying their data types explicitly. This flexibility simplifies code development and enhances
code readability.
Rich Standard Library: Python comes with a comprehensive standard library that provides
built-in support for a wide range of tasks and functionalities, including file I/O, networking,
data manipulation, and more. This extensive library reduces the need for external dependencies
and simplifies development.
Community Support: Python boasts a large and active community of developers who
contribute to its ecosystem by creating libraries, frameworks, and tools. This vibrant community
ensures continuous improvement and innovation within the Python ecosystem.
10.4 NLP
Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the
interaction between computers and human languages. NLP enables computers to understand,
interpret, and generate human language in a way that is both meaningful and useful.
Python plays a significant role in NLP due to its simplicity, readability, and rich ecosystem of
libraries and tools. Some of the key aspects of Python's involvement in NLP include: NLTK
(Natural Language Toolkit): NLTK is a popular Python library for NLP tasks such as
tokenization, stemming, tagging, parsing, and semantic reasoning. It provides a wide range of
tools and resources for building NLP applications and conducting research in the field.
TensorFlow and PyTorch: TensorFlow and PyTorch are popular deep learning frameworks
that can be used for NLP tasks such as text classification, sentiment analysis, machine
translation, and text generation. These frameworks provide tools and APIs for building and
training deep learning models for NLP applications.
Word Embeddings: Word embeddings such as Word2Vec, GloVe, and FastText are
techniques used to represent words as dense vectors in a high- dimensional space. Python
libraries like Gensim and TensorFlow provide implementations of these techniques, making it
Example: Virtual assistants like Siri, Alexa, and Google Assistant use AI algorithms to
understand and respond to user commands. They can perform tasks such as setting reminders,
answering questions, and playing music based on user preferences. Another example of AI is
autonomous vehicles, which use sensors, cameras, and AI algorithms to perceive their
environment, navigate roads, and make real-time driving decisions.
Large Language Models (LLMs) represent a diverse array of advanced systems designed to
understand and generate human language. Built upon complex neural network architectures like
Transformer models, such as GPT-3, BERT, and XLNet, LLMs are capable of comprehending
context and producing coherent text-based outputs. They undergo extensive training on massive
datasets, allowing them to learn general language patterns and semantics.
Moreover, LLMs can be fine-tuned on task-specific datasets, enhancing their effectiveness for
specialized tasks like sentiment analysis or question answering. This versatility and adaptability
make LLMs invaluable tools for a wide range of natural language processing (NLP)
applications, driving advancements in language understanding and generation technology.
The key distinction between pretrained and fine-tuned Large Language Models (LLMs) lies in
their training stages and objectives. Pretrained LLMs undergo initial training on vast amounts
of unlabeled text data without specific task supervision. This phase enables them to learn
general language patterns and semantics, forming a foundational understanding of human
language. In contrast, fine-tuned LLMs undergo additional training on task-specific datasets
after the initial pretrained phase.
This fine-tuning process tailors the model's knowledge and performance to specific tasks or
domains, such as sentiment analysis or question answering, enhancing its effectiveness and
accuracy for targeted applications. Overall, pretrained LLMs establish a broad linguistic
understanding, while fine-tuned LLMs refine their capabilities for specialized tasks through
additional training.
• XLNet:
XLNet is a transformer-based language model that takes into account bidirectional
context as well as context from surrounding words. It is a base model used for various
natural language processing applications.
• Llama:
LLama is one of the prominent Large Language Models (LLMs) that has been compared
to other models like ChatGPT-4 and Mistral in terms of performance and
capabilities. Which can able to perform tasks like question answering, text
generation/summarization, translation. Suitable for research in AI ethics, educational
platforms, and language analysis tools.
• Mistral Models:
Mistral AI’s models, including Mistral 7B and Mistral 8X7B, outperform competitors
like Llama 2 and GPT-3.5. These versatile models excel in question answering, text
generation, summarization, and translation, making them ideal for industrial
automation, energy-efficient AI deployments, and mobile applications.