ML Miniproject Final Report
ML Miniproject Final Report
This FastAPI application allows users to upload PDF files, process the content to generate
coding practice questions and answers, and then produce a PDF document with these Q&A
pairs. The app integrates OpenAI's GPT-3.5-turbo for generating questions and answers,
using LangChain for text processing and vector storage. It includes functionality for
uploading files, analyzing content, and returning the output in a downloadable format. The
app also serves static files and uses Jinja2 templates for rendering HTML pages. It is
designed to help coders and programmers prepare for exams and coding tests by providing
relevant practice material.
The core of the application is a Natural Language Processing (NLP) model, often powered by
state-of-the-art transformers like BERT or GPT-3, which processes the input questions and
generates accurate and relevant answers. The system is designed to handle various types of
questions, providing detailed responses that are contextually appropriate. FastAPI's
asynchronous capabilities ensure that the application can handle multiple requests
simultaneously, making it scalable and efficient. This setup is ideal for applications in
education, customer support, and knowledge management, where quick and reliable
information retrieval is crucial. By abstracting the complexities of NLP and API
development, the FastAPI-based Question and Answer Generator offers a streamlined
solution for implementing advanced AI-driven Q&A functionalities.
i
CONTENTS
Abstract i
Contents ii
ii
LIST OF FIGURES
iii
CHAPTER 1
INTRODUCTION
The ‘count_pdf_pages’ function is a utility that determines the number of pages in a PDF file
using the PyPDF2 library. The file_processing function leverages the PyPDFLoader from
LangChain to load and split the PDF content into chunks suitable for generating questions
and answers. This function prepares the document for further processing by splitting it into
larger chunks for question generation and smaller chunks for answer generation, utilizing the
TokenTextSplitter from LangChain.
The ‘llm_pipeline’ function orchestrates the core logic of generating questions and answers.
It first processes the PDF content into appropriate chunks and then employs the ChatOpenAI
model to generate questions based on a prompt template. The questions are refined through a
refine prompt template if necessary. After generating the questions, the document content is
converted into embeddings using OpenAIEmbeddings and stored in a FAISS vector store.
This enables the creation of a retrieval-based question-answering chain that can generate
answers for the previously generated questions.
The ‘get_pdf’ function combines the generated questions and answers into a new PDF
document using the ReportLab library. It ensures the output is neatly formatted with
questions and corresponding answers presented sequentially. This function saves the resulting
PDF to a predefined directory and returns the file path. The web routes defined in the
FastAPI app handle file uploads and initiate the analysis process. The /upload endpoint saves
the uploaded PDF to a directory, while the /analyze endpoint triggers the question-and-
answer generation process and returns the path to the newly created PDF document
containing the generated content.
1.1 Background
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python
3.7+ based on standard Python type hints. It is designed to create APIs quickly with a focus
on performance and ease of use. FastAPI is built on top of Starlette for the web parts and
Pydantic for the data parts. This combination allows for automatic generation of interactive
API documentation using Swagger UI and ReDoc, which makes it easier for developers to
understand and use the APIs. The framework's design is inspired by tools like Flask, offering
a similar simplicity while incorporating advanced features and optimizations.
The key feature of FastAPI is its ability to leverage Python type hints to perform data
validation, serialization, and documentation. When you define a request body, query
parameters, or path parameters using Python types, FastAPI automatically generates a JSON
schema for them. This schema is then used to validate incoming requests and generate
interactive documentation. This approach not only reduces the amount of boilerplate code but
also ensures that the API is well-documented and easy to understand for both developers and
users.
FastAPI is also known for its high performance. It is built to be asynchronous from the
ground up, using Python's async and await keywords. This makes it capable of handling a
large number of concurrent connections, which is crucial for real-time applications, such as
chat applications, gaming backends, or IoT applications. The performance of FastAPI is
comparable to frameworks like Node.js and Go, making it a strong contender in the world of
high- performance web frameworks. Benchmarking tests have shown that FastAPI can handle
a high throughput of requests per second with low latency.
1.2 Why
High Performance:
FastAPI is built on top of Starlette for the web parts and Pydantic for the data parts. This
combination ensures high performance and efficiency. The asynchronous capabilities of
FastAPI are particularly beneficial for handling I/O-bound operations, such as reading and
writing files, making API requests, and interacting with databases, which are crucial in this
application that deals with file uploads and processing.
FastAPI simplifies the development process with its intuitive design and automatic
interactive API documentation generation. This allows developers to quickly set up routes
and endpoints, as seen in the /upload and /analyze endpoints. The automatic validation and
serialization of request and response data make the development process smoother and less
error-prone.
Asynchronous Support:
The code benefits from FastAPI's built-in support for asynchronous functions, which is
critical for handling potentially slow operations without blocking the main execution thread.
For example, the use of aiofiles for asynchronous file operations ensures that the application
can handle multiple file uploads concurrently without significant performance degradation.
FastAPI is highly flexible and can be easily extended with additional functionality. The
provided code demonstrates how FastAPI can be combined with other libraries, such as
langchain for language model operations, PyPDF2 for PDF handling, and reportlab for PDF
generation. This modular approach allows developers to extend the application’s capabilities
without major refactoring.
B.E, Dept. of ECE, CITech 2024-25 Page 3
QUESTION AND ANSWER GENERATOR INTRODUCTION
To develop a web application using FastAPI that allows users to upload PDF documents
containing coding materials, automatically generates practice questions and answers from the
content, and provides the resulting Q&A as a downloadable PDF.
1.4 Objectives
The primary objectives of the "Question and Answer Generator" project are as follows:
Set up a FastAPI application that includes handling static files and HTML templates for a
user interface.
Create endpoints that allow users to upload PDF documents containing coding materials.
Ensure the uploaded files are saved to a designated directory on the server.
Implement functions to load and process the content of the uploaded PDF files. Use the
PyPDFLoader to extract text and split it into manageable chunks for question and answer
generation.
Utilize OpenAI's GPT-3.5 language model to automatically generate practice questions from
the extracted text. Implement a prompt template to guide the question generation process.
Use a refining template to enhance the quality of the generated questions based on additional
context.
Create a retrieval-based QA chain using LangChain and FAISS to generate answers for the
previously generated questions.
Format the generated questions and answers into a new PDF document using ReportLab.
Ensure the resulting PDF is neatly formatted and easy to read.
Make the newly created Q&A PDF available for download to the user. Implement an endpoint
that triggers the analysis and returns the path to the output PDF.
Design a user-friendly HTML interface for uploading PDF files and initiating the analysis
process.
Configure the application to run on a specified host and port, ensuring it is accessible for
users to interact with.
Description: The paper "Language Models are Few-Shot Learners" by Brown, T., Mann, B.,
Ryder, N., et al. (2020) presents GPT-3, a state-of-the-art language model developed by
OpenAI that contains 175 billion parameters, making it the largest language model at the time
of its release. GPT-3 is based on the Transformer architecture and demonstrates
unprecedented performance in a variety of natural language processing tasks. One of the key
innovations highlighted in the paper is GPT-3's ability to perform "few-shot learning," where
the model can effectively handle tasks with little to no task-specific training data. The authors
illustrate this capability by showing GPT-3's proficiency in generating coherent text,
translating languages, answering questions, and even performing arithmetic and
commonsense reasoning tasks. These results are achieved through simple prompts provided
at inference time, without the need for fine-tuning on specific tasks. This breakthrough
suggests that the sheer scale of the model, combined with its training on diverse internet text,
allows it to generalize across different tasks and domains, pushing the boundaries of what
language models can achieve. The paper also discusses the broader implications of such
powerful models, including potential applications and ethical considerations.
The paper "Learning to Ask: Neural Question Generation for Reading Comprehension" by
Du, X., Shao, J., and Cardie, C. (2017) introduces a neural network-based approach for
generating questions from text passages. The authors present a sequence-to-sequence
(Seq2Seq) model with attention mechanisms, trained on the Stanford Question Answering
Dataset (SQuAD). Their model is designed to convert sentences into questions, and it is
evaluated on its ability to generate fluent and relevant questions. The results show that the
model effectively produces high-quality questions, which can enhance automated tutoring
systems and improve reading comprehension assessments.
The provided code is a FastAPI application designed to generate questions and answers from
PDF documents, particularly coding materials and documentation. The process involves
several key steps:
1. PDF Upload and Storage: Users upload a PDF file, which is stored in a specified directory.
The file is read asynchronously to ensure non-blocking operations.
2. PDF Processing: The uploaded PDF is processed to extract its content. This is done using
the `PyPDFLoader`from Langchain, which loads the PDF and extracts the text from each
page.
3.Text Splitting: The extracted text is split into manageable chunks using the
`TokenTextSplitter`. Two types of splitting are done: one for question generation with larger
chunks and overlap to ensure context, and another for answer generation with smaller chunks.
5. Embedding and Vector Store Creation: The text chunks intended for answer generation are
embedded using ‘OpenAIEmbeddings’, and a vector store is created using FAISS (Facebook
AI Similarity Search). This allows for efficient retrieval of relevant text chunks for answering
questions.
7. PDF Generation: The generated questions and their corresponding answers are compiled
into a PDF using the ReportLab library. Each question and answer pair is formatted and
added to the PDF, with spaces added between pairs for readability.
8.Serving the Results: The final PDF containing the Q&A pairs is saved in a designated
output directory and made available for download. The API provides endpoints for uploading
the PDF, analyzing it, and retrieving the generated Q&A PDF.
Code uses pretrained models from the OpenAI API (such as gpt-3.5-turbo) for question and
answer generation. These models are invoked rather than trained within the provided code.
TokenTextSplitter from
def file_processing(file_path):
loader =
PyPDFLoader(file_path) data =
loader.load()
question_gen = ''
question_gen += page.page_content
model_name='gpt-3.5-turbo',
chunk_size=10000,
chunk_overlap=200
chunks_ques_gen = splitter_ques_gen.split_text(question_gen)
splitter_ans_gen =
TokenTextSplitter( model_name='g
pt-3.5-turbo', chunk_size=1000,
chunk_overlap=100
document_answer_gen = splitter_ans_gen.split_documents(document_ques_gen)
def llm_pipeline(file_path):
llm_ques_gen_pipeline = ChatOpenAI(
temperature=0.3,
model="gpt-3.5-turbo"
prompt_template = """
You are an expert at creating questions based on coding materials and documentation.
Your goal is to prepare a coder or programmer for their exam and coding tests.
{text}
Create questions that will prepare the coders or programmers for their tests.
QUESTIONS:
"""
PROMPT_QUESTIONS =
PromptTemplate(template=prompt_template,input_variables=["text"])
refine_template = """
You are an expert at creating practice questions based on coding material and
documentation.
We have the option to refine the existing questions or add new ones.
{text}
QUESTIONS:
"""
REFINE_PROMPT_QUESTIONS =
template=refine_template,
ques_gen_chain = load_summarize_chain(llm=llm_ques_gen_pipeline,
chain_type="refine",
verbose=True,
question_prompt=PROMPT_QUESTIONS,
refine_prompt=REFINE_PROMPT_QUESTIONS)
ques = ques_gen_chain.run(document_ques_gen)
embeddings = OpenAIEmbeddings()
ques_list = ques.split("\n")
answer_generation_chain = RetrievalQA.from_chain_type(llm=llm_answer_gen,
chain_type="stuff",
retriever=vector_store.as_retriever())
File Processing:
Load PDF content and split it into chunks suitable for question and answer generation using
the TokenTextSplitter.
Question Generation:
Use the ChatOpenAI model with a specified prompt to generate questions based on the text
chunks.
Answer Generation:
Embed the document chunks using OpenAIEmbeddings and store them in a FAISS vector store.
This setup leverages pretrained models for question and answer generation without explicitly
training new models within the provided code.
Upload
User and
Interface PDF LLM PDF Store
analyze Output
processi-ng Gener
PDF Pipeline
ation PDF
Frontend Interface
Components: HTML pages, including the file upload form and result display.
FastAPI Application
Components:
Endpoints:
/analyze: Processes the uploaded PDF to generate a question and answer PDF.
Libraries:
StaticFiles: Serves static files like CSS, JavaScript, and uploaded PDFs.
Function: Manages file uploads, triggers PDF processing, and serves the results.
Components:
PDF Loader: Uses PyPDFLoader to read and extract text from the uploaded PDF.
Text Splitter: Splits the extracted text into manageable chunks for further processing.
Answer Generation:
Report Generation: Creates a PDF with questions and answers using ReportLab.
External Services
OpenAI API:
Function: Provides natural language processing capabilities for question generation and
answer retrieval.
Storage
Static Files: Directory for storing uploaded PDFs and generated output PDFs.
Execution Environment
Operating System: Manages file system operations and execution of the FastAPI app.
This provides an overview of the tools and technologies used in a FastAPI application
designed for generating questions and answers from PDF documents. The application
leverages modern web frameworks, natural language processing libraries, and PDF
processing tools to deliver a comprehensive solution.
Web Framework
FastAPI: A modern, high-performance web framework for building APIs with Python.
FastAPI is known for its speed and ease of use, thanks to its support for Python type hints and
automatic
generation of API documentation. It is used in this application to handle HTTP requests and
manage endpoints for uploading and processing PDF files.
Jinja2: A template engine for Python, integrated with FastAPI to render HTML templates. It
allows for dynamic content generation in web pages.
StaticFiles: A FastAPI component used to serve static assets such as CSS, JavaScript, and
images. This enables the application to deliver a complete web experience.
LangChain: A framework designed for applications with large language models (LLMs). The
following components of LangChain are used:
ChatOpenAI: Provides access to OpenAI’s GPT models for generating natural language text,
including questions and answers.
QAGenerationChain: Utilized for creating questions from text. This component helps in
structuring questions based on the content of the PDF.
TokenTextSplitter: Splits text into chunks suitable for processing by language models,
ensuring that large texts are handled efficiently.
PromptTemplate: Defines the prompts for guiding the LLM to generate or refine questions.
load_summarize_chain: Constructs a chain to generate and refine questions based on the input
text.
OpenAIEmbeddings: Converts text into vector embeddings for similarity search and document
retrieval.
FAISS: An efficient library for similarity search and clustering of dense vectors, used here to
manage and search through text embeddings.
PDF Processing
Aiofiles: An asynchronous I/O library for handling file operations in a non-blocking manner.
It ensures efficient file handling during PDF uploads and processing.
Server
Uvicorn: An ASGI server used to run the FastAPI application. Uvicorn provides high
performance and supports asynchronous capabilities, making it suitable for handling a large
number of concurrent requests.
Utilities
OS: A Python module for interacting with the operating system. It is used for file system
operations such as checking the existence of directories and creating new ones.
JSON: A Python module for encoding and decoding JSON data. It is used for handling data
interchange between the API and frontend.
File System
Static File Storage: Manages the storage of uploaded and generated files. This includes
directories for saving PDFs and the resulting question-answer PDFs.
Workflow
File Upload: Users upload PDF files through the web interface.
Processing: The uploaded file is processed to extract text and generate questions and answers
using the NLP pipeline.
PDF Generation: A new PDF is created containing the generated questions and answers.
Result Delivery: The generated PDF is saved and made available for download.
Hardware Requirements:
CPU:
RAM:
Minimum 8 GB of RAM.
Recommended 16 GB or higher for handling large PDF files and extensive processing.
Storage:
SSD storage for faster read/write operations, especially for handling large PDF files.
Internet Connection:
A stable internet connection is required to interact with the OpenAI API for generating
questions and answers.
Software Requirements:
Operating System:
Python:
Python Packages:
fastapi
uvicorn
PyPDF2
openai (for OpenAI's GPT-3.5-turbo model)
Action: The user accesses the main page served by the FastAPI application.
Frontend Interaction: The user uploads a PDF file via a form on the HTML page.
Endpoint: /upload
Process:
2. Trigger Analysis
Frontend Interaction: The user submits the filename of the uploaded PDF via a form.
Endpoint: /analyze
Process:
This triggers the get_pdf() function to process the PDF and generate the question-and-answer
PDF.
Generation Function:
llm_pipeline(file_path)
Process:
Load PDF: file_processing(file_path) extracts text from the PDF using PyPDFLoader.
Text Splitting: The extracted text is split into chunks using TokenTextSplitter.
4. Question Generation
Text chunks are fed into the ChatOpenAI model with a PromptTemplate to generate questions.
5. Answer Generation
A RetrievalQA chain is used to generate answers based on the questions and the vector store.
6. PDF Generation
Function:
get_pdf(file_path) Process:
7. Create Document
Use ReportLab to create a PDF document with questions and corresponding answers.
For each question, the corresponding answer is retrieved and added to the PDF.
The PDF is structured with questions and answers formatted neatly, separated by spacers.
8. Deliver Results
Process:
The PDF file can be accessed from the static/output/ directory where it was saved.
Summary of Steps
1. User Uploads PDF: The user uploads a PDF file via the /upload endpoint.
3. Trigger Analysis: The user requests analysis via the /analyze endpoint.
4. Process PDF:
6. Return Result: Provide the path to the generated PDF for user download.
FastAPI application to process PDF documents, generate questions and answers based on
their content, and produce a downloadable PDF containing the generated Q&A pairs. This
system leverages several libraries and models, including OpenAI's GPT-3.5-turbo, PyPDF2
for PDF handling, FAISS for vector storage, and ReportLab for PDF generation.
PDF Handling
Library: PyPDF2 and aiofiles are used for reading and writing PDF files.
Functionality: Extracts text from the uploaded PDF and handles file operations
asynchronously to improve performance.
Text Processing
Text Splitter: The TokenTextSplitter from LangChain is used to break down the extracted
text into chunks suitable for processing by the GPT-3.5-turbo model. This ensures that the
text is manageable and conforms to token limits.
Document Creation: The text chunks are converted into document objects, which are then
used for generating questions and answers.
Question Generation: Utilizes ChatOpenAI with a prompt template specifically designed for
generating coding-related questions. The prompt ensures that the generated questions are
relevant and comprehensive.
Answer Generation: Employs a retrieval-based approach using FAISS for vector storage
and retrieval. The OpenAI embeddings and GPT-3.5-turbo model are used to generate precise
answers based on the context of the questions.
PDF Generation
Library: ReportLab is used to compile the generated questions and answers into a new PDF
document.
Formatting: Ensures that the PDF is well-formatted with appropriate spacing and styles for
readability.
PDF Processing:
Questions are generated using the GPT-3.5-turbo model with a specific prompt template.
Generated questions are refined and split to filter out irrelevant content.
Answers are generated by embedding the text chunks, storing them in a vector store, and
using a retrieval-based QA system.
Result Compilation:
Questions and answers are compiled into a new PDF document using ReportLab.
The resulting PDF is saved and made available for user download.
The FastAPI application successfully implements a robust pipeline for generating questions
and answers from PDF documents, specifically tailored for coding materials and
documentation. By leveraging state-of-the-art natural language processing capabilities
through OpenAI's GPT-3.5-turbo model, the application effectively extracts, processes, and
generates educational content. The integration of various components, including text
extraction, question generation, and answer retrieval, demonstrates the feasibility and
efficiency of using advanced AI models for educational purposes. The generated question-
and-answer PDFs provide valuable resources for learners, aiding in their preparation for
coding exams and tests.
A key strength of this system lies in its modular design, which allows for easy customization
and scalability. The use of well-defined pipelines for question and answer generation means
that individual components can be updated or replaced as new technologies emerge, without
requiring a complete overhaul of the system.
Customization Options: Allow users to specify the type and difficulty level of questions they
want generated.
4. Additional Features:
Multi-Format Support: Extend support to other document formats such as Word documents,
text files, and HTML content.
Collaborative Learning: Implement features that allow collaborative question generation and
discussion among multiple users or within study groups.
Bias Analysis: Continuously monitor and analyze the generated content for biases and
implement strategies to mitigate any identified biases.
[1] Jurafsky, D., & Martin, J. H. (2019). Speech and Language Processing (3rd ed. draft). The
authors are renowned in the field of natural language processing and provide a
comprehensive guide to language understanding and generation algorithms. Retrieved from
https://web.stanford.edu/~jurafsky/slp3/
[2] Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information
Retrieval. This book covers the fundamentals of information retrieval, which is crucial for
developing a question and answer system. Cambridge University Press.
[3] Vasile, F., & Ligozat, A. L. (2016). Question Answering over Linked Data (QALD-5).
This research paper explores question answering techniques over linked data, which can be
valuable for incorporating semantic knowledge into the app. Retrieved from
https://www.researchgate.net/publication/303721162_Question_Answering_over_Linked_Da
ta_QALD-5
[4] Tiangolo, S. (2020). FastAPI: The modern, fast (high-performance), web framework for
building APIs with Python 3.7+ based on standard Python type hints. Retrieved from
https://fastapi.tiangolo.com/
[10] Johnson, J., Douze, M., & Jégou, H. (2021). FAISS: A library for efficient similarity
search and clustering of dense vectors. Retrieved from https://faiss.ai/