Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
1 Introduction
Large language models (LLMs) excel at generating human like responses, but
base AI models can’t keep up with the constantly evolving information within
dynamic sectors. They rely on static training data, leading to outdated or incom-
plete answers. Thus they often lack transparency and accuracy in high stakes
decision making. Retrieval Augmented Generation (RAG) presents a powerful
solution to this problem. RAG systems pull in information from external data
sources, like PDFs, databases, or websites, grounding the generated content in
accurate and current data making it ideal for knowledge intensive tasks.
In this report, we document our experience as a step-by-step guide to build
RAG systems that integrates PDF documents as the primary knowledge base.
We discuss the design choice, development of system, and evaluation of the guide,
providing insights into the technical challenges encountered and the practical so-
lutions applied. We detail our experience using both proprietary tools (OpenAI)
and open-source alternatives (Llama) with data security, offering guidance on
choosing the right strategy. Our insights are designed to help practitioners and
researchers optimize RAG models for precision, accuracy and transparency that
best suites their use case.
2 Background
This section presents the theoretical background of this study. Traditional gen-
erative models, such as GPT, BERT, or T5 are trained on massive datasets but
have a fixed internal knowledge cut off based on their training data. They can
only generate black box answers based on what they know, and this limitation
is notable in fields where information changes rapidly and better explainabil-
ity and traceability of responses is required, such as healthcare, legal analysis,
customer service, or technical support.
2
enhancing the model’s ability to respond.
2. Data Preprocessing:
The collected data is then preprocessed to create manageable and meaning-
ful chunks. Preprocessing involves cleaning the text (e.g., removing noise,
formatting), normalizing it, and segmenting it into smaller units, such as to-
kens (e.g., words or group of words), that can be easily indexed and retrieved
later. This segmentation is necessary to ensure that the retrieval process is
accurate and efficient.
3
date information to respond to the query.
5. Augmentation of Context:
By merging two knowledge streams - the fixed, general knowledge embed-
ded in the LLM and the flexible, domain-specific information augmented on
demand as an additional layer of context, aligns the Large Language Model
(LLM) with both established and emerging information.
7. Final Output:
By moving beyond the opaque outputs of traditional models, the final output
of RAG systems offer several advantages: they minimize the risk of generating
hallucinations or outdated information, enhance interpretability by clearly
linking outputs to real-world sources, enriched with relevant and accurate
responses.
4
effective when dealing with stable data or when the model needs to adhere to a
specific tone and style.
5
Advantages: No additional training is required, making it easy to deploy
and maintain. It is best for general purpose tasks or when exploring potential
applications without high upfront costs.
PDFs are paramount for RAG applications because they are widely used for
distributing high-value content like research papers, legal documents, technical
manuals, and financial reports, all of which contain dense, detailed information
essential for training RAG models. PDFs come in various forms, allowing access
to a wide range of data types—from scientific data and technical diagrams to le-
gal terms and financial figures. This diversity makes PDFs an invaluable resource
for extracting rich, contextually relevant information. Additionally, the consis-
tent formatting of PDFs ensures accurate text extraction and context preserva-
tion, which is fundamental for generating precise responses. PDFs also include
metadata (like author, keywords, and creation date) and annotations (such as
6
highlights and comments) that provide extra context, helping RAG models pri-
oritize sections and better understand document structure, ultimately enhancing
retrieval and generation accuracy.
– Appropriate Tool for Extraction: There are tools and libraries for ex-
tracting text from PDFs for most popular programming languages (i.e:
pdfplumber or PyMuPDF (fitz) for Python). These libraries handle most
common PDF structures and formats, preserving the text’s layout and struc-
ture as much as possible.
– Verify and Clean Extracted Text: After extracting text, always verify
it for completeness and correctness. This step is essential for catching any
extraction errors or artifacts from formatting.
7
2. Effective Chunking for Retrieval:
PDF documents often contain large blocks of text, which can be challenging for
retrieval models to handle effectively. Chunking the text into smaller, contextu-
ally coherent pieces can improve retrieval performance.
8
– Use Logging for Monitoring: Implement logging to capture detailed in-
formation about the PDF processing steps, including successes, failures, and
any anomalies. This is important for debugging and optimizing the applica-
tion over time.
By following these key considerations and best practices, we can effectively
process PDFs for RAG applications, ensuring high-quality text extraction, re-
trieval, and generation. This approach ensures that your RAG models are strong,
efficient, and capable of delivering meaningful insights from complex PDF doc-
uments.
3 Study Design
This section presents the methodology for building a Retrieval Augmented Gen-
eration (RAG) system that integrates PDF documents as a primary knowledge
source. This system combines the retrieval capabilities of information retrieval
(IR) techniques with the generative strengths of Large Language Models (LLMs)
to produce factually accurate and contextually relevant responses, grounded in
domain-specific documents.
The goal is to design and implement a RAG system that addresses the limita-
tions of traditional LLMs, which rely solely on static, pre-trained knowledge. By
incorporating real-time retrieval from domain-specific PDFs, the system aims to
deliver responses that are not only contextually appropriate but also up-to-date
and factually reliable.
The system begins with the collection of relevant PDFs, including research
papers, legal documents, and technical manuals, forming a specialized knowledge
base. Using tools and libraries, the text is extracted, cleaned, and preprocessed
to remove irrelevant elements such as headers and footers. The cleaned text is
then segmented into manageable chunks, ensuring efficient retrieval. These text
segments are converted into vector embeddings using transformer-based models
like BERT or Sentence Transformers, which capture the semantic meaning of
the text. The embeddings are stored in a vector database optimized for fast
similarity-based retrieval.
The RAG system architecture consists of two key components: a retriever,
which converts user queries into vector embeddings to search the vector database,
and a generator, which synthesizes the retrieved content into a coherent, factual
response. Two types of models are considered: OpenAI’s GPT models, accessed
through the Assistant API for ease of integration, and the open-source Llama
model, which offers greater customization for domain-specific tasks.
In developing the system, several challenges are addressed, such as managing
complex PDF layouts (e.g., multi-column formats, embedded images) and main-
taining retrieval efficiency as the knowledge base grows. These challenges were
highlighted during a preliminary evaluation process, where participants pointed
out the difficulty of handling documents with irregular structures. Feedback from
the evaluation also emphasized the need for improvements in text extraction and
chunking to ensure coherent retrieval.
9
The design also incorporates the feedback from a diverse group of partici-
pants during a workshop session, which focused on the practical aspects of imple-
menting RAG systems. Their input highlighted the effectiveness of the system’s
real-time retrieval capabilities, particularly in knowledge-intensive domains, and
underscored the importance of refining the integration between retrieval and
generation to enhance the transparency and reliability of the system’s outputs.
This design sets the foundation for a RAG system capable of addressing the
needs of domains requiring precise, up-to-date information.
4.1.2 Setting Up an IDE After installing Python, the next step is to set
up an Integrated Development Environment (IDE) to write and execute your
Python code. We recommend Visual Studio Code (VSCode), however you are
free to choose editor of your own choice. Below are the setup instructions for
VSCode.
10
1. Download and Install VSCode
– Visit the official VSCode website: https://code.visualstudio.com/.
– Select your operating system (Windows, macOS, or Linux) and follow
the instructions for installation.
11
With your virtual environment now configured, you are ready to install
project specific dependencies and manage Python packages independently for
each approach. This setup allows you to create separate virtual environments
for the two approaches outlined in Sections[4.2.1][4.2.2]. By isolating your de-
pendencies, you can ensure that the OpenAI Assistant API-based[4.2.1] and
Llama-based [4.2.2] Retrieval Augmented Generation (RAG) systems are de-
veloped and managed in their respective environments without conflicts or de-
pendency issues. This practice also helps maintain cleaner, more manageable
development workflows for both models, ensuring that each approach functions
optimally with its specific requirements.
12
Table 2: Comparison of RAG Approaches: OpenAI vs. Llama
Feature OpenAI’s Assistant API Llama (Open-Source
(GPT Series) LLM Model)
Ease of Use High. Simple API calls with Moderate. Requires setup
no model management and model management
Customization Limited to prompt engineer- High. Full access to model
ing and few-shot learning fine-tuning and adaptation
Cost Pay-per-use pricing model Upfront infrastructure costs;
no API fees
Deployment Flexi- Cloud-based; depends on Highly flexible; can be de-
bility OpenAI’s infrastructure ployed locally or in any cloud
environment
Performance Excellent for a wide range of Excellent, particularly when
general NLP tasks fine-tuned for specific do-
mains
Security and Data Data is processed on Ope- Full control over data and
Privacy nAI servers; privacy con- model; suitable for sensitive
cerns may arise applications
Support and Strong support, documenta- Community-driven; updates
Maintenance tion, and updates from Ope- and support depend on com-
nAI munity efforts
Scalability Scalable through OpenAI’s Scalable depending on in-
cloud infrastructure frastructure setup
Control Over Up- Limited; depends on Ope- Full control; users can decide
dates nAI’s release cycle when and how to update or
modify the model
4.2.1 Using OpenAI’s Assistant API : GPT Series While the OpenAI
Completion API is effective for simple text generation tasks, the Assistant API
is a superior choice for developing RAG systems. The Assistant API supports
multi-modal operations (such as text, images, audio, and video inputs) by
combining text generation with file searches, code execution, and API calls.
For a RAG system, this means an assistant can retrieve documents, generate
vector embeddings, search for relevant content, augment user queries with addi-
tional context, and generate responses—all in a seamless, integrated workflow.
It includes memory management across sessions, so the assistant remembers
past queries, retrieved documents, or instructions. Assistants can be configured
with specialized instructions, behaviors, parameters other than custom tools that
makes this API far more powerful for developing RAG systems.
This subsection provides a step-by-step guide and code snippets to utilize
the OpenAI’s File Search tool within the Assistant API, as illustrated in
13
Fig. 2: Open AI’s Assistant API Workflow
Fig[2] to implement RAG. The diagram shows how after the domain specific
data ingestion of supported files (such as PDFs, DOCX, JSON, etc.), the data
prepocessing[2] and vectorization[3] is handled by Assistant API. These vectors
are stored in OpenAI Vector Store, which the File Search tool can query to
retrieve relevant content. The assistant then augments the context and generates
accurate responses based on specialized instructions and the retrieved informa-
tion. This integrated process is covered in detailed steps below:
Store your API key securely. A .env file is used to securely store environ-
ment variables, such as your OpenAI API key.
14
(b) Create a .env File: Inside your new folder, make a file called .env. This
file will store your OpenAI API key.
(c) Add Your API Key: Open the .env file and paste your OpenAI API
key in this format:
OPENAI_API_KEY = y ou r _ o pe n a i _a p i _ ke y _ h er e
If you need specific version of these tools used for the code in GitHub
repository, you can install them like this:
pip install python - dotenv ==1.0.1 openai ==1.37.2
(f) Create the Main Python File: In the same folder, create a new
file called main.py. All the code snippets attached in this entire sec-
tion[4.2.1] should be implemented within this file.
To interact with the OpenAI API and load environment variables, you need
to import the necessary libraries. The dotenv library will be used to load
environment variables from the .env file.
Next, you need to load the environment variables from the .env file and set
up the OpenAI API client. This is important for authenticating your re-
quests to the OpenAI service and setting up the connection to interact with
OpenAI’s Assistant API.
15
# Set OpenAI key and model
openai . api_key = openai_api_key
client = openai . OpenAI ( api_key = openai . api_key )
model_name = " gpt -4 o " # Any model from GPT series
Organize the Knowledge Base Files: After selecting the PDF(s) for
your external knowledge base, create a folder named Upload in the project
directory, and place all the selected PDFs inside this folder.
The following Python code defines a function to upload multiple PDF files
from a specified directory to OpenAI vector store, which is a common data
structure used for storing and querying high-dimensional vectors, often for
machine learning and AI applications. It ensures the directory and files are
valid before proceeding with the upload and collects and returns the up-
loaded files’ IDs.
NOTE: The function is called to run only when a new vector store is created,
meaning it won’t upload additional files to an existing vector store. You can
modify the logic as needed to suit your requirements.
Also, please be aware that the files are stored on external servers, such as
16
OpenAI’s infrastructure. OpenAI has specific policies regarding data access
and usage to protect user privacy and data security. They state that they
do not use customer data to train their models unless explicitly permitted by
the user. For more details refer:https://openai.com/policies/privacy-policy/.
Additionally, the data stored can be deleted easily when necessary either via
code:https://platform.openai.com/docs/api-reference/files/delete or from the
user interface by clicking the delete button here:
https://platform.openai.com/storage/files/.
file_ids = {}
# Get all PDF file paths from the directory
file_paths = [ os . path . join ( directory_path , file )
for file in os . listdir ( directory_path ) if file
. endswith ( " . pdf " ) ]
17
return file_ids
except Exception as e :
print ( f " Error uploading files to vector store : { e }
")
return None
OpenAI Vector stores are used to store files for use by the file search tool in
Assistant API. This step involves initializing a vector store for storing vector
embeddings of documents and retrieving them when needed.
try :
# List all existing vector stores
vector_stores = client . beta . vector_stores . list ()
except Exception as e :
18
print ( f " Error creating or retrieving vector store :
{e}")
return None
Once the functions to upload PDF file(s) and creating a vector store are
defined you can call it to create Knowledge Base for your project by pro-
viding vector store name and store in a vector store object as shown below:
After setting up the vector store, the next step is to create an AI assistant
using the OpenAI API. This assistant will be configured with specialized
instructions and tools to perform RAG tasks effectively. Set the assistant
name, description and instructions properties accordingly. Refer to the
19
best practices, if needed you can also play with the temperature and top
p values as per the project needs for random or deterministic responses.
Code Example: Create and Configure Assistant
try :
assistants = client . beta . assistants . list ()
for assistant in assistants . data :
if assistant . name == assistant_name :
print ( " AI Assistant already exists with ID
: " + assistant . id )
return assistant
except Exception as e :
print ( f " Error creating or retrieving assistant : { e
}")
return None
20
Common Mistakes and Best Practices
This capability is essentially important when there is a need to use the same
AI assistant with different tools for different threads. They can be dynami-
cally managed to suit the requirements for topic-specific threads, reusing the
same Assistant across different contexts or overwriting assistant tools for a
specific thread.
21
}
}
}
6. Initiating a Run
22
)
while True :
run_status = client . beta . threads . runs . retrieve ( run
. id , thread_id = message_thread . id )
if run_status . status == ’ completed ’:
break
elif run_status . status == ’ failed ’:
raise Exception ( f " Run failed : { run_status .
error } " )
time . sleep (1)
while True :
response_messages = client . beta . threads . messages .
list ( thread_id = message_thread . id )
if any ( msg . role == " assistant " and msg . content for
msg in new_messages ) :
23
break
if citations :
print ( " \ nSources : " , " , " . join ( citations ) )
print ( " \ n " )
Code Example:
import os
import fitz # PyMuPDF for reading PDFs
24
text_file_name = os . path . splitext ( file_name )
[0] + " . txt "
text_file_path = os . path . join ( text_folder ,
text_file_name )
Functionality: Converts all PDFs in the Data folder to text files and saves
them in the DataTxt folder.
Code Example:
import os
from la ngch ain_h uggi ngfa ce import Hu ggin gFace Embe dding s
from langchain_community . vectorstores import FAISS
25
embeddings = Hugg ingF aceEm bedd ings ( model_name =
embedding_model )
vector_store = FAISS . from_texts ( texts , embeddings )
This command runs the Llama 3.1 model, allowing you to ask questions
directly in the terminal and interact with the model in real time.
26
(d) Step 4: Test OLlama in VS Code
create a .bat file to avoid typing the full path each time. Open Notepad,
add this:
@echo off
" C :\ path \ to \ OLlama \ oLlama . exe " %*
Replace the path with your own. Save the file as oLlama.bat directly in
your project folder.
In the VS Code terminal, run the .bat file with the command:
.\ oLlama . bat run Llama3 .1
Code Example:
import os
from langchain_community . vectorstores import FAISS
from la ngch ain_h uggi ngfa ce import Hu ggin gFace Embe dding s
from langchain . prompts import PromptTemplate
from langchain . chains import RetrievalQA
from langchain_community . llms import OLlama
# Create the RAG system using FAISS and OLlama ( Llama 3.1)
def create_rag_system ( index_path , embedding_model = ’
sentence - transformers / all - MiniLM - L6 - v2 ’ , model_name = "
Llama3 .1 " ) :
# Load the FAISS index
vector_store = load_faiss_index ( index_path ,
embedding_model )
27
You are an expert assistant with access to the
following context extracted from documents . Your
job is to answer the user ’s question as accurately
as possible , using the context below .
Context :
{ context }
return qa_chain
28
while True :
user_question = input ( " Ask your question ( or type
’ exit ’ to quit ) : " )
if user_question . lower () == " exit " :
print ( " Exiting the RAG system . " )
break
answer = get_answer ( user_question , rag_system )
print ( f " Answer : { answer } " )
Functionality: The script takes a user query from the terminal. It retrieves
relevant documents using FAISS. Then it generates a answer using the re-
trieved context with OLlama and Llama 3.1.
OLlama (Llama 3.1) is a local language model that runs entirely on the
user’s machine, ensuring data privacy and faster response times depending
on the system’s hardware. The accuracy of its outputs depends on the qual-
29
ity of its training data, and it can be further improved by fine-tuning with
domain-specific knowledge. Fine-tuning involves retraining the model with
specialized datasets, allowing it to internalize specific organizational knowl-
edge for more precise and relevant responses. This process keeps the model
updated and tailored to the user’s needs while maintaining privacy.
5.2 Participants
30
in machine learning, natural language processing (NLP), and using tools for Re-
trieval Augmented Generation (RAG). The participants although had familiarity
with Python language and OpenAI models.
31
Fig. 6: Most Valuable Aspects of the Workshop.
32
Fig. 8: Comments and suggestions for improving the guide
6 Discussion
Practitioners in fields like healthcare, legal analysis, and customer support, often
struggle with static models that rely on outdated or limited knowledge. RAG
models provide practical solutions with pulling in real time data from provided
sources. The ability to explain and trace how RAG models reach their answers
also builds trust where accountability and decision making based on real evidence
is important.
In this paper, we developed a RAG guide that we tested in a workshop setting,
where participants set up and deployed RAG systems following the approaches
mentioned. This contribution is practical, as it helps practitioners implement
RAG models to address real world challenges with dynamic data and improved
accuracy. The guide provides users clear, actionable steps to integrate RAG into
their workflows, contributing to the growing toolkit of AI driven solutions.
With that, RAG also opens new research avenues that can shape the future
of AI and NLP technologies. As these models and tools improve, there are many
potential areas for growth, such as finding better ways to search for information,
adapting to new data automatically, and handling more than just text (like
images or audio). Recent advancements in tools and technologies have further
accelerated the development and deployment of RAG models. As RAG models
continue to evolve, several emerging trends are shaping the future of this field.
1. Haystack: An open-source framework that integrates dense and sparse re-
trieval methods with large-scale language models. Haystack supports real-
time search applications and can be used to develop RAG models that per-
form tasks such as document retrieval, question answering, and summariza-
tion [4].
2. Elasticsearch with Vector Search: Enhanced support for dense vector
search capabilities, allowing RAG models to perform more sophisticated re-
trieval tasks. Elasticsearch’s integration with frameworks like Faiss enables
33
hybrid retrieval systems that combine the strengths of both dense and sparse
search methods, optimizing retrieval speed and accuracy for large datasets[3].
3. Integration with Knowledge Graphs: Researchers are exploring ways to
integrate RAG models with structured knowledge bases such as knowledge
graphs. This integration aims to improve the factual accuracy and reasoning
capabilities of the models, making them more reliable for knowledge-intensive
tasks[8].
4. Adaptive Learning and Continual Fine-Tuning: There is a growing
interest in adaptive learning techniques that allow RAG models to contin-
uously fine-tune themselves based on new data and user feedback. This ap-
proach aims to keep models up-to-date and relevant in rapidly changing
information environments[7].
5. Cross-Lingual and Multimodal Capabilities: Future RAG models are
expected to expand their capabilities across different languages and modal-
ities. Incorporating cross-lingual retrieval and multimodal data processing
can make RAG models more versatile and applicable to a wider range of
global and multimedia tasks[2].
7 Conclusions
The development of Retrieval Augmented Generation (RAG) systems offers a
new way to improve large language models by grounding their outputs in real-
time, relevant information. This paper covers the main steps for building RAG
systems that use PDF documents as the data source. With clear examples and
code snippets, it connects theory with practice and highlights challenges like
handling complex PDFs and extracting useful text. It also looks at the options
available, with examples of using proprietary APIs like OpenAI’s GPT and, as
an alternative, open-source models like Llama 3.1, helping developers choose the
best tools for their needs.
By following the recommendations in this guide, developers can avoid com-
mon mistakes and ensure their RAG systems retrieve relevant information and
generate accurate, fact-based responses. As technology advances in adaptive
learning, multi-modal capabilities, and retrieval methods, RAG systems will play
a key role in industries like healthcare, legal research, and technical documen-
tation. This guide offers a solid foundation for optimizing RAG systems and
extending the potential of generative AI in practical applications.
34
References
1. Avi Arampatzis, Georgios Peikos, and Symeon Symeonidis. Pseudo relevance feed-
back optimization. Information Retrieval Journal, 24(4–5):269–297, May 2021.
2. Md Chowdhury, John Smith, Rajesh Kumar, and Sang-Woo Lee. Cross-lingual and
multimodal retrieval-augmented generation models. IEEE Transactions on Multi-
media, 27(2):789–802, 2024.
3. Elasticsearch. Integrating dense vector search in elasticsearch. Elastic Technical
Blog, 2023.
4. Haystack. The haystack framework for neural search. Haystack Project Documen-
tation, 2023.
5. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin,
Naman Goyal, and Sebastian Riedel. Retrieval-augmented generation for knowledge-
intensive nlp tasks. In Advances in Neural Information Processing Systems
(NeurIPS 2020), 2020.
6. Hang Li, Ahmed Mourad, Shengyao Zhuang, Bevan Koopman, and Guido Zuccon.
Pseudo relevance feedback with deep language models and dense retrievers: Suc-
cesses and pitfalls. ACM Transactions on Information Systems, 41(3):1–40, April
2023.
7. Percy Liang, Wen-tau Wu, Douwe Kiela, and Sebastian Riedel. Best practices for
training large language models: Lessons from the field. IEEE Transactions on Neural
Networks and Learning Systems, 34(9):2115–2130, 2023.
8. Chenyan Xiong, Zhuyun Dai, Jamie Callan, and Jie Liu. Knowledge-enhanced lan-
guage models for information retrieval and beyond. IEEE Transactions on Knowl-
edge and Data Engineering, 36(5):1234–1247, 2024.
35
Appendix
1. Tampere University, “Cost Estimation for RAG Application Using GPT-4o”,
Zenodo, Sep. 2024. doi: 10.5281/zenodo.13740032.
36