0% found this document useful (0 votes)

81 views17 pages

Introduction

This document provides instructions for building a system that allows users to chat with PDF and image files by asking questions. It describes detecting the file type, extracting the content, splitting it into chunks, creating embeddings of the chunks, building a search index, and implementing a question answering chat function. Key steps include using libraries like langchain to extract text, splitting it into overlapping chunks, transforming the chunks into embeddings, and searching the embeddings to find relevant chunks answering the user's query.

Uploaded by

Kishan hari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views17 pages

Introduction

Uploaded by

Kishan hari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Introduction

So much valuable information is trapped in PDF and image files.

Luckily, we have these powerful brains capable of processing those
files to find specific information, which in fact is great.

But how many of us, deep inside

wouldn’t like to have a tool that can
answer any question about a given
document?

That is the whole purpose of this article. I will explain step-by-step

how to build a system that can chat with any PDFs and image files.

General Workflow of the project

It’s always good to have a clear understanding of the main

components of the system being built. So let’s get started.
 First, the user submits the document to be processed,
which can be in PDF or image format.
 A second module is used to detect the format of the file so
that the relevant content extraction function is applied.
 The content of the document is then split into multiple
chunks using the Data Splitter module.
 Those chunks are finally transformed into embeddings
using the Chunk Transformer before they are stored in the
vector store.
 At the end of the process, the user’s query is used to find
relevant chunks containing the answer to that query, and
the result is returned as a JSON to the user.

1. Detect document type

For each input document, specific processing is applied depending

on its type, whether it is PDF , or image.

1. Detect document type

For each input document, specific processing is applied depending

on its type, whether it is PDF , or image.

This can be achieved with the helper

function detect_document_type combined with the guess function
from the built-in Python module.

def detect_document_type(document_path):

guess_file = guess(document_path)
file_type = ""
image_types = ['jpg', 'jpeg', 'png', 'gif']

if(guess_file.extension.lower() == "pdf"):
file_type = "pdf"

elif(guess_file.extension.lower() in image_types):
file_type = "image"

else:
file_type = "unkown"

return file_type

Now we can test the function on two types of documents:

 transformer_paper.pdf is the Transformers research

paper from Arxiv.
 zoumana_article_information.png is the image document
containing information about the main topics I have
covered on Medium.
research_paper_path = "./data/transformer_paper.pdf"
article_information_path = "./data/zoumana_article_information.png"

print(f"Research Paper Type: {detect_document_type(research_paper_path)}")

print(f"Article Information Document Type:
{detect_document_type(article_information_path)}")

Output:

Files types successfully detected (Image by Author)

Both file type is successfully detected by

the detect_document_type function.

2. Extract content based on document type

The langchain library provides different modules to extract the

content of a given type of document.

 UnstructuredImageLoader extracts image content.

 UnstructuredFileLoader extracts the content of any pdf and
Txt files.

We can combine these modules and the

above detect_document_type function to implement the text
extraction logic.

These modules can be used to implement end-to-end text extraction

logic within the extract_file_content function.

Let’s see them in action! 🔥

from langchain.document_loaders.image import UnstructuredImageLoader
from langchain.document_loaders import UnstructuredFileLoader

def extract_file_content(file_path):

file_type = detect_document_type(file_path)

if(file_type == "pdf"):
loader = UnstructuredFileLoader(file_path)

elif(file_type == "image"):
loader = UnstructuredImageLoader(file_path)

documents = loader.load()
documents_content = '\n'.join(doc.page_content for doc in documents)

return documents_content

Now, let’s print the first 400 characters of each file content.
research_paper_content = extract_file_content(research_paper_path)
article_information_content =
extract_file_content(article_information_path)

nb_characters = 400

print(f"First {nb_characters} Characters of the Paper: \

n{research_paper_content[:nb_characters]} ...")
print("---"*5)
print(f"First {nb_characters} Characters of Article Information
Document :\n {research_paper_content[:nb_characters]} ...")
Output:

The first 400 characters of each of the above documents are shown
below:

 The research paper content starts with Provided proper

attribution is provided and ends with Jacod Uszkoreit*

Google Research [email protected].

 The image document’s content starts with This document

provides a quick summary and ends with Data Science

section covers basic to advance concepts.

3. Chat Implementation

The input document is broken into chunks, then an embedding is

created for each chunk before implementing the question-answering
logic.

a. Document chunking

The chunks represent smaller segments of a larger piece of text. This

process is essential to ensure that a piece of content is represented
with as little noise as possible, making it semantically relevant.

Multiple chunking strategies can be applied. For instance, we have

the NLTKTextSplitter , SpacyTextSplitter , RecursiveCharacterTextSpl
itter , CharacterTextSplitter and more.

Each one of these strategies has its own pros and cons.

The main focus of this article is made on

the CharacterTextSplitter which creates chunks from the input
documents based on \n\n , and measure each chunk's length
(length_function ) by its number of characters.

text_splitter = CharacterTextSplitter(
separator = "\n\n",
chunk_size = 1000,
chunk_overlap = 200,
length_function = len,
)
The chunk_size tells that we want a maximum of 1000 characters in
each chunk, and a smaller value will result in more chunks, while a
larger one will generate fewer chunks.

It is important to note that the way the chunk_size is chosen can

affect the overall result. So, a good approach is to try different values
and chose the one that better fits one's use case.

Also, the chunk_overlap means that we want a maximum of 200

overlapping characters between consecutive chunks.

For instance, imagine that we have a document containing the

text Chat with your documents using LLMs and want to apply the
chunking using the Chunk Size = 10 and Chunk overlap = 5.

The process is explained in the image below:

We can see that we ended up with a total of 7 chunks for an input
document of 35 characters (spaces included).

But, why do we use these overlaps in

the first place?

Including these overlaps, the CharacterTextSplitter ensures that the

underlying context is maintained between the chunks, which is
especially useful when working with long pieces of documents.

Similarly to the chunk_size there is no fixed value of chunk_overlap .

Different values need to be tested to choose the one with better
results.
Now, let’s see their application in our scenario:

research_paper_chunks = text_splitter.split_text(research_paper_content)
article_information_chunks =
text_splitter.split_text(article_information_content)

print(f"# Chunks in Research Paper: {len(research_paper_chunks)}")

print(f"# Chunks in Article Document: {len(article_information_chunks)}")

Output:

For a larger document like the research paper, we have a lot more
chunks (51) compared to the one-page article document, which is
only 2.

b. Create embeddings of the chunks

We can use the OpenAIEmbeddings module, which uses text-embedding-

ada-002 model by default to create the embedding of the chunks.

Instead of using the text-embedding-ada-002 can use a different

model (e.g. gpt-3.5-turbo-0301) by changing the following
parameters:

 model = “gpt-3.5-turbo-0301”
 deployment = "<DEPLOYMENT-NAME> “ which corresponds to
the name given during the deployment of the model. The
default value is also text-embedding-ada-002

For simplicity’s sake, we will stick to using the default parameters’

value in this tutorial. But before that, we need to acquire the OpenAI
credentials, and all the steps are provided in the following article.

from langchain.embeddings.openai import OpenAIEmbeddings

import os

os.environ["OPENAI_API_KEY"] = "<YOUR_KEY>"
embeddings = OpenAIEmbeddings()

c. Create document search

To get the answer to a given query, we need to create a vector store

that finds the closest matching chunk to that query.

Such vector store can be created using the from_texts function

from FAISS module and the function takes two main
parameters: text_splitter and embeddings which are both defined
previously.

from langchain.vectorstores import FAISS

def get_doc_search(text_splitter):

return FAISS.from_texts(text_splitter, embeddings)

By running the get_doc_search on the research paper chunks, we can
see that the result is of a vectorstores . The result would have been
the same if we used the article_information_chunks.

doc_search_paper = get_doc_search(research_paper_chunks)
print(doc_search_paper)

Output:

Vector store of the research paper (Image by Author)

d. Start chatting with your documents

Congrats on making it that far! 🎉

The chat_with_file function is used to implement the end-to-end

logic of the chat by combining all the above functions, along with the
with similarity_search function.

The final function takes two parameters:

 The file we want to chat with, and

 The query provided by the user

from langchain.llms import OpenAI

from langchain.chains.question_answering import load_qa_chain
chain = load_qa_chain(OpenAI(), chain_type = "map_rerank",
return_intermediate_steps=True)

def chat_with_file(file_path, query):

file_content = extract_file_content(file_path)
text_splitter = text_splitter.split_text(file_content)

document_search = get_doc_search(text_splitter)
documents = document_search.similarity_search(query)

results = chain({
"input_documents":documents,
"question": query
},
return_only_outputs=True)
answers = results['intermediate_steps'][0]

return answers

Let’s take a step back to properly understand what is happening in

the above code block.

 The load_qa_chain provides an interface to perform

question-answering over a set of documents. In this specific
case, we are using the default OpenAI GPT-3 large language
model.
 The chain_type is map_rerank . By doing so,
the load_qa_chain function returns the answers based on a
confidence score given by the chain. There are
other chain_type that can be used such
as map_reduce , stuff , refine and more. Each one has its
own pros and cons.
 By setting return_intermediate_steps=True , we can access
the metadata such as the above confidence score.

Its output is a dictionary of two keys: the answer to the query, and
the confidence score.
We can finally chat with the our files, starting with the image
document:

 Chat with the image document

To chat with the image document, we provide the path to the

document, and the question we want the model to answer.

query = "What is the document about"

results = chat_with_file(article_information_path, query)

answer = results["answer"]
confidence_score = results["score"]

print(f"Answer: {answer}\n\nConfidence Score: {confidence_score}")

Output:

The model is 100% confident in its response. By looking at the first

paragraph of the original document below, we can see that the
model response is indeed correct.

First two paragraphs of the original article image document (Image by Author)
One of the most interesting parts is that it provided a brief summary
of the main topics covered in the document ( statistics, model
evaluation metrics, SQL queries, etc.).

 Chat with the PDF file

The process with the PDF file is similar to the one in the above
section.

query = "Why is the self-attention approach used in this document?"

results = chat_with_file(research_paper_path, query)

answer = results["answer"]
confidence_score = results["score"]

print(f"Answer: {answer}\n\nConfidence Score: {confidence_score}")

Output:

Once again we are getting a 100% confidence score from the model.
The answer to the question looks pretty correct!

Result of a query on the PDF document (Image by Author)

In both cases, the model was able to provide a human-like response

in a few seconds. Making a human go through the same process
would take minutes, even hours depending on the length of the
document.

Conclusion
Congratulations!!!🎉

I hope this article provided enough tools to help you take your
knowledge to the next level. The code is available on my GitHub.

In my next article, I will explain how to integrate this system into a

nice user interface. Stay tuned!

Also, If you enjoy reading my stories and wish to support my

writing, consider becoming a Medium member. It’s $5 a month,
giving you unlimited access to thousands of Python guides and Data
science articles.

By signing up using my link, I will earn a small commission at no

extra cost to you.

Join Medium with my referral link - Zoumana Keita

As a Medium member, a portion of your membership fee goes to
writers you read, and you get full access to every story…
zoumanakeita.medium.com

Feel free to follow me on Twitter, and YouTube, or say Hi

on LinkedIn.

Let’s connect here for a 1–1 discussion

Before you leave, there are more great resources below you might be
interested in reading!
Introduction to Text Embeddings with the OpenAI API

How to Extract Text from Any PDF and Image for Large Language
Model

SACE Stage 2 Chemistry - Cation Exchange Capacity Deconstruct & Design
No ratings yet
SACE Stage 2 Chemistry - Cation Exchange Capacity Deconstruct & Design
7 pages
Multilayer Perceptron
No ratings yet
Multilayer Perceptron
11 pages
DS1 Test Patterns
No ratings yet
DS1 Test Patterns
5 pages
RAG With Reinforcement Learning
No ratings yet
RAG With Reinforcement Learning
40 pages
A-Z of RAG Question Answering Methods in Langchain
No ratings yet
A-Z of RAG Question Answering Methods in Langchain
33 pages
Notes - by Kishor
No ratings yet
Notes - by Kishor
11 pages
Flowise AI Tutorial #3 File Loaders, Text Splitters, Embeddings & Vector Stores
No ratings yet
Flowise AI Tutorial #3 File Loaders, Text Splitters, Embeddings & Vector Stores
3 pages
Gen Ai-1
No ratings yet
Gen Ai-1
6 pages
Claude Comparet DB
No ratings yet
Claude Comparet DB
8 pages
QA Using Gemini Langchain ChromaDB PDF
No ratings yet
QA Using Gemini Langchain ChromaDB PDF
2 pages
Building RAG Apps
No ratings yet
Building RAG Apps
32 pages
Absolutely, Let'S Break Down The Recursivecharactertextsplitter Class Even Further, Focusing On The Key Aspects and How It Achieves Text Splitting
No ratings yet
Absolutely, Let'S Break Down The Recursivecharactertextsplitter Class Even Further, Focusing On The Key Aspects and How It Achieves Text Splitting
12 pages
Self RAG
No ratings yet
Self RAG
12 pages
Zref
No ratings yet
Zref
8 pages
Case Study
No ratings yet
Case Study
25 pages
Finally Final
No ratings yet
Finally Final
18 pages
JHH 24 HR 2 Nvarlunhuuye
No ratings yet
JHH 24 HR 2 Nvarlunhuuye
2 pages
LangChain Talk
No ratings yet
LangChain Talk
35 pages
Lecture 31-Document GPT Hands On
No ratings yet
Lecture 31-Document GPT Hands On
18 pages
An Effective Query System Using Llms and Langchain IJERTV12IS060161
No ratings yet
An Effective Query System Using Llms and Langchain IJERTV12IS060161
3 pages
Labsheet 9
No ratings yet
Labsheet 9
2 pages
Lecture 37 Embeddings and Splitters in Langchain
No ratings yet
Lecture 37 Embeddings and Splitters in Langchain
20 pages
LangChain Talk
No ratings yet
LangChain Talk
35 pages
RAG Application Using Open Source Tools 1721123882
No ratings yet
RAG Application Using Open Source Tools 1721123882
5 pages
GenAI Final Project
No ratings yet
GenAI Final Project
8 pages
200835.113 - Cheat Sheet
No ratings yet
200835.113 - Cheat Sheet
29 pages
02 Data Connections
No ratings yet
02 Data Connections
32 pages
Providing Accurate Data:: How Does It Work?
No ratings yet
Providing Accurate Data:: How Does It Work?
9 pages
Your First RAG
No ratings yet
Your First RAG
21 pages
Course Project Report For: Artificial Intelligence EL-3011
No ratings yet
Course Project Report For: Artificial Intelligence EL-3011
8 pages
Chatbot Code
No ratings yet
Chatbot Code
2 pages
365careers - AI - Eng - Bootcamp, Ai, 365careers, Udemy
No ratings yet
365careers - AI - Eng - Bootcamp, Ai, 365careers, Udemy
89 pages
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
How To Build Your Own Custom ChatGPT Bot With Custom Knowledge Base - Better Programming
No ratings yet
How To Build Your Own Custom ChatGPT Bot With Custom Knowledge Base - Better Programming
8 pages
Chatbot Code
No ratings yet
Chatbot Code
2 pages
Langchain Onepager
No ratings yet
Langchain Onepager
1 page
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Chatbot Systems For Document Interaction
No ratings yet
Chatbot Systems For Document Interaction
3 pages
AIlab 10
No ratings yet
AIlab 10
3 pages
Chatbot Code
No ratings yet
Chatbot Code
2 pages
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Demo
No ratings yet
Demo
3 pages
Agentic RAG - Removed
No ratings yet
Agentic RAG - Removed
9 pages
Langchain App Design
No ratings yet
Langchain App Design
7 pages
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
LangChain Cheatsheet 1704475842
No ratings yet
LangChain Cheatsheet 1704475842
11 pages
One Stop Framework Building Applications With Llms
No ratings yet
One Stop Framework Building Applications With Llms
8 pages
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Detail NLP
No ratings yet
Detail NLP
5 pages
An Effective Query System Using Llms and Langchain IJERTV12IS060161
No ratings yet
An Effective Query System Using Llms and Langchain IJERTV12IS060161
4 pages
(English) Python RAG Tutorial (With Local LLMS) - AI For Your PDFs (DownSub - Com)
No ratings yet
(English) Python RAG Tutorial (With Local LLMS) - AI For Your PDFs (DownSub - Com)
15 pages
Bring Your Data To Life - Creating A Chatbot With LLM, LangChain, Vector DB
No ratings yet
Bring Your Data To Life - Creating A Chatbot With LLM, LangChain, Vector DB
10 pages
An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain For Enhanced Data Retrieval (#1602597) - 4445287
No ratings yet
An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain For Enhanced Data Retrieval (#1602597) - 4445287
13 pages
NLP Exp2
No ratings yet
NLP Exp2
6 pages
MultiModel RAG
No ratings yet
MultiModel RAG
18 pages
LAB MANUAL OF GENERATIVE AI April - 4
No ratings yet
LAB MANUAL OF GENERATIVE AI April - 4
17 pages
CODE Explanation
No ratings yet
CODE Explanation
6 pages
Synopsis
No ratings yet
Synopsis
3 pages
Project Documentation - PDF Q&A With Gemini (LangChain Practical Implementation)
No ratings yet
Project Documentation - PDF Q&A With Gemini (LangChain Practical Implementation)
6 pages
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
100% (1)
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
39 pages
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Session 9 LangChain Ecosystem
No ratings yet
Session 9 LangChain Ecosystem
34 pages
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
GK Question Bank STD 5
No ratings yet
GK Question Bank STD 5
6 pages
Real-TimePCRdataanalysis BIORAD Genex
No ratings yet
Real-TimePCRdataanalysis BIORAD Genex
48 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
15 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
12 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
21 pages
Sap Abap Refresher Course NoRestriction
100% (3)
Sap Abap Refresher Course NoRestriction
1,074 pages
CSRAT-Y Air Cooled Chiller - R134a
No ratings yet
CSRAT-Y Air Cooled Chiller - R134a
36 pages
IST-100-1 509ITM PROTOCOL MANUAL Rev-31022
No ratings yet
IST-100-1 509ITM PROTOCOL MANUAL Rev-31022
131 pages
Grade 8 Math Projects
100% (1)
Grade 8 Math Projects
4 pages
MODULE 1-Forces in Equilibrium-2
No ratings yet
MODULE 1-Forces in Equilibrium-2
9 pages
Topic 2 - Programming (i-GCSE) Computer Science
No ratings yet
Topic 2 - Programming (i-GCSE) Computer Science
47 pages
Irc Compilation Past Year Questions (Student)
No ratings yet
Irc Compilation Past Year Questions (Student)
387 pages
Lab Report 1
67% (3)
Lab Report 1
4 pages
Digital Thermometer
No ratings yet
Digital Thermometer
7 pages
DLD Chapter2
No ratings yet
DLD Chapter2
64 pages
CPS - 25kW 208V UL Modbus Map Spec FW V4.0
No ratings yet
CPS - 25kW 208V UL Modbus Map Spec FW V4.0
78 pages
G9 07 Rate and Ratio
No ratings yet
G9 07 Rate and Ratio
6 pages
Spring Boot
No ratings yet
Spring Boot
19 pages
What Is Engine
No ratings yet
What Is Engine
25 pages
Percentage Points of The T-Distribution: This Table Was Generated Using Excel
No ratings yet
Percentage Points of The T-Distribution: This Table Was Generated Using Excel
1 page
Assignment 2 (If Else If Ladder)
100% (1)
Assignment 2 (If Else If Ladder)
2 pages
Canrig-Top Drive 1275AC-681 750 Ton
0% (1)
Canrig-Top Drive 1275AC-681 750 Ton
17 pages
Astm E328 20
No ratings yet
Astm E328 20
7 pages
7 Series Fpgas Data Sheet: Overview: General Description
No ratings yet
7 Series Fpgas Data Sheet: Overview: General Description
18 pages
Light - Reflection & Refraction - Practice Sheet - Warrior 2025
No ratings yet
Light - Reflection & Refraction - Practice Sheet - Warrior 2025
5 pages
NUST Computer Science 2
No ratings yet
NUST Computer Science 2
29 pages
Assembler p3
100% (1)
Assembler p3
7 pages
State of NetBox - Jeremy Stretch
No ratings yet
State of NetBox - Jeremy Stretch
10 pages
Acrylic Ester
No ratings yet
Acrylic Ester
27 pages
BEE Cement Plant Code Final
No ratings yet
BEE Cement Plant Code Final
4 pages
Chemical Composition of Rainwater Captured in An Oil Refinery
No ratings yet
Chemical Composition of Rainwater Captured in An Oil Refinery
6 pages
Digital Twins
100% (1)
Digital Twins
17 pages