Introduction
Introduction
def detect_document_type(document_path):
guess_file = guess(document_path)
file_type = ""
image_types = ['jpg', 'jpeg', 'png', 'gif']
if(guess_file.extension.lower() == "pdf"):
file_type = "pdf"
elif(guess_file.extension.lower() in image_types):
file_type = "image"
else:
file_type = "unkown"
return file_type
Output:
def extract_file_content(file_path):
file_type = detect_document_type(file_path)
if(file_type == "pdf"):
loader = UnstructuredFileLoader(file_path)
elif(file_type == "image"):
loader = UnstructuredImageLoader(file_path)
documents = loader.load()
documents_content = '\n'.join(doc.page_content for doc in documents)
return documents_content
Now, let’s print the first 400 characters of each file content.
research_paper_content = extract_file_content(research_paper_path)
article_information_content =
extract_file_content(article_information_path)
nb_characters = 400
The first 400 characters of each of the above documents are shown
below:
a. Document chunking
Each one of these strategies has its own pros and cons.
text_splitter = CharacterTextSplitter(
separator = "\n\n",
chunk_size = 1000,
chunk_overlap = 200,
length_function = len,
)
The chunk_size tells that we want a maximum of 1000 characters in
each chunk, and a smaller value will result in more chunks, while a
larger one will generate fewer chunks.
research_paper_chunks = text_splitter.split_text(research_paper_content)
article_information_chunks =
text_splitter.split_text(article_information_content)
Output:
For a larger document like the research paper, we have a lot more
chunks (51) compared to the one-page article document, which is
only 2.
model = “gpt-3.5-turbo-0301”
deployment = "<DEPLOYMENT-NAME> “ which corresponds to
the name given during the deployment of the model. The
default value is also text-embedding-ada-002
os.environ["OPENAI_API_KEY"] = "<YOUR_KEY>"
embeddings = OpenAIEmbeddings()
def get_doc_search(text_splitter):
doc_search_paper = get_doc_search(research_paper_chunks)
print(doc_search_paper)
Output:
document_search = get_doc_search(text_splitter)
documents = document_search.similarity_search(query)
results = chain({
"input_documents":documents,
"question": query
},
return_only_outputs=True)
answers = results['intermediate_steps'][0]
return answers
Its output is a dictionary of two keys: the answer to the query, and
the confidence score.
We can finally chat with the our files, starting with the image
document:
answer = results["answer"]
confidence_score = results["score"]
Output:
First two paragraphs of the original article image document (Image by Author)
One of the most interesting parts is that it provided a brief summary
of the main topics covered in the document ( statistics, model
evaluation metrics, SQL queries, etc.).
The process with the PDF file is similar to the one in the above
section.
answer = results["answer"]
confidence_score = results["score"]
Output:
Once again we are getting a 100% confidence score from the model.
The answer to the question looks pretty correct!
Conclusion
Congratulations!!!🎉
I hope this article provided enough tools to help you take your
knowledge to the next level. The code is available on my GitHub.
Before you leave, there are more great resources below you might be
interested in reading!
Introduction to Text Embeddings with the OpenAI API
How to Extract Text from Any PDF and Image for Large Language
Model