0% found this document useful (0 votes)
2 views77 pages

Updated Project File

The project report details the development of Zhyper AI, an autonomous system that utilizes large language models to enhance human-computer interaction by executing tasks with minimal user input. The system features a graphical user interface, a processing core for task execution, and a real-time feedback loop for accuracy. The report also discusses the architecture, capabilities, and future advancements of Zhyper AI in the context of automation and software development.

Uploaded by

jerofrancis555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views77 pages

Updated Project File

The project report details the development of Zhyper AI, an autonomous system that utilizes large language models to enhance human-computer interaction by executing tasks with minimal user input. The system features a graphical user interface, a processing core for task execution, and a real-time feedback loop for accuracy. The report also discusses the architecture, capabilities, and future advancements of Zhyper AI in the context of automation and software development.

Uploaded by

jerofrancis555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 77

ZHYPER AI: ADVANCING AUTONOMOUS COMPUTER

OPERATIONS WITH LARGE LANGUAGE MODELS

A PROJECT REPORT

Submitted by
MUKESH ANAND G 211521243104
JERO FRANCIS S 211521243077
CHANDRU T 211521243033

in partial fulfillment for the award of the degree

of

BACHELOR OF TECHNOLOGY
IN

ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

PANIMALAR INSTITUTE OF TECHNOLOGY


ANNA UNIVERSITY, CHENNAI 600 025
JUNE 2025

I
PANIMALAR INSTITUTE OF TECHNOLOGY
ANNA UNIVERSITY: CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project report “ZHYPER AI: ADVANCING AUTONOMOUS


COMPUTER OPERATIONS WITH LARGE LANGUAGE MODELS” is the
Bonafide work of “MUKESH ANAND G, JERO FRANCIS S, CHANDRU T” who
carried out the project work under my supervision.

SIGNATURE SIGNATURE

Dr. T. KALAI CHELVI, M.E, Ph.D., Mrs. K. SARANYA, M.E.,


SUPERVISOR
HEAD OF THE DEPARTMENT ASSISTANT PROFESSOR
Department of Artificial Department of Artificial
Intelligence and Data Science, Intelligence and Data Science,

Panimalar Institute of Technology Panimalar Institute of Technology

Poonamallee, Chennai 600 123

Certified that the candidates were examined in the university project


viva-voce held on ________________at Panimalar Institute of Technology,
Chennai 600 123.

INTERNAL EXAMINER EXTERNAL EXAMINER

II
ACKNOWLEDGEMENT
A project of this magnitude and nature requires kind co-operation and support from
many, for successful completion. We wish to express our sincere thanks to all those who
were involved in the completion of this project.
We seek the blessings from the Founder of our institution Dr. JEPPIAAR, M.A.,
Ph.D., for having been a role model who has been our source of inspiration behind our
success in education in his premier institution.
We would like to express our deep gratitude to our beloved Secretary and
Correspondent Dr. P. CHINNADURAI, M.A., Ph.D., for his kind words and enthusiastic
motivation which inspired us a lot in completing this project.
We also express our sincere thanks and gratitude to our dynamic Directors
Mrs. C. VIJAYA RAJESHWARI, Dr. C. SAKTHI KUMAR, M.E., Ph.D., and
Dr. SARANYA SREE SAKTHI KUMAR, B.E, M.B.A., Ph.D., for providing us with
necessary facilities for completion of this project.
We also express our appreciation and gratefulness to our respected Principal
Dr. T. JAYANTHY, M.E., Ph.D., who helped us in the completion of the project.
We wish to convey our thanks and gratitude to our Head of the Department,
Dr. T. KALAI CHELVI, M.E, Ph.D., for her full support by providing ample time to
complete our project.
We express our indebtedness and special thanks to our supervisor,
Mrs. K. SARANYA, M.E., for her expert advice and valuable information and guidance
throughout the completion of the project.
Last, we thank our parents and friends for providing their extensive moral support and
encouragement during the project.

III
TABLE OF CONTENTS

CHAPTER TITLE PAGE NO

LIST OF TABLES VIII


IX
LIST OF FIGURES
X
LIST OF SYMBOLS XI
LIST OF ABBREVATIONS XII
ABSTRACT XIII

1 INTRODUCTION 2

2 LITERATURE SURVEY 5

3 EXISTING SYSTEM
11
3.1 EXISTING SYSTEM
12
3.2 PROBLEM DEFINITION 12
3.3 PROPOSED SYSTEM 14
3.4 ADVANTAGES

4
REQUIRMENTS SPECIFICATIONS
18
4.1 INTRODUCTION
18
4.2 HARDWARE AND SOFTWARE REQUIREMENTS
4.2.1 HARDWARE REQUIREMENTS
4.2.2 SOFTWARE REQUIREMENTS 20
20
4.2.2.1 PYTHON

IV
4.2.2.2 GOOGLE COLAB
4.2.2.3 INTERFACE
20
4.2.2.4 CORE TECHNIQUES
21
21

5
SYSTEM DESIGN
26
5.1 INTRODUCTION
26
5.2 SYSTEM ARCHITECTURE
5.3 UML DIAGRAM
5.3.1 USE CASE DIAGRAM 28
29
5.3.2 CLASS DIAGRAM
30
5.3.3 SEQUENCE DIAGRAM 33
5.3.4 ACTIVITY DIAGRAM 34
36
5.3.5 COMPONENT DIAGRAM
37
5.3.6 DEPLOYMENT DIAGRAM
5.4 DESIGN CONSIDERATIONS 38
5.5 TOOLS AND TECHNOLOGIES USED 38
38
5.6 SUMMARY

V
6
IMPLEMENTATION
40
6.1 INTRODUCTION
40
6.2 SYSTEM ENVIRONMENT
6.3 MODULE WISE IMPLEMENTATION
6.3.1 PDF UPLOAD AND PREPROCESSING MODULE 41
42
6.3.2 EMBEDDING GENERATION MODULE
42
6.3.3 VECTOR DATABASE MANAGEMENT 42
6.3.4 USER QUERY HANDLING MODULE 42
42
6.3.5 USER INTERFACE DESIGN
6.4 DATASET PREPARATION AND TESTING 43
6.5 SUMMARY 43

7
TECHNIQUES USED
7.1 OCR
7.1.1 INTRODUCTION 45
7.1.2 PYTESERAT OCR 45
7.1.3 EASY OCR 47
48
7.1.4 KERAS OCR
50
7.1.5 COMPARISON OF OCR 51
7.1.6 OVERALL EFFICIENCY

7.2 VECTOR DB
7.2.1 INTRODUCTION 51
7.2.2 CHROMA DB 52
53
7.2.3 WEAVIATE DB
54
7.2.4 FIASS DB 56
7.2.5 COMPARISON OF VECTOR DB 57
7.2.6 OVERALL EFFICIENCY

VI
7.3 SENTENCE TRANSFORMER
57
7.3.1 INTRODUCTION
58
7.3.2 DIFFERENCE BETWEEN PARAGRAPH AND
SENTENCE TRANSFORMER
7.3.3 all-MiniLMv2 59

7.4 RAG (Retrieval-Augmented Generation)


7.4.1 INTRODUCTION 60
60
7.4.2 OLLAMA NOMIC EMBEDDINGS
61
7.4.3 GEMMA2 MODEL

7.5 INTERFACE
61
7.5.1 STREAMLIT

8 64
PERFORMANCE ANALYSIS
65
8.1 EXISTING MODEL 66
8.2 FINE-TUNED MODEL
8.3 DIFFERENCE BETWEEN TWO MODELS

9 68
CONCLUSION AND FUTURE SCOPE
69
9.1 CONCLUSION
9.2 FUTURE SCOPE

10 72-76
APPENDICES

11 78-79
REFRENCES

VII
LIST OF TABLES

S.NO NAME OF THE TABLES PAGE.NO

1
HARDWARE COMPONENTS AND
SPECIFICATIONS

2
OCR TYPES AND IT’S PROS &
CONS

3
VECTOR DBs TYPES AND IT’S
PROS & CONS

4 SENTENCE TRANSFORMER PROS &CONS

5 RAG COMPONENTS

6 COMPARISION OF VECTOR DB

7 DIFFERENCE BETWEEN PARAGRAPH AND


SENTENCE TRANSFORMERS

8 DIFFERENCE BETWEEN EXISTING AND


FINE-TUNED MODEL

VIII
LIST OF FIGURES

S.NO NAME OF THE FIGURES PAGE.NO

1
ARCHITECTURE DIAGRAM

2
USE CASE DIAGRAM

3
CLASS DIAGRAM

4 SEQUENCE DIAGRAM

5 ACTIVITY DIAGRAM

6 COMPONENT DIAGRAM

7 DEPLOYMENT DIAGRAM

8 COMPARISION OF OCR

9 EXISTING MODEL OUTPUT

10 FINE TUNED MODEL OUTPUT

IX
LIST OF SYMBOLS

S.NO NAME NOTATION DESCRIPTION


Start of the workflow.
1 Initial Node
Represents a single step or
2 Activity / Action task in the process.
Represents a decision point
3 Decision Node where the flow branches
based on conditions.
Indicates the direction of
4 Control Flow
→ flow from one node to
another.
Represents the end of the
5 Final Node
◎ activity flow.
A Boolean expression that
6 Guard Condition [condition] (in must be true for the transition
brackets) to be taken.
Declares a property (e.g.,
7 Attribute + name: type +id: int) of a class. + = public
visibility.
+methodName(): Defines an operation of the
8 Method / Operation returnType class.
A line connecting classes;
9 Association ─────────── shows a relationship between
them.
Indicates how many

X
10 Multiplicity 1, *, 0..1, 1..*, etc. instances of one class relate
to another.
Represents inheritance

11 Generalization
Represents a "whole-part"
12 Aggregation
◇ relationship where parts can
exist independently.
◆ Black diamond Represents a "strong whole-
13 Composition at one end part" relationship; parts
cannot exist separately.
Represents a functionality or
14 Use Case service the system provides.
Represents a user or external
15 Actor system interacting with the
system.
Represents a data type
16 Data
Represents the vectordb used
17 Cylinder to store data

Represents a bidirectional
18 Double arrow relationship or
communication between two
entities.
Represents a class in UML
19 Rectangle with three diagrams.
parts

XI

Represents a strong "has-a"
20 Composition relationship between two
classes

ABSTRACT

Zhyper Al is an autonomous system designed to revolutionize human-computer

interaction by leveraging multiple Large Language Models (LLMs) to execute tasks with

minimal user intervention. By interpreting natural language commands, the system

translates high-level user instructions into precise automated actions, such as typing,

clicking, and navigating complex software environments. The architecture of Zhyper Al

comprises a graphical user interface (GUI) for intuitive command input, a processing

core that collaborates with LLMs to determine task execution steps, an interpreter that

converts instructions into executable commands, and an executor that simulates user

interactions. A key innovation of Zhyper AI is its real-time feedback loop, which adapts

dynamically through screenshot-based monitoring to ensure accurate execution. This

paper explores the system's architecture, capabilities, and potential applications across

automation, software development, and creative workflows. Additionally, it discusses

future advancements, including multimodal Al integration for visual task comprehension.

XII
By enabling fully autonomous computing, Zhyper AI redefines efficiency, accessibility,

and scalability in digital task automation

XIII
CHAPTER 1

1
CHAPTER 1
INTRODUCTION

1.1 AN OVERVIEW OF PROJECT

The project introduces ScreenAgent, a cutting-edge intelligent agent powered

by large language models (LLMs) and designed for autonomous interaction with

digital screens. This agent addresses a crucial need in the field of artificial

intelligence—developing systems that can observe, understand, and operate

graphical user interfaces (GUIs) on real computers. ScreenAgent operates through

a three-phase architecture: planning, execution, and reflection. Drawing inspiration

from Kolb’s Experiential Learning Cycle, the reflection module enables the agent

to learn from its actions and improve over time, closely mimicking human learning

patterns. Using visual input from screenshots and basic control commands like

mouse clicks and keyboard strokes, the agent interacts with environments via the

VNC protocol. It supports task execution on Windows and Linux desktops and is

evaluated using the newly developed ScreenAgent dataset.

This dataset includes diverse screen task scenarios along with a fine-grained

evaluation metric. Comparative testing with GPT-4V and other state-of-the-art

Vision-Language Models (VLMs) reveals that while existing models perform well,

they lack precision in interactions.

2
1.2 SCOPE OF THE PROJECT

The scope of this project involves the development and deployment of a fully

autonomous screen-interaction agent, ScreenAgent, that leverages the capabilities

of Vision-Language Models (VLMs). The agent is built to perceive screen content

visually and perform intelligent operations like clicking, typing, navigating, and

tool invocation without manual input. A major focus is on creating a reinforcement

learning environment using the VNC protocol to allow real-time interaction with

operating systems such as Windows and Linux. The project also involves

designing a modular pipeline that includes planning, action execution, and

reflection, enhancing the system’s learning and adaptability.

Another significant aspect is the creation of the ScreenAgent dataset, comprising

task sequences and benchmarks for evaluating performance

3
CHAPTER 2

4
CHAPTER 2
RELATED WORK
2.1 MULTIMODEL LARGE LANGUAGE MODELS

Multimodal LLMs like LLAMA, Vicuna, and GPT-4 show strong contextual
understanding and text generation. GPT-4V extends GPT-4 with vision, enabling
image-based interactions. LLAVA and LLAVA-1.5 connect CLIP with Vicuna for
multimodal tasks. Fuyu-8B uses a pure decoder transformer without an image
encoder. CogVLM supports multi-turn visual dialogue at high resolution. Monkey
enhances input resolution through efficient training.

2.2 COMPUTER CONTROL ENVIRONMENT & DATASET

Simulated environments train agents for GUI tasks like clicking and typing.
WebNav and MiniWoB++ test decision-making via browser-based tasks.
WebShop enables shopping-based automation, while SWDE and WebSRC support
info extraction and QA from webpages. Mind2Web aims at generalist agents, and
datasets like Seq2Act, Screen2Words, and META-GUI train agents on Android
UIs. These datasets create complex environments for LLM agents to learn screen
control.

2.3 LARGE LANGUAGE MODEL-DRIVEN AGENTS

LLMs have enhanced agent capabilities, with WebGPT enabling web-based question

answering. ToolFormer integrates external tools like calculators and search engines.

Voyager pioneers lifelong learning in Minecraft using LLMs. RecAgent adds memory

reflection, and ProAgent automates tasks with LLMs. CogAgent focuses on GUI

comprehension; AppAgent learns mobile app usage. The environment defines actions

(JSON), states (screenshots), and flexible rewards for screen control tasks.
5
CHAPTER 3

6
CHAPTER 3

FRAME WORK

3.1 COMPUTER CONTROL ENVIRONMENT

We construct a computer control environment to assess the capabilities of VLM


agents. This environment connects to a desktop operating system through remote
desktop (VNC) protocol and allows the sending of mouse and keyboard events to
the controlled desktop. The formal definitions of this environment

7
3.2 CONTROL PIPELINE
To guide the agent to continually interact with the environ-ment and complete
multi-step complex tasks. We designed a control pipeline including the Planning.
Acting, and Re-flecting phases. The whole pipeline is depicted in Fig. 2. The
pipeline will ask the agent to disassemble the complex task, execute subtasks, and
evaluate execution results. The agent will have the opportunity to retry some
subtasks or ad-just previously established plans to accommodate the current
occurrences.
Planning Phase. In the planning phase, based on the cur-rent screenshot, the agent
needs to decompose the complex task relying on its own common-sense
knowledge and com puter knowledge.
Acting Phase. In the acting phase, based on the current screenshot, the agent
generates tes low-level low- mouse or keyboard actions in JSON-style function
calls. The environment will attempt to parse the function calls from the agent's
response, and convert them to device actions defined in the VNC pro-tocol. Then
our environment will send actions to the con-trolled computer. The environment
will capture the after-action screen as input for the next execution phase.
Reflecting Phase. The reflecting stage requires the agent to assess the current
8
situation based on the after-action screen. The agent determines whether needs to
retry the current sub-task, go on to the next sub-task, or make some adjustments to
the plan list. This phase is crucial within the control pipeline, providing some
flexibility to handle a variety of unpredictable circumstances.

9
CHAPTER 4

10
CHAPTER 4

SCREEN AGENT DATASET & CC-SCORE

4.1 INTRODUCTION

The ScreenAgent Dataset addresses the limitations of existing datasets focused


primarily on web or Android environments by introducing a computer control
dataset using mouse and keyboard interfaces. Collected from Linux and Windows
systems, it covers a broad range of daily computing tasks—such as office work,
entertainment, programming, and system operations—across 39 sub-task
categories grouped into 6 themes. It includes 273 task sessions (203 for training, 70
for testing), using real screenshots to simulate practical scenarios.

4.2 HARDWARE REQUIREMENTS

* Standard desktop or laptop computer


* Mouse and keyboard interface
* Linux or Windows operating system
* Sufficient storage for image-based datasets
* GPU recommended for training Vision-Language Models

4.3 SOFTWARE REQUIREMENTS


* VNC protocol support for screen control simulation

* Python-based annotation tools

* Reinforcement Learning environment setup

* Libraries for computer vision, NLP, and BLEU scoring

* Frameworks for LLM/VLM integration (e.g., PyTorch, Transformers)

11
Table 1
Hardware components and specifications

Component Minimum Specification Recommended Specification

CPU Quad-core Intel i7 @ 2.5 Octa-core Intel i7/i9 or AMD Ryzen 7 @


GHz 3.0 GHz
RAM 16 GB DDR4 32 GB DDR4
GPU NVIDIA RTX 3060 (4 NVIDIA RTX 4090 (12 GB)
GB)
Storage 512 GB SSD 1 TB NVMe SSD
Network 100 Mbps Ethernet/Wi-Fi 1 Gbps Ethernet
Display 1366×768 resolution 1920×1080 or higher
Backup & External HDD or RAID-configured NAS
Redundancy NAS
(opt.)

Rationale:

1. CPU & RAM: Multi-core CPU and ample RAM are critical to parallelize
OCR tasks and handle large embedding matrices.
2. GPU: While OCR runs on CPU, embedding models (all-MiniLM-L6-v2)
leverage GPU for faster vector computation (<50 ms per chunk).
3. Storage: SSD or NVMe ensures rapid loading of PDF files, storage of
vector index, and database operations.
4. Network: High throughput and low latency enable seamless interactions
between Streamlit frontend and backend services.
5. Backup: Ensures business continuity and data integrity in case of hardware
failure.

12
13
14
CHAPTER 5

15
CHAPTER 5
EXPERIMENT
5.1 INTRODUCTION:
In the experimental phase, we assessed OpenAI GPT-4V per formance on the
ScreenAgent test set, along with evaluations of three open-source VLMs.
Furthermore, one of these mod-els underwent fine-tuning to potentially augment its
profi-ciency in screen control tasks. Subsequently, we conducted a thorough
analysis of the outcomes and identified several typ ical cases to elucidate the
inherent challenges of our task.

5.2 EVALUATION RESULTS ON SCREENAGENT TEST-SET

Apart from GPT-4V, we selected several recently released SOTA VLMs for testing,
including LLaVA-1.5 [Liu et al..2023al and CogAgent [Hong et al., 2023]. LLAVA-1.5
is a 13B-parameter multimodal model, unfortunately, it only sup ports up to 336 x 336px
image size inputs. CogAgent is an 18B-parameter visual language model designed for
GUI com-prehension and navigation. Leveraging dual image encoders. for both low-
resolution and high-resolution inputs, CogAgent demonstrates proficiency at a resolution
of 1120 x 1120px. allowing it to discern minute elements and text.

We test the models capabilities from two aspects: The ability to follow instructions to
output the correct function call format, shown in Table 2, and the ability to complete
specific tasks assigned by the user

Training data proportions and division of four training phases. Percentages indicate the
proportion of samples from this data set at each phase. of these function calls for each
attribute key. This assess-ment focuses on whether the model can accurately execute
various functions encompassing the attribute items expected by manual action
annotations. Note that, this evaluation does. not consider the consistency of the attribute

16
values with the golden labeling: it solely examines if the model's output in-cludes the
necessary attribute keys. From the table, GPT-4V and LLAVA-1.5 achieved higher
scores, while CogAgent and its upstream model CogAgent-VQA underperformed.
CogAgent-VQA and CogAgent-Chat almost entirely disre garded the JSON format action
definitions in our prompts, resulting in a very low score on successful function calls.
Therefore, rendering them completely incapable of interact-ing with our environment. To
ensure fairness in comparison, we utilize OpenAI GPT-3.5 to extract action into JSON-
style function calls from the original CogAgent-Chat responses, in-dicated as "CogAgent-
Chat (helped by GPT-3.5)". Even so, its scores are significantly lower than those of
LLAVA-1.5 and. GPT-4V, although CogAgent has been trained on Mind2Web web
browsing simulation datasets.

Displays the fine-grained scores of predicted at tribute values for each action within the
successfully parsed function calls. As can be seen, GPT-4V remains the best per former,
with action type prediction F1 score of 0.98. This implies that it can accurately select
appropriate mouse or keyboard actions. Additionally, it can precisely choose the mouse
action type, typing text, or pressing keys consistent with the golden label actions.

The ability for precise positioning is crucial in computer-controlling tasks. As indicated


by the "Mouse Position" col-umn in Table 3, current VLMs have not yet achieved the ca-
pability for precise positioning required for computer manip-ulation, GPT-4V refuses to
give precise coordinate results in its answers, and two open-source models also fail to
output the correct coordinates with our pipeline prompt template.

Another significant challenge for all models is the reflec tion phase. In this phase, the
agent is required to determine whether the subtask has been completed in the current
state, and decide whether to proceed further or make some adjust ments. This is crucial
for constructing a continuous inter-active process. Regrettably, all models show
insufficient ac-curacy in this determination, with GPT-4V achieving only a 0.60 F1 score.
This implies that human intervention is still necessary during task execution.

17
5.2 FINE-TUNING TRAINING

Another significant challenge for all models is the reflec tion phase. In this phase, the
agent is required to determine whether the subtask has been completed in the current
state, and decide whether to proceed further or make some adjust ments. This is crucial for
constructing a continuous inter-active process. Regrettably, all models show insufficient
ac-curacy in this determination, with GPT-4V achieving only a 0.60 F1 score. This implies
that human intervention is still necessary during task execution. After vision fine-tuning.
ScreenAgent achieved the same level of following instructions and making function calls
as GPT-4V on our dataset, as shown in Table 2. In Table 3, ScreenAgent also reached a
comparable level to GPT-4V. No-tably, our ScreenAgent far surpasses existing models in
the precision of mouse clicking. This indicates that vision fine-tuning effectively enhances
the model's precise positioning capabilities. Additionally, we observed that ScreenAgent
has a significant gap compared to GPT-4V in terms of task plan-ning, highlighting GPT-
4V's common-sense knowledge and task-planning abilities.

5.3 CASE STUDY

To evaluate our Screen Agent model on computer control tasks, we provide two cases. In
Fig. 7, we present a case il-lustrating the workflow of ScreenAgent executing a chain of
actions. In Fig. 8, we compare different agents in executing the details of the three phases
in the pipeline. Fig. 8 (a) shows the planning process of all the agents, where we find that
our ScreenAgent produces the most concise and effective plan. Fig. 8 (b) presents four
different click action tasks, each rep-resenting a step in a specific task. Results show that
LLAVA clicks on the bottom-left corner on all screens, cogAgent may fail to generate
click positions, and in the fourth task, only our agent can correctly click on the position.
Fig. 8 (c) shows that our agent can recognize whether an action needs to be re-tried after
reflection and successfully execute the action following a failure.

18
19
20
21
CHAPTER 6

22
CHAPTER 6
CONCLUSION

6.1 INTRODUCTION
The implementation phase is the pivotal step in transforming the design blueprints
of the INFOQUERY Interactive AI Model into a functioning reality. This phase is
responsible for translating high-level system architecture, component designs, and
theoretical strategies into actual code and working modules that meet the outlined
specifications and use cases. Implementation is not just coding—it is a disciplined
engineering process involving testing, optimization, debugging, and integration. It
ensures that the system performs its intended function reliably, efficiently, and
securely.
The INFOQUERY system is designed to serve as an AI-driven tool capable of
interpreting, extracting, and responding to user queries on PDF documents. This
makes it especially useful for individuals working in academia, research, and
enterprise document management. The following sections describe the detailed
implementation process, including system requirements, tools used, and in-depth
explanations of each module involved.
6.2 SYSTEM ENVIRONMENT
Implementing an AI-based PDF question-answering tool requires a mix of
programming languages, libraries, hardware, and platforms. Below is an overview
of the environment and tools used:
- Programming Language: Python 3.10
- Frontend Framework: Streamlit (used for building the interactive user interface)
- Libraries Used:
- PyMuPDF: For extracting textual content from PDF files.
- pytesseract: For applying Optical Character Recognition on image-based PDFs
- SentenceTransformers: To convert text content into semantic vectors using
transformer-based models.
23
- ChromaDB: To manage and query high-dimensional embeddings for
semantic search.
- LangChain: To connect queries and retrieved content with a large language
model for generating answers.
- HuggingFace Transformers: For language model support.
- Operating System: Windows 11
- Hardware Requirements:
Minimum: 8 GB RAM, 2.4 GHz quad-core processor.
Recommended: GPU-enabled systems for accelerated embeddings.

6.3 MODULE-WISE IMPLEMENTATION


INFOQUERY is implemented as a modular system, where each component
performs a defined role in the pipeline. Below are detailed explanations for each
module:

6.3.1 PDF UPLOAD AND PREPROCESSING MODULE

This module facilitates PDF file upload using Streamlit’s `file_uploader`


component. Upon file upload, the following sub-tasks are executed:
- File Type Identification: Text-based or image-based PDF.
- Text Extraction: PyMuPDF is used for reading text layers in PDFs.
- OCR Handling: If no text is detected, Tesseract OCR is triggered to extract text
from images.
- Preprocessing: Includes text cleaning, removal of special characters, and
segmentation into manageable chunks (e.g., 100-word segments).

6.3.2 EMBEDDING GENERATION MODULE


24
To enable semantic search, the preprocessed text must be transformed into dense
vector representations. This is achieved using the SentenceTransformers library.
- Model Used: `all-MiniLM-L6-v2`
- Chunk Embedding: Each text chunk is converted into a 384-dimensional
embedding vector.

6.3.3 VECTOR DATABASE MANAGEMENT


INFOQUERY supports ChromaDB for storing and querying embeddings. These
vector databases allow similarity search using cosine distance.
- Indexing: Each embedding is indexed with metadata (chunk number, page
number).
- Search: During a query, user input is embedded and compared against stored
vectors.
6.3.4 USER QUERY HANDLING MODULE
Users interact with the system via a Streamlit-based frontend. A typical query
undergoes the following stages:
- Embedding via SentenceTransformer.
- Similarity search using vector database.
- Selection of top-k relevant chunks.
- Prompt engineering to concatenate chunks with user query.
- Language model response generation

6.3.5 USER INTERFACE DESIGN


Using Streamlit, the interface allows:
- File uploads.
- Display of extracted chunks.
- Input field for questions.
- Real-time display of generated answers.
- Additional features like re-ask, download, and debug.

25
6.4 DATASET PREPARATION AND TESTING
INFOQUERY uses dynamically uploaded PDFs, but internal testing was done
using sample datasets such as:
- IEEE research papers
- Wikipedia offline dumps (converted to PDFs)
- University curriculum and question banks
Testing was done to validate the performance in terms of:
- OCR accuracy (Tesseract)
- Semantic search precision
- Query relevancy and LLM coherence

6.6 SUMMARY
This chapter comprehensively described the practical implementation of
INFOQUERY—from backend embedding to frontend interactivity. Each
component has been designed to function both independently and cohesively as
part of a semantic Q&A pipeline over PDFs. By utilizing state-of-the-art
transformer models and OCR, INFOQUERY bridges the gap between document
access and intelligent interaction.

26
CHAPTER 7

27
CHAPTER 7
TECHNIQUES USED

7.1 OCR (OPTICAL CHARACTER RECOGNITION):


7.1.1 INTRODUCTION:
Optical Character Recognition (OCR) is a technology that converts scanned
documents, PDFs, and images into editable, searchable, and machine-readable text.
It is widely used to digitize printed or handwritten documents, enabling faster
access, processing, and analysis of information. OCR allows systems to recognize
and extract text automatically from static formats, making information retrieval
much more efficient.
The OCR process typically involves:
 Image Preprocessing: Enhancing the quality of input images by removing
noise, adjusting contrast, and correcting distortions.

 Text Detection: Identifying and isolating text regions from non-text areas
like backgrounds and graphics.

 Character Recognition: Recognizing individual characters and words using


pattern matching, machine learning, or deep learning techniques.

 Post-Processing: Refining recognized text using dictionary matching and


language models to correct errors and improve output accuracy.

OCR serves as the foundational step in systems like InfoQuery by transforming


unstructured PDF content into searchable data, enabling intelligent querying and
efficient information extraction.
7.1.2 PYTESSERACT OCR
PyTesseract is a Python wrapper for Google's open-source Tesseract OCR engine.
It provides a simple way to integrate Tesseract’s text recognition capabilities into
Python applications. PyTesseract is widely used for extracting text from images
28
and scanned PDF documents because of its reliability and broad language support.
The working of PyTesseract generally includes:
 Converting PDFs to images: Since Tesseract works on images, PDFs are
first converted page-by-page into images.

 Applying OCR on images: Tesseract analyzes the image and identifies the
textual content.

 Text extraction and formatting: Recognized text is extracted and can be


output in formats like plain text, TSV, PDF, or HTML.

Advantages of PyTesseract:
 Free and Open Source: Completely free to use with active community
support.

 Multilingual Support: Recognizes over 100 languages and allows


combining multiple languages during OCR.

 High Accuracy: Good for clean, well-formatted scanned documents,


especially when preprocessing techniques (like noise removal) are applied.

 Flexible Output: Can export results into different formats including plain
text, searchable PDFs, and TSV.

Disadvantages of PyTesseract:
 Low Performance on Noisy Images: Struggles with handwritten, low-
resolution, or complex layout documents unless heavy preprocessing is
done.

 Speed Limitations: Slower compared to newer deep-learning-based OCR


models, especially on large datasets.

 Dependency on External Software: Requires installation of both the


Tesseract engine and additional libraries like Poppler for PDF processing.

29
 Limited Deep Learning Features: Does not use neural networks for
recognition like newer OCR systems (such as EasyOCR or Keras OCR).

7.1.3 EASYOCR
EasyOCR is a deep learning-based Optical Character Recognition (OCR) library
developed by the Jaided AI team. It leverages neural networks to perform OCR
tasks, making it particularly effective for handling complex images and a wide
variety of languages. EasyOCR supports over 80 languages and provides state-of-
the-art accuracy for both printed and handwritten text extraction.
EasyOCR uses Convolutional Neural Networks (CNNs) and Recurrent Neural
Networks (RNNs) for text detection and recognition. The library is designed to be
simple to use and is built on top of PyTorch, allowing easy integration into deep
learning pipelines.
The working of EasyOCR generally includes:
 Preprocessing the image: EasyOCR automatically processes images to
detect text regions.

 Text detection and recognition: The OCR engine first detects the text areas
in the image, followed by recognition of the actual characters using deep
learning models.

 Text output: The recognized text can be returned in formats like JSON,
plain text, or a list of bounding box coordinates.

Advantages of EasyOCR:
1. Deep Learning-based Technology: Utilizes CNNs and RNNs, which leads
to improved accuracy in recognizing text, especially in noisy or complex
images.

30
2. Wide Language Support: Supports over 80 languages, including non-Latin
scripts like Chinese, Arabic, and Hindi, making it versatile for international
applications.

3. Handles Complex Layouts and Curved Text: More effective than


traditional OCR tools for images with intricate layouts, rotated text, or
skewed text.

Disadvantages of EasyOCR:
1. Slower Performance on Simple Tasks: It can be slower than traditional
OCR engines like PyTesseract when dealing with clean, straightforward
documents or small text extractions.

2. Higher Computational Requirements: As a deep learning model,


EasyOCR requires more computational resources, such as a decent CPU or
GPU, to run effectively.

3. Less Customizable for Fine-Tuning: Unlike Tesseract, EasyOCR has


limited customization options, making it harder to fine-tune for specific
tasks or use cases like custom-trained models.

4. Challenges with Highly Noisy or Degraded Images: While it performs


well in many situations, EasyOCR can still struggle with heavily distorted,
noisy, or low-quality images unless well-preprocessed.

7.1.4 KERAS OCR


Keras OCR is an open-source Python package built on top of Keras and
TensorFlow, specifically designed for Optical Character Recognition (OCR) tasks.
It leverages deep learning models to perform both text detection and recognition,
offering a robust solution for extracting text from images. Keras OCR can handle
text in diverse and complex layouts, making it effective for various types of OCR
tasks.
Keras OCR uses advanced techniques, including convolutional neural networks
31
(CNNs) for detecting text regions and recurrent neural networks (RNNs) for
recognizing characters. The package is efficient for use in production pipelines and
integrates well with other machine learning and computer vision systems.
The working of Keras OCR generally includes:
 Text Detection: Identifies text regions within an image using CNN-based
models.

 Text Recognition: Recognizes and extracts the text in those regions using
sequence-to-sequence models like RNNs.

 Output: Provides recognized text in formats such as plain text, along with
bounding box coordinates for each detected text region.

Advantages of Keras OCR:


1. High Accuracy for Complex Texts: Keras OCR performs well in
recognizing text in diverse and complex images, including those with curved
or rotated text and complex backgrounds.

2. Deep Learning-based Models: The use of deep learning models for both
text detection and recognition ensures state-of-the-art accuracy, particularly
on images with varying layouts.

3. Flexibility in Handling Different Fonts and Handwriting: The deep


learning models used in Keras OCR are capable of handling a wide variety
of fonts and handwriting styles, making it adaptable to different types of
documents.

4. Open Source and Well-Documented: Keras OCR is open-source, and it


comes with good documentation and an active community, ensuring ongoing
improvements and support.

32
Disadvantages of Keras OCR:
1. Computationally Intensive: The deep learning models used in Keras OCR
require significant computational resources, such as a good CPU or GPU, for
optimal performance, which may not be feasible for low-resource
environments.

2. Slow Performance on Large Datasets: While Keras OCR is accurate, its


performance can degrade when processing large datasets, particularly on
machines without dedicated GPU support.

3. Training Data Sensitivity: The performance of Keras OCR is highly


dependent on the quality and diversity of training data. If the model hasn't
been trained on specific fonts or handwriting, it may perform poorly.

7.1.5 COMPARISION OF OCR:

Fig 8 Comparison of OCR

33
7.1.6 OVERALL EFFICIENCY:
When evaluating OCR solutions for PDFs, efficiency is determined by factors such
as accuracy,ease of integration, support for complex layouts, and processing speed.
PyTesseract is a reliable choice for straightforward text extraction due to its
balance of accuracy, ease of integration, and extensive language support. It
performs well with simple, well-formatted PDFs but may require additional
preprocessing for documents with complex layouts. On the other hand, EasyOCR
utilizes advanced deep learning models, offering high accuracy and effective
handling of noisy or complex text. It supports over 80 languages, making it
versatile for international PDFs, but can be more resource-intensive. Keras-OCR
stands out for its state-of-the-art deep learning models and end-to-end pipeline,
which excels in recognizing text from complex layouts and customized tasks.
However, it may require more setup and familiarity with Keras and TensorFlow.
Overall, Keras-OCR is highly efficient for handling intricate document layouts and
specialized OCR needs, while PyTesseract and EasyOCR offer excellent
performance with their own strengths, with EasyOCR being particularly adept at
dealing with complex text in challenging conditions.
7.2 VECTOR DATABASE:
7.2.1. INTRODUCTION:
A Vector Database is a specialized type of database designed to store, manage, and
query high-dimensional vectors, which are typically the output of machine learning
models, such as embeddings generated by neural networks. These vectors represent
data points in a continuous vector space and are often used in applications such as
natural language processing (NLP), image recognition, recommendation systems,
and more. Vector databases are optimized for handling vector operations like
similarity search, nearest neighbor search, and clustering, which are essential for
many machine learning and AI tasks.
Key Features of Vector Databases:
34
1. High-Dimensional Data Storage: Vectors are usually represented in high-
dimensional spaces .A vector database efficiently stores and indexes these
vectors for quick retrieval.

2. Similarity Search: One of the primary use cases for vector databases is finding
vectors that are similar to a given query vector. This is achieved through
techniques like nearest neighbor search, where the database searches for vectors
closest to a given query in terms of distance (e.g., Euclidean distance, cosine
similarity).

3. Support for Complex Queries: Vector databases allow for more advanced
queries like range queries and multi-query searches.

4. Scalability: These databases are designed to handle millions or even billions of


high-dimensional vectors, making them scalable to large datasets often used in
AI, machine learning, and big data applications.

7.2.2 CHROMA DB:


ChromaDB is a modern, open-source vector database designed specifically to
handle, store, and query embeddings or vectors at scale. It focuses on providing
fast and efficient similarity search capabilities for high-dimensional data, making it
an excellent choice for machine learning and AI applications where large datasets
of vector embeddings are used. ChromaDB is optimized for both structured and
unstructured data, allowing it to seamlessly integrate with machine learning
workflows and perform semantic searches.
ChromaDB can be used for applications in recommendation systems, semantic
search, anomaly detection, and more, where you need to search for similar data
points (e.g., similar documents, images, or other entities) based on their vector
representations.

35
Advantages of ChromaDB:
1. High Performance: Optimized for fast similarity search, providing quick
results even with large vector datasets, ideal for real-time applications.

2. Scalability: Designed to scale horizontally, efficiently handling billions of


high-dimensional vectors without compromising performance.

3. Ease of Use: Offers a simple API for easy integration, reducing the complexity
of using vector search in applications.

Disadvantages of ChromaDB:
1. Memory Consumption: Storing large vector datasets can consume significant
memory, especially with high-dimensional vectors, requiring substantial
resources.

2. Complexity in Setup for Large Scale: Scaling ChromaDB for very large
datasets can be complex and require detailed optimization for performance.

3. Limited Query Types: Primarily focuses on similarity search, and may lack
advanced query capabilities for more complex database operations.

7.2.3 WEAVIATE DB:


Weaviate DB is an open-source vector search engine optimized for high-
dimensional vector data. It supports semantic search, enabling queries based on
meaning rather than exact matches, and integrates seamlessly with machine
learning models like BERT and OpenAI embeddings. Weaviate is designed for
scalability, supporting horizontal scaling to handle large datasets. It also allows
real-time updates and uses a graph-based data model, which helps represent
relationships between entities in a vector space, enhancing its capability for
complex queries and AI-driven applications.
Advantages of Weaviate DB:
1. Scalable: Handles large datasets and scales horizontally with ease.

36
2. Semantic Search: Offers meaning-based queries, improving search relevance.

3. ML Integration: Works well with machine learning models, enhancing AI-


driven applications.

Disadvantages of Weaviate DB:


1. Complex Setup: Initial configuration can be challenging, especially for
distributed environments.

2. Resource Intensive: Requires significant memory and storage, particularly for


large datasets.

3. Limited Queries: Primarily focuses on vector search and may lack support for
other complex queries.

7.2.4 FAISS DB:


FAISS (Facebook AI Similarity Search) is an open-source library developed by
Facebook for efficient similarity search and clustering of high-dimensional vectors.
It supports both exact and approximate nearest neighbor searches using various
indexing methods like Hierarchical Navigable Small World (HNSW) and Inverted
File Index (IVF). FAISS is optimized for fast vector searches in large datasets and
scales well to handle billions of vectors. It also supports GPU acceleration for
faster performance in large-scale applications. FAISS is widely used in AI and
machine learning applications like recommendation systems, image search, and
natural language processing due to its efficiency and scalability.

Advantages of FAISS DB:


1. High Performance: Extremely fast search capabilities, even with large
datasets, due to optimized algorithms and indexing methods.

2. Scalability: Can efficiently handle billions of high-dimensional vectors, making


it suitable for large-scale applications.

37
3. GPU Acceleration: Supports GPU-based processing for faster indexing and
search operations, enhancing performance.

4. Open Source: Free to use with a large community and extensive


documentation, making it accessible for developers and researchers.

Disadvantages of FAISS DB:


1. Memory Usage: Can be memory-intensive, especially when dealing with large
datasets, requiring significant hardware resources.

2. Complex Setup: The configuration of FAISS, especially for large-scale


deployments with GPU acceleration, can be complex and requires technical
expertise.

3. Limited Query Types: Primarily focused on vector search and lacks support
for more complex, traditional database queries.

38
FEATURE CHROMA DB WEAVIATE FAISS
TYPE Open-source Open-source Open-source
library

GENERAL A user-friendly A user-friendly While


EFFICIENCY Interface, simplifying interface with extremely fast, it
vector built-in support for a requires careful
storage and variety of data types. tuning and may not
retrieval. support complex
querying mechanisms.

FLEXIBILITY Impressive ease of use Offers a graphQL It can


based API providing handle vast
efficient interactions amount of data with
efficiency

SCHEMA It does not require Automates the It does not require


ARCHITECTURE schema interface process of defining schema
data structures inference structure
LIMITATIONS While it’s easy to use, it Its setup and initial Faiss is a low-level
may not offer the same configuration might library and requires more
level of be more complex manual setup, making it
customization and Compared to less
control as other Chroma DB, user-friendly for
options. especially for those unfamiliar with its
beginners. API.

7.2.5 COMPARISON OF VECTORDB:


Table 6
Comparison of Vector DB

39
7.2.6 OVERALL EFFICIENCY:
ChromaDB offers a user-friendly and cloud-native solution, optimized for quick
deployment and moderate scalability. It excels in ease of use, especially for small
to medium-sized applications that require seamless integration with common ML
models. While it may not match the advanced scalability and search speed of Faiss
or the complex querying capabilities of Weaviate, ChromaDB strikes a balance
between simplicity and performance. Its strength lies in providing efficient vector
storage and retrieval for applications where ease of setup and integration outweigh
the need for extreme-scale processing or custom search configurations.

7.3 SENTENCE TRANSFORMER:


7.3.1 INTRODUCTION:
A Sentence Transformer is a neural network model specifically designed to encode
entire sentences into fixed-size vector representations (embeddings) that capture
the semantic meaning of the sentence. Unlike traditional models that focus on
word-level embeddings, sentence transformers are optimized to handle the
meaning of sentences or even larger text spans, like paragraphs. Sentence
Transformers use pre-trained models, often built on transformer architectures like
BERT to map textual input into dense vectors that can be used for tasks like
semantic textual similarity, information retrieval, and clustering.
Features of Sentence Transformers:
 Semantic Representation: Transforms text into vectors that capture the
semantic meaning.

 Pre-trained Models: Utilizes powerful transformer models pre-trained on


large corpora for better accuracy and performance.

 Textual Similarity: Effective for tasks like finding sentence or document


similarity, clustering, and classification.

40
 Fine-tuning: Can be fine-tuned on specific tasks or datasets to improve
performance.

 Fast Inference: Optimized for speed, making them suitable for real-time
applications.

7.3.2 Difference Between Paragraph and Sentence Transformers:


Table 7
Difference Between Paragraph and Sentence Transformers
S.no Aspect Sentence Transformers Paragraph Transformers

1. Input Length Processes individual sentences. Processes longer text spans, such
as paragraphs or documents.

2. Contextual Focuses on understanding the Captures broader context by


Understanding meaning of a single sentence. understanding relationships
between sentences.

3. Use Cases Best for tasks like sentence Ideal for document classification,
similarity, sentence multi-sentence retrieval, and
classification, and matching. context-based tasks.

4. Model Generally less complex as it More complex due to the need to


Complexity processes shorter text. handle larger text spans and their
relationships.

5. Performance Performs well on sentence-level Performs better on document-level


tasks with high efficiency. tasks, leveraging multi-sentence
context.

41
7.3.3 all-MiniLMv2
The all-MiniLMv2 model is a highly efficient, lightweight transformer-based
model designed for generating dense vector embeddings from text. It is a version
of MiniLM, optimized for performance and accuracy. MiniLM (Minimum
Memory BERT) is a distilled model that achieves high performance with fewer
parameters, making it faster and more memory-efficient than traditional
transformer models like BERT or RoBERTa.
Features of all-MiniLMv2:
 Lightweight and Efficient: It is designed to be faster and more resource-
efficient, making it ideal for applications requiring low latency and high
throughput.

 High-Quality Embeddings: Despite being smaller in size, all-MiniLMv2


produces high-quality embeddings that capture the semantic meaning of
sentences effectively.

 Reduced Memory Usage: It requires significantly less memory compared to


larger models, which makes it well-suited for environments with limited
resources.

 Performance on NLP Tasks: It performs exceptionally well on tasks like


semantic textual similarity, information retrieval, and clustering while
maintaining efficiency.

 Pre-trained Model: It is pre-trained on large text corpora and can be fine-


tuned on specific tasks to improve performance for domain-specific
applications.

Advantages:
 Efficiency: The model is faster, requiring less computational power while
maintaining good performance.

42
 Versatility: Can be used for various NLP tasks, including classification,
information retrieval, and clustering.

 Pre-trained: Comes pre-trained, so users can fine-tune it for specific tasks,


saving time in model development.

7.4 RAG (Retrieval-Augmented Generation):


7.4.1 RAG (Retrieval-Augmented Generation):
RAG (Retrieval-Augmented Generation) is a technique in natural language
processing (NLP) that combines traditional retrieval-based methods with
generative models. It enhances a model's ability to generate more accurate and
contextually relevant text by first retrieving relevant information from a large
corpus or database and then using that information to generate coherent responses.
RAG is particularly effective in tasks like question answering and dialogue
generation, where access to external knowledge is critical.
FEATURES:
 Retrieval + Generation: Combines retrieval-based methods and generative
models for more accurate responses.
 External Knowledge: Accesses a broader knowledge base by retrieving relevant
documents.
 Flexible Integration: Can work with various retrievers and generative models.
 Open-Domain Tasks: Performs well in open-domain question answering and
conversations.
 Dynamic Responses: Generates responses based on retrieved content for
relevance.

7.4.2 Ollama Nomic Embeddings:


Ollama Nomic Embeddings refer to a specific type of vector representation model
designed for various NLP tasks, including document retrieval and semantic search.
These embeddings aim to provide high-quality, dense vector representations of text
that capture semantic meanings in a compact format. The key focus of Ollama
43
Nomic embeddings is to improve retrieval-based tasks by generating embeddings
that enhance the matching of semantically similar documents or queries, improving
the accuracy of search and recommendation systems.

7.4.3 GEMMA2 Model


GEMMA2 is a sophisticated generative model designed to enhance text generation
tasks, such as creating relevant and coherent responses based on input prompts. It
is part of a family of models optimized for multi-turn dialogues and knowledge-
intensive tasks. GEMMA2 has been trained on large-scale datasets, enabling it to
generate text that is both contextually appropriate and aligned with external
knowledge sources, making it suitable for use cases in areas like customer service,
educational tools, and automated content creation.

7.5 INTERFACE:
In software development, an interface refers to a point of interaction between
different systems or components. It defines the methods, parameters, and protocols
that allow different software applications or components to communicate with each
other. An interface can exist between a user and a system , between different
software modules (API interface), or between hardware and software. The primary
goal of an interface is to allow seamless communication and ensure that
components work together efficiently.

7.5.1 STREAMLIT:
Streamlit is an open-source Python library that allows developers to easily build
interactive web applications for machine learning and data science projects.
Streamlit is designed to be simple and fast, with minimal code required to create
web interfaces. It automatically updates the UI when changes are made to the
Python code, making it highly interactive. Streamlit is widely used for creating
dashboards, visualizations, and prototypes, allowing data scientists and machine
44
learning engineers to quickly showcase their models and results to non-technical
users.

Key Features of Streamlit:


 Very easy to use with minimal code, ideal for rapid prototyping.

 Automatically updates the UI when the code changes, making it interactive.

 Directly integrates with popular libraries like Pandas, Matplotlib, and Plotly
for data visualization.

 Allows users to create custom UI elements (widgets, charts, etc.).

 Supports easy deployment of applications to the web.

45
CHAPTER 8

46
CHAPTER 8
PERFORMANCE ANALYSIS
Performance analysis compares the effectiveness of two models (the existing
model and the fine-tuned model) to understand improvements or regressions in
their outputs. The comparison often includes accuracy, speed, resource usage, and
other performance metrics that reveal the strengths and weaknesses of each model.
8.1 Existing Model:
In this model, the output is based on generative techniques and OCR-based
information retrieval. The model retrieves relevant documents or passages from the
input data using pre-processing techniques, applies OCR (likely PyTesseract or
other similar libraries), and then generates a response or summary based on the
retrieved information.
Key aspects of the existing model output:
 Document Retrieval: Retrieves relevant passages from the input corpus
based on similarity to the query.

 Text Generation: Generates a coherent response or summary using the


retrieved text.

 Accuracy: The accuracy of the model depends on how well it retrieves


relevant documents and how the generative model processes them. The
output may not always be precise if the relevant information isn't retrieved
correctly.

 Context Handling: The model may face difficulty in handling long or


complex contexts because it generates text based on isolated document
retrieval, potentially missing broader context.

 Performance: The response time and processing power may be higher due
to document retrieval and text generation steps.

47
8.2 Fine-Tuned Model :
The InfoQuery model, the fine-tuned version, improves upon existing model by
integrating a more refined approach to document retrieval and text generation. This
version uses advanced techniques, including fine-tuning on a specific dataset,
better contextual handling, and possibly more advanced OCR methods. It uses
retrieval-augmented generation (RAG) for better semantic understanding and more
accurate answers.
Key aspects of the InfoQuery output:
 Improved Document Retrieval: Enhanced retrieval techniques that ensure
more relevant and contextually appropriate documents are fetched.

 Better Contextual Understanding: By fine-tuning the model with specific


data, InfoQuery is able to better understand the nuances of long or complex
queries.

 Accuracy and Relevance: The fine-tuned model improves the precision of


the generated response by learning from a broader dataset and optimizing the
text generation process.

 Faster and More Efficient: With optimizations in both the retrieval and
generation processes, InfoQuery delivers faster results with fewer
computational resources.

 Enhanced User Experience: The output is more coherent, informative, and


tailored to the user’s query, with a better flow of information.

InfoQuery’s output should be more reliable, precise, and contextually appropriate


than the existing model.

8.3 Difference Between Two Models:

48
Table 9
Difference between Existing and Fine-tuned model
S.no Aspect Existing Model Fine-Tuned Model

1 Accuracy May have less accuracy due to Improved accuracy due to fine-
suboptimal document retrieval and tuning and better retrieval methods.
generation.

2 Context Struggles with complex or long Better handling of context, capable


Handling contexts; may miss important of understanding complex queries.
information.

3 Response Responses may be more generic or Responses are more tailored and
Quality less specific to the query. specific to the user's needs.

4 Speed and Slower due to unoptimized retrieval Faster and more efficient due to
Efficiency and generation processes. optimizations in both retrieval and
generation.

5 Resource High computational load and More optimized resource usage,


Usage memory usage due to retrieval and requiring less computational power.
generation.

49
50
CHAPTER 9

51
CHAPTER 9
CONCLUSION
9.1 CONCLUSION

In this study, we explored various embedding models and their


integration with vector databases to optimize both retrieval efficiency and
accuracy. With the exponential surge in unstructured data generation, the need for
robust storage and retrieval solutions has become more critical than ever. Vector
databases, empowered by AI-driven embeddings and advanced indexing
techniques, have emerged as a promising solution for managing high-dimensional
data effectively.
Our proposed system seamlessly combines LangChain and ChromaDB, adopting a
structured methodology for text extraction, preprocessing, embedding generation,
and vector storage. Through a straightforward yet efficient pipeline — involving
data extraction, text segmentation, embedding generation, storage, and query
handling — we demonstrated a scalable and effective approach to leveraging
vector space embeddings. Experimental results highlighted that optimized
embeddings significantly enhance both search speed and retrieval accuracy,
making the system well-suited for real-time applications such as recommendation
systems, natural language processing (NLP) tasks, and computer vision projects.
Looking ahead, future research should focus on further improving scalability
through distributed computing frameworks, refining indexing algorithms, and
developing multimodal embedding techniques that integrate textual, visual, and
audio data. Additionally, the adoption of hybrid models that combine symbolic AI
with deep learning-based embeddings could substantially improve the semantic
understanding and processing of complex, unstructured data.
As vector database technologies continue to evolve, they are poised to play a
pivotal role in advancing AI-driven applications. By incorporating highly
optimized embedding models, more sophisticated similarity search methods, and
52
intelligent data management practices, businesses and researchers can build the
next generation of scalable, efficient, and intelligent AI systems, capable of
delivering faster, more accurate insights across a wide range of domains.

9.2 FUTURE SCOPE

InfoQuery aims to continually enhance its query processing capabilities


and expand its document management features, ensuring it remains a cutting-edge
tool for users. Key future developments include:

1. Improved Query Processing:


With advancements in Natural Language Understanding (NLU), InfoQuery
will refine its ability to process and understand user queries. By incorporating
sophisticated NLP models, the system will better handle synonyms,
paraphrases, and contextual nuances, leading to more accurate query
interpretations and responses. This will allow for more intuitive and flexible
user interactions.

2. Enhanced Semantic Search:


The semantic search functionality will evolve to prioritize the meaning of user
queries over traditional keyword matching. By focusing on the intent and
context behind queries, InfoQuery will significantly improve the relevance of
search results, delivering more accurate and user-friendly responses.

3. Advanced Document Management:


The future version of InfoQuery will enhance document management by
extracting metadata such as author, document title, and creation date,
enabling users to refine their search results. Furthermore, automatic
classification will allow for more effective organization, ensuring documents
are systematically categorized and easily retrievable.

53
4. Multi-Format Support:
Expanding support to additional file formats will empower users to query a
wider range of documents, including Word files, Excel spreadsheets, and
images. This expansion will make InfoQuery a versatile solution that can
handle a broad spectrum of document types, improving its utility across
different industries.

5. External System Integration and API Development:


Building an API will enable third-party applications to integrate seamlessly
with InfoQuery. This will facilitate easy interaction between InfoQuery and
other systems, making it possible for external platforms to leverage the
system’s capabilities, thereby increasing its scope and potential use cases.

6. CMS Integration:
InfoQuery plans to integrate with content management systems (CMS),
enabling features like automatic publishing and content updates, which will
streamline document organization and management. Integration with external
systems like SGW will further enhance the automation and workflow efficiency
of the platform.

These advancements not only promise to improve the operational efficiency and
accuracy of InfoQuery but also pave the way for its application in diverse real-time
environments. As the system evolves, it will likely serve as a foundational platform
for intelligent document retrieval and management, fostering innovation in
domains such as recommendation systems, content curation, and enterprise data
management.

54
CHAPTER 10

55
CHAPTER 10
APPENDICES
10.1 APPENDIX A – SOURCE CODE

!pip install langchain chromadb langchain_community pytesseract pdf2image


import pytesseract
from pdf2image import convert_from_path
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA, LLMChain
import pathlib
import textwrap
from IPython.display import display, Markdown
import os

def to_markdown(text):
text = text.replace('•', ' *')
return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))
def extract_text_from_pdfs(pdf_path): # Changed pdf_folder to pdf_path
"""Extracts text from a single PDF file.""" # Added docstring for clarity
images = convert_from_path(pdf_path)
extracted_text = "".join(pytesseract.image_to_string(img) for img in images)
return [extracted_text] # Return a list to maintain consistency

pdf_path = "/content/sample_data/data/ITA Unit 2 Final (1).pdf"


pdf_path
doc_texts = extract_text_from_pdfs(pdf_path) # Pass pdf_path instead of
pdf_folder

text_splitter = RecursiveCharacterTextSplitter(chunk_size=300,
chunk_overlap=50)
chunks = [text_splitter.split_text(text) for text in doc_texts]

!sudo apt install tesseract-ocr


!sudo apt install libtesseract-dev
!apt-get install poppler-utils # install Poppler on colab

56
chunks = []
for text in doc_texts:
chunks.extend(text_splitter.split_text(text))
len(chunks)
chunks[0]
!pip install colab-xterm
%load_ext colabxterm
%xterm

!ollama list
embeddings = OllamaEmbeddings(model="nomic-embed-text")
from langchain.docstore.document import Document

# ... (your existing code) ...

vectorstore = Chroma.from_documents(
[Document(page_content=chunk) for chunk in chunks],
embeddings
)
vectorstore
query = "Implementation of Artificial Light Systems in Agriculture"
search = vectorstore.similarity_search(query)
to_markdown(search[0].page_content)
retriever = vectorstore.as_retriever(
search_kwargs={'k': 5}
)
retriever.get_relevant_documents(query)
from langchain_community.llms import Ollama
llm = Ollama(model="gemma2")
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate
template = """
<|context|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
<|user|>
{query}
</s>
<|assistant|>
"""
57
prompt = ChatPromptTemplate.from_template(template)
rag_chain = (
{"context": retriever, "query": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)

response = rag_chain.invoke("Implementation of Artificial Light Systems in


Agriculture")
to_markdown(response)

10.2 APPENDIX B

58
59
60
CHAPTER 11

61
CHAPTER 11
REFERENCES

[1] Zhou, X., "Advanced Search Mechanisms in Vector Databases," ACM


Transactions, 2023.

[2] Li, B., "Embedding Indexing Techniques for AI Applications," Springer AI


Review, 2022.

[3] Kumar, S., "Efficient Nearest Neighbor Search in HighDimensional Spaces,"


IEEE Transactions, 2021.

[4] Wang, Y., "Scalable Machine Learning Models with Vector Databases," ACM
Computing Surveys, 2023.

[5] Chen, R., "Automating AI Workflows with Vector Search," Elsevier Journal of
AI Research, 2022.

[6] Patel, J., "Semantic Similarity Matching Using Vector Representations," IEEE
Big Data Conference, 2023.

[7] Rao, P., "Hybrid ANN Algorithms for Fast Vector Retrieval," Springer AI &
Data Science, 2023.

[8] Zhang, L., "Multi-Modal Embeddings for Vector Search Systems," ACM
Transactions, 2023.

[9] Gupta, M., "Evaluating ANN Methods in Large-Scale Databases," IEEE


Access, 2023. 8

[10] Tan, H., "Exploring ChromaDB for AI-Based Vector Retrieval," IEEE ICML,
2023.

62
[11] Martin, D., "Optimizing Retrieval-Augmented Generation Pipelines,"
Springer AI Review, 2023.

[12] Xu, T., "Distributed Vector Databases for Enterprise Applications," Elsevier
Data Science, 2023.

[13] Shen, Q., "Deep Learning-Based Vector Search Optimization," IEEE


Transactions, 2023.

[14] Venkatesh, R., "Hybrid Embedding Models for Efficient Search," ACM
Transactions, 2023.

[15] Nair, P., "Cloud-Based Vector Storage Systems: A Review," IEEE Access,
2023.

[16] Yu, J., "Quantum-Inspired Methods for Vector Similarity Search," Springer
AI Research, 2023.

[17] Das, K., "Graph-Based Approaches for Nearest Neighbor Search," ACM
Computing Surveys, 2023.

[18] Singh, H., "Advancements in High-Dimensional Indexing Techniques," IEEE


Big Data, 2023.

[19] Singla, S., "Big Data Processing in Vector Databases," IEEE, 2021.

[20] Pavon, J., "Vector Indexed Sparse Computation," IEEE, 2023.

[21] Guo, R., "Cloud-Native Vector Databases," VLDB, 2022.

[22] Microsoft AI Research, "Vector Databases and AI Applications," 2023.

63
64

You might also like