Updated Project File
Updated Project File
A PROJECT REPORT
Submitted by
MUKESH ANAND G 211521243104
JERO FRANCIS S 211521243077
CHANDRU T 211521243033
of
BACHELOR OF TECHNOLOGY
IN
I
PANIMALAR INSTITUTE OF TECHNOLOGY
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
II
ACKNOWLEDGEMENT
A project of this magnitude and nature requires kind co-operation and support from
many, for successful completion. We wish to express our sincere thanks to all those who
were involved in the completion of this project.
We seek the blessings from the Founder of our institution Dr. JEPPIAAR, M.A.,
Ph.D., for having been a role model who has been our source of inspiration behind our
success in education in his premier institution.
We would like to express our deep gratitude to our beloved Secretary and
Correspondent Dr. P. CHINNADURAI, M.A., Ph.D., for his kind words and enthusiastic
motivation which inspired us a lot in completing this project.
We also express our sincere thanks and gratitude to our dynamic Directors
Mrs. C. VIJAYA RAJESHWARI, Dr. C. SAKTHI KUMAR, M.E., Ph.D., and
Dr. SARANYA SREE SAKTHI KUMAR, B.E, M.B.A., Ph.D., for providing us with
necessary facilities for completion of this project.
We also express our appreciation and gratefulness to our respected Principal
Dr. T. JAYANTHY, M.E., Ph.D., who helped us in the completion of the project.
We wish to convey our thanks and gratitude to our Head of the Department,
Dr. T. KALAI CHELVI, M.E, Ph.D., for her full support by providing ample time to
complete our project.
We express our indebtedness and special thanks to our supervisor,
Mrs. K. SARANYA, M.E., for her expert advice and valuable information and guidance
throughout the completion of the project.
Last, we thank our parents and friends for providing their extensive moral support and
encouragement during the project.
III
TABLE OF CONTENTS
1 INTRODUCTION 2
2 LITERATURE SURVEY 5
3 EXISTING SYSTEM
11
3.1 EXISTING SYSTEM
12
3.2 PROBLEM DEFINITION 12
3.3 PROPOSED SYSTEM 14
3.4 ADVANTAGES
4
REQUIRMENTS SPECIFICATIONS
18
4.1 INTRODUCTION
18
4.2 HARDWARE AND SOFTWARE REQUIREMENTS
4.2.1 HARDWARE REQUIREMENTS
4.2.2 SOFTWARE REQUIREMENTS 20
20
4.2.2.1 PYTHON
IV
4.2.2.2 GOOGLE COLAB
4.2.2.3 INTERFACE
20
4.2.2.4 CORE TECHNIQUES
21
21
5
SYSTEM DESIGN
26
5.1 INTRODUCTION
26
5.2 SYSTEM ARCHITECTURE
5.3 UML DIAGRAM
5.3.1 USE CASE DIAGRAM 28
29
5.3.2 CLASS DIAGRAM
30
5.3.3 SEQUENCE DIAGRAM 33
5.3.4 ACTIVITY DIAGRAM 34
36
5.3.5 COMPONENT DIAGRAM
37
5.3.6 DEPLOYMENT DIAGRAM
5.4 DESIGN CONSIDERATIONS 38
5.5 TOOLS AND TECHNOLOGIES USED 38
38
5.6 SUMMARY
V
6
IMPLEMENTATION
40
6.1 INTRODUCTION
40
6.2 SYSTEM ENVIRONMENT
6.3 MODULE WISE IMPLEMENTATION
6.3.1 PDF UPLOAD AND PREPROCESSING MODULE 41
42
6.3.2 EMBEDDING GENERATION MODULE
42
6.3.3 VECTOR DATABASE MANAGEMENT 42
6.3.4 USER QUERY HANDLING MODULE 42
42
6.3.5 USER INTERFACE DESIGN
6.4 DATASET PREPARATION AND TESTING 43
6.5 SUMMARY 43
7
TECHNIQUES USED
7.1 OCR
7.1.1 INTRODUCTION 45
7.1.2 PYTESERAT OCR 45
7.1.3 EASY OCR 47
48
7.1.4 KERAS OCR
50
7.1.5 COMPARISON OF OCR 51
7.1.6 OVERALL EFFICIENCY
7.2 VECTOR DB
7.2.1 INTRODUCTION 51
7.2.2 CHROMA DB 52
53
7.2.3 WEAVIATE DB
54
7.2.4 FIASS DB 56
7.2.5 COMPARISON OF VECTOR DB 57
7.2.6 OVERALL EFFICIENCY
VI
7.3 SENTENCE TRANSFORMER
57
7.3.1 INTRODUCTION
58
7.3.2 DIFFERENCE BETWEEN PARAGRAPH AND
SENTENCE TRANSFORMER
7.3.3 all-MiniLMv2 59
7.5 INTERFACE
61
7.5.1 STREAMLIT
8 64
PERFORMANCE ANALYSIS
65
8.1 EXISTING MODEL 66
8.2 FINE-TUNED MODEL
8.3 DIFFERENCE BETWEEN TWO MODELS
9 68
CONCLUSION AND FUTURE SCOPE
69
9.1 CONCLUSION
9.2 FUTURE SCOPE
10 72-76
APPENDICES
11 78-79
REFRENCES
VII
LIST OF TABLES
1
HARDWARE COMPONENTS AND
SPECIFICATIONS
2
OCR TYPES AND IT’S PROS &
CONS
3
VECTOR DBs TYPES AND IT’S
PROS & CONS
5 RAG COMPONENTS
6 COMPARISION OF VECTOR DB
VIII
LIST OF FIGURES
1
ARCHITECTURE DIAGRAM
2
USE CASE DIAGRAM
3
CLASS DIAGRAM
4 SEQUENCE DIAGRAM
5 ACTIVITY DIAGRAM
6 COMPONENT DIAGRAM
7 DEPLOYMENT DIAGRAM
8 COMPARISION OF OCR
IX
LIST OF SYMBOLS
X
10 Multiplicity 1, *, 0..1, 1..*, etc. instances of one class relate
to another.
Represents inheritance
▲
11 Generalization
Represents a "whole-part"
12 Aggregation
◇ relationship where parts can
exist independently.
◆ Black diamond Represents a "strong whole-
13 Composition at one end part" relationship; parts
cannot exist separately.
Represents a functionality or
14 Use Case service the system provides.
Represents a user or external
15 Actor system interacting with the
system.
Represents a data type
16 Data
Represents the vectordb used
17 Cylinder to store data
Represents a bidirectional
18 Double arrow relationship or
communication between two
entities.
Represents a class in UML
19 Rectangle with three diagrams.
parts
XI
⊂
Represents a strong "has-a"
20 Composition relationship between two
classes
ABSTRACT
interaction by leveraging multiple Large Language Models (LLMs) to execute tasks with
translates high-level user instructions into precise automated actions, such as typing,
comprises a graphical user interface (GUI) for intuitive command input, a processing
core that collaborates with LLMs to determine task execution steps, an interpreter that
converts instructions into executable commands, and an executor that simulates user
interactions. A key innovation of Zhyper AI is its real-time feedback loop, which adapts
paper explores the system's architecture, capabilities, and potential applications across
XII
By enabling fully autonomous computing, Zhyper AI redefines efficiency, accessibility,
XIII
CHAPTER 1
1
CHAPTER 1
INTRODUCTION
by large language models (LLMs) and designed for autonomous interaction with
digital screens. This agent addresses a crucial need in the field of artificial
from Kolb’s Experiential Learning Cycle, the reflection module enables the agent
to learn from its actions and improve over time, closely mimicking human learning
patterns. Using visual input from screenshots and basic control commands like
mouse clicks and keyboard strokes, the agent interacts with environments via the
VNC protocol. It supports task execution on Windows and Linux desktops and is
This dataset includes diverse screen task scenarios along with a fine-grained
Vision-Language Models (VLMs) reveals that while existing models perform well,
2
1.2 SCOPE OF THE PROJECT
The scope of this project involves the development and deployment of a fully
visually and perform intelligent operations like clicking, typing, navigating, and
learning environment using the VNC protocol to allow real-time interaction with
operating systems such as Windows and Linux. The project also involves
3
CHAPTER 2
4
CHAPTER 2
RELATED WORK
2.1 MULTIMODEL LARGE LANGUAGE MODELS
Multimodal LLMs like LLAMA, Vicuna, and GPT-4 show strong contextual
understanding and text generation. GPT-4V extends GPT-4 with vision, enabling
image-based interactions. LLAVA and LLAVA-1.5 connect CLIP with Vicuna for
multimodal tasks. Fuyu-8B uses a pure decoder transformer without an image
encoder. CogVLM supports multi-turn visual dialogue at high resolution. Monkey
enhances input resolution through efficient training.
Simulated environments train agents for GUI tasks like clicking and typing.
WebNav and MiniWoB++ test decision-making via browser-based tasks.
WebShop enables shopping-based automation, while SWDE and WebSRC support
info extraction and QA from webpages. Mind2Web aims at generalist agents, and
datasets like Seq2Act, Screen2Words, and META-GUI train agents on Android
UIs. These datasets create complex environments for LLM agents to learn screen
control.
LLMs have enhanced agent capabilities, with WebGPT enabling web-based question
answering. ToolFormer integrates external tools like calculators and search engines.
Voyager pioneers lifelong learning in Minecraft using LLMs. RecAgent adds memory
reflection, and ProAgent automates tasks with LLMs. CogAgent focuses on GUI
comprehension; AppAgent learns mobile app usage. The environment defines actions
(JSON), states (screenshots), and flexible rewards for screen control tasks.
5
CHAPTER 3
6
CHAPTER 3
FRAME WORK
7
3.2 CONTROL PIPELINE
To guide the agent to continually interact with the environ-ment and complete
multi-step complex tasks. We designed a control pipeline including the Planning.
Acting, and Re-flecting phases. The whole pipeline is depicted in Fig. 2. The
pipeline will ask the agent to disassemble the complex task, execute subtasks, and
evaluate execution results. The agent will have the opportunity to retry some
subtasks or ad-just previously established plans to accommodate the current
occurrences.
Planning Phase. In the planning phase, based on the cur-rent screenshot, the agent
needs to decompose the complex task relying on its own common-sense
knowledge and com puter knowledge.
Acting Phase. In the acting phase, based on the current screenshot, the agent
generates tes low-level low- mouse or keyboard actions in JSON-style function
calls. The environment will attempt to parse the function calls from the agent's
response, and convert them to device actions defined in the VNC pro-tocol. Then
our environment will send actions to the con-trolled computer. The environment
will capture the after-action screen as input for the next execution phase.
Reflecting Phase. The reflecting stage requires the agent to assess the current
8
situation based on the after-action screen. The agent determines whether needs to
retry the current sub-task, go on to the next sub-task, or make some adjustments to
the plan list. This phase is crucial within the control pipeline, providing some
flexibility to handle a variety of unpredictable circumstances.
9
CHAPTER 4
10
CHAPTER 4
4.1 INTRODUCTION
11
Table 1
Hardware components and specifications
Rationale:
1. CPU & RAM: Multi-core CPU and ample RAM are critical to parallelize
OCR tasks and handle large embedding matrices.
2. GPU: While OCR runs on CPU, embedding models (all-MiniLM-L6-v2)
leverage GPU for faster vector computation (<50 ms per chunk).
3. Storage: SSD or NVMe ensures rapid loading of PDF files, storage of
vector index, and database operations.
4. Network: High throughput and low latency enable seamless interactions
between Streamlit frontend and backend services.
5. Backup: Ensures business continuity and data integrity in case of hardware
failure.
12
13
14
CHAPTER 5
15
CHAPTER 5
EXPERIMENT
5.1 INTRODUCTION:
In the experimental phase, we assessed OpenAI GPT-4V per formance on the
ScreenAgent test set, along with evaluations of three open-source VLMs.
Furthermore, one of these mod-els underwent fine-tuning to potentially augment its
profi-ciency in screen control tasks. Subsequently, we conducted a thorough
analysis of the outcomes and identified several typ ical cases to elucidate the
inherent challenges of our task.
Apart from GPT-4V, we selected several recently released SOTA VLMs for testing,
including LLaVA-1.5 [Liu et al..2023al and CogAgent [Hong et al., 2023]. LLAVA-1.5
is a 13B-parameter multimodal model, unfortunately, it only sup ports up to 336 x 336px
image size inputs. CogAgent is an 18B-parameter visual language model designed for
GUI com-prehension and navigation. Leveraging dual image encoders. for both low-
resolution and high-resolution inputs, CogAgent demonstrates proficiency at a resolution
of 1120 x 1120px. allowing it to discern minute elements and text.
We test the models capabilities from two aspects: The ability to follow instructions to
output the correct function call format, shown in Table 2, and the ability to complete
specific tasks assigned by the user
Training data proportions and division of four training phases. Percentages indicate the
proportion of samples from this data set at each phase. of these function calls for each
attribute key. This assess-ment focuses on whether the model can accurately execute
various functions encompassing the attribute items expected by manual action
annotations. Note that, this evaluation does. not consider the consistency of the attribute
16
values with the golden labeling: it solely examines if the model's output in-cludes the
necessary attribute keys. From the table, GPT-4V and LLAVA-1.5 achieved higher
scores, while CogAgent and its upstream model CogAgent-VQA underperformed.
CogAgent-VQA and CogAgent-Chat almost entirely disre garded the JSON format action
definitions in our prompts, resulting in a very low score on successful function calls.
Therefore, rendering them completely incapable of interact-ing with our environment. To
ensure fairness in comparison, we utilize OpenAI GPT-3.5 to extract action into JSON-
style function calls from the original CogAgent-Chat responses, in-dicated as "CogAgent-
Chat (helped by GPT-3.5)". Even so, its scores are significantly lower than those of
LLAVA-1.5 and. GPT-4V, although CogAgent has been trained on Mind2Web web
browsing simulation datasets.
Displays the fine-grained scores of predicted at tribute values for each action within the
successfully parsed function calls. As can be seen, GPT-4V remains the best per former,
with action type prediction F1 score of 0.98. This implies that it can accurately select
appropriate mouse or keyboard actions. Additionally, it can precisely choose the mouse
action type, typing text, or pressing keys consistent with the golden label actions.
Another significant challenge for all models is the reflec tion phase. In this phase, the
agent is required to determine whether the subtask has been completed in the current
state, and decide whether to proceed further or make some adjust ments. This is crucial
for constructing a continuous inter-active process. Regrettably, all models show
insufficient ac-curacy in this determination, with GPT-4V achieving only a 0.60 F1 score.
This implies that human intervention is still necessary during task execution.
17
5.2 FINE-TUNING TRAINING
Another significant challenge for all models is the reflec tion phase. In this phase, the
agent is required to determine whether the subtask has been completed in the current
state, and decide whether to proceed further or make some adjust ments. This is crucial for
constructing a continuous inter-active process. Regrettably, all models show insufficient
ac-curacy in this determination, with GPT-4V achieving only a 0.60 F1 score. This implies
that human intervention is still necessary during task execution. After vision fine-tuning.
ScreenAgent achieved the same level of following instructions and making function calls
as GPT-4V on our dataset, as shown in Table 2. In Table 3, ScreenAgent also reached a
comparable level to GPT-4V. No-tably, our ScreenAgent far surpasses existing models in
the precision of mouse clicking. This indicates that vision fine-tuning effectively enhances
the model's precise positioning capabilities. Additionally, we observed that ScreenAgent
has a significant gap compared to GPT-4V in terms of task plan-ning, highlighting GPT-
4V's common-sense knowledge and task-planning abilities.
To evaluate our Screen Agent model on computer control tasks, we provide two cases. In
Fig. 7, we present a case il-lustrating the workflow of ScreenAgent executing a chain of
actions. In Fig. 8, we compare different agents in executing the details of the three phases
in the pipeline. Fig. 8 (a) shows the planning process of all the agents, where we find that
our ScreenAgent produces the most concise and effective plan. Fig. 8 (b) presents four
different click action tasks, each rep-resenting a step in a specific task. Results show that
LLAVA clicks on the bottom-left corner on all screens, cogAgent may fail to generate
click positions, and in the fourth task, only our agent can correctly click on the position.
Fig. 8 (c) shows that our agent can recognize whether an action needs to be re-tried after
reflection and successfully execute the action following a failure.
18
19
20
21
CHAPTER 6
22
CHAPTER 6
CONCLUSION
6.1 INTRODUCTION
The implementation phase is the pivotal step in transforming the design blueprints
of the INFOQUERY Interactive AI Model into a functioning reality. This phase is
responsible for translating high-level system architecture, component designs, and
theoretical strategies into actual code and working modules that meet the outlined
specifications and use cases. Implementation is not just coding—it is a disciplined
engineering process involving testing, optimization, debugging, and integration. It
ensures that the system performs its intended function reliably, efficiently, and
securely.
The INFOQUERY system is designed to serve as an AI-driven tool capable of
interpreting, extracting, and responding to user queries on PDF documents. This
makes it especially useful for individuals working in academia, research, and
enterprise document management. The following sections describe the detailed
implementation process, including system requirements, tools used, and in-depth
explanations of each module involved.
6.2 SYSTEM ENVIRONMENT
Implementing an AI-based PDF question-answering tool requires a mix of
programming languages, libraries, hardware, and platforms. Below is an overview
of the environment and tools used:
- Programming Language: Python 3.10
- Frontend Framework: Streamlit (used for building the interactive user interface)
- Libraries Used:
- PyMuPDF: For extracting textual content from PDF files.
- pytesseract: For applying Optical Character Recognition on image-based PDFs
- SentenceTransformers: To convert text content into semantic vectors using
transformer-based models.
23
- ChromaDB: To manage and query high-dimensional embeddings for
semantic search.
- LangChain: To connect queries and retrieved content with a large language
model for generating answers.
- HuggingFace Transformers: For language model support.
- Operating System: Windows 11
- Hardware Requirements:
Minimum: 8 GB RAM, 2.4 GHz quad-core processor.
Recommended: GPU-enabled systems for accelerated embeddings.
25
6.4 DATASET PREPARATION AND TESTING
INFOQUERY uses dynamically uploaded PDFs, but internal testing was done
using sample datasets such as:
- IEEE research papers
- Wikipedia offline dumps (converted to PDFs)
- University curriculum and question banks
Testing was done to validate the performance in terms of:
- OCR accuracy (Tesseract)
- Semantic search precision
- Query relevancy and LLM coherence
6.6 SUMMARY
This chapter comprehensively described the practical implementation of
INFOQUERY—from backend embedding to frontend interactivity. Each
component has been designed to function both independently and cohesively as
part of a semantic Q&A pipeline over PDFs. By utilizing state-of-the-art
transformer models and OCR, INFOQUERY bridges the gap between document
access and intelligent interaction.
26
CHAPTER 7
27
CHAPTER 7
TECHNIQUES USED
Text Detection: Identifying and isolating text regions from non-text areas
like backgrounds and graphics.
Applying OCR on images: Tesseract analyzes the image and identifies the
textual content.
Advantages of PyTesseract:
Free and Open Source: Completely free to use with active community
support.
Flexible Output: Can export results into different formats including plain
text, searchable PDFs, and TSV.
Disadvantages of PyTesseract:
Low Performance on Noisy Images: Struggles with handwritten, low-
resolution, or complex layout documents unless heavy preprocessing is
done.
29
Limited Deep Learning Features: Does not use neural networks for
recognition like newer OCR systems (such as EasyOCR or Keras OCR).
7.1.3 EASYOCR
EasyOCR is a deep learning-based Optical Character Recognition (OCR) library
developed by the Jaided AI team. It leverages neural networks to perform OCR
tasks, making it particularly effective for handling complex images and a wide
variety of languages. EasyOCR supports over 80 languages and provides state-of-
the-art accuracy for both printed and handwritten text extraction.
EasyOCR uses Convolutional Neural Networks (CNNs) and Recurrent Neural
Networks (RNNs) for text detection and recognition. The library is designed to be
simple to use and is built on top of PyTorch, allowing easy integration into deep
learning pipelines.
The working of EasyOCR generally includes:
Preprocessing the image: EasyOCR automatically processes images to
detect text regions.
Text detection and recognition: The OCR engine first detects the text areas
in the image, followed by recognition of the actual characters using deep
learning models.
Text output: The recognized text can be returned in formats like JSON,
plain text, or a list of bounding box coordinates.
Advantages of EasyOCR:
1. Deep Learning-based Technology: Utilizes CNNs and RNNs, which leads
to improved accuracy in recognizing text, especially in noisy or complex
images.
30
2. Wide Language Support: Supports over 80 languages, including non-Latin
scripts like Chinese, Arabic, and Hindi, making it versatile for international
applications.
Disadvantages of EasyOCR:
1. Slower Performance on Simple Tasks: It can be slower than traditional
OCR engines like PyTesseract when dealing with clean, straightforward
documents or small text extractions.
Text Recognition: Recognizes and extracts the text in those regions using
sequence-to-sequence models like RNNs.
Output: Provides recognized text in formats such as plain text, along with
bounding box coordinates for each detected text region.
2. Deep Learning-based Models: The use of deep learning models for both
text detection and recognition ensures state-of-the-art accuracy, particularly
on images with varying layouts.
32
Disadvantages of Keras OCR:
1. Computationally Intensive: The deep learning models used in Keras OCR
require significant computational resources, such as a good CPU or GPU, for
optimal performance, which may not be feasible for low-resource
environments.
33
7.1.6 OVERALL EFFICIENCY:
When evaluating OCR solutions for PDFs, efficiency is determined by factors such
as accuracy,ease of integration, support for complex layouts, and processing speed.
PyTesseract is a reliable choice for straightforward text extraction due to its
balance of accuracy, ease of integration, and extensive language support. It
performs well with simple, well-formatted PDFs but may require additional
preprocessing for documents with complex layouts. On the other hand, EasyOCR
utilizes advanced deep learning models, offering high accuracy and effective
handling of noisy or complex text. It supports over 80 languages, making it
versatile for international PDFs, but can be more resource-intensive. Keras-OCR
stands out for its state-of-the-art deep learning models and end-to-end pipeline,
which excels in recognizing text from complex layouts and customized tasks.
However, it may require more setup and familiarity with Keras and TensorFlow.
Overall, Keras-OCR is highly efficient for handling intricate document layouts and
specialized OCR needs, while PyTesseract and EasyOCR offer excellent
performance with their own strengths, with EasyOCR being particularly adept at
dealing with complex text in challenging conditions.
7.2 VECTOR DATABASE:
7.2.1. INTRODUCTION:
A Vector Database is a specialized type of database designed to store, manage, and
query high-dimensional vectors, which are typically the output of machine learning
models, such as embeddings generated by neural networks. These vectors represent
data points in a continuous vector space and are often used in applications such as
natural language processing (NLP), image recognition, recommendation systems,
and more. Vector databases are optimized for handling vector operations like
similarity search, nearest neighbor search, and clustering, which are essential for
many machine learning and AI tasks.
Key Features of Vector Databases:
34
1. High-Dimensional Data Storage: Vectors are usually represented in high-
dimensional spaces .A vector database efficiently stores and indexes these
vectors for quick retrieval.
2. Similarity Search: One of the primary use cases for vector databases is finding
vectors that are similar to a given query vector. This is achieved through
techniques like nearest neighbor search, where the database searches for vectors
closest to a given query in terms of distance (e.g., Euclidean distance, cosine
similarity).
3. Support for Complex Queries: Vector databases allow for more advanced
queries like range queries and multi-query searches.
35
Advantages of ChromaDB:
1. High Performance: Optimized for fast similarity search, providing quick
results even with large vector datasets, ideal for real-time applications.
3. Ease of Use: Offers a simple API for easy integration, reducing the complexity
of using vector search in applications.
Disadvantages of ChromaDB:
1. Memory Consumption: Storing large vector datasets can consume significant
memory, especially with high-dimensional vectors, requiring substantial
resources.
2. Complexity in Setup for Large Scale: Scaling ChromaDB for very large
datasets can be complex and require detailed optimization for performance.
3. Limited Query Types: Primarily focuses on similarity search, and may lack
advanced query capabilities for more complex database operations.
36
2. Semantic Search: Offers meaning-based queries, improving search relevance.
3. Limited Queries: Primarily focuses on vector search and may lack support for
other complex queries.
37
3. GPU Acceleration: Supports GPU-based processing for faster indexing and
search operations, enhancing performance.
3. Limited Query Types: Primarily focused on vector search and lacks support
for more complex, traditional database queries.
38
FEATURE CHROMA DB WEAVIATE FAISS
TYPE Open-source Open-source Open-source
library
39
7.2.6 OVERALL EFFICIENCY:
ChromaDB offers a user-friendly and cloud-native solution, optimized for quick
deployment and moderate scalability. It excels in ease of use, especially for small
to medium-sized applications that require seamless integration with common ML
models. While it may not match the advanced scalability and search speed of Faiss
or the complex querying capabilities of Weaviate, ChromaDB strikes a balance
between simplicity and performance. Its strength lies in providing efficient vector
storage and retrieval for applications where ease of setup and integration outweigh
the need for extreme-scale processing or custom search configurations.
40
Fine-tuning: Can be fine-tuned on specific tasks or datasets to improve
performance.
Fast Inference: Optimized for speed, making them suitable for real-time
applications.
1. Input Length Processes individual sentences. Processes longer text spans, such
as paragraphs or documents.
3. Use Cases Best for tasks like sentence Ideal for document classification,
similarity, sentence multi-sentence retrieval, and
classification, and matching. context-based tasks.
41
7.3.3 all-MiniLMv2
The all-MiniLMv2 model is a highly efficient, lightweight transformer-based
model designed for generating dense vector embeddings from text. It is a version
of MiniLM, optimized for performance and accuracy. MiniLM (Minimum
Memory BERT) is a distilled model that achieves high performance with fewer
parameters, making it faster and more memory-efficient than traditional
transformer models like BERT or RoBERTa.
Features of all-MiniLMv2:
Lightweight and Efficient: It is designed to be faster and more resource-
efficient, making it ideal for applications requiring low latency and high
throughput.
Advantages:
Efficiency: The model is faster, requiring less computational power while
maintaining good performance.
42
Versatility: Can be used for various NLP tasks, including classification,
information retrieval, and clustering.
7.5 INTERFACE:
In software development, an interface refers to a point of interaction between
different systems or components. It defines the methods, parameters, and protocols
that allow different software applications or components to communicate with each
other. An interface can exist between a user and a system , between different
software modules (API interface), or between hardware and software. The primary
goal of an interface is to allow seamless communication and ensure that
components work together efficiently.
7.5.1 STREAMLIT:
Streamlit is an open-source Python library that allows developers to easily build
interactive web applications for machine learning and data science projects.
Streamlit is designed to be simple and fast, with minimal code required to create
web interfaces. It automatically updates the UI when changes are made to the
Python code, making it highly interactive. Streamlit is widely used for creating
dashboards, visualizations, and prototypes, allowing data scientists and machine
44
learning engineers to quickly showcase their models and results to non-technical
users.
Directly integrates with popular libraries like Pandas, Matplotlib, and Plotly
for data visualization.
45
CHAPTER 8
46
CHAPTER 8
PERFORMANCE ANALYSIS
Performance analysis compares the effectiveness of two models (the existing
model and the fine-tuned model) to understand improvements or regressions in
their outputs. The comparison often includes accuracy, speed, resource usage, and
other performance metrics that reveal the strengths and weaknesses of each model.
8.1 Existing Model:
In this model, the output is based on generative techniques and OCR-based
information retrieval. The model retrieves relevant documents or passages from the
input data using pre-processing techniques, applies OCR (likely PyTesseract or
other similar libraries), and then generates a response or summary based on the
retrieved information.
Key aspects of the existing model output:
Document Retrieval: Retrieves relevant passages from the input corpus
based on similarity to the query.
Performance: The response time and processing power may be higher due
to document retrieval and text generation steps.
47
8.2 Fine-Tuned Model :
The InfoQuery model, the fine-tuned version, improves upon existing model by
integrating a more refined approach to document retrieval and text generation. This
version uses advanced techniques, including fine-tuning on a specific dataset,
better contextual handling, and possibly more advanced OCR methods. It uses
retrieval-augmented generation (RAG) for better semantic understanding and more
accurate answers.
Key aspects of the InfoQuery output:
Improved Document Retrieval: Enhanced retrieval techniques that ensure
more relevant and contextually appropriate documents are fetched.
Faster and More Efficient: With optimizations in both the retrieval and
generation processes, InfoQuery delivers faster results with fewer
computational resources.
48
Table 9
Difference between Existing and Fine-tuned model
S.no Aspect Existing Model Fine-Tuned Model
1 Accuracy May have less accuracy due to Improved accuracy due to fine-
suboptimal document retrieval and tuning and better retrieval methods.
generation.
3 Response Responses may be more generic or Responses are more tailored and
Quality less specific to the query. specific to the user's needs.
4 Speed and Slower due to unoptimized retrieval Faster and more efficient due to
Efficiency and generation processes. optimizations in both retrieval and
generation.
49
50
CHAPTER 9
51
CHAPTER 9
CONCLUSION
9.1 CONCLUSION
53
4. Multi-Format Support:
Expanding support to additional file formats will empower users to query a
wider range of documents, including Word files, Excel spreadsheets, and
images. This expansion will make InfoQuery a versatile solution that can
handle a broad spectrum of document types, improving its utility across
different industries.
6. CMS Integration:
InfoQuery plans to integrate with content management systems (CMS),
enabling features like automatic publishing and content updates, which will
streamline document organization and management. Integration with external
systems like SGW will further enhance the automation and workflow efficiency
of the platform.
These advancements not only promise to improve the operational efficiency and
accuracy of InfoQuery but also pave the way for its application in diverse real-time
environments. As the system evolves, it will likely serve as a foundational platform
for intelligent document retrieval and management, fostering innovation in
domains such as recommendation systems, content curation, and enterprise data
management.
54
CHAPTER 10
55
CHAPTER 10
APPENDICES
10.1 APPENDIX A – SOURCE CODE
def to_markdown(text):
text = text.replace('•', ' *')
return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))
def extract_text_from_pdfs(pdf_path): # Changed pdf_folder to pdf_path
"""Extracts text from a single PDF file.""" # Added docstring for clarity
images = convert_from_path(pdf_path)
extracted_text = "".join(pytesseract.image_to_string(img) for img in images)
return [extracted_text] # Return a list to maintain consistency
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300,
chunk_overlap=50)
chunks = [text_splitter.split_text(text) for text in doc_texts]
56
chunks = []
for text in doc_texts:
chunks.extend(text_splitter.split_text(text))
len(chunks)
chunks[0]
!pip install colab-xterm
%load_ext colabxterm
%xterm
!ollama list
embeddings = OllamaEmbeddings(model="nomic-embed-text")
from langchain.docstore.document import Document
vectorstore = Chroma.from_documents(
[Document(page_content=chunk) for chunk in chunks],
embeddings
)
vectorstore
query = "Implementation of Artificial Light Systems in Agriculture"
search = vectorstore.similarity_search(query)
to_markdown(search[0].page_content)
retriever = vectorstore.as_retriever(
search_kwargs={'k': 5}
)
retriever.get_relevant_documents(query)
from langchain_community.llms import Ollama
llm = Ollama(model="gemma2")
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate
template = """
<|context|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
<|user|>
{query}
</s>
<|assistant|>
"""
57
prompt = ChatPromptTemplate.from_template(template)
rag_chain = (
{"context": retriever, "query": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
10.2 APPENDIX B
58
59
60
CHAPTER 11
61
CHAPTER 11
REFERENCES
[4] Wang, Y., "Scalable Machine Learning Models with Vector Databases," ACM
Computing Surveys, 2023.
[5] Chen, R., "Automating AI Workflows with Vector Search," Elsevier Journal of
AI Research, 2022.
[6] Patel, J., "Semantic Similarity Matching Using Vector Representations," IEEE
Big Data Conference, 2023.
[7] Rao, P., "Hybrid ANN Algorithms for Fast Vector Retrieval," Springer AI &
Data Science, 2023.
[8] Zhang, L., "Multi-Modal Embeddings for Vector Search Systems," ACM
Transactions, 2023.
[10] Tan, H., "Exploring ChromaDB for AI-Based Vector Retrieval," IEEE ICML,
2023.
62
[11] Martin, D., "Optimizing Retrieval-Augmented Generation Pipelines,"
Springer AI Review, 2023.
[12] Xu, T., "Distributed Vector Databases for Enterprise Applications," Elsevier
Data Science, 2023.
[14] Venkatesh, R., "Hybrid Embedding Models for Efficient Search," ACM
Transactions, 2023.
[15] Nair, P., "Cloud-Based Vector Storage Systems: A Review," IEEE Access,
2023.
[16] Yu, J., "Quantum-Inspired Methods for Vector Similarity Search," Springer
AI Research, 2023.
[17] Das, K., "Graph-Based Approaches for Nearest Neighbor Search," ACM
Computing Surveys, 2023.
[19] Singla, S., "Big Data Processing in Vector Databases," IEEE, 2021.
63
64