0% found this document useful (0 votes)
380 views60 pages

Black Book For Multimedia ChatBot

Uploaded by

umarclg1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
380 views60 pages

Black Book For Multimedia ChatBot

Uploaded by

umarclg1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

LOCAL MULTIMODEL CHATBOT

A Project Report

Submitted by

SHAIKH AREEBA RAIS AHMED

SHAIKH YUNUS MOHD ABDULLAH

AYAZ ARIF TAMBE

SYED ADEEN AHMED

Under the guidance of

Mr. Pritam Pawar


(Lecturer, Artificial Intelligence and Machine Learning Dept.)

Department of Artificial Intelligence and Machine Learning Engineering,


Anjuman-I-Islam's Abdul Razzak Kalsekar Polytechnic,
Sector- 16, Khandagaon, New Panvel- 410206
Maharashtra State Board of Technical Education
(2023 - 2024)
CERTIFICATE

This is to certify that the project entitled “Local Multimodel Chatbot” being submitted by
Shaikh Areeba is worthy of consideration for the award of the degree of “Diploma in
Artificial Intelligence and Machine Learning Engg” and is a record of original bonafide
carried out under our guidance and supervision. The results contained in this respect have not
been submitted in part or full to any other university or institute for the award of any degree,
diploma certificate.

Shaikh Areeba Rais Ahmed.


Shaikh Mohd Yunus Abdullah.
Ayaz Arif Tambe
Syed Adeen Ahmed

Ms. Nousheen Shaikh

(Internal Examiner) (External Examiner)

Mr. Pritam Pawar

(Project Guide)

Prof. Ali Karim Sayed Prof. Arif Shaikh

(HOD, AN Dept.) (Principal, AIARKP)


Declaration

I declare that this project report entitled “Local Multimodel Chatbot” represents my
ideas in my own words and where others' ideas or words have been included, I have adequately
cited and referenced the original sources. I also declare that I have adhered to all principles of
academic honesty and integrity and have not misrepresented or fabricated or falsified any
data/fact in my submission. I understand that any violation of the above will be cause for
disciplinary action by the Institute and can also evoke penal action from the sources which have
thus not been properly cited or from whom proper permission has not been taken when needed.

Name1: Shaikh Areeba Rais Ahmed

Name 2: Shaikh Mohd Yunus

Name 3: Ayaz Arif Tambe

Name 4: Syed Adeen Ahmed

Date: 18th April, 2024

Place: New Panvel

II
Acknowledgement

I consider myself lucky to work under guidance of such talented and experienced people
who guided me all through the completion of my dissertation.

I express my deep sense of gratitude to my guide Mr. Pritam Pawar, Lecturer of


Artificial Intelligence and Machine Learning Engineering Department, and Principal Prof.
Arif Shaikh, for his generous assistance, vast knowledge, experience, views & suggestions
and for giving me their gracious support. I owe a lot to them for this invaluable guidance in
spite of their busy schedule.

I am grateful to Prof. Arif Shaikh, Principal for his support and co-operation and for
allowing me to pursue my Diploma Programme besides permitting me to use the laboratory
infrastructure of the Institute.

I am thankful to my H.O.D Mr.Ali Karim Sayed and Ms. Nousheen Shaikh for her
support at various stages.

Last but not the least my thanks also goes to other staff members of Artificial
Intelligence and Machine Learning Engineering Department, Anjuman-I-Islam’s Kalsekar
Polytechnic, Panvel, library staff for their assistance, useful views and tips.

I also take this opportunity to thank my Friends for their support and encouragement at
every stage of my life.

Date: 18th April, 2024

III
ABSTRACT

Traditional chat applications primarily rely on text-based communication, potentially limiting


the user’s ability to express themselves comprehensively. Local Multimodal AI Chat
dismantles this barrier by incorporating multimodal functionalities. By integrating audio
processing, image understanding, and PDF interaction, the project expands the communication
channels available to users.

The project's core philosophy revolves around "learning by doing." Users delve into the process
of integrating different AI functionalities to create a versatile chat application capable of
handling diverse information formats. This includes audio transcription through Whisper AI,
image processing using LLaVA, and PDF interaction facilitated by Chroma DB. By exploring
these integrations, users gain a deeper understanding of how these AI models operate and how
they can synergistically contribute to a user-friendly chat experience.

The project serves as a launchpad for exploring the capabilities of various AI models. Users
gain insights into how Whisper AI transcribes audio input, transforming spoken words into text
for seamless integration within the chat interface. Similarly, they delve into how LLaVA
processes images, enabling the application to understand and potentially respond to visual
content shared in the chat. Additionally, the project sheds light on how Chroma DB facilitates
interaction with PDFs, allowing users to share and discuss document content directly within
the chat environment.

The project serves as a launchpad for exploring the capabilities of various AI models. Users
gain insights into how Whisper AI transcribes audio input, transforming spoken words into text
for seamless integration within the chat interface. Similarly, they delve into how LLaVA
processes images, enabling the application to understand and potentially respond to visual
content shared in the chat. Additionally, the project sheds light on how Chroma DB facilitates
interaction with PDFs, allowing users to share and discuss document content directly within
the chat environment.

IV
List of Figures

Figure no. Title Page no.


4.1.1 Mistral7B Architecture diagram 12
4.1.2 Encoder and Decoder 12
4.1.3 Embedding in Mistral7B 12
4.1.4 LLaVA Architecture 14
4.1.5 Network Architecture 15

4.1.6 LLaVA-1.5 hard bar GPT-4V 16

V
List of Tables

Table no. Title Page no.


Table 1 Research Paper 8
Table 2 Cost Estimation 41

VI
Table of Content

1 Introduction………………………………………………………………..01

1.1 Background and Objective……………………………………………01

2 Project Overview……………………………………………….………….03

3 Literature Review……………………………………………….……...….05

4 Methodology……………………………………………………….…...…..09

4.1 Algorithm Used…………………………………………………………09

4.2 Tools and technologies..………………………………………………...17

4.3 Development Framework.……………………………………………...18

5 Developed System………..…………………………………………………20

5.1 Project Requirement Details..………………………………………….20

5.2 High Level Design (HLD)..……………………………………………..21

5.3 Wireframe of FE & BE…..……………………………………………..23

5.3 Source Code of FE & BE..……...………………………………………25

5.4 Results & Reports……..………………………………………………..39

5.5 Cost Estimation…………..……………………………………………..43

6 Future Work……………..………………………………………………….44

7 Conclusion…………………..……………………………………………….46

8 Technical Event Participation Certificate..………………………………..47

9 International Journal Certificate……………..……………………………51

10 References & Bibliography ……………..…………………………………54


INTRODUCTION

Introduction

In today's world, communication is constantly evolving. We rely on chat applications to stay


connected with friends, family, and colleagues. However, traditional chat applications
primarily focus on text-based communication, potentially limiting the richness of interaction.
This final year project, titled "Local Multimodal AI Chat," tackles this challenge by delving
into the exciting realm of multimodal communication.

This project aims to bridge the gap between theoretical knowledge of Artificial Intelligence
(AI) and its practical application in building a user-friendly chat interface. It offers a hands-on
learning experience, empowering you to explore how various AI models can be integrated to
create a versatile chat application capable of handling diverse information formats.

The core philosophy behind this project is "learning by doing." You'll have the opportunity to
delve into the fascinating world of AI and software development by actively participating in
the creation of a multimodal chat application. Through this process, you'll gain a deeper
understanding of how AI models like Whisper, LLaVA, and Chroma DB operate and how they
can synergistically contribute to a more engaging and interactive chat experience.

Background and objectives

• Enable Seamless Multimodal Communication: Develop a chat application that


integrates audio, images, and PDFs within a unified interface using AI technologies like
Whisper AI, LLaVA, and Chroma DB.
• Enhance Accessibility and Usability: Ensure the application is inclusive and user-
friendly by optimizing the user interface and incorporating accessibility features.
• Optimize Performance and Efficiency: Minimize resource requirements and maximize
performance to deliver a smooth user experience across devices.
• Foster Collaboration and Innovation: Create an open platform for collaboration,
feedback, and contributions to accelerate the development and adoption of advanced
communication technologies.

Empower Users with AI Capabilities: Provide users with a powerful yet accessible tool for
communication and collaboration, democratizing access to AI technologies for personal and
professional use.

1
Motivation

The idea and motivation to make this project came from increasing demand for advanced
communication tools that can seamlessly handle various types of data like texts, images, PDFs.
Imagine being able to create a chat interface that effortlessly transcribes audio lectures,
recognizes important information in images, and parses text from PDFs for quick reference
during study sessions. With the skills learned in this project, we have the potential to
revolutionize the way we interact with course materials, collaborate with peers, and even
communicate with professors.

2
Project Overview

Project Goal:

This project aims to develop a "Local Multimodal AI Chat" application, offering a hands-on
learning experience in building chat interfaces that integrate various AI models. The
application will enable users to communicate through diverse information formats beyond just
text, fostering a more interactive and engaging chat experience.

Key Functionalities:

o Audio Processing: The application will leverage Whisper AI to transcribe spoken


words into text, allowing users to send and receive voice messages.
o Image Understanding: By integrating LLaVA, the application will be able to
understand and potentially respond to images shared within the chat, enabling richer
communication.
o PDF Interaction: Utilizing Chroma DB, users can share and discuss PDF documents
directly in the chat environment, streamlining collaboration and information sharing.

Learning Approach:

The project prioritizes a "learn-by-doing" approach. Users will actively participate in building
the application, gaining practical experience in:

Integrating different AI models into a single chat interface.

Understanding the functionalities and capabilities of each AI model (Whisper, LLaVA,


Chroma DB).

Applying software development skills to create a user-friendly chat application.

Target Audience:

This project caters to individuals interested in:

o Artificial Intelligence (AI) and its applications in communication technology.


o Software development and building interactive applications.
o Exploring the potential of multimodal communication for enhanced chat experiences.

Project Scope:

3
This project focuses on developing a local, standalone chat application. It serves as a
foundation for further development, welcoming contributions from the user community for:

o Implementing new features to expand the functionalities of the chat application.


o Optimizing the code for improved performance and efficiency.
o Identifying and addressing any bugs encountered during usage.

Expected Outcome:

The successful completion of this project will result in:

o A functional Local Multimodal AI Chat application demonstrating the integration of


various AI models for multimodal communication.
o A valuable learning resource for individuals interested in AI and software development.
o A collaborative platform for further development and innovation within the domain of
multimodal chat applications.

4
Literature Review

Research Papers

Paper1:Develpoment of AI-based Voice Assistence using Large Language Model

Author: Varun Chennuri, Vamshi Prashnath Rodda

Published in: 2017 Published by Elsevier

ABSTRACT

Voice assistants have become an integral part of our daily lives, enabling natural and seamless
interactions with technology. Recent advancements in natural language processing (NLP) have
been fueled by Large Language Models (LLMs), such as GPT-3 and its successors. This
research explores the application of LLMs in voice assistants to enhance their language
understanding and response generation capabilities. The study presents a comprehensive
literature review, analyzing existing research on LLMs in the context of voice assistants. Our
research objectives aim to investigate the effectiveness of LLMs in understanding complex
user queries and generating contextually relevant responses. The methodology involves
training LLMs on extensive datasets, fine-tuning them for voice assistant tasks, and evaluating
their performance using standardized metrics. The experiments compare our LLM-based
approach with traditional voice assistant architectures, assessing the quality and efficiency of
responses. Results indicate a substantial improvement in language comprehension and
conversational quality when LLMs are integrated into the voice assistant framework. The
discussion elaborates on the strengths and limitations of the proposed LLM-based approach.
While LLMs show promising potential, challenges such as computational costs and ethical
considerations arise. Moreover, future research directions are proposed, including methods for
reducing model sizes and optimizing runtime performance. In conclusion, this research
establishes the viability of leveraging LLMs in voice assistants to advance their conversational
capabilities. The integration of LLMs opens new avenues for creating more intelligent and
context-aware voice assistants, revolutionizing the way users interact with voice-based
technologies. By sharing the codebase on a public repository, we aim to foster collaboration
and encourage further exploration in this rapidly evolving domain.

5
Paper2: Chat Bot for College Management System

Author: Delphin Lydia B

Published in: 2015 Published by Google Scholar

ABSTRACT

Nowadays, Artificial Intelligence is being used extensively in a wide range of sectors, from
product production to customer service in public relations. Artificial Intelligence (AI) chat bots
play a vital role in helping solve their problems in any aspects. So, we implemented a virtual
assistant based on AI that can deal with any query related to College Management System. A
chatbot uses information stored in its database to recognize phrases and make decisions on its
own in response to a query. The college inquiry chat-bot is built using the Rasa NLU
framework that analyzes user's queries by understanding user’s text message. The response
principle is matching the input sentence from a user. The college management system involves
public user portal and student/staff portal. It keeps track records of all the information regarding
students and the college. In the public portal, the user may use the chat-bot to ask any college-
related questions without having to physically visit the campus. The Bot analyses the query
and responds with a graphical user interface that makes it appear as though a real person is
conversing with the user. The system's accuracy is estimated to be 95% and the time it takes to
create responses corresponds to the number of lines of response.

6
Paper3: College Enquiry Chat Bot System with Text to Speech

Author: Pratiksha Chaundry, Sonal Lokhande, Vaishnavi Ghule

Published in: 2016 Published by Research Gate

ABSTRACT

Chat bots are intelligent systems that interpret and react to users' questions in their native
language. In a conversation, the chat bot reacts in the same way as a human would. It functions
as a virtual assistant, and its accuracy is assessed by determining a correlation between user
questions and chat bot responses. For a better user experience, the implemented Chat bot has
two modes: text mode and audio mode. It provides an interactive approach of answering
through voice messages when in audio mode. There is a long line at the inquiry window during
the Institute's Academic Admission procedure. Even more challenging is the situation for
parents who live in various cities, states, and nations. The purpose of this system is to give
students and parents a place to ask questions and get answers via easy English language text
messages or audio commands. Instead of queuing at an information desk to ask questions about
the admissions process, students and parents will collaborate with a bot. Artificial intelligence
(AI) and natural language processing (NLP) algorithms are used to create chatbots, which are
intelligent systems. It effectively interacts with users and responds to their questions.
Organizations, government groups, and non-profit associations are the most common users of
dialogue/conversation operators. These conversational experts work for a wide range of
businesses, from small start-ups to large corporations. There are a variety of code-based and
interface-based chatbot development platforms available on the market.

3.1. Research Papers

Sr.no Paper Name Author Year

1. Develpoment of AI-based Voice Varun Chennuri, 2017


Assistence using Large Language Vamshi Prashnath
Model Rodda

2. Chat Bot for College Delphin Lydia B 2015


Management System

7
3. College Enquiry Chat Bot System Pratiksha 2016
with Text to Speech Chaundry, Sonal
Lokhande,
Vaishnavi Ghule

Table1.Reseach Paper

8
Methodology

4.1 Algorithm used

Mistral 7B: Power and Efficiency Under the Hood

Mistral 7B has taken the world of large language models (LLMs) by storm with its impressive
performance and efficiency. But how exactly does it achieve this? While the full inner workings
might be closely guarded secrets, we can explore the core algorithmic techniques that set
Mistral 7B apart.

The Transformer Foundation:

At its heart, Mistral 7B relies on the Transformer architecture, a powerful tool for processing
sequential data like text. Imagine a transformer as a sophisticated machine that can understand
the relationships between words in a sentence. This allows Mistral 7B to grasp the nuances of
language and perform tasks like question answering, code generation, and text summarization.

Key features of Mistral7B:

Mistral 7B has shown remarkable performance in different tests, surpassing models with more
parameters. It is particularly proficient in fields such as mathematics, code generation, and
reasoning. When Mistral 7B was launched, it surpassed the top open source 13B model (Llama
2) in all the tests conducted. The model is created to be easily adjusted for different tasks. As
an example, the Mistral 7B Instruct model is a specific version of the model that has been
modified to excel in conversational and question-answering tasks. The Mistral 7B has safety
measures to limit the output and filters out inappropriate content. It can be employed as a
content moderator to categorize user inputs into various groups such as unlawful actions,
offensive material, or unsolicited guidance. The model is available for use without any
limitations due to its release under the Apache 2.0 license. The model can be accessed on
various platforms such as HuggingFace, Vertex AI, Replicate, Sagemaker Jumpstart, and
Baseten. Nevertheless, as a large language model, Mistral 7B can create false scenarios and
has problems like other models, such as inserting prompts. The small number of parameters in
this model restricts the amount of knowledge it can store, making it less capable than larger
models.

9
Use cases of Mistral7B:

Mistral-7B-Instruct is a language model that has been specifically created to perform


exceptionally well in two main areas: English language tasks and coding tasks. The softwares
open-source structure enables developers and organizations to adapt it to their specific
requirements and create personalized AI applications without encountering any limitations.
This adaptability allows for a broad range of uses, from complex customer service chatbots to
advanced code creation tools.

Some specific use cases of Mistral-7B include:

1. Automated Code Generation — Developers can automate code generation tasks using
Mistral-7B-Instruct. It understands and generates code snippets, offering immense
assistance in software development. This reduces manual coding effort and accelerates the
development cycle.
2. Debugging Assistance — Mistral-7B-Instruct assists in debugging by understanding code
logic, identifying errors, and recommending solutions, streamlining the debugging
process.
3. Algorithm Optimization — Mistral-7B-Instruct can suggest algorithm optimizations,
contributing to more efficient and faster software.
4. Text Summarization and Classification — Mistral-7B supports a variety of use cases, such
as text summarization, classification, text completion, and code completion.
5. Chat Use Cases — Mistral AI has released a Mistral 7B Instruct model for chat use cases,
fine-tuned using a variety of publicly available conversation datasets.
6. Knowledge Retrieval — Mistral 7B can be used for knowledge retrieval tasks, providing
accurate and detailed responses to queries.
7. Mathematics Accuracy — Mistral 7B reports strengths in mathematics accuracy,
providing comprehension for math logic.
8. Roleplay and Text Generation — Users have reported using Mistral 7B for roleplaying
RPG settings and generating blocks of text.
9. Natural Language Processing (NLP) — Some users have used Mistral 7B for NLP tasks
on documents to return JSON, finding it reliable enough for personal use.
The Mistral 7B Advantage:

Here's where things get interesting. Mistral 7B builds upon the Transformer foundation with

10
several key innovations that boost its efficiency and effectiveness:

1. Sliding Window Attention (SWA): Traditional transformers struggle with long


sequences of text. It's like trying to read a whole page at once – difficult to focus on
everything! SWA tackles this by introducing a "window." The model can only attend to
information within that window in a specific layer. But here's the clever part: as the model
progresses through layers, the window size increases cumulatively. This allows it to access
distant parts of the sequence, similar to how our eyes scan a long sentence, taking in
information across stretches. This is a critical innovation for handling complex inputs
without sacrificing speed.

2. Grouped Query Attention (GQA): Standard attention mechanisms can be


computationally expensive, especially for large amounts of text. Imagine having to
compare every single word in a sentence to every other word – that's a lot of calculations!
GQA tackles this by grouping similar words together. This is like efficiently searching a
library by categorizing books. Instead of comparing every physics book to every history
book, you'd look within categories. GQA achieves similar efficiency gains in processing
text by reducing the number of comparisons.

3. Rolling Buffer Cache: Attention mechanisms often require storing intermediate


calculations for later use. But this can strain memory resources, especially for long
sequences. The Rolling Buffer Cache solves this by utilizing a fixed-size cache. It stores
key information from previous processing steps, like keys and values. The model checks
the cache first for needed information. If not found, it calculates it and stores it in the cache,
replacing the least recently used entry. This ensures efficient use of memory without
sacrificing essential data.

4. Pre-filling and Chunking: Text generation tasks involve predicting the next word based
on the current context. Standard models do this one word at a time. Pre-filling and
chunking improve this process. The model predicts several words (a chunk) at once,
considering the preceding context. This is like a writer outlining key points before writing
a paragraph. It reduces the number of predictions steps and speeds up generation.

By combining the results, Mistral7B achieves impressive results. It can handle complex tasks
with high accuracy while maintaining faster processing speeds compared to some other large
language models. This mistakes it the valuable tool for various applications.

11
Figure No. 4.1.1 Mirtral7B Architecture diagram

Figure No.4.1.2 Encoder and Decoder

Figure No. 4.1.3 Embedding in Mistral7B

12
LLaVA Model for image handling

While the exact workings of LLaVA might be under development or private, here's a possible
breakdown of how it could handle image I/O:

1. Image Input Channels:

LLaVA might be able to accept images through various channels depending on its
implementation:

⚫ User Upload: Users could directly upload images through a user interface (UI) element
like a file selector.

⚫ External URLs: LLaVA might have the capability to fetch images from external sources
if provided with a URL. This could be useful for tasks like analyzing product images on
e-commerce websites.

⚫ API integration: It’s also possible for LLaVa to integrate with external APIs that provide
image data. This could allow the bot to access images from other applications.

2. Preprocessing for Efficiency:

Once LLaVA receives an image, it might perform some preprocessing steps to optimize it for
Mistral 7B:

⚫ Resizing: LLaVA could resize the image to a standard size suitable for Mistral 7B's
processing. This reduces computational cost without sacrificing essential details.

⚫ Color Balancing: Depending on the model's capabilities, LLaVA might adjust the image's
color balance for better feature extraction.

⚫ Format Conversion: LLaVA might convert the image to a specific format that Mistral 7B
expects for efficient processing.

3. Feature Extraction - Understanding the Image:

This is where LLaVA's core functionality comes in: analyzing the image and extracting key
features. The specific techniques might involve:

⚫ Object Detection: LLaVA could identify objects present in the image. This could involve
using pre-trained object detection models like YOLO or SSD.

13
⚫ Scene Understanding: It might be able to classify the overall scene depicted in the image
(e.g., a beach, a living room).

⚫ Landmark Recognition: In some cases, LLaVA could even recognize specific landmarks
or locations within the image.

4. Output for Mistral 7B:

After preprocessing and feature extraction, LLaVA likely prepares the information for Mistral
7B. This could involve:

⚫ Feature Tensors: LLaVA might represent the extracted features as tensors, a


multidimensional data structure that Mistral 7B can understand and process effectively.

⚫ Descriptive Text: Additionally, LLaVA could generate a textual description of the image
content. This can provide context for Mistral 7B, especially if some features are complex
or nuanced.

Overall, LLaVA acts as a bridge between the visual world and the language processing
capabilities of Mistral 7B. By handling image I/O, preprocessing, and feature extraction,
LLaVA empowers Mistral 7B to perform insightful tasks on image data.

It's important to remember that this is a simplified explanation based on publicly available
information about LLaVA. The actual implementation details might differ based on the specific
model version and its development progress.

Figure No. 4.1.4 LLAVA Architecture

14
Figure No. 4.1.5 Network Architecture

LLaVA and Mistral 7B Working Together: A Powerful Image-Text Combo

Here's a breakdown of how LLaVA and Mistral 7B can work together in a bot, along with their
individual functionalities:

LLaVA: The Image I/O Specialist

LLaVA (likely referring to the Multimodal Open-Source LLM) excels at image recognition
and understanding. It acts as the "eyes" of your bot, performing these key tasks:

⚫ Image Input/Output (I/O): LLaVA can handle receiving images from users, potentially
through uploads or other channels. It might also be able to retrieve images from external
sources based on instructions.

⚫ Image Preprocessing: LLaVA might preprocess the image by resizing, adjusting color
balance, or performing other optimizations for efficient processing by Mistral 7B.

⚫ Feature Extraction: LLaVA extracts key features from the image. This could involve
identifying objects, scenes, or other relevant information within the image.

15
Figure No. 4.1.6 LLAVA 1.5 hard bar GPT-4V

Mistral 7B: The Language Powerhouse

Mistral 7B, the powerful LLM, takes center stage for language processing and task execution:

• Understanding User Queries: Mistral 7B analyzes the user's text input alongside the
information extracted from the image by LLaVA. It leverages its natural language
processing (NLP) capabilities to understand the user's intent and the context
surrounding the image.
• Information Retrieval or Action: Based on the combined understanding of text and
image, Mistral 7B can perform various actions:
• Information Retrieval: If the user is asking a question about the image (e.g., "What kind
of animal is this?"), Mistral 7B can access a knowledge base or search engine to answer
the question based on LLaVA's image analysis.
• Action Execution: It could also trigger actions based on the image and text. For
instance, if the image shows a product and the user asks to buy it, Mistral 7B could
initiate a purchase process.

The Teamwork Flow

1. User Input: The user provides an image and a text query (question or instruction) to the
bot.

16
2. LLaVA Takes Charge: LLaVA handles image I/O, preprocessing, and feature extraction.

3. Information to Mistral 7B: LLaVA provides Mistral 7B with the processed image data and
any relevant features.

4. Mistral 7B in Action: Mistral 7B analyzes the user's text query alongside the image
information from LLaVA.

5. Output or Action: Based on its understanding, Mistral 7B generates a response (answering


a question), retrieves information, or initiates an action related to the image and text.

Overall, this combination creates a powerful image-text understanding system. LLaVA bridges
the gap between visual data and Mistral 7B's language processing capabilities.

4.2 Tools and Technologies

Tools:

• Large Language Model (LLM):


o Tool: Mistral 7B
o Terminology: Transformer architecture, attention mechanism, pre-training,
text generation, question answering, code generation, natural language
processing (NLP)
• Multimodal Open-Source LLM (likely):
o Tool: LLaVA
o Terminology: Image I/O (input/output), image preprocessing, feature
extraction, object detection, scene understanding, landmark recognition
• Speech Recognition Model:
o Tool: Whisper
o Terminology: Automatic Speech Recognition (ASR), beam search, language
modeling, word error rate (WER)

17
4.3 Development Framework

Framework Considerations:

Several factors influence the choice of a development framework:

• Functionality: The desired features of the chatbot will guide the framework selection.
Frameworks like TensorFlow or PyTorch offer flexibility for complex models and
research purposes. For more streamlined development, consider user-friendly options
like Rasa or Dialogflow.
• Ease of Use: If you prioritize ease of use and rapid development, consider Rasa or
Dialogflow. These frameworks offer pre-built components for chatbot development,
including dialogue management and natural language understanding (NLU)
capabilities.
• Integration: The chosen framework should integrate well with the selected tools
(Mistral 7B, LLaVA, Whisper). Libraries like Transformers (for LLM integration) and
torchaudio (for audio processing with PyTorch) can bridge the gap.
• Scalability: If you anticipate future growth and expansion of the chatbot, consider
frameworks like TensorFlow or PyTorch that offer scalability for handling larger
datasets and more complex models.

Additional Considerations
• User Interface (UI): Develop a user-friendly interface for interacting with the chatbot.
Frameworks like Qt or Flutter can be helpful for building cross-platform UIs.
• Deployment: Choose a deployment strategy based on your needs. Cloud platforms like
Google Cloud or Amazon Web Services (AWS) offer deployment options for chatbots
built with various frameworks.

18
Developed System

5.1 Project Requirement Details

Project Requirements for Conversational AI System

This section outlines the functionalities and features envisioned for this conversational AI
project.

Overall Objective:
• Develop a conversational AI system that allows users to interact naturally through text,
voice commands, and potentially images.
Functional Requirements:
• Conversational AI (Mistral 7b LLM):
o Understand and respond to user queries and prompts in a natural language.
o Generate different creative text formats, if applicable (e.g., poems, code,
scripts).
o Access and process information from the real world through Google Search or
other APIs (if desired).
• Image Handling (Llava):
o Recognize and process images uploaded by the user.
o Generate captions or descriptions for images.
o Potentially perform additional image manipulation tasks depending on project
goals (e.g., object detection, style transfer).
• PDF Processing:
o Upload and parse PDF documents.
o Extract text and relevant information from the PDF content.
o Answer user questions about the information contained within the PDF.
• Speech Interaction:
o Enable users to interact with the AI through voice commands or speech
recognition.
o Convert spoken language to text for processing by the conversational AI.
• User Interface (Streamlit):
o Design a user-friendly interface for text input, microphone interaction, and
potentially image upload .

19
o Display conversation history and AI responses in a clear and organized manner.
o Integrate functionalities for uploading and interacting with PDFs.

5.2 High Level Design (HLD)

High-Level Design (HLD) for Conversational AI System

System Architecture:
1. User Interface (Streamlit):
o This is the entry point for users to interact with the system.
o It will provide functionalities for:
Text input for user queries and prompts.
a. Microphone access for speech interaction .
b. Uploading images .
c. Uploading and interacting with PDFs .
d. Displaying conversation history and AI responses.
2. Natural Language Processing (NLP) Module:
o This module will handle processing user input (text or speech).
o It will likely include functionalities for:
a. Text pre-processing (cleaning, tokenization).
b. Intent recognition (understanding user's purpose).
c. Entity recognition (identifying relevant information in user input).
3. Conversational AI Core (Mistral 7b LLM):
o This is the core of the system, responsible for generating responses.
o It will receive processed user input from the NLP module.
o It will leverage its capabilities to:
a. Understand the user's intent and context.
b. Access and process relevant information (potentially through Google
Search APIs).
c. Generate natural language responses to user queries.
4. Image Processing Module (Llava) (Optional):
o This module will handle image data uploaded by the user.
o It will utilize Llava's functionalities for:
a. Image recognition (identifying objects or scenes).

20
b. Caption generation (describing the content of the image).
c. Potentially performing other image manipulation tasks as needed.
5. PDF Processing Module (Optional):
o This module will handle uploaded PDF documents.
o It will utilize functionalities like:
a. Parsing the PDF structure and extracting text.
b. Identifying relevant information based on user queries.
c. Providing answers or summaries of information within the PDF.
6. Output Interface:
o This module will handle presenting the AI's response to the user.
o It will integrate with the Streamlit UI to display:
a. Textual responses from the conversational AI core.
b. Generated captions or descriptions for images.
c. Extracted information or summaries from PDFs .
Component Interaction:
1. User interacts with the Streamlit UI through text input, voice commands , image
uploads , or PDF uploads .
2. Streamlit pre-processes user input and forwards it to the NLP module.
3. NLP module processes the user input (text or speech) and extracts key information
(intent and entities).
4. NLP module sends the processed information to the Conversational AI Core (Mistral
7b LLM).
5. Conversational AI Core generates a response based on the received information and
potentially accesses additional information through APIs.
6. The response is sent back to the Streamlit UI.
7. Streamlit displays the text response and integrates any additional outputs from Image
Processing (captions) or PDF Processing (information summaries) for a comprehensive
user experience.
Benefits of this HLD:
• Provides a clear understanding of how different modules work together.
• Highlights the role of each component in achieving the overall objective.
• Allows for modular development and potential future expansion of functionalities.

21
5.3 Wireframe of FE & BE

Wireframe of Front-End (FE) and Back-End (BE)

Front-End (FE):

The FE focuses on the user interface elements and how users interact with the system. Here's
a possible wireframe:

• Main Screen:
o Text input field for user queries and prompts.
o (Optional) Microphone button for speech interaction.
o (Optional) Image upload button.
o (Optional) PDF upload button.
o Display area for conversation history, including:
▪ User prompts and questions.
▪ AI responses and generated text.
▪ (Optional) Image captions or descriptions (if applicable).
▪ (Optional) Extracted information summaries from PDFs (if applicable).
Back-End (BE):
• User Input Processing:
o Receives user input from the FE (text, speech, image, or PDF).
o Pre-processes text or speech input (cleaning, tokenization).
o (Optional) Sends image data to the Image Processing Module (if applicable).
o (Optional) Sends PDF data to the PDF Processing Module (if applicable).
o Sends processed text data (intent and entities) to the NLP Module.

• Natural Language Processing (NLP) Module:


o Analyzes the user's intent and identifies relevant entities (BE component).
o Communicates with the Conversational AI Core.
• Conversational AI Core (Mistral 7b LLM):
o Receives processed user input from the NLP module (BE component).
o Generates natural language responses based on its training data.
o (Optional) Accesses and processes external information through APIs (BE
component).

22
o Sends the generated response back to the BE.
• Output Processing:
o Receives the AI's response from the BE.
o Sends the response and any additional outputs (captions, summaries) to the FE
for display.
Interaction Flow:
1. User interacts with the FE by entering text, using voice commands (if applicable),
uploading images (if applicable), or uploading PDFs (if applicable).
2. FE pre-processes user input (text or speech) and sends it to the BE.
3. BE further processes the user input and potentially sends it to additional modules
(Image Processing or PDF Processing) depending on the user's action.
4. BE interacts with the NLP Module and the Conversational AI Core to generate a
response.
5. BE sends the response and any additional outputs back to the FE.
FE displays the response and integrates any additional outputs from Image Processing or PDF
Processing for the use

5.3 Source Code of FE & BE


Main.py

import streamlit as st

from llm_chains import load_normal_chain, load_pdf_chat_chain


from streamlit_mic_recorder import mic_recorder
from utils import get_timestamp, load_config, get_avatar
from image_handler import handle_image
from audio_handler import transcribe_audio
from pdf_handler import add_documents_to_db
from html_templates import css
from database_operations import load_last_k_text_messages, save_text_message,
save_image_message, save_audio_message, load_messages,
get_all_chat_history_ids, delete_chat_history
import sqlite3
config = load_config()

@st.cache_resource
def load_chain():
if st.session_state.pdf_chat:
print("loading pdf chat chain")
return load_pdf_chat_chain()
return load_normal_chain()

23
def toggle_pdf_chat():
st.session_state.pdf_chat = True
clear_cache()

def get_session_key():
if st.session_state.session_key == "new_session":
st.session_state.new_session_key = get_timestamp()
return st.session_state.new_session_key
return st.session_state.session_key

def delete_chat_session_history():
delete_chat_history(st.session_state.session_key)
st.session_state.session_index_tracker = "new_session"

def clear_cache():
st.cache_resource.clear()

def main():
st.title(" Cleora")
st.write(css, unsafe_allow_html=True)

if "db_conn" not in st.session_state:


st.session_state.session_key = "new_session"
st.session_state.new_session_key = None
st.session_state.session_index_tracker = "new_session"
st.session_state.db_conn =
sqlite3.connect(config["chat_sessions_database_path"],
check_same_thread=False)
st.session_state.audio_uploader_key = 0
st.session_state.pdf_uploader_key = 1
if st.session_state.session_key == "new_session" and
st.session_state.new_session_key != None:
st.session_state.session_index_tracker =
st.session_state.new_session_key
st.session_state.new_session_key = None

st.sidebar.title("Chat Sessions")
chat_sessions = ["new_session"] + get_all_chat_history_ids()

index = chat_sessions.index(st.session_state.session_index_tracker)
st.sidebar.selectbox("Select a chat session", chat_sessions,
key="session_key", index=index)
pdf_toggle_col, voice_rec_col = st.sidebar.columns(2)
pdf_toggle_col.toggle("PDF Chat", key="pdf_chat", value=False)
with voice_rec_col:
voice_recording=mic_recorder(start_prompt="Record
Audio",stop_prompt="Stop recording", just_once=True)

24
delete_chat_col, clear_cache_col = st.sidebar.columns(2)
delete_chat_col.button("Delete Chat Session",
on_click=delete_chat_session_history)
clear_cache_col.button("Clear Cache", on_click=clear_cache)

chat_container = st.container()
user_input = st.chat_input("Type your message here", key="user_input")

uploaded_audio = st.sidebar.file_uploader("Upload an audio file",


type=["wav", "mp3", "ogg"], key=st.session_state.audio_uploader_key)
uploaded_image = st.sidebar.file_uploader("Upload an image file",
type=["jpg", "jpeg", "png"])
uploaded_pdf = st.sidebar.file_uploader("Upload a pdf file",
accept_multiple_files=True,
key=st.session_state.pdf_uploader_key,
type=["pdf"], on_change=toggle_pdf_chat)

if uploaded_pdf:
with st.spinner("Processing pdf..."):
add_documents_to_db(uploaded_pdf)
st.session_state.pdf_uploader_key += 2

if uploaded_audio:
transcribed_audio = transcribe_audio(uploaded_audio.getvalue())
print(transcribed_audio)
llm_chain = load_chain()
llm_answer = llm_chain.run(user_input = "Summarize this text: " +
transcribed_audio, chat_history=[])
save_audio_message(get_session_key(), "human",
uploaded_audio.getvalue())
save_text_message(get_session_key(), "ai", llm_answer)
st.session_state.audio_uploader_key += 2

if voice_recording:
transcribed_audio = transcribe_audio(voice_recording["bytes"])
print(transcribed_audio)
llm_chain = load_chain()
llm_answer = llm_chain.run(user_input = transcribed_audio,
chat_history=load_last_k_text_messages(get_
session_key(), config["chat_config"]["chat_memory_length"]))
save_audio_message(get_session_key(), "human",
voice_recording["bytes"])
save_text_message(get_session_key(), "ai", llm_answer)

if user_input:
if uploaded_image:

25
with st.spinner("Processing image..."):
llm_answer = handle_image(uploaded_image.getvalue(), user_input)
save_text_message(get_session_key(), "human", user_input)
save_image_message(get_session_key(), "human",
uploaded_image.getvalue())
save_text_message(get_session_key(), "ai", llm_answer)
user_input = None

if user_input:
llm_chain = load_chain()
llm_answer = llm_chain.run(user_input = user_input,
chat_history=load_last_k_text_messages(
get_session_key(), config["chat_config"]["chat_memory_length"]))
save_text_message(get_session_key(), "human", user_input)
save_text_message(get_session_key(), "ai", llm_answer)
user_input = None

if (st.session_state.session_key != "new_session") !=
(st.session_state.new_session_key != None):
with chat_container:
chat_history_messages = load_messages(get_session_key())

for message in chat_history_messages:


with st.chat_message(name=message["sender_type"],
avatar=get_avatar(message["sender_type"])):
if message["message_type"] == "text":
st.write(message["content"])
if message["message_type"] == "image":
st.image(message["content"])
if message["message_type"] == "audio":
st.audio(message["content"], format="audio/wav")

if (st.session_state.session_key == "new_session") and


(st.session_state.new_session_key != None):
st.rerun()

if __name__ == "__main__":
main()

Audio.py

import torch
from transformers import pipeline
import librosa
import io
from utils import load_config

26
config = load_config()

def convert_bytes_to_array(audio_bytes):
audio_bytes = io.BytesIO(audio_bytes)
audio, sample_rate = librosa.load(audio_bytes)
print(sample_rate)
return audio

def transcribe_audio(audio_bytes):
#device = "cuda:0" if torch.cuda.is_available() else "cpu"
device = "cpu"
pipe = pipeline(
task="automatic-speech-recognition",
model=config["whisper_model"],
chunk_length_s=30,
device=device,
)

audio_array = convert_bytes_to_array(audio_bytes)
prediction = pipe(audio_array, batch_size=1)["text"]

return prediction

Image.py

from llama_cpp import Llama


from llama_cpp.llama_chat_format import Llava15ChatHandler
import base64
from utils import load_config
#import streamlit as st
config = load_config()

def convert_bytes_to_base64(image_bytes):
encoded_string= base64.b64encode(image_bytes).decode("utf-8")
return "data:image/jpeg;base64," + encoded_string

#@st.cache_resource # can be cached if you use it often


def load_llava():
chat_handler =
Llava15ChatHandler(clip_model_path=config["llava_model"]["clip_model_path"])
llm = Llama(
model_path=config["llava_model"]["llava_model_path"],
chat_handler=chat_handler,
logits_all=True,
n_ctx=1024 # n_ctx should be increased to accomodate the image embedding
)
return llm

27
def handle_image(image_bytes, user_message):

llava = load_llava()
image_base64 = convert_bytes_to_base64(image_bytes)

output = llava.create_chat_completion(
messages = [
{"role": "system", "content": "You are an assistant who perfectly
describes images."},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_base64}},
{"type" : "text", "text": user_message}
]
}
]
)
print(output)
return output["choices"][0]["message"]["content"]

PDFs.py

from langchain.text_splitter import RecursiveCharacterTextSplitter


from langchain.schema.document import Document
from llm_chains import load_vectordb, create_embeddings
from utils import load_config
import pypdfium2
config = load_config()

def get_pdf_texts(pdfs_bytes_list):
return [extract_text_from_pdf(pdf_bytes.getvalue()) for pdf_bytes in
pdfs_bytes_list]

def extract_text_from_pdf(pdf_bytes):
pdf_file = pypdfium2.PdfDocument(pdf_bytes)
return
"\n".join(pdf_file.get_page(page_number).get_textpage().get_text_range() for
page_number in range(len(pdf_file)))

def get_text_chunks(text):
splitter =
RecursiveCharacterTextSplitter(chunk_size=config["pdf_text_splitter"]["chunk_s
ize"],
chunk_overlap=config["pdf_text_s
plitter"]["overlap"],

28
separators=config["pdf_text_sp
litter"]["separators"])
return splitter.split_text(text)

def get_document_chunks(text_list):
documents = []
for text in text_list:
for chunk in get_text_chunks(text):
documents.append(Document(page_content = chunk))
return documents

def add_documents_to_db(pdfs_bytes):
texts = get_pdf_texts(pdfs_bytes)
documents = get_document_chunks(texts)
vector_db = load_vectordb(create_embeddings())
vector_db.add_documents(documents)
print("Documents added to db.")

LLM.py

from prompt_templates import memory_prompt_template, pdf_chat_prompt


from langchain.chains import LLMChain
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate
from langchain_community.llms import CTransformers
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from operator import itemgetter
from utils import load_config
import chromadb

config = load_config()

def load_ollama_model():
llm = Ollama(model=config["ollama_model"])
return llm

def create_llm(model_path = config["ctransformers"]["model_path"]["large"],


model_type = config["ctransformers"]["model_type"], model_config =
config["ctransformers"]["model_config"]):
llm = CTransformers(model=model_path, model_type=model_type,
config=model_config)
return llm

def create_embeddings(embeddings_path = config["embeddings_path"]):


return HuggingFaceInstructEmbeddings(model_name=embeddings_path)

29
def create_chat_memory(chat_history):
return ConversationBufferWindowMemory(memory_key="history",
chat_memory=chat_history, k=3)

def create_prompt_from_template(template):
return PromptTemplate.from_template(template)

def create_llm_chain(llm, chat_prompt):


return LLMChain(llm=llm, prompt=chat_prompt)

def load_normal_chain():
return chatChain()

def load_vectordb(embeddings):
persistent_client =
chromadb.PersistentClient(config["chromadb"]["chromadb_path"])

langchain_chroma = Chroma(
client=persistent_client,
collection_name=config["chromadb"]["collection_name"],
embedding_function=embeddings,
)

return langchain_chroma

def load_pdf_chat_chain():
return pdfChatChain()

def load_retrieval_chain(llm, vector_db):


return RetrievalQA.from_llm(llm=llm,
retriever=vector_db.as_retriever(search_kwargs={"k":
config["chat_config"]["number_of_retrieved_documents"]}), verbose=True)

def create_pdf_chat_runnable(llm, vector_db, prompt):


runnable = (
{
"context": itemgetter("human_input") |
vector_db.as_retriever(search_kwargs={"k":
config["chat_config"]["number_of_retrieved_documents"]}),
"human_input": itemgetter("human_input"),
"history" : itemgetter("history"),
}
| prompt | llm.bind(stop=["Human:"])
)
return runnable

class pdfChatChain:

30
def __init__(self):
vector_db = load_vectordb(create_embeddings())
llm = create_llm()
#llm = load_ollama_model()
prompt = create_prompt_from_template(pdf_chat_prompt)
self.llm_chain = create_pdf_chat_runnable(llm, vector_db, prompt)

def run(self, user_input, chat_history):


print("Pdf chat chain is running...")
return self.llm_chain.invoke(input={"human_input" : user_input,
"history" : chat_history})

class chatChain:

def __init__(self):
llm = create_llm()
#llm = load_ollama_model()
chat_prompt = create_prompt_from_template(memory_prompt_template)
self.llm_chain = create_llm_chain(llm, chat_prompt)

def run(self, user_input, chat_history):


return self.llm_chain.invoke(input={"human_input" : user_input,
"history" : chat_history} ,stop=["Human:"])["text"]

Database.py

from utils import load_config


import streamlit as st
import sqlite3
config = load_config()

def get_db_connection():
return st.session_state.db_conn

def get_db_cursor(db_connection):
return db_connection.cursor()

def get_db_connection_and_cursor():
conn = get_db_connection()
return conn, conn.cursor()

def close_db_connection():
if 'db_conn' in st.session_state and st.session_state.db_conn is not None:
st.session_state.db_conn.close()
st.session_state.db_conn = None

31
def save_text_message(chat_history_id, sender_type, text):
conn, cursor = get_db_connection_and_cursor()

cursor.execute('INSERT INTO messages (chat_history_id, sender_type,


message_type, text_content) VALUES (?, ?, ?, ?)',
(chat_history_id, sender_type, 'text', text))

conn.commit()

def save_image_message(chat_history_id, sender_type, image_bytes):


conn, cursor = get_db_connection_and_cursor()

cursor.execute('INSERT INTO messages (chat_history_id, sender_type,


message_type, blob_content) VALUES (?, ?, ?, ?)',
(chat_history_id, sender_type, 'image',
sqlite3.Binary(image_bytes)))

conn.commit()

def save_audio_message(chat_history_id, sender_type, audio_bytes):


conn, cursor = get_db_connection_and_cursor()

cursor.execute('INSERT INTO messages (chat_history_id, sender_type,


message_type, blob_content) VALUES (?, ?, ?, ?)',
(chat_history_id, sender_type, 'audio',
sqlite3.Binary(audio_bytes)))

conn.commit()

def load_messages(chat_history_id):
conn, cursor = get_db_connection_and_cursor()

query = "SELECT message_id, sender_type, message_type, text_content,


blob_content FROM messages WHERE chat_history_id = ?"
cursor.execute(query, (chat_history_id,))

messages = cursor.fetchall()
chat_history = []
for message in messages:
message_id, sender_type, message_type, text_content, blob_content =
message

if message_type == 'text':
chat_history.append({'message_id': message_id, 'sender_type':
sender_type, 'message_type': message_type, 'content': text_content})
else:
chat_history.append({'message_id': message_id, 'sender_type':
sender_type, 'message_type': message_type, 'content': blob_content})

32
return chat_history

def load_last_k_text_messages(chat_history_id, k):


conn, cursor = get_db_connection_and_cursor()

query = """
SELECT message_id, sender_type, message_type, text_content
FROM messages
WHERE chat_history_id = ? AND message_type = 'text'
ORDER BY message_id DESC
LIMIT ?
"""
cursor.execute(query, (chat_history_id, k))

messages = cursor.fetchall()
chat_history = []
for message in reversed(messages):
message_id, sender_type, message_type, text_content = message

chat_history.append({
'message_id': message_id,
'sender_type': sender_type,
'message_type': message_type,
'content': text_content
})

return chat_history

def get_all_chat_history_ids():
conn, cursor = get_db_connection_and_cursor()

query = "SELECT DISTINCT chat_history_id FROM messages ORDER BY


chat_history_id ASC"
cursor.execute(query)

chat_history_ids = cursor.fetchall()
chat_history_id_list = [item[0] for item in chat_history_ids]

return chat_history_id_list

def delete_chat_history(chat_history_id):
conn, cursor = get_db_connection_and_cursor()

query = "DELETE FROM messages WHERE chat_history_id = ?"


cursor.execute(query, (chat_history_id,))
conn.commit()

33
print(f"All entries with chat_history_id {chat_history_id} have been
deleted.")

def init_db():
db_path = config["chat_sessions_database_path"]

conn = sqlite3.connect(db_path)
cursor = conn.cursor()

create_messages_table = """
CREATE TABLE IF NOT EXISTS messages (
message_id INTEGER PRIMARY KEY AUTOINCREMENT,
chat_history_id TEXT NOT NULL,
sender_type TEXT NOT NULL,
message_type TEXT NOT NULL,
text_content TEXT,
blob_content BLOB
);
"""

cursor.execute(create_messages_table)
conn.commit()
conn.close()

if __name__ == "__main__":
init_db()

htmlTemplate.py

css ="""
<style>
/* User Chat Message */

.st-emotion-cache-janbn0 {
background-color: #967bb6;
}

/* AI Chat Message */

.st-emotion-cache-4oy321 {
background-color: #475063;
}

section[data-testid="stSidebar"] {
width: 380px !important;
}

34
</style>
"""

PromptSetting.py

memory_prompt_template = """<s>[INST] You are an AI chatbot having a


conversation with a human, You can use swear words to answer his
questions.[/INST]
Previous conversation: {history}
Human: {human_input}
AI:"""

pdf_chat_prompt = """<s>[INST] Answer the user question based on the given


context. Also consider the chat history.[/INST]
Chat history: {history}
Context: {context}
Question: {human_input}
Answer:"""

Utils.py

import json
from langchain.schema.messages import HumanMessage, AIMessage
from datetime import datetime
import yaml

def load_config():
with open("config.yaml", "r") as f:
return yaml.safe_load(f)

def save_chat_history_json(chat_history, file_path):


with open(file_path, "w") as f:
json_data = [message.dict() for message in chat_history]
json.dump(json_data, f)

def load_chat_history_json(file_path):
with open(file_path, "r") as f:
json_data = json.load(f)
messages = [HumanMessage(**message) if message["type"] == "human" else
AIMessage(**message) for message in json_data]
return messages

def get_timestamp():
return datetime.now().strftime("%Y-%m-%d %H:%M:%S")

def get_avatar(sender_type):
if sender_type == "human":

35
return "chat_icons/user_image.png"
else:
return "chat_icons/bot_image.png"

Config.yml

ctransformers:
model_path:
small: "./models/mistral-7b-instruct-v0.1.Q3_K_M.gguf"
large: "./models/mistral-7b-instruct-v0.1.Q5_K_M.gguf"

model_type: "mistral"
model_config:
'max_new_tokens': 256
'temperature' : 0.2
'context_length': 2048
'gpu_layers' : 0
'threads' : -1

chat_config:
chat_memory_length: 2
number_of_retrieved_documents: 3

pdf_text_splitter:
chunk_size: 1024
overlap: 50
separators: ["\n", "\n\n"]

llava_model:
llava_model_path: "./models/llava/ggml-model-q5_k.gguf"
clip_model_path: "./models/llava/mmproj-model-f16.gguf"

whisper_model: "openai/whisper-small"

embeddings_path: "BAAI/bge-large-en-v1.5"

chromadb:
chromadb_path: "chroma_db"
collection_name: "pdfs"

chat_sessions_database_path: "./chat_sessions/chat_sessions.db"

36
5.4 Results & Reports
User Interface:

Image input:

Image Output:

37
Audio Input:

38
Audio Output:

39
PDF input & output:

40
5.5 Cost Estimation

Item Destcription Cost(INR)

Documentation Printing Additional printing cost 100/-

Synopsis Total cost of synopsis 300/-

Blackbook Total cost of blackbook 3000/-

Paper Publishing Total cost for publishing 1000/-


the paper

Miscellaneous Expenses Diaries 400/-

Total 4800/-

Table no 2

41
Future Work

⚫ Intelligent Document Summarization: Implement advanced summarization algorithms


to automatically generate concise summaries of lengthy documents, helping users quickly
extract key insights and information

⚫ Integration with Cloud Services: Allow users to sync their chatbot interactions and
documents across devices using cloud storage services like Google Drive or Dropbox,
providing seamless access and continuity.

⚫ Integrate Ollama, OpenAI, Gemini, or Other Model Providers:Explore the integration


of Ollama, OpenAI, Gemini, or other model providers to expand the capabilities and
diversity of the chatbot's AI models.

⚫ Enhanced Security: Implement end-to-end encryption for sensitive conversations and


data, ensuring user privacy and security, especially when dealing with confidential
documents or information.

⚫ Add Image Generator Model:Incorporate an image generator model to enable the


chatbot to generate images based on user input, enhancing the visual aspect of the chat
experience.

⚫ Authentication Mechanism: Implement an authentication mechanism to ensure user


privacy and security, allowing for personalized interactions and user-specific features.

⚫ Integration with Productivity Tools: Integrate with popular productivity tools such as
Slack, Microsoft Teams, or Trello, enabling users to seamlessly share documents and
collaborate within their existing workflows.

⚫ Analytics Dashboard: Provide users with insights into their interaction patterns,
document usage, and chatbot performance metrics through an analytics dashboard,
empowering them to optimize their workflow and usage.

⚫ Gamification Elements: Introduce gamification elements such as achievements,


leaderboards, or rewards to incentivize user engagement and learning within the chatbot
platform.

42
⚫ Community Forum: Create a community forum or platform where users can share tips,
best practices, and customizations for the chatbot, fostering a collaborative learning
environment and building a supportive community around the project.

⚫ Accessibility Features: Ensure compatibility with screen readers, keyboard navigation,


and other accessibility tools to accommodate users with disabilities and provide an
inclusive user experience.

43
Conclusion

In conclusion, the "Local Multimodal ChatGPT" project represents a dynamic exploration into
the realms of artificial intelligence, natural language processing, and multimodal interaction.
By seamlessly integrating cutting-edge AI models tailored for audio, image, and PDF
processing, this project not only provides a captivating chat experience but also serves as a
comprehensive learning platform for enthusiasts and developers alike.

Through the collaborative efforts of contributors and the continuous pursuit of innovation, this
project stands as a testament to the power of community-driven development. As we embrace
the journey of improvement and refinement, there lies endless potential for expansion and
enhancement, whether through the integration of new model providers, the addition of
innovative features, or the optimization of existing functionalities.

With its commitment to accessibility, versatility, and user empowerment, "Local Multimodal
ChatGPT" embodies the spirit of exploration and collaboration in the ever-evolving landscape
of AI-driven chat applications. As we invite individuals from diverse backgrounds to join us
on this journey, we remain dedicated to fostering a community where knowledge is shared,
ideas flourish, and the boundaries of possibility are continually pushed.

Together, let us embark on a voyage of discovery, where curiosity fuels innovation, and every
line of code contributes to the evolution of AI-powered communication. Join us as we redefine
the future of chat technology, one conversation at a time.

44
Participation Certificate in paper presentation/project
exhibition/hackathon

45
46
47
48
International Journal Certificate

49
50
51
References & Bibliography
Reference

[1]TheBloke. (2023). "Quantized Models from TheBloke." Retrieved from


[https://huggingface.co/TheBloke/Project-Baize-v2-13B-GPTQ].

[2]Whisper AI. (September 2022). "Whisper AI Documentation." Retrieved from


[https://openai.com/blog/introducing-chatgpt-and-whisper-apis].

[3] LLaVA. (October 2023). "LLaVA Documentation." Retrieved from [https://llava-


vl.github.io/].

[4]Chroma DB. (2022). "Chroma DB Documentation." Retrieved from


[https://www.trychroma.com/].

[5]llama-cpp-python repo. (2024). "LLaVA Loading Documentation." Retrieved from


[https://pypi.org/project/llama-cpp-python/].

Bibliography

[1]"Deep Learning for Chatbots" by Sumit Pandey, AikAion

[2]"Building Chatbots with Python: Using Natural Language Processing and Machine
Learning" by Sumit Raj

[3]"Designing Bots: Creating Conversational Experiences" by Amir Shevat

[4]"Chatbot Development: A Comprehensive Guide" by Sivakumar Harinath

[5]"Practical Artificial Intelligence for Dummies" by John Paul Mueller, Luca Massaron

52

You might also like