Chatbot On Videos
Chatbot On Videos
Chatbot On Videos
Author
Adarsha Mondal
[email protected]
Supervisors
Mr. Sudhir Kumar & Mr. Rohan Nandode
Abstract
This project focuses on the use of generative models to extract information and insights from
sequential data like videos and audios. The main objective is to develop a pipeline that can
process large amounts of sequential data and generate meaningful responses to user queries.
The pipeline includes steps for data processing, knowledge base creation, and query response
generation. The project primarily uses proprietary models, but with adequate resources it can be
modified to use open-source models.
Contents
Acknowledgement .......................................................................................................................................... 2
Types of Models used ..................................................................................................................................... 3
1. Automatic Speech Recognition Model (M1) ..................................................................................... 3
2. Text Embedding Model (M2) ............................................................................................................. 3
3. Large Language Models (M3) ............................................................................................................ 4
Building an Chatbot with Generative AI Models ............................................................................................ 6
1. Problem Statement ........................................................................................................................... 6
2. Data ................................................................................................................................................... 6
3. Workflow Pipeline ............................................................................................................................. 6
3.1. Data Preprocessing ................................................................................................................... 6
3.2. Transcription Method ............................................................................................................... 7
3.3. Chunking of data ....................................................................................................................... 7
3.4. Embedding generation ............................................................................................................. 8
3.5. Vector Store and Retriever methods ........................................................................................ 8
3.6. Output generation using LLM ................................................................................................... 9
3.7. Setting up of chain .................................................................................................................... 9
Application of the Tool ................................................................................................................................. 10
3.1 Proprietary services ....................................................................................................................... 10
3.2 Open-source models ...................................................................................................................... 11
Some Examples of Question-Answering with the Chatbot .......................................................................... 12
Query which has context in the Knowledge Base .................................................................................... 12
EXAMPLE 1 ........................................................................................................................................... 12
EXAMPLE 2 ........................................................................................................................................... 13
EXAMPLE 3 ........................................................................................................................................... 13
EXAMPLE 4 ........................................................................................................................................... 14
Query which does not have context in the Knowledge Base ................................................................... 14
EXAMPLE 1 ........................................................................................................................................... 14
EXAMPLE 2 ........................................................................................................................................... 14
Conclusion .................................................................................................................................................... 15
Appendix....................................................................................................................................................... 15
1. Tools used ................................................................................................................................... 15
2. Issues Faced ................................................................................................................................ 15
3. References .................................................................................................................................. 15
1
Acknowledgement
I wish to express my sincere appreciation to Coriolis Management
and its CEO, Mr. Basant Rajan, for affording me the invaluable opportunity
to engage in as intern within their esteemed organization. I am profoundly
grateful to my mentors, Mr. Sudhir Kumar and Mr. Rohan Nandode, whose
guidance and commitment were instrumental throughout the course of my
internship. Their continuous enthusiasm and unwavering dedication have
consistently motivated me to embark on thorough and exhaustive
explorations. This internship has proven to be an exceptional source of
knowledge and practical experience, significantly enriching my skill set. I am
humbled to acknowledge that the enriching experience I gained here has
exceeded all my expectations.
2
Types of Models used
1. Automatic Speech Recognition Model (M1)
ASR is a technology that converts spoken language into written text. This enables a
wide range of applications that enhance human-machine interaction and automate text
generation from audio sources.
The underlying architecture can be of various types; but due to recent development
in attention mechanism-based Transformer architecture, I used Whisper-1 (OpenAI) in
my pipeline.
3
OpenAI's text-embedding-ada-002 likely offers a sophisticated tool for extracting
and utilizing the rich information embedded within text data.
4
Large Language Models (LLMs), such as OpenAI's GPT-3.5-turbo, are advanced AI
models designed to understand and generate human-like language.
These models are known for their language generation prowess, making it a valuable
tool for natural language processing tasks. Its capacity to understand and generate
contextually rich text has garnered attention across industries, from content creation to
automation of various textual tasks. However, it is important to use such models
responsibly, considering their strengths and limitations.
5
Building an Chatbot with Generative AI
Models
1. Problem Statement
This project focuses on addressing the challenge of extracting information from
videos without the need to watch them in their entirety. The objective is to create a
pipeline capable of efficiently processing data and furnishing relevant answers to user
inquiries derived from the data. The pipeline's capability extends to managing situations
involving both sensitive and non-sensitive data.
2. Data
Public Domain: YouTube Videos (URLs referred at the end), which contain information
about Chennai Mathematical Institute (CMI). These videos are mainly made in an
interview environment by various persons. The primary topics covered in these videos
are academics, environment, and culture of CMI. Cumulative length of these videos is
more than 1.5 hour.
3. Workflow Pipeline
The pipeline for building the AI chatbot with large language models consists of
several stages, including data preprocessing, embeddings generation, retriever
methods, semantic search index, and the integration of a large language model as the
chatbot.
6
If they have manually generated subtitle, I extract the subtitle directly into a JSON
file, using YouTubeTranscriptApi library. This saves the hassle of extracting audio in an
intermediate stage.
On the other hand, those which does not have subtitles, I used the yt_dlp library to
extract the audio.
Local Videos:
For local pre-recorded videos, we separate audio from video file using moviepy library.
Audio file which exceeds the size-limit is pushed through the ASR model by
fragmenting it into multiple temporary audio chunks, at an intermediate stage. At the
end, each transcription chunk is merged into one JSON file.
7
3.4. Embedding generation
To generate embeddings for the text chunks as well as the query, different M2 models
can be used, such as OpenAI embeddings or HuggingFace embeddings. These embeddings
capture the semantic representation of the text and are stored using a vector database.
To embed the documents and the query, we utilized OpenAI's text-embedding-ada-002
as the primary embedding model.
The next stage is to perform semantic, searches using the FAISS library. Retriever
methods involve contextualizing the embeddings and building a semantic search index
depending on the query. This allows the chatbot to efficiently retrieve relevant
responses based on query similarities.
Why FAISS?
The efficiency of FAISS lies in its ability to use hardware acceleration, such as GPUs,
to perform comparison computations in parallel. This allows for faster search and
clustering of large datasets. Overall, FAISS is a powerful tool for similarity search and
clustering of large datasets, particularly in the field of machine learning and NLP.
8
3.6. Output generation using LLM
The integration of the LLM enables the chatbot to understand the context of user
queries more effectively than deterministic models. In this case, we pass the query and
the retrieved documents (from Knowledge Base using semantic search) as context to
the LLM. The LLM then generates answers to the query based on the passed context.
For my specific use-case I used OpenAI's GPT-3.5-turbo as LLM. This has a token limit of
4K. If we want to push documents with larger token size, we could use models with 16K
token limit, which are specifically build to handle these kinds of cases.
9
Application of the Tool
Below is a flow diagram where it is depicted how and which Language Model is
to be used depending on the use case. For our use case discussed above it is not
recommendable to use a deterministic model, because the question can vary even
within a small context window.
Also depending on the data used, we can break them down into two categories,
but there might be some crossover between each other.
10
GPT-4) but also to serve the responses. For this reason, these extremely large models
will likely always be under the control of organizations, and requires us to send your
data to their servers in order to interact with their language models. This raises privacy
and security concerns, and subject users to “black box” models, whose training, and
guardrails they have no control over. Also, due to the compute required, these services
are not free beyond a very limited use, so cost becomes a factor in applying these at
scale.
Summary: Proprietary services are great to use if you have very complex tasks, are okay
with sharing your data with a third party, and are prepared to incur costs if operating at
any significant scale.
Use Case:
To enhance subject understanding we could use the above pipeline to extract
knowledge from,
o Academic videos
o Interviews
o YouTube podcasts
Model Examples:
ASR Model ~ Whisper-1, Rev AI
Text Embedding Model ~ text-embedding-ada-002
LLM ~ gpt-3.5-turbo, gpt-4
It does currently take a little bit more work to grab an open-source model and
make it production ready. But progress is moving very quickly to make them more
accessible to users. Oftentimes, we can find an open-source model that solves our
specific problem, that is orders of magnitude smaller than ChatGPT. This allows us to
bring the model into a localized environment and host it ourselves. This means that we
can keep the data in our control for privacy and governance concerns, as well as
minimize utilization cost.
11
Use Case:
To understanding client needs, also manage employee sentiment, we could use
the above pipeline to extract knowledge from,
o Customer calls
o Business meeting recordings
Model Examples:
ASR Model ~ Wav2Vec2 (Meta)
Text Embedding Model ~ MiniLM, MPnet
LLM ~ LLaMa, Orca, Falcon
The library at CMI is open 24 hours a day, 7 days a week, except for a 1.5
hour cleaning period in the morning. Anyone can walk in during the open
hours. The library has some rules in place to prevent book loss. The library
started with a small collection of 2500 books and has now grown to have an
excellent collection. Books are purchased based on suggestions from
students and faculty. It takes about two months for a book to arrive after
ordering. The library also subscribes to about 38 journals, both physical
copies and online access. Online journals are accessible through the CMI
network. For journals not available, the library uses interlibrary loan services
or contacts other institutions directly for PDF copies.
12
EXAMPLE 2
What courses does Chennai Mathematical Institute (CMI) offer?
Note: CMI may consider starting new programs in the future based on
demand and expertise.
EXAMPLE 3
Describe the admission process of CMI.
The admission process of CMI starts in March when the applications open.
13
EXAMPLE 4
Why did CMI start a Data Science course?
EXAMPLE 2
I don't know
14
Conclusion
The use of large language models in building AI chatbots provides a significant
improvement in accuracy and relevance of responses. The workflow pipeline, starting
from data preprocessing to the integration of LLMs, ensures a seamless interaction
between the chatbot and users. However, the choice of public or private models
depends on the data privacy requirements and computational resources available.
Overall, large language models have the potential to revolutionize chatbot applications
and enhance user experience.
By adding more data into the knowledge base, we can improve the usability, and
relevance of this chatbot, towards broader spectrum of users. Along with that, we can
add multi-modal features to it in a iterative manner, further expanding its use cases.
Appendix
1. Tools used
Python Langchain
OpenAI, FAISS
HuggingFace FFmpeg
Streamlit
2. Issues Faced
Here are the issues I faced during the implementation of this project:
https://docs.google.com/document/d/18sbbaf17y_cOFgMYQSNw5c9cSPyqlUqk1Z
X6qM4HFuw/edit?usp=sharing
3. References
Large Language Mode: gpt-3.5-turbo,
https://platform.openai.com/docs/api-reference/completions/object
Text Embedding model: text-embedding-ada-002,
https://platform.openai.com/docs/guides/embeddings
ASR Model: Whisper-1,
https://platform.openai.com/docs/guides/speech-to-text
Vector Store: FAISS,
https://github.com/facebookresearch/faiss
15
Wrapping Framework: Langchain
https://python.langchain.com/docs/get_started/introduction.html
Web UI Framework: Streamlit
https://streamlit.io/
Sequential data handling tool: FFmpeg
https://www.ffmpeg.org/