Chatbot On Videos

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Summer Internship Project Report

Author
Adarsha Mondal
[email protected]

Supervisors
Mr. Sudhir Kumar & Mr. Rohan Nandode

Abstract
This project focuses on the use of generative models to extract information and insights from
sequential data like videos and audios. The main objective is to develop a pipeline that can
process large amounts of sequential data and generate meaningful responses to user queries.
The pipeline includes steps for data processing, knowledge base creation, and query response
generation. The project primarily uses proprietary models, but with adequate resources it can be
modified to use open-source models.
Contents
Acknowledgement .......................................................................................................................................... 2
Types of Models used ..................................................................................................................................... 3
1. Automatic Speech Recognition Model (M1) ..................................................................................... 3
2. Text Embedding Model (M2) ............................................................................................................. 3
3. Large Language Models (M3) ............................................................................................................ 4
Building an Chatbot with Generative AI Models ............................................................................................ 6
1. Problem Statement ........................................................................................................................... 6
2. Data ................................................................................................................................................... 6
3. Workflow Pipeline ............................................................................................................................. 6
3.1. Data Preprocessing ................................................................................................................... 6
3.2. Transcription Method ............................................................................................................... 7
3.3. Chunking of data ....................................................................................................................... 7
3.4. Embedding generation ............................................................................................................. 8
3.5. Vector Store and Retriever methods ........................................................................................ 8
3.6. Output generation using LLM ................................................................................................... 9
3.7. Setting up of chain .................................................................................................................... 9
Application of the Tool ................................................................................................................................. 10
3.1 Proprietary services ....................................................................................................................... 10
3.2 Open-source models ...................................................................................................................... 11
Some Examples of Question-Answering with the Chatbot .......................................................................... 12
Query which has context in the Knowledge Base .................................................................................... 12
EXAMPLE 1 ........................................................................................................................................... 12
EXAMPLE 2 ........................................................................................................................................... 13
EXAMPLE 3 ........................................................................................................................................... 13
EXAMPLE 4 ........................................................................................................................................... 14
Query which does not have context in the Knowledge Base ................................................................... 14
EXAMPLE 1 ........................................................................................................................................... 14
EXAMPLE 2 ........................................................................................................................................... 14
Conclusion .................................................................................................................................................... 15
Appendix....................................................................................................................................................... 15
1. Tools used ................................................................................................................................... 15
2. Issues Faced ................................................................................................................................ 15
3. References .................................................................................................................................. 15

1
Acknowledgement
I wish to express my sincere appreciation to Coriolis Management
and its CEO, Mr. Basant Rajan, for affording me the invaluable opportunity
to engage in as intern within their esteemed organization. I am profoundly
grateful to my mentors, Mr. Sudhir Kumar and Mr. Rohan Nandode, whose
guidance and commitment were instrumental throughout the course of my
internship. Their continuous enthusiasm and unwavering dedication have
consistently motivated me to embark on thorough and exhaustive
explorations. This internship has proven to be an exceptional source of
knowledge and practical experience, significantly enriching my skill set. I am
humbled to acknowledge that the enriching experience I gained here has
exceeded all my expectations.

2
Types of Models used
1. Automatic Speech Recognition Model (M1)
ASR is a technology that converts spoken language into written text. This enables a
wide range of applications that enhance human-machine interaction and automate text
generation from audio sources.

The underlying architecture can be of various types; but due to recent development
in attention mechanism-based Transformer architecture, I used Whisper-1 (OpenAI) in
my pipeline.

ASR is used in transcription services, voice search, call center automation,


accessibility tools, language learning, and more. But for my use-case I utilized the
transcription feature of it.

2. Text Embedding Model (M2)


Text embedding models are advanced language models designed to convert textual
input into high-dimensional vectors, also known as embeddings. These embeddings
capture the semantic meaning and contextual information of the text, allowing for more
effective natural language understanding and processing.

Incorporating text embeddings into various applications can significantly improve


their performance, as the models have learned intricate language patterns and nuances.

3
OpenAI's text-embedding-ada-002 likely offers a sophisticated tool for extracting
and utilizing the rich information embedded within text data.

3. Large Language Models (M3)


 What are Large Language Models?
Large Language Models (LLMs) belong to a class of sophisticated deep
learning models meticulously designed for the comprehensive comprehension of
extensive natural language data. These models undergo training on colossal
datasets containing billions of words, where their underlying architecture relies
on intricate algorithms like transformer architectures. Through these
mechanisms, LLMs proficiently navigate voluminous datasets, discerning
intricate linguistic patterns at the word level. Such proficiency empowers these
models to adeptly predict outcomes involving text generation, text classification,
and more.
Employing neural network architectures, particularly the transformer
framework, LLMs demonstrate a profound ability to capture intricate language
nuances and the intricate connections between words or phrases embedded in
vast textual datasets. In essence, LLMs can be regarded as a derivation of
transformer models. The transformer's effectiveness hinges upon its
employment of mechanisms like cross-attention and self-attention.
Cross-attention endows the model with the capability to ascertain the
significance of various segments within an input text, which is instrumental in
accurately predicting subsequent words in generated text.
In parallel, the self-attention mechanism empowers the model to
selectively focus on distinct portions of its input during processing, enhancing its
comprehension and subsequent output.

4
Large Language Models (LLMs), such as OpenAI's GPT-3.5-turbo, are advanced AI
models designed to understand and generate human-like language.

These models are known for their language generation prowess, making it a valuable
tool for natural language processing tasks. Its capacity to understand and generate
contextually rich text has garnered attention across industries, from content creation to
automation of various textual tasks. However, it is important to use such models
responsibly, considering their strengths and limitations.

5
Building an Chatbot with Generative AI
Models
1. Problem Statement
This project focuses on addressing the challenge of extracting information from
videos without the need to watch them in their entirety. The objective is to create a
pipeline capable of efficiently processing data and furnishing relevant answers to user
inquiries derived from the data. The pipeline's capability extends to managing situations
involving both sensitive and non-sensitive data.

2. Data
Public Domain: YouTube Videos (URLs referred at the end), which contain information
about Chennai Mathematical Institute (CMI). These videos are mainly made in an
interview environment by various persons. The primary topics covered in these videos
are academics, environment, and culture of CMI. Cumulative length of these videos is
more than 1.5 hour.

Private Domain: Pre-recorded videos, recorded in an interview setting in multiple


sessions. The primary interviewer is Mr. Sudhir Kumar. He conducted these sessions
with the professors and academic staffs of CMI. Cumulative length of these videos is
more than 2.8 hour.

3. Workflow Pipeline
The pipeline for building the AI chatbot with large language models consists of
several stages, including data preprocessing, embeddings generation, retriever
methods, semantic search index, and the integration of a large language model as the
chatbot.

3.1. Data Preprocessing


YouTube:
I distinguish YT videos into two segments, depending on if they consist of manually
generated English subtitles or not.

6
If they have manually generated subtitle, I extract the subtitle directly into a JSON
file, using YouTubeTranscriptApi library. This saves the hassle of extracting audio in an
intermediate stage.

On the other hand, those which does not have subtitles, I used the yt_dlp library to
extract the audio.

Local Videos:
For local pre-recorded videos, we separate audio from video file using moviepy library.

3.2. Transcription Method


The audio files which are extracted from YT videos or Local videos are to be
transcribed next. I use M1 model for that, specifically Whisper-1. This model takes an

audio stream file of maximum 25MB, at an instance and transcribe it.

Audio file which exceeds the size-limit is pushed through the ASR model by
fragmenting it into multiple temporary audio chunks, at an intermediate stage. At the
end, each transcription chunk is merged into one JSON file.

3.3. Chunking of data


We take the text data from the JSON file and then split it into smaller chunks or
Langchain documents. The reason of doing this is that, this data is to be fed in to a M3
model and the current large integrated models that we use, has a finite input token
limit. That means it can process only a small fix amount of text, at a time. Each chunk is
going to be of specific length and that can be specified by us.

Here, I have used Langchain’s RecursiveCharacterTextSplitter that splits the data


into chunks each of 4096 characters, at maximum and has an overlap with the
consecutive chunk of 500 characters

7
3.4. Embedding generation
To generate embeddings for the text chunks as well as the query, different M2 models
can be used, such as OpenAI embeddings or HuggingFace embeddings. These embeddings
capture the semantic representation of the text and are stored using a vector database.
To embed the documents and the query, we utilized OpenAI's text-embedding-ada-002
as the primary embedding model.

3.5. Vector Store and Retriever methods


Once the Text embeddings were obtained, the next step was to store those
embeddings in a vector store such as FAISS (Facebook AI Similarity Search). We call this
vector store as Knowledge Base.

The next stage is to perform semantic, searches using the FAISS library. Retriever
methods involve contextualizing the embeddings and building a semantic search index
depending on the query. This allows the chatbot to efficiently retrieve relevant
responses based on query similarities.

 Why FAISS?

The efficiency of FAISS lies in its ability to use hardware acceleration, such as GPUs,
to perform comparison computations in parallel. This allows for faster search and
clustering of large datasets. Overall, FAISS is a powerful tool for similarity search and
clustering of large datasets, particularly in the field of machine learning and NLP.

8
3.6. Output generation using LLM
The integration of the LLM enables the chatbot to understand the context of user
queries more effectively than deterministic models. In this case, we pass the query and
the retrieved documents (from Knowledge Base using semantic search) as context to
the LLM. The LLM then generates answers to the query based on the passed context.

For my specific use-case I used OpenAI's GPT-3.5-turbo as LLM. This has a token limit of
4K. If we want to push documents with larger token size, we could use models with 16K
token limit, which are specifically build to handle these kinds of cases.

3.7. Setting up of chain


In order to combine all the above components in a single chain, I used Langchain’s
RetrievalQA chain with chain type stuff. After passing the LLM object along with
Retriever we form the chain. Once the chain has been set up, it returns the human-like
answer based on the user’s query and prompt, along with links of the source
documents.

9
Application of the Tool
Below is a flow diagram where it is depicted how and which Language Model is
to be used depending on the use case. For our use case discussed above it is not
recommendable to use a deterministic model, because the question can vary even
within a small context window.

Also depending on the data used, we can break them down into two categories,
but there might be some crossover between each other.

3.1 Proprietary services


As the first widely available LLM powered service, OpenAI’s ChatGPT was the
explosive charge that brought LLMs into the mainstream. ChatGPT provides a nice user
interface (or API) where users can feed prompts to one of many models (GPT-3.5, GPT-
4, and more) and typically get a fast response. These are among the highest-performing
models, trained on enormous data sets, and are capable of extremely complex tasks
both from a technical standpoint, such as code generation, as well as from a creative
perspective like writing poetry in a specific style.

The downside of these services is the enormous amount of compute required


not only to train them (according to data OpenAI spent over $100 million to develop

10
GPT-4) but also to serve the responses. For this reason, these extremely large models
will likely always be under the control of organizations, and requires us to send your
data to their servers in order to interact with their language models. This raises privacy
and security concerns, and subject users to “black box” models, whose training, and
guardrails they have no control over. Also, due to the compute required, these services
are not free beyond a very limited use, so cost becomes a factor in applying these at
scale.

Summary: Proprietary services are great to use if you have very complex tasks, are okay
with sharing your data with a third party, and are prepared to incur costs if operating at
any significant scale.

Use Case:
 To enhance subject understanding we could use the above pipeline to extract
knowledge from,
o Academic videos
o Interviews
o YouTube podcasts

Model Examples:
 ASR Model ~ Whisper-1, Rev AI
 Text Embedding Model ~ text-embedding-ada-002
 LLM ~ gpt-3.5-turbo, gpt-4

3.2 Open-source models


The other avenue for language models is to go to the open-source community,
where there has been similarly explosive growth over the past few years. Communities
like HuggingFace gather hundreds of models from open-source contributors, that can
help solve tons of specific use cases such as text generation, summarization, and
classification. The open-source community has been quickly catching up to the
performance of the proprietary models, but due to resource limitations still there is long
road ahead for them, to produce a model like GPT-4

It does currently take a little bit more work to grab an open-source model and
make it production ready. But progress is moving very quickly to make them more
accessible to users. Oftentimes, we can find an open-source model that solves our
specific problem, that is orders of magnitude smaller than ChatGPT. This allows us to
bring the model into a localized environment and host it ourselves. This means that we
can keep the data in our control for privacy and governance concerns, as well as
minimize utilization cost.

11
Use Case:
 To understanding client needs, also manage employee sentiment, we could use
the above pipeline to extract knowledge from,
o Customer calls
o Business meeting recordings

Model Examples:
 ASR Model ~ Wav2Vec2 (Meta)
 Text Embedding Model ~ MiniLM, MPnet
 LLM ~ LLaMa, Orca, Falcon

Some Examples of Question-Answering


with the Chatbot
Query which has context in the Knowledge Base
EXAMPLE 1

How does the library at CMI operate?

The library at CMI is open 24 hours a day, 7 days a week, except for a 1.5
hour cleaning period in the morning. Anyone can walk in during the open
hours. The library has some rules in place to prevent book loss. The library
started with a small collection of 2500 books and has now grown to have an
excellent collection. Books are purchased based on suggestions from
students and faculty. It takes about two months for a book to arrive after
ordering. The library also subscribes to about 38 journals, both physical
copies and online access. Online journals are accessible through the CMI
network. For journals not available, the library uses interlibrary loan services
or contacts other institutions directly for PDF copies.

12
EXAMPLE 2
What courses does Chennai Mathematical Institute (CMI) offer?

Chennai Mathematical Institute (CMI) offers the following courses:

1. Bachelor's program in mathematics and allied sciences.


2. Master's program in mathematics and allied sciences.
3. Data Science course, focusing on analytics and bridging math and
engineering backgrounds.

Note: CMI may consider starting new programs in the future based on
demand and expertise.

EXAMPLE 3
Describe the admission process of CMI.

The admission process of CMI starts in March when the applications open.

 The administrative team, including Rajeshwari Nair, starts working on the


admissions process in February.
 The application forms are available online and need to be tested before being
displayed to the public.
 Some application forms are demanded by post, while the majority are submitted
online.
 CMI students who are graduating from BSc or MSc also apply to CMI.
 The process goes through July, with the announcement of results and the entrance
examination.
 There are about 22 centers all over India where the entrance examination is
administered.
 The centers include schools and universities in both big and small cities.
 The selection of centers is based on requests from candidates and the data of
applicants.
 The administrative team ensures that all exam materials reach the centers before
the exams in mid-May.

13
EXAMPLE 4
Why did CMI start a Data Science course?

CMI started a Data Science course because of the following reasons:

1. Failed experimental course: CMI had previously started an


Applications of Math course, which included finance and required a
stronger math background. However, they were not getting the right
pool of students for this course.
2. UGC review committee recommendation: The UGC review
committee recommended that CMI refocus the course to focus only
on analytics. This recommendation justified CMI's motivation to start
a Data Science course.
3. Demand for Data Science: CMI had already started training
programs in machine learning for industry through AlgoLabs and
observed a growing demand for Data Science.
4. Target audience: CMI decided to make the Data Science course
accessible to a broad spectrum of students, including those with a
math and stats background or an engineering background.
5. Bridge program: CMI aimed to bridge the gap between students
with different backgrounds by including programming components
in the course.

Query which does not have context in the Knowledge


Base
EXAMPLE 1

How do I travel to CMI from Mumbai?

don't know the answer to that question

EXAMPLE 2

Which IT company is situated beside CMI campus?

I don't know

14
Conclusion
The use of large language models in building AI chatbots provides a significant
improvement in accuracy and relevance of responses. The workflow pipeline, starting
from data preprocessing to the integration of LLMs, ensures a seamless interaction
between the chatbot and users. However, the choice of public or private models
depends on the data privacy requirements and computational resources available.
Overall, large language models have the potential to revolutionize chatbot applications
and enhance user experience.

By adding more data into the knowledge base, we can improve the usability, and
relevance of this chatbot, towards broader spectrum of users. Along with that, we can
add multi-modal features to it in a iterative manner, further expanding its use cases.

Appendix
1. Tools used
 Python  Langchain
 OpenAI,  FAISS
 HuggingFace  FFmpeg
 Streamlit

2. Issues Faced
Here are the issues I faced during the implementation of this project:
https://docs.google.com/document/d/18sbbaf17y_cOFgMYQSNw5c9cSPyqlUqk1Z
X6qM4HFuw/edit?usp=sharing

3. References
 Large Language Mode: gpt-3.5-turbo,
https://platform.openai.com/docs/api-reference/completions/object
 Text Embedding model: text-embedding-ada-002,
https://platform.openai.com/docs/guides/embeddings
 ASR Model: Whisper-1,
https://platform.openai.com/docs/guides/speech-to-text
 Vector Store: FAISS,
https://github.com/facebookresearch/faiss

15
 Wrapping Framework: Langchain
https://python.langchain.com/docs/get_started/introduction.html
 Web UI Framework: Streamlit
https://streamlit.io/
 Sequential data handling tool: FFmpeg
https://www.ffmpeg.org/

You might also like