YouTube Video Search and Transcript-Based QA With LLM (Project)
YouTube Video Search and Transcript-Based QA With LLM (Project)
1
Note: We are implementing a basic version that includes the following steps:
2
1. YouTube Video Search: Use a library like yt-dlp to search for videos on YouTube.
2. Fetch Transcript: Use youtube-transcript-api to fetch the transcript of the selected video.
3. Process Transcript: Break down the transcript into manageable chunks.
4. QA with LLM: Use a language model (LLM) to extract answers from the transcript chunks
and map them to timestamps.
+------------------+ +---------------------+ +---------------------+ +-------
| YouTube Search | ----> | Fetch Transcript | ----> | Process Chunks | ----> | Quest
| | | | | | |
+------------------+ +---------------------+ +---------------------+ +-------
def search_youtube(query):
ydl_opts = {"quiet": True, "default_search": "ytsearch5"}
with YoutubeDL(ydl_opts) as ydl:
result = ydl.extract_info(query, download=False)
videos = [
{"title": entry["title"], "videoId": entry["id"]}
for entry in result["entries"]
]
return videos
videos = search_youtube("twosetai")
print(videos)
def get_transcript(video_id):
try:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
3
# for entry in transcript:
# print(f"Timestamp: {entry['start']} - {entry['start'] +␣
↪entry['duration']} seconds")
# print(f"Text: {entry['text']}")
# print("-" * 50)
return transcript
except Exception as e:
print("Transcript not available:", str(e))
return None
[{'text': 'all right this is M and Angelina today', 'start': 0.08, 'duration':
5.08}, {'text': "we're going to talk about document", 'start': 3.08, 'duration':
5.32}, {'text': 'search using molden BT if you want to', 'start': 5.16,
'duration': 5.519}, {'text': 'search in a document the most important', 'start':
8.4, 'duration': 3.96}, {'text': 'step is to really understand the', 'start':
10.679, 'duration': 4.08}]
# When the chunk reaches the specified duration, store it and start a␣
↪new chunk
if current_time - chunk_start >= chunk_duration:
chunks.append({"text": " ".join(current_chunk), "start":␣
↪chunk_start, "end": current_time})
current_chunk = []
chunk_start = current_time
4
return chunks
# Example usage
chunks = split_transcript_into_chunks(transcript)
print(f"First Chunk: {chunks[0]}")
First Chunk: {'text': "all right this is M and Angelina today we're going to
talk about document search using molden BT if you want to search in a document
the most important step is to really understand the document right that's where
this model comes in and we will show you a prototype Search application today
using this model and together with how we build it speaking of Bert it's kind of
a dinosaur model right in terms of the AI in terms of the AI ears if you're from
with the burd model it's released in", 'start': 0, 'end': 30.24}
Args:
chunks (list): List of transcript chunks, each with 'text', 'start',␣
↪and 'end' timestamps.
Returns:
str: Formatted transcript text.
"""
formatted_chunks = []
for chunk in chunks:
formatted_chunks.append(f"[{chunk['start']}s - {chunk['end']}s]␣
↪{chunk['text']}")
return "\n".join(formatted_chunks)
# Example usage
formatted_transcript = format_transcript_chunks(chunks)
print(formatted_transcript[:500]) # Print first 500 characters for preview
[0s - 30.24s] all right this is M and Angelina today we're going to talk about
document search using molden BT if you want to search in a document the most
important step is to really understand the document right that's where this
model comes in and we will show you a prototype Search application today using
this model and together with how we build it speaking of Bert it's kind of a
dinosaur model right in terms of the AI in terms of the AI ears if you're from
with the burd model it's released
5
1.6 Step 6: Query LLM for Answer with Timestamps
[305]: from litellm import completion, acompletion
from IPython.display import display, Markdown
import nest_asyncio
nest_asyncio.apply()
if content is None:
break
return answer_text
# return response.choices[0].message.content.strip()
"Your task is to analyze the given transcript and answer the user's␣
↪question.\n"
6
"Make sure to include the most relevant timestamps in your response.␣
↪Timestamps must be from the transcript and must follow this format [531.88s␣
↪- 564.6s].\n\n"
<think>
Alright, so looking at the user's question, they're asking about the batch size
used when inserting documents into the vector database. From what I remember in
the provided tutorial, there was a mention of using batches to insert data
efficiently.
I'll need to find where that information is located. The user included
timestamps with each section, so maybe those can help me pinpoint exactly where
the batch size was discussed.
So that seems like the relevant section. The user wants the specific batch size,
which is 50 in this case. I should make sure my answer directly references that
timestamp to show where the information came from.
I think it's important to state clearly that each document was inserted in
batches of 50 and mention how many batches there were for 1,000 documents (which
would be 20 batches). That way, anyone reading the answer understands both the
batch size and the total number of batches used.
</think>
The batch size for inserting documents into the vector database was **50**. This
means that each document was inserted in batches of 50, with a total of 20
batches to process all 1,000 documents.
Answer:
7
Batch size = 50 (each batch contains 50 documents) and there were 20 batches for
inserting 1,000 documents.
=====================================
Alright, so looking at the user’s question, they’re asking about the batch size used when inserting
documents into the vector database. From what I remember in the provided tutorial, there was a
mention of using batches to insert data efficiently.
I’ll need to find where that information is located. The user included timestamps with each section,
so maybe those can help me pinpoint exactly where the batch size was discussed.
Looking through the timeline:
• At 1067.28s to 1099.52s, it says they created multiple batches of size 50 for inserting data
into the vector database.
So that seems like the relevant section. The user wants the specific batch size, which is 50 in
this case. I should make sure my answer directly references that timestamp to show where the
information came from.
I think it’s important to state clearly that each document was inserted in batches of 50 and mention
how many batches there were for 1,000 documents (which would be 20 batches). That way, anyone
reading the answer understands both the batch size and the total number of batches used.
The batch size for inserting documents into the vector database was 50. This means that each
document was inserted in batches of 50, with a total of 20 batches to process all 1,000 documents.
Answer:
Batch size = 50 (each batch contains 50 documents) and there were 20 batches for inserting 1,000
documents.
[408]: formatted_transcript
[408]: "[0s - 30.24s] all right this is M and Angelina today we're going to talk about
document search using molden BT if you want to search in a document the most
important step is to really understand the document right that's where this
model comes in and we will show you a prototype Search application today using
this model and together with how we build it speaking of Bert it's kind of a
dinosaur model right in terms of the AI in terms of the AI ears if you're from
with the burd model it's released in\n[30.24s - 62.039s] 2018 um but it did Mark
the beginning of the AI ERA with the most popular model architecture which is
Transformer uh but we're now going to talk about history of AI language models
today how about let's that [Music] in awesome as you said today we're going to
show how to create a Search application so we are essentially going back to the
fundamental and Basics the race and the progress in AI is crazy\n[62.039s -
93.159s] nowadays and every day you would see a lot of different model
techniques so on and so forth and a lot of people will get distracted by them
what I think is if you are really going to work in this domain you need to
understand the basics and fundamentals anyway and since we are working with rag
8
search question answering so on and so forth I think one of the very big
fundamentals or\n[93.159s - 123.28s] component is the retrieval or search part
and semantic search you have seen a lot of different semantic search
applications out there so we are basically going to again show how semantic
search works here and instead of using a lot of these new models and llms and
all that we're going to use a new model but this model is not new in a sense
that it's built on the\n[123.28s - 155.319s] previous uh version of that so
we're talking about modern bird it's a new model that answer AI with you know
hugging phase they trained on a new data set they changed the architectures a
little bit and they increased the context length of this model so it has a lot
of different advantages over the previous be model which was introduced in 2018
that's why it's a dinosaur model right in terms of how fast AI is\n[155.319s -
188.319s] developing it's old that even but even that old bir model is I think
the most downloaded model from hugging face so don't underestimate even that old
bird model it is used in a lot of different NL NLP tasks like classification
sentiment analysis you name it and now we have a new model which is like a
descendant of that model but it's a lot more capable and more\n[188.319s -
220.76s] powerful and I think this is going to really change a lot of different
things uh this morning I was checking hugging face and I saw two different
models based on this modern bird one is um and both of them from Alibaba one is
for reranking your search results and the other one is for embedding and and we
can use it for embedding for you know for classification it has capability to
understand code as well so I think this is a very capable person and in
this\n[220.76s - 250.92s] tutorial we're going to show how to use this you said
capable person oh sorry so this is a very capable model and in this tutorial
we're going to see how we can use that for the very basic schematic search and
before show you the code I would like just to show you the demo that's another
motivation because often times we will do things in Jupiter notebook which is
fantastic my favorite you know python environment for for creating and\n[250.92s
- 281.199s] developing but at the end we should really move our code to a kind
of framework right to an socalled web application and I created a very basic
JavaScript application using it framework called White and a back end developed
by fast API I'm going to show you how it works and then I can show you the code
here we have a very basic search bar where you can start typing\n[281.199s -
311.24s] and while you are typing right you can see active learning it kind of
opens up a drop- down and it gives me some suggestions yes question what are we
searching against here so that's a good question I have downloaded a subset uh
of a data set of research papers related to machine learning and deep learning
the original data set is very large but I only used\n[311.24s - 343.199s] uh
1,000 examples which include the title and the abstract of the papers and here
when I start typing it's going to search through the titles and based on the
similarity and this similarity is not semantic it's like fuzzy search string
matching for this Auto suggestion but of course we can even replace this one
with a better search like semantic search and when I\n[343.199s - 375.319s]
start typing and you can see that it shows me a bunch of these titles I can
select any of them or just I can just type Active Learning if I want to for
example here and then when I do that it's going to give me the top search
9
results here and typically the first title uh of the search result should match
exactly this one because uh this is also in the data set this could be improve
but right now you can see this\n[375.319s - 408.08s] is how it works so it shows
the title and the relevances for for that and again unsupervised learning and
you can just typee that and based on that it's going to do some search so that's
the idea of this search of course it needs to be improved now that um you saw
the application let's dive into the code and then explain like the technology
and how I develop that\n[408.08s - 440.199s] yeah this is very useful I mean a
lot of the companies have like internal documents and this is very much needed
right have having being able to match the relevant document fast and rank them
on the top yes right now there is no ranking it's completely based on the
semantic search and you saw that the accuracy is not extremely high but it is
good enough however that could be improved this is just the basic version of
that and in\n[440.199s - 471.0s] this case I'm using this mwest Vector database
and there are a bunch of vector databases out there I have used almost all of
them and recently I started playing with milis and I realized it's also
extremely capable and kind of similar to quadrant and it kind of becomes my
favorite so just today I decided to use milest instead of quadrant and why not
this Jupiter notebook so I just\n[471.0s - 501.4s] wrote some U explanations
about this document search and modern bird and why you know modern bird is
really powerful and here is a very basic system architecture of semantic search
where what happens is we have a collection of documents unstructured text in
this case and then we give them to some embedding model and the embedding model
here is the modern\n[501.4s - 531.88s] bird that I am using this modern bird has
two different variations one with um a smaller model with 150 plus million
parameters both are very capable even the a smaller one and the good thing is
they are so fast if you use them for embedding a lot of documents it is actually
extremely fast and the context length of the text has been increased from 512 of
the original birth model to more\n[531.88s - 564.6s] than 8,000 uh wordss so
this is fantastic we embed the documents into embeddings or vector
representations then we store them in a vector database in this case I am using
mest and later on when the user types a query we will embed the query as well
and then we'll go inside the vector database and perform a semantic search or a
nearest neighbor search there are a few\n[564.6s - 597.76s] techniques out there
but both work based on this concept of uh nearest neighbor it's an approximation
algorithm and then it's going to give us the most relevant based on some
similarity metrics I am using cosine similarity it will return the top 10
documents I am not doing any chunking anything like that the documents here are
the concatenation of the title of the papers and Abstract of the papers so I put
them all together\n[597.76s - 629.76s] and then I embed them so in order to set
up your environment you need to install a few Library sentence Transformers
because we're going to use modern bird data sets from hugging face so that's why
you need to install data sets I'm using milis I need to install Pi Milas for
that and um do EnV for later on when I am using this llms for generating some
synthetic you know questions I am using misal so\n[629.76s - 663.959s] that's
why we need this and then what you need to import is a few libraries here the
most important one sentence Transformers and then we need to load the model I am
10
using this nomic AI modern bird um if you search for that they have a um model
card on hugging face that explains all the details accuracy so on and don't
worth about that I have um defined this function here which is going to embed
a\n[663.959s - 694.519s] text so I'm going to pass a text and then it's going to
embed that using that modern bird to prepare the data set I am loading this data
set from hugging phase and this is like 50,000 research papers about machine
learning and AI from this archive website however I am not really using the
entire 50,000 I'm using a subset of that which is 1,000 and this data set has um
a few\n[694.519s - 726.68s] columns including the title of the research paper
and the abstract so I'm using these two to do the semantic search on and then I
select randomly a thousand of these papers uh one thing that they should mention
here when we are using modern bird then um we we need to add like a prefix to to
queries and documents later on you know for embedding and then later for the
search\n[726.68s - 758.48s] that it's something that we should be doing however
I have seen some tutorials that they are not necessarily just add this to the
beginning of the query if you want to do a search on a query when you have the
query you just uh concatenate that query with this one you start with or prepend
the query with this search query colon and then you just have your user query
for embeddings because we're going to embed the documents then the the prefix
is\n[758.48s - 788.68s] documentor search uncore document so these are because
this is the way that this modern bird has been trained uh to be able to to
handle you know short and long form of text for embedding and search that's why
I have these two here I have another function which is going to concatenate the
the title and the abstract it also prepend them with this prefix here and so
here I am creating a a\n[788.68s - 819.639s] combined text of the I show you how
many examples I have how many research paper it shows 1,000 now that we have our
documents we need to embed them I am using this generate embedding function I'm
passing each example which is like a dictionary and then it's going to grab the
text property from that which is the concatenation of the prefix plus the the
title and the abstract it it'll embed\n[819.639s - 849.959s] them and I apply
that on the entire data set using this map function um I also created a panda
version of that just if you want to further explore or do some kind of data
analysis um that's why I did this here yes not a question just a comment I feel
like we should emphasize on the on you know visualizing the data and looking at
the data as a step for\n[849.959s - 880.92s] whatever you are building a lot of
people actually ignore this step right chunk ever B them you didn't you didn't
even look what they are right exactly so that's a very good point yes of course
the first step before even implementing a search is to do a lot of Eda
exploratory data analysis again I skip mostly that part but that is very
necessary I doing a little bit of like here because I want to know what is the
longest uh piece of text in my data set\n[880.92s - 911.36s] right it gives me
some idea in terms of the distribution if I want to be even kind of more
accurate and better and then I have to create some visualizations to see the
distribution of the length of each document that I have here here so this is
definitely a very important step I even didn't do cleaning because these Texs
have maybe some special characters right new line\n[911.36s - 943.279s]
characters things like that so it's really a good idea to just remove those
11
special characters we don't need to remove stop Wars and things like that for
this bird model but cleaning the text is also very essential but yes to be
really doing implementing all the best practices that's something that I should
have done I should have you know done some Eda before even I start embedding and
implementing the application and I am using milis it's a\n[943.279s - 975.0s]
very powerful Vector database that you can use and they have different version
of that if your data set is very small then you can use milest light for that up
to 1 million documents if you have then this is good enough uh I am using the
Standalone the docker based version of milest and if you have more than 100
million documents you want to use milest distributed version now in order to
set\n[975.0s - 1005.24s] up the milest vector database similar to quadrant and
other Vector databases um for m the good thing is you can even Define the schema
of your index like all the fields that you want to store an index in your vector
database and that's what I have done the most important thing here is the dense
Vector I call it dense vector and then the data type is numbers\n[1005.24s -
1036.48s] because our embeddings are essentially numbers and the dimension is
768 and 768 that Dimension that that bird modern Bird model mod gives us another
good thing about modern bird is we can work with variable embeddings meaning
that we can have from pretty much 64 or 256 Dimension size all the way to 768
I'm using this this one here and so\n[1036.48s - 1067.28s] I create the index
and then the The Collection now we need to insert the data into the vector
database and I created like multiple batches here of size 50 so we have 1,000
different documents it's going to put them into batches of 50 and then it's
going to just insert that instead of inserting one at a time this one is faster
we can\n[1067.28s - 1099.52s] see there are 20 batches Here and Now The Next
Step would be just to do some search I have this piece of text or user query and
I need to search that I have written this then search function which is going to
first pass uh my user query um to the model and it's going to embed that you can
see here I have also prepended my query with the prefix and then I embed that
and I do the search\n[1099.52s - 1131.799s] and I return the top K and the K
here is 10 so it's going to return the top uh 10 results and then I iterate over
these top 10 results and print out the titles you can see here this is like the
title of the top one and also the score the score shows the relevance or
similarity to my query so that's the basic of the semantic search so far okay
awesome thank you m a few things we should um\n[1131.799s - 1163.2s] clarify yes
the Bert model is not a generat model it is a encoder only model right this is
yes because it's not um like llm model where it generate text this is for
embeddings yes correct yeah yeah uh and uh uh to summarize what you uh taught me
today about why modern bird is interesting is because one uh\n[1163.2s -
1193.72s] compared with its original version it has a extended context uh window
now so it will be very easily handling long context understanding that's why you
like in just the title and abstracts definitely enough um for for understanding
the do document um so here the data set was basically titles in abstract and I
measured the longest uh piece of text had uh 2,200 words right that's why it's
still\n[1193.72s - 1226.12s] less than 8,000 right right other improvements
include it's so still a very small model and it's efficient right yes you can
see the you know the the improvements they are using different types of
12
embeddings when they were training in the architecture so they are using local
and Global attention right and you there are a bunch of other explanations and
features and you can see even within the this\n[1226.12s - 1256.64s] very short
period of time that it was released it's been heavily used more than 4 million
right almost 5 million downloads last month only there are two different
versions there is a base and there is a large 149 million parameters versus
almost 400 million parameters you can see that and some Snippets how to use it
so again this model is in my opinion going to be one of the it's going
to\n[1256.64s - 1288.08s] essentially replace the old bird models and it's going
to be used for a lot of different tasks yeah it's very essential right the
search use case is very common in the industry so in today's video we introduced
the mod Bird model and how to make a a Search application out of this uh modern
version of birds embedding model which is for those who are familiar with this
type of model it's a makeover of the older version of the a family of small and
very efficient encoder only models that's it for today\n[1288.08s - 1309.22s] I
hope that this tutorial is going to be helpful for our audience sounds good but
don't forget to subscribe to our Channel and stay tuned for our new content see
you next time see you [Music]"
[ ]:
13