IET Final Year Project - Making YouTube Transcript
IET Final Year Project - Making YouTube Transcript
flask-backend@cref
Building A Tool To Summarize YouTube Videos
Bachelor of Technology
in
Computer Science & Engineering
by
June, 2025
Declaration
We hereby declare that this submission of project is our own work and that to the best of our
knowledge and belief it contains no material previously published or written by another person
or material which to a substantial extent has been accepted for award of any other degree of the
university or other institute of higher learning, except where due acknowledgement has been
made in the text.
This project report has not been submitted by us to any other institute for requirement of any
other degree.
i
Certificate
This is to certify that the project report entitled: Building A Tool To Summarize YouTube
Videos As Well As Implementing Content Filtering submitted by Milind Nautiyal
and Faiz Khan in the partial fulfillment for the award of the degree of Bachelor of Technology
in Computer Science & Engineering is a record of the bonafide work carried out by them under
our supervision and guidance at the Department of Computer Science & Engineering, Institute
of Engineering & Technology Lucknow.
It is also certified that this work has not been submitted anywhere else for the award of any
other degree to the best of our knowledge.
ii
Acknowledgement
We take this opportunity to express our sincere gratitude to all those who supported and guided
us throughout our project work. We are thankful to the Almighty for the grace and blessings
that enabled us to complete this project successfully.
We are deeply grateful to our supervisor, Dr. Vineet Kansal, for his consistent support, ex-
pert guidance, and encouragement throughout the B.Tech project. His insights and suggestions
were crucial to the success of our work.
We would also like to thank our supervisor, Dr. Ankita Srivastava, for her valuable inputs
and continuous support during the course of this project. Her insights and suggestions were
instrumental in shaping the direction of our work.
Our sincere thanks to Dr. Pawan Kumar Tiwari, our project coordinator, for providing us
with the opportunity and resources to undertake this project, as well as his suggestions which
helped us further refine our initial idea.
We are also thankful to the Project Evaluation Committee Members for their feedback
and constructive suggestions, which helped refine our work.
We extend our appreciation to Dr. Girish Chandra, the Head of the Department,
Department of Computer Science & Engineering, Institute of Engineering & Tech-
nology, Lucknow, for the supportive academic environment and ensuring access to essential
resources and fostering an environment that supported the smooth execution of this project.
Lastly, we thank our families and friends for their unwavering support and motivation through-
out this journey.
Milind Nautiyal
Faiz Khan
iii
Abstract
With the exponential growth of media online, controlling the flow of information, especially
for younger audiences, has become increasingly important for parents and guardians. This
project aims to solve this critical problem by presenting a Google Chrome extension which
leverages advanced Natural Language Processing techniques to provide meaningful as well
as concise summaries of YouTube video transcripts, while also enhancing parental oversight
over the type of content children are exposed to online.
The extension integrates both traditional and modern NLP approaches, including a Latent Se-
mantic Analysis (LSA) pipeline and a BERT-based abstractive summarization model,
to generate human-readable summaries of video transcripts fetched via the YouTube Data API.
Additionally, users may optionally harness Large Language Models (LLMs) through third-
party APIs for more context-aware summarization.
In order to insure safe media consumption, this system includes a keyword-based filtering mech-
anism that detects and flags potentially sensitive themes such as violence, explicit language, or
adult content using curated and dynamic keyword databases. Flagged content can be restricted
from playback, and summaries of such content can be routed to guardians for review.
These capabilities are especially crucial in light of documented controversies such as Elsagate [3],
where disturbing and inappropriate content managed to bypass platform-level filters and reach
child audiences. The growing presence of AI-generated and bot-driven videos—many of which
mimic child-friendly content while containing harmful material—further underscores the need
for more robust and fine-grained moderation systems. As such, the project highlights the im-
portance of automated content analysis and filtering mechanisms that can assist guardians in
managing online media consumption effectively.
The backend is powered by a lightweight and modular Flask server, enabling fast, scalable
processing and paving the way for future enhancements. Future iterations aim to incorporate
fine-tuned Transformer models for improved contextual understanding, and possibly real-time
content monitoring as well as an interactive dashboard on the browser’s new tab page, allowing
guardians to customize filtering rules and monitor viewed content.
iv
Contents
Declaration i
Certificate ii
Acknowledgements iii
Abstract iv
Contents v
List of Tables ix
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Common Use Cases in Daily Life . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Review 7
2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Foundation of Extractive Summarization . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Early Efforts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Transition to Semantic Techniques in the 1990s . . . . . . . . . . . . . . 9
2.3 Statistical and Machine Learning Approaches . . . . . . . . . . . . . . . . . . . 10
2.3.1 Probabilistic Models and Feature Engineering (1990s) . . . . . . . . . . . 10
2.3.2 Evaluation and Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Rise of Deep Learning and Transformer-Based Models . . . . . . . . . . . . . . . 11
2.4.1 BERT and Contextual Representations . . . . . . . . . . . . . . . . . . . 11
2.4.2 Abstractive Models: BART, T5, and Beyond . . . . . . . . . . . . . . . . 12
2.5 Meta-Review and Evolutionary Perspectives . . . . . . . . . . . . . . . . . . . . 12
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
v
Contents vi
3 Methodology 17
3.1 Existing Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Proposed Methodolgy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.0.1 System Pipeline Overview . . . . . . . . . . . . . . . . . . . . . 19
3.3 Advantages of Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Disadvantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Comparison Between LSA, BART & LLMs . . . . . . . . . . . . . . . . . . . . . 22
3.6 Transcript Extraction Method (from YouTube Transcripts) . . . . . . . . . . . . 23
3.7 Preprocessing the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.8 Parental Control and Kids Safety Filtering . . . . . . . . . . . . . . . . . . . . . 25
3.8.1 Keyword-Based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8.2 Semantic Filtering via Embedding Similarity . . . . . . . . . . . . . . . . 26
3.8.2.1 Cosine Similarity Filtering . . . . . . . . . . . . . . . . . . . . . 27
3.8.3 Modularity and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.8.4 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9 Summarization Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9.1 Latent Semantic Analysis (LSA) – Extractive Summarization . . . . . . . 29
3.9.1.1 Term Frequency-Inverse Document Frequency (TF-IDF) . . . . 29
3.9.1.2 Matrix Construction . . . . . . . . . . . . . . . . . . . . . . . . 32
3.9.1.3 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . 33
3.9.1.4 Sentence Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.9.1.5 Sentence Selection . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.9.1.6 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.9.2 BART – Abstractive Summarization . . . . . . . . . . . . . . . . . . . . 35
3.9.2.1 Token-Aware Chunking . . . . . . . . . . . . . . . . . . . . . . 35
3.9.2.2 Chunk-Level Summarization and Merging . . . . . . . . . . . . 36
3.9.2.3 Fallback Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9.2.4 Advantages of BART-Based Summarization . . . . . . . . . . . 37
3.9.3 Large Language Models (LLMs) . . . . . . . . . . . . . . . . . . . . . . . 37
3.10 Flask Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.10.1 Flask Development Server Execution . . . . . . . . . . . . . . . . . . . . 39
3.11 Chrome Extension Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.11.1 Content Script (content.js) . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.11.2 Background Script (background.js) . . . . . . . . . . . . . . . . . . . . . 41
3.11.3 Extension Manifest (manifest.json) . . . . . . . . . . . . . . . . . . . . . 41
3.11.4 Extension Workflow Summary . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Experimental Results 43
4.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Conclusion 46
5.1 Backend Implementation Using NLP Libraries . . . . . . . . . . . . . . . . . . . 46
Contents vii
A Appendix 49
References 50
List of Figures
1.1 Proposed System Overview: Abstract pipeline of the YouTube Transcript Sum-
marization and Parental Control System, illustrating the transformation from
raw video input to semantically enriched output. . . . . . . . . . . . . . . . . . . 5
viii
List of Tables
ix
Chapter 1
Introduction
In the digital age, video content has emerged as one of the most dominant and widely consumed
forms of media. Platforms like YouTube host billions of videos spanning diverse domains such
as education, entertainment, news, and personal content creation [1]. However, this abundance
also presents a growing challenge, as users are often overwhelmed by the sheer volume of
information, making it difficult to locate relevant content quickly and efficiently.
This issue becomes particularly evident when dealing with long-form videos or scenarios that
require time-sensitive viewing. For instance, students revisiting recorded lectures, profes-
sionals consulting instructional content, or parents reviewing what their children are watching
all face the same challenge: extracting relevant information from lengthy videos is often te-
dious, inefficient, and cognitively demanding. Without mechanisms for quick summarization
or content preview, viewers are forced to manually scrub through timelines—an approach that
is both time-consuming and error-prone.
In addition, the unrestricted availability of video content raises serious concerns regarding con-
tent safety for children. Young viewers typically lack the maturity and critical thinking
skills needed to evaluate content quality or intent [2], making them particularly susceptible to
1
Chapter I. Introduction 2
inappropriate or manipulative material. Content creators may exploit this vulnerability by pro-
ducing seemingly child-friendly videos that are designed primarily to generate revenue through
advertisements and sponsorships, often prioritizing engagement over ethical responsibility. As a
result, children can inadvertently be exposed to content that promotes consumerism, ideological
bias, or even harmful narratives. This situation underscores the necessity for active parental
involvement and more intelligent moderation tools, especially in the face of known failures in
platform-level filters, such as those highlighted in the Elsagate controversy [3].
The project addresses these problems by proposing that the solution lies in automatic sum-
marization and semantic analysis of video transcripts. By distilling long videos into concise
summaries and tagging them with relevant topics or potential red flags, users can engage with
video content more effectively and safely. This project introduces a browser-based tool that
combines Natural Language Processing (NLP) techniques, keyword extraction, and parental
control features to empower viewers, particularly guardians, with greater control, efficiency,
and transparency over the content being consumed by themselves and their wards.
While platforms like YouTube have revolutionized content accessibility and democratized in-
formation sharing, they have also inadvertently opened the floodgates to inappropriate and
harmful material, particularly for a young audience which is more vulnerable than ever. Despite
attempts to moderate content using algorithms and platform-level filters, incidents such as the
2017 “Elsagate” controversy have exposed critical shortcomings in these systems. Thousands
of seemingly child-friendly videos—often featuring popular cartoon characters—were discovered
to contain disturbing, violent, or sexually suggestive content, all while being disguised under
misleading titles and thumbnails.
Chapter I. Introduction 3
Such incidents highlight the fundamental limitations of keyword-based moderation and super-
ficial content filtering. They underscore the pressing need for more nuanced, context-aware
systems capable of understanding the semantic content of videos, rather than relying solely
on metadata or surface-level heuristics.
Beyond the issue of content safety, there is also the broader challenge of information over-
load. For everyday users—such as students revisiting lectures, professionals referencing tuto-
rials, or parents reviewing their children’s media consumption—the ability to extract relevant
information from long-form videos is severely limited. Manually scanning through videos is
inefficient, time-consuming, and often ineffective, especially when only small segments contain
useful information.
• Intelligently summarize and analyze video transcripts using semantic understanding, and
• Flag potentially inappropriate or harmful content based on context rather than surface-
level keywords.
This project addresses that gap by developing a Chrome extension that integrates Transformer-
based summarization, semantic keyword extraction, and customizable parental control fea-
tures. The proposed tool aims to facilitate faster, safer, and more informed video consump-
tion across diverse user groups.
1.2 Objective
The primary objective of this project is to design and develop a Chrome extension that trans-
forms how users, in particular guardians, educators, and students, interact with online video
content by offering intelligent summarization and context-sensitive content analysis.
Chapter I. Introduction 4
The extension aims to enhance both comprehension and safety in digital media consumption.
By fulfilling these objectives, the project aims to deliver a practical and user-centric tool that
improves access to relevant video content while addressing growing concerns around content
safety in the digital age. The integration of advanced NLP techniques, context-aware sensitivity
analysis, and extensible system design ensures the tool remains adaptable for future use cases
like multilingual summarization, integration with other video platforms, or enhancement via
fine-tuned large language models.
Chapter I. Introduction 5
At its core, the YouTube Transcript Summarization and Parental Control Chrome
Extension is designed to process YouTube video transcripts to generate intelligent summaries,
semantic topic tags, and sensitivity flags. This facilitates a safer, more efficient, and context-
aware video consumption experience.
Figure 1.1: Proposed System Overview: Abstract pipeline of the YouTube Transcript Sum-
marization and Parental Control System, illustrating the transformation from raw video input
to semantically enriched output.
The NLP-driven pipeline ensures that the generated output goes beyond mere textual extrac-
tion. It produces contextually meaningful summaries, enriched with topic-level annota-
tions and content sensitivity alerts. By applying both statistical and deep learning-based
techniques, the system offers a more nuanced, human-like understanding of the video’s content.
Chapter I. Introduction 6
This facilitates not only quicker comprehension but also active content moderation and parental
oversight.
1. A concerned parent wants to verify whether a trending YouTube video contains mature,
misleading, or inappropriate content before allowing their child to watch it.
2. A student preparing for exams needs to quickly revisit the key points of a lengthy
educational video without rewatching the entire lecture.
5. A late-night viewer prefers to silently skim summaries and topic tags rather than playing
video audio, avoiding disturbance while still engaging with the content.
6. A parent unfamiliar with English relies on visual summaries and topic tags to evaluate
whether an English-language video is suitable for their child.
These scenarios illustrate the system’s practical value in daily digital interactions. Whether
supporting parental oversight, enhancing educational efficiency, or enabling informed decision-
making, the proposed extension addresses real-world challenges. By bridging the gap between
content accessibility and content safety, the tool fosters a more mindful, secure, and user-centric
digital media environment.
Chapter 2
Literature Review
Text summarization, a fundamental task within the domain of Natural Language Processing
(NLP), focuses on condensing large volumes of textual information into concise, coherent, and
informative representations. As digital content proliferates across the web, effective summa-
rization techniques have become indispensable for enhancing information accessibility, reducing
cognitive load, and enabling quick decision-making across domains such as journalism, educa-
tion, and multimedia content analysis.
Summarization techniques are generally classified into two primary categories: extractive and
abstractive. Extractive summarization identifies and selects key sentences or phrases directly
from the source material, maintaining their original form. Abstractive summarization, on the
other hand, involves generating novel sentences that may not explicitly exist in the source,
thereby mimicking the human approach to summarization. While abstractive methods are more
flexible and expressive, they often demand extensive computational resources, sophisticated
language modeling, and large annotated datasets for training. In contrast, extractive methods
7
Chapter 2. Literature Review 8
offer greater scalability and are more suitable for real-time or resource-constrained applications,
such as browser extensions.
Recent advancements have extended summarization research beyond static textual inputs into
multimedia domains, particularly video content. In this context, transcripts derived from au-
tomatic speech recognition (ASR) systems act as intermediary textual data. However, summa-
rizing transcripts presents unique challenges, including disfluencies, informal language, missing
punctuation, and irregular sentence boundaries. Addressing these challenges requires enhanced
semantic understanding through models such as Latent Semantic Analysis (LSA) and contex-
tualized embeddings from pretrained transformer-based architectures.
This project builds upon these developments by applying both extractive and abstractive sum-
marization techniques to YouTube video transcripts. It further integrates semantic keyword
extraction and sensitivity detection, offering a real-time, browser-based interface aimed at im-
proving both the efficiency and safety of video content consumption. In doing so, it contributes
to the growing body of research focused on multimodal NLP applications and user-centric media
moderation tools.
The origins of extractive summarization can be traced back to foundational work in the mid-
20th century. One of the earliest and most influential contributions was made by Hans Peter
Luhn (1958), who proposed a frequency-based method of summarization. Luhn’s approach
emphasized that words appearing more frequently, excluding common stop words, were likely
to represent the core themes of a document [4].
Chapter 2. Literature Review 9
These early approaches formed the basis for extractive summarization techniques, prioritizing
sentence selection based on surface-level features such as frequency, positional importance, and
lexical cues. While these methods were computationally efficient and easy to interpret, the
resulting summaries often suffered from a lack of coherence and contextual understanding.
The 1990s witnessed a shift from purely statistical models toward semantically enriched ap-
proaches, driven by advances in machine learning and information retrieval. Techniques such
as Latent Semantic Analysis (LSA) and graph-based ranking algorithms like TextRank signif-
icantly improved the quality of extractive summarization by leveraging semantic relationships
and structural properties of text.
A pivotal development during this period was introduced by Scott Deerwester et al. (1990),
who proposed Latent Semantic Analysis (LSA)—a technique that applies Singular Value De-
composition (SVD) to a term-document matrix in order to uncover latent structures in textual
data [6]. By reducing the dimensionality of this matrix, LSA was able to identify semantically
similar terms and documents, even when they did not share exact lexical overlap. This allowed
summarization systems to capture underlying conceptual relationships rather than relying solely
on surface-level frequency counts.
Despite its innovation, LSA had inherent limitations. Its reliance on linear algebra and the
bag-of-words model meant it could not effectively account for word order, syntax, or linguistic
Chapter 2. Literature Review 10
ambiguity. As a result, while LSA offered improved thematic coherence over earlier frequency-
based models, its summaries sometimes lacked grammatical fluency and contextual nuance.
Nevertheless, LSA laid crucial groundwork for subsequent developments in semantic summa-
rization and keyword extraction, forming a bridge between early statistical models and the
more sophisticated, context-aware methods that emerged in the deep learning era.
Paul Kupiec (1995) introduced one of the first probabilistic frameworks for extractive summa-
rization, utilizing surface-level features such as sentence position, length, and keyword density
to determine sentence importance [7]. This marked a transition from purely heuristic methods
to early statistical learning, although these models lacked any deep understanding of language
semantics.
A major challenge in summarization research has been evaluation. Chin-Yew Lin (2003) pro-
posed the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, which com-
pares n-gram overlap between generated summaries and human-written references [9]. ROUGE
quickly became a standard for assessing summarization quality. However, it tends to reward
lexical similarity over semantic correctness, making it less suited for evaluating abstractive or
highly paraphrased outputs.
Chapter 2. Literature Review 11
The advent of deep learning significantly enhanced the quality and fluency of text summa-
rization systems. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)
networks were first to model sequential dependencies in text, but struggled with long-range
context.
The real breakthrough came with the introduction of the Transformer architecture by Vaswani
et al. (2017), which enabled parallelized processing of entire sequences through self-attention
mechanisms.
While BERT is not a summarization model per se, its architecture has been widely adapted in
summarization pipelines—for example, via fine-tuning on sentence ranking tasks or using BERT
embeddings as inputs to summarizers. However, its computational demands pose limitations
in resource-constrained environments like browser extensions.
Chapter 2. Literature Review 12
Despite their high performance, such models require significant memory and compute—making
them more suitable for server-side processing than real-time, client-side summarization.
Haopeng Zhang (2024) conducted a recent meta-review, tracing the progression of summa-
rization research from rule-based heuristics to hybrid and transformer-driven architectures [16].
The review highlights a growing preference for semantically rich summarization and hybrid
systems that balance extractiveness and abstraction. It also emphasizes the need for domain-
specific summarization solutions—such as education, healthcare, and online safety—where con-
text and ethical safeguards are paramount.
2.6 Conclusion
While numerous tools exist for summarizing transcripts—such as AutoSub for subtitle genera-
tion or Chrome extensions like Eightify—they primarily focus on summarization without any
integrated moderation or parental control. They also lack topic modeling, filtering logic, or
safety tagging based on the semantic content of videos.
Chapter 2. Literature Review 13
• Parental control logic that flags content based on topic and keyword patterns, and
This hybrid, safety-focused approach directly responds to issues like Elsagate and the growing
need for parental involvement in content moderation. By bridging fast summarization and
ethical screening, the system provides not just accessibility but also responsibility in content
consumption.
Gong & 2001 TF-IDF, LSA, LSA-based relevance mea- Limited to extrac-
Liu [8] sentence selec- sures can effectively cap- tive summaries;
tion based on ture latent topical struc- computationally
singular vector ture for generic summa- intensive for large
components rization datasets;
Chapter 2. Literature Review 15
Methodology
The development of an intelligent summarization system for YouTube video transcripts within a
browser-based environment requires a deep integration of Natural Language Processing (NLP)
techniques, lightweight modeling strategies, and modular design principles. This chapter details
the methodology used in building the Chrome extension, which transforms raw transcript data
into structured, human-readable summaries.
The summarization pipeline combines classical extractive summarization techniques like TF-
IDF and Latent Semantic Analysis (LSA) with modern filtering approaches based on cosine
similarity, and optionally explores transformer-based models (BERT/BART) for abstractive
summarization. Emphasis is placed on balancing performance with computational efficiency to
ensure seamless operation within the resource constraints of a browser extension.
One of the earliest attempts at text summarization was proposed by Luhn in 1958. His method
utilized word and phrase frequency analysis to identify key content in technical papers, laying
17
Chapter III. Methodology 18
Luhn’s 1958 paper, ‘The Automatic Creation of Literature Abstracts’ dealt with using
frequency analysis to identify significant words and sentences in a document. It proposed
that words occurring more frequently in a text (excluding stopwords) are likely central to its
content. Sentences were scored based on their concentration of these key terms, and high-scoring
sentences were extracted to form summaries [1]. This approach provided a systematic and
computational method to create abstracts, serving as a cornerstone for modern summarization
techniques in NLP. As technological capabilities expanded, Machine Learning and NLP gained
prominence, leading to increasingly sophisticated summarization methods.
While foundational, Luhn’s method lacked semantic understanding and struggled with com-
plex sentence structures—a limitation modern techniques aim to overcome. Automatic text
summarization addresses the challenge of limited time for comprehensively understanding vast
amounts of material, mitigating the risk of overlooking critical information.
The proposed system is initiated when a user provides a YouTube video link through the
frontend interface. Upon receiving the link, the backend extracts the video’s unique identifier
and retrieves the associated transcript using the YouTube API or related caption-extraction
techniques. Once the transcript is obtained, it undergoes a cleaning and preprocessing phase
to remove noise such as timestamps and formatting artifacts.
The cleaned transcript is then analyzed for appropriateness using a two-stage kid-safe filtering
mechanism: a keyword-based scan and a semantic similarity check using sentence embeddings.
If the content passes the safety checks and meets the minimum length requirements, it proceeds
to the summarization phase.
Chapter III. Methodology 19
Depending on the length and complexity of the transcript, the system employs either a direct
abstractive summarization using a pretrained transformer model or a smart chunking strategy to
handle longer inputs. In cases where the transcript is too long for abstractive summarization,
an extractive approach based on TF-IDF and Latent Semantic Analysis (LSA) is used as a
fallback.
Finally, the generated summary is returned to the frontend and displayed to the user in a clean
and readable format. This end-to-end pipeline is designed to operate automatically, ensuring
efficient, safe, and contextually relevant summarization of YouTube video content.
The summarization system follows a modular, decision-driven pipeline for extracting and filter-
ing YouTube video transcripts. Figure 3.1 illustrates the end-to-end architecture of the system.
• Transcript Extraction: The backend invokes the YouTube Data API to retrieve the
transcript when a video is opened.
• Text Preprocessing: This includes noise removal, sentence splitting, and tokenization
to clean the transcript for downstream processing.
• Kids Safety Filter: Applied immediately post-preprocessing, this filter combines keyword-
based screening and cosine similarity filtering using pre-embedded phrases.
• Model Selection Logic: Based on transcript length and API availability, the system
branches to one of three summarization paths:
– Latent Semantic Analysis (LSA): For longer transcripts where BART or API
usage is constrained.
Chapter III. Methodology 20
• Postprocessing and Output: All models return final summaries along with status
flags, which are then displayed via a Chrome extension interface.
Figure 3.1: System Pipeline Overview: From YouTube Transcript Extraction to Frontend
Summary Display
Chapter III. Methodology 21
The proposed summarization pipeline integrates classical extractive techniques with modern
transformer-based abstractive models under a modular design optimized for browser deploy-
ment. This hybrid approach allows efficient transcript analysis, summary generation, and safety
filtering, all within the constraints of a lightweight Chrome extension.
• Content Moderation via Cosine Similarity: Safety filtering includes both keyword-
based checks and semantic similarity comparison against a list of embedded banned
phrases. This reduces the likelihood of inappropriate or unsafe content appearing in
summaries.
• Scalable and Language-Agnostic: LSA and cosine-based filters are unsupervised and
work independently of grammar, syntax, or labeled training data, supporting scalability
and multilingual adaptability.
• Modular and API-Compatible: The architecture supports external API fallbacks for
advanced summarization engines (e.g., OpenAI, DeepSeek), enabling future enhancements
with minimal disruption.
Despite its efficiency and modularity, the current approach has the following limitations:
Chapter III. Methodology 22
• Semantic Filter Tradeoffs: While cosine similarity allows context-aware safety filtering,
it can lead to false positives (flagging benign phrases) or false negatives (if embeddings
don’t sufficiently capture nuance).
scripts)
To initiate the summarization pipeline, the system uses the yt-dlp Python package to extract
subtitle data directly from YouTube videos. This method provides a more robust alternative
to frontend-dependent APIs, supporting both manually uploaded and auto-generated captions
in multiple formats (e.g., TTML, VTT, SRT).
Figure 3.2: Example terminal log showing successful transcript extraction. The system logs
each request, identifies the video ID, fetches the transcript using internal API calls, and reports
the cleaned word count as well as the token count.
• Video ID Resolution: The YouTube video ID is extracted from the user-provided URL
and passed to the downloader.
• Subtitle Discovery: yt-dlp is configured to skip media download and instead fetch
caption metadata. The system checks both:
• Subtitle URL Retrieval: Once a suitable caption track is identified (via the metadata
JSON), the corresponding subtitle file URL is fetched.
• Transcript Downloading and Parsing: The subtitle file is retrieved using an HTTP
request. The transcript is parsed from its raw JSON format (typically TTML) by iterating
over event segments and extracting UTF-8 encoded tokens.
Chapter III. Methodology 24
• Plain Text Conversion: All extracted tokens are concatenated into a plain-text tran-
script, with newline and formatting characters removed.
• Error detection (e.g., HTTP failures, malformed caption files, quota limits).
This method offers greater flexibility and reliability compared to frontend-based libraries like
YouTubeTranscriptAPI, allowing full access to caption metadata and supporting low-level pars-
ing of subtitles in multiple formats.
Once the transcript is extracted from YouTube, it undergoes several preprocessing steps to clean
and normalize the data before summarization. This phase is critical for removing transcription
noise and ensuring consistency across various downstream modules, including TF-IDF, LSA,
and sentence embedding comparison for content filtering.
In the current implementation, transcript text is first assembled from raw subtitle data retrieved
in TTML or JSON format using yt-dlp. Captions are parsed into individual segments, with
each segment’s textual content concatenated to form the full transcript.
• Noise Removal: Time-stamps, symbols (e.g., “[Music]”, “–>”), and formatting artifacts
are stripped during raw segment processing. This cleaning stage ensures only spoken
content is retained.
• Tokenization: Text is tokenized into word-level and sentence-level units using standard
NLP tools such as NLTK or spaCy, preparing it for vectorization and summarization.
• Stopword Removal: Common English stopwords are excluded to reduce noise in the
term frequency matrix, improving the quality of latent semantic representations.
The output of this stage is a cleaned, normalized text corpus that is passed downstream to
either extractive summarization models like LSA or transformer-based summarizers such as
BART, depending on transcript length and system availability.
To ensure the summarized content remains suitable for children and sensitive audiences, the
system incorporates a dedicated Kids Safety Filter. This module is applied to the transcript
before summarization, filtering out sentences that contain explicit, violent, or age-inappropriate
material based on two complementary strategies: keyword-based detection and semantic simi-
larity filtering.
Chapter III. Methodology 26
A predefined list of hardcoded terms and phrases (e.g., explicit words, references to violence,
hate speech) is used to flag and optionally remove sentences that directly contain inappropriate
content. This fast and interpretable rule-based filter acts as the first line of defense.
• If a sentence contains any word from the blocked list, it is marked as flagged and excluded
from downstream summarization.
To capture implicit or euphemistic references not caught by direct keyword matches, the system
uses a second semantic filtering stage based on sentence embeddings.
• Each sentence in the transcript is converted into a high-dimensional vector using a sen-
tence embedding model (e.g., all-MiniLM-L6-v2 from SentenceTransformers).
• The cosine similarity is computed between each sentence vector and a curated list of
pre-embedded offensive phrases.
• Sentences whose similarity exceeds a predefined threshold (e.g., 0.75) are flagged as po-
tentially unsafe and the video is flagged.
Chapter III. Methodology 27
Cosine similarity is used to measure the semantic similarity between sentence vectors in the
latent topic space derived from TF-IDF or LSA. It quantifies the cosine of the angle between
two non-zero vectors in a multi-dimensional space, offering a normalized measure of similarity.
A·B
cosine(A, B) = (3.1)
∥A∥∥B∥
where A and B are the sentence embedding vectors, · denotes the dot product, and ∥A∥ is the
Euclidean norm of vector A.
• Redundancy Removal: If two sentences are highly similar, only one is retained.
• Parental Filtering: Sentences too similar to flagged examples are removed from candi-
date outputs.
Figure 3.4: Cosine Similarity showing two vectors with similarities close to 1, close to 0, and
close to -1.
This dual use of cosine similarity helps the system maintain high semantic precision and safety
without compromising performance, making it suitable for browser-based summarization.
Chapter III. Methodology 28
The Kids Safety Filter operates as a standalone preprocessing module. It can be toggled
based on user preference (e.g., parental control mode), and all filtering decisions are logged for
auditability. This modular structure ensures the core summarization logic remains unaffected,
while still supporting strong content safety guarantees.
This section describes the models used to generate summaries from YouTube video transcripts.
The system employs a hybrid approach, using both extractive and abstractive summarization
techniques while optionally offloading heavier abstractive summarization to cloud-based APIs.
A parental control filtering step is applied after each summarization process to ensure the
generated content is safe for children.
Chapter III. Methodology 29
Latent Semantic Analysis (LSA) is an unsupervised extractive technique used to identify and
retain the most semantically significant sentences from a document. In the context of this
system, LSA is applied to preprocessed YouTube transcripts to generate concise, informative
summaries. It works by reducing the dimensionality of a term-sentence matrix via Singular
Value Decomposition (SVD), uncovering latent semantic structures in the data.
• Term Frequency (TF): Measures how frequently a term occurs in a document, normal-
ized to account for document length.
• Inverse Document Frequency (IDF): Measures how rare a term is across a corpus,
reducing the influence of commonly occurring words.
TF-IDF helps distinguish important content words from generic ones (e.g., "the", "is"). In this
pipeline, TF-IDF is applied at the sentence level, treating each sentence as a "document" and
the transcript as the corpus. This enables construction of a weighted representation of term
importance within each sentence.
Term Frequency:
The term frequency is calculated based on the formula:
f (t, d)
T F (t, d) = (3.2)
Nd
Chapter III. Methodology 30
Where:
This normalizes the term count so that longer documents don’t dominate due to their size.
N
IDF (t, D) = log (3.3)
nt
Where:
The logarithmic scale reduces the impact of very common words, like "the" or "and," which
occur in nearly every document.
TF-IDF Score: The TF-IDF score for a term t in a document d is calculated as:
This combines the term’s frequency in a document with its rarity across the entire corpus.
Chapter III. Methodology 31
The implementation of TF-IDF involves several steps. First, preprocessing is applied to the
text, including tokenization, removal of stop words (e.g., "the," "is," "and"), and stemming
or lemmatization to reduce words to their root forms. Term frequencies are then calculated
for each word in a document, followed by computing the inverse document frequency for the
entire corpus.[14] Finally, the TF-IDF score for each term is calculated by combining these two
measures, enabling the identification of the most significant terms within a document.
TF-IDF has several practical applications. In text summarization, it helps identify key terms
that are used to extract meaningful sentences from a document. Search engines rely on TF-
IDF to rank documents based on the relevance of search queries. It is also widely used for
keyword extraction, content categorization, and even recommender systems to identify and
suggest relevant content.
Despite its simplicity and computational efficiency, TF-IDF has limitations. It treats words as
independent units, ignoring their context and relationships, which is a disadvantage for tasks
requiring semantic understanding. Additionally, it does not account for the sequence of words,
which can be crucial in certain text analysis tasks. While IDF discounts common terms, it may
overly penalize very rare terms that could still be important. Furthermore, for large corpora,
calculating IDF values can become computationally expensive, posing scalability challenges.
Chapter III. Methodology 32
This matrix encodes semantic information, capturing the relevance of each term to each sen-
tence. The resulting matrix is typically sparse and high-dimensional.
where:
Figure 3.7: Truncated Singular Value Decomposition (SVD) applied to matrix G. Only the
top r singular values and their corresponding left and right singular vectors are retained to
form a low-rank approximation Ũ Σ̃Ṽ T . This process captures the most significant semantic
structure while discarding noise.
To better understand this process, Figure 3.7 illustrates the truncated SVD decomposition.
Only the top r singular values (and their corresponding vectors) are retained to form a low-
rank approximation of the original matrix. This reduction step is critical in LSA as it discards
less important dimensions (i.e., noise or minor topics), focusing the model on dominant semantic
concepts.
Chapter III. Methodology 34
The decomposition reduces noise and captures the core semantic structure of the text. By
retaining only the top k singular values (typically 1–3 for short documents), we focus on the
most significant semantic concepts.
Each sentence is scored based on its relevance to the dominant latent topics:
• The matrix V T encodes how strongly each sentence relates to each topic.
• Sentence importance is calculated by computing the Euclidean norm (or weighted mag-
nitude) across the top k components of each column in V T .
• Higher scores indicate a sentence that aligns closely with the main semantic themes of
the document.
The top n scoring sentences are selected as the summary. The parameter n is either fixed
(e.g., 3 sentences) or computed dynamically as a proportion (e.g., 30%) of the total number
of sentences. The selected sentences are arranged in their original order to preserve narrative
flow.
3.9.1.6 Post-Processing
To improve readability and user experience, the following steps are applied:
• Redundancy Filtering: Sentences with high cosine similarity (above a fixed threshold)
are discarded to reduce duplication.
Chapter III. Methodology 35
• Parental Control Filtering: The final summary is passed through a content filter to
remove or flag inappropriate material.
This process yields a compact, informative, and safe summary suitable for both children and
adults.
To enable fluent and abstractive summaries, the system utilizes DistilBART-CNN, a distilled
variant of Facebook’s BART model that is fine-tuned on the CNN/DailyMail dataset. This
model is accessed through HuggingFace’s pipeline API, enabling easy integration and fast
inference.
The BART summarization module is designed to handle large transcripts that may exceed the
model’s input token limit (commonly 1024 tokens). To address this, a custom chunking and
summarization pipeline is implemented.
If the transcript exceeds the token limit, it is processed using a sentence-based, tokenizer-aware
chunking mechanism:
• Token Estimation: Each sentence is tokenized using the same tokenizer as the BART
model (i.e., facebook/bart-large-cnn) to estimate the token length.
• Chunk Construction: Sentences are grouped into chunks of 800–900 tokens to stay
well within the model’s upper limit.
• Adaptive Chunk Count: The number of chunks dynamically depends on the total
length of the transcript, ensuring coverage without truncation.
• Summary Merging: The resulting chunk-level summaries are concatenated into a co-
herent intermediate summary.
• Recursive Summarization: If the merged summary is still too lengthy for frontend
display, it may undergo a second round of summarization through BART to improve
conciseness.
If any step in the BART summarization pipeline fails—due to memory issues, API errors, or
tokenization mismatches—the system gracefully falls back to the Latent Semantic Analysis
(LSA) based extractive summarizer. This ensures robust performance under varying runtime
constraints.
Chapter III. Methodology 37
This architecture ensures scalability across video lengths and quality expectations, and inte-
grates seamlessly with the frontend for delivering structured summaries.
In addition to the default DistilBART-based summarization, the system also integrates general-
purpose large language models (LLMs) such as GPT-4, Claude, and Gemini via secure API calls
through the use of chutes.ai. These LLMs support both few-shot and zero-shot summariza-
tion, offering superior abstraction, coherence, and linguistic fluency.
• Task Adaptivity: Through prompt engineering, the summary output can be tailored by
style (e.g., explanatory, conversational) or format (e.g., bulleted list, paragraph, Q&A).
• Multilingual and Safety Features: Most LLMs include built-in safety filters and
multilingual understanding, making them robust for diverse audiences.
Chapter III. Methodology 38
The backend is built using Flask and serves as the interface between the frontend and the core
processing components. It exposes modular, RESTful API endpoints that manage transcript
retrieval, summarization (both extractive and abstractive), and kid-safe filtering.
• GET /transcript
Accepts a YouTube video URL and returns the cleaned transcript text using a dedi-
cated caption extraction module. The transcript is preprocessed to remove noise (e.g.,
timestamps, filler text).
• GET /summary
Accepts a video URL and produces a summary using a multi-stage decision pipeline:
– The transcript is first validated for length and content safety using both keyword-
based and embedding-based checks.
• GET /health
A lightweight endpoint used to monitor backend health and confirm if the summarization
model is loaded successfully.
Internally, the backend also uses a semantic filtering module based on cosine similarity between
transcript embeddings (generated via all-MiniLM-L6-v2) and a sensitive phrase bank. This
enables deeper context-aware filtering beyond basic keyword matching.
The Flask server is CORS-enabled to support direct communication with the Chrome extension
and other frontends.
The backend handles all preprocessing, LSA computation, keyword extraction, and content
filtering, returning results in structured JSON format to be consumed by the frontend.
The backend Flask server, titled TranscriptApp, is responsible for summarization, content
filtering, and communication with the Chrome extension. Upon running, it outputs a startup
log confirming the successful initialization of core components:
Note: As shown in Figure 3.8, this is a development server only. For production deployment,
a WSGI server such as gunicorn or uWSGI should be used.
Chapter III. Methodology 40
The Chrome Extension serves as the client-side interface for kid-safe filtering of YouTube videos.
It is structured around three primary components: the content script, background service
worker, and the extension manifest.
The content script is injected into YouTube pages and is responsible for real-time interaction
with the video player and DOM. Its core functionalities include:
• Video Blocking: Initially pauses or mutes videos until they are explicitly approved.
This is enforced via a periodic check on the <video> element and event listeners for play
attempts.
• Visual Overlay: Displays a full-screen blocker with the extension’s branding and safety
warning. This ensures users are aware that the content requires safety verification.
The background service worker performs higher-level extension management and cross-tab co-
ordination:
• Script Injection: Ensures that the content script is loaded into YouTube tabs upon
navigation or extension installation.
• Flask Server Communication: Sends health-check requests to the local Flask backend
to determine server availability.
• Global Messaging: Handles global commands such as retrieving all approved video IDs
or clearing them from storage.
• Permissions: Includes tabs, scripting, storage, and YouTube URL host permissions
for full interaction.
• Content Script Registration: Specifies that content.js should run on all YouTube
URLs at document_start.
• Extension UI: Defines icons, a popup interface (popup.html), and default titles.
Chapter III. Methodology 42
1. When a YouTube page is loaded, the content script runs and checks if the video is ap-
proved.
3. The user can interact with the popup to request a safety check (via Flask backend).
Experimental Results
Although no formal benchmark evaluation was conducted, a series of functional tests were
carried out to verify the system’s effectiveness in real-world usage. Summarization outputs
were visually inspected, and the system was tested across a range of YouTube videos varying
in length, topic, and language clarity.
43
Chapter IV. Experimental Results 44
• Parental Control Filtering: The system accurately flagged or removed content con-
taining inappropriate language, as seen in Figure ??, using both keyword-based and
embedding-based filtering methods.
Screenshots from the Chrome extension and backend output console illustrate successful sum-
marization and filtering in real scenarios.
Conclusion
The backend of the YouTube Transcript Summarizer was implemented in Python, leveraging
a combination of classical and modern Natural Language Processing (NLP) techniques. Li-
braries such as SpaCy, NLTK, and scikit-learn were used for text preprocessing and TF-IDF
matrix construction. Latent Semantic Analysis (LSA), powered by Singular Value Decompo-
sition (SVD), serves as the primary extractive summarization engine due to its computational
efficiency and semantic consistency in resource-constrained environments.
To improve output quality, cosine similarity filtering is applied to eliminate redundancy and
enhance the diversity of selected sentences. Additionally, the backend supports optional ab-
stractive summarization using Hugging Face’s Transformers library (e.g., BART), enabling
fluent and context-aware summaries when required.
46
Chapter V. Conclusion 47
Deployment is managed through a lightweight Flask backend that exposes several RESTful API
endpoints:
The user interface is delivered through a Chrome Extension that enables users to input YouTube
video links and receive summaries and tags directly within the browser. This modular archi-
tecture ensures cross-platform compatibility, ease of use, and fast response times.
Although the current implementation offers robust summarization and filtering, several im-
provements are envisioned for future development:
Chapter V. Conclusion 48
• Robust Error Handling: Enhanced handling of invalid URLs, private videos, livestreams,
and API failures.
• User Customization: Enable control over summary length, topic tag density, and con-
tent sensitivity levels.
Appendix
49
Bibliography
[2] Sandra L. Calvert, Children as consumers: Advertising and marketing, The Future of
Children, pp. 205–234, 2008. JSTOR.
[4] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research
and Development, 2(2), 159–165.
[5] Edmundson, H. P. (1969). New methods in automatic extracting. Journal of the ACM
(JACM), 16(2), 264–285.
[6] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).
Indexing by latent semantic analysis. Journal of the American Society for Information
Science, 41(6), 391–407.
50
Bibliography 51
[7] Kupiec, P. H. (1995). Techniques for verifying the accuracy of risk measurement models
(Vol. 95, No. 24). Washington, DC: Division of Research and Statistics, Division of Mon-
etary Affairs, Federal Reserve Board.
[8] Yihong Gong and Xin Liu. Generic text summarization using relevance measure and latent
semantic analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval, pages 19–25. ACM, 2001.
[9] Lin, C. Y. (2004, July). ROUGE: A package for automatic evaluation of summaries. In
Text Summarization Branches Out (pp. 74–81).
[10] Rush, A. (2015). A neural attention model for abstractive sentence summarization. arXiv
preprint arXiv:1509.00685.
[11] Kumar, K., Shrimankar, D. D., & Singh, N. (2016, November). Equal partition based clus-
tering approach for event summarization in videos. In 2016 12th International Conference
on Signal-Image Technology & Internet-Based Systems (SITIS) (pp. 119–126). IEEE.
[12] See, A., Liu, P. J., & Manning, C. D. (2017). Get to the point: Summarization with
pointer-generator networks. arXiv preprint arXiv:1704.04368.
[13] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document
transformer. arXiv preprint arXiv:2004.05150, 2020.
[14] Raposo, G., Raposo, A., & Carmo, A. S. (2022). Document-Level Abstractive Summariza-
tion. arXiv preprint arXiv:2212.03013.
[15] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies (NAACL-HLT), pp. 4171–4186.
Bibliography 52
[16] Zhang, H., Yu, P. S., & Zhang, J. (2024). A Systematic Survey of Text Summarization:
From Statistical Methods to Large Language Models. arXiv preprint arXiv:2406.11289.
[18] Beel, J., Gipp, B., Langer, S., & Breitinger, C. (2016). Paper recommender systems: a
literature survey. International Journal on Digital Libraries, 17, 305–338.
[19] Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application
in retrieval. Journal of Documentation, 28(1), 11–21.
[20] Henry Schaeffer et al., *DistilBART-CNN-12-6: A distilled BART model for ab-
stractive summarization*, 2021. Available at: https://huggingface.co/sshleifer/
distilbart-cnn-12-6 (accessed June 9, 2025).