0% found this document useful (0 votes)
19 views63 pages

IET Final Year Project - Making YouTube Transcript

This project report presents a Google Chrome extension designed to summarize YouTube video transcripts using advanced Natural Language Processing techniques, enhancing parental control over children's media consumption. The system employs both extractive and abstractive summarization methods, alongside a keyword-based filtering mechanism to flag inappropriate content. The backend is built on a Flask server, allowing for scalability and future enhancements, including real-time monitoring and customizable filtering options for guardians.

Uploaded by

Axe No
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views63 pages

IET Final Year Project - Making YouTube Transcript

This project report presents a Google Chrome extension designed to summarize YouTube video transcripts using advanced Natural Language Processing techniques, enhancing parental control over children's media consumption. The system employs both extractive and abstractive summarization methods, alongside a keyword-based filtering mechanism to flag inappropriate content. The backend is built on a Flask server, allowing for scalability and future enhancements, including real-time monitoring and customizable filtering options for guardians.

Uploaded by

Axe No
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

1

flask-backend@cref
Building A Tool To Summarize YouTube Videos

A Project Report Submitted


In Partial Fulfilment of the Requirements
for the Degree of

Bachelor of Technology
in
Computer Science & Engineering

by

Milind Nautiyal (Roll No. 2100520100045)


Faiz Khan (Roll No. 2100520100028)

Under the Guidance of


Prof. Vineet Kansal
Dr. Ankita Srivastava

Department of Computer Science and Engineering


Institute of Engineering & Technology
Dr. APJ Abdul Kalam Technical University, Uttar Pradesh

June, 2025
Declaration
We hereby declare that this submission of project is our own work and that to the best of our
knowledge and belief it contains no material previously published or written by another person
or material which to a substantial extent has been accepted for award of any other degree of the
university or other institute of higher learning, except where due acknowledgement has been
made in the text.

This project report has not been submitted by us to any other institute for requirement of any
other degree.

Signature of the Students

Milind Nautiyal (2100520100045)

Faiz Khan (2100520100028)

i
Certificate
This is to certify that the project report entitled: Building A Tool To Summarize YouTube
Videos As Well As Implementing Content Filtering submitted by Milind Nautiyal
and Faiz Khan in the partial fulfillment for the award of the degree of Bachelor of Technology
in Computer Science & Engineering is a record of the bonafide work carried out by them under
our supervision and guidance at the Department of Computer Science & Engineering, Institute
of Engineering & Technology Lucknow.

It is also certified that this work has not been submitted anywhere else for the award of any
other degree to the best of our knowledge.

(Prof. Vineet Kansal)


Department of Computer Science and Engineering,
Institute of Engineering & Technology, Lucknow

(Dr. Ankita Srivastava)


Department of Computer Science and Engineering,
Institute of Engineering & Technology, Lucknow

ii
Acknowledgement
We take this opportunity to express our sincere gratitude to all those who supported and guided
us throughout our project work. We are thankful to the Almighty for the grace and blessings
that enabled us to complete this project successfully.

We are deeply grateful to our supervisor, Dr. Vineet Kansal, for his consistent support, ex-
pert guidance, and encouragement throughout the B.Tech project. His insights and suggestions
were crucial to the success of our work.

We would also like to thank our supervisor, Dr. Ankita Srivastava, for her valuable inputs
and continuous support during the course of this project. Her insights and suggestions were
instrumental in shaping the direction of our work.

Our sincere thanks to Dr. Pawan Kumar Tiwari, our project coordinator, for providing us
with the opportunity and resources to undertake this project, as well as his suggestions which
helped us further refine our initial idea.

We are also thankful to the Project Evaluation Committee Members for their feedback
and constructive suggestions, which helped refine our work.

We extend our appreciation to Dr. Girish Chandra, the Head of the Department,
Department of Computer Science & Engineering, Institute of Engineering & Tech-
nology, Lucknow, for the supportive academic environment and ensuring access to essential
resources and fostering an environment that supported the smooth execution of this project.

Lastly, we thank our families and friends for their unwavering support and motivation through-
out this journey.

Milind Nautiyal

Faiz Khan

iii
Abstract

With the exponential growth of media online, controlling the flow of information, especially
for younger audiences, has become increasingly important for parents and guardians. This
project aims to solve this critical problem by presenting a Google Chrome extension which
leverages advanced Natural Language Processing techniques to provide meaningful as well
as concise summaries of YouTube video transcripts, while also enhancing parental oversight
over the type of content children are exposed to online.

The extension integrates both traditional and modern NLP approaches, including a Latent Se-
mantic Analysis (LSA) pipeline and a BERT-based abstractive summarization model,
to generate human-readable summaries of video transcripts fetched via the YouTube Data API.
Additionally, users may optionally harness Large Language Models (LLMs) through third-
party APIs for more context-aware summarization.

In order to insure safe media consumption, this system includes a keyword-based filtering mech-
anism that detects and flags potentially sensitive themes such as violence, explicit language, or
adult content using curated and dynamic keyword databases. Flagged content can be restricted
from playback, and summaries of such content can be routed to guardians for review.

These capabilities are especially crucial in light of documented controversies such as Elsagate [3],
where disturbing and inappropriate content managed to bypass platform-level filters and reach
child audiences. The growing presence of AI-generated and bot-driven videos—many of which
mimic child-friendly content while containing harmful material—further underscores the need
for more robust and fine-grained moderation systems. As such, the project highlights the im-
portance of automated content analysis and filtering mechanisms that can assist guardians in
managing online media consumption effectively.

The backend is powered by a lightweight and modular Flask server, enabling fast, scalable
processing and paving the way for future enhancements. Future iterations aim to incorporate
fine-tuned Transformer models for improved contextual understanding, and possibly real-time
content monitoring as well as an interactive dashboard on the browser’s new tab page, allowing
guardians to customize filtering rules and monitor viewed content.

iv
Contents

Declaration i

Certificate ii

Acknowledgements iii

Abstract iv

Contents v

List of Figures viii

List of Tables ix

1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Common Use Cases in Daily Life . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Literature Review 7
2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Foundation of Extractive Summarization . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Early Efforts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Transition to Semantic Techniques in the 1990s . . . . . . . . . . . . . . 9
2.3 Statistical and Machine Learning Approaches . . . . . . . . . . . . . . . . . . . 10
2.3.1 Probabilistic Models and Feature Engineering (1990s) . . . . . . . . . . . 10
2.3.2 Evaluation and Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Rise of Deep Learning and Transformer-Based Models . . . . . . . . . . . . . . . 11
2.4.1 BERT and Contextual Representations . . . . . . . . . . . . . . . . . . . 11
2.4.2 Abstractive Models: BART, T5, and Beyond . . . . . . . . . . . . . . . . 12
2.5 Meta-Review and Evolutionary Perspectives . . . . . . . . . . . . . . . . . . . . 12
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
v
Contents vi

3 Methodology 17
3.1 Existing Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Proposed Methodolgy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.0.1 System Pipeline Overview . . . . . . . . . . . . . . . . . . . . . 19
3.3 Advantages of Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Disadvantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Comparison Between LSA, BART & LLMs . . . . . . . . . . . . . . . . . . . . . 22
3.6 Transcript Extraction Method (from YouTube Transcripts) . . . . . . . . . . . . 23
3.7 Preprocessing the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.8 Parental Control and Kids Safety Filtering . . . . . . . . . . . . . . . . . . . . . 25
3.8.1 Keyword-Based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8.2 Semantic Filtering via Embedding Similarity . . . . . . . . . . . . . . . . 26
3.8.2.1 Cosine Similarity Filtering . . . . . . . . . . . . . . . . . . . . . 27
3.8.3 Modularity and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.8.4 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9 Summarization Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9.1 Latent Semantic Analysis (LSA) – Extractive Summarization . . . . . . . 29
3.9.1.1 Term Frequency-Inverse Document Frequency (TF-IDF) . . . . 29
3.9.1.2 Matrix Construction . . . . . . . . . . . . . . . . . . . . . . . . 32
3.9.1.3 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . 33
3.9.1.4 Sentence Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.9.1.5 Sentence Selection . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.9.1.6 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.9.2 BART – Abstractive Summarization . . . . . . . . . . . . . . . . . . . . 35
3.9.2.1 Token-Aware Chunking . . . . . . . . . . . . . . . . . . . . . . 35
3.9.2.2 Chunk-Level Summarization and Merging . . . . . . . . . . . . 36
3.9.2.3 Fallback Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9.2.4 Advantages of BART-Based Summarization . . . . . . . . . . . 37
3.9.3 Large Language Models (LLMs) . . . . . . . . . . . . . . . . . . . . . . . 37
3.10 Flask Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.10.1 Flask Development Server Execution . . . . . . . . . . . . . . . . . . . . 39
3.11 Chrome Extension Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.11.1 Content Script (content.js) . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.11.2 Background Script (background.js) . . . . . . . . . . . . . . . . . . . . . 41
3.11.3 Extension Manifest (manifest.json) . . . . . . . . . . . . . . . . . . . . . 41
3.11.4 Extension Workflow Summary . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Experimental Results 43
4.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Conclusion 46
5.1 Backend Implementation Using NLP Libraries . . . . . . . . . . . . . . . . . . . 46
Contents vii

5.2 YouTube Transcript Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


5.3 Deployment via Flask and Chrome Extension . . . . . . . . . . . . . . . . . . . 47
5.4 Planned Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

A Appendix 49

References 50
List of Figures

1.1 Proposed System Overview: Abstract pipeline of the YouTube Transcript Sum-
marization and Parental Control System, illustrating the transformation from
raw video input to semantically enriched output. . . . . . . . . . . . . . . . . . . 5

3.1 System Pipeline Overview: From YouTube Transcript Extraction to Frontend


Summary Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Example terminal log showing successful transcript extraction. The system logs
each request, identifies the video ID, fetches the transcript using internal API
calls, and reports the cleaned word count as well as the token count. . . . . . . . 23
3.3 Terminal Message: Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Cosine Similarity showing two vectors with similarities close to 1, close to 0, and
close to -1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Converting Text To TF-IDF Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Term-Sentence TF-IDF Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Truncated Singular Value Decomposition (SVD) applied to matrix G. Only the
top r singular values and their corresponding left and right singular vectors are
retained to form a low-rank approximation Ũ Σ̃Ṽ T . This process captures the
most significant semantic structure while discarding noise. . . . . . . . . . . . . 33
3.8 Flask development server log showing initialization of TranscriptApp . . . . . . 39

4.1 The Terminal At The End (Fullscreen) . . . . . . . . . . . . . . . . . . . . . . . 44


4.2 Opening YouTube Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 After Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Summary Through API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

viii
List of Tables

2.1 Summary of Related Work on Text Summarization Techniques . . . . . . . . . . 13

3.1 Comparison of Summarization Strategies Used in the Proposed System . . . . . 22

ix
Chapter 1

Introduction

In the digital age, video content has emerged as one of the most dominant and widely consumed
forms of media. Platforms like YouTube host billions of videos spanning diverse domains such
as education, entertainment, news, and personal content creation [1]. However, this abundance
also presents a growing challenge, as users are often overwhelmed by the sheer volume of
information, making it difficult to locate relevant content quickly and efficiently.

This issue becomes particularly evident when dealing with long-form videos or scenarios that
require time-sensitive viewing. For instance, students revisiting recorded lectures, profes-
sionals consulting instructional content, or parents reviewing what their children are watching
all face the same challenge: extracting relevant information from lengthy videos is often te-
dious, inefficient, and cognitively demanding. Without mechanisms for quick summarization
or content preview, viewers are forced to manually scrub through timelines—an approach that
is both time-consuming and error-prone.

In addition, the unrestricted availability of video content raises serious concerns regarding con-
tent safety for children. Young viewers typically lack the maturity and critical thinking
skills needed to evaluate content quality or intent [2], making them particularly susceptible to

1
Chapter I. Introduction 2

inappropriate or manipulative material. Content creators may exploit this vulnerability by pro-
ducing seemingly child-friendly videos that are designed primarily to generate revenue through
advertisements and sponsorships, often prioritizing engagement over ethical responsibility. As a
result, children can inadvertently be exposed to content that promotes consumerism, ideological
bias, or even harmful narratives. This situation underscores the necessity for active parental
involvement and more intelligent moderation tools, especially in the face of known failures in
platform-level filters, such as those highlighted in the Elsagate controversy [3].

The project addresses these problems by proposing that the solution lies in automatic sum-
marization and semantic analysis of video transcripts. By distilling long videos into concise
summaries and tagging them with relevant topics or potential red flags, users can engage with
video content more effectively and safely. This project introduces a browser-based tool that
combines Natural Language Processing (NLP) techniques, keyword extraction, and parental
control features to empower viewers, particularly guardians, with greater control, efficiency,
and transparency over the content being consumed by themselves and their wards.

1.1 Problem Statement

While platforms like YouTube have revolutionized content accessibility and democratized in-
formation sharing, they have also inadvertently opened the floodgates to inappropriate and
harmful material, particularly for a young audience which is more vulnerable than ever. Despite
attempts to moderate content using algorithms and platform-level filters, incidents such as the
2017 “Elsagate” controversy have exposed critical shortcomings in these systems. Thousands
of seemingly child-friendly videos—often featuring popular cartoon characters—were discovered
to contain disturbing, violent, or sexually suggestive content, all while being disguised under
misleading titles and thumbnails.
Chapter I. Introduction 3

Such incidents highlight the fundamental limitations of keyword-based moderation and super-
ficial content filtering. They underscore the pressing need for more nuanced, context-aware
systems capable of understanding the semantic content of videos, rather than relying solely
on metadata or surface-level heuristics.

Beyond the issue of content safety, there is also the broader challenge of information over-
load. For everyday users—such as students revisiting lectures, professionals referencing tuto-
rials, or parents reviewing their children’s media consumption—the ability to extract relevant
information from long-form videos is severely limited. Manually scanning through videos is
inefficient, time-consuming, and often ineffective, especially when only small segments contain
useful information.

Currently, there is a noticeable lack of integrated tools that can:

• Intelligently summarize and analyze video transcripts using semantic understanding, and

• Flag potentially inappropriate or harmful content based on context rather than surface-
level keywords.

This project addresses that gap by developing a Chrome extension that integrates Transformer-
based summarization, semantic keyword extraction, and customizable parental control fea-
tures. The proposed tool aims to facilitate faster, safer, and more informed video consump-
tion across diverse user groups.

1.2 Objective

The primary objective of this project is to design and develop a Chrome extension that trans-
forms how users, in particular guardians, educators, and students, interact with online video
content by offering intelligent summarization and context-sensitive content analysis.
Chapter I. Introduction 4

The extension aims to enhance both comprehension and safety in digital media consumption.

The specific goals of the project are as follows:

• Automatic Transcript Summarization: Implement state-of-the-art summarization


models to generate concise, human-readable summaries of YouTube video transcripts,
enabling users to quickly grasp the core content without viewing the entire video.

• Context-Aware Keyword and Topic Extraction: Utilize Natural Language Process-


ing (NLP) techniques to extract semantically meaningful keywords and topics, aiding in
content classification, relevance scoring, and insight generation.

• Parental Control and Sensitivity Detection: Develop a robust flagging mechanism


that detects potentially inappropriate or disturbing content based on extracted semantic
features.

• User-Friendly Interface and Dashboard: Lay the groundwork for an interactive


browser dashboard where users can access summaries, view associated topics, and config-
ure sensitivity filters and parental controls.

• Scalable and Modular Backend Architecture: Build a Flask-based backend to sup-


port core functionalities such as transcript processing, summarization, and flagging—ensuring
modularity and scalability for future extensions.

By fulfilling these objectives, the project aims to deliver a practical and user-centric tool that
improves access to relevant video content while addressing growing concerns around content
safety in the digital age. The integration of advanced NLP techniques, context-aware sensitivity
analysis, and extensible system design ensures the tool remains adaptable for future use cases
like multilingual summarization, integration with other video platforms, or enhancement via
fine-tuned large language models.
Chapter I. Introduction 5

1.3 System Overview

At its core, the YouTube Transcript Summarization and Parental Control Chrome
Extension is designed to process YouTube video transcripts to generate intelligent summaries,
semantic topic tags, and sensitivity flags. This facilitates a safer, more efficient, and context-
aware video consumption experience.

The end-to-end architecture of the system can be visualized as follows:

Figure 1.1: Proposed System Overview: Abstract pipeline of the YouTube Transcript Sum-
marization and Parental Control System, illustrating the transformation from raw video input
to semantically enriched output.

The NLP-driven pipeline ensures that the generated output goes beyond mere textual extrac-
tion. It produces contextually meaningful summaries, enriched with topic-level annota-
tions and content sensitivity alerts. By applying both statistical and deep learning-based
techniques, the system offers a more nuanced, human-like understanding of the video’s content.
Chapter I. Introduction 6

This facilitates not only quicker comprehension but also active content moderation and parental
oversight.

1.4 Common Use Cases in Daily Life

1. A concerned parent wants to verify whether a trending YouTube video contains mature,
misleading, or inappropriate content before allowing their child to watch it.

2. A student preparing for exams needs to quickly revisit the key points of a lengthy
educational video without rewatching the entire lecture.

3. A time-constrained viewer searching for cooking or DIY content wants an at-a-glance


summary to determine whether a video is worth following.

4. A teacher curating educational resources seeks to ensure that recommended videos


are aligned with curriculum goals and are free from off-topic or inappropriate segments.

5. A late-night viewer prefers to silently skim summaries and topic tags rather than playing
video audio, avoiding disturbance while still engaging with the content.

6. A parent unfamiliar with English relies on visual summaries and topic tags to evaluate
whether an English-language video is suitable for their child.

These scenarios illustrate the system’s practical value in daily digital interactions. Whether
supporting parental oversight, enhancing educational efficiency, or enabling informed decision-
making, the proposed extension addresses real-world challenges. By bridging the gap between
content accessibility and content safety, the tool fosters a more mindful, secure, and user-centric
digital media environment.
Chapter 2

Literature Review

2.1 Related Works

Text summarization, a fundamental task within the domain of Natural Language Processing
(NLP), focuses on condensing large volumes of textual information into concise, coherent, and
informative representations. As digital content proliferates across the web, effective summa-
rization techniques have become indispensable for enhancing information accessibility, reducing
cognitive load, and enabling quick decision-making across domains such as journalism, educa-
tion, and multimedia content analysis.

Summarization techniques are generally classified into two primary categories: extractive and
abstractive. Extractive summarization identifies and selects key sentences or phrases directly
from the source material, maintaining their original form. Abstractive summarization, on the
other hand, involves generating novel sentences that may not explicitly exist in the source,
thereby mimicking the human approach to summarization. While abstractive methods are more
flexible and expressive, they often demand extensive computational resources, sophisticated
language modeling, and large annotated datasets for training. In contrast, extractive methods

7
Chapter 2. Literature Review 8

offer greater scalability and are more suitable for real-time or resource-constrained applications,
such as browser extensions.

Recent advancements have extended summarization research beyond static textual inputs into
multimedia domains, particularly video content. In this context, transcripts derived from au-
tomatic speech recognition (ASR) systems act as intermediary textual data. However, summa-
rizing transcripts presents unique challenges, including disfluencies, informal language, missing
punctuation, and irregular sentence boundaries. Addressing these challenges requires enhanced
semantic understanding through models such as Latent Semantic Analysis (LSA) and contex-
tualized embeddings from pretrained transformer-based architectures.

This project builds upon these developments by applying both extractive and abstractive sum-
marization techniques to YouTube video transcripts. It further integrates semantic keyword
extraction and sensitivity detection, offering a real-time, browser-based interface aimed at im-
proving both the efficiency and safety of video content consumption. In doing so, it contributes
to the growing body of research focused on multimodal NLP applications and user-centric media
moderation tools.

2.2 Foundation of Extractive Summarization

2.2.1 Early Efforts

The origins of extractive summarization can be traced back to foundational work in the mid-
20th century. One of the earliest and most influential contributions was made by Hans Peter
Luhn (1958), who proposed a frequency-based method of summarization. Luhn’s approach
emphasized that words appearing more frequently, excluding common stop words, were likely
to represent the core themes of a document [4].
Chapter 2. Literature Review 9

Building upon this foundation, Harold P. Edmundson (1969) introduced a heuristic-based


model that incorporated additional features such as the presence of cue words (e.g., “in conclu-
sion,” “important,” etc.), sentence position (e.g., opening or closing sentences of a paragraph),
and keyword frequency to improve the relevance of selected sentences [5].

These early approaches formed the basis for extractive summarization techniques, prioritizing
sentence selection based on surface-level features such as frequency, positional importance, and
lexical cues. While these methods were computationally efficient and easy to interpret, the
resulting summaries often suffered from a lack of coherence and contextual understanding.

2.2.2 Transition to Semantic Techniques in the 1990s

The 1990s witnessed a shift from purely statistical models toward semantically enriched ap-
proaches, driven by advances in machine learning and information retrieval. Techniques such
as Latent Semantic Analysis (LSA) and graph-based ranking algorithms like TextRank signif-
icantly improved the quality of extractive summarization by leveraging semantic relationships
and structural properties of text.

A pivotal development during this period was introduced by Scott Deerwester et al. (1990),
who proposed Latent Semantic Analysis (LSA)—a technique that applies Singular Value De-
composition (SVD) to a term-document matrix in order to uncover latent structures in textual
data [6]. By reducing the dimensionality of this matrix, LSA was able to identify semantically
similar terms and documents, even when they did not share exact lexical overlap. This allowed
summarization systems to capture underlying conceptual relationships rather than relying solely
on surface-level frequency counts.

Despite its innovation, LSA had inherent limitations. Its reliance on linear algebra and the
bag-of-words model meant it could not effectively account for word order, syntax, or linguistic
Chapter 2. Literature Review 10

ambiguity. As a result, while LSA offered improved thematic coherence over earlier frequency-
based models, its summaries sometimes lacked grammatical fluency and contextual nuance.

Nevertheless, LSA laid crucial groundwork for subsequent developments in semantic summa-
rization and keyword extraction, forming a bridge between early statistical models and the
more sophisticated, context-aware methods that emerged in the deep learning era.

2.3 Statistical and Machine Learning Approaches

2.3.1 Probabilistic Models and Feature Engineering (1990s)

Paul Kupiec (1995) introduced one of the first probabilistic frameworks for extractive summa-
rization, utilizing surface-level features such as sentence position, length, and keyword density
to determine sentence importance [7]. This marked a transition from purely heuristic methods
to early statistical learning, although these models lacked any deep understanding of language
semantics.

2.3.2 Evaluation and Benchmarking

A major challenge in summarization research has been evaluation. Chin-Yew Lin (2003) pro-
posed the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, which com-
pares n-gram overlap between generated summaries and human-written references [9]. ROUGE
quickly became a standard for assessing summarization quality. However, it tends to reward
lexical similarity over semantic correctness, making it less suited for evaluating abstractive or
highly paraphrased outputs.
Chapter 2. Literature Review 11

2.4 Rise of Deep Learning and Transformer-Based Models

The advent of deep learning significantly enhanced the quality and fluency of text summa-
rization systems. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)
networks were first to model sequential dependencies in text, but struggled with long-range
context.

The real breakthrough came with the introduction of the Transformer architecture by Vaswani
et al. (2017), which enabled parallelized processing of entire sequences through self-attention
mechanisms.

2.4.1 BERT and Contextual Representations

BERT (Bidirectional Encoder Representations from Transformers), introduced by


Devlin et al. (2018), was a transformative leap in contextual language modeling [15]. Pre-
trained using masked language modeling and next sentence prediction objectives, BERT enabled
deep bidirectional understanding of context, making it ideal for a variety of downstream tasks
including summarization, question answering, and classification.

While BERT is not a summarization model per se, its architecture has been widely adapted in
summarization pipelines—for example, via fine-tuning on sentence ranking tasks or using BERT
embeddings as inputs to summarizers. However, its computational demands pose limitations
in resource-constrained environments like browser extensions.
Chapter 2. Literature Review 12

2.4.2 Abstractive Models: BART, T5, and Beyond

Following BERT, models like BART (Bidirectional and Auto-Regressive Transformers)


and T5 (Text-To-Text Transfer Transformer) provided end-to-end abstractive summa-
rization capabilities, capable of paraphrasing and condensing content with near-human fluency.
These models combine encoder-decoder architectures to generate text from scratch rather than
selecting from the original.

Despite their high performance, such models require significant memory and compute—making
them more suitable for server-side processing than real-time, client-side summarization.

2.5 Meta-Review and Evolutionary Perspectives

Haopeng Zhang (2024) conducted a recent meta-review, tracing the progression of summa-
rization research from rule-based heuristics to hybrid and transformer-driven architectures [16].
The review highlights a growing preference for semantically rich summarization and hybrid
systems that balance extractiveness and abstraction. It also emphasizes the need for domain-
specific summarization solutions—such as education, healthcare, and online safety—where con-
text and ethical safeguards are paramount.

2.6 Conclusion

While numerous tools exist for summarizing transcripts—such as AutoSub for subtitle genera-
tion or Chrome extensions like Eightify—they primarily focus on summarization without any
integrated moderation or parental control. They also lack topic modeling, filtering logic, or
safety tagging based on the semantic content of videos.
Chapter 2. Literature Review 13

The system proposed in this project fills that gap by combining:

• Extractive summarization via LSA for low-resource and interpretable summaries,

• Contextual keyword extraction via Transformer embeddings (e.g., BERT),

• Parental control logic that flags content based on topic and keyword patterns, and

• Modular extensibility that accommodates future Transformer models or domain-specific


classifiers.

This hybrid, safety-focused approach directly responds to issues like Elsagate and the growing
need for parental involvement in content moderation. By bridging fast summarization and
ethical screening, the system provides not just accessibility but also responsibility in content
consumption.

2.6.1 Literature Review

Table 2.1: Summary of Related Work on Text Summarization Techniques

Author Year Method Used Findings Limitations

Hans Pe- 1958 A frequency- Statistical techniques Lacked semantic


ter Luhn based extractive could provide meaning- understanding;
[4] summarization ful summaries without favored common
method semantic understanding of words and could
a document not generate ab-
stract summaries
Chapter 2. Literature Review 14

Author Year Method Used Findings Limitations

Harold P. 1969 A heuristic- Cue words, sentence po- Struggled with


Edmund- based extractive sition, and keyphrase domain-specific
son [5] summarization identification improved texts; lacked deep
method sentence importance de- semantic modeling
termination

Scott 1990 Used Singular Introduced Latent Seman- Computationally


Craig Value Decom- tic Analysis to discover expensive due to
Deerwester position (SVD) latent structure in docu- SVD; ignored syn-
[6] to reduce di- ments for IR and cluster- tax and word-sense
mensionality of ing ambiguity
term-document
matrix

Paul H. 1995 A probabilis- Showed that sentence po- Lacked semantic


Kupiec [7] tic model for sition, length, and key- representation;
extractive sum- word density can effec- could not capture
marization tively identify important deeper meanings
sentences

Gong & 2001 TF-IDF, LSA, LSA-based relevance mea- Limited to extrac-
Liu [8] sentence selec- sures can effectively cap- tive summaries;
tion based on ture latent topical struc- computationally
singular vector ture for generic summa- intensive for large
components rization datasets;
Chapter 2. Literature Review 15

Author Year Method Used Findings Limitations

Chin-Yew 2003 Development ROUGE improved bench- Cannot assess se-


Lin [9] of ROUGE as marking; favored high lex- mantic equiva-
a recall-based ical overlap with reference lence; limited for
metric for sum- summaries abstractive evalua-
marization eval- tions
uation

Alexander 2015 Neural Applied neural MT mod- Limited document-


M. Rush sequence- els to summarization; level understand-
[10] to-sequence showed promise for ab- ing and factual
(Seq2Seq) straction grounding
model

Krishan 2016 Clustering- Semantic clustering im- Struggled with


Kumar based summa- proved over frequency- topic drift and pol-
[11] rization using based selection ysemous sentences
K-means and
equal partition-
ing

Abigail 2017 Pointer- Improved fluency and fac- Occasionally repet-


See [12] generator net- tual accuracy of abstrac- itive; could miss
works with rein- tive summaries important details
forcement learn-
ing
Chapter 2. Literature Review 16

Author Year Method Used Findings Limitations

Beltagy et 2020 Transformer ar- Achieved state-of-the-art Not designed for


al. See [13] chitecture with performance on long- generative sum-
sparse attention document tasks with re- marization tasks;
mechanism for duced memory complexity requires retraining
long documents or fine-tuning for
summarization

Gonçalo 2022 Chunked sum- Improved long-document Inconsistencies


Raposo marization summarization using across chunks;
[14] with retrieval- retrieval-based context sometimes missed
enhanced Trans- management key content
formers

Jacob De- 2018 Pre-trained Enabled deep bidirec- High memory/-


vlin et al. BERT model tional understanding of computation cost;
[15] for contextual language; useful in key- not optimized for
language repre- word extraction and sen- summarization
sentation tence ranking out-of-the-box

Haopeng 2024 Meta-review Traced evolution of sum- Limited real-


Zhang [16] of extractive, marization; highlighted world evaluation;
abstractive, shift to semantically rich abstractiveness
hybrid, and and hybrid models benchmarks not
transformer- comprehensive
based summa-
rization models
Chapter 3

Methodology

The development of an intelligent summarization system for YouTube video transcripts within a
browser-based environment requires a deep integration of Natural Language Processing (NLP)
techniques, lightweight modeling strategies, and modular design principles. This chapter details
the methodology used in building the Chrome extension, which transforms raw transcript data
into structured, human-readable summaries.

The summarization pipeline combines classical extractive summarization techniques like TF-
IDF and Latent Semantic Analysis (LSA) with modern filtering approaches based on cosine
similarity, and optionally explores transformer-based models (BERT/BART) for abstractive
summarization. Emphasis is placed on balancing performance with computational efficiency to
ensure seamless operation within the resource constraints of a browser extension.

3.1 Existing Methodologies

One of the earliest attempts at text summarization was proposed by Luhn in 1958. His method
utilized word and phrase frequency analysis to identify key content in technical papers, laying

17
Chapter III. Methodology 18

the groundwork for modern summarization techniques.

Luhn’s 1958 paper, ‘The Automatic Creation of Literature Abstracts’ dealt with using
frequency analysis to identify significant words and sentences in a document. It proposed
that words occurring more frequently in a text (excluding stopwords) are likely central to its
content. Sentences were scored based on their concentration of these key terms, and high-scoring
sentences were extracted to form summaries [1]. This approach provided a systematic and
computational method to create abstracts, serving as a cornerstone for modern summarization
techniques in NLP. As technological capabilities expanded, Machine Learning and NLP gained
prominence, leading to increasingly sophisticated summarization methods.

While foundational, Luhn’s method lacked semantic understanding and struggled with com-
plex sentence structures—a limitation modern techniques aim to overcome. Automatic text
summarization addresses the challenge of limited time for comprehensively understanding vast
amounts of material, mitigating the risk of overlooking critical information.

3.2 Proposed Methodolgy

The proposed system is initiated when a user provides a YouTube video link through the
frontend interface. Upon receiving the link, the backend extracts the video’s unique identifier
and retrieves the associated transcript using the YouTube API or related caption-extraction
techniques. Once the transcript is obtained, it undergoes a cleaning and preprocessing phase
to remove noise such as timestamps and formatting artifacts.

The cleaned transcript is then analyzed for appropriateness using a two-stage kid-safe filtering
mechanism: a keyword-based scan and a semantic similarity check using sentence embeddings.
If the content passes the safety checks and meets the minimum length requirements, it proceeds
to the summarization phase.
Chapter III. Methodology 19

Depending on the length and complexity of the transcript, the system employs either a direct
abstractive summarization using a pretrained transformer model or a smart chunking strategy to
handle longer inputs. In cases where the transcript is too long for abstractive summarization,
an extractive approach based on TF-IDF and Latent Semantic Analysis (LSA) is used as a
fallback.

Finally, the generated summary is returned to the frontend and displayed to the user in a clean
and readable format. This end-to-end pipeline is designed to operate automatically, ensuring
efficient, safe, and contextually relevant summarization of YouTube video content.

3.2.0.1 System Pipeline Overview

The summarization system follows a modular, decision-driven pipeline for extracting and filter-
ing YouTube video transcripts. Figure 3.1 illustrates the end-to-end architecture of the system.

• Transcript Extraction: The backend invokes the YouTube Data API to retrieve the
transcript when a video is opened.

• Text Preprocessing: This includes noise removal, sentence splitting, and tokenization
to clean the transcript for downstream processing.

• Kids Safety Filter: Applied immediately post-preprocessing, this filter combines keyword-
based screening and cosine similarity filtering using pre-embedded phrases.

• Model Selection Logic: Based on transcript length and API availability, the system
branches to one of three summarization paths:

– Latent Semantic Analysis (LSA): For longer transcripts where BART or API
usage is constrained.
Chapter III. Methodology 20

– BART Summarization: With token-aware chunking and optional recursive sum-


marization.

– LLM-Based Summarization: Optional path using large-scale transformer APIs


like ChatGPT or DeepSeek.

• Postprocessing and Output: All models return final summaries along with status
flags, which are then displayed via a Chrome extension interface.

Figure 3.1: System Pipeline Overview: From YouTube Transcript Extraction to Frontend
Summary Display
Chapter III. Methodology 21

3.3 Advantages of Proposed Methodology

The proposed summarization pipeline integrates classical extractive techniques with modern
transformer-based abstractive models under a modular design optimized for browser deploy-
ment. This hybrid approach allows efficient transcript analysis, summary generation, and safety
filtering, all within the constraints of a lightweight Chrome extension.

• Hybrid Architecture: Combines extractive (TF-IDF + LSA) and abstractive (BART


with chunking) methods to balance speed, interpretability, and semantic richness.

• Smart Chunking for Abstractive Summarization: BART is used with token-aware


chunking strategies to overcome input length limitations, inspired by approaches such as
Gong and Liu’s sliding window [? ] and overlapping chunk attention.

• Content Moderation via Cosine Similarity: Safety filtering includes both keyword-
based checks and semantic similarity comparison against a list of embedded banned
phrases. This reduces the likelihood of inappropriate or unsafe content appearing in
summaries.

• Scalable and Language-Agnostic: LSA and cosine-based filters are unsupervised and
work independently of grammar, syntax, or labeled training data, supporting scalability
and multilingual adaptability.

• Modular and API-Compatible: The architecture supports external API fallbacks for
advanced summarization engines (e.g., OpenAI, DeepSeek), enabling future enhancements
with minimal disruption.

3.4 Disadvantages and Limitations

Despite its efficiency and modularity, the current approach has the following limitations:
Chapter III. Methodology 22

• Loss of Coherence in Extractive Summarization: LSA-based summaries may in-


clude disjointed or out-of-context sentences due to lack of deep language modeling.

• Chunking-Induced Discontinuity: Abstractive summarization using BART can result


in discontinuous or repetitive summaries across chunks if overlap is insufficient or sentence
splitting is misaligned.

• Resource Constraints for Local Models: Transformer-based summarizers like BART


are computationally intensive and not yet feasible for pure in-browser inference, necessi-
tating server-side or API usage.

• Semantic Filter Tradeoffs: While cosine similarity allows context-aware safety filtering,
it can lead to false positives (flagging benign phrases) or false negatives (if embeddings
don’t sufficiently capture nuance).

3.5 Comparison Between LSA, BART & LLMs

Feature LSA (Extrac- BART + LLMs (API-


tive) Chunking based)
(Abstractive)
Summary Style Extracts sen- Generates new Fully generated
tences verbatim text chunks summary
Coherence Low–Moderate Moderate High
Resource Re- Minimal High High
quirements
Deployment In-browser Server only API only
Feasibility
Interpretability High (score- Moderate Low (black-box)
based selection)
Content Safety Easy with cosine Embedded check Post-processing
Integration filter per chunk required

Table 3.1: Comparison of Summarization Strategies Used in the Proposed System


Chapter III. Methodology 23

3.6 Transcript Extraction Method (from YouTube Tran-

scripts)

To initiate the summarization pipeline, the system uses the yt-dlp Python package to extract
subtitle data directly from YouTube videos. This method provides a more robust alternative
to frontend-dependent APIs, supporting both manually uploaded and auto-generated captions
in multiple formats (e.g., TTML, VTT, SRT).

Figure 3.2: Example terminal log showing successful transcript extraction. The system logs
each request, identifies the video ID, fetches the transcript using internal API calls, and reports
the cleaned word count as well as the token count.

The extraction process includes the following steps:

• Video ID Resolution: The YouTube video ID is extracted from the user-provided URL
and passed to the downloader.

• Subtitle Discovery: yt-dlp is configured to skip media download and instead fetch
caption metadata. The system checks both:

– Manual Captions (Preferred): Higher accuracy and better segmentation.

– Auto-Generated Captions (Fallback): Available in most videos but less reliable.

• Subtitle URL Retrieval: Once a suitable caption track is identified (via the metadata
JSON), the corresponding subtitle file URL is fetched.

• Transcript Downloading and Parsing: The subtitle file is retrieved using an HTTP
request. The transcript is parsed from its raw JSON format (typically TTML) by iterating
over event segments and extracting UTF-8 encoded tokens.
Chapter III. Methodology 24

• Plain Text Conversion: All extracted tokens are concatenated into a plain-text tran-
script, with newline and formatting characters removed.

Additional fallback and validation logic includes:

• Graceful handling of missing captions (e.g., private videos, disabled subtitles).

• Language filtering (default: English; future-proofed for multilingual support).

• Error detection (e.g., HTTP failures, malformed caption files, quota limits).

This method offers greater flexibility and reliability compared to frontend-based libraries like
YouTubeTranscriptAPI, allowing full access to caption metadata and supporting low-level pars-
ing of subtitles in multiple formats.

3.7 Preprocessing the Text

Once the transcript is extracted from YouTube, it undergoes several preprocessing steps to clean
and normalize the data before summarization. This phase is critical for removing transcription
noise and ensuring consistency across various downstream modules, including TF-IDF, LSA,
and sentence embedding comparison for content filtering.

In the current implementation, transcript text is first assembled from raw subtitle data retrieved
in TTML or JSON format using yt-dlp. Captions are parsed into individual segments, with
each segment’s textual content concatenated to form the full transcript.

The preprocessing pipeline includes:

• Lowercasing: All transcript text is converted to lowercase to maintain case consistency


and avoid treating "Video" and "video" as distinct tokens.
Chapter III. Methodology 25

• Noise Removal: Time-stamps, symbols (e.g., “[Music]”, “–>”), and formatting artifacts
are stripped during raw segment processing. This cleaning stage ensures only spoken
content is retained.

• Sentence Splitting: Heuristic-based segmentation is applied to concatenate adjacent


subtitle lines and split them into longer, coherent sentences. This step improves sentence
granularity and summary coherence.

• Tokenization: Text is tokenized into word-level and sentence-level units using standard
NLP tools such as NLTK or spaCy, preparing it for vectorization and summarization.

• Stopword Removal: Common English stopwords are excluded to reduce noise in the
term frequency matrix, improving the quality of latent semantic representations.

• Lemmatization (Optional): Depending on the summarization method, words may be


lemmatized to their base forms using rule-based or model-based taggers (e.g., spaCy or
WordNetLemmatizer). This step helps unify variations like “running” and “ran” into “run”.

The output of this stage is a cleaned, normalized text corpus that is passed downstream to
either extractive summarization models like LSA or transformer-based summarizers such as
BART, depending on transcript length and system availability.

3.8 Parental Control and Kids Safety Filtering

To ensure the summarized content remains suitable for children and sensitive audiences, the
system incorporates a dedicated Kids Safety Filter. This module is applied to the transcript
before summarization, filtering out sentences that contain explicit, violent, or age-inappropriate
material based on two complementary strategies: keyword-based detection and semantic simi-
larity filtering.
Chapter III. Methodology 26

3.8.1 Keyword-Based Filtering

A predefined list of hardcoded terms and phrases (e.g., explicit words, references to violence,
hate speech) is used to flag and optionally remove sentences that directly contain inappropriate
content. This fast and interpretable rule-based filter acts as the first line of defense.

• Input text is scanned sentence by sentence.

• If a sentence contains any word from the blocked list, it is marked as flagged and excluded
from downstream summarization.

• A flagging report is optionally returned for user visibility or alerting mechanisms.

3.8.2 Semantic Filtering via Embedding Similarity

To capture implicit or euphemistic references not caught by direct keyword matches, the system
uses a second semantic filtering stage based on sentence embeddings.

Figure 3.3: Terminal Message: Cosine Similarity

• Each sentence in the transcript is converted into a high-dimensional vector using a sen-
tence embedding model (e.g., all-MiniLM-L6-v2 from SentenceTransformers).

• The cosine similarity is computed between each sentence vector and a curated list of
pre-embedded offensive phrases.

• Sentences whose similarity exceeds a predefined threshold (e.g., 0.75) are flagged as po-
tentially unsafe and the video is flagged.
Chapter III. Methodology 27

3.8.2.1 Cosine Similarity Filtering

Cosine similarity is used to measure the semantic similarity between sentence vectors in the
latent topic space derived from TF-IDF or LSA. It quantifies the cosine of the angle between
two non-zero vectors in a multi-dimensional space, offering a normalized measure of similarity.

A·B
cosine(A, B) = (3.1)
∥A∥∥B∥

where A and B are the sentence embedding vectors, · denotes the dot product, and ∥A∥ is the
Euclidean norm of vector A.

A predefined threshold (e.g., 0.8) is used during both:

• Redundancy Removal: If two sentences are highly similar, only one is retained.

• Parental Filtering: Sentences too similar to flagged examples are removed from candi-
date outputs.

Figure 3.4: Cosine Similarity showing two vectors with similarities close to 1, close to 0, and
close to -1.

This dual use of cosine similarity helps the system maintain high semantic precision and safety
without compromising performance, making it suitable for browser-based summarization.
Chapter III. Methodology 28

3.8.3 Modularity and Control

The Kids Safety Filter operates as a standalone preprocessing module. It can be toggled
based on user preference (e.g., parental control mode), and all filtering decisions are logged for
auditability. This modular structure ensures the core summarization logic remains unaffected,
while still supporting strong content safety guarantees.

3.8.4 Future Improvements

Planned enhancements include:

• Incorporation of multilingual filtering using multilingual embeddings.

• Integration with real-time abuse detection APIs for dynamic updates.

• Adaptive thresholds based on context (e.g., educational vs entertainment videos).

3.9 Summarization Models

This section describes the models used to generate summaries from YouTube video transcripts.
The system employs a hybrid approach, using both extractive and abstractive summarization
techniques while optionally offloading heavier abstractive summarization to cloud-based APIs.
A parental control filtering step is applied after each summarization process to ensure the
generated content is safe for children.
Chapter III. Methodology 29

3.9.1 Latent Semantic Analysis (LSA) – Extractive Summarization

Latent Semantic Analysis (LSA) is an unsupervised extractive technique used to identify and
retain the most semantically significant sentences from a document. In the context of this
system, LSA is applied to preprocessed YouTube transcripts to generate concise, informative
summaries. It works by reducing the dimensionality of a term-sentence matrix via Singular
Value Decomposition (SVD), uncovering latent semantic structures in the data.

3.9.1.1 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is a statistical measure used to evaluate the importance of a word in a document


relative to a collection of documents. It combines two metrics:

• Term Frequency (TF): Measures how frequently a term occurs in a document, normal-
ized to account for document length.

• Inverse Document Frequency (IDF): Measures how rare a term is across a corpus,
reducing the influence of commonly occurring words.

TF-IDF helps distinguish important content words from generic ones (e.g., "the", "is"). In this
pipeline, TF-IDF is applied at the sentence level, treating each sentence as a "document" and
the transcript as the corpus. This enables construction of a weighted representation of term
importance within each sentence.

Term Frequency:
The term frequency is calculated based on the formula:

f (t, d)
T F (t, d) = (3.2)
Nd
Chapter III. Methodology 30

Where:

• f (t, d) = Number of times the term t appears in the document d.

• Nd = Total number of terms in document D.

This normalizes the term count so that longer documents don’t dominate due to their size.

Inverse Document Frequency (IDF):


The inverse document frequency is calculated using the formula:

N
IDF (t, D) = log (3.3)
nt

Where:

• N = Total number of documents in the corpus.

• nt = Total number of documents containing the term t.

The logarithmic scale reduces the impact of very common words, like "the" or "and," which
occur in nearly every document.

TF-IDF Score: The TF-IDF score for a term t in a document d is calculated as:

T F − IDF (t, d, D) = T F (t, d) × IDF (t, D) (3.4)

This combines the term’s frequency in a document with its rarity across the entire corpus.
Chapter III. Methodology 31

Figure 3.5: Converting Text To TF-IDF Matrix

The implementation of TF-IDF involves several steps. First, preprocessing is applied to the
text, including tokenization, removal of stop words (e.g., "the," "is," "and"), and stemming
or lemmatization to reduce words to their root forms. Term frequencies are then calculated
for each word in a document, followed by computing the inverse document frequency for the
entire corpus.[14] Finally, the TF-IDF score for each term is calculated by combining these two
measures, enabling the identification of the most significant terms within a document.

TF-IDF has several practical applications. In text summarization, it helps identify key terms
that are used to extract meaningful sentences from a document. Search engines rely on TF-
IDF to rank documents based on the relevance of search queries. It is also widely used for
keyword extraction, content categorization, and even recommender systems to identify and
suggest relevant content.

Despite its simplicity and computational efficiency, TF-IDF has limitations. It treats words as
independent units, ignoring their context and relationships, which is a disadvantage for tasks
requiring semantic understanding. Additionally, it does not account for the sequence of words,
which can be crucial in certain text analysis tasks. While IDF discounts common terms, it may
overly penalize very rare terms that could still be important. Furthermore, for large corpora,
calculating IDF values can become computationally expensive, posing scalability challenges.
Chapter III. Methodology 32

In conclusion, while TF-IDF is a relatively simple algorithm compared to modern natural


language processing techniques, it remains a foundational tool for text analysis. Its balance of
interpretability, efficiency, and utility makes it indispensable for many text mining and retrieval
applications, even as more advanced models continue to emerge.

3.9.1.2 Matrix Construction

The TF-IDF values are arranged into a term-sentence matrix A, where:

• Each row corresponds to a unique term (word) in the transcript.

• Each column corresponds to a sentence.

• Each entry Aij represents the TF-IDF weight of term i in sentence j.

This matrix encodes semantic information, capturing the relevance of each term to each sen-
tence. The resulting matrix is typically sparse and high-dimensional.

Figure 3.6: Term-Sentence TF-IDF Matrix


Chapter III. Methodology 33

3.9.1.3 Singular Value Decomposition (SVD)

The matrix A is factorized using Singular Value Decomposition (SVD):

A=U ·Σ·VT (3.5)

where:

• U ∈ Rm×k : Term-topic matrix.

• Σ ∈ Rk×k : Diagonal matrix of singular values (topic strength).

• V T ∈ Rk×n : Topic-sentence matrix.

Figure 3.7: Truncated Singular Value Decomposition (SVD) applied to matrix G. Only the
top r singular values and their corresponding left and right singular vectors are retained to
form a low-rank approximation Ũ Σ̃Ṽ T . This process captures the most significant semantic
structure while discarding noise.

To better understand this process, Figure 3.7 illustrates the truncated SVD decomposition.
Only the top r singular values (and their corresponding vectors) are retained to form a low-
rank approximation of the original matrix. This reduction step is critical in LSA as it discards
less important dimensions (i.e., noise or minor topics), focusing the model on dominant semantic
concepts.
Chapter III. Methodology 34

The decomposition reduces noise and captures the core semantic structure of the text. By
retaining only the top k singular values (typically 1–3 for short documents), we focus on the
most significant semantic concepts.

3.9.1.4 Sentence Scoring

Each sentence is scored based on its relevance to the dominant latent topics:

• The matrix V T encodes how strongly each sentence relates to each topic.

• Sentence importance is calculated by computing the Euclidean norm (or weighted mag-
nitude) across the top k components of each column in V T .

• Higher scores indicate a sentence that aligns closely with the main semantic themes of
the document.

3.9.1.5 Sentence Selection

The top n scoring sentences are selected as the summary. The parameter n is either fixed
(e.g., 3 sentences) or computed dynamically as a proportion (e.g., 30%) of the total number
of sentences. The selected sentences are arranged in their original order to preserve narrative
flow.

3.9.1.6 Post-Processing

To improve readability and user experience, the following steps are applied:

• Redundancy Filtering: Sentences with high cosine similarity (above a fixed threshold)
are discarded to reduce duplication.
Chapter III. Methodology 35

• Sentence Reordering: Sentences are reordered according to their original position in


the transcript to ensure coherence.

• Parental Control Filtering: The final summary is passed through a content filter to
remove or flag inappropriate material.

This process yields a compact, informative, and safe summary suitable for both children and
adults.

3.9.2 BART – Abstractive Summarization

To enable fluent and abstractive summaries, the system utilizes DistilBART-CNN, a distilled
variant of Facebook’s BART model that is fine-tuned on the CNN/DailyMail dataset. This
model is accessed through HuggingFace’s pipeline API, enabling easy integration and fast
inference.

The BART summarization module is designed to handle large transcripts that may exceed the
model’s input token limit (commonly 1024 tokens). To address this, a custom chunking and
summarization pipeline is implemented.

3.9.2.1 Token-Aware Chunking

If the transcript exceeds the token limit, it is processed using a sentence-based, tokenizer-aware
chunking mechanism:

• Sentence Grouping: The preprocessed transcript is divided into semantically coherent


sentences.
Chapter III. Methodology 36

• Token Estimation: Each sentence is tokenized using the same tokenizer as the BART
model (i.e., facebook/bart-large-cnn) to estimate the token length.

• Chunk Construction: Sentences are grouped into chunks of 800–900 tokens to stay
well within the model’s upper limit.

• Adaptive Chunk Count: The number of chunks dynamically depends on the total
length of the transcript, ensuring coverage without truncation.

3.9.2.2 Chunk-Level Summarization and Merging

Each chunk is then independently passed through the BART summarizer:

• Individual Chunk Summarization: Each chunk undergoes abstractive summarization


via the BART pipeline.

• Summary Merging: The resulting chunk-level summaries are concatenated into a co-
herent intermediate summary.

• Recursive Summarization: If the merged summary is still too lengthy for frontend
display, it may undergo a second round of summarization through BART to improve
conciseness.

3.9.2.3 Fallback Mechanism

If any step in the BART summarization pipeline fails—due to memory issues, API errors, or
tokenization mismatches—the system gracefully falls back to the Latent Semantic Analysis
(LSA) based extractive summarizer. This ensures robust performance under varying runtime
constraints.
Chapter III. Methodology 37

3.9.2.4 Advantages of BART-Based Summarization

• Produces highly fluent, human-like summaries.

• Handles longer transcripts via intelligent chunking.

• Maintains abstraction and compression not possible with extractive methods.

• Resilient to failure via LSA fallback.

This architecture ensures scalability across video lengths and quality expectations, and inte-
grates seamlessly with the frontend for delivering structured summaries.

3.9.3 Large Language Models (LLMs)

In addition to the default DistilBART-based summarization, the system also integrates general-
purpose large language models (LLMs) such as GPT-4, Claude, and Gemini via secure API calls
through the use of chutes.ai. These LLMs support both few-shot and zero-shot summariza-
tion, offering superior abstraction, coherence, and linguistic fluency.

LLM-Based Summarization Features

• High-Level Abstraction: LLMs are capable of generating concise, rephrased summaries


that go beyond surface-level extraction, often capturing context, tone, and subtle nuances.

• Task Adaptivity: Through prompt engineering, the summary output can be tailored by
style (e.g., explanatory, conversational) or format (e.g., bulleted list, paragraph, Q&A).

• Multilingual and Safety Features: Most LLMs include built-in safety filters and
multilingual understanding, making them robust for diverse audiences.
Chapter III. Methodology 38

LLMs serve as an optional but powerful alternative to traditional summarization techniques


when API availability and latency constraints allow. This hybrid architecture improves coverage
and quality across a wide range of video content.

3.10 Flask Backend

The backend is built using Flask and serves as the interface between the frontend and the core
processing components. It exposes modular, RESTful API endpoints that manage transcript
retrieval, summarization (both extractive and abstractive), and kid-safe filtering.

The major endpoints include:

• GET /transcript
Accepts a YouTube video URL and returns the cleaned transcript text using a dedi-
cated caption extraction module. The transcript is preprocessed to remove noise (e.g.,
timestamps, filler text).

• GET /summary
Accepts a video URL and produces a summary using a multi-stage decision pipeline:

– The transcript is first validated for length and content safety using both keyword-
based and embedding-based checks.

– If deemed appropriate, the system applies a DistilBART-based abstractive summa-


rization model, with fallback to extractive summarization (TF-IDF + SVD) in case
of failure or overload.

– For longer transcripts, the summarization is handled in token-aware chunks, and


recursive summarization is used if the output exceeds the model’s context window.
Chapter III. Methodology 39

• GET /health
A lightweight endpoint used to monitor backend health and confirm if the summarization
model is loaded successfully.

Internally, the backend also uses a semantic filtering module based on cosine similarity between
transcript embeddings (generated via all-MiniLM-L6-v2) and a sensitive phrase bank. This
enables deeper context-aware filtering beyond basic keyword matching.

The Flask server is CORS-enabled to support direct communication with the Chrome extension
and other frontends.

The backend handles all preprocessing, LSA computation, keyword extraction, and content
filtering, returning results in structured JSON format to be consumed by the frontend.

3.10.1 Flask Development Server Execution

The backend Flask server, titled TranscriptApp, is responsible for summarization, content
filtering, and communication with the Chrome extension. Upon running, it outputs a startup
log confirming the successful initialization of core components:

Figure 3.8: Flask development server log showing initialization of TranscriptApp

Note: As shown in Figure 3.8, this is a development server only. For production deployment,
a WSGI server such as gunicorn or uWSGI should be used.
Chapter III. Methodology 40

3.11 Chrome Extension Architecture

The Chrome Extension serves as the client-side interface for kid-safe filtering of YouTube videos.
It is structured around three primary components: the content script, background service
worker, and the extension manifest.

3.11.1 Content Script (content.js)

The content script is injected into YouTube pages and is responsible for real-time interaction
with the video player and DOM. Its core functionalities include:

• Video Blocking: Initially pauses or mutes videos until they are explicitly approved.
This is enforced via a periodic check on the <video> element and event listeners for play
attempts.

• Visual Overlay: Displays a full-screen blocker with the extension’s branding and safety
warning. This ensures users are aware that the content requires safety verification.

• Storage-Based Approval: Checks local storage (via chrome.storage.local) for video


approval status using the video ID as a key (approved_<videoId>).

• SPA Navigation Handling: Uses both a MutationObserver and the chrome.webNavigation.on


event to detect in-site navigation changes (as YouTube is a single-page application).

• Message Handling: Responds to runtime messages (e.g., unblockVideo, getVideoStatus)


from the extension popup or background worker.
Chapter III. Methodology 41

3.11.2 Background Script (background.js)

The background service worker performs higher-level extension management and cross-tab co-
ordination:

• Script Injection: Ensures that the content script is loaded into YouTube tabs upon
navigation or extension installation.

• Navigation Events: Monitors URL changes through the chrome.webNavigation API


to trigger re-verification of video safety.

• Flask Server Communication: Sends health-check requests to the local Flask backend
to determine server availability.

• Storage Maintenance: Implements a cleanup routine via chrome.alarms, retaining


only the most recent 100 approved videos to reduce local storage bloat.

• Global Messaging: Handles global commands such as retrieving all approved video IDs
or clearing them from storage.

3.11.3 Extension Manifest (manifest.json)

The manifest defines the extension’s configuration and capabilities:

• Permissions: Includes tabs, scripting, storage, and YouTube URL host permissions
for full interaction.

• Content Script Registration: Specifies that content.js should run on all YouTube
URLs at document_start.

• Service Worker: Registers background.js as a persistent event listener.

• Extension UI: Defines icons, a popup interface (popup.html), and default titles.
Chapter III. Methodology 42

3.11.4 Extension Workflow Summary

1. When a YouTube page is loaded, the content script runs and checks if the video is ap-
proved.

2. If unapproved, a blocking overlay is shown and video playback is disabled.

3. The user can interact with the popup to request a safety check (via Flask backend).

4. Once approved, the overlay is removed and video playback is restored.

5. The approval is cached locally, and cleanup is performed periodically.


Chapter 4

Experimental Results

4.1 Experimental Results

Although no formal benchmark evaluation was conducted, a series of functional tests were
carried out to verify the system’s effectiveness in real-world usage. Summarization outputs
were visually inspected, and the system was tested across a range of YouTube videos varying
in length, topic, and language clarity.

• Summarization Output: The BART-based abstractive summarizer produced coherent


and concise summaries, while the LSA-based extractive summarizer reliably retained key
sentences from the transcript. Example outputs are shown in Figures ?? and ??.

• Transcript Handling: Long transcripts were successfully chunked and processed in


parts, with summaries from each chunk merged. Recursive summarization was applied
when intermediate results were too verbose.

43
Chapter IV. Experimental Results 44

• Parental Control Filtering: The system accurately flagged or removed content con-
taining inappropriate language, as seen in Figure ??, using both keyword-based and
embedding-based filtering methods.

• End-to-End Usability: From transcript extraction to summary display in the browser


extension, the full pipeline executed without crashes or critical bugs. The UI provided a
seamless user experience for summary generation.

Screenshots from the Chrome extension and backend output console illustrate successful sum-
marization and filtering in real scenarios.

Figure 4.1: The Terminal At The End (Fullscreen)

Figure 4.2: Opening YouTube Video


Chapter IV. Experimental Results 45

Figure 4.3: After Summary

Figure 4.4: Summary Through API


Chapter 5

Conclusion

5.1 Backend Implementation Using NLP Libraries

The backend of the YouTube Transcript Summarizer was implemented in Python, leveraging
a combination of classical and modern Natural Language Processing (NLP) techniques. Li-
braries such as SpaCy, NLTK, and scikit-learn were used for text preprocessing and TF-IDF
matrix construction. Latent Semantic Analysis (LSA), powered by Singular Value Decompo-
sition (SVD), serves as the primary extractive summarization engine due to its computational
efficiency and semantic consistency in resource-constrained environments.

To improve output quality, cosine similarity filtering is applied to eliminate redundancy and
enhance the diversity of selected sentences. Additionally, the backend supports optional ab-
stractive summarization using Hugging Face’s Transformers library (e.g., BART), enabling
fluent and context-aware summaries when required.

46
Chapter V. Conclusion 47

5.2 YouTube Transcript Extraction

The system integrates the YouTubeTranscriptAPI to programmatically extract transcripts


from public YouTube videos. This enables real-time retrieval of caption data, which serves
as the foundation for summarization and content filtering. The integration is resilient to vari-
ations in caption quality and supports both manually uploaded and auto-generated subtitles.

5.3 Deployment via Flask and Chrome Extension

Deployment is managed through a lightweight Flask backend that exposes several RESTful API
endpoints:

• /summarize — returns an LSA-based summary of the transcript

• /keywords — returns extracted tags and keywords from the transcript

• /filter — flags inappropriate content using a combination of keyword matching and


semantic vector similarity

The user interface is delivered through a Chrome Extension that enables users to input YouTube
video links and receive summaries and tags directly within the browser. This modular archi-
tecture ensures cross-platform compatibility, ease of use, and fast response times.

5.4 Planned Enhancements

Although the current implementation offers robust summarization and filtering, several im-
provements are envisioned for future development:
Chapter V. Conclusion 48

• Robust Error Handling: Enhanced handling of invalid URLs, private videos, livestreams,
and API failures.

• Timestamp-Based Summaries: Introduce summaries segmented by video timestamps


to support timeline-based navigation.

• Speech-to-Text Integration: Integrate automatic speech recognition (ASR) models to


support videos without existing subtitles.

• Multilingual Support: Expand summarization and filtering capabilities to non-English


transcripts.

• User Customization: Enable control over summary length, topic tag density, and con-
tent sensitivity levels.

These enhancements aim to further improve usability, personalization, and accessibility—particularly


for parents, educators, and users seeking efficient content consumption.
Appendix A

Appendix

Attach source code

49
Bibliography

[1] Statista, Hours of video uploaded to YouTube every minute, 2024.


[Online]. Available: https://www.statista.com/statistics/259477/
hours-of-video-uploaded-to-youtube-every-minute/ [Accessed: Dec. 2024].

[2] Sandra L. Calvert, Children as consumers: Advertising and marketing, The Future of
Children, pp. 205–234, 2008. JSTOR.

[3] D. DiPlacido, YouTube’s ‘Elsagate’ Illuminates The Unintended Hor-


rors Of The Digital Age, Forbes, Nov. 28, 2017. [Online]. Avail-
able: https://www.forbes.com/sites/danidiplacido/2017/11/28/
youtubes-elsagate-illuminates-the-unintended-horrors-of-the-digital-age/
[Accessed: Dec. 2024].

[4] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research
and Development, 2(2), 159–165.

[5] Edmundson, H. P. (1969). New methods in automatic extracting. Journal of the ACM
(JACM), 16(2), 264–285.

[6] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).
Indexing by latent semantic analysis. Journal of the American Society for Information
Science, 41(6), 391–407.

50
Bibliography 51

[7] Kupiec, P. H. (1995). Techniques for verifying the accuracy of risk measurement models
(Vol. 95, No. 24). Washington, DC: Division of Research and Statistics, Division of Mon-
etary Affairs, Federal Reserve Board.

[8] Yihong Gong and Xin Liu. Generic text summarization using relevance measure and latent
semantic analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval, pages 19–25. ACM, 2001.

[9] Lin, C. Y. (2004, July). ROUGE: A package for automatic evaluation of summaries. In
Text Summarization Branches Out (pp. 74–81).

[10] Rush, A. (2015). A neural attention model for abstractive sentence summarization. arXiv
preprint arXiv:1509.00685.

[11] Kumar, K., Shrimankar, D. D., & Singh, N. (2016, November). Equal partition based clus-
tering approach for event summarization in videos. In 2016 12th International Conference
on Signal-Image Technology & Internet-Based Systems (SITIS) (pp. 119–126). IEEE.

[12] See, A., Liu, P. J., & Manning, C. D. (2017). Get to the point: Summarization with
pointer-generator networks. arXiv preprint arXiv:1704.04368.

[13] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document
transformer. arXiv preprint arXiv:2004.05150, 2020.

[14] Raposo, G., Raposo, A., & Carmo, A. S. (2022). Document-Level Abstractive Summariza-
tion. arXiv preprint arXiv:2212.03013.

[15] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies (NAACL-HLT), pp. 4171–4186.
Bibliography 52

[16] Zhang, H., Yu, P. S., & Zhang, J. (2024). A Systematic Survey of Text Summarization:
From Statistical Methods to Large Language Models. arXiv preprint arXiv:2406.11289.

[17] Rajaraman, A. (2011). Mining of Massive Datasets.

[18] Beel, J., Gipp, B., Langer, S., & Breitinger, C. (2016). Paper recommender systems: a
literature survey. International Journal on Digital Libraries, 17, 305–338.

[19] Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application
in retrieval. Journal of Documentation, 28(1), 11–21.

[20] Henry Schaeffer et al., *DistilBART-CNN-12-6: A distilled BART model for ab-
stractive summarization*, 2021. Available at: https://huggingface.co/sshleifer/
distilbart-cnn-12-6 (accessed June 9, 2025).

You might also like