0% found this document useful (0 votes)
33 views58 pages

Audio To Text Summarizer Mini Project Final Report

The project report titled 'Audio to Text Summarizer' outlines the development of an automated system that converts audio content into text and generates concise summaries using Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) techniques. The system aims to enhance productivity in various fields such as education, journalism, and business by reducing manual transcription efforts while maintaining accuracy. The report includes acknowledgments, objectives, methodology, and limitations of the study, highlighting the significance of deep learning models like Wav2Vec and T5 in achieving these goals.

Uploaded by

Krishna Kg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views58 pages

Audio To Text Summarizer Mini Project Final Report

The project report titled 'Audio to Text Summarizer' outlines the development of an automated system that converts audio content into text and generates concise summaries using Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) techniques. The system aims to enhance productivity in various fields such as education, journalism, and business by reducing manual transcription efforts while maintaining accuracy. The report includes acknowledgments, objectives, methodology, and limitations of the study, highlighting the significance of deep learning models like Wav2Vec and T5 in achieving these goals.

Uploaded by

Krishna Kg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

AUDIO TO TEXT SUMMARIZER

A PROJECT REPORT
Submitted by

AJAY
(P19DU23S126012)
ARYA P
(P19DU23S126036)

Under the guidance of


Ms. BHARGAVI K,
Assistant Professor
Department of Computer Applications

in partial fulfillment for the award of the degree


of

MASTER OF COMPUTER APPLICATIONS


of

BENGALURU NORTH UNIVERSITY

DEPARTMENT OF COMPUTER APPLICATION


KRUPANIDHI COLLEGE OF MANAGEMENT
BENGALURU, KARNATAKA

2023-2025
KRUPANIDHI COLLEGE OF MANAGEMENT
BENGALURU -560035

Department of Computer Applications

JUNE 2025

BONAFIDE CERTIFICATE

This is to certify that the project entitled “AUDIO TO TEXT SUMMARIZER” is the bonafide record

of project work done by AJAY bearing Reg No. P19DU23S126012 and ARYA P bearing Reg

No.P19DU23S126036 and is being submitted in partial fulfilment for the award of the Master’s Degree

in Computer Applications of Bengaluru North University.

Project Guide Principal

Submitted for the Project Viva-Voice examination held on

Internal Examiner External Examiner


KRUPANIDHI COLLEGE OF MANAGEMENT
BENGALURU -560035

Department of Computer Applications

JUNE 2025

DECLARATION BY THE CANDIDATE

I hereby declare that “AUDIO TO TEXT SUMMARIZER” is the result of the project

work carried out by me under the guidance of Ms. BHARGAVI K, Assistant Professor

in partial fulfillment for the award of Master’s Degree in Computer Applications by

Bengaluru North University.

I also declare that this project is the outcome of my own efforts and that it has not been

submitted to any other university or Institute for the award of any other degree or Diploma

or Certificate.

Signature of the Candidate Signature of the Candidate

AJAY ARYA P
P19DU23S126012 P19DU23S126036

I certify that the declaration made above by the candidate is true

Signature of the Guide

Ms. BHARGAVI K
Assistant Professor
ACKNOWLEDGEMENT

I owe very gratitude to God almighty for blessing me all the way to complete this work
successfully.

I proudly utilize the opportunity to express my heart full thanks to Dr. SURESH NAGPAL,
Chairman of Krupanidhi Group of Institutions, Bengaluru and Dr.
C.J. Rajendra Prasad, Principal, Krupanidhi College of Management, for their valuable
advice and encouragement for carrying out this project.

My special thanks to faculty guide, MS. BHARGAVI K, Assistant Professor, Department


of Computer Applications, who guided me with full support in completing my project
through his/her constant encouragement and suggestions.

I thank all other Faculty Members of Department of Computer Applications for their
continuous support in carrying out this project.

I offer my humble and sincere thanks to my beloved parents and friends who are the never-
ending source of inspiration to me.

AJAY
(P19DU23S126012)

ARYA P
(P19DU23S126036)
ABSTRACT

In the era of digital transformation, vast amounts of audio data are generated daily,
including lectures, meetings, podcasts, and interviews. Manually transcribing and
summarizing such content is time-consuming and inefficient. This project aims to
develop an automated system that converts spoken content into text and generates
concise summaries while preserving essential information.

The system utilizes Automatic Speech Recognition (ASR) to transcribe audio into
text and Natural Language Processing (NLP) techniques for summarization. Deep
learning models, such as Wav2Vec for speech-to-text conversion and transformer-
based models like T5 for text summarization, are integrated to enhance accuracy
and efficiency.

The project finds applications in various domains, including education, journalism,


business meetings, and accessibility for the hearing-impaired. By reducing manual
effort and improving information retrieval, this system enhances productivity and
ensures quick access to key insights from lengthy audio recordings.

Keywords: Speech-to-Text, Transformer Architecture, ASR, NLP, Wav2Vec,


T5, Text-Summarization.
LIST OF FIGURES

Sl. No Title Page


No

1. Fig 3.3.1 System Architecture of Transformer model 19

2. Fig 3.4.1 Level-0 DFD 20

3. Fig 3.4.2 Level-1 DFD 21

4. Fig 4.1 Architecture of Wav2Vec2 24

5. Fig 4.2 Architecture of T5 25

6. Fig 6.1 Transcripted Text 32

7. Fig 6.2 Summarized Text 32

8. Fig 6.3 Accuracy and F1 score 32

9. Fig 6.4 Graphical Representation of Confusion Matrix 33


LIST OF ABBREVATIONS

SL NO Acronyms Full Form of Abbreviation

1. AI Artificial Intelligence

2. ASR Automatic Speech Recognition

3. NLP Natural Language Processing

4. TTS Text-to-Speech

5. STT Speech-to-Text

6. DFD Data Flow Diagram

7. MT Machine Translation

8. F1 score Harmonic Mean of Precision


and Recall
TABLE OF CONTENTS

1. INTRODUCTION 1-6

1.1 Background 1

1.2 Problem Statement 2-3

1.3 Objective of the study 3-4

1.4 Scope of the study 4-5

1.5 Limitations of the study 5-6

2. LITERATURE REVIEW 7-13

2.1 Overview of existing Literature 7-13

2.2 Comparison with Existing System 13

3. SYSTEM ANALYSIS AND DESIGN 14-20

3.1 Requirement Analysis 14

3.2 System Specification 14-17

3.3 System Architecture 17-18

3.4 Data Flow Diagram 18-20

4. METHODOLOGY 21-24

5. IMPLEMENTATION AND TESTING 25-29

5.1 Code Snippet 25-26

5.2 Types of Testing 27

5.3 Test case 28-29

6. RESULTS AND DISCUSSION 30-33

6.1 Analysis of Result 30-33

6.2 Discussion 33
7. CONCLUSION AND FUTURE WORK 34-36

7.1 Summary 34

7.2 Recommendation 34-35

7.3 Future Enhancements 35-36

BIBLIOGRAPHY 37-38

SAMPLE CODING
AUDIO TO TEXT SUMMARIZER

CHAPTER 1

INTRODUCTION

1.1 Background

Each passing day in the world today, lectures, meetings, podcasts and many
other activities produce a lot of audio data. It takes a lot of effort and time to
extract pertinent information from this data, and efficient transcription and
summarization techniques are required. These procedures can now be
automated, making them quicker and more precise, thanks to developments in
AI and NLP.

Unquestionably, the industry has undergone a revolution since transformer-


based models were introduced, offering ASR and Text Summarization high
accuracy, scalability, and efficiency. The statistical and rule-based approaches
that underpinned traditional systems fared poorly when it came to handling
linguistic subtleties and context understanding when compared to modern
techniques. The first to use deep learning and self-attentional techniques to
improve transcription and summarization quality were the Waves2Vec model
for ASR and the Transformer model for summarization, which was based on
T5.

This project takes the improvements in the above paragraph a step further by
suggesting a system with transformer-based summarization and Whisper for
English speech-to-text transcription with the aim of creating accurate and
succinct summaries. It is implied that lengthy audio files can be handled by the
system.

Deep Learning: One of the main branches of machine learning that draws
inspiration from the workings of the human brain is deep learning. Deep
learning uses multi-layered artificial neural networks—thus the word "deep"—
to automatically extract complex representations and patterns from massive
datasets. Deep learning models, on the other hand, can learn from raw inputs

KRUPANIDHI COLLEGE OF MANAGEMENT 1|Page


AUDIO TO TEXT SUMMARIZER

like text, audio, or images, whereas other machine learning techniques need
hand-engineered features and domain knowledge.

Over the past few years, advancements in computer vision, speech recognition,
natural language processing (NLP), and real-time translation software have all
been fueled by deep learning. This has made it possible to automate tasks that
previously required a great deal of accuracy and human labor.

For the purposes of this project, deep learning powers both speech-to-text and
summarization. By identifying intricate audio patterns, the Wav2Vec2 model,
which is based on deep neural architecture, effectively converts speech to text.
Similarly, by identifying linguistic structure and contextual meaning, the T5
transformer-based model summarizes the transcribed text. By eliminating the
need for manual transcription and summarization, both of these create a system
that is quicker, more scalable, and domain-adaptable. This sums up the current
trend of integrating deep learning into offline, real-time, end-to-end automated
products for access, education, and communication.

1.2 Problem Statement

In today's digital world, a variety of audio content, including lectures, podcasts,


meetings, and interviews, are being recorded every day. Due to the need for
extraction, transcription, summarization, and other processing that is often
only possible by hand, processing this kind of material is extremely time-
consuming.
In addition to the inefficiency of mapping traditional techniques like context
preservation, paraphrasing transcription, and summarization, data loss is
inevitable due to the lack of accuracy and the inability to retain verification
context.

It can be difficult to strike the correct balance between automation, efficiency,


and accuracy when generating verbal texture and summaries into speech
captures. Current approaches, which are governed by statistical models or rule-
based methodologies, lack precise definitions of contextual meaning

KRUPANIDHI COLLEGE OF MANAGEMENT 2|Page


AUDIO TO TEXT SUMMARIZER

articulation to background noise along subtle voice patterns, despite the fact
that they are understandable.

This project seeks to address these issues by utilizing summation transformers


like T5 and ASR techniques like Wave2Vec. Using advanced Neural
Contextual Understanding (NLP) models with deep learning will enable a
significant reduction in verbal quantity without sacrificing essential
information, paving the way for quick and error-free content mapping. Thus,
timely extraction through precise analysis will become deeply ingrained in
journalism, boardroom meetings, educational systems, and even for people
with physical disabilities, making this an ultra-modern solution.

1.3 Objective of the Study

The primary goal of this research is to develop an automated speech


summarization tool that generates intelligent summaries by verbatim
transcription of sound content. In a variety of fields, including education,
business, journalism, and other aid services, this tool will improve information
retrieval, lessen the need for human labor, and improve usability.

The real goals are:

1. To Create a Functional Speech-to-Text Interface

Use ASR (Automatic Speech Recognition) systems based on the Wav2Vec


model to convert audio files to text while accounting for background noise,
accents, and speech variations.

2. To Create a Reliable Text Summarization Model

Create a brief but accurate summary of the given text using the transformer-
based natural language processing (NLP) model T5.

3. To Improve Contextual Meaning Capture

Determine the original meaning and speaker's intention in long audio formats
to extract meaningful comprehension from the audio data.

KRUPANIDHI COLLEGE OF MANAGEMENT 3|Page


AUDIO TO TEXT SUMMARIZER

4. To Increase Productivity for Practical Uses

In order to be used in a variety of industries, the system should be made to


easily work with any type of audio, including lectures, meetings, interviews,
and podcasts.

5. To Enhance Usability

Offer solutions to teachers, deaf professional assistants, hearing-impaired


people, and other professionals who work with them.

1.4 Scope of the Study

Building and implementing an automated audio-to-text summarization system


that reliably captures and abstracts speech data is the goal of this study. This
study covers a wide range of subjects, including deep learning-based
summarization, natural language processing (NLP), and speech recognition.

1. Speech-to-Text Translation: This method efficiently converts speech to


text by using Automatic Speech Recognition (ASR) tools like Wave2Vec.
Controlling differences in speech patterns, accents, and acceptable background
noise.

2. Text Summarization: Concise and pertinent summaries of transcribed text


are produced by using transformer-based natural language processing (NLP)
algorithms (T5). minimizing redundancy while preserving important data.

3. Supported Audio Inputs and Sources: Live speech transcription and pre-
recorded audio files (such as WAV and MP3). audio transcription of podcasts,
lectures, meetings, and interviews.

4. Performance Evaluation: Word Error Rate (WER) is used to measure


transcription accuracy. ROUGE and BLEU metrics are used to measure the
quality of summaries.

5. Application Domains: Education aids students by turning lectures into


synopses. Meetings are narrated and summarized by business and corporate.

KRUPANIDHI COLLEGE OF MANAGEMENT 4|Page


AUDIO TO TEXT SUMMARIZER

summarizing recent interviews and reports in journalism and the media. People
with hearing impairments can access information thanks to assistive
technologies.

1.5 Limitations of the Study

The performance and usability of the audio-to-text summarization system


may be impacted by certain research limitations, despite advancements in
Automatic Speech Recognition (ASR) and Natural Language Processing
(NLP). They are:
1. Language Restrictions: The system may not function well for
multilingual or highly accented speech because it is primarily designed for
English-speaking audio. More training and model improvement would result
from expanding support to more languages.

2. Accuracy Problems in Noisy Environments: Background noise, speech


overlap, and poor audio quality can all negatively impact speech-to-text
accuracy, which in turn can lead to transcription and summarization errors.

3. Restricted Real-Time Summarization Capability:

Although real-time transcription is possible, real-time summarization takes


extra processing time and hence poses the challenge of accommodating
applications that have instant requirements for summaries.

4. Relying on Pre-Trained Models:

The technology is based on pre-trained models like Wav2Vec for ASR and
transformer models like T5 for summarization. These models are perhaps not
optimized for the particular domain (e.g., legal or medical jargon) without
further fine-tuning.

5. Limitations in Processing Long Audio Files:

Processing highly long audio files (e.g., multi-hour lectures or meetings) can
involve segmentation methods to allow efficient processing and
summarization.
KRUPANIDHI COLLEGE OF MANAGEMENT 5|Page
AUDIO TO TEXT SUMMARIZER

6. Computational and Hardware Requirements:

Deep learning models require high computational power, hence are resource-
hungry for real-time processing on low-end devices.

7. Ethical and Privacy Issues:

Handling sensitive or confidential audio content raises privacy issues. Secure


handling of data and compliance with legal frameworks (like GDPR) is
imperative.

KRUPANIDHI COLLEGE OF MANAGEMENT 6|Page


AUDIO TO TEXT SUMMARIZER

CHAPTER 2

LITERATURE REVIEW

2.1Overview of Existing Literature

1. Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, Chengqing Zong IEEE
Transactions on Knowledge and Data Engineering 31 (5), 996-1009, 2018

Automatic summarization of text is a basic natural language processing (NLP) task


that tries to reduce a source document to its essence by creating a shorter version.
The recent growth in multimedia data transmission via the Internet requires multi-
modal summarization (MMS) from asynchronous collections of text, image, audio,
and video. In this paper, we introduce an extractive MMS approach that integrates
the methods of NLP, speech processing, and computer vision to tap into the
abundant information hidden in multi-modal data and enhance the quality of
multimedia news summarization. The fundamental idea is to fill in the semantic gaps
in multimodal content. In the video, audio and visual modalities predominate. We
suggest a technique for selectively using the transcription of audio data and
determining the transcription's salience based on audio signals. We use a neural
network to learn the joint representations of text and images for visual information.
Next, we use multi-modal topic modeling or text-image matching to capture the
coverage of the generated summary for important visual information. Finally,
through the budgeted optimization of submodular functions, all the multi-modal
factors are considered to produce a textual summary by optimizing salience, non-
redundancy, readability, and coverage. We also present a publicly available English
and Chinese MMS corpus.1 Our experiment results on our dataset show that our
image matching and image topic framework-based methods surpass other state-of-
the-art competitive baseline methods. The suggested framework exhibits resilience
in a variety of media formats and languages. To further improve multi-modal
summarization, future research will concentrate on adding temporal alignment and
video captioning.

KRUPANIDHI COLLEGE OF MANAGEMENT 7|Page


AUDIO TO TEXT SUMMARIZER

2. Nitin B Raut, AS Pranesh, B Nagulan, S Pranesh, R Vasantharajan 2023 3rd


International Conference on Innovative Mechanisms for Industry Applications
(ICIMIA), 1251-1257, 2023

This literature review examines evolving trends in audio-to-text and text


summarization techniques applied to YouTube video content. With the rapid
expansion of online video platforms, the demand for efficient transcription and
summarization of multimedia data has grown significantly. This study offers a
comprehensive analysis of current methodologies, algorithms, and tools for
converting audio content to text, highlighting the intricacies of speech
recognition technologies and their applicability in various contexts.
Additionally, the survey explores text summarization techniques specifically
designed for YouTube videos, including both extractive and abstractive
methods. It assesses their effectiveness in condensing lengthy video transcripts
into concise and coherent textual representations. Special emphasis is given to
NLP (Natural Language Processing) algorithms and ML (Machine Learning)
models that aid in extracting essential information while maintaining
contextual relevance.

3. Aneesh Vartakavi, Amanmeet Garg arXiv preprint arXiv:2009.10315, 2020

The diverse nature, scale, and specificity of podcasts present a unique


challenge to content discovery systems. Listeners often rely on text
descriptions of episodes provided by the podcast creators to discover new
content. Some factors like the presentation style of the narrator and production
quality are significant indicators of subjective user preference but are difficult
to quantify and not reflected in the text descriptions provided by the podcast
creators. We propose the automated creation of podcast audio summaries to
aid in content discovery and help listeners to quickly preview podcast content
before investing time in listening to an entire episode. In this paper, we present
a method to automatically construct a podcast summary via guidance from the
text-domain. Our method performs two key steps, namely, audio to text
transcription and text summary generation. Motivated by a lack of datasets for
this task, we curate an internal dataset, find an effective scheme for data

KRUPANIDHI COLLEGE OF MANAGEMENT 8|Page


AUDIO TO TEXT SUMMARIZER

augmentation, and design a protocol to gather summaries from annotators. We


fine-tune a PreSumm[10] model with our augmented dataset and perform an
ablation study. Our method achieves ROUGE-F(1/2/L) scores of
0.63/0.53/0.63 on our dataset.

4. Georgios Evangelopoulos, Athanasia Zlatintsi, Georgios Skoumas,


Konstantinos Rapantzikos, Alexandros Potamianos, Petros Maragos, Yannis
Avrithis2009 IEEE international conference on acoustics, speech and signal
processing, 3553-3556, 2009

This definition of perceptually significant video events detection is based on


models of the saliency of textual, visual, and audio information in a video
stream. Cues measuring multifrequency waveform modulations extracted by
nonlinear operators and energy tracking are used to assess audio saliency. An
intensity, color, and motion-driven spatiotemporal attention model is used to
estimate visual saliency. Part-of-speech tagging on the subtitles that most
movie distributors provide is the source of text saliency. The occurrence of an
event is marked in one or more domains, and the various modality curves are
combined into a single attention curve. A bottom-up video summarization
algorithm that improves on the outcomes of unimodal or audiovisual-based
skimming is based on this multimodal saliency curve. The algorithm works
well for video summarization in the aspect of informativeness and enjoyability.

5.Karan Pandita, Purab Kulranjan Singh Thakur, Suresh


AnnamalaiProceedings of the 5th International Conference on Information
Management & Machine Intelligence, 1-9, 2023

The field of Natural Language Processing (NLP) has revolutionized the way
human language interacts with computer systems. NLP applications span
machine translation, information extraction, summarization, and question
answering, driven by vast computational resources and big data
methodologies. Despite these advancements, NLP tools haven't fully
integrated with Internet of Things (IoT) devices, like audio recorders,

KRUPANIDHI COLLEGE OF MANAGEMENT 9|Page


AUDIO TO TEXT SUMMARIZER

hindering their accessibility and usability. This paper introduces an innovative


solution: a method for audio transcription and contextual summarization using
NLP, addressing this gap and enhancing comprehension. Our approach
employs cutting-edge NLP techniques, including word embedding methods
and knowledge-based graphs, to create a system that efficiently converts audio
content into written text and generates coherent summaries. Unlike existing AI
tools, our system's summaries are not only accurate but also rich and deep,
providing insightful representations of the original content. This depth is
achieved through advanced linguistic analysis, surpassing tools like ChatGPT.
Furthermore, our system breaks language barriers, enabling multilingual data
traversal, enhancing accessibility on a global scale. Our research methodology
ensures the system's adherence to industry standards like Request for
Comments (RFC) and Constrained Application Protocol (CoAP), guaranteeing
interoperability and reliability.
By incorporating knowledge-based graphs, our system comprehensively
understands audio content, enhancing the accuracy of summarization. This
approach addresses the unmet need for seamlessly integrating NLP with IoT
devices, making the technology accessible to a broader audience.

6. Oleh Basystiuk, Natalya Shakhovska, Violetta Bilynska, Oleksij Syvokon,


Oleksii Shamuratov, Volodymyr Kuchkovskiy IT&AS, 1-8, 2021

The paper describes possibilities, which are provided by open APIs, and how
to use them for creating unified interfaces which is based on recurrent neural
network. In last decade AI technologies became widespread and easy to
implement and use. One of the most perspective technologies in the AI field is
speech recognition as part of natural language processing. New speech
recognition technologies and methods will become a central part of future life
because they save a lot of communication time, replacing common texting with
voice/audio. In addition, this paper explores the advantages and disadvantages
of well-known chatbots. The method of their improvement is built. The
algorithms sequence-to-sequence based on recurrent neural network is used.
The time complexity of proposed algorithm is compared with existed one.

KRUPANIDHI COLLEGE OF MANAGEMENT 10 | P a g e


AUDIO TO TEXT SUMMARIZER

Scientific novelty of the obtained results is the method for converting audio
signals into text based on a sequential ensemble of recurrent encoding and
decoding networks. The practical significance is the modified existing chatbot
system for converting audio signals into text.

7. Kazumasa Yamamoto et al. (2023)

Devised a system for automated speech recognition, translation, and


summarization of TED English lectures, with the objective of improving
understanding for Japanese-speaking individuals. The sys- term comprises
three fundamental elements: DNN-HMM for speech recognition, a
Transformer model for translation, and BERT based summarization
(BertSumExt) for the extraction of salient sentences. In the domain of speech
recognition, the investigators attained a word accuracy of approximately 88%
through the utilization of a combination of TED and Librispeech audio
corpora.
The Transformer-based translation system exhibited diminished efficacy with
respect to speech inputs relative to text, yielding BLEU scores that were
approx.- irately 14% lower when processing recognized speech due to errors
in recognition. Nonetheless, speech summarization demonstrated resilience to
such inaccuracies, as the results of significant sentence extraction closely
mirrored those of the original text summarization. This methodology offers
effective instruments for subtitling English lectures accompanied by Japanese
translations, emphasizing readability while ensuring robust summarization
even amidst recognition challenges.

8. Zhara Nabila et al. (2022)

Undertook the development of a translation application leveraging the Google


Translate API, intended to enhance accessibility and facilitate understanding
of languages across multiple media formats. This application is designed to
extract audio or textual content from YouTube videos and subsequently
translate it into multiple languages, employing the Google Translate and Text-
to-Speech (TTS) libraries. The system was engineered in Python and Incorp-

KRUPANIDHI COLLEGE OF MANAGEMENT 11 | P a g e


AUDIO TO TEXT SUMMARIZER

rates deep learning, machine translation, and natural language processing


(NLP) elements to ensure the provision of precise outcomes. The primary
objective of the translation tool is to aid users in engaging with content in
foreign languages, particularly emphasizing educational applications, public
out- reach, and assistance for individuals with disabilities. During the
evaluation phase, the application successfully translated 90.38% of videos into
both text and audio formats, achieving a synchronization accuracy ranging
from 89% to 97% between the produced text and audio outputs. The main
limitations identified were instances where videos were subject to re-stricted
public access, thereby obstructing translation efforts.

9. Adhika Pramita Widyassari, Supriadi Rustad, Guruh Fajar Shidik, Edi


Noersasongko, Abdul Syukur, Affandy Affandy, De Rosal Ignatius Moses
Setiadi
Journal of King Saud University-Computer and Information Sciences 34 (4),
1029-1046, 2022

Text summarization automatically produces a summary containing important


sentences and includes all relevant important information from the original
document. One of the main approaches, when viewed from the summary
results, are extractive and abstractive.An extractive summary is heading
towards maturity and now research has shifted towards abstractive summation
and realtime summarization.

Although there have been so many achievements in the acquisition of datasets,


methods, and techniques published, there are not many papers that can provide
a broad picture of the current state of research in this field. This paper provides
a broad and systematic review of research in the field of text summarization
published from 2008 to 2019. There are 85 journal and conference publications
which are the results of the extraction of selected studies for identification and
analysis to describe research topics/trends, datasets, preprocessing, features,
techniques, methods, evaluations, and problems in this field of research. The
results of the analysis provide an in-depth explanation of the topics/trends that
are the focus of their research in the field of text summarization; provide

KRUPANIDHI COLLEGE OF MANAGEMENT 12 | P a g e


AUDIO TO TEXT SUMMARIZER

references to public datasets, preprocessing and features that have been used;
describes the techniques and methods that are often used by researchers as a
comparison and means for developing methods. At the end of this paper,
several recommendations for opportunities and challenges related to text
summarization research are mentioned.

2.2 Comparison with Existing System

To implement the audio-to-text summarizer, Wav2Vec2 was used for speech-


to-text and T5 (Text-To-Text Transfer Transformer) for summarization.
Accuracy, efficiency, offline suitability, and modularity suitability were the
criteria used to choose both models. Wav2Vec2 + T5 has a few key advantages
over other models currently in use.

Facebook AI created the self-supervised learning-based Wav2Vec2 model,


which uses less labeled data and has good accuracy, particularly when it comes
to English speech. Because it is open-source and lightweight, it is perfect for
low-resource and offline settings. Although models such as Google Speech-
to-Text API offer great real-time transcription, they are cloud-based and
eventually charge users. Similar to Wav2Vec2, Mozilla DeepSpeech is open-
source but performs worse and is less accurate. T5 is unique for abstractive
summarization because of its text-to-text architecture, which makes it simple
to adapt to a variety of NLP tasks, including summarization. It comes pre-
trained to fit into mid-range hardware and is completely open-source. It
provides a higher performance-to-resource ratio than Google's Pegasus and
Facebook's BART. BART is heavier and slower, but it has great abstractive
summarization. Although Pegasus produces outputs that are class-leading, its
model size and hardware requirements are much larger.

A fully offline, free, and fully customizable speech recognition and


summarization pipeline is provided by the combination of Wav2Vec2 and T5.
It provides fine-grained control, is cloud-free, and makes model replacement
or update simple. This makes it ideal for embedded systems, research,
education, and rural deployments where access to high-end GPUs and the
internet may be limited.
KRUPANIDHI COLLEGE OF MANAGEMENT 13 | P a g e
AUDIO TO TEXT SUMMARIZER

CHAPTER 3

SYSTEM ANALYSIS AND DESIGN

3.1 Requirement Analysis

The Audio-to-Text Summarizer system takes advantage of deep learning


models to convert spoken audio into text and create brief summaries of the text
versions. The following section presents the necessary hardware, software, and
model specifications for developing and deploying the system successfully.

3.2 System Specification

System specification defines the software and hardware requirements along


with the operational context necessary to implement and deploy the system
effectively. This specification serves as a blueprint that guides developers,
testers, and system administrators throughout the system’s development
lifecycle.

Hardware Specification

This part refers to the physical equipment your system needs to function
properly.

1. Processor (CPU)

• The CPU performs all the main calculations and runs your software.
• A multi-core processor (e.g., Intel i5 or better) is important to handle the
data processing and machine learning tasks efficiently.

2. RAM (Memory)

• RAM temporarily stores data the computer is actively using.

KRUPANIDHI COLLEGE OF MANAGEMENT 14 | P a g e


AUDIO TO TEXT SUMMARIZER

• More RAM helps with handling large datasets and running simulations
or ML models smoothly.
• Minimum: 8 GB is fine for small projects. Recommended: 16 GB or
more for faster performance.

3. Storage

• You need disk space to store:


o Datasets (which can be large),
o Trained models,
o Logs and results.
• Using an SSD (Solid State Drive) makes reading/writing data much
faster than a traditional hard drive.
• At least 256 GB SSD is suggested. More if your data is large.

4. Graphics Card (GPU)

• If you're using deep learning models (like LSTM or CNN), a GPU can
significantly speed up training.
• .Although models can run on the CPU, GPU-most famous support with
NVIDIA GPUs with CUDA-VELMI support improves performance,
especially during the inference and training of the model.

Software Specification

This includes the programs, libraries, and tools your system will use.

1. Operating System

• You can use Windows or Linux (Ubuntu recommended).

2. Programming Language

• Python is the primary language:


• Easy to learn
• Supports many libraries for ML, NLP and visualization,
• Widely used in academic and industrial projects.

KRUPANIDHI COLLEGE OF MANAGEMENT 15 | P a g e


AUDIO TO TEXT SUMMARIZER

3. IDE (Integrated Development Environment)

• Tools like Jupyter Notebook, VS Code, or Google colab help you write
and test your code.
• Jupyter is great for step-by-step experiments.
• VS Code is good for full applications

4. Python Libraries

These are packages that help you build your project:

Library Use

Core deep learning framework used to


torch load and run Wav2Vec2 and T5
models
For loading and processing audio
torchaudio files(e.g., waveform extraction,
resampling, transformations)

Provides access to pre-trained models


transformers like Wav2Vec2 and T5 from Hugging
Face
For handling and organizing
pandas transcription and summary data (e.g.,
storing results in DataFrames)
Used to extract datasets or models
tarfile
compressed in .tar archives
librosa Audio analysis and feature extraction
(e.g., duration, sampling rate,
visualization)
nltk Natural Language Toolkit; used for
postprocessing text-e.g., tokenization,
stopword removal
pytest Writing and running test cases to
ensure each module works correctly

Python Libraries

KRUPANIDHI COLLEGE OF MANAGEMENT 16 | P a g e


AUDIO TO TEXT SUMMARIZER

Component Specification
Processor Intel Core i5 or above, Multi-
core (Recommended: i7 or
Ryzen 7 for faster training)
Minimum 8 GB
RAM (Recommended: 16 GB for
larger datasets and simulations)
256 GB SSD minimum; 512 GB
Storage recommended for model logs,
datasets, and backups
Operating Linux (Ubuntu preferred) or
System Windows 10+
Programming
Python 3.8+
Language
VS Code, Jupyter Notebook, or
IDE
Google colab
Torch, torchaudio, transformers,
Libraries
pandas, tarfile, librosa, nltk,
Used
pytest

Hardware and Software Specification

3.3 System Architecture

The picture illustrates the Transformer model, a baseline model in contemporary


deep learning, particularly for Natural Language Processing (NLP) applications
including text summarization, machine translation, and language modeling. It is
composed of two primary parts: the encoder (left) and the decoder (right).

The encoder receives the input sentence, feeds it through several layers (represented
by Nx), and processes it with multi-head attention mechanisms and feed-forward
neural networks. Every input token is embedded first and mixed with positional
encoding to preserve word order information since transformers themselves do not
preserve sequence. The attention mechanism enables the model to assign varying
levels of importance to words in a sentence based on their position or no matter
where they are in the sentence.

The decoder is used in a similar way but with an additional layer of masked multi-
head attention to avoid the model seeing future tokens in training (i.e., for
autoregressive generation). It also listens to the outputs of the encoder in order to
utilize the contextual information from the input sequence. The final outputs are
passed through a linear layer and softmax to produce probability distributions over
the vocabulary for every word in the output sequence.

KRUPANIDHI COLLEGE OF MANAGEMENT 17 | P a g e


AUDIO TO TEXT SUMMARIZER

This architecture allows sequence parallelization, processes long-term dependencies


more efficiently than RNNs, and forms the backbone of models such as T5, BERT,
and GPT, all of which are central to state-of-the-art NLP systems.

Fig 3.3.1 System Architecture of Transformer model

3.4 Data Flow Diagram

Fig 3.4.1 Level-0 DFD

KRUPANIDHI COLLEGE OF MANAGEMENT 18 | P a g e


AUDIO TO TEXT SUMMARIZER

An Audio-to-Text Summarizer pipeline is depicted in the diagram. Wav2Vec2, a


deep learning model made for automated speech recognition (ASR), begins by
processing an audio input and turning spoken language into text. The T5 (Text-To-
Text Transfer Transformer) model, a potent language model utilized for a variety
of NLP tasks, including summarization, is then given this transcribed text. Lastly,
a succinct synopsis of the original spoken content is produced by the T5 model.
This two-step process efficiently converts audio data into a succinct and insightful
textual synopsis.

Fig 3.4.2 Level-1 DFD for Detection

1. Audio Input
- Purpose: The system takes in voice input of spoken language, like recorded
speech, live speech, or a podcast.
- Formats: May be a microphone input, audio file (i.e., MP3, WAV), or streaming
audio.

KRUPANIDHI COLLEGE OF MANAGEMENT 19 | P a g e


AUDIO TO TEXT SUMMARIZER

2. Preprocessing
- Purpose: Processes the audio so that it is clear and devoid of much noise to
prepare for use with the STT model.
- Steps:
- Noise reduction (removing background sounds).
- Normalization (adjusting volume levels).
- Splitting audio into manageable chunks (e.g., by silence detection).

3. STT Model (Speech-to-Text)


- Purpose: Transcribes the processed audio into raw text.

- Tools: Employs AI models such as OpenAI's Whisper, Google Speech-


to-Text, or Mozilla DeepSpeech.
- Output: Raw text can contain errors (e.g., misheard words) or be unformatted.

4. Postprocessing
- Purpose: Fine-tunes the raw text for readability and accuracy.
- Steps:
- Correction of grammar/spelling errors.
- Elimination of filler words (e.g., "um," "ah").
- Punctuation, capitalization.

5. Cleaned Text
- Output: Refine, accurate transcript available for additional analysis or storage.

6. Summarization Model
- Purpose: Produces a short summary of the cleaned text.
- Methods:
- Extractive: Chooses important sentences/phrases (e.g., with NLP libraries such
as spaCy).
- Abstractive: Rephrases content in condensed form (e.g., GPT, BERT).

7. Output/Save
- Final Deliverables:
- The summary (e.g., for ease of review, meeting minutes, or reports).
- Optionally, the cleaned transcript and audio may also be stored.

KRUPANIDHI COLLEGE OF MANAGEMENT 20 | P a g e


AUDIO TO TEXT SUMMARIZER

CHAPTER 4

METHODOLOGY

Modules

1. Audio Input Module

Purpose:
This module takes raw audio data as an input. The source could be a live
microphone input or a previously recorded audio file (such as a.wav or.mp3).

Functions:
1. Enter audio from a directory or user.
2. Verify and adjust the sample rate and audio format.
3. Send the speech-to-text model the audio waveform.

Technologies Used:
torchaudio, librosa, pyaudio (for real-time microphone input)

2. Speech-to-Text (STT) Module

Purpose:
This module transforms the audio waveform into raw text using the Wav2Vec2
model. Automatic speech recognition (ASR) is what it does.

Functions:
1. Tokenize the waveform so that the model can use it.
2. Create transcriptions after decoding the audio.
3. Send back the uncondensed text for summarization.

Advantages:
1. Can process noisy audio.
2. No handcrafted audio features needed.

KRUPANIDHI COLLEGE OF MANAGEMENT 21 | P a g e


AUDIO TO TEXT SUMMARIZER

Technologies Used:
transformers, torch, facebook/wav2vec2-base-960h

Fig 4.1 Architecture of Wav2Vec2

3. Text Summarization Module – T5

Purpose:
Using the T5 (Text-to-Text Transfer Transformer) model, this module takes
the entire transcribed text and provides a succinct, useful summary.

Functions:
1.Preprocess the transcription (e.g., add prefix like "summarize:").
2.Generate a summary based on the transformer model.
3.Return the final summary text.

Advantages:
1.Flexible and powerful for various NLP tasks.
2.Handles long input texts with contextual understanding.

Technologies Used:
transformers, torch, google/t5-small or t5-base

KRUPANIDHI COLLEGE OF MANAGEMENT 22 | P a g e


AUDIO TO TEXT SUMMARIZER

Fig 4.2 Architecture of T5

4. Output Module

Purpose:
This module displays and/or saves the resulting summary generated by the T5
model.

Functions:
1. Display summary on console, GUI, or web application.
2. Save results in a .txt, .json, or .csv file.

Technologies Used:
Python I/O, Tkinter or Flask (optional interface)

KRUPANIDHI COLLEGE OF MANAGEMENT 23 | P a g e


AUDIO TO TEXT SUMMARIZER

Sources & Tools

Library
Component Purpose
/ Tool
REST API
Web
Flask for traffic
Framework
handling
Natural
Machine NLP,
Language
Learning NLTK
Processing
Data
handling
Data Pandas,
and
Manipulation NumPy
transformat
ions
Torch.sa Save/load
Model
ve/ model
Persistence
Joblib artifacts
Python
Central
Class
Configuration parameter
(config.p
tuning
y)
Python Event
Logging logging tracing and
module debugging

Source and Tools

KRUPANIDHI COLLEGE OF MANAGEMENT 24 | P a g e


AUDIO TO TEXT SUMMARIZER

CHAPTER 5

IMPLEMENTATION AND TESTING

5.1 Code Snippets

//// Loading the Librispeech Dataset from local drive dataset =


load_dataset(
'audiofolder',
# Include the subdirectory path
data_dir=os.path.join(extract_path, 'LibriSpeech', 'train-clean-100'),
drop_labels=False
)
// Initially the Audio/Speech is converted to Text and then the Text is
converted to a short summary
// Here, Autoprocessor and AutomodelForCTC from transformers are
utilized

def speech_to_text(batch):
# Process audio files
inputs = processor(
batch["audio"]["array],
sampling_rate=batch["audio"]["sampling_rate"],
return_tensors="pt",
padding=True
).to(device)

# Run inference with torch.no_grad(): logits =


model(inputs.input_values).logits

# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1) transcription =
processor.batch_decode(predicted_ids)
KRUPANIDHI COLLEGE OF MANAGEMENT 25 | P a g e
AUDIO TO TEXT SUMMARIZER

return {"predicted_text": transcription}

# Load ASR model


model_name = "facebook/wav2vec2-base-960h"
processor = AutoProcessor.from_pretrained(model_name) model =

AutoModelForCTC.from_pretrained(model_name).to("cuda" if
torch.cuda.is_available() else "cpu")

def generate_summary(text, sentences_count=3): """Generate


summary using LexRank algorithm""" try:
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, sentences_count) return "
".join([str(sentence) for sentence in summary]) except Exception as e:
print(f"Error generating summary: {str(e)}")
return "Summary not available"

//Final data preprocessing for summary formation


def preprocess_data(source, target): input_text = "summarize: " +
source
model_inputs = tokenizer(input_text, max_length=512,
truncation=True, padding="max_length")

with tokenizer.as_target_tokenizer(): labels = tokenizer(target,


max_length=150, truncation=True, padding="max_length")

model_inputs["labels"] = labels["input_ids"] return model_inputs

KRUPANIDHI COLLEGE OF MANAGEMENT 26 | P a g e


AUDIO TO TEXT SUMMARIZER

5.2 Types of Testing

Testing is a key phase of developing any Machine Learning or Deep


Learning system, ensuring that the model works accurately, reliably
and how expected under different conditions. In this project, there
was testing done to assess the performance of the Brahmi script
recognition and transcription system, particularly with regard to the
accuracy of the character prediction and the correctness of the
rewriting output.
• Unit Testing:
Verified that separate functions such as image preprocessing,
segmentation, and character mapping functioned as intended.

• Model Testing:
Tested the performance of the CNN with validation and test sets
with metrics like accuracy, precision, recall, and F1-score.

• Functional Testing:
Confirmed the overall workflow from input image upload to final
English transliteration worked properly and seamlessly.

• Stress Testing:
Tested the system with noisy, distorted, or low-quality images to
measure model robustness in real-world scenarios.

• Performance Visualization:
Utilized confusion matrices and accuracy/loss plots for result
analysis and determination of improvement areas.

KRUPANIDHI COLLEGE OF MANAGEMENT 27 | P a g e


AUDIO TO TEXT SUMMARIZER

5.3 Test Cases

Test Case 1: File: Peep vs Peer vs Glimpse - English In A Minute.mp3


Full Path: /content/drive/MyDrive/Audio Files/Peep vs Peer vs Glimpse
- English In A Minute.mp3
Duration: 58.77 seconds

Transcription:

I EVERYONE WALCOME BACK TO ENGLISH IN A MINUTE


PEEP PEER AND GLIMPSE ARE ALL VERBS OF SIGHT THAT
MEAN LOOK AT SOMETHING BUT ARE USED IN DIFFERENT
SITUATIONS LET'S LOOK AT SOME EXAMPLES MY FRIEND
PEEPED AT MY TEST ANCERS THIS VERB MEANS TO LOOK AT
SOMETHING QUICKLY AND SECRETIVELY I PEERED AT THE
DOCUMENT TRYING TO UNDERSTAND IT PEER MEANS TO
LOOK AT SOMETHING INTENTLY OR CAREFULLY IN DETAIL
IT CAN ALSO BE USED IN ANOTHER WAY I WAS PEERING AT
THE CLOCK IN THE DISTANCE THIS EXAMPLE MEANS THAT I
HAD DIFFICULTY READING THE CLOCK MAYBE THE CLOCK
WAS VERY SMALL OR I HAVE BAD EYESIGHT I GLIMPSE THE
SUNLIGHT THROUGH THE TREES GLIMPSE MEANS TO SEE
SOMETHING FOR A SHORT TIME OR TO ONLY SEE PART OF
SOMETHING WE OFTEN US GLIMPSE AS A NOWN WITH THE
VERB CATCH FOR EXAMPLE I CAUGHT A GLIMPSE OF FILL
AS HE LEFT THE OFFICE B EVERYONE

Summary:

I EVERYONE WALCOME BACK TO ENGLISH IN A MINUTE


PEEP PEER AND GLIMPSE ARE ALL VERBS OF SIGHT THAT

KRUPANIDHI COLLEGE OF MANAGEMENT 28 | P a g e


AUDIO TO TEXT SUMMARIZER

MEAN LOOK AT SOMETHING BUT ARE USED IN DIFFERENT


SITUATIONS LET'S LOOK AT SOME EXAMPLES MY FRIEND
PEEPED AT MY TEST ANCERS THIS VERB MEANS TO LOOK AT
SOMETHING QUICKLY

Test Case 2: File: 1034-121119-0008.flac


Full Path: /content/LibriSpeech/LibriSpeech/train-clean-
100/1034/121119/1034-121119-0008.flac
Duration: 15.10 seconds

Transcription:

HAD RETIRED IN HASTE TO HIS CLUB WHERE HE WAS CHATTING


WITH SOME FRIENDS UPON THE EVENTS WHICH SERVED AS A
SUBJECT OF CONVERSATION FOR THREE FOURTHS OF THAT
CITY KNOWN AS THE CAPITAL OF THE WORLD AT THE PRECISE
TIME WHEN MADAME DANGLARS DRESSED IN BLACK AND
CONCEALED IN A LONG VEIL

Summary:

HAD RETIRED IN HASTE TO HIS CLUB WHERE HE WAS CHATTING


WITH SOME FRIENDS UPON THE EVENTS WHICH SERVED AS a
SUBJECT OF CONVERSATION FOR THREE FOURTHS OF THAT
CITY KNOWN AS THE CAPITAL OF THE WORLD AT
THE PRECISE TIME.

KRUPANIDHI COLLEGE OF MANAGEMENT 29 | P a g e


AUDIO TO TEXT SUMMARIZER

CHAPTER 6

RESULTS AND DISCUSSIONS

Fig 6.1 Transcripted Text

Fig 6.2 Summarized Text

Fig 6.3 Accuracy and F1 score

KRUPANIDHI COLLEGE OF MANAGEMENT 30 | P a g e


AUDIO TO TEXT SUMMARIZER

Fig 6.4 Graphical Representation of Confusion Matrix

6.1 Analysis of the results

1. Power of speech model to text

Notebook protocols verify the successful loading of pre-loaded model


automatic speech recognition (ASR), likely founded on the WAV2VEC2
architecture. Weights of the model and configuration files have been
downloaded effectively, with model. Safetensors (378 MB) being received in
about one second at 227 MB/s. This indicates a strong network performance
and adequate acceleration of hardware, likely from GPU sources. But in the
given reports, the transcription accuracy has not been measured in explicit
terms. Depending on the quality of the sound, the loudness of the speakers and
the background noise in real-world implementations, the model's performance
would strongly rely on it. Terminology that is domain-specific or heavy
accents would also decrease the accuracy unless the model fine-tunes on the
applicable data files. The process of follow-up processing, including the
integration of the language model (e.g., pyctcdecode with Kenlm), could

KRUPANIDHI COLLEGE OF MANAGEMENT 31 | P a g e


AUDIO TO TEXT SUMMARIZER

further enhance the transcripts by error correction of standard mistakes and


enhance context coherence.

2. Efficiency of text summary

The component seems to use the summary of the abstract model, perhaps
Facebook/Bart-Large-Cnn or T5-Small, due to the presence of the
Generation_Config.json file in protocols. Abstractive summary is
advantageous for creating brief, paraphrased outputs, but can introduce
inaccuracies if the input text contains errors from the ASR phase. The
performance of the model summary would be evaluated on the basis of fluency,
coherence and maintaining key information. Larger models like Bart-Large
usually generate a summary of better quality, but at the cost of increased
computing latency. On the other hand, smaller models, such as the T5-Small,
offer a faster conclusion, but can produce less coherent or too general
summons. A hybrid approach, combining extraction methods (eg key Afrase
extraction) with abstract summary, could improve robustness, especially in
solving noisy or imperfect transcriptions.

3. The integration and efficiency of the system

The pipe shows smooth integration between ASR and summary, while Google
disk serves as a backend storage for inputs and outputs. The protocols indicate
the possibilities of batch processing, as evidenced by the processing of 28,539
examples of training, indicating scalability for large data sets.

However, the conflict of dependencies involving FSSPEC emphasizes the


potential problems of stability in the production environment where the
mismatch of versions can disrupt workflows. The latency of the end-to-end is
primarily dictated by the ASR component, which lasted approximately 6.41
seconds in the observed protocols. The summary adds marginal direction, but
the use of GPU remains decisive for maintaining performance in real time.
Future iterations could benefit from the quantization or distillation of the model

KRUPANIDHI COLLEGE OF MANAGEMENT 32 | P a g e


AUDIO TO TEXT SUMMARIZER

to reduce the time of inference, especially for the deployment of edges where
computational sources are limited.

4. Recommendations for improvement

To increase the accuracy and reliability of the system, it would be beneficial


to fine -tuning the ASR model on the domain specific data sets, especially for
applications in specialized areas such as medicine or law. Incorporating the
mechanism of control of authenticity (eg Sypellpy) and a more sophisticated
language model could further specify transcripts. To summarize, a hybrid
approach that combines extraction and abstract techniques can bring more
consistent results, especially when processing noisy or erroneous text. From
the point of view of deployment, the transition to the cloud inference API (eg
hugging the endpoints of the inference) could interpret the calculation load,
while optimization of the edges (eg on Onnx Runtime) would be for low
latency applications. Continuous monitoring with metrics such as the Mira of
Word Errors (WER) for ASR and Rouge for Summary would ensure lasting
performance quality

6.2 Discussion

The implemented pipe effectively combines the recognition and summary of


the text, demonstrating scalability and efficient use of resources. However, its
performance depends on sound quality, model selection and computing
sources. The key areas for improvement include the ASR domain adaptation,
a hybrid summary of techniques and optimized deployment strategies. The
solution of these aspects would increase the robustness of the system, which
would be more suitable for applications in the real world where accuracy and
speed are paramount. Future work should also focus on complex benchmarking
against industry standards to verify its competitiveness with existing solutions.

KRUPANIDHI COLLEGE OF MANAGEMENT 33 | P a g e


AUDIO TO TEXT SUMMARIZER

CHAPTER 7

CONCLUSION AND FUTURE ENHANCEMENTS

7.1 Summary

The project has successfully developed an integrated speech pipe to text and
generate automated summary, showing the practical use of modern automatic
speech recognition (ASR) and processing natural language (NLP). By using
state -of -the -art models such as WAV2VEC2 for transcription and BART/T5
for summary, the system achieved a functional balance between accuracy and
computing efficiency. The ASR component reliably made clear sound inputs,
although its efficiency has reduced noisy recordings, strong accents or
specialized terminology. Meanwhile, the summary module created coherent
and concise outputs, albeit with occasional factual discrepancies associated
with abstract approaches.
The architecture of the system has proved to be a scalable, capable of
efficiently processing large batch of audio files. Latence, however, remained a
problem for real -time application, and steering problems with dependence
emphasized the importance of robust configuration of the environment in
production settings. These findings emphasize the potential and limitation of
current speech systems between AI, which provides valuable knowledge of
future improvement.

7.2 Recommendations
Several targeted improvements should be preferred to increase system
performance. First, the accuracy of the ASR model could be significantly
improved by fine fine -tuning on the domain specific data sets, ensuring better
processing of specialized vocabulary and various accents. Incorporating the
subsequent correction processing, such as the control of the magic and the
Republic of the language, would further specify the transcripts of the solution
of common errors and improve context coherence.

KRUPANIDHI COLLEGE OF MANAGEMENT 34 | P a g e


AUDIO TO TEXT SUMMARIZER

For a summary component, the acceptance of a hybrid approach that combines


extraction and abstract techniques would help maintain critical information
while maintaining fluency. In addition, allowing users to control
comprehensive parameters - such as length or focus areas - would increase the
versatility of the system.

On the technical front, optimization of models would reduce latency and


resource consumption through quantization and distillation, which would be
more efficient for extensive or real time.
Finally, it is necessary to strengthen the reliability of the system deployment.
Containerization with Docker and Orchestration via Kubernetes would ensure
consistent performance across the environments, while robust monitoring tools
could follow key metrics such as words of words (WER) and score to maintain
quality standards. Together, these improvements would increase the
preparedness of the system for the use of production.

7.3 Future Enhancement

A Looking forward, the system could expand in several innovative directions


to expand its usability and impact. Real time processing capabilities would
unlock new cases of use, such as lively transcription and meeting summary or
lectures. The integration of ASR models capable of streaming and incremental
summary would allow dynamic updates to be processed as new speech
segments are being processed. Multilingual support is another critical limit, as
the extension of the system to non English languages would significantly
increase its usefulness in global contexts. Similarly, incorporating multimodal
inputs - for example, video or image content - could provide richer contextual
information for summary. Advanced NLP features such as analysis of
sentiment and integration of answers to questions would further increase the
system value by allowing deeper knowledge and interactive questioning. From
the user experience point of view, the offer of personalized profile summary
and interactive tools to edit the users to customize the outputs to their specific

KRUPANIDHI COLLEGE OF MANAGEMENT 35 | P a g e


AUDIO TO TEXT SUMMARIZER

needs. In order to ensure that the responsible system of the system should be
addressed, ethical considerations, including distortion and privacy techniques,
should be addressed. By monitoring these improvements, the project could
develop a complex tool for the new generation to transform the unstructured
sound into the knowledge available.
This project is an example of AI transformation potential in bridging the gap
between the spoken language and structured information. While the current
implementation provides a solid foundation, proposed improvements and
future improvements outline the clear way to a more robust, versatile and user
-focused system. Other phases of development could build this technology as
an indispensable source across sectors such as education, journalism and
corporate communication, by prioritizing accuracy, efficiency and ethical
considerations.

The way from speech to summary is complicated, but with continued


innovation and improvement, this pipe has the potential to redefine how we
interact and derive the value of sound content. Future work should focus on
extensive benchmarking, integration of user feedback and community
cooperation to ensure that the system meets the evolving needs of its users.
Finally, this project underlines the power of AI to turn the spoken word into a
meaningful knowledge and prepares a way for smarter and more affordable
processing information at digital age.

KRUPANIDHI COLLEGE OF MANAGEMENT 36 | P a g e


AUDIO TO TEXT SUMMARIZER

BIBLIOGRAPHY

1. Radford, A., et al. (2022). Robust Speech Recognition via Large-Scale


Weak Supervision. OpenAI.
2. Tiedemann, J., & Thottingal, S. (2020). OPUS-MT: Building Open
Translation Services for the World.
3. Vaswani, A., et al. (2017). Attention is All You Need. NeurIPS.
4. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence
Learning with Neural Networks. NeurIPS.
5. Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine
Translation.
6. IIT Bombay English-Hindi Parallel Corpus. CFILT, IIT Bombay.
7. Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language
Processing. ACL.
8. OpenAI. (2022). Whisper GitHub Repository.
https://github.com/openai/whisper
9. Hugging Face. (2023). Datasets Library Documentation.
https://huggingface.co/docs/transformers
10. Hugging Face. (2023). Transformers Documentation.
https://huggingface.co/docs/datasets
11. Paszke, A., et al. (2019). PyTorch: An Imperative Style, High-
Performance Deep Learning Library. NeurIPS.
12. Papineni, K., et al. (2002). BLEU: A Method for Automatic Evaluation
of Machine Translation. ACL.
13. Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python.
JMLR.
14. Johnson, M., et al. (2017). Google’s Multilingual Neural Machine
Translation System: Enabling Zero-Shot Translation. TACL.
15. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine
Translation by Jointly Learning to Align and Translate. ICLR.
16. Ott, M., et al. (2018). Scaling Neural Machine Translation. arXiv
preprint arXiv:1806.00187.
17. Kudo, T., & Richardson, J. (2018). SentencePiece: A Simple and
Language Independent Subword Tokenizer and Detokenizer for Neural Text

KRUPANIDHI COLLEGE OF MANAGEMENT 37 | P a g e


AUDIO TO TEXT SUMMARIZER

Processing. EMNLP.
18. Bojar, O., et al. (2014). Findings of the 2014 Workshop on Statistical
Machine Translation (WMT14). ACL.
19. Bojar, O., et al. (2016). Findings of the 2016 Conference on Machine
Translation (WMT16). ACL.
20. Vaswani, A., Shazeer, N., Parmar, N., et al. (2018). The Transformer
Model for NLP. Google AI Blog.

KRUPANIDHI COLLEGE OF MANAGEMENT 38 | P a g e


AUDIO TO TEXT SUMMARIZER

SAMPLE CODING

from google.colab import drive


drive.mount('/content/drive')

# Install required packages


#!pip install torch torchaudio transformers datasets soundfile librosa --quiet
#!pip install pyctcdecode pypi-kenlm --quiet

# Install only the essential packages with specific versions to avoid conflicts
!pip install torch torchaudio transformers datasets soundfile librosa --quiet
!pip install -U fsspec # Update fsspec to resolve the conflict

!pip install -U sentence-transformers nltk sumy --quiet

# Verify installations
!pip list | grep -E "torch|transformers|datasets"

import tarfile
import os
from datasets import load_dataset, Audio
from transformers import AutoProcessor, AutoModelForCTC
import torch
import pandas as pd

dataset_path = '/content/drive/MyDrive/train-clean-100.tar.gz' # Update this path


extract_path = '/content/LibriSpeech'

for root, dirs, files in os.walk(extract_path):


for root, dirs, files in os.walk(extract_path):
print(f"Directory: {root}")
for file in files:
print(f" File: {file}")

# Create the extraction directory if it doesn't exist


os.makedirs(extract_path, exist_ok=True)

with tarfile.open(dataset_path, 'r:gz') as tar:


tar.extractall(extract_path)

import os

# List the contents of the directory, adjusting the path to account for potential
nesting

KRUPANIDHI COLLEGE OF MANAGEMENT 39 | P a g e


AUDIO TO TEXT SUMMARIZER

print(os.listdir(os.path.join(extract_path, 'LibriSpeech', 'train-clean-100')))


# Added 'LibriSpeech' to the path

import os
from datasets import Dataset, Audio

# Define the directory containing the audio data


# This path should point to where the audio files were extracted
audio_data_dir = os.path.join(extract_path, 'LibriSpeech', 'train-clean-100')

# Step 1: Manually get all audio file paths


all_audio_files = [
os.path.join(root, file)
for root, _, files in os.walk(audio_data_dir)
for file in files if file.endswith('.flac')
]

# Step 2: Create dataset from dictionary


dataset = Dataset.from_dict({'audio': all_audio_files})

# Step 3: Cast column to audio and resample


dataset = dataset.cast_column('audio', Audio(sampling_rate=16000))

print(dataset)

5# Resample audio to 16kHz (standard for ASR models)


dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

print(dataset)

from transformers import AutoProcessor, AutoModelForCTC


import torch

# Use a pre-trained model for faster inference


model_name = "facebook/wav2vec2-base-960h" # Good balance between speed
and accuracy

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCTC.from_pretrained(model_name)

# Move model to GPU if available


device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print(f"Model loaded on {device}")

KRUPANIDHI COLLEGE OF MANAGEMENT 40 | P a g e


AUDIO TO TEXT SUMMARIZER

def speech_to_text(batch):
# Process audio files
inputs = processor(
batch["audio"]["array"],
sampling_rate=batch["audio"]["sampling_rate"],
return_tensors="pt",
padding=True
).to(device)

# Run inference
with torch.no_grad():
logits = model(inputs.input_values).logits

# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

return {"predicted_text": transcription}

# Test on a small subset for quick results


test_samples = dataset.select(range(5)) # First 5 samples
results = test_samples.map(speech_to_text, remove_columns=["audio"])

'''import pandas as pd

# Create a DataFrame for better visualization


df = pd.DataFrame({
"File": [os.path.basename(x) for x in test_samples["audio"]["path"]],
"Predicted Text": results["predicted_text"],
"Actual Text": test_samples["text"] if "text" in test_samples.features else
["N/A"]*len(results)
})

# Display results
print("\nAudio to Text Conversion Results:")
display(df)

# Print sample audio info


sample = test_samples[0]
print(f"\nSample audio info:
Duration={len(sample['audio']['array'])/sample['audio']['sampling_rate']:.2f}s, "
f"Sample rate={sample['audio']['sampling_rate']}Hz, "
f"Channels={len(sample['audio']['array'].shape)}")'''

KRUPANIDHI COLLEGE OF MANAGEMENT 41 | P a g e


AUDIO TO TEXT SUMMARIZER

import pandas as pd
# Create a DataFrame for better visualization

df = pd.DataFrame({
"File": [os.path.basename(x["path"]) for x in test_samples["audio"]], # Access
'path' within each audio item
"Predicted Text": results["predicted_text"],
"Actual Text": test_samples["text"] if "text" in test_samples.features else
["N/A"]*len(results)
})

# Display results
print("\nAudio to Text Conversion Results:")
display(df)

# Print sample audio info


sample = test_samples[0]
print(f"\nSample audio info:
Duration={len(sample['audio']['array'])/sample['audio']['sampling_rate']:.2f}s, "
f"Sample rate={sample['audio']['sampling_rate']}Hz, "
f"Channels={len(sample['audio']['array'].shape)}")

# Save results to CSV


output_path = '/content/drive/MyDrive/asr_results.csv'
df.to_csv(output_path, index=False)
print(f"\nResults saved to {output_path}")

import librosa
import nltk
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

# Download NLTK data


nltk.download('punkt')

# Install required packages


!pip install torch torchaudio transformers librosa soundfile --quiet

# Import libraries
from transformers import AutoProcessor, AutoModelForCTC
import torch
import librosa
from IPython.display import Audio, display

KRUPANIDHI COLLEGE OF MANAGEMENT 42 | P a g e


AUDIO TO TEXT SUMMARIZER

# Mount Google Drive (if not already mounted)


from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Load ASR model


model_name = "facebook/wav2vec2-base-960h"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCTC.from_pretrained(model_name).to("cuda" if
torch.cuda.is_available() else "cpu")

def generate_summary(text, sentences_count=3):


"""Generate summary using LexRank algorithm"""
try:
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, sentences_count)
return " ".join([str(sentence) for sentence in summary])
except Exception as e:
print(f"Error generating summary: {str(e)}")
return "Summary not available"

def transcribe_from_drive(file_path):
"""
Transcribe audio file from Google Drive path

Args:
file_path (str): Full path to audio file in Google Drive
Example: '/content/drive/MyDrive/audio_files/103-1240-0002.flac'
"""
try:
# Verify file exists
if not os.path.exists(file_path):
print(f"Error: File not found at {file_path}")
return

# Load and display audio


print("\nAudio Preview:")
display(Audio(file_path))

# Process audio
speech, sr = librosa.load(file_path, sr=16000)
inputs = processor(speech, sampling_rate=sr,
return_tensors="pt").to(model.device)

KRUPANIDHI COLLEGE OF MANAGEMENT 43 | P a g e


AUDIO TO TEXT SUMMARIZER

with torch.no_grad():
logits = model(**inputs).logits

transcription = processor.batch_decode(torch.argmax(logits, dim=-1))[0]

print("\n" + "="*50)
print(f"File: {os.path.basename(file_path)}")
print(f"Full Path: {file_path}")
print(f"Duration: {len(speech)/sr:.2f} seconds")
print("\nTranscription:")
print("="*50)
print(transcription)
print("="*50)

return transcription

except Exception as e:
print(f"Error processing file: {str(e)}")
return None

# Example usage:
file_path = "/content/drive/MyDrive/Audio Files/Beside vs Besides_ English In A
Minute.mp3" # Change this to your actual file path
transcription = transcribe_from_drive(file_path)

!pip install sumy --quiet


from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

def extractive_summary(text, sentence_count=3):


parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentence_count)
return " ".join(str(sentence) for sentence in summary)

# Create pseudo-labeled dataset


summaries = [extractive_summary(text) for text in transcription]

from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-small")

KRUPANIDHI COLLEGE OF MANAGEMENT 44 | P a g e


AUDIO TO TEXT SUMMARIZER

def preprocess_data(source, target):


input_text = "summarize: " + source
model_inputs = tokenizer(input_text, max_length=512, truncation=True,
padding="max_length")

with tokenizer.as_target_tokenizer():
labels = tokenizer(target, max_length=150, truncation=True,
padding="max_length")

model_inputs["labels"] = labels["input_ids"]
return model_inputs

dataset = [preprocess_data(src, tgt) for src, tgt in zip(transcription, summaries)]

from datasets import Dataset


train_dataset = Dataset.from_list(dataset)
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask',
'labels'])

from transformers import T5Tokenizer


import numpy as np

tokenizer = T5Tokenizer.from_pretrained("t5-small")

def preprocess_data(source, target):


input_text = "summarize: " + source
model_inputs = tokenizer(input_text, max_length=512, truncation=True,
padding="max_length")

with tokenizer.as_target_tokenizer():
labels = tokenizer(target, max_length=150, truncation=True,
padding="max_length")

# Convert lists to numpy arrays before returning


model_inputs["input_ids"] = np.array(model_inputs["input_ids"])
model_inputs["attention_mask"] = np.array(model_inputs["attention_mask"])
model_inputs["labels"] = np.array(labels["input_ids"]) # Ensure labels is also a
numpy array

return model_inputs

dataset = [preprocess_data(src, tgt) for src, tgt in zip(transcription, summaries)]

# Instead of using the ASR model for generation, load a text summarization model
from transformers import T5ForConditionalGeneration, T5Tokenizer
KRUPANIDHI COLLEGE OF MANAGEMENT 45 | P a g e
AUDIO TO TEXT SUMMARIZER

# Load a pre-trained T5 model for summarization


summarization_model_name = "t5-small" # Or "t5-base", "bart-large", etc.
# You already loaded T5Tokenizer previously, but re-loading it here for clarity
summarization_tokenizer =
T5Tokenizer.from_pretrained(summarization_model_name)
summarization_model =
T5ForConditionalGeneration.from_pretrained(summarization_model_name)

# Move the summarization model to the same device as your ASR model if
available
summarization_model.to("cuda" if torch.cuda.is_available() else "cpu")

# Prepare the input for the summarization model


# The T5 model expects input in the format "summarize: <text to summarize>"
new_input = "summarize: " + transcription

# Encode the input using the summarization tokenizer


input_ids = summarization_tokenizer.encode(new_input, return_tensors="pt",
max_length=512, truncation=True).to(summarization_model.device)

# Generate the summary using the summarization model's generate method


summary_ids = summarization_model.generate(
input_ids,
max_length=150,
min_length=30,
num_beams=4,
early_stopping=True
)

# Decode the generated summary


summary = summarization_tokenizer.decode(summary_ids[0],
skip_special_tokens=True)

print(" Summary:\n", summary)

KRUPANIDHI COLLEGE OF MANAGEMENT 46 | P a g e


The Report is Generated by DrillBit AI Content Detection Software

Submission Information

Author Name Ajay


Title Audio_to_text
Paper/Submission ID 4005501
Submitted By [email protected]
Submission Date 2025-06-28 09:44:06
Total Pages 49
Document type Project Work

Result Information

AI Text: 9 %

Content Matched

AI Text
9.0%

Human
Text 91.0%

Disclaimer:
* The content detection system employed here is powered by artificial intelligence (AI) technology.
* Its not always accurate and only help to author identify text that might be prepared by a AI tool.
* It is designed to assist in identifying & moderating content that may violate community guidelines/legal regulations, it may not be perfect.
The Report is Generated by DrillBit Plagiarism Detection Software

Submission Information

Author Name Ajay


Title Audio_to_text
Paper/Submission ID 4005501
Submitted by [email protected]
Submission Date 2025-06-28 09:44:06
Total Pages, Total Words 49, 7742
Document type Project Work

Result Information

Similarity 15 %
1 10 20 30 40 50 60 70 80 90

Sources Type Report Content


Student
Paper
0.15% Quotes
Internet Words <
2.92%
4.9% 14,
3.97%
Journal/
Publicatio
n 9.95% Ref/Bib
3.49%

Exclude Information Database Selection

Quotes Excluded Language English


References/Bibliography Excluded Student Papers Yes
Source: Excluded < 14 Words Not Excluded Journals & publishers Yes
Excluded Source 0% Internet or Web Yes
Excluded Phrases Not Excluded Institution Repository Yes

A Unique QR Code use to View/Download/Share Pdf File


AUDIO TO TEXT SUMMARIZER

KRUPANIDHI COLLEGE OF MANAGEMENT 47 | P a g e

You might also like