Audio To Text Summarizer Mini Project Final Report
Audio To Text Summarizer Mini Project Final Report
A PROJECT REPORT
Submitted by
AJAY
(P19DU23S126012)
ARYA P
(P19DU23S126036)
2023-2025
KRUPANIDHI COLLEGE OF MANAGEMENT
BENGALURU -560035
JUNE 2025
BONAFIDE CERTIFICATE
This is to certify that the project entitled “AUDIO TO TEXT SUMMARIZER” is the bonafide record
of project work done by AJAY bearing Reg No. P19DU23S126012 and ARYA P bearing Reg
No.P19DU23S126036 and is being submitted in partial fulfilment for the award of the Master’s Degree
JUNE 2025
I hereby declare that “AUDIO TO TEXT SUMMARIZER” is the result of the project
work carried out by me under the guidance of Ms. BHARGAVI K, Assistant Professor
I also declare that this project is the outcome of my own efforts and that it has not been
submitted to any other university or Institute for the award of any other degree or Diploma
or Certificate.
AJAY ARYA P
P19DU23S126012 P19DU23S126036
Ms. BHARGAVI K
Assistant Professor
ACKNOWLEDGEMENT
I owe very gratitude to God almighty for blessing me all the way to complete this work
successfully.
I proudly utilize the opportunity to express my heart full thanks to Dr. SURESH NAGPAL,
Chairman of Krupanidhi Group of Institutions, Bengaluru and Dr.
C.J. Rajendra Prasad, Principal, Krupanidhi College of Management, for their valuable
advice and encouragement for carrying out this project.
I thank all other Faculty Members of Department of Computer Applications for their
continuous support in carrying out this project.
I offer my humble and sincere thanks to my beloved parents and friends who are the never-
ending source of inspiration to me.
AJAY
(P19DU23S126012)
ARYA P
(P19DU23S126036)
ABSTRACT
In the era of digital transformation, vast amounts of audio data are generated daily,
including lectures, meetings, podcasts, and interviews. Manually transcribing and
summarizing such content is time-consuming and inefficient. This project aims to
develop an automated system that converts spoken content into text and generates
concise summaries while preserving essential information.
The system utilizes Automatic Speech Recognition (ASR) to transcribe audio into
text and Natural Language Processing (NLP) techniques for summarization. Deep
learning models, such as Wav2Vec for speech-to-text conversion and transformer-
based models like T5 for text summarization, are integrated to enhance accuracy
and efficiency.
1. AI Artificial Intelligence
4. TTS Text-to-Speech
5. STT Speech-to-Text
7. MT Machine Translation
1. INTRODUCTION 1-6
1.1 Background 1
4. METHODOLOGY 21-24
6.2 Discussion 33
7. CONCLUSION AND FUTURE WORK 34-36
7.1 Summary 34
BIBLIOGRAPHY 37-38
SAMPLE CODING
AUDIO TO TEXT SUMMARIZER
CHAPTER 1
INTRODUCTION
1.1 Background
Each passing day in the world today, lectures, meetings, podcasts and many
other activities produce a lot of audio data. It takes a lot of effort and time to
extract pertinent information from this data, and efficient transcription and
summarization techniques are required. These procedures can now be
automated, making them quicker and more precise, thanks to developments in
AI and NLP.
This project takes the improvements in the above paragraph a step further by
suggesting a system with transformer-based summarization and Whisper for
English speech-to-text transcription with the aim of creating accurate and
succinct summaries. It is implied that lengthy audio files can be handled by the
system.
Deep Learning: One of the main branches of machine learning that draws
inspiration from the workings of the human brain is deep learning. Deep
learning uses multi-layered artificial neural networks—thus the word "deep"—
to automatically extract complex representations and patterns from massive
datasets. Deep learning models, on the other hand, can learn from raw inputs
like text, audio, or images, whereas other machine learning techniques need
hand-engineered features and domain knowledge.
Over the past few years, advancements in computer vision, speech recognition,
natural language processing (NLP), and real-time translation software have all
been fueled by deep learning. This has made it possible to automate tasks that
previously required a great deal of accuracy and human labor.
For the purposes of this project, deep learning powers both speech-to-text and
summarization. By identifying intricate audio patterns, the Wav2Vec2 model,
which is based on deep neural architecture, effectively converts speech to text.
Similarly, by identifying linguistic structure and contextual meaning, the T5
transformer-based model summarizes the transcribed text. By eliminating the
need for manual transcription and summarization, both of these create a system
that is quicker, more scalable, and domain-adaptable. This sums up the current
trend of integrating deep learning into offline, real-time, end-to-end automated
products for access, education, and communication.
articulation to background noise along subtle voice patterns, despite the fact
that they are understandable.
Create a brief but accurate summary of the given text using the transformer-
based natural language processing (NLP) model T5.
Determine the original meaning and speaker's intention in long audio formats
to extract meaningful comprehension from the audio data.
5. To Enhance Usability
3. Supported Audio Inputs and Sources: Live speech transcription and pre-
recorded audio files (such as WAV and MP3). audio transcription of podcasts,
lectures, meetings, and interviews.
summarizing recent interviews and reports in journalism and the media. People
with hearing impairments can access information thanks to assistive
technologies.
The technology is based on pre-trained models like Wav2Vec for ASR and
transformer models like T5 for summarization. These models are perhaps not
optimized for the particular domain (e.g., legal or medical jargon) without
further fine-tuning.
Processing highly long audio files (e.g., multi-hour lectures or meetings) can
involve segmentation methods to allow efficient processing and
summarization.
KRUPANIDHI COLLEGE OF MANAGEMENT 5|Page
AUDIO TO TEXT SUMMARIZER
Deep learning models require high computational power, hence are resource-
hungry for real-time processing on low-end devices.
CHAPTER 2
LITERATURE REVIEW
1. Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, Chengqing Zong IEEE
Transactions on Knowledge and Data Engineering 31 (5), 996-1009, 2018
The field of Natural Language Processing (NLP) has revolutionized the way
human language interacts with computer systems. NLP applications span
machine translation, information extraction, summarization, and question
answering, driven by vast computational resources and big data
methodologies. Despite these advancements, NLP tools haven't fully
integrated with Internet of Things (IoT) devices, like audio recorders,
The paper describes possibilities, which are provided by open APIs, and how
to use them for creating unified interfaces which is based on recurrent neural
network. In last decade AI technologies became widespread and easy to
implement and use. One of the most perspective technologies in the AI field is
speech recognition as part of natural language processing. New speech
recognition technologies and methods will become a central part of future life
because they save a lot of communication time, replacing common texting with
voice/audio. In addition, this paper explores the advantages and disadvantages
of well-known chatbots. The method of their improvement is built. The
algorithms sequence-to-sequence based on recurrent neural network is used.
The time complexity of proposed algorithm is compared with existed one.
Scientific novelty of the obtained results is the method for converting audio
signals into text based on a sequential ensemble of recurrent encoding and
decoding networks. The practical significance is the modified existing chatbot
system for converting audio signals into text.
references to public datasets, preprocessing and features that have been used;
describes the techniques and methods that are often used by researchers as a
comparison and means for developing methods. At the end of this paper,
several recommendations for opportunities and challenges related to text
summarization research are mentioned.
CHAPTER 3
Hardware Specification
This part refers to the physical equipment your system needs to function
properly.
1. Processor (CPU)
• The CPU performs all the main calculations and runs your software.
• A multi-core processor (e.g., Intel i5 or better) is important to handle the
data processing and machine learning tasks efficiently.
2. RAM (Memory)
• More RAM helps with handling large datasets and running simulations
or ML models smoothly.
• Minimum: 8 GB is fine for small projects. Recommended: 16 GB or
more for faster performance.
3. Storage
• If you're using deep learning models (like LSTM or CNN), a GPU can
significantly speed up training.
• .Although models can run on the CPU, GPU-most famous support with
NVIDIA GPUs with CUDA-VELMI support improves performance,
especially during the inference and training of the model.
Software Specification
This includes the programs, libraries, and tools your system will use.
1. Operating System
2. Programming Language
• Tools like Jupyter Notebook, VS Code, or Google colab help you write
and test your code.
• Jupyter is great for step-by-step experiments.
• VS Code is good for full applications
4. Python Libraries
Library Use
Python Libraries
Component Specification
Processor Intel Core i5 or above, Multi-
core (Recommended: i7 or
Ryzen 7 for faster training)
Minimum 8 GB
RAM (Recommended: 16 GB for
larger datasets and simulations)
256 GB SSD minimum; 512 GB
Storage recommended for model logs,
datasets, and backups
Operating Linux (Ubuntu preferred) or
System Windows 10+
Programming
Python 3.8+
Language
VS Code, Jupyter Notebook, or
IDE
Google colab
Torch, torchaudio, transformers,
Libraries
pandas, tarfile, librosa, nltk,
Used
pytest
The encoder receives the input sentence, feeds it through several layers (represented
by Nx), and processes it with multi-head attention mechanisms and feed-forward
neural networks. Every input token is embedded first and mixed with positional
encoding to preserve word order information since transformers themselves do not
preserve sequence. The attention mechanism enables the model to assign varying
levels of importance to words in a sentence based on their position or no matter
where they are in the sentence.
The decoder is used in a similar way but with an additional layer of masked multi-
head attention to avoid the model seeing future tokens in training (i.e., for
autoregressive generation). It also listens to the outputs of the encoder in order to
utilize the contextual information from the input sequence. The final outputs are
passed through a linear layer and softmax to produce probability distributions over
the vocabulary for every word in the output sequence.
1. Audio Input
- Purpose: The system takes in voice input of spoken language, like recorded
speech, live speech, or a podcast.
- Formats: May be a microphone input, audio file (i.e., MP3, WAV), or streaming
audio.
2. Preprocessing
- Purpose: Processes the audio so that it is clear and devoid of much noise to
prepare for use with the STT model.
- Steps:
- Noise reduction (removing background sounds).
- Normalization (adjusting volume levels).
- Splitting audio into manageable chunks (e.g., by silence detection).
4. Postprocessing
- Purpose: Fine-tunes the raw text for readability and accuracy.
- Steps:
- Correction of grammar/spelling errors.
- Elimination of filler words (e.g., "um," "ah").
- Punctuation, capitalization.
5. Cleaned Text
- Output: Refine, accurate transcript available for additional analysis or storage.
6. Summarization Model
- Purpose: Produces a short summary of the cleaned text.
- Methods:
- Extractive: Chooses important sentences/phrases (e.g., with NLP libraries such
as spaCy).
- Abstractive: Rephrases content in condensed form (e.g., GPT, BERT).
7. Output/Save
- Final Deliverables:
- The summary (e.g., for ease of review, meeting minutes, or reports).
- Optionally, the cleaned transcript and audio may also be stored.
CHAPTER 4
METHODOLOGY
Modules
Purpose:
This module takes raw audio data as an input. The source could be a live
microphone input or a previously recorded audio file (such as a.wav or.mp3).
Functions:
1. Enter audio from a directory or user.
2. Verify and adjust the sample rate and audio format.
3. Send the speech-to-text model the audio waveform.
Technologies Used:
torchaudio, librosa, pyaudio (for real-time microphone input)
Purpose:
This module transforms the audio waveform into raw text using the Wav2Vec2
model. Automatic speech recognition (ASR) is what it does.
Functions:
1. Tokenize the waveform so that the model can use it.
2. Create transcriptions after decoding the audio.
3. Send back the uncondensed text for summarization.
Advantages:
1. Can process noisy audio.
2. No handcrafted audio features needed.
Technologies Used:
transformers, torch, facebook/wav2vec2-base-960h
Purpose:
Using the T5 (Text-to-Text Transfer Transformer) model, this module takes
the entire transcribed text and provides a succinct, useful summary.
Functions:
1.Preprocess the transcription (e.g., add prefix like "summarize:").
2.Generate a summary based on the transformer model.
3.Return the final summary text.
Advantages:
1.Flexible and powerful for various NLP tasks.
2.Handles long input texts with contextual understanding.
Technologies Used:
transformers, torch, google/t5-small or t5-base
4. Output Module
Purpose:
This module displays and/or saves the resulting summary generated by the T5
model.
Functions:
1. Display summary on console, GUI, or web application.
2. Save results in a .txt, .json, or .csv file.
Technologies Used:
Python I/O, Tkinter or Flask (optional interface)
Library
Component Purpose
/ Tool
REST API
Web
Flask for traffic
Framework
handling
Natural
Machine NLP,
Language
Learning NLTK
Processing
Data
handling
Data Pandas,
and
Manipulation NumPy
transformat
ions
Torch.sa Save/load
Model
ve/ model
Persistence
Joblib artifacts
Python
Central
Class
Configuration parameter
(config.p
tuning
y)
Python Event
Logging logging tracing and
module debugging
CHAPTER 5
def speech_to_text(batch):
# Process audio files
inputs = processor(
batch["audio"]["array],
sampling_rate=batch["audio"]["sampling_rate"],
return_tensors="pt",
padding=True
).to(device)
# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1) transcription =
processor.batch_decode(predicted_ids)
KRUPANIDHI COLLEGE OF MANAGEMENT 25 | P a g e
AUDIO TO TEXT SUMMARIZER
AutoModelForCTC.from_pretrained(model_name).to("cuda" if
torch.cuda.is_available() else "cpu")
• Model Testing:
Tested the performance of the CNN with validation and test sets
with metrics like accuracy, precision, recall, and F1-score.
• Functional Testing:
Confirmed the overall workflow from input image upload to final
English transliteration worked properly and seamlessly.
• Stress Testing:
Tested the system with noisy, distorted, or low-quality images to
measure model robustness in real-world scenarios.
• Performance Visualization:
Utilized confusion matrices and accuracy/loss plots for result
analysis and determination of improvement areas.
Transcription:
Summary:
Transcription:
Summary:
CHAPTER 6
The component seems to use the summary of the abstract model, perhaps
Facebook/Bart-Large-Cnn or T5-Small, due to the presence of the
Generation_Config.json file in protocols. Abstractive summary is
advantageous for creating brief, paraphrased outputs, but can introduce
inaccuracies if the input text contains errors from the ASR phase. The
performance of the model summary would be evaluated on the basis of fluency,
coherence and maintaining key information. Larger models like Bart-Large
usually generate a summary of better quality, but at the cost of increased
computing latency. On the other hand, smaller models, such as the T5-Small,
offer a faster conclusion, but can produce less coherent or too general
summons. A hybrid approach, combining extraction methods (eg key Afrase
extraction) with abstract summary, could improve robustness, especially in
solving noisy or imperfect transcriptions.
The pipe shows smooth integration between ASR and summary, while Google
disk serves as a backend storage for inputs and outputs. The protocols indicate
the possibilities of batch processing, as evidenced by the processing of 28,539
examples of training, indicating scalability for large data sets.
to reduce the time of inference, especially for the deployment of edges where
computational sources are limited.
6.2 Discussion
CHAPTER 7
7.1 Summary
The project has successfully developed an integrated speech pipe to text and
generate automated summary, showing the practical use of modern automatic
speech recognition (ASR) and processing natural language (NLP). By using
state -of -the -art models such as WAV2VEC2 for transcription and BART/T5
for summary, the system achieved a functional balance between accuracy and
computing efficiency. The ASR component reliably made clear sound inputs,
although its efficiency has reduced noisy recordings, strong accents or
specialized terminology. Meanwhile, the summary module created coherent
and concise outputs, albeit with occasional factual discrepancies associated
with abstract approaches.
The architecture of the system has proved to be a scalable, capable of
efficiently processing large batch of audio files. Latence, however, remained a
problem for real -time application, and steering problems with dependence
emphasized the importance of robust configuration of the environment in
production settings. These findings emphasize the potential and limitation of
current speech systems between AI, which provides valuable knowledge of
future improvement.
7.2 Recommendations
Several targeted improvements should be preferred to increase system
performance. First, the accuracy of the ASR model could be significantly
improved by fine fine -tuning on the domain specific data sets, ensuring better
processing of specialized vocabulary and various accents. Incorporating the
subsequent correction processing, such as the control of the magic and the
Republic of the language, would further specify the transcripts of the solution
of common errors and improve context coherence.
needs. In order to ensure that the responsible system of the system should be
addressed, ethical considerations, including distortion and privacy techniques,
should be addressed. By monitoring these improvements, the project could
develop a complex tool for the new generation to transform the unstructured
sound into the knowledge available.
This project is an example of AI transformation potential in bridging the gap
between the spoken language and structured information. While the current
implementation provides a solid foundation, proposed improvements and
future improvements outline the clear way to a more robust, versatile and user
-focused system. Other phases of development could build this technology as
an indispensable source across sectors such as education, journalism and
corporate communication, by prioritizing accuracy, efficiency and ethical
considerations.
BIBLIOGRAPHY
Processing. EMNLP.
18. Bojar, O., et al. (2014). Findings of the 2014 Workshop on Statistical
Machine Translation (WMT14). ACL.
19. Bojar, O., et al. (2016). Findings of the 2016 Conference on Machine
Translation (WMT16). ACL.
20. Vaswani, A., Shazeer, N., Parmar, N., et al. (2018). The Transformer
Model for NLP. Google AI Blog.
SAMPLE CODING
# Install only the essential packages with specific versions to avoid conflicts
!pip install torch torchaudio transformers datasets soundfile librosa --quiet
!pip install -U fsspec # Update fsspec to resolve the conflict
# Verify installations
!pip list | grep -E "torch|transformers|datasets"
import tarfile
import os
from datasets import load_dataset, Audio
from transformers import AutoProcessor, AutoModelForCTC
import torch
import pandas as pd
import os
# List the contents of the directory, adjusting the path to account for potential
nesting
import os
from datasets import Dataset, Audio
print(dataset)
print(dataset)
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCTC.from_pretrained(model_name)
def speech_to_text(batch):
# Process audio files
inputs = processor(
batch["audio"]["array"],
sampling_rate=batch["audio"]["sampling_rate"],
return_tensors="pt",
padding=True
).to(device)
# Run inference
with torch.no_grad():
logits = model(inputs.input_values).logits
# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
'''import pandas as pd
# Display results
print("\nAudio to Text Conversion Results:")
display(df)
import pandas as pd
# Create a DataFrame for better visualization
df = pd.DataFrame({
"File": [os.path.basename(x["path"]) for x in test_samples["audio"]], # Access
'path' within each audio item
"Predicted Text": results["predicted_text"],
"Actual Text": test_samples["text"] if "text" in test_samples.features else
["N/A"]*len(results)
})
# Display results
print("\nAudio to Text Conversion Results:")
display(df)
import librosa
import nltk
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
# Import libraries
from transformers import AutoProcessor, AutoModelForCTC
import torch
import librosa
from IPython.display import Audio, display
def transcribe_from_drive(file_path):
"""
Transcribe audio file from Google Drive path
Args:
file_path (str): Full path to audio file in Google Drive
Example: '/content/drive/MyDrive/audio_files/103-1240-0002.flac'
"""
try:
# Verify file exists
if not os.path.exists(file_path):
print(f"Error: File not found at {file_path}")
return
# Process audio
speech, sr = librosa.load(file_path, sr=16000)
inputs = processor(speech, sampling_rate=sr,
return_tensors="pt").to(model.device)
with torch.no_grad():
logits = model(**inputs).logits
print("\n" + "="*50)
print(f"File: {os.path.basename(file_path)}")
print(f"Full Path: {file_path}")
print(f"Duration: {len(speech)/sr:.2f} seconds")
print("\nTranscription:")
print("="*50)
print(transcription)
print("="*50)
return transcription
except Exception as e:
print(f"Error processing file: {str(e)}")
return None
# Example usage:
file_path = "/content/drive/MyDrive/Audio Files/Beside vs Besides_ English In A
Minute.mp3" # Change this to your actual file path
transcription = transcribe_from_drive(file_path)
tokenizer = T5Tokenizer.from_pretrained("t5-small")
with tokenizer.as_target_tokenizer():
labels = tokenizer(target, max_length=150, truncation=True,
padding="max_length")
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenizer = T5Tokenizer.from_pretrained("t5-small")
with tokenizer.as_target_tokenizer():
labels = tokenizer(target, max_length=150, truncation=True,
padding="max_length")
return model_inputs
# Instead of using the ASR model for generation, load a text summarization model
from transformers import T5ForConditionalGeneration, T5Tokenizer
KRUPANIDHI COLLEGE OF MANAGEMENT 45 | P a g e
AUDIO TO TEXT SUMMARIZER
# Move the summarization model to the same device as your ASR model if
available
summarization_model.to("cuda" if torch.cuda.is_available() else "cpu")
Submission Information
Result Information
AI Text: 9 %
Content Matched
AI Text
9.0%
Human
Text 91.0%
Disclaimer:
* The content detection system employed here is powered by artificial intelligence (AI) technology.
* Its not always accurate and only help to author identify text that might be prepared by a AI tool.
* It is designed to assist in identifying & moderating content that may violate community guidelines/legal regulations, it may not be perfect.
The Report is Generated by DrillBit Plagiarism Detection Software
Submission Information
Result Information
Similarity 15 %
1 10 20 30 40 50 60 70 80 90