0% found this document useful (0 votes)

10 views17 pages

Speech To Text

The project focuses on developing a speech recognition system that converts spoken words into text and processes audio data from various input sources, including microphones and video files. It aims to automate audio and video processing tasks to enhance efficiency and user experience, addressing the limitations of existing manual methods. The system is implemented as a web application using Flask and incorporates several Python libraries for functionality, with plans for future enhancements such as multi-language support and improved accuracy.

Uploaded by

Aniketh Vustepalle

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views17 pages

Speech To Text

Uploaded by

Aniketh Vustepalle

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

SRI CHANDRASEKHARENDRA SARASWATHI VISWA

MAHAVIDHYALAYA
(Deemed to be university u/s 3 of UGC act 1956)
(Accredited with “A” by NAAC)
Enathur, Kanchipuram – 631561. Tamilnadu
www.kanchiuniv.ac.in

BCSF187Z50 - Project Work - Phase I

Speech to text Transcripts
11209A021
Vustepalle Aniketh
11209A013
Ramacharla Saiteja

Guided By
Dr. M. Senthil Kumaran
Associate Professor
Dept. of CSE

University Review: <<Date>>

Speech to text Transcripts
Abstract :
• This project focuses on speech recognition and audio processing.
• It aims to convert spoken words into text and process audio data.
• Various input sources, including microphones, audio files, video files, and cameras, are supported.
• Audio Conversion: The project includes capabilities for converting audio data between different formats,
such as MP3 to WAV, enabling versatile input compatibility.
• Web Application: The system is implemented as a web application using Flask, making it accessible and
user-friendly through web browsers, adding to its convenience and usability.

Keywords:
• Flask

• PyDub

• Speech Recognition

• Video Processing

2
Speech to text Transcripts

Problem Statement:
• Challenge: Current methods of audio and video processing are time-consuming and
cumbersome.
• Manual Tasks: Users need to perform manual conversion and manipulation of audio and
video formats.
• Lack of Automation: Existing methods lack automation and user-friendly interfaces.
• Integration Issues: There is a lack of seamless integration between different processing tasks.
• Productivity Hindrance: Inefficiencies in the workflow hinder productivity for content
creators and professionals.
• Project Objective: Our project aims to streamline audio and video file conversion and
manipulation.
• Solution: We plan to create a user-friendly platform that automates tasks and provides a
unified interface.
• Goal: Improve efficiency and reduce the time and effort required for audio and video
processing.

3
Speech to text Transcripts

Existing system and its drawbacks:

The existing system for audio and video processing relies on manual processes that have several drawbacks.
Users must manually convert audio and video files, which consumes a lot of time and effort. Moreover,
different tools are used for various processing tasks, resulting in a lack of integration and efficiency. The lack
of automation also means that users must perform repetitive and time-consuming tasks, which can hinder their
productivity and creativity.
Drawbacks of existing system:
• Manual file conversion leads to time and effort consumption.
• Lack of integration among different tools for processing tasks.
• Inefficient system that causes productivity hindrances.
• No automation for repetitive and time-consuming tasks.
• User challenges in seamlessly handling audio and video processing tasks.
• User experience is less than optimal due to the system’s drawbacks.
• Inconsistent results and errors due to manual processes.
• Overall inefficiency that hampers content creation and professional work.
Speech to text Transcripts

Literature survey
Speech to text Transcripts
Proposed Method / Architecture /Algorithm used / Methodology ,etc.

•The proposed method involves a speech recognition system that accepts various types of audio inputs.
•It includes four main input methods: microphone, audio file, video file, and camera.
•The system uses the speech_recognition library to perform speech recognition, which interfaces with the
Google Web Speech API.
•Depending on the selected input type, the system records audio, processes it, and attempts to recognize the
speech.
Modules description

The system relies on several Python libraries and modules:

•speech_recognition for speech recognition and managing audio input.
•pydub for converting audio file formats, such as MP3 to WAV.
•moviepy.editor for extracting audio from video files.
•cv2 (OpenCV) for capturing and displaying camera input.

6
Speech to text Transcripts
Algorithm used

The algorithm used in the project is primarily based on the speech_recognition library for speech recognition,
which abstracts the details of working with the Google Web Speech API. Here's an overview of the key steps of
the algorithm used in the code:
1.Select Input Type:
•The user is prompted to select an input type, which can be one of the following: microphone input, audio
file input, video file input, or camera input.
2.Input Handling:
•Depending on the selected input type, the code handles the input differently. It includes the following
scenarios:
•Microphone Input:
•The code enters a loop for real-time microphone input.
•It continuously records audio using a microphone as the audio source.
•The recorded audio is then processed for speech recognition.
•Audio File Input:
•The user is prompted to provide the path to an MP3 audio file.
•The code converts the MP3 file to WAV format using the PyDub library.
•The WAV file is then processed for speech recognition.

7
Speech to text Transcripts
•Video File Input:
•The user is prompted to provide the path to a video file (e.g., MP4, AVI).
•The code extracts the audio from the video file using the moviepy library, creating a temporary
audio file.
•The temporary audio file is processed for speech recognition.
•Camera Input:
•The code opens the camera stream using OpenCV (cv2).
•While capturing video frames from the camera, it simultaneously records audio from a
microphone.
•Real-time audio is processed for speech recognition.
3.Speech Recognition:
•In all scenarios, the speech_recognition library is used to perform speech recognition.
•The audio data, obtained from different sources as described above, is sent to the Google Web Speech
API for recognition.
4.Transcript Display:
•The recognized speech is displayed as a transcript.

8
Speech to text Transcripts
5.Error Handling:
•The code includes error handling for potential issues during speech recognition.
•It differentiates between two types of errors:
•sr.UnknownValueError: This error is raised when speech recognition could not understand the
audio input.
•sr.RequestError: This error occurs when there is an issue with requesting results from the Google
Web Speech API.
6.Cleanup:
•Temporary audio files created during audio file input and video file input are cleaned up to avoid
cluttering the file system.
The algorithm's core component is the use of the speech_recognition library to capture and process audio data
and then send it to the Google Web Speech API for recognition. The code is designed to handle various input
sources and provide real-time or batch recognition based on the chosen input type

9
Speech to text Transcripts
Methodology
Step 1: Select Input Type
• Choose an input type by entering a number (1 for microphone, 2 for audio file, 3 for video file, 4 for
camera).
Step 2: Input Handling
• Depending on your choice:
• Microphone Input: Speak into the microphone. Recognized speech is displayed.
• Audio File Input: Provide an MP3 file path. The code converts it to WAV and displays the recognized
speech.
• Video File Input: Provide a video file path. Audio is extracted, recognized, and displayed.
• Camera Input: Camera captures video, and real-time speech is recognized.
Step 3: Error Handling
• The code handles errors and provides error messages as needed.
Step 4: Cleanup
• If you use audio or video file input, temporary files are cleaned up.
Step 5: Interaction Completion
• The code continues to prompt for input or allows you to exit based on your choice.

10
Implementation Details
1. Importing Required Libraries:
•The code begins by importing necessary Python libraries, including os, random, speech_recognition, pydub,
moviepy.editor, and cv2. These libraries are used for various functionalities.
2. Functions for Audio and Video Processing:
•The code defines two key functions:
•extract_audio_from_video(video_path): This function extracts audio from a video file and returns the
path to the temporary audio file in WAV format.
•convert_mp3_to_wav(mp3_file, wav_file): This function converts an MP3 audio file to WAV format
using the PyDub library.
3. Setting Up the Speech Recognizer:
•The code initializes a speech recognizer object (r) from the speech_recognition library. This object is used to
manage the speech recognition process.
4. Input Type Selection:
•The code prompts the user to select an input type (microphone, audio file, video file, or camera) and stores the
choice in the input_type variable.
5. Handling Different Input Types:
•The code includes conditional branches for each input type:
•Microphone Input:
•Enters a loop for real-time audio input via the microphone.
•Records audio, attempts speech recognition, and displays the recognized transcript.
•If a specific "easter egg" trigger phrase is detected, it responds with "VANI."

11
Implementation Details

•Audio File Input:

•Prompts the user to provide the path to an MP3 audio file.
•Converts the MP3 file to WAV format using the PyDub library.
•Records and processes the audio for speech recognition, displaying the recognized transcript.
•Video File Input:
•Prompts the user to provide the path to a video file (e.g., MP4, AVI).
•Extracts audio from the video using the moviepy library, creating a temporary audio file.
•Records and processes the extracted audio for speech recognition, displaying the recognized transcript
with sentences separated.
•Camera Input:
•Opens the computer's camera using OpenCV.
•Captures video frames from the camera and records audio simultaneously.
•Listens for speech in real-time, attempts recognition, and displays the recognized transcript.
•Similar to microphone input, it checks for the "easter egg" trigger phrase.
6. Error Handling:
•Throughout the code, error handling is implemented for two types of errors:
•sr.UnknownValueError: Occurs when the system couldn't understand the audio.
•sr.RequestError: Happens when there are issues with requesting results from the Google Web Speech
API.

12
Implementation Details

7. Cleanup:
•If audio file input or video file input is used, the code removes temporary audio files to avoid cluttering the file
system.
8. Interaction Completion:
•After using the selected input type, the code either continues to prompt for more input or allows the user to exit
based on their choice.

13
Demo

14
Results

15
Conclusion
In summary, the provided code is a versatile speech recognition system that can handle different input sources,
including microphones, audio files, video files, and cameras. It uses the speech_recognition library to capture,
process, and recognize spoken words. The code also includes error handling, and automatic cleanup of
temporary files. This system offers a flexible and interactive tool for various speech recognition applications.

16
Future Enhancement
1. Multi-Language Support: Extend the system to recognize and work with multiple languages,
allowing users to choose the language they want to use.
2. Improved Accuracy: Implement techniques to enhance speech recognition accuracy, such as model
fine-tuning and noise reduction algorithms.
3. Voice Command Integration: Integrate the system with other applications, devices, or services to
respond to voice commands, providing automation and control.
4. Natural Language Processing (NLP): Incorporate NLP techniques to extract meaning and context
from recognized speech, enabling more sophisticated interactions.
5. Real-Time Transcription: Provide real-time transcription of speeches, lectures, or meetings, making
it a valuable tool for note-taking and accessibility.
6. Cloud Integration: Implement cloud-based speech recognition services for scalability and improved
recognition accuracy.

Real Time Voice Translator
No ratings yet
Real Time Voice Translator
28 pages
Similarity 0505064848
No ratings yet
Similarity 0505064848
56 pages
Speech Recognition PPT F
100% (2)
Speech Recognition PPT F
16 pages
Personal Voice Assistant in Python
86% (22)
Personal Voice Assistant in Python
30 pages
Speech Processing
No ratings yet
Speech Processing
70 pages
Voice Assistant
No ratings yet
Voice Assistant
34 pages
Voice Assistant
No ratings yet
Voice Assistant
30 pages
Final Report
No ratings yet
Final Report
35 pages
7sem Projectreport
No ratings yet
7sem Projectreport
33 pages
Design Lab2
No ratings yet
Design Lab2
22 pages
Presentation 3
No ratings yet
Presentation 3
24 pages
Real-Time Video To Text Transcription Android App (Using Video Processing and Multimedia)
No ratings yet
Real-Time Video To Text Transcription Android App (Using Video Processing and Multimedia)
32 pages
Session 5 - Speech Recognition
No ratings yet
Session 5 - Speech Recognition
20 pages
BDHXB
No ratings yet
BDHXB
16 pages
Speechrecogn
No ratings yet
Speechrecogn
15 pages
Thank You
No ratings yet
Thank You
23 pages
Sujal Kumar Sinha - IOT - MATLAB Mini
No ratings yet
Sujal Kumar Sinha - IOT - MATLAB Mini
13 pages
Speech Image Translator Presentation
No ratings yet
Speech Image Translator Presentation
16 pages
Natural Language Processing: Task4
No ratings yet
Natural Language Processing: Task4
12 pages
Presentation 1
No ratings yet
Presentation 1
22 pages
Project Report
No ratings yet
Project Report
58 pages
Department of Computer Science and Engineering) : CGB1121/ EGB1122
No ratings yet
Department of Computer Science and Engineering) : CGB1121/ EGB1122
18 pages
Text To Speech
No ratings yet
Text To Speech
14 pages
Speech To Text Conversion
No ratings yet
Speech To Text Conversion
7 pages
Presentation ML
No ratings yet
Presentation ML
9 pages
NLP 1.3.1 - Speed Recogmnition
No ratings yet
NLP 1.3.1 - Speed Recogmnition
20 pages
Summarization - Doc - Jupyter Notebook
No ratings yet
Summarization - Doc - Jupyter Notebook
12 pages
Speech Recognition
No ratings yet
Speech Recognition
11 pages
Speech Recognition Report
100% (1)
Speech Recognition Report
20 pages
Speech Recognition
No ratings yet
Speech Recognition
5 pages
Project Report
No ratings yet
Project Report
17 pages
Voice Assistant Using Python 2
No ratings yet
Voice Assistant Using Python 2
20 pages
Speech Recognition System Using Python Report
No ratings yet
Speech Recognition System Using Python Report
7 pages
Ai Project Sona-1 (1) - 250630 - 194118
No ratings yet
Ai Project Sona-1 (1) - 250630 - 194118
10 pages
Speech Recognition
No ratings yet
Speech Recognition
13 pages
Case Study: Speech Recognition For Virtual Assistants: 1. Problem Identification
No ratings yet
Case Study: Speech Recognition For Virtual Assistants: 1. Problem Identification
8 pages
Main (pt2)
No ratings yet
Main (pt2)
13 pages
DL Proj Rep
No ratings yet
DL Proj Rep
11 pages
Labs 9
No ratings yet
Labs 9
4 pages
Unit 3 NMU
No ratings yet
Unit 3 NMU
4 pages
Unit 5 NMU
No ratings yet
Unit 5 NMU
4 pages
Speech Recognition Techniques - GUVI
No ratings yet
Speech Recognition Techniques - GUVI
4 pages
Speech Recognition: An Overview
No ratings yet
Speech Recognition: An Overview
19 pages
7B Sem DL Lab1
No ratings yet
7B Sem DL Lab1
1 page
How Speech Recognition Works: Hidden Markov Model
No ratings yet
How Speech Recognition Works: Hidden Markov Model
25 pages
Widcollogo1 FINAL
No ratings yet
Widcollogo1 FINAL
83 pages
DL Based Speech To Text Converter For Audio Visual Applications
No ratings yet
DL Based Speech To Text Converter For Audio Visual Applications
4 pages
SPEECH
100% (1)
SPEECH
17 pages
Personal Voice Assistant in Python
100% (1)
Personal Voice Assistant in Python
30 pages
Synopsis
No ratings yet
Synopsis
5 pages
Paper 4
No ratings yet
Paper 4
5 pages
Automatic Speech Recognition Using Python
No ratings yet
Automatic Speech Recognition Using Python
18 pages
Speech Technology
No ratings yet
Speech Technology
5 pages
Minor Project123
No ratings yet
Minor Project123
40 pages
Ai Virtual Assistant in Python: Submitted By: Rohit Kumar Sakshi Verma
No ratings yet
Ai Virtual Assistant in Python: Submitted By: Rohit Kumar Sakshi Verma
17 pages
Methodology To Use in Speech To Text Python - Google Search PDF
No ratings yet
Methodology To Use in Speech To Text Python - Google Search PDF
1 page
GUI Multimedia
No ratings yet
GUI Multimedia
24 pages
SPEECH RECOGNITION SYSTEM Final
No ratings yet
SPEECH RECOGNITION SYSTEM Final
16 pages
(IJCST-V9I2P18) :swati, Harpreet Kaur
No ratings yet
(IJCST-V9I2P18) :swati, Harpreet Kaur
6 pages
Pioneer DEH-2770MP PDF
No ratings yet
Pioneer DEH-2770MP PDF
75 pages
Multimedia System Notes
No ratings yet
Multimedia System Notes
90 pages
EmissionControl2 Manual
No ratings yet
EmissionControl2 Manual
15 pages
Netflix Originals Delivery Specifications OC-3-3
No ratings yet
Netflix Originals Delivery Specifications OC-3-3
25 pages
Text To Speech
No ratings yet
Text To Speech
4 pages
Algorithm For AssemblyAi and Gemini
No ratings yet
Algorithm For AssemblyAi and Gemini
8 pages
Traditional Georgian Singing, Praying and Lamenting
No ratings yet
Traditional Georgian Singing, Praying and Lamenting
55 pages
Manual Book Zoom F6
No ratings yet
Manual Book Zoom F6
202 pages
Digital Audio Processing
No ratings yet
Digital Audio Processing
67 pages
Cooledit 1.2. Manual
No ratings yet
Cooledit 1.2. Manual
225 pages
Chapter 3 Mmedia
No ratings yet
Chapter 3 Mmedia
63 pages
Kontakt Player Developer Guide
No ratings yet
Kontakt Player Developer Guide
37 pages
Product Specification ReadSpeaker Speechcloud API (SCAPI) 4.5 - 05
No ratings yet
Product Specification ReadSpeaker Speechcloud API (SCAPI) 4.5 - 05
9 pages
Praat Phonetic Training
No ratings yet
Praat Phonetic Training
25 pages
Stinger 3.1.0 Manual
100% (1)
Stinger 3.1.0 Manual
34 pages
PPR 1
No ratings yet
PPR 1
13 pages
Manual PDF
No ratings yet
Manual PDF
52 pages
WF - Hytera Voice Tool Software Operation Guide R5.0
No ratings yet
WF - Hytera Voice Tool Software Operation Guide R5.0
15 pages
Fahrenheit 451 Essay Questions
100% (2)
Fahrenheit 451 Essay Questions
3 pages
Lab Report - Multimedia
No ratings yet
Lab Report - Multimedia
10 pages
faIR Post Grunge Manual PDF
No ratings yet
faIR Post Grunge Manual PDF
3 pages
MacFi Se Manual SM - v1.0
No ratings yet
MacFi Se Manual SM - v1.0
14 pages
EAC Software Setup
No ratings yet
EAC Software Setup
19 pages
Radio Music 2017 Firmware
No ratings yet
Radio Music 2017 Firmware
4 pages
Advanced Options - FFmpeg
No ratings yet
Advanced Options - FFmpeg
8 pages
Netflix M&E Creation and Delivery Guidelines - Prodicle
No ratings yet
Netflix M&E Creation and Delivery Guidelines - Prodicle
11 pages
Mindray Solicited Results Approach
No ratings yet
Mindray Solicited Results Approach
9 pages
Zoom H2N Manual
No ratings yet
Zoom H2N Manual
2 pages
D&AD Student Awards 2012 Create The Packaging For A 21st Century Scotch Whisky
No ratings yet
D&AD Student Awards 2012 Create The Packaging For A 21st Century Scotch Whisky
4 pages
Marmalade SDK Mobile Game Development Essentials
From Everand
Marmalade SDK Mobile Game Development Essentials
Sean Scaplehorn
No ratings yet
Video Creators 48 Top Tools: Video Editing Special Edition [ The 8 series - Vol 9 ]
From Everand
Video Creators 48 Top Tools: Video Editing Special Edition [ The 8 series - Vol 9 ]
Mobile Library
No ratings yet

Speech To Text

Uploaded by

Speech To Text

Uploaded by

SRI CHANDRASEKHARENDRA SARASWATHI VISWA

BCSF187Z50 - Project Work - Phase I

University Review: <<Date>>

Existing system and its drawbacks:

The system relies on several Python libraries and modules:

•Audio File Input:

You might also like