Speech To Text
Speech To Text
MAHAVIDHYALAYA
(Deemed to be university u/s 3 of UGC act 1956)
(Accredited with “A” by NAAC)
Enathur, Kanchipuram – 631561. Tamilnadu
www.kanchiuniv.ac.in
Guided By
Dr. M. Senthil Kumaran
Associate Professor
Dept. of CSE
Keywords:
• Flask
• PyDub
• Speech Recognition
• Video Processing
2
Speech to text Transcripts
Problem Statement:
• Challenge: Current methods of audio and video processing are time-consuming and
cumbersome.
• Manual Tasks: Users need to perform manual conversion and manipulation of audio and
video formats.
• Lack of Automation: Existing methods lack automation and user-friendly interfaces.
• Integration Issues: There is a lack of seamless integration between different processing tasks.
• Productivity Hindrance: Inefficiencies in the workflow hinder productivity for content
creators and professionals.
• Project Objective: Our project aims to streamline audio and video file conversion and
manipulation.
• Solution: We plan to create a user-friendly platform that automates tasks and provides a
unified interface.
• Goal: Improve efficiency and reduce the time and effort required for audio and video
processing.
3
Speech to text Transcripts
Literature survey
Speech to text Transcripts
Proposed Method / Architecture /Algorithm used / Methodology ,etc.
•The proposed method involves a speech recognition system that accepts various types of audio inputs.
•It includes four main input methods: microphone, audio file, video file, and camera.
•The system uses the speech_recognition library to perform speech recognition, which interfaces with the
Google Web Speech API.
•Depending on the selected input type, the system records audio, processes it, and attempts to recognize the
speech.
Modules description
6
Speech to text Transcripts
Algorithm used
The algorithm used in the project is primarily based on the speech_recognition library for speech recognition,
which abstracts the details of working with the Google Web Speech API. Here's an overview of the key steps of
the algorithm used in the code:
1.Select Input Type:
•The user is prompted to select an input type, which can be one of the following: microphone input, audio
file input, video file input, or camera input.
2.Input Handling:
•Depending on the selected input type, the code handles the input differently. It includes the following
scenarios:
•Microphone Input:
•The code enters a loop for real-time microphone input.
•It continuously records audio using a microphone as the audio source.
•The recorded audio is then processed for speech recognition.
•Audio File Input:
•The user is prompted to provide the path to an MP3 audio file.
•The code converts the MP3 file to WAV format using the PyDub library.
•The WAV file is then processed for speech recognition.
7
Speech to text Transcripts
•Video File Input:
•The user is prompted to provide the path to a video file (e.g., MP4, AVI).
•The code extracts the audio from the video file using the moviepy library, creating a temporary
audio file.
•The temporary audio file is processed for speech recognition.
•Camera Input:
•The code opens the camera stream using OpenCV (cv2).
•While capturing video frames from the camera, it simultaneously records audio from a
microphone.
•Real-time audio is processed for speech recognition.
3.Speech Recognition:
•In all scenarios, the speech_recognition library is used to perform speech recognition.
•The audio data, obtained from different sources as described above, is sent to the Google Web Speech
API for recognition.
4.Transcript Display:
•The recognized speech is displayed as a transcript.
8
Speech to text Transcripts
5.Error Handling:
•The code includes error handling for potential issues during speech recognition.
•It differentiates between two types of errors:
•sr.UnknownValueError: This error is raised when speech recognition could not understand the
audio input.
•sr.RequestError: This error occurs when there is an issue with requesting results from the Google
Web Speech API.
6.Cleanup:
•Temporary audio files created during audio file input and video file input are cleaned up to avoid
cluttering the file system.
The algorithm's core component is the use of the speech_recognition library to capture and process audio data
and then send it to the Google Web Speech API for recognition. The code is designed to handle various input
sources and provide real-time or batch recognition based on the chosen input type
9
Speech to text Transcripts
Methodology
Step 1: Select Input Type
• Choose an input type by entering a number (1 for microphone, 2 for audio file, 3 for video file, 4 for
camera).
Step 2: Input Handling
• Depending on your choice:
• Microphone Input: Speak into the microphone. Recognized speech is displayed.
• Audio File Input: Provide an MP3 file path. The code converts it to WAV and displays the recognized
speech.
• Video File Input: Provide a video file path. Audio is extracted, recognized, and displayed.
• Camera Input: Camera captures video, and real-time speech is recognized.
Step 3: Error Handling
• The code handles errors and provides error messages as needed.
Step 4: Cleanup
• If you use audio or video file input, temporary files are cleaned up.
Step 5: Interaction Completion
• The code continues to prompt for input or allows you to exit based on your choice.
10
Implementation Details
1. Importing Required Libraries:
•The code begins by importing necessary Python libraries, including os, random, speech_recognition, pydub,
moviepy.editor, and cv2. These libraries are used for various functionalities.
2. Functions for Audio and Video Processing:
•The code defines two key functions:
•extract_audio_from_video(video_path): This function extracts audio from a video file and returns the
path to the temporary audio file in WAV format.
•convert_mp3_to_wav(mp3_file, wav_file): This function converts an MP3 audio file to WAV format
using the PyDub library.
3. Setting Up the Speech Recognizer:
•The code initializes a speech recognizer object (r) from the speech_recognition library. This object is used to
manage the speech recognition process.
4. Input Type Selection:
•The code prompts the user to select an input type (microphone, audio file, video file, or camera) and stores the
choice in the input_type variable.
5. Handling Different Input Types:
•The code includes conditional branches for each input type:
•Microphone Input:
•Enters a loop for real-time audio input via the microphone.
•Records audio, attempts speech recognition, and displays the recognized transcript.
•If a specific "easter egg" trigger phrase is detected, it responds with "VANI."
11
Implementation Details
12
Implementation Details
7. Cleanup:
•If audio file input or video file input is used, the code removes temporary audio files to avoid cluttering the file
system.
8. Interaction Completion:
•After using the selected input type, the code either continues to prompt for more input or allows the user to exit
based on their choice.
13
Demo
14
Results
15
Conclusion
In summary, the provided code is a versatile speech recognition system that can handle different input sources,
including microphones, audio files, video files, and cameras. It uses the speech_recognition library to capture,
process, and recognize spoken words. The code also includes error handling, and automatic cleanup of
temporary files. This system offers a flexible and interactive tool for various speech recognition applications.
16
Future Enhancement
1. Multi-Language Support: Extend the system to recognize and work with multiple languages,
allowing users to choose the language they want to use.
2. Improved Accuracy: Implement techniques to enhance speech recognition accuracy, such as model
fine-tuning and noise reduction algorithms.
3. Voice Command Integration: Integrate the system with other applications, devices, or services to
respond to voice commands, providing automation and control.
4. Natural Language Processing (NLP): Incorporate NLP techniques to extract meaning and context
from recognized speech, enabling more sophisticated interactions.
5. Real-Time Transcription: Provide real-time transcription of speeches, lectures, or meetings, making
it a valuable tool for note-taking and accessibility.
6. Cloud Integration: Implement cloud-based speech recognition services for scalability and improved
recognition accuracy.