0% found this document useful (0 votes)
10 views17 pages

Speech To Text

The project focuses on developing a speech recognition system that converts spoken words into text and processes audio data from various input sources, including microphones and video files. It aims to automate audio and video processing tasks to enhance efficiency and user experience, addressing the limitations of existing manual methods. The system is implemented as a web application using Flask and incorporates several Python libraries for functionality, with plans for future enhancements such as multi-language support and improved accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views17 pages

Speech To Text

The project focuses on developing a speech recognition system that converts spoken words into text and processes audio data from various input sources, including microphones and video files. It aims to automate audio and video processing tasks to enhance efficiency and user experience, addressing the limitations of existing manual methods. The system is implemented as a web application using Flask and incorporates several Python libraries for functionality, with plans for future enhancements such as multi-language support and improved accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

SRI CHANDRASEKHARENDRA SARASWATHI VISWA

MAHAVIDHYALAYA
(Deemed to be university u/s 3 of UGC act 1956)
(Accredited with “A” by NAAC)
Enathur, Kanchipuram – 631561. Tamilnadu
www.kanchiuniv.ac.in

BCSF187Z50 - Project Work - Phase I


Speech to text Transcripts
11209A021
Vustepalle Aniketh
11209A013
Ramacharla Saiteja

Guided By
Dr. M. Senthil Kumaran
Associate Professor
Dept. of CSE

University Review: <<Date>>


Speech to text Transcripts
Abstract :
• This project focuses on speech recognition and audio processing.
• It aims to convert spoken words into text and process audio data.
• Various input sources, including microphones, audio files, video files, and cameras, are supported.
• Audio Conversion: The project includes capabilities for converting audio data between different formats,
such as MP3 to WAV, enabling versatile input compatibility.
• Web Application: The system is implemented as a web application using Flask, making it accessible and
user-friendly through web browsers, adding to its convenience and usability.

Keywords:
• Flask

• PyDub

• Speech Recognition

• Video Processing

2
Speech to text Transcripts

Problem Statement:
• Challenge: Current methods of audio and video processing are time-consuming and
cumbersome.
• Manual Tasks: Users need to perform manual conversion and manipulation of audio and
video formats.
• Lack of Automation: Existing methods lack automation and user-friendly interfaces.
• Integration Issues: There is a lack of seamless integration between different processing tasks.
• Productivity Hindrance: Inefficiencies in the workflow hinder productivity for content
creators and professionals.
• Project Objective: Our project aims to streamline audio and video file conversion and
manipulation.
• Solution: We plan to create a user-friendly platform that automates tasks and provides a
unified interface.
• Goal: Improve efficiency and reduce the time and effort required for audio and video
processing.

3
Speech to text Transcripts

Existing system and its drawbacks:


The existing system for audio and video processing relies on manual processes that have several drawbacks.
Users must manually convert audio and video files, which consumes a lot of time and effort. Moreover,
different tools are used for various processing tasks, resulting in a lack of integration and efficiency. The lack
of automation also means that users must perform repetitive and time-consuming tasks, which can hinder their
productivity and creativity.
Drawbacks of existing system:
• Manual file conversion leads to time and effort consumption.
• Lack of integration among different tools for processing tasks.
• Inefficient system that causes productivity hindrances.
• No automation for repetitive and time-consuming tasks.
• User challenges in seamlessly handling audio and video processing tasks.
• User experience is less than optimal due to the system’s drawbacks.
• Inconsistent results and errors due to manual processes.
• Overall inefficiency that hampers content creation and professional work.
Speech to text Transcripts

Literature survey
Speech to text Transcripts
Proposed Method / Architecture /Algorithm used / Methodology ,etc.

•The proposed method involves a speech recognition system that accepts various types of audio inputs.
•It includes four main input methods: microphone, audio file, video file, and camera.
•The system uses the speech_recognition library to perform speech recognition, which interfaces with the
Google Web Speech API.
•Depending on the selected input type, the system records audio, processes it, and attempts to recognize the
speech.
Modules description

The system relies on several Python libraries and modules:


•speech_recognition for speech recognition and managing audio input.
•pydub for converting audio file formats, such as MP3 to WAV.
•moviepy.editor for extracting audio from video files.
•cv2 (OpenCV) for capturing and displaying camera input.

6
Speech to text Transcripts
Algorithm used

The algorithm used in the project is primarily based on the speech_recognition library for speech recognition,
which abstracts the details of working with the Google Web Speech API. Here's an overview of the key steps of
the algorithm used in the code:
1.Select Input Type:
•The user is prompted to select an input type, which can be one of the following: microphone input, audio
file input, video file input, or camera input.
2.Input Handling:
•Depending on the selected input type, the code handles the input differently. It includes the following
scenarios:
•Microphone Input:
•The code enters a loop for real-time microphone input.
•It continuously records audio using a microphone as the audio source.
•The recorded audio is then processed for speech recognition.
•Audio File Input:
•The user is prompted to provide the path to an MP3 audio file.
•The code converts the MP3 file to WAV format using the PyDub library.
•The WAV file is then processed for speech recognition.

7
Speech to text Transcripts
•Video File Input:
•The user is prompted to provide the path to a video file (e.g., MP4, AVI).
•The code extracts the audio from the video file using the moviepy library, creating a temporary
audio file.
•The temporary audio file is processed for speech recognition.
•Camera Input:
•The code opens the camera stream using OpenCV (cv2).
•While capturing video frames from the camera, it simultaneously records audio from a
microphone.
•Real-time audio is processed for speech recognition.
3.Speech Recognition:
•In all scenarios, the speech_recognition library is used to perform speech recognition.
•The audio data, obtained from different sources as described above, is sent to the Google Web Speech
API for recognition.
4.Transcript Display:
•The recognized speech is displayed as a transcript.

8
Speech to text Transcripts
5.Error Handling:
•The code includes error handling for potential issues during speech recognition.
•It differentiates between two types of errors:
•sr.UnknownValueError: This error is raised when speech recognition could not understand the
audio input.
•sr.RequestError: This error occurs when there is an issue with requesting results from the Google
Web Speech API.
6.Cleanup:
•Temporary audio files created during audio file input and video file input are cleaned up to avoid
cluttering the file system.
The algorithm's core component is the use of the speech_recognition library to capture and process audio data
and then send it to the Google Web Speech API for recognition. The code is designed to handle various input
sources and provide real-time or batch recognition based on the chosen input type

9
Speech to text Transcripts
Methodology
Step 1: Select Input Type
• Choose an input type by entering a number (1 for microphone, 2 for audio file, 3 for video file, 4 for
camera).
Step 2: Input Handling
• Depending on your choice:
• Microphone Input: Speak into the microphone. Recognized speech is displayed.
• Audio File Input: Provide an MP3 file path. The code converts it to WAV and displays the recognized
speech.
• Video File Input: Provide a video file path. Audio is extracted, recognized, and displayed.
• Camera Input: Camera captures video, and real-time speech is recognized.
Step 3: Error Handling
• The code handles errors and provides error messages as needed.
Step 4: Cleanup
• If you use audio or video file input, temporary files are cleaned up.
Step 5: Interaction Completion
• The code continues to prompt for input or allows you to exit based on your choice.

10
Implementation Details
1. Importing Required Libraries:
•The code begins by importing necessary Python libraries, including os, random, speech_recognition, pydub,
moviepy.editor, and cv2. These libraries are used for various functionalities.
2. Functions for Audio and Video Processing:
•The code defines two key functions:
•extract_audio_from_video(video_path): This function extracts audio from a video file and returns the
path to the temporary audio file in WAV format.
•convert_mp3_to_wav(mp3_file, wav_file): This function converts an MP3 audio file to WAV format
using the PyDub library.
3. Setting Up the Speech Recognizer:
•The code initializes a speech recognizer object (r) from the speech_recognition library. This object is used to
manage the speech recognition process.
4. Input Type Selection:
•The code prompts the user to select an input type (microphone, audio file, video file, or camera) and stores the
choice in the input_type variable.
5. Handling Different Input Types:
•The code includes conditional branches for each input type:
•Microphone Input:
•Enters a loop for real-time audio input via the microphone.
•Records audio, attempts speech recognition, and displays the recognized transcript.
•If a specific "easter egg" trigger phrase is detected, it responds with "VANI."

11
Implementation Details

•Audio File Input:


•Prompts the user to provide the path to an MP3 audio file.
•Converts the MP3 file to WAV format using the PyDub library.
•Records and processes the audio for speech recognition, displaying the recognized transcript.
•Video File Input:
•Prompts the user to provide the path to a video file (e.g., MP4, AVI).
•Extracts audio from the video using the moviepy library, creating a temporary audio file.
•Records and processes the extracted audio for speech recognition, displaying the recognized transcript
with sentences separated.
•Camera Input:
•Opens the computer's camera using OpenCV.
•Captures video frames from the camera and records audio simultaneously.
•Listens for speech in real-time, attempts recognition, and displays the recognized transcript.
•Similar to microphone input, it checks for the "easter egg" trigger phrase.
6. Error Handling:
•Throughout the code, error handling is implemented for two types of errors:
•sr.UnknownValueError: Occurs when the system couldn't understand the audio.
•sr.RequestError: Happens when there are issues with requesting results from the Google Web Speech
API.

12
Implementation Details

7. Cleanup:
•If audio file input or video file input is used, the code removes temporary audio files to avoid cluttering the file
system.
8. Interaction Completion:
•After using the selected input type, the code either continues to prompt for more input or allows the user to exit
based on their choice.

13
Demo

14
Results

15
Conclusion
In summary, the provided code is a versatile speech recognition system that can handle different input sources,
including microphones, audio files, video files, and cameras. It uses the speech_recognition library to capture,
process, and recognize spoken words. The code also includes error handling, and automatic cleanup of
temporary files. This system offers a flexible and interactive tool for various speech recognition applications.

16
Future Enhancement
1. Multi-Language Support: Extend the system to recognize and work with multiple languages,
allowing users to choose the language they want to use.
2. Improved Accuracy: Implement techniques to enhance speech recognition accuracy, such as model
fine-tuning and noise reduction algorithms.
3. Voice Command Integration: Integrate the system with other applications, devices, or services to
respond to voice commands, providing automation and control.
4. Natural Language Processing (NLP): Incorporate NLP techniques to extract meaning and context
from recognized speech, enabling more sophisticated interactions.
5. Real-Time Transcription: Provide real-time transcription of speeches, lectures, or meetings, making
it a valuable tool for note-taking and accessibility.
6. Cloud Integration: Implement cloud-based speech recognition services for scalability and improved
recognition accuracy.

You might also like