0% found this document useful (0 votes)
6 views4 pages

Labs 9

This document outlines an experiment focused on implementing Text to Speech (TTS) recognition and synthesis using APIs. It provides a detailed description of speech synthesis and recognition, including steps to use the gTTS library for converting text to speech and the SpeechRecognition library for transcribing audio files in Google Colab. Additionally, it discusses speech segmentation and presents tasks for generating audio and comparing transcriptions.

Uploaded by

Matrix Gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

Labs 9

This document outlines an experiment focused on implementing Text to Speech (TTS) recognition and synthesis using APIs. It provides a detailed description of speech synthesis and recognition, including steps to use the gTTS library for converting text to speech and the SpeechRecognition library for transcribing audio files in Google Colab. Additionally, it discusses speech segmentation and presents tasks for generating audio and comparing transcriptions.

Uploaded by

Matrix Gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

EXPERIMENT 9 Text to Speech recognition and Synthesis through APIs

Aim:
To implement the algorithm for Text to Speech recognition and Synthesis through APIs.

Description:
1. Speech synthesis is the artificial production of human speech. A computer system used
for this purpose is called a speech synthesizer, and can be implemented in software or
hardware products. A text-to-speech (TTS) system converts normal language text into
speech; other systems render symbolic linguistic representations like phonetic
transcriptions into speech. The reverse process is speech recognition.
Synthesized speech can be created by concatenating pieces of recorded speech that are
stored in a database. Systems differ in the size of the stored speech units; a system that
stores phones or diphones provides the largest output range, but may lack clarity. For
specific usage domains, the storage of entire words or sentences allows for high-quality
output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other
human voice characteristics to create a completely "synthetic" voice output.

Here’s how you can use Google Text-to-Speech (gTTS) in Google Colab to convert text to
speech. The gTTS library is straightforward and well-suited for generating speech in multiple
languages.

Step 1: Install the gTTS Library

First, install the gTTS library.

# Install gTTS
!pip install gTTS

Step 2: Convert Text to Speech and Save as an Audio File

Now, you can use the gTTS library to convert text into speech and save it as an MP3 file.

from gtts import gTTS

# Specify the text and language


text = "Hello, my name is (your full name)."
language = 'en' # Language can be changed to any supported language code,
like 'es' for Spanish, 'fr' for French, etc.

# Create a gTTS object


speech = gTTS(text=text, lang=language, slow=False)

# Save the audio file


speech.save("output_audio.mp3")

Step 3: Play the Audio (Optional)

You can play the audio file directly in Colab using the IPython.display.Audio function.
from IPython.display import Audio

# Play the saved audio file


Audio("output_audio.mp3")

This code will convert your text into speech, save it as an MP3 file, and allow you to play or
download it directly from Colab.

2. Speech recognition, also known as automatic speech recognition (ASR), computer


speech recognition, or speech-to-text, is a capability which enables a program to
process human speech into a written format. While it’s commonly confused with voice
recognition, speech recognition focuses on the translation of speech from a verbal
format to a text one whereas voice recognition just seeks to identify an individual user’s
voice.
Key features of effective speech recognition
Many speech recognition applications and devices are available, but the more advanced
solutions use AI and machine learning. They integrate grammar, syntax, structure, and
composition of audio and voice signals to understand and process human speech.
Ideally, they learn as they go — evolving responses with each interaction.
The best kind of systems also allow organizations to customize and adapt the technology
to their specific requirements — everything from language and nuances of speech to
brand recognition. For example:
 Language weighting: Improve precision by weighting specific words that are
spoken frequently (such as product names or industry jargon), beyond terms already
in the base vocabulary.
 Speaker labeling: Output a transcription that cites or tags each speaker’s
contributions to a multi-participant conversation.
 Acoustics training: Attend to the acoustical side of the business. Train the system to
adapt to an acoustic environment (like the ambient noise in a call center) and speaker
styles (like voice pitch, volume and pace).
 Profanity filtering: Use filters to identify certain words or phrases and sanitize
speech output.

Here’s a Python code snippet to perform speech recognition in Google Colab using the
SpeechRecognition library. This code will allow you to transcribe speech from an audio
file.

Step 1: Install Required Libraries

First, you need to install the SpeechRecognition and pydub libraries. pydub is useful for
handling audio files in different formats.

# Install the required libraries


!pip install SpeechRecognition pydub
Step 2: Import and Set Up Libraries

Import the necessary libraries and set up the recognizer.


import speech_recognition as sr
from pydub import AudioSegment

Step 3: Upload an Audio File

Google Colab allows you to upload files directly. Use the following code to upload an audio
file.

from google.colab import files

# Upload an audio file


uploaded = files.upload()

Make sure the audio file is in a format supported by SpeechRecognition (e.g., .wav, .flac,
or .mp3). If it's not in .wav format, you may need to convert it using pydub.

Step 4: Convert Audio to WAV Format (if needed)

If your file is in .mp3 format, you can convert it to .wav using pydub.

# Convert audio file to WAV format


audio_file = next(iter(uploaded.keys())) # Get the uploaded file name
sound = AudioSegment.from_file(audio_file)
wav_file = "converted_audio.wav"
sound.export(wav_file, format="wav")

Step 5: Perform Speech Recognition

Use the SpeechRecognition library to transcribe the audio.

# Initialize recognizer
recognizer = sr.Recognizer()

# Load the audio file and recognize the speech


with sr.AudioFile(wav_file) as source:
audio_data = recognizer.record(source)

# Recognize speech using Google Web Speech API


try:
text = recognizer.recognize_google(audio_data)
print("Transcribed Text:")
print(text)
except sr.UnknownValueError:
print("Speech Recognition could not understand the audio.")
except sr.RequestError as e:
print(f"Could not request results; {e}")

Step 6: Optional - Display the Transcribed Text

The transcribed text will be printed directly in the Colab output cell.

3. Speech segmentation is the process of identifying the boundaries between words,


syllables, or phonemes in spoken natural languages. The term applies both to the
mental processes used by humans, and to artificial processes of natural language
processing.
Speech segmentation is a subfield of general speech perception and an important sub
problem of the technologically focused field of speech recognition, and cannot be
adequately solved in isolation. As in most natural language processing problems, one
must take into account context, grammar, and semantics, and even so the result is often
a probabilistic division (statistically based on likelihood) rather than a categorical one.
Though it seems that coarticulation—a phenomenon which may happen between
adjacent words just as easily as within a single word—presents the main challenge in
speech segmentation across languages.

Task 1: Implement the gTTS code and provide the link. You must use your full name as
input text and generate audio in English. (4
marks)

(optional: Try text in your mother tongue and generate audio file)

Task 2: Speech recognition. Implement the SpeechRecognition code and provide the link.
You must use the generated audio in Task 1 as input to Task 2.

Compare the text input given in Task 1 and the Transcribed text obtained from Task 2. Test
more sentences and write your inference (6 marks)

You might also like