Python GuiaUser
Python GuiaUser
A special algorithm is then applied to determine the most likely word (or
words) that produce the given sequence of phonemes.
One can imagine that this whole process may be computationally expensive. In many modern speech recognition
systems, neural networks are used to simplify the speech signal using techniques for feature transformation and
dimensionality reduction before HMM recognition. Voice activity detectors (VADs) are also used to reduce an audio signal
to only the portions that are likely to contain speech. This prevents the recognizer from wasting time analyzing
unnecessary parts of the signal.
Fortunately, as a Python programmer, you don’t have to worry about any of this. A number of speech recognition
services are available for use online through an API, and many of these services offer Python SDKs.
apiai
assemblyai
google-cloud-speech
pocketsphinx
SpeechRecognition
watson-developer-cloud
wit
Some of these packages—such as wit and apiai—offer built-in features, like natural language processing for identifying a
speaker’s intent, which go beyond basic speech recognition. Others, like google-cloud-speech, focus solely on speech-
to-text conversion.
Recognizing speech requires audio input, and SpeechRecognition makes retrieving this input really easy. Instead of
having to build scripts for accessing microphones and processing audio files from scratch, SpeechRecognition will have
you up and running in just a few minutes.
The SpeechRecognition library acts as a wrapper for several popular speech APIs and is thus extremely flexible. One of
these—the Google Web Speech API—supports a default API key that is hard-coded into the SpeechRecognition library.
That means you can get off your feet without having to sign up for a service.
The flexibility and ease-of-use of the SpeechRecognition package make it an excellent choice for any Python project.
However, support for every feature of each API it wraps is not guaranteed. You will need to spend some time researching
the available options to find out if SpeechRecognition will work in your particular case.
So, now that you’re convinced you should try out SpeechRecognition, the next step is getting it installed in your
environment.
Installing SpeechRecognition
SpeechRecognition is compatible with Python 2.6, 2.7 and 3.3+, but requires some additional installation steps for
Python 2. For this tutorial, I’ll assume you are using Python 3.3+.
Shell
Using record() to Capture Data From a File
Type the following into your interpreter session to process the contents of the “harvard.wav” file:
Python >>>
The context manager opens the file and reads its contents, storing the data in an AudioFile instance called source.
Then the record() method records the data from the entire file into an AudioData instance. You can confirm this by
checking the type of audio:
Python >>>
>>> type(audio)
<class 'speech_recognition.AudioData'>
You can now invoke recognize_google() to attempt to recognize any speech in the audio. Depending on your internet
connection speed, you may have to wait several seconds before seeing the result.
Python >>>
>>> r.recognize_google(audio)
'the stale smell of old beer lingers it takes heat
to bring out the odor a cold dip restores health and
zest a salt pickle taste fine with ham tacos al
Pastore are my favorite a zestful food is the hot
cross bun'
If you’re wondering where the phrases in the “harvard.wav” file come from, they are examples of Harvard Sentences.
These phrases were published by the IEEE in 1965 for use in speech intelligibility testing of telephone lines. They are still
used in VoIP and cellular testing today.
The Harvard Sentences are comprised of 72 lists of ten phrases. You can find freely available recordings of these phrases
on the Open Speech Repository website. Recordings are available in English, Mandarin Chinese, French, and Hindi. They
provide an excellent source of free material for testing your code.
For example, the following captures any speech in the first four seconds of the file:
Python >>>
Python >>>
Notice that audio2 contains a portion of the third phrase in the file. When specifying a duration, the recording might stop
mid-phrase—or even mid-word—which can hurt the accuracy of the transcription. More on this in a bit.
In addition to specifying a recording duration, the record() method can be given a specific starting point using the
offset keyword argument. This value represents the number of seconds from the beginning of the file to ignore before
starting to record.
To capture only the second phrase in the file, you could start with an offset of four seconds and record for, say, three
seconds.
Python >>>
The offset and duration keyword arguments are useful for segmenting an audio file if you have prior knowledge of the
structure of the speech in the file. However, using them hastily can result in poor transcriptions. To see this effect, try the
following in your interpreter:
Python >>>
By starting the recording at 4.7 seconds, you miss the “it t” portion a the beginning of the phrase “it takes heat to bring
out the odor,” so the API only got “akes heat,” which it matched to “Mesquite.”
Similarly, at the end of the recording, you captured “a co,” which is the beginning of the third phrase “a cold dip restores
health and zest.” This was matched to “Aiko” by the API.
There is another reason you may get inaccurate transcriptions. Noise! The above examples worked well because the
audio file is reasonably clean. In the real world, unless you have the opportunity to process audio files beforehand, you
can not expect the audio to be noise-free.
To get a feel for how noise can affect speech recognition, download the “jackhammer.wav” file here. As always, make
sure you save this to your interpreter session’s working directory.
This file has the phrase “the stale smell of old beer lingers” spoken with a loud jackhammer in the background.
Python >>>
Way off!
So how do you deal with this? One thing you can try is using the adjust_for_ambient_noise() method of the
Recognizer class.
Python >>>
That got you a little closer to the actual phrase, but it still isn’t perfect. Also, “the” is missing from the beginning of the
phrase. Why is that?
The adjust_for_ambient_noise() method reads the first second of the file stream and calibrates the recognizer to the
noise level of the audio. Hence, that portion of the stream is consumed before you call record() to capture the data.
You can adjust the time-frame that adjust_for_ambient_noise() uses for analysis with the duration keyword
argument. This argument takes a numerical value in seconds and is set to 1 by default. Try lowering this value to 0.5.
Python >>>
Well, that got you “the” at the beginning of the phrase, but now you have some new issues! Sometimes it isn’t possible
to remove the effect of the noise—the signal is just too noisy to be dealt with successfully. That’s the case with this file.
If you find yourself running up against these issues frequently, you may have to resort to some pre-processing of the
audio. This can be done with audio editing software or a Python package (such as SciPy) that can apply filters to the files.
A detailed discussion of this is beyond the scope of this tutorial—check out Allen Downey’s Think DSP book if you are
interested. For now, just be aware that ambient noise in an audio file can cause problems and must be addressed in
order to maximize the accuracy of speech recognition.
When working with noisy files, it can be helpful to see the actual API response. Most APIs return a JSON string containing
many possible transcriptions. The recognize_google() method will always return the most likely transcription unless
you force it to give you the full response.
You can do this by setting the show_all keyword argument of the recognize_google() method to True.
Python >>>
As you can see, recognize_google() returns a dictionary with the key 'alternative' that points to a list of possible
transcripts. The structure of this response may vary from API to API and is mainly useful for debugging.
By now, you have a pretty good idea of the basics of the SpeechRecognition package. You’ve seen how to create an
AudioFile instance from an audio file and use the record() method to capture data from the file. You learned how
record segments of a file using the offset and duration keyword arguments of record(), and you experienced the
detrimental effect noise can have on transcription accuracy.
Now for the fun part. Let’s transition from transcribing static audio files to making your project interactive by accepting
input from a microphone.
Installing PyAudio
The process for installing PyAudio will vary depending on your operating system.
Debian Linux
If you’re on Debian-based Linux (like Ubuntu) you can install PyAudio with apt:
Shell
Once installed, you may still need to run pip install pyaudio, especially if you are working in a virtual environment.
>>> mic = sr.Microphone()
If your system has no default microphone (such as on a RaspberryPi), or you want to use a microphone other than the
default, you will need to specify which one to use by supplying a device index. You can get a list of microphone names by
calling the list_microphone_names() static method of the Microphone class.
Python >>>
>>> sr.Microphone.list_microphone_names()
['HDA Intel PCH: ALC272 Analog (hw:0,0)',
'HDA Intel PCH: HDMI 0 (hw:0,3)',
'sysdefault',
'front',
'surround40',
'surround51',
'surround71',
'hdmi',
'pulse',
'dmix',
'default']
Note that your output may differ from the above example.
The device index of the microphone is the index of its name in the list returned by list_microphone_names(). For
example, given the above output, if you want to use the microphone called “front,” which has index 3 in the list, you
would create a microphone instance like this:
Python >>>
For most projects, though, you’ll probably want to use the default system microphone.
Just like the AudioFile class, Microphone is a context manager. You can capture input from the microphone using the
listen() method of the Recognizer class inside of the with block. This method takes an audio source as its first
argument and records input from the source until silence is detected.
Python >>>
Once you execute the with block, try speaking “hello” into your microphone. Wait a moment for the interpreter prompt
to display again. Once the “>>>” prompt returns, you’re ready to recognize the speech.
Python >>>
>>> r.recognize_google(audio)
'hello'
If the prompt never returns, your microphone is most likely picking up too much ambient noise. You can interrupt the
import random
import time
import speech_recognition as sr
return response
if __name__ == "__main__":
# set the list of words, maxnumber of guesses, and prompt limit
WORDS = ["apple", "banana", "grape", "orange", "mango", "lemon"]
NUM_GUESSES = 3
PROMPT_LIMIT = 5
for i in range(NUM_GUESSES):
# get the guess from the user
# if a transcription is returned, break out of the loop and
# continue
# if no transcription returned and API request failed, break
# loop and continue
# if API request succeeded but no transcription was returned,
# re-prompt the user to say their guess again. Do this up
# to PROMPT_LIMIT times
for j in range(PROMPT_LIMIT):
print('Guess {}. Speak!'.format(i+1))
guess = recognize_speech_from_mic(recognizer, microphone)
if guess["transcription"]:
break
if not guess["success"]:
break
print("I didn't catch that. What did you say?\n")
The recognize_speech_from_mic() function takes a Recognizer and Microphone instance as arguments and returns a
dictionary with three keys. The first key, "success", is a boolean that indicates whether or not the API request was
successful. The second key, "error", is either None or an error message indicating that the API is unavailable or the
speech was unintelligible. Finally, the "transcription" key contains the transcription of the audio recorded by the
microphone.
The function first checks that the recognizer and microphone arguments are of the correct type, and raises a TypeError
if either is invalid:
Python
Python
The adjust_for_ambient_noise() method is used to calibrate the recognizer for changing noise conditions each time
the recognize_speech_from_mic() function is called.
Next, recognize_google() is called to transcribe any speech in the recording. A try...except block is used to catch the
RequestError and UnknownValueError exceptions and handle them accordingly. The success of the API request, any
error messages, and the transcribed speech are stored in the success, error and transcription keys of the response
dictionary, which is returned by the recognize_speech_from_mic() function.
Python
response = {
"success": True,
"error": None,
"transcription": None
}
try:
response["transcription"] = recognizer.recognize_google(audio)
except sr.RequestError:
# API was unreachable or unresponsive
response["success"] = False
response["error"] = "API unavailable"
except sr.UnknownValueError:
# speech was unintelligible
response["error"] = "Unable to recognize speech"
return response
You can test the recognize_speech_from_mic() function by saving the above script to a file called “guessing_game.py”
and running the following in an interpreter session:
Python >>>
Python
Next, a Recognizer and Microphone instance is created and a random word is chosen from WORDS:
Python
recognizer = sr.Recognizer()
microphone = sr.Microphone()
word = random.choice(WORDS)
After printing some instructions and waiting for 3 three seconds, a for loop is used to manage each user attempt at
guessing the chosen word. The first thing inside the for loop is another for loop that prompts the user at most
PROMPT_LIMIT times for a guess, attempting to recognize the input each time with the recognize_speech_from_mic()
function and storing the dictionary returned to the local variable guess.
If the "transcription" key of guess is not None, then the user’s speech was transcribed and the inner loop is terminated
with break. If the speech was not transcribed and the "success" key is set to False, then an API error occurred and the
loop is again terminated with break. Otherwise, the API request was successful but the speech was unrecognizable. The
user is warned and the for loop repeats, giving the user another chance at the current attempt.
Python
for j in range(PROMPT_LIMIT):
print('Guess {}. Speak!'.format(i+1))
guess = recognize_speech_from_mic(recognizer, microphone)
if guess["transcription"]:
break
if not guess["success"]:
break
print("I didn't catch that. What did you say?\n")
Once the inner for loop terminates, the guess dictionary is checked for errors. If any occurred, the error message is
displayed and the outer for loop is terminated with break, which will end the program execution.
Python
if guess['error']:
print("ERROR: {}".format(guess["error"]))
break
If there weren’t any errors, the transcription is compared to the randomly selected word. The lower() method for string
objects is used to ensure better matching of the guess to the chosen word. The API may return speech matched to the
word “apple” as “Apple” or “apple,” and either response should count as a correct answer.
If the guess was correct, the user wins and the game is terminated. If the user was incorrect and has any remaining
attempts, the outer for loop repeats and a new guess is retrieved. Otherwise, the user loses the game.
Python