14-Speech Recognition
14-Speech Recognition
Speech Recognition
Speech
Automatic Speech Recognition (ASR)
• History
• Challenges
• Performance Evaluation
ASR Approaches
• Template-based ASR
• Statistical ASR
Speech
• Speech - is the most common analog signal
produced by humans.
• Human thoughts and intentions
• Speech production involves physiology, cognition,
and acoustics to produce meaningful sounds to
communicate
• Consonants
• Vowels
• Speech signal can be decomposed into source and filter.
– The source is the vocal folds in voiced speech.
– The filter is the vocal tract and articulators.
2
Speech
• Speech visualization combines the science of speech
production with the art of visual representation
– It aims to make the invisible sounds of speech visible, offering
insights into the temporal and spectral dynamics of spoken
language.
3
Automatic Speech Recognition (ASR)
• Automatic Speech Recognition (ASR) - is the
process of converting spoken language into text.
• How does it work?
– Analyzes the acoustic signal (sound waves) of spoken
language.
– Extracts features like pitch, formants, and spectral energy.
– Uses these features to identify individual sounds
(phonemes) and then combine them into words.
– Employs language models to predict the most likely
sequence of words based on the context and grammar.
– Outputs the recognized text.
4
ASR Example
5
ASR Challenges
• Converting spoken language into text using computational
methods is a complex challenge with various facets
encompassing many factors of input, processing, and output.
• Input factors:
– Acoustic signal - captured through microphones, contains the
speaker's voice but can also be mixed with background noise and
other environmental factors.
• Microphone: close-mic, throat-mic, microphone array
• Sources: band-limited, background noise
• Speaker: speaker dependent, speaker independent
– Language - the specific spoken language, with its vocabulary,
grammar, and pronunciation rules, guides the interpretation of the
acoustic signal.
9
ASR Challenges
• Processing factors:
– Feature extraction - Identifying the relevant aspects of the acoustic
signal that represent the spoken words, like pitch, formants, and
spectral energy.
• Pitch: the fundamental frequency of the vocal cords, determining
the perceived "highness" or "lowness" of a voice.
• Formants: resonant frequencies created by the vocal tract, shaping
the sound waves of vowels and certain consonants.
• Spectral energy: comprehensive picture of the sound wave
including pitch, formants, noise, breath, speaker variations and
emotions.
– Acoustic modeling - This stage decodes the sequence of sounds
(phonemes) from the extracted features, essentially recognizing the
building blocks of speech.
– Language modeling - Based on acoustic modeling, the system
then applies its understanding of language rules and context to
predict the most likely sequence of words. 10
ASR Challenges
• Output factors:
– Text transcript - The final product, the text version of the spoken
language, is the result of the previous stages.
• Accuracy and fluency of sentences are crucial
• but factors like keywords, punctuation and speaker identification
can also be part of the output.
• What is hard about that?
– Digitization – Converting analogue signal to digital representation
– Signal processing – Separating speech from background noise
– Phonetics – Variability in human speech
– Phonology – Recognizing individual sound distinctions (similar
phonemes)
– Lexicology and syntax – Disambiguating homophones and
features of continuous speech
– Pragmatics – Filtering of performance errors (disfluencies) 11
ASR Performance Evaluation
• Evaluating the performance of an ASR system is
crucial for:
– Understanding its strengths and weaknesses
– Identifying areas for improvement
– Comparing different approaches
• Feature Selection is a crucial step in ASR systems, as
it plays a vital role in determining the accuracy and
efficiency of the recognition process
– Choosing the right features from the extracted acoustic
signal can significantly improve performance
– Selecting irrelevant or redundant features can lead to errors
and wasted computational resources
12
ASR Performance Evaluation
• Key metrics to evaluate the performance of an ASR
system:
– Accuracy – Percentage of tokens correctly recognized
– Error Rate – Percentage of errors made by the system
(inverse of accuracy)
– Speed and latency – Time taken to process the speech and
generate the output text (transcript)
– Resource consumption - The amount of memory,
processing power, and other resources required by the
system.
– User experience - Subjective factors like ease of use,
clarity of transcripts, and overall satisfaction with the
system.
13
ASR Approaches – Template-Based ASR
• Originally only worked for isolated words, one user.
• Performs best when training and testing conditions
are best.
• For each word we want to recognize, we store a
template or example based on actual data.
• Each test utterance is checked against the templates
to find the best match.
• Uses the Dynamic Time Warping (DTW) algorithm
14