0% found this document useful (0 votes)
19 views11 pages

14-Speech Recognition

Uploaded by

thatsarra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

14-Speech Recognition

Uploaded by

thatsarra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CS463 – Natural Language Processing

Speech Recognition
 Speech
 Automatic Speech Recognition (ASR)
• History
• Challenges
• Performance Evaluation
 ASR Approaches
• Template-based ASR
• Statistical ASR
Speech
• Speech - is the most common analog signal
produced by humans.
• Human thoughts and intentions
• Speech production involves physiology, cognition,
and acoustics to produce meaningful sounds to
communicate
• Consonants
• Vowels
• Speech signal can be decomposed into source and filter.
– The source is the vocal folds in voiced speech.
– The filter is the vocal tract and articulators.

2
Speech
• Speech visualization combines the science of speech
production with the art of visual representation
– It aims to make the invisible sounds of speech visible, offering
insights into the temporal and spectral dynamics of spoken
language.

3
Automatic Speech Recognition (ASR)
• Automatic Speech Recognition (ASR) - is the
process of converting spoken language into text.
• How does it work?
– Analyzes the acoustic signal (sound waves) of spoken
language.
– Extracts features like pitch, formants, and spectral energy.
– Uses these features to identify individual sounds
(phonemes) and then combine them into words.
– Employs language models to predict the most likely
sequence of words based on the context and grammar.
– Outputs the recognized text.

4
ASR Example

5
ASR Challenges
• Converting spoken language into text using computational
methods is a complex challenge with various facets
encompassing many factors of input, processing, and output.
• Input factors:
– Acoustic signal - captured through microphones, contains the
speaker's voice but can also be mixed with background noise and
other environmental factors.
• Microphone: close-mic, throat-mic, microphone array
• Sources: band-limited, background noise
• Speaker: speaker dependent, speaker independent
– Language - the specific spoken language, with its vocabulary,
grammar, and pronunciation rules, guides the interpretation of the
acoustic signal.

9
ASR Challenges
• Processing factors:
– Feature extraction - Identifying the relevant aspects of the acoustic
signal that represent the spoken words, like pitch, formants, and
spectral energy.
• Pitch: the fundamental frequency of the vocal cords, determining
the perceived "highness" or "lowness" of a voice.
• Formants: resonant frequencies created by the vocal tract, shaping
the sound waves of vowels and certain consonants.
• Spectral energy: comprehensive picture of the sound wave
including pitch, formants, noise, breath, speaker variations and
emotions.
– Acoustic modeling - This stage decodes the sequence of sounds
(phonemes) from the extracted features, essentially recognizing the
building blocks of speech.
– Language modeling - Based on acoustic modeling, the system
then applies its understanding of language rules and context to
predict the most likely sequence of words. 10
ASR Challenges
• Output factors:
– Text transcript - The final product, the text version of the spoken
language, is the result of the previous stages.
• Accuracy and fluency of sentences are crucial
• but factors like keywords, punctuation and speaker identification
can also be part of the output.
• What is hard about that?
– Digitization – Converting analogue signal to digital representation
– Signal processing – Separating speech from background noise
– Phonetics – Variability in human speech
– Phonology – Recognizing individual sound distinctions (similar
phonemes)
– Lexicology and syntax – Disambiguating homophones and
features of continuous speech
– Pragmatics – Filtering of performance errors (disfluencies) 11
ASR Performance Evaluation
• Evaluating the performance of an ASR system is
crucial for:
– Understanding its strengths and weaknesses
– Identifying areas for improvement
– Comparing different approaches
• Feature Selection is a crucial step in ASR systems, as
it plays a vital role in determining the accuracy and
efficiency of the recognition process
– Choosing the right features from the extracted acoustic
signal can significantly improve performance
– Selecting irrelevant or redundant features can lead to errors
and wasted computational resources
12
ASR Performance Evaluation
• Key metrics to evaluate the performance of an ASR
system:
– Accuracy – Percentage of tokens correctly recognized
– Error Rate – Percentage of errors made by the system
(inverse of accuracy)
– Speed and latency – Time taken to process the speech and
generate the output text (transcript)
– Resource consumption - The amount of memory,
processing power, and other resources required by the
system.
– User experience - Subjective factors like ease of use,
clarity of transcripts, and overall satisfaction with the
system.
13
ASR Approaches – Template-Based ASR
• Originally only worked for isolated words, one user.
• Performs best when training and testing conditions
are best.
• For each word we want to recognize, we store a
template or example based on actual data.
• Each test utterance is checked against the templates
to find the best match.
• Uses the Dynamic Time Warping (DTW) algorithm

14

You might also like