Unit 5 (Automatic Speech Recognition)
Unit 5 (Automatic Speech Recognition)
Acoustic modeling is a crucial component of speech recognition systems, where it deals with the
representation of the relationship between linguistic units of speech (such as phonemes or words)
and the corresponding audio signal. It focuses on how to statistically model the way phonetic units
are produced in various contexts, including differences in speakers, accents, and environmental
noise.
1. Phonemes
Phonemes are the smallest units of sound in a language, and acoustic models attempt to recognize
these by mapping the audio signal to the corresponding phonetic sounds. For example, the words
"cat" and "bat" differ by just one phoneme: /k/ and /b/.
2. Feature Extraction
Before building acoustic models, speech data is processed to extract key features. These features
are typically derived using techniques such as:
HMMs have been the backbone of many traditional speech recognition systems. They model
sequences of sounds probabilistically, representing transitions between phonemes or states over
time. Each state in the HMM is associated with a distribution that represents the likelihood of
observing a particular feature vector.
GMMs are used to model the distribution of the acoustic features associated with each HMM state.
A GMM is a weighted sum of several Gaussian distributions and helps capture the variability in
speech signals for a particular phoneme.
Modern speech recognition systems have largely moved from HMM-GMM-based systems to
Deep Neural Networks (DNNs) and more advanced models like:
6. Context-Dependent Models
In practical systems, context-dependent acoustic models are often used. These models account for
coarticulation, the phenomenon where the pronunciation of phonemes depends on the
surrounding phonemes. For example, the /t/ sound in "cat" might differ from the /t/ sound in "bat."
Acoustic models are trained using large datasets of transcribed speech. The training process
typically involves:
Speech recognition systems often need to adapt to new speakers, environments, or languages.
Techniques like Maximum Likelihood Linear Regression (MLLR) or speaker adaptation
training can be used to fine-tune acoustic models for specific speakers or conditions.
In modern systems, combining acoustic models with language models (which capture the
structure and probability of word sequences) results in highly accurate speech recognition systems.
FEATURE EXTRACTION
To address this, feature extraction techniques convert the time-domain signal into a more
manageable and informative form, typically in the frequency domain.
Step 1: Pre-Emphasis
• The high-frequency components of speech often carry important information, but they tend
to have less energy than lower frequencies. Pre-emphasis is a filtering step that amplifies
the high-frequency components of the signal.
• This step applies a high-pass filter, which can help reduce noise and balance the frequency
content of the signal.
Step 2: Framing
• Speech is a non-stationary signal, meaning its characteristics change over time. To deal
with this, the signal is divided into short segments called frames. Each frame is treated as
stationary (constant over time) for processing.
• Typically, frames are around 20–40 milliseconds long with some overlap (e.g., 50%)
between adjacent frames to capture smooth transitions.
Step 3: Windowing
• To convert each frame from the time domain to the frequency domain, the Fast Fourier
Transform (FFT) is applied. This step decomposes the signal into its frequency
components, which is more informative for speech analysis.
• The resulting frequency domain representation reveals how the energy of the signal is
distributed across different frequencies.
• MFCC is the most commonly used feature in speech recognition due to its ability to mimic
the human auditory system.
• Steps to Extract MFCC:
1. Apply the FFT: Convert the signal from the time domain to the frequency domain.
2. Mel-Scale Filter Bank: Pass the frequency spectrum through a series of filters
arranged according to the Mel scale, which reflects human auditory perception.
The Mel scale is a logarithmic transformation of the frequency axis, emphasizing
lower frequencies.
3. Logarithm of Filtered Energies: Take the logarithm of the filter bank outputs to
simulate the human ear’s sensitivity to loudness changes.
4. Discrete Cosine Transform (DCT): Apply the DCT to compress the log filter
bank outputs into a small number of coefficients (typically 12–13). These
coefficients are called the MFCCs and form the feature vector.
• MFCCs capture the broad spectral shape of speech and are highly effective in
distinguishing between different phonemes.
• LPC models the human vocal tract and attempts to predict future speech samples based on
past samples. It assumes that speech can be modeled as a linear combination of previous
samples.
• The result is a set of coefficients that represent the vocal tract configuration for each frame
of speech.
• While LPC was widely used in early speech recognition systems, it has been largely
replaced by MFCC due to MFCC’s superior performance.
• PLP is similar to LPC but incorporates perceptual modeling, such as critical-band filtering,
equal loudness pre-emphasis, and intensity-loudness compression. These steps help align
the features more closely with human auditory perception.
• PLP features are considered more perceptually accurate than LPC.
4. Mel-Spectrogram
• In addition to static features (e.g., MFCCs), dynamic features are often included to
capture the temporal evolution of speech.
• Delta features represent the change in MFCCs over time (first-order derivative), and
Delta-Delta features represent the rate of change of Delta features (second-order
derivative).
• These dynamic features provide information about the speed and acceleration of speech,
which improves recognition accuracy.
6. Chroma Features
• Chroma features represent the energy of different pitch classes in the signal. Although
more commonly used in music analysis, they can be useful for analyzing tonal qualities of
speech.
1. Noise Robustness
• Speech signals often contain background noise, which can degrade the quality of the
extracted features. Techniques like spectral subtraction and Wiener filtering are used to
reduce noise during preprocessing.
• Robust feature extraction methods, such as RASTA filtering (Relative Spectral
Processing), can be employed to reduce the impact of noise.
2. Speaker Variability
• Different speakers have different vocal tract shapes, accents, and speaking styles, which
can affect the extracted features. Techniques like speaker normalization and Vocal Tract
Length Normalization (VTLN) are used to account for this variability.
3. Real-Time Processing
• In real-time speech recognition systems (e.g., virtual assistants), feature extraction must be
performed quickly to allow for real-time feedback. This requires optimizing the feature
extraction pipeline for speed without sacrificing accuracy.
2. Wavelet Transform
• In addition to the Fourier transform, the wavelet transform can be used to analyze speech
signals. It provides both time and frequency resolution, making it suitable for analyzing
non-stationary signals like speech.
HMM
A Hidden Markov Model (HMM) is a statistical model that is widely used in speech recognition,
natural language processing, and various other time-series applications. It is particularly well-
suited for modeling sequences where observations are generated by underlying hidden states,
which evolve over time.
overview of Hidden Markov Models and how they relate to speech recognition:
1. Understanding HMMs
• States: The system is assumed to be in one of several hidden (unobservable) states at any
given time.
• Observations: For each state, an observation (which is visible) is generated according to
a probability distribution specific to that state.
• State transitions: The system transitions between states according to a set of probabilities.
• The objective of an HMM is to model both the transitions between states and the likelihood
of observing certain outputs given those states.
2. Components of an HMM
To apply HMMs in speech recognition (or any other sequence-based tasks), three key problems
need to be solved:
• Given the model and a sequence of observations, what is the probability that the
sequence was generated by the model?
• This is solved using the Forward Algorithm, a dynamic programming algorithm that
efficiently computes the likelihood of the observation sequence by summing over all
possible state sequences.
• Given the model and a sequence of observations, what is the most likely sequence of
hidden states that produced the observations?
• This problem is solved using the Viterbi Algorithm, which finds the most likely path of
hidden states by tracking the maximum probability path at each step in the sequence.
• Given an observation sequence and a model with unknown parameters, how can we
adjust the model parameters (A, B, and π) to best fit the data?
• This is solved using the Baum-Welch Algorithm (a special case of the Expectation-
Maximization algorithm), which iteratively refines the parameters to maximize the
likelihood of the observed data.
In speech recognition, HMMs are used to model sequences of speech sounds (phonemes) and their
temporal dependencies. Here’s how they fit into the process:
1. Phoneme Modeling:
o Each phoneme (or sub-phoneme) in a language is typically modeled as a separate
HMM. For example, the sound "aa" might have its own HMM, with transitions
between states representing different parts of the sound.
o The speech signal is divided into short frames (e.g., 25 milliseconds each), and for
each frame, a feature vector (such as MFCC) is extracted.
2. Recognition Task:
o Given an observation sequence (features extracted from the speech signal), the task
of the HMM is to find the most likely sequence of phonemes (and thus words) that
produced the sequence.
o This involves decoding the hidden state sequence using the Viterbi Algorithm,
which yields the best path through the HMMs corresponding to the phonemes of
the language.
3. Acoustic Modeling:
o In traditional systems, HMMs are combined with Gaussian Mixture Models
(GMMs) to model the emission probabilities. The GMMs estimate the probability
of each observation given the hidden state (phoneme).
o In modern systems, GMMs are often replaced by Deep Neural Networks (DNNs),
which directly model the emission probabilities more accurately.
4. Integration with Language Models:
o HMMs for speech recognition are often combined with language models to
improve accuracy. Language models capture the likelihood of word sequences
(e.g., "the cat" is more likely than "cat the"). The combination of acoustic models
(HMMs) and language models helps predict both phoneme sequences and word
sequences.
5. HMM Training
1. Prepare labeled speech data: Each phoneme or word is labeled in the training set, and
feature vectors (e.g., MFCCs) are extracted from the audio.
2. Initialize model parameters: The transition probabilities, emission probabilities, and
initial state probabilities are initialized randomly or based on prior knowledge.
3. Apply the Baum-Welch Algorithm: This algorithm iteratively refines the HMM
parameters to fit the training data, maximizing the likelihood of observing the labeled
speech.
Once trained, the HMM is capable of recognizing phonemes and words in new, unseen speech.
6. Limitations of HMMs
• Markov Assumption: HMMs assume that the current state depends only on the previous
state, which limits their ability to capture long-term dependencies in speech.
• Gaussian Mixture Models (GMMs) Limitations: GMMs, traditionally used with HMMs
to model the emission probabilities, struggle with representing complex, non-linear
distributions in speech data.
• Static Features: HMMs often rely on handcrafted features (such as MFCCs), which may
not fully capture the complexities of speech.
While HMMs were once the dominant method in speech recognition, they have largely been
replaced by deep learning-based models, especially for large-vocabulary and real-time speech
recognition systems. Key differences include:
Despite these advances, HMMs remain a foundational concept in the history of speech recognition
and are still used in some domains.
HMM-DNN systems
HMM-DNN (Hidden Markov Model-Deep Neural Network) systems represent a hybrid
approach to speech recognition, combining the strengths of traditional HMMs for temporal
modeling with the powerful pattern recognition capabilities of deep neural networks (DNNs). This
hybrid architecture has significantly improved speech recognition performance and was a major
breakthrough before fully end-to-end deep learning models became dominant.
HMMs are excellent for modeling the sequential nature of speech, particularly for modeling
phonemes and their transitions over time. However, HMMs traditionally rely on Gaussian
Mixture Models (GMMs) for modeling the emission probabilities (the probability of observing
a particular speech feature given a hidden state). GMMs have several limitations:
On the other hand, Deep Neural Networks (DNNs) excel at learning complex patterns in data
and can model the relationship between acoustic features and phonetic states more effectively. By
replacing GMMs with DNNs, we get a system that can better estimate the emission probabilities,
thus improving the overall performance of the speech recognition system.
2. HMM-DNN Architecture
• HMM: The HMM handles the temporal dependencies in speech by modeling transitions
between hidden states (phonemes or sub-phonemes). It defines the sequence of phonemes
over time.
• DNN: The DNN models the emission probabilities, replacing the traditional Gaussian
Mixture Models. The DNN is trained to predict the likelihood of each HMM state
(phoneme or sub-phoneme) given an input acoustic feature (e.g., MFCC, log-Mel
spectrogram).
1. Feature Extraction: Acoustic features (e.g., MFCC, log-Mel spectrogram) are extracted
from the input speech signal.
2. DNN Acoustic Model: The extracted features are fed into a trained DNN, which outputs
the posterior probabilities of different HMM states (phonemes or sub-phonemes) for each
time frame.
o The DNN is typically a fully connected feedforward network with multiple hidden
layers that can learn complex mappings from acoustic features to phonetic states.
3. HMM Temporal Modeling: The HMM takes the state posterior probabilities from the
DNN and uses them to model the sequence of states over time, calculating the likelihood
of different phoneme sequences and finding the best match using the Viterbi algorithm.
The key innovation is that the DNN improves the modeling of the acoustic features, while the
HMM still handles the temporal dynamics of speech.
The training process for an HMM-DNN system typically follows these steps:
• Once the frame-level alignments are obtained, the DNN is trained as a classifier. The input
to the DNN is the acoustic feature vector (e.g., MFCC or spectrogram), and the target
output is the HMM state (phoneme or sub-phoneme) label for that frame.
• The DNN learns to predict the posterior probability of each HMM state, given the input
feature vector.
• After the DNN is trained, it is used to predict the posterior probabilities of HMM states for
new, unseen speech data.
• These posterior probabilities are then converted into likelihoods (using Bayes’ rule) and
used by the HMM to perform speech recognition by finding the most likely sequence of
phonemes that correspond to the input speech.
The hybrid system typically improves over GMM-based systems due to the superior ability of
DNNs to model complex, high-dimensional acoustic data.
o DNNs are far better than GMMs at modeling complex, non-linear relationships in
the data. They can capture subtle variations in the speech signal, such as speaker
differences, accents, and noise conditions.
2. Ability to Model Large Feature Spaces:
o DNNs can handle high-dimensional input data such as log-Mel spectrograms,
which provide more detailed acoustic information compared to traditional features
like MFCC.
3. Discriminative Training:
o Unlike GMMs, which are typically trained generatively, DNNs are trained
discriminatively, meaning they directly optimize for classification accuracy. This
leads to better performance in recognizing phonemes and words.
4. Better Generalization:
o DNNs can generalize better across different speakers and environmental conditions
because of their ability to learn complex patterns in the training data.
1. Training Complexity:
o Training a DNN involves tuning many hyperparameters, such as learning rate,
number of hidden layers, number of units per layer, and regularization techniques.
This can be computationally intensive, especially for large datasets.
2. Data Requirements:
o DNNs require large amounts of labeled data to achieve good performance. In
speech recognition, this typically means having a large corpus of transcribed
speech data.
3. Computational Resources:
o DNNs are more computationally demanding than GMMs, requiring powerful
hardware (e.g., GPUs) for training and, in some cases, for real-time inference.
While HMM-DNN hybrid systems were a significant step forward, they have been gradually
replaced by end-to-end deep learning approaches in modern speech recognition systems. Here’s
how they compare:
HMM-DNN Systems:
• Separation of Components: The acoustic model (DNN) and the temporal model (HMM)
are separate, with the DNN predicting state posteriors and the HMM handling sequence
decoding.
• Frame-by-Frame Modeling: HMM-DNN systems still rely on frame-by-frame modeling
of the speech signal, and the temporal dependencies are modeled with HMMs.
• Legacy: They offer a practical way to transition from HMM-GMM systems and
incorporate deep learning.
End-to-End Models:
7. HMM-DNN in Practice
Despite the rise of end-to-end systems, HMM-DNN systems are still used in various applications,
especially where:
• There is a need to leverage existing HMM-GMM infrastructure, and full transition to end-
to-end models is not feasible.
• Smaller or more specialized systems where DNNs can still improve over GMMs, but a full
deep learning pipeline is unnecessary.
HMM-DNN systems were the dominant speech recognition architecture in the 2010s and have
been implemented in systems like Google Voice, Siri, and Cortana before end-to-end models
started taking over.