0% found this document useful (0 votes)
2 views13 pages

Unit 5 (Automatic Speech Recognition)

The document discusses Automatic Speech Recognition (ASR), focusing on acoustic modeling, feature extraction, and Hidden Markov Models (HMMs). It outlines key concepts such as phonemes, feature extraction techniques like MFCCs, and the transition from traditional HMM-GMM systems to modern Deep Neural Networks (DNNs). Additionally, it covers the training and adaptation of acoustic models, as well as challenges and advanced techniques in feature extraction.

Uploaded by

953622243011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views13 pages

Unit 5 (Automatic Speech Recognition)

The document discusses Automatic Speech Recognition (ASR), focusing on acoustic modeling, feature extraction, and Hidden Markov Models (HMMs). It outlines key concepts such as phonemes, feature extraction techniques like MFCCs, and the transition from traditional HMM-GMM systems to modern Deep Neural Networks (DNNs). Additionally, it covers the training and adaptation of acoustic models, as well as challenges and advanced techniques in feature extraction.

Uploaded by

953622243011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Page 1

CCS369 - TEXT AND SPEECH ANALYSIS


UNIT V
AUTOMATIC SPEECH RECOGNITION
Speech recognition: Acoustic modelling – Feature Extraction -
HMM, HMM-DNN systems

Acoustic modeling is a crucial component of speech recognition systems, where it deals with the
representation of the relationship between linguistic units of speech (such as phonemes or words)
and the corresponding audio signal. It focuses on how to statistically model the way phonetic units
are produced in various contexts, including differences in speakers, accents, and environmental
noise.

key concepts in acoustic modeling for speech recognition:

1. Phonemes

Phonemes are the smallest units of sound in a language, and acoustic models attempt to recognize
these by mapping the audio signal to the corresponding phonetic sounds. For example, the words
"cat" and "bat" differ by just one phoneme: /k/ and /b/.

2. Feature Extraction

Before building acoustic models, speech data is processed to extract key features. These features
are typically derived using techniques such as:

• Mel-Frequency Cepstral Coefficients (MFCCs): One of the most commonly used


features that capture important spectral properties of sound.
• Linear Predictive Coding (LPC): Used to approximate the human vocal tract.
• Spectrograms: Visual representations of the frequencies in the signal over time.

3. Hidden Markov Models (HMMs)

HMMs have been the backbone of many traditional speech recognition systems. They model
sequences of sounds probabilistically, representing transitions between phonemes or states over
time. Each state in the HMM is associated with a distribution that represents the likelihood of
observing a particular feature vector.

• States: Each state corresponds to a part of the phoneme.


• Transition probabilities: Capture the likelihood of moving from one state to another.
• Emission probabilities: The probability of the observed feature vector given a state.

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 2

4. Gaussian Mixture Models (GMMs)

GMMs are used to model the distribution of the acoustic features associated with each HMM state.
A GMM is a weighted sum of several Gaussian distributions and helps capture the variability in
speech signals for a particular phoneme.

5. Deep Neural Networks (DNNs)

Modern speech recognition systems have largely moved from HMM-GMM-based systems to
Deep Neural Networks (DNNs) and more advanced models like:

• Convolutional Neural Networks (CNNs): Effective at capturing local temporal and


frequency patterns in the speech signal.
• Recurrent Neural Networks (RNNs): Can model temporal sequences, making them ideal
for capturing the temporal nature of speech.
• Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs): Extensions
of RNNs that are particularly good at remembering long-term dependencies.

6. Context-Dependent Models

In practical systems, context-dependent acoustic models are often used. These models account for
coarticulation, the phenomenon where the pronunciation of phonemes depends on the
surrounding phonemes. For example, the /t/ sound in "cat" might differ from the /t/ sound in "bat."

7. Training Acoustic Models

Acoustic models are trained using large datasets of transcribed speech. The training process
typically involves:

• Labeling: Aligning audio with text transcriptions to create training examples.


• Supervised learning: Using algorithms like Expectation-Maximization (for GMMs) or
backpropagation (for DNNs) to learn model parameters.
• Evaluation: Testing the model's performance on unseen data to ensure it generalizes well.

8. Acoustic Model Adaptation

Speech recognition systems often need to adapt to new speakers, environments, or languages.
Techniques like Maximum Likelihood Linear Regression (MLLR) or speaker adaptation
training can be used to fine-tune acoustic models for specific speakers or conditions.

9. Applications of Acoustic Models

Acoustic models are used in various speech recognition systems, including:

• Voice Assistants (e.g., Siri, Alexa, Google Assistant)


• Speech-to-Text systems
• Automatic captioning
• Transcription services

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 3

In modern systems, combining acoustic models with language models (which capture the
structure and probability of word sequences) results in highly accurate speech recognition systems.

FEATURE EXTRACTION

Feature extraction is a fundamental step in speech processing, particularly in speech recognition


systems. The goal of feature extraction is to transform the raw audio signal into a set of concise,
informative representations (features) that retain the essential characteristics of speech and discard
irrelevant information, such as background noise. These features are then used as inputs for further
processing, such as pattern recognition or machine learning models.

overview of feature extraction in the context of speech recognition:

1. Understanding the Raw Speech Signal

• Time-Domain Representation: The speech signal is captured as a time-varying


waveform, which records the variation of air pressure caused by human speech.
• Challenges: The raw waveform contains too much information, including irrelevant
details like noise and speaker-specific traits. Directly using the raw signal as input to
machine learning models can be inefficient and error-prone.

To address this, feature extraction techniques convert the time-domain signal into a more
manageable and informative form, typically in the frequency domain.

2. Steps in Feature Extraction

Step 1: Pre-Emphasis

• The high-frequency components of speech often carry important information, but they tend
to have less energy than lower frequencies. Pre-emphasis is a filtering step that amplifies
the high-frequency components of the signal.
• This step applies a high-pass filter, which can help reduce noise and balance the frequency
content of the signal.

Step 2: Framing

• Speech is a non-stationary signal, meaning its characteristics change over time. To deal
with this, the signal is divided into short segments called frames. Each frame is treated as
stationary (constant over time) for processing.
• Typically, frames are around 20–40 milliseconds long with some overlap (e.g., 50%)
between adjacent frames to capture smooth transitions.

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 4

Step 3: Windowing

• Each frame is multiplied by a window function (such as Hamming or Hanning window)


to taper the edges of the frame and reduce spectral leakage. This prevents discontinuities
at the boundaries between frames.

Step 4: Fourier Transform

• To convert each frame from the time domain to the frequency domain, the Fast Fourier
Transform (FFT) is applied. This step decomposes the signal into its frequency
components, which is more informative for speech analysis.
• The resulting frequency domain representation reveals how the energy of the signal is
distributed across different frequencies.

3. Key Features for Speech Recognition

1. Mel-Frequency Cepstral Coefficients (MFCC)

• MFCC is the most commonly used feature in speech recognition due to its ability to mimic
the human auditory system.
• Steps to Extract MFCC:
1. Apply the FFT: Convert the signal from the time domain to the frequency domain.
2. Mel-Scale Filter Bank: Pass the frequency spectrum through a series of filters
arranged according to the Mel scale, which reflects human auditory perception.
The Mel scale is a logarithmic transformation of the frequency axis, emphasizing
lower frequencies.
3. Logarithm of Filtered Energies: Take the logarithm of the filter bank outputs to
simulate the human ear’s sensitivity to loudness changes.
4. Discrete Cosine Transform (DCT): Apply the DCT to compress the log filter
bank outputs into a small number of coefficients (typically 12–13). These
coefficients are called the MFCCs and form the feature vector.
• MFCCs capture the broad spectral shape of speech and are highly effective in
distinguishing between different phonemes.

2. Linear Predictive Coding (LPC)

• LPC models the human vocal tract and attempts to predict future speech samples based on
past samples. It assumes that speech can be modeled as a linear combination of previous
samples.
• The result is a set of coefficients that represent the vocal tract configuration for each frame
of speech.
• While LPC was widely used in early speech recognition systems, it has been largely
replaced by MFCC due to MFCC’s superior performance.

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 5

3. Perceptual Linear Prediction (PLP)

• PLP is similar to LPC but incorporates perceptual modeling, such as critical-band filtering,
equal loudness pre-emphasis, and intensity-loudness compression. These steps help align
the features more closely with human auditory perception.
• PLP features are considered more perceptually accurate than LPC.

4. Mel-Spectrogram

• A Mel-spectrogram is a time-frequency representation where the frequency axis is


mapped onto the Mel scale. It shows how the energy of the speech signal is distributed
over different Mel-frequency bands across time.
• Mel-spectrograms are often used in deep learning models, where they can serve as inputs
for neural networks, especially convolutional neural networks (CNNs).

5. Delta and Delta-Delta Coefficients

• In addition to static features (e.g., MFCCs), dynamic features are often included to
capture the temporal evolution of speech.
• Delta features represent the change in MFCCs over time (first-order derivative), and
Delta-Delta features represent the rate of change of Delta features (second-order
derivative).
• These dynamic features provide information about the speed and acceleration of speech,
which improves recognition accuracy.

6. Chroma Features

• Chroma features represent the energy of different pitch classes in the signal. Although
more commonly used in music analysis, they can be useful for analyzing tonal qualities of
speech.

4. Challenges in Feature Extraction

1. Noise Robustness

• Speech signals often contain background noise, which can degrade the quality of the
extracted features. Techniques like spectral subtraction and Wiener filtering are used to
reduce noise during preprocessing.
• Robust feature extraction methods, such as RASTA filtering (Relative Spectral
Processing), can be employed to reduce the impact of noise.

2. Speaker Variability

• Different speakers have different vocal tract shapes, accents, and speaking styles, which
can affect the extracted features. Techniques like speaker normalization and Vocal Tract
Length Normalization (VTLN) are used to account for this variability.

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 6

3. Real-Time Processing

• In real-time speech recognition systems (e.g., virtual assistants), feature extraction must be
performed quickly to allow for real-time feedback. This requires optimizing the feature
extraction pipeline for speed without sacrificing accuracy.

5. Advanced Techniques in Feature Extraction

1. Deep Learning-Based Features

• Modern speech recognition systems increasingly rely on deep learning techniques to


directly learn feature representations from raw audio. For instance, convolutional neural
networks (CNNs) can automatically learn relevant patterns from spectrograms or raw
audio waveforms.
• End-to-End Models: These models bypass traditional feature extraction steps by learning
features in an integrated manner during training. Examples include models based on
WaveNet or transformers.

2. Wavelet Transform

• In addition to the Fourier transform, the wavelet transform can be used to analyze speech
signals. It provides both time and frequency resolution, making it suitable for analyzing
non-stationary signals like speech.

HMM

A Hidden Markov Model (HMM) is a statistical model that is widely used in speech recognition,
natural language processing, and various other time-series applications. It is particularly well-
suited for modeling sequences where observations are generated by underlying hidden states,
which evolve over time.

overview of Hidden Markov Models and how they relate to speech recognition:

1. Understanding HMMs

An HMM is defined by:

• States: The system is assumed to be in one of several hidden (unobservable) states at any
given time.
• Observations: For each state, an observation (which is visible) is generated according to
a probability distribution specific to that state.
• State transitions: The system transitions between states according to a set of probabilities.
• The objective of an HMM is to model both the transitions between states and the likelihood
of observing certain outputs given those states.

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 7

In the context of speech recognition:

• States typically represent phonemes or sub-phonemes (small units of speech).


• Observations are features extracted from the audio signal, such as Mel-Frequency
Cepstral Coefficients (MFCCs).
• Transitions model how speech moves from one phoneme to the next over time.

2. Components of an HMM

An HMM is characterized by the following components:

1. Set of States (S):


o The model consists of a finite number of hidden states S={S1,S2,...,SN}S = \{S_1,
S_2, ..., S_N\}S={S1,S2,...,SN}.
o In speech recognition, these states often correspond to phonemes or sub-phonemes.
2. Observation Sequence (O):
o The observed data (such as the sequence of MFCC vectors) is denoted as
O={O1,O2,...,OT}O = \{O_1, O_2, ..., O_T\}O={O1,O2,...,OT}, where TTT is the
total number of observations (i.e., time steps or frames).
3. Transition Probabilities (A):
o A={aij}A = \{a_{ij}\}A={aij} is the matrix of state transition probabilities,
where aija_{ij}aij represents the probability of transitioning from state SiS_iSi to
state SjS_jSj.
o The transition probabilities capture how likely one phoneme is to follow another.
4. Emission Probabilities (B):
o B={bj(Ot)}B = \{b_j(O_t)\}B={bj(Ot)} is the set of emission probabilities. These
probabilities define the likelihood of observing OtO_tOt (e.g., an MFCC vector)
given that the system is in state SjS_jSj at time ttt.
o The emission probabilities are typically modeled using Gaussian Mixture Models
(GMMs), or in modern systems, by Deep Neural Networks (DNNs).
5. Initial State Distribution (π):
o π={πi}\pi = \{\pi_i\}π={πi} represents the probability distribution over the initial
states. πi\pi_iπi is the probability of starting in state SiS_iSi.

3. Three Key Problems of HMMs

To apply HMMs in speech recognition (or any other sequence-based tasks), three key problems
need to be solved:

1. The Evaluation Problem (Likelihood Calculation)

• Given the model and a sequence of observations, what is the probability that the
sequence was generated by the model?
• This is solved using the Forward Algorithm, a dynamic programming algorithm that
efficiently computes the likelihood of the observation sequence by summing over all
possible state sequences.

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 8

2. The Decoding Problem (State Sequence Inference)

• Given the model and a sequence of observations, what is the most likely sequence of
hidden states that produced the observations?
• This problem is solved using the Viterbi Algorithm, which finds the most likely path of
hidden states by tracking the maximum probability path at each step in the sequence.

3. The Learning Problem (Parameter Estimation)

• Given an observation sequence and a model with unknown parameters, how can we
adjust the model parameters (A, B, and π) to best fit the data?
• This is solved using the Baum-Welch Algorithm (a special case of the Expectation-
Maximization algorithm), which iteratively refines the parameters to maximize the
likelihood of the observed data.

4. HMM in Speech Recognition

In speech recognition, HMMs are used to model sequences of speech sounds (phonemes) and their
temporal dependencies. Here’s how they fit into the process:

1. Phoneme Modeling:
o Each phoneme (or sub-phoneme) in a language is typically modeled as a separate
HMM. For example, the sound "aa" might have its own HMM, with transitions
between states representing different parts of the sound.
o The speech signal is divided into short frames (e.g., 25 milliseconds each), and for
each frame, a feature vector (such as MFCC) is extracted.
2. Recognition Task:
o Given an observation sequence (features extracted from the speech signal), the task
of the HMM is to find the most likely sequence of phonemes (and thus words) that
produced the sequence.
o This involves decoding the hidden state sequence using the Viterbi Algorithm,
which yields the best path through the HMMs corresponding to the phonemes of
the language.
3. Acoustic Modeling:
o In traditional systems, HMMs are combined with Gaussian Mixture Models
(GMMs) to model the emission probabilities. The GMMs estimate the probability
of each observation given the hidden state (phoneme).
o In modern systems, GMMs are often replaced by Deep Neural Networks (DNNs),
which directly model the emission probabilities more accurately.
4. Integration with Language Models:
o HMMs for speech recognition are often combined with language models to
improve accuracy. Language models capture the likelihood of word sequences
(e.g., "the cat" is more likely than "cat the"). The combination of acoustic models
(HMMs) and language models helps predict both phoneme sequences and word
sequences.

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 9

5. HMM Training

To train HMMs for speech recognition:

1. Prepare labeled speech data: Each phoneme or word is labeled in the training set, and
feature vectors (e.g., MFCCs) are extracted from the audio.
2. Initialize model parameters: The transition probabilities, emission probabilities, and
initial state probabilities are initialized randomly or based on prior knowledge.
3. Apply the Baum-Welch Algorithm: This algorithm iteratively refines the HMM
parameters to fit the training data, maximizing the likelihood of observing the labeled
speech.

Once trained, the HMM is capable of recognizing phonemes and words in new, unseen speech.

6. Limitations of HMMs

• Markov Assumption: HMMs assume that the current state depends only on the previous
state, which limits their ability to capture long-term dependencies in speech.
• Gaussian Mixture Models (GMMs) Limitations: GMMs, traditionally used with HMMs
to model the emission probabilities, struggle with representing complex, non-linear
distributions in speech data.
• Static Features: HMMs often rely on handcrafted features (such as MFCCs), which may
not fully capture the complexities of speech.

7. HMM vs. Modern Deep Learning Methods

While HMMs were once the dominant method in speech recognition, they have largely been
replaced by deep learning-based models, especially for large-vocabulary and real-time speech
recognition systems. Key differences include:

• Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)


networks can model long-term dependencies, addressing the Markov assumption
limitation of HMMs.
• End-to-End Models: Models like Connectionist Temporal Classification (CTC) and
Transformer-based architectures perform speech recognition without the need for
separate acoustic models, language models, or HMMs.

Despite these advances, HMMs remain a foundational concept in the history of speech recognition
and are still used in some domains.

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 10

HMM-DNN systems
HMM-DNN (Hidden Markov Model-Deep Neural Network) systems represent a hybrid
approach to speech recognition, combining the strengths of traditional HMMs for temporal
modeling with the powerful pattern recognition capabilities of deep neural networks (DNNs). This
hybrid architecture has significantly improved speech recognition performance and was a major
breakthrough before fully end-to-end deep learning models became dominant.

HMM-DNN systems and their role in speech recognition:

1. Why Combine HMM and DNN?

HMMs are excellent for modeling the sequential nature of speech, particularly for modeling
phonemes and their transitions over time. However, HMMs traditionally rely on Gaussian
Mixture Models (GMMs) for modeling the emission probabilities (the probability of observing
a particular speech feature given a hidden state). GMMs have several limitations:

• Assumptions of Gaussian distribution: GMMs struggle to model complex, non-Gaussian


distributions.
• Limited expressiveness: GMMs are not able to capture intricate, high-dimensional
relationships in speech features as well as modern deep learning techniques.

On the other hand, Deep Neural Networks (DNNs) excel at learning complex patterns in data
and can model the relationship between acoustic features and phonetic states more effectively. By
replacing GMMs with DNNs, we get a system that can better estimate the emission probabilities,
thus improving the overall performance of the speech recognition system.

2. HMM-DNN Architecture

The hybrid HMM-DNN system combines two main components:

• HMM: The HMM handles the temporal dependencies in speech by modeling transitions
between hidden states (phonemes or sub-phonemes). It defines the sequence of phonemes
over time.
• DNN: The DNN models the emission probabilities, replacing the traditional Gaussian
Mixture Models. The DNN is trained to predict the likelihood of each HMM state
(phoneme or sub-phoneme) given an input acoustic feature (e.g., MFCC, log-Mel
spectrogram).

Here’s how it works:

1. Feature Extraction: Acoustic features (e.g., MFCC, log-Mel spectrogram) are extracted
from the input speech signal.

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 11

2. DNN Acoustic Model: The extracted features are fed into a trained DNN, which outputs
the posterior probabilities of different HMM states (phonemes or sub-phonemes) for each
time frame.
o The DNN is typically a fully connected feedforward network with multiple hidden
layers that can learn complex mappings from acoustic features to phonetic states.
3. HMM Temporal Modeling: The HMM takes the state posterior probabilities from the
DNN and uses them to model the sequence of states over time, calculating the likelihood
of different phoneme sequences and finding the best match using the Viterbi algorithm.

The key innovation is that the DNN improves the modeling of the acoustic features, while the
HMM still handles the temporal dynamics of speech.

3. Training HMM-DNN Systems

The training process for an HMM-DNN system typically follows these steps:

Step 1: Frame Alignment with HMM-GMM

• Initially, a traditional HMM-GMM system is trained. This system is used to generate


frame-level alignments, which map each frame of the input speech to a specific HMM
state (phoneme or sub-phoneme). This provides the labels needed for training the DNN.

Step 2: DNN Training

• Once the frame-level alignments are obtained, the DNN is trained as a classifier. The input
to the DNN is the acoustic feature vector (e.g., MFCC or spectrogram), and the target
output is the HMM state (phoneme or sub-phoneme) label for that frame.
• The DNN learns to predict the posterior probability of each HMM state, given the input
feature vector.

Step 3: Hybrid System Integration

• After the DNN is trained, it is used to predict the posterior probabilities of HMM states for
new, unseen speech data.
• These posterior probabilities are then converted into likelihoods (using Bayes’ rule) and
used by the HMM to perform speech recognition by finding the most likely sequence of
phonemes that correspond to the input speech.

The hybrid system typically improves over GMM-based systems due to the superior ability of
DNNs to model complex, high-dimensional acoustic data.

4. Advantages of HMM-DNN Systems

1. Improved Acoustic Modeling:

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 12

o DNNs are far better than GMMs at modeling complex, non-linear relationships in
the data. They can capture subtle variations in the speech signal, such as speaker
differences, accents, and noise conditions.
2. Ability to Model Large Feature Spaces:
o DNNs can handle high-dimensional input data such as log-Mel spectrograms,
which provide more detailed acoustic information compared to traditional features
like MFCC.
3. Discriminative Training:
o Unlike GMMs, which are typically trained generatively, DNNs are trained
discriminatively, meaning they directly optimize for classification accuracy. This
leads to better performance in recognizing phonemes and words.
4. Better Generalization:
o DNNs can generalize better across different speakers and environmental conditions
because of their ability to learn complex patterns in the training data.

5. Challenges of HMM-DNN Systems

1. Training Complexity:
o Training a DNN involves tuning many hyperparameters, such as learning rate,
number of hidden layers, number of units per layer, and regularization techniques.
This can be computationally intensive, especially for large datasets.
2. Data Requirements:
o DNNs require large amounts of labeled data to achieve good performance. In
speech recognition, this typically means having a large corpus of transcribed
speech data.
3. Computational Resources:
o DNNs are more computationally demanding than GMMs, requiring powerful
hardware (e.g., GPUs) for training and, in some cases, for real-time inference.

6. HMM-DNN vs. End-to-End Deep Learning Systems

While HMM-DNN hybrid systems were a significant step forward, they have been gradually
replaced by end-to-end deep learning approaches in modern speech recognition systems. Here’s
how they compare:

HMM-DNN Systems:

• Separation of Components: The acoustic model (DNN) and the temporal model (HMM)
are separate, with the DNN predicting state posteriors and the HMM handling sequence
decoding.
• Frame-by-Frame Modeling: HMM-DNN systems still rely on frame-by-frame modeling
of the speech signal, and the temporal dependencies are modeled with HMMs.
• Legacy: They offer a practical way to transition from HMM-GMM systems and
incorporate deep learning.

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis
Page 13

End-to-End Models:

• Unified Model: In contrast, end-to-end models such as Connectionist Temporal


Classification (CTC), Recurrent Neural Networks (RNNs), or Transformer-based
models (e.g., DeepSpeech or Wav2Vec) directly map the input speech features to text
sequences without requiring a separate HMM.
• Longer Context Modeling: Models like RNNs and LSTMs, or Transformers, can capture
long-range dependencies in the speech signal, which HMMs struggle with due to the
Markov assumption (where the next state depends only on the current state).
• Simplicity: End-to-end systems simplify the architecture by removing the need for
phoneme-based alignment and hybrid modeling.

7. HMM-DNN in Practice

Despite the rise of end-to-end systems, HMM-DNN systems are still used in various applications,
especially where:

• There is a need to leverage existing HMM-GMM infrastructure, and full transition to end-
to-end models is not feasible.
• Smaller or more specialized systems where DNNs can still improve over GMMs, but a full
deep learning pipeline is unnecessary.

HMM-DNN systems were the dominant speech recognition architecture in the 2010s and have
been implemented in systems like Google Voice, Siri, and Cortana before end-to-end models
started taking over.

CCS369 RIT, Rajapalayam Prepared by R. Muthu Eshwaran, AP/AD


Text & Speech Analysis

You might also like