Automatic Speech Recognition
Automatic Speech Recognition
Recognition
Dr. Simran Setia
Recap: Statistical Models for Speech Recognition
Traditional HMM/GMM Model
Recap: Statistical Models for Speech Recognition
Lexicon Model: The lexicon model describes how words are pronounced
phonetically. You usually need a custom phoneme set for each language,
handcrafted by expert phoneticians.
Acoustic Model: The acoustic model (AM), models the acoustic patterns of
speech. The job of the acoustic model is to predict which sound or phoneme is
being spoken at each speech segment.
Language Model: The language model (LM) models the statistics of language. It
learns which sequences of words are most likely to be spoken, and its job is to
predict which words will follow on from the current words and with what probability.
Disadvantages of using Traditional Approaches
Each model must be trained independently, which makes it time and labor
intensive.
Experts are needed to build a custom phonetic set in order to boost the model’s
accuracy.
Model Accuracy is generally low for real-time datasets.
Prerequisites: Phonetics
Some facts about Phonetics:
Speech can be classified as voiced and voiceless sounds. Don’t be mistaken by the
terms, both produce sounds as shown in this 2-minute video
For voiced sounds, we tense up our vocal folds. When we exhale air from the lungs, it
pushes the vocal folds open. The airflow speeds up and the pressure at the vocal fold
drops. This closes them again.
These open and close cycles continue and produce a series of sound waves.
But the key component in speaking is the vocal tract which is composed of the oral and
the nasal part. It acts as a resonator. Both voiced and voiceless sounds are further
modulated by articulation and create different resonances by the vocal tract.
Prerequisites: Phonetics
Some facts about Phonetics:
Syllables are classified into consonants and vowels.
Consonants are sounds that are articulated with a complete or partial closure of the
vocal tract. It can be voiced or voiceless.
To classify a consonant, we ask where and how it is produced. Constrictions can be
made in the vocal tract. And three major kinds of place articulation are coronal,
dorsal, and labial. Labial consonants mainly involve lip(s), teeth, and the tongue,
coronal consonants are made with the tip or the blade of the tongue and dorsal uses
the back of the tongue. Other articulators include the jaw, velum, lips, and mouth.
Prerequisites: Phonetics
Besides what area are involved in the articulation, the consonant sounds depends
on how we articulate them: Stops, Fricatives, Nasals, Laterals, Trills, Taps, Flaps,
Clicks, Affricate, Approximant, etc … For example, in “stops”, the airstream is
completely obstructed
Vowels are voiced sounds.
The pronunciation of a vowel can be modeled by the vowel height (how far we
move up the tongue or lower the jaw) and how far we move the tongue to the front
or to the back.
Prerequisites: Phonetics
Phonetics plays a crucial role in Automatic Speech Recognition (ASR) by providing insights into how human speech
sounds are produced, transmitted, and perceived. Here’s how phonetic knowledge helps improve ASR systems:
1. Phoneme-Based Speech Modeling
● Speech is made up of phonemes, the smallest units of sound in a language (e.g., /p/, /b/, /t/).
● ASR systems use phonetic transcription to map spoken words to their corresponding phonemes, improving
accuracy in recognizing different pronunciations.
2. Pronunciation Dictionaries
● Phonetic knowledge helps in creating lexicons (word-to-pronunciation mappings), which ASR systems use to
match spoken input with expected word pronunciations.
● Example: The word "data" may be pronounced as /ˈdeɪ.tə/ or /ˈdæ.tə/, and phonetic models account for such
variations.
3. Acoustic Modeling
● ASR systems use phonetic features (e.g., voicing, place of articulation) to train deep learning models on
speech waveforms.
● Phonetics helps in identifying and classifying sounds based on their acoustic properties, making the system
robust to variations in speech.
Spectrogram: Representation of the Sound Waves
Short Time Fourier Transform of the underlying sound wave
While frequency-domain representations such as the DTFT and the DFT are
useful, they both are obtained by summing the time function x[n] from -∞ to ∞. This
means that the DTFT and DFT describe frequency components in the signal
averaged over all time.
Interesting signals like music and speech are characterized the ways in which
frequency components change over time. (These components could represent
objects such as the phonemes that constitute a spoken word or the individual
notes that constitute a musical composition.)
Spectrogram: Representation of the Sound Waves
The STFT considers only a short-duration segment of a longer signal and
computes its Fourier transform. Typically this is accomplished by multiplying a
longer time function x[n] by a window function w[n] that is brief in duration.
The window used can either be finite or infinite in duration. We use Hamming
window for speech waveforms, to smoothly taper the signal at its edges,
minimizing spectral leakage by reducing discontinuities at the window boundaries,
A hamming window is raised cosine window
Spectrogram: Representation of the Sound Waves
When using the Short-Time Fourier Transform (STFT) for audio signal analysis, there are
several crucial factors to consider. One of them is width of the analysis window.
Wideband Analysis: With a short window, the STFT offers higher time resolution, allowing you
to capture rapid changes in the audio signal. This is particularly useful for analyzing transient or
rapidly evolving sounds, such as speech. However, a short window results in lower frequency
resolution, which can make it difficult to distinguish between closely spaced frequency
components.
Narrowband Analysis: A long window provides higher frequency resolution, enabling you to
identify small frequency differences in the audio signal. This can be beneficial when analyzing
steady-state sounds, like sustained musical notes or constant background noise. However, a
long window reduces time resolution, making it less suitable for capturing rapidly changing
events in the audio signal.
In other words, the STFT can also be thought of as the convolution of the original
input signal with the Fourier transform of the window function time reversed and
shifted.
STFT of a Sine Wave
STFT of a Sine Wave
STFT of a Sine Wave
Feature Extraction: MFCC
One popular audio feature extraction method is the Mel-frequency cepstral
coefficients (MFCC) which have 39 features. The feature count is small enough to
force us to learn the information of the audio. Out of 39 features, 12 parameters
are related to the amplitude of frequencies.
Feature Extraction: MFCC
Preemphasis: Pre-emphasis boosts the amount of energy in the high frequencies.
higher frequency components of a signal are more susceptible to noise and
attenuation during transmission, so by boosting them beforehand, we improve the
signal-to-noise ratio at the receiver
Feature Extraction: MFCC
Windowing involves the slicing of the audio waveform into sliding frames.
Given an audio segment, we are using a sliding window of 25ms wide to extract
audio features. If we speak 3 words per second with 4 phones and each phone will
be sub-divided into 3 stages, then there are 36 states per second or 28 ms per
state. So the 25ms window is about right
Pronunciations are changed according to the articulation before and after a phone.
Each sliding window is about 10ms apart so we can capture the dynamics among
frames to capture the proper phone.
Feature Extraction: MFCC
In addition to the size of the window, we should also take into consideration the
type of the window.
Feature Extraction: MFCC
On the top right below is a soundwave in the time domain. It mainly composes of
two frequencies only. As shown, the chopped frame with Hamming and Hanning
maintains the original frequency information better with less noise compared to a
rectangle window
As shown, for Hamming and Hanning window, the amplitude drops off near the
edge. (The Hamming window has a slight sudden drop at the edge while the
Hanning window does not.)
Feature Extraction: MFCC
DFT: Next, we apply DFT to extract information in frequency domain.
Mel Filterbank: Mel scale maps the measured frequency to that we perceived in
the context of frequency resolution.
First, we square the output of the DFT. This reflects the power of the speech at
each frequency (x[k]²) and we call it the DFT power spectrum. We apply these
triangular Mel-scale filter banks to transform it to Mel-scale power spectrum. The
output for each Mel-scale power spectrum slot represents the energy from a
number of frequency bands that it covers. This mapping is called the Mel Binning.
In feature extraction, we apply triangular band-pass filters to convert the frequency
information to mimic what a human perceived. The human ear perceives frequencies
non-linearly. It is more sensitive to low frequencies than high frequencies.
The Triangular bandpass is wider at the higher frequencies to reflect human hearing
is less sensitivity in high frequency.
All these efforts try to mimic how the basilar membrane in our ear senses the vibration
of sounds. The basilar membrane has about 15,000 hairs inside the cochlear at birth.
The diagram below demonstrates the frequency response of those hairs. So the
curve-shape response below is simply approximated by triangles in Mel filterbank
Feature Extraction: MFCC
Log: Mel filterbank outputs a power spectrum. Keeping in mind the Mel scale, the
next step is to take log of the power spectrum output. This also reduces the
acoustic variants that are not significant for speech recognition.
Cepstrum: Cepstrum is the reverse of the first 4 letters in the word “spectrum”. Our
next step is to compute the Cepstral which separates the glottal source and the
filter. (This is what we learnt in sound production model in humans)
Feature Extraction: MFCC
Diagram (a) is the spectrum with the y-axis being the magnitude. Diagram (b)
takes the log of the magnitude. Look closer, the wave fluctuates about 8 times
between 1000 and 2000. Actually, it fluctuates about 8 times for every 1000 units.
That is about 125 Hz — the source vibration of the vocal folds.
As observed, the log spectrum (the first diagram below) composes of information
related to the phone (the second diagram) and the pitch (the third diagram). The
peaks in the second diagram identify the formants that distinguish phones.
Feature Extraction: MFCC
We will realise this using IDFT.
The solid line on the left diagram is the signal in the frequency domain. It is
composed of the phone information drawn in the dotted line and the pitch
information. After the IDFT (inverse Discrete Fourier Transform), the pitch
information with 1/T period is transformed to a peak near T at the right side.
MFCC just takes the first 12 cepstral values. This is because
● Low-order cepstral coefficients → Represent slow spectral variations (vocal tract).
● High-order cepstral coefficients → Represent rapid variations (source excitation, noise).
We apply Direct Cosine Transform to obtain cepstral values.
where:
● p is the order of the predictor
● ak are the LP coefficients.
● e(n) is the prediction error (residual).
The goal is to minimize the total squared error:
MFCC vs PLP
If you prioritize psychoacoustic accuracy: use PLP
If you prioritize low resource computing: use MFCC
If you prioritize both: use hybrid features
Raw Waveform based Features: Wav2Vec
Unsupervised Learning Approach
Uses CNN
CNN (Convolutional Neural Network)
Image Analysis
If we use MLPs?
For a 100X100 image with 3 channels, we need 30000 weights.
Also, spatial information is lost due to flattening the underlying image
CNNs
Filters
Wav2Vec
Unsupervised Learning for Speech
Wav2Vec learns meaningful speech representations from raw audio without requiring
transcriptions.
The model first learns to understand speech sounds and patterns in an unsupervised way
and is then fine-tuned on smaller labeled datasets.
It has two phases:
Pre-Training: The model is trained on large, unlabeled speech datasets, where it learns
audio features without explicit supervision.
Fine-Tuning: After pre-training, the model is fine-tuned with a much smaller labeled dataset
(i.e., transcribed speech) to perform ASR tasks effectively
Wav2Vec Architecture
1. Raw Audio Input
2. Feature Encoder
● Converts the raw waveform into a lower-dimensional representation that
captures important speech features.
● It is realised by a five layer convolutional network. The encoder layers have
kernel sizes (10, 8, 4, 4, 4) and strides (5, 4, 2, 2, 2).(Inspired from
“Representation Learning with Contrastive Predictive Coding”)
● The output of the encoder layer is parameterised feature representation.
Wav2Vec Architecture
3. Context Network
● Context network is applied on the output generated from encoder network.
● It models long-range dependencies in the audio signal — meaning it helps the
model understand the relationship between different parts of the speech (e.g.,
phonemes, words, and sentences) over time.
● This embedding is no longer just a local acoustic feature — it incorporates
information from the surrounding audio, helping the model understand the
overall speech pattern.
● The feature encoder focuses on extracting “what the audio sounds like” (short-
term features), while the context network captures “what the audio means”
(long-term structure).
Wav2Vec Architecture
The context network consists of multiple stacked temporal convolution layers
Each temporal convolution layer performs:
1. Feature Aggregation:
○ It takes the encoded speech features from the feature encoder (CNN) and applies 1D convolutions
along the time axis.
○ This helps combine short-term acoustic features into higher-level contextualized representations.
2. Long-Range Context Learning:
○ The deeper the convolution stack, the wider the receptive field, meaning later layers capture broader
temporal relationships.
○ This allows the model to understand phonemes, syllables, and even word-level dependencies over
time.
3. Hierarchical Representation Learning:
○ Early layers capture local speech patterns (e.g., phonemes).
○ Deeper layers capture higher-level language structures (e.g., words, phrases).
Wav2Vec Architecture
● Combines multiple latent representations zi . . . zi−v into a single
contextualized tensor ci = g(zi . . . zi−v) for a receptive field size v.
● The context network has nine layers with kernel size three and stride one.
The total receptive field of the context network is about 210 ms.
● The layers in both the encoder and context networks consist of a causal
convolution with 512 channels, a group normalization layer and a ReLU
nonlinearity.
Wav2Vec Architecture
Contrastive Loss Training
Masking of Latent Representations
● A certain percentage of the latent speech representations are randomly
masked (hidden from the model).
● The goal is to force the model to predict the masked regions using
surrounding speech context.
● This is similar to masked language modeling (MLM) in NLP models like BERT
but applied to speech data.
Wav2Vec Architecture
Contrastive Learning: Predicting the Correct Representation
● The model is trained to identify the true latent representation of a masked
time step from a set of multiple possible candidates.
● It is presented with:
○ One correct (positive) example: The actual masked representation.
○ Several incorrect (negative) examples: Representations from different parts of the audio or
other samples.
● The model must learn to distinguish the correct representation from the
incorrect ones.
Contrastive Loss Calculation
● The model assigns probabilities to each candidate and is optimized using a contrastive
loss function, which encourages high similarity with the positive example and low
similarity with negative examples
The loss function is based on the InfoNCE (Information Noise Contrastive Estimation) formula
where:
● zt= True latent speech representation (positive sample).
● ct= Contextualized representation predicted by the model.
● zi= Negative samples (distractors).
● sim(z,c) = Similarity function (typically cosine similarity).
● τ = Temperature parameter (controls sharpness of probability distribution).
● N = Number of negative samples.
Wav2Vec Architecture
Role of Temperature parameter (typically kept at a lower value of 0.05 to 0.2 for speech tasks)
With a low τ, the model is very confident in distinguishing positives from negatives.
With a high τ, the model is more relaxed, leading to smoother, less extreme probabilities.
Wav2Vec: Phase 1 (Pretraining)
Pre-training substantially improves WER in simulated low-resource setups on the
audio data of WSJ compared to wav2letter++ with log-mel filter banks features
(Baseline). Pre-training on the audio data of the full 960 h Librispeech dataset
(wav2vec Libri) performs better than pre-training on the 81 h WSJ dataset
(wav2vec WSJ).
Wav2Vec Architecture: Phase 2 (Fine Tuning)
Fine-tuning is the process of adapting a pre-trained Wav2Vec model to a specific task, such as speech-to-text
transcription. It involves:
● Freezing some layers and updating others.
● Training on labeled data (pairs of speech waveforms and transcriptions).
Why Fine Tuning?
(A) Self-Supervised Pre-Training Alone is Not Enough
● Wav2Vec 1.0 only learns to distinguish speech representations during pre-training.
● It does not learn actual phonemes, words, or sentence structures.
● Fine-tuning is required to map learned speech features to text labels.
(B) Adapting to a Specific Language or Domain
● Wav2Vec 1.0 is pre-trained on generic speech data, but fine-tuning adapts it to a specific language,
accent, or domain.
● Example: If the pre-trained model was trained on English, but you want it for Hindi, fine-tuning on Hindi
speech-to-text data is necessary.
Wav2Vec Architecture: Phase 2 (Fine Tuning)
Procedure
Step 1: Load the Pre-Trained Wav2Vec 1.0 Model
● Use a model that has been pre-trained on unlabeled speech data.
Step 2: Add a New Output Layer
● Replace the last layer with a linear classifier that maps speech features to text labels
(phonemes or characters).
Step 3: Train Using Supervised Data
● Use labeled speech-to-text datasets (e.g., Librispeech for English, Common Voice for other
languages).
● Optimize using CTC Loss (for sequence-to-sequence learning without needing word
boundaries).