0% found this document useful (0 votes)
69 views20 pages

The Diagram Outlines The Key Steps Involved in Co

Uploaded by

Taqwa Elsayed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views20 pages

The Diagram Outlines The Key Steps Involved in Co

Uploaded by

Taqwa Elsayed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

The diagram outlines the key steps involved in converting spoken language into text using

ASR. Here's a breakdown of the process:

Recorded Speech: This represents the raw audio signal captured by a microphone when
someone speaks. It's a continuous analog waveform that encodes the sound variations over
time.

Signal Analysis: The analog speech signal is converted into a digital format suitable for
processing by a computer. This typically involves Analog-to-Digital Conversion (ADC), where
the signal is sampled at a specific rate and its amplitude values are discretized into digital
bits.

Acoustic Model: This component analyzes the characteristics of the digitized speech signal.
It extracts features like Mel-Frequency Cepstral Coefficients (MFCCs) that represent the
speech's spectral information relevant for ASR.

Search Space: This represents the possible sequence of words or sounds that the ASR
system considers as potential matches for the speech input. It can be vast, encompassing all
possible words and combinations in the target language.

Training Data: To function accurately, the ASR system needs to be trained on a large corpus
of speech data with corresponding text transcripts. This data allows the system to learn the
relationships between acoustic features and words or phonemes (basic units of sound).

Language Model: This component incorporates knowledge about language structure and
grammar. It helps the ASR system choose the most likely word sequence based on the
extracted features and the context of the surrounding words.

Decoded Text (Transcription): After processing the speech signal and considering both the
acoustic features and language rules, the ASR system outputs the recognized text, which is
the transcription of the spoken input.

Here's a table summarizing the main components:

Component Description
Recorded Speech Analog audio signal from the microphone

Converts analog speech to digital format and


Signal Analysis
extracts features

Acoustic Model Analyzes the spectral characteristics of the speech

Search Space All possible word or sound sequences

Speech data with corresponding text transcripts


Training Data
used for training

Language Model Encodes language structure and grammar rules

Decoded Text
Recognized text output by the ASR system
(Transcription)

Sampling is a crucial step in the process of transforming a continuous analog


signal into a discrete digital signal.
the frequency above which a filter significantly attenuates or
Cut-off frequency :
reduces the signal's amplitude.

Speech signal analysis to produce a sequence of acoustic feature


Vectors

characteristics for acoustic features used in Automatic Speech Recognition


(ASR):

1. Distinguishing Phones:

 The features should capture enough information to differentiate between


phonemes (basic units of sound) in spoken language. This allows the ASR
system to identify the individual sounds that make up a word.

2. Time Resolution (10ms):

 The features should provide good temporal resolution, typically around 10


milliseconds. This allows the system to capture the rapid changes in speech
sounds over time, which are crucial for distinguishing between similar
phonemes.

3. Frequency Resolution (20-40 channels):

 The features should offer good frequency resolution, typically represented by


20 to 40 frequency channels. This helps differentiate sounds based on their
spectral content (pitch and harmonics).
4. Separation from F0 (Fundamental Frequency) and Harmonics:

 The features should ideally be independent of the speaker's fundamental


frequency (F0) and its harmonics. These can vary significantly between
speakers and don't necessarily contribute to distinguishing phonemes.

5. Robustness to Speaker Variation:

 The features should be resilient to variations in speaker characteristics like


gender, age, or accent. This ensures the ASR system can perform well across
diverse speakers.

6. Robustness to Noise and Channel Distortions:

 The features should be resistant to background noise or distortions introduced


by the communication channel (e.g., phone calls). This helps the ASR system
function accurately even in less-than-ideal environments.

7. Pattern Recognition Characteristics:

 The features should be suitable for pattern recognition algorithms used in


ASR systems. This allows the system to effectively learn the patterns
associated with different phonemes and words.

8. Low Feature Dimension:

 While capturing enough information is important, a lower feature dimension is


generally desirable. This reduces computational complexity and storage
requirements for the ASR system.

9. Feature Independence (for GMMs):

 In the context of Gaussian Mixture Models (GMMs), features should ideally be


statistically independent. This simplifies the training process and improves the
performance of GMM-based ASR systems. However, this is not a strict
requirement for neural network (NN)-based approaches.
A/D conversion

A/D conversion samples the audio clips and digitizes the


content, i.e. converting the analog signal into discrete space. A
sampling frequency of 8 or 16 kHz is often used.

Source

Pre-emphasis

Pre-emphasis boosts the amount of energy in the high


frequencies. For voiced segments like vowels, there is more
energy at the lower frequencies than the higher frequencies.
This is called spectral tilt which is related to the glottal source
(how vocal folds produce sound). Boosting the high-frequency
energy makes information in higher formants more available
to the acoustic model. This improves phone detection accuracy.
For humans, we start having hearing problems when we
cannot hear these high-frequency sounds. Also, noise has a
high frequency. In the engineering field, we use pre-emphasis
to make the system less susceptible to noise introduced in the
process later. For some applications, we just need to undo the
boosting at the end.

Pre-emphasis uses a filter to boost higher frequencies. Below is


the before and after signal on how the high-frequency signal is
boosted.

Jurafsky & Martin, fig. 9.9


Windowing

Windowing involves the slicing of the audio waveform into


sliding frames.

But we cannot just chop it off at the edge of the frame. The
suddenly fallen in amplitude will create a lot of noise that
shows up in the high-frequency. To slice the audio, the
amplitude should gradually drop off near the edge of a frame.
Let’s say w is the window applied to the original audio clip in
the time domain.

A few alternatives for w are the Hamming window and the


Hanning window. The following diagram indicates how a
sinusoidal waveform will be chopped off using these windows.
As shown, for Hamming and Hanning window, the amplitude
drops off near the edge. (The Hamming window has a slight
sudden drop at the edge while the Hanning window does not.)

The corresponding equations for w are:


On the top right below is a soundwave in the time domain. It
mainly composes of two frequencies only. As shown, the
chopped frame with Hamming and Hanning maintains the
original frequency information better with less noise compared
to a rectangle window.

Source Top right: a signal that composed of two frequency

Discrete Fourier Transform (DFT)

Next, we apply DFT to extract information in the frequency


domain.
Mel filterbank

As mentioned in the previous article, the equipment


measurements are not the same as our hearing perception. For
humans, the perceived loudness changes according to
frequency. Also, perceived frequency resolution decreases as
frequency increases. i.e. humans are less sensitive to higher
frequencies. The diagram on the left indicates how the Mel
scale maps the measured frequency to that we perceived in
the context of frequency resolution.

Source

All these mappings are non-linear. In feature extraction, we


apply triangular band-pass filters to coverts the frequency
information to mimic what a human perceived.
Source

First, we square the output of the DFT. This reflects the power
of the speech at each frequency (x[k]²) and we call it the DFT
power spectrum. We apply these triangular Mel-scale filter
banks to transform it to Mel-scale power spectrum. The output
for each Mel-scale power spectrum slot represents the energy
from a number of frequency bands that it covers. This mapping
is called the Mel Binning. The precise equations for
slot m will be:

The Trainangular bandpass is wider at the higher frequencies


to reflect human hearing is less sensitivity in high frequency.
Specifically, it is linearly spaced below 1000 Hz and turns
logarithmically afterward.
All these efforts try to mimic how the basilar membrane in our
ear senses the vibration of sounds. The basilar membrane has
about 15,000 hairs inside the cochlear at birth. The diagram
below demonstrates the frequency response of those hairs. So
the curve-shape response below is simply approximated by
triangles in Mel filterbank.

We imitate how our ears perceive sound through those hairs.


In short, it is modeled by the triangular filters using Mel
filtering bank.

Source
Log

Mel filterbank outputs a power spectrum. Humans are less


sensitive to small energy change at high energy than small
changes at a low energy level. In fact, it is logarithmic. So our
next step will take the log out of the output of the Mel
filterbank. This also reduces the acoustic variants that are not
significant for speech recognition. Next, we need to address
two more requirements. First, we need to remove the F0
information (the pitch) and makes the extracted features
independent of others.

Cepstrum — IDFT

Below is the model of how speech is produced.


Source

Our articulations control the shape of the vocal tract. The


source-filter model combines the vibrations produced by the
vocal folds with the filter created by our articulations. The
glottal source waveform will be suppressed or amplified at
different frequencies by the shape of the vocal tract.
Cepstrum is the reverse of the first 4 letters in the word
“spectrum”. Our next step is to compute the Cepstral which
separates the glottal source and the filter. Diagram (a) is the
spectrum with the y-axis being the magnitude. Diagram (b)
takes the log of the magnitude. Look closer, the wave fluctuates
about 8 times between 1000 and 2000. Actually, it fluctuates
about 8 times for every 1000 units. That is about 125 Hz — the
source vibration of the vocal folds.

Paul Taylor (2008)

As observed, the log spectrum (the first diagram below)


composes of information related to the phone (the second
diagram) and the pitch (the third diagram). The peaks in the
second diagram identify the formants that distinguish phones.
But how can we separate them?
Source

Recall that periods in the time or frequency domain is inverted


after transformation.

Recall that the pitch information has short periods in the


frequency domain. We can apply the inverse Fourier
Transformation to separate the pitch information from the
formants. As shown below, the pitch information will show up
on the middle and the right side. The peak in the middle is
actually corresponding to F0 and the phone-related
information will locate in the far left.

Here is another visualization. The solid line on the left diagram


is the signal in the frequency domain. It is composed of the
phone information drawn in the dotted line and the pitch
information. After the IDFT (inverse Discrete Fourier
Transform), the pitch information with 1/T period is
transformed to a peak near T at the right side.

Source

So for speech recognition, we just need the coefficients on the


far left and discard the others. In fact, MFCC just takes the first
12 cepstral values. There is another important property related
to these 12 coefficients. Log power spectrum is real and
symmetric. Its inverse DFT is equivalent to a discrete cosine
transformation (DCT).

DCT is an orthogonal transformation. Mathematically, the


transformation produces uncorrelated features. Therefore,
MFCC features are highly unrelated. In ML, this makes our
model easier to model and to train. If we model these
parameters with multivariate Gaussian distribution, all the
non-diagonal values in the covariance matrix will be zero.
Mathematically, the output of this stage is

The following is the visualization of the 12 Cepstrum


coefficients.
Source

Dynamic features (delta)

MFCC has 39 features. We finalize 12 and what are the rest.


The 13th parameter is the energy in each frame. It helps us to
identify phones.

In pronunciation, context and dynamic information are


important. Articulations, like stop closures and releases, can be
recognized by the formant transitions. Characterizing feature
changes over time provides the context information for a
phone. Another 13 values compute the delta values d(t) below.
It measures the changes in features from the previous frame to
the next frame. This is the first-order derivative of the features.

The last 13 parameters are the dynamic changes of d(t) from


the last frame to the next frame. It acts as the second-order
derivative of c(t).

So the 39 MFCC features parameters are 12 Cepstrum


coefficients plus the energy term. Then we have 2 more sets
corresponding to the delta and the double delta values.

You might also like