0% found this document useful (0 votes)

69 views20 pages

The Diagram Outlines The Key Steps Involved in Co

Uploaded by

Taqwa Elsayed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views20 pages

The Diagram Outlines The Key Steps Involved in Co

Uploaded by

Taqwa Elsayed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

The diagram outlines the key steps involved in converting spoken language into text using

ASR. Here's a breakdown of the process:

Recorded Speech: This represents the raw audio signal captured by a microphone when
someone speaks. It's a continuous analog waveform that encodes the sound variations over
time.

Signal Analysis: The analog speech signal is converted into a digital format suitable for
processing by a computer. This typically involves Analog-to-Digital Conversion (ADC), where
the signal is sampled at a specific rate and its amplitude values are discretized into digital
bits.

Acoustic Model: This component analyzes the characteristics of the digitized speech signal.
It extracts features like Mel-Frequency Cepstral Coefficients (MFCCs) that represent the
speech's spectral information relevant for ASR.

Search Space: This represents the possible sequence of words or sounds that the ASR
system considers as potential matches for the speech input. It can be vast, encompassing all
possible words and combinations in the target language.

Training Data: To function accurately, the ASR system needs to be trained on a large corpus
of speech data with corresponding text transcripts. This data allows the system to learn the
relationships between acoustic features and words or phonemes (basic units of sound).

Language Model: This component incorporates knowledge about language structure and
grammar. It helps the ASR system choose the most likely word sequence based on the
extracted features and the context of the surrounding words.

Decoded Text (Transcription): After processing the speech signal and considering both the
acoustic features and language rules, the ASR system outputs the recognized text, which is
the transcription of the spoken input.

Here's a table summarizing the main components:

Component Description
Recorded Speech Analog audio signal from the microphone

Converts analog speech to digital format and

Signal Analysis
extracts features

Acoustic Model Analyzes the spectral characteristics of the speech

Search Space All possible word or sound sequences

Speech data with corresponding text transcripts

Training Data
used for training

Language Model Encodes language structure and grammar rules

Decoded Text
Recognized text output by the ASR system
(Transcription)

Sampling is a crucial step in the process of transforming a continuous analog

signal into a discrete digital signal.
the frequency above which a filter significantly attenuates or
Cut-off frequency :
reduces the signal's amplitude.

Speech signal analysis to produce a sequence of acoustic feature

Vectors

characteristics for acoustic features used in Automatic Speech Recognition

(ASR):

1. Distinguishing Phones:

 The features should capture enough information to differentiate between

phonemes (basic units of sound) in spoken language. This allows the ASR
system to identify the individual sounds that make up a word.

2. Time Resolution (10ms):

 The features should provide good temporal resolution, typically around 10

milliseconds. This allows the system to capture the rapid changes in speech
sounds over time, which are crucial for distinguishing between similar
phonemes.

3. Frequency Resolution (20-40 channels):

 The features should offer good frequency resolution, typically represented by

20 to 40 frequency channels. This helps differentiate sounds based on their
spectral content (pitch and harmonics).
4. Separation from F0 (Fundamental Frequency) and Harmonics:

 The features should ideally be independent of the speaker's fundamental

frequency (F0) and its harmonics. These can vary significantly between
speakers and don't necessarily contribute to distinguishing phonemes.

5. Robustness to Speaker Variation:

 The features should be resilient to variations in speaker characteristics like

gender, age, or accent. This ensures the ASR system can perform well across
diverse speakers.

6. Robustness to Noise and Channel Distortions:

 The features should be resistant to background noise or distortions introduced

by the communication channel (e.g., phone calls). This helps the ASR system
function accurately even in less-than-ideal environments.

7. Pattern Recognition Characteristics:

 The features should be suitable for pattern recognition algorithms used in

ASR systems. This allows the system to effectively learn the patterns
associated with different phonemes and words.

8. Low Feature Dimension:

 While capturing enough information is important, a lower feature dimension is

generally desirable. This reduces computational complexity and storage
requirements for the ASR system.

9. Feature Independence (for GMMs):

 In the context of Gaussian Mixture Models (GMMs), features should ideally be

statistically independent. This simplifies the training process and improves the
performance of GMM-based ASR systems. However, this is not a strict
requirement for neural network (NN)-based approaches.
A/D conversion

A/D conversion samples the audio clips and digitizes the

content, i.e. converting the analog signal into discrete space. A
sampling frequency of 8 or 16 kHz is often used.

Source

Pre-emphasis

Pre-emphasis boosts the amount of energy in the high

frequencies. For voiced segments like vowels, there is more
energy at the lower frequencies than the higher frequencies.
This is called spectral tilt which is related to the glottal source
(how vocal folds produce sound). Boosting the high-frequency
energy makes information in higher formants more available
to the acoustic model. This improves phone detection accuracy.
For humans, we start having hearing problems when we
cannot hear these high-frequency sounds. Also, noise has a
high frequency. In the engineering field, we use pre-emphasis
to make the system less susceptible to noise introduced in the
process later. For some applications, we just need to undo the
boosting at the end.

Pre-emphasis uses a filter to boost higher frequencies. Below is

the before and after signal on how the high-frequency signal is
boosted.

Jurafsky & Martin, fig. 9.9

Windowing

Windowing involves the slicing of the audio waveform into

sliding frames.

But we cannot just chop it off at the edge of the frame. The
suddenly fallen in amplitude will create a lot of noise that
shows up in the high-frequency. To slice the audio, the
amplitude should gradually drop off near the edge of a frame.
Let’s say w is the window applied to the original audio clip in
the time domain.

A few alternatives for w are the Hamming window and the

Hanning window. The following diagram indicates how a
sinusoidal waveform will be chopped off using these windows.
As shown, for Hamming and Hanning window, the amplitude
drops off near the edge. (The Hamming window has a slight
sudden drop at the edge while the Hanning window does not.)

The corresponding equations for w are:

On the top right below is a soundwave in the time domain. It
mainly composes of two frequencies only. As shown, the
chopped frame with Hamming and Hanning maintains the
original frequency information better with less noise compared
to a rectangle window.

Source Top right: a signal that composed of two frequency

Discrete Fourier Transform (DFT)

Next, we apply DFT to extract information in the frequency

domain.
Mel filterbank

As mentioned in the previous article, the equipment

measurements are not the same as our hearing perception. For
humans, the perceived loudness changes according to
frequency. Also, perceived frequency resolution decreases as
frequency increases. i.e. humans are less sensitive to higher
frequencies. The diagram on the left indicates how the Mel
scale maps the measured frequency to that we perceived in
the context of frequency resolution.

Source

All these mappings are non-linear. In feature extraction, we

apply triangular band-pass filters to coverts the frequency
information to mimic what a human perceived.
Source

First, we square the output of the DFT. This reflects the power
of the speech at each frequency (x[k]²) and we call it the DFT
power spectrum. We apply these triangular Mel-scale filter
banks to transform it to Mel-scale power spectrum. The output
for each Mel-scale power spectrum slot represents the energy
from a number of frequency bands that it covers. This mapping
is called the Mel Binning. The precise equations for
slot m will be:

The Trainangular bandpass is wider at the higher frequencies

to reflect human hearing is less sensitivity in high frequency.
Specifically, it is linearly spaced below 1000 Hz and turns
logarithmically afterward.
All these efforts try to mimic how the basilar membrane in our
ear senses the vibration of sounds. The basilar membrane has
about 15,000 hairs inside the cochlear at birth. The diagram
below demonstrates the frequency response of those hairs. So
the curve-shape response below is simply approximated by
triangles in Mel filterbank.

We imitate how our ears perceive sound through those hairs.

In short, it is modeled by the triangular filters using Mel
filtering bank.

Source
Log

Mel filterbank outputs a power spectrum. Humans are less

sensitive to small energy change at high energy than small
changes at a low energy level. In fact, it is logarithmic. So our
next step will take the log out of the output of the Mel
filterbank. This also reduces the acoustic variants that are not
significant for speech recognition. Next, we need to address
two more requirements. First, we need to remove the F0
information (the pitch) and makes the extracted features
independent of others.

Cepstrum — IDFT

Below is the model of how speech is produced.

Source

Our articulations control the shape of the vocal tract. The

source-filter model combines the vibrations produced by the
vocal folds with the filter created by our articulations. The
glottal source waveform will be suppressed or amplified at
different frequencies by the shape of the vocal tract.
Cepstrum is the reverse of the first 4 letters in the word
“spectrum”. Our next step is to compute the Cepstral which
separates the glottal source and the filter. Diagram (a) is the
spectrum with the y-axis being the magnitude. Diagram (b)
takes the log of the magnitude. Look closer, the wave fluctuates
about 8 times between 1000 and 2000. Actually, it fluctuates
about 8 times for every 1000 units. That is about 125 Hz — the
source vibration of the vocal folds.

Paul Taylor (2008)

As observed, the log spectrum (the first diagram below)

composes of information related to the phone (the second
diagram) and the pitch (the third diagram). The peaks in the
second diagram identify the formants that distinguish phones.
But how can we separate them?
Source

Recall that periods in the time or frequency domain is inverted

after transformation.

Recall that the pitch information has short periods in the

frequency domain. We can apply the inverse Fourier
Transformation to separate the pitch information from the
formants. As shown below, the pitch information will show up
on the middle and the right side. The peak in the middle is
actually corresponding to F0 and the phone-related
information will locate in the far left.

Here is another visualization. The solid line on the left diagram

is the signal in the frequency domain. It is composed of the
phone information drawn in the dotted line and the pitch
information. After the IDFT (inverse Discrete Fourier
Transform), the pitch information with 1/T period is
transformed to a peak near T at the right side.

Source

So for speech recognition, we just need the coefficients on the

far left and discard the others. In fact, MFCC just takes the first
12 cepstral values. There is another important property related
to these 12 coefficients. Log power spectrum is real and
symmetric. Its inverse DFT is equivalent to a discrete cosine
transformation (DCT).

DCT is an orthogonal transformation. Mathematically, the

transformation produces uncorrelated features. Therefore,
MFCC features are highly unrelated. In ML, this makes our
model easier to model and to train. If we model these
parameters with multivariate Gaussian distribution, all the
non-diagonal values in the covariance matrix will be zero.
Mathematically, the output of this stage is

The following is the visualization of the 12 Cepstrum

coefficients.
Source

Dynamic features (delta)

MFCC has 39 features. We finalize 12 and what are the rest.

The 13th parameter is the energy in each frame. It helps us to
identify phones.

In pronunciation, context and dynamic information are

important. Articulations, like stop closures and releases, can be
recognized by the formant transitions. Characterizing feature
changes over time provides the context information for a
phone. Another 13 values compute the delta values d(t) below.
It measures the changes in features from the previous frame to
the next frame. This is the first-order derivative of the features.

The last 13 parameters are the dynamic changes of d(t) from

the last frame to the next frame. It acts as the second-order
derivative of c(t).

So the 39 MFCC features parameters are 12 Cepstrum

coefficients plus the energy term. Then we have 2 more sets
corresponding to the delta and the double delta values.

Audio Data Analysis Using Machine Learning and Deep
No ratings yet
Audio Data Analysis Using Machine Learning and Deep
74 pages
Biometrics Lecture Speech
No ratings yet
Biometrics Lecture Speech
38 pages
3.2 Automatic Speech Recognition
No ratings yet
3.2 Automatic Speech Recognition
151 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Unit 5 (Automatic Speech Recognition)
No ratings yet
Unit 5 (Automatic Speech Recognition)
13 pages
UNIT-V Automatic Speech Recognition 22.10,24
No ratings yet
UNIT-V Automatic Speech Recognition 22.10,24
15 pages
CCS369 - TSS-Unit 5
No ratings yet
CCS369 - TSS-Unit 5
23 pages
Automatic Speech Recognition 2
No ratings yet
Automatic Speech Recognition 2
22 pages
Bco - English 5
No ratings yet
Bco - English 5
12 pages
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
No ratings yet
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
104 pages
Xiao Guest Lecture ASR
No ratings yet
Xiao Guest Lecture ASR
39 pages
Audproc 2
No ratings yet
Audproc 2
40 pages
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
No ratings yet
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
19 pages
Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
100% (1)
Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
65 pages
EE264 Final Project Report: Echai@stanford - Edu
No ratings yet
EE264 Final Project Report: Echai@stanford - Edu
17 pages
Audio Noise Detection
No ratings yet
Audio Noise Detection
29 pages
Speech Chapter 4
No ratings yet
Speech Chapter 4
41 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
Lecture 7 - Automatic Speech Recognition
No ratings yet
Lecture 7 - Automatic Speech Recognition
58 pages
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
No ratings yet
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
50 pages
Chapter 1: Introduction To Audio Signal Processing: KH Wong
100% (1)
Chapter 1: Introduction To Audio Signal Processing: KH Wong
55 pages
MFCC Features: Appendix A
No ratings yet
MFCC Features: Appendix A
19 pages
IT Report-1
No ratings yet
IT Report-1
14 pages
Mel Frequency Cepstral Coefficient (MFCC) - Guidebook - Informatica e Ingegneria Online
No ratings yet
Mel Frequency Cepstral Coefficient (MFCC) - Guidebook - Informatica e Ingegneria Online
12 pages
Chapter 2
No ratings yet
Chapter 2
29 pages
Unit V Application
No ratings yet
Unit V Application
13 pages
Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages
Melody Transcription EC304 Signal Processing: Project Project Report
No ratings yet
Melody Transcription EC304 Signal Processing: Project Project Report
16 pages
Speech Recognition, Synthesis, and Dialogue 2
No ratings yet
Speech Recognition, Synthesis, and Dialogue 2
59 pages
Hanoi University of Science and Technology
No ratings yet
Hanoi University of Science and Technology
9 pages
Project Final Report V 1.54
No ratings yet
Project Final Report V 1.54
88 pages
Recall What Are Sound Features? Feature Detection and Extraction Features in Sphinx III
No ratings yet
Recall What Are Sound Features? Feature Detection and Extraction Features in Sphinx III
11 pages
Pad Assignment 2
No ratings yet
Pad Assignment 2
12 pages
Speech Recognition UTHM
No ratings yet
Speech Recognition UTHM
30 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
34 pages
Speech Recognition1
No ratings yet
Speech Recognition1
24 pages
CS 322 - Computer Organization Lecture 5 & 6
No ratings yet
CS 322 - Computer Organization Lecture 5 & 6
55 pages
G-Stomper 5 - Pattern Sequencer
No ratings yet
G-Stomper 5 - Pattern Sequencer
83 pages
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
No ratings yet
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
8 pages
13MFCC Tutorial
No ratings yet
13MFCC Tutorial
6 pages
EC2 Pricing
No ratings yet
EC2 Pricing
37 pages
Subject Verb Agreement PPT 2
No ratings yet
Subject Verb Agreement PPT 2
25 pages
As R Tutorial
No ratings yet
As R Tutorial
16 pages
English 3: Could-Couldn'T
No ratings yet
English 3: Could-Couldn'T
20 pages
Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015
No ratings yet
Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015
10 pages
Speaker Recognition Using Mel Frequency Cepstral Coefficients (MFCC) and Vector
No ratings yet
Speaker Recognition Using Mel Frequency Cepstral Coefficients (MFCC) and Vector
4 pages
Artificial Intelligence-An Introduction: Department of Computer Science & Engineering
No ratings yet
Artificial Intelligence-An Introduction: Department of Computer Science & Engineering
17 pages
MFCC Feature Extraction
No ratings yet
MFCC Feature Extraction
9 pages
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
No ratings yet
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
6 pages
Gmail User Manual Basic Operation en Pdf1
No ratings yet
Gmail User Manual Basic Operation en Pdf1
24 pages
Visme Presentation 3
No ratings yet
Visme Presentation 3
6 pages
Feature Extraction Using PCA
No ratings yet
Feature Extraction Using PCA
36 pages
Project 1
No ratings yet
Project 1
30 pages
Bilingualism and Multilingualisme in Basque Education
No ratings yet
Bilingualism and Multilingualisme in Basque Education
56 pages
MFCC PDF
No ratings yet
MFCC PDF
14 pages
The Invention of The Historic Monument
100% (1)
The Invention of The Historic Monument
27 pages
Riassunto
No ratings yet
Riassunto
1 page
0547 - s03 - RP - 3 SPEAKING2
No ratings yet
0547 - s03 - RP - 3 SPEAKING2
18 pages
Maguire Mackenzie Resume
No ratings yet
Maguire Mackenzie Resume
2 pages
Abstract:: Text-Independent and Dependent Methods. in A Text
No ratings yet
Abstract:: Text-Independent and Dependent Methods. in A Text
11 pages
EC2 Pricing Options
No ratings yet
EC2 Pricing Options
6 pages
To V and Ving
No ratings yet
To V and Ving
10 pages
Reconocimiento de Voz - MATLAB
No ratings yet
Reconocimiento de Voz - MATLAB
5 pages
Test (Allophones and Aspiration)
No ratings yet
Test (Allophones and Aspiration)
3 pages
DSP Project 2
No ratings yet
DSP Project 2
10 pages
Mini Pro 2
No ratings yet
Mini Pro 2
18 pages
Term Paper ECE-300 Topic: - Speech Recognition
No ratings yet
Term Paper ECE-300 Topic: - Speech Recognition
14 pages
Ringkasan Materi Optimasi Tugas Mata Kul
No ratings yet
Ringkasan Materi Optimasi Tugas Mata Kul
15 pages
Sheet 1
No ratings yet
Sheet 1
7 pages
CAF (Cloud Adoption Framework) in AWS AWS Cloud Adoption Framework (CAF)
No ratings yet
CAF (Cloud Adoption Framework) in AWS AWS Cloud Adoption Framework (CAF)
5 pages
Webhook
No ratings yet
Webhook
2 pages
Test Student Statistics
No ratings yet
Test Student Statistics
6 pages
Angles and Directions: An Introduction To JTS Warped
No ratings yet
Angles and Directions: An Introduction To JTS Warped
7 pages
Global Harvest, Volume 5
No ratings yet
Global Harvest, Volume 5
88 pages
B PSU 210 - 42sh
No ratings yet
B PSU 210 - 42sh
4 pages
Jubaer CV
No ratings yet
Jubaer CV
2 pages
Aws Certified Cloud Practitioner - 9
100% (1)
Aws Certified Cloud Practitioner - 9
28 pages
Ass 8
No ratings yet
Ass 8
2 pages
The Interpersonal Idiom in Shakespeare D PDF
No ratings yet
The Interpersonal Idiom in Shakespeare D PDF
3 pages
Pinal de Amoles
No ratings yet
Pinal de Amoles
1 page
The Answers Are Below: Advertised Headhunted Refuse Apprenticeship Short-Listed
No ratings yet
The Answers Are Below: Advertised Headhunted Refuse Apprenticeship Short-Listed
7 pages
English Tenses of Grammar
No ratings yet
English Tenses of Grammar
5 pages
Descargar Gratis El Libro La Rebelion de Las Ratas
No ratings yet
Descargar Gratis El Libro La Rebelion de Las Ratas
3 pages
Acord Sb+predicat - Engleza - Exercitii
No ratings yet
Acord Sb+predicat - Engleza - Exercitii
5 pages
Quiz 003 - Attempt Review PDF
No ratings yet
Quiz 003 - Attempt Review PDF
3 pages
Speech Recognition
No ratings yet
Speech Recognition
4 pages
NATO Phonetic Alphabet 2015 NGL
No ratings yet
NATO Phonetic Alphabet 2015 NGL
1 page
Digital Signal Processing "Speech Recognition": Paper Presentation On
No ratings yet
Digital Signal Processing "Speech Recognition": Paper Presentation On
12 pages
The Self From Various Philosophical Perspective - PSYCH 101
No ratings yet
The Self From Various Philosophical Perspective - PSYCH 101
3 pages
16 Passive PDF
No ratings yet
16 Passive PDF
2 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
Dll-Eapp 12 Week 15
50% (2)
Dll-Eapp 12 Week 15
5 pages
Voice on the Air! Easy FM Transmitter for Beginners
From Everand
Voice on the Air! Easy FM Transmitter for Beginners
GURUPRASAD N H
No ratings yet
Sound Design and Mixing in Reason
From Everand
Sound Design and Mixing in Reason
Andrew Eisele
3/5 (2)
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
A Beginner's Guide to Ham Radio
From Everand
A Beginner's Guide to Ham Radio
George Freeman
No ratings yet
Filter Bank: Insights into Computer Vision's Filter Bank Techniques
From Everand
Filter Bank: Insights into Computer Vision's Filter Bank Techniques
Fouad Sabry
No ratings yet
Noise Reduction: Enhancing Clarity, Advanced Techniques for Noise Reduction in Computer Vision
From Everand
Noise Reduction: Enhancing Clarity, Advanced Techniques for Noise Reduction in Computer Vision
Fouad Sabry
No ratings yet

The Diagram Outlines The Key Steps Involved in Co

Uploaded by

The Diagram Outlines The Key Steps Involved in Co

Uploaded by

The diagram outlines the key steps involved in converting spoken language into text using

ASR. Here's a breakdown of the process:

Here's a table summarizing the main components:

Converts analog speech to digital format and

Acoustic Model Analyzes the spectral characteristics of the speech

Search Space All possible word or sound sequences

Speech data with corresponding text transcripts

Language Model Encodes language structure and grammar rules

Sampling is a crucial step in the process of transforming a continuous analog

Speech signal analysis to produce a sequence of acoustic feature

characteristics for acoustic features used in Automatic Speech Recognition

 The features should capture enough information to differentiate between

2. Time Resolution (10ms):

 The features should provide good temporal resolution, typically around 10

3. Frequency Resolution (20-40 channels):

 The features should offer good frequency resolution, typically represented by

 The features should ideally be independent of the speaker's fundamental

5. Robustness to Speaker Variation:

 The features should be resilient to variations in speaker characteristics like

6. Robustness to Noise and Channel Distortions:

 The features should be resistant to background noise or distortions introduced

7. Pattern Recognition Characteristics:

 The features should be suitable for pattern recognition algorithms used in

8. Low Feature Dimension:

 While capturing enough information is important, a lower feature dimension is

9. Feature Independence (for GMMs):

 In the context of Gaussian Mixture Models (GMMs), features should ideally be

A/D conversion samples the audio clips and digitizes the

Pre-emphasis boosts the amount of energy in the high

Pre-emphasis uses a filter to boost higher frequencies. Below is

Jurafsky & Martin, fig. 9.9

Windowing involves the slicing of the audio waveform into

A few alternatives for w are the Hamming window and the

The corresponding equations for w are:

Source Top right: a signal that composed of two frequency

Discrete Fourier Transform (DFT)

Next, we apply DFT to extract information in the frequency

As mentioned in the previous article, the equipment

All these mappings are non-linear. In feature extraction, we

The Trainangular bandpass is wider at the higher frequencies

We imitate how our ears perceive sound through those hairs.

Mel filterbank outputs a power spectrum. Humans are less

Below is the model of how speech is produced.

Our articulations control the shape of the vocal tract. The

Paul Taylor (2008)

As observed, the log spectrum (the first diagram below)

Recall that periods in the time or frequency domain is inverted

Recall that the pitch information has short periods in the

Here is another visualization. The solid line on the left diagram

So for speech recognition, we just need the coefficients on the

DCT is an orthogonal transformation. Mathematically, the

The following is the visualization of the 12 Cepstrum

Dynamic features (delta)

MFCC has 39 features. We finalize 12 and what are the rest.

In pronunciation, context and dynamic information are

The last 13 parameters are the dynamic changes of d(t) from

So the 39 MFCC features parameters are 12 Cepstrum

You might also like