0% found this document useful (0 votes)

84 views59 pages

Speech Recognition, Synthesis, and Dialogue 2

This document provides an overview and outline of topics to be covered in a lecture on feature extraction and acoustic modeling for speech recognition. The lecture will cover Mel-frequency cepstral coefficients (MFCCs) as the most widely used spectral representation, and increasingly sophisticated acoustic models including Gaussian mixtures as the acoustic likelihood for each hidden Markov model state. Evaluation of speech recognition systems is discussed briefly in terms of word error rate.

Uploaded by

Dewayu Dewi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views59 pages

Speech Recognition, Synthesis, and Dialogue 2

Uploaded by

Dewayu Dewi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

CS 224S / LINGUIST 281

Speech Recognition, Synthesis, and

Dialogue

Dan Jurafsky

Lecture 9: Feature Extraction and start of

Acoustic Modeling (VQ)

IP Notice:
Outline for Today
• Speech Recognition Architectural Overview
• Hidden Markov Models in general and for speech
 Forward
 Viterbi Decoding
• How this fits into the ASR component of course
 Jan 27 HMMs, Forward, Viterbi,
 Jan 29 Baum-Welch (Forward-Backward)
 Feb 3: Feature Extraction, MFCCs, start of AM
 Feb 5: Acoustic Modeling and GMMs
 Feb 10: N-grams and Language Modeling
 Feb 24: Search and Advanced Decoding
 Feb 26: Dealing with Variation
 Mar 3: Dealing with Disfluencies
Outline for Today
• Feature Extraction
 Mel-Frequency Cepstral Coefficients
• Acoustic Model
 Increasingly sophisticated models
 Acoustic Likelihood for each state:
 Gaussians
 Multivariate Gaussians
 Mixtures of Multivariate Gaussians
 Where a state is progressively:
 CI Subphone (3ish per phone)
 CD phone (=triphones)
 State-tying of CD phone
• Evaluation
 Word Error Rate
Discrete Representation of
Signal
• Represent continuous signal into discrete form.

Thanks to Bryan Pellom for this slide

Digitizing The Signal (A-D)

Sampling:
measuring amplitude of signal at time t
16,000 Hz (samples/sec) Microphone
(“Wideband”):
8,000 Hz (samples/sec) Telephone
Why?
– Need at least 2 samples per cycle
– max measurable frequency is half sampling rate
– Human speech < 10,000 Hz, so need max 20K
– Telephone filtered at 4K, so 8K is enough
Digitizing Speech (II)

Quantization
Representing real value of each amplitude as integer
8-bit (-128 to 127) or 16-bit (-32768 to 32767)
Formats:
16 bit PCM
8 bit mu-law; log compression
LSB (Intel) vs. MSB (Sun, Apple)
Headers: 40 byte
header
Raw (no header)
Microsoft wav
Sun .au
Discrete Representation of
Signal
• Byte swapping
 Little-endian vs. Big-endian
• Some audio formats have headers
 Headers contain meta-information such as
sampling rates, recording condition
 Raw file refers to 'no header'
 Example: Microsoft wav, Nist sphere
• Nice sound manipulation tool: sox.
 change sampling rate
 convert speech formats
MFCC
• Mel-Frequency Cepstral Coefficient (MFCC)
 Most widely used spectral representation in
ASR
Pre-Emphasis
• Pre-emphasis: boosting the energy in the high
frequencies
• Q: Why do this?
• A: The spectrum for voiced segments has more
energy at lower frequencies than higher
frequencies.
 This is called spectral tilt
 Spectral tilt is caused by the nature of the glottal pulse
• Boosting high-frequency energy gives more info
to Acoustic Model
 Improves phone recognition performance
George Miller figure
Example of pre-emphasis
• Before and after pre-emphasis
 Spectral slice from the vowel [aa]
MFCC
Windowing

Slide from Bryan Pellom

Windowing

• Why divide speech signal into successive

overlapping frames?
 Speech is not a stationary signal; we want information
about a small enough region that the spectral
information is a useful cue.
• Frames
 Frame size: typically, 10-25ms
 Frame shift: the length of time between successive
frames, typically, 5-10ms
Common window shapes

• Rectangular window:

• Hamming window
Window in time domain
Window in the frequency
domain
MFCC
Discrete Fourier
Transform
• Input:
 Windowed signal x[n]…x[m]
• Output:
 For each of N discrete frequency bands
 A complex number X[k] representing magnidue and phase of that
frequency component in the original signal
• Discrete Fourier Transform (DFT)

• Standard algorithm for computing DFT:

 Fast Fourier Transform (FFT) with complexity N*log(N)
 In general, choose N=512 or 1024
Discrete Fourier Transform
computing a spectrum
• A 25 ms Hamming-windowed signal from [iy]
 And its spectrum as computed by DFT (plus
other smoothing)
MFCC
Mel-scale
• Human hearing is not equally sensitive to all
frequency bands
• Less sensitive at higher frequencies, roughly >
1000 Hz
• I.e. human perception of frequency is non-linear:
Mel-scale
• A mel is a unit of pitch
 Definition:
 Pairs of sounds perceptually equidistant in
pitch
 Are separated by an equal number of mels:

• Mel-scale is approximately linear below 1

kHz and logarithmic above 1 kHz
• Definition:
Mel Filter Bank Processing
• Mel Filter bank
 Uniformly spaced before 1 kHz
 logarithmic scale after 1 kHz
Mel-filter Bank Processing
• Apply the bank of filters according Mel scale
to the spectrum
• Each filter output is the sum of its filtered
spectral components
MFCC
Log energy computation
• Compute the logarithm of the square
magnitude of the output of Mel-filter bank
Log energy computation

• Why log energy?

Logarithm compresses dynamic range of
values
 Human response to signal level is logarithmic
– humans less sensitive to slight differences in amplitude
at high amplitudes than low amplitudes
Makes frequency estimates less sensitive to
slight variations in input (power variation
due to speaker’s mouth moving closer to
mike)
Phase information not helpful in speech
MFCC
The Cepstrum
• One way to think about this
 Separating the source and filter
 Speech waveform is created by
 A glottal source waveform
 Passes through a vocal tract which because of its
shape has a particular filtering characteristic
• Articulatory facts:
 The vocal cord vibrations create harmonics
 The mouth is an amplifier
 Depending on shape of oral cavity, some
harmonics are amplified more than others
Vocal Fold Vibration

UCLA Phonetics Lab Demo

George Miller figure
We care about the filter not
the source
• Most characteristics of the source
 F0
 Details of glottal pulse
• Don’t matter for phone detection
• What we care about is the filter
 The exact position of the articulators in the
oral tract
• So we want a way to separate these
 And use only the filter function
The Cepstrum
• The spectrum of the log of the spectrum

Spectrum Log spectrum

Spectrum of log spectrum

Thinking about the
Cepstrum
Mel Frequency cepstrum
• The cepstrum requires Fourier analysis
• But we’re going from frequency space back to
time
• So we actually apply inverse DFT

• Details for signal processing gurus: Since the log

power spectrum is real and symmetric, inverse
DFT reduces to a Discrete Cosine Transform
(DCT)
Another advantage of the
Cepstrum
• DCT produces highly uncorrelated features

• We’ll see when we get to acoustic modeling that

these will be much easier to model than the
spectrum
 Simply modelled by linear combinations of Gaussian
density functions with diagonal covariance matrices

• In general we’ll just use the first 12 cepstral

coefficients (we don’t want the later ones which
have e.g. the F0 spike)
MFCC
Dynamic Cepstral
Coefficient
• The cepstral coefficients do not capture energy

• So we add an energy feature

• Also, we know that speech signal is not constant (slope of

formants, change from stop burst to release).

• So we want to add the changes in features (the slopes).

• We call these delta features

• We also add double-delta acceleration features

Delta and double-delta
• Derivative: in order to obtain temporal information
Typical MFCC features
• Window size: 25ms
• Window shift: 10ms
• Pre-emphasis coefficient: 0.97
• MFCC:
 12 MFCC (mel frequency cepstral coefficients)
 1 energy feature
 12 delta MFCC features
 12 double-delta MFCC features
 1 delta energy feature
 1 double-delta energy feature
• Total 39-dimensional features
Why is MFCC so popular?
• Efficient to compute
• Incorporates a perceptual Mel frequency
scale
• Separates the source and filter

• IDFT(DCT) decorrelates the features

 Improves diagonal assumption in HMM
modeling

• Alternative
 PLP
Now on to Acoustic Modeling
Problem: how to apply HMM model
to continuous observations?
• We have assumed that the output
alphabet V has a finite number of symbols
• But spectral feature vectors are real-
valued!
• How to deal with real-valued features?
 Decoding: Given ot, how to compute P(ot|q)
 Learning: How to modify EM to deal with real-
valued features
Vector Quantization
• Create a training set of feature vectors
• Cluster them into a small number of
classes
• Represent each class by a discrete symbol
• For each class vk, we can compute the
probability that it is generated by a given
HMM state using Baum-Welch as above
VQ
• We’ll define a
 Codebook, which lists for each symbol
 A prototype vector, or codeword
• If we had 256 classes (‘8-bit VQ’),
 A codebook with 256 prototype vectors
 Given an incoming feature vector, we
compare it to each of the 256 prototype
vectors
 We pick whichever one is closest (by some
‘distance metric’)
 And replace the input vector by the index of
this prototype vector
VQ
VQ requirements
• A distance metric or distortion metric
 Specifies how similar two vectors are
 Used:
 to build clusters
 To find prototype vector for cluster
 And to compare incoming vector to prototypes
• A clustering algorithm
 K-means, etc.
Distance metrics

• Simplest:
 (square of) Euclidean
distance
D
d (x, y) = ∑ (x i − y i )
2 2

i=1
 Also called ‘sum-
squared error’
Distance metrics
• More sophisticated:
 (square of) Mahalanobis distance
 Assume that each dimension of feature vector
has variance σ2
D 2
(x i − y i )
d (x, y) = ∑
2
2
i=1 σi

€  Equation above assumes diagonal covariance

matrix; more on this later
Training a VQ system (generating
codebook): K-means clustering

1. Initialization
choose M vectors from L training vectors (typically
M=2B)
as initial code words… random or max. distance.
2. Search:
for each training vector, find the closest code word,
assign this training vector to that cell
3. Centroid Update:
for each cell, compute centroid of that cell. The
new code word is the centroid.
4. Repeat (2)-(3) until average distance falls below threshold
(or no change)

Slide from John-Paul Hosum, OHSU/OGI

Vector Quantization Slide thanks to John-Paul Hosum, OHSU/OGI

• Example

Given data points, split into 4 codebook vectors with initial

values at (2,2), (4,6), (6,5), and (8,8)

0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Vector Quantization Slide from John-Paul Hosum, OHSU/OGI

• Example

compute centroids of each codebook, re-compute nearest

neighbor, re-compute centroids...

0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Vector Quantization Slide from John-Paul Hosum, OHSU/OGI

• Example
Once there’s no more change, the feature space will be
partitioned into 4 regions. Any input feature can be classified
as belonging to one of the 4 regions. The entire codebook
can be specified by the 4 centroid points.
0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9
Summary: VQ
• To compute p(ot|qj)
 Compute distance between feature vector ot
 and each codeword (prototype vector)
 in a preclustered codebook
 where distance is either
• Euclidean
• Mahalanobis
 Choose the vector that is the closest to ot
 and take its codeword vk
 And then look up the likelihood of vk given HMM state
j in the B matrix
• Bj(ot)=bj(vk) s.t. vk is codeword of closest vector
to ot
• Using Baum-Welch as above
Computing bj(vk)
Slide from John-Paul Hosum, OHSU/OGI

feature value 2
for state j

feature value 1 for state j

• bj(vk) = number of vectors with codebook index k in state j = 14 = 1
number of vectors in state j 56 4
Summary: VQ
• Training:
 Do VQ and then use Baum-Welch to assign
probabilities to each symbol
• Decoding:
 Do VQ and then use the symbol probabilities
in decoding
Next Time: Directly Modeling
Continuous Observations
• Gaussians
 Univariate Gaussians
 Baum-Welch for univariate Gaussians
 Multivariate Gaussians
 Baum-Welch for multivariate Gausians
 Gaussian Mixture Models (GMMs)
 Baum-Welch for GMMs
Summary
• Feature Extraction
• Beginning of Acoustic Modeling
 Vector Quantization (VQ)

Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
No ratings yet
Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
16 pages
Lecture 7 - Automatic Speech Recognition
No ratings yet
Lecture 7 - Automatic Speech Recognition
58 pages
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
No ratings yet
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
104 pages
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
No ratings yet
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
50 pages
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
No ratings yet
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
34 pages
MFCC Feature Extraction
No ratings yet
MFCC Feature Extraction
9 pages
MFCC PDF
No ratings yet
MFCC PDF
14 pages
Biometrics Lecture Speech
No ratings yet
Biometrics Lecture Speech
38 pages
MFCC and Vector Quantization For Arabic Fricatives2012
No ratings yet
MFCC and Vector Quantization For Arabic Fricatives2012
6 pages
MFCC Features: Appendix A
No ratings yet
MFCC Features: Appendix A
19 pages
Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015
No ratings yet
Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015
10 pages
Speaker Recognition Using Vocal Tract Features
No ratings yet
Speaker Recognition Using Vocal Tract Features
5 pages
Discrete Representation of Signal
No ratings yet
Discrete Representation of Signal
34 pages
Feature Extraction MFCCs PDF
No ratings yet
Feature Extraction MFCCs PDF
15 pages
03 Audio
No ratings yet
03 Audio
32 pages
Mel Frequency Cepstral Coefficient (MFCC) - Guidebook - Informatica e Ingegneria Online
No ratings yet
Mel Frequency Cepstral Coefficient (MFCC) - Guidebook - Informatica e Ingegneria Online
12 pages
MFCC Technique for Speech Recognition
No ratings yet
MFCC Technique for Speech Recognition
6 pages
Final Project Report
No ratings yet
Final Project Report
15 pages
Speaker Verification For Remote Authentication
100% (2)
Speaker Verification For Remote Authentication
31 pages
DSP Lab Mini Project
No ratings yet
DSP Lab Mini Project
7 pages
Speech Reconstruction From Mel-Frequency Cepstral Coefficients Using A Source-Filter Model
No ratings yet
Speech Reconstruction From Mel-Frequency Cepstral Coefficients Using A Source-Filter Model
4 pages
Voice Recognition
No ratings yet
Voice Recognition
6 pages
13MFCC Tutorial
No ratings yet
13MFCC Tutorial
6 pages
Voice Recognition
100% (1)
Voice Recognition
18 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Intechopen 80419
No ratings yet
Intechopen 80419
18 pages
CCS369 - TSS-Unit 5
No ratings yet
CCS369 - TSS-Unit 5
23 pages
Practical Cryptography PDF
No ratings yet
Practical Cryptography PDF
10 pages
Ijves Y14 05338
No ratings yet
Ijves Y14 05338
5 pages
Speaker Recognition Using Matlab
No ratings yet
Speaker Recognition Using Matlab
14 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
Automatic Speaker Recognition Report Hiya
No ratings yet
Automatic Speaker Recognition Report Hiya
8 pages
Automatic Speech Recognition 2
No ratings yet
Automatic Speech Recognition 2
22 pages
Speech Recognition
No ratings yet
Speech Recognition
4 pages
MFCCs
No ratings yet
MFCCs
12 pages
LBG VQ
No ratings yet
LBG VQ
3 pages
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
No ratings yet
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
19 pages
Maretext Independent Speaker Identification Based On K-Mean Algorithm
No ratings yet
Maretext Independent Speaker Identification Based On K-Mean Algorithm
9 pages
Recall What Are Sound Features? Feature Detection and Extraction Features in Sphinx III
No ratings yet
Recall What Are Sound Features? Feature Detection and Extraction Features in Sphinx III
11 pages
Dynamic Spectrum Derived MFCC and HFCC Parameters and Human Robot Speech Interaction
No ratings yet
Dynamic Spectrum Derived MFCC and HFCC Parameters and Human Robot Speech Interaction
5 pages
Feature Extraction Methods LPC, PLP and MFCC
100% (1)
Feature Extraction Methods LPC, PLP and MFCC
5 pages
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
No ratings yet
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
5 pages
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
No ratings yet
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
5 pages
UNIT 2-Speech Processing
No ratings yet
UNIT 2-Speech Processing
25 pages
Implementation of Speech Recognition Using Artificial Neural Networks
No ratings yet
Implementation of Speech Recognition Using Artificial Neural Networks
12 pages
A Novel Approach For MFCC Feature Extraction
No ratings yet
A Novel Approach For MFCC Feature Extraction
5 pages
M FCC Review
No ratings yet
M FCC Review
10 pages
$Xwrpdwlf6Shhfk5Hfrjqlwlrqxvlqj&Ruuhodwlrq $Qdo/Vlv: $evwudfw - 7Kh Jurzwk LQ Zluhohvv FRPPXQLFDWLRQ
No ratings yet
$Xwrpdwlf6Shhfk5Hfrjqlwlrqxvlqj&Ruuhodwlrq $Qdo/Vlv: $evwudfw - 7Kh Jurzwk LQ Zluhohvv FRPPXQLFDWLRQ
5 pages
MFCC
100% (2)
MFCC
6 pages
An Automatic Speaker Recognition System
100% (1)
An Automatic Speaker Recognition System
11 pages
Speaker Recognition File
No ratings yet
Speaker Recognition File
16 pages
5707 Assign1
No ratings yet
5707 Assign1
9 pages
Unit 5 (Automatic Speech Recognition)
No ratings yet
Unit 5 (Automatic Speech Recognition)
13 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
45 pages
Digital Signal Processing "Speech Recognition": Paper Presentation On
No ratings yet
Digital Signal Processing "Speech Recognition": Paper Presentation On
12 pages
DLL - Tle-H.e. 6 - Q1 - W7
No ratings yet
DLL - Tle-H.e. 6 - Q1 - W7
6 pages
AAN 2023 Day 1-2 Mind Next Original
No ratings yet
AAN 2023 Day 1-2 Mind Next Original
21 pages
MTD3055VL 115349
No ratings yet
MTD3055VL 115349
5 pages
How To Raise SAP OSS Call
No ratings yet
How To Raise SAP OSS Call
6 pages
AES DRRM Memo PASS
No ratings yet
AES DRRM Memo PASS
2 pages
160719a0cd3011 - 29094359708
No ratings yet
160719a0cd3011 - 29094359708
2 pages
Gotaq QPCR Master Mix Quick Protocol
No ratings yet
Gotaq QPCR Master Mix Quick Protocol
1 page
Cre Project Group 2 Sec01
No ratings yet
Cre Project Group 2 Sec01
28 pages
MSDS Pigment Yellow 14
No ratings yet
MSDS Pigment Yellow 14
3 pages
Maths
No ratings yet
Maths
114 pages
DDP Sohana - 2021 - Notification
No ratings yet
DDP Sohana - 2021 - Notification
17 pages
Aircraft Electrical Load and Power Source Capacity Analysis: Standard Guide For
100% (4)
Aircraft Electrical Load and Power Source Capacity Analysis: Standard Guide For
8 pages
Mrcs Part B Osce Anatomy
No ratings yet
Mrcs Part B Osce Anatomy
287 pages
Pro Forma Invoice
0% (1)
Pro Forma Invoice
1 page
702-Failure Cargo Crane
100% (1)
702-Failure Cargo Crane
27 pages
Wearable Devices For The Detection of Covid-19
No ratings yet
Wearable Devices For The Detection of Covid-19
21 pages
CCC Professional Cloud Security Manager
No ratings yet
CCC Professional Cloud Security Manager
32 pages
Super Memory British English Student A2 B1
No ratings yet
Super Memory British English Student A2 B1
6 pages
Neoplasia
100% (1)
Neoplasia
15 pages
Determinants of The Money Supply: © 2005 Pearson Education Canada Inc
No ratings yet
Determinants of The Money Supply: © 2005 Pearson Education Canada Inc
17 pages
Egsh064784 (1) - 060844
No ratings yet
Egsh064784 (1) - 060844
1 page
Abs Paris
No ratings yet
Abs Paris
2 pages
Martin Et Al Manuscript Final
No ratings yet
Martin Et Al Manuscript Final
74 pages
Purbasari and Purbararang Script
No ratings yet
Purbasari and Purbararang Script
22 pages
Business Plan Zulkifli Collection
No ratings yet
Business Plan Zulkifli Collection
58 pages
Commerce
No ratings yet
Commerce
10 pages
Technical Datasheet Modula - EN24062013
No ratings yet
Technical Datasheet Modula - EN24062013
2 pages
CRT Controller
No ratings yet
CRT Controller
42 pages

Speech Recognition, Synthesis, and Dialogue 2

Uploaded by

Speech Recognition, Synthesis, and Dialogue 2

Uploaded by

CS 224S / LINGUIST 281

Speech Recognition, Synthesis, and

Lecture 9: Feature Extraction and start of

Thanks to Bryan Pellom for this slide

Slide from Bryan Pellom

• Why divide speech signal into successive

• Standard algorithm for computing DFT:

• Mel-scale is approximately linear below 1

• Why log energy?

UCLA Phonetics Lab Demo

Spectrum Log spectrum

Spectrum of log spectrum

• Details for signal processing gurus: Since the log

• We’ll see when we get to acoustic modeling that

• In general we’ll just use the first 12 cepstral

• So we add an energy feature

• Also, we know that speech signal is not constant (slope of

• So we want to add the changes in features (the slopes).

• We call these delta features

• We also add double-delta acceleration features

• IDFT(DCT) decorrelates the features

€  Equation above assumes diagonal covariance

Slide from John-Paul Hosum, OHSU/OGI

Given data points, split into 4 codebook vectors with initial

compute centroids of each codebook, re-compute nearest

feature value 1 for state j

You might also like