Lecture 1
Lecture 1
4 Machine learning 2 -
9 Acous c modeling -
11 Language modeling -
Language
processing Deep learning
Course materials
• Textbooks
• Other references:
BA 7666-11, 2 860499/1
ti
ti
What makes modeling this so di cult?
Production Perception
Ear drum
Narayanan et. al. 2014
ti
Speech is the most natural
way to communicate
• To interact with machines the same way we do with
humans has been a dream for decades
How is it
perceived?
What is
the code?
Signal proper es?
Signal representa on?
Learning representa on?
Sequence learning?
ti
ti
ti
Review of Digital Signal Processing (DSP)
• Unsupervised Clustering
Frequency
0.1 0.1
0 Time (s) 1 0 1
Time (s)
What can we say about what the person heard, or wanted to hear?
Speech representa on
• Feature extrac on for speech recogni on, speaker
iden ca on, speech enhancement, emo on recogni on, etc.
• Dynamic features
/T/ /UW/
• MFCC and PLP “two”
“two”
ti
fi
ti
ti
ti
ti
ti
ti
ti
Acous c modeling
“two”
ti
ti
ti
ti
ti
Sta s cal sequence recogni on
• How do we put together a sequence of phonemes into words?
“two”
ti
ti
ti
ti
Language models
• How do we put together words into sentences?
“two”
ti
Auditory scene analysis, speech
enhancement and separa on
ti
Speech processing technologies
Automatic Speech Recognition (speech-to-text)
Speech-to-speech translation
Hitchhikers guide to the galaxy (Adams, 1978)
Why is ASR important?
• Most natural way for us to communicate
• Fastest mean of communica on with computers
(speech: ~150-200 words/min, typing: ~60 words/min)
• Can be used simultaneously with other tasks
• Can enable communica on with people with disabili es
• Allows analysis, searching, transcrip on, etc. of spoken
recordings
ti
ti
ti
ti
Disciplines relavant to ASR
• Signal processing (front-end encoding of waveform)
• Sta s cs (modeling)
fi
)
A simple idea
Not soThiseasy!
is hard!
Which
Which twotwo
are are
the the
samesame
digit?digits?
History
A simple task
Speech production
Spectrogram: Fourier
transform over short
windows, plot of energy at
each frequency over me
ti
tt
ti
Speech production
Several examples
of digit eight
ti
ffi
What is the problem?
• Finding a useful set of features x[n] -> f1, f2, … is di cult
• Many sources of varia on:
• acous c: channel, noise, vocal tract di erences, pitch, …
• Phone c: eight: [ey tcl] vs. [ey tcl t]
• Phonological: eight before vowel: [ey dx], sandwich
• Dialect: either: [iy dh er] vs. [ay dh er]
• Coar cula on: she, shoe
• Seman cs: the baby cried vs. the Bay Bee cried
• Segmenta on: an iceman, a nice man
ti
ti
ti
ti
ti
ti
ti
ff
ffi
Intro
History
A simple task
Speech production
Typical speech
Architecture recognizer
of typical speech recognizer
Must take into account (and take advantage of!) sources of
Takes into account sources of varia on/constraint
•variation/constraint
Radio Rex
Radio Rex (1922):
Toy dog responded to hi
Toy dog responded to high
energy signal around 500
energy signal around 500Hz (as
in theRex)
in the vowel vowel in “Rex”)
fi
Intro
A simple task
History
Speech production
A simple task
Speech production
• Small-vocabulary, isolated-
word, phoneme
recogni on Small-vocabu
Small-vocabula
isolated-word
isolated-word/
• Nearest-neighbor phoneme reco
classi ca on phoneme recog
Nearest-neigh
Nearest-neighb
• classification
Fixed-length feature
classification
vectors of spectral
Fixed-lengthfe
Fixed-length
measurements vectors of spe
vectors of spec
hocmeasureme
hoc measurem
fi
ti
ti
1960s
1960s
Dynamic time warp
• Dynamic me warping is
introduced,
introduced, allowing
allowing for
di erent-length test and test
di↵erent-length
reference vectors for
reference vectors
isolated word/phoneme
fo
isolated
recogni word/phon
on (Vintsyuk 68,
Sakoe and Chiba 78) [Vintsy
recognition.
Sakoe and Chiba ’7
ff
ti
ti
Intro
History
1970s
A simple task
Speech production
1970s
1980s
1980s
• Hidden
(fig. from Markov
[Juang andmodel (HMM)
Rabiner approaches (Baum 72, Levinson 83)
’05])
• Gaussian Mixture observa on densi es
Hidden Markov
• Training on datamodel (HMM)-based
via expecta on-maximiza approaches
on algorithms[Baum ’72,
Levinson ’83]
ti
ti
ti
ti
1990s
• More of everything
• Context-dependent models
Deep learning
Word error rate (WER)
10%
5.9%
1%
1995 2000 2005 2010 2015 2016
Switchboard task (Telephone conversation speech)
Course overview
• Brief history of speech recogni on
• Discrete Signal Processing (DSP) overview
• Pa ern recogni on and machine learning overview
• Speech signal produc on
• Speech signal representa on
• Auditory Scene Analysis, speech enhancement and separa on
• Speech processing in the auditory system
• Acous c modeling
• Sequence recogni on and Hidden Markov Models
• Language models
• Music signal processing
tt
ti
ti
ti
ti
ti
ti
ti