0% found this document useful (0 votes)
57 views

Lecture 1

Uploaded by

Rakshith Kamath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Lecture 1

Uploaded by

Rakshith Kamath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

ELEN 6820

Speech and audio signal processing


Instructor: Nima Mesgarani (nm2764)
3 credits
TA: Yi Luo (yl3364)
O ce hours: TBD
ffi
ELEN 6820
Speech and audio signal processing
Instructor: Nima Mesgarani (nm2764)
3 credits
TA: Yi Luo (yl3364)
O ce hours: TBD
ffi
Course overview
• Brief history of speech recogni on
• Discrete Signal Processing (DSP) overview
• Pa ern recogni on and deep learning overview
• Speech signal produc on
• Speech signal representa on
• Auditory Scene Analysis, speech enhancement and separa on
• Speech processing in the auditory system
• Acous c modeling
• Sequence recogni on and Hidden Markov Models
• Language models
• Music signal processing
tt
ti
ti
ti
ti
ti
ti
ti
Homeworks
• HW1: Discrete signal processing (wri en) (W2)
• HW2: Neural networks and voice ac vity detec on (programming) (W3&4)
• HW3: Speech signal produc on and representa on (wri en) (W5)
• HW3: Speech enhancement and separa on (programming) (W6)
• HW4: Acous c event detec on and Speaker iden ca on (programming)
(W7&8)
• HW5: Phoneme recogni on and automa c speech recogni on
(programming) (W9&10)
• Final project (programming) (W11-13)
ti
ti
ti
ti
ti
tt
ti
ti
ti
ti
ti
fi
ti
tt
ti
Week topic HW

1 Introduc on and history -

2 Discrete signal processing DSP (W)

3 Machine learning 1 Neural network and VAD (P)

4 Machine learning 2 -

5 Speech signal produc on Speech produc on (W)

6 Speech signal representra on Speech enhancement (P)


Speech enhancement and
7 -
separa on
8 Human speech percep on Acous c event detec on (P)

9 Acous c modeling -

10 Sequence modeling and HMMs Phoneme recogni on and ASR (P)

11 Language modeling -

12 Automa c speech recogn on Projcet

13 Music signal processing -


ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
Course evalua on

• Two wri en homework (20%)


• Four programming homework (60%)
• Final project (20%)

• Late submission: 10% penalty per day


tt
ti
Main Signal processing Neural processin
resource background of acoustic signals

Language
processing Deep learning

Course materials
• Textbooks

• Other references:

• Handbook of Speech Percep on, Pisoni & Remez

• Jim Glass (MIT)

• Dan Ellis (Columbia University)

• Karen Livescu (Toyota Ins tute of Technology)

• Shinji Watanabe (Johns Hopkins University)


ti
ti
Acous c descrip on of the environment
Sound

Bruel & Kjaer

BA 7666-11, 2 860499/1
ti
ti
What makes modeling this so di cult?

The Cocktail party “problem” (cherry 1953)


ffi
Neural basis of speech communica on

Production Perception

Ear drum
Narayanan et. al. 2014

Cocktail party problem, Cherry, (1953


)

ti
Speech is the most natural
way to communicate
• To interact with machines the same way we do with
humans has been a dream for decades

• Only recently became actually possible


Space odyssey 2001
Speech signal processing

How is it
perceived?

What is
the code?
Signal proper es?
Signal representa on?
Learning representa on?
Sequence learning?
ti
ti
ti
Review of Digital Signal Processing (DSP)

• Discrete me signals and systems

• Discrete me Fourier transform, z-transform

• Digital lters, IIR and FIR

• Sampling theorem, resampling

• Spectrogram representa on of speech


fi
ti
ti
ti
Review of machine learning
• Feature extrac on

• Minimum distance Classi ers, Discriminant func ons

• Support Vector Machine

• Unsupervised Clustering

• Sta s cal methods: Likelihood, MAP, sta s cal linear


discriminant

• Neural Network models, Deep learning


ti
ti
ti
fi
ti
ti
ti
Fundamentals of human speech
produc on
• The mechanism of speech produc on
• Ar culatory and acous c speech
features
• Acous c theory of speech produc on
• Acous c phone cs
• Dis nc ve features of the English
Phonemes
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
Hearing and speech percep on

• Anatomy and func on of auditory pathway


• Percep on of sound, Loudness, pitch, mbre
• Masking, temporal, spectral, applica ons in coding
• Auditory models: PLP, Lyons, Shamma
• Speech percep on and adapta on in noise
• Speech processing in the auditory cortex
ti
ti
ti
ti
ti
ti
ti
Encoding Decoding
S(t,f) R(t) S(t,f)

Spectrogram S(t,f) Reconstructed Ŝ(t,f)


8 8
Frequency

Frequency
0.1 0.1
0 Time (s) 1 0 1
Time (s)

What can we say about what the person heard, or wanted to hear?

Possibility of Speech Brain-controlled


Brain-Computer Interface hearing technologies

Speech representa on
• Feature extrac on for speech recogni on, speaker
iden ca on, speech enhancement, emo on recogni on, etc.

• Cepstrum as a Spectral analyzer

• Linear predic on (LP)

• Dynamic features
/T/ /UW/
• MFCC and PLP “two”

Feature Acoustic Language


Lexicon
extraction modeling modeling

“two”
ti
fi
ti
ti
ti
ti
ti
ti
ti
Acous c modeling

• How do we go from the sound waveform to the linguis c


units such as phonemes? (e.g. /c/ in cat)

• Acous c to phone c transforma ons

• Gaussian Mixture Models

• Deep neural network models /T/ /UW/ “two”

Feature Acoustic Language


Lexicon
extraction modeling modeling

“two”
ti
ti
ti
ti
ti
Sta s cal sequence recogni on
• How do we put together a sequence of phonemes into words?

• Markov models and Hidden Markov models

• Three problems of HMMs: scoring, training, decoding.


forward-backward algorithms, Viterbi algorithm, Baum-Welch
algorithm

• Neural network models for sequence recogni on


/T/ /UW/ “two”

Feature Acoustic Language


Lexicon
extraction modeling modeling

“two”
ti
ti
ti
ti
Language models
• How do we put together words into sentences?

• Language model and n-grams

• Decoding with Acous c and Language Models

• Solving the sparsity of LMs

• A complete system /T/ /UW/ “two”

Feature Acoustic Language


Lexicon
extraction modeling modeling

“two”
ti
Auditory scene analysis, speech
enhancement and separa on

ti
Speech processing technologies
Automatic Speech Recognition (speech-to-text)

Four Zero One

Speaker/Language/emotion identi cation

Mary / English / neutral


fi
Speech technologies
Speech synthesis (text-to-speech)

Four Zero One

Speech-to-speech translation
Hitchhikers guide to the galaxy (Adams, 1978)
Why is ASR important?
• Most natural way for us to communicate
• Fastest mean of communica on with computers
(speech: ~150-200 words/min, typing: ~60 words/min)
• Can be used simultaneously with other tasks
• Can enable communica on with people with disabili es
• Allows analysis, searching, transcrip on, etc. of spoken
recordings
ti
ti
ti
ti
Disciplines relavant to ASR
• Signal processing (front-end encoding of waveform)

• Sta s cs (modeling)

• Machine learning (phone c units, use of linguis c


context)

• Algorithms (e cient data structure, search for best


hypothesis)
ti
ti
ffi
ti
ti
Single digit classi cation
Given a 1-second speech waveform, determine which digits
(0-9) were spoken

• Recording from a microphone: instantanous air pressure vs. tim


• Discretized in time (in this case, to 16,000 samples, i.e. sampling rate of
16KHz
• Discretized in magnitude (in this case, to 16 bits per sample
• Result: 16,000 dimensional vector (e.g. x[n] = [1, 4, -2, 6, 10, …]
)

fi
)

A simple idea

• Record an example (“template”) of each digit, e.g. Xd[n],


d=[0…9]

• For test waveform X[n], compute 16000 2

Euclidean distance to each template: distd = ∑ ( x[n] − xd [n]) !


n=0

• Pick digit d with minimum distance


Intro
History
A simple task
Speech production

Not soThiseasy!
is hard!

Which
Which twotwo
are are
the the
samesame
digit?digits?
History
A simple task
Speech production

Idea 2: Go to frequency domain


Idea 2:Fourier
Spectrogram: a be eroverrepresenta
transform on
short windows (e.g. 20ms)
! plot of energy at each frequency over time f1 (!), f2 (!), . . .

Spectrogram: Fourier
transform over short
windows, plot of energy at
each frequency over me
ti
tt
ti
Speech production

This is still hard!


SSeveral
ll diexamplescult
of the digit “eight”

Several examples
of digit eight
ti
ffi
What is the problem?
• Finding a useful set of features x[n] -> f1, f2, … is di cult
• Many sources of varia on:
• acous c: channel, noise, vocal tract di erences, pitch, …
• Phone c: eight: [ey tcl] vs. [ey tcl t]
• Phonological: eight before vowel: [ey dx], sandwich
• Dialect: either: [iy dh er] vs. [ay dh er]
• Coar cula on: she, shoe
• Seman cs: the baby cried vs. the Bay Bee cried
• Segmenta on: an iceman, a nice man
ti
ti
ti
ti
ti
ti
ti
ff
ffi
Intro
History
A simple task
Speech production

Typical speech
Architecture recognizer
of typical speech recognizer
Must take into account (and take advantage of!) sources of
Takes into account sources of varia on/constraint
•variation/constraint

We’ll consider each component, starting with feature extraction


ti
Solved vs. unsolved problems in ASR
What works today?
• Closed-talk, restricted, single-talker ASR largely solved

What s ll needs work?


• ASR with far eld microphone: Living room, mee ng
room, eld video recordings
• ASR under very noisy condi ons, e.g. when music is
playing
• ASR with accented speech
• ASR with mul -talker speech
• ASR with spontaneous speech
Bo om-line: must constrain something for ASR to work well
tt
fi
ti
fi
ti
ti
ti
Speech production

1922: The first ASR system?


1922: The rst ASR system

Radio Rex
Radio Rex (1922):
Toy dog responded to hi
Toy dog responded to high
energy signal around 500
energy signal around 500Hz (as
in theRex)
in the vowel vowel in “Rex”)
fi
Intro
A simple task
History
Speech production
A simple task
Speech production

1950s, Bell Labs digit recognizer


1950s
1950s

• Small-vocabulary, isolated-
word, phoneme
recogni on Small-vocabu
Small-vocabula
isolated-word
isolated-word/
• Nearest-neighbor phoneme reco
classi ca on phoneme recog
Nearest-neigh
Nearest-neighb
• classification
Fixed-length feature
classification
vectors of spectral
Fixed-lengthfe
Fixed-length
measurements vectors of spe
vectors of spec
hocmeasureme
hoc measurem
fi
ti
ti
1960s

1960s
Dynamic time warp
• Dynamic me warping is
introduced,
introduced, allowing
allowing for
di erent-length test and test
di↵erent-length
reference vectors for
reference vectors
isolated word/phoneme
fo
isolated
recogni word/phon
on (Vintsyuk 68,
Sakoe and Chiba 78) [Vintsy
recognition.
Sakoe and Chiba ’7
ff
ti
ti
Intro
History

1970s
A simple task
Speech production

1970s

• DARPA funds large-vocabulary


U.S. ARPA funds
con nouslarger-vocabulary,
speech recogni on
research continuous-speech
recognition research
• Finite-state machines,
Finite-state graph
machines,
search algorithms, sta s cal
graph search
models trained on data
algorithms, statistical
emerge (Jelinek
models 75, Reddy
trained on 76,
Juang 85)data emerge [Jelinek
’75, Reddy ’76, Juang
’85]
ti
ti
ti
ti
History
A simple task
Speech production

1980s
1980s

• Hidden
(fig. from Markov
[Juang andmodel (HMM)
Rabiner approaches (Baum 72, Levinson 83)
’05])
• Gaussian Mixture observa on densi es
Hidden Markov
• Training on datamodel (HMM)-based
via expecta on-maximiza approaches
on algorithms[Baum ’72,
Levinson ’83]
ti
ti
ti
ti
1990s
• More of everything

• More data, more computer power: models with more


parameters

• Mul -pass systems

• Context-dependent models

• Robustness/adapta on to noise, speaker, style

• Discrimina ve training, alterna ves to HMMs


ti
ti
ti
ti
2000s

• Closer es with machine learning

• Discrimina ve structured models

• Increased interest in speech tasks among ML


community, especially structured predic on and
neural network researchers
ti
ti
ti
2010s, Debut of Deep Neural
Networks ASR

• 2009 DNN on phoneme recogni on (U Toronto)

• 2010 on Large Vocabulary ASR (Microso ): %39.6 ->


%36.2

• 2011 CD-DNN-HMM on switchboard (Microso ), 1/3


error cut
ti
ft
ft
(Pallett’03, Saon’15, Xiong’16)
100%

Deep learning
Word error rate (WER)

10%

5.9%

1%
1995 2000 2005 2010 2015 2016
Switchboard task (Telephone conversation speech)
Course overview
• Brief history of speech recogni on
• Discrete Signal Processing (DSP) overview
• Pa ern recogni on and machine learning overview
• Speech signal produc on
• Speech signal representa on
• Auditory Scene Analysis, speech enhancement and separa on
• Speech processing in the auditory system
• Acous c modeling
• Sequence recogni on and Hidden Markov Models
• Language models
• Music signal processing
tt
ti
ti
ti
ti
ti
ti
ti

You might also like