0% found this document useful (0 votes)

57 views

Lecture 1

Uploaded by

Rakshith Kamath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views

Lecture 1

Uploaded by

Rakshith Kamath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

ELEN 6820

Speech and audio signal processing

Instructor: Nima Mesgarani (nm2764)
3 credits
TA: Yi Luo (yl3364)
O ce hours: TBD
ffi
ELEN 6820
Speech and audio signal processing
Instructor: Nima Mesgarani (nm2764)
3 credits
TA: Yi Luo (yl3364)
O ce hours: TBD
ffi
Course overview
• Brief history of speech recogni on
• Discrete Signal Processing (DSP) overview
• Pa ern recogni on and deep learning overview
• Speech signal produc on
• Speech signal representa on
• Auditory Scene Analysis, speech enhancement and separa on
• Speech processing in the auditory system
• Acous c modeling
• Sequence recogni on and Hidden Markov Models
• Language models
• Music signal processing
tt
ti
ti
ti
ti
ti
ti
ti
Homeworks
• HW1: Discrete signal processing (wri en) (W2)
• HW2: Neural networks and voice ac vity detec on (programming) (W3&4)
• HW3: Speech signal produc on and representa on (wri en) (W5)
• HW3: Speech enhancement and separa on (programming) (W6)
• HW4: Acous c event detec on and Speaker iden ca on (programming)
(W7&8)
• HW5: Phoneme recogni on and automa c speech recogni on
(programming) (W9&10)
• Final project (programming) (W11-13)
ti
ti
ti
ti
ti
tt
ti
ti
ti
ti
ti
fi
ti
tt
ti
Week topic HW

1 Introduc on and history -

2 Discrete signal processing DSP (W)

3 Machine learning 1 Neural network and VAD (P)

4 Machine learning 2 -

5 Speech signal produc on Speech produc on (W)

6 Speech signal representra on Speech enhancement (P)

Speech enhancement and
7 -
separa on
8 Human speech percep on Acous c event detec on (P)

9 Acous c modeling -

10 Sequence modeling and HMMs Phoneme recogni on and ASR (P)

11 Language modeling -

12 Automa c speech recogn on Projcet

13 Music signal processing -

ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
Course evalua on

• Two wri en homework (20%)

• Four programming homework (60%)
• Final project (20%)

• Late submission: 10% penalty per day

tt
ti
Main Signal processing Neural processin
resource background of acoustic signals

Language
processing Deep learning

Course materials
• Textbooks

• Other references:

• Handbook of Speech Percep on, Pisoni & Remez

• Jim Glass (MIT)

• Dan Ellis (Columbia University)

• Karen Livescu (Toyota Ins tute of Technology)

• Shinji Watanabe (Johns Hopkins University)

ti
ti
Acous c descrip on of the environment
Sound

Bruel & Kjaer

BA 7666-11, 2 860499/1
ti
ti
What makes modeling this so di cult?

The Cocktail party “problem” (cherry 1953)

ffi
Neural basis of speech communica on

Production Perception

Ear drum
Narayanan et. al. 2014

Cocktail party problem, Cherry, (1953

)

ti
Speech is the most natural
way to communicate
• To interact with machines the same way we do with
humans has been a dream for decades

• Only recently became actually possible

Space odyssey 2001
Speech signal processing

How is it
perceived?

What is
the code?
Signal proper es?
Signal representa on?
Learning representa on?
Sequence learning?
ti
ti
ti
Review of Digital Signal Processing (DSP)

• Discrete me signals and systems

• Discrete me Fourier transform, z-transform

• Digital lters, IIR and FIR

• Sampling theorem, resampling

• Spectrogram representa on of speech

fi
ti
ti
ti
Review of machine learning
• Feature extrac on

• Minimum distance Classi ers, Discriminant func ons

• Support Vector Machine

• Unsupervised Clustering

• Sta s cal methods: Likelihood, MAP, sta s cal linear

discriminant

• Anatomy and func on of auditory pathway

• Percep on of sound, Loudness, pitch, mbre
• Masking, temporal, spectral, applica ons in coding
• Auditory models: PLP, Lyons, Shamma
• Speech percep on and adapta on in noise
• Speech processing in the auditory cortex
ti
ti
ti
ti
ti
ti
ti
Encoding Decoding
S(t,f) R(t) S(t,f)

Spectrogram S(t,f) Reconstructed Ŝ(t,f)

8 8
Frequency

Frequency
0.1 0.1
0 Time (s) 1 0 1
Time (s)

What can we say about what the person heard, or wanted to hear?

Possibility of Speech Brain-controlled

Brain-Computer Interface hearing technologies

Speech representa on
• Feature extrac on for speech recogni on, speaker
iden ca on, speech enhancement, emo on recogni on, etc.

• Cepstrum as a Spectral analyzer

• Linear predic on (LP)

• Dynamic features
/T/ /UW/
• MFCC and PLP “two”

Feature Acoustic Language

Lexicon
extraction modeling modeling

“two”
ti
fi
ti
ti
ti
ti
ti
ti
ti
Acous c modeling

• How do we go from the sound waveform to the linguis c

units such as phonemes? (e.g. /c/ in cat)

• Acous c to phone c transforma ons

• Gaussian Mixture Models

• Deep neural network models /T/ /UW/ “two”

Feature Acoustic Language

Lexicon
extraction modeling modeling

“two”
ti
ti
ti
ti
ti
Sta s cal sequence recogni on
• How do we put together a sequence of phonemes into words?

• Markov models and Hidden Markov models

• Three problems of HMMs: scoring, training, decoding.

forward-backward algorithms, Viterbi algorithm, Baum-Welch
algorithm

• Neural network models for sequence recogni on

/T/ /UW/ “two”

Feature Acoustic Language

Lexicon
extraction modeling modeling

“two”
ti
ti
ti
ti
Language models
• How do we put together words into sentences?

• Language model and n-grams

• Decoding with Acous c and Language Models

• Solving the sparsity of LMs

• A complete system /T/ /UW/ “two”

Feature Acoustic Language

Lexicon
extraction modeling modeling

“two”
ti
Auditory scene analysis, speech
enhancement and separa on

ti
Speech processing technologies
Automatic Speech Recognition (speech-to-text)

Four Zero One

Speaker/Language/emotion identi cation

Mary / English / neutral

fi
Speech technologies
Speech synthesis (text-to-speech)

Four Zero One

Speech-to-speech translation
Hitchhikers guide to the galaxy (Adams, 1978)
Why is ASR important?
• Most natural way for us to communicate
• Fastest mean of communica on with computers
(speech: ~150-200 words/min, typing: ~60 words/min)
• Can be used simultaneously with other tasks
• Can enable communica on with people with disabili es
• Allows analysis, searching, transcrip on, etc. of spoken
recordings
ti
ti
ti
ti
Disciplines relavant to ASR
• Signal processing (front-end encoding of waveform)

• Sta s cs (modeling)

• Machine learning (phone c units, use of linguis c

context)

• Algorithms (e cient data structure, search for best

hypothesis)
ti
ti
ffi
ti
ti
Single digit classi cation
Given a 1-second speech waveform, determine which digits
(0-9) were spoken

• Recording from a microphone: instantanous air pressure vs. tim

• Discretized in time (in this case, to 16,000 samples, i.e. sampling rate of
16KHz
• Discretized in magnitude (in this case, to 16 bits per sample
• Result: 16,000 dimensional vector (e.g. x[n] = [1, 4, -2, 6, 10, …]
)

fi
)

A simple idea

• Record an example (“template”) of each digit, e.g. Xd[n],

d=[0…9]

• For test waveform X[n], compute 16000 2

Euclidean distance to each template: distd = ∑ ( x[n] − xd [n]) !

n=0

• Pick digit d with minimum distance

Intro
History
A simple task
Speech production

Not soThiseasy!
is hard!

Which
Which twotwo
are are
the the
samesame
digit?digits?
History
A simple task
Speech production

Idea 2: Go to frequency domain

Idea 2:Fourier
Spectrogram: a be eroverrepresenta
transform on
short windows (e.g. 20ms)
! plot of energy at each frequency over time f1 (!), f2 (!), . . .

Spectrogram: Fourier
transform over short
windows, plot of energy at
each frequency over me
ti
tt
ti
Speech production

This is still hard!

SSeveral
ll diexamplescult
of the digit “eight”

Several examples
of digit eight
ti
ffi
What is the problem?
• Finding a useful set of features x[n] -> f1, f2, … is di cult
• Many sources of varia on:
• acous c: channel, noise, vocal tract di erences, pitch, …
• Phone c: eight: [ey tcl] vs. [ey tcl t]
• Phonological: eight before vowel: [ey dx], sandwich
• Dialect: either: [iy dh er] vs. [ay dh er]
• Coar cula on: she, shoe
• Seman cs: the baby cried vs. the Bay Bee cried
• Segmenta on: an iceman, a nice man
ti
ti
ti
ti
ti
ti
ti
ff
ffi
Intro
History
A simple task
Speech production

Typical speech
Architecture recognizer
of typical speech recognizer
Must take into account (and take advantage of!) sources of
Takes into account sources of varia on/constraint
•variation/constraint

We’ll consider each component, starting with feature extraction

ti
Solved vs. unsolved problems in ASR
What works today?
• Closed-talk, restricted, single-talker ASR largely solved

What s ll needs work?

• ASR with far eld microphone: Living room, mee ng
room, eld video recordings
• ASR under very noisy condi ons, e.g. when music is
playing
• ASR with accented speech
• ASR with mul -talker speech
• ASR with spontaneous speech
Bo om-line: must constrain something for ASR to work well
tt
fi
ti
fi
ti
ti
ti
Speech production

1922: The first ASR system?

1922: The rst ASR system

Radio Rex
Radio Rex (1922):
Toy dog responded to hi
Toy dog responded to high
energy signal around 500
energy signal around 500Hz (as
in theRex)
in the vowel vowel in “Rex”)
fi
Intro
A simple task
History
Speech production
A simple task
Speech production

1950s, Bell Labs digit recognizer

1950s
1950s

• Small-vocabulary, isolated-
word, phoneme
recogni on Small-vocabu
Small-vocabula
isolated-word
isolated-word/
• Nearest-neighbor phoneme reco
classi ca on phoneme recog
Nearest-neigh
Nearest-neighb
• classification
Fixed-length feature
classification
vectors of spectral
Fixed-lengthfe
Fixed-length
measurements vectors of spe
vectors of spec
hocmeasureme
hoc measurem
fi
ti
ti
1960s

1960s
Dynamic time warp
• Dynamic me warping is
introduced,
introduced, allowing
allowing for
di erent-length test and test
di↵erent-length
reference vectors for
reference vectors
isolated word/phoneme
fo
isolated
recogni word/phon
on (Vintsyuk 68,
Sakoe and Chiba 78) [Vintsy
recognition.
Sakoe and Chiba ’7
ff
ti
ti
Intro
History

1970s
A simple task
Speech production

1970s

• DARPA funds large-vocabulary

U.S. ARPA funds
con nouslarger-vocabulary,
speech recogni on
research continuous-speech
recognition research
• Finite-state machines,
Finite-state graph
machines,
search algorithms, sta s cal
graph search
models trained on data
algorithms, statistical
emerge (Jelinek
models 75, Reddy
trained on 76,
Juang 85)data emerge [Jelinek
’75, Reddy ’76, Juang
’85]
ti
ti
ti
ti
History
A simple task
Speech production

1980s
1980s

• Hidden
(fig. from Markov
[Juang andmodel (HMM)
Rabiner approaches (Baum 72, Levinson 83)
’05])
• Gaussian Mixture observa on densi es
Hidden Markov
• Training on datamodel (HMM)-based
via expecta on-maximiza approaches
on algorithms[Baum ’72,
Levinson ’83]
ti
ti
ti
ti
1990s
• More of everything

• More data, more computer power: models with more

parameters

• Mul -pass systems

• Context-dependent models

• Robustness/adapta on to noise, speaker, style

• Discrimina ve training, alterna ves to HMMs

ti
ti
ti
ti
2000s

• Closer es with machine learning

• Discrimina ve structured models

• Increased interest in speech tasks among ML

community, especially structured predic on and
neural network researchers
ti
ti
ti
2010s, Debut of Deep Neural
Networks ASR

• 2009 DNN on phoneme recogni on (U Toronto)

• 2010 on Large Vocabulary ASR (Microso ): %39.6 ->

%36.2

• 2011 CD-DNN-HMM on switchboard (Microso ), 1/3

error cut
ti
ft
ft
(Pallett’03, Saon’15, Xiong’16)
100%

Deep learning
Word error rate (WER)

10%

5.9%

1%
1995 2000 2005 2010 2015 2016
Switchboard task (Telephone conversation speech)
Course overview
• Brief history of speech recogni on
• Discrete Signal Processing (DSP) overview
• Pa ern recogni on and machine learning overview
• Speech signal produc on
• Speech signal representa on
• Auditory Scene Analysis, speech enhancement and separa on
• Speech processing in the auditory system
• Acous c modeling
• Sequence recogni on and Hidden Markov Models
• Language models
• Music signal processing
tt
ti
ti
ti
ti
ti
ti
ti

Alfred Hosp ICU-orientation-manual
100% (1)
Alfred Hosp ICU-orientation-manual
85 pages
Multimedia Programming Using Max/MSP and TouchDesigner
From Everand
Multimedia Programming Using Max/MSP and TouchDesigner
Patrik Lechner
5/5 (3)
Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages
14ec3029 Speech and Audio Signal Processing
No ratings yet
14ec3029 Speech and Audio Signal Processing
30 pages
Discrete-Time Processing of Speech Signals (IEEE Press Classic Reissue) PDF
No ratings yet
Discrete-Time Processing of Speech Signals (IEEE Press Classic Reissue) PDF
919 pages
Pro Tools For Breakfast: Get Started Guide For The Most Used Software In Recording Studios: Stefano Tumiati, #2
From Everand
Pro Tools For Breakfast: Get Started Guide For The Most Used Software In Recording Studios: Stefano Tumiati, #2
Stefano Tumiati
No ratings yet
Su Carbs and Pas PDF
No ratings yet
Su Carbs and Pas PDF
24 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
35 pages
Design and Implementation
No ratings yet
Design and Implementation
74 pages
Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
100% (1)
Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
65 pages
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
No ratings yet
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
6 pages
CCS369 - TSS-Unit 5
No ratings yet
CCS369 - TSS-Unit 5
23 pages
Speech Recognition UTHM
No ratings yet
Speech Recognition UTHM
30 pages
Lecture 7 - Automatic Speech Recognition
No ratings yet
Lecture 7 - Automatic Speech Recognition
58 pages
Speech Recognition (Dr. M. Sabarimalai Manikandan
No ratings yet
Speech Recognition (Dr. M. Sabarimalai Manikandan
2 pages
Reconocimiento de Voz - MATLAB
No ratings yet
Reconocimiento de Voz - MATLAB
5 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
Discrete Time Processing of Speech Signa
No ratings yet
Discrete Time Processing of Speech Signa
12 pages
Basic Course Material Winter 2015
100% (1)
Basic Course Material Winter 2015
19 pages
Ann LA2 Project
No ratings yet
Ann LA2 Project
23 pages
Speech Recognition, Synthesis, and Dialogue 2
No ratings yet
Speech Recognition, Synthesis, and Dialogue 2
59 pages
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
No ratings yet
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
19 pages
Advanced Topics in Speech Processing (IT60116) : K Sreenivasa Rao School of Information Technology IIT Kharagpur
No ratings yet
Advanced Topics in Speech Processing (IT60116) : K Sreenivasa Rao School of Information Technology IIT Kharagpur
17 pages
Linear Prediction
No ratings yet
Linear Prediction
18 pages
Lectures 1 Rabiner Speech Processing
No ratings yet
Lectures 1 Rabiner Speech Processing
77 pages
Speechrecognitionfinalpresentation 141124072610 Conversion Gate01
No ratings yet
Speechrecognitionfinalpresentation 141124072610 Conversion Gate01
30 pages
Speech Recognition Seminar
No ratings yet
Speech Recognition Seminar
19 pages
Digital Speech Processing
No ratings yet
Digital Speech Processing
7 pages
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
No ratings yet
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
6 pages
Term Paper ECE-300 Topic: - Speech Recognition
No ratings yet
Term Paper ECE-300 Topic: - Speech Recognition
14 pages
Speaker Recognition System
No ratings yet
Speaker Recognition System
7 pages
Speaker Recognition
100% (1)
Speaker Recognition
15 pages
Speech Recognition Using Artificial Neural Network: - A Review
No ratings yet
Speech Recognition Using Artificial Neural Network: - A Review
4 pages
Lecture 1
No ratings yet
Lecture 1
32 pages
Feature Extraction Using PCA
No ratings yet
Feature Extraction Using PCA
36 pages
Tutorial On Speech Recognition: Alex Acero Microsoft Research
No ratings yet
Tutorial On Speech Recognition: Alex Acero Microsoft Research
38 pages
Speech Processing
No ratings yet
Speech Processing
70 pages
Speech Recognition: BY Charu Joshi
No ratings yet
Speech Recognition: BY Charu Joshi
26 pages
Speech Recognition: A Complete Perspective: Ashok Kumar, Vikas Mittal
No ratings yet
Speech Recognition: A Complete Perspective: Ashok Kumar, Vikas Mittal
6 pages
Speech Recognition1
100% (1)
Speech Recognition1
39 pages
(Synthesis Lectures On Speech and Audio Processing Volume 0) Li Deng - Dynamic Speech Models-Morgan and Claypool Publishers (2006)
No ratings yet
(Synthesis Lectures On Speech and Audio Processing Volume 0) Li Deng - Dynamic Speech Models-Morgan and Claypool Publishers (2006)
118 pages
Seminar Presentation: Topic: Speech Recognition
No ratings yet
Seminar Presentation: Topic: Speech Recognition
26 pages
Automatic+Speaker+Recognition+System - EEE
No ratings yet
Automatic+Speaker+Recognition+System - EEE
11 pages
AI Speech Recognition Document
No ratings yet
AI Speech Recognition Document
26 pages
Speech Recognition: BY Charu Joshi
100% (2)
Speech Recognition: BY Charu Joshi
26 pages
Speech Recognition1
No ratings yet
Speech Recognition1
24 pages
Assamese Numeral Corpus For Speech Recognition Using ANN: Master of Science
No ratings yet
Assamese Numeral Corpus For Speech Recognition Using ANN: Master of Science
58 pages
Speech Recognition Using A DSP: Lunds Universitet
No ratings yet
Speech Recognition Using A DSP: Lunds Universitet
12 pages
David Crawford Epson
No ratings yet
David Crawford Epson
31 pages
Speech Rec
No ratings yet
Speech Rec
27 pages
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
No ratings yet
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
104 pages
Speech Recognition Full Report
No ratings yet
Speech Recognition Full Report
11 pages
The Diagram Outlines The Key Steps Involved in Co
No ratings yet
The Diagram Outlines The Key Steps Involved in Co
20 pages
Digital Speech Processing
No ratings yet
Digital Speech Processing
46 pages
MathLab Based Speech Processing
No ratings yet
MathLab Based Speech Processing
8 pages
Voice Recognition Using Matlab: Presented By: Avienash Raibole Paresh Meshram Vinayak Kolpek
100% (1)
Voice Recognition Using Matlab: Presented By: Avienash Raibole Paresh Meshram Vinayak Kolpek
18 pages
Introduction To Digital Speech Processing
No ratings yet
Introduction To Digital Speech Processing
42 pages
Speech Processing
No ratings yet
Speech Processing
5 pages
Strange Code: Esoteric Languages That Make Programming Fun Again
From Everand
Strange Code: Esoteric Languages That Make Programming Fun Again
Ronald T. Kneusel
No ratings yet
Digital Signal Processing for Audio Applications: Volume 2 - Code
From Everand
Digital Signal Processing for Audio Applications: Volume 2 - Code
Anton R Kamenov
5/5 (1)
Colour Banding: Exploring the Depths of Computer Vision: Unraveling the Mystery of Colour Banding
From Everand
Colour Banding: Exploring the Depths of Computer Vision: Unraveling the Mystery of Colour Banding
Fouad Sabry
No ratings yet
Desktop Mastering
From Everand
Desktop Mastering
Bob Buontempo
No ratings yet
Lecture 4
No ratings yet
Lecture 4
50 pages
Lecture 2
No ratings yet
Lecture 2
30 pages
Lecture 3
No ratings yet
Lecture 3
49 pages
Heat Equation
No ratings yet
Heat Equation
19 pages
Butt Hinge
No ratings yet
Butt Hinge
6 pages
Diff. White Hat SEO - Black Hat SEO.docx
No ratings yet
Diff. White Hat SEO - Black Hat SEO.docx
3 pages
It 101 Reviewer 1
No ratings yet
It 101 Reviewer 1
3 pages
Numerical Analysis - 9th Edition - Ch 6.1 - P 16E
No ratings yet
Numerical Analysis - 9th Edition - Ch 6.1 - P 16E
5 pages
GE - 1.5T Shoulder Manual
No ratings yet
GE - 1.5T Shoulder Manual
42 pages
Trivision - VisioPointer®Product Sheet Medium
No ratings yet
Trivision - VisioPointer®Product Sheet Medium
3 pages
Maity - Patra - 2016 - Tradeoffs Aware Design Procedure For An Adaptively Biased Capacitorless Low
No ratings yet
Maity - Patra - 2016 - Tradeoffs Aware Design Procedure For An Adaptively Biased Capacitorless Low
12 pages
Simulators For Occlusion
No ratings yet
Simulators For Occlusion
16 pages
Lecture 95 Amazon Product Listing
No ratings yet
Lecture 95 Amazon Product Listing
25 pages
Maag Eq4 Ms Manual
No ratings yet
Maag Eq4 Ms Manual
12 pages
Australian Semiconductor Sector Study
No ratings yet
Australian Semiconductor Sector Study
48 pages
Ekta K Resume 2021
No ratings yet
Ekta K Resume 2021
2 pages
Aerospace: Applications
No ratings yet
Aerospace: Applications
36 pages
Chat GPT Beat Saber Copy Past
No ratings yet
Chat GPT Beat Saber Copy Past
78 pages
Manometro Filtro
No ratings yet
Manometro Filtro
3 pages
Assignment 5 1
No ratings yet
Assignment 5 1
4 pages
Etabs Vs Sap2000
0% (1)
Etabs Vs Sap2000
5 pages
T3 Chinese Inverter
No ratings yet
T3 Chinese Inverter
1 page
Job Description - Marketing Specialist
No ratings yet
Job Description - Marketing Specialist
3 pages
12 Physics Ip 2024 - 25
No ratings yet
12 Physics Ip 2024 - 25
20 pages
HP Dealer Price List - June Month PDF
No ratings yet
HP Dealer Price List - June Month PDF
4 pages
Boiler Controls
No ratings yet
Boiler Controls
55 pages
Chapter-4 Overview of Kwu Turbines
100% (3)
Chapter-4 Overview of Kwu Turbines
14 pages
Akshat Kedawat Resume
No ratings yet
Akshat Kedawat Resume
2 pages
Your Catchy Headline Here
No ratings yet
Your Catchy Headline Here
29 pages
International Institute of Information Technology, Hyderabad
No ratings yet
International Institute of Information Technology, Hyderabad
36 pages
448-Book Manuscript-1240-1-10-20240706
No ratings yet
448-Book Manuscript-1240-1-10-20240706
95 pages
Sharma, J - Sarin, Ashish-Getting Started With Java Programming Language (2017)
No ratings yet
Sharma, J - Sarin, Ashish-Getting Started With Java Programming Language (2017)
332 pages

Lecture 1

Uploaded by

Lecture 1

Uploaded by

ELEN 6820

Speech and audio signal processing

1 Introduc on and history -

2 Discrete signal processing DSP (W)

3 Machine learning 1 Neural network and VAD (P)

5 Speech signal produc on Speech produc on (W)

6 Speech signal representra on Speech enhancement (P)

10 Sequence modeling and HMMs Phoneme recogni on and ASR (P)

12 Automa c speech recogn on Projcet

13 Music signal processing -

• Two wri en homework (20%)

• Late submission: 10% penalty per day

• Handbook of Speech Percep on, Pisoni & Remez

• Jim Glass (MIT)

• Dan Ellis (Columbia University)

• Karen Livescu (Toyota Ins tute of Technology)

• Shinji Watanabe (Johns Hopkins University)

Bruel & Kjaer

The Cocktail party “problem” (cherry 1953)

Cocktail party problem, Cherry, (1953

• Only recently became actually possible

• Discrete me signals and systems

• Discrete me Fourier transform, z-transform

• Digital lters, IIR and FIR

• Sampling theorem, resampling

• Spectrogram representa on of speech

• Minimum distance Classi ers, Discriminant func ons

• Support Vector Machine

• Sta s cal methods: Likelihood, MAP, sta s cal linear

• Neural Network models, Deep learning

• Anatomy and func on of auditory pathway

Spectrogram S(t,f) Reconstructed Ŝ(t,f)

Possibility of Speech Brain-controlled

• Cepstrum as a Spectral analyzer

• Linear predic on (LP)

Feature Acoustic Language

• How do we go from the sound waveform to the linguis c

• Acous c to phone c transforma ons

• Gaussian Mixture Models

• Deep neural network models /T/ /UW/ “two”

Feature Acoustic Language

• Markov models and Hidden Markov models

• Three problems of HMMs: scoring, training, decoding.

• Neural network models for sequence recogni on

Feature Acoustic Language

• Language model and n-grams

• Decoding with Acous c and Language Models

• Solving the sparsity of LMs

• A complete system /T/ /UW/ “two”

Feature Acoustic Language

Four Zero One

Speaker/Language/emotion identi cation

Mary / English / neutral

Four Zero One

• Machine learning (phone c units, use of linguis c

• Algorithms (e cient data structure, search for best

• Recording from a microphone: instantanous air pressure vs. tim

• Record an example (“template”) of each digit, e.g. Xd[n],

• For test waveform X[n], compute 16000 2

Euclidean distance to each template: distd = ∑ ( x[n] − xd [n]) !

• Pick digit d with minimum distance

Idea 2: Go to frequency domain

This is still hard!

We’ll consider each component, starting with feature extraction

What s ll needs work?

1922: The first ASR system?

1950s, Bell Labs digit recognizer

• DARPA funds large-vocabulary

• More data, more computer power: models with more

• Mul -pass systems

• Robustness/adapta on to noise, speaker, style

• Discrimina ve training, alterna ves to HMMs

• Closer es with machine learning

• Discrimina ve structured models

• Increased interest in speech tasks among ML

• 2009 DNN on phoneme recogni on (U Toronto)