0% found this document useful (0 votes)
93 views33 pages

Introduction & DSP: EE E6820: Speech & Audio Processing & Recognition

This document provides an introduction and overview of a course on speech and audio processing and recognition. It discusses the topics that will be covered over the course, including fundamentals of digital signal processing, audio processing techniques, and applications like speech recognition and music information retrieval. The course structure is outlined, which includes weekly assignments, a midterm, and a final project. It also provides a review of relevant digital signal processing concepts like timescale modification algorithms.

Uploaded by

Brian Wheeler
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views33 pages

Introduction & DSP: EE E6820: Speech & Audio Processing & Recognition

This document provides an introduction and overview of a course on speech and audio processing and recognition. It discusses the topics that will be covered over the course, including fundamentals of digital signal processing, audio processing techniques, and applications like speech recognition and music information retrieval. The course structure is outlined, which includes weekly assignments, a midterm, and a final project. It also provides a review of relevant digital signal processing concepts like timescale modification algorithms.

Uploaded by

Brian Wheeler
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

EE E6820: Speech & Audio Processing & Recognition

Lecture 1:
Introduction & DSP
Dan Ellis <[email protected]>
Mike Mandel <[email protected]>
Columbia University Dept. of Electrical Engineering
http://www.ee.columbia.edu/dpwe/e6820
January 22, 2009
1
Sound and information
2
Course Structure
3
DSP review: Timescale modication
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 1 / 33
Outline
1
Sound and information
2
Course Structure
3
DSP review: Timescale modication
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 2 / 33
Sound and information
Sound is air pressure variation
Mechanical vibration
Pressure waves in air
Motion of sensor
Time-varying voltage
+ + + +
t
v(t)
Transducers convert air pressure voltage
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 3 / 33
What use is sound?
Footsteps examples:
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
-0.5
0
0.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
-0.5
0
0.5
time / s
Hearing confers an evolutionary advantage
useful information, complements vision
. . . at a distance, in the dark, around corners
listeners are highly adapted to natural sounds (including
speech)
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 4 / 33
The scope of audio processing
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 5 / 33
The acoustic communication chain
message signal channel receiver decoder
!
synthesis
audio
processing
recognition
Sound is an information bearer
Received sound reects source(s)
plus eect of environment (channel)
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 6 / 33
Levels of abstraction
Much processing concerns shifting between levels of abstraction
sound p(t)
representation
(e.g. t-f energy)
information
abstract
concrete
A
n
a
l
y
s
i
s
S
y
n
t
h
e
s
i
s
Dierent representations serve dierent tasks
separating aspects, making things explicit, . . .
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 7 / 33
Outline
1
Sound and information
2
Course Structure
3
DSP review: Timescale modication
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 8 / 33
Source structure
Goals

survey topics in sound analysis & processing

develop and intuition for sound signals

learn some specic technologies


Course structure

weekly assignments (25%)

midterm event (25%)

nal project (50%)


Text
Speech and Audio Signal Processing
Ben Gold & Nelson Morgan
Wiley, 2000
ISBN: 0-471-35154-7
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 9 / 33
Web-based
Course website:
http://www.ee.columbia.edu/dpwe/e6820/
for lecture notes, problem sets, examples, . . .
+ student web pages for homework, etc.
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 10 / 33
Course outline
Fundamentals
L1:
DSP
L2:
Acoustics
L3:
Pattern
recognition
L4:
Auditory
perception
Audio processing
L5:
Signal
models
L6:
Music
analysis/
synthesis
L7:
Audio
compression
L8:
Spatial sound
& rendering
Applications
L9:
Speech
recognition
L10:
Music
retrieval
L11:
Signal
separation
L12:
Multimedia
indexing
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 11 / 33
Weekly assignments
Research papers

journal & conference publications

summarize & discuss in class

written summaries on web page + Courseworks discussion


Practical experiments

Matlab-based (+ Signal Processing Toolbox)

direct experience of sound processing

skills for project


Book sections
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 12 / 33
Final project
Most signicant part of course (50%) of grade
Oral proposals mid-semester;
Presentations in nal class
+ website
Scope

practical (Matlab recommended)

identify a problem; try some solutions

evaluation
Topic

few restrictions within world of audio

investigate other resources

develop in discussion with me


Citation & plagiarism
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 13 / 33
Examples of past projects
Automatic prosody classication Model-based note transcription
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 14 / 33
Outline
1
Sound and information
2
Course Structure
3
DSP review: Timescale modication
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 15 / 33
DSP review: digital signals
time
x
d
[n] = Q( x
c
(nT ) )
Discrete-time sampling
limits bandwidth
Discrete-level
quantization
limits
dynamic range
T

sampling interval T
sampling frequency
T
=
2
T
quantizer Q(y) =
_
y

_
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 16 / 33
The speech signal: time domain
Speech is a sequence of dierent sound types
-0.2
-0.1
0
0.1
0.2
1.38 1.4 1.42
.1
0
.1
1.52 1.54 1.56 1.58
-0.1
0
0.1
1.86 1.88 1.92 1.9
-0.05
0
0.05
2.42 2.44 2.46 2.4
-0.02
0
0.02
1.4 1.6 1.8 2 2.2 2.4 2.6
time/s
watch thin as a dime a has
Vowel: periodic
has
Fricative: aperiodic
watch
Glide: smooth transition
watch
Stop burst: transient
dime
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 17 / 33
Timescale modication (TSM)
Can we modify a sound to make it slower ?
i.e. speech pronounced more slowly
e.g. to help comprehension, analysis
or more quickly for speed listening ?
Why not just slow it down?
x
s
(t) = x
o
(
t
r
), r = slowdown factor (> 1 slower)
equivalent to playback at a dierent sampling rate
2.35 2.4 2.45 2.5 2.55 2.6
-0.1
-0.05
0
0.05
0.1
2.35 2.4 2.45 2.5 2.55 2.6
-0.1
-0.05
0
0.05
0.1
time/s
Original
2x slower
r = 2
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 18 / 33
Time-domain TSM
Problem: want to preserve local time structure
but alter global time structure
Repeat segments

but: artifacts from abrupt edges


Cross-fade & overlap
y
m
[mL + n] = y
m1
[mL + n] + w[n] x
__
m
r
_
L + n
_
2.35 2.4 2.45 2.5 2.55 2.6
-0.1
0
0.1
4.7 4.75 4.8 4.85 4.9 4.95
-0.1
0
0.1
1
1
1 1 2 2 3 3 4 4 5 5 6
2
2
3
3
4
4
5 6
6
5
time / s
time / s
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 19 / 33
Synchronous overlap-add (SOLA)
Idea: allow some leeway in placing window to optimize alignment
of waveforms
1
2
K
m
maximizes alignment of 1 and 2
Hence,
y
m
[mL + n] = y
m1
[mL + n] + w[n] x
__
m
r
_
L + n + K
m
_
Where K
m
chosen by cross-correlation:
K
m
= argmax
0KK
u

N
ov
n=0
y
m1
[mL + n] x
__
m
r
_
L + n + K

(y
m1
[mL + n])
2

(x
__
m
r
_
L + n + K

)
2
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 20 / 33
The Fourier domain
Fourier Series (periodic continuous x)

0
=
2
T
x(t) =

k
c
k
e
jk
0
t
c
k
=
1
2T
_
T/2
T/2
x(t)e
jk
0
t
dt
k
1 2 3 5 6 7 4
|c
k
|
1.0
1.5 1 0.5 0 0.5 1 1.5
1
0.5
0
0.5
t
x(t)
Fourier Transform (aperiodic continuous x)
x(t) =
1
2
_
X(j )e
j t
d
X(j ) =
_
x(t)e
j t
dt
0 0.002 0.004 0.006 0.008
time / sec
level
/ dB
-0.01
0
0.01
0.02
x(t)
0 2000 4000 6000 8000
freq / Hz
-80
-60
-40
-20 |
X(j)
|
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 21 / 33
Discrete-time Fourier
DT Fourier Transform (aperiodic sampled x)
x[n] =
1
2
_

X(e
j
)e
j n
d
X(e
j
) =

x[n]e
j n
n
-1 1 2 3 4 5 6 7
0
|X(e
j
)|

2 3 4 5
1
2
3
x [n]
Discrete Fourier Transform (N-point x)
x[n] =

k
X[k]e
j
2kn
N
X[k] =

n
x[n]e
j
2kn
N
k
|X(e
j
)| |X[k]|
k=1...
n
1 2 3 4 5 6 7
x [n]
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 22 / 33
Sampling and aliasing
Discrete-time signals equal the continuous time signal at discrete
sampling instants:
x
d
[n] = x
c
(nT)
Sampling cannot represent rapid uctuations
0 1 2 3 4 5 6 7 8 9 10
1
0.5
0
0.5
1
sin
__

M
+
2
T
_
Tn
_
= sin(
M
Tn) n Z
Nyquist limit (
T
/2) from periodic spectrum:

T

T

T
-
M
T
+
M
G
p
(j)
G
a
(j)
alias of baseband
signal
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 23 / 33
Speech sounds in the Fourier domain
1.52 1.54 1.56 1.58
-0.1
0
0.1
2.42 2.44 2.46 2.48
-0.02
0
0.02
0 1000 2000 3000 4000
-100
-80
-60
-40
0 1000 2000 3000 4000
-100
-80
-60
1.37 1.38 1.39 1.4 1.41 1.42
-0.1
0
0.1
0 1000 2000 3000
-100
-80
-60
-40
1.86 1.87 1.88 1.89 1.9 1.91
-0.05
0
0.05
0 1000 2000 3000 4000
-100
-80
-60
Vowel: periodic
has
Fricative: aperiodic
watch
Glide: transition
watch
Stop: transient
dime
time domain frequency domain
time / s freq / Hz
e
n
e
r
g
y

/

d
B
dB = 20 log
10
(amplitude) = 10 log
10
(power)
Voiced spectrum has pitch + formants
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 24 / 33
Short-time Fourier Transform
Want to localize energy in time and frequency
break sound into short-time pieces
calculate DFT of each one
2.35 2.4 2.45
0
4000
3000
2000
1000
2.5 2.55 2.6
-0.1
0
0.1
time / s
f
r
e
q

/

H
z
k

short-time
window
DFT
m = 0 m = 1 m = 2 m = 3
L 2L 3L
Mathematically,
X[k, m] =
N1

n=0
x[n] w[n mL] exp
_
j
2k(n mL)
N
_
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 25 / 33
The Spectrogram
Plot STFT X[k, m] as a gray-scale image
time / s
time / s
f
r
e
q

/

H
z
i
n
t
e
n
s
i
t
y

/

d
B
2.35 2.4 2.45 2.5 2.55 2.6
0
1000
2000
3000
4000
f
r
e
q

/

H
z
0
1000
2000
3000
4000
0
0.1
-50
-40
-30
-20
-10
0
10
0 0.5 1 1.5 2 2.5
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 26 / 33
Time-frequency tradeo
Longer window w[n] gains frequency resolution at cost of time
resolution
1.4 1.6 1.8 2 2.2 2.4 2.6
f
r
e
q

/

H
z
time / s
level
/ dB
0
1000
2000
3000
4000
f
r
e
q

/

H
z
0
1000
2000
3000
4000
0
0.2
W
i
n
d
o
w

=

2
5
6

p
t
N
a
r
r
o
w
b
a
n
d

W
i
n
d
o
w

=

4
8

p
t
W
i
d
e
b
a
n
d

-50
-40
-30
-20
-10
0
10
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 27 / 33
Speech sounds on the Spectrogram
Most popular speech visualization
f
r
e
q

/

H
z
0
1000
2000
3000
4000
1.4 1.6 1.8 2 2.2 2.4 2.6
time/s
watch thin as a dime a has
V
o
w
e
l
:

p
e
r
i
o
d
i
c

h
a
s


F
r
i
c
'
v
e
:

a
p
e
r
i
o
d
i
c

w
a
t
c
h


G
l
i
d
e
:

t
r
a
n
s
i
t
i
o
n

w
a
t
c
h


S
t
o
p
:

t
r
a
n
s
i
e
n
t

d
i
m
e


Wideband (short window) better than narrowband (long window)
to see formants
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 28 / 33
TSM with the Spectrogram
Just stretch out the spectrogram?
Time
F
r
e
q
u
e
n
c
y
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0
1000
2000
3000
4000
Time
F
r
e
q
u
e
n
c
y
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0
1000
2000
3000
4000
how to resynthesize?
spectrogram is only |Y[k, m]|
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 29 / 33
The Phase Vocoder
Timescale modication in the STFT domain
Magnitude from stretched spectrogram:
|Y[k, m]| =

X
_
k,
m
r
_

e.g. by linear interpolation


But preserve phase increment between slices:

Y
[k, m] =

X
_
k,
m
r
_

e.g. by discrete dierentiator


Does right thing for single sinusoid

keeps overlapped parts of sinusoid aligned


time
=
T
.
' = 2T
.
T
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 30 / 33
General issues in TSM
Time window

stretching a narrowband spectrogram


Malleability of dierent sounds

vowels stretch well, stops lose nature


Not a well-formed problem?

want to alter time without frequency


. . . but time and frequency are not separate!

satisfying result is a subjective judgment


solution depends on auditory perception. . .
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 31 / 33
Summary
Information in sound

lots of it, multiple levels of abstraction


Course overview

survey of audio processing topics

practicals, readings, project


DSP review

digital signals, time domain

Fourier domain, STFT


Timescale modication

properties of the speech signal

time-domain

phase vocoder
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 32 / 33
References
J. L. Flanagan and R. M. Golden. Phase vocoder. Bell System Technical Journal,
pages 14931509, 1966.
M. Dolson. The Phase Vocoder: A Tutorial. Computer Music Journal, 10(4):1427,
1986.
M. Puckette. Phase-locked vocoder. In Proc. IEEE Workshop on Applications of
Signal Processing to Audio and Acoustics (WASPAA), pages 222225, 1995.
A. T. Cemgil and S. J. Godsill. Probabilistic Phase Vocoder and its application to
Interpolation of Missing Values in Audio Signals. In 13th European Signal
Processing Conference, Antalya, Turkey, 2005.
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 33 / 33

You might also like