Introduction & DSP: EE E6820: Speech & Audio Processing & Recognition
Introduction & DSP: EE E6820: Speech & Audio Processing & Recognition
Lecture 1:
Introduction & DSP
Dan Ellis <[email protected]>
Mike Mandel <[email protected]>
Columbia University Dept. of Electrical Engineering
http://www.ee.columbia.edu/dpwe/e6820
January 22, 2009
1
Sound and information
2
Course Structure
3
DSP review: Timescale modication
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 1 / 33
Outline
1
Sound and information
2
Course Structure
3
DSP review: Timescale modication
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 2 / 33
Sound and information
Sound is air pressure variation
Mechanical vibration
Pressure waves in air
Motion of sensor
Time-varying voltage
+ + + +
t
v(t)
Transducers convert air pressure voltage
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 3 / 33
What use is sound?
Footsteps examples:
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
-0.5
0
0.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
-0.5
0
0.5
time / s
Hearing confers an evolutionary advantage
useful information, complements vision
. . . at a distance, in the dark, around corners
listeners are highly adapted to natural sounds (including
speech)
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 4 / 33
The scope of audio processing
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 5 / 33
The acoustic communication chain
message signal channel receiver decoder
!
synthesis
audio
processing
recognition
Sound is an information bearer
Received sound reects source(s)
plus eect of environment (channel)
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 6 / 33
Levels of abstraction
Much processing concerns shifting between levels of abstraction
sound p(t)
representation
(e.g. t-f energy)
information
abstract
concrete
A
n
a
l
y
s
i
s
S
y
n
t
h
e
s
i
s
Dierent representations serve dierent tasks
separating aspects, making things explicit, . . .
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 7 / 33
Outline
1
Sound and information
2
Course Structure
3
DSP review: Timescale modication
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 8 / 33
Source structure
Goals
evaluation
Topic
sampling interval T
sampling frequency
T
=
2
T
quantizer Q(y) =
_
y
_
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 16 / 33
The speech signal: time domain
Speech is a sequence of dierent sound types
-0.2
-0.1
0
0.1
0.2
1.38 1.4 1.42
.1
0
.1
1.52 1.54 1.56 1.58
-0.1
0
0.1
1.86 1.88 1.92 1.9
-0.05
0
0.05
2.42 2.44 2.46 2.4
-0.02
0
0.02
1.4 1.6 1.8 2 2.2 2.4 2.6
time/s
watch thin as a dime a has
Vowel: periodic
has
Fricative: aperiodic
watch
Glide: smooth transition
watch
Stop burst: transient
dime
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 17 / 33
Timescale modication (TSM)
Can we modify a sound to make it slower ?
i.e. speech pronounced more slowly
e.g. to help comprehension, analysis
or more quickly for speed listening ?
Why not just slow it down?
x
s
(t) = x
o
(
t
r
), r = slowdown factor (> 1 slower)
equivalent to playback at a dierent sampling rate
2.35 2.4 2.45 2.5 2.55 2.6
-0.1
-0.05
0
0.05
0.1
2.35 2.4 2.45 2.5 2.55 2.6
-0.1
-0.05
0
0.05
0.1
time/s
Original
2x slower
r = 2
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 18 / 33
Time-domain TSM
Problem: want to preserve local time structure
but alter global time structure
Repeat segments
N
ov
n=0
y
m1
[mL + n] x
__
m
r
_
L + n + K
(y
m1
[mL + n])
2
(x
__
m
r
_
L + n + K
)
2
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 20 / 33
The Fourier domain
Fourier Series (periodic continuous x)
0
=
2
T
x(t) =
k
c
k
e
jk
0
t
c
k
=
1
2T
_
T/2
T/2
x(t)e
jk
0
t
dt
k
1 2 3 5 6 7 4
|c
k
|
1.0
1.5 1 0.5 0 0.5 1 1.5
1
0.5
0
0.5
t
x(t)
Fourier Transform (aperiodic continuous x)
x(t) =
1
2
_
X(j )e
j t
d
X(j ) =
_
x(t)e
j t
dt
0 0.002 0.004 0.006 0.008
time / sec
level
/ dB
-0.01
0
0.01
0.02
x(t)
0 2000 4000 6000 8000
freq / Hz
-80
-60
-40
-20 |
X(j)
|
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 21 / 33
Discrete-time Fourier
DT Fourier Transform (aperiodic sampled x)
x[n] =
1
2
_
X(e
j
)e
j n
d
X(e
j
) =
x[n]e
j n
n
-1 1 2 3 4 5 6 7
0
|X(e
j
)|
2 3 4 5
1
2
3
x [n]
Discrete Fourier Transform (N-point x)
x[n] =
k
X[k]e
j
2kn
N
X[k] =
n
x[n]e
j
2kn
N
k
|X(e
j
)| |X[k]|
k=1...
n
1 2 3 4 5 6 7
x [n]
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 22 / 33
Sampling and aliasing
Discrete-time signals equal the continuous time signal at discrete
sampling instants:
x
d
[n] = x
c
(nT)
Sampling cannot represent rapid uctuations
0 1 2 3 4 5 6 7 8 9 10
1
0.5
0
0.5
1
sin
__
M
+
2
T
_
Tn
_
= sin(
M
Tn) n Z
Nyquist limit (
T
/2) from periodic spectrum:
T
T
T
-
M
T
+
M
G
p
(j)
G
a
(j)
alias of baseband
signal
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 23 / 33
Speech sounds in the Fourier domain
1.52 1.54 1.56 1.58
-0.1
0
0.1
2.42 2.44 2.46 2.48
-0.02
0
0.02
0 1000 2000 3000 4000
-100
-80
-60
-40
0 1000 2000 3000 4000
-100
-80
-60
1.37 1.38 1.39 1.4 1.41 1.42
-0.1
0
0.1
0 1000 2000 3000
-100
-80
-60
-40
1.86 1.87 1.88 1.89 1.9 1.91
-0.05
0
0.05
0 1000 2000 3000 4000
-100
-80
-60
Vowel: periodic
has
Fricative: aperiodic
watch
Glide: transition
watch
Stop: transient
dime
time domain frequency domain
time / s freq / Hz
e
n
e
r
g
y
/
d
B
dB = 20 log
10
(amplitude) = 10 log
10
(power)
Voiced spectrum has pitch + formants
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 24 / 33
Short-time Fourier Transform
Want to localize energy in time and frequency
break sound into short-time pieces
calculate DFT of each one
2.35 2.4 2.45
0
4000
3000
2000
1000
2.5 2.55 2.6
-0.1
0
0.1
time / s
f
r
e
q
/
H
z
k
short-time
window
DFT
m = 0 m = 1 m = 2 m = 3
L 2L 3L
Mathematically,
X[k, m] =
N1
n=0
x[n] w[n mL] exp
_
j
2k(n mL)
N
_
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 25 / 33
The Spectrogram
Plot STFT X[k, m] as a gray-scale image
time / s
time / s
f
r
e
q
/
H
z
i
n
t
e
n
s
i
t
y
/
d
B
2.35 2.4 2.45 2.5 2.55 2.6
0
1000
2000
3000
4000
f
r
e
q
/
H
z
0
1000
2000
3000
4000
0
0.1
-50
-40
-30
-20
-10
0
10
0 0.5 1 1.5 2 2.5
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 26 / 33
Time-frequency tradeo
Longer window w[n] gains frequency resolution at cost of time
resolution
1.4 1.6 1.8 2 2.2 2.4 2.6
f
r
e
q
/
H
z
time / s
level
/ dB
0
1000
2000
3000
4000
f
r
e
q
/
H
z
0
1000
2000
3000
4000
0
0.2
W
i
n
d
o
w
=
2
5
6
p
t
N
a
r
r
o
w
b
a
n
d
W
i
n
d
o
w
=
4
8
p
t
W
i
d
e
b
a
n
d
-50
-40
-30
-20
-10
0
10
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 27 / 33
Speech sounds on the Spectrogram
Most popular speech visualization
f
r
e
q
/
H
z
0
1000
2000
3000
4000
1.4 1.6 1.8 2 2.2 2.4 2.6
time/s
watch thin as a dime a has
V
o
w
e
l
:
p
e
r
i
o
d
i
c
h
a
s
F
r
i
c
'
v
e
:
a
p
e
r
i
o
d
i
c
w
a
t
c
h
G
l
i
d
e
:
t
r
a
n
s
i
t
i
o
n
w
a
t
c
h
S
t
o
p
:
t
r
a
n
s
i
e
n
t
d
i
m
e
Wideband (short window) better than narrowband (long window)
to see formants
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 28 / 33
TSM with the Spectrogram
Just stretch out the spectrogram?
Time
F
r
e
q
u
e
n
c
y
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0
1000
2000
3000
4000
Time
F
r
e
q
u
e
n
c
y
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0
1000
2000
3000
4000
how to resynthesize?
spectrogram is only |Y[k, m]|
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 29 / 33
The Phase Vocoder
Timescale modication in the STFT domain
Magnitude from stretched spectrogram:
|Y[k, m]| =
X
_
k,
m
r
_
Y
[k, m] =
X
_
k,
m
r
_
time-domain
phase vocoder
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 32 / 33
References
J. L. Flanagan and R. M. Golden. Phase vocoder. Bell System Technical Journal,
pages 14931509, 1966.
M. Dolson. The Phase Vocoder: A Tutorial. Computer Music Journal, 10(4):1427,
1986.
M. Puckette. Phase-locked vocoder. In Proc. IEEE Workshop on Applications of
Signal Processing to Audio and Acoustics (WASPAA), pages 222225, 1995.
A. T. Cemgil and S. J. Godsill. Probabilistic Phase Vocoder and its application to
Interpolation of Missing Values in Audio Signals. In 13th European Signal
Processing Conference, Antalya, Turkey, 2005.
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 33 / 33