Speech Acoustics Project
Speech Acoustics Project
Speech Acoustics Project
Abstract: In this paper, basic methods for analyzing recorded speech are presented. The
spectrogram is introduced and subsequently utilized in a Matlab environment to reveal patterns
in recorded voice data. Several examples of speech are recorded, analyzed, and compared. A
model for voice production is introduced in order to explain the variety of time-frequency
patterns in the waveforms. Specifically, a single tube and then a multi-tube model for the vocal
tract are considered and related to resonances in the speech spectrum. It is shown that a series of
connected acoustic tubes results in resonances similar to those that occur in speech.
Introduction
Motivation:
Consider the problem of speech recognition. When two different people speak the same phrase
(or if one person utters the same phrase twice), a human listener will generally have no trouble
understanding each instance of that phrase. This leads us to believe that even though the two
speakers may have different vocal qualities (different pitch, different accents, etc.) there must be
some sort of invariant quality between the two instances of the spoken phrase.
Thinking about the problem a bit further, we realize that when two different people articulate the
same phrase, they perform essentially the same mechanical motions. In other words, they move
their mouths, tongue, lips, etc., in the roughly the same way. We hypothesize that as a result of
the similarities in speech mechanics from person to person there should be some features in the
recorded speech waveform that are similar for multiple instances of a spoken phrase.
One such set of speech features is called formants, which are resonances in the vocal tract. The
frequencies at which these resonances occur are a direct result of the particular configuration of
the vocal tract. As words are spoken, the speaker moves his or her tongue, mouth, and lips,
changing the resonant frequencies with time. Analysis of these time-varying frequency patterns
forms the basis for all modern speech recognition systems.
Organization:
This paper is broadly divided into two sections. Part 1 is concerned with analysis of voice
waveforms. In Part 2, we will delve into models for voice production and relate them to the data
presented in Part 1.
Part 1 is organized as follows. In Section 1.1 we briefly describe the spectrogram, a widely used
tool for time-frequency analysis of acoustic data, and illustrate its benefits with an example. A
Matlab program for recording sounds and viewing their spectrograms is presented in Section 1.2.
In Section 1.3 we divide speech sounds into two broad categories, voiced and unvoiced speech,
restricting our analysis to voiced speech. Finally, in Section 1.4, several speech waveforms are
presented and analyzed.
Part 2 is organized as follows. Section 2.1 briefly describes the vocal tract, and then Section 2.2
presents a single acoustic tube model for the vocal tract. Section 2.3 presents a multi-tube model
and discusses various ways that the model can be analyzed. Closing remarks are made in
Section 2.4.
1. The original waveform is first broken into smaller blocks, each with equal size. The
choice of the block size depends on the frequency of the underlying data. For speech, a
width of 20 to 30 ms is often used. Blocks are allowed to overlap. An overlap of 50% is
typical.
2. Each block is multiplied by a window function. Most window functions have a value of
1 in the middle and taper down towards 0 at the edges. Windowing a block of data has
the effect of diminishing the magnitude of the samples at the edges, while maintaining
the magnitude of the samples in the middle.
3. The Discrete Fourier Transform (DFT) of each windowed block is computed. Only the
magnitude of the DFT is retained. The result is several vectors of frequency data (the
magnitude of the DFT), one vector for each block of the original waveform. The
frequency information is localized in time depending on the location of the time block
that it was computed from.
A simple example will help to illustrate the point. Below we have the waveform and
spectrogram of a bird chirping. This sound was borrowed from a Matlab demonstration.
The waveform spectrogram of a chirping bird
The upper plot shows the time domain waveform of a bird chirping. Below this is the
spectrogram, which shows frequency content as a function of time. Frequency is on the vertical
axis and time is on the horizontal. Blue indicates larger magnitude while red indicates smaller
magnitude.
The beauty of the spectrogram is that it clearly illustrates how the frequency of a signal varies
with time. In this example we can see that each chirp starts at a high frequency, usually between
3 and 4 kHz, and over the course of about 0.1 seconds, decreases in frequency to about 2 kHz.
This type of detail would be lost if we chose to take the DFT of the entire waveform.
Technical details:
The remainder of this section will describe the programits inner working and functionality.
Running the program:
The program can be run by typing record at the Matlab prompt or by opening the program in the
Matlab editor and selecting Run from the Debug menu
Recording:
Sound recording is initiated through the Matlab graphical user interface (GUI) by clicking on the
record button. The duration of the recording can be adjusted to be anywhere from 1 to 6
seconds. (These are the GUI defaults, but the code can be modified to record for longer
durations if desired.)
Upon being clicked, the record button executes a function that reads in mono data from the
microphone jack on the sound card and stores it a Matlab vector.
Most of the important information in a typical voice waveform is found below a frequency of
about 4 kHz. Accordingly, we should sample at a least twice this frequency, or 8 kHz. (Note
that all sound cards have a built in pre-filter to limit the effects of aliasing.) Since there is at
least some valuable information above 4 kHz, the Record GUI has a default sampling rate of
11.025 kHzthis can be modified in code. A sampling rate of 16 kHz had been used in the past,
but the data acquisition toolbox in Matlab 6.0 does not support this rate.
Once recorded, the time data is normalized to a maximum amplitude of 0.99 and displayed on
the upper plot in the GUI window. In addition to the time domain waveform, a spectrogram is
computed using Matlabs built in specgram function (part of the signal processing toolbox).
An example recording of the sentence, We were away a year ago is shown below.
We were away a year ago
One can examine a region of interest in the waveform using the Zoom in button. When Zoom in
is clicked, the cursor will change to a cross hair. Clicking the left mouse button and dragging a
rectangle around the region of interest in the time domain waveform will select a sub-section of
data. In the example below we have zoomed in on the region from about 1 to 1.2 seconds.
The Zoom out button will change the axis back to what it was before Zoom in was used. If you
zoom in multiple times, zooming out will return you to the previous axis limits.
The Play button uses Matlabs sound function to play back (send to the speakers) the waveform
that appears in the GUI. If you have zoomed in on a particular section of the waveform, only
that portion of the waveform will be sent to the speakers.
Save is used to write the waveform to a wave file. If zoomed in on segment of data, only that
portion of the waveform will be saved.
Click Load to import any mono wave file into the Record GUI for analysis.
Voiced Speech:
All voiced speech originates as vibrations of the vocal cords. Its primary characteristic is its
periodic nature.
Voiced speech is created by pushing air from the lungs up the trachea to the vocal folds (cords),
where pressure builds until the folds part, releasing a puff of air. The folds then return to their
original position as pressure on each side is equalized. Muscles controlling the tension and
elasticity of the folds determine the rate at which they vibrate. See [3].
The puffs of air from the vocal cords are subsequently passed through the vocal tract and then
through the air to our ears. The periodicity of the vocal cord vibrations is directly related to the
perceived pitch of the sound. We will examine the effects of the vocal tract in more detail later
on.
Vowels sounds are one example voiced speech. Consider the /aa/ sound in father, or the /o/
sound in boat. In the segment of voiced speech below, note the periodicity of the waveform.
Segment of voiced speech, /aa/ in father
Unvoiced speech:
Unvoiced speech does not have the periodicity associated with voiced speech. In many kinds of
unvoiced speech, noise-like sound is produced at the front of the mouth using the tongue, lips,
and/or teeth. The vocal folds are held open for these sounds.
Consider the sounds /f/ as in fish, /s/ as in sound. The /f/ sound is created by forcing air between
the lower lip and teeth, while /s/ is created by forcing air through a constriction between the
tongue and the roof of the mouth or the teeth.
The waveform below shows a small segment of unvoiced speech. Note its distinguishing
characteristics. It is low amplitude, noise-like, and it changes more rapidly than voiced speech.
Lets further examine the waveform and spectrogram of a word containing both voiced and
unvoiced speech. A recording of the word sky was made with the Matlab program. The /s/
sound, we know now, is unvoiced, while the /eye/ sound is voiced. (The /k/ is also unvoiced, but
not noise-like) The plots are shown below.
Looking at the spectrogram we note that /s/ contains a broad range of frequencies, but is
concentrated at higher frequencies. The resonances, or formants, in the speech waveform can be
seen as blue, horizontal stripes in the spectrogram. These formants, mentioned in the
introduction, arent particularly clear in the unvoiced /s/, but are quite obvious in the voiced
/eye/. It is for this reason that we restrict our analysis to voiced speech.
In addition to the waveforms and spectrograms, we will analyze the spectrum of small segments
of the waveforms. These 2-D spectrums will help us compare formant frequencies from sound to
sound.
Waveform 1: Already
The word already contains several different sounds, all of them voiced. Notice how the formants
change with time. One significant feature that is also easy to identify in the spectrogram is the
/d/ sound. This sound is called a stop, and for obvious reasons. When the /d/ is pronounced, the
tongue temporarily stops air from leaving the oral cavity. This action leads to a small amplitude
in the waveform and a hole in the spectrogram. Can you spot it?
Waveform and spectrogram of the word already
Well now take a look at a small sub-section of the waveform. Weve isolated a about 0.07
seconds of sound towards the beginning if the word. By itself, this portion of the waveform
sounds a bit like the /au/ in the word caught. The plot below shows the waveform and spectrum
of the sub-signal.
Moving on, we isolate a different portion of the same waveform, the /ee/ sound at the end of the
word already. By looking at either the waveform or the spectrum, we can see that /ee/ contains
more high frequency information than /au/. /ee/ has a narrow first formant and 3 additional
formants at higher frequencies.
Waveform 2: Cool
The word cool contains three distinct sounds. The /c/ at the beginning is an unvoiced sound. The
/oo/ and the /L/ are both voiced sounds, but each has different properties. For the sake of time
well stick to the /L/ sound at the end of the word. An /L/ is created by arching the tongue in the
mouth. This leads to a sound that differs greatly from most of the vowel sounds.
Waveform and spectrogram of the word cool
Below, youll see a segment of the /L/ sound. It is different from the last two sounds weve
analyzed in that it contains virtually no high frequency components. There are two high
frequency formants at about 3000 and 3500 Hz, but they are a great deal smaller in magnitude
than the first formant.
We chose the word cot to get a look at the /ah/ sound in the middle. Youll notice from the
spectrogram that this sound is quite frequency rich, containing several large formants between 0
and 5 kHz.
Compare the /ah/ waveform (below) to the previous /L/ waveform. There is a great deal more
activity in this waveform, which explains variety of frequencies in the spectrogram. The
spectrum of the signal reveals 5 (possibly 6) significant formants, each one having a sizable
bandwidth.
/ah/ in cot from about 0.17 to 0.22 sec
It is clear from the data than different vocal sounds have widely varying spectral content.
However, each sound contains similar features, like formants, that arise from the methods of
speech production. Well now begin to talk a bit about the vocal tract and the attempts that have
been made to model its functionality. The predictions of these models will then be related to the
empirical results from this section
The figure above shows a schematic diagram of the vocal tract on the left and, on the right, a plot
the area of the vocal tract as a function of distance (in centimeters) from the vocal cords. The
area/distance function plotted here is for the sound /i/, as in bit. The configuration of the vocal
tract, and hence the plot, is different for different sounds.
Looking at the area vs. distance function in the plot above, youll notice that there are two major
resonant chambers in the vocal tract, the first, the pharyngeal cavity, is from about 1 to 8 cm and
the second, the oral cavity, is from about 14 to 16 cm. This manner of thinking about the vocal
tractidentifying resonant chambersleads us to model it as on acoustic tube, or concatenation
of acoustic tubes, an idea well examine further in subsequent sections.
Aside: Some sounds, called nasals, also use the nasal cavity as an additional resonant chamber
and path to outside world. However, these sounds are small subset of the set of voiced sounds,
and we will ignore them in this presentation.
More detail about the vocal cords can be found in [3], and an in depth analysis of the vocal tract
can be found in [4].
Well now examine some models of speech production. All of these models make a couple of
simplifying assumptions:
Neither of these assumptions are entirely true, but they help to simplify the analysis. The
acoustic tube models that we will examine can also be viewed as source-filter models. Either
way you look at it, there is an excitation sent through a channel, and that channel alters the
spectral content of the excitation.
2.2 A Single Tube Model
Suppose for a moment that we model the vocal tract as a single acoustic tube of uniform area.
Suppose further that the tube is excited at one end by a periodic input, and open at the other. We
will now analyze such a configuration, and attempt to determine the resonant frequencies that
this model predicts.
Excitation
Open
end
x
0 L
{
u (0, t ) = Re U (0)e jt } {
u (L, t ) = Re U (L )e jt }
The wave equation is:
2 p 1 2 p
=
x 2 c0 2 t 2
{ }
Assume a solution of the form p(x, t ) = Re P(x )e jt . We can now use complex notation to
write the reduced wave equation.
d 2P
= 2 ( j ) P
1 2
2
dx c0
d 2P
2
+ k 2P = 0, k= (1)
dx c0
Equation (1) has a solution of the form P(x ) = A cos(kx ) + B sin (kx ) (2)
0 ( j )U =
dP
(3)
dx
U (x ) =
j
[ A sin (kx ) + B cos(kx )] (4)
0 c0
U (0) =
j
B
0 c0
B = j 0 c0U (0 ) (5)
P(L ) = 0
We are interested in the relation between the input and the outputthe transfer functionof the
system. Combining (4), (5), and (6), we have,
U (L )
= tan (kL )sin (kL ) + cos(kL )
U (0)
U (L ) 1
= (7)
U (0) cos(kL )
c 0
or = (2n + 1)
2L
c0
or f = (2n + 1) (8)
4L
c0
The results show that resonance occurs at odd multiples of the fundamental resonance, . For
4L
a vocal tract of length 17 cm (L = 17 cm) resonance will occur at approximately 500, 1500, 2500
Hz, etc.
This model seems to be a decent first approximation, yielding results similar to the results
observed in the data. Clearly, however, resonances should not occur at nicely spaced intervals.
The actual data shows that the resonances appear at all sorts of frequencies.
Most derivations of multi-tube models, [1], [2], [4], [5], rely on either electrical analogs of
acoustic circuits, which translate into transmission line problems, or discretized versions of the
pressure and volume velocity equations. In either case the solution to the wave equation can be
written as the sum of coming and going waves. For the ith tube, the volume velocity, u i (x, t ) , is:
u i (x, t ) = u i (t x c ) u (t + x c )
+
pi (x, t ) =
Ai
[
c +
u i (t x c ) + u (t + x c ) ]
Since we are primarily interested in what happens at the boundary of the tubes, many authors,
[5], [7], [4], will discretize the velocity and pressure equations in time and space, by evaluating
the functions only at these boundaries. See the plot below.
2-Tube model of vocal tract (here, v is the volume velocity), from [7]
This sort of discrete interpretation of the multi-tube model gives rise to a useful signal flow
model. The model can further be changed to a digital waveguide equivalent if the length of each
tube is the same.
The important parameters in the signal flow diagram are the reflection coefficients. These
coefficients arise in many areas of acoustics. They determine what percentage of a wave will
pass the boundary and what percentage will reflect. Simulations of digital waveguides built after
this model confirms that they do indeed lead to multiple resonances of the form witnessed in the
analysis of our data.
Actual simulation of these models is well beyond the scope of the paper. However, Gold and
Morgan assured us that, an acoustic configuration [of this form] can always be found to match
the measured steady-state spectrum of any speech signal. ([5], page 152).
The next step in the investigation of this topic would be to test some of the digital waveguide
models for speech production described by Gold and Morgan. Unfortunately, the point was
never reached in this presentation were we could show in detail exactly how the multi-tube
model leads to resonances similar to those in the recorded data. This was the goal as the outset,
but the task proved formidable.
A data analysis technique called Linear Predictive Coding (LPC) can be using to estimate a set of
reflection coefficients corresponding to an acoustic tube model for speech production. In a LPC,
an all-pole model is fit to the spectral envelope of a segment of the speech waveform. The
corresponding reflection coefficients could be used to determine the relative areas of the tube
segments. An animation could be used to show how the shape of the vocal tract changes during
the articulation of a phrase
Speech synthesis:
All sounds shown in this paper were recorded using the equipment in the Experiential Signal
Processing Laboratory (ESPLab) in the department of Electrical and Computer Engineering at
the University of Rhode Island. Laboratory funding is provided by the National Science
Foundation and the Champlin Foundation.
The microphone is a Passport P-51 with cardioid pick-up pattern designed to reject as much of
the sound coming from the side and rear of microphone as possible. In other words, the P-51
has a certain directionality, with the main lobe on the vertical axis of the microphone. The
precise technical specifications for the P-51 are not readily available, but we can assume that
because of the microphones intended usefaithful representation of voice and acoustic music
it ought to have a relatively flat response between 20 Hz and 20 KHz. Thus the microphone is
more that adequate for recording voice waveforms.
System Description
The microphone output is sent to the first channel of the amplifier. The gain for channel one is
adjusted appropriately, and all other channel levels are set to zero. The amplifiers output, tape
out, is sent to the microphone jack on the computers soundcard. Recording is initialized from
the Matlab GUI, Record, as described in the programs documentation.
Data flow: Sound goes from the microphone to the amplifier and then to the computer.
References
[1] G. Fant, Acoustic Theory of Speech Production, Mouton & Co., The Hague, 1970
[2] K.N. Stevens and A.S. House, An Acoustic Theory of Vowel Production and Some of
its Implications, Journal of Speech and Hearing Research, 4:303-320, 1961
[3] D.G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley & Sons, Inc.,
New York, 2000
[5] B. Gold and N. Morgan, Speech and Audio Signal Processing: Processing and
Perception of Speech and Music, John Wiley & Sons, Inc., New York, 2000
[7] J.H.L. Hansen, Slides for ECEN-5022 Speech Processing & Recognition, University of
Colorado Boulder, 2000, (http://cslr.colorado.edu/classes/ECEN5022/)