Module1 SSP
Module1 SSP
ECE3028
BY
DR. K. GOWRI
ASSISTANT PROFESSOR,
DEPARTMENT OF ECE,
PRESIDENCY UNIVERSITY,
BANGALORE
Module 1: Fundamentals of speech signal
production
Fig. 17 Plots of simulated glottal volume velocity flow and radiated pressure at
the lips at the beginning of voicing.
The vocal cords – contd…
• Figure 17 shows plots from a simulation of the glottal volume velocity air flow
(upper plot) and the resulting sound pressure at the mouth for the first 30 msec
of a voiced sound (such as a vowel).
• The cycle of opening and closing of the vocal cords is clearly seen in the glottal
volume velocity flow.
• Notice that the first 15 msec (or so) - a period of buildup in the glottal flow
(top), and thus the resulting pressure waveform (at the mouth) also shows a
buildup until it begins to look like a quasiperiodic signal (bottom).
• This transient behavior at the onset (and also termination) of voicing is a
source of some difficulty in algorithms for deciding exactly when voicing
begins and ends, and in estimating parameters of the speech signal during this
buildup period.
The vocal cords – contd…
Fig.19 Schematized
diagram of the vocal
apparatus.
The vocal cords – contd…
• The muscle force from the chest muscles pushes air out of the lungs and then
through the bronchi and the trachea to the vocal cords.
• Vocal tract - the combination of larynx tube, pharynx cavity, and the mouth.
• If the vocal cords are tensed, the air flow causes them to vibrate, producing
puffs of air at a quasi-periodic rate, which excite the vocal tract and/or nasal
cavity, producing “voiced” or quasi-periodic speech sounds, such as steady
state vowel sounds, which radiate from the mouth and/or nose.
• If the vocal cords are relaxed, then the vocal cord membranes are spread
apart and the air flow from the lungs continues unimpeded through the
vocal tract until it hits a constriction in the vocal tract.
• If the constriction is only partial, the air flow may become turbulent, thereby
producing so-called “unvoiced” sounds (such as the initial sound in the word
/see/, or the word /shout/).
The vocal cords – contd…
• If the constriction is total, pressure builds up behind the total constriction.
When the constriction is released, the pressure is suddenly and abruptly
released, causing a brief transient sound, such as occurs at the beginning of the
words /put/, /take/, or /kick/.
• Again the sound pressure variations at the mouth and/or nose constitute the
speech signal that is produced by the speech generation mechanism.
• The vocal tract and nasal tract are shown in Figure 19 as tubes of non-uniform
cross-sectional area laid out along a straight line.
• The vocal tract bends at almost a right angle between the larynx and pharynx.
• As sound, generated as discussed above, propagates down these tubes, the
frequency spectrum is shaped by the frequency selectivity of the tube.
• This effect is very similar to the resonance effects observed with organ pipes or
wind instruments.
The vocal cords – contd…
• In the context of speech production, the resonance frequencies of the vocal
tract tube are called formant frequencies or simply formants.
• The formant frequencies depend upon the shape and dimensions of the vocal
tract; each shape is characterized by a set of formant frequencies.
• Different sounds are formed by varying the shape of the vocal tract. Thus, the
spectral properties of the speech signal vary with time as the vocal tract shape
varies.
Speech Properties and the Speech Waveform
• Speech is a sequence of ever-changing sounds.
• Highly dependent on the sounds that are produced in order to encode the
content of the implicit message.
• The properties of the speech signal are highly dependent on the context in
which the sounds are produced; i.e., the sounds that occur before and after the
current sound. This effect is called speech sound co-articulation.
• The state of the vocal cords and the positions, shapes, and sizes of the various
articulators (lips, teeth, tongue, jaw, velum) all change slowly over time,
thereby producing the desired speech sounds.
Speech Properties and the Speech Waveform- contd…
• The area function for a particular vowel is determined primarily by the position
of the tongue, but the positions of the jaw, lips, and, to a small extent, the
velum also influence the resulting sound.
Fig. 22 Schematic vocal tract configurations for the
vowels /i/, /æ/, /a/, and /u/ (/IY/, /AE/, /AA/, and /UW/
in ARPAbet).
Semivowels
• The group of sounds consisting of /w/, /l/, /r/, and /y/ (/W/, /L/, /R/, and /Y/)
is called the semivowels because of their vowel-like nature.
• The semivowels /w/ and /y/ are often called glides.
• The semivowels /r/ and /l/ are often called liquids.
• The semivowels are characterized by a constriction in the vocal tract, but one
at which no turbulence is created.
• This is due to the fact that the tongue tip generally forms the constriction for
the semivowels, and therefore the constriction does not totally block air flow
through the vocal tract.
• The semivowels have properties similar to corresponding vowels, but with
more pronounced articulations.
Fig.23 Articulatory configurations for the semivowels of
American English.
Semivowels-Contd…
• The vowels most closely corresponding to the four
semivowels are the following:
• the semivowel /w/ is closest to the vowel /u/ (as in boot)
• the semivowel /y/ is closest to the vowel /i/ (as in beet)
• the semivowel /r/ is closest to the vowel /Ç/ (as in bird)
• the semivowel /l/ is closest to the vowel /o/ (as in boat)
Nasals
• The nasal consonants /m/, /n/, /N/ (/M/, /N/, and /NX/) are produced with
glottal excitation (hence these are voiced sounds) and the vocal tract totally
constricted at some point along the oral passageway.
• The velum is lowered so that air flows through the nasal tract, with sound
being radiated at the nostrils.
• Thus, the mouth serves as a resonant cavity that traps acoustic energy at
certain natural frequencies.
Fig.23 Articulatory configurations for the nasal
consonants.
Unvoiced Fricatives
• The unvoiced fricatives /f/, /T/, /s/, and /š/ (/F/, /TH/, /S/,
and /SH/) are produced by exciting the vocal tract by a
steady air flow that becomes turbulent in the region of a
constriction in the vocal tract.
• The location of the constriction serves to determine which
fricative sound is produced.
• As shown in Figure 24,
• for the fricative /f/ the constriction is near the lips;
• for /T/ it is near the teeth;
• for /s/ it is near the middle of the oral tract; and
• for /š/ it is near the back of the oral tract.
Fig. 24 Articulatory configurations for the unvoiced
fricatives.
Voiced Fricatives
• The voiced fricatives /v/, /D/, /z/, and /ž/ (/V/, /DH/, /Z/, and /ZH/) are the
counterparts of the unvoiced fricatives /f/, /T/, /s/, and /š/ (/F/, /TH/, /S/, and
/SH/), respectively, in that the place of constriction for each of the
corresponding phonemes is essentially the same.
• The voiced fricatives differ from their unvoiced counterparts in that two
excitation sources are involved in their production.
• For voiced fricatives the vocal cords are vibrating, and thus one excitation
source is at the glottis.
Voiced Stops
• The voiced stop consonants /b/, /d/, and /g/ (/B/, /D/, and
/G/) are transient, non-continuant sounds that are produced
by building up pressure behind a total constriction
somewhere in the oral tract and suddenly releasing the
pressure.
• As shown in Figure 3.35,
• for /b/ the constriction is at the lips;
• for /d/ the constriction is back of the teeth;
• and for /g/ it is near the velum.
Fig. 25 Articulatory configurations for the voiced stop
consonants.
Voiced Stops- Contd…
• During the period when there is a total constriction in the
tract, there is no sound radiated from the lips.
• However, there is often a small amount of low frequency
energy radiated through the walls of the throat (sometimes
called a voice bar).
• This occurs when the vocal cords are able to vibrate even
though the vocal tract is closed at some point.
• Since the stop sounds are dynamical in nature, their
properties are highly influenced by the vowel that follows
the stop consonant.
Unvoiced Stops
• The unvoiced stop consonants /p/, /t/, and /k/ (/P/, /T/, and
/K/) are similar to their voiced counterparts /b/, /d/, and /g/
with one major exception.
• During the period of total closure of the tract, as the
pressure builds up, the vocal cords do not vibrate.
• Thus, following the period of closure, as the air pressure is
released, there is a brief interval of frication (due to sudden
turbulence of the escaping air) followed by a period of
aspiration (steady air flow from the glottis exciting the
resonances of the vocal tract) before voiced excitation
begins.
Affricates and Whisper
• The remaining consonants of American English are the affricates
/c/ and /j/ (/CH/ and ˇ /JH/), and the aspirated phoneme /h/
(/HH/).
• The unvoiced affricate /c/ is a dynamical ˇ sound that can be
modeled as the concatenation of the stop /t/ and the fricative /š/.
• The voiced affricate /j/ can be modeled as the concatenation of
the stop /d/ and the fricative /ž/.
• Finally, the phoneme /h/ is produced by exciting the vocal tract by
a steady air flow—i.e., without the vocal cords vibrating, but with
turbulent flow being produced at the glottis.
THE SPEECH CHAIN - from production to perception (Fig.
25).
THE SPEECH CHAIN- Contd…
• Levels of representation between the speaker and the listener.
1. the linguistic level - where the basic sounds of the communication are
chosen to express some thought or idea
2. the physiological level - where the vocal tract components produce the
sounds associated with the linguistic units of the utterance
3. the acoustic level - where sound is released from the lips and nostrils and
transmitted to both the speaker, as feedback, and to the listener
4. the physiological level - where the sound is analyzed by the ear and the
auditory nerves
5. linguistic level - where the speech is perceived as a sequence of linguistic
units and understood in terms of the ideas being communicated
THE SPEECH CHAIN- contd…
• Our focus is on the listener (or auditory) side of the speech chain.
Fig.26- Block diagram of processes for conversion from acoustic wave to perceived sound.