0% found this document useful (0 votes)
15 views

Module1 SSP

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Module1 SSP

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

SPEECH SIGNAL PROCESSING

ECE3028

BY
DR. K. GOWRI
ASSISTANT PROFESSOR,
DEPARTMENT OF ECE,
PRESIDENCY UNIVERSITY,
BANGALORE
Module 1: Fundamentals of speech signal
production

Introduction to Speech, The Mechanism of


speech production, Acoustic
phonetics: vowels, diphthongs, semivowels,
nasals, fricatives, stops and affricates
INTRODUCTION
• The fundamental purpose of speech is communication, i.e., the
transmission of messages.
• According to Shannon’s information theory,
❖A message represented as a sequence of discrete symbols- can
be quantified by its information content in bits, and
❖the rate of transmission of information is measured in
bits/second (bps).
• Speech signal: The fundamental analog form of the message is an
acoustic waveform
Speech signal→ electrical form→ acoustic form (Mic, head phone and headset)
Fig.1 Speech signal with phonetic labels for the text message “Should we chase.”
Fig.2 The Speech Chain: from message, to speech signal, to understanding.
The speech chain
• Fig shows the complete process of producing and perceiving speech from the
formulation of a message in the brain of a talker, to the creation of the
speech signal, and finally to the understanding of the message by a listener.
• Message – has different representations (Ex: English text).
• The 1. talker - converts the text into a symbolic representation of the
sequence of sounds corresponding to the spoken version of the text.
• 2. Language code generator - converts text symbols to phonetic symbols
(along with stress and durational information) that describe the basic sounds
of a spoken version of the message and the manner (i.e., the speed and
emphasis) in which the sounds are intended to be produced.
The speech chain
• The segments of the waveform of Figure 1 are labeled with phonetic symbols
using a computer-keyboard-friendly code called ARPAbet.
should we chase - SH UH D — W IY — CH EY S.
• 3. Neuro-muscular controls- the set of control signals that direct the neuro-
muscular system to move the speech articulators, namely the tongue, lips,
teeth, jaw and velum.
• It is consistent with the sounds of the desired spoken message and with the
desired degree of emphasis.
• Neuro-muscular controls - cause the vocal tract articulators to move in a
prescribed manner in order to create the desired sounds.
• 4. Speech Production process: “vocal tract system”- that physically creates
the necessary sound sources and the appropriate vocal tract shapes over time
so as to create an acoustic waveform.
The speech chain
• To determine the rate of information flow during speech production, assume
that there are about 32 symbols (letters) in the language (in English there are
26 letters, but if we include simple punctuation we get a count closer to 32 =
25 symbols).
• The rate of speaking for most people is about 10 symbols per second.
• Assuming independent letters as a simple approximation, information
rate→50 bps.
• At the second stage of the process, where the text representation is
converted into phonemes and prosody (e.g., pitch and stress) markers, the
information rate is estimated to increase by a factor of 4 to about 200 bps.
The speech chain
• ARBAbet phonetic symbol set used to label the speech sounds in Figure 1
contains approximately 64 = 2^6 symbols or 6 bits/phoneme (approx.)→8
phonemes in approximately 600 ms. Therefore 8 × 6/0.6 = 80 bps.
• Additional information required to describe prosodic features (duration, pitch,
loudness)-100 bps.
• In speech production: First two stages→are discrete so we can readily
estimate the rate of information flow with some simple assumptions.
• Next stage- articulatory motion, representation becomes continuous -
estimate the spectral bandwidth- sample and quantize these signals to obtain
equivalent digital signals→ this will give total data rate.
• Estimates of bandwidth and required accuracy suggest that the total data rate
of the sampled articulatory control signals is about 2000 bps.
The speech chain
• Digital representation requires a much higher data rate than the speech signal.
• Digitized speech waveform at the end of the speech production part of the
speech chain → 64,000 to 700,000 bps.
• Telephone quality- 8000 samples/s, 0–4 kHzBW, therefore, bit rate of 64,000
bps.
• Text to speech waveform→acoustic wave propagation and robustly decoded
by the hearing mechanism of a listener.
• Emotional state, speech mannerisms, accent, etc., - inefficiency of simply
sampling and finely quantizing analog signals.
• Aim: is to obtain a digital representation with lower data rate than
that of the sampled waveform.
The speech chain
• Left bottom half - The speech perception model shows the series of steps from
capturing speech at the ear to understanding the message encoded in the speech
signal.
• First step- acoustic waveform to a spectral representation.
• This is done within the inner ear by the basilar membrane, which acts as a non-
uniform spectrum analyzer by spatially separating the spectral components of the
incoming speech signal and thereby analyzing them by what amounts to a non-
uniform filter bank.
• The next step - neural transduction -spectral features into a set of sound features -
that can be decoded and processed by the brain.
• The next step - conversion of the sound features into the set of phonemes, words,
and sentences associated with the in-coming message by a language translation
process in the human brain.
• Finally, conversion of the phonemes, words and sentences of the message into an
understanding of the meaning of the basic message.
Applications of Digital Speech Processing
1.Speech Coding
2.Text-to-Speech Synthesis
3.Speech Recognition and Other Pattern Matching
Problems
4.Other Speech Applications
1. Speech Coding (Fig. 3)
2. Text-to-Speech Synthesis (Fig. 4)
3. Speech Recognition and Other Pattern Matching
Problems (Fig. 5)
Other Speech Applications (Fig.5)
General block diagram for application of digital signal
processing to speech signals (Fig.6)
Phonetic Representation of Speech
• Speech can be represented phonetically by a finite set of symbols
called the phonemes of the language.
• The number of phonemes depends upon the language and the
refinement of the analysis.
• For most languages the number of phonemes is between 32 and 64.
• Table 1 also includes some simple examples of ARPAbet
transcriptions of words containing each of the phonemes of English.
Condensed list of ARPAbet phonetic symbols for North American
English (Table 1)
Condensed list of ARPAbet phonetic symbols for North
American English (Table 1)- Contd…
Condensed list of ARPAbet phonetic symbols for North
American English (Table 1)- Contd…
Models for Speech Production
• A schematic longitudinal cross-sectional drawing of the human vocal
tract mechanism is given in Figure 7.
Models for Speech Production-contd…
• This diagram highlights the essential physical features of human
anatomy that enter into the final stages of the speech production
process.
• The vocal tract - as a tube of nonuniform cross-sectional area - at
one end -vocal cords and other end- mouth opening.
• This tube - serves as an acoustic transmission system for sounds
generated inside the vocal tract.
• For creating nasal sounds like /M/, /N/, or /NG/, a side-branch tube,
called the nasal tract, is connected to the main acoustic branch by
the trapdoor action of the velum.
Models for Speech Production - contd…
• This branch path radiates sound at the nostrils.
• The shape of the vocal tract varies with time due to motions of the
lips, jaw, tongue, and velum.
• The actual human vocal tract is not laid out along a straight line,
this type of model is a reasonable approximation for wavelengths
of the sounds in speech.
• The sounds of speech are generated in the system of Figure 7 in
several ways.
• 1. Voiced sounds - (vowels, liquids, glides, nasals in Table 1) are
produced when the vocal tract tube is excited by pulses of air
pressure resulting from quasi-periodic opening and closing of the
glottal orifice (opening between the vocal cords).
Models for Speech Production - contd…
• Examples: vowels -/UH/, /IY/, and /EY/,
liquid consonant - /W/
• 2. Unvoiced sounds - are produced by creating a constriction somewhere in
the vocal tract tube and forcing air through that constriction, thereby creating
turbulent air flow, which acts as a random noise excitation of the vocal tract
tube.
• Examples: unvoiced fricative sounds - /SH/ and /S/
• 3. Voiced fricatives - when the vocal tract is partially closed off causing
turbulent flow due to the constriction, at the same time allowing quasi-
periodic flow due to vocal cord vibrations.
• Examples: voiced fricatives - /V/, /DH/, /Z/, and /ZH/
Models for Speech Production - contd…
• 4. Plosive sounds and affricates: closing off air flow, allowing
pressure to build up behind the closure, and then abruptly releasing
the pressure.
• Examples: Plosive sounds - /P/, /T/, and /K/
affricates - /CH/
• Vocal tract tube- acts as an acoustic transmission line with certain
vocal tract shape-dependent resonances that tend to emphasize
some frequencies of the excitation relative to others.
Models for Speech Production - contd…
• The speech signal varies at the phoneme rate, which is on the order of 10
phonemes per second.
• The detailed time variations of the speech waveform are at a much higher
rate.
• That is, the changes in vocal tract configuration occur relatively slowly
compared to the detailed time variation of the speech signal.
• The sounds created in the vocal tract are shaped in the frequency domain by
the frequency response of the vocal tract.
• The resonance frequencies resulting from a particular configuration of the
articulators are instrumental in forming the sound corresponding to a given
phoneme- formant frequencies.
Models for Speech Production - contd…
• In summary, the fine structure of the time waveform is created by
the sound sources in the vocal tract, and the resonances of the
vocal tract tube shape these sound sources into the phonemes.
• To model the production of a sampled speech signal by a discrete-
time system model is given in Fig.2.
• The excitation generator on the left simulates the different modes
of sound generation in the vocal tract.
• Samples of a speech signal are assumed to be the output of the
time-varying linear system.
Fig.8 Source/system model for a speech signal
Source/system model for a speech signal
• In general such a model is called a source/system model of speech production.
• The short-time frequency response of the linear system simulates the
frequency shaping of the vocal tract system.
• The vocal tract changes shape relatively slowly, it is reasonable to assume that
the linear system response does not vary over time intervals on the order of
10 ms or so.
Source/system model for a speech signal
• Characterizing the discrete-time linear system by a system function of the
form,

• Where, and → filter coefficients, change at a rate on the order of


50–100 times/s.
• Some of the poles ( ) of the system function lie close to the unit circle and
create resonances to model the formant frequencies.
Source/system model for a speech signal
• In detailed modeling of speech production, it is sometimes useful to employ
zeros (dk) of the system function to model nasal and fricative sounds.
• For voiced speech the excitation to the linear system is a quasi-periodic
sequence of discrete (glottal) pulses that look very much like those shown in
the righthand half of the excitation signal waveform in Figure 8.
• The fundamental frequency of the glottal excitation determines the perceived
pitch of the voice.
• The individual finite-duration glottal pulses have a lowpass spectrum that
depends on a number of factors.
• Therefore, the periodic sequence of smooth glottal pulses has a harmonic line
spectrum with components that decrease in amplitude with increasing
frequency.
Source/system model for a speech signal
• For unvoiced speech, the linear system is excited by a random number
generator that produces a discrete-time noise signal with flat spectrum as
shown in the left-hand half of the excitation signal.
• This model of speech as the output of a slowly time-varying digital filter -
that captures the nature of the voiced/unvoiced distinction in speech
production.
• By assuming that the properties of the speech signal (and the model) are
constant over short time intervals, it is possible to
compute/measure/estimate the parameters of the model by analyzing short
blocks of samples of the speech signal.
Fig.9 Schematic view of the human ear (inner and
middle structures enlarged)
Human Ear
• Figure 9 shows a schematic view of the human ear showing the three distinct
sound processing sections
• The outer ear:
Consisting of the pinna, which gathers sound and conducts it through the external canal
to the middle ear;
• The middle ear:
The middle ear beginning at the tympanic membrane, or eardrum.
Includes three small bones, 1. The malleus (also called the hammer),
2. The incus (also called the anvil) and
3. The stapes (also called the stirrup),
which perform a transduction from acoustic waves to mechanical pressure waves;
• The inner ear:
consists of the cochlea and the set of neural connections to the auditory nerve, which
conducts the neural signals to the brain.
Fig.10 Schematic model of the auditory mechanism
Auditory mechanism
• The acoustic wave is transmitted from the outer ear to the inner ear.
• In inner ear, the ear drum and bone structures convert the sound
wave to mechanical vibrations.
• These are transferred to the basilar membrane inside the cochlea.
• The basilar membrane vibrates in a frequency-selective manner
along its extent and thereby performs a rough (non-uniform)
spectral analysis of the sound.
• Set of inner hair cells that serve to convert motion along the basilar
membrane to neural activity.
• This produces an auditory nerve representation in both time and
frequency.
Auditory mechanism
• The processing at higher levels in the brain as a sequence of central processing
with multiple representations followed by some type of pattern recognition.
• We can only postulate the mechanisms used by the human brain to perceive
sound or speech.
Perception of Loudness
• A key factor in the perception of speech and other sounds is loudness.
• Loudness is a perceptual quality that is related to the physical property of
sound pressure level.
• Loudness is quantified by relating the actual sound pressure level of a pure
tone (in dB) to the perceived loudness of the same tone (in phons) over the
range of human hearing (20 Hz–20 kHz) over the range of human hearing (20
Hz–20 kHz).
Fig. 11 Loudness level for human hearing.
Critical Bands
• The non-uniform frequency analysis performed by the basilar membrane- BPF
- frequency responses become increasingly broad with increasing frequency.

Fig.12 Relation between subjective pitch and frequency of a pure tone


Critical Bands
• Center frequency <500Hz → effective Bandwidths→ 100Hz
>500Hz → effective Bandwidths→ 20% of
• An equation that fits empirical measurements over the auditory range is,

• where, → Critical Bandwidth associated with ,


• Approximately 25 critical band filters span the range from 0 to 20 kHz.
Pitch Perception
• Most musical sounds as well as voiced speech sounds have a periodic
structure when viewed over short time intervals, and such sounds are
perceived by the auditory system as having a quality known as pitch.
• Pitch is a subjective attribute of sound that is related to the fundamental
frequency of the sound, which is a physical attribute of the acoustic waveform.
• The relationship between pitch (measured on a nonlinear frequency scale
called the mel-scale ) and frequency of a pure tone is approximated by the
equation,
Pitch Perception

Fig.13 Illustration of effects of masking.


Auditory Masking
• The phenomenon of critical band auditory analysis can be explained intuitively
in terms of vibrations of the basilar membrane - masking.
• Masking occurs when one sound makes a second superimposed sound
inaudible.
The Mechanism of Speech Production

Fig. 13 Sagittal plane X-ray of the


human vocal apparatus
The Mechanism of Speech Production
• The vocal tract, outlined by the dotted lines in Figure 13, begins at the opening
between the vocal cords, or glottis, and ends at the lips.
• The vocal tract thus consists of the pharynx (the connection from the
esophagus to the mouth) and the mouth or oral cavity.
• In the average male, the total length of the vocal tract is about 17–17.5 cm.
• The cross-sectional area of the vocal tract, which is determined by the positions
of the tongue, lips, jaw, and velum, varies from zero (complete closure) to
about 20 cm^2.
• The nasal tract begins at the velum and ends at the nostrils.
• The velum is a trapdoor-like mechanism at the back of the mouth cavity.
• When the velum is lowered, the nasal tract is acoustically coupled to the vocal
tract to produce the nasal sounds of speech.
(a) Example of a typical vocal tract MR (magnetic resonance) image showing contours
of interest;
(b) a vocal tract schematic showing size and shape parameters.
Fig.15 Schematic view of the human vocal tract
Human vocal tract
• A schematic cross-sectional view of the human vocal system is shown in Figure
15.
• The parts of the body involved in speech production include
1. The lungs and chest cavity
• As the source of air to excite the vocal tract and as the source of pressure to force the air
from the lungs.
2. The trachea or windpipe
• which conducts the air from the lungs to the vocal cords and vocal tract.
3. The vocal cords
• which vibrate when tensed and excited by air flow.
Human vocal tract –Contd…
4. The vocal tract consisting of the pharynx
• the throat cavity
5. The mouth cavity
• including the tongue, lips, jaw, and mouth
6. Nasal cavity
• depending on the position of the velum
The speech production mechanism for voiced sounds
such as vowels works as follows:
• Air enters the lungs via normal breathing and no speech is produced (generally)
on intake;
• As air is expelled from the lungs via the trachea, the tensed vocal cords within
the larynx are made to vibrate by Bernoulli-Law variations of air pressure in
the glottal opening;
• Air flow is chopped up by the opening and closing of the glottal orifice into
quasiperiodic pulses;
• These pulses are frequency-shaped when passing through the pharynx (the
throat cavity), the mouth cavity, and possibly the nasal cavity.
• The positions of the various articulators (jaw, tongue, velum, lips, and mouth)
determine the sound that is produced.
The vocal cords

Fig. 16 The vocal cords:


(top) an artist’s rendering from the top;
(bottom) a schematic cross-sectional
view
The vocal cords – contd…
• Figure 16 shows two views of the vocal cords,
1. An artist’s top view- looking down into the vocal cords
2. A schematic longitudinal cross-sectional view - showing the path for air flow
from the lungs through the vocal cords and through the vocal tract.
• The top view shows the two vocal cords (VC - also called vocal folds )
(literally membranes)
1. AC - arytenoid cartilage
2. TC - thyroid cartilage,
• When the vocal cords are tensed, they form a relaxation oscillator.
• Air pressure builds up behind the closed vocal cords until they are
eventually blown apart.
The vocal cords – contd…
• Air then flows through the orifice (Bernoulli’s Law) , the air pressure drops,
causing the vocal cords to return to the closed position.
• The cycle of building up pressure, blowing apart the vocal cords, and then
closing shut is repeated quasi-periodically as air continues to be forced out of
the lungs.
• The rate of opening and closing is controlled mainly by the tension in the vocal
cords.
The vocal cords – contd…

Fig. 17 Plots of simulated glottal volume velocity flow and radiated pressure at
the lips at the beginning of voicing.
The vocal cords – contd…
• Figure 17 shows plots from a simulation of the glottal volume velocity air flow
(upper plot) and the resulting sound pressure at the mouth for the first 30 msec
of a voiced sound (such as a vowel).
• The cycle of opening and closing of the vocal cords is clearly seen in the glottal
volume velocity flow.
• Notice that the first 15 msec (or so) - a period of buildup in the glottal flow
(top), and thus the resulting pressure waveform (at the mouth) also shows a
buildup until it begins to look like a quasiperiodic signal (bottom).
• This transient behavior at the onset (and also termination) of voicing is a
source of some difficulty in algorithms for deciding exactly when voicing
begins and ends, and in estimating parameters of the speech signal during this
buildup period.
The vocal cords – contd…

Fig. 18 The artificial


larynx, a demonstration
of an artificial method
for generating vocal cord
excitation for speech
production.
The vocal cords – contd…
• Pathologies of the larynx sometimes lead to its complete removal, thereby
depriving a person of the means to generate natural voiced speech.
• The operation of the vocal cords in speech production was the basis for the
design of the artificial larynx shown in Figure 17.
• The artificial larynx was designed and built by AT&T as an aid to patients who
had their larynx surgically removed.
• It is a vibrating diaphragm that can produce a quasi-periodic excitation sound
that can be coupled directly into the human vocal tract by holding the artificial
larynx tightly against the neck, as shown by the human user in Figure 17.
• The artificial larynx does not cause the vocal cords to open and close ;
however, its vibrations are transmitted through the soft tissue of the neck to the
pharynx where the air flow is amplitude modulated.
The vocal cords – contd…
• By using the on-off control, along with a “rate of vibration” control, an
accomplished human user can create an appropriate excitation signal that
essentially mimics the one produced by the vibrating vocal cords, thereby
enabling the user to create (albeit somewhat buzzy sounding) speech for
communication with other humans.

Fig.19 Schematized
diagram of the vocal
apparatus.
The vocal cords – contd…
• The muscle force from the chest muscles pushes air out of the lungs and then
through the bronchi and the trachea to the vocal cords.
• Vocal tract - the combination of larynx tube, pharynx cavity, and the mouth.
• If the vocal cords are tensed, the air flow causes them to vibrate, producing
puffs of air at a quasi-periodic rate, which excite the vocal tract and/or nasal
cavity, producing “voiced” or quasi-periodic speech sounds, such as steady
state vowel sounds, which radiate from the mouth and/or nose.
• If the vocal cords are relaxed, then the vocal cord membranes are spread
apart and the air flow from the lungs continues unimpeded through the
vocal tract until it hits a constriction in the vocal tract.
• If the constriction is only partial, the air flow may become turbulent, thereby
producing so-called “unvoiced” sounds (such as the initial sound in the word
/see/, or the word /shout/).
The vocal cords – contd…
• If the constriction is total, pressure builds up behind the total constriction.
When the constriction is released, the pressure is suddenly and abruptly
released, causing a brief transient sound, such as occurs at the beginning of the
words /put/, /take/, or /kick/.
• Again the sound pressure variations at the mouth and/or nose constitute the
speech signal that is produced by the speech generation mechanism.
• The vocal tract and nasal tract are shown in Figure 19 as tubes of non-uniform
cross-sectional area laid out along a straight line.
• The vocal tract bends at almost a right angle between the larynx and pharynx.
• As sound, generated as discussed above, propagates down these tubes, the
frequency spectrum is shaped by the frequency selectivity of the tube.
• This effect is very similar to the resonance effects observed with organ pipes or
wind instruments.
The vocal cords – contd…
• In the context of speech production, the resonance frequencies of the vocal
tract tube are called formant frequencies or simply formants.
• The formant frequencies depend upon the shape and dimensions of the vocal
tract; each shape is characterized by a set of formant frequencies.
• Different sounds are formed by varying the shape of the vocal tract. Thus, the
spectral properties of the speech signal vary with time as the vocal tract shape
varies.
Speech Properties and the Speech Waveform
• Speech is a sequence of ever-changing sounds.
• Highly dependent on the sounds that are produced in order to encode the
content of the implicit message.
• The properties of the speech signal are highly dependent on the context in
which the sounds are produced; i.e., the sounds that occur before and after the
current sound. This effect is called speech sound co-articulation.
• The state of the vocal cords and the positions, shapes, and sizes of the various
articulators (lips, teeth, tongue, jaw, velum) all change slowly over time,
thereby producing the desired speech sounds.
Speech Properties and the Speech Waveform- contd…

Fig.20 Example of a speech waveform


and its classification into intervals of
voicing (V), unvoicing (U), and silence or
background signal (S).
Speech Properties and the Speech Waveform- contd…
• Figure 20 shows a waveform plot of 500 msec of a speech signal.
• V- voiced,
• U- unvoiced
• S- silence or break
• Voiced sounds: produced by forcing air through the glottis with the tension of
the vocal cords adjusted so that they vibrate in a relaxation oscillation, thereby
producing quasi-periodic pulses of air that excite the vocal tract, leading to a
quasi-periodic waveform.
• Unvoiced or fricative sounds are generated by forming a partial constriction at
some point in the vocal tract (usually toward the mouth end), and forcing air
through the constriction at a high enough velocity to produce turbulence.
• This creates a broad-spectrum noise source that excites the vocal tract.
Speech Properties and the Speech Waveform- contd…
• Silence or background sounds are identified by their lack of the
characteristics of either voiced or unvoiced sounds and usually occur at the
beginning and end of speech utterances, although intervals of silence often
occur within speech utterances.
• The voiced intervals are readily identified as the quasiperiodic waveform
regions in Figure 20.
• Knowing the linguistic (phonetic) transcription of this utterance, we can
segment the waveform into the constituent sounds and syllables.
Eg. Should we chase.
• The variation of pitch period over time is often called the speech rhythm since
it enables humans to form questions (by making the period fall at the end of a
sentence), or make declarative statements, etc.
Table 2 shows typical ranges (from minimum to maximum pitch periods),
along with the average pitch period for male, female, and child speakers
Fig.21 Phonemes in American English.
39 sounds
• 11 vowels
• 4 diphthongs
• 4 semivowels
• 3 nasal consonants
• 6 voiced and unvoiced stop consonants
• 8 voiced and unvoiced fricatives
• 2 affricate consonants
• 1 whispered sound.
• Each of the phonemes in Figure 21 can be classified as
1. A continuant, or
• Continuant sounds are produced by a fixed (non-time-varying) vocal tract configuration
excited by the appropriate source.
• The class of continuant sounds includes the vowels, the fricatives (both unvoiced and
voiced), and the nasals.
2. A non-continuant sound.
• The remaining sounds (diphthongs, semivowels, stops, and affricates) are produced by a
changing (time-varying) vocal tract configuration.
Vowels
• Vowels generally have the longest duration in natural speech, and they are the
most well defined of the sounds of the language.
• They can be held indefinitely, e.g., while singing.
• Although vowels play a major role in spoken language, they carry very little
linguistic information about the orthography of the sentence that is spoken.
• There are some languages whose orthography does not include any vowels,
e.g., Arabic and Hebrew.
• As an example, consider the two sentences shown below, the first with the
vowel orthography removed, and the second with the consonant orthography
removed.
Vowels –contd….
• (all vowels deleted) Th_y n_t_d s_gn_f_c_nt _mpr_v_m_nts _n
th _ c_mp_ny’s _m_g_, s_p_rv_s_ _n _nd m_n_g_m_nt.

• (all consonants deleted) A_ _i_u_e_ _o_a_ _ _a_ _ _a_e_ e_


_e_ _ia_ _ _ _ _e _a_e, _i_ _ _ _e _ _o_e_ o_ o_
_u_a_io_a_ e_ _ _o_ee_ __i_____ _e__ea_i_ _.

• The area function for a particular vowel is determined primarily by the position
of the tongue, but the positions of the jaw, lips, and, to a small extent, the
velum also influence the resulting sound.
Fig. 22 Schematic vocal tract configurations for the
vowels /i/, /æ/, /a/, and /u/ (/IY/, /AE/, /AA/, and /UW/
in ARPAbet).
Semivowels
• The group of sounds consisting of /w/, /l/, /r/, and /y/ (/W/, /L/, /R/, and /Y/)
is called the semivowels because of their vowel-like nature.
• The semivowels /w/ and /y/ are often called glides.
• The semivowels /r/ and /l/ are often called liquids.
• The semivowels are characterized by a constriction in the vocal tract, but one
at which no turbulence is created.
• This is due to the fact that the tongue tip generally forms the constriction for
the semivowels, and therefore the constriction does not totally block air flow
through the vocal tract.
• The semivowels have properties similar to corresponding vowels, but with
more pronounced articulations.
Fig.23 Articulatory configurations for the semivowels of
American English.
Semivowels-Contd…
• The vowels most closely corresponding to the four
semivowels are the following:
• the semivowel /w/ is closest to the vowel /u/ (as in boot)
• the semivowel /y/ is closest to the vowel /i/ (as in beet)
• the semivowel /r/ is closest to the vowel /Ç/ (as in bird)
• the semivowel /l/ is closest to the vowel /o/ (as in boat)
Nasals
• The nasal consonants /m/, /n/, /N/ (/M/, /N/, and /NX/) are produced with
glottal excitation (hence these are voiced sounds) and the vocal tract totally
constricted at some point along the oral passageway.
• The velum is lowered so that air flows through the nasal tract, with sound
being radiated at the nostrils.
• Thus, the mouth serves as a resonant cavity that traps acoustic energy at
certain natural frequencies.
Fig.23 Articulatory configurations for the nasal
consonants.
Unvoiced Fricatives
• The unvoiced fricatives /f/, /T/, /s/, and /š/ (/F/, /TH/, /S/,
and /SH/) are produced by exciting the vocal tract by a
steady air flow that becomes turbulent in the region of a
constriction in the vocal tract.
• The location of the constriction serves to determine which
fricative sound is produced.
• As shown in Figure 24,
• for the fricative /f/ the constriction is near the lips;
• for /T/ it is near the teeth;
• for /s/ it is near the middle of the oral tract; and
• for /š/ it is near the back of the oral tract.
Fig. 24 Articulatory configurations for the unvoiced
fricatives.
Voiced Fricatives
• The voiced fricatives /v/, /D/, /z/, and /ž/ (/V/, /DH/, /Z/, and /ZH/) are the
counterparts of the unvoiced fricatives /f/, /T/, /s/, and /š/ (/F/, /TH/, /S/, and
/SH/), respectively, in that the place of constriction for each of the
corresponding phonemes is essentially the same.
• The voiced fricatives differ from their unvoiced counterparts in that two
excitation sources are involved in their production.
• For voiced fricatives the vocal cords are vibrating, and thus one excitation
source is at the glottis.
Voiced Stops
• The voiced stop consonants /b/, /d/, and /g/ (/B/, /D/, and
/G/) are transient, non-continuant sounds that are produced
by building up pressure behind a total constriction
somewhere in the oral tract and suddenly releasing the
pressure.
• As shown in Figure 3.35,
• for /b/ the constriction is at the lips;
• for /d/ the constriction is back of the teeth;
• and for /g/ it is near the velum.
Fig. 25 Articulatory configurations for the voiced stop
consonants.
Voiced Stops- Contd…
• During the period when there is a total constriction in the
tract, there is no sound radiated from the lips.
• However, there is often a small amount of low frequency
energy radiated through the walls of the throat (sometimes
called a voice bar).
• This occurs when the vocal cords are able to vibrate even
though the vocal tract is closed at some point.
• Since the stop sounds are dynamical in nature, their
properties are highly influenced by the vowel that follows
the stop consonant.
Unvoiced Stops
• The unvoiced stop consonants /p/, /t/, and /k/ (/P/, /T/, and
/K/) are similar to their voiced counterparts /b/, /d/, and /g/
with one major exception.
• During the period of total closure of the tract, as the
pressure builds up, the vocal cords do not vibrate.
• Thus, following the period of closure, as the air pressure is
released, there is a brief interval of frication (due to sudden
turbulence of the escaping air) followed by a period of
aspiration (steady air flow from the glottis exciting the
resonances of the vocal tract) before voiced excitation
begins.
Affricates and Whisper
• The remaining consonants of American English are the affricates
/c/ and /j/ (/CH/ and ˇ /JH/), and the aspirated phoneme /h/
(/HH/).
• The unvoiced affricate /c/ is a dynamical ˇ sound that can be
modeled as the concatenation of the stop /t/ and the fricative /š/.
• The voiced affricate /j/ can be modeled as the concatenation of
the stop /d/ and the fricative /ž/.
• Finally, the phoneme /h/ is produced by exciting the vocal tract by
a steady air flow—i.e., without the vocal cords vibrating, but with
turbulent flow being produced at the glottis.
THE SPEECH CHAIN - from production to perception (Fig.
25).
THE SPEECH CHAIN- Contd…
• Levels of representation between the speaker and the listener.
1. the linguistic level - where the basic sounds of the communication are
chosen to express some thought or idea
2. the physiological level - where the vocal tract components produce the
sounds associated with the linguistic units of the utterance
3. the acoustic level - where sound is released from the lips and nostrils and
transmitted to both the speaker, as feedback, and to the listener
4. the physiological level - where the sound is analyzed by the ear and the
auditory nerves
5. linguistic level - where the speech is perceived as a sequence of linguistic
units and understood in terms of the ideas being communicated
THE SPEECH CHAIN- contd…
• Our focus is on the listener (or auditory) side of the speech chain.
Fig.26- Block diagram of processes for conversion from acoustic wave to perceived sound.

• Figure 26 shows a block diagram of the physical processes involved in human


hearing and speech (sound) perception.
1. The acoustic signal corresponding to the spoken speech utterance is first
converted to a neural representation by processing in the ear.
• The acoustic-to-neural conversion takes place in stages at the outer, middle,
and inner ear.
• These processes are subject to measurement and therefore to mathematical
simulation and characterization.
2. Neural transduction, takes place between the output of the inner ear and the
neural pathways to the brain, and consists of a statistical process of nerve firings
at the hair cells of the inner ear, which are transmitted along the auditory nerve
to the brain.
3. Finally the nerve firing signals along the auditory nerve are processed by the
brain to create the perceived sound corresponding to the spoken utterance.
“black box” behavioral model of hearing and perception-
Fig. 27
• The processes used in the neural processing step are, as yet, not well
understood and we can only speculate as to exactly how the neural
information is transformed into sounds, phonemes, syllables, words, and
sentences, and how the brain is able to decode the sound sequence into an
understanding of the message embedded in the utterance.

• This model assumes that an acoustic signal


enters the auditory system causing behavior
that we record as psychophysical observations.

• Psychophysical methods and sound


perception experiments are used to
determine how the brain processes
signals with different loudness levels,
different spectral characteristics, and different temporal properties.
black box” behavioral model
ANATOMY AND FUNCTION OF THE EAR
• A block diagram of the mechanisms in human hearing is shown in Figure 28.
AUDITORY MODELS
• Spectral analysis on a non-linear frequency scale (Linear- upto 1000Hz, non
linear- >1000Hz)
• Spectral amplitude compression
• Loudness compression via some type of logarithmic compression process;
• Decreased sensitivity at lower (and higher) frequencies based on results from
equal loudness contours;
• Utilization of temporal features based on long spectral integration intervals
(e.g., syllabic rate processing);
• Auditory masking whereby loud tones (or noise) mask adjacent signals that
are below some threshold and contained within a critical frequency band of
the tone (or noise).

You might also like