Unit 1

THE HUMAN SPEECH PRODUCTION MECHANISM:
[Figure: The human vocal organs: 1. Nasal cavity, 2. Hard palate, 3. Alveolar ridge, 4. Soft palate
(velum), 5. Tip of the tongue (apex), 6. Dorsum, 7. Uvula, 8. Radix, 9. Pharynx, 10. Epiglottis,
11. False vocal cords, 12. Vocal cords, 13. Larynx, 14. Esophagus, and 15. Trachea. ]
The human speech production system is shown in figure. The various organs responsible for
producing speech are labeled. The main source of energy is the lungs with the diaphragm. When a
person speaks, the air is forced through the glottis between the vocal cords and the larynx and it
passes via three main cavities, namely, the vocal tract, the pharynx, and the oral and nasal cavities.
To produce speech, air is forced by the oral and nasal cavities through the mouth.
Excitation signals are generated in the following ways – phonation, whispering, frication,
compression, and vibration. Phonation is generated when vocal cords oscillate. These cords can close
and stretch because of cartilages. The oscillations depend on the mass and tension of the cords, and
are also governed by the Bernoulli effect of the air passing through the vocal cords. The opening and
closing of the cords breaks the air stream impulses. The shape and duty cycle of these pulses will
depend on loudness and pitch of the speech signal. Pitch is defined as the repetition rate of these
pulses. At low levels of air pressure, the oscillations may become irregular and occasionally the pitch
may drop. These irregularities are called vocal fry.
The V-shaped opening between the vocal cords, known as the glottis, is the most important sound
source in the vocal system, generating a periodic excitation because of the natural frequency of
vibration of vocal cords. The vocal cords may act in several different ways during speech. Their most
important function is to modulate the air flow by rapidly opening and closing, thus causing a buzzing
sound to produce vowels and voiced consonants. The fundamental frequency of vibration of the
vocal cords – which function as a tuning fork – depends on the mass and tension of the cords, and is
about 110 Hz, 200 Hz, and 300 Hz with men, women, and children, respectively.
In the case of whispering, the vocal cords are drawn closer with a very small triangular opening
between the cartilages. The air passing through this opening will generate turbulence (wideband
noise). This turbulence will function as excitation for whispering. When the vocal tract is constricted
at any other point, the air flow again becomes turbulent, generating wideband noise. It is observed
that the frequency spectrum of this wideband noise reflects the location of constriction. The sounds
produced with this excitation are called fricatives or sibilants. Frication can occur with or without
phonation. If the talker continues to exhale when the vocal cord is completely shut off, then the
pressure will build up and when the vocal cord responds, a small explosion will occur. A combination
of short silence with a short noise burst has a characteristic sound. If the release is abrupt and clean,
it is called a stop or a plosive. If the release is gradual and turbulent, it is termed as an affricate.
When stop consonants are produced, air flow may be cut completely by closing the vocal cords. The
vocal cords may then change to the completely open position, producing a light cough or a glottal
stop. On the other hand, with unvoiced consonants, such as /s/ or /f/, the vocal cords may be
completelyopen throughout. For phonemes such as /h/, an intermediate position may also occur.
Vibrations are set up especially at the tongue and occasionally between the lips, if the air is forced
through the closure.
The pharynx connects the larynx to the oral cavity. The dimension of the pharynx is fixed but its
length changes slightly when the larynx is raised or lowered at one end and the soft palate is raised
or lowered at the other end. The soft palate also isolates the nasal cavity and the pharynx or forms
the route from the nasal cavity to the pharynx. The epiglottis and false vocal cords are at the bottom
of the pharynx to prevent food from reaching the larynx and to isolate the esophagus acoustically
from the vocal tract. The epiglottis, the false vocal cords, and the vocal cords are closed during
swallowing and they remain open during normal breathing.
The oral cavity is one of the most important parts connected to the vocal tract. Its size, shape, and
acoustics can be varied by the movements of the palate, the tongue, the lips, the cheeks, and the teeth,
which are called articulating organs. The tongue in particular is very flexible, allowing the tip and the
edges to be moved independently. The complete tongue can also be moved forward, backward, up,
and down. The lips can control the size and shape of the mouth opening when speech sound is
radiated. Unlike the oral cavity, the nasal cavity has fixed dimensions and shape. Its length is about
12 cm and volume 60 cm3. The air stream to the nasal cavity is controlled by the soft palate.
To analyze sound production, the vocal system may be considered as a single acoustic tube between
the glottis and the mouth. It is about 17 cm long in adult males. Its non-uniform cross-sectional area
depends strongly on the position of the articulators and varies from 0 cm2 at closure to about 20 cm2
when open. The vocal tract has certain normal resonant frequencies – called formant frequencies –
depending on the position of the articulators. The first three formants determine the sound produced
whereas higher formants are necessary for acceptable speech quality. The glottal excited vocal tract
may be then approximated as a straight pipe closed at the vocal cords where the acoustic impedance
is Zg = ∞ and open at the mouth with an acoustic impedance of Zm = 0. We say that the vocal tract
resembles a circular waveguide with proper terminating conditions.
Let us consider the anatomy of the human ear. It is divided into three parts, namely, the outer ear,
the middle ear, and the inner ear. The external canal of the outer ear is uniform in cross-section and
is 2.7 cm long and 0.7 cm across. It has a number of resonating frequencies. The resonant frequency
of 3 kHz falls in the range of speech. The middle ear provides acoustical coupling. It provides
impedance transformation and amplitude limiting. Amplitude limiting protects the ear at high sound
levels. This function is performed by the inner muscles, which alternate the transmission of sound
by contracting. The inner ear contains a transducer which converts acoustic vibrations to nerve
impulses. The part of the inner ear is called the cochlea, and produces a spatial distortion of
frequency components. It acts as a mechanical neural spectrum analyzer.
The frequency range for sound perception is approximately between 16 Hz and 16 kHz. The upper
limit falls off with increasing age. For a young person, it occasionally reaches 20 kHz. In case of old
people, it can be as low as 10 kHz. At the low end of the range, the perceived sound becomes a pulse
stream and at the high end, it fades off into silence. Perceived loudness is a function of frequency and
the level of the sound. By comparing the tones of different frequencies and amplitudes, contours of
equal subjective loudness can be drawn. These contours indicate a dip near 3–4 kHz, indicating
increased sensitivity. Loudness is measured in sones, whereas loudness level is measured in fones. It
is observed that different types of degradation affect speech intelligibility.
DISCRETE – TIME MODEL OF SPEECH:
SPEECH PERCEPTION:
HUMAN AUDITORY SYSTEM:
PHONETICS:
Any language can be described in terms of a set of distinctive sounds called phonemes. Phonetics is
a branch of linguistics that involves the study of the sounds of human speech. It deals with the
physical properties of speech sounds – namely, phones, their physiological production, acoustic
properties, and auditory perception. There are three basic areas of study are articulatory phonetics,
acoustic phonetics, and auditory phonetics.
1. Articulatory phonetics:
The study of the production of speech by the articulatory and vocal tract by the speaker. The main
goal of articulatory phonetics is to provide a common notation and a frame of reference for
linguistics. This allows accurate reproduction of any unknown utterance which is written in the form
of phonetics transcriptions.
Consonants are defined in anatomical terms using the point of articulation, manner of articulation
and phonations. The point of articulation is the location of principal constrictions in the vocal tract
defined in the terms of the participating organs. For example, when the principal point of articulation
is between lips, the consonant is termed bilabial. If it is between the lower lip and upper teeth, the
consonant is said to be labiodental. The manner of articulation is principally a degree of constriction
at the point of articulation and the manner of release. For example, in the case of a plosive, the vocal
tract is shut off at the point of constriction. Plosives have a clean and sharp release; they are also
called stops. If the consonants are accompanied by voicing they are called voice consonants. For
example, even during a stop it is possible to force air through the vocal cords for a short time.
Vowels are less well defined than consonants. This is because the tongue never touches another
organ when making a vowel. So we cannot specify parameters like the point of articulation. In the
case of vowels we specify only the distance from other parts of the mouth. Vowels are described by
the following variables:
1. Tongue high or low
2. Tongue front or back
3. Lips rounded or unrounded

4. Nasalized or un-nasalized
The vowel diagram is shown in Figure. If we consider only the tongue position, we can map high/low
or front/back positions by means of a diagram. The space spanned by these dimensions is called
vowel space. The diagram shows the position of some known vowels to provide points of reference.
Normally, the front vowels are pronounced with the lips unrounded and back vowels with lips
rounded. Phonations distinguish the vowels as tense or lax respectively. The vowels, such as a, e, i, o,
and u are called tense vowels. These vowels are associated with more extreme positions on the vowel
diagram, as they particularly require greater muscular tension to produce.
Figure: Basic Vowel Diagram

Nasalization
If the valve in a closed nasal cavity is disconnected from the system, the vowel sound is determined
by the position of tongue and the lips. If the valve is opened, sound passes through the nasal cavity
also. Such vowels are called nasalized vowels. In phonetics an individual sound is called a phone and
in phonemics the individual sound is called a phonon.
1.5.2 Acoustic Phonetics Acoustically, the vocal tract is a tube of non-uniform cross-section
approximately 17 cm long, which is usually opened at one end and closed at other. The vocal tract
tube has many natural frequencies. If the vocal tract were of uniform cross-section, these frequencies
would be given by
2. Acoustic phonetics:
The study of the transmission of speech from the speaker to the listener. Acoustic phonetics is the
study of sound waves made by the human vocal organs for communication. Acoustically, the vocal
tract is a tube of non-uniform cross-section approximately 17 cm long, which is usually opened at
one end and closed at other. The vocal tract tube has many natural frequencies. If the vocal tract
were of uniform cross-section, these frequencies would be given by
Here, C =350 m/s and l=17 cm. Natural frequencies occur at odd multiples of 500 Hz. These resonant
frequencies are called formants. These appear as dark bands in spectrograms and they are
considered to be a very important acoustical feature. In reality, the cross-sectional area is not
uniform, so resonances are not equally spaced. Typical formant frequencies for selected vowels are
shown in Table 1. Acoustic phonetics is a subfield of phonetics which deals with the acoustic aspects
of speech sound. Acoustic phonetics investigates properties of speech sound, such as the mean
squared amplitude of a speech waveform, its duration and its fundamental frequency. The study of
acoustic phonetics was advanced when in the late 19th century the Edison phonograph was invented.
The phonograph allowed the speech signal to be recorded and later processed and analyzed. By
exposing a segment of speech to the phonograph, and filtering it with a different band pass filter each
time, a spectrogram of the speech utterance could be generated. Ludimar Hermann investigated the
spectral properties of vowels and consonants using the Edison phonograph, and it was in his papers
that the term formant was first introduced.
3. Auditory phonetics:
The study of phonetics of the reception and perception of speech by the listener. Auditory phonetics
is related to the perception and interpretation of sounds. The transmitter and receiver are involved
in the process of linguistic communication. In case of auditory phonetics, the receiver is the listener,
that is, the human ear. The sound is heard by human listener and then is transmitted to the brain.
The ear perceives the auditory stimuli, analyses the stimuli and transmits them to the brain. We have
to discuss three components of ear, namely, the outer, the middle, and the inner ear. The outer ear
consists of the auricle or the pinna, and the auditory meatus or the outer ear canal. The auricle is the
visible part of the ear. The meatus, or the outer ear canal is a tube-like structure having two functions.
The first is that it protects the middle ear, and secondly it functions as a resonator for the sound
waves entering our ear. The middle ear is a cavity that consists of a number of little anatomical
structures, such as the eardrum. The eardrum is a diaphragm or a membrane which vibrates in
response to sound waves. It acts as both a filter and a transmitter for incident waves. The middle ear
also contains tiny bones, namely, the mallet, the anvil and the stirrup. The middle ear plays the role
of protection. When sound waves with high intensity hit the ear, the muscles associated with the
three little bones stated above contract and protect the ear. The next segment is the inner ear. It has
a cavity filled with liquid and is called the cochlea. It also includes the vestibule of the ear and the
semicircular canals. There are two membranes inside the cochlea, namely, the vestibular membrane,
and the basilar membrane. The basilar membrane plays a central role in audition. The cochlea is the
real auditory receptor. The cells of the corti include many ciliate cells capable of detecting the
slightest vibrating movement. These vibrations are converted into neural signals that are
transmitted via the auditory nerves to the brain. The brain is the central receptor and controller of
the entire process.
CATEGORIZATION OF SPEECH SOUNDS:

The speech signal is divided in two parts, a voiced segment and an unvoiced segment. A waveform is
a two-dimensional representation of a sound. The two dimensions in a waveform display are time
and intensity. In case of a sound waveform, the vertical dimension is intensity and the horizontal
dimension is time. Waveforms are also known as time-domain representations of sound, as they are
representations of changes in intensity over time. Sounds of American English are categorized in
four classes, namely, vowels, diphthongs, semivowels, and consonants, as shown in Figure. We will
now go through these different components of a speech signal.
1.Vowels or Voiced Segments :
Vowels are voiced components of the sound, that is, /a/, /e/, /i/, /o/, /u/. The excitation is the
periodic excitation generated by the fundamental frequency of the vocal cords and the sound gets
modulated when it passes via the vocal tract.
2. A Diphthong
It means two sounds or two tones. A diphthong is also known as a gliding vowel, meaning that there
are two adjacent vowel sounds occurring within the same syllable. A diphthong is a gliding
monosyllabic speech item starting at or near the articulatory position for one vowel and moving
towards the position of the other vowel. While pronouncing a diphthong, that is, in the case of words
like eye, hay, boy, low, and cow the tongue moves and these are said to contain diphthongs.
Diphthongs can be characterized by a time-varying vocal tract area function that varies between the
two vowel configurations concerned, that is, in the case of /eI/ it will vary between the vowel
configurations of /e/ and /i/.
3. Semivowel (or Glide)
A semivowel is a sound, such as /w/ or /j/ in English, which is phonetically similar to a vowel sound
but functions often as the syllable boundary.. These are called semivowels due to their vowel-like
structure. They occur between adjacent phonemes and are recognized by a gliding transition in the
functioning of the vocal tract area. Depending on the context, the acoustic characteristics of
semivowels change.
4. Unvoiced Sounds
With purely unvoiced sounds, there is no fundamental frequency in the excitation signal and hence
we consider the excitation as white noise. The air flow is forced through a vocal tract constriction
occurringat several places between the glottis and mouth. Some sounds are produced when a
complete stoppage of air flow is followed by a sudden release. This produces an impulsive turbulent
excitation that is often followed by a more turbulent excitation. Unvoiced sounds are found to be
more silent and having less energy. They are less steady than voiced sounds. Whispering is a special
case of speech. In the case of whispering a voiced sound, there is no fundamental frequency in the
excitation and the first formant frequencies produced by vocal tract are perceived by us.
5. Consonants
A consonant is a speech segment that is articulated using a partial or complete closure of vocal tract.
The examples of consonants are as follows. The sound segment /p/ is pronounced with the lips
closed. The sound segment /t/ is pronounced by closing the front part of the tongue. The sound
segment /k/ is pronounced with the closure of the back of the tongue whereas the sound segment
/h/ is directly pronounced in the throat; /f/ and /s/ are pronounced by forcing air through a narrow
channel of the tongue (fricatives). Sounds like /m/ and /n/ are uttered by forcing air through the
nose (nasals).
6. Nasal Consonant (Also Called Nasal Stop or Nasal Continuant)
It is produced with a lowered velum. It allows air to escape freely through the nose. The oral cavity
will function as a resonant cavity, but the air does not escape through the mouth because it is blocked
either by the lips or the tongue. In the case of utterance of /m/, the constriction is at the lips and for
/n/, the constriction is just at the back of the teeth. Nasal stops have bands of energy at around 200
Hz and 2000 Hz.
7. Stop Consonants
A stop or occlusive is a consonant sound produced by stopping the air flow in the vocal tract. Stops
can be voiced or unvoiced. Plosives are stops with air f low out of the mouth. Sometimes the term
stop also includes nasal stops, in which air flow is stopped in the mouth, and is released through the
nose. The voiced stop consonants are /b/, /d/, and /g/. These are transient sounds produced by
building up pressure behind a total constriction in the oral tract. In case of /b/, the constriction is at
the lips, in case of /d/, the constriction is at the back of the teeth, in case of /g/, the constriction is
near the velum. The properties of these stops are influenced by the vowel which follows them. The
unvoiced stop consonants are /p/, /t/, and /k/. Here, during the closure of the tract when air
pressure builds up, vocal cords do not vibrate.
8. Fricatives
Fricatives are a special class of consonants produced by forcing air through a narrow channel-like
thing made by placing any two articulators close together. The articulators are the lower lip and the
upper teeth, in the case of /f/. The turbulent air flow is called frication. A subset of fricatives is the
sibilants. To produce a sibilant, one is forcing air through a narrow channel, but in addition the
tongue is often curled lengthwise to direct the air over the edge of the teeth.
Fricatives can be voiced or unvoiced. Unvoiced fricatives /f/, /s/, and /sh/ are produced by air
turbulence as the excitation for vocal tract. The location of the constriction decides which fricative it
is. The constriction for /f/ is near the lips, for /s/ it is near the middle of the oral tract and for /sh/,
the constriction is near the back of the oral tract. The corresponding voiced fricatives with same
point of constriction are /v/, /z/, and /zh/, respectively. Two excitation sources are involved in the
case of voiced fricatives. One source is periodic excitation due to vibrating vocal cords and the other
source is the air turbulence that gets generated due to constriction.
9. Affricates
Affricates are consonants and such as /pf/ and /kx/. They will start and stop, such as in the utterance
of /t/ or /d/, but sometimes they release as fricatives such as in the utterance of /s/ or /z/.
SPECTROGRAPHIC ANALYSIS OF SPEECH SOUNDS
PITCH FREQUENCY / FUNDAMENTAL FREQUENCY
A speech signal consists of different frequencies which are harmonically related to each other in the
form of a series. The lowest frequency of this harmonic series is known as the fundamental frequency
or pitch frequency. Pitch frequency is the fundamental frequency of vibrations of the vocal cords.
This frequency generated by vocal cords in the form of periodic excitation passes via the vocal tract
filter and gets convolved with the impulse response of the filter to produce a speech signal. Thus,
speech is basically a convolved signal. Fundamental frequency is related to a voiced speech segment.
Let us have a look at the waveform of a voiced speech segment.
If we have a closer look at the waveform, we can see that a peculiar pattern is repeating after sample
number 650 with approximately the same number of samples.
The number of samples after which the waveform repeats itself will reveal the pitch period in terms
of number of samples. If we can find the time corresponding to these samples, (because the sampling
frequency for the speech is known to us) and calculate the inverse of time period, we will get the
fundamental frequency in hertz (Hz). The main principle used in time domain pitch detection
algorithms is to find the similarity between the original speech signal and its shifted version.
We will now use a variety of methods, such as autocorrelation, average magnitude difference function
(AMDF), etc., for finding the pitch period.
1. Autocorrelation Method for Finding Pitch Period of a Voiced Speech Segment
Autocorrelation is the correlation of a signal with itself. We can say that it is a measure of the
similarity between samples as a function of the time separation between them. It can be considered
as a mathematical tool to find repeating patterns and their periods. Autocorrelation methods need
at least two pitch periods to detect pitch.
An algorithm to detect autocorrelation can be described as follows:
1. Divide the speech in number of segments. Take a speech segment which is at least equal to two
pitch periods. We will use a speech file having a sampling frequency of 22 kHz. So the sampling
interval is (22 kHz)−1 which is equal to 0.045 ms. In an interval of 20 ms, there will be 444.4 samples.
Let us consider a speech segment of size 400 samples.
2. We will calculate the autocorrelation for, say, 45 overlapping samples. This means that two speech
segments, one extending over sample numbers 1–45 is correlated to the other segment from sample
numbers 2–46, then 3–46, and so on. Hence, the sample shifts in steps of 1 starting from 1. We will
use a shift up to 400 samples to find the shift value for which the correlation is the highest. The
distance between two successive maxima in correlation will give a pitch period in terms of number
of samples.
3. Calculate the autocorrelation using the formula given by
The equation is simplified for practical implementation as
1.1 Non-Linear Processing

The low-amplitude portions of speech tend to contain most of the formant information and
highamplitude portions contain most of the pitch information. Hence, non-linear processing, which
deemphasizes the low amplitude portions, will improve the performance of the autocorrelator. Non-
linear processing can be done directly in the time domain. We will discuss two methods of non-linear
processing, namely, centre clipping and cubing.
Centre clipping
First, the low amplitude portions are removed using a centre clipper whose clipping point is decided
by the peak amplitude of the speech signal. Then the clipper is set to remove all portions of the
waveform below some threshold T. The centre clipped speech is then passed through the correlator.
The autocorrelation will be zero for most lags and will have sharp peaks at the pitch period with very
small secondary peaks. Typically, the clipper will have the input-output function given by
The value of T is 30% of the peak value of the signal. The input–output plot is shown in Figure.
[Figure: Clipper input – output graph]
Cubing
The speech waveform is passed through a non-linear circuit whose transfer function is
Cubing tends to suppress the low-amplitude portions of the speech waveform. Here, it is not required
to keep an adjustable threshold.
1.2 AMDF Method for Finding Pitch Period of a Voiced Speech Segment
We form the difference signal Dm by delaying the input speech by various amounts, subtracting the
delayed waveform from the original, and summing the magnitude of the differences between sample
values. Finally, we take the average of the difference function over the number of samples. The
difference signal is always zero at delay = 0, and is particularly small at delays corresponding to the
pitch period of a voiced sound having a quasi-periodic structure. The main advantage of the AMDF
method is that it requires only subtractions.
The algorithm for this method can be described as follows:
1. Divide the speech in number of segments. First take a speech segment which is at least equal to
two pitch periods. In this case, we are using a speech file with a sampling frequency of 22 kHz. So,
the sampling interval is (22 kHz)−1, which is equal to 0.045 ms. In an interval of 20 ms, there will be
444.4 samples. Let us consider a speech segment of size 400 samples. We will then calculate the
AMDF for, say, 45 overlapping samples. Hence, two speech segments, one extending over sample
numbers 1–45 is correlated to the segment extending oversample numbers 2–46, then the segment
of sample numbers 3–47, that is, with a shift of 1, 2, 3 samples and so on. We will use a shift up to
400 samples. We have to find the shift value for which AMDF value is the smallest. The distance
between two successive minima in AMDF will give a pitch period in terms of number of samples.
2. Calculate the AMDF using the formula given by
2. Parallel Processing Approach for Calculation of Pitch Frequency

The block diagram for the algorithm is shown in Figure. The figure illustrates that there are four
blocks in the algorithm. These are:
1. The first block is a low-pass filter with a cut-off frequency of approximately 600 Hz, to eliminate
the higher harmonics of the speech signal.
2. The second block is a processor of signal peaks. This block uses the low-pass filtered signal from
the first block and tracks the peaks in the waveform. Consider a speech waveform obtained at the
output of a filter as shown in Fig. 7. The waveform indicates that the waveform has a first as well as
a second harmonic. The processor of signal peaks will generate a pulse when it encounters a maxima
position in the waveform – which can be either a positive peak or a negative peak. For a positive
peak which is less than the maxima of the entire waveform, the amplitude A3 is measured, which is
the distance between actual peak and the maximum peak value. Similarly, A6 is the distance between
negative peak value and negative maxima value. A1 and A4 are the actual amplitudes of the wave at
maxima position. A1, A2, and A3 are measured at every positive peak and A4, A5, and A6 are
measured for every negative peak. If the signal has only the fundamental notes, then the values of A3
and A6 will be zero and other measurements will indicate strong peaks. For a signal containing
strong second harmonics, A3 and A6 will dominate. Pulses of different heights are produced for each
positive and negative peak as shown in Fig. 7. Six pulse trains are given as input to six identical peak
detectors.
3. Let us now see how the parallel processing element (PPE) works. Each PPE will take a train of
pulses A1, A2, etc. as input at different heights and at different time instants. The PPE will function as
a pitch detector run down circuit as shown in Fig. 8. This means that it will generate a constant
blanking time after the pulse, and decay is produced after a blanking time. If the next pulse input
crosses the decay output, it is detected and a run down decay circuit is reset. The interval between
the two start positions of the circuit is the pitch period as shown in Fig. 8. The values of the blanking
time and the decay time constant are updated according to the previous pitch period estimate. The
output estimates of the 6 PPEs are given as input to a final pitch period estimator block.
4. The final pitch period estimator block uses the successive three pitch estimates of all the PPEs and
these are entered in the first 3 rows of the 6 s6 matrix as shown in Table 1. It also forms the last 3
rows as P14 = P11 + P12, P15 = P12 + P13, and P16 = P11 + P12 + P13. The first 3 pitch estimates of
the first PPE are P11, P12, and P13, and the same is true for all other PPEs. The final PPE block
compares the first entry with rest of the 35 entries and counts the number of coincidences. If most of
the entries are matching, then the pitch estimate is calculated as the frequently repeating value of
pitch. This new value of pitch is used with the previous average values of pitch to update the value
of average pitch.
This algorithm avoids wrong estimate of pitch period even when the waveform contains first and
second harmonic. Because of parallel functioning pitch estimators, the estimation can be done every
5 ms which is found to be an appropriate interval. In the case of unvoiced segment, this coincidence
will not occur. This indicates that no pitch is present and the segment is unvoiced.
2.1 Pitch Contour
If we track the pitch period over the entire voiced segment, we find that there is a small variation in
the pitch period. The contour of variations of the pitch period is termed as pitch contour. The pitch
contour also contains some information related to the spoken word and the speaker. Hence, pitch
contour is also used as a feature in speech recognition and speaker verification tasks.
PITCH PERIOD MEASUREMENT USING SPECTRAL DOMAIN
Here frequency-domain approach is used for pitch period measurement. The frequency domain pitch
detection algorithms operate on the speech spectrum. The periodic signal will have a harmonic structure.
The frequency domain algorithm tracks the distance between the harmonics. The main drawback of
frequency domain methods is its high computational complexity. We will describe four different methods
for finding fundamental frequency using the spectrum as listed below.
1. FFT-based method
2. Harmonic peak detection method
3. Spectrum similarity method

4. Spectral autocorrelation method
1. FFT-Based Method
FFT is a fast algorithm for computation of discrete Fourier transform (DFT). The k th DFT coefficient X(k)
of any sampled signal (sequence) x(n) is given by
where W is given by
Computation of the DFT coefficient requires complex multiplications and complex additions. FFT
algorithms are designed to simplify the computations. There are two FFT algorithms, namely, decimation
in frequency and decimation in time. If one computes a 512-point FFT for a sequence containing 512
samples, then we get a plot that is symmetric about the centre value (256 in this case) because of the
complex conjugate property of DFT coefficients. This plot is a graph of the amplitude of the FFT plotted
against the frequency sample number. In the case of a 512-point FFT, if the sampling frequency of the
signal is 22,100 Hz then this 22,100 Hz range gets divided into 512 frequency components with each
frequency point corresponding to a frequency of 43.16 Hz. We can easily calibrate the DFT coefficient
number axis to the actual frequency value in Hz. The spectrum domain analyzes the amplitude of the
spectral component of a signal. It has the spectral amplitude plotted with respect to frequency.
The algorithm for finding pitch period using FFT-based method can be described as follows.
1. Take the voiced segment of speech containing 2048 samples.
2. Take a 2048-point FFT and plot it. Since the spectrum is symmetrical, we will use only first 1024 values
of FFT.
3. We will then track the first peak in the FFT output to find the fundamental frequency. Note that the
resolution for fundamental frequency measurement is decided by the number of points in the FFT. If we
take a 2048-point FFT, the frequency resolution will improve to 22,100/2048 = 10.79 Hz. The
measurement of the fundamental frequency cannot be accurate using FFT analysis. The accuracy depends
on the resolution in DFT domain.
4. The fundamental frequency is the FFT point number (where first peak occurs) multiplied by the
frequency resolution, that is, 10.79 Hz. 5. The signal plot and the output plot of the FFT are shown in Fig.
9(a). It indicates that the first peak has a value of 12560 and it occurs at the 16th position, and hence, the
fundamental frequency is 10.79 x 16 = 172.26 Hz. This is a valid pitch frequency value and hence, the
speech segment is a voiced speech segment. In case of front-end filtering, sometimes the first peak is
missing.
2 Harmonic Peak Detection Method
Harmonic peak detection method is also used to find the fundamental frequency. The algorithm for
finding pitch period using harmonic peak detection method can be described as follows.
1. Take a speech segment as described in the algorithm for spectrum domain method above. Take the
FFT of each segment.
2. Detect the peaks in the spectrum. If the speech segment is voiced, we will get harmonic peaks, that is,
the peaks at regular intervals.
3. Find the common divisor of these harmonics. The common divisor is the fundamental frequency.
4. From figure, we find that we get peaks at regular intervals for the voiced segment.
3. Spectrum Similarity Method
There is another method called spectrum similarity method. This method assumes that the spectrum is
fully voiced and has peaks located at multiples of the fundamental frequency. The procedure for finding
fundamental frequency using spectrum similarity method can be described as follows.
1. This method has stored templates of the spectrum for different values of pitch frequency.
2. The signal to be analyzed is used to find the spectrum. We divide the signal into a number of segments.
The FFT of the segment is calculated. We then find the squared magnitude of the FFT for a segment. This
is the squared magnitude spectrum.
3. When the spectrum is found for a segment for which pitch frequency is to be determined, one will
compare its spectrum with stored templates to find a match. When a match is found, the corresponding
value of pitch frequency is picked up.
4. The process is repeated for successive segments for tracking pitch values.
4 Spectral Autocorrelation Method

Pitch frequency can also be measured by calculating the spectral autocorrelation. The procedure for
finding fundamental frequency using spectral autocorrelation method can be described as follows.
1. The speech segment will be divided into a number of segments as described in previous algorithms.
2. The FFT of each segment is taken. The squared magnitude of the FFT is evaluated.
3. Find the autocorrelation of the spectrum using the equation
where X(i) represents the i th FFT coefficient and 2N is the number of FFT points. We concentrate on the
half the spectrum as the spectrum is symmetrical about the centre point.
4. The spectral autocorrelation will be highest for frequency interval equal to pitch frequency for voiced
segment. A plot of spectral autocorrelation in figure below shows spectral peaks at regular intervals. The
distance between two peaks gives fundamental frequency.
PITCH PERIOD MEASUREMENT USING CEPSTRAL DOMAIN

A cepstrum is obtained when we take the Fourier transform (FT) of the log spectrum. The name
cepstrum is derived by reversing the first four letters of ‘spectrum’. We can have a power cepstrum
or a complex cepstrum. Let us define a power cepstrum. It is the squared magnitude of the FT of the
log of squared magnitude of FT of a signal. We can algorithmically define it as
Signal ⤇ FT ⤇ abs() ⤇square⤇ log ⤇ FT⤇ abs()⤇square⤇ power cepstrum
Let us also define a complex cepstrum. The complex cepstrum was defined by Oppenheim in his
development of homomorphic system theory. The complex cepstrum of any signal is the Fourier
transform of the logarithm obtained with unwrapped phase of the FT of a signal. Sometimes it is called
the spectrum of a spectrum. We will define it algorithmically as
signal m FT m abs() m log m phase unwrapping m FT m complex cepstrum
The real cepstrum will use the log function defined for real values. The real cepstrum is related to the
power cepstrum as follows: 4(Real spectrum) Power spectrum

Unit 1

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Unit 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

THE HUMAN SPEECH PRODUCTION MECHANISM:

3. Lips rounded or unrounded

Figure: Basic Vowel Diagram

CATEGORIZATION OF SPEECH SOUNDS:

3. Calculate the autocorrelation using the formula given by

The equation is simplified for practical implementation as

1.1 Non-Linear Processing

2. Parallel Processing Approach for Calculation of Pitch Frequency

2. Harmonic peak detection method

3. Spectrum similarity method

1. Take the voiced segment of speech containing 2048 samples.

3. Spectrum Similarity Method

4 Spectral Autocorrelation Method

3. Find the autocorrelation of the spectrum using the equation

PITCH PERIOD MEASUREMENT USING CEPSTRAL DOMAIN

You might also like