0% found this document useful (0 votes)
14 views

Lecture Notes - Speech Processing

Speech Production
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture Notes - Speech Processing

Speech Production
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

UNIT I FUNDAMENTALS OF SPEECH

THE HUMAN SPEECH PRODUCTION MECHANISM


The human speech production mechanism involves a complex process that includes various structures
and organs working together to produce speech sounds. Here's an overview of the speech production
mechanism along with a simplified diagram:

Fig. Block Diagram of Speech Production Mechanism

1. Lungs: The process of speech production starts with the lungs, which provide the airflow
necessary for speech. Air is exhaled from the lungs and travels upwards towards the vocal tract.

1
2. Trachea: The trachea, also known as the windpipe, is the passage through which air travels from
the lungs towards the vocal tract.
3. Vocal cords (Vocal Folds): The vocal cords are a pair of muscular folds located in the larynx
(voice box). They are capable of opening and closing, and their vibration produces sound. When
air passes between the vibrating vocal cords, it creates a buzzing sound.
4. Larynx: The larynx is a cartilaginous structure located in the throat. It houses the vocal cords
and plays a crucial role in pitch and voice modulation.
5. Pharynx: The pharynx is a cavity at the back of the throat. It serves as a resonating chamber for
speech sounds.
6. Oral Cavity: The oral cavity consists of the mouth and its various structures, including the
tongue, teeth, and hard and soft palates. It plays a significant role in shaping speech sounds.
7. Articulators:
✓ Tongue: The tongue is a highly flexible organ that assists in shaping the sound produced by
the vocal cords. It can make various configurations against the roof of the mouth or other
parts of the oral cavity to produce different speech sounds.
✓ Teeth: The interaction of the tongue with the teeth can create specific sounds like "th," "s,"
and "z."
✓ Hard Palate: The hard palate, located on the roof of the mouth, also plays a role in shaping
sounds.
✓ Soft Palate (Velum): The soft palate can be raised or lowered to close off the nasal passage
or allow air to pass through the nose, producing nasal sounds.
✓ Lips: The movement of the lips plays a crucial role in shaping many speech sounds,
especially bilabial sounds like "p," "b," and "m."
8. Nasal Cavity: The nasal cavity is located behind the nose. During speech, air can pass through
the nasal cavity, resulting in nasal sounds for certain speech sounds like "m" and "n."

PHONETICS - ARTICULATORY PHONETICS, ACOUSTIC PHONETICS, AND AUDITORY


PHONETICS, CATEGORIZATION OF SPEECH SOUNDS
✓ Phonetics is the branch of linguistics that deals with the study of speech sounds. It focuses on
the physical properties of speech sounds, how they are produced, transmitted, and perceived by
humans.
✓ The primary goal of phonetics is to understand and describe the sounds of human language,
irrespective of the specific language being spoken.

2
Fig. Understanding the Phonetics

1. Articulatory Phonetics:
✓ Articulatory phonetics is concerned with the study of how speech sounds are physically
produced or articulated by the human vocal tract and articulatory organs.
✓ It investigates the movements and positions of the various speech organs, such as the tongue,
lips, teeth, alveolar ridge, and velum (soft palate), during speech sound production.
✓ Articulatory phonetics describes the specific articulatory configurations that lead to the creation
of different speech sounds.
For example:
✓ The articulation of the vowel /i/ involves the tongue being in a high front position.
✓ The articulation of the consonant /p/ requires the lips to come together to block the airflow and
then release the blockage to create a sound.

3
2. Acoustic Phonetics:
✓ Acoustic phonetics is the study of the physical properties of speech sounds as sound waves
travel through the air. It deals with the analysis of the frequencies, amplitudes, and durations of
these sound waves.
✓ Acoustic phonetics helps us understand how speech sounds differ in terms of their acoustic
properties and how they are perceived by the listener's auditory system.
For example:
✓ The vowel /a/ is characterized by a low-frequency sound with a relatively open vocal tract,
resulting in a larger and more open sound wave pattern.
✓ The fricative /s/ is characterized by high-frequency sound waves due to the turbulent airflow
caused by the constriction between the tongue and the alveolar ridge.
3. Auditory Phonetics:
✓ Auditory phonetics focuses on the perception and processing of speech sounds by the human
auditory system.
✓ It examines how the brain interprets the acoustic information received from the environment
and recognizes different speech sounds.
✓ Auditory phonetics plays a crucial role in understanding how humans perceive and
distinguish speech sounds, even in challenging listening conditions.
For example:
✓ The auditory system can discriminate between two similar speech sounds, such as /b/ and /p/,
based on subtle differences in their acoustic properties, like voicing onset time.
✓ Auditory phonetics helps explain how listeners can recognize speech sounds even in the
presence of background noise.
4. Categorization of Speech Sounds:
Speech sounds can be categorized based on various phonetic features. Here are some common
categorizations:
✓ Place of Articulation: Categorizes consonant sounds based on where in the vocal tract the
airflow is constricted or blocked during articulation. Examples include bilabials (/p/, /b/),
alveolars (/t/, /d/), and velars (/k/, /g/).
✓ Manner of Articulation: Classifies consonant sounds based on the degree of airflow
constriction or how the airflow is manipulated during articulation. Examples include stops (/p/,
/t/, /k/), fricatives (/f/, /s/, /ʃ/), and nasals (/m/, /n/, /ŋ/).
✓ Voicing: Distinguishes between consonant sounds based on whether the vocal cords vibrate
during their production. Examples include voiced (/b/, /d/, /g/) and voiceless (/p/, /t/, /k/)
sounds.
✓ Vowels: Vowels are categorized based on the position of the tongue and lips during articulation
and the height and backness of the tongue. Examples include /i/, /e/, /a/, /o/, and /u/.

4
✓ Prosody: This refers to the rhythm, intonation, and stress patterns of speech. It helps convey
meaning and emotion in connected speech.
✓ These categorizations and the understanding of articulatory, acoustic, and auditory aspects of
speech sounds are fundamental to the study of phonetics and linguistics, contributing to our
comprehension of human language and communication.

1. Short-Time Fourier Transform (STFT)


✓ Speech is not a stationary signal, i.e., it has properties that change with time
✓ Thus a single representation based on all the samples of a speech utterance, for the most part,
has no meaning
✓ Instead, we define a time-dependent Fourier transform (TDFT or STFT) of speech that changes
periodically as the speech properties change over time

2. Ceptrum
✓ Cepstrum - Inverse Fourier transform of the log magnitude spectrum of a signal
✓ Cepstrum of a discrete-time signal as

where log |X(ejw)| is the logarithm of the magnitude of the DTFT of the signal, and extended
the concept by defining the complex cepstrum as

5
✓ Computation Using the DFT

3. Mel-Frequency Cepstrum Coefficients


✓ The basic idea is to compute a frequency analysis based upon a filter bank with approximately
critical band spacing of the filters and bandwidths.
✓ For 4 kHz bandwidth, approximately 20 filters are used.
✓ In most implementations, a short-time Fourier analysis is done first, resulting in a DFT Xˆn[k]
for analysis time ˆn.
✓ Then the DFT values are grouped together in critical bands and weighted by a triangular
weighting function

6
Fig. Weighting functions for Mel-frequency filter bank.
Note that the bandwidths in Figure are constant for center frequencies below 1 kHz and then
increase exponentially up to half the sampling rate of 4 kHz resulting in a total of 22 “filters.”

✓ The mel-frequency spectrum at analysis time ˆn is defined for r = 1,2, . . . ,R as

✓ where Vr[k] is the triangular weighting function for the rth filter ranging from DFT index Lr to
Ur, where

✓ is a normalizing factor for the rth mel-filter. This normalization is built into the weighting
functions of Figure
✓ It is needed so that a perfectly flat input Fourier spectrum will produce a flat mel-spectrum.
✓ For each frame, a discrete cosine transform of the log of the magnitude of the filter outputs is
computed to form the function mfccˆn[m],

4. Short-Time Energy and Zero-Crossing Rate

7
✓ Two basic short-time analysis functions useful for speech signals are the short-time energy and
the short-time zero-crossing rate.
✓ These functions are simple to compute, and they are useful for estimating properties of the
excitation function in the model.
✓ The short-time energy is defined as

✓ The short-time zero crossing rate is defined as the weighted average of the number of times the
speech signal changes sign within the time window.
✓ Representing this operator in terms of linear filtering leads to

✓ Below Figure shows an example of the short-time energy and zero crossing rate for a segment
of speech with a transition from unvoiced to voiced speech.
✓ In both cases, the window is a Hamming window (two examples shown) of duration 25ms
(equivalent to 401 samples at a 16 kHz sampling rate).

Fig. Section of speech waveform with short-time energy and zero-crossing rate superimposed
✓ Note that during the unvoiced interval, the zero-crossing rate is relatively high compared to the
zero-crossing rate in the voiced interval.

8
✓ Conversely, the energy is relatively low in the unvoiced region compared to the energy in the
voiced region

5. Short-Time Autocorrelation Function (STACF)


✓ The autocorrelation function is often used as a means of detecting periodicity in signals, and it
is also the basis for many spectrum analysis methods.
✓ This makes it a useful tool for short-time speech analysis.
✓ The STACF is defined as the deterministic autocorrelation function of the sequence

that is selected by the window shifted to time ˆn

Fig. Voiced and unvoiced segments of speech and their corresponding STACF

✓ Segmentation by window

9
✓ Note the peak in the autocorrelation function for the voiced segment at the pitch period and
twice the pitch period, and note the absence of such peaks in the autocorrelation function for
the unvoiced segment.
✓ This suggests that the STACF could be the basis for an algorithm for estimating/detecting the
pitch period of speech.
Usually such algorithms involve the autocorrelation function and other short-time
measurements such as zero-crossings and energy to aid in making the voiced/unvoiced
decision

DISCRETE-TIME MODEL OF SPEECH PRODUCTION

Fig.Discrete-Time model of speech production

10
✓ Model uses a more detailed representation of the excitation in terms of separate source
generators for voiced and unvoiced speech
✓ In this model the unvoiced excitation is assumed to be a random noise sequence, and the voiced
excitation is assumed to be a periodic impulse train with impulses spaced by the pitch period
(P0) rounded to the nearest sample
✓ The pulses needed to model the glottal flow waveform during voiced speech are assumed to be
combined (by convolution) with the impulse response of the linear system, which is assumed
to be slowly-time-varying (changing every 50–100 ms or so)
✓ The system can be described by the convolution expression

✓ It is often assumed that the system is an all-pole system with system function of the form:

SPECTROGRAPHIC ANALYSIS OF SPEECH SOUNDS


Spectrographic analysis of speech sounds is a powerful tool used in phonetics and speech science to
visualize and study the acoustic properties of speech. It provides a detailed and visual representation of
the speech signal, showing how different frequencies change over time. Spectrograms are widely used
to analyze and compare speech sounds, study speech patterns, and understand the acoustic
characteristics of different languages and dialects.
Principle of Spectrographic Analysis:
A spectrogram is a 2D representation of a speech signal in which time is represented on the x-axis,
frequency on the y-axis, and the intensity (or energy) of each frequency at a given time is represented
by the darkness or color of the corresponding point.
Process of Spectrographic Analysis:
Recording the Speech Signal: The first step in spectrographic analysis is to record the speech signal.
This can be done using a microphone or other recording devices.

11
Preprocessing the Speech Signal: The recorded speech signal may contain background noise and
unwanted artifacts. Before generating a spectrogram, the speech signal is preprocessed to remove noise
and enhance the quality of the speech signal.
1. Time-Frequency Analysis:
✓ The speech signal is then analyzed using a technique called Short-Time Fourier
Transform (STFT) or other time-frequency analysis methods.
✓ STFT breaks the speech signal into small overlapping segments and calculates the
frequency spectrum for each segment.
✓ The spectrogram analysis involves the use of the Short-Time Fourier Transform
(STFT), which is a variation of the traditional Fourier Transform.
✓ The Fourier Transform is a mathematical operation that transforms a signal from the
time domain to the frequency domain.
✓ It provides information about the frequency components present in the signal.
✓ However, in speech analysis, using the traditional Fourier Transform on the entire
speech signal would not provide sufficient temporal resolution, as speech sounds
change rapidly over time.
✓ The STFT addresses this issue by analysing short segments of the speech signal,
allowing for a time-frequency representation.

STFT(t, f) = ∫[x(τ) * w(t - τ)] * e^(-j2πfτ) dτ

Where:
STFT(t, f) represents the value of the Short-Time Fourier Transform at time t and frequency f.
x(τ) is the original speech signal.
w(t - τ) is a windowing function that reduces spectral leakage and smoothens the spectrogram.
e^(-j2πfτ) is a complex exponential that represents the contribution of frequency f at time τ.

12
Figure : Spectrogram of a speech signal with breath sound (marked as Breath
2. Visualization as a Spectrogram: The frequency spectrum of each segment is represented as a
column in the spectrogram. As the analysis progresses through time, the columns are stacked
side by side to create a 2D image. The intensity of each frequency component is represented by
the color or darkness of the corresponding point.
3. Interpreting Spectrograms:
Spectrograms provide valuable information about the acoustic properties of speech sounds:
a. Formants: Formants are regions of concentrated energy in the speech spectrum. They
correspond to the resonant frequencies of the vocal tract and are essential for vowel
identification.
b. Consonant Transitions: Spectrograms show transitions between consonant sounds and the
adjacent vowels. These transitions reveal information about the place and manner of
articulation.
c. Vowel Quality: Vowels are characterized by specific patterns of formants, which determine
their quality (e.g., high or low, front or back).
d. Voice Onset Time (VOT): Spectrograms can reveal the voicing properties of consonants,
including voice onset time, which distinguishes between voiced and voiceless stops.
e. Prosody: Spectrograms display variations in pitch (fundamental frequency) and intensity,
which are crucial for understanding prosodic features such as stress and intonation.
Spectrographic analysis is widely used in various fields, including linguistics, phonetics, speech
pathology, and speech technology. It helps researchers and practitioners gain valuable insights into the
acoustic properties of speech and aids in the study and characterization of different speech sounds and
patterns.

13
PITCH FREQUENCY:
✓ The pitch frequency, also known as the fundamental frequency (F0), is the rate at which the
vocal cords vibrate during the production of voiced speech sounds.
✓ The fundamental frequency determines the perceived pitch of a sound.
✓ In speech, pitch is measured in Hertz (Hz), which represents the number of cycles of vocal cord
vibration per second.
✓ The fundamental frequency is typically denoted by "F0" and can be calculated using the
following equation:

F0 = 1 / T
Where:
F0 is the fundamental frequency in Hertz (cycles per second).
T is the period of one vocal cord vibration in seconds.
The period (T) represents the time taken for one complete cycle of vocal cord vibration. It is
the reciprocal of the fundamental frequency (F0):

T = 1 / F0

✓ In practice, measuring the exact period or fundamental frequency of the vocal cord vibration
directly is challenging.
✓ Instead, researchers and speech scientists use various methods, such as signal processing
techniques, to estimate the fundamental frequency from the speech signal.
✓ Common methods include the use of autocorrelation, cepstral analysis, and pitch detection
algorithms.
✓ Pitch period measurement in the spectral domain involves estimating the fundamental
frequency (F0) by analyzing the periodicity of the harmonics in the speech signal's spectrum.
✓ The pitch period (T) can be obtained as the reciprocal of the fundamental frequency (T = 1 /
F0).
PITCH PERIOD MEASUREMENT IN THE SPECTRAL DOMAIN:
✓ Compute the Spectrum: Calculate the Short-Time Fourier Transform (STFT) of the speech
signal to obtain its magnitude spectrum.
✓ The STFT breaks the signal into short overlapping segments and computes the frequency
spectrum for each segment
✓ Given a discrete-time signal x(n) of length N, the STFT is calculated by performing the Fourier
Transform on short segments (frames) of the signal. The process involves the following steps:

14
a. Windowing: The signal is divided into short overlapping segments, and a window function w(n)
is applied to each segment to reduce spectral leakage. Common window functions include
Hamming, Hanning, and Blackman.
b. Zero-padding (Optional): Zero-padding may be applied to each windowed segment to increase
the frequency resolution of the resulting spectrum.
c. Fourier Transform: The Discrete Fourier Transform (DFT) is applied to each windowed
segment to obtain its frequency spectrum.
d. Overlap-Add: The resulting spectra are then overlapped and added together to create the STFT.
Equations for STFT Computation:
Let's define the parameters involved in the equations:

N: Length of the original signal x(n).


M: Length of each windowed segment (frame).
H: Hop size or the number of samples between the starting points of consecutive frames. It
determines the overlap between frames.
w(n): Window function of length M.
X(m, k): The STFT of the signal x(n) at time index m and frequency index k.
The STFT can be calculated as follows:
X(m, k) = Σ [ x(n) * w(n - mH) * e^(-j2πkn/M) ]
Where:
m: Index of the frame (time index).
k: Index of the frequency bin.
n: Index of the sample in the original signal.
e^(-j2πkn/M): Complex exponential term for the DFT.
Σ: Summation over the window length M.
The STFT, X(m, k), represents the magnitude and phase of the frequency components at each time
frame and frequency bin. To obtain the spectrogram, we typically calculate the magnitude of the STFT:
Spectrogram(m, k) = |X(m, k)|

The spectrogram provides a time-frequency representation of the signal, allowing us to observe how
the frequency content of the signal changes over time.
Find Harmonic Peaks: Identify the peaks in the spectrum that correspond to the harmonics of the vocal
cord vibrations. Harmonics are integer multiples of the fundamental frequency, and their presence in
the spectrum indicates the periodicity of the speech signal.

15
Estimate the Fundamental Frequency (F0): Once the harmonic peaks are identified, the fundamental
frequency (F0) can be estimated as the distance between consecutive harmonic peaks. The distance
between two consecutive harmonics represents the pitch period (T) in seconds.
a. Equation for Pitch Period Estimation:
The pitch period (T) can be calculated as follows:
T = (n - 1) * (1 / Fs)
Where:
T is the pitch period in seconds.
n is the distance (in number of samples) between consecutive harmonic peaks in the spectrum.
Fs is the sampling rate of the speech signal, which represents the number of samples per second.
For example, if the distance between two consecutive harmonic peaks is 50 samples, and the sampling
rate is 16,000 samples per second (Fs = 16000), then the pitch period (T) would be:

T = (50 - 1) * (1 / 16000) = 0.0030625 seconds

The reciprocal of the pitch period provides the fundamental frequency (F0):

F0 = 1 / T = 1 / 0.0030625 ≈ 326.4 Hz

In this example, the estimated fundamental frequency (F0) of the speech signal is approximately 326.4
Hz.
Pitch period measurement in the spectral domain is a valuable tool for understanding the pitch
characteristics of speech sounds, analyzing prosody, and studying intonation patterns in speech. It is
widely used in speech processing, speech recognition systems, and other applications involving the
analysis of speech signals.

PITCH PERIOD MEASUREMENT IN THE CEPSTRAL DOMAIN:


In the cepstral domain, the pitch period is estimated based on the periodic structure of the speech signal
in the cepstrum. The cepstrum is obtained by taking the inverse Fourier transform of the log magnitude
spectrum of the speech signal.
Step 1: Preprocess the speech signal (same as mentioned before).
Step 2: Compute the Short-Time Fourier Transform (STFT) of the preprocessed speech signal and
calculate the magnitude spectrum, |X(t, f)|, as explained in the spectral domain.
Step 3: Compute the cepstrum of the speech signal. The cepstrum C(t, quefrency) is obtained by taking
the inverse Fourier transform of the log magnitude spectrum:
C(t, quefrency) = IFFT(log(|X(t, f)|))

16
Where "quefrency" is the quefrency index, representing the time domain after the inverse Fourier
transform.
Step 4: Identify the dominant peak in the cepstral domain to estimate the pitch period (T0). The position
of the peak corresponds to the periodicity of the speech signal.
Step 5: Calculate the pitch period (T0) based on the position of the dominant peak in the cepstrum. The
value of T0 will depend on the quefrency index of the peak.

FORMANTS

17
✓ Formants are essential acoustic resonances that characterize the vocal tract's frequency response
during speech production.
✓ They are created as a result of the vocal tract's configuration, which acts as a series of tubes and
cavities. Formants play a crucial role in determining the quality and timbre of speech sounds
and are critical in distinguishing different vowels.
✓ When we produce speech sounds, the vocal tract (the throat, mouth, and nasal cavity) acts as a
resonating system.
✓ As air is expelled from the lungs and passes through the vocal cords, the configuration of the
vocal tract changes, causing certain frequencies to resonate more strongly than others. These
resonant frequencies are known as formants.
✓ The first three formants, denoted as F1, F2, and F3, are particularly important in speech
production:
F1: First Formant (Frequency influenced by tongue height)
F2: Second Formant (Frequency influenced by tongue front-back position)
F3: Third Formant (Frequency influenced by lip rounding)
Explanation:
1. The vocal tract consists of the oral cavity, pharynx, and larynx.
2. The tongue, a primary articulator, can be positioned at different heights and front-back positions to
create different vowel sounds.
3. F1 is primarily influenced by the height of the tongue (high or low).
4. F2 is mainly influenced by the front-back position of the tongue (front or back).
5. F3 is influenced by the degree of lip rounding (rounded or unrounded).
F1 (First Formant):
• Represents the lowest resonant frequency of the vocal tract.
• Primarily influenced by the height of the tongue.
• A high tongue position results in a lower F1 frequency (e.g., in the vowel sound /i/ as
in "see"), while a low tongue position results in a higher F1 frequency (e.g., in the
vowel sound /ɑ/ as in "father").
F2 (Second Formant):
• Represents the second resonant frequency of the vocal tract.
• Mainly influenced by the front-back position of the tongue.
• A front tongue position results in a higher F2 frequency (e.g., in the vowel sound /i/ as
in "see"), while a back tongue position results in a lower F2 frequency (e.g., in the
vowel sound /u/ as in "blue").

18
F3 (Third Formant):
• Represents the third resonant frequency of the vocal tract.
• Influenced by the shape of the lips.
• The degree of lip rounding affects the F3 frequency.
• Rounded lips result in a lower F3 frequency (e.g., in the vowel sound /u/ as in "blue"),
while unrounded lips result in a higher F3 frequency (e.g., in the vowel sound /i/ as in
"see").
Formants are essential for the perception of vowels in human speech. Different vowel sounds are
characterized by distinct patterns of formant frequencies, allowing us to identify and differentiate vowel
sounds in speech.

EVALUATION OF FORMANTS FOR VOICED AND UNVOICED SPEECH

Formants play a significant role in both voiced and unvoiced speech, but their characteristics and
behavior differ between these two types of speech sounds. Voiced speech is characterized by periodic
vocal cord vibrations, while unvoiced speech is produced without vocal cord vibration. Let's evaluate
the behavior of formants in both types of speech:
1. Voiced Speech:
✓ Formant Presence: In voiced speech, formants are present and prominent. The periodic
vocal cord vibrations create resonances in the vocal tract, resulting in clearly identifiable
formants.
✓ Stability: Formants in voiced speech are relatively stable over time because the vocal cord
vibrations maintain consistent resonances in the vocal tract.
✓ Frequency Range: The first three formants (F1, F2, and F3) are typically well-defined and
fall within specific frequency ranges associated with different vowel sounds. These formants
are essential in distinguishing and identifying vowel sounds during speech production.
✓ Spectral Envelope: The spectral envelope of voiced speech shows peaks corresponding to
the formant frequencies, indicating the resonance of the vocal tract at specific frequencies.
2. Unvoiced Speech:
✓ Formant Absence: In unvoiced speech, formants are generally less prominent or absent.
Since unvoiced sounds are produced without vocal cord vibration, there are no periodic
oscillations to create stable resonances in the vocal tract.
✓ Spectral Characteristics: Unvoiced speech is characterized by broad-spectrum noise-like
sounds. Without the presence of periodic vibrations and formants, the speech signal lacks the
distinct peaks associated with voiced speech.

19
✓ Spectral Continuity: The spectral envelope of unvoiced speech is relatively flat or shows
gradual changes rather than sharp peaks associated with formants.
✓ Noise-like Properties: Unvoiced speech sounds often have a turbulent or hissing quality due
to the random airflow through the constriction in the vocal tract (e.g., /s/ as in "see").

20
UNIT II SPEECH FEATURES AND DISTORTION MEASURES

SIGNIFICANCE OF SPEECH FEATURES IN SPEECH-BASED APPLICATIONS

✓ Speech features play a crucial role in speech-based applications, as they are


fundamental in extracting relevant information from speech signals.
✓ These features are extracted from the raw speech waveform and provide a more
compact representation of the speech signal, capturing essential characteristics for
further processing and analysis.
✓ The significance of speech features lies in their ability to enable a wide range of
applications and functionalities in the field of speech signal processing and natural
language processing.
✓ Here are some key areas where speech features are essential:
1. Speech Recognition: Speech features are vital for automatic speech recognition
(ASR) systems. They help in transforming the speech signal into a format suitable
for pattern recognition algorithms, enabling the system to identify and transcribe
spoken words or phrases.
2. Speaker Identification/Verification: In speaker identification or verification
systems, speech features are used to create unique representations of individual
speakers' voices, allowing the system to distinguish between different speakers.
3. Speech Synthesis: Speech features aid in generating artificial speech signals in
text-to-speech (TTS) systems. These features provide information about the
prosody, pitch, duration, and other speech characteristics required to produce
natural-sounding synthesized speech.
4. Speech Coding/Compression: Speech features are used in speech coding and
compression algorithms to reduce the data size of speech signals while maintaining
acceptable quality.
5. Emotion Recognition: Speech features can be used to analyze and recognize
emotional content in speech, enabling applications in sentiment analysis and
emotion detection.
6. Language Identification: Speech features help in determining the language being
spoken, which is useful in multilingual speech processing applications.
7. Speech Enhancement: In speech enhancement applications, speech features are
used to distinguish between speech and noise components, allowing for noise
reduction or removal.
8. Speech-based Biometrics: Speech features are employed in biometric systems for
speaker recognition, where they aid in identifying individuals based on their voice
patterns.
✓ Commonly used speech features include Mel-frequency cepstral coefficients (MFCCs),
linear predictive coding (LPC) coefficients, pitch, formants, energy, and various
prosodic features.
✓ Overall, speech features are critical for converting raw speech data into meaningful
representations that can be used in a wide range of applications, making speech-based
technologies more accessible, efficient, and accurate.

SPEECH FEATURES - CEPSTRAL COEFFICIENTS

✓ Cepstral coefficients are features derived from the cepstral domain of a signal and are
commonly used in speech and audio processing tasks.
✓ The cepstral domain represents the inverse Fourier transform of the logarithm of the
magnitude of the Fourier transform of a signal.
✓ Cepstral coefficients are obtained by further processing the cepstrum to extract useful
information for various applications.
✓ The most widely used cepstral coefficients are Mel-frequency cepstral coefficients
(MFCCs), which are popular in speech recognition and speaker identification tasks.
✓ Here's an explanation of cepstral coefficients, with a focus on MFCCs:

1. Cepstral Domain:

As mentioned earlier, the cepstral domain is obtained by taking the inverse Fourier transform
of the logarithm of the magnitude spectrum of a signal. The process involves the following
steps:
a. Compute the Short-Time Fourier Transform (STFT) of the signal.
b. Calculate the magnitude spectrum from the STFT.
c. Take the logarithm of the magnitude spectrum.
d. Compute the inverse Fourier transform of the log magnitude to obtain the cepstrum.
2. Mel-frequency Cepstral Coefficients (MFCCs):
MFCCs are a type of cepstral coefficients that are widely used in speech processing tasks,
especially for automatic speech recognition (ASR). The MFCC extraction process involves the
following steps:
✓ Pre-emphasis: A pre-emphasis filter is applied to boost the higher frequencies of the signal to
compensate for attenuation during sound propagation.
✓ Framing: The speech signal is divided into short overlapping frames to make the analysis more
localized.
✓ Windowing: A window function (e.g., Hamming window) is applied to each frame to reduce
spectral leakage during Fourier transform computation.
✓ Fourier Transform: The Fast Fourier Transform (FFT) is applied to each framed segment to
convert the signal from the time domain to the frequency domain.
✓ Mel Filterbank: A set of triangular filters in the Mel scale is applied to the power spectrum
obtained from the Fourier transform. The Mel scale is a perceptual scale of pitches that
approximates the human auditory system's frequency response.
✓ Log Compression: The logarithm of the filterbank energies is taken to convert the values into
the logarithmic scale. This step emphasizes lower energy values, making the representation
more robust to noise and variations.
✓ Discrete Cosine Transform (DCT): Finally, the Discrete Cosine Transform is applied to the log-
filter bank energies to obtain the MFCCs. The DCT decorrelates the filterbank coefficients,
reducing the dimensionality and capturing the most relevant information.
1. Pre-processing:
Before computing the MFCCs, the speech signal undergoes some pre-processing steps, such as:
a. Pre-emphasis: The speech signal is pre-emphasized to boost higher frequencies and
compensate for attenuation during sound propagation. The pre-emphasis filter is typically
applied to the speech signal to emphasize higher frequencies:
y(t) = x(t) - α * x(t-1)
where:
y(t) is the pre-emphasized signal at time t,
x(t) is the original speech signal at time t,
α is the pre-emphasis coefficient (typically around 0.97),
x(t-1) is the speech signal at the previous time instant.
2. Framing and Windowing:
The pre-emphasized speech signal is divided into short overlapping frames, and a window
function (e.g., Hamming window) is applied to each frame to reduce spectral leakage during
Fourier transform computation.
3. Short-Time Fourier Transform (STFT):
The Fast Fourier Transform (FFT) is applied to each framed segment of the pre-emphasized
speech signal to convert the signal from the time domain to the frequency domain.
4. Mel Filterbank:
A set of triangular filters in the Mel scale is applied to the power spectrum obtained from the
Fourier transform. The Mel scale is a perceptual scale of pitches that approximates the human
auditory system's frequency response.
The Mel filterbank is created by a series of triangular filters, each spanning a specific range of
frequencies in the Mel scale. The filterbank converts the power spectrum into a set of filterbank
energies.
5. Log Compression:
The log of the filterbank energies is taken to convert the values into the logarithmic scale. This
step emphasizes lower energy values, making the representation more robust to noise and
variations:
C(i) = log(Energy(i))
where:
C(i) is the log-compressed energy of the i-th filterbank,
Energy(i) is the energy obtained from the i-th filterbank.
6. Discrete Cosine Transform (DCT):
Finally, the Discrete Cosine Transform is applied to the log-filterbank energies to obtain the
MFCCs. The DCT decorrelates the filterbank coefficients, reducing the dimensionality and
capturing the most relevant information:
MFCCs(k) = ∑ [C(i) * cos((π * k * (i - 0.5)) / num_filters)]
where:
MFCCs(k) is the k-th MFCC coefficient,
C(i) is the log-compressed energy from the i-th filterbank,
num_filters is the number of filterbanks,
k is the index of the MFCC coefficient (usually starts from 1).
The resulting MFCCs form the feature vector representing the speech signal and are commonly
used in speech recognition, speaker identification, and other speech-related tasks.

PERCEPTUAL LINEAR PREDICTION (PLP)


✓ Perceptual Linear Prediction (PLP) is a speech processing technique that involves perceptual
weighting to improve the representation of the speech signal.
✓ The main steps of PLP analysis include pre-emphasis, frame blocking, windowing, Discrete
Fourier Transform (DFT), and perceptually motivated filtering.
✓ Below are the necessary equations for each step:
1. Pre-Emphasis:
✓ The pre-emphasis operation is usually implemented using a first-order high-pass filter.
✓ The equation for pre-emphasis is as follows:
y(t) = x(t) - α * x(t-1)

Where:
y(t) is the pre-emphasized signal at time t,
x(t) is the original speech signal at time t,
α is the pre-emphasis coefficient (typically around 0.97),
x(t-1) is the speech signal at the previous time instant.
2. Frame Blocking and Windowing:
✓ The pre-emphasized speech signal is divided into short overlapping frames, and a window
function (e.g., Hamming window) is applied to each frame.
✓ The windowing operation is given by:
w(n) = 0.54 - 0.46 * cos((2 * π * n) / (N - 1))
Where:
w(n) is the value of the window at index n,
N is the length of the window (usually the frame size).
The window function helps to taper the signal at the frame edges, reducing spectral leakage during
Fourier transform computation.
3. Discrete Fourier Transform (DFT):
✓ The Fourier transform is applied to each windowed frame to convert the signal from the time
domain to the frequency domain.
✓ The DFT equation is:

X(k) = ∑ [x(n) * exp(-j * 2 * π * k * n / N)]

Where:
X(k) is the frequency domain representation at frequency index k,
x(n) is the windowed speech signal at time index n,
N is the length of the DFT (usually the frame size).
4. Perceptually Motivated Filterbank:
✓ The perceptually motivated filterbank is used to convert the magnitude spectrum obtained from
the DFT into a perceptually relevant representation.
✓ The filterbank coefficients are designed to mimic the human auditory system's frequency
resolution and emphasis on specific frequency regions.
✓ The exact equations for the filterbank depend on the design and type of PLP analysis being
used. Typically, the filterbank is implemented as a set of triangular filters in the Mel scale or as
Bark filters.
5. Inverse Discrete Fourier Transform (IDFT):
✓ After passing the spectrum through the perceptually motivated filterbank, the inverse discrete
Fourier transform (IDFT) is applied to obtain the PLP coefficients, which are cepstral
coefficients representing the speech signal in the perceptually enhanced domain.
✓ The IDFT equation is similar to the DFT equation, but with the inverse transform:
c(n) = ∑ [X(k) * exp(j * 2 * π * k * n / N)]
Where:
c(n) is the PLP coefficient at cepstral index n,
X(k) is the filterbank output at frequency index k,
N is the length of the IDFT (usually the number of filterbank coefficients).
✓ The resulting PLP coefficients provide a perceptually enhanced representation of the speech
signal, which is more robust and discriminative for various speech processing tasks.
✓ These coefficients are particularly useful in automatic speech recognition (ASR) tasks, where
accurate feature extraction is crucial for recognition accuracy.
LOG FREQUENCY POWER COEFFICIENTS (LFPC)
✓ Log Frequency Power Coefficients (LFPC) is a speech processing technique that involves
extracting log-scaled power coefficients from the speech signal's frequency domain
representation.
✓ LFPC is an alternative to traditional Mel-frequency cepstral coefficients (MFCCs) and is used
in automatic speech recognition (ASR) and other speech-related tasks
✓ The LFPC analysis follows similar steps to MFCC extraction, including pre-emphasis, framing,
windowing, Discrete Fourier Transform (DFT), and power spectrum computation.
✓ However, instead of applying the Mel filterbank, LFPC employs a logarithmically spaced
filterbank to approximate the human auditory system's frequency resolution.
1. Pre-Emphasis:
Similar to other speech processing techniques, the speech signal is pre-emphasized to enhance
higher frequencies and compensate for spectral tilt.
2. Frame Blocking and Windowing:
The pre-emphasized speech signal is divided into short overlapping frames, and a window
function (e.g., Hamming window) is applied to each frame to reduce spectral leakage during
Fourier transform computation.
3. Discrete Fourier Transform (DFT):
The Fourier transform is applied to each windowed frame to convert the signal from the time
domain to the frequency domain. The resulting complex-valued spectrum represents the
frequency content of the speech signal in each frame.
4. Power Spectrum Computation:
The power spectrum is computed from the complex-valued spectrum by squaring the magnitude
of each complex value:
Power(k) = |X(k)|^2
Where:
Power(k) is the power spectrum at frequency index k,
X(k) is the complex-valued spectrum at frequency index k.
5. Logarithmically Spaced Filterbank:
✓ Unlike the Mel filterbank used in MFCC analysis, LFPC employs a logarithmically
spaced filterbank to approximate the human auditory system's frequency resolution.
✓ The logarithmically spaced filterbank is designed to have a constant bandwidth on the
mel-frequency scale.
✓ The filterbank divides the power spectrum into a set of equally spaced frequency bins
on the logarithmic scale.
✓ The spacing between bins increases with frequency, providing better resolution at lower
frequencies, which is more in line with human auditory perception.
6. Log Compression:
✓ The LFPC coefficients are obtained by taking the logarithm of the energy in each
filterbank bin:
LFPC(k) = log(Energy(k))
Where:
LFPC(k) is the log frequency power coefficient at coefficient index k,
Energy(k) is the energy obtained from the k-th filterbank bin.
✓ The resulting LFPC coefficients provide a log-scaled representation of the speech
signal's spectral characteristics, emphasizing perceptually important frequency regions.
✓ LFPC is particularly useful in ASR tasks, where it can help capture relevant phonetic
information and improve recognition accuracy.

SPEECH DISTORTION MEASURES–SIMPLIFIED DISTANCE MEASURE

✓ Speech distortion measures are used to evaluate the quality of speech signals after they have
undergone various processing or transmission.
✓ A simplified distance measure is a type of distortion measure that quantifies the difference or
distance between the original (reference) speech signal and the processed (distorted) speech
signal.
✓ One common simplified distance measure used for speech distortion evaluation is the Mean
Squared Error (MSE).
✓ The Mean Squared Error calculates the average squared difference between corresponding
samples of the reference and distorted speech signals.
✓ Let's define the reference speech signal as x[n] and the distorted speech signal as y[n], where
"n" represents the sample index.
✓ The Mean Squared Error (MSE) is calculated as follows:

MSE = (1 / N) * Σ[(x[n] - y[n])^2]


Where:
MSE is the Mean Squared Error.
N is the total number of samples in the speech signal (the length of the signals).
Σ denotes the summation over all samples.
(x[n] - y[n]) is the difference between the corresponding samples of the reference and
distorted speech signals.
(x[n] - y[n])^2 is the squared difference between the samples.
✓ A lower MSE value indicates a smaller difference between the reference and distorted signals,
implying less distortion and better quality.
✓ Conversely, a higher MSE value suggests greater distortion and poorer quality.

SPEECH DISTORTION MEASURES– LPC-BASED DISTANCE MEASURE


✓ Linear Predictive Coding (LPC) is a widely used method in speech and audio processing to
model the spectral envelope of a speech signal.
✓ LPC-based distance measures are used to compare two LPC models or LPC coefficients to
assess the similarity or distortion between speech signals.
✓ One such LPC-based distance measure is the Log Spectral Distance (LSD).
✓ Below is an explanation of the Log Spectral Distance measure:
Log Spectral Distance (LSD):
✓ LSD is an objective distance measure used to quantify the spectral distortion between two
speech signals, represented by their LPC models.
✓ It evaluates how much the log magnitude spectra of the two LPC models differ from each other.
✓ Assume we have two LPC models represented as polynomials:
Original LPC model:
H(z) = 1 / (1 - a1 * z^-1 - a2 * z^-2 - ... - ap * z^-p)

✓ Processed LPC model:


G(z) = 1 / (1 - b1 * z^-1 - b2 * z^-2 - ... - bq * z^-q)
Where p and q are the orders of the LPC models (typically around 10-20).
✓ The Log Spectral Distance (LSD) is calculated as follows:
1. Compute the log magnitude spectra of the two LPC models H(z) and G(z) over a set of
frequencies (usually represented as log-scale frequency bins).
Log magnitude spectrum of H(z):
log(|H(f)|) = log(|H(z)|) at f frequencies
Log magnitude spectrum of G(z):
Log(|G(f)|) = log(|G(z)|) at f frequencies
Here, |H(z)| and |G(z)| represent the magnitude of H(z) and G(z) at different frequencies
f, respectively.
2. Calculate the squared error between the log spectra of the original and processed LPC
models for each frequency bin:
Squared Error(f) = (log(|H(f)|) - log(|G(f)|))^2
3. Finally, sum up the squared errors over all frequency bins and take the square root to obtain
the Log Spectral Distance (LSD):
LSD = sqrt(1/F * sum(Squared Error(f)))
Where F is the total number of frequency bins used for evaluation.
✓ A smaller value of the LSD indicates a closer match between the two LPC models and,
therefore, less spectral distortion between the original and processed speech signals.
✓ The Log Spectral Distance (LSD) is just one of many objective measures used to evaluate
speech quality.

SPECTRAL DISTORTION MEASURE


✓ The Spectral Distortion Measure commonly used in speech signal processing is the Mean
Square Error (MSE).
✓ As mentioned earlier, the MSE calculates the average of the squared differences between
corresponding spectral values of the original and processed speech signals.
✓ Let's represent the magnitude spectra as X(f) and Y(f), respectively.
✓ Assuming we have two spectra:
Original Spectrum:
X(f) - The magnitude spectrum of the original speech signal at different frequency
bins f.
Processed Spectrum:
Y(f) - The magnitude spectrum of the processed or transmitted speech signal at the
same frequency bins f.
The Mean Square Error (MSE) is calculated as follows:
MSE = (1/N) * sum((X(f) - Y(f))^2)
where N is the total number of frequency bins or spectral samples.
Here's a step-by-step explanation of the calculation:
1. Compute the difference between the magnitude spectra of the original and processed speech
signals for each frequency bin:
Difference(f) = X(f) - Y(f)
2. Square the differences to get rid of negative values and emphasize larger discrepancies:
Squared Difference(f) = (Difference(f))^2
3. Sum up the squared differences over all frequency bins:
sum(Squared Difference(f)) = sum((X(f) - Y(f))^2)
4. Calculate the average of the squared differences by dividing the sum by the total number
of frequency bins:
MSE = (1/N) * sum(Squared Difference(f))
✓ A smaller value of MSE indicates a closer match between the spectra of the original and
processed speech signals, suggesting lower spectral distortion.
✓ On the other hand, a larger MSE value implies greater spectral distortion and a larger difference
between the spectra.
✓ The Mean Square Error is a commonly used objective measure for evaluating spectral
distortion.
✓ However, as mentioned before, there are other spectral distortion measures and objective
quality measures used in speech signal processing, and the choice of the appropriate measure
depends on the specific application and requirements.

PERCEPTUAL DISTORTION MEASURE


✓ Perceptual distortion measures are used to evaluate the quality of speech signals based on
human perception.
✓ These measures take into account the characteristics of the human auditory system and how
individuals perceive speech quality.
✓ Unlike purely objective measures such as the Mean Square Error (MSE), perceptual distortion
measures aim to align more closely with human judgment and perception of speech quality.
✓ One widely used perceptual distortion measure for speech quality evaluation is the Perceptual
Evaluation of Speech Quality (PESQ).
✓ PESQ was developed by the International Telecommunication Union (ITU-T) as an objective
measure for predicting subjective speech quality ratings.
✓ PESQ works by comparing the original (reference) and processed speech signals and simulating
the effects of transmission or processing on human perception.
✓ It considers factors such as speech intelligibility, annoying artifacts, and overall quality.
✓ The result is a numerical score that correlates with subjective ratings given by human listeners.
✓ PESQ is expressed as a Mean Opinion Score (MOS), which represents the average subjective
rating of processed speech quality on a scale from 1 to 5.
✓ A higher MOS score indicates better perceptual quality, while a lower score suggests more
perceptual distortion.
UNIT III SPEECH CODING
3.1. NEED FOR SPEECH CODING
✓ Speech coding, also known as speech compression or speech encoding, is a process of
converting analog speech signals into digital format in order to efficiently represent and
transmit the speech over digital communication channels with reduced data size.
✓ The primary goal of speech coding is to achieve high-quality speech reproduction while
minimizing the required bit rate for transmission or storage.
✓ Speech coding is essential in various applications, including telecommunications, Voice over
Internet Protocol (VoIP) systems, speech storage and transmission, voice messaging, and
speech recognition.
✓ There are two main types of speech coding:
1. Lossless Speech Coding:
• Lossless speech coding aims to compress the speech signal without any loss of
information.
• It is primarily used in applications where preserving the exact original speech
waveform is critical, such as in some telecommunication systems and voice
storage applications.
• However, lossless speech coding typically achieves lower compression ratios
compared to lossy speech coding.

2. Lossy Speech Coding:


• Lossy speech coding aims to achieve higher compression ratios by discarding
some non-critical information from the speech signal.
• While this results in some loss of audio quality, the goal is to maintain the
perceptual quality of the speech while reducing the data size significantly.
✓ Common Speech Coding Techniques:
1. Pulse Code Modulation (PCM):
• PCM is a basic form of speech coding that quantizes the speech waveform into
discrete amplitude levels and then encodes these levels into digital values. It
provides a lossless representation of speech but requires a relatively high bit
rate.
2. Adaptive Differential Pulse Code Modulation (ADPCM):
• ADPCM is a form of differential coding that predicts the next speech sample
based on previous samples and only encodes the difference between the
predicted and actual sample values. It achieves higher compression ratios
compared to PCM but introduces some loss of quality.
3. Linear Predictive Coding (LPC):
• LPC is a widely used speech coding technique that models the speech signal
using a linear prediction model and represents the residual error as a codebook.
• It achieves good compression ratios while maintaining reasonable speech
quality.
4. Code-Excited Linear Prediction (CELP):
• CELP is a popular speech coding technique that combines LPC and vector
quantization to represent the speech signal.
• It uses a codebook to represent the excitation signal, allowing higher
compression ratios and maintaining good speech quality.

3.2. WAVEFORM CODING OF SPEECH – PCM


Pulse Code Modulation (PCM) is commonly used to digitize speech signals for transmission, storage,
and processing.
Sampling: Speech signals are continuous analog waveforms. In PCM, these analog signals are
sampled at a fixed rate. Let the sampling rate be "fs" (samples per second).
Quantization: After sampling, the amplitude of each sample is quantized into a discrete set of levels.
Let "L" be the number of quantization levels.
Quantization Levels and Step Size: The step size (Δ) between quantization levels is determined by
the range of amplitudes that can be represented and the number of quantization levels. It's calculated
as:
Δ = (Max Amplitude - Min Amplitude) / (L - 1)
The greater the number of quantization levels, the finer the quantization and the better the fidelity of
the digital representation.
Quantization Function:
The quantization function maps the continuous analog amplitude to the nearest quantization level. For
a speech sample "x(t)" at time "t," the quantization function can be represented as:
Q(x(t)) = Δ * floor(x(t) / Δ)
where "floor" rounds down to the nearest integer.

Encoding: The quantized values are then encoded into binary codes. This encoding step typically
involves representing the quantization levels using a fixed number of bits (e.g., 8 bits, 16 bits) per
sample.
Decoding: To reconstruct the analog signal from the digital representation, the decoding process
involves reversing the quantization and reconstructing the continuous waveform.
Fig. The operations of sampling and quantization
✓ The sampler produces a sequence of numbers x[n] = xc(nT), where T is the sampling period
and fs = 1/T is the sampling frequency.
✓ The quantizer simply takes the real number inputs x[n] and assigns an output ˆx[n] according
✓ to the nonlinear discrete-output mapping Q{ }.
✓ In the case of the example of Figure shown below the output samples are mapped to one of
eight possible values, with samples within the peak-to-peak range being rounded and samples
outside the range being “clipped” to either the maximum positive or negative level.

Fig. 8-level mid-tread quantizer


✓ For samples within range, the quantization error, defined as

✓ satisfies the condition

✓ where ∆ is the quantizer step size

Adaptive Pulse Code Modulation (ADPCM)


✓ Adaptive Pulse Code Modulation (ADPCM) is an extension of the basic PCM technique that
adapts the quantization step size based on the characteristics of the input signal.
✓ ADPCM is particularly useful for compressing speech signals efficiently while maintaining
reasonable audio quality. Here's a step-by-step explanation of how Adaptive PCM works for
speech signals:
1. Initialization:
• Define an initial quantization step size, denoted as Δ₀.
• Set up an initial predicted value for the current sample, denoted as P₀.
2. Prediction and Residual Calculation:
• Calculate the difference between the actual sample and the predicted value to obtain
the prediction residual:

R₀ = x₀ - P₀
3. Quantization of Residual:
• Quantize the prediction residual R₀ using the current step size Δ₀ to obtain a quantized
residual value Q₀.
4. Encoding:
• Encode the quantized residual Q₀ using a fixed number of bits.
5. Decoding and Reconstruction:
• Decode the encoded quantized residual to obtain the quantized residual value Q₀.
• Reconstruct the quantization level by scaling the quantized residual with the step size
Δ₀:
• R₀' = Q₀ * Δ₀
6. Prediction Update:
• Update the predicted value for the next sample based on the reconstructed quantization
level R₀':
P₁ = P₀ + R₀'

7. Adaptation of Step Size:


• Calculate a new step size Δ₁ based on the current quantization error (difference between
the original sample and the reconstructed sample):
Δ₁ = f(Δ₀, R₀, x₀)
• The function f() adjusts the step size based on the characteristics of the signal. Common
methods involve using adaptive algorithms like the Differential Pulse Code Modulation
(DPCM) algorithm
3.8. G.726 STANDARD FOR ADPCM
✓ G.726 is an ITU-T standard that defines a particular form of Adaptive Differential Pulse Code
Modulation (ADPCM) for speech coding.
✓ The G.726 standard specifies four bit rates for encoding speech signals: 16 kbps, 24 kbps, 32
kbps, and 40 kbps.
✓ These bit rates offer different trade-offs between speech quality and data compression.
✓ The main features of the G.726 standard for ADPCM include:
1. ADPCM Coding:
• ADPCM is a form of speech coding that predicts the next speech sample based on the
previous sample and only encodes the difference (residual) between the predicted and
actual sample values.
• This difference is then quantized and encoded using a small number of bits. ADPCM
achieves higher compression ratios compared to PCM while maintaining reasonable
speech quality.
2. Adaptive Quantization:
• G.726 ADPCM uses adaptive quantization, where the step size for quantization is
adjusted based on the characteristics of the input speech signal.
• During periods of high signal activity, the step size is reduced to provide higher
precision and better speech quality.
• Conversely, during less active periods, the step size is increased to achieve higher
compression and reduce the bit rate.
3. Four Bit Rates:
• G.726 offers four bit rates for speech encoding:
• 16 kbps: Provides relatively high speech quality suitable for toll-quality telephony
applications.
• 24 kbps: Balanced trade-off between speech quality and compression, suitable for
general telephony and voice communications.
• 32 kbps: Provides acceptable speech quality with higher compression, often used in
low-bandwidth voice communication systems.
• 40 kbps: Lower speech quality with significant compression, suitable for applications
with severe bandwidth constraints.
4. Sample Rate Reduction:
• In addition to ADPCM coding, G.726 also includes a sample rate reduction option for
reducing the sampling rate of the speech signal.
• The reduced sampling rate helps in further reducing the bit rate.
• G.726 is widely used in various telecommunication systems, voice over packet
networks, and digital voice storage applications.
• It strikes a good balance between speech quality and data compression, making it
suitable for a wide range of communication systems with varying bandwidth
requirements.

Speech coding - Delta Modulation

✓ Delta modulation is a simple form of analog-to-digital conversion used in speech and audio
coding.
✓ It's a method that encodes the difference between consecutive samples of an analog signal rather
than encoding the absolute values of each sample.
✓ This can help reduce the amount of data needed to represent the signal, making it suitable for
low-bitrate communication channels.
✓ Let's denote the input analog signal as x(t), and the sampled version of the signal as x[n], where
n is the discrete time index corresponding to the sample.
1. Sampling:
X(n)=x(nTs ), where Ts is the sampling interval.
2. Prediction and Encoding:
The predicted value of the next sample, xp(n+1), can be calculated as the previous sample
value plus the delta value d(n):
Xp(n+1)=x(n)+d(n)
✓ The difference between the predicted value and the actual next sample x(n+1) is calculated as:
e[n+1]=x[n+1]−xp[n+1]
The encoder then generates a binary output based on the sign of e[n+1]
3. Decoding:

✓ The reconstructed sample value is then calculated as the predicted value plus the delta value:

✓ Finally, the reconstructed sample is used to predict the value for the next sample in the
sequence:
Adaptive Delta Modulation
✓ Adaptive Delta Modulation (ADM) is an enhancement over basic delta modulation that
dynamically adjusts the step size to better accommodate the characteristics of the input signal.
✓ This helps mitigate some of the limitations of basic delta modulation, such as granularity noise
and limited dynamic range.
✓ Here's an overview of adaptive delta modulation with equations:
3.9. PARAMETRIC SPEECH CODING
✓ Parametric speech coding is a type of speech coding technique that represents speech signals
using a set of model parameters instead of directly encoding the speech waveform.
✓ The primary goal of parametric speech coding is to achieve high compression ratios while
maintaining acceptable speech quality by capturing the essential characteristics of the speech
using a compact set of parameters.
✓ In parametric speech coding, the speech signal is typically modeled using linear prediction
analysis, and the model parameters are quantized and transmitted or stored to be later decoded
and reconstructed at the receiver end.
✓ The main steps involved in parametric speech coding are as follows:
1. Linear Prediction Analysis:
• The speech signal is analyzed using linear prediction analysis (LPC) to
estimate the spectral envelope of the speech signal.
• LPC models the speech signal as a linear combination of past samples, and
the model coefficients represent the spectral envelope of the speech signal.
• In linear prediction analysis, the speech signal is modeled as a linear
combination of past samples. The model can be represented as follows:
x(n) = Σ[i=1 to p] a(i) * x(n-i),
• where:
• x(n) is the current sample of the speech signal.
• a(i) are the linear prediction coefficients, also known as LPC coefficients.
• p is the prediction order, which determines the number of past samples
used in the prediction.
2. Model Parameter Extraction:
• From the LPC analysis, a set of model parameters is extracted, which
typically includes the LPC coefficients, pitch period, and other parameters
describing the excitation signal, such as voicing information.
3. Quantization:
• The model parameters are quantized to reduce the number of bits required
for representation.
• Quantization involves mapping the continuous parameter values to a finite
set of discrete values, reducing the data size for transmission or storage.
4. Encoding:
• The quantized model parameters are encoded into digital format using a
binary representation. The encoded parameters are transmitted or stored
for later reconstruction.
5. Decoding and Synthesis:
• At the receiver end, the encoded model parameters are decoded, and the
speech signal is synthesized using the LPC model and the decoded
parameters.
• The synthesis process involves generating the excitation signal from the
decoded parameters and passing it through the LPC filter to reconstruct
the speech waveform.
• The synthesis equation can be represented as follows:
x(n) = Σ[i=1 to p] a_hat(i) * x(n-i) + excitation_hat(n),
• where x(n) is the reconstructed speech signal at time index n, a_hat(i) are
the decoded LPC coefficients, excitation_hat(n) represents the decoded
excitation signal, and p is the prediction order.
• The excitation signal, excitation_hat(n), can be generated using various
techniques, such as codebooks, pulse trains, or stochastic models,
depending on the specific parametric speech coding algorithm.
✓ Advantages of Parametric Speech Coding:
• High Compression Ratios: Parametric speech coding can achieve high
compression ratios, significantly reducing the bit rate required for speech
transmission or storage.
• Low Data Size: The compact representation of speech using model
parameters results in smaller data sizes, making it suitable for low-
bandwidth communication systems.
• Speech Quality: Despite the high compression, parametric speech coding
can preserve acceptable speech quality, especially at higher bit rates.
• Robustness: Parametric speech coding is often more robust to
transmission errors and noise compared to waveform coding techniques.
✓ Disadvantages of Parametric Speech Coding:
• Complexity: The encoding and decoding process can be computationally
complex, requiring additional processing resources.
• Sensitivity to Errors: In certain scenarios, parametric coding may be
more sensitive to transmission errors or losses compared to waveform
coding.
✓ Parametric speech coding has been used in various speech coding standards and applications,
including telecommunications, VoIP systems, and voice storage applications. Examples of
parametric speech coding standards include MELP (Mixed Excitation Linear Prediction) and
some modes of the ITU-T G.723 and G.729 codecs.

3.10. CHANNEL VOCODERS

✓ Channel vocoders are a class of speech coding algorithms used to encode and transmit speech
signals over band-limited communication channels.
✓ They are particularly suited for low-bit-rate applications, where the available channel
bandwidth is limited, such as in mobile communication systems, internet telephony (VoIP), and
satellite communications.
✓ The primary objective of channel vocoders is to reduce the bit rate required to transmit speech
while maintaining acceptable speech quality.
✓ They achieve this by exploiting the characteristics of the human vocal tract and the perceptual
properties of human hearing.
Fig. Channel Vocoder

✓ Channel vocoders work based on the following principles:


1. Source-Filter Model:
• Channel vocoders use a source-filter model to represent speech signals.
• The vocal tract can be represented as a fixed linear filter (the vocal tract filter)
through which an excitation signal (the source) passes.
• The excitation signal models the speech source, which includes the voicing
information and speech dynamics.
2. Analysis and Synthesis:
• The channel vocoder first analyzes the input speech signal to estimate the
parameters of the vocal tract filter and the excitation signal.
• These parameters are then quantized and transmitted over the channel. At the
receiver end, the vocoder synthesizes the speech signal using the received
parameters to reconstruct the speech waveform.
3. Parameter Quantization:
• To achieve compression, the vocoder quantizes the estimated parameters
before transmission.
• The quantization process involves mapping the continuous parameter values to
a limited set of discrete values.
• Higher compression ratios can be achieved by using fewer bits for
quantization, but this may result in a loss of speech quality.
4. Perceptual Coding:
• Channel vocoders often use perceptual coding techniques that take advantage
of the human auditory system's limited ability to perceive certain details.
• By prioritizing perceptually important features of speech and reducing the
representation of less critical components, channel vocoders can achieve
higher compression without a significant perceived loss in speech quality.
5. Trade-off between Bit Rate and Quality:
• Channel vocoders typically allow users to choose the desired bit rate, which
influences the speech quality.
• Lower bit rates lead to higher compression and lower speech quality, while
higher bit rates result in improved speech quality but require more data for
transmission.

3.1.1. LINEAR PREDICTION BASED VOCODERS


✓ Linear Prediction (LP)-based vocoders are a class of speech coding algorithms that use linear
prediction analysis to model the spectral characteristics of the speech signal.
✓ These vocoders typically use a source-filter model to represent the speech signal, where the
vocal tract is considered as a fixed linear filter, and the excitation signal models the source.
✓ The key equations involved in LP-based vocoders are related to linear prediction analysis,
synthesis filtering, and quantization of the model parameters.
✓ Here's an overview of the main equations used in LP-based vocoders:
1. Linear Prediction Analysis (LP Analysis):
• In LP-based vocoders, the speech signal is modeled as a linear combination of past
samples using the following equation:
x(n) = Σ[1 to p] a(i) * x(n-i),
where:
• x(n) is the current sample of the speech signal.
• a(i) are the LP coefficients, also known as LPC coefficients.
• p is the prediction order, which determines the number of past samples used in
the prediction.
• The LP coefficients, a(i), are estimated using various methods, such as the
autocorrelation method or the Levinson-Durbin recursion.
2. Excitation Signal Generation:
• The excitation signal is the source that drives the vocal tract filter to produce the speech
signal. In LP-based vocoders, the excitation signal can be generated using different
methods, such as:
• a) Pitch Periodic Pulse Train: A periodic pulse train with a pitch period (T0) represents
voiced sounds. b) White Noise: Random white noise represents unvoiced sounds. c)
Mixed Excitation: A combination of periodic pulse train and white noise for mixed
excitation.
• The excitation signal can be represented as:
excitation(n) = pitch_excitation(n) + noise_excitation(n),
where pitch_excitation(n) and noise_excitation(n) represent the pitch and noise
components of the excitation signal, respectively.
where pitch_excitation(n) and noise_excitation(n) represent the pitch and noise
components of the excitation signal, respectively.
3. Synthesis Filtering:
• The synthesis filtering process involves passing the excitation signal through the vocal
tract filter (modeled by the LPC coefficients) to reconstruct the speech waveform. The
synthesis filter equation can be represented as:
y(n) = Σ[i=1 to p] a(i) * y(n-i) + excitation(n),
where:
• y(n) is the reconstructed speech signal at time index n.
• a(i) are the LPC coefficients.
• excitation(n) is the excitation signal at time index n.
4. Quantization:

• The LP coefficients and other model parameters (e.g., pitch period, excitation type) are
quantized to reduce the number of bits required for their representation.
• Quantization involves mapping the continuous parameter values to a finite set of
discrete values.
5. Decoding and Synthesis:
• At the receiver end, the quantized model parameters are decoded, and the speech signal
is synthesized using the LP synthesis filter and the decoded parameters.
• The synthesis process involves generating the excitation signal from the decoded
parameters and passing it through the LP filter to reconstruct the speech waveform.
.

Code-Excited Linear Prediction (CELP)


✓ Code-Excited Linear Prediction (CELP) is a widely used class of speech coding algorithms that
provides high-quality speech coding at low bit rates.
✓ CELP-based vocoders are known for their efficiency in compressing speech signals while
maintaining good speech quality, making them suitable for various communication systems
with limited bandwidth.
✓ The main idea behind CELP-based vocoders is to represent the excitation signal using a
codebook containing a set of pre-recorded excitation vectors. These excitation vectors are used
to reconstruct the speech signal by shaping them through a linear predictive filter, which models
the spectral characteristics of the speech.
✓ Codebook Search: During encoding, the CELP vocoder performs a codebook search to find
the best excitation vector that closely matches the residual (difference between the original
speech and the filtered speech) of the current speech segment. The selected excitation vector is
then quantized and transmitted to the decoder.
✓ Linear Predictive Coding (LPC): CELP vocoders utilize linear predictive coding to model
the spectral characteristics of the speech signal. LPC coefficients are estimated at both the
encoder and decoder to create the linear predictive filter. The LPC filter shapes the excitation
vector to approximate the original speech signal.
✓ Long-Term Prediction (Pitch Prediction): CELP also incorporates long-term prediction to
capture the periodicity of voiced speech segments. The pitch period is estimated, and the
excitation vectors are adjusted to match the pitch period, which improves the quality of the
reconstructed speech for voiced sounds.
✓ Low Bit Rates: CELP-based vocoders can achieve high compression ratios at low bit rates,
typically ranging from 4 kbps to 16 kbps. This makes them suitable for applications with limited
bandwidth, such as mobile communication and voice over packet networks.
✓ Perceptual Optimization: CELP vocoders employ perceptual optimization techniques to
prioritize perceptually important features of speech and allocate more bits to those components.
This ensures that the most critical aspects of speech quality are preserved.
✓ Examples of CELP-based vocoders include the ITU-T G.729 and G.728 codecs, which have
been widely used in various communication systems for speech coding applications.
✓ Code-Excited Linear Prediction (CELP) is a speech coding algorithm that uses linear prediction
(LP) analysis and a codebook of excitation vectors to efficiently represent speech signals.
CELP-based vocoders achieve high compression ratios while maintaining good speech quality.
Let's explore the key equations involved in CELP-based vocoders:
1. Linear Prediction Analysis (LP Analysis):
As mentioned before, LPC analysis models the speech signal as a linear combination of past
samples using the following equation:
x(n) = Σ] a(i) * x(n-i),
where:
• x(n) is the current sample of the speech signal.
• a(i) are the LPC coefficients, also known as LPC prediction coefficients.
• p is the prediction order, which determines the number of past samples used in the
prediction.
2. In CELP-based vocoders, the excitation signal is generated using a codebook that contains a set
of pre-recorded excitation vectors (codewords).
The codebook can be represented as C = {c(1), c(2), ..., c(N)}, where c(i) is the ith excitation
vector.
During encoding, the codebook search finds the best excitation vector, c_best, that minimizes
the distortion between the original speech and the reconstructed speech.
The excitation signal for the current frame can be represented as:
excitation(n) = c_best(n),
where
• excitation(n) is the excitation signal at time index n.
• c_best(n) is the selected excitation vector from the codebook at time index n.

3. Linear Predictive Synthesis Filtering:


The synthesis filtering process involves passing the excitation signal through the LPC filter to
reconstruct the speech waveform.
The synthesis filter equation can be represented as:
y(n) = Σ[i=1 to p] a(i) * y(n-i) + excitation(n),
where:
• y(n) is the reconstructed speech signal at time index n.
• a(i) are the LPC coefficients.
• excitation(n) is the excitation signal at time index n.
4. Quantization:
The excitation vector c_best(n) is quantized to reduce the number of bits required for its
representation.
During quantization, the index i_best, corresponding to the selected excitation vector c_best(n),
is transmitted.
5. Codebook Search and Decoding:
At the decoder, the index i_best received during quantization is used to retrieve the excitation
vector c_best(n) from the codebook.
The LPC synthesis filter is then applied to reconstruct the speech signal using the retrieved
excitation vector.
The closed-loop nature of CELP involves an iterative process of codebook search and excitation
vector refinement to minimize the distortion between the original and reconstructed speech.
CELP-based vocoders, such as the ITU-T G.729 codec, have been widely used in various
communication systems for speech coding applications due to their efficiency in low-bit-rate
scenarios while maintaining acceptable speech quality..

Sinusoidal speech coding techniques:


1. Sinusoidal Analysis:
✓ In sinusoidal analysis, the voiced speech signal is decomposed into a sum of sinusoids.
The speech signal can be modelled as follows:
x(n) = Σ[i=1 to M] A(i) * cos[2π * f(i) * n + φ(i)],
where:
x(n) is the speech signal at time index n.
M is the number of sinusoidal components used to represent the speech segment.
A(i) is the amplitude of the ith sinusoidal component.
f(i) is the frequency (in Hz) of the ith sinusoidal component.
φ(i) is the phase (in radians) of the ith sinusoidal component.
2. Parameter Extraction:
✓ During sinusoidal analysis, the goal is to estimate the parameters A(i), f(i), and φ(i) for
each sinusoidal component. Various algorithms, such as the Short-Time Fourier
Transform (STFT) or the Harmonic Model Algorithm (HMA), are used to perform the
sinusoidal analysis and extract these parameters.
3. Quantization:
✓ The extracted sinusoidal parameters A(i), f(i), and φ(i) are typically quantized to reduce
the number of bits required for their representation.
✓ Quantization involves mapping the continuous parameter values to a finite set of
discrete values.
✓ The quantized sinusoidal parameters A(i), f(i), and φ(i) are encoded into digital format
using a binary representation.
✓ The encoded parameters can then be transmitted or stored for later synthesis.
4. Decoding and Synthesis:
✓ At the receiver end, the encoded sinusoidal parameters are decoded, and the speech
signal is synthesized using sinusoidal synthesis.
✓ The sinusoidal synthesis process involves summing the sinusoidal components with
their corresponding frequencies, amplitudes, and phases to reconstruct the voiced
speech segment.
✓ The synthesis equation can be represented as follows:
x(n) = Σ[i=1 to M] A_hat(i) * cos[2π * f_hat(i) * n + φ_hat(i)],
where:
x(n) is the reconstructed speech signal at time index n.
A_hat(i), f_hat(i), and φ_hat(i) are the decoded and quantized values of amplitude,
frequency, and phase for the ith sinusoidal component.
✓ Sinusoidal speech coding techniques are known for their ability to provide high-quality
speech reproduction, particularly for voiced speech segments.
✓ They are commonly used in speech synthesis applications, where they allow for the
generation of natural-sounding speech by combining sinusoidal components to mimic
the characteristics of the human vocal tract.
✓ The sinusoidal synthesis process effectively reconstructs the original speech signal
from the encoded sinusoidal parameters, providing high-fidelity speech reproduction.

HYBRID SPEECH CODER

✓ A hybrid speech coder is a type of speech coding algorithm that combines the advantages of
different coding techniques to achieve high-quality speech coding at low bit rates.
✓ Hybrid coders use a combination of waveform coding and model-based coding methods to
efficiently represent speech signals.
✓ The main idea behind hybrid coding is to use waveform coding for unvoiced speech segments,
which typically have less predictable and more complex waveforms, and model-based coding
for voiced speech segments, which have more regular and periodic characteristics.
✓ By exploiting the strengths of both approaches, hybrid coders can achieve better speech quality
and compression performance compared to individual coding methods alone.
✓ Here are the key components and features of a typical hybrid coder:
1. Voiced-Unvoiced Decision: At the encoder, the speech signal is analyzed to determine
whether a given segment is voiced (contains periodic speech) or unvoiced (contains non-
periodic speech). This decision is essential for selecting the appropriate coding method for
each segment.
2. Model-Based Coding for Voiced Segments: For voiced segments, model-based coding
techniques like Code-Excited Linear Prediction (CELP) or Sinusoidal Coding are used.
These techniques effectively represent the periodic and quasi-periodic characteristics of
voiced speech, providing efficient coding and good speech quality.
3. Waveform Coding for Unvoiced Segments: For unvoiced segments, waveform coding
techniques such as Differential Pulse Code Modulation (DPCM) or Adaptive Differential
Pulse Code Modulation (ADPCM) are used. These techniques are better suited for
capturing the complex and non-periodic nature of unvoiced speech segments.
4. Low Bit Rates: Hybrid coders are typically designed to operate at low bit rates, ranging
from a few kilobits per second to tens of kilobits per second. This makes them suitable for
applications with limited bandwidth, such as mobile communication and voice over packet
networks.
5. Seamless Switching: A critical aspect of hybrid coding is the seamless switching between
the waveform coding and model-based coding methods. The coder must accurately
determine the optimal transition points between the two coding techniques to ensure

smooth transitions in the reconstructed speech.

Transform domain coding of speech


✓ Transform domain coding of speech is a speech coding technique that represents and
compresses speech signals in a transformed domain instead of the time domain.
✓ In this approach, the speech signal is transformed into a different domain using mathematical
transforms, and then the transformed coefficients are quantized and encoded to achieve data
compression.
✓ The main advantage of transform domain coding is its ability to concentrate the speech energy
into a few transformed coefficients, enabling higher compression ratios with minimal loss of
speech quality. Some of the commonly used transforms in speech coding include the Discrete
Fourier Transform (DFT) and the Discrete Cosine Transform (DCT).
✓ The typical steps involved in transform domain coding of speech are as follows:
✓ Frame Segmentation: The speech signal is divided into short frames of typically 10-30
milliseconds duration. Each frame represents a segment of the speech signal that can be
processed independently.
1. Frame Segmentation: The speech signal is divided into short frames of N samples each.
Let's denote the speech signal as x(n), where n represents the sample index. The frames can
be represented as x_frame(k), where k represents the sample index within the frame.
x_frame(k) = x(n + k),
where k = 0, 1, 2, ..., N-1.
2. Windowing: A window function is applied to each speech frame to reduce spectral
leakage. The Hamming window is commonly used for this purpose. The windowed speech
frame is denoted as xw_frame(k).
xw_frame(k) = x_frame(k) * w(k),
where w(k) is the Hamming window function.
3. Discrete Fourier Transform (DFT): The DFT transforms the windowed speech frame
xw_frame(k) from the time domain to the frequency domain. The DFT equation is as
follows:
X_frame(m) = Σ[k=0 to N-1] xw_frame(k) * exp(-j * 2π * m * k / N),
where X_frame(m) represents the DFT coefficients of the frame in the frequency
domain, and m = 0, 1, 2, ..., N-1 represents the frequency index.
4. Quantization: The DFT coefficients X_frame(m) obtained from the DFT step are
quantized to reduce the number of bits required for their representation. Quantization
involves mapping the continuous coefficient values to a finite set of discrete values.
5. Encoding and Transmission: The quantized DFT coefficients are encoded into digital
format using a binary representation. The encoded coefficients are transmitted or stored for
later synthesis.
6. Decoding and Inverse Transform: At the receiver end, the encoded DFT coefficients are
decoded, and the inverse DFT (IDFT) is applied to reconstruct the speech signal in the time
domain. The IDFT equation is as follows:
xw_frame(k) = (1/N) * Σ[m=0 to N-1] X_frame(m) * exp(j * 2π * m * k / N),
where xw_frame(k) represents the windowed speech frame in the frequency domain,
and k = 0, 1, 2, ..., N-1 represents the sample index within the frame.
7. The reconstructed speech frame x_frame(k) is obtained by removing the window
function:
x_frame(k) = xw_frame(k) / w(k).
The above process is repeated for each frame of the speech signal to reconstruct the entire
speech signal in the time domain.
UNIT IV SPEECH ENHANCEMENT

4. CLASSES OF SPEECH ENHANCEMENT ALGORITHMS

✓ Speech enhancement algorithms are designed to improve the quality and intelligibility of
speech signals in the presence of noise or other distortions.
✓ These algorithms can be categorized into several classes based on their underlying principles
and methods.
✓ Some of the common classes of speech enhancement algorithms are:
1. Spectral Subtraction Methods:
• These algorithms estimate the noise power spectral density (PSD) and subtract it
from the noisy speech signal in the frequency domain.
• The Wiener filter and Minimum Mean Square Error (MMSE) filter are examples
of spectral subtraction methods.
2. Adaptive Filtering Methods:
• Adaptive filtering algorithms dynamically adjust their filter coefficients based on
the input signal and noise characteristics.
• They can adapt to non-stationary noise environments.
• One popular adaptive filtering method is the Normalized Least Mean Squares
(NLMS) algorithm.
3. Statistical Model-Based Methods:
• These algorithms use statistical models to estimate the clean speech signal and the
noise from the noisy input. Hidden Markov Model (HMM) and Gaussian Mixture
Model (GMM) based approaches fall under this category.
4. Spectral Masking Methods:
• These algorithms use masks to selectively enhance or suppress certain frequency
regions in the noisy speech signal based on the signal-to-noise ratio (SNR).
• Spectral masking is often used in speech enhancement for hearing aids and
cochlear implants.
5. Time-Frequency Domain Methods:
• Time-frequency domain algorithms analyze the speech signal in both time and
frequency domains to exploit its characteristics more effectively.
• Short-Time Fourier Transform (STFT) and Time-Frequency Masking are examples
of such methods.

1
4.1. SPECTRAL-SUBTRACTIVE ALGORITHMS

✓ Spectral-subtractive algorithms are a class of speech enhancement techniques used to reduce


noise from a noisy speech signal.
✓ These algorithms work in the spectral domain by estimating the noise components and then
subtracting them from the noisy speech signal to obtain a denoised version of the speech.
✓ The main idea behind spectral-subtractive algorithms is that the noise in a noisy speech signal
often occupies different frequency regions from the speech signal.
✓ By estimating the spectral characteristics of the noise, these algorithms attempt to attenuate or
remove the noise components while preserving the speech components.

4.1.1 MULTIBAND SPECTRAL SUBTRACTION


✓ It is an extension of the basic Spectral Subtraction algorithm that divides the frequency
spectrum into multiple subbands to provide better noise reduction performance.
✓ Basic steps steps with equations:
1. Short-Time Fourier Transform (STFT):
The input speech signal is divided into short overlapping frames, and the Short-Time
Fourier Transform (STFT) is applied to each frame. The STFT converts the time-domain
signal into the frequency domain, yielding a complex-valued spectrogram:
X(k, i) = FFT(x[n, i])
where:

X(k, i) is the complex-valued spectrogram at frequency bin "k" and frame "i",

x[n, i] is the speech signal in frame "i" starting at sample index "n",

FFT denotes the Fast Fourier Transform.

2. Noise Estimation:
An estimate of the noise spectrum is required for spectral subtraction. Let's denote the
estimated noise spectrum as N(k, i):
N(k, i) = Estimation_Method(X(k, i))
where:

N(k, i) is the estimated noise spectrum at frequency bin "k" and frame "i",
Estimation Method is a function that estimates the noise spectrum, which can be
obtained from noise-only segments or adaptively estimated from the noisy speech
itself.
3. Multiband Division:
✓ The frequency spectrum is divided into multiple subbands, denoted by B subbands.
Each subband may span a different frequency range.

2
✓ Let's denote the lower and upper frequency bounds of each subband as f_lower(b)
and f_upper(b), respectively, where "b" represents the subband index:

X_b(k, i) = X(k, i) for f_lower(b) ≤ f < f_upper(b)


where:
X_b(k, i) is the complex-valued spectrogram in subband "b" at frequency
bin "k" and frame "i".
4. Spectral Subtraction:
✓ For each subband, the magnitude of the estimated noise spectrum is subtracted
from the magnitude of the noisy speech spectrum.
✓ This subtraction is done on a logarithmic scale to avoid introducing negative
values:
Magnitude_Subtracted_b(k, i) = max(|X_b(k, i)| - |N_b(k, i)|, 0)
where:
Magnitude_Subtracted_b(k, i) is the magnitude of the subband after spectral
subtraction in subband "b" at frequency bin "k" and frame "i",
|X_b(k, i)| is the magnitude of the noisy speech subband in subband "b" at
frequency bin "k" and frame "i",
|N_b(k, i)| is the magnitude of the estimated noise subband in subband "b" at
frequency bin "k" and frame "i",
max(x, y) returns the maximum of "x" and "y", preventing negative values after
subtraction.
5. Phase Reconstruction:
The phase information of the original noisy speech is retained to reconstruct the complex-
valued spectrogram of the processed signal:

Phase_Reconstructed_b(k, i) = ∠X_b(k, i)

where:
Phase_Reconstructed_b(k, i) is the reconstructed phase in subband "b"
at frequency bin "k" and frame "i",
∠X_b(k, i) is the phase of the noisy speech subband in subband "b" at
frequency bin "k" and frame "i".
6. Inverse Short-Time Fourier Transform (ISTFT):
The processed complex-valued spectrogram is converted back to the time domain using the
Inverse Short-Time Fourier Transform (ISTFT) to obtain the enhanced speech signal:
y[n, i] = ISTFT(Phase_Reconstructed_b(k, i) * Magnitude_Subtracted_b(k, i))
where:

3
y[n, i] is the enhanced speech signal in frame "i" starting at sample index "n",
ISTFT denotes the Inverse Short-Time Fourier Transform.

4.1.2 MMSE SPECTRAL SUBTRACTION ALGORITHM


✓ The MMSE Spectral Subtraction algorithm is based on Bayesian estimation principles and can
be applied in the Short-Time Fourier Transform (STFT) domain. Here are the main steps of the
MMSE Spectral Subtraction algorithm:
1. Short-Time Fourier Transform (STFT):
The noisy speech signal is divided into short overlapping frames, and the Short-Time
Fourier Transform (STFT) is applied to each frame.
This converts the time-domain signal into the frequency domain, resulting in a complex-
valued spectrogram.
2. Noise Estimation:
To perform MMSE spectral subtraction, an estimate of the noise power spectrum is
required.
The noise power spectrum can be estimated using the noise-only segments of the signal or
can be adaptively estimated from the noisy speech itself during periods of silence or low-
level speech.
3. MMSE Estimation:
For each frequency bin and frame, the MMSE estimator is used to estimate the clean speech
power spectrum based on the noisy speech power spectrum and the estimated noise power
spectrum.
Estimated_Clean_Power(k, i) =
Max(0, Magnitude_Noisy(k, i)^2 - Estimated_Noise_Power(k, i))
where:
Estimated_Clean_Power(k, i) is the estimated clean speech power spectrum at
frequency bin "k" and frame "i",
Magnitude_Noisy(k, i) is the magnitude of the noisy speech spectrum at frequency bin
"k" and frame "i",
Estimated_Noise_Power(k, i) is the estimated noise power spectrum at frequency bin
"k" and frame "i",
Max(x, y) returns the maximum of "x" and "y", preventing negative values in the
estimation.
4. Phase Reconstruction:
The phase information of the original noisy speech is retained to reconstruct the complex-
valued spectrogram of the processed signal.

4
Phase_Reconstructed(k, i) = ∠X(k, i)
where:
Phase_Reconstructed(k, i) is the reconstructed phase at frequency bin "k" and
frame "i",
∠X(k, i) is the phase of the noisy speech spectrum at frequency bin "k" and frame "i".
5. Inverse Short-Time Fourier Transform (ISTFT):
The processed complex-valued spectrogram is converted back to the time domain using the
Inverse Short-Time Fourier Transform (ISTFT) to obtain the enhanced speech signal.
y[n, i] = ISTFT(Phase_Reconstructed(k, i) * sqrt(Estimated_Clean_Power(k, i)))
where:
y[n, i] is the enhanced speech signal in frame "i" starting at sample index "n",
ISTFT denotes the Inverse Short-Time Fourier Transform.
✓ The MMSE Spectral Subtraction algorithm provides improved noise reduction performance
compared to basic spectral subtraction methods by taking into account the statistical properties
of the noisy speech and noise.
✓ It allows for better preservation of the speech signal while reducing noise, making it useful for
various speech enhancement applications in noisy environments.

4.1.3 SPECTRAL SUBTRACTION BASED ON PERCEPTUAL PROPERTIES

✓ Spectral Subtraction based on perceptual properties involves modifying the standard spectral
subtraction algorithm by incorporating perceptual weighting factors. T
✓ hese factors represent the importance of different frequency regions to human auditory
perception.
✓ Let's go through the steps with equations:
1. Short-Time Fourier Transform (STFT):
The input speech signal is divided into short overlapping frames, and the Short-Time
Fourier Transform (STFT) is applied to each frame to obtain the complex-valued
spectrogram:
X(k, i) = FFT(x[n, i])
where:

X(k, i) is the complex-valued spectrogram at frequency bin "k" and frame "i",
x[n, i] is the speech signal in frame "i" starting at sample index "n",
FFT denotes the Fast Fourier Transform.
2. Noise Estimation:
✓ An estimate of the noise power spectrum is required for spectral subtraction.

5
✓ This can be obtained from noise-only segments of the signal or adaptively
estimated from the noisy speech itself during periods of silence or low-level
speech.
✓ Let's denote the estimated noise power spectrum as N(k, i).
3. Perceptual Weighting:
✓ In the perceptual weighting step, each frequency bin of the magnitude spectrum is
multiplied by a perceptual weighting factor that represents the importance of that
frequency region to human hearing.
✓ Let's denote the perceptual weighting factor as W(k).
4. Magnitude Subtraction:
✓ The perceptually weighted magnitude spectrum is subtracted from the magnitude
spectrum of the noisy speech.
✓ The subtraction is typically performed on a logarithmic scale to avoid introducing
negative values:

Magnitude_Subtracted(k, i) = max(log(|X(k, i)|) - log(W(k) * N(k, i)), 0)

where:

Magnitude_Subtracted(k, i) is the magnitude after perceptual spectral


subtraction at frequency bin "k" and frame "i",
|X(k, i)| is the magnitude of the noisy speech spectrum at frequency bin "k" and
frame "i",
W(k) is the perceptual weighting factor at frequency bin "k",
N(k, i) is the estimated noise power spectrum at frequency bin "k" and
frame "i",
log denotes the natural logarithm,
max(x, y) returns the maximum of "x" and "y" to prevent negative values after
subtraction.
5. Phase Reconstruction:
The phase information of the original noisy speech is retained to reconstruct the complex-
valued spectrogram of the processed signal:

Phase_Reconstructed(k, i) = ∠X(k, i)

where:
Phase_Reconstructed(k, i) is the reconstructed phase at frequency bin "k" and
frame "i", and ∠X(k, i) is the phase of the noisy speech spectrum at frequency bin
k" and frame "i".

6
4.2. WIENER FILTERING

✓ Wiener filtering is a signal processing technique used for noise reduction or signal
enhancement in various applications, including speech and image processing.
✓ It is based on the principles of statistical estimation and aims to reconstruct a clean signal
from a noisy or degraded version by minimizing the mean square error between the
estimated signal and the original signal.
✓ The Wiener filter is an optimal linear filter that considers the statistical properties of the
noisy signal and the signal of interest to find the best estimate of the clean signal.
✓ It works in the frequency domain and requires knowledge of the power spectral density
(PSD) of both the noise and the signal.

Fig. Wiener Filter


✓ Suppose we have a noisy speech signal represented in the time domain as x[n], and we want
to estimate the clean speech signal s[n] (i.e., the desired signal) from the noisy signal x[n].
✓ We assume that the relationship between the clean speech signal and the noisy signal is
given by:
x[n] = s[n] + v[n],

where v[n] represents the additive noise.

✓ The goal of Wiener filtering is to estimate the clean speech signal s[n] from the noisy signal
x[n] by designing a filter in the frequency domain.
✓ Step-by-step Wiener Filtering Equations:
1. Compute the Power Spectral Density (PSD) of the clean speech signal S(f) and the noisy
signal X(f):
PSD of the clean speech signal:
S(f) = |S(f)|^2,
PSD of the noisy signal:
X(f) = |X(f)|^2,
where f represents frequency.
2. Estimate the Power Spectral Density (PSD) of the noise V(f) using statistical methods or
noise estimation techniques:

7
V(f) = |X(f) - S(f)|^2,
3. Estimate the Signal-to-Noise Ratio (SNR) for each frequency bin:
SNR(f) = S(f) / V(f).
4. Define the Wiener filter transfer function H(f) for each frequency bin:
H(f) = SNR(f) / (1 + SNR(f)).
5. Apply the Wiener filter transfer function H(f) to the noisy signal X(f) in the frequency
domain to obtain the estimated clean speech signal Y(f):
Y(f) = H(f) * X(f).
6. Take the inverse Fourier transform of Y(f) to obtain the estimated clean speech signal y[n]
in the time domain:
y[n] = Inverse Fourier Transform(Y(f)).
The resulting signal y[n] is the denoised version of the original noisy speech signal x[n].
✓ It's important to note that the success of Wiener filtering heavily relies on accurate noise
estimation and stationarity assumptions.
✓ In practical scenarios, noise can be non-stationary, and advanced techniques like adaptive
filtering are employed to handle such cases.
✓ Nevertheless, Wiener filtering provides a fundamental understanding of noise reduction
principles and serves as a basis for more sophisticated methods used in speech enhancement
and other applications.

4.2.1 WIENER FILTERS IN THE TIME DOMAIN


✓ Wiener filters, also known as minimum mean square error (MMSE) filters, are commonly
used in various signal processing applications, including speech and audio processing.
✓ In the time domain, the Wiener filter is designed to minimize the mean square error between
the desired signal and the filtered signal, considering the statistical properties of both the
desired signal and the noise.
✓ The Wiener filter is particularly effective for noise reduction and speech enhancement when
the statistical properties of the desired signal and the noise are known or can be estimated
accurately.
✓ Here's the basic concept of designing a Wiener filter in the time domain:
1. Problem Formulation:
Given a noisy speech signal s(n), the goal is to design a Wiener filter h(n) that minimizes
the mean square error (MSE) between the desired signal d(n) (the clean speech) and the
filtered signal y(n) (the enhanced speech):
MSE = E[(d(n) - y(n))^2]
where:
E[.] denotes the expectation operator.

8
2. Autocorrelation and Cross-Correlation:
To design the Wiener filter, we need to estimate the autocorrelation and cross-correlation
of the desired signal and the noise.
The autocorrelation of the desired signal is represented by R_dd(m), and the cross-
correlation between the desired signal and the noise is represented by R_dn(m):

R_dd(m) = E[d(n) * d(n-m)] R_dn(m) = E[d(n) * v(n-m)]

where:

d(n) is the desired signal (clean speech),


v(n) is the noise signal,
m is the time lag.
3. Wiener Filter Coefficients:
The Wiener filter h(n) is designed to minimize the MSE, taking into account the
autocorrelation and cross-correlation estimates:

h(n) = arg min [E[|d(n) - y(n)|^2]] = R_dn(m) / R_dd(m)

where:

R_dn(m) is the cross-correlation between the desired signal and the noise at
time lag m,

R_dd(m) is the autocorrelation of the desired signal at time lag m.

4. Filtering Operation:
The Wiener filter is then applied to the noisy speech signal s(n) in the time domain to
obtain the enhanced speech signal y(n):

y(n) = ∑[h(m) * s(n-m)]

where:

y(n) is the enhanced speech signal at time index n,

h(m) is the Wiener filter coefficient at time lag m.

✓ The Wiener filter in the time domain is an effective tool for noise reduction and speech
enhancement when the statistical properties of the desired signal and noise are known or
can be estimated accurately.
✓ It provides a means to optimally trade off noise reduction and speech preservation, leading
to improved speech quality in noisy environments.

9
4.2.2 WIENER FILTERS IN THE FREQUENCY DOMAIN / WIENER FILTERS FOR NOISE
REDUCTION

✓ In the frequency domain, Wiener filters are used for signal processing tasks such as noise
reduction, speech enhancement, and system identification.
✓ Wiener filters in the frequency domain are based on the concept of minimizing the mean
square error (MSE) between the desired signal and the filtered signal, taking advantage of
the frequency representation of the signals.
✓ The basic idea behind designing a Wiener filter in the frequency domain is to modify the
frequency components of the noisy signal to minimize the noise while preserving the
important signal information.
✓ The Wiener filter operates on the Short-Time Fourier Transform (STFT) representation of
the signals, where the signals are divided into short overlapping frames and transformed
into the frequency domain.
✓ Here's the general concept of designing a Wiener filter in the frequency domain:
1. STFT Representation:
The noisy signal s(n) is divided into short overlapping frames, and the Short-Time Fourier
Transform (STFT) is applied to each frame to obtain the complex-valued spectrogram:
S(k, i) = STFT(s[n, i])
where:
S(k, i) is the complex-valued spectrogram at frequency bin "k" and frame "i",
s[n, i] is the noisy signal in frame "i" starting at sample index "n",
STFT denotes the Short-Time Fourier Transform.
2. Noise Estimation:
• An estimate of the noise power spectrum is required for the Wiener filter.
• This can be obtained from noise-only segments of the signal or adaptively
estimated from the noisy speech itself during periods of silence or low-level
speech.
• Let's denote the estimated noise power spectrum as N(k, i).
3. Wiener Filter Coefficients:
The Wiener filter in the frequency domain is designed to minimize the mean square error
(MSE) between the desired signal D(k, i) (the clean speech spectrum) and the filtered signal
Y(k, i) (the enhanced speech spectrum):

Wiener Filter Coefficient:

H(k, i) = R_dd(k, i) / (R_dd(k, i) + R_nn(k, i))

where:

10
H(k, i) is the Wiener filter coefficient at frequency bin "k" and frame "i",
R_dd(k, i) is the autocorrelation of the desired signal (clean speech) at frequency bin
"k" and frame "i",
R_nn(k, i) is the autocorrelation of the noise at frequency bin "k" and frame "i".

4. Wiener Filtering:
The Wiener filter coefficients H(k, i) are applied to each frequency bin of the noisy speech
spectrogram S(k, i) to obtain the enhanced speech spectrogram Y(k, i):
Y(k, i) = H(k, i) * S(k, i)
where:
Y(k, i) is the enhanced speech spectrogram at frequency bin "k" and frame "i",
H(k, i) is the Wiener filter coefficient at frequency bin "k" and frame "i",
S(k, i) is the complex-valued spectrogram of the noisy speech at frequency bin
"k" and frame "i".
5. Inverse STFT (ISTFT):
The enhanced speech spectrogram Y(k, i) is transformed back to the time domain using the
Inverse Short-Time Fourier Transform (ISTFT) to obtain the enhanced speech signal
y(n, i).
y(n, i) = ISTFT(Y(k, i))
where:
y(n, i) is the enhanced speech signal in frame "i" starting at sample index "n",
ISTFT denotes the Inverse Short-Time Fourier Transform.

✓ Wiener filters in the frequency domain provide an effective way to perform noise reduction
and speech enhancement by taking advantage of the frequency representation of signals.
✓ By minimizing the mean square error between the desired and filtered signals, the Wiener
filter can effectively suppress noise while preserving the important speech information,
leading to improved speech quality in noisy environments.
4.3 MAXIMUM-LIKELIHOOD ESTIMATORS
✓ Maximum-Likelihood Estimators (MLE) are statistical methods used to estimate the
parameters of a probability distribution that are most likely to have generated a given set
of observed data.
✓ The basic idea behind MLE is to find the values of the parameters that maximize the
likelihood of observing the data under the assumed probability distribution.
✓ Let's say we have a set of observed data points {x1, x2, ..., xn}, and we want to estimate
the parameters θ of a probability distribution f(x; θ) that governs the data generation
process.

11
✓ Here, f(x; θ) represents the probability density function (PDF) or probability mass function
(PMF) of the distribution, and θ represents the unknown parameters.
✓ The likelihood function L(θ) is defined as the joint probability of observing the data given
the parameters:
L(θ) = f(x1; θ) * f(x2; θ) * ... * f(xn; θ)
✓ The goal of MLE is to find the values of θ that maximize the likelihood function L(θ).
✓ This can be achieved by taking the derivative of the log-likelihood function (log L(θ)) with
respect to θ, setting it equal to zero, and solving for θ.
✓ The resulting estimates of the parameters are called maximum-likelihood estimates.
✓ Mathematically, the MLE θ̂ can be obtained as follows:
1. Take the logarithm of the likelihood function to simplify calculations:
log L(θ) = log f(x1; θ) + log f(x2; θ) + ... + log f(xn; θ)
2. Compute the derivative of the log-likelihood function with respect to θ:
∂(log L(θ))/∂θ = ∂(log f(x1; θ))/∂θ + ∂(log f(x2; θ))/∂θ + ... + ∂(log f(xn; θ))/∂θ
3. Set the derivative to zero and solve for θ:
∂(log L(θ))/∂θ = 0
4. The solution θ̂ obtained from the equation represents the maximum-likelihood estimate
of the parameters.
✓ MLE is often used for estimating the parameters of statistical models in speech recognition,
speech synthesis, and various other applications where probability distributions play a
crucial role.

4.4 BAYESIAN ESTIMATORS

✓ Bayesian estimators are statistical methods used to estimate unknown parameters in a


model based on Bayes' theorem and the principles of Bayesian statistics.
✓ Unlike maximum likelihood estimators (MLE) that focus solely on finding the most likely
point estimate for the parameters, Bayesian estimators incorporate prior information or
knowledge about the parameters' distribution.
✓ In Bayesian statistics, the unknown parameters are treated as random variables with their
own probability distributions.
✓ The goal is to update our knowledge about these parameters given the observed data and
the prior information using Bayes' theorem.
✓ Let's say we have a set of observed data points {x1, x2, ..., xn} and want to estimate the
parameters θ of a probability distribution f(x; θ) that governs the data generation process.
✓ Here, f(x; θ) represents the likelihood function, and θ represents the unknown parameters.
✓ In Bayesian estimation, we start with a prior probability distribution for the parameters θ,
denoted as P(θ).

12
✓ This prior represents our beliefs or knowledge about the parameters before observing the
data.
✓ After observing the data, we update our beliefs about the parameters to obtain the posterior
distribution P(θ|x), which represents our updated knowledge about the parameters given
the data.
✓ Bayes' theorem relates the posterior distribution to the prior and the likelihood:
P(θ|x) = (f(x; θ) * P(θ)) / P(x),
where
P(θ|x) is the posterior distribution, f(x; θ) is the likelihood function, P(θ) is
the prior distribution, and P(x) is the marginal likelihood (also known as the
evidence).
✓ The maximum a posteriori (MAP) estimator is a common Bayesian estimator that finds the
mode (peak) of the posterior distribution, which represents the most likely estimate of the
parameters given the data and the prior information.
✓ Mathematically, the MAP estimator θ̂_MAP can be obtained as follows:
a. Calculate the posterior distribution P(θ|x) using Bayes' theorem.
b. Find the value of θ that maximizes the posterior distribution, which corresponds to the
mode of the distribution:
θ̂_MAP = argmax P(θ|x).
✓ Bayesian estimation is widely used in various fields, including speech processing, where
uncertainty and prior knowledge play a critical role in parameter estimation.

4.5 MMSE AND LOG-MMSE ESTIMATOR

✓ The MMSE (Minimum Mean Square Error) estimator and the Log-MMSE (Logarithmic
Minimum Mean Square Error) estimator are statistical methods used for signal processing
and estimation tasks, including speech enhancement and denoising.
✓ Both estimators aim to find an optimal estimate of a random variable, given noisy or
observed data, by minimizing the mean square error.

MMSE Estimator:

✓ The MMSE estimator is used to estimate an unknown parameter or signal based on noisy
observations.
✓ It is derived from the principle of minimizing the expected value of the squared difference
between the estimated value and the true value of the parameter.
✓ Let's say we want to estimate an unknown random variable or signal s from noisy
observations x, and we have a model that describes the relationship between s and x as:
x = s + v,

13
where v represents the additive noise.
✓ The MMSE estimator for s, denoted as ŝ_MMSE, is given by:
ŝ_MMSE = E(s|x),
where E(s|x) represents the conditional expectation of s given the observed
data x.
✓ The MMSE estimator is optimal in the sense that it minimizes the expected mean square
error between the estimate and the true value of the parameter.
✓ It is widely used in various signal processing applications, including speech enhancement
and channel estimation.

Log-MMSE Estimator:

✓ The Log-MMSE estimator is a modification of the MMSE estimator that operates in the
logarithmic domain.
✓ It is particularly useful when the observed data is corrupted by multiplicative noise or when
the underlying signal is positive or has a log-normal distribution.
✓ The Log-MMSE estimator operates on the logarithm of the random variables.
✓ Let's say we have log-transformed observations y and we want to estimate the unknown
log-transformed signal log(s) from noisy log-transformed observations y, and we have a
model that describes the relationship between log(s) and y as:
y = log(s) + w,

where w represents the additive noise in the logarithmic domain.

✓ The Log-MMSE estimator for log(s), denoted as log(ŝ)_MMSE, is given by:

log(ŝ)_MMSE = E(log(s)|y),

where E(log(s)|y) represents the conditional expectation of log(s) given the observed
data y.

✓ The Log-MMSE estimator essentially operates on the logarithmic scale, which can be
advantageous when dealing with multiplicative noise or positive-valued signals.
✓ It can provide better performance in scenarios where the signal-to-noise ratio is low or
when the underlying signal has a log-normal distribution.
✓ Both MMSE and Log-MMSE estimators are widely used in various signal processing and
estimation tasks, offering optimal or near-optimal performance under certain assumptions
and noise conditions.

14
4.6 SUBSPACE ALGORITHMS

✓ Subspace algorithms are a class of signal processing techniques used for various
applications, including noise reduction, signal separation, and source localization.
✓ These algorithms exploit the subspace structure of signals to achieve their objectives.
✓ The basic idea behind subspace algorithms is to transform the data into a lower-
dimensional subspace where the signal of interest is concentrated, while noise or
interference is spread out.
✓ One of the most common subspace-based techniques is Principal Component Analysis
(PCA). PCA is used for dimensionality reduction and signal denoising.
✓ It identifies the principal components or eigenvectors of the covariance matrix of the data,
which represent the directions of maximum variance.
✓ By projecting the data onto the principal components with the largest eigenvalues, the
signal is enhanced, and noise is attenuated in the lower-dimensional subspace.
✓ Another widely used subspace algorithm is Independent Component Analysis (ICA).
✓ ICA aims to separate a mixture of statistically independent source signals from their
observed mixtures.
✓ It is particularly useful in applications like blind source separation, where the sources are
unknown and uncorrelated.
✓ Here are some key points about subspace algorithms:
1. Signal Subspace: In subspace algorithms, the signal subspace is the subspace spanned
by the columns of a matrix containing the signal and noise data. The signal subspace
captures the intrinsic structure of the signals of interest.
2. Noise Subspace: The noise subspace is orthogonal to the signal subspace and represents
the subspace containing only noise or interference.
3. Signal Subspace Projection: Subspace algorithms often involve projecting the data onto
the signal subspace to enhance the signal components and attenuate noise.
4. Eigendecomposition: Many subspace algorithms rely on the eigendecomposition of
covariance matrices or other matrices derived from the data.
5. Array Processing: Subspace algorithms are commonly used in array processing
applications, such as beamforming and direction-of-arrival estimation.
✓ Subspace algorithms have been applied in various fields, including speech processing,
image processing, and sensor array processing.
✓ They offer effective ways to exploit the inherent structure of data to achieve noise
reduction, signal separation, and other signal processing tasks.
✓ However, their performance can depend on the specific application and the assumptions
made about the data and noise characteristics.

15
UNIT V SPEECH SYNTHESIS AND APPLICATION
5.1. A TEXT-TO-SPEECH SYSTEMS (TTS)

Fig. Block diagram of TTS


✓ Text-to-Speech (TTS) systems, also known as speech synthesis or speech-to-speech systems,
are technology-driven applications that convert written text into spoken speech.
✓ The primary goal of TTS systems is to generate natural-sounding human speech from input
text, enabling machines to "speak" and interact with users in a more human-like manner.
✓ TTS technology finds applications in various fields, including assistive technology,
accessibility, voice assistants, navigation systems, and entertainment.
✓ Key components and processes involved in Text-to-Speech systems include:
1. Text Analysis: The input text is analyzed to identify the linguistic elements, including
words, punctuation, and sentence structures. This step is crucial for proper
pronunciation, emphasis, and intonation.
2. Text Normalization: During text normalization, abbreviations, acronyms, and
numerical expressions are expanded into their full forms to ensure accurate speech
synthesis.
3. Phonetic Processing: TTS systems use a lexicon or dictionary containing phonetic
representations of words. The system maps each word to its corresponding phonetic
representation, which helps in generating the correct pronunciation.
4. Prosody and Intonation: Prosody refers to the rhythm, pitch, and stress patterns of
speech. TTS systems use prosodic models to determine the appropriate intonation and
stress in synthesized speech to sound more natural and expressive.
5. Speech Synthesis: TTS systems employ various techniques for speech synthesis, such
as concatenative synthesis and parametric synthesis.
• Concatenative Synthesis: In this method, pre-recorded speech
segments are concatenated to create the desired speech. These
segments are often recorded from a human speaker. Unit selection and

1
statistical parametric methods are used to select appropriate segments
based on context and smooth transitions.
• Parametric Synthesis: Parametric TTS models generate speech using
mathematical models that describe speech parameters such as pitch,
duration, and spectral features. These models are trained using large
amounts of speech data and can generate speech more efficiently than
concatenative methods.
6. Voice Selection: TTS systems may offer multiple voices to choose from, representing
different genders, ages, and accents, allowing users to customize the synthesized
speech according to their preferences.
7. Output Rendering: The final synthesized speech is rendered as an audio waveform that
can be played through speakers or audio output devices.

5.2. SYNTHESIZERS TECHNOLOGIES – CONCATENATIVE SYNTHESIS


✓ Concatenative synthesis is a speech synthesis technology used in Text-to-Speech (TTS)
systems. It involves the concatenation (joining together) of pre-recorded speech segments to
create the desired speech output.
✓ These speech segments are typically small units of speech, such as phonemes, diphones, or
triphones, recorded from a human speaker.
✓ The process of concatenative synthesis can be broken down into the following steps:
1. Speech Database Creation:
To build a concatenative TTS system, a large database of recorded speech segments is
needed. These speech segments are carefully selected and recorded to cover a wide
range of phonetic variations, intonations, and prosodic patterns.
2. Unit Selection:
During synthesis, the TTS system selects appropriate speech units from the database to
create the desired output. The selection is based on the input text and the desired
prosody to ensure a natural-sounding and expressive speech synthesis.
3. Concatenation:
The selected speech units are concatenated in real-time to form the synthesized speech.
The transitions between concatenated units are smoothed out to avoid audible
discontinuities and ensure a smooth and natural flow of speech.
4. Prosody Control:
Concatenative synthesis allows for fine-grained prosody control. By selecting
appropriate speech units with different intonations and stress patterns, the system can
accurately reproduce the intended prosody of the input text.

2
✓ The overall concatenative synthesis process can be represented in a simplified form as follows:
Synthesized Speech = Concatenate(U_1, U_2, ..., U_n),
✓ where U_1, U_2, ..., U_n are the selected speech units from the speech database D,
concatenated in sequence to form the final synthesized speech waveform.
✓ It's important to note that in practical concatenative synthesis systems, there are additional
considerations and algorithms to handle issues like prosody matching, unit selection, and
waveform concatenation to achieve more natural and expressive speech output.
✓ These considerations may involve signal processing techniques, prosodic modeling, and more
sophisticated algorithms for unit selection and concatenation.

5.3. USE OF FORMANTS FOR CONCATENATIVE SYNTHESIS


✓ Formants play a crucial role in concatenative speech synthesis, especially in systems that aim
to produce natural and intelligible speech.
✓ Formants are resonant frequencies in the vocal tract that correspond to peaks in the speech
spectrum.
✓ They are responsible for the distinctive timbre and vowel sounds in human speech.
✓ The main steps involving formants in concatenative synthesis.
1. Formant Analysis:
• Formant analysis is performed on the recorded speech segments in the speech
database to identify the formant frequencies and their bandwidths.
• Formants are typically represented as resonant frequencies (f_i) and their
corresponding bandwidths (BW_i) for each speech segment.
2. Formant Matching:
• During synthesis, when selecting speech units (U_i) from the speech database,
formant matching is considered to ensure smoother concatenation and better
continuity in the synthesized speech.
• The formant frequencies and bandwidths of the selected units should be compatible
with the target speech's formants (f_target and BW_target).
3. Formant Shaping:
• To achieve seamless concatenation between speech units, formant shaping may be
applied.
• Formant shaping involves modifying the formant frequencies and bandwidths of
the selected speech units to better match the target speech or to achieve specific
prosodic effects.
• This process helps in reducing perceptible discontinuities and artifacts during
concatenation.

3
4. Vowel Synthesis:
• For synthesizing vowels, formants are of particular importance.
• Vowel sounds are characterized by specific formant patterns, and correctly
reproducing these formants is crucial for accurate vowel synthesis.
• In concatenative synthesis, the formant frequencies and bandwidths of the selected
speech units are carefully adjusted to achieve the desired vowel sounds.
5.4.USE OF LPC FOR CONCATENATIVE SYNTHESIS
• LPC is a widely used speech analysis technique that models the speech signal as a linear
combination of past speech samples.
• It estimates the vocal tract filter parameters, also known as LPC coefficients, which represent
the resonant characteristics of the vocal tract.
• These coefficients are used to model the formant frequencies and bandwidths in the speech
signal.
• In concatenative synthesis, LPC is used to analyze the recorded speech segments in the speech
database to extract LPC coefficients.
• These coefficients are stored in the database and used during the synthesis process.
• Here's how LPC is utilized in concatenative synthesis:
1. LPC Analysis:
• LPC is used to analyze recorded speech segments to estimate the LPC coefficients.
• The LPC analysis involves modeling the speech signal as a linear combination of
past speech samples using an all-pole linear prediction model.
• The LPC coefficients are computed based on the autocorrelation method or the
Levinson-Durbin algorithm.
• The LPC model equation is given as:
s[n] = Σ(a_i * s[n - i]), for i = 1 to p,
• where s[n] is the speech sample at time n, a_i represents the LPC coefficients, and
p is the LPC order (the number of coefficients).
2. Formant Estimation:
• From the LPC coefficients, the formant frequencies can be estimated.
• Formants correspond to the resonant frequencies in the vocal tract and are crucial
for accurately capturing the unique vowel qualities in the speech signal.
• The formant frequencies (f_i) can be computed using the LPC coefficients (a_i)
using the relationship:
f_i = (Fs / (2π)) * arccos(|a_i| / (2 * Π)),
• where Fs is the sampling frequency.

4
3. Unit Selection and Concatenation:
• In concatenative synthesis, the recorded speech database contains multiple speech
units, and unit selection is performed during synthesis to choose appropriate units
for concatenation.
• The LPC-derived formant frequencies play a significant role in this selection
process.
• Units with formant characteristics that match the target speech formants are chosen
to ensure smooth concatenation and continuity in the synthesized speech.
4. Formant Shaping:
• LPC-derived formant information can also be used for formant shaping, where the
formant frequencies of selected units are modified to better match the formant
frequencies of adjacent units.
• Formant shaping helps achieve a more seamless and natural transition between
concatenated units.
✓ To summarize, LPC is not directly used to synthesize speech in concatenative
synthesis.
✓ Instead, it is employed for speech analysis, specifically in estimating formant
frequencies and aiding in unit selection and shaping, which are essential steps in
concatenative synthesis.

5.5. HMM-BASED SPEECH SYNTHESIS

• Hidden Markov Model (HMM)-based speech synthesis, also known as HMM-based Text-to-
Speech (TTS), is a popular statistical parametric approach for synthesizing high-quality and
natural-sounding speech from text input.
• HMM-based speech synthesis utilizes Hidden Markov Models to model the relationship
between linguistic features (e.g., phonemes, diphones) and acoustic features (e.g., spectral
parameters, prosody) of speech.
• HMM-based speech synthesis involves using Hidden Markov Models to model the relationship
between linguistic units (e.g., phonemes, diphones) and acoustic features (e.g., Mel-cepstral
coefficients) of speech.

5
Fig. A hidden Markov model for speech recognition
• Below are the key equations involved in HMM-based speech synthesis:
1. HMM State Transition Probability (A):
• The state transition probabilities in an HMM represent the probabilities of moving
from one state to another. In the context of speech synthesis, the states correspond
to different phonetic or linguistic units.
• Let's denote the state transition probability matrix as A, where A[i][j] represents
the probability of transitioning from state i to state j.
2. HMM Emission Probability (B):
• The emission probabilities in an HMM represent the probabilities of observing a
particular acoustic feature given a specific state.
• Let's denote the emission probability matrix as B, where B[i][j] represents the
probability of emitting the acoustic feature j when in state i.
3. Initial State Probability (π):
• The initial state probabilities represent the probabilities of starting from each state
in the HMM.
• Let's denote the initial state probability vector as π, where π[i] represents the
probability of starting from state i.
5.5.1 HMM Forward Algorithm based speech synthesis:
✓ The forward algorithm is used to compute the probability of observing a sequence of acoustic
features given the HMM and its parameters. This probability is used to perform alignment and
parameter estimation during training.
✓ The forward algorithm recursively computes the forward probabilities α(t, i), which represent
the probability of being in state i at time t and observing the acoustic features up to time t.
✓ The forward probability α(t, i) is computed as follows:
α(t, i) = Σ[α(t-1, j) * A[j][i] * B[i][O(t)]],
where O(t) represents the observed acoustic feature at time t.

6
5.5.2 HMM Viterbi Algorithm bases speech synthesis:
✓ The Viterbi algorithm is used to find the most likely state sequence given the observed acoustic
features.
✓ This is useful for alignment during training and for state selection during synthesis.
✓ The Viterbi algorithm recursively computes the Viterbi path probabilities δ(t, i), which represent
the probability of being in the most likely state sequence up to time t and ending in state i.
✓ The Viterbi path probability δ(t, i) is computed as follows:
δ(t, i) = max[δ(t-1, j) * A[j][i] * B[i][O(t)]],
where max is taken over all possible states j at time t-1.
5.6.SINEWAVE SPEECH SYNTHESIS
✓ Sinewave speech synthesis is a speech synthesis technique that uses pure sinusoidal tones
(sinewaves) to recreate speech-like sounds.
✓ It is a simple but effective method for generating speech-like waveforms with intelligible
speech content.
✓ The basic idea behind sinewave speech synthesis is to analyse the formant structure of human
speech and represent it using pure sinewaves.
✓ Formants are the resonant frequencies in the vocal tract that give each vowel its characteristic
sound.
✓ By synthesizing speech using only the formant frequencies, sinewave speech provides a
stripped-down representation of speech that is still recognizable as speech.
1. Formant Analysis:
The first step in sinewave speech synthesis is to analyze the speech waveform to identify
the formant frequencies of the vowels present in the speech. Let's denote the formant
frequencies as F_1, F_2, F_3, ..., F_n.
2. Sinewave Generation:
• For each formant frequency, a sinewave tone is generated with the corresponding
frequency and amplitude.
• The amplitude of each sinewave is typically set to 1 to keep it simple, as the relative
amplitudes are adjusted later during the synthesis process.
• The equation for generating a sinewave tone with frequency F_i is given by:
s_i(t) = A * sin(2π * F_i * t),
• where s_i(t) represents the sinewave tone at time t with frequency F_i, and A is the
amplitude of the sinewave.

7
3. Superposition:
• The next step involves superimposing the sinewaves corresponding to different
formants to create a composite sinewave representing the synthesized speech
sound.
• Let's assume there are n formants in the speech. The composite sinewave is
obtained by summing the individual sinewaves:
s(t) = s_1(t) + s_2(t) + ... + s_n(t),
• where s(t) represents the synthesized speech waveform at time t.
4. Time-Varying Amplitude:
• To create the desired speech sound, the amplitudes of the sinewaves are varied over
time according to the phonetic and prosodic features of the speech.
• The time-varying amplitude can be determined based on the desired speech sound
and is typically controlled by some parameters or functions.
• Let's denote the time-varying amplitude for the sinewave with frequency F_i as
A_i(t).
• The final sinewave speech synthesis equation incorporating time-varying
amplitude is given by:
s(t) = A_1(t) * sin(2π * F_1 * t) + A_2(t) * sin(2π * F_2 * t) + ...
+ A_n(t) * sin(2π * F_n * t).
• The time-varying amplitudes A_i(t) can be determined based on the target speech
sound and desired prosody.
• They control the loudness and shape of the individual formants over time, resulting
in a synthesized speech waveform that approximates the speech sound represented
by the formant frequencies.
5.7 SPEECH TRANSFORMATIONS
✓ Speech transformations refer to various signal processing and manipulation techniques used to
modify and enhance speech signals.
✓ These transformations can be applied for a range of purposes, including speech enhancement,
voice conversion, speech synthesis, and more.
✓ Some common speech transformations include:
1. Pitch Shifting: Pitch shifting modifies the fundamental frequency (pitch) of the speech
signal while maintaining the speech content. It is used in applications like voice
modulation, creating harmonies, and altering the perceived gender of the speaker.
2. Time Stretching: Time stretching alters the duration of the speech signal without changing
its pitch. It can be used to speed up or slow down speech, which finds applications in
language learning, voice dubbing, and audio editing.

8
3. Speech Enhancement: Speech enhancement techniques aim to improve the quality and
intelligibility of speech in noisy environments. Methods such as spectral subtraction,
Wiener filtering, and minimum mean-square error estimation are used to reduce
background noise and enhance speech signals.
4. Voice Conversion: Voice conversion transforms the voice of a speaker to sound like
another speaker without altering the linguistic content. It is commonly used in the
entertainment industry and voice acting.
5. Formant Shifting: Formant shifting modifies the resonant frequencies (formants) in the
speech signal, affecting the vowel quality. It can be used to simulate different accents or
change the speaker's perceived vocal tract characteristics.
6. Vocoder Techniques: Vocoder algorithms analyse and synthesize speech by separating the
speech signal into its spectral and temporal components. Vocoder techniques are used in
speech synthesis, voice coding, and voice encryption.
7. Resampling: Resampling changes the sample rate of the speech signal, effectively altering
its playback speed. It can be used in speech synthesis and audio processing applications.
8. Spectral Manipulation: Spectral manipulation techniques modify the spectral content of
the speech signal. This includes methods like spectral shaping, filtering, and spectral
envelope modification.
9. Formant Synthesis: Formant synthesis involves synthesizing speech using the formant
frequencies of the vocal tract to create specific vowel sounds. It is used in speech synthesis
and phonetic research.
10. Prosody Modification: Prosody refers to the rhythm, intonation, and stress patterns of
speech. Prosody modification techniques can alter the emotional expression or emphasis in
the speech signal.
5.8 WATERMARKING FOR AUTHENTICATION OF A SPEECH
✓ Speech watermarking is a technique used to embed an imperceptible and robust watermark into
speech signals for the purpose of authentication and copyright protection.
✓ The watermark serves as a unique identifier that can be used to verify the authenticity of the
speech signal and detect any unauthorized alterations or tampering.
1. Watermark Generation:
• A watermark is typically a short sequence of bits or a digital signature generated
using cryptographic techniques.
• The watermark is designed to be imperceptible, meaning it should not be audible
to human listeners, but robust enough to withstand common signal processing
operations and attacks.
2. Watermark Embedding:

9
• The watermark is embedded into the speech signal using specialized algorithms.
• The embedding process modifies the speech signal in a way that the watermark
information is invisibly embedded within the speech waveform.
3. Authentication:
• During the authentication process, the embedded watermark is extracted from
the received speech signal.
• The extracted watermark is then compared with the original watermark to
determine if the speech signal is authentic or has been altered in any way.
4. Robustness:
• Speech watermarking should be robust against common signal processing
operations such as compression, noise addition, filtering, and other
transformations that the speech signal may undergo during transmission or
storage.
• Robust watermarking ensures that the watermark can still be reliably extracted
even after such operations.
5. Security:
• Watermarking algorithms should be designed to be resistant to attacks
attempting to remove or modify the watermark, ensuring the integrity of the
authentication process.
6. Imperceptibility:
• The watermark should be imperceptible to human listeners so that it does not
degrade the quality or intelligibility of the speech signal.
Applications of Speech Watermarking:
✓ Copyright Protection: Speech watermarking can be used to protect copyrighted audio content,
such as speeches, podcasts, or audio books, from unauthorized distribution and reproduction.
✓ Authentication of Digital Audio: Speech watermarking can be applied in forensics to verify
the authenticity of recorded audio evidence used in legal proceedings.
✓ Multimedia Content Authentication: Speech watermarking can be used as part of a
comprehensive multimedia content authentication system to verify the integrity and origin of
audiovisual content.
5.9 EMOTION RECOGNITION FROM SPEECH
✓ Emotion recognition from speech is a field of research in speech processing and machine
learning that focuses on automatically detecting and identifying emotions expressed in speech
signals.
✓ Emotion recognition systems aim to infer the emotional state of a speaker based on the acoustic
features and prosodic characteristics present in their speech.

10
✓ The process of emotion recognition from speech typically involves the following steps:
1. Data Collection: Emotion recognition systems require a labeled dataset containing speech
samples with corresponding emotion labels. These samples are often collected by recording
speakers while inducing various emotional states, such as happiness, sadness, anger, fear,
etc.
2. Feature Extraction: Acoustic features are extracted from the speech signal to capture the
characteristics relevant to emotion expression.
Commonly used features include:
o Mel-frequency cepstral coefficients (MFCCs) to capture spectral information.
o Pitch and pitch variation (fundamental frequency) to assess prosodic features.
o Energy to represent the intensity of the speech signal.
3. Feature Selection and Dimensionality Reduction: Depending on the application and the
algorithm used, feature selection and dimensionality reduction techniques may be applied
to reduce the number of features while preserving relevant information.
4. Emotion Classification: Machine learning algorithms, such as Support Vector Machines
(SVM), Neural Networks, or Hidden Markov Models (HMM), are trained on the extracted
features and corresponding emotion labels. During training, the model learns the patterns
that distinguish different emotions in the speech data.
5. Evaluation: The trained model is evaluated on a separate set of speech samples to assess
its performance in correctly recognizing emotions. Various metrics, such as accuracy,
precision, recall, and F1 score, are used to measure the system's performance.
Challenges in Emotion Recognition from Speech:
✓ Speech is highly variable and context-dependent, making it challenging to
accurately capture emotions across different speakers and languages.
✓ The same emotion can be expressed differently by different individuals, making it
difficult to define universal emotional features.
✓ Emotions can be subtle and continuous, making it challenging to categorize them
into discrete classes.
Applications of Emotion Recognition from Speech:
✓ Human-Computer Interaction: Emotion recognition can enhance the interaction
between humans and computers by enabling systems to respond appropriately to
users' emotional states.
✓ Market Research: Emotion recognition can be used in market research to assess
consumers' emotional responses to products and advertisements.
✓ Health Care: Emotion recognition systems can assist in diagnosing and
monitoring emotional disorders, such as depression and anxiety.

11

You might also like