Lecture Notes - Speech Processing
Lecture Notes - Speech Processing
1. Lungs: The process of speech production starts with the lungs, which provide the airflow
necessary for speech. Air is exhaled from the lungs and travels upwards towards the vocal tract.
1
2. Trachea: The trachea, also known as the windpipe, is the passage through which air travels from
the lungs towards the vocal tract.
3. Vocal cords (Vocal Folds): The vocal cords are a pair of muscular folds located in the larynx
(voice box). They are capable of opening and closing, and their vibration produces sound. When
air passes between the vibrating vocal cords, it creates a buzzing sound.
4. Larynx: The larynx is a cartilaginous structure located in the throat. It houses the vocal cords
and plays a crucial role in pitch and voice modulation.
5. Pharynx: The pharynx is a cavity at the back of the throat. It serves as a resonating chamber for
speech sounds.
6. Oral Cavity: The oral cavity consists of the mouth and its various structures, including the
tongue, teeth, and hard and soft palates. It plays a significant role in shaping speech sounds.
7. Articulators:
✓ Tongue: The tongue is a highly flexible organ that assists in shaping the sound produced by
the vocal cords. It can make various configurations against the roof of the mouth or other
parts of the oral cavity to produce different speech sounds.
✓ Teeth: The interaction of the tongue with the teeth can create specific sounds like "th," "s,"
and "z."
✓ Hard Palate: The hard palate, located on the roof of the mouth, also plays a role in shaping
sounds.
✓ Soft Palate (Velum): The soft palate can be raised or lowered to close off the nasal passage
or allow air to pass through the nose, producing nasal sounds.
✓ Lips: The movement of the lips plays a crucial role in shaping many speech sounds,
especially bilabial sounds like "p," "b," and "m."
8. Nasal Cavity: The nasal cavity is located behind the nose. During speech, air can pass through
the nasal cavity, resulting in nasal sounds for certain speech sounds like "m" and "n."
2
Fig. Understanding the Phonetics
1. Articulatory Phonetics:
✓ Articulatory phonetics is concerned with the study of how speech sounds are physically
produced or articulated by the human vocal tract and articulatory organs.
✓ It investigates the movements and positions of the various speech organs, such as the tongue,
lips, teeth, alveolar ridge, and velum (soft palate), during speech sound production.
✓ Articulatory phonetics describes the specific articulatory configurations that lead to the creation
of different speech sounds.
For example:
✓ The articulation of the vowel /i/ involves the tongue being in a high front position.
✓ The articulation of the consonant /p/ requires the lips to come together to block the airflow and
then release the blockage to create a sound.
3
2. Acoustic Phonetics:
✓ Acoustic phonetics is the study of the physical properties of speech sounds as sound waves
travel through the air. It deals with the analysis of the frequencies, amplitudes, and durations of
these sound waves.
✓ Acoustic phonetics helps us understand how speech sounds differ in terms of their acoustic
properties and how they are perceived by the listener's auditory system.
For example:
✓ The vowel /a/ is characterized by a low-frequency sound with a relatively open vocal tract,
resulting in a larger and more open sound wave pattern.
✓ The fricative /s/ is characterized by high-frequency sound waves due to the turbulent airflow
caused by the constriction between the tongue and the alveolar ridge.
3. Auditory Phonetics:
✓ Auditory phonetics focuses on the perception and processing of speech sounds by the human
auditory system.
✓ It examines how the brain interprets the acoustic information received from the environment
and recognizes different speech sounds.
✓ Auditory phonetics plays a crucial role in understanding how humans perceive and
distinguish speech sounds, even in challenging listening conditions.
For example:
✓ The auditory system can discriminate between two similar speech sounds, such as /b/ and /p/,
based on subtle differences in their acoustic properties, like voicing onset time.
✓ Auditory phonetics helps explain how listeners can recognize speech sounds even in the
presence of background noise.
4. Categorization of Speech Sounds:
Speech sounds can be categorized based on various phonetic features. Here are some common
categorizations:
✓ Place of Articulation: Categorizes consonant sounds based on where in the vocal tract the
airflow is constricted or blocked during articulation. Examples include bilabials (/p/, /b/),
alveolars (/t/, /d/), and velars (/k/, /g/).
✓ Manner of Articulation: Classifies consonant sounds based on the degree of airflow
constriction or how the airflow is manipulated during articulation. Examples include stops (/p/,
/t/, /k/), fricatives (/f/, /s/, /ʃ/), and nasals (/m/, /n/, /ŋ/).
✓ Voicing: Distinguishes between consonant sounds based on whether the vocal cords vibrate
during their production. Examples include voiced (/b/, /d/, /g/) and voiceless (/p/, /t/, /k/)
sounds.
✓ Vowels: Vowels are categorized based on the position of the tongue and lips during articulation
and the height and backness of the tongue. Examples include /i/, /e/, /a/, /o/, and /u/.
4
✓ Prosody: This refers to the rhythm, intonation, and stress patterns of speech. It helps convey
meaning and emotion in connected speech.
✓ These categorizations and the understanding of articulatory, acoustic, and auditory aspects of
speech sounds are fundamental to the study of phonetics and linguistics, contributing to our
comprehension of human language and communication.
2. Ceptrum
✓ Cepstrum - Inverse Fourier transform of the log magnitude spectrum of a signal
✓ Cepstrum of a discrete-time signal as
where log |X(ejw)| is the logarithm of the magnitude of the DTFT of the signal, and extended
the concept by defining the complex cepstrum as
5
✓ Computation Using the DFT
6
Fig. Weighting functions for Mel-frequency filter bank.
Note that the bandwidths in Figure are constant for center frequencies below 1 kHz and then
increase exponentially up to half the sampling rate of 4 kHz resulting in a total of 22 “filters.”
✓ where Vr[k] is the triangular weighting function for the rth filter ranging from DFT index Lr to
Ur, where
✓ is a normalizing factor for the rth mel-filter. This normalization is built into the weighting
functions of Figure
✓ It is needed so that a perfectly flat input Fourier spectrum will produce a flat mel-spectrum.
✓ For each frame, a discrete cosine transform of the log of the magnitude of the filter outputs is
computed to form the function mfccˆn[m],
7
✓ Two basic short-time analysis functions useful for speech signals are the short-time energy and
the short-time zero-crossing rate.
✓ These functions are simple to compute, and they are useful for estimating properties of the
excitation function in the model.
✓ The short-time energy is defined as
✓ The short-time zero crossing rate is defined as the weighted average of the number of times the
speech signal changes sign within the time window.
✓ Representing this operator in terms of linear filtering leads to
✓ Below Figure shows an example of the short-time energy and zero crossing rate for a segment
of speech with a transition from unvoiced to voiced speech.
✓ In both cases, the window is a Hamming window (two examples shown) of duration 25ms
(equivalent to 401 samples at a 16 kHz sampling rate).
Fig. Section of speech waveform with short-time energy and zero-crossing rate superimposed
✓ Note that during the unvoiced interval, the zero-crossing rate is relatively high compared to the
zero-crossing rate in the voiced interval.
8
✓ Conversely, the energy is relatively low in the unvoiced region compared to the energy in the
voiced region
Fig. Voiced and unvoiced segments of speech and their corresponding STACF
✓ Segmentation by window
9
✓ Note the peak in the autocorrelation function for the voiced segment at the pitch period and
twice the pitch period, and note the absence of such peaks in the autocorrelation function for
the unvoiced segment.
✓ This suggests that the STACF could be the basis for an algorithm for estimating/detecting the
pitch period of speech.
Usually such algorithms involve the autocorrelation function and other short-time
measurements such as zero-crossings and energy to aid in making the voiced/unvoiced
decision
10
✓ Model uses a more detailed representation of the excitation in terms of separate source
generators for voiced and unvoiced speech
✓ In this model the unvoiced excitation is assumed to be a random noise sequence, and the voiced
excitation is assumed to be a periodic impulse train with impulses spaced by the pitch period
(P0) rounded to the nearest sample
✓ The pulses needed to model the glottal flow waveform during voiced speech are assumed to be
combined (by convolution) with the impulse response of the linear system, which is assumed
to be slowly-time-varying (changing every 50–100 ms or so)
✓ The system can be described by the convolution expression
✓ It is often assumed that the system is an all-pole system with system function of the form:
11
Preprocessing the Speech Signal: The recorded speech signal may contain background noise and
unwanted artifacts. Before generating a spectrogram, the speech signal is preprocessed to remove noise
and enhance the quality of the speech signal.
1. Time-Frequency Analysis:
✓ The speech signal is then analyzed using a technique called Short-Time Fourier
Transform (STFT) or other time-frequency analysis methods.
✓ STFT breaks the speech signal into small overlapping segments and calculates the
frequency spectrum for each segment.
✓ The spectrogram analysis involves the use of the Short-Time Fourier Transform
(STFT), which is a variation of the traditional Fourier Transform.
✓ The Fourier Transform is a mathematical operation that transforms a signal from the
time domain to the frequency domain.
✓ It provides information about the frequency components present in the signal.
✓ However, in speech analysis, using the traditional Fourier Transform on the entire
speech signal would not provide sufficient temporal resolution, as speech sounds
change rapidly over time.
✓ The STFT addresses this issue by analysing short segments of the speech signal,
allowing for a time-frequency representation.
Where:
STFT(t, f) represents the value of the Short-Time Fourier Transform at time t and frequency f.
x(τ) is the original speech signal.
w(t - τ) is a windowing function that reduces spectral leakage and smoothens the spectrogram.
e^(-j2πfτ) is a complex exponential that represents the contribution of frequency f at time τ.
12
Figure : Spectrogram of a speech signal with breath sound (marked as Breath
2. Visualization as a Spectrogram: The frequency spectrum of each segment is represented as a
column in the spectrogram. As the analysis progresses through time, the columns are stacked
side by side to create a 2D image. The intensity of each frequency component is represented by
the color or darkness of the corresponding point.
3. Interpreting Spectrograms:
Spectrograms provide valuable information about the acoustic properties of speech sounds:
a. Formants: Formants are regions of concentrated energy in the speech spectrum. They
correspond to the resonant frequencies of the vocal tract and are essential for vowel
identification.
b. Consonant Transitions: Spectrograms show transitions between consonant sounds and the
adjacent vowels. These transitions reveal information about the place and manner of
articulation.
c. Vowel Quality: Vowels are characterized by specific patterns of formants, which determine
their quality (e.g., high or low, front or back).
d. Voice Onset Time (VOT): Spectrograms can reveal the voicing properties of consonants,
including voice onset time, which distinguishes between voiced and voiceless stops.
e. Prosody: Spectrograms display variations in pitch (fundamental frequency) and intensity,
which are crucial for understanding prosodic features such as stress and intonation.
Spectrographic analysis is widely used in various fields, including linguistics, phonetics, speech
pathology, and speech technology. It helps researchers and practitioners gain valuable insights into the
acoustic properties of speech and aids in the study and characterization of different speech sounds and
patterns.
13
PITCH FREQUENCY:
✓ The pitch frequency, also known as the fundamental frequency (F0), is the rate at which the
vocal cords vibrate during the production of voiced speech sounds.
✓ The fundamental frequency determines the perceived pitch of a sound.
✓ In speech, pitch is measured in Hertz (Hz), which represents the number of cycles of vocal cord
vibration per second.
✓ The fundamental frequency is typically denoted by "F0" and can be calculated using the
following equation:
F0 = 1 / T
Where:
F0 is the fundamental frequency in Hertz (cycles per second).
T is the period of one vocal cord vibration in seconds.
The period (T) represents the time taken for one complete cycle of vocal cord vibration. It is
the reciprocal of the fundamental frequency (F0):
T = 1 / F0
✓ In practice, measuring the exact period or fundamental frequency of the vocal cord vibration
directly is challenging.
✓ Instead, researchers and speech scientists use various methods, such as signal processing
techniques, to estimate the fundamental frequency from the speech signal.
✓ Common methods include the use of autocorrelation, cepstral analysis, and pitch detection
algorithms.
✓ Pitch period measurement in the spectral domain involves estimating the fundamental
frequency (F0) by analyzing the periodicity of the harmonics in the speech signal's spectrum.
✓ The pitch period (T) can be obtained as the reciprocal of the fundamental frequency (T = 1 /
F0).
PITCH PERIOD MEASUREMENT IN THE SPECTRAL DOMAIN:
✓ Compute the Spectrum: Calculate the Short-Time Fourier Transform (STFT) of the speech
signal to obtain its magnitude spectrum.
✓ The STFT breaks the signal into short overlapping segments and computes the frequency
spectrum for each segment
✓ Given a discrete-time signal x(n) of length N, the STFT is calculated by performing the Fourier
Transform on short segments (frames) of the signal. The process involves the following steps:
14
a. Windowing: The signal is divided into short overlapping segments, and a window function w(n)
is applied to each segment to reduce spectral leakage. Common window functions include
Hamming, Hanning, and Blackman.
b. Zero-padding (Optional): Zero-padding may be applied to each windowed segment to increase
the frequency resolution of the resulting spectrum.
c. Fourier Transform: The Discrete Fourier Transform (DFT) is applied to each windowed
segment to obtain its frequency spectrum.
d. Overlap-Add: The resulting spectra are then overlapped and added together to create the STFT.
Equations for STFT Computation:
Let's define the parameters involved in the equations:
The spectrogram provides a time-frequency representation of the signal, allowing us to observe how
the frequency content of the signal changes over time.
Find Harmonic Peaks: Identify the peaks in the spectrum that correspond to the harmonics of the vocal
cord vibrations. Harmonics are integer multiples of the fundamental frequency, and their presence in
the spectrum indicates the periodicity of the speech signal.
15
Estimate the Fundamental Frequency (F0): Once the harmonic peaks are identified, the fundamental
frequency (F0) can be estimated as the distance between consecutive harmonic peaks. The distance
between two consecutive harmonics represents the pitch period (T) in seconds.
a. Equation for Pitch Period Estimation:
The pitch period (T) can be calculated as follows:
T = (n - 1) * (1 / Fs)
Where:
T is the pitch period in seconds.
n is the distance (in number of samples) between consecutive harmonic peaks in the spectrum.
Fs is the sampling rate of the speech signal, which represents the number of samples per second.
For example, if the distance between two consecutive harmonic peaks is 50 samples, and the sampling
rate is 16,000 samples per second (Fs = 16000), then the pitch period (T) would be:
The reciprocal of the pitch period provides the fundamental frequency (F0):
F0 = 1 / T = 1 / 0.0030625 ≈ 326.4 Hz
In this example, the estimated fundamental frequency (F0) of the speech signal is approximately 326.4
Hz.
Pitch period measurement in the spectral domain is a valuable tool for understanding the pitch
characteristics of speech sounds, analyzing prosody, and studying intonation patterns in speech. It is
widely used in speech processing, speech recognition systems, and other applications involving the
analysis of speech signals.
16
Where "quefrency" is the quefrency index, representing the time domain after the inverse Fourier
transform.
Step 4: Identify the dominant peak in the cepstral domain to estimate the pitch period (T0). The position
of the peak corresponds to the periodicity of the speech signal.
Step 5: Calculate the pitch period (T0) based on the position of the dominant peak in the cepstrum. The
value of T0 will depend on the quefrency index of the peak.
FORMANTS
17
✓ Formants are essential acoustic resonances that characterize the vocal tract's frequency response
during speech production.
✓ They are created as a result of the vocal tract's configuration, which acts as a series of tubes and
cavities. Formants play a crucial role in determining the quality and timbre of speech sounds
and are critical in distinguishing different vowels.
✓ When we produce speech sounds, the vocal tract (the throat, mouth, and nasal cavity) acts as a
resonating system.
✓ As air is expelled from the lungs and passes through the vocal cords, the configuration of the
vocal tract changes, causing certain frequencies to resonate more strongly than others. These
resonant frequencies are known as formants.
✓ The first three formants, denoted as F1, F2, and F3, are particularly important in speech
production:
F1: First Formant (Frequency influenced by tongue height)
F2: Second Formant (Frequency influenced by tongue front-back position)
F3: Third Formant (Frequency influenced by lip rounding)
Explanation:
1. The vocal tract consists of the oral cavity, pharynx, and larynx.
2. The tongue, a primary articulator, can be positioned at different heights and front-back positions to
create different vowel sounds.
3. F1 is primarily influenced by the height of the tongue (high or low).
4. F2 is mainly influenced by the front-back position of the tongue (front or back).
5. F3 is influenced by the degree of lip rounding (rounded or unrounded).
F1 (First Formant):
• Represents the lowest resonant frequency of the vocal tract.
• Primarily influenced by the height of the tongue.
• A high tongue position results in a lower F1 frequency (e.g., in the vowel sound /i/ as
in "see"), while a low tongue position results in a higher F1 frequency (e.g., in the
vowel sound /ɑ/ as in "father").
F2 (Second Formant):
• Represents the second resonant frequency of the vocal tract.
• Mainly influenced by the front-back position of the tongue.
• A front tongue position results in a higher F2 frequency (e.g., in the vowel sound /i/ as
in "see"), while a back tongue position results in a lower F2 frequency (e.g., in the
vowel sound /u/ as in "blue").
18
F3 (Third Formant):
• Represents the third resonant frequency of the vocal tract.
• Influenced by the shape of the lips.
• The degree of lip rounding affects the F3 frequency.
• Rounded lips result in a lower F3 frequency (e.g., in the vowel sound /u/ as in "blue"),
while unrounded lips result in a higher F3 frequency (e.g., in the vowel sound /i/ as in
"see").
Formants are essential for the perception of vowels in human speech. Different vowel sounds are
characterized by distinct patterns of formant frequencies, allowing us to identify and differentiate vowel
sounds in speech.
Formants play a significant role in both voiced and unvoiced speech, but their characteristics and
behavior differ between these two types of speech sounds. Voiced speech is characterized by periodic
vocal cord vibrations, while unvoiced speech is produced without vocal cord vibration. Let's evaluate
the behavior of formants in both types of speech:
1. Voiced Speech:
✓ Formant Presence: In voiced speech, formants are present and prominent. The periodic
vocal cord vibrations create resonances in the vocal tract, resulting in clearly identifiable
formants.
✓ Stability: Formants in voiced speech are relatively stable over time because the vocal cord
vibrations maintain consistent resonances in the vocal tract.
✓ Frequency Range: The first three formants (F1, F2, and F3) are typically well-defined and
fall within specific frequency ranges associated with different vowel sounds. These formants
are essential in distinguishing and identifying vowel sounds during speech production.
✓ Spectral Envelope: The spectral envelope of voiced speech shows peaks corresponding to
the formant frequencies, indicating the resonance of the vocal tract at specific frequencies.
2. Unvoiced Speech:
✓ Formant Absence: In unvoiced speech, formants are generally less prominent or absent.
Since unvoiced sounds are produced without vocal cord vibration, there are no periodic
oscillations to create stable resonances in the vocal tract.
✓ Spectral Characteristics: Unvoiced speech is characterized by broad-spectrum noise-like
sounds. Without the presence of periodic vibrations and formants, the speech signal lacks the
distinct peaks associated with voiced speech.
19
✓ Spectral Continuity: The spectral envelope of unvoiced speech is relatively flat or shows
gradual changes rather than sharp peaks associated with formants.
✓ Noise-like Properties: Unvoiced speech sounds often have a turbulent or hissing quality due
to the random airflow through the constriction in the vocal tract (e.g., /s/ as in "see").
20
UNIT II SPEECH FEATURES AND DISTORTION MEASURES
✓ Cepstral coefficients are features derived from the cepstral domain of a signal and are
commonly used in speech and audio processing tasks.
✓ The cepstral domain represents the inverse Fourier transform of the logarithm of the
magnitude of the Fourier transform of a signal.
✓ Cepstral coefficients are obtained by further processing the cepstrum to extract useful
information for various applications.
✓ The most widely used cepstral coefficients are Mel-frequency cepstral coefficients
(MFCCs), which are popular in speech recognition and speaker identification tasks.
✓ Here's an explanation of cepstral coefficients, with a focus on MFCCs:
1. Cepstral Domain:
As mentioned earlier, the cepstral domain is obtained by taking the inverse Fourier transform
of the logarithm of the magnitude spectrum of a signal. The process involves the following
steps:
a. Compute the Short-Time Fourier Transform (STFT) of the signal.
b. Calculate the magnitude spectrum from the STFT.
c. Take the logarithm of the magnitude spectrum.
d. Compute the inverse Fourier transform of the log magnitude to obtain the cepstrum.
2. Mel-frequency Cepstral Coefficients (MFCCs):
MFCCs are a type of cepstral coefficients that are widely used in speech processing tasks,
especially for automatic speech recognition (ASR). The MFCC extraction process involves the
following steps:
✓ Pre-emphasis: A pre-emphasis filter is applied to boost the higher frequencies of the signal to
compensate for attenuation during sound propagation.
✓ Framing: The speech signal is divided into short overlapping frames to make the analysis more
localized.
✓ Windowing: A window function (e.g., Hamming window) is applied to each frame to reduce
spectral leakage during Fourier transform computation.
✓ Fourier Transform: The Fast Fourier Transform (FFT) is applied to each framed segment to
convert the signal from the time domain to the frequency domain.
✓ Mel Filterbank: A set of triangular filters in the Mel scale is applied to the power spectrum
obtained from the Fourier transform. The Mel scale is a perceptual scale of pitches that
approximates the human auditory system's frequency response.
✓ Log Compression: The logarithm of the filterbank energies is taken to convert the values into
the logarithmic scale. This step emphasizes lower energy values, making the representation
more robust to noise and variations.
✓ Discrete Cosine Transform (DCT): Finally, the Discrete Cosine Transform is applied to the log-
filter bank energies to obtain the MFCCs. The DCT decorrelates the filterbank coefficients,
reducing the dimensionality and capturing the most relevant information.
1. Pre-processing:
Before computing the MFCCs, the speech signal undergoes some pre-processing steps, such as:
a. Pre-emphasis: The speech signal is pre-emphasized to boost higher frequencies and
compensate for attenuation during sound propagation. The pre-emphasis filter is typically
applied to the speech signal to emphasize higher frequencies:
y(t) = x(t) - α * x(t-1)
where:
y(t) is the pre-emphasized signal at time t,
x(t) is the original speech signal at time t,
α is the pre-emphasis coefficient (typically around 0.97),
x(t-1) is the speech signal at the previous time instant.
2. Framing and Windowing:
The pre-emphasized speech signal is divided into short overlapping frames, and a window
function (e.g., Hamming window) is applied to each frame to reduce spectral leakage during
Fourier transform computation.
3. Short-Time Fourier Transform (STFT):
The Fast Fourier Transform (FFT) is applied to each framed segment of the pre-emphasized
speech signal to convert the signal from the time domain to the frequency domain.
4. Mel Filterbank:
A set of triangular filters in the Mel scale is applied to the power spectrum obtained from the
Fourier transform. The Mel scale is a perceptual scale of pitches that approximates the human
auditory system's frequency response.
The Mel filterbank is created by a series of triangular filters, each spanning a specific range of
frequencies in the Mel scale. The filterbank converts the power spectrum into a set of filterbank
energies.
5. Log Compression:
The log of the filterbank energies is taken to convert the values into the logarithmic scale. This
step emphasizes lower energy values, making the representation more robust to noise and
variations:
C(i) = log(Energy(i))
where:
C(i) is the log-compressed energy of the i-th filterbank,
Energy(i) is the energy obtained from the i-th filterbank.
6. Discrete Cosine Transform (DCT):
Finally, the Discrete Cosine Transform is applied to the log-filterbank energies to obtain the
MFCCs. The DCT decorrelates the filterbank coefficients, reducing the dimensionality and
capturing the most relevant information:
MFCCs(k) = ∑ [C(i) * cos((π * k * (i - 0.5)) / num_filters)]
where:
MFCCs(k) is the k-th MFCC coefficient,
C(i) is the log-compressed energy from the i-th filterbank,
num_filters is the number of filterbanks,
k is the index of the MFCC coefficient (usually starts from 1).
The resulting MFCCs form the feature vector representing the speech signal and are commonly
used in speech recognition, speaker identification, and other speech-related tasks.
Where:
y(t) is the pre-emphasized signal at time t,
x(t) is the original speech signal at time t,
α is the pre-emphasis coefficient (typically around 0.97),
x(t-1) is the speech signal at the previous time instant.
2. Frame Blocking and Windowing:
✓ The pre-emphasized speech signal is divided into short overlapping frames, and a window
function (e.g., Hamming window) is applied to each frame.
✓ The windowing operation is given by:
w(n) = 0.54 - 0.46 * cos((2 * π * n) / (N - 1))
Where:
w(n) is the value of the window at index n,
N is the length of the window (usually the frame size).
The window function helps to taper the signal at the frame edges, reducing spectral leakage during
Fourier transform computation.
3. Discrete Fourier Transform (DFT):
✓ The Fourier transform is applied to each windowed frame to convert the signal from the time
domain to the frequency domain.
✓ The DFT equation is:
Where:
X(k) is the frequency domain representation at frequency index k,
x(n) is the windowed speech signal at time index n,
N is the length of the DFT (usually the frame size).
4. Perceptually Motivated Filterbank:
✓ The perceptually motivated filterbank is used to convert the magnitude spectrum obtained from
the DFT into a perceptually relevant representation.
✓ The filterbank coefficients are designed to mimic the human auditory system's frequency
resolution and emphasis on specific frequency regions.
✓ The exact equations for the filterbank depend on the design and type of PLP analysis being
used. Typically, the filterbank is implemented as a set of triangular filters in the Mel scale or as
Bark filters.
5. Inverse Discrete Fourier Transform (IDFT):
✓ After passing the spectrum through the perceptually motivated filterbank, the inverse discrete
Fourier transform (IDFT) is applied to obtain the PLP coefficients, which are cepstral
coefficients representing the speech signal in the perceptually enhanced domain.
✓ The IDFT equation is similar to the DFT equation, but with the inverse transform:
c(n) = ∑ [X(k) * exp(j * 2 * π * k * n / N)]
Where:
c(n) is the PLP coefficient at cepstral index n,
X(k) is the filterbank output at frequency index k,
N is the length of the IDFT (usually the number of filterbank coefficients).
✓ The resulting PLP coefficients provide a perceptually enhanced representation of the speech
signal, which is more robust and discriminative for various speech processing tasks.
✓ These coefficients are particularly useful in automatic speech recognition (ASR) tasks, where
accurate feature extraction is crucial for recognition accuracy.
LOG FREQUENCY POWER COEFFICIENTS (LFPC)
✓ Log Frequency Power Coefficients (LFPC) is a speech processing technique that involves
extracting log-scaled power coefficients from the speech signal's frequency domain
representation.
✓ LFPC is an alternative to traditional Mel-frequency cepstral coefficients (MFCCs) and is used
in automatic speech recognition (ASR) and other speech-related tasks
✓ The LFPC analysis follows similar steps to MFCC extraction, including pre-emphasis, framing,
windowing, Discrete Fourier Transform (DFT), and power spectrum computation.
✓ However, instead of applying the Mel filterbank, LFPC employs a logarithmically spaced
filterbank to approximate the human auditory system's frequency resolution.
1. Pre-Emphasis:
Similar to other speech processing techniques, the speech signal is pre-emphasized to enhance
higher frequencies and compensate for spectral tilt.
2. Frame Blocking and Windowing:
The pre-emphasized speech signal is divided into short overlapping frames, and a window
function (e.g., Hamming window) is applied to each frame to reduce spectral leakage during
Fourier transform computation.
3. Discrete Fourier Transform (DFT):
The Fourier transform is applied to each windowed frame to convert the signal from the time
domain to the frequency domain. The resulting complex-valued spectrum represents the
frequency content of the speech signal in each frame.
4. Power Spectrum Computation:
The power spectrum is computed from the complex-valued spectrum by squaring the magnitude
of each complex value:
Power(k) = |X(k)|^2
Where:
Power(k) is the power spectrum at frequency index k,
X(k) is the complex-valued spectrum at frequency index k.
5. Logarithmically Spaced Filterbank:
✓ Unlike the Mel filterbank used in MFCC analysis, LFPC employs a logarithmically
spaced filterbank to approximate the human auditory system's frequency resolution.
✓ The logarithmically spaced filterbank is designed to have a constant bandwidth on the
mel-frequency scale.
✓ The filterbank divides the power spectrum into a set of equally spaced frequency bins
on the logarithmic scale.
✓ The spacing between bins increases with frequency, providing better resolution at lower
frequencies, which is more in line with human auditory perception.
6. Log Compression:
✓ The LFPC coefficients are obtained by taking the logarithm of the energy in each
filterbank bin:
LFPC(k) = log(Energy(k))
Where:
LFPC(k) is the log frequency power coefficient at coefficient index k,
Energy(k) is the energy obtained from the k-th filterbank bin.
✓ The resulting LFPC coefficients provide a log-scaled representation of the speech
signal's spectral characteristics, emphasizing perceptually important frequency regions.
✓ LFPC is particularly useful in ASR tasks, where it can help capture relevant phonetic
information and improve recognition accuracy.
✓ Speech distortion measures are used to evaluate the quality of speech signals after they have
undergone various processing or transmission.
✓ A simplified distance measure is a type of distortion measure that quantifies the difference or
distance between the original (reference) speech signal and the processed (distorted) speech
signal.
✓ One common simplified distance measure used for speech distortion evaluation is the Mean
Squared Error (MSE).
✓ The Mean Squared Error calculates the average squared difference between corresponding
samples of the reference and distorted speech signals.
✓ Let's define the reference speech signal as x[n] and the distorted speech signal as y[n], where
"n" represents the sample index.
✓ The Mean Squared Error (MSE) is calculated as follows:
Encoding: The quantized values are then encoded into binary codes. This encoding step typically
involves representing the quantization levels using a fixed number of bits (e.g., 8 bits, 16 bits) per
sample.
Decoding: To reconstruct the analog signal from the digital representation, the decoding process
involves reversing the quantization and reconstructing the continuous waveform.
Fig. The operations of sampling and quantization
✓ The sampler produces a sequence of numbers x[n] = xc(nT), where T is the sampling period
and fs = 1/T is the sampling frequency.
✓ The quantizer simply takes the real number inputs x[n] and assigns an output ˆx[n] according
✓ to the nonlinear discrete-output mapping Q{ }.
✓ In the case of the example of Figure shown below the output samples are mapped to one of
eight possible values, with samples within the peak-to-peak range being rounded and samples
outside the range being “clipped” to either the maximum positive or negative level.
R₀ = x₀ - P₀
3. Quantization of Residual:
• Quantize the prediction residual R₀ using the current step size Δ₀ to obtain a quantized
residual value Q₀.
4. Encoding:
• Encode the quantized residual Q₀ using a fixed number of bits.
5. Decoding and Reconstruction:
• Decode the encoded quantized residual to obtain the quantized residual value Q₀.
• Reconstruct the quantization level by scaling the quantized residual with the step size
Δ₀:
• R₀' = Q₀ * Δ₀
6. Prediction Update:
• Update the predicted value for the next sample based on the reconstructed quantization
level R₀':
P₁ = P₀ + R₀'
✓ Delta modulation is a simple form of analog-to-digital conversion used in speech and audio
coding.
✓ It's a method that encodes the difference between consecutive samples of an analog signal rather
than encoding the absolute values of each sample.
✓ This can help reduce the amount of data needed to represent the signal, making it suitable for
low-bitrate communication channels.
✓ Let's denote the input analog signal as x(t), and the sampled version of the signal as x[n], where
n is the discrete time index corresponding to the sample.
1. Sampling:
X(n)=x(nTs ), where Ts is the sampling interval.
2. Prediction and Encoding:
The predicted value of the next sample, xp(n+1), can be calculated as the previous sample
value plus the delta value d(n):
Xp(n+1)=x(n)+d(n)
✓ The difference between the predicted value and the actual next sample x(n+1) is calculated as:
e[n+1]=x[n+1]−xp[n+1]
The encoder then generates a binary output based on the sign of e[n+1]
3. Decoding:
✓ The reconstructed sample value is then calculated as the predicted value plus the delta value:
✓ Finally, the reconstructed sample is used to predict the value for the next sample in the
sequence:
Adaptive Delta Modulation
✓ Adaptive Delta Modulation (ADM) is an enhancement over basic delta modulation that
dynamically adjusts the step size to better accommodate the characteristics of the input signal.
✓ This helps mitigate some of the limitations of basic delta modulation, such as granularity noise
and limited dynamic range.
✓ Here's an overview of adaptive delta modulation with equations:
3.9. PARAMETRIC SPEECH CODING
✓ Parametric speech coding is a type of speech coding technique that represents speech signals
using a set of model parameters instead of directly encoding the speech waveform.
✓ The primary goal of parametric speech coding is to achieve high compression ratios while
maintaining acceptable speech quality by capturing the essential characteristics of the speech
using a compact set of parameters.
✓ In parametric speech coding, the speech signal is typically modeled using linear prediction
analysis, and the model parameters are quantized and transmitted or stored to be later decoded
and reconstructed at the receiver end.
✓ The main steps involved in parametric speech coding are as follows:
1. Linear Prediction Analysis:
• The speech signal is analyzed using linear prediction analysis (LPC) to
estimate the spectral envelope of the speech signal.
• LPC models the speech signal as a linear combination of past samples, and
the model coefficients represent the spectral envelope of the speech signal.
• In linear prediction analysis, the speech signal is modeled as a linear
combination of past samples. The model can be represented as follows:
x(n) = Σ[i=1 to p] a(i) * x(n-i),
• where:
• x(n) is the current sample of the speech signal.
• a(i) are the linear prediction coefficients, also known as LPC coefficients.
• p is the prediction order, which determines the number of past samples
used in the prediction.
2. Model Parameter Extraction:
• From the LPC analysis, a set of model parameters is extracted, which
typically includes the LPC coefficients, pitch period, and other parameters
describing the excitation signal, such as voicing information.
3. Quantization:
• The model parameters are quantized to reduce the number of bits required
for representation.
• Quantization involves mapping the continuous parameter values to a finite
set of discrete values, reducing the data size for transmission or storage.
4. Encoding:
• The quantized model parameters are encoded into digital format using a
binary representation. The encoded parameters are transmitted or stored
for later reconstruction.
5. Decoding and Synthesis:
• At the receiver end, the encoded model parameters are decoded, and the
speech signal is synthesized using the LPC model and the decoded
parameters.
• The synthesis process involves generating the excitation signal from the
decoded parameters and passing it through the LPC filter to reconstruct
the speech waveform.
• The synthesis equation can be represented as follows:
x(n) = Σ[i=1 to p] a_hat(i) * x(n-i) + excitation_hat(n),
• where x(n) is the reconstructed speech signal at time index n, a_hat(i) are
the decoded LPC coefficients, excitation_hat(n) represents the decoded
excitation signal, and p is the prediction order.
• The excitation signal, excitation_hat(n), can be generated using various
techniques, such as codebooks, pulse trains, or stochastic models,
depending on the specific parametric speech coding algorithm.
✓ Advantages of Parametric Speech Coding:
• High Compression Ratios: Parametric speech coding can achieve high
compression ratios, significantly reducing the bit rate required for speech
transmission or storage.
• Low Data Size: The compact representation of speech using model
parameters results in smaller data sizes, making it suitable for low-
bandwidth communication systems.
• Speech Quality: Despite the high compression, parametric speech coding
can preserve acceptable speech quality, especially at higher bit rates.
• Robustness: Parametric speech coding is often more robust to
transmission errors and noise compared to waveform coding techniques.
✓ Disadvantages of Parametric Speech Coding:
• Complexity: The encoding and decoding process can be computationally
complex, requiring additional processing resources.
• Sensitivity to Errors: In certain scenarios, parametric coding may be
more sensitive to transmission errors or losses compared to waveform
coding.
✓ Parametric speech coding has been used in various speech coding standards and applications,
including telecommunications, VoIP systems, and voice storage applications. Examples of
parametric speech coding standards include MELP (Mixed Excitation Linear Prediction) and
some modes of the ITU-T G.723 and G.729 codecs.
✓ Channel vocoders are a class of speech coding algorithms used to encode and transmit speech
signals over band-limited communication channels.
✓ They are particularly suited for low-bit-rate applications, where the available channel
bandwidth is limited, such as in mobile communication systems, internet telephony (VoIP), and
satellite communications.
✓ The primary objective of channel vocoders is to reduce the bit rate required to transmit speech
while maintaining acceptable speech quality.
✓ They achieve this by exploiting the characteristics of the human vocal tract and the perceptual
properties of human hearing.
Fig. Channel Vocoder
• The LP coefficients and other model parameters (e.g., pitch period, excitation type) are
quantized to reduce the number of bits required for their representation.
• Quantization involves mapping the continuous parameter values to a finite set of
discrete values.
5. Decoding and Synthesis:
• At the receiver end, the quantized model parameters are decoded, and the speech signal
is synthesized using the LP synthesis filter and the decoded parameters.
• The synthesis process involves generating the excitation signal from the decoded
parameters and passing it through the LP filter to reconstruct the speech waveform.
.
✓ A hybrid speech coder is a type of speech coding algorithm that combines the advantages of
different coding techniques to achieve high-quality speech coding at low bit rates.
✓ Hybrid coders use a combination of waveform coding and model-based coding methods to
efficiently represent speech signals.
✓ The main idea behind hybrid coding is to use waveform coding for unvoiced speech segments,
which typically have less predictable and more complex waveforms, and model-based coding
for voiced speech segments, which have more regular and periodic characteristics.
✓ By exploiting the strengths of both approaches, hybrid coders can achieve better speech quality
and compression performance compared to individual coding methods alone.
✓ Here are the key components and features of a typical hybrid coder:
1. Voiced-Unvoiced Decision: At the encoder, the speech signal is analyzed to determine
whether a given segment is voiced (contains periodic speech) or unvoiced (contains non-
periodic speech). This decision is essential for selecting the appropriate coding method for
each segment.
2. Model-Based Coding for Voiced Segments: For voiced segments, model-based coding
techniques like Code-Excited Linear Prediction (CELP) or Sinusoidal Coding are used.
These techniques effectively represent the periodic and quasi-periodic characteristics of
voiced speech, providing efficient coding and good speech quality.
3. Waveform Coding for Unvoiced Segments: For unvoiced segments, waveform coding
techniques such as Differential Pulse Code Modulation (DPCM) or Adaptive Differential
Pulse Code Modulation (ADPCM) are used. These techniques are better suited for
capturing the complex and non-periodic nature of unvoiced speech segments.
4. Low Bit Rates: Hybrid coders are typically designed to operate at low bit rates, ranging
from a few kilobits per second to tens of kilobits per second. This makes them suitable for
applications with limited bandwidth, such as mobile communication and voice over packet
networks.
5. Seamless Switching: A critical aspect of hybrid coding is the seamless switching between
the waveform coding and model-based coding methods. The coder must accurately
determine the optimal transition points between the two coding techniques to ensure
✓ Speech enhancement algorithms are designed to improve the quality and intelligibility of
speech signals in the presence of noise or other distortions.
✓ These algorithms can be categorized into several classes based on their underlying principles
and methods.
✓ Some of the common classes of speech enhancement algorithms are:
1. Spectral Subtraction Methods:
• These algorithms estimate the noise power spectral density (PSD) and subtract it
from the noisy speech signal in the frequency domain.
• The Wiener filter and Minimum Mean Square Error (MMSE) filter are examples
of spectral subtraction methods.
2. Adaptive Filtering Methods:
• Adaptive filtering algorithms dynamically adjust their filter coefficients based on
the input signal and noise characteristics.
• They can adapt to non-stationary noise environments.
• One popular adaptive filtering method is the Normalized Least Mean Squares
(NLMS) algorithm.
3. Statistical Model-Based Methods:
• These algorithms use statistical models to estimate the clean speech signal and the
noise from the noisy input. Hidden Markov Model (HMM) and Gaussian Mixture
Model (GMM) based approaches fall under this category.
4. Spectral Masking Methods:
• These algorithms use masks to selectively enhance or suppress certain frequency
regions in the noisy speech signal based on the signal-to-noise ratio (SNR).
• Spectral masking is often used in speech enhancement for hearing aids and
cochlear implants.
5. Time-Frequency Domain Methods:
• Time-frequency domain algorithms analyze the speech signal in both time and
frequency domains to exploit its characteristics more effectively.
• Short-Time Fourier Transform (STFT) and Time-Frequency Masking are examples
of such methods.
1
4.1. SPECTRAL-SUBTRACTIVE ALGORITHMS
X(k, i) is the complex-valued spectrogram at frequency bin "k" and frame "i",
x[n, i] is the speech signal in frame "i" starting at sample index "n",
2. Noise Estimation:
An estimate of the noise spectrum is required for spectral subtraction. Let's denote the
estimated noise spectrum as N(k, i):
N(k, i) = Estimation_Method(X(k, i))
where:
N(k, i) is the estimated noise spectrum at frequency bin "k" and frame "i",
Estimation Method is a function that estimates the noise spectrum, which can be
obtained from noise-only segments or adaptively estimated from the noisy speech
itself.
3. Multiband Division:
✓ The frequency spectrum is divided into multiple subbands, denoted by B subbands.
Each subband may span a different frequency range.
2
✓ Let's denote the lower and upper frequency bounds of each subband as f_lower(b)
and f_upper(b), respectively, where "b" represents the subband index:
Phase_Reconstructed_b(k, i) = ∠X_b(k, i)
where:
Phase_Reconstructed_b(k, i) is the reconstructed phase in subband "b"
at frequency bin "k" and frame "i",
∠X_b(k, i) is the phase of the noisy speech subband in subband "b" at
frequency bin "k" and frame "i".
6. Inverse Short-Time Fourier Transform (ISTFT):
The processed complex-valued spectrogram is converted back to the time domain using the
Inverse Short-Time Fourier Transform (ISTFT) to obtain the enhanced speech signal:
y[n, i] = ISTFT(Phase_Reconstructed_b(k, i) * Magnitude_Subtracted_b(k, i))
where:
3
y[n, i] is the enhanced speech signal in frame "i" starting at sample index "n",
ISTFT denotes the Inverse Short-Time Fourier Transform.
4
Phase_Reconstructed(k, i) = ∠X(k, i)
where:
Phase_Reconstructed(k, i) is the reconstructed phase at frequency bin "k" and
frame "i",
∠X(k, i) is the phase of the noisy speech spectrum at frequency bin "k" and frame "i".
5. Inverse Short-Time Fourier Transform (ISTFT):
The processed complex-valued spectrogram is converted back to the time domain using the
Inverse Short-Time Fourier Transform (ISTFT) to obtain the enhanced speech signal.
y[n, i] = ISTFT(Phase_Reconstructed(k, i) * sqrt(Estimated_Clean_Power(k, i)))
where:
y[n, i] is the enhanced speech signal in frame "i" starting at sample index "n",
ISTFT denotes the Inverse Short-Time Fourier Transform.
✓ The MMSE Spectral Subtraction algorithm provides improved noise reduction performance
compared to basic spectral subtraction methods by taking into account the statistical properties
of the noisy speech and noise.
✓ It allows for better preservation of the speech signal while reducing noise, making it useful for
various speech enhancement applications in noisy environments.
✓ Spectral Subtraction based on perceptual properties involves modifying the standard spectral
subtraction algorithm by incorporating perceptual weighting factors. T
✓ hese factors represent the importance of different frequency regions to human auditory
perception.
✓ Let's go through the steps with equations:
1. Short-Time Fourier Transform (STFT):
The input speech signal is divided into short overlapping frames, and the Short-Time
Fourier Transform (STFT) is applied to each frame to obtain the complex-valued
spectrogram:
X(k, i) = FFT(x[n, i])
where:
X(k, i) is the complex-valued spectrogram at frequency bin "k" and frame "i",
x[n, i] is the speech signal in frame "i" starting at sample index "n",
FFT denotes the Fast Fourier Transform.
2. Noise Estimation:
✓ An estimate of the noise power spectrum is required for spectral subtraction.
5
✓ This can be obtained from noise-only segments of the signal or adaptively
estimated from the noisy speech itself during periods of silence or low-level
speech.
✓ Let's denote the estimated noise power spectrum as N(k, i).
3. Perceptual Weighting:
✓ In the perceptual weighting step, each frequency bin of the magnitude spectrum is
multiplied by a perceptual weighting factor that represents the importance of that
frequency region to human hearing.
✓ Let's denote the perceptual weighting factor as W(k).
4. Magnitude Subtraction:
✓ The perceptually weighted magnitude spectrum is subtracted from the magnitude
spectrum of the noisy speech.
✓ The subtraction is typically performed on a logarithmic scale to avoid introducing
negative values:
where:
Phase_Reconstructed(k, i) = ∠X(k, i)
where:
Phase_Reconstructed(k, i) is the reconstructed phase at frequency bin "k" and
frame "i", and ∠X(k, i) is the phase of the noisy speech spectrum at frequency bin
k" and frame "i".
6
4.2. WIENER FILTERING
✓ Wiener filtering is a signal processing technique used for noise reduction or signal
enhancement in various applications, including speech and image processing.
✓ It is based on the principles of statistical estimation and aims to reconstruct a clean signal
from a noisy or degraded version by minimizing the mean square error between the
estimated signal and the original signal.
✓ The Wiener filter is an optimal linear filter that considers the statistical properties of the
noisy signal and the signal of interest to find the best estimate of the clean signal.
✓ It works in the frequency domain and requires knowledge of the power spectral density
(PSD) of both the noise and the signal.
✓ The goal of Wiener filtering is to estimate the clean speech signal s[n] from the noisy signal
x[n] by designing a filter in the frequency domain.
✓ Step-by-step Wiener Filtering Equations:
1. Compute the Power Spectral Density (PSD) of the clean speech signal S(f) and the noisy
signal X(f):
PSD of the clean speech signal:
S(f) = |S(f)|^2,
PSD of the noisy signal:
X(f) = |X(f)|^2,
where f represents frequency.
2. Estimate the Power Spectral Density (PSD) of the noise V(f) using statistical methods or
noise estimation techniques:
7
V(f) = |X(f) - S(f)|^2,
3. Estimate the Signal-to-Noise Ratio (SNR) for each frequency bin:
SNR(f) = S(f) / V(f).
4. Define the Wiener filter transfer function H(f) for each frequency bin:
H(f) = SNR(f) / (1 + SNR(f)).
5. Apply the Wiener filter transfer function H(f) to the noisy signal X(f) in the frequency
domain to obtain the estimated clean speech signal Y(f):
Y(f) = H(f) * X(f).
6. Take the inverse Fourier transform of Y(f) to obtain the estimated clean speech signal y[n]
in the time domain:
y[n] = Inverse Fourier Transform(Y(f)).
The resulting signal y[n] is the denoised version of the original noisy speech signal x[n].
✓ It's important to note that the success of Wiener filtering heavily relies on accurate noise
estimation and stationarity assumptions.
✓ In practical scenarios, noise can be non-stationary, and advanced techniques like adaptive
filtering are employed to handle such cases.
✓ Nevertheless, Wiener filtering provides a fundamental understanding of noise reduction
principles and serves as a basis for more sophisticated methods used in speech enhancement
and other applications.
8
2. Autocorrelation and Cross-Correlation:
To design the Wiener filter, we need to estimate the autocorrelation and cross-correlation
of the desired signal and the noise.
The autocorrelation of the desired signal is represented by R_dd(m), and the cross-
correlation between the desired signal and the noise is represented by R_dn(m):
where:
where:
R_dn(m) is the cross-correlation between the desired signal and the noise at
time lag m,
4. Filtering Operation:
The Wiener filter is then applied to the noisy speech signal s(n) in the time domain to
obtain the enhanced speech signal y(n):
where:
✓ The Wiener filter in the time domain is an effective tool for noise reduction and speech
enhancement when the statistical properties of the desired signal and noise are known or
can be estimated accurately.
✓ It provides a means to optimally trade off noise reduction and speech preservation, leading
to improved speech quality in noisy environments.
9
4.2.2 WIENER FILTERS IN THE FREQUENCY DOMAIN / WIENER FILTERS FOR NOISE
REDUCTION
✓ In the frequency domain, Wiener filters are used for signal processing tasks such as noise
reduction, speech enhancement, and system identification.
✓ Wiener filters in the frequency domain are based on the concept of minimizing the mean
square error (MSE) between the desired signal and the filtered signal, taking advantage of
the frequency representation of the signals.
✓ The basic idea behind designing a Wiener filter in the frequency domain is to modify the
frequency components of the noisy signal to minimize the noise while preserving the
important signal information.
✓ The Wiener filter operates on the Short-Time Fourier Transform (STFT) representation of
the signals, where the signals are divided into short overlapping frames and transformed
into the frequency domain.
✓ Here's the general concept of designing a Wiener filter in the frequency domain:
1. STFT Representation:
The noisy signal s(n) is divided into short overlapping frames, and the Short-Time Fourier
Transform (STFT) is applied to each frame to obtain the complex-valued spectrogram:
S(k, i) = STFT(s[n, i])
where:
S(k, i) is the complex-valued spectrogram at frequency bin "k" and frame "i",
s[n, i] is the noisy signal in frame "i" starting at sample index "n",
STFT denotes the Short-Time Fourier Transform.
2. Noise Estimation:
• An estimate of the noise power spectrum is required for the Wiener filter.
• This can be obtained from noise-only segments of the signal or adaptively
estimated from the noisy speech itself during periods of silence or low-level
speech.
• Let's denote the estimated noise power spectrum as N(k, i).
3. Wiener Filter Coefficients:
The Wiener filter in the frequency domain is designed to minimize the mean square error
(MSE) between the desired signal D(k, i) (the clean speech spectrum) and the filtered signal
Y(k, i) (the enhanced speech spectrum):
where:
10
H(k, i) is the Wiener filter coefficient at frequency bin "k" and frame "i",
R_dd(k, i) is the autocorrelation of the desired signal (clean speech) at frequency bin
"k" and frame "i",
R_nn(k, i) is the autocorrelation of the noise at frequency bin "k" and frame "i".
4. Wiener Filtering:
The Wiener filter coefficients H(k, i) are applied to each frequency bin of the noisy speech
spectrogram S(k, i) to obtain the enhanced speech spectrogram Y(k, i):
Y(k, i) = H(k, i) * S(k, i)
where:
Y(k, i) is the enhanced speech spectrogram at frequency bin "k" and frame "i",
H(k, i) is the Wiener filter coefficient at frequency bin "k" and frame "i",
S(k, i) is the complex-valued spectrogram of the noisy speech at frequency bin
"k" and frame "i".
5. Inverse STFT (ISTFT):
The enhanced speech spectrogram Y(k, i) is transformed back to the time domain using the
Inverse Short-Time Fourier Transform (ISTFT) to obtain the enhanced speech signal
y(n, i).
y(n, i) = ISTFT(Y(k, i))
where:
y(n, i) is the enhanced speech signal in frame "i" starting at sample index "n",
ISTFT denotes the Inverse Short-Time Fourier Transform.
✓ Wiener filters in the frequency domain provide an effective way to perform noise reduction
and speech enhancement by taking advantage of the frequency representation of signals.
✓ By minimizing the mean square error between the desired and filtered signals, the Wiener
filter can effectively suppress noise while preserving the important speech information,
leading to improved speech quality in noisy environments.
4.3 MAXIMUM-LIKELIHOOD ESTIMATORS
✓ Maximum-Likelihood Estimators (MLE) are statistical methods used to estimate the
parameters of a probability distribution that are most likely to have generated a given set
of observed data.
✓ The basic idea behind MLE is to find the values of the parameters that maximize the
likelihood of observing the data under the assumed probability distribution.
✓ Let's say we have a set of observed data points {x1, x2, ..., xn}, and we want to estimate
the parameters θ of a probability distribution f(x; θ) that governs the data generation
process.
11
✓ Here, f(x; θ) represents the probability density function (PDF) or probability mass function
(PMF) of the distribution, and θ represents the unknown parameters.
✓ The likelihood function L(θ) is defined as the joint probability of observing the data given
the parameters:
L(θ) = f(x1; θ) * f(x2; θ) * ... * f(xn; θ)
✓ The goal of MLE is to find the values of θ that maximize the likelihood function L(θ).
✓ This can be achieved by taking the derivative of the log-likelihood function (log L(θ)) with
respect to θ, setting it equal to zero, and solving for θ.
✓ The resulting estimates of the parameters are called maximum-likelihood estimates.
✓ Mathematically, the MLE θ̂ can be obtained as follows:
1. Take the logarithm of the likelihood function to simplify calculations:
log L(θ) = log f(x1; θ) + log f(x2; θ) + ... + log f(xn; θ)
2. Compute the derivative of the log-likelihood function with respect to θ:
∂(log L(θ))/∂θ = ∂(log f(x1; θ))/∂θ + ∂(log f(x2; θ))/∂θ + ... + ∂(log f(xn; θ))/∂θ
3. Set the derivative to zero and solve for θ:
∂(log L(θ))/∂θ = 0
4. The solution θ̂ obtained from the equation represents the maximum-likelihood estimate
of the parameters.
✓ MLE is often used for estimating the parameters of statistical models in speech recognition,
speech synthesis, and various other applications where probability distributions play a
crucial role.
12
✓ This prior represents our beliefs or knowledge about the parameters before observing the
data.
✓ After observing the data, we update our beliefs about the parameters to obtain the posterior
distribution P(θ|x), which represents our updated knowledge about the parameters given
the data.
✓ Bayes' theorem relates the posterior distribution to the prior and the likelihood:
P(θ|x) = (f(x; θ) * P(θ)) / P(x),
where
P(θ|x) is the posterior distribution, f(x; θ) is the likelihood function, P(θ) is
the prior distribution, and P(x) is the marginal likelihood (also known as the
evidence).
✓ The maximum a posteriori (MAP) estimator is a common Bayesian estimator that finds the
mode (peak) of the posterior distribution, which represents the most likely estimate of the
parameters given the data and the prior information.
✓ Mathematically, the MAP estimator θ̂_MAP can be obtained as follows:
a. Calculate the posterior distribution P(θ|x) using Bayes' theorem.
b. Find the value of θ that maximizes the posterior distribution, which corresponds to the
mode of the distribution:
θ̂_MAP = argmax P(θ|x).
✓ Bayesian estimation is widely used in various fields, including speech processing, where
uncertainty and prior knowledge play a critical role in parameter estimation.
✓ The MMSE (Minimum Mean Square Error) estimator and the Log-MMSE (Logarithmic
Minimum Mean Square Error) estimator are statistical methods used for signal processing
and estimation tasks, including speech enhancement and denoising.
✓ Both estimators aim to find an optimal estimate of a random variable, given noisy or
observed data, by minimizing the mean square error.
MMSE Estimator:
✓ The MMSE estimator is used to estimate an unknown parameter or signal based on noisy
observations.
✓ It is derived from the principle of minimizing the expected value of the squared difference
between the estimated value and the true value of the parameter.
✓ Let's say we want to estimate an unknown random variable or signal s from noisy
observations x, and we have a model that describes the relationship between s and x as:
x = s + v,
13
where v represents the additive noise.
✓ The MMSE estimator for s, denoted as ŝ_MMSE, is given by:
ŝ_MMSE = E(s|x),
where E(s|x) represents the conditional expectation of s given the observed
data x.
✓ The MMSE estimator is optimal in the sense that it minimizes the expected mean square
error between the estimate and the true value of the parameter.
✓ It is widely used in various signal processing applications, including speech enhancement
and channel estimation.
Log-MMSE Estimator:
✓ The Log-MMSE estimator is a modification of the MMSE estimator that operates in the
logarithmic domain.
✓ It is particularly useful when the observed data is corrupted by multiplicative noise or when
the underlying signal is positive or has a log-normal distribution.
✓ The Log-MMSE estimator operates on the logarithm of the random variables.
✓ Let's say we have log-transformed observations y and we want to estimate the unknown
log-transformed signal log(s) from noisy log-transformed observations y, and we have a
model that describes the relationship between log(s) and y as:
y = log(s) + w,
log(ŝ)_MMSE = E(log(s)|y),
where E(log(s)|y) represents the conditional expectation of log(s) given the observed
data y.
✓ The Log-MMSE estimator essentially operates on the logarithmic scale, which can be
advantageous when dealing with multiplicative noise or positive-valued signals.
✓ It can provide better performance in scenarios where the signal-to-noise ratio is low or
when the underlying signal has a log-normal distribution.
✓ Both MMSE and Log-MMSE estimators are widely used in various signal processing and
estimation tasks, offering optimal or near-optimal performance under certain assumptions
and noise conditions.
14
4.6 SUBSPACE ALGORITHMS
✓ Subspace algorithms are a class of signal processing techniques used for various
applications, including noise reduction, signal separation, and source localization.
✓ These algorithms exploit the subspace structure of signals to achieve their objectives.
✓ The basic idea behind subspace algorithms is to transform the data into a lower-
dimensional subspace where the signal of interest is concentrated, while noise or
interference is spread out.
✓ One of the most common subspace-based techniques is Principal Component Analysis
(PCA). PCA is used for dimensionality reduction and signal denoising.
✓ It identifies the principal components or eigenvectors of the covariance matrix of the data,
which represent the directions of maximum variance.
✓ By projecting the data onto the principal components with the largest eigenvalues, the
signal is enhanced, and noise is attenuated in the lower-dimensional subspace.
✓ Another widely used subspace algorithm is Independent Component Analysis (ICA).
✓ ICA aims to separate a mixture of statistically independent source signals from their
observed mixtures.
✓ It is particularly useful in applications like blind source separation, where the sources are
unknown and uncorrelated.
✓ Here are some key points about subspace algorithms:
1. Signal Subspace: In subspace algorithms, the signal subspace is the subspace spanned
by the columns of a matrix containing the signal and noise data. The signal subspace
captures the intrinsic structure of the signals of interest.
2. Noise Subspace: The noise subspace is orthogonal to the signal subspace and represents
the subspace containing only noise or interference.
3. Signal Subspace Projection: Subspace algorithms often involve projecting the data onto
the signal subspace to enhance the signal components and attenuate noise.
4. Eigendecomposition: Many subspace algorithms rely on the eigendecomposition of
covariance matrices or other matrices derived from the data.
5. Array Processing: Subspace algorithms are commonly used in array processing
applications, such as beamforming and direction-of-arrival estimation.
✓ Subspace algorithms have been applied in various fields, including speech processing,
image processing, and sensor array processing.
✓ They offer effective ways to exploit the inherent structure of data to achieve noise
reduction, signal separation, and other signal processing tasks.
✓ However, their performance can depend on the specific application and the assumptions
made about the data and noise characteristics.
15
UNIT V SPEECH SYNTHESIS AND APPLICATION
5.1. A TEXT-TO-SPEECH SYSTEMS (TTS)
1
statistical parametric methods are used to select appropriate segments
based on context and smooth transitions.
• Parametric Synthesis: Parametric TTS models generate speech using
mathematical models that describe speech parameters such as pitch,
duration, and spectral features. These models are trained using large
amounts of speech data and can generate speech more efficiently than
concatenative methods.
6. Voice Selection: TTS systems may offer multiple voices to choose from, representing
different genders, ages, and accents, allowing users to customize the synthesized
speech according to their preferences.
7. Output Rendering: The final synthesized speech is rendered as an audio waveform that
can be played through speakers or audio output devices.
2
✓ The overall concatenative synthesis process can be represented in a simplified form as follows:
Synthesized Speech = Concatenate(U_1, U_2, ..., U_n),
✓ where U_1, U_2, ..., U_n are the selected speech units from the speech database D,
concatenated in sequence to form the final synthesized speech waveform.
✓ It's important to note that in practical concatenative synthesis systems, there are additional
considerations and algorithms to handle issues like prosody matching, unit selection, and
waveform concatenation to achieve more natural and expressive speech output.
✓ These considerations may involve signal processing techniques, prosodic modeling, and more
sophisticated algorithms for unit selection and concatenation.
3
4. Vowel Synthesis:
• For synthesizing vowels, formants are of particular importance.
• Vowel sounds are characterized by specific formant patterns, and correctly
reproducing these formants is crucial for accurate vowel synthesis.
• In concatenative synthesis, the formant frequencies and bandwidths of the selected
speech units are carefully adjusted to achieve the desired vowel sounds.
5.4.USE OF LPC FOR CONCATENATIVE SYNTHESIS
• LPC is a widely used speech analysis technique that models the speech signal as a linear
combination of past speech samples.
• It estimates the vocal tract filter parameters, also known as LPC coefficients, which represent
the resonant characteristics of the vocal tract.
• These coefficients are used to model the formant frequencies and bandwidths in the speech
signal.
• In concatenative synthesis, LPC is used to analyze the recorded speech segments in the speech
database to extract LPC coefficients.
• These coefficients are stored in the database and used during the synthesis process.
• Here's how LPC is utilized in concatenative synthesis:
1. LPC Analysis:
• LPC is used to analyze recorded speech segments to estimate the LPC coefficients.
• The LPC analysis involves modeling the speech signal as a linear combination of
past speech samples using an all-pole linear prediction model.
• The LPC coefficients are computed based on the autocorrelation method or the
Levinson-Durbin algorithm.
• The LPC model equation is given as:
s[n] = Σ(a_i * s[n - i]), for i = 1 to p,
• where s[n] is the speech sample at time n, a_i represents the LPC coefficients, and
p is the LPC order (the number of coefficients).
2. Formant Estimation:
• From the LPC coefficients, the formant frequencies can be estimated.
• Formants correspond to the resonant frequencies in the vocal tract and are crucial
for accurately capturing the unique vowel qualities in the speech signal.
• The formant frequencies (f_i) can be computed using the LPC coefficients (a_i)
using the relationship:
f_i = (Fs / (2π)) * arccos(|a_i| / (2 * Π)),
• where Fs is the sampling frequency.
4
3. Unit Selection and Concatenation:
• In concatenative synthesis, the recorded speech database contains multiple speech
units, and unit selection is performed during synthesis to choose appropriate units
for concatenation.
• The LPC-derived formant frequencies play a significant role in this selection
process.
• Units with formant characteristics that match the target speech formants are chosen
to ensure smooth concatenation and continuity in the synthesized speech.
4. Formant Shaping:
• LPC-derived formant information can also be used for formant shaping, where the
formant frequencies of selected units are modified to better match the formant
frequencies of adjacent units.
• Formant shaping helps achieve a more seamless and natural transition between
concatenated units.
✓ To summarize, LPC is not directly used to synthesize speech in concatenative
synthesis.
✓ Instead, it is employed for speech analysis, specifically in estimating formant
frequencies and aiding in unit selection and shaping, which are essential steps in
concatenative synthesis.
• Hidden Markov Model (HMM)-based speech synthesis, also known as HMM-based Text-to-
Speech (TTS), is a popular statistical parametric approach for synthesizing high-quality and
natural-sounding speech from text input.
• HMM-based speech synthesis utilizes Hidden Markov Models to model the relationship
between linguistic features (e.g., phonemes, diphones) and acoustic features (e.g., spectral
parameters, prosody) of speech.
• HMM-based speech synthesis involves using Hidden Markov Models to model the relationship
between linguistic units (e.g., phonemes, diphones) and acoustic features (e.g., Mel-cepstral
coefficients) of speech.
5
Fig. A hidden Markov model for speech recognition
• Below are the key equations involved in HMM-based speech synthesis:
1. HMM State Transition Probability (A):
• The state transition probabilities in an HMM represent the probabilities of moving
from one state to another. In the context of speech synthesis, the states correspond
to different phonetic or linguistic units.
• Let's denote the state transition probability matrix as A, where A[i][j] represents
the probability of transitioning from state i to state j.
2. HMM Emission Probability (B):
• The emission probabilities in an HMM represent the probabilities of observing a
particular acoustic feature given a specific state.
• Let's denote the emission probability matrix as B, where B[i][j] represents the
probability of emitting the acoustic feature j when in state i.
3. Initial State Probability (π):
• The initial state probabilities represent the probabilities of starting from each state
in the HMM.
• Let's denote the initial state probability vector as π, where π[i] represents the
probability of starting from state i.
5.5.1 HMM Forward Algorithm based speech synthesis:
✓ The forward algorithm is used to compute the probability of observing a sequence of acoustic
features given the HMM and its parameters. This probability is used to perform alignment and
parameter estimation during training.
✓ The forward algorithm recursively computes the forward probabilities α(t, i), which represent
the probability of being in state i at time t and observing the acoustic features up to time t.
✓ The forward probability α(t, i) is computed as follows:
α(t, i) = Σ[α(t-1, j) * A[j][i] * B[i][O(t)]],
where O(t) represents the observed acoustic feature at time t.
6
5.5.2 HMM Viterbi Algorithm bases speech synthesis:
✓ The Viterbi algorithm is used to find the most likely state sequence given the observed acoustic
features.
✓ This is useful for alignment during training and for state selection during synthesis.
✓ The Viterbi algorithm recursively computes the Viterbi path probabilities δ(t, i), which represent
the probability of being in the most likely state sequence up to time t and ending in state i.
✓ The Viterbi path probability δ(t, i) is computed as follows:
δ(t, i) = max[δ(t-1, j) * A[j][i] * B[i][O(t)]],
where max is taken over all possible states j at time t-1.
5.6.SINEWAVE SPEECH SYNTHESIS
✓ Sinewave speech synthesis is a speech synthesis technique that uses pure sinusoidal tones
(sinewaves) to recreate speech-like sounds.
✓ It is a simple but effective method for generating speech-like waveforms with intelligible
speech content.
✓ The basic idea behind sinewave speech synthesis is to analyse the formant structure of human
speech and represent it using pure sinewaves.
✓ Formants are the resonant frequencies in the vocal tract that give each vowel its characteristic
sound.
✓ By synthesizing speech using only the formant frequencies, sinewave speech provides a
stripped-down representation of speech that is still recognizable as speech.
1. Formant Analysis:
The first step in sinewave speech synthesis is to analyze the speech waveform to identify
the formant frequencies of the vowels present in the speech. Let's denote the formant
frequencies as F_1, F_2, F_3, ..., F_n.
2. Sinewave Generation:
• For each formant frequency, a sinewave tone is generated with the corresponding
frequency and amplitude.
• The amplitude of each sinewave is typically set to 1 to keep it simple, as the relative
amplitudes are adjusted later during the synthesis process.
• The equation for generating a sinewave tone with frequency F_i is given by:
s_i(t) = A * sin(2π * F_i * t),
• where s_i(t) represents the sinewave tone at time t with frequency F_i, and A is the
amplitude of the sinewave.
7
3. Superposition:
• The next step involves superimposing the sinewaves corresponding to different
formants to create a composite sinewave representing the synthesized speech
sound.
• Let's assume there are n formants in the speech. The composite sinewave is
obtained by summing the individual sinewaves:
s(t) = s_1(t) + s_2(t) + ... + s_n(t),
• where s(t) represents the synthesized speech waveform at time t.
4. Time-Varying Amplitude:
• To create the desired speech sound, the amplitudes of the sinewaves are varied over
time according to the phonetic and prosodic features of the speech.
• The time-varying amplitude can be determined based on the desired speech sound
and is typically controlled by some parameters or functions.
• Let's denote the time-varying amplitude for the sinewave with frequency F_i as
A_i(t).
• The final sinewave speech synthesis equation incorporating time-varying
amplitude is given by:
s(t) = A_1(t) * sin(2π * F_1 * t) + A_2(t) * sin(2π * F_2 * t) + ...
+ A_n(t) * sin(2π * F_n * t).
• The time-varying amplitudes A_i(t) can be determined based on the target speech
sound and desired prosody.
• They control the loudness and shape of the individual formants over time, resulting
in a synthesized speech waveform that approximates the speech sound represented
by the formant frequencies.
5.7 SPEECH TRANSFORMATIONS
✓ Speech transformations refer to various signal processing and manipulation techniques used to
modify and enhance speech signals.
✓ These transformations can be applied for a range of purposes, including speech enhancement,
voice conversion, speech synthesis, and more.
✓ Some common speech transformations include:
1. Pitch Shifting: Pitch shifting modifies the fundamental frequency (pitch) of the speech
signal while maintaining the speech content. It is used in applications like voice
modulation, creating harmonies, and altering the perceived gender of the speaker.
2. Time Stretching: Time stretching alters the duration of the speech signal without changing
its pitch. It can be used to speed up or slow down speech, which finds applications in
language learning, voice dubbing, and audio editing.
8
3. Speech Enhancement: Speech enhancement techniques aim to improve the quality and
intelligibility of speech in noisy environments. Methods such as spectral subtraction,
Wiener filtering, and minimum mean-square error estimation are used to reduce
background noise and enhance speech signals.
4. Voice Conversion: Voice conversion transforms the voice of a speaker to sound like
another speaker without altering the linguistic content. It is commonly used in the
entertainment industry and voice acting.
5. Formant Shifting: Formant shifting modifies the resonant frequencies (formants) in the
speech signal, affecting the vowel quality. It can be used to simulate different accents or
change the speaker's perceived vocal tract characteristics.
6. Vocoder Techniques: Vocoder algorithms analyse and synthesize speech by separating the
speech signal into its spectral and temporal components. Vocoder techniques are used in
speech synthesis, voice coding, and voice encryption.
7. Resampling: Resampling changes the sample rate of the speech signal, effectively altering
its playback speed. It can be used in speech synthesis and audio processing applications.
8. Spectral Manipulation: Spectral manipulation techniques modify the spectral content of
the speech signal. This includes methods like spectral shaping, filtering, and spectral
envelope modification.
9. Formant Synthesis: Formant synthesis involves synthesizing speech using the formant
frequencies of the vocal tract to create specific vowel sounds. It is used in speech synthesis
and phonetic research.
10. Prosody Modification: Prosody refers to the rhythm, intonation, and stress patterns of
speech. Prosody modification techniques can alter the emotional expression or emphasis in
the speech signal.
5.8 WATERMARKING FOR AUTHENTICATION OF A SPEECH
✓ Speech watermarking is a technique used to embed an imperceptible and robust watermark into
speech signals for the purpose of authentication and copyright protection.
✓ The watermark serves as a unique identifier that can be used to verify the authenticity of the
speech signal and detect any unauthorized alterations or tampering.
1. Watermark Generation:
• A watermark is typically a short sequence of bits or a digital signature generated
using cryptographic techniques.
• The watermark is designed to be imperceptible, meaning it should not be audible
to human listeners, but robust enough to withstand common signal processing
operations and attacks.
2. Watermark Embedding:
9
• The watermark is embedded into the speech signal using specialized algorithms.
• The embedding process modifies the speech signal in a way that the watermark
information is invisibly embedded within the speech waveform.
3. Authentication:
• During the authentication process, the embedded watermark is extracted from
the received speech signal.
• The extracted watermark is then compared with the original watermark to
determine if the speech signal is authentic or has been altered in any way.
4. Robustness:
• Speech watermarking should be robust against common signal processing
operations such as compression, noise addition, filtering, and other
transformations that the speech signal may undergo during transmission or
storage.
• Robust watermarking ensures that the watermark can still be reliably extracted
even after such operations.
5. Security:
• Watermarking algorithms should be designed to be resistant to attacks
attempting to remove or modify the watermark, ensuring the integrity of the
authentication process.
6. Imperceptibility:
• The watermark should be imperceptible to human listeners so that it does not
degrade the quality or intelligibility of the speech signal.
Applications of Speech Watermarking:
✓ Copyright Protection: Speech watermarking can be used to protect copyrighted audio content,
such as speeches, podcasts, or audio books, from unauthorized distribution and reproduction.
✓ Authentication of Digital Audio: Speech watermarking can be applied in forensics to verify
the authenticity of recorded audio evidence used in legal proceedings.
✓ Multimedia Content Authentication: Speech watermarking can be used as part of a
comprehensive multimedia content authentication system to verify the integrity and origin of
audiovisual content.
5.9 EMOTION RECOGNITION FROM SPEECH
✓ Emotion recognition from speech is a field of research in speech processing and machine
learning that focuses on automatically detecting and identifying emotions expressed in speech
signals.
✓ Emotion recognition systems aim to infer the emotional state of a speaker based on the acoustic
features and prosodic characteristics present in their speech.
10
✓ The process of emotion recognition from speech typically involves the following steps:
1. Data Collection: Emotion recognition systems require a labeled dataset containing speech
samples with corresponding emotion labels. These samples are often collected by recording
speakers while inducing various emotional states, such as happiness, sadness, anger, fear,
etc.
2. Feature Extraction: Acoustic features are extracted from the speech signal to capture the
characteristics relevant to emotion expression.
Commonly used features include:
o Mel-frequency cepstral coefficients (MFCCs) to capture spectral information.
o Pitch and pitch variation (fundamental frequency) to assess prosodic features.
o Energy to represent the intensity of the speech signal.
3. Feature Selection and Dimensionality Reduction: Depending on the application and the
algorithm used, feature selection and dimensionality reduction techniques may be applied
to reduce the number of features while preserving relevant information.
4. Emotion Classification: Machine learning algorithms, such as Support Vector Machines
(SVM), Neural Networks, or Hidden Markov Models (HMM), are trained on the extracted
features and corresponding emotion labels. During training, the model learns the patterns
that distinguish different emotions in the speech data.
5. Evaluation: The trained model is evaluated on a separate set of speech samples to assess
its performance in correctly recognizing emotions. Various metrics, such as accuracy,
precision, recall, and F1 score, are used to measure the system's performance.
Challenges in Emotion Recognition from Speech:
✓ Speech is highly variable and context-dependent, making it challenging to
accurately capture emotions across different speakers and languages.
✓ The same emotion can be expressed differently by different individuals, making it
difficult to define universal emotional features.
✓ Emotions can be subtle and continuous, making it challenging to categorize them
into discrete classes.
Applications of Emotion Recognition from Speech:
✓ Human-Computer Interaction: Emotion recognition can enhance the interaction
between humans and computers by enabling systems to respond appropriately to
users' emotional states.
✓ Market Research: Emotion recognition can be used in market research to assess
consumers' emotional responses to products and advertisements.
✓ Health Care: Emotion recognition systems can assist in diagnosing and
monitoring emotional disorders, such as depression and anxiety.
11