UNIT 2-Speech Processing
UNIT 2-Speech Processing
Unit 2
Speech Features
Cepstral coef fic ients are values that represent the rate at which different
frequencies in a speech signal change over time. Think of them as a summary
of how the sound frequencies vary, helping us capture important patterns in
speech sounds.
These coef fic ients are useful in speech recognition and other speech-
processing applications because they help highlight the unique qualities of
different speech sounds (like vowels and consonants), making it easier for
computers to distinguish between them.
To extract cepstral coef ficients, a process called cepstral analysis is used, involving these
main steps:
1.Pre-Emphasis: The speech signal is passed through a f il ter that amplif ie s higher
frequencies to balance the spectrum. This helps capture details in speech sounds that
may otherwise be lost.
2.Framing and Windowing: The signal is split into short frames (20-40 ms each) because
speech characteristics change over time. Each frame is windowed, often with a Hamming
window, to reduce edge effects.
3.Fourier Transform: For each frame, a Fourier Transform is applied to convert the time-
domain signal into a frequency-domain representation. This step reveals the frequencies
that make up each frame of the speech signal.
4.Logarithmic Transformation: The magnitude of each frequency component is converted
to a logarithmic scale. This step mimics the human ear’s perception, which is more
sensitive to changes in lower frequencies than higher ones.
5.Inverse Fourier Transform (Cepstrum): Finally, an inverse Fourier Transform is applied to
the log spectrum, transforming it back to the "cepstral domain." The result is a set of
coef ficients, called cepstral coef ficients, that describe the rate of frequency changes in
each frame of the signal.
These coef ficients capture essential features of the speech signal, making them highly
useful in tasks like speech recognition and speaker identification.
Mel Frequency Cepstral Coefficients (MFCCs)
Mel Frequency Cepstral Coef ficients (MFCCs) are a set of features commonly used in speech
processing to capture the unique characteristics of a person's voice. MFCCs are extracted by
processing the speech signal in a way that mimics how humans hear sounds, especially focusing
on lower frequencies, where our ears are more sensitive. Here’s a simplif ied breakdown of how
MFCCs are extracted:
1.Pre-Emphasis: The speech signal is f irst f iltered to boost higher frequencies, making details more
noticeable.
2.Framing and Windowing: The signal is divided into small time frames (about 20-40 ms each)
because speech changes quickly. Each frame is "windowed" with a Hamming window to smooth out
the edges.
3.Fourier Transform: For each frame, a Fourier Transform is applied to convert the time-based signal
into a frequency-based one, showing the different frequencies and their strengths.
4.Mel Filter Bank: The frequency spectrum is then passed through a set of f ilters spaced on the Mel
scale, which is designed to match human hearing. These filters give more weight to lower frequencies
and less to higher frequencies, as we naturally perceive sound this way.
5.Logarithmic Transformation: The output from each Mel f ilter is converted to a logarithmic scale.
This step helps to approximate how our ears detect volume changes, especially small changes at low
volumes.
6.Discrete Cosine Transform (DCT): Finally, a mathematical transformation (DCT) is applied to the log
Mel values to create a compact set of values. These are the Mel Frequency Cepstral Coef ficients
(MFCCs).
MFCCs capture important patterns in speech sounds that help machines recognize different words
and voices. They’re especially valuable because they focus on sounds humans hear best, making
them ideal for tasks like speech recognition and speaker identification.
Perceptual Linear Prediction (PLP)
Log Frequency Power Coef ficients (LFPCs) are features used in speech processing to capture
important frequency information in a way that emphasizes lower frequencies, which are generally
more important for understanding speech. Unlike MFCCs, which use the Mel scale, LFPCs apply a
logarithmic transformation directly to the frequency spectrum. Here’s a simplif ie d process of how
LFPCs are extracted:
• Pre-Emphasis: As with other speech features, the speech signal is f irst f iltered to boost higher
frequencies. This step ensures that subtle details in the signal aren’t lost.
• Framing and Windowing: The signal is divided into short frames (20-40 ms each) since speech
characteristics change rapidly. A window function, such as a Hamming window, is applied to each
frame to reduce edge effects.
• Fourier Transform: Each frame is then transformed from the time domain to the frequency
domain using a Fourier Transform. This creates a frequency spectrum that reveals the signal’s
different frequency components and their intensities.
• Logarithmic Frequency Scaling: The frequency axis is then transformed using a logarithmic scale.
Unlike the Mel scale used in MFCCs, the log scale directly compresses higher frequencies, making
them appear closer together. This step ref lects the way humans perceive sound frequencies, with
greater sensitivity to lower frequencies.
• Power Spectrum Calculation: The power of each frequency component is computed on
this log-scaled frequency axis, which shows the intensity of each frequency in the
speech signal.
• Discrete Cosine Transform (DCT): Finally, the power spectrum values on the log
frequency scale are compressed further using a Discrete Cosine Transform (DCT). This
step creates a set of coef fic ients, known as the Log Frequency Power Coef ficients
(LFPCs).
• LFPCs are particularly useful in speech recognition and speaker identif ication because
they capture the important patterns in speech frequencies with a focus on lower-
frequency information. This makes LFPCs effective in applications where emphasis on
lower frequencies improves understanding, such as in low-quality or noisy recordings.
Speech Distortion Measures
Speech Distortion Measures
Speech distortion measures are used to quantify differences between original and processed
speech signals. These measures are critical in evaluating the quality of speech processing
methods, especially in compression, enhancement, and recognition applications.
1.Simplif ied Distance Measure: This measure calculates the distance between two speech signals
in terms of basic Euclidean or cosine distance. Although straightforward, it may not always align
with perceptual differences as observed by listeners.
2.LPC-Based Distance Measure: Linear Predictive Coding (LPC) analyzes the formant structure of
speech by approximating the vocal tract as a series of linear f ilters. LPC-based distance measures
focus on the coef ficients derived from LPC analysis, quantifying the dissimilarity between speech
signals based on vocal tract shape.
3.Spectral Distortion Measure: This measure assesses the difference between the spectral
representations of two speech signals. It’s particularly useful in applications like speech coding
and compression, where the preservation of spectral f id elity is essential for natural-sounding
speech.
4.Perceptual Distortion Measure: Perceptual distortion measures aim to align closely with how
humans perceive sound quality. They often use auditory models to emphasize perceptually relevant
features of speech, focusing on attributes like frequency masking and loudness perception. This
meas u re is valu ab le in ap p licat ion s w h ere s u b ject ive qu alit y is imp or t an t , s u ch as
telecommunication and audio streaming services.
Simplified distance measure
A simplif ie d distance measure in speech distortion is a basic way to quantify how different two speech
signals are from each other. It’s often used to measure how much a speech signal changes after
processing, like compression or noise reduction. This measure doesn’t account for complex details; instead,
it gives a quick estimate of overall differences.
One common simplified distance measure is the Euclidean distance. Here’s how it works in simple steps:
1.Select Key Points: First, choose some key points in each speech signal. These could be specific frames or
feature values (like MFCCs).
2.Calculate Difference: For each key point, f ind the difference in value between the original and modif ie d
signal.
3.Square the Differences: To avoid canceling out positive and negative differences, square each difference.
4.Sum and Square Root: Add up all these squared differences, then take the square root of the sum. This
result is the Euclidean distance.
Mathematically, if the original signal is xxx and the modified signal is y, the Euclidean distance D is:
This measure is straightforward but gives a general sense of how much a processed signal differs from the
original. It’s helpful for quick comparisons but may not ref le ct perceptual (human hearing) differences as
well as more complex measures.
LPC (Linear Predictive Coding)-based distance measure
An LPC (Linear Predictive Coding)-based distance measure is a way to compare two speech signals by
focusing on how their vocal tract shapes differ. LPC models a speech signal by estimating the shape of the
vocal tract (throat, mouth, etc.) as the sound is produced. The LPC coef fic ients represent this shape, so
comparing these coefficients can show how similar or different two speech signals are.
Here’s a breakdown of how LPC-based distance measurement works:
1.Extract LPC Coef ficients: For each speech signal (original and modif ied), we calculate LPC coef ficients.
These coef fic ients represent the vocal tract's shape and are extracted by using linear prediction on short
frames of the signal.
2.Calculate the Difference: To find the distance between the original and modified speech signals, we compare
their LPC coefficients. There are several ways to do this:
1. Itakura-Saito Distance: Measures the difference between two LPC models by evaluating how much one
model can predict the other.
2. Log-Area Ratio (LAR): Compares the area under the spectral envelope of the LPC coefficients.
3. Cepstral Distance: Converts LPC coef ficients into cepstral coef ficients and computes the distance
between them.
3.Summing or Averaging the Distances: The distances from each frame are then summed or averaged to give
an overall LPC-based distance measure between the two signals.
The LPC-based distance measure is effective in applications like speaker recognition, where it’s important to
distinguish voices by analyzing their unique vocal tract shapes. Since this measure captures vocal
characteristics, it often provides a more accurate comparison for speech applications than simplif ie d
measures like Euclidean distance.
Spectral distortion measures
Spectral distortion measures are techniques used to quantify the difference between the spectral
representations of two speech signals. These measures help assess how much a processed speech signal
(like one that's been compressed or denoised) differs from the original signal, particularly in terms of frequency
content. Here’s a simplified breakdown of how spectral distortion is measured:
Key Steps in Spectral Distortion Measurement
1.Extract the Spectrum:
1. For both the original and modif ied speech signals, the frequency spectrum is calculated, typically using
the Short-Time Fourier Transform (STFT). This breaks the signal into small segments and converts
each segment into its frequency components.
2.Calculate the Magnitude Spectrum:
1. The magnitude of each frequency component is determined, which represents how much energy is
present at each frequency. The resulting spectra are often referred to as the magnitude spectra of the
original and modified signals.
3.Choose a Distortion Measure:
1. There are several methods to quantify the spectral distortion between the two spectra, including:
1. Mean Squared Error (MSE): Calculates the average squared differences between corresponding
frequency magnitudes in the two spectra. This is a simple but effective measure.
2. Log Spectral Distortion (LSD): Computes the difference in the logarithmic scale of the magnitudes.
This measure emphasizes lower frequencies and is more aligned with human perception of sound.
3. Spectral Flatness Measure: Evaluates how much the spectral shape differs. A f la t spectrum
indicates noise, while a peaked spectrum indicates tonal signals.
Aggregate the Distortion:
•The calculated distortion values for each frequency can be aggregated (summed or averaged) to obtain
a single value representing the overall spectral distortion.
Importance of Spectral Distortion Measures
Spectral distortion measures are crucial in various applications, such as:
•Speech Recognition: To evaluate how well recognition systems perform on processed
speech.
•Speech Enhancement: To assess the effectiveness of noise reduction and other processing
techniques.
•Codec Development: To ensure that audio codecs maintain the quality of the original signal
after compression.
By quantifying differences in spectral content, these measures help improve the quality and
intelligibility of speech processing systems.
Perceptual distortion measures
Perceptual distortion measures evaluate the differences between two audio signals based on human auditory
perception. Unlike traditional measures that focus purely on mathematical differences in signal values (like
mean squared error), perceptual distortion measures take into account how humans actually perceive sound,
making them more relevant for applications like speech recognition, audio compression, and enhancement.
Here’s a simplified overview of how perceptual distortion measures work:
Key Concepts of Perceptual Distortion Measurement
• Human Hearing Model:
• Perceptual measures are often based on models of human hearing, which consider factors like
frequency sensitivity, loudness, and how sounds mask each other. One common model used is the
Zwicker model or Bark scale, which accounts for the critical bands of hearing, where frequencies are
grouped together based on how they are perceived.
• Calculate Spectra:
• Similar to spectral distortion measures, perceptual measures begin by calculating the frequency
spectra of both the original and modif ied signals, typically using the Short-Time Fourier Transform
(STFT).
• Apply Masking Effects:
• In perceptual measures, the effects of masking are considered. Masking occurs when a louder sound
makes it dif ficult to hear a quieter sound in the same frequency range. This means that differences in
parts of the spectrum that are masked by louder sounds can be ignored since they are less perceptible
to listeners.
• Weight the Differences: Differences between the original and modif ie d signals’ spectra are
weighted according to their perceived importance. Frequencies that are more sensitive to
human hearing (like those in the mid-range) are given greater importance in the distortion
calculation, while less sensitive frequencies are downweighted.
• Aggregate the Distortion: Finally, the weighted differences across the spectrum are
aggregated to produce a single perceptual distortion value. This can involve methods such as
summing squared differences or calculating a logarithmic measure of difference, emphasizing
the parts of the signal that are most critical to perception.