0% found this document useful (0 votes)
5 views25 pages

UNIT 2-Speech Processing

Speech processing lecture notes unit 2

Uploaded by

sankari56007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views25 pages

UNIT 2-Speech Processing

Speech processing lecture notes unit 2

Uploaded by

sankari56007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Speech Processing

Unit 2
Speech Features

In speech-based applications, extracting and analyzing speech features is


essential for accurate processing and recognition. These features capture key
characteristics of human speech, such as frequency, amplitude, and phonetic
elements, which are used in various applications, including speech recognition,
speaker identification, and emotion detection.
Speech Features
1.Cepstral Coef ficients: These are derived from the cepstrum of a signal, which represents the rate of
change in the spectral bands. Cepstral coef ficients help in characterizing the speech signal's spectral
properties, which are important for distinguishing different speech sounds.
2.Mel Frequency Cepstral Coef ficients (MFCCs): MFCCs are one of the most widely used features in
speech processing. They are based on the Mel scale, which approximates human auditory perception
by placing more emphasis on lower frequencies. MFCCs are extracted by passing the speech signal
through a series of f il ters spaced on the Mel scale, then applying a Fourier transform, and f in ally,
computing the log power and taking the discrete cosine transform (DCT). MFCCs are highly effective in
capturing speech-specific information, making them ideal for speech and speaker recognition.
3.Perceptual Linear Prediction (PLP): PLP is designed to model human auditory perception more
closely than traditional linear prediction. It emphasizes spectral peaks that correspond to formants
(vocal tract resonances) by warping the frequency axis to resemble human auditory responses. This
feature extraction method helps improve the robustness of speech recognition systems in noisy
environments.
4.Log Frequency Power Coefficients (LFPCs): LFPCs apply a logarithmic scale to frequency, which helps
in capturing the non-linear perception of sound frequencies in humans. Like MFCCs, LFPCs are based
on spectral analysis but use a log frequency scale instead of the Mel scale. LFPCs can improve
performance in certain speech processing tasks, particularly where a higher emphasis on low-
frequency regions is needed.
Cepstral coefficients

Cepstral coef fic ients are values that represent the rate at which different
frequencies in a speech signal change over time. Think of them as a summary
of how the sound frequencies vary, helping us capture important patterns in
speech sounds.
These coef fic ients are useful in speech recognition and other speech-
processing applications because they help highlight the unique qualities of
different speech sounds (like vowels and consonants), making it easier for
computers to distinguish between them.
To extract cepstral coef ficients, a process called cepstral analysis is used, involving these
main steps:
1.Pre-Emphasis: The speech signal is passed through a f il ter that amplif ie s higher
frequencies to balance the spectrum. This helps capture details in speech sounds that
may otherwise be lost.
2.Framing and Windowing: The signal is split into short frames (20-40 ms each) because
speech characteristics change over time. Each frame is windowed, often with a Hamming
window, to reduce edge effects.
3.Fourier Transform: For each frame, a Fourier Transform is applied to convert the time-
domain signal into a frequency-domain representation. This step reveals the frequencies
that make up each frame of the speech signal.
4.Logarithmic Transformation: The magnitude of each frequency component is converted
to a logarithmic scale. This step mimics the human ear’s perception, which is more
sensitive to changes in lower frequencies than higher ones.
5.Inverse Fourier Transform (Cepstrum): Finally, an inverse Fourier Transform is applied to
the log spectrum, transforming it back to the "cepstral domain." The result is a set of
coef ficients, called cepstral coef ficients, that describe the rate of frequency changes in
each frame of the signal.
These coef ficients capture essential features of the speech signal, making them highly
useful in tasks like speech recognition and speaker identification.
Mel Frequency Cepstral Coefficients (MFCCs)

Mel Frequency Cepstral Coef ficients (MFCCs) are a set of features commonly used in speech
processing to capture the unique characteristics of a person's voice. MFCCs are extracted by
processing the speech signal in a way that mimics how humans hear sounds, especially focusing
on lower frequencies, where our ears are more sensitive. Here’s a simplif ied breakdown of how
MFCCs are extracted:
1.Pre-Emphasis: The speech signal is f irst f iltered to boost higher frequencies, making details more
noticeable.
2.Framing and Windowing: The signal is divided into small time frames (about 20-40 ms each)
because speech changes quickly. Each frame is "windowed" with a Hamming window to smooth out
the edges.
3.Fourier Transform: For each frame, a Fourier Transform is applied to convert the time-based signal
into a frequency-based one, showing the different frequencies and their strengths.
4.Mel Filter Bank: The frequency spectrum is then passed through a set of f ilters spaced on the Mel
scale, which is designed to match human hearing. These filters give more weight to lower frequencies
and less to higher frequencies, as we naturally perceive sound this way.
5.Logarithmic Transformation: The output from each Mel f ilter is converted to a logarithmic scale.
This step helps to approximate how our ears detect volume changes, especially small changes at low
volumes.
6.Discrete Cosine Transform (DCT): Finally, a mathematical transformation (DCT) is applied to the log
Mel values to create a compact set of values. These are the Mel Frequency Cepstral Coef ficients
(MFCCs).
MFCCs capture important patterns in speech sounds that help machines recognize different words
and voices. They’re especially valuable because they focus on sounds humans hear best, making
them ideal for tasks like speech recognition and speaker identification.
Perceptual Linear Prediction (PLP)

Perceptual Linear Prediction (PLP) is a method used in speech processing to capture


important features of a speech signal in a way that closely resembles human hearing.
PLP is par ticularly useful for tasks like speech recognition, especially in noisy
environments, because it emphasizes the most important aspects of speech sounds.
Here’s a simple explanation of how PLP features are extracted:
• Pre-Emphasis: Just like in MFCC extraction, the speech signal is f ir st f il tered to
emphasize higher frequencies, making details easier to capture.
• Framing and Windowing: The speech signal is split into short time frames (about 20-30
ms each), since speech characteristics change quickly. A window function (like a
Hamming window) is applied to each frame to smooth out the edges.
• Power Spectrum Calculation: For each frame, a Fourier Transform is applied to convert
the signal from the time domain to the frequency domain. This step provides the power
spectrum, showing which frequencies are most present.
• Critical-Band Filtering: The power spectrum is then passed through a set of f il ters
based on the critical bands of hearing, which represent ranges of frequencies that our
ears naturally group together. This step models how humans perceive frequency ranges.
• Equal-Loudness Pre-Emphasis: The critical-band energies are adjusted using an equal-
loudness curve, which ref le cts the fact that our ears are more sensitive to certain
frequencies. This curve emphasizes mid-frequency ranges where human hearing is
most sensitive.
• Intensity Loudness Compression: To mimic how the ear compresses loudness, the
critical-band energies are transformed using a logarithmic function. This step makes
PLP less sensitive to variations in volume.
• Linear Predictive Coding (LPC): Finally, Linear Predictive Coding (LPC) is applied to the
compressed critical-band energies. LPC models the shape of the vocal tract and helps
reduce the number of features, resulting in the PLP coefficients.
• PLP coef ficients capture the essential characteristics of speech in a way that’s close to
human hearing. This makes them useful for robust speech recognition, as they
emphasize the parts of speech sounds that humans rely on most, even in noisy
conditions.
Log Frequency Power Coefficients

Log Frequency Power Coef ficients (LFPCs) are features used in speech processing to capture
important frequency information in a way that emphasizes lower frequencies, which are generally
more important for understanding speech. Unlike MFCCs, which use the Mel scale, LFPCs apply a
logarithmic transformation directly to the frequency spectrum. Here’s a simplif ie d process of how
LFPCs are extracted:
• Pre-Emphasis: As with other speech features, the speech signal is f irst f iltered to boost higher
frequencies. This step ensures that subtle details in the signal aren’t lost.
• Framing and Windowing: The signal is divided into short frames (20-40 ms each) since speech
characteristics change rapidly. A window function, such as a Hamming window, is applied to each
frame to reduce edge effects.
• Fourier Transform: Each frame is then transformed from the time domain to the frequency
domain using a Fourier Transform. This creates a frequency spectrum that reveals the signal’s
different frequency components and their intensities.
• Logarithmic Frequency Scaling: The frequency axis is then transformed using a logarithmic scale.
Unlike the Mel scale used in MFCCs, the log scale directly compresses higher frequencies, making
them appear closer together. This step ref lects the way humans perceive sound frequencies, with
greater sensitivity to lower frequencies.
• Power Spectrum Calculation: The power of each frequency component is computed on
this log-scaled frequency axis, which shows the intensity of each frequency in the
speech signal.
• Discrete Cosine Transform (DCT): Finally, the power spectrum values on the log
frequency scale are compressed further using a Discrete Cosine Transform (DCT). This
step creates a set of coef fic ients, known as the Log Frequency Power Coef ficients
(LFPCs).
• LFPCs are particularly useful in speech recognition and speaker identif ication because
they capture the important patterns in speech frequencies with a focus on lower-
frequency information. This makes LFPCs effective in applications where emphasis on
lower frequencies improves understanding, such as in low-quality or noisy recordings.
Speech Distortion Measures
Speech Distortion Measures
Speech distortion measures are used to quantify differences between original and processed
speech signals. These measures are critical in evaluating the quality of speech processing
methods, especially in compression, enhancement, and recognition applications.
1.Simplif ied Distance Measure: This measure calculates the distance between two speech signals
in terms of basic Euclidean or cosine distance. Although straightforward, it may not always align
with perceptual differences as observed by listeners.
2.LPC-Based Distance Measure: Linear Predictive Coding (LPC) analyzes the formant structure of
speech by approximating the vocal tract as a series of linear f ilters. LPC-based distance measures
focus on the coef ficients derived from LPC analysis, quantifying the dissimilarity between speech
signals based on vocal tract shape.
3.Spectral Distortion Measure: This measure assesses the difference between the spectral
representations of two speech signals. It’s particularly useful in applications like speech coding
and compression, where the preservation of spectral f id elity is essential for natural-sounding
speech.
4.Perceptual Distortion Measure: Perceptual distortion measures aim to align closely with how
humans perceive sound quality. They often use auditory models to emphasize perceptually relevant
features of speech, focusing on attributes like frequency masking and loudness perception. This
meas u re is valu ab le in ap p licat ion s w h ere s u b ject ive qu alit y is imp or t an t , s u ch as
telecommunication and audio streaming services.
Simplified distance measure
A simplif ie d distance measure in speech distortion is a basic way to quantify how different two speech
signals are from each other. It’s often used to measure how much a speech signal changes after
processing, like compression or noise reduction. This measure doesn’t account for complex details; instead,
it gives a quick estimate of overall differences.
One common simplified distance measure is the Euclidean distance. Here’s how it works in simple steps:
1.Select Key Points: First, choose some key points in each speech signal. These could be specific frames or
feature values (like MFCCs).
2.Calculate Difference: For each key point, f ind the difference in value between the original and modif ie d
signal.
3.Square the Differences: To avoid canceling out positive and negative differences, square each difference.
4.Sum and Square Root: Add up all these squared differences, then take the square root of the sum. This
result is the Euclidean distance.
Mathematically, if the original signal is xxx and the modified signal is y, the Euclidean distance D is:

This measure is straightforward but gives a general sense of how much a processed signal differs from the
original. It’s helpful for quick comparisons but may not ref le ct perceptual (human hearing) differences as
well as more complex measures.
LPC (Linear Predictive Coding)-based distance measure
An LPC (Linear Predictive Coding)-based distance measure is a way to compare two speech signals by
focusing on how their vocal tract shapes differ. LPC models a speech signal by estimating the shape of the
vocal tract (throat, mouth, etc.) as the sound is produced. The LPC coef fic ients represent this shape, so
comparing these coefficients can show how similar or different two speech signals are.
Here’s a breakdown of how LPC-based distance measurement works:
1.Extract LPC Coef ficients: For each speech signal (original and modif ied), we calculate LPC coef ficients.
These coef fic ients represent the vocal tract's shape and are extracted by using linear prediction on short
frames of the signal.
2.Calculate the Difference: To find the distance between the original and modified speech signals, we compare
their LPC coefficients. There are several ways to do this:
1. Itakura-Saito Distance: Measures the difference between two LPC models by evaluating how much one
model can predict the other.
2. Log-Area Ratio (LAR): Compares the area under the spectral envelope of the LPC coefficients.
3. Cepstral Distance: Converts LPC coef ficients into cepstral coef ficients and computes the distance
between them.
3.Summing or Averaging the Distances: The distances from each frame are then summed or averaged to give
an overall LPC-based distance measure between the two signals.
The LPC-based distance measure is effective in applications like speaker recognition, where it’s important to
distinguish voices by analyzing their unique vocal tract shapes. Since this measure captures vocal
characteristics, it often provides a more accurate comparison for speech applications than simplif ie d
measures like Euclidean distance.
Spectral distortion measures
Spectral distortion measures are techniques used to quantify the difference between the spectral
representations of two speech signals. These measures help assess how much a processed speech signal
(like one that's been compressed or denoised) differs from the original signal, particularly in terms of frequency
content. Here’s a simplified breakdown of how spectral distortion is measured:
Key Steps in Spectral Distortion Measurement
1.Extract the Spectrum:
1. For both the original and modif ied speech signals, the frequency spectrum is calculated, typically using
the Short-Time Fourier Transform (STFT). This breaks the signal into small segments and converts
each segment into its frequency components.
2.Calculate the Magnitude Spectrum:
1. The magnitude of each frequency component is determined, which represents how much energy is
present at each frequency. The resulting spectra are often referred to as the magnitude spectra of the
original and modified signals.
3.Choose a Distortion Measure:
1. There are several methods to quantify the spectral distortion between the two spectra, including:
1. Mean Squared Error (MSE): Calculates the average squared differences between corresponding
frequency magnitudes in the two spectra. This is a simple but effective measure.
2. Log Spectral Distortion (LSD): Computes the difference in the logarithmic scale of the magnitudes.
This measure emphasizes lower frequencies and is more aligned with human perception of sound.
3. Spectral Flatness Measure: Evaluates how much the spectral shape differs. A f la t spectrum
indicates noise, while a peaked spectrum indicates tonal signals.
Aggregate the Distortion:
•The calculated distortion values for each frequency can be aggregated (summed or averaged) to obtain
a single value representing the overall spectral distortion.
Importance of Spectral Distortion Measures
Spectral distortion measures are crucial in various applications, such as:
•Speech Recognition: To evaluate how well recognition systems perform on processed
speech.
•Speech Enhancement: To assess the effectiveness of noise reduction and other processing
techniques.
•Codec Development: To ensure that audio codecs maintain the quality of the original signal
after compression.
By quantifying differences in spectral content, these measures help improve the quality and
intelligibility of speech processing systems.
Perceptual distortion measures
Perceptual distortion measures evaluate the differences between two audio signals based on human auditory
perception. Unlike traditional measures that focus purely on mathematical differences in signal values (like
mean squared error), perceptual distortion measures take into account how humans actually perceive sound,
making them more relevant for applications like speech recognition, audio compression, and enhancement.
Here’s a simplified overview of how perceptual distortion measures work:
Key Concepts of Perceptual Distortion Measurement
• Human Hearing Model:
• Perceptual measures are often based on models of human hearing, which consider factors like
frequency sensitivity, loudness, and how sounds mask each other. One common model used is the
Zwicker model or Bark scale, which accounts for the critical bands of hearing, where frequencies are
grouped together based on how they are perceived.
• Calculate Spectra:
• Similar to spectral distortion measures, perceptual measures begin by calculating the frequency
spectra of both the original and modif ied signals, typically using the Short-Time Fourier Transform
(STFT).
• Apply Masking Effects:
• In perceptual measures, the effects of masking are considered. Masking occurs when a louder sound
makes it dif ficult to hear a quieter sound in the same frequency range. This means that differences in
parts of the spectrum that are masked by louder sounds can be ignored since they are less perceptible
to listeners.
• Weight the Differences: Differences between the original and modif ie d signals’ spectra are
weighted according to their perceived importance. Frequencies that are more sensitive to
human hearing (like those in the mid-range) are given greater importance in the distortion
calculation, while less sensitive frequencies are downweighted.
• Aggregate the Distortion: Finally, the weighted differences across the spectrum are
aggregated to produce a single perceptual distortion value. This can involve methods such as
summing squared differences or calculating a logarithmic measure of difference, emphasizing
the parts of the signal that are most critical to perception.

Example of Perceptual Distortion Measures


One of the commonly used perceptual distortion measures is Perceptual Evaluation of Speech
Quality (PESQ), which assesses the quality of speech signals. PESQ involves several steps:
•Preprocessing: Aligns the reference and degraded signals.
•Modeling: Uses a model of human hearing to compute the difference between the signals in a
way that reflects perceptual differences.
•Scoring: Produces a score that represents the quality of the processed speech compared to the
original.
Importance of Perceptual Distortion Measures
Perceptual distortion measures are crucial for:
•Audio Compression: Ensuring that audio codecs maintain sound quality while reducing f ile size by
prioritizing perceptually significant sounds.
•Speech Enhancement: Evaluating the effectiveness of noise reduction techniques in maintaining speech
intelligibility.
•Quality Assessment: Providing more accurate assessments of perceived audio quality in various
applications.
By incorporating human hearing characteristics into the evaluation process, perceptual distortion
measures offer a more relevant assessment of audio quality and intelligibility, especially in the context of
speech processing.

You might also like