0% found this document useful (0 votes)
14 views72 pages

Introduction (UCS749)

The document outlines a course on Speech Processing and Synthesis, covering fundamental concepts such as automatic speech recognition, statistical modeling, and speech synthesis techniques. It discusses sound wave properties, the journey of sound to the brain, and the process of analog-to-digital conversion, including sampling and quantization. Additionally, it explores the source-filter model of speech production and the importance of convolution in signal processing.

Uploaded by

shivanjalishaily
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views72 pages

Introduction (UCS749)

The document outlines a course on Speech Processing and Synthesis, covering fundamental concepts such as automatic speech recognition, statistical modeling, and speech synthesis techniques. It discusses sound wave properties, the journey of sound to the brain, and the process of analog-to-digital conversion, including sampling and quantization. Additionally, it explores the source-filter model of speech production and the importance of convolution in signal processing.

Uploaded by

shivanjalishaily
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Introduction to Speech

Processing
Introduction

Course Name: Speech Processing and Synthesis


Course Code: UCS749
L T P: 2 0 2
Evaluation Scheme

MST:

EST:

Quiz:

Assignments/Project:
Material ( Book )
Daniel Jurafsky and James H. Martin, "Speech and Language Processing", 3rd
edition draft, 2019 [JM-2019]
What is this course about?
Fundamentals of Speech Processing: Speech processing involves analyzing
and manipulating speech signals for applications like recognition and synthesis.
Key steps include feature extraction, filtering, and segmentation to prepare the
speech signal for further tasks.

● Feature extraction reduces the amount of information in a speech signal while


retaining the most important data.

● Filtering is a process that reduces noise

● Speech segmentation is the process of identifying the boundaries between


words, syllables or phonemes in spoken language
What is this course about?
Automatic Speech Recognition: ASR systems convert spoken language into
text, often employing techniques like Hidden Markov Models (HMMs) or neural
networks.
What is this course about?
Statistical Modelling: Statistical models like HMMs or Gaussian Mixture Models
(GMMs) are used to represent and predict patterns in speech data.

Language Models: Language models predict the probability of a sequence of


words. They enhance speech recognition systems by improving word sequence
prediction and correcting errors.

Speech Synthesis: Speech synthesis generates artificial speech from text input.
Modern systems use neural networks, such as Tacotron and WaveNet, to produce
natural-sounding, human-like voices.
Applications of Speech Processing and Synthesis

Voice Assistants: Siri, Alexa, and Google Assistant

Google Translate

Healthcare

Entertainment

Forensics and Security


Sound as a wave

The sound waves are generated by a sound source, which creates vibrations in
the surrounding medium.

As the source continues to vibrate the medium, the vibrations are propagating
away from the source at the speed of sound and are forming the sound wave.

These sound wave reaches the ears of the listener, who decipher the waves into a
received message.

● Production/Generation: An object vibrates and mechanical energy is


converted to acoustic energy.
● Propagation: The sound travels from its source to receptor.
● Perception: The sound is received and interpreted.
Speech Signal
Sound as a Wave
Sound is longitudinal pressure variation.

Amplitude: It's a measure of the height of the wave and is directly related to the
sound's loudness. It is the maximum displacement of a vibrating object from its
central location.
Sound as a Wave

● The wavelength (λ) of a sound wave is the distance between two identical
parts of a wave, such as the distance between successive peaks or between
any two corresponding points on the cycle.

● The frequency f specifies the number of cycles per second, measured in hertz
(Hz). Human hearing range: ~20 Hz to 20,000 Hz
Sound as a Wave

Sound pressure is the local pressure deviation from the ambient atmospheric
pressure caused by a sound wave.

Compression describes the region of high pressure and rarefaction describes the
region of low pressure.
Sound as a Wave
Compression
A region where particles in a medium are close together, causing a temporary
decrease in volume and a region of high pressure

Rarefaction
A region where particles in a medium are spread apart, causing a temporary
increase in volume and a region of low pressure
Human Ear

The journey of sound to the brain


involves several process where
sound waves are captured by the
ear, converted into electrical signals,
and interpreted by the brain.

This process relies on the anatomy of


the human ear, which is divided into
three main sections: the outer ear,
middle ear, and inner ear.
Journey of Sound to the Brain
Sound travels to the brain through the following steps:
1. Outer ear
Sound waves enter the ear through the outer ear and travel through the ear
canal to the eardrum.
2. Middle ear
The eardrum vibrates and sends the vibrations to the malleus, incus, and
stapes, three tiny bones in the middle ear. These bones amplify the vibrations
and send them to the inner ear.
3. Inner ear
A spiral-shaped, fluid-filled structure responsible for hearing.
Journey of Sound to the Brain
4. Brain interpretation
The brain interprets the electrical signals as sound. Some areas of the brain
compare signals from both ears to determine where the sound came from, while
other areas process language and music.
Problem with sound waves
● Ambiguous nature of speech : No two utterance of the same words are same
● Variations of the speech signals from one speaker to another
● Surrounding noise and channel distortion
● Context effects
● Computers deal with discrete data (zeros and ones). We need to convert the
continuous signal into digital presentation
Digital Sound Waves
● Digital sound waves are a representation of sound that is converted into a
series of binary digits, or bits, that can be stored, processed, and played back
on a computer.
● Analog-to-Digital Conversion (ADC) of speech sound is the process of
converting continuous analog speech signals into a discrete digital format.
Analog to Digital Conversion (continuous to discrete)

Steps required to convert analog sound waves to digital sound waves:


● Sampling
● Quantization

● Sampling converts a time-varying voltage signal into a discrete-time


signal, a sequence of real numbers.
Analog to Digital Conversion

Sampling:
● Imagine a sound wave as a moving object (like a car driving by).
● Sampling is like taking snapshots of that car at regular intervals of time.
● Instead of having the whole continuous motion, you now have a collection of
still pictures at specific moments.

Sampling takes measurements of the wave at equal time intervals (e.g., 8,000
times per second for telephone sound or 44,100 times per second for CD quality).

The more frequent the snapshots (higher sampling rate), the closer it looks to the
original sound wave.
Analog to Digital Conversion
Sampling Rate: Sampling rate is the number of samples taken from a continuous
signal per second. The number of samples taken per second

Sample Size: The number of bits per sample i.e., is the level of detail or quality
each sample has. The most common sampling sizes are 8-bit sample and 16-bit
sample.
Signal Sampling
Typical Sampling Rates

8000 samples per


second
5 samples per cycle 10 samples per cycle

20 samples per cycle


Sampling

To sample the signal at high rate, we need high speed ADC and large memory
capacity and various high processing power which is not always acceptable.

● Nyquist Sampling Theorem:


○ if all significant frequencies of a signal are less than B
○ and if we sample the signal with a frequency 2B or higher,
○ we can exactly reconstruct the signal.
○ anything sampling rate less than 2B will lose information
The Nyquist–Shannon Theorem

The Nyquist-Shannon Theorem (or Sampling Theorem) is a fundamental


principle in signal processing that defines the conditions under which an analog
signal can be sampled and perfectly reconstructed.
The Nyquist–Shannon Theorem

A continuous time signal can be represented in its samples and can be recovered back
when sampling frequency Fs is greater than or equal to the twice the highest frequency
component of message signal f_max.

Fs ≥ 2⋅f_max

Here, Fs is the sampling rate, and f_max is the highest frequency in the signal.

This minimum sampling rate, 2⋅f_max, is called the Nyquist rate.


Signal Sampling

Each bit can store two possible states: 0 or 1.


With n bits, the total number of unique combinations (values) is 2^n.
For 8 bits:

2^8 = 256 values.


Quantization: Think of it like rounding off numbers.
The next step in analog to digital conversion is Quantization.

Once we have our samples, it’s time to assign each of them a digital value in a
process called quantization. Since digital systems can't handle continuous values,
quantization involves converting the infinite possibilities of an analog signal to a
finite number of discrete values.
More levels = more accurate sound = better sound quality.
The range of the signal's amplitudes is divided into discrete intervals, called
quantization levels.
Suppose an analog signal has amplitude
values ranging from −1 to +1.
If we use 4-bit quantization:

● 2^4 = 16 levels.
● The range −1 to +1 is divided into 16
intervals of size 2/16 = 0.125.
● A sample value 0.73 will be mapped to
the closest level, 0.625
Quantization
Think of it like assigning each point from our snapshots a seat in a limited room.
The more seats (intervals) available, the closer the representation to the actual
signal. However, more seats mean more digital data, which requires more storage.
● A higher number of quantization levels lead to a more accurate signal
representation, but require more data.
● Quantization error can occur if a signal is not captured accurately, leading to a
rough representation of the sound.
Analog to Digital Conversion: Summary
Representation of Speech

Signals can be examined from two points of view, or domains. These two
domains are the time domain and the frequency domain.

A time-domain graph displays the changes in a signal over a span of time and
frequency domain displays how much of the signal exists within a given frequency
band.
Frequency domain vs Time domain
Analysis in Time domain: Signal sampling taken over time renders a
representation of time as measured by a periodic change in the signal or data. For
example, data showing the progression of amplitude over a specific time period,
would be "amplitude given time".

Analysis in Frequency domain: Here in the frequency domain, we can observe


amplitude versus frequency. The amplitude of a wave or vibration is expressed in
positive numbers, with the peak amplitude as a measure of deviation from its
central value.
Frequency domain vs Time domain
Frequency domain advantage
If the signal is made up of narrow band spectral features then we can understand
signal much better and much faster when looking up at the frequency domain
compare to time domain

With frequency domain analysis one can figure out the key points in the total data
set, rather than examining every variation which occurs in the time domain.

A frequency domain graph shows either the phase shift or magnitude of a signal at
each frequency that it exists at. It shows how much of the signal lies within each
given frequency band.
Fourier Transform

• Time domain and frequency domain are inversely related .

• If mathematical description of the signal in one domain is known, it is possible


to perform an operation on the signal to see what it looks like in the other
domain. This operation is called the Fourier Transform.

• The Fourier Transform decomposes a time-domain signal into its constituent


frequencies. Conversely, the Inverse Fourier Transform reconstructs the
time-domain signal from its frequency components.
Fourier Transform
Fourier Transform

A signal could be described as the sum of many sine waves ("Fourier series") that
have differing pulses, phases and amplitudes.

Switching between the time domain and the frequency domain and back again, is
accomplished by performing mathematical integration using the "Fourier
Transform" equations.
Fourier Transform
For a continuous-time signal x(t), the Fourier Transform X(f) is defined as:
X(f) represents the Sine and cosine of
frequency-domain
representation of x(t) different frequency

Inverse Fourier Transform is just the opposite of the Fourier Transform. It takes
the frequency-domain representation X(f) of a given signal as input and converts
back the original signal x(t).
x(t): Time-domain signal
X(f): Frequency-domain representation.
e−j2πft dt: Complex exponential
representing a sinusoidal wave with
frequency f.
E = 2.718
Pi = 3.141
Fourier Transform
Fourier Transform
Quantifying Sound

Perceptual Features: These are subjective qualities that describe how humans
perceive sound
● Loudness: Loudness is the perception of how "loud" or "soft" a sound is.
Larger amplitudes correspond to louder sounds.

● Pitch : Pitch is the perception of the "highness" or "lowness" of a sound.


Higher frequencies correspond to higher pitches.

● Timbre (tone color): Timbre is the quality of a sound that allows us to


distinguish between different sources producing the same pitch and loudness.
For example, the waveform of a violin and a piano playing the same note at
the same loudness will have different harmonic structures, giving them distinct
timbres.
Quantifying Sound

Physical Features: These are objective, measurable properties of sound waves:

● Intensity: Intensity refers to the power or energy of a sound wave per unit area,
measured in decibels (dB). Intensity is related to loudness. Higher intensity
generally means louder sounds.
● Frequency: Frequency refers to the number of cycles of a sound wave per second.
Frequency determines the pitch of the sound. Higher frequencies correspond to
higher pitches.
If a wave completes one cycle in T seconds, the number of cycles per second
(frequency) is:
f=1\T
● Time variations: Time variations refer to how sound changes over time
Speech Production

It is a key component of speech


that conveys meaning, emotion
and emphasis beyond the literal
words being spoken.
Timbre determines the
quality of sound.
Speech Production in Humans
● Voicing Source: Vocal cords vibrate
Modulation of the airflow from lungs
makes the vocal cords vibrate.
● The space between the chords is
called glottis.
● If the vocal cords are close to each
other, then the air pressure from the
lungs makes them vibrate
periodically, which generates the
voice.
● The sound is then filtered and
shaped by the articulators.
Source-Filter Approach:
The source-filter model is a fundamental concept in speech production and signal
processing. It breaks down speech production into two main components:

Source: This is the initial (raw) sound produced by the vocal cords. Source
generates the acoustic energy.

Filter: The vocal tract acts as a filter that modifies the sound produced by the source.
Source-Filter
Source-Filter
In the time domain, the source signal X(m) is passed through the vocal tract filter
H(m) to produce the observed speech signal S(m). Here, ∗ represents the
convolution operation

convolution can be used to


denoise or enhance the signal
Convolution
Goal of convolution: Take two time series signal and mix them somehow to create a
third signal. We combine the properties of source and filter.
Filter
Source-Filter

S[t] is The current value of the


signal at time t.
E[t] is excitation signal at time t.
S[t-k] is is the value of the signal
s at a past time step [t−k]
Convolution

● g∗f means the convolution of two functions g(t) and f(t).

● g(t) is filter and f(t) is source.

● The variable x indicates the point at which the convolution result is being
evaluated

● The function f(x−t) represents f(t) shifted by x.

● The integral computes the weighted sum of the product of g(t) and the
shifted version of f(t) over all possible values of t.
Convolution
Convolution
Signal
Convolution
Filter : Two signals (red and blue) are convolved with each
other to produce the result signal (green). First one of the signals is
inverted in time and then slid over the other signal. At each time
moment, the values at same time moment are multiplied and then a
sum of the multiplication results is calculated. The result is stored to
the position where the first value of the sliding signal is moving.

shaded region represents the overlap between the two signals.


Convolution
This figure represents the result of a
convolution operation, denoted as
(f∗g)(x), where two functions f(x) and g(x)
are convolved.

The x-axis represents the independent


variable x

y-axis represents the value of the


convolution result

Curve starts at 0 for x<−1, indicating no


overlap between f and g at that point
Convolution

Between −1 and 1, the value of (f∗g)(x)


increases, reaching a plateau at 0.5

From x=1 to x=2, the curve decreases


back to 0 as the overlap diminishes

The constant value in the middle


indicates that the overlap remains
maximal and consistent during this
range.
Speech Sounds: Phonemics and Phonetics
● The study of phonemes is the focus of phonemics. A phoneme is the smallest
unit of sound that distinguishes one word from another in a language.
● It is widely assumed that English has between 40 and 45 phonemes.
● Each phoneme represents a distinct collection of articulatory motions that
include the type and location of sound excitation as well as the position or
movement of the vocal tract articulators.
● The English words pet and bet, differ in their initial consonant or phone and
have diverse meanings. As a result, the phonemes /p/ and /b/ are phonemes for
the language. Two forward slashes are widely used to represent a phoneme.
/a/ /p/ /b/
Coil

Foil
Speech Sounds
● Allophones are different sounds that are used to pronounce the same
phoneme in a language. An allophone is a variation of phoneme.

● For example, the /p/ sound in "spin" and "pin" is pronounced differently, but
both are still recognized as the /p/ phoneme. The /l/ in "clear" and "silly

● For example, [p] and [ph] are allophones of the /p/ phoneme.

● It is important to note that phonemes in one language may be allophones in


another. For example, in English, the sounds /z/ and /s/ are separate
phonemes, whereas in Spanish, they are allophones.
Phonemes and Graphemes
A grapheme is a written representation of a phoneme, which is a spoken sound.
Grapheme is the way we write a phoneme. The letter "b" in "bat" represents the
phoneme /b/. The letter "p" in "pat" represents the phoneme /p/.
Differences between Phonemes and Graphemes:
● Visual and verbal: Graphemes are visual, while phonemes are verbal.
● Multiple graphemes for one phoneme: It's possible for a single phoneme to be
represented by more than one grapheme. For example, the grapheme "pH"
represents the same sound as the phoneme "p". (e.g., "c" in "cat" /k/ vs. "c" in
"city" /s/).
● A single grapheme can represent multiple phonemes depending on the context
(e.g., "/g/" in "giraffe" vs. "/g/" in "get" ).
Grapheme to Phoneme Conversion
Grapheme-to-Phoneme Conversion (G2P) stands as a fundamental process
within the vast domain of natural language processing (NLP), where it plays a
pivotal role in bridging the gap between the written word and its spoken form.

G2P conversion is essential in various NLP applications, most notably:

● Text-to-Speech (TTS) Systems: Enabling computers to read text out loud in a


human-like voice.
● Automatic Speech Recognition (ASR): Assisting in the accurate transcription
of spoken language into text.
● Language Learning Tools: Aiding learners in understanding the correct
pronunciation of new words.
Thank you

You might also like