Introduction (UCS749)
Introduction (UCS749)
Processing
Introduction
MST:
EST:
Quiz:
Assignments/Project:
Material ( Book )
Daniel Jurafsky and James H. Martin, "Speech and Language Processing", 3rd
edition draft, 2019 [JM-2019]
What is this course about?
Fundamentals of Speech Processing: Speech processing involves analyzing
and manipulating speech signals for applications like recognition and synthesis.
Key steps include feature extraction, filtering, and segmentation to prepare the
speech signal for further tasks.
Speech Synthesis: Speech synthesis generates artificial speech from text input.
Modern systems use neural networks, such as Tacotron and WaveNet, to produce
natural-sounding, human-like voices.
Applications of Speech Processing and Synthesis
Google Translate
Healthcare
Entertainment
The sound waves are generated by a sound source, which creates vibrations in
the surrounding medium.
As the source continues to vibrate the medium, the vibrations are propagating
away from the source at the speed of sound and are forming the sound wave.
These sound wave reaches the ears of the listener, who decipher the waves into a
received message.
Amplitude: It's a measure of the height of the wave and is directly related to the
sound's loudness. It is the maximum displacement of a vibrating object from its
central location.
Sound as a Wave
● The wavelength (λ) of a sound wave is the distance between two identical
parts of a wave, such as the distance between successive peaks or between
any two corresponding points on the cycle.
● The frequency f specifies the number of cycles per second, measured in hertz
(Hz). Human hearing range: ~20 Hz to 20,000 Hz
Sound as a Wave
Sound pressure is the local pressure deviation from the ambient atmospheric
pressure caused by a sound wave.
Compression describes the region of high pressure and rarefaction describes the
region of low pressure.
Sound as a Wave
Compression
A region where particles in a medium are close together, causing a temporary
decrease in volume and a region of high pressure
Rarefaction
A region where particles in a medium are spread apart, causing a temporary
increase in volume and a region of low pressure
Human Ear
Sampling:
● Imagine a sound wave as a moving object (like a car driving by).
● Sampling is like taking snapshots of that car at regular intervals of time.
● Instead of having the whole continuous motion, you now have a collection of
still pictures at specific moments.
Sampling takes measurements of the wave at equal time intervals (e.g., 8,000
times per second for telephone sound or 44,100 times per second for CD quality).
The more frequent the snapshots (higher sampling rate), the closer it looks to the
original sound wave.
Analog to Digital Conversion
Sampling Rate: Sampling rate is the number of samples taken from a continuous
signal per second. The number of samples taken per second
Sample Size: The number of bits per sample i.e., is the level of detail or quality
each sample has. The most common sampling sizes are 8-bit sample and 16-bit
sample.
Signal Sampling
Typical Sampling Rates
To sample the signal at high rate, we need high speed ADC and large memory
capacity and various high processing power which is not always acceptable.
A continuous time signal can be represented in its samples and can be recovered back
when sampling frequency Fs is greater than or equal to the twice the highest frequency
component of message signal f_max.
Fs ≥ 2⋅f_max
Here, Fs is the sampling rate, and f_max is the highest frequency in the signal.
Once we have our samples, it’s time to assign each of them a digital value in a
process called quantization. Since digital systems can't handle continuous values,
quantization involves converting the infinite possibilities of an analog signal to a
finite number of discrete values.
More levels = more accurate sound = better sound quality.
The range of the signal's amplitudes is divided into discrete intervals, called
quantization levels.
Suppose an analog signal has amplitude
values ranging from −1 to +1.
If we use 4-bit quantization:
● 2^4 = 16 levels.
● The range −1 to +1 is divided into 16
intervals of size 2/16 = 0.125.
● A sample value 0.73 will be mapped to
the closest level, 0.625
Quantization
Think of it like assigning each point from our snapshots a seat in a limited room.
The more seats (intervals) available, the closer the representation to the actual
signal. However, more seats mean more digital data, which requires more storage.
● A higher number of quantization levels lead to a more accurate signal
representation, but require more data.
● Quantization error can occur if a signal is not captured accurately, leading to a
rough representation of the sound.
Analog to Digital Conversion: Summary
Representation of Speech
Signals can be examined from two points of view, or domains. These two
domains are the time domain and the frequency domain.
A time-domain graph displays the changes in a signal over a span of time and
frequency domain displays how much of the signal exists within a given frequency
band.
Frequency domain vs Time domain
Analysis in Time domain: Signal sampling taken over time renders a
representation of time as measured by a periodic change in the signal or data. For
example, data showing the progression of amplitude over a specific time period,
would be "amplitude given time".
With frequency domain analysis one can figure out the key points in the total data
set, rather than examining every variation which occurs in the time domain.
A frequency domain graph shows either the phase shift or magnitude of a signal at
each frequency that it exists at. It shows how much of the signal lies within each
given frequency band.
Fourier Transform
A signal could be described as the sum of many sine waves ("Fourier series") that
have differing pulses, phases and amplitudes.
Switching between the time domain and the frequency domain and back again, is
accomplished by performing mathematical integration using the "Fourier
Transform" equations.
Fourier Transform
For a continuous-time signal x(t), the Fourier Transform X(f) is defined as:
X(f) represents the Sine and cosine of
frequency-domain
representation of x(t) different frequency
Inverse Fourier Transform is just the opposite of the Fourier Transform. It takes
the frequency-domain representation X(f) of a given signal as input and converts
back the original signal x(t).
x(t): Time-domain signal
X(f): Frequency-domain representation.
e−j2πft dt: Complex exponential
representing a sinusoidal wave with
frequency f.
E = 2.718
Pi = 3.141
Fourier Transform
Fourier Transform
Quantifying Sound
Perceptual Features: These are subjective qualities that describe how humans
perceive sound
● Loudness: Loudness is the perception of how "loud" or "soft" a sound is.
Larger amplitudes correspond to louder sounds.
● Intensity: Intensity refers to the power or energy of a sound wave per unit area,
measured in decibels (dB). Intensity is related to loudness. Higher intensity
generally means louder sounds.
● Frequency: Frequency refers to the number of cycles of a sound wave per second.
Frequency determines the pitch of the sound. Higher frequencies correspond to
higher pitches.
If a wave completes one cycle in T seconds, the number of cycles per second
(frequency) is:
f=1\T
● Time variations: Time variations refer to how sound changes over time
Speech Production
Source: This is the initial (raw) sound produced by the vocal cords. Source
generates the acoustic energy.
Filter: The vocal tract acts as a filter that modifies the sound produced by the source.
Source-Filter
Source-Filter
In the time domain, the source signal X(m) is passed through the vocal tract filter
H(m) to produce the observed speech signal S(m). Here, ∗ represents the
convolution operation
● The variable x indicates the point at which the convolution result is being
evaluated
● The integral computes the weighted sum of the product of g(t) and the
shifted version of f(t) over all possible values of t.
Convolution
Convolution
Signal
Convolution
Filter : Two signals (red and blue) are convolved with each
other to produce the result signal (green). First one of the signals is
inverted in time and then slid over the other signal. At each time
moment, the values at same time moment are multiplied and then a
sum of the multiplication results is calculated. The result is stored to
the position where the first value of the sliding signal is moving.
Foil
Speech Sounds
● Allophones are different sounds that are used to pronounce the same
phoneme in a language. An allophone is a variation of phoneme.
● For example, the /p/ sound in "spin" and "pin" is pronounced differently, but
both are still recognized as the /p/ phoneme. The /l/ in "clear" and "silly
● For example, [p] and [ph] are allophones of the /p/ phoneme.