0% found this document useful (0 votes)

14 views72 pages

Introduction (UCS749)

The document outlines a course on Speech Processing and Synthesis, covering fundamental concepts such as automatic speech recognition, statistical modeling, and speech synthesis techniques. It discusses sound wave properties, the journey of sound to the brain, and the process of analog-to-digital conversion, including sampling and quantization. Additionally, it explores the source-filter model of speech production and the importance of convolution in signal processing.

Uploaded by

shivanjalishaily

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views72 pages

Introduction (UCS749)

Uploaded by

shivanjalishaily

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Introduction to Speech

Processing
Introduction

Course Name: Speech Processing and Synthesis

Course Code: UCS749
L T P: 2 0 2
Evaluation Scheme

MST:

EST:

Quiz:

Assignments/Project:
Material ( Book )
Daniel Jurafsky and James H. Martin, "Speech and Language Processing", 3rd
edition draft, 2019 [JM-2019]
What is this course about?
Fundamentals of Speech Processing: Speech processing involves analyzing
and manipulating speech signals for applications like recognition and synthesis.
Key steps include feature extraction, filtering, and segmentation to prepare the
speech signal for further tasks.

● Feature extraction reduces the amount of information in a speech signal while

retaining the most important data.

● Filtering is a process that reduces noise

● Speech segmentation is the process of identifying the boundaries between

words, syllables or phonemes in spoken language
What is this course about?
Automatic Speech Recognition: ASR systems convert spoken language into
text, often employing techniques like Hidden Markov Models (HMMs) or neural
networks.
What is this course about?
Statistical Modelling: Statistical models like HMMs or Gaussian Mixture Models
(GMMs) are used to represent and predict patterns in speech data.

Language Models: Language models predict the probability of a sequence of

words. They enhance speech recognition systems by improving word sequence
prediction and correcting errors.

Speech Synthesis: Speech synthesis generates artificial speech from text input.
Modern systems use neural networks, such as Tacotron and WaveNet, to produce
natural-sounding, human-like voices.
Applications of Speech Processing and Synthesis

Voice Assistants: Siri, Alexa, and Google Assistant

Google Translate

Healthcare

Entertainment

Forensics and Security

Sound as a wave

The sound waves are generated by a sound source, which creates vibrations in
the surrounding medium.

As the source continues to vibrate the medium, the vibrations are propagating
away from the source at the speed of sound and are forming the sound wave.

These sound wave reaches the ears of the listener, who decipher the waves into a
received message.

● Production/Generation: An object vibrates and mechanical energy is

converted to acoustic energy.
● Propagation: The sound travels from its source to receptor.
● Perception: The sound is received and interpreted.
Speech Signal
Sound as a Wave
Sound is longitudinal pressure variation.

Amplitude: It's a measure of the height of the wave and is directly related to the
sound's loudness. It is the maximum displacement of a vibrating object from its
central location.
Sound as a Wave

● The wavelength (λ) of a sound wave is the distance between two identical
parts of a wave, such as the distance between successive peaks or between
any two corresponding points on the cycle.

● The frequency f specifies the number of cycles per second, measured in hertz
(Hz). Human hearing range: ~20 Hz to 20,000 Hz
Sound as a Wave

Sound pressure is the local pressure deviation from the ambient atmospheric
pressure caused by a sound wave.

Compression describes the region of high pressure and rarefaction describes the
region of low pressure.
Sound as a Wave
Compression
A region where particles in a medium are close together, causing a temporary
decrease in volume and a region of high pressure

Rarefaction
A region where particles in a medium are spread apart, causing a temporary
increase in volume and a region of low pressure
Human Ear

The journey of sound to the brain

involves several process where
sound waves are captured by the
ear, converted into electrical signals,
and interpreted by the brain.

This process relies on the anatomy of

the human ear, which is divided into
three main sections: the outer ear,
middle ear, and inner ear.
Journey of Sound to the Brain
Sound travels to the brain through the following steps:
1. Outer ear
Sound waves enter the ear through the outer ear and travel through the ear
canal to the eardrum.
2. Middle ear
The eardrum vibrates and sends the vibrations to the malleus, incus, and
stapes, three tiny bones in the middle ear. These bones amplify the vibrations
and send them to the inner ear.
3. Inner ear
A spiral-shaped, fluid-filled structure responsible for hearing.
Journey of Sound to the Brain
4. Brain interpretation
The brain interprets the electrical signals as sound. Some areas of the brain
compare signals from both ears to determine where the sound came from, while
other areas process language and music.
Problem with sound waves
● Ambiguous nature of speech : No two utterance of the same words are same
● Variations of the speech signals from one speaker to another
● Surrounding noise and channel distortion
● Context effects
● Computers deal with discrete data (zeros and ones). We need to convert the
continuous signal into digital presentation
Digital Sound Waves
● Digital sound waves are a representation of sound that is converted into a
series of binary digits, or bits, that can be stored, processed, and played back
on a computer.
● Analog-to-Digital Conversion (ADC) of speech sound is the process of
converting continuous analog speech signals into a discrete digital format.
Analog to Digital Conversion (continuous to discrete)

Steps required to convert analog sound waves to digital sound waves:

● Sampling
● Quantization

● Sampling converts a time-varying voltage signal into a discrete-time

signal, a sequence of real numbers.
Analog to Digital Conversion

Sampling:
● Imagine a sound wave as a moving object (like a car driving by).
● Sampling is like taking snapshots of that car at regular intervals of time.
● Instead of having the whole continuous motion, you now have a collection of
still pictures at specific moments.

Sampling takes measurements of the wave at equal time intervals (e.g., 8,000
times per second for telephone sound or 44,100 times per second for CD quality).

The more frequent the snapshots (higher sampling rate), the closer it looks to the
original sound wave.
Analog to Digital Conversion
Sampling Rate: Sampling rate is the number of samples taken from a continuous
signal per second. The number of samples taken per second

Sample Size: The number of bits per sample i.e., is the level of detail or quality
each sample has. The most common sampling sizes are 8-bit sample and 16-bit
sample.
Signal Sampling
Typical Sampling Rates

8000 samples per

second
5 samples per cycle 10 samples per cycle

20 samples per cycle

Sampling

To sample the signal at high rate, we need high speed ADC and large memory
capacity and various high processing power which is not always acceptable.

● Nyquist Sampling Theorem:

○ if all significant frequencies of a signal are less than B
○ and if we sample the signal with a frequency 2B or higher,
○ we can exactly reconstruct the signal.
○ anything sampling rate less than 2B will lose information
The Nyquist–Shannon Theorem

The Nyquist-Shannon Theorem (or Sampling Theorem) is a fundamental

principle in signal processing that defines the conditions under which an analog
signal can be sampled and perfectly reconstructed.
The Nyquist–Shannon Theorem

A continuous time signal can be represented in its samples and can be recovered back
when sampling frequency Fs is greater than or equal to the twice the highest frequency
component of message signal f_max.

Fs ≥ 2⋅f_max

Here, Fs is the sampling rate, and f_max is the highest frequency in the signal.

This minimum sampling rate, 2⋅f_max, is called the Nyquist rate.

Signal Sampling

Each bit can store two possible states: 0 or 1.

With n bits, the total number of unique combinations (values) is 2^n.
For 8 bits:

2^8 = 256 values.

Quantization: Think of it like rounding off numbers.
The next step in analog to digital conversion is Quantization.

Once we have our samples, it’s time to assign each of them a digital value in a
process called quantization. Since digital systems can't handle continuous values,
quantization involves converting the infinite possibilities of an analog signal to a
finite number of discrete values.
More levels = more accurate sound = better sound quality.
The range of the signal's amplitudes is divided into discrete intervals, called
quantization levels.
Suppose an analog signal has amplitude
values ranging from −1 to +1.
If we use 4-bit quantization:

● 2^4 = 16 levels.
● The range −1 to +1 is divided into 16
intervals of size 2/16 = 0.125.
● A sample value 0.73 will be mapped to
the closest level, 0.625
Quantization
Think of it like assigning each point from our snapshots a seat in a limited room.
The more seats (intervals) available, the closer the representation to the actual
signal. However, more seats mean more digital data, which requires more storage.
● A higher number of quantization levels lead to a more accurate signal
representation, but require more data.
● Quantization error can occur if a signal is not captured accurately, leading to a
rough representation of the sound.
Analog to Digital Conversion: Summary
Representation of Speech

Signals can be examined from two points of view, or domains. These two
domains are the time domain and the frequency domain.

A time-domain graph displays the changes in a signal over a span of time and
frequency domain displays how much of the signal exists within a given frequency
band.
Frequency domain vs Time domain
Analysis in Time domain: Signal sampling taken over time renders a
representation of time as measured by a periodic change in the signal or data. For
example, data showing the progression of amplitude over a specific time period,
would be "amplitude given time".

Analysis in Frequency domain: Here in the frequency domain, we can observe

amplitude versus frequency. The amplitude of a wave or vibration is expressed in
positive numbers, with the peak amplitude as a measure of deviation from its
central value.
Frequency domain vs Time domain
Frequency domain advantage
If the signal is made up of narrow band spectral features then we can understand
signal much better and much faster when looking up at the frequency domain
compare to time domain

With frequency domain analysis one can figure out the key points in the total data
set, rather than examining every variation which occurs in the time domain.

A frequency domain graph shows either the phase shift or magnitude of a signal at
each frequency that it exists at. It shows how much of the signal lies within each
given frequency band.
Fourier Transform

• Time domain and frequency domain are inversely related .

• If mathematical description of the signal in one domain is known, it is possible

to perform an operation on the signal to see what it looks like in the other
domain. This operation is called the Fourier Transform.

• The Fourier Transform decomposes a time-domain signal into its constituent

frequencies. Conversely, the Inverse Fourier Transform reconstructs the
time-domain signal from its frequency components.
Fourier Transform
Fourier Transform

A signal could be described as the sum of many sine waves ("Fourier series") that
have differing pulses, phases and amplitudes.

Switching between the time domain and the frequency domain and back again, is
accomplished by performing mathematical integration using the "Fourier
Transform" equations.
Fourier Transform
For a continuous-time signal x(t), the Fourier Transform X(f) is defined as:
X(f) represents the Sine and cosine of
frequency-domain
representation of x(t) different frequency

Inverse Fourier Transform is just the opposite of the Fourier Transform. It takes
the frequency-domain representation X(f) of a given signal as input and converts
back the original signal x(t).
x(t): Time-domain signal
X(f): Frequency-domain representation.
e−j2πft dt: Complex exponential
representing a sinusoidal wave with
frequency f.
E = 2.718
Pi = 3.141
Fourier Transform
Fourier Transform
Quantifying Sound

Perceptual Features: These are subjective qualities that describe how humans
perceive sound
● Loudness: Loudness is the perception of how "loud" or "soft" a sound is.
Larger amplitudes correspond to louder sounds.

● Pitch : Pitch is the perception of the "highness" or "lowness" of a sound.

Higher frequencies correspond to higher pitches.

● Timbre (tone color): Timbre is the quality of a sound that allows us to

distinguish between different sources producing the same pitch and loudness.
For example, the waveform of a violin and a piano playing the same note at
the same loudness will have different harmonic structures, giving them distinct
timbres.
Quantifying Sound

Physical Features: These are objective, measurable properties of sound waves:

● Intensity: Intensity refers to the power or energy of a sound wave per unit area,
measured in decibels (dB). Intensity is related to loudness. Higher intensity
generally means louder sounds.
● Frequency: Frequency refers to the number of cycles of a sound wave per second.
Frequency determines the pitch of the sound. Higher frequencies correspond to
higher pitches.
If a wave completes one cycle in T seconds, the number of cycles per second
(frequency) is:
f=1\T
● Time variations: Time variations refer to how sound changes over time
Speech Production

It is a key component of speech

that conveys meaning, emotion
and emphasis beyond the literal
words being spoken.
Timbre determines the
quality of sound.
Speech Production in Humans
● Voicing Source: Vocal cords vibrate
Modulation of the airflow from lungs
makes the vocal cords vibrate.
● The space between the chords is
called glottis.
● If the vocal cords are close to each
other, then the air pressure from the
lungs makes them vibrate
periodically, which generates the
voice.
● The sound is then filtered and
shaped by the articulators.
Source-Filter Approach:
The source-filter model is a fundamental concept in speech production and signal
processing. It breaks down speech production into two main components:

Source: This is the initial (raw) sound produced by the vocal cords. Source
generates the acoustic energy.

Filter: The vocal tract acts as a filter that modifies the sound produced by the source.
Source-Filter
Source-Filter
In the time domain, the source signal X(m) is passed through the vocal tract filter
H(m) to produce the observed speech signal S(m). Here, ∗ represents the
convolution operation

convolution can be used to

denoise or enhance the signal
Convolution
Goal of convolution: Take two time series signal and mix them somehow to create a
third signal. We combine the properties of source and filter.
Filter
Source-Filter

S[t] is The current value of the

signal at time t.
E[t] is excitation signal at time t.
S[t-k] is is the value of the signal
s at a past time step [t−k]
Convolution

● g∗f means the convolution of two functions g(t) and f(t).

● g(t) is filter and f(t) is source.

● The variable x indicates the point at which the convolution result is being
evaluated

● The function f(x−t) represents f(t) shifted by x.

● The integral computes the weighted sum of the product of g(t) and the
shifted version of f(t) over all possible values of t.
Convolution
Convolution
Signal
Convolution
Filter : Two signals (red and blue) are convolved with each
other to produce the result signal (green). First one of the signals is
inverted in time and then slid over the other signal. At each time
moment, the values at same time moment are multiplied and then a
sum of the multiplication results is calculated. The result is stored to
the position where the ﬁrst value of the sliding signal is moving.

shaded region represents the overlap between the two signals.

Convolution
This figure represents the result of a
convolution operation, denoted as
(f∗g)(x), where two functions f(x) and g(x)
are convolved.

The x-axis represents the independent

variable x

y-axis represents the value of the

convolution result

Curve starts at 0 for x<−1, indicating no

overlap between f and g at that point
Convolution

Between −1 and 1, the value of (f∗g)(x)

increases, reaching a plateau at 0.5

From x=1 to x=2, the curve decreases

back to 0 as the overlap diminishes

The constant value in the middle

indicates that the overlap remains
maximal and consistent during this
range.
Speech Sounds: Phonemics and Phonetics
● The study of phonemes is the focus of phonemics. A phoneme is the smallest
unit of sound that distinguishes one word from another in a language.
● It is widely assumed that English has between 40 and 45 phonemes.
● Each phoneme represents a distinct collection of articulatory motions that
include the type and location of sound excitation as well as the position or
movement of the vocal tract articulators.
● The English words pet and bet, differ in their initial consonant or phone and
have diverse meanings. As a result, the phonemes /p/ and /b/ are phonemes for
the language. Two forward slashes are widely used to represent a phoneme.
/a/ /p/ /b/
Coil

Foil
Speech Sounds
● Allophones are different sounds that are used to pronounce the same
phoneme in a language. An allophone is a variation of phoneme.

● For example, the /p/ sound in "spin" and "pin" is pronounced differently, but
both are still recognized as the /p/ phoneme. The /l/ in "clear" and "silly

● For example, [p] and [ph] are allophones of the /p/ phoneme.

● It is important to note that phonemes in one language may be allophones in

another. For example, in English, the sounds /z/ and /s/ are separate
phonemes, whereas in Spanish, they are allophones.
Phonemes and Graphemes
A grapheme is a written representation of a phoneme, which is a spoken sound.
Grapheme is the way we write a phoneme. The letter "b" in "bat" represents the
phoneme /b/. The letter "p" in "pat" represents the phoneme /p/.
Differences between Phonemes and Graphemes:
● Visual and verbal: Graphemes are visual, while phonemes are verbal.
● Multiple graphemes for one phoneme: It's possible for a single phoneme to be
represented by more than one grapheme. For example, the grapheme "pH"
represents the same sound as the phoneme "p". (e.g., "c" in "cat" /k/ vs. "c" in
"city" /s/).
● A single grapheme can represent multiple phonemes depending on the context
(e.g., "/g/" in "giraffe" vs. "/g/" in "get" ).
Grapheme to Phoneme Conversion
Grapheme-to-Phoneme Conversion (G2P) stands as a fundamental process
within the vast domain of natural language processing (NLP), where it plays a
pivotal role in bridging the gap between the written word and its spoken form.

G2P conversion is essential in various NLP applications, most notably:

● Text-to-Speech (TTS) Systems: Enabling computers to read text out loud in a

human-like voice.
● Automatic Speech Recognition (ASR): Assisting in the accurate transcription
of spoken language into text.
● Language Learning Tools: Aiding learners in understanding the correct
pronunciation of new words.
Thank you

Psychology and Alchemy PDF
0% (1)
Psychology and Alchemy PDF
123 pages
Late Modernism and Peter Eisenmann
No ratings yet
Late Modernism and Peter Eisenmann
16 pages
CV Suresh Pavaiah Electrical Commissioning Engineer
100% (1)
CV Suresh Pavaiah Electrical Commissioning Engineer
6 pages
Introduction To Multimedia. Analog-Digital Representation
100% (1)
Introduction To Multimedia. Analog-Digital Representation
29 pages
Ch-2 Acct For Public Enterprises
100% (1)
Ch-2 Acct For Public Enterprises
15 pages
Gis 46 020 - A
No ratings yet
Gis 46 020 - A
71 pages
Audio Theory
100% (1)
Audio Theory
33 pages
How To Be A Network Engineer in A Programmable Age PDF
100% (1)
How To Be A Network Engineer in A Programmable Age PDF
38 pages
Instrument Technician Resume 24
No ratings yet
Instrument Technician Resume 24
5 pages
14ec3029 Speech and Audio Signal Processing
No ratings yet
14ec3029 Speech and Audio Signal Processing
30 pages
Chapter 06 - Basics of Digital Audio
No ratings yet
Chapter 06 - Basics of Digital Audio
97 pages
Q3 Module6 CSS9
No ratings yet
Q3 Module6 CSS9
7 pages
6 - Digital Audio Technology
No ratings yet
6 - Digital Audio Technology
24 pages
Website Blogs
No ratings yet
Website Blogs
7 pages
MEH-Nakai Lab-1
No ratings yet
MEH-Nakai Lab-1
93 pages
Mul c2
No ratings yet
Mul c2
86 pages
Introduction (UCS749)
No ratings yet
Introduction (UCS749)
59 pages
0024.11 Ic Process Safety Management Certificate Web
No ratings yet
0024.11 Ic Process Safety Management Certificate Web
2 pages
Lecture 4 - Audio Basics
No ratings yet
Lecture 4 - Audio Basics
36 pages
Shitake Shake PPT - 281024 - F
No ratings yet
Shitake Shake PPT - 281024 - F
18 pages
ENG-189 SAS12 Speaking 2324
No ratings yet
ENG-189 SAS12 Speaking 2324
7 pages
Building in Existing Fabric Refurbishment Extensions New Design 1sst Edition Christian Schittich
No ratings yet
Building in Existing Fabric Refurbishment Extensions New Design 1sst Edition Christian Schittich
77 pages
CS3570 Chapter4
No ratings yet
CS3570 Chapter4
71 pages
The Analog To Digital Conversion Process
No ratings yet
The Analog To Digital Conversion Process
14 pages
Audio Digital (Ingleės)
No ratings yet
Audio Digital (Ingleės)
9 pages
Chapter 2 SOUND AUDIO Systems
No ratings yet
Chapter 2 SOUND AUDIO Systems
58 pages
Euripides Our Contemporary 1st Edition J. Michael Walton Download PDF
100% (5)
Euripides Our Contemporary 1st Edition J. Michael Walton Download PDF
55 pages
Test 1
No ratings yet
Test 1
77 pages
Catalogue of Microbial Cultures
100% (1)
Catalogue of Microbial Cultures
78 pages
Human Speech Communication
No ratings yet
Human Speech Communication
44 pages
Lecture 2
No ratings yet
Lecture 2
49 pages
MC 7
No ratings yet
MC 7
36 pages
Greenlights by Matthew Mcconaughey Fight Thirty Years Not Quite at The Top Hardcover by Harry Hill 2 Books Collection Set Mcconaughey PDF Download
No ratings yet
Greenlights by Matthew Mcconaughey Fight Thirty Years Not Quite at The Top Hardcover by Harry Hill 2 Books Collection Set Mcconaughey PDF Download
32 pages
Speech
No ratings yet
Speech
39 pages
M5 Audio
No ratings yet
M5 Audio
32 pages
Bab 7 Multimedia Kompresi Audio
No ratings yet
Bab 7 Multimedia Kompresi Audio
52 pages
Chapter 5
No ratings yet
Chapter 5
68 pages
04 Digital Audio - Nuts and Bolts
No ratings yet
04 Digital Audio - Nuts and Bolts
52 pages
Streaming Audio and Video
No ratings yet
Streaming Audio and Video
54 pages
Introduction To Information Technology: Lecture #8
No ratings yet
Introduction To Information Technology: Lecture #8
25 pages
Basic Acoustics + DSP
No ratings yet
Basic Acoustics + DSP
42 pages
C6 - Production Phase - Audio Video Basics - Part 1
No ratings yet
C6 - Production Phase - Audio Video Basics - Part 1
50 pages
Chapter 3
No ratings yet
Chapter 3
23 pages
Audio and Audio Compression
No ratings yet
Audio and Audio Compression
27 pages
MM 2
No ratings yet
MM 2
16 pages
A Marine Fuel Oil: The SAC Programme Is Managed by Enterprise Singapore
No ratings yet
A Marine Fuel Oil: The SAC Programme Is Managed by Enterprise Singapore
3 pages
Dereje Teferi (PHD) Dereje - Teferi@Aau - Edu.Et
No ratings yet
Dereje Teferi (PHD) Dereje - Teferi@Aau - Edu.Et
30 pages
Multimedia Technology CH 5
No ratings yet
Multimedia Technology CH 5
9 pages
Lect 4
No ratings yet
Lect 4
14 pages
CS 550 Multimedia&WS 2 SOUND v1
No ratings yet
CS 550 Multimedia&WS 2 SOUND v1
41 pages
Week-3 Representation of Speech Waveforms - EEE 2415
No ratings yet
Week-3 Representation of Speech Waveforms - EEE 2415
10 pages
Chapter 3
No ratings yet
Chapter 3
27 pages
RT Lecture 5 Slides
No ratings yet
RT Lecture 5 Slides
26 pages
Speech and Audio Processing: Lecture-2
No ratings yet
Speech and Audio Processing: Lecture-2
23 pages
Digital Audio Processing Revisited: Juan P Bello
No ratings yet
Digital Audio Processing Revisited: Juan P Bello
29 pages
Audio Theory
No ratings yet
Audio Theory
33 pages
EETimes - How Audio Codecs Work
No ratings yet
EETimes - How Audio Codecs Work
8 pages
Chapter 6
No ratings yet
Chapter 6
9 pages
WINSEM2024-25 TPHY207L TH VL2024250506113 2024-12-13 Reference-Material-III
No ratings yet
WINSEM2024-25 TPHY207L TH VL2024250506113 2024-12-13 Reference-Material-III
12 pages
2015 Chapter 6 MMS IT - 1
No ratings yet
2015 Chapter 6 MMS IT - 1
18 pages
Audio Processing
No ratings yet
Audio Processing
19 pages
Group 3 KALAGAN - MANDAYA Reporter Rodolfo D Mendez
No ratings yet
Group 3 KALAGAN - MANDAYA Reporter Rodolfo D Mendez
7 pages
Presentation Gruop D
No ratings yet
Presentation Gruop D
27 pages
Dmslecture 3
No ratings yet
Dmslecture 3
11 pages
Chapter 2 Sound and Audio
No ratings yet
Chapter 2 Sound and Audio
20 pages
Chapter 6
No ratings yet
Chapter 6
8 pages
Los Microsatelites STRs Marcadores Moleculares de
No ratings yet
Los Microsatelites STRs Marcadores Moleculares de
14 pages
ch4 - Acquiring Audio Data PDF
No ratings yet
ch4 - Acquiring Audio Data PDF
18 pages
Audiosignalprocessing
No ratings yet
Audiosignalprocessing
11 pages
Multimedia Systems Chapter 6
No ratings yet
Multimedia Systems Chapter 6
8 pages
A Level 1.1.3 Sound
No ratings yet
A Level 1.1.3 Sound
15 pages
DC 17
No ratings yet
DC 17
4 pages
Msa 02
No ratings yet
Msa 02
9 pages
Unit 7
No ratings yet
Unit 7
10 pages
MUTC - Week2 Handout 2 PDF
No ratings yet
MUTC - Week2 Handout 2 PDF
6 pages
History II List For Book Reviews
No ratings yet
History II List For Book Reviews
4 pages
English Digital Audio
No ratings yet
English Digital Audio
5 pages
The Male Reproductive System Lesson Plan (FINAL)
No ratings yet
The Male Reproductive System Lesson Plan (FINAL)
19 pages
SMK Negeri 1 Oku: Format Penilaian Pembelajaran
No ratings yet
SMK Negeri 1 Oku: Format Penilaian Pembelajaran
8 pages
Snow White and The Seven Dwarfs (1937)
No ratings yet
Snow White and The Seven Dwarfs (1937)
1 page
Multimedia
No ratings yet
Multimedia
2 pages
HBS Neighborhood Map
No ratings yet
HBS Neighborhood Map
1 page
Fear of Falling
No ratings yet
Fear of Falling
8 pages
October Internal Assesment MBA3rdSem
No ratings yet
October Internal Assesment MBA3rdSem
2 pages
Projectile and Mortar Parts
No ratings yet
Projectile and Mortar Parts
2 pages
Graiffe Adopt Me - Google Search
No ratings yet
Graiffe Adopt Me - Google Search
1 page
Transmagnetic Resonance Field Theory
From Everand
Transmagnetic Resonance Field Theory
Timothy E. Douglas
No ratings yet
Sound Design and Mixing in Reason
From Everand
Sound Design and Mixing in Reason
Andrew Eisele
3/5 (2)
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
A Beginner's Guide to Ham Radio
From Everand
A Beginner's Guide to Ham Radio
George Freeman
No ratings yet

Introduction (UCS749)

Uploaded by

Introduction (UCS749)

Uploaded by

Introduction to Speech

Course Name: Speech Processing and Synthesis

● Feature extraction reduces the amount of information in a speech signal while

● Filtering is a process that reduces noise

● Speech segmentation is the process of identifying the boundaries between

Language Models: Language models predict the probability of a sequence of

Voice Assistants: Siri, Alexa, and Google Assistant

Forensics and Security

● Production/Generation: An object vibrates and mechanical energy is

The journey of sound to the brain

This process relies on the anatomy of

Steps required to convert analog sound waves to digital sound waves:

● Sampling converts a time-varying voltage signal into a discrete-time

8000 samples per

20 samples per cycle

● Nyquist Sampling Theorem:

The Nyquist-Shannon Theorem (or Sampling Theorem) is a fundamental

This minimum sampling rate, 2⋅f_max, is called the Nyquist rate.

Each bit can store two possible states: 0 or 1.

2^8 = 256 values.

Analysis in Frequency domain: Here in the frequency domain, we can observe

• Time domain and frequency domain are inversely related .

• If mathematical description of the signal in one domain is known, it is possible

• The Fourier Transform decomposes a time-domain signal into its constituent

● Pitch : Pitch is the perception of the "highness" or "lowness" of a sound.

● Timbre (tone color): Timbre is the quality of a sound that allows us to

Physical Features: These are objective, measurable properties of sound waves:

It is a key component of speech

convolution can be used to

S[t] is The current value of the

● g∗f means the convolution of two functions g(t) and f(t).

● g(t) is filter and f(t) is source.

● The function f(x−t) represents f(t) shifted by x.

shaded region represents the overlap between the two signals.

The x-axis represents the independent

y-axis represents the value of the

Curve starts at 0 for x<−1, indicating no

Between −1 and 1, the value of (f∗g)(x)

From x=1 to x=2, the curve decreases

The constant value in the middle

● It is important to note that phonemes in one language may be allophones in

G2P conversion is essential in various NLP applications, most notably:

● Text-to-Speech (TTS) Systems: Enabling computers to read text out loud in a

You might also like