0% found this document useful (0 votes)
9 views

Speech Signal Processing Core

Uploaded by

lucky.pics45
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Speech Signal Processing Core

Uploaded by

lucky.pics45
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 192

Sound and waveforms

Valerio Velardo
Sound

● Produced by vibration of an object


● Vibrations cause air molecules to oscillate
● Change in air pressure creates a wave
Mechanical wave

● Oscillation that travels through space


● Energy travels from one point to another
● The medium is deformed
Sound wave
Sound wave
Sound wave
Sound wave
Sound wave
Waveform
Waveform

● Carries multifactorial information:


○ Frequency
○ Intensity
○ Timbre
Periodic and aperiodic sound
Waveform
Frequency

period
Frequency

period
Amplitude

amplitude
Phase

phase
Frequency and amplitude
Frequency Amplitude

higher frequency -> higher sound


Frequency and amplitude
Frequency Amplitude

larger amplitude -> louder


Hearing range
Hearing range
Pitch

● Logarithmic perception
● 2 frequencies are perceived similarly if they differ by a power of 2
Midi notes
Midi notes
Midi notes
Midi notes
Midi notes
Midi notes

440 Hz
Midi notes

440 Hz 880 Hz
Pitch-frequency chart
Mapping pitch to frequency
Mapping pitch to frequency
Mapping pitch to frequency
Cents

● Octave divided in 1200 cents


● 100 cents in a semitone
● Noticeable pitch difference: 10-25 cents
What’s up next?

● Intensity, power, loudness


● Timbre
Join the community!

thesoundofai.slack.com
Intensity, loudness, and timbre
Valerio Velardo
The power of sound!
Sound power

● Rate at which energy is transferred


● Energy per unit of time emitted by a sound source in all directions
● Measured in watt (W)
Sound intensity

● Sound power per unit area


● Measured in W/m2
1 Watt
= 100 W
Threshold of hearing

● Human can perceive sounds with very small intensities


Threshold of hearing

● Human can perceive sounds with very small intensities


Threshold of pain
Intensity level

● Logarithmic scale
● Measured in decibels (dB)
● Ration between two intensity values
● Use an intensity of reference (TOH)
Intensity level
Intensity level
Intensity level

log(1) = 0
Intensity level

● Every ~3 dBs, intensity doubles


Intensity level
Loudness

● Subjective perception of sound intensity


● Depends on duration / frequency of a sound
● Depends on age
● Measured in phons
Equal loudness contours
Timbre
Timbre
Timbre

● Colour of sound
Timbre

● Colour of sound
● Diff between two sounds with same intensity, frequency, duration
Timbre

● Colour of sound
● Diff between two sounds with same intensity, frequency, duration
● Described with words like: bright, dark, dull, harsh, warm
What are the features of timbre?

● Timbre is multidimensional
What are the features of timbre?

● Timbre is multidimensional
● Sound envelope
● Harmonic content
● Amplitude / frequency modulation
Sound envelope

● Attack-Decay-Sustain-Release Model
Sound envelope
Complex sound

● Superposition of sinusoids
Complex sound

● Superposition of sinusoids
● A partial is a sinusoid used to describe a sound
Complex sound

● Superposition of sinusoids
● A partial is a sinusoid used to describe a sound
● The lowest partial is called fundamental frequency
Complex sound

● Superposition of sinusoids
● A partial is a sinusoid used to describe a sound
● The lowest partial is called fundamental frequency
● A harmonic partial is a frequency that’s a multiple of the fundamental
frequency
Complex sound

● Superposition of sinusoids
● A partial is a sinusoid used to describe a sound
● The lowest partial is called fundamental frequency
● A harmonic partial is a frequency that’s a multiple of the fundamental
frequency
Complex sound

● Superposition of sinusoids
● A partial is a sinusoid used to describe a sound
● The lowest partial is called fundamental frequency
● A harmonic partial is a frequency that’s a multiple of the fundamental
frequency
Complex sound

● Superposition of sinusoids
● A partial is a sinusoid used to describe a sound
● The lowest partial is called fundamental frequency
● A harmonic partial is a frequency that’s a multiple of the fundamental
frequency
Complex sound

● Superposition of sinusoids
● A partial is a sinusoid used to describe a sound
● The lowest partial is called fundamental frequency
● A harmonic partial is a frequency that’s a multiple of the fundamental
frequency
● Inharmonicity indicates a deviation from a harmonic partial
Harmonic vs inharmonic instruments
Harmonic content
Frequency modulation

● AKA vibrato
● Periodic variation in frequency
● In music, used for expressive purposes
Frequency modulation
Amplitude modulation

● AKA tremolo
● Periodic variation in amplitude
● In music, used for expressive purposes
Amplitude modulation
Timbre recap

● Multifactorial sound dimension


● Amplitude envelope
● Distribution of energy across partials
● Signal modulation (frequency/amplitude)
Sound recap

● Sound is a wave
● Frequency, intensity, timbre
● Pitch, loudness, timbre
What’s up next?

● Introducing audio signal


● Audio to Digital Conversion (ADC)
● Digital to Audio Conversion (DAC)
Join the community!

thesoundofai.slack.com
Understanding audio signals for ML
Valerio Velardo
Audio signal

● Representation of sound
● Encodes all info we need to reproduce sound
Houston we have a problem!
Houston we have a problem!

Digital
Analog

vs
Analog signal

● Continuous values for time


● Continuous values for amplitude
Analog signal
Digital signal

● Sequence of discrete values


● Data points can only take on a finite number of values
Analog to digital conversion

● Sampling
● Quantization
Pulse-code modulation
Sampling
Sampling period

T
Sampling period
Sampling period
Locating samples
Sampling rate
Sampling rate
Why sampling rate = 44100hz?
Nyquist frequency
Nyquist frequency for CD
Aliasing
Quantization
Quantization

● Resolution = num. of bits


Quantization

● Resolution = num. of bits


● Bit depth
Quantization

● Resolution = num. of bits


● Bit depth
● CD resolution = 16 bits
Quantization

● Resolution = num. of bits


● Bit depth
● CD resolution = 16 bits
Memory for 1’ of sound

● Sampling rate = 44100 Hz


● Bit depth = 16 bits
Memory for 1’ of sound

● Sampling rate = 44100 Hz


● Bit depth = 16 bits
Memory for 1’ of sound

● Sampling rate = 44100 Hz


● Bit depth = 16 bits
Memory for 1’ of sound

● Sampling rate = 44100 Hz


● Bit depth = 16 bits
Memory for 1’ of sound

● Sampling rate = 44100 Hz


● Bit depth = 16 bits
Memory for 1’ of sound

● Sampling rate = 44100 Hz


● Bit depth = 16 bits
Memory for 1’ of sound

● Sampling rate = 44100 Hz


● Bit depth = 16 bits
Memory for 1’ of sound

● Sampling rate = 44100 Hz


● Bit depth = 16 bits
Dynamic range

● Difference between largest/smallest signal a


system can record
Dynamic range

resolution dynamic range


Signal-to-quantization-noise ratio

● Relationship between max signal strength and


quantization error
● Correlates with dynamic range
Signal-to-quantization-noise ratio

● Relationship between max signal strength and


quantization error
● Correlates with dynamic range
Signal-to-quantization-noise ratio

● Relationship between max signal strength and


quantization error
● Correlates with dynamic range
How do we record sound?
How do we record sound?
How do we record sound?

ADC
How do we record sound?

ADC
How do we reproduce sound?
How do we reproduce sound?
How do we reproduce sound?

DAC
How do we reproduce sound?

DAC
What’s up next?

● Overview of audio features


Join the community!

thesoundofai.slack.com
How do we extract audio features?
Valerio Velardo
Join the community!

thesoundofai.slack.com
Previously on Audio Processing for ML

● Time-domain features
● Frequency-domain features
● Time-frequency domain features
Time-domain feature pipeline
Time-domain feature pipeline

ADC
Time-domain feature pipeline

frame 1: sample 1 … 128


framing frame 2: sample 64 … 192
frame 3: sample 128 … 256
frame 4: sample 192 … 320
...
Frames

● Perceivable audio chunk


Frames

● Perceivable audio chunk

1 sample @44.1KHz = 0.0227ms


Frames

● Perceivable audio chunk

1 sample @44.1KHz = 0.0227ms

Duration 1 sample << Ear’s time resolution (10ms)


Frames

● Perceivable audio chunk


● Power of 2 num. samples
Frames

● Perceivable audio chunk


● Power of 2 num. samples
● Typical values: 256 - 8192
Frames

● Perceivable audio chunk


● Power of 2 num. samples
● Typical values: 256 - 8192
Frames

● Perceivable audio chunk


● Power of 2 num. samples
● Typical values: 256 - 8192

512

44100
Frames

● Perceivable audio chunk


● Power of 2 num. samples
● Typical values: 256 - 8192

512

= 11.6ms

44100
Time-domain feature pipeline

frame 1: sample 1 … 128


framing frame 2: sample 64 … 192
frame 3: sample 128 … 256
frame 4: sample 192 … 320
...
Time-domain feature pipeline

frame 1: sample 1 … 128


frame 2: sample 64 … 192
frame 3: sample 128 … 256 feature computation
frame 4: sample 192 … 320
...
Time-domain feature pipeline

frame 1: sample 1 … 128


frame 2: sample 64 … 192
frame 3: sample 128 … 256 feature computation
frame 4: sample 192 … 320
...

aggregation
(mean, median, GMM)
Time-domain feature pipeline

frame 1: sample 1 … 128


frame 2: sample 64 … 192
frame 3: sample 128 … 256 feature computation
frame 4: sample 192 … 320
...

aggregation
(mean, median, GMM)

feature value/vector/matrix
Frequency-domain feature pipeline
Frequency-domain feature pipeline

ADC
Frequency-domain feature pipeline

frame 1: sample 1 … 128


framing frame 2: sample 64 … 192
frame 3: sample 128 … 256
frame 4: sample 192 … 320
...
From time to frequency domain
Spectral leakage

● Processed signal isn’t an integer number of periods


Spectral leakage

● Processed signal isn’t an integer number of periods


● Endpoints are discontinuous
Spectral leakage

● Processed signal isn’t an integer number of periods


● Endpoints are discontinuous
Spectral leakage

● Processed signal isn’t an integer number of periods


● Endpoints are discontinuous
● Discontinuities appear as high-frequency components not present in the
original signal
Spectral leakage
Spectral leakage

FT
Frequency-domain feature pipeline

frame 1: sample 1 … 128


frame 2: sample 64 … 192
frame 3: sample 128 … 256 windowing
frame 4: sample 192 … 320
...
Windowing

● Apply windowing function to each frame


Windowing

● Apply windowing function to each frame


● Eliminates samples at both ends of a frame
Windowing

● Apply windowing function to each frame


● Eliminates samples at both ends of a frame
● Generates a periodic signal
Hann window
Windowing
Windowing
Windowing
Windowing
Windowing
Houston we have another problem!
Houston we have another problem!
Houston we have another problem!
Non-overlapping frames
Non-overlapping frames
Non-overlapping frames
Non-overlapping frames
Non-overlapping frames
Non-overlapping frames
Overlapping frames
Overlapping frames
Overlapping frames
Overlapping frames
Overlapping frames
Overlapping frames
Overlapping frames
Overlapping frames

frame size K
Overlapping frames

hop length
Frequency-domain feature pipeline

frame 1: sample 1 … 128


frame 2: sample 64 … 192
frame 3: sample 128 … 256 windowing
frame 4: sample 192 … 320
...
Frequency-domain feature pipeline

frame 1: sample 1 … 128


frame 2: sample 64 … 192
frame 3: sample 128 … 256 windowing
frame 4: sample 192 … 320
...
FT
Frequency-domain feature pipeline

frame 1: sample 1 … 128


frame 2: sample 64 … 192
frame 3: sample 128 … 256 windowing
frame 4: sample 192 … 320
...

feature
computation
Frequency-domain feature pipeline

frame 1: sample 1 … 128


frame 2: sample 64 … 192
frame 3: sample 128 … 256 windowing
frame 4: sample 192 … 320
...

aggregation feature
(mean, median, GMM) computation
Frequency-domain feature pipeline

frame 1: sample 1 … 128


frame 2: sample 64 … 192
frame 3: sample 128 … 256 windowing
frame 4: sample 192 … 320
...

feature aggregation feature


value/vector/matrix (mean, median, GMM) computation
What’s up next?

● Time-domain features

You might also like