Matlab Audiotoolbox
Matlab Audiotoolbox
User’s Guide
R2024a
How to Contact MathWorks
Phone: 508-647-7000
Active Noise Control Using a Filtered-X LMS FIR Adaptive Filter . . . . 1-129
Pitch Shifting and Time Dilation Using a Phase Vocoder in MATLAB . 1-156
Pitch Shifting and Time Dilation Using a Phase Vocoder in Simulink . 1-160
iii
Remove Interfering Tone From Audio Stream . . . . . . . . . . . . . . . . . . . . 1-162
iv Contents
Denoise Speech Using Deep Learning Networks . . . . . . . . . . . . . . . . . . 1-293
Train Voice Activity Detection in Noise Model Using Deep Learning . . 1-409
v
Train Spoken Digit Recognition Network Using Out-of-Memory Audio
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-653
Train 3-D Sound Event Localization and Detection (SELD) Using Deep
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-767
Adapt Pretrained Audio Network for New Data Using Deep Network
Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-851
vi Contents
Keyword Spotting in Simulink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-915
3-D Speech Enhancement Using Trained Filter and Sum Network . . . . 1-948
vii
Measure Impulse Response of an Audio System
3
Measure and Manage Impulse Responses . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Configure Audio I/O System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Configure IR Acquisition Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
Acquire IR Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
Analyze and Manage IR Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
Export IR Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Generate MATLAB Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
viii Contents
MIDI Control for Audio Plugins
7
MIDI Control for Audio Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
MIDI and Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
Use MIDI with MATLAB Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
ix
Equalization
11
Equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
Equalization Design Using Audio Toolbox . . . . . . . . . . . . . . . . . . . . . . . . 11-2
EQ Filter Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
Lowpass and Highpass Filter Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5
Shelving Filter Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-6
Deployment
12
Desktop Real-Time Audio Acceleration with MATLAB Coder . . . . . . . . . . 12-2
x Contents
Detect Presence of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12
Use Octave Filter Bank to Create Flanging Chorus Effect for Guitar
Layers (Overdubs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-50
xi
Design Auditory Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-66
xii Contents
1
This example shows how to use transfer learning to retrain YAMNet, a pretrained convolutional
neural network, to classify a new set of audio signals. To get started with audio deep learning from
scratch, see “Classify Sound Using Deep Learning”.
Transfer learning is commonly used in deep learning applications. You can take a pretrained network
and use it as a starting point to learn a new task. Fine-tuning a network with transfer learning is
usually much faster and easier than training a network with randomly initialized weights from
scratch. You can quickly transfer learned features to a new task using a smaller number of training
signals.
Audio Toolbox™ additionally provides the classifySound function, which implements necessary
preprocessing for YAMNet and convenient postprocessing to interpret the results. Audio Toolbox also
provides the pretrained VGGish network (vggish) as well as the vggishEmbeddings function,
which implements preprocessing and postprocessing for the VGGish network.
Create Data
Generate 100 white noise signals, 100 brown noise signals, and 100 pink noise signals. Each signal
represents a duration of 0.98 seconds assuming a 16 kHz sample rate.
fs = 16e3;
duration = 0.98;
N = duration*fs;
numSignals = 100;
wNoise = 2*rand([N,numSignals]) - 1;
wLabels = repelem(categorical("white"),numSignals,1);
bNoise = filter(1,[1,-0.999],wNoise);
bNoise = bNoise./max(abs(bNoise),[],"all");
bLabels = repelem(categorical("brown"),numSignals,1);
pNoise = pinknoise([N,numSignals]);
pLabels = repelem(categorical("pink"),numSignals,1);
Split the data into training and test sets. Normally, the training set consists of most of the data.
However, to illustrate the power of transfer learning, you will use only a few samples for training and
the majority for validation.
K = ;
1-2
Transfer Learning with Pretrained Audio Networks
trainAudio = [wNoise(:,1:K),bNoise(:,1:K),pNoise(:,1:K)];
trainLabels = [wLabels(1:K);bLabels(1:K);pLabels(1:K)];
validationAudio = [wNoise(:,K+1:end),bNoise(:,K+1:end),pNoise(:,K+1:end)];
validationLabels = [wLabels(K+1:end);bLabels(K+1:end);pLabels(K+1:end)];
Extract Features
Use yamnetPreprocess to extract log-mel spectrograms from both the training set and the
validation set using the same parameters as the YAMNet model was trained on.
trainFeatures = yamnetPreprocess(trainAudio,fs);
validationFeatures = yamnetPreprocess(validationAudio,fs);
Transfer Learning
To load the pretrained network, call yamnet. If the Audio Toolbox model for YAMNet is not installed,
then the function provides a link to the location of the network weights. To download the model, click
the link. Unzip the file to a location on the MATLAB path. The YAMNet model can classify audio into
one of 521 sound categories, including white noise and pink noise (but not brown noise).
net = yamnet;
net.Layers(end).Classes
1-3
1 Audio Toolbox Examples
Chant
Mantra
Child singing
⋮
Prepare the model for transfer learning by first converting the network to a layerGraph (Deep
Learning Toolbox). Use replaceLayer (Deep Learning Toolbox) to replace the fully-connected layer
with an untrained fully-connected layer. Replace the classification layer with a classification layer that
classifies the input as "white", "pink", or "brown". See “List of Deep Learning Layers” (Deep Learning
Toolbox) for deep learning layers supported in MATLAB®.
uniqueLabels = unique(trainLabels);
numLabels = numel(uniqueLabels);
lgraph = layerGraph(net.Layers);
lgraph = replaceLayer(lgraph,"dense",fullyConnectedLayer(numLabels,Name="dense"));
lgraph = replaceLayer(lgraph,"Sound",classificationLayer(Name="Sounds",Classes=uniqueLabels));
options = trainingOptions("adam",ValidationData={single(validationFeatures),validationLabels});
To train the network, use trainNetwork (Deep Learning Toolbox). The network achieves a validation
accuracy of 100% using only 5 signals per noise type.
trainNetwork(single(trainFeatures),trainLabels,lgraph,options);
1-4
Effect of Soundproofing on Perceived Noise Levels
In this example, you measure engine noise and use psychoacoustic metrics to model its perceived
loudness, sharpness, fluctuation strength, roughness, and overall annoyance level. You then simulate
the addition of soundproofing material and recompute the overall annoyance level. Finally, you
compare annoyance levels and show the perceptual improvements gained from applying
soundproofing.
Psychoacoustic measurements produce the most accurate results with a calibrated microphone input
level. To use calibrateMicrophone to match your recording level to the reading of an SPL meter,
you can use a 1 kHz tone source (such as an online tone generator or cell phone app) and a calibrated
SPL meter. The SPL of the 1 kHz calibration tone should be loud enough to dominate any background
noise. For a calibration example using MATLAB as the 1 kHz tone source, see
calibrateMicrophone.
Simulate the tone recording and include some background noise. Assume an SPL meter reading of
83.1 dB (C-weighted).
FS = 48e3;
t = (1:2*FS)/FS;
s = rng("default");
testTone = 0.46*sin(2*pi*t*1000).' + .1*pinknoise(2*FS);
rng(s)
splMeterReading = 83.1;
To compute the calibration level of a recording chain, call calibrateMicrophone and specify the
test tone, the sample rate, the SPL reading, and the frequency weighting of the SPL meter. To
compensate for possible background noise and produce a precise calibration level, match the
frequency weighting setting of the SPL meter.
calib = calibrateMicrophone(testTone,FS,splMeterReading,FrequencyWeighting="C-weighting");
Once you have a calibration factor for your recording chain, you can make acoustic measurements.
When using a physical meter, you are limited to the settings selected during measurement time. With
the splMeter object, you can change your settings after the recording has been made. This makes it
easy to experiment with different time and frequency weighting options.
[x,FS] = audioread("Engine-16-44p1-stereo-20sec.wav");
x = x(1:8*FS,1); % use only channel 1 and keep only 8 seconds.
t = (1:size(x,1))/FS;
Create an splMeter object and select C-weighting, fast time weighting, and a 0.2 second interval for
peak SPL measurement.
1-5
1 Audio Toolbox Examples
TimeInterval=0.2, ...
SampleRate=FS);
Psychoacoustic Metrics
Loudness level
Monitoring SPL is important for occupational safety compliance. However, SPL measurements do not
reflect loudness as perceived by an actual listener. acousticLoudness measures loudness levels as
perceived by a human listener with normal hearing (no hearing impairments). The
acousticLoudness function also shows which frequency bands contribute the most to the
perceptual sensation of loudness.
Using the same calibration level as before, and assuming a free-field recording (the default), plot
stationary loudness.
acousticLoudness(x,FS,calib)
1-6
Effect of Soundproofing on Perceived Noise Levels
The loudness is 23.8 sones, and much of the noise is below 3.3 (Bark scale). Convert 3.3 Bark to Hz
using bark2hz
fprintf("Loudness frequency of 3.3 Bark: %d Hz\n",round(bark2hz(3.3),-1));
The acousticLoudness function returns perceived loudness in sones. To understand the sone
measurement, compare it to an SPL (dB) reading. A signal with a loudness of 60 phons is perceived to
be as loud as a 1 kHz tone measured at 60 dB SPL. Converting 23.8 sones to phons using sone2phon
demonstrates the loudness perception of the engine noise is as loud as a 1 kHz tone measured at 86
dB SPL.
fprintf("Equivalent 1 kHz SPL: %d phons\n",round(sone2phon(23.8)));
Make your own plot with units in phons and frequency in Hz on a log scale.
[sone,spec] = acousticLoudness(x,FS,calib);
barks = 0.1:0.1:24; % Bark scale for ISO 532-1 loudness
hz = bark2hz(barks);
specPhon = sone2phon(spec);
semilogx(hz,specPhon)
title("Specific Loudness")
subtitle(sprintf("Loudness = %.1f phons",sone2phon(sone)))
xlabel("Frequency (Hz)")
ylabel("Loudness (phons/Bark)")
1-7
1 Audio Toolbox Examples
xlim(hz([1,end]))
grid on
You can also plot time-varying loudness and specific loudness to analyze the sound of the engine if it
changes with time. This can be displayed with other relevant time-varying data, such as engine
revolutions per minute (RPMs). In this case, the noise is stationary, but you can observe the impulsive
nature of the noise.
acousticLoudness(x,FS,calib,TimeVarying=true,TimeResolution="high")
1-8
Effect of Soundproofing on Perceived Noise Levels
[tvsoneHD,tvspecHD,perc] = acousticLoudness(x,FS,calib,TimeVarying=true,TimeResolution="high");
tvspec = tvspecHD(1:4:end,:,:); % for standard resolution measurements
spectimeHD = 0:5e-4:5e-4*(size(tvspecHD,1)-1); % time axis for loudness output
clf; % do not reuse the previous subplot
surf(spectimeHD,hz,sone2phon(tvspecHD).',EdgeColor="interp");
set(gca,View=[0 90],YScale="log",YLim=hz([1,end]));
title("Specific Loudness (HD)")
zlabel("Specific Loudness (phons/Bark)")
ylabel("Frequency (Hz)")
xlabel("Time (seconds)")
colorbar
1-9
1 Audio Toolbox Examples
Sharpness Level
The perceived sharpness of a sound can significantly contribute to its overall annoyance level.
Estimate the perceived sharpness level of an acoustic signal using the acousticSharpness
function.
sharp = acousticSharpness(spec)
sharp = 1.1512
Pink noise has a sharpness of 2 acums. This means the engine noise is biased towards low
frequencies.
acousticSharpness(x,FS,calib,TimeVarying=true);
1-10
Effect of Soundproofing on Perceived Noise Levels
Fluctuation Strength
In the case of engine noise, low-frequency modulations contribute to the perceived annoyance level.
N = 2^nextpow2(size(x,1));
xa = abs(x); % Use the rectified signal
pspectrum(xa-mean(xa),FS,FrequencyLimits=[0 80],FrequencyResolution=1)
title("Modulation Frequencies")
1-11
1 Audio Toolbox Examples
The modulation frequency peaks at 24.9 Hz. Below 30 Hz, modulation is perceived dominantly as
fluctuation. There is a second peak at 49.7 Hz, which is in the range of roughness.
Use acousticFluctuation to compute the perceived fluctuation strength. The engine noise is
relatively constant in this recording, so we have the algorithm automatically detect the most audible
fluctuation frequency (fMod).
acousticFluctuation(x,FS,calib)
1-12
Effect of Soundproofing on Perceived Noise Levels
Interpret the results in Hertz instead of Bark. To reduce computations, reuse the previously computed
time-varying specific loudness. Alternatively, you can also specify the modulation frequency that you
are interested in.
[vacil,spec,fMod] = acousticFluctuation(tvspec,ModulationFrequency=24.9);
clf; % do not reuse previous subplot
flucHz = bark2hz(0.5:0.5:23.5);
spectime = 0:2e-3:2e-3*(size(spec,1)-1);
surf(spectime,flucHz,spec.',EdgeColor="interp");
set(gca,View=[0 90],YScale="log",YLim=flucHz([1,end]));
title("Specific Fluctuation Strength")
zlabel("Specific Fluctuation Strength (vacils/Bark)")
ylabel("Frequency (Hz)")
xlabel("Time (seconds)")
colorbar
1-13
1 Audio Toolbox Examples
Roughness
Use the acousticRoughness function to compute the perceived roughness of the signal. Let the
algorithm automatically detect the most audible modulation frequency (fMod).
acousticRoughness(x,FS,calib)
1-14
Effect of Soundproofing on Perceived Noise Levels
Interpret the results in Hertz instead of Bark. To reduce computations, reuse the previously computed
time-varying specific loudness. Specify the modulation frequency.
[asper,specR,fModR] = acousticRoughness(tvspecHD,ModulationFrequency=49.7);
clf; % do not reuse previous subplot
rougHz = bark2hz(0.5:0.5:23.5);
surf(spectimeHD,rougHz,specR.',EdgeColor="interp");
set(gca,View=[0 90],YScale="log",YLim=rougHz([1,end]));
title("Specific Roughness")
zlabel("Specific Roughness (aspers/Bark)")
ylabel("Frequency (Hz)")
xlabel("Time (seconds)")
colorbar
1-15
1 Audio Toolbox Examples
Tonality
Another perceivable attribute of engine noise is tonality, including the ratio between tonality and
other noises.
Plot the tone-to-noise ratio of the signal, which includes datatips with information such as a
prominence indicator. There is no calibration factor required for tonality metrics.
acousticToneToNoiseRatio(x,FS)
1-16
Effect of Soundproofing on Perceived Noise Levels
Also, display the time-varying prominence ratio. Click on the plot to get detailed information for that
time and frequency.
acousticProminenceRatio(x,FS,TimeVarying=true)
1-17
1 Audio Toolbox Examples
Sound Quality
For overall sound quality evaluation, combine the previous metrics to produce the psychoacoustic
annoyance metric. The relation is as follows [1] :
2 2
PA ∼ N 1 + g1 S + g2 F, R
PA = N5 1 + wS2 + wFR
2
with:
In this example, sharpness was less than 1.75, so it is not a contributing factor. Therefore, you can set
ws to zero.
Percentile loudness, N5, is the second value returned by the third output of acousticLoudness
when TimeVarying is set to true.
1-18
Effect of Soundproofing on Perceived Noise Levels
N5 = perc(2);
Compute the average fluctuation strength ignoring the first second of the signal.
f = mean(vacil(501:end,1));
Compute the average roughness ignoring the first second of the signal.
r = mean(asper(2001:end,1));
pa = N5 * (1 + abs(2.18/(N5^.4)*(.4*f+.6*r)))
pa = 26.3402
For noise with tonal components such as aircrafts or engines, this metric was enhanced by S. R. More
to include tonality [2] :
with:
• γ4N5 2 γ5K5 2
wT2 = 1 − e 1−e
In this case, no tone was marked as "prominent" in the range specified by the standard (89.1 to
11200 Hz), so you can use the equation for PA (without tonality).
Measure the impact of improved soundproofing on the measured SPL and the perceived noise.
Design a graphicEQ object to simulate the attenuation of the proposed soundproofing. Low
frequencies are harder to attenuate, so we create a model that is best above 200 Hz.
[B,A] = coeffs(geq);
sos = [B;A].';
[H,w] = freqz(sos,2^16,FS);
semilogx(w,db(abs(H)))
title("Frequency Response of Soundproofing Simulation Filter")
ylabel("Relative SPL (dB)")
xlabel("Frequency (Hz)")
xlim(cf([1,end]))
grid on
1-19
1 Audio Toolbox Examples
Filter the engine recording using the graphic EQ to simulate the soundproofing.
x2 = geq(x);
Compare the SPL with and without soundproofing. Reuse the same SPL meter settings, but use the
filtered recording.
reset(spl)
[LCFnew,~,LCpeaknew] = spl(x2);
plot(t,LCpeak,t,LCF,t,LCpeaknew,t,LCFnew)
legend("LCpeak (original)", ...
"LCF (original)", ...
"LCpeak (with soundproofing)", ...
"LCF (with soundproofing)", ...
Location="southeast")
title("SPL Measurement of Engine Noise")
xlabel("Time (seconds)")
ylabel("SPL (dB)")
ylim([70 95])
grid on
1-20
Effect of Soundproofing on Perceived Noise Levels
acousticLoudness(x2,FS,calib)
1-21
1 Audio Toolbox Examples
Loudness decreased from 23.8 to 16.3 sones. However, it might be easier to interpret loudness in
phons. Convert the sone units to phons using sone2phon.
After soundproofing, acousticLoudness shows the perception of the engine noise is approximately
5.5 dB quieter. Human perception of sound is limited at very low frequencies, where most of the
engine noise is. The soundproofing is more effective at higher frequencies.
Calculate the reduction in the psychoacoustic annoyance factor. Start by computing the mean of the
acoustic sharpness.
[~,spec2hd,perc2] = acousticLoudness(x2,FS,calib,TimeVarying=true,TimeResolution="high");
spec2 = spec2hd(1:4:end,:,:);
shp = acousticSharpness(spec2,TimeVarying=true);
new_sharp = mean(shp(501:end))
new_sharp = 1.0796
1-22
Effect of Soundproofing on Perceived Noise Levels
Sharpness has decreased because the soundproofing is more effective at high frequency attenuation.
It is below the threshold of 1.75, so it is ignored for the annoyance factor.
vacil2 = acousticFluctuation(spec2);
f2 = mean(vacil2(501:end,1));
asper2 = acousticRoughness(spec2hd);
r2 = mean(asper2(2001:end,1));
Compute the new psychoacoustic annoyance factor. It has decreased, from 26.3 to 18.1.
pahp = 18.0626
References
[1] Zwicker, Eberhard, and Hugo Fastl. Psychoacoustics: Facts and Models. Vol. 22. Springer Science
& Business Media, 2013.
[2] More, Shashikant R. (2010). Aircraft noise characteristics and metrics (thesis), 2010, pp. 201-204.
1-23
1 Audio Toolbox Examples
This example shows how to deploy feature extraction and a convolutional neural network (CNN) for
speech command recognition to Raspberry Pi™. To generate the feature extraction and network code,
you use MATLAB Coder™, MATLAB® Support Package for Raspberry Pi Hardware, and the ARM®
Compute Library. In this example, the generated code is an executable on your Raspberry Pi, which is
called by a MATLAB script that displays the predicted speech command along with the signal and
auditory spectrogram. Interaction between the MATLAB script and the executable on your Raspberry
Pi is handled using the user datagram protocol (UDP). For details about audio preprocessing and
network training, see “Train Speech Command Recognition Model Using Deep Learning” on page 1-
313.
Prerequisites
For supported versions of libraries and for information about setting up environment variables, see
“Prerequisites for Deep Learning with MATLAB Coder” (MATLAB Coder).
Use the same parameters for the feature extraction pipeline and classification as developed in “Train
Speech Command Recognition Model Using Deep Learning” on page 1-313.
Define the same sample rate the network was trained on (16 kHz). Define the classification rate and
the number of audio samples input per frame. The feature input to the network is a Bark spectrogram
that corresponds to 1 second of audio data. The Bark spectrogram is calculated for 25 ms windows
with 10 ms hops. Calculate the number of individual spectrums in each spectrogram.
fs = 16000;
classificationRate = 20;
samplesPerCapture = fs/classificationRate;
segmentDuration = 1;
segmentSamples = round(segmentDuration*fs);
frameDuration = 0.025;
frameSamples = round(frameDuration*fs);
hopDuration = 0.010;
hopSamples = round(hopDuration*fs);
numSpectrumPerSpectrogram = floor((segmentSamples-frameSamples)/hopSamples) + 1;
1-24
Speech Command Recognition Code Generation on Raspberry Pi
'FFTLength',512, ...
'Window',hann(frameSamples,'periodic'), ...
'OverlapLength',frameSamples - hopSamples, ...
'barkSpectrum',true);
numBands = 50;
setExtractorParameters(afe,'barkSpectrum','NumBands',numBands,'WindowNormalization',false);
numElementsPerSpectrogram = numSpectrumPerSpectrogram*numBands;
load('commandNet.mat')
labels = trainedNet.Layers(end).Classes;
NumLabels = numel(labels);
BackGroundIdx = find(labels == 'background');
probBuffer = single(zeros([NumLabels,classificationRate/2]));
YBuffer = single(NumLabels * ones(1, classificationRate/2));
countThreshold = ceil(classificationRate*0.2);
probThreshold = single(0.7);
Create an audioDeviceReader object to read audio from your device. Create a dsp.AsyncBuffer
object to buffer the audio into chunks.
adr = audioDeviceReader('SampleRate',fs,'SamplesPerFrame',samplesPerCapture,'OutputDataType','sin
audioBuffer = dsp.AsyncBuffer(fs);
Show the time scope and matrix viewer. Detect commands as long as both the time scope and matrix
viewer are open or until the time limit is reached. To stop the live detection before the time limit is
reached, close the time scope window or matrix viewer window.
show(timeScope)
show(matrixViewer)
1-25
1 Audio Toolbox Examples
timeLimit = 10;
tic
while isVisible(timeScope) && isVisible(matrixViewer) && toc < timeLimit
% Capture audio
x = adr();
write(audioBuffer,x);
y = read(audioBuffer,fs,fs-samplesPerCapture);
% Perform prediction
probs = predict(trainedNet, auditoryFeatures);
[~, YPredicted] = max(probs);
% Update plots
matrixViewer(auditoryFeatures');
timeScope(x);
if (speechCommandIdx == BackGroundIdx)
timeScope.Title = ' ';
else
timeScope.Title = char(labels(speechCommandIdx));
end
drawnow limitrate
end
To create a function to perform feature extraction compatible with code generation, call
generateMATLABFunction on the audioFeatureExtractor object. The
generateMATLABFunction object function creates a standalone function that performs equivalent
feature extraction and is compatible with code generation.
generateMATLABFunction(afe,'extractSpeechFeatures')
1-26
Speech Command Recognition Code Generation on Raspberry Pi
Replace the hostIPAddress with your machine's address. Your Raspberry Pi sends auditory
spectrograms and the predicted speech command to this IP address.
hostIPAddress = coder.Constant('172.18.230.30');
Create a code generation configuration object to generate an executable program. Specify the target
language as C++.
cfg = coder.config('exe');
cfg.TargetLang = 'C++';
Create a configuration object for deep learning code generation with the ARM compute library that is
on your Raspberry Pi. Specify the architecture of the Raspberry Pi and attach the deep learning
configuration object to the code generation configuration object.
dlcfg = coder.DeepLearningConfig('arm-compute');
dlcfg.ArmArchitecture = 'armv7';
dlcfg.ArmComputeVersion = '20.02.1';
cfg.DeepLearningConfig = dlcfg;
Use the Raspberry Pi Support Package function, raspi, to create a connection to your Raspberry Pi.
In the following code, replace:
r = raspi('raspiname','pi','password');
Create a coder.hardware (MATLAB Coder) object for Raspberry Pi and attach it to the code
generation configuration object.
hw = coder.hardware('Raspberry Pi');
cfg.Hardware = hw;
buildDir = '~/remoteBuildDir';
cfg.Hardware.BuildDir = buildDir;
Use an auto generated C++ main file for the generation of a standalone executable.
cfg.GenerateExampleMain = 'GenerateCodeAndCompile';
Call codegen (MATLAB Coder) to generate C++ code and the executable on your Raspberry Pi. By
default, the Raspberry Pi application name is the same as the MATLAB function.
1-27
1 Audio Toolbox Examples
------------------------------------------------------------------------
### Generating compilation report ...
Warning: Function 'HelperSpeechCommandRecognitionRasPi' does not terminate due to an infinite
loop.
applicationName = 'HelperSpeechCommandRecognitionRasPi';
applicationDirPaths = raspi.utils.getRemoteBuildDirectory('applicationName',applicationName);
targetDirPath = applicationDirPaths{1}.directory;
exeName = strcat(applicationName,'.elf');
command = ['cd ' targetDirPath '; ./' exeName ' &> 1 &'];
system(r,command);
Create a dsp.UDPReceiver system object to send audio captured in MATLAB to your Raspberry Pi.
Update the targetIPAddress for your Raspberry Pi. Raspberry Pi receives the captured audio from
the same port using the dsp.UDPReceiver system object.
targetIPAddress = '172.18.231.92';
UDPSend = dsp.UDPSender('RemoteIPPort',26000,'RemoteIPAddress',targetIPAddress);
Create a dsp.UDPReceiver system object to receive auditory features and the predicted speech
command index from your Raspberry Pi. Each UDP packet received from the Raspberry Pi consists of
auditory features in column-major order followed by the predicted speech command index. The
maximum message length for the dsp.UDPReceiver object is 65507 bytes. Calculate the buffer size
to accommodate the maximum number of UDP packets.
sizeOfFloatInBytes = 4;
maxUDPMessageLength = floor(65507/sizeOfFloatInBytes);
samplesPerPacket = 1 + numElementsPerSpectrogram;
numPackets = floor(maxUDPMessageLength/samplesPerPacket);
bufferSize = numPackets*samplesPerPacket*sizeOfFloatInBytes;
1-28
Speech Command Recognition Code Generation on Raspberry Pi
Reduce initialization overhead by sending a frame of zeros to the executable running on your
Raspberry Pi.
UDPSend(zeros(samplesPerCapture,1,"single"));
Detect commands as long as both the time scope and matrix viewer are open or until the time limit is
reached. To stop the live detection before the time limit is reached, close the time scope or matrix
viewer window.
show(timeScope)
show(matrixViewer)
timeLimit = 20;
tic
while isVisible(timeScope) && isVisible(matrixViewer) && toc < timeLimit
% Capture audio and send that to RasPi
x = adr();
UDPSend(x);
if ~isempty(udpRec)
% Extract predicted index, the last sample of received UDP packet
speechCommandIdx = udpRec(end);
if speechCommandIdx == BackGroundIdx
timeScope.Title = ' ';
else
timeScope.Title = char(labels(speechCommandIdx));
end
drawnow limitrate
end
end
hide(matrixViewer)
hide(timeScope)
1-29
1 Audio Toolbox Examples
To stop the executable on your Raspberry Pi, use stopExecutable. Release the UDP objects.
stopExecutable(codertarget.raspi.raspberrypi,exeName)
release(UDPSend)
release(UDPReceive)
You can measure the execution time taken on the Raspberry Pi using a processor-in-the-loop (PIL)
workflow of Embedded Coder®. The ProfileSpeechCommandRecognitionRaspi supporting
function is the equivalent of the HelperSpeechCommandRecognitionRaspi function, except that the
former returns the speech command index and auditory spectrogram while the latter sends the same
parameters using UDP. The time taken by the UDP calls is less than 1 ms, which is relatively small
compared to the overall execution time.
1-30
Speech Command Recognition Code Generation on Raspberry Pi
cfg = coder.config('lib','ecoder',true);
cfg.VerificationMode = 'PIL';
dlcfg = coder.DeepLearningConfig('arm-compute');
cfg.DeepLearningConfig = dlcfg ;
cfg.DeepLearningConfig.ArmArchitecture = 'armv7';
cfg.DeepLearningConfig.ArmComputeVersion = '19.05';
if (~exist('r','var'))
r = raspi('raspiname','pi','password');
end
hw = coder.hardware('Raspberry Pi');
cfg.Hardware = hw;
buildDir = '~/remoteBuildDir';
cfg.Hardware.BuildDir = buildDir;
cfg.TargetLang = 'C++';
Enable profiling and then generate the PIL code. A MEX file named
ProfileSpeechCommandRecognition_pil is generated in your current folder.
cfg.CodeExecutionProfiling = true;
codegen -config cfg ProfileSpeechCommandRecognitionRaspi -args {rand(samplesPerCapture, 1, 'singl
------------------------------------------------------------------------
### Generating compilation report ...
Code generation successful: View report
Call the generated PIL function multiple times to get the average execution time.
testDur = 50e-3;
numCalls = 100;
for k = 1:numCalls
x = pinknoise(fs*testDur,'single');
1-31
1 Audio Toolbox Examples
clear ProfileSpeechCommandRecognitionRaspi_pil
### Host application produced the following standard output (stdout) and standard error (stderr)
executionProfile = getCoderExecutionProfile('ProfileSpeechCommandRecognitionRaspi');
report(executionProfile, ...
'Units','Seconds', ...
'ScaleFactor','1e-03', ...
'NumericFormat','%0.4f')
ans =
'W:\Ex\ExampleManager\sporwal.Bdoc22a.j1844576\deeplearning_shared-ex00376115\codegen\lib\Profile
1-32
Speech Command Recognition Code Generation on Raspberry Pi
1-33
1 Audio Toolbox Examples
This example shows how to deploy feature extraction and a convolutional neural network (CNN) for
speech command recognition on Intel® processors. To generate the feature extraction and network
code, you use MATLAB® Coder™ and the Intel® Math Kernel Library for Deep Neural Networks
(MKL-DNN). In this example, the generated code is a MATLAB executable (MEX) function, which is
called by a MATLAB script that displays the predicted speech command along with the time domain
signal and auditory spectrogram. For details about audio preprocessing and network training, see
“Train Speech Command Recognition Model Using Deep Learning” on page 1-313.
Prerequisites
• The MATLAB Coder Interface for Deep Learning Libraries support package
• Xeon processor with support for Intel Advanced Vector Extensions 2 (Intel AVX2)
• Intel Math Kernel Library for Deep Neural Networks (MKL-DNN)
• Environment variables for Intel MKL-DNN
For supported versions of libraries and for information about setting up environment variables, see
“Prerequisites for Deep Learning with MATLAB Coder” (MATLAB Coder).
Use the same parameters for the feature extraction pipeline and classification as developed in “Train
Speech Command Recognition Model Using Deep Learning” on page 1-313.
Define the same sample rate the network was trained on (16 kHz). Define the classification rate and
the number of audio samples input per frame. The feature input to the network is a Bark spectrogram
that corresponds to 1 second of audio data. The Bark spectrogram is calculated for 25 ms windows
with 10 ms hops.
fs = 16000;
classificationRate = 20;
samplesPerCapture = fs/classificationRate;
segmentDuration = 1;
segmentSamples = round(segmentDuration*fs);
frameDuration = 0.025;
frameSamples = round(frameDuration*fs);
hopDuration = 0.010;
hopSamples = round(hopDuration*fs);
1-34
Speech Command Recognition Code Generation with Intel MKL-DNN
numBands = 50;
setExtractorParameters(afe,"barkSpectrum",NumBands=numBands,WindowNormalization=false);
load('SpeechCommandRecognitionNetwork.mat')
numLabels = numel(labels);
backgroundIdx = find(labels == 'background');
probBuffer = single(zeros([numLabels,classificationRate/2]));
YBuffer = single(numLabels * ones(1, classificationRate/2));
countThreshold = ceil(classificationRate*0.2);
probThreshold = single(0.7);
Create an audioDeviceReader object to read audio from your device. Create a dsp.AsyncBuffer
object to buffer the audio into chunks.
adr = audioDeviceReader(SampleRate=fs,SamplesPerFrame=samplesPerCapture,OutputDataType='single');
audioBuffer = dsp.AsyncBuffer(fs);
timeScope.YLabel = "Amplitude";
timeScope.ShowGrid = true;
Show the time scope and matrix viewer. Detect commands as long as both the time scope and matrix
viewer are open or until the time limit is reached. To stop the live detection before the time limit is
reached, close the time scope window or matrix viewer window.
show(timeScope)
show(matrixViewer)
timeLimit = 10;
tic
while isVisible(timeScope) && isVisible(matrixViewer) && toc < timeLimit
%% Capture Audio
1-35
1 Audio Toolbox Examples
x = adr();
write(audioBuffer,x);
y = read(audioBuffer,fs,fs-samplesPerCapture);
% Perform prediction
score = predict(net,auditory_features);
[~,YPredicted] = max(score);
% Update plots
matrixViewer(auditorySpectrum);
timeScope(x);
if (speechCommandIdx == backgroundIdx)
timeScope.Title = ' ';
else
timeScope.Title = char(labels(speechCommandIdx));
end
drawnow
end
To create a function to perform feature extraction compatible with code generation, call
generateMATLABFunction on the audioFeatureExtractor object. The
generateMATLABFunction object function creates a standalone function that performs equivalent
feature extraction and is compatible with code generation.
generateMATLABFunction(afe,"extractSpeechFeatures")
1-36
Speech Command Recognition Code Generation with Intel MKL-DNN
So that the network is compatible with code generation, the supporting function uses the
coder.loadDeepLearningNetwork (MATLAB Coder) function to load the network.
show(timeScope)
show(matrixViewer)
timeLimit = 10;
tic
while isVisible(timeScope) && isVisible(matrixViewer) && toc < timeLimit
x = adr();
matrixViewer(auditorySpectrum);
timeScope(x);
if (speechCommandIdx == backgroundIdx)
timeScope.Title = ' ';
else
timeScope.Title = char(labels(speechCommandIdx));
end
drawnow
end
hide(timeScope)
hide(matrixViewer)
Create a code generation configuration object for generation of an executable program. Specify the
target language as C++.
cfg = coder.config('mex');
cfg.TargetLang = 'C++';
Create a configuration object for deep learning code generation with the MKL-DNN library. Attach
the configuration object to the code generation configuration object.
dlcfg = coder.DeepLearningConfig('mkldnn');
cfg.DeepLearningConfig = dlcfg;
Call codegen (MATLAB Coder) to generate C++ code for the HelperSpeechCommandRecognition
function. Specify the configuration object and prototype arguments. A MEX file named
HelperSpeechCommandRecognition_mex is generated to your current folder.
1-37
1 Audio Toolbox Examples
HelperSpeechCommandRecognition_data.cpp
[2/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
eml_int_forloop_overflow_check.cpp
[3/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
rt_nonfinite.cpp
[4/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
colon.cpp
[5/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
HelperSpeechCommandRecognition_initialize.cpp
[6/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
HelperSpeechCommandRecognition_terminate.cpp
[7/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
log10.cpp
[8/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
_coder_HelperSpeechCommandRecognition_mex.cpp
[9/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
_coder_HelperSpeechCommandRecognition_api.cpp
[10/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
sort.cpp
[11/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
permute.cpp
[12/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
cpp_mexapi_version.cpp
[13/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
_coder_HelperSpeechCommandRecognition_info.cpp
[14/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
HelperSpeechCommandRecognition_mexutil.cpp
[15/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
extractSpeechFeatures.cpp
[16/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
stft.cpp
[17/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
sortIdx.cpp
[18/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
DeepLearningNetwork.cpp
[19/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
predict.cpp
[20/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
HelperSpeechCommandRecognition.cpp
[21/22] cl /c /Zp8 /GR /W3 /EHs /nologo /MD /D_CRT_SECURE_NO_DEPRECATE /D_SCL_SECURE_NO_DEPRECATE
AsyncBuffer.cpp
[22/22] link build\win64\HelperSpeechCommandRecognition_data.obj build\win64\rt_nonfinite.obj bui
Creating library HelperSpeechCommandRecognition_mex.lib and object HelperSpeechCommandRecognit
------------------------------------------------------------------------
### Generating compilation report ...
Code generation successful: View report
Show the time scope and matrix viewer. Detect commands using the generated MEX for as long as
both the time scope and matrix viewer are open or until the time limit is reached. To stop the live
detection before the time limit is reached, close the time scope window or matrix viewer window.
show(timeScope)
show(matrixViewer)
timeLimit = 20;
1-38
Speech Command Recognition Code Generation with Intel MKL-DNN
tic
while isVisible(timeScope) && isVisible(matrixViewer) && toc < timeLimit
x = adr();
matrixViewer(auditorySpectrum);
timeScope(x);
if (speechCommandIdx == backgroundIdx)
timeScope.Title = ' ';
else
timeScope.Title = char(labels(speechCommandIdx));
end
drawnow
end
hide(matrixViewer)
hide(timeScope)
1-39
1 Audio Toolbox Examples
Use tic and toc to compare the execution time to run the simulation completely in MATLAB with
the execution time of the MEX function.
1-40
Speech Command Recognition Code Generation with Intel MKL-DNN
tic
for k = 1:numLoops
[speechCommandIdx, auditory_features] = HelperSpeechCommandRecognition_mex(x);
end
exeTimeMex = toc;
fprintf('MEX execution time per 50 ms of audio = %0.4f ms\n',(exeTimeMex/numLoops)*1000);
Evaluate the performance gained from using the MEX function. This performance test is performed
on a machine using NVIDIA Titan V GPU and Intel(R) Xeon(R) W-2133 CPU running at 3.60 GHz.
PerformanceGain = exeTime/exeTimeMex
PerformanceGain = 6.7643
1-41
1 Audio Toolbox Examples
This example shows a Simulink® model that detects the presence of speech commands in audio. The
model uses a pretrained convolutional neural network to recognize a given set of commands.
• "yes"
• "no"
• "up"
• "down"
• "left"
• "right"
• "on"
• "off"
• "stop"
• "go"
The model uses a pretrained convolutional deep learning network. Refer to the example “Train
Speech Command Recognition Model Using Deep Learning” on page 1-313 for details on the
architecture of this network and how you train it.
model = "cmdrecog";
open_system(model)
The model breaks the audio stream into one-second overlapping segments. A bark spectrogram is
computed from each segment. The spectrograms are fed to the pretrained network.
1-42
Speech Command Recognition in Simulink
Use the manual switch to select either a live stream from your microphone or read commands stored
in audio files. For commands on file, use the rotary switch to select one of three commands (Go, Yes,
or Stop).
The deep learning network was trained on auditory spectrograms computed using an
audioFeatureExtractor. The Auditory Spectrogram block in the model has been configured to
extract the same features as the network was trained on.
set_param(model,StopTime="20");
open_system(model + "/Time Scope")
sim(model);
The recognized command is printed in the display block. The network activations, which give a level
of confidence in the different supported commands, are displayed in a time scope.
1-43
1 Audio Toolbox Examples
close_system(model,0)
See Also
Auditory Spectrogram
Related Examples
• “Train Speech Command Recognition Model Using Deep Learning” on page 1-313
• “Speech Command Recognition Using Deep Learning” on page 1-905
• “Speech Command Recognition Code Generation with Intel MKL-DNN Using Simulink” on page
1-857
1-44
Time-Frequency Masking for Harmonic-Percussive Source Separation
The goal of harmonic-percussive source separation (HPSS) is to decompose an audio signal into
harmonic and percussive components. Applications of HPSS include audio remixing, improving the
quality of chroma features, tempo estimation, and time-scale modification [1 on page 1-66]. Another
use of HPSS is as a parallel representation when creating a late fusion deep learning system. Many of
the top performing systems of the Detection and Classification of Acoustic Scenes and Events
(DCASE) 2017 and 2018 challenges used HPSS for this reason.
This example walks through the algorithm described in [1 on page 1-66] to apply time-frequency
masking to the task of harmonic-percussive source separation.
For an example of deriving time-frequency masks using deep learning, see “Cocktail Party Source
Separation Using Deep Learning Networks” on page 1-349.
Read in harmonic and percussive audio files. Both have a sample rate of 16 kHz.
[harmonicAudio,fs] = audioread("violin.wav");
percussiveAudio = audioread("drums.wav");
Listen to the harmonic signal and plot the spectrogram. Note that there is continuity along the
horizontal (time) axis.
sound(harmonicAudio,fs)
spectrogram(harmonicAudio,1024,512,1024,fs,"yaxis")
title("Harmonic Audio")
1-45
1 Audio Toolbox Examples
Listen to the percussive signal and plot the spectrogram. Note that there is continuity along the
vertical (frequency) axis.
sound(percussiveAudio,fs)
spectrogram(percussiveAudio,1024,512,1024,fs,"yaxis")
title("Percussive Audio")
1-46
Time-Frequency Masking for Harmonic-Percussive Source Separation
Mix the harmonic and percussive signals. Listen to the harmonic-percussive audio and plot the
spectrogram.
sound(mix,fs)
spectrogram(mix,1024,512,1024,fs,"yaxis")
title("Harmonic-Percussive Audio")
1-47
1 Audio Toolbox Examples
The HPSS proposed by [1 on page 1-66] creates two enhanced spectrograms: a harmonic-enhanced
spectrogram and a percussive-enhanced spectrogram. The harmonic-enhanced spectrogram is
created by applying median filtering along the time axis. The percussive-enhanced spectrogram is
created by applying median filtering along the frequency axis. The enhanced spectrograms are then
compared to create harmonic and percussive time-frequency masks. In the simplest form, the masks
are binary.
Convert the mixed signal to a half-sided magnitude short-time Fourier transform (STFT).
win = sqrt(hann(1024,"periodic"));
overlapLength = floor(numel(win)/2);
fftLength = 2^nextpow2(numel(win) + 1);
y = stft(mix,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,FrequencyRange="onesided"
ymag = abs(y);
Apply median smoothing along the time axis to enhance the harmonic audio and diminish the
percussive audio. Use a filter length of 200 ms, as suggested by [1 on page 1-66]. Plot the power
spectrum of the harmonic-enhanced audio.
timeFilterLength = 0.2;
timeFilterLengthInSamples = timeFilterLength/((numel(win) - overlapLength)/fs);
ymagharm = movmedian(ymag,timeFilterLengthInSamples,2);
surf(log10(ymagharm.^2),EdgeColor="none")
1-48
Time-Frequency Masking for Harmonic-Percussive Source Separation
Apply median smoothing along the frequency axis to enhance the percussive audio and diminish the
harmonic audio. Use a filter length of 500 Hz, as suggested by [1 on page 1-66]. Plot the power
spectrum of the percussive-enhanced audio.
frequencyFilterLength = 500;
frequencyFilterLengthInSamples = frequencyFilterLength/(fs/fftLength);
ymagperc = movmedian(ymag,frequencyFilterLengthInSamples,1);
surf(log10(ymagperc.^2),EdgeColor="none")
title("Percussive Enhanced Audio")
view([0,90])
axis tight
1-49
1 Audio Toolbox Examples
To create a binary mask, first sum the percussive- and harmonic-enhanced spectrums to determine
the total magnitude per bin.
totalMagnitudePerBin = ymagharm + ymagperc;
If the magnitude in a given harmonic-enhanced or percussive-enhanced bin accounts for more than
half of the total magnitude of that bin, then assign that bin to the corresponding mask.
harmonicMask = ymagharm > (totalMagnitudePerBin*0.5);
percussiveMask = ymagperc > (totalMagnitudePerBin*0.5);
Apply the harmonic and percussive masks and then return the masked audio to the time domain.
yharm = harmonicMask.*y;
yperc = percussiveMask.*y;
Perform the inverse short-time Fourier transform to return the signals to the time domain.
h = istft(yharm,Window=win,OverlapLength=overlapLength, ...
FFTLength=fftLength,ConjugateSymmetric=true,FrequencyRange="onesided");
p = istft(yperc,Window=win,OverlapLength=overlapLength, ...
FFTLength=fftLength,ConjugateSymmetric=true,FrequencyRange="onesided");
1-50
Time-Frequency Masking for Harmonic-Percussive Source Separation
spectrogram(h,1024,512,1024,fs,"yaxis")
title("Recovered Harmonic Audio")
sound(p,fs)
spectrogram(p,1024,512,1024,fs,"yaxis")
title("Recovered Percussive Audio")
1-51
1 Audio Toolbox Examples
sound(h + p,fs)
spectrogram(h + p,1024,512,1024,fs,"yaxis")
title("Recovered Harmonic + Percussive Audio")
1-52
Time-Frequency Masking for Harmonic-Percussive Source Separation
As suggested in [1 on page 1-66], decomposing a signal into harmonic and percussive sounds is
often impossible. They propose adding a thresholding parameter: if the bin of the spectrogram is not
clearly harmonic or percussive, categorize it as residual.
Perform the same steps described in HPSS Using Binary Mask on page 1-48 to create harmonic-
enhanced and percussive-enhanced spectrograms.
win = sqrt(hann(1024,"periodic"));
overlapLength = floor(numel(win)/2);
fftLength = 2^nextpow2(numel(win) + 1);
y = stft(mix,Window=win,OverlapLength=overlapLength, ...
FFTLength=fftLength,FrequencyRange="onesided");
ymag = abs(y);
timeFilterLength = 0.2;
timeFilterLengthInSamples = timeFilterLength/((numel(win) - overlapLength)/fs);
ymagharm = movmedian(ymag,timeFilterLengthInSamples,2);
frequencyFilterLength = 500;
frequencyFilterLengthInSamples = frequencyFilterLength/(fs/fftLength);
ymagperc = movmedian(ymag,frequencyFilterLengthInSamples,1);
1-53
1 Audio Toolbox Examples
Using a threshold, create three binary masks: harmonic, percussive, and residual. Set the threshold
to 0.65. This means that if the magnitude of a bin of the harmonic-enhanced spectrogram is 65% of
the total magnitude for that bin, you assign that bin to the harmonic portion. If the magnitude of a bin
of the percussive-enhanced spectrogram is 65% of the total magnitude for that bin, you assign that
bin to the percussive portion. Otherwise, the bin is assigned to the residual portion. The optimal
thresholding parameter depends on the harmonic-percussive mix and your application.
threshold = ;
harmonicMask = ymagharm > (totalMagnitudePerBin*threshold);
percussiveMask = ymagperc > (totalMagnitudePerBin*threshold);
residualMask = ~(harmonicMask+percussiveMask);
Perform the same steps described in HPSS Using Binary Mask on page 1-48 to return the masked
signals to the time domain.
yharm = harmonicMask.*y;
yperc = percussiveMask.*y;
yresi = residualMask.*y;
h = istft(yharm,Window=win,OverlapLength=overlapLength, ...
FFTLength=fftLength,ConjugateSymmetric=true,FrequencyRange="onesided");
p = istft(yperc,Window=win,OverlapLength=overlapLength, ...
FFTLength=fftLength,ConjugateSymmetric=true,FrequencyRange="onesided");
r = istft(yresi,Window=win,OverlapLength=overlapLength, ...
FFTLength=fftLength,ConjugateSymmetric=true,FrequencyRange="onesided");
sound(h,fs)
spectrogram(h,1024,512,1024,fs,"yaxis")
title("Recovered Harmonic Audio")
1-54
Time-Frequency Masking for Harmonic-Percussive Source Separation
sound(p,fs)
spectrogram(p,1024,512,1024,fs,"yaxis")
title("Recovered Percussive Audio")
1-55
1 Audio Toolbox Examples
sound(r,fs)
spectrogram(r,1024,512,1024,fs,"yaxis")
title("Recovered Residual Audio")
1-56
Time-Frequency Masking for Harmonic-Percussive Source Separation
Listen to the combination of the harmonic, percussive, and residual signals and plot the spectrogram.
sound(h + p + r,fs)
spectrogram(h + p + r,1024,512,1024,fs,"yaxis")
title("Recovered Harmonic + Percussive + Residual Audio")
1-57
1 Audio Toolbox Examples
For time-frequency masking, masks are generally either binary or soft. Soft masking separates the
energy of the mixed bins into harmonic and percussive portions depending on the relative weights of
their enhanced spectrograms.
Perform the same steps described in HPSS Using Binary Mask on page 1-48 to create harmonic-
enhanced and percussive-enhanced spectrograms.
win = sqrt(hann(1024,"periodic"));
overlapLength = floor(numel(win)/2);
fftLength = 2^nextpow2(numel(win) + 1);
y = stft(mix,Window=win,OverlapLength=overlapLength, ...
FFTLength=fftLength,FrequencyRange="onesided");
ymag = abs(y);
timeFilterLength = 0.2;
timeFilterLengthInSamples = timeFilterLength/((numel(win)-overlapLength)/fs);
ymagharm = movmedian(ymag,timeFilterLengthInSamples,2);
frequencyFilterLength = 500;
frequencyFilterLengthInSamples = frequencyFilterLength/(fs/fftLength);
ymagperc = movmedian(ymag,frequencyFilterLengthInSamples,1);
1-58
Time-Frequency Masking for Harmonic-Percussive Source Separation
Create soft masks that separate the bin energy to the harmonic and percussive portions relative to
the weights of their enhanced spectrograms.
harmonicMask = ymagharm./totalMagnitudePerBin;
percussiveMask = ymagperc./totalMagnitudePerBin;
Perform the same steps described in HPSS Using Binary Mask on page 1-48 to return the masked
signals to the time domain.
yharm = harmonicMask.*y;
yperc = percussiveMask.*y;
h = istft(yharm,Window=win,OverlapLength=overlapLength, ...
FFTLength=fftLength,ConjugateSymmetric=true,FrequencyRange="onesided");
p = istft(yperc,Window=win,OverlapLength=overlapLength, ...
FFTLength=fftLength,ConjugateSymmetric=true,FrequencyRange="onesided");
spectrogram(h,1024,512,1024,fs,"yaxis")
title("Recovered Harmonic Audio")
1-59
1 Audio Toolbox Examples
spectrogram(p,1024,512,1024,fs,"yaxis")
title("Recovered Percussive Audio")
Example Function
The example function, HelperHPSS, provides the harmonic-percussive source separation capabilities
described in this example. You can use it to quickly explore how parameters effect the algorithm
performance.
help HelperHPSS
[h,p] = HelperHPSS(...,'FrequencyFilterLength',FREQUENCYFILTERLENGTH)
specifies the median filter length along the frequency dimension of a
spectrogram, in Hz. If unspecified, FREQUENCYFILTERLENGTH defaults to 500
Hz.
1-60
Time-Frequency Masking for Harmonic-Percussive Source Separation
Example:
% Load a sound file and listen to it.
[audio,fs] = audioread('Laughter-16-8-mono-4secs.wav');
sound(audio,fs)
[1 on page 1-66] observed that a large frame size in the STFT calculation moves the energy towards
the harmonic component, while a small frame size moves the energy towards the percussive
component. [1 on page 1-66] proposed using an iterative procedure to take advantage of this
insight. In the iterative procedure:
1 Perform HPSS using a large frame size to isolate the harmonic component.
2 Sum the residual and percussive portions.
3 Perform HPSS using a small frame size to isolate the percussive component.
threshold1 = ;
N1 = ;
[h1,p1,r1] = HelperHPSS(mix,fs,Threshold=threshold1,Window=sqrt(hann(N1,"periodic")),OverlapLengt
mix1 = p1 + r1;
threshold2 = ;
N2 = ;
[h2,p2,r2] = HelperHPSS(mix1,fs,Threshold=threshold2,Window=sqrt(hann(N2,"periodic")),OverlapLeng
h = h1;
1-61
1 Audio Toolbox Examples
p = p2;
r = h2 + r2;
sound(h,fs)
spectrogram(h,1024,512,1024,fs,"yaxis")
title("Recovered Harmonic Audio")
sound(p,fs)
spectrogram(p,1024,512,1024,fs,"yaxis")
title("Recovered Percussive Audio")
1-62
Time-Frequency Masking for Harmonic-Percussive Source Separation
sound(r,fs)
spectrogram(r,1024,512,1024,fs,"yaxis")
title("Recovered Residual Audio")
1-63
1 Audio Toolbox Examples
Listen to the combination of the harmonic, percussive, and residual signals and plot the spectrogram.
sound(h + p + r,fs)
spectrogram(h+p+r,1024,512,1024,fs,"yaxis")
title("Recovered Harmonic + Percussive + Residual Audio")
1-64
Time-Frequency Masking for Harmonic-Percussive Source Separation
[2 on page 1-66] proposes that time scale modification (TSM) can be improved by first separating a
signal into harmonic and percussive portions and then applying a TSM algorithm optimal for the
portion. After TSM, the signal is reconstituted by summing the stretched audio.
To listen to a stretched audio without HPSS, apply time-scale modification using the default
stretchAudio function. By default, stretchAudio uses the phase vocoder algorithm.
alpha = ;
mixStretched = stretchAudio(mix,alpha);
sound(mixStretched,fs)
Separate the harmonic-percussive mix into harmonic and percussive portions using HelperHPSS. As
proposed in [2 on page 1-66], use the default vocoder algorithm to stretch the harmonic portion and
the WSOLA algorithm to stretch the percussive portion. Sum the stretched portions and listen to the
results.
[h,p] = HelperHPSS(mix,fs);
hStretched = stretchAudio(h,alpha);
pStretched = stretchAudio(p,alpha,Method="wsola");
1-65
1 Audio Toolbox Examples
References
[1] Driedger, J., M. Muller, and S. Disch. "Extending harmonic-percussive separation of audio signals."
Proceedings of the International Society for Music Information Retrieval Conference. Vol. 15, 2014.
[2] Driedger, J., M. Muller, and S. Ewert. "Improving Time-Scale Modification of Music Signals Using
Harmonic-Percussive Separation." IEEE Signal Processing Letters. Vol. 21. Issue 1. pp. 105-109,
2014.
See Also
Related Examples
• “Cocktail Party Source Separation Using Deep Learning Networks” on page 1-349
1-66
Binaural Audio Rendering Using Head Tracking
Track head orientation by fusing data received from an IMU, and then control the direction of arrival
of a sound source by applying head-related transfer functions (HRTF).
In a typical virtual reality setup, the IMU sensor is attached to the user's headphones or VR headset
so that the perceived position of a sound source is relative to a visual cue independent of head
movements. For example, if the sound is perceived as coming from the monitor, it remains that way
even if the user turns his head to the side.
Required Hardware
• Arduino Uno
• Invensense MPU-9250
Hardware Connection
First, connect the Invensense MPU-9250 to the Arduino board. For more details, see “Estimating
Orientation Using Inertial Sensor Fusion and MPU-9250” (Sensor Fusion and Tracking Toolbox).
a = arduino;
imu = mpu9250(a);
Fs = imu.SampleRate;
imufilt = imufilter('SampleRate',Fs);
When sound travels from a point in space to your ears, you can localize it based on interaural time
and level differences (ITD and ILD). These frequency-dependent ITD and ILD's can be measured and
represented as a pair of impulse responses for any given source elevation and azimuth. The ARI
HRTF Dataset contains 1550 pairs of impulse responses which span azimuths over 360 degrees and
elevations from -30 to 80 degrees. You use these impulse responses to filter a sound source so that it
is perceived as coming from a position determined by the sensor's orientation. If the sensor is
attached to a device on a user's head, the sound is perceived as coming from one fixed place despite
head movements.
ARIDataset = load('ReferenceHRTF.mat');
Then, get the relevant HRTF data from the dataset and put it in a useful format for our processing.
hrtfData = double(ARIDataset.hrtfData);
hrtfData = permute(hrtfData,[2,3,1]);
1-67
1 Audio Toolbox Examples
Get the associated source positions. Angles should be in the same range as the sensor. Convert the
azimuths from [0,360] to [-180,180].
sourcePosition = ARIDataset.sourcePosition(:,[1,2]);
sourcePosition(:,1) = sourcePosition(:,1) - 180;
Load an ambisonic recording of a helicopter. Keep only the first channel, which corresponds to an
omnidirectional recording. Resample it to 48 kHz for compatibility with the HRTF data set.
[heli,originalSampleRate] = audioread('Heli_16ch_ACN_SN3D.wav');
heli = 12*heli(:,1); % keep only one channel
sampleRate = 48e3;
heli = resample(heli,sampleRate,originalSampleRate);
Load the audio data into a SignalSource object. Set the SamplesPerFrame to 0.1 seconds.
Create an audioDeviceWriter with the same sample rate as the audio signal.
deviceWriter = audioDeviceWriter('SampleRate',sampleRate);
FIR = cell(1,2);
FIR{1} = dsp.FIRFilter('NumeratorSource','Input port');
FIR{2} = dsp.FIRFilter('NumeratorSource','Input port');
Create an object to perform real-time visualization for the orientation of the IMU sensor. Call the IMU
filter once and display the initial orientation.
orientationScope = HelperOrientationViewer;
data = read(imu);
qimu = imufilt(data.Acceleration,data.AngularVelocity);
orientationScope(qimu);
1-68
Binaural Audio Rendering Using Head Tracking
Execute the processing loop for 30 seconds. This loop performs the following steps:
imuOverruns = 0;
audioUnderruns = 0;
audioFiltered = zeros(sigsrc.SamplesPerFrame,2);
1-69
1 Audio Toolbox Examples
tic
while toc < 30
% Convert the orientation from a quaternion representation to pitch and yaw in Euler angles.
ypr = eulerd(qimu,'zyx','frame');
yaw = ypr(end,1);
pitch = ypr(end,2);
desiredPosition = [yaw,pitch];
% Apply HRTFs
audioFiltered(:,1) = FIR{1}(audioIn, interpolatedIR(1,:)); % Left
audioFiltered(:,2) = FIR{2}(audioIn, interpolatedIR(2,:)); % Right
audioUnderruns = audioUnderruns + deviceWriter(squeeze(audioFiltered));
end
1-70
Binaural Audio Rendering Using Head Tracking
Cleanup
release(sigsrc)
release(deviceWriter)
clear imu a
1-71
1 Audio Toolbox Examples
This example illustrates a simple speech emotion recognition (SER) system using a BiLSTM network.
You begin by downloading the data set and then testing the trained network on individual files. The
network was trained on a small German-language database [1] on page 1-84.
The example walks you through training the network, which includes downloading, augmenting, and
training the data set. Finally, you perform leave-one-speaker-out (LOSO) 10-fold cross validation to
evaluate the network architecture.
The features used in this example were chosen using sequential feature selection, similar to the
method described in “Sequential Feature Selection for Audio Features” on page 1-522.
Download the Berlin Database of Emotional Speech [1] on page 1-84. The database contains 535
utterances spoken by 10 actors intended to convey one of the following emotions: anger, boredom,
disgust, anxiety/fear, happiness, sadness, or neutral. The emotions are text independent.
dataFolder = tempdir;
dataset = fullfile(dataFolder,"Emo-DB");
if ~datasetExists(dataset)
url = "http://emodb.bilderbar.info/download/download.zip";
disp("Downloading Emo-DB (40.5 MB) ...")
unzip(url,dataset)
end
The file names are codes indicating the speaker ID, text spoken, emotion, and version. The website
contains a key for interpreting the code and additional information about the speakers such as
gender and age. Create a table with the variables Speaker and Emotion. Decode the file names into
the table.
filepaths = ads.Files;
emotionCodes = cellfun(@(x)x(end-5),filepaths,UniformOutput=false);
emotions = replace(emotionCodes,["W","L","E","A","F","T","N"], ...
["Anger","Boredom","Disgust","Anxiety/Fear","Happiness","Sadness","Neutral"]);
speakerCodes = cellfun(@(x)x(end-10:end-9),filepaths,UniformOutput=false);
labelTable = cell2table([speakerCodes,emotions],VariableNames=["Speaker","Emotion"]);
labelTable.Emotion = categorical(labelTable.Emotion);
labelTable.Speaker = categorical(labelTable.Speaker);
summary(labelTable)
Variables:
Values:
03 49
08 58
09 43
1-72
Speech Emotion Recognition
10 38
11 55
12 35
13 61
14 69
15 56
16 71
Values:
Anger 127
Anxiety/Fear 69
Boredom 81
Disgust 46
Happiness 71
Neutral 79
Sadness 62
labelTable is in the same order as the files in audioDatastore. Set the Labels property of the
audioDatastore to the labelTable.
ads.Labels = labelTable;
Download and load the pretrained network, the audioFeatureExtractor object used to train the
network, and normalization factors for the features. This network was trained using all speakers in
the data set except speaker 03.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio/examples","serbilstm.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
load(fullfile(dataFolder,"network_Audio_SER.mat"));
The sample rate set on the audioFeatureExtractor corresponds to the sample rate of the data
set.
fs = afe.SampleRate;
Select a speaker and emotion, then subset the datastore to only include the chosen speaker and
emotion. Read from the datastore and listen to the file.
speaker = ;
emotion = ;
audio = read(adsSubset);
sound(audio,fs)
Use the audioFeatureExtractor object to extract the features. Normalize the features and then
convert them to 20-element sequences with 10-element overlap, which corresponds to approximately
600 ms windows with 300 ms overlap. Use the supporting function, HelperFeatureVector2Sequence
on page 1-81, to convert the array of feature vectors to sequences.
1-73
1 Audio Toolbox Examples
features = extract(afe,audio);
numOverlap = ;
featureSequences = HelperFeatureVector2Sequence(featuresNormalized,20,numOverlap);
Feed the feature sequences into the network for prediction. Compute the mean prediction and plot
the probability distribution of the chosen emotions as a pie chart. You can try different speakers,
emotions, sequence overlap, and prediction average to test the network's performance. To get a
realistic approximation of the network's performance, use speaker 03, which the network was not
trained on.
YPred = double(minibatchpredict(net,featureSequences));
average = ;
switch average
case "mean"
probs = mean(YPred,1);
case "median"
probs = median(YPred,1);
case "mode"
probs = mode(YPred,1);
end
pie(probs./sum(probs))
legend(string(classes),Location="eastoutside");
1-74
Speech Emotion Recognition
The remainder of the example illustrates how the network was trained and validated.
Train Network
The 10-fold cross validation accuracy of a first attempt at training was about 60% because of
insufficient training data. A model trained on the insufficient data overfits some folds and underfits
others. To improve overall fit, increase the size of the data set using audioDataAugmenter. 50
augmentations per file was chosen empirically as a good tradeoff between processing time and
accuracy improvement. You can decrease the number of augmentations to speed up the example.
Create an audioDataAugmenter object. Set the probability of applying pitch shifting to 0.5 and use
the default range. Set the probability of applying time shifting to 1 and use a range of [-0.3,0.3]
seconds. Set the probability of adding noise to 1 and specify the SNR range as [-20,40] dB.
numAugmentations = ;
augmenter = audioDataAugmenter(NumAugmentations=numAugmentations, ...
TimeStretchProbability=0, ...
VolumeControlProbability=0, ...
...
PitchShiftProbability=0.5, ...
...
TimeShiftProbability=1, ...
TimeShiftRange=[-0.3,0.3], ...
...
AddNoiseProbability=1, ...
SNRRange=[-20,40]);
1-75
1 Audio Toolbox Examples
Create a new folder in your current folder to hold the augmented data set.
currentDir = pwd;
writeDirectory = fullfile(currentDir,"augmentedData");
mkdir(writeDirectory)
1 Create 50 augmentations.
2 Normalize the audio to have a max absolute value of 1.
3 Write the augmented audio data as a WAV file. Append _augK to each of the file names, where K
is the augmentation number. To speed up processing, use parfor and partition the datastore.
This method of augmenting the database is time consuming and space consuming. However, when
iterating on choosing a network architecture or feature extraction pipeline, this upfront cost is
generally advantageous.
N = numel(ads.Files)*numAugmentations;
reset(ads)
numPartitions = 18;
tic
parfor ii = 1:numPartitions
adsPart = partition(ads,numPartitions,ii);
while hasdata(adsPart)
[x,adsInfo] = read(adsPart);
data = augment(augmenter,x,fs);
[~,fn] = fileparts(adsInfo.FileName);
for i = 1:size(data,1)
augmentedAudio = data.Audio{i};
augmentedAudio = augmentedAudio/max(abs(augmentedAudio),[],"all");
augNum = num2str(i);
if isscalar(augNum)
iString = ['0',augNum];
else
iString = augNum;
end
audiowrite(fullfile(writeDirectory,sprintf('%s_aug%s.wav',fn,iString)),augmentedAudio
end
end
end
1-76
Speech Emotion Recognition
Create an audio datastore that points to the augmented data set. Replicate the rows of the label table
of the original datastore NumAugmentations times to determine the labels of the augmented
datastore.
adsAug = audioDatastore(writeDirectory);
adsAug.Labels = repelem(ads.Labels,augmenter.NumAugmentations,1);
win = hamming(round(0.03*fs),"periodic");
overlapLength = 0;
When you train for deployment, use all available speakers in the data set. Set the training datastore
to the augmented datastore.
adsAug = subset(adsAug,adsAug.Labels.Speaker~=categorical("03"))
adsAug =
audioDatastore with properties:
Files: {
' ...\deeplearning_shared-ex37272868\augmentedData\08a01Ab_aug01.wa
' ...\deeplearning_shared-ex37272868\augmentedData\08a01Ab_aug02.wa
' ...\deeplearning_shared-ex37272868\augmentedData\08a01Ab_aug03.wa
... and 24297 more
}
Folders: {
' ...\bhemmat.Bdoc24a.j2470636\deeplearning_shared-ex37272868\augme
}
Labels: 24300-by-2 table
AlternateFileSystemRoots: {}
OutputDataType: 'double'
OutputEnvironment: 'cpu'
SupportedOutputFormats: ["wav" "flac" "ogg" "opus" "mp3" "mp4" "m4a"]
DefaultOutputFormat: "wav"
adsTrain = adsAug;
1-77
1 Audio Toolbox Examples
Extract the training features and reorient the features so that time is along rows to be compatible
with sequenceInputLayer (Deep Learning Toolbox). If you have Parallel Computing Toolbox™, run
the feature extraction in parallel.
featuresTrain = extract(afe,adsTrain,UseParallel=canUseParallelPool);
Use the training set to determine the mean and standard deviation of each feature.
allFeatures = cat(1,featuresTrain{:});
[S,M] = std(allFeatures,0,1,"omitnan");
featuresTrain = cellfun(@(x)(x-M)./S,featuresTrain,UniformOutput=false);
Buffer the feature vectors into sequences so that each sequence consists of 20 feature vectors with
overlaps of 10 feature vectors.
featureVectorsPerSequence = 20;
featureVectorOverlap = 10;
[sequencesTrain,sequencePerFileTrain] = HelperFeatureVector2Sequence(featuresTrain,featureVectors
Replicate the labels of the training and validation sets so that they are in one-to-one correspondence
with the sequences. Not all speakers have utterances for all emotions. Create an empty
categorical array that contains all the emotional categories and append it to the validation labels
so that the categorical array contains all emotions.
labelsTrain = repelem(adsTrain.Labels.Emotion,[sequencePerFileTrain{:}]);
emptyEmotions = ads.Labels.Emotion;
emptyEmotions(:) = [];
Define a BiLSTM network using bilstmLayer (Deep Learning Toolbox). Place a dropoutLayer
(Deep Learning Toolbox) before and after the bilstmLayer to help prevent overfitting.
dropoutProb1 = 0.3;
numUnits = 200;
dropoutProb2 = 0.6;
layers = [ ...
sequenceInputLayer(afe.FeatureVectorLength)
dropoutLayer(dropoutProb1)
bilstmLayer(numUnits,OutputMode="last")
dropoutLayer(dropoutProb2)
fullyConnectedLayer(numel(categories(emptyEmotions)))
softmaxLayer];
miniBatchSize = 512;
initialLearnRate = 0.005;
learnRateDropPeriod = 2;
maxEpochs = 3;
options = trainingOptions("adam", ...
MiniBatchSize=miniBatchSize, ...
InitialLearnRate=initialLearnRate, ...
LearnRateDropPeriod=learnRateDropPeriod, ...
LearnRateSchedule="piecewise", ...
MaxEpochs=maxEpochs, ...
Shuffle="every-epoch", ...
Verbose=false, ...
1-78
Speech Emotion Recognition
Plots="Training-Progress", ...
Metrics="accuracy", ...
InputDataFormats="TCB");
net = trainnet(sequencesTrain,labelsTrain,layers,"crossentropy",options);
saveSERSystem = ;
if saveSERSystem
normalizers.Mean = M;
normalizers.StandardDeviation = S;
classes = unique(labelsTrain);
save("network_Audio_SER.mat","net","afe","normalizers","classes")
end
To provide an accurate assessment of the model you created in this example, train and validate using
leave-one-speaker-out (LOSO) k-fold cross validation. In this method, you train using k − 1 speakers
and then validate on the left-out speaker. You repeat this procedure k times. The final validation
accuracy is the average of the k folds.
Create a variable that contains the speaker IDs. Determine the number of folds: 1 for each speaker.
The database contains utterances from 10 unique speakers. Use summary to display the speaker IDs
(left column) and the number of utterances they contribute to the database (right column).
1-79
1 Audio Toolbox Examples
speaker = ads.Labels.Speaker;
numFolds = numel(speaker);
summary(speaker)
03 49
08 58
09 43
10 38
11 55
12 35
13 61
14 69
15 56
16 71
The helper function HelperTrainAndValidateNetwork on page 1-82 performs the steps outlined
above for all 10 folds and returns the true and predicted labels for each fold. Call
HelperTrainAndValidateNetwork with the audioDatastore, the augmented audioDatastore,
and the audioFeatureExtractor.
[labelsTrue,labelsPred] = HelperTrainAndValidateNetwork(ads,adsAug,afe);
Print the accuracy per fold and plot the 10-fold confusion chart.
for ii = 1:numel(labelsTrue)
foldAcc = mean(labelsTrue{ii}==labelsPred{ii})*100;
disp("Fold " + ii + ", Accuracy = " + round(foldAcc,2))
end
labelsTrueMat = cat(1,labelsTrue{:});
labelsPredMat = cat(1,labelsPred{:});
figure
cm = confusionchart(labelsTrueMat,labelsPredMat, ...
Title=["Confusion Matrix for 10-Fold Cross-Validation","Average Accuracy = " round(mean(label
ColumnSummary="column-normalized",RowSummary="row-normalized");
sortClasses(cm,categories(emptyEmotions))
1-80
Speech Emotion Recognition
Supporting Functions
if ~iscell(features)
features = {features};
end
hopLength = featureVectorsPerSequence - featureVectorOverlap;
idx1 = 1;
sequences = {};
sequencePerFile = cell(numel(features),1);
for ii = 1:numel(features)
sequencePerFile{ii} = floor((size(features{ii},1) - featureVectorsPerSequence)/hopLength)
idx2 = 1;
for j = 1:sequencePerFile{ii}
sequences{idx1,1} = features{ii}(idx2:idx2 + featureVectorsPerSequence - 1,:); %#ok<A
idx1 = idx1 + 1;
idx2 = idx2 + hopLength;
end
end
end
1-81
1 Audio Toolbox Examples
for i = 1:numFolds
1-82
Speech Emotion Recognition
1-83
1 Audio Toolbox Examples
end
end
References
[1] Burkhardt, F., A. Paeschke, M. Rolfes, W.F. Sendlmeier, and B. Weiss, "A Database of German
Emotional Speech." In Proceedings Interspeech 2005. Lisbon, Portugal: International Speech
Communication Association, 2005.
1-84
End-to-End Deep Speaker Separation
This example showcases an end-to-end deep learning network for speaker-independent speech
separation.
Introduction
Speaker separation is a challenging and critical speech processing task. A number of speaker
separation methods based on deep learning have been proposed recently, most of which rely on time-
frequency transformations of the time-domain audio mixture (See “Cocktail Party Source Separation
Using Deep Learning Networks” on page 1-349 for an implementation of such a deep learning
system).
• The conversion of the time-frequency representations back to the time domain requires phase
estimation, which introduces errors and leads to imperfect reconstruction.
• Relatively long windows are required to yield high resolution frequency representations, which
leads to high computational complexity and unacceptable latency for real-time scenarios.
In this example, you train a deep learning speech separation network based on the Conv-TasNet
architecture [1] on page 1-91. The Conv-TasNet model acts directly on the audio signal and bypasses
the issues arising from time-frequency transformations.
For a comparison of the performance of the different models, see “Compare Speaker Separation
Models” on page 1-1033.
To train the network with the entire data set and achieve the highest possible accuracy, set
speedupExample to false. To run this example more quickly, set speedupExample to true.
speedupExample = ;
1-85
1 Audio Toolbox Examples
The network is based on [1] on page 1-91 and consists of three stages: Encoding, mask estimation or
separation, and decoding.
• The encoder transforms the time-domain input mixture signals into an intermediate
representation using convolutional layers.
• The mask estimator computes one mask per speaker. The intermediate representation of each
speaker is obtained by multiplying the encoder's output by its respective mask. The mask
estimator is comprised of 32 blocks of convolutional and normalization layers with skip
connections between blocks.
• The decoder transforms the intermediate representations to time-domain separated speech
signals using transposed convolutional layers.
To calculate loss, use the supporting functions uPIT on page 1-93 for utterance-level permutation-
invariant training (uPIT) and SISNR on page 1-94 to calculate scale-invariant signal-to-noise ratio
(SI-SNR) [1] on page 1-91. SI-SNR encourages the network to learn how to separate signals
regardless of their initial relative energy. Without scale invariance, the network would learn how to
recover the more dominant energy signal at the cost of the less dominant. uPIT resolves the problem
that there is no a priori way to associate the predictions with the targets by minimizing the loss of the
best permutation between predictions and targets.
Use a subset of the LibriSpeech data set [2] on page 1-91 to train the network. The LibriSpeech data
set is a large corpus of read English speech sampled at 16 kHz. The data is derived from audiobooks
read from the LibriVox project.
Download the LibriSpeech data set. If speedupExample is true, download the approximately 322
MB dev-clean set. If speedupExample is to false, download the approximately 28 GB train-
clean-360 set.
downloadDatasetFolder = tempdir;
if speedupExample
1-86
End-to-End Deep Speaker Separation
filename = "dev-clean.tar.gz";
datasetFolder = fullfile(downloadDatasetFolder,"LibriSpeech","dev-clean");
else
filename = "train-clean-360.tar.gz";
datasetFolder = fullfile(downloadDatasetFolder,"LibriSpeech","train-clean-360");
end
The LibriSpeech data set is comprised of many audio files with a single speaker. It does not contain
mixture signals where 2 or more persons are speaking simultaneously.
You will process the original data set to create a new data set that is suitable for training the speech
separation network.
The steps for creating the training data set are encapsulated in createTrainingDataset on page
1-91. The function creates mixture signals comprised of utterances of two random speakers. The
function returns three audio datastores:
• mixDatastore points to mixture files (where two speakers are talking simultaneously).
• speaker1Datastore points to files containing the isolated speech of the first speaker in the
mixture.
• speaker2Datastore points to files containing the isolated speech of the second speaker in the
mixture.
Define the mini-batch size and the maximum training signal length (in number of samples).
miniBatchSize = 4;
duration = 5*8000;
[mixDatastore,speaker1Datastore,speaker2Datastore] = createTrainingDataset(datasetFolder,download
Combine the datastores. This ensures that the files stay in the correct order when you shuffle them at
the start of each new epoch in the training loop.
ds = combine(mixDatastore,speaker1Datastore,speaker2Datastore);
Train on a GPU if one is available. Using a GPU requires Parallel Computing Toolbox™.
1-87
1 Audio Toolbox Examples
if speedupExample
numEpochs = 1;
else
numEpochs = 10;
end
Specify the options for Adam optimization. Set the initial learning rate to 1e-3. Use a gradient decay
factor of 0.9 and a squared gradient decay factor of 0.999.
learnRate = 1e-3;
averageGrad = [];
averageSqGrad = [];
gradDecay = 0.9;
sqGradDecay = 0.999;
multipleSpeakersSignal = audioread("MultipleSpeakers-16-8-4channel-5secs.flac");
s1 = multipleSpeakersSignal(:,1);
s2 = multipleSpeakersSignal(:,2);
mix = s1 + s2;
mix = mix/max(abs(mix));
mix = dlarray(mix,"SCB");
if (executionEnvironment == "auto" && canUseGPU) || executionEnvironment == "gpu"
mix = gpuArray(mix);
end
numIterPerValidation = 100;
valSNR = [];
1-88
End-to-End Deep Speaker Separation
bestSNR = -Inf;
Define a variable to hold the epoch in which the best validation score occurred.
bestEpoch = 1;
Initialize Network
Initialize the network parameters. learnables is a structure containing the learnable parameters
from the network layers. states is a structure containing the states from the normalization layers.
[learnables,states] = initializeNetworkParams;
Train Network
Execute the training loop. This can take many hours to run.
The validation SI-SNR is computed periodically. If the SI-SNR is the best value so far, the network
parameters are saved to params.mat.
iteration = 0;
updateInfo(monitor,LearnRate=learnRate,Epoch=jj)
while hasdata(mqueue)
[z1,z2] = separateSpeakersConvTasNet(mix,learnables,states,false);
l = uPIT(z1,s1,z2,s2);
valSNR(end+1) = l; %#ok
recordMetrics(monitor,iteration,ValidationLoss=-l);
if l > bestSNR
bestSNR = l;
bestEpoch = jj;
filename = "params.mat";
save(filename,"learnables","states");
end
end
1-89
1 Audio Toolbox Examples
iteration = iteration + 1;
% Evaluate the model gradients and states using dlfeval and the modelLoss function.
[loss,gradients,states] = dlfeval(@modelLoss,mixBatch,x1Batch,x2Batch,learnables,states,m
recordMetrics(monitor,iteration,TrainingLoss=loss);
if monitor.Stop
return
end
end
% Reduce the learning rate if the validation accuracy did not improve
% during the epoch
if bestEpoch ~= jj
learnRate = learnRate/2;
end
if monitor.Stop
return
end
end
1-90
End-to-End Deep Speaker Separation
References
[1] Yi Luo, Nima Mesgarani, "Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for
Speech Separation," 2019 IEEE/ACM transactions on audio, speech, and language processing, vol.
29, issue 8, pp. 1256-1266.
[2] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public
domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Brisbane, QLD, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964
Supporting Functions
newDatasetPath = fullfile(downloadDatasetFolder,"speech-sep-dataset");
% You will create mixture signals comprised of utterances of two random speakers.
% Randomize the IDs of all the speakers.
names = unique(ads.Labels);
names = names(randperm(length(names)));
% In this example, you create training signals based on 400 speakers. You
% generate mixture signals based on combining utterances from 200 pairs of
% speakers.
1-91
1 Audio Toolbox Examples
% Create audio datastores pointing to the files corresponding to the individual speakers and the
mixDatastore = audioDatastore(fullfile(newDatasetPath,"mix"));
speaker1Datastore = audioDatastore(fullfile(newDatasetPath,"sp1"));
speaker2Datastore = audioDatastore(fullfile(newDatasetPath,"sp2"));
end
duration = varargin{1};
x1 = data{1};
x2 = data{2};
% SNR [-5 5] dB
s = snr(x1,x2);
1-92
End-to-End Deep Speaker Separation
[~,s1] = fileparts(writeInfo.ReadInfo{1}.FileName);
[~,s2] = fileparts(writeInfo.ReadInfo{2}.FileName);
name = sprintf("%s-%s.wav",s1,s2);
audiowrite(sprintf("%s",fullfile(writeInfo.Location,"sp1",name)),x1,8000);
audiowrite(sprintf("%s",fullfile(writeInfo.Location,"sp2",name)),x2,8000);
audiowrite(sprintf("%s",fullfile(writeInfo.Location,"mix",name)),mix,8000);
end
Model Loss
[y1,y2,states] = separateSpeakersConvTasNet(mix,learnables,states,true);
m = uPIT(x1,y1,x2,y2);
l = sum(m);
loss = -l./miniBatchSize;
gradients = dlgradient(loss,learnables);
end
function m = uPIT(x1,y1,x2,y2)
%uPIT Compute utterance-level permutation invariant training
v1 = SISNR(y1,x1);
v2 = SISNR(y2,x2);
m1 = mean([v1;v2]);
v1 = SISNR(y2,x1);
v2 = SISNR(y1,x2);
m2 = mean([v1;v2]);
1-93
1 Audio Toolbox Examples
m = max(m1,m2);
end
function z = SISNR(x,y)
%SISNR Compute SI-SNR
x = x - mean(x);
y = y - mean(y);
t = sum(x.*y).*y./(sum(y.^2)+eps);
z = 20*log((sqrt(sum(t.^2))+eps)./sqrt((sum((x-t).^2))+eps))/log(10);
end
learnables.Conv1W = initializeGlorot(20,1,256);
learnables.Conv1B = dlarray(zeros(256,1,"single"));
learnables.ln_weight = dlarray(ones(1,256,"single"));
learnables.ln_bias = dlarray(zeros(1,256,"single"));
learnables.Conv2W = initializeGlorot(1,256,256);
learnables.Conv2B = dlarray(zeros(256,1,"single"));
blk.Conv1B = dlarray(zeros(512,1,"single"));
blk.Prelu1 = dlarray(single(0.25));
blk.BN1Offset = dlarray(zeros(512,1,"single"));
blk.BN1Scale = dlarray(ones(512,1,"single"));
blk.Conv2B = dlarray(zeros(512,1,"single"));
blk.Prelu2 = dlarray(single(0.25));
blk.BN2Offset = dlarray(zeros(512,1,"single"));
blk.BN2Scale= dlarray(ones(512,1,"single"));
blk.Conv3B = dlarray(ones(256,1,"single"));
s.BN1Mean = dlarray(zeros(512,1,"single"));
s.BN1Var = dlarray(ones(512,1,"single"));
s.BN2Mean = dlarray(zeros(512,1,"single"));
s.BN2Var = dlarray(ones(512,1,"single"));
learnables.Conv3W = initializeGlorot(1,256,512);
learnables.Conv3B = dlarray(zeros(512,1,"single"));
1-94
End-to-End Deep Speaker Separation
learnables.TransConv1W = initializeGlorot(20,1,256);
learnables.TransConv1B = dlarray(zeros(1,1,"single"));
end
Glorot Initialization
function weights = initializeGlorot(filterSize,numChannels,numFilters)
% initializeGlorot - Perform Glorot initialization
Z = 2*rand(sz,"single") - 1;
bound = sqrt(6 / (numIn + numOut));
weights = dlarray(bound*Z);
end
if ~isdlarray(input)
input = dlarray(input,"SCB");
end
x = dlconv(input,learnables.Conv1W,learnables.Conv1B,Stride=10);
x = relu(x);
x0 = x;
x = x-mean(x, 2);
x = x./sqrt(mean(x.^2,2) + 1e-5);
x = x.*learnables.ln_weight + learnables.ln_bias;
encoderOut = dlconv(x,learnables.Conv2W,learnables.Conv2B);
masks = dlconv(encoderOut,learnables.Conv3W,learnables.Conv3B);
masks = relu(masks);
mask1 = masks(:,1:256,:);
mask2 = masks(:,257:512,:);
out1 = x0.*mask1;
out2 = x0.*mask2;
weights = learnables.TransConv1W;
bias = learnables.TransConv1B;
1-95
1 Audio Toolbox Examples
output2 = dltranspconv(out1,weights,bias,Stride=10);
output1 = dltranspconv(out2,weights,bias,Stride=10);
if ~training
output1 = gather(extractdata(output1));
output2 = gather(extractdata(output2));
output1 = output1./max(abs(output1));
output2 = output2./max(abs(output2));
end
end
% Conv:
conv1Out = dlconv(input,learnables.Conv1W,learnables.Conv1B);
% PRelu:
conv1Out = relu(conv1Out) - learnables.Prelu1.*relu(-conv1Out);
% BatchNormalization:
offset = learnables.BN1Offset;
scale = learnables.BN1Scale;
datasetMean = state.BN1Mean;
datasetVariance = state.BN1Var;
if training
[batchOut,dsmean,dsvar] = batchnorm(conv1Out,offset,scale,datasetMean,datasetVariance);
state.BN1Mean = dsmean;
state.BN1Var = dsvar;
else
batchOut = batchnorm(conv1Out,offset,scale,datasetMean,datasetVariance);
end
% Conv:
padding = [1 1] * 2^(mod(count,8));
dilationFactor = 2^(mod(count,8));
convOut = dlconv(batchOut,learnables.Conv2W,learnables.Conv2B,DilationFactor=dilationFactor,Paddi
% PRelu:
convOut = relu(convOut) - learnables.Prelu2.*relu(-convOut);
% BatchNormalization:
if training
[batchOut,dsmean,dsvar] = batchnorm(convOut,learnables.BN2Offset,learnables.BN2Scale,state.BN
state.BN2Mean = dsmean;
state.BN2Var = dsvar;
else
batchOut = batchnorm(convOut,learnables.BN2Offset,learnables.BN2Scale,state.BN2Mean,state.BN2
end
% Conv:
output = dlconv(batchOut,learnables.Conv3W,learnables.Conv3B);
% Skip connection
output = output + input;
1-96
End-to-End Deep Speaker Separation
end
See Also
Related Examples
• “Cocktail Party Source Separation Using Deep Learning Networks” on page 1-349
1-97
1 Audio Toolbox Examples
This example shows an audio plugin designed to shift the pitch of a sound in real time.
Algorithm
The algorithm is based on cross-fading between two channels with time-varying delays and gains.
This method takes advantage of the pitch-shift Doppler effect that occurs as a signal's delay is
increased or decreased.
The figure below illustrates the variation of channel delays and gains for an upward pitch shift
scenario: The delay of channel 1 decreases at a fixed rate from its maximum value (in this example,
30 ms). Since the gain of channel 2 is initially equal to zero, it does not contribute to the output. As
the delay of channel 1 approaches zero, the delay of channel 2 starts decreasing down from 30 ms. In
this cross-fading region, the gains of the two channels are adjusted to preserve the output power
level. Channel 1 is completely faded out by the time its delay reaches zero. The process is then
repeated, going back and forth between the two channels.
1-98
Delay-Based Pitch Shifter
For a downward pitch effect, the delays are increased from zero to the maximum value.
The desired output pitch may be controlled by varying the rate of change of the channel delays.
Cross-fading reduces the audible glitches that occur during the transition between channels.
However, if cross-fading happens over too long a time, the repetitions present in the overlap area may
create spurious modulation and comb-filtering effects.
In addition to the output audio signal, the object returns two extra outputs, corresponding to the
delays and gains of the two channels, respectively.
You can open a test bench for audiopluginexample.PitchShifter by using Audio Test Bench.
The test bench provides a user interface (UI) to help you test your audio plugin in MATLAB. You can
tune the plugin parameters as the test bench is executing. You can also open a dsp.TimeScope and
a spectrumAnalyzer to view and compare the input and output signals in the time and frequency
domains, respectively.
You can also use audiopluginexample.PitchShifter in MATLAB just as you would use any other
MATLAB object. You can use the configureMIDI command to enable tuning the object via a MIDI
device. This is particularly useful if the object is part of a streaming MATLAB simulation where the
command window is not free.
1-99
1 Audio Toolbox Examples
runPitchShift is a simple function that may be used to perform pitch shifting as part of a larger
MATLAB simulation. The function instantiates an audiopluginexample.PitchShifter plugin,
and uses the setSampleRate method to set its sampling rate to the input argument Fs. The plugin's
parameter's are tuned by setting their values to the input arguments pitch and overlap, respectively.
Note that it is also possible to generate a MEX-file from this function using the codegen command.
Performance is improved in this mode without compromising the ability to tune parameters.
MATLAB Simulation
1-100
Delay-Based Pitch Shifter
MATLAB Coder can be used to generate C code for HelperPitchShifterSim. In order to generate
a MEX-file for your platform, execute HelperPitchShifterCodeGeneration from a folder with
write permissions.
Call audioPitchShifterExampleApp with 'true' as argument to use the MEX-file for simulation.
Again, the simulation runs till the user explicitly stops it from the UI.
References
[1] 'Using Multiple Processors for Real-Time Audio Effects', Bogdanowicz, K. ; Belcher, R; AES - May
1989.
1-101
1 Audio Toolbox Examples
This example shows an audio plugin designed to enhance the perceived sound level in the lower part
of the audible spectrum.
Introduction
Small loudspeakers typically have a poor low frequency response, which can have a negative impact
on overall sound quality. This example implements psychoacoustic bass enhancement to improve
sound quality of audio played on small loudspeakers.
The example is based on the algorithm in [1 on page 1-104]. A non-linear device shifts the low-
frequency range of the signal to a high-frequency range through the generation of harmonics. The
pitch of the original signal is preserved due to the "virtual pitch" psychoacoustic phenomenon.
Algorithm
1. The input stereo signal is split into lowpass and highpass components using a crossover filter. The
filter's crossover frequency is equal to the speaker's cutoff frequency (set to 60 Hz in this example).
2. The highpass component, hpstereo, is split into left and right channels: hplef t and hpright,
respectively.
3. The lowpass component, lpstereo, is converted to mono, lpmono, by adding the left and right
channels element by element.
4. lpmono is passed through a full wave integrator. The full wave integrator shifts lpmono to higher
harmonics.
0 if u n > 0 and u n − 1 ≤ 0
yn =
y n − 1 + u n − 1 otherwise
1-102
Psychoacoustic Bass Enhancement for Band-Limited Signals
5. y[n] is passed through a bandpass filter with lower cutoff frequency set to the speaker's cutoff
frequency. The filter's upper cutoff frequency may be adjusted to fine-tune output sound quality.
8. The left and right channels are concatenated into a single matrix and output.
Although the resulting output stereo signal does not contain low-frequency elements, the input's bass
pitch is preserved thanks to the generated harmonics.
You can open a test bench for audiopluginexample.BassEnhancer using Audio Test Bench. The
test bench provides a graphical user interface to help you test your audio plugin in MATLAB. You can
tune the plugin parameters as the test bench is executing. You can also open a timescope and a
spectrumAnalyzer to view and compare the input and output signals in the time and frequency
domains, respectively.
bassEnhancer = audiopluginexample.BassEnhancer;
audioTestBench(bassEnhancer)
1-103
1 Audio Toolbox Examples
You can also use audiopluginexample.BassEnhancer in MATLAB just as you would use any other
MATLAB object. You can use configureMIDI to enable tuning the object using a MIDI device. This is
particularly useful if the object is part of a streaming MATLAB simulation where the command
window is not free.
References
[1] Aarts, Ronald M, Erik Larsen, and Daniel Schobben. “Improving Perceived Bass and
Reconstruction of High Frequencies for Band Limited Signals.” Proceedings 1st IEEE Benelux
Workshop on Model Based Coding of Audio (MPCA-2002) , November 15, 2002, 59–71.
1-104
Tunable Filtering and Visualization Using Audio Plugins
This example shows how to visualize the magnitude response of a tunable filter. The filters in this
example are implemented as audio plugins. This example uses the visualize and audioTestBench
functionality of the Audio Toolbox™.
Audio Toolbox provides several examples of tunable filters that have been implemented as audio
plugins:
audiopluginexample.BandpassIIRFilter
audiopluginexample.HighpassIIRFilter
audiopluginexample.LowpassIIRFilter
audiopluginexample.ParametricEqualizerWithUDP
audiopluginexample.ShelvingEqualizer
audiopluginexample.VarSlopeBandpassFilter
visualize
All of these example audio plugins can be used with the visualize function in order to view the
magnitude response of the filters as they are tuned in real time.
audioTestBench
Any audio plugin can be tuned in real time using audioTestBench. The tool allows you to test an
audio plugin with audio signals from a file or device. The tool also enables you to view the power
spectrum and the time-domain waveform for the input and output signals.
hpf = audiopluginexample.HighpassIIRFilter;
visualize(hpf)
1-105
1 Audio Toolbox Examples
audioTestBench(hpf)
1-106
Tunable Filtering and Visualization Using Audio Plugins
Note that moving the cutoff frequency in audioTestBench does not update the magnitude response
plot. However, once the 'Run' (or play) button is pressed, you can see and hear the changing
magnitude response of the filter as the cutoff frequency is tuned in real time.
1-107
1 Audio Toolbox Examples
audiopluginexample.ShelvingEqualizer and
audiopluginexample.VarSlopeBandpassFilter have visualize functions which update the
magnitude response plot even when not processing data. The visualization is also updated in real
time once audio is being processed.
audioTestBench('-close')
varfilter = audiopluginexample.VarSlopeBandpassFilter;
visualize(varfilter)
audioTestBench(varfilter)
1-108
Tunable Filtering and Visualization Using Audio Plugins
1-109
1 Audio Toolbox Examples
audioTestBench('-close')
equalizer = audiopluginexample.ParametricEqualizerWithUDP;
visualize(equalizer)
audioTestBench(equalizer)
1-110
Tunable Filtering and Visualization Using Audio Plugins
1-111
1 Audio Toolbox Examples
audioTestBench('-close')
1-112
Communicate Between a DAW and MATLAB Using UDP
This example shows how to communicate between a digital audio workstation (DAW) and MATLAB®
using the user datagram protocol (UDP). The information shared between the DAW and MATLAB can
used to perform visualization in real time in MATLAB on parameters that are being changed in the
DAW.
UDP is a core member of the Internet protocol suite. It is a simple connectionless transmission that
does not employ any methods for error checking. Because it does not check for errors, UDP is a fast
but unreliable alternative to the transmission control protocol (TCP) and stream control transmission
protocol (SCTP). UDP is widely used in applications that are willing to trade fidelity for high-speed
transmission, such as video conferencing and real-time computer games. If you use UDP for
communication within a single machine, packets are less likely to drop. The tutorials outlined here
work best when executed on a single machine.
To communicate between a DAW and MATLAB using UDP, place a UDP sender in the plugin used in
the DAW, and run a corresponding UDP receiver in MATLAB.
The dsp.UDPSender and dsp.UDPReceiver System objects use prebuilt library files that are
included with MATLAB.
Example Plugins
1-113
1 Audio Toolbox Examples
generateAudioPlugin audiopluginexample.UDPSender
.......
The generated plugin is saved to your current folder and named UDPSender.
To run the UDP sender outside of MATLAB, you must open the DAW from a command terminal with
the appropriate environment variables set. Setting environment variables enables the deployed UDP
sender to use the necessary library files in MATLAB. To learn how to set the environment variables,
see the tutorial specific to your system:
After you set the environment variables, open your DAW from the same command terminal, such as in
this example terminal from a Windows system.
1-114
Communicate Between a DAW and MATLAB Using UDP
1. Follow steps 1-2 from Send Audio from DAW to MATLAB, replacing
audiopluginexample.UDPSender with
audiopluginexample.ParametricEqualizerWithUDP.
a. In the DAW, open the generated ParameterEqualizerWithUDP file. The plugin display name is
ParametricEQ.
1-115
1 Audio Toolbox Examples
1-116
Acoustic Echo Cancellation (AEC)
This example shows how to apply adaptive filters to acoustic echo cancellation (AEC).
Introduction
Acoustic echo cancellation is important for audio teleconferencing when simultaneous communication
(or full-duplex transmission) of speech is necessary. In acoustic echo cancellation, a measured
microphone signal contains two signals:
The goal is to remove the far-end echoed speech signal from the microphone signal so that only the
near-end speech signal is transmitted. This example has some sound clips, so you might want to
adjust your computer's volume now.
You first need to model the acoustics of the loudspeaker-to-microphone signal path where the
speakerphone is located. Use a long finite impulse response filter to describe the characteristics of
the room. The following code generates a random impulse response that is not unlike what a
conference room would exhibit. Assume a system sample rate of 16000 Hz.
fs = 16000;
M = fs/2 + 1;
frameSize = 2048;
1-117
1 Audio Toolbox Examples
fig = figure;
plot(0:1/fs:0.5, roomImpulseResponse);
xlabel('Time (s)');
ylabel('Amplitude');
title('Room Impulse Response');
fig.Color = [1 1 1];
1-118
Acoustic Echo Cancellation (AEC)
The teleconferencing system's user is typically located near the system's microphone. Here is what a
male speech sounds like at the microphone.
load nearspeech
1-119
1 Audio Toolbox Examples
In a teleconferencing system, a voice travels out the loudspeaker, bounces around in the room, and
then is picked up by the system's microphone. Listen to what the speech sounds like if it is picked up
at the microphone without the near-end speech present.
load farspeech
farSpeechSrc = dsp.SignalSource('Signal',x,'SamplesPerFrame',frameSize);
farSpeechSink = dsp.SignalSink;
farSpeechScope = timescope('SampleRate', fs, 'TimeSpanSource','Property',...
'TimeSpan', 35, 'TimeSpanOverrunAction', 'Scroll', ...
'YLimits', [-0.5 0.5], ...
'BufferLength', length(x), ...
'Title', 'Far-End Speech Signal', ...
'ShowGrid', true);
1-120
Acoustic Echo Cancellation (AEC)
farSpeechSink(farSpeechEcho);
end
release(farSpeechScope);
The signal at the microphone contains both the near-end speech and the far-end speech that has been
echoed throughout the room. The goal of the acoustic echo canceler is to cancel out the far-end
speech, such that only the near-end speech is transmitted back to the far-end listener.
reset(nearSpeechSrc);
farSpeechEchoSrc = dsp.SignalSource('Signal', farSpeechSink.Buffer, ...
'SamplesPerFrame', frameSize);
micSink = dsp.SignalSink;
micScope = timescope('SampleRate', fs,'TimeSpanSource','Property',...
'TimeSpan', 35, 'TimeSpanOverrunAction', 'Scroll',...
'YLimits', [-1 1], ...
'BufferLength', length(x), ...
'Title', 'Microphone Signal', ...
'ShowGrid', true);
1-121
1 Audio Toolbox Examples
player(micSignal);
% Plot the signal
micScope(micSignal);
% Log the signal
micSink(micSignal);
end
release(micScope);
The algorithm in this example is the Frequency-Domain Adaptive Filter (FDAF). This algorithm is
very useful when the impulse response of the system to be identified is long. The FDAF uses a fast
convolution technique to compute the output signal and filter updates. This computation executes
quickly in MATLAB®. It also has fast convergence performance through frequency-bin step size
normalization. Pick some initial parameters for the filter and see how well the far-end speech is
cancelled in the error signal.
1-122
Acoustic Echo Cancellation (AEC)
AECScope1.ActiveDisplay = 1;
AECScope1.ShowGrid = true;
AECScope1.YLimits = [-1.5 1.5];
AECScope1.Title = 'Near-End Speech Signal';
AECScope1.ActiveDisplay = 2;
AECScope1.ShowGrid = true;
AECScope1.YLimits = [-1.5 1.5];
AECScope1.Title = 'Microphone Signal';
AECScope1.ActiveDisplay = 3;
AECScope1.ShowGrid = true;
AECScope1.YLimits = [-1.5 1.5];
AECScope1.Title = 'Output of Acoustic Echo Canceller mu=0.025';
AECScope1.ActiveDisplay = 4;
AECScope1.ShowGrid = true;
AECScope1.YLimits = [0 50];
AECScope1.YLabel = 'ERLE (dB)';
AECScope1.Title = 'Echo Return Loss Enhancement mu=0.025';
Since you have access to both the near-end and far-end speech signals, you can compute the echo
return loss enhancement (ERLE), which is a smoothed measure of the amount (in dB) that the
echo has been attenuated. From the plot, observe that you achieved about a 35 dB ERLE at the end of
the convergence period.
diffAverager = dsp.FIRFilter('Numerator', ones(1,1024));
farEchoAverager = clone(diffAverager);
setfilter(FVT,diffAverager);
1-123
1 Audio Toolbox Examples
1-124
Acoustic Echo Cancellation (AEC)
To get faster convergence, you can try using a larger step size value. However, this increase causes
another effect: the adaptive filter is "misadjusted" while the near-end speaker is talking. Listen to
what happens when you choose a step size that is 60% larger than before.
AECScope2 = clone(AECScope1);
AECScope2.ActiveDisplay = 3;
AECScope2.Title = 'Output of Acoustic Echo Canceller mu=0.04';
AECScope2.ActiveDisplay = 4;
AECScope2.Title = 'Echo Return Loss Enhancement mu=0.04';
reset(nearSpeechSrc);
reset(farSpeechSrc);
reset(farSpeechEchoSrc);
reset(micSrc);
reset(diffAverager);
reset(farEchoAverager);
1-125
1 Audio Toolbox Examples
nearSpeech = nearSpeechSrc();
farSpeech = farSpeechSrc();
farSpeechEcho = farSpeechEchoSrc();
micSignal = micSrc();
% Apply FDAF
[y,e] = echoCanceller(farSpeech, micSignal);
% Send the speech samples to the output audio device
player(e);
% Compute ERLE
erle = diffAverager((e-nearSpeech).^2)./ farEchoAverager(farSpeechEcho.^2);
erledB = -10*log10(erle);
% Plot near-end, far-end, microphone, AEC output and ERLE
AECScope2(nearSpeech, micSignal, e, erledB);
end
release(nearSpeechSrc);
release(farSpeechSrc);
release(farSpeechEchoSrc);
release(micSrc);
release(diffAverager);
release(farEchoAverager);
release(echoCanceller);
release(AECScope2);
1-126
Acoustic Echo Cancellation (AEC)
With a larger step size, the ERLE performance is not as good due to the misadjustment introduced by
the near-end speech. To deal with this performance difficulty, acoustic echo cancelers include a
detection scheme to tell when near-end speech is present and lower the step size value over these
periods. Without such detection schemes, the performance of the system with the larger step size is
not as good as the former, as can be seen from the ERLE plots.
Traditional FDAF is numerically more efficient than time-domain adaptive filtering for long impulse
responses, but it imposes high latency, because the input frame size must be a multiple of the
specified filter length. This can be unacceptable for many real-world applications. Latency may be
reduced by using partitioned FDAF, which partitions the filter impulse response into shorter
segments, applies FDAF to each segment, and then combines the intermediate results. The frame size
in that case must be a multiple of the partition (block) length, thereby greatly reducing the latency
for long impulse responses.
1-127
1 Audio Toolbox Examples
1-128
Active Noise Control Using a Filtered-X LMS FIR Adaptive Filter
This example shows how to apply adaptive filters to the attenuation of acoustic noise via active noise
control.
In active noise control, one attempts to reduce the volume of an unwanted noise propagating through
the air using an electro-acoustic system using measurement sensors such as microphones and output
actuators such as loudspeakers. The noise signal usually comes from some device, such as a rotating
machine, so that it is possible to measure the noise near its source. The goal of the active noise
control system is to produce an "anti-noise" that attenuates the unwanted noise in a desired quiet
region using an adaptive filter. This problem differs from traditional adaptive noise cancellation in
that: - The desired response signal cannot be directly measured; only the attenuated signal is
available. - The active noise control system must take into account the secondary loudspeaker-to-
microphone error path in its adaptation.
For more implementation details on active noise control tasks, see S.M. Kuo and D.R. Morgan, "Active
Noise Control Systems: Algorithms and DSP Implementations", Wiley-Interscience, New York, 1996.
The secondary propagation path is the path the anti-noise takes from the output loudspeaker to the
error microphone within the quiet zone. The following commands generate a loudspeaker-to-error
microphone impulse response that is bandlimited to the range 160 - 2000 Hz and with a filter length
of 0.1 seconds. For this active noise control task, we shall use a sampling frequency of 8000 Hz.
Fs = 8e3; % 8 kHz
N = 800; % 800 samples@8 kHz = 0.1 seconds
Flow = 160; % Lower band-edge: 160 Hz
Fhigh = 2000; % Upper band-edge: 2000 Hz
delayS = 7;
Ast = 20; % 20 dB stopband attenuation
Nfilt = 8; % Filter order
t = (1:N)/Fs;
plot(t,secondaryPathCoeffsActual,'b');
xlabel('Time [sec]');
ylabel('Coefficient value');
title('True Secondary Path Impulse Response');
1-129
1 Audio Toolbox Examples
The first task in active noise control is to estimate the impulse response of the secondary propagation
path. This step is usually performed prior to noise control using a synthetic random signal played
through the output loudspeaker while the unwanted noise is not present. The following commands
generate 3.75 seconds of this random noise as well as the measured signal at the error microphone.
ntrS = 30000;
randomSignal = randn(ntrS,1); % Synthetic random signal to be played
secondaryPathGenerator = dsp.FIRFilter('Numerator',secondaryPathCoeffsActual.');
secondaryPathMeasured = secondaryPathGenerator(randomSignal) + ... % random signal propagated thr
0.01*randn(ntrS,1); % measurement noise at the microphone
Typically, the length of the secondary path filter estimate is not as long as the actual secondary path
and need not be for adequate control in most cases. We shall use a secondary path filter length of 250
taps, corresponding to an impulse response length of 31 ms. While any adaptive FIR filtering
algorithm could be used for this purpose, the normalized LMS algorithm is often used due to its
simplicity and robustness. Plots of the output and error signals show that the algorithm converges
after about 10000 iterations.
M = 250;
muS = 0.1;
secondaryPathEstimator = dsp.LMSFilter('Method','Normalized LMS','StepSize', muS, ...
'Length', M);
[yS,eS,SecondaryPathCoeffsEst] = secondaryPathEstimator(randomSignal,secondaryPathMeasured);
1-130
Active Noise Control Using a Filtered-X LMS FIR Adaptive Filter
n = 1:ntrS;
figure, plot(n,secondaryPathMeasured,n,yS,n,eS);
xlabel('Number of iterations');
ylabel('Signal value');
title('Secondary Identification Using the NLMS Adaptive Filter');
legend('Desired Signal','Output Signal','Error Signal');
How accurate is the secondary path impulse response estimate? This plot shows the coefficients of
both the true and estimated path. Only the tail of the true impulse response is not estimated
accurately. This residual error does not significantly harm the performance of the active noise control
system during its operation in the chosen task.
1-131
1 Audio Toolbox Examples
The propagation path of the noise to be cancelled can also be characterized by a linear filter. The
following commands generate an input-to-error microphone impulse response that is bandlimited to
the range 200 - 800 Hz and has a filter length of 0.1 seconds.
delayW = 15;
Flow = 200; % Lower band-edge: 200 Hz
Fhigh = 800; % Upper band-edge: 800 Hz
Ast = 20; % 20 dB stopband attenuation
Nfilt = 10; % Filter order
figure, plot(t,primaryPathCoeffs,'b');
xlabel('Time [sec]');
ylabel('Coefficient value');
title('Primary Path Impulse Response');
1-132
Active Noise Control Using a Filtered-X LMS FIR Adaptive Filter
Typical active noise control applications involve the sounds of rotating machinery due to their
annoying characteristics. Here, we synthetically generate noise that might come from a typical
electric motor.
The most popular adaptive algorithm for active noise control is the filtered-X LMS algorithm. This
algorithm uses the secondary path estimate to calculate an output signal whose contribution at the
error sensor destructively interferes with the undesired noise. The reference signal is a noisy version
of the undesired sound measured near its source. We shall use a controller filter length of about 44
ms and a step size of 0.0001 for these signal statistics.
1-133
1 Audio Toolbox Examples
F0 = 60;
k = 1:La;
F = F0*k;
phase = rand(1,La); % Random initial phase
sine = audioOscillator('NumTones', La, 'Amplitude',A,'Frequency',F, ...
'PhaseOffset',phase,'SamplesPerFrame',512,'SampleRate',Fs);
Here we simulate the active noise control system. To emphasize the difference we run the system
with no active noise control for the first 200 iterations. Listening to its sound at the error microphone
before cancellation, it has the characteristic industrial "whine" of such motors.
Once the adaptive filter is enabled, the resulting algorithm converges after about 5 (simulated)
seconds of adaptation. Comparing the spectrum of the residual error signal with that of the original
noise signal, we see that most of the periodic components have been attenuated considerably. The
steady-state cancellation performance may not be uniform across all frequencies, however. Such is
often the case for real-world systems applied to active noise control tasks. Listening to the error
signal, the annoying "whine" is reduced considerably.
for m = 1:400
% Generate synthetic noise by adding sine waves with random phase
x = sine();
d = primaryPathGenerator(x) + ... % Propagate noise through primary path
0.1*randn(size(x)); % Add measurement noise
if m <= 200
% No noise control for first 200 iterations
e = d;
else
% Enable active noise control after 200 iterations
xhat = x + 0.1*randn(size(x));
[y,e] = noiseController(xhat,d);
end
player(e); % Play noise signal
scope([d,e]); % Show spectrum of original (Channel 1)
% and attenuated noise (Channel 2)
end
release(player); % Release audio device
release(scope); % Release spectrum analyzer
1-134
Active Noise Control Using a Filtered-X LMS FIR Adaptive Filter
1-135
1 Audio Toolbox Examples
This example shows how to use the Least Mean Square (LMS) algorithm to subtract noise from an
input signal. The LMS adaptive filter uses the reference signal on the Input port and the desired
signal on the Desired port to automatically match the filter response. As it converges to the correct
filter model, the filtered noise is subtracted and the error signal should contain only the original
signal.
In the model, the signal output at the upper port of the Acoustic Environment subsystem is white
noise. The signal output at the lower port is composed of colored noise and a signal from a WAV file.
This example model uses an adaptive filter to remove the noise from the signal output at the lower
port. When you run the simulation, you hear both noise and a person playing the drums. Over time,
the adaptive filter in the model filters out the noise so you only hear the drums.
Run the model to listen to the audio signal in real time. The stop time is set to infinity. This allows you
to interact with the model while it is runs. For example, you can change the filter or alternate from
slow adaptation to fast adaptation (and vice versa), and get a sense of the real-time audio processing
behavior under these conditions.
1-136
Acoustic Noise Cancellation Using LMS
Notice the colors of the blocks in the model. These are sample time colors that indicate how fast a
block executes. Here, the fastest discrete sample time is red, and the second fastest discrete sample
time is green. You can see that the color changes from red to green after down-sampling by 32 (in the
Downsample block before the Waterfall Scope block). Further information on displaying sample time
colors can be found in the Simulink® documentation.
Waterfall Scope
The Waterfall window displays the behavior of the adaptive filter's filter coefficients. It displays
multiple vectors of data at one time. These vectors represent the values of the filter's coefficients of a
normalized LMS adaptive filter, and are the input data at consecutive sample times. The data is
displayed in a three-dimensional axis in the Waterfall window. By default, the x-axis represents
amplitude, the y-axis represents samples, and the z-axis represents time. The Waterfall window has
toolbar buttons that enable you to zoom in on displayed data, suspend data capture, freeze the
scope's display, save the scope position, and export data to the workspace.
You can see the details of the Acoustic Environment subsystem by double clicking on that block.
Gaussian noise is used to create the signal sent to the Exterior Mic output port. If the input to the
Filter port changes from 0 to 1, the Digital Filter block changes from a lowpass filter to a bandpass
filter. The filtered noise output from the Digital Filter block is added to the signal coming from a WAV-
file to produce the signal sent to the Pilot's Mic output port.
References
[1] Haykin, Simon S. Adaptive Filter Theory. 3rd ed, Prentice Hall, 1996.
1-137
1 Audio Toolbox Examples
This example shows how to design and use three audio effects that are based on varying delay: echo,
chorus and flanger. The example also shows how the algorithms, developed in MATLAB, can be easily
ported to Simulink.
Introduction
Audio effects can be generated by adding a processed ('wet') signal to the original ('dry') audio signal.
A simple effect, echo, adds a delayed version of the signal to the original. More complex effects, like
chorus and flanger, modulate the delayed version of the signal.
Echo
You can model the echo effect by delaying the audio signal and adding it back. Feedback is often
added to the delay line to give a fading effect. The echo effect is implemented in the
audioexample.Echo class. The block diagram shows a high-level implementation of an echo effect.
The echo effect example has four tunable parameters that can be modified while the simulation is
running:
1-138
Delay-Based Audio Effects
Chorus
The chorus effect usually has multiple independent delays, each modulated by a low-frequency
oscillator. audioexample.Chorus implements this effect. The block diagram shows a high-level
implementation of a chorus effect.
1-139
1 Audio Toolbox Examples
The chorus effect example has six tunable parameters that can be modified while the simulation is
running:
1-140
Delay-Based Audio Effects
Flanger
You can model the flanging effect by delaying the audio input by an amount that is modulated by a
low-frequency oscillator (LFO). The delay line used in flanger can also have a feedback path.
audioexample.Flanger implements this effect. The block diagram shows a high-level implementation
of a flanger effect.
The flanger effect example has five tunable parameters that can be modified while the simulation is
running:
1-141
1 Audio Toolbox Examples
1-142
Delay-Based Audio Effects
You can select the effect to be applied by double-clicking on the Effect Selector block.
Once the effect has been selected, you can click on Launch Parameter Tuning UI button to bring
up the dialog that has all tunable parameters of the effect.
This dialog will remain available even during simulation. You can run the model and tune properties
of the effect to listen to how they affect the audio output.
1-143
1 Audio Toolbox Examples
This example shows how to apply reverberation to audio by using the Freeverb reverberation
algorithm. The reverberation can be tuned using a user interface (UI) in MATLAB or through a MIDI
controller. This example illustrates MATLAB® and Simulink® implementations.
Introduction
Reverberators are used to add the effect of multiple decaying echoes, or reverbs, to audio signals. A
common use of reverberation is to simulate music played in a closed room. Most digital audio
workstations (DAWs) have options to add such effects to the sound track.
In this example, you add reverberation to audio through the Freeverb algorithm. Freeverb is a
popular implementation of the Schroeder reverberator. A high-level model of the Freeverb algorithm
is shown below:
Example Architecture
MATLAB Simulation
audioFreeverbReverberationExampleApp
1-144
Add Reverberation Using Freeverb Algorithm
There are also three buttons on the UI - the Reset button will reset the states of the comb and allpass
sections in reverberator to their initial values and the Pause Simulation button will hold the
simulation until you click on it again. The simulation may be terminated by either closing the UI or by
clicking on the Stop simulation button. If you have a MIDI controller, it is possible to synchronize it
with the UI. You can do this by choosing a MIDI control in the dialog that is opened when you right-
click on the sliders or buttons and select "Synchronize" from the context menu. The chosen MIDI
control then works in accordance with the slider or button so that operating one control is tracked by
the other.
If you see a lot of queue underrun warnings, you will need to adjust the buffer and queue size of audio
player used in audioFreeverbReverberationExampleApp. More information on this can be found
at the documentation page for audioDeviceWriter. The audio source in this example is an audio
file, but you can replace it with an audio input device (through audioDeviceReader) to add
reverberation to live audio. For ways to reduce latency while not having any overruns/underruns, you
can follow the example “Measure Audio Latency” on page 1-259.
Using MATLAB Coder™, you can generate a MEX file for the main processing algorithm by executing
the HelperFreeverbCodeGeneration command. You can use the generated MEX file by executing
the audioFreeverbReverberationExampleApp command with true as an argument.
audioFreeverbReverberationExampleApp(true)
1-145
1 Audio Toolbox Examples
Simulink Version
The model generates code when it is simulated. Therefore, it must be executed from a folder with
write permissions.
Acknowledgement
The algorithm in this example is based on the public domain 'Freeverb' model written by Jezar at
Dreampoint (June 2000).
Reference
1-146
Multiband Dynamic Range Compression
This example shows how to simulate a digital audio multiband dynamic range compression system.
Introduction
Dynamic range compression reduces the dynamic range of a signal by attenuating the level of strong
peaks, while leaving weaker peaks unchanged. Compression has applications in audio recording,
mixing, and in broadcasting.
Multiband compression compresses different audio frequency bands separately, by first splitting the
audio signal into multiple bands and then passing each band through its own independently
adjustable compressor. Multiband compression is widely used in audio production and is often
included in audio workstations.
The multiband compressor in this example first splits an audio signal into different bands using a
multiband crossover filter. Linkwitz-Riley crossover filters are used to obtain an overall allpass
frequency response. Each band is then compressed using a separate dynamic range compressor. Key
compressor characteristics, such as the compression ratio, the attack and release time, the threshold
and the knee width, are independently tunable for each band. The effect of compression on the
dynamic range of the signal is showcased.
A Linkwitz-Riley crossover filter consists of a combination of a lowpass and highpass filter, each
formed by cascading two lowpass or highpass Butterworth filters. Summing the response of the two
filters yields a gain of 0 dB at the crossover frequency, so that the crossover acts like an allpass filter
(and therefore introducing no distortion in the audio signal).
Here is an example where an fourth-order Linkwitz-Riley crossover is used to filter a signal. Notice
that the lowpass and highpass sections each have a -6 dB gain at the crossover frequency. The sum of
the lowpass and highpass sections is allpass.
Fs = 44100;
% Linkwitz-Riley filter
crossover = crossoverFilter(1,5000,4*6,Fs);
frameLength = 1024;
1-147
1 Audio Toolbox Examples
tic
while toc < 10
in = randn(frameLength,1);
% Return lowpass and highpass responses of the crossover filter
[ylp,yhp] = crossover(in);
% sum the responses
y = ylp + yhp;
v = transferFuncEstimator(repmat(in,1,3),[ylp yhp y]);
scope(20*log10(abs(v)));
end
crossoverFilter may also be used to implement a multiband crossover filter by combining two-
band crossover filters and allpass filters in a tree-like structure. The filter divides the spectrum into
multiple bands such that their sum is a perfect allpass filter.
1-148
Multiband Dynamic Range Compression
The example below shows a four-band crossover filter formed of fourth order Linkwitz-Riley crossover
filters. Notice the allpass response of the sum of the four bands.
Fs = 44100;
crossover = crossoverFilter(3,[2e3 5e3 10e3],[24 24 24],44100);
transferFuncEstimator = dsp.TransferFunctionEstimator('FrequencyRange','onesided','SpectralAverag
L = 2^14;
scope = dsp.ArrayPlot( ...
'PlotType','Line', ...
'XOffset',0, ...
'YLimits',[-120 5], ...
'XScale','log', ...
'SampleIncrement', .5 * Fs/(L/2 + 1 ), ...
'YLabel','Frequency Response (dB)', ...
'XLabel','Frequency (Hz)', ...
'Title','Four-Band Crossover Filter', ...
'ShowLegend',true, ...
'ChannelNames',{'Band 1','Band 2','Band 3','Band 4','Sum'});
tic;
while toc < 10
in = randn(L,1);
% Split the signal into four bands
[ylp,ybp1,ybp2,yhp] = crossover(in);
y = ylp + ybp1 + ybp2 + yhp;
z = transferFuncEstimator(repmat(in,1,5),[ylp,ybp1,ybp2,yhp,y]);
scope(20*log10(abs(z)))
end
1-149
1 Audio Toolbox Examples
compressor is a dynamic range compressor System object. The input signal is compressed when it
exceeds the specified threshold. The amount of compression is controlled by the specified
compression ratio. The attack and release times determine how quickly the compressor starts or
stops compressing. The knee width provides a smooth transition for the compressor gain around the
threshold. Finally, a make-up gain can be applied at the output of the compressor. This make-up gain
amplifies both strong and weak peaks equally.
The static compression characteristic of the compressor depends on the compression ratio, the
threshold and the knee width. The example below illustrates the static compression characteristic for
a hard knee:
drc = compressor(-15,5);
visualize(drc);
In order to view the effect of threshold, ratio and knee width on the compressor's static
characteristic, change the values of the Threshold, Ratio and KneeWidth properties. The static
characteristic plot should change accordingly.
The compressor's attack time is defined as the time (in msec) it takes for the compressor's gain to
rise from 10% to 90% of its final value when the signal level exceeds the threshold. The compressor's
release time is defined as the time (in seconds) it takes the compressor's gain to drop from 90% to
10% of its value when the signal level drops below the threshold. The example below illustrates the
signal envelope for different release and attack times:
1-150
Multiband Dynamic Range Compression
Fs = 44100;
x = [ones(Fs,1);0.1*ones(Fs,1)];
[y,g] = drc(x);
figure
subplot(211)
plot(t,x);
hold on
grid on
plot(t,y,'r')
ylabel('Amplitude')
legend('Input','Compressed Output')
subplot(212)
plot(t,g)
grid on
legend('Compressor gain (dB)')
xlabel('Time (sec)')
ylabel('Gain (dB)')
1-151
1 Audio Toolbox Examples
The input maximum level is 0 dB, which is above the specified -10 dB threshold. The steady-state
compressor output for a 0 dB input is -10 + 10/5 = -8 dB. The gain is therefore -8 dB. The attack time
is defined as the time it takes the compressor gain to rise from 10% to 90% of its final value when the
input level goes above the threshold, or in this case, from -0.8 dB to -7.2 dB. Let's find the times at
which the gains in the compression stage are equal to -0.8 dB and -7.2 dB, respectively:
The input signal then drops back down to 0, where there is no compression. The release time is
defined as the time it takes the gain to drop from 90% to 10% of its absolute value when the input
goes below the threshold, or in this case, -7.2 dB to -0.8 dB. Let's find the times at which the gains in
the no-compression stage are equal to -7.2 dB and -0.8 dB, respectively:
The example below illustrates the effect of dynamic range compression on an audio signal. The
compression threshold is set to -15 dB, and the compression ratio is 7.
1-152
Multiband Dynamic Range Compression
frameLength = 1024;
reader = dsp.AudioFileReader('Filename', ...
'RockGuitar-16-44p1-stereo-72secs.wav', ...
'SamplesPerFrame',frameLength);
% Compressor. Threshold = -15 dB, ratio = 7
drc = compressor(-15,7, ...
'SampleRate',reader.SampleRate, ...
'MakeUpGainMode','Property', ...
'KneeWidth',5);
scope = timescope('SampleRate',reader.SampleRate, ...
'TimeSpanSource','property',...
'TimeSpan',1,'BufferLength',Fs*4, ...
'ShowGrid',true, ...
'LayoutDimensions',[2 1], ...
'NumInputPorts',2, ...
'TimeSpanOverrunAction','Scroll');
scope.ActiveDisplay = 1;
scope.YLimits = [-1 1];
scope.ShowLegend = true;
scope.ChannelNames = {'Original versus compressed audio'};
scope.ActiveDisplay = 2;
scope.YLimits = [-6 0];
scope.YLabel = 'Gain (dB)';
scope.ShowLegend = true;
scope.ChannelNames = {'compressor gain in dB'};
while ~isDone(reader)
x = reader();
[y,g] = drc(x);
x1 = x(:,1);
y1 = y(:,1);
scope([x1,y1],g(:,1))
end
1-153
1 Audio Toolbox Examples
The following model implements the multiband dynamic range compression example:
model = 'audiomultibanddynamiccompression';
open_system(model)
In this example, the audio signal is first divided into four bands using a multiband crossover filter.
Each band is compressed using a separate compressor. The four bands are then recombined to form
the audio output. The dynamic range of the uncompressed and compressed signals (defined as the
ratio of the largest absolute value of the signal to the signal RMS) is computed. To hear the difference
between the original and compressed audio signals, toggle the switch on the top level.
The model integrates a User Interface (UI) designed to interact with the simulation. The UI allows
you to tune parameters and the results are reflected in the simulation instantly. To launch the UI that
controls the simulation, click the 'Launch Parameter Tuning UI' link on the model.
bdclose(model)
1-154
Multiband Dynamic Range Compression
1-155
1 Audio Toolbox Examples
This example shows how to implement a phase vocoder to time stretch and pitch scale an audio
signal.
Introduction
The phase vocoder performs time stretching and pitch scaling by transforming the audio into
frequency domain. The following block diagram shows the operations involved in the phase vocoder
implementation.
The phase vocoder has an analysis section that performs an overlapped short-time FFT (ST-FFT) and
a synthesis section that performs an overlapped inverse short-time FFT (IST-FFT). To time stretch a
signal, the phase vocoder uses a larger hop size for the overlap-add operation in the synthesis section
than the analysis section. Here, the hop size is the number of samples processed at one time. As a
result, there are more samples at the output than at the input although the frequency content
remains the same. Now, you can pitch scale this signal by playing it back at a higher sample rate,
which produces a signal with the original duration but a higher pitch.
Initialization
To achieve optimal performance, you must create and initialize your System objects before using
them in a processing loop. Use these next sections of code to initialize the required variables and load
the input speech data. You set an analysis hop size of 64 and a synthesis hop size of 90 because you
want to stretch the signal by a factor of 90/64.
Initialize some variables used in configuring the System objects you create below.
WindowLen = 256;
AnalysisLen = 64;
SynthesisLen = 90;
Hopratio = SynthesisLen/AnalysisLen;
Create a System object to read in the input speech signal from an audio file.
1-156
Pitch Shifting and Time Dilation Using a Phase Vocoder in MATLAB
win = sqrt(hanning(WindowLen,'periodic'));
stft = dsp.STFT(win, WindowLen - AnalysisLen, WindowLen);
istft = dsp.ISTFT(win, WindowLen - SynthesisLen );
Fs = 8000;
player = audioDeviceWriter('SampleRate',Fs, ...
'SupportVariableSizeInput',true, ...
'BufferSize',512);
logger = dsp.SignalSink;
unwrapdata = 2*pi*AnalysisLen*(0:WindowLen-1)'/WindowLen;
yangle = zeros(WindowLen,1);
firsttime = true;
Now that you have instantiated your System objects, you can create a processing loop that performs
time stretching on the input signal. The loop is stopped when you reach the end of the input file,
which is detected by the AudioFileReader System object.
while ~isDone(reader)
y = reader();
% ST-FFT
yfft = stft(y);
% IST-FFT
1-157
1 Audio Toolbox Examples
yistfft = istft(ys);
Release
Call release on the System objects to close any open files and devices.
release(reader)
release(player)
loggedSpeech = logger.Buffer(200:end)';
player = audioDeviceWriter('SampleRate',Fs, ...
'SupportVariableSizeInput',true, ...
'BufferSize',512);
player(loggedSpeech.');
The pitch-scaled signal is the time-stretched signal played at a higher sampling rate which produces a
signal with a higher pitch.
Fs_new = Fs*(SynthesisLen/AnalysisLen);
player = audioDeviceWriter('SampleRate',Fs_new, ...
'SupportVariableSizeInput',true, ...
'BufferSize',1024);
player(loggedSpeech.');
You can easily apply time dilation with audioTimeScaler. audioTimeScaler implements an
analysis-synthesis phase vocoder for time scaling.
Instantiate an audioTimeScaler with the desired speedup factor, window, and analysis hop length:
ats = audioTimeScaler(AnalysisLen/SynthesisLen,'Window',win,'OverlapLength',WindowLen-AnalysisLen
Create a processing loop that performs time stretching on the input signal.
while ~isDone(reader)
x = reader();
1-158
Pitch Shifting and Time Dilation Using a Phase Vocoder in MATLAB
release(reader)
release(player)
Summary
This example shows the implementation of a phase vocoder to perform time stretching and pitch
scaling of a speech signal. You can hear these time-stretched and pitch-scaled signals when you run
the example.
References
1-159
1 Audio Toolbox Examples
This example shows how to use a phase vocoder to implement time dilation and pitch shifting of an
audio signal.
The phase vocoder in this example consists of an analysis section, a phase calculation section and a
synthesis section. The analysis section consists of an overlapped, short-time windowed FFT. The start
of each frame to be transformed is delayed from the previous frame by the amount specified in the
Analysis hop size parameter. The synthesis section consists of a short-time windowed IFFT and an
overlap add of the resulting frames. The overlap size during synthesis is specified by the Synthesis
hop size parameter.
The vocoder output has a different sample rate than its input. The ratio of the output to input sample
rates is the Synthesis hop size divided by the Analysis hop size. If the output is played at the input
sample rate, it is time stretched or time reduced depending on that ratio. If the output is played at
the output sample rate, the sound duration is identical to the input, but is pitch shifted either up or
down.
To prevent distortion, the phase of the frequency domain signal is modified in the phase calculation
section. In the frequency domain, the signal is split into its magnitude and phase components. For
each bin, a phase difference between frames is calculated, then normalized by the nominal phase of
the bin. Phase modification first requires that the normalized phase differences be unwrapped. The
unwrapped differences are multiplied by the Synthesis hop size divided by the Analysis hop size.
The differences are accumulated, frame by frame, to recover the phase components. Magnitude and
phase components are then recombined.
On running the model, the pitch-scaled signal is automatically played once the simulation has
finished. The Audio Playback block allows you to choose between Pitch Shifting and Time Dilation
modes.
Double-click the Phase Vocoder block. Change the Synthesis hop-size parameter to 64, the same
value as the Analysis hop-size parameter. Run the simulation and listen to the three signals. The
pitch-scaled signal has the same pitch as the original signal, and the time-stretched signal has the
same speed as the original signal.
Next change the Synthesis hop-size parameter in the Phase Vocoder block to 48, which is less than
the Analysis hop-size parameter. Run the simulation and listen to the three signals. The pitch-scaled
signal has a lower pitch than the original signal. The time-stretched signal is faster than the original
signal.
To see the implementation, right-click on the Phase Vocoder block and select Mask > Look Under
Mask.
1-160
Pitch Shifting and Time Dilation Using a Phase Vocoder in Simulink
References
1-161
1 Audio Toolbox Examples
This example shows how to remove a 250 Hz interfering tone from a streaming audio signal using a
notch filter.
Introduction
A notch filter is used to eliminate a specific frequency from a given signal. In their most common
form, the filter design parameters for notch filters are center frequency for the notch and the 3 dB
bandwidth. The center frequency is the frequency point at which the filter has a gain of zero. The 3
dB bandwidth measures the frequency width of the notch filter computed at the half-power, or 3 dB,
attenuation point.
In this example, you tune a notch filter in order to eliminate a 250 Hz sinusoidal tone corrupting an
audio signal. You can control both the center frequency and the bandwidth of the notch filter and
listen to the filtered audio signal as you tune the design parameters.
Example Architecture
The audioToneRemovalExampleApp command opens a user interface designed to interact with the
simulation. It also opens a spectrum analyzer to view the spectrum of the audio with and without
filtering and the magnitude response of the notch filter.
audioToneRemovalExampleApp
1-162
Remove Interfering Tone From Audio Stream
The notch filter is implemented using dsp.NotchPeakFilter. The filter has two specification
modes: 'Design parameters' and 'Coefficients'. The 'Design parameters' mode allows you to specify
the center frequency and bandwidth in Hz. This is the only mode used in this example. The
'Coefficients' mode allows you to specify the multipliers or coefficients in the filter directly. In the
latter mode, each coefficient affects only one characteristic of the filter (either the center frequency
or the 3 dB bandwidth). In other words, the effect of tuning the coefficients is completely decoupled.
Using MATLAB Coder, you can generate a MEX file for the main processing algorithm by executing
the HelperAudioToneRemovalCodeGeneration command. You can use the generated MEX file by
executing the audioToneRemovalExampleApp(true) command.
1-163
1 Audio Toolbox Examples
Vorbis Decoder
This example shows how to implement a Vorbis decoder, which is a freeware, open-source alternative
to the MP3 standard. This audio decoding format supports the segmentation of encoded data into
small packets for network transmission.
Vorbis Basics
The Vorbis encoding format [1] is an open-source lossy audio compression algorithm similar to
MPEG-1 Audio Layer 3, more commonly known as MP3. Vorbis has many of the same features as
MP3, while adding flexibility and functionality. The Vorbis specification only defines the format of the
bitstream and the decoding algorithm. This allows developers to improve the encoding algorithm over
time and remain compatible with existing decoders.
Encoding starts by splitting the original signal into overlapping frames. Vorbis allows frames of
different lengths so that it can efficiently handle stationary and transient signals. Each frame is
multiplied by a window and transformed using the modified discrete cosine transform (mdct). The
frames are then split into a rough approximation called the floor, and a remainder called the residue.
The flexibility of the Vorbis format is illustrated by its use of different methods to represent and
encode the floor and residue portions of the signal. The algorithm introduces modes as a mechanism
to specify these different methods and thereby code various frames differently.
Vorbis uses Huffman coding to compress the data contained in the floor and residue portions. Vorbis
uses a dynamic probability model rather than the static probability model of MP3. Specifically, Vorbis
builds custom codebooks for audio signals, which can differ for 'floor' and 'residue' and from frame to
frame.
1-164
Vorbis Decoder
After Huffman encoding is complete, the frame data is bitpacked into a logical packet. In Vorbis, a
series of such packets is always preceded by a header. The header contains all the information
needed for correct decoding. This information includes a complete set of codebooks, descriptions of
methods to represent the floor and residue, and the modes and mappings for multichannel support.
The header can also include general information such as bit rates, sampling rate, song and artist
names, etc.
Vorbis provides its own format, known as 'Ogg', to encapsulate logical packets into transport streams.
The Ogg format provides mechanisms such as framing, synchronization, positioning, and error
correction, which are necessary for data transfer over networks.
The Vorbis decoder in this example implements the specifications of the Vorbis I format, which is a
subset of Vorbis. The example model decodes any raw binary OGG file containing a mono or stereo
audio signal. The example model has the capability to decode and play back a wide variety of Vorbis
audio files in real time.
You can test this example with any Vorbis audio file, or with the included handel file. To load the file
into the model, replace the file name in the annotated code at the top level of the model with the
name of the file you want to test. When this step is complete, click the annotated code to load the new
audio file. The model is configured to notify you if the output sampling rate has been changed due to
a change in the input data. In this case, the simulation needs to be restarted with the new sample
rate.
In order to implement a Vorbis decoder in Simulink®, you must address the variable-sized data
packets. This example addresses the variable-sized packets by capturing a whole page of the Ogg
bitstream using the 'OggS' synchronization pattern. For practical purposes, a page is assumed to be
no larger than 5500 bytes. After obtaining a segmentation table at the beginning of the page, the
model extracts logical packets from the remainder of the page. Asynchronous control over the
decoding sequence is implemented using the Stateflow chart 'Decode All Pages of Data'.
1-165
1 Audio Toolbox Examples
Initially, the chart tries to detect the 'OggS' synchronization pattern and then follow the decoding
steps described above. Decoding the page is done with the Simulink function 'decodePage' and then
the model immediately goes back to detecting the next 'OggS' sequence. The state
'ResetPageCounter' is added in parallel with the Stateflow algorithm described above to support the
looping of the compressed input file for an unlimited number of iterations.
Data pages contain different types of information: header, codebooks, and audio signal data. The
'Read Setup Info', 'Read the Header', and 'Decode Audio' subsystems inside the 'decodePage'
Simulink function are responsible for handling each of these different kinds of information.
The decoding process is implemented using MATLAB Function blocks. Most bit-unpacking routines in
the example are implemented with MATLAB code.
The recombining of the floor and residue and the subsequent inverse MDCT (IMDCT) are also
implemented with a MATLAB Function block that uses the fast imdct function of Audio Toolbox. The
variable frame lengths are taken into account using a fixed-size maximum-length frame at the input
and output of the Function block, and by using a window length parameter in both the Function block
code and a Selector block immediately following the Function block.
The IMDCT transforms the frames back to the time domain, ready to be multiplied by the synthesis
window and then combined with an overlap-add operation.
The output block in the top level of the model feeds the output of the decoding block to the audio
playback device on your system. The valid portion of the decoded signal is input to the Audio Device
Writer block.
References
1-166
Dynamic Range Compression Using Overlap-Add Reconstruction
This example shows how to compress the dynamic range of a signal by modifying the range of the
magnitude at each frequency bin. This nonlinear spectral modification is followed by an overlap-add
FFT algorithm for reconstruction. This system might be used as a speech enhancement system for the
hearing impaired. The algorithm in this simulation is derived from a patented system for adaptive
processing of telephone voice signals for the hearing impaired originally developed by Alvin M. Terry
and Thomas P. Krauss at US West Advanced Technologies Inc., US patent number 5,388,185.
1-167
1 Audio Toolbox Examples
This system decomposes the input signal into overlapping sections of length 256. The overlap is 192
so that every 64 samples, a new section is defined and a new FFT is computed. After the spectrum is
modified and the inverse FFT is computed, the overlapping parts of the sections are added together.
If no spectral modification is performed, the output is a scaled replica of the input. A reference for
the overlap-add method used for the audio signal reconstruction is Rabiner, L. R. and R. W. Schafer.
Digital Processing of Speech Signals. Englewood Cliffs, NJ: Prentice Hall, 1978, pgs. 274-277.
Compression maps the dynamic range of the magnitude at each frequency bin from the range 0 to
100 dB to the range ymin to ymax dB. ymin and ymax are vectors in the MATLAB® workspace with
one element for each frequency bin; in this case 256. The phase is not altered. This is a non-linear
spectral modification. By compressing the dynamic range at certain frequencies, the listener should
be able to perceive quieter sounds without being blasted out when they get loud, as in linear
equalization.
To use this system to demonstrate frequency-dependent dynamic range compression, start the
simulation. After repositioning the input and output figures so you can see them at the same time,
change the Slider Gain from 1 to 1000 to 10000. Notice the relative heights of the output peaks
change as you increase the magnitude.
1-168
LPC Analysis and Synthesis of Speech
This example shows how to use the Levinson-Durbin and Time-Varying Lattice Filter blocks for low-
bandwidth transmission of speech using linear predictive coding.
Example Model
1-169
1 Audio Toolbox Examples
Example Description
The example consists of two parts: analysis and synthesis. The analysis portion 'LPC Analysis' is found
in the transmitter section of the system. Reflection coefficients and the residual signal are extracted
from the original speech signal and then transmitted over a channel. The synthesis portion 'LPC
Synthesis', which is found in the receiver section of the system, reconstructs the original signal using
the reflection coefficients and the residual signal.
In this simulation, the speech signal is divided into 20 ms frames (160 samples), with an overlap of 10
ms (80 samples). Each frame is windowed using a Hamming window. Eleventh-order autocorrelation
coefficients are found, and then the reflection coefficients are calculated from the autocorrelation
coefficients using the Levinson-Durbin algorithm. The original speech signal is passed through an
analysis filter, which is an all-zero filter with coefficients same as the reflection coefficients obtained
above. The output of the filter is the residual signal. This residual signal is passed through a synthesis
filter which is the inverse of the analysis filter. The output of the synthesis filter is the original signal.
This is played through the 'Audio Device Writer' block.
1-170
Simulation of a Plucked String
This example shows how to simulate a plucked string using digital waveguide synthesis.
Introduction
A digital waveguide is a computational model for physical media through which sound propagates.
They are essentially bidirectional delay lines with some wave impedance. Each delay line can be
thought of as a sampled acoustic traveling wave. Using the digital waveguide, a linear one-
dimensional acoustic system like the vibration of a guitar string can be modeled.
The result of the simulation is automatically played back using the Audio Device Writer block. To
see the implementation, look under the Digital Waveguide Synthesis block by right clicking on the
block and selecting Mask > Look Under Mask.
Acknowledgements
This Simulink® implementation is based on a MATLAB® file implementation available from Daniel
Ellis's home page at Columbia University.
References
The online textbook Digital Waveguide Modeling of Musical Instruments by Julius O. Smith III
covers significant background related to digital waveguides.
The Harmony Central website also provides useful background information on a variety of related
topics.
1-171
1 Audio Toolbox Examples
This example shows how to implement a real-time audio "phaser" effect which can be tuned by a user
interface (UI). It also shows how to generate a VST plugin for the phaser that you can import into a
Digital Audio Workstation (DAW).
Introduction
The phaser is an audio effect produced when an audio signal is passed through one or more notch
filters. The center frequencies of the notch filters are typically modulated at some consistent rate to
produce a "swirling" effect on the audio. The modulation source is typically a low frequency oscillator
such as a sine wave. Different waveform shapes create different phaser effects.
You can use any audio file with this example. However, the phasing effect is more audible with some
audio files than with others. A file that is suggested for this example is RockGuitar-16-44p1-
stereo-72secs.wav. Another option is to use a pink noise source instead of a file.
This example uses the audiopluginexample.Phaser audio plugin class. The plugin implements a
multi-notch filter with notch frequencies modulated by an audioOscillator. The multi-notch filter
is implemented through the multibandParametricEQ System object. The bands of the equalizer
can be made to act as individual notch filters by setting their gain to -inf.
You can test the phaser implemented in audiopluginexample.Phaser using Audio Test Bench. The
audio test bench sets up the audio file reader and audio device writer objects, and streams the audio
through the phaser in a processing loop.
phaser = audiopluginexample.Phaser;
visualize(phaser)
1-172
Audio Phaser Using Multiband Parametric Equalizer
audioTestBench(phaser)
1-173
1 Audio Toolbox Examples
The Audio Test Bench enables you to tune the audio phaser using dials and drop-down menus.
Changing dial or drop-down values updates the magnitude response plot of the phaser in real time.
• Rate - Controls the rate at which the center frequency of the notch filters sweep up and down the
audio spectrum.
• Center Frequency - Controls the center frequency of the lowest notch. The center frequency of
other notches is calculated relative to this value and the modulation source.
• Depth - Controls how far the notch frequencies modulate around the center frequency.
• Qualify Factor - Sets the quality factor (or "Q") of each notch. A higher Q setting creates a
narrower bandwidth notch.
• Notches - Sets the number of notch filters. More notches can be used to create a more dramatic
effect.
• Modulation Source - The waveform that controls the center frequencies of the notch filters.
Different waveforms create different sweep sounds.
The audio test bench by default streams audio from a file on disk. You can change it to a sound card
microphone/line-in input, or pink noise (useful for testing).
Click the Run button on the UI to start streaming and hear the phaser effect.
1-174
Audio Phaser Using Multiband Parametric Equalizer
You may find that audio dropouts occur when using higher numbers of notches or high Rate settings.
One way to work around this is to generate a VST plugin to take the place of the portion of the code
that performs the actual audio processing. Switch the Run As drop-down to VST Plugin. On running
the simulation now, a VST plugin will be generated and loaded back into MATLAB for use in the
simulation.
To generate and port a VST plugin to a Digital Audio Workstation, click on the Generate Audio
Plugin button on the toolbar of audio test bench, or run the generateAudioPlugin command.
generateAudioPlugin audiopluginexample.Phaser
1-175
1 Audio Toolbox Examples
This example shows how to use tools from Audio Toolbox™ to measure loudness, loudness range, and
true-peak value. It also shows how to normalize audio to meet the EBU R 128 standard compliance.
Introduction
Volume normalization was traditionally performed by looking at peak signal measurements. However,
this method had the drawback that excessively compressed audio could pass a signal-level threshold
but still be very loud to hear. The result was a loudness war, where recordings tended to be louder
than before and inconsistent across genres.
The modern solution to the loudness war is to measure the perceived loudness in combination with
a true-peak level measurement. International standards like ITU BS.1770-4, EBU R 128, and ATSC
A/85 have been developed to standardize loudness measurements based on the power of the audio
signal. Many countries have already passed legislations for compliance with broadcast standards on
loudness levels.
In this example, you measure loudness and supplementary parameters for both offline (file-based)
and live (streaming) audio signals. You also see ways to normalize audio to be compliant with target
levels.
Audio Toolbox enables you to measure loudness and associated parameters according to the EBU R
128 standard. This standard defines the following measures of loudness:
For a more detailed description of these parameters, refer to the documentation for EBU R 128
standard.
For cases where you already have the recorded audio samples, you can use the
integratedLoudness function to measure loudness. It returns the integrated loudness, in units of
LUFS, and loudness range, in units of LU, of the complete audio file.
EBU R 128 defines the target loudness level to be -23 LUFS. The loudness of the audio file is clearly
above this level. A simple level reduction operation can be used to normalize the loudness.
1-176
Loudness Normalization in Accordance with EBU R 128 Standard
target = -23;
gaindB = target - loudness;
gain = 10^(gaindB/20);
xn = x.*gain;
audiowrite('RockGuitar_normalized.wav',xn,fs)
For streaming audio, EBU R 128 defines momentary and short-term loudness. You can use the
loudnessMeter System object to measure momentary loudness, short-term loudness, integrated
loudness, loudness range, and true-peak value of a live audio signal.
First, stream the audio signal to your sound card and measure its loudness using loudnessMeter.
The visualize method of loudnessMeter opens a user interface (UI) that displays all the
loudness-related measurements as the simulation progresses.
1-177
1 Audio Toolbox Examples
release(reader)
release(player)
As you can see on the UI, the loudness of the audio stream is clearly above the -23 LUFS threshold.
Its maximum true-peak level of -0.3 dBTP is also above the threshold of -1 dBTP specified by EBU R
128. Normalizing the loudness of a live audio stream is trickier than normalizing the loudness of a
file. One way to help get the loudness value close to a target threshold is to use an Automatic Gain
Controller (AGC). In the following code, you use the audioexample.AGC System object to normalize
the power of an audio signal to -23 dB. The AGC estimates the audio signal's power by looking at the
previous 400 ms, which is the window size used to calculate momentary loudness. There are two
loudness meters used in this example - one for the input to AGC and one for the output from AGC.
The UIs for the two loudness meters may launch at the same location on your screen, so you will have
to move one to the side to compare the measured loudness before and after AGC.
outputLoudness = loudnessMeter('SampleRate',fs);
gainController = audioexample.AGC('DesiredOutputPower',-23, ...
'AveragingLength',0.4*fs,'MaxPowerGain',20);
reset(inputLoudness) % Reuse the same loudness meter from before
reset(runningMax)
visualize(inputLoudness)
visualize(outputLoudness)
while ~isDone(reader)
audioIn = reader();
loudnessBeforeNorm = inputLoudness(audioIn);
[audioOut, gain] = gainController(audioIn);
[loudnessAfterNorm,~,~,~,tp] = outputLoudness(audioOut);
maxTP = runningMax(tp);
player(audioOut);
end
1-178
Loudness Normalization in Accordance with EBU R 128 Standard
release(reader)
release(player)
Using AGC not only brought the loudness of the audio close to the target of -23 LUFS, but it also got
the maximum true-peak value below the allowed -1 dBTP. In some cases, the maximum true-peak
value remains above -1 dBTP although the loudness is at or below -23 LUFS. For such scenarios, you
can pass the audio through a limiter.
1-179
1 Audio Toolbox Examples
This example shows how to use a multistage/multirate approach to sample rate conversion between
different audio sampling rates.
This example focuses on converting an audio signal sampled at 96 kHz (DVD quality) to an audio
signal sampled at 44.1 kHz (CD quality).
Setup
frameSize = 64;
inFs = 96e3;
Create two spectrum analyzers. These will be used to visualize the frequency content of the original
signal as well as that of the signals converted to 44.1 kHz.
1-180
Multistage Sample-Rate Conversion of Audio Signals
The loop below plots the spectrogram and power spectrum of the original 96 kHz signal. The chirp
signal starts at 0 and sweeps to 48 kHz over a simulated time of 8 seconds.
NFrames = 8*inFs/frameSize;
for k = 1:NFrames
sig96 = source(); % Source
SpectrumAnalyzer96(sig96); % Spectrogram
end
release(source)
release(SpectrumAnalyzer96)
In order to convert the signal, dsp.SampleRateConverter is used. A first attempt sets the
bandwidth of interest to 40 kHz, i.e. to cover the range [-20 kHz, 20 kHz]. This is the usually accepted
range that is audible to humans. The stopband attenuation for the filters to be used to remove
spectral replicas and aliased replicas is left at the default value of 80 dB.
BW40 = 40e3;
OutFs = 44.1e3;
SRC40kHz80dB = dsp.SampleRateConverter(Bandwidth=BW40, ...
InputSampleRate=inFs,OutputSampleRate=OutFs);
1-181
1 Audio Toolbox Examples
Use info to get information on the filters that are designed to perform the conversion. This reveals
that the conversion will be performed in two steps. The first step involves a decimation by two filter
which converts the signal from 96 kHz to 48 kHz. The second step involves an FIR rate converter that
interpolates by 147 and decimates by 160. This results in the 44.1 kHz required. The freqz
command can be used to visualize the combined frequency response of the two stages involved.
Zooming in reveals that the passband extends up to 20 kHz as specified and that the passband ripple
is in the milli-dB range (less than 0.003 dB).
info(SRC40kHz80dB)
[H80dB,f] = freqz(SRC40kHz80dB,0:10:25e3);
plot(f,20*log10(abs(H80dB)/norm(H80dB,inf)))
xlabel("Frequency (Hz)")
ylabel("Magnitude (dB)")
axis([0 25e3 -140 5])
ans =
1-182
Multistage Sample-Rate Conversion of Audio Signals
Asynchronous Buffer
The sample rate conversion from 96 kHz to 44.1 kHz produces 147 samples for every 320 input
samples. Because the chirp signal is generated with frames of 64 samples, an asynchronous buffer is
needed. The chirp signal is written 64 samples at a time, and whenever there are enough samples
buffered, 320 of them are read and fed to the sample rate converter.
buff = dsp.AsyncBuffer;
The loop below performs the sample rate conversion in streaming fashion. The computation is fast
enough to operate in real time if need be.
The spectrogram and power spectrum of the converted signal are plotted. The extra lines in the
spectrogram correspond to spectral aliases/images remaining after filtering. The replicas are
attenuated by better than 80 dB as can be verified with the power spectrum plot.
srcFrameSize = 320;
for k = 1:NFrames
sig96 = source(); % Generate chirp
write(buff,sig96); % Buffer data
if buff.NumUnreadSamples >= srcFrameSize
sig96buffered = read(buff,srcFrameSize);
sig44p1 = SRC40kHz80dB(sig96buffered); % Convert sample-rate
SpectrumAnalyzer44p1(sig44p1); % View spectrum of converted signal
end
1-183
1 Audio Toolbox Examples
end
release(source)
release(SpectrumAnalyzer44p1)
release(buff)
In order to improve the sample rate converter quality, two changes can be made. First, the bandwidth
can be extended from 40 kHz to 43.5 kHz. This in turn requires filters with a sharper transition.
Second, the stopband attenuation can be increased from 80 dB to 160 dB. Both these changes come
at the expense of more filter coefficients over all as well as more multiplications per input sample.
BW43p5 = 43.5e3;
SRC43p5kHz160dB = dsp.SampleRateConverter(Bandwidth=BW43p5, ...
InputSampleRate=inFs,OutputSampleRate=OutFs, ...
StopbandAttenuation=160);
The previous sample rate converter involved 8618 filter coefficients and a computational cost of 42.3
multiplications per input sample. By increasing the bandwidth and stopband attenuation, the cost
increases substantially to 123896 filter coefficients and 440.34 multiplications per input sample. The
frequency response reveals a much sharper filter transition as well as larger stopband attenuation.
Moreover, the passband ripple is now in the micro-dB scale. NOTE: this implementation involves the
1-184
Multistage Sample-Rate Conversion of Audio Signals
design of very long filters which takes several minutes to complete. However, this is a one time cost
which happens offline (before the actual sample rate conversion).
info(SRC43p5kHz160dB)
[H160dB,f] = freqz(SRC43p5kHz160dB,0:10:25e3);
plot(f,20*log10(abs(H160dB)/norm(H160dB,inf)));
xlabel("Frequency (Hz)")
ylabel("Magnitude (dB)")
axis([0 25e3 -250 5])
ans =
1-185
1 Audio Toolbox Examples
The processing is repeated with the more precise sample rate converter.
Once again the spectrogram and power spectrum of the converted signal are plotted. Notice that the
imaging/aliasing is attenuated enough that they are not visible in the spectrogram. The power
spectrum shows spectral aliases attenuated by more than 160 dB (the peak is at about 20 dB).
for k = 1:NFrames
sig96 = source(); % Generate chirp
over = write(buff,sig96); % Buffer data
if buff.NumUnreadSamples >= srcFrameSize
[sig96buffered,under] = read(buff,srcFrameSize);
sig44p1 = SRC43p5kHz160dB(sig96buffered); % Convert sample-rate
SpectrumAnalyzer44p1(sig44p1); % View spectrum of converted signal
end
end
release(source)
release(SpectrumAnalyzer44p1)
release(buff)
1-186
Graphic Equalization
Graphic Equalization
This example demonstrates two forms of graphic equalizers constructed using building blocks from
Audio Toolbox™. It also shows how to export them as VST plugins to be used in a Digital Audio
Workstation (DAW).
Graphic Equalizers
Equalizers are commonly used by audio engineers and consumers to adjust the frequency response of
audio. For example, they can be used to compensate for bias introduced by speakers, or to add bass
to a song. They are essentially a group of filters designed to provide a custom overall frequency
response.
While parametric equalizers are useful when you want to fine-tune the frequency response, there are
simpler equalizers for cases when you need fewer controls. Octave, two-third octave, and one-third
octave have emerged as common bandwidths for equalizers based on the behavior of the human ear.
Standards like IS0 266:1997(E), ANSI S1.11-2004, and IEC 61672-1:2013 define center frequencies
for octave and fractional octave filters. This leaves only one parameter to tune: filter gain. Graphic
equalizers provide control over the gain parameter while using standard center frequencies and
common bandwidths.
In this example, you use two implementations of graphic equalizers. They differ in arrangement of
constituent filters: One uses a bank of parallel octave- or fractional octave-band filters, and the other
uses a cascade of biquad filters. The center frequencies in both implementations follow the ANSI
S1.11-2004 standard.
One way to construct a graphic equalizer is to place a group of bandpass filters in parallel. The
bandwidth of each filter is octave or fractional octave, and their center frequency is set so that
together they cover the audio frequency range of [20, 20000] Hz.
1-187
1 Audio Toolbox Examples
You can tune the gains to boost or cut the corresponding frequency band while the simulation runs.
Because the gains are independent of the filter design, tuning the gains does not have a significant
computational cost. The parallel filter structure is well suited to parallel hardware implementation.
The magnitude response of the bandpass filters should be close to zero at all other frequencies
outside its bandwidth to avoid interaction between the filters. However, this is not practical, leading
to inter-band interference.
You can use the graphicEQ System object to implement a graphic equalizer with a parallel structure.
eq = graphicEQ('Structure','Parallel')
eq =
graphicEQ with properties:
EQOrder: 2
Bandwidth: '1 octave'
Structure: 'Parallel'
Gains: [0 0 0 0 0 0 0 0 0 0]
SampleRate: 44100
This designs a parallel implementation of second order filters with 1-octave bandwidth. It takes ten
octave filters to cover the range of audible frequencies. Each element of the Gains property controls
the gain of one branch of the parallel configuration.
Configure the object you created to boost low and high frequencies, similar to a rock preset.
eq.Gains = [4, 4.2, 4.6, 2.7, -3.7, -5.2, -2.5, 2.3, 5.4, 6.5, 6.5]
eq =
graphicEQ with properties:
EQOrder: 2
Bandwidth: '1 octave'
Structure: 'Parallel'
Gains: [4 4.2000 4.6000 2.7000 -3.7000 -5.2000 -2.5000 2.3000 5.4000 6.5000]
SampleRate: 44100
visualize(eq)
1-188
Graphic Equalization
You can test the equalizer implemented in graphicEQ using Audio Test Bench. The audio test bench
sets up the audio file reader and audio device writer objects, and streams the audio through the
equalizer in a processing loop. It also assigns a slider to each gain value and labels the center
frequency it corresponds to, so you can easily change the gain and hear its effect. Modifying the value
of the slider simultaneously updates the magnitude response plot.
audioTestBench(eq)
1-189
1 Audio Toolbox Examples
A different implementation of the graphic equalizer uses cascaded equalizing filters (peak or notch)
implemented as biquad filters. The transfer function of the equalizer can be written as a product of
the transfer function of individual biquads.
To motivate the usefulness of this implementation, first look at the magnitude response of the parallel
structure when all gains are 0 dB.
parallelGraphicEQ = graphicEQ('Structure','Parallel');
visualize(parallelGraphicEQ)
1-190
Graphic Equalization
You will notice that the magnitude response is not flat. This is because the filters have been designed
independently, and each has a transition width where the magnitude response droops. Moreover,
because of non-ideal stopband, there is leakage from the stopband of one filter to the passband of its
neighbor. The leakage can cause actual gains to differ from expected gains.
parallelGraphicEQ_10dB = graphicEQ('Structure','Parallel');
parallelGraphicEQ_10dB.Gains = 10*ones(1,10);
visualize(parallelGraphicEQ_10dB)
1-191
1 Audio Toolbox Examples
Note that the gains are never 10 dB in the frequency response. A cascaded structure can mitigate
this to an extent because the gain is inherent in the design of the filter. Setting the gain of all
cascaded biquads to 0 dB leads to them being bypassed. Since there are no branches in this type of
structure, this means you have a no-gain path between the input and the output. graphicEQ
implements the cascaded structure by default.
cascadeGraphicEQ = graphicEQ;
visualize(cascadeGraphicEQ)
1-192
Graphic Equalization
Moreover, when you set the gains to 10 dB, notice that the resultant frequency response has close to
10 dB of gain at the center frequencies.
cascadeGraphicEQ_10dB = graphicEQ;
cascadeGraphicEQ_10dB.Gains = 10*ones(1,10);
visualize(cascadeGraphicEQ_10dB)
1-193
1 Audio Toolbox Examples
The drawback of cascade design is that the coefficients of a biquad stage need to be redesigned
whenever the corresponding gain changes. This isn't needed for the parallel implementation because
gain is just a multiplier to each parallel branch. A parallel connection of bandpass filters also avoids
accumulating phase errors and quantization noise found in the cascade.
The graphicEQ object supports 1 octave, 2/3 octave, and 1/3 octave bandwidths. Reducing
the bandwidth of individual filters allows you finer control over frequency response. To verify this, set
the gains to boost mid frequencies, similar to a pop preset.
octaveGraphicEQ = graphicEQ;
octaveGraphicEQ.Gains = [-2.1,-1.8,-1.4,2.7,4.2,4.6,3.1,-1,-1.8,-1.8,-1.4];
visualize(octaveGraphicEQ)
1-194
Graphic Equalization
oneThirdOctaveGraphicEQ = graphicEQ;
oneThirdOctaveGraphicEQ.Bandwidth = '1/3 octave';
oneThirdOctaveGraphicEQ.Gains = [-2,-1.9,-1.8,-1.6,-1.5,-1.4,0,1.2,2.7, ...
3.2,3.8,4.2,4.4,4.5,4.6,4,3.5,3.1,1.5,-0.1,-1,-1.2,-1.6,-1.8,-1.8, ...
-1.8,-1.8,-1.7,-1.5,-1.4,-1.3];
visualize(oneThirdOctaveGraphicEQ)
1-195
1 Audio Toolbox Examples
To generate and port a VST plugin to a Digital Audio Workstation, run the generateAudioPlugin
command. For example, you can generate a two-third octave graphic equalizer through the
commands shown below. You will need to be in a directory with write permissions when you run these
commands.
twoThirdOctaveGraphicEQ = graphicEQ;
twoThirdOctaveGraphicEQ.Bandwidth = '2/3 octave';
createAudioPluginClass(twoThirdOctaveGraphicEQ);
generateAudioPlugin twoThirdOctaveGraphicEQPlugin
You can use the same features described in this example in Simulink through the Graphic EQ block. It
provides a slider for each gain value so you can easily boost or cut a frequency band while the
simulation is running.
1-196
Audio Weighting Filters
This example shows how to obtain A-weighting and C-weighting filters the weightingFilter
System object in the Audio Toolbox™.
In many applications involving acoustic measurements, the final sensor is the human ear. For this
reason, acoustic measurements usually attempt to describe the subjective perception of a sound by
this organ. Instrumentation devices are built to provide a linear response, but the ear is a nonlinear
sensor. Special filters, known as weighting filters, are used to account for the nonlinearities.
You can design A and C weighting filters that follow the ANSI S1.42 [1 on page 1-199] and IEC
61672-1 [2 on page 1-199] standards using weightingFilter System object. An A-weighting filter is
a bandpass filter designed to simulate the perceived loudness of low-level tones. An A-weighting filter
progressively de-emphasizes frequencies below 500 Hz. A C-weighting filter removes sounds outside
the audio range of 20 Hz to 20 kHz and simulates the loudness perception of high-level tones. The
following code designs an IIR filter for A-weighting with a sampling rate of 48 kHz.
AWeighting = weightingFilter('A-weighting',48000)
AWeighting =
weightingFilter with properties:
Method: 'A-weighting'
SampleRate: 48000
A and C-weighting filter designs are based on direct implementation of the filter's transfer function
based on poles and zeros specified in the ANSI S1.42 standard.
The IEC 61672-1 standard requires that the filter magnitudes fall within a specified tolerance mask.
The standard defines two masks, one with stricter tolerance values than the other. A filter that meets
the tolerance specifications of the stricter mask is referred to as a Class 1 filter. A filter that meets
the specifications of the less strict mask is referred to as a Class 2 filter. You can view the magnitude
response of the filter along with a mask corresponding to Class 1 or Class 2 specifications by calling
the visualize method on the object. Note that the choice of the Class value will not affect the filter
design itself but it will be used to render the correct tolerance mask in the visualization plot.
visualize(AWeighting,'class 1')
1-197
1 Audio Toolbox Examples
The A- and C-weighting standards specify tolerance magnitude values for up to 20 kHz. In the
following example we use a sample rate of 28 kHz and design a C-weighting filter. Even though the
Nyquist interval for this sample rate is below the maximum specified 20 kHz frequency, the design
still meets the Class 2 tolerances as shown by the green mask around the magnitude response plot.
The design, however, does not meet Class 1 tolerances due to the small sample rate value and you
will see the mask around the magnitude response plot turn red.
CWeighting = weightingFilter('C-weighting',28000)
CWeighting =
weightingFilter with properties:
Method: 'C-weighting'
SampleRate: 28000
visualize(CWeighting,'class 2')
visualize(CWeighting,'class 1')
1-198
Audio Weighting Filters
References
[1] "Design Response of Weighting Networks for Acoustical Measurements." American National
Standard, ANSI S1.42–2001.
[2] "Electroacoustics Sound Level Meters Part 1: Specifications." IEC 61672-1, First Edition, 2002–05.
1-199
1 Audio Toolbox Examples
This example demonstrates how to measure sound pressure levels of octave frequency bands. A user
interface (UI) enables you to experiment with various parameters while the measurement is
displayed.
Many applications involving acoustic measurements must take into account the non-linear
characteristics of the human auditory system. For that reason, sound levels are generally reported in
decibels (dB) and on a frequency scale that increases logarithmically. Frequency weighting adjusts
levels to take into account the ear's frequency-dependent sensitivity. A-weighting is the most
common, as it cuts low and high frequencies similarly to the auditory system for "normal" levels. C-
weighting is an alternative for measuring very loud sounds, as it mimics the human ear's flatter
response at level over 100 dB.
This example uses the splMeter System object to measure sound pressure levels (SPL). You can
measure sound pressure levels of audio files or perform live SPL measurements with a microphone.
You can specify the weighting filter (Z/A/C) and frequency bandwidth used for the measurements. For
more information on the weighting filters, see the “Audio Weighting Filters” on page 1-197 example.
MATLAB Simulation
soundPressureMeasurementExampleApp loads the SPL meter user interface (shown below). The
demonstration begins with pink noise, which measures relatively flat on the octave frequency scale.
You can experiment with different audio sources, frequency weightings, and bandwidths.
1-200
Sound Pressure Measurement of Octave Frequency Bands
1-201
1 Audio Toolbox Examples
1-202
Cochlear Implant Speech Processor
This example shows how to simulate the design of a cochlear implant that can be placed in the inner
ear of a profoundly deaf person to restore partial hearing. Signal processing is used in cochlear
implants to convert sound to electrical pulses. The pulses can bypass the damaged parts of a deaf
person's ear and be transmitted to the brain to provide partial hearing.
This example highlights some of the choices made when designing cochlear implant speech
processors using Audio Toolbox™. In particular, the benefits of using a cascaded multirate, multistage
FIR filter bank instead of a parallel, single-rate, second-order-section IIR filter bank are shown.
Human Hearing
Converting sound into something the human brain can understand involves the inner, middle, and
outer ear, hair cells, neurons, and the central nervous system. When a sound is made, the outer ear
picks up acoustic waves, which are converted into mechanical vibrations by tiny bones in the middle
ear. The vibrations move to the inner ear, where they travel through fluid in a snail-shaped structure
called the cochlea. The fluid displaces different points along the basilar membrane of the cochlea.
Displacements along the basilar membrane contain the frequency information of the acoustic signal.
A schematic of the membrane is shown here (not drawn to scale).
1-203
1 Audio Toolbox Examples
Different frequencies cause the membrane to displace maximally at different positions. Low
frequencies cause the membrane to be displaced near its apex, while high frequencies stimulate the
membrane at its base. The amplitude of the displacement of the membrane at a particular point is
proportional to the amplitude of the frequency that has excited it. When a sound is composed of many
frequencies, the basilar membrane is displaced at multiple points. In this way the cochlea separates
complex sounds into frequency components.
Each region of the basilar membrane is attached to hair cells that bend proportionally to the
displacement of the membrane. The bending causes an electrochemical reaction that stimulates
neurons to communicate the sound information to the brain through the central nervous system.
Deafness is most often caused by degeneration or loss of hair cells in the inner ear, rather than a
problem with the associated neurons. This means that if the neurons can be stimulated by a means
other than hair cells, some hearing can be restored. A cochlear implant does just that. The implant
electrically stimulates neurons directly to provide information about sound to the brain.
1-204
Cochlear Implant Speech Processor
The problem of how to convert acoustic waves to electrical impulses is one that Signal Processing
helps to solve. Multichannel cochlear implants have the following components in common:
Just as the basilar membrane of the cochlea resolves a wave into its component frequencies, so does
the signal processor in a cochlear implant divide an acoustic signal into component frequencies, that
are each then transmitted to an electrode. The electrodes are surgically implanted into the cochlea of
the deaf person so that they each stimulate the appropriate regions in the cochlea for the frequency
they are transmitting. Electrodes transmitting high-frequency (high-pitched) signals are placed near
the base, while those transmitting low-frequency (low-pitched) signals are placed near the apex.
Nerve fibers in the vicinity of the electrodes are stimulated and relay the information to the brain.
Loud sounds produce high-amplitude electrical pulses that excite a greater number of nerve fibers,
while quiet ones excite less. In this way, information about both the frequencies and amplitudes of the
components making up a sound can be transmitted to the brain of a deaf person.
The block diagram at the top of the model represents a cochlear implant speech processor, from the
microphone which picks up the sound (Input Source block) to the electrical pulses that are generated.
The frequencies increase in pitch from Channel 0, which transmits the lowest frequency, to Channel
7, which transmits the highest.
To hear the original input signal, double-click the Original Signal block at the bottom of the model. To
hear the output signal of the simulated cochlear implant, double-click the Reconstructed Signal block.
There are a number of changes you can make to the model to see how different variables affect the
output of the cochlear implant speech processor. Remember that after you make a change, you must
rerun the model to implement the changes before you listen to the reconstructed signal again.
1-205
1 Audio Toolbox Examples
Research has shown that about eight frequency channels are necessary for an implant to provide
good auditory understanding for a cochlear implant user. Above eight channels, the reconstructed
signal usually does not improve sufficiently to justify the rising complexity. Therefore, this example
resolves the input signal into eight component frequencies, or electrical pulses.
The Speech Synthesized from Generated Pulses block at the bottom left of the model allows you to
either play each electrical channel simultaneously or sequentially. Oftentimes cochlear implant users
experience inferior results with simultaneous frequencies, because the electrical pulses interact with
each other and cause interference. Emitting the pulses in an interleaved manner mitigates this
problem for many people. You can toggle the Synthesis mode of the Speech Synthesized From
Generated Pulses block to hear the difference between these two modes. Zoom in on the Time Scope
block to observe that the pulses are interleaved.
Noise presents a significant challenge to cochlear implant users. Select the Add noise parameter in
the Input Source block to simulate the effects of a noisy environment on the reconstructed signal.
Observe that the signal becomes difficult to hear. The Denoise block in the model uses a Soft
Threshold block to attempt to remove noise from the signal. When the Denoise parameter in the
Denoise block is selected, you can listen to the reconstructed signal and observe that not all the noise
is removed. There is no perfect solution to the noise problem, and the results afforded by any
denoising technology must be weighed against its cost.
1-206
Cochlear Implant Speech Processor
The purpose of the Filter Bank Signal Processing block is to decompose the input speech signal into
eight overlapping subbands. More information is contained in the lower frequencies of speech signals
than in the higher frequencies. To get as much resolution as possible where the most information is
contained, the subbands are spaced such that the lower-frequency bands are more narrow than the
higher-frequency bands. In this example, the four low-frequency bands are equally spaced, while each
of the four remaining high-frequency bands is twice the bandwidth of its lower-frequency neighbor.
To examine the frequency contents of the eight filter banks, run the model using the Chirp Source
type in the Input Source block.
Two filter bank implementations are illustrated in this example: a parallel, single-rate, second-order-
section IIR filter bank and a cascaded, multirate, multistage FIR filter bank. Double click on the
Design Filter Banks button to examine their design and frequency specifications.
Parallel Single-Rate SOS IIR Filter Bank: In this bank, the sixth-order IIR filters are implemented as
second-order-sections (SOS). The eight filters are running in parallel at the input signal rate. You can
look at their frequency responses by double clicking the Plot IIR Filter Bank Response button.
Cascaded Multirate Multistage FIR Filter Bank: The design of this filter bank is based on the
principles of an approach that combines downsampling and filtering at each filter stage. The overall
filter response for each subband is obtained by cascading its components. Double click on the Design
Filter Banks button to examine how design functions from the Audio Toolbox are used in
constructing these filter banks.
Since downsampling is applied at each filter stage, the later stages are running at a fraction of the
input signal rate. For example, the last filter stages are running at one-eighth of the input signal rate.
Consequently, this design is very suitable for implementations on the low-power DSPs with limited
processing cycles that are used in cochlear implant speech processors. You can look at the frequency
responses for this filter bank by double clicking on the Plot FIR Filter Bank Response button.
Notice that this design produces sharper and flatter subband definition compared to the parallel
single-rate SOS IIR filter bank. This is another benefit of a multirate, multistage filter design
approach. For a related example see “Multistage Rate Conversion” in the DSP System Toolbox™ FIR
Filter Design examples.
Thanks to Professor Philip Loizou for his help in creating this example.
• Loizou, Philip C., "Mimicking the Human Ear," IEEE® Signal Processing Magazine, Vol. 15, No.
5, pp. 101-130, 1998.
1-207
1 Audio Toolbox Examples
This example illustrates microphone array beamforming to extract desired speech signals in an
interference-dominant, noisy environment. Such operations are useful to enhance speech signal
quality for perception or further processing. For example, the noisy environment can be a trading
room, and the microphone array can be mounted on the monitor of a trading computer. If the trading
computer must accept speech commands from a trader, the beamformer operation is crucial to
enhance the received speech quality and achieve the designed speech recognition accuracy.
The example shows two types of time domain beamformers: the time delay beamformer and the Frost
beamformer. It also illustrates how you can use diagonal loading to improve the robustness of the
Frost beamformer. You can listen to the speech signals at each processing step.
First, define a uniform linear array (ULA) to receive the signal. The array contains 10 omnidirectional
elements (microphones) spaced 5 cm apart. Set the upper bound for frequency range of interest to 4
kHz because the signals used in this example are sampled at 8 kHz.
microphone = ...
phased.OmnidirectionalMicrophoneElement('FrequencyRange',[20 4000]);
Nele = 10;
ula = phased.ULA(Nele,0.05,'Element',microphone);
c = 340; % speed of sound, in m/s
Next, simulate the multichannel signal received by the microphone array. Two speech signals are
used as audio of interest. A laughter audio segment is used as interference. The sampling frequency
of the audio signals is 8 kHz.
Because audio signals are usually large, it is often not practical to read the entire signal into the
memory. Therefore, in this example, you read and process the signal in a streaming fashion, i.e.,
break the signal into small blocks at the input, process each block, and then assemble them at the
output.
The incident direction of the first speech signal is -30 degrees in azimuth and 0 degrees in elevation.
The direction of the second speech signal is -10 degrees in azimuth and 10 degrees in elevation. The
interference comes from 20 degrees in azimuth and 0 degrees in elevation.
Now you can use a wideband collector to simulate a 3-second signal received by the array. Notice that
this approach assumes that each input single-channel signal is received at the origin of the array by a
single microphone.
fs = 8000;
collector = phased.WidebandCollector('Sensor',ula,'PropagationSpeed',c, ...
'SampleRate',fs,'NumSubbands',1000,'ModulatedInput', false);
1-208
Acoustic Beamforming Using a Microphone Array
t_duration = 3; % 3 seconds
t = 0:1/fs:t_duration-1/fs;
Generate a white noise signal with a power of 1e-4 Watts to represent the thermal noise for each
sensor. A local random number stream ensures reproducible results.
prevS = rng(2008);
noisePwr = 1e-4;
Run the simulation. At the output, the received signal is stored in a 10-column matrix. Each column of
the matrix represents the signal collected by one microphone. Note that the audio is played back
during the simulation.
% preallocate
NSampPerFrame = 1000;
NTSample = t_duration*fs;
sigArray = zeros(NTSample,Nele);
voice_dft = zeros(NTSample,1);
voice_cleanspeech = zeros(NTSample,1);
voice_laugh = zeros(NTSample,1);
% simulate
for m = 1:NSampPerFrame:NTSample
sig_idx = m:m+NSampPerFrame-1;
x1 = dftFileReader();
x2 = speechFileReader();
x3 = 2*laughterFileReader();
temp = collector([x1 x2 x3], ...
[ang_dft ang_cleanspeech ang_laughter]) + ...
sqrt(noisePwr)*randn(NSampPerFrame,Nele);
player(0.5*temp(:,3));
sigArray(sig_idx,:) = temp;
voice_dft(sig_idx) = x1;
voice_cleanspeech(sig_idx) = x2;
voice_laugh(sig_idx) = x3;
end
Notice that the laughter masks the speech signals, rendering them unintelligible. Plot the signal in
channel 3.
plot(t,sigArray(:,3));
xlabel('Time (sec)'); ylabel ('Amplitude (V)');
title('Signal Received at Channel 3'); ylim([-3 3]);
1-209
1 Audio Toolbox Examples
The time delay beamformer compensates for the arrival time differences across the array for a signal
coming from a specific direction. The time aligned multichannel signals are coherently averaged to
improve the signal-to-noise ratio (SNR). Define a steering angle corresponding to the incident
direction of the first speech signal and construct a time delay beamformer.
angSteer = ang_dft;
beamformer = phased.TimeDelayBeamformer('SensorArray',ula, ...
'SampleRate',fs,'Direction',angSteer,'PropagationSpeed',c)
beamformer =
Process the synthesized signal, then plot and listen to the output of the conventional beamformer.
signalsource = dsp.SignalSource('Signal',sigArray, ...
'SamplesPerFrame',NSampPerFrame);
1-210
Acoustic Beamforming Using a Microphone Array
cbfOut = zeros(NTSample,1);
for m = 1:NSampPerFrame:NTSample
temp = beamformer(signalsource());
player(temp);
cbfOut(m:m+NSampPerFrame-1,:) = temp;
end
plot(t,cbfOut);
xlabel('Time (s)'); ylabel ('Amplitude');
title('Time Delay Beamformer Output'); ylim([-3 3]);
You can measure the speech enhancement by the array gain, which is the ratio of the output signal-
to-interference-plus-noise ratio (SINR) to the input SINR.
agCbf = pow2db(mean((voice_cleanspeech+voice_laugh).^2+noisePwr)/ ...
mean((cbfOut - voice_dft).^2))
agCbf =
9.5022
Notice that the first speech signal begins to emerge in the time delay beamformer output. You obtain
an SINR improvement of 9.4 dB. However, the background laughter is still comparable to the speech.
To obtain better beamformer performance, use a Frost beamformer.
1-211
1 Audio Toolbox Examples
By attaching FIR filters to each sensor, the Frost beamformer has more beamforming weights to
suppress the interference. It is an adaptive algorithm that places nulls at learned interference
directions to better suppress the interference. In the steering direction, the Frost beamformer uses
distortionless constraints to ensure desired signals are not suppressed. Create a Frost beamformer
with a 20-tap FIR after each sensor.
frostbeamformer = ...
phased.FrostBeamformer('SensorArray',ula,'SampleRate',fs, ...
'PropagationSpeed',c,'FilterLength',20,'DirectionSource','Input port');
Process and play the synthesized signal using the Frost beamformer.
reset(signalsource);
FrostOut = zeros(NTSample,1);
for m = 1:NSampPerFrame:NTSample
temp = frostbeamformer(signalsource(),ang_dft);
player(temp);
FrostOut(m:m+NSampPerFrame-1,:) = temp;
end
plot(t,FrostOut);
xlabel('Time (sec)'); ylabel ('Amplitude (V)');
title('Frost Beamformer Output'); ylim([-3 3]);
agFrost =
14.4385
1-212
Acoustic Beamforming Using a Microphone Array
Notice that the interference is now canceled. The Frost beamformer has an array gain of 14.5 dB,
which is about 5 dB higher than that of the time delay beamformer. The performance improvement is
impressive, but has a high computational cost. In the preceding example, an FIR filter of order 20 is
used for each microphone. With all 10 sensors, it needs to invert a 200-by-200 matrix, which may be
expensive in real-time processing.
Next, steer the array in the direction of the second speech signal. Suppose you only know a rough
estimate of azimuth -5 degrees and elevation 5 degrees for the direction of the second speech signal.
release(frostbeamformer);
ang_cleanspeech_est = [-5; 5]; % Estimated steering direction
reset(signalsource);
FrostOut2 = zeros(NTSample,1);
for m = 1:NSampPerFrame:NTSample
temp = frostbeamformer(signalsource(), ang_cleanspeech_est);
player(temp);
FrostOut2(m:m+NSampPerFrame-1,:) = temp;
end
plot(t,FrostOut2);
xlabel('Time (sec)'); ylabel ('Amplitude (V)');
title('Frost Beamformer Output'); ylim([-3 3]);
1-213
1 Audio Toolbox Examples
agFrost2 =
6.1927
The speech is barely audible. Despite the 6.1 dB gain from the beamformer, performance suffers from
the inaccurate steering direction. One way to improve the robustness of the Frost beamformer
against direction of arrival mismatch is to use diagonal loading. This approach adds a small quantity
to the diagonal elements of the estimated covariance matrix. The drawback of this method is that it is
difficult to estimate the correct loading factor. Here you try diagonal loading with a value of 1e-3.
reset(signalsource);
FrostOut2_dl = zeros(NTSample,1);
for m = 1:NSampPerFrame:NTSample
temp = frostbeamformer(signalsource(),ang_cleanspeech_est);
player(temp);
FrostOut2_dl(m:m+NSampPerFrame-1,:) = temp;
end
1-214
Acoustic Beamforming Using a Microphone Array
plot(t,FrostOut2_dl);
xlabel('Time (sec)'); ylabel ('Amplitude (V)');
title('Frost Beamformer Output'); ylim([-3 3]);
agFrost2_dl =
6.4788
The output speech signal is improved and you obtain a 0.3 dB gain improvement from the diagonal
loading technique.
release(frostbeamformer);
release(signalsource);
release(player);
rng(prevS);
Summary
This example shows how to use time domain beamformers to retrieve speech signals from noisy
microphone array measurements. The example also shows how to simulate an interference-dominant
1-215
1 Audio Toolbox Examples
signal received by a microphone array. The example used both time delay and the Frost beamformers
and compared their performance. The Frost beamformer has a better interference suppression
capability. The example also illustrates the use of diagonal loading to improve the robustness of the
Frost beamformer.
Reference
[1] O. L. Frost III, An algorithm for linear constrained adaptive array processing, Proceedings of the
IEEE, Vol. 60, Number 8, Aug. 1972, pp. 925-935.
1-216
Identification and Separation of Panned Audio Sources in a Stereo Mix
This example shows how to extract an audio source from a stereo mix based on its panning
coefficient. This example illustrates MATLAB® and Simulink® implementations.
Introduction
Panning is a technique used to spread a mono or stereo sound signal into a new stereo or multi-
channel sound signal. Panning can simulate the spatial perspective of the listener by varying the
amplitude or power level of the original source across the new audio channels.
Panning is an essential component of sound engineering and stereo mixing. In studio stereo
recordings, different sources or tracks (corresponding to different musical instruments, voices, and
other sound sources) are often recorded separately and then mixed into a stereo signal. Panning is
usually controlled by a physical or virtual control knob that may be placed anywhere from the "hard-
left" position (usually referred to as 8 o'clock) to the hard-right position (4 o'clock). When a signal is
panned to the 8 o'clock position, the sound only appears in the left channel (or speaker). Conversely,
when a signal is panned to the 4 o'clock position, the sound only appears in the right speaker. At the
12 o'clock position, the sound is equally distributed across the two speakers. An artificial position or
direction relative to the listener may be generated by varying the level of panning.
Source separation consists of the identification and extraction of individual audio sources from a
stereo mix recording. Source separation has many applications, such as speech enhancement,
sampling of musical sounds for electronic music composition, and real-time speech separation. It also
plays a role in stereo-to-multichannel (e.g. 5.1 or 7.1) upmix, where the different extracted sources
may be distributed across the channels of the new mix.
This example showcases a source separation algorithm applied to an audio stereo signal. The stereo
signal is a mix of two independently panned audio sources: The first source is a man counting from
one to ten, and the second source is a toy train whistle.
The example uses a frequency-domain technique based on short-time FFT analysis to identify and
separate the sources based on their different panning coefficients.
Simulink Version
The model audiosourceseparation implements the panned audio source separation example.
1-217
1 Audio Toolbox Examples
The stereo signal is mixed in the Panned Source subsystem. The stereo signal is formed of two
panned signals as shown below.
The train whistle source is panned with a constant panning coefficient of 0.2. You may vary the
panning coefficient of the speech source by double-clicking the Panned Source subsystem and
modifying the position of the 'Panning Index' knob.
The source separation algorithm is implemented in the 'Compute Panning Index Function' subsystem.
The algorithm is based on the comparison of the short-time Fourier Transforms of the right and left
1-218
Identification and Separation of Panned Audio Sources in a Stereo Mix
channels of the stereo mix. A frequency-domain, time-varying panning index function [1] is computed
based on the cross-correlations of the left and right short-time FFT pair. There is a one-to-one
relationship between the panning coefficient of the sources and the derived panning index. A
running-window histogram is implemented in the 'Panning Index Histogram' subsystem to identify the
dominant panning indices in the mix. The desired source is then unmixed by applying a masking
function modeled used a Gaussian window centered at the target panning index. Finally, the unmixed
extracted source is obtained by applying a short-time IFFT.
The mixed signal and the extracted speech signal are visualized using a scope. The estimated panning
coefficient is shown on a Display block. You can listen to either the mixed stereo or the unmixed
speech source by flipping the manual switch at the input of the Audio Device Writer block. The
streaming algorithm can adapt to a change in the value of the panning coefficient. For example, you
can modify the panning coefficient from 0.4 to 0.6 and observe that the displayed panning coefficient
value is updated with the correct value.
MATLAB Version
1-219
1 Audio Toolbox Examples
opens a UI designed to interact with the simulation. The UI allows you to tune the panning coefficient
of the speech source. You can also toggle between listening to either the mixed signal (whistle +
speech) or the unmixed speech signal by changing the value of the 'Audio Output' drop-down box in
the UI. There are also three buttons on the UI - the 'Reset' button will reset the simulation internal
state to its initial condition and the 'Pause Simulation' button will hold the simulation until you press
on it again. The simulation may be terminated by either closing the UI or by clicking on the 'Stop
simulation' button.
Execute audioSourceSeparationApp to run the simulation and plot the results. Note that the
simulation runs until you explicitly stop it.
References
[1] 'A Frequency-Domain Approach to Multichannel Upmix', Avendano, Carlos; Jot, Jean-Marc, JAES
Volume 52 Issue 7/8 pp. 740-749; July 2004
1-220
Live Direction of Arrival Estimation with a Linear Microphone Array
This example shows how to acquire and process live multichannel audio. It also presents a simple
algorithm for estimating the Direction Of Arrival (DOA) of a sound source using multiple microphone
pairs within a linear array.
If a multichannel input audio interface is available, then modify this script to set sourceChoice to
'live'. In this mode the example uses live audio input signals. The example assumes all inputs (two
or more) are driven by microphones arranged on a linear array. If no microphone array or
multichannel audio card is available, then set sourceChoice to 'recorded'. In this mode the
example uses prerecorded audio samples acquired with a linear array. For sourceChoice =
'live', the following code uses audioDeviceReader to acquire 4 live audio channels through a
Microsoft® Kinect™ for Windows®. To use another microphone array setup, ensure the installed
audio device driver is one of the conventional types supported by MATLAB® and set the Device
property of audioDeviceReader accordingly. You can query valid Device assignments for your
computer by calling the getAudioDevices object function of audioDeviceReader. Note that even
when using Microsoft Kinect, the device name can vary across machines and may not match the one
used in this example. Use tab completion to get the correct name on your machine.
sourceChoice = ;
Set the duration of live processing. Set how many samples per channel to acquire and process each
iteration.
endTime = 20;
audioFrameLength = 3200;
switch sourceChoice
case 'live'
fs = 16000;
audioInput = audioDeviceReader( ...
'Device','Microphone Array (Microsoft Kinect USB Audio)', ...
'SampleRate',fs, ...
'NumChannels',4, ...
'OutputDataType','double', ...
'SamplesPerFrame',audioFrameLength);
case 'recorded'
% This audio file holds a 20-second recording of 4 raw audio
% channels acquired with a Microsoft Kinect(TM) for Windows(R) in
% the presence of a noisy source moving in front of the array
% roughly from -40 to about +40 degrees and then back to the
% initial position.
audioFileName = 'AudioArray-16-16-4channels-20secs.wav';
audioInput = dsp.AudioFileReader( ...
'OutputDataType','double', ...
'Filename',audioFileName, ...
'PlayCount',inf, ...
'SamplesPerFrame',audioFrameLength);
1-221
1 Audio Toolbox Examples
fs = audioInput.SampleRate;
end
The following values identify the approximate linear coordinates of the 4 built-in microphones of the
Microsoft Kinect™ relative to the position of the RGB camera (not used in this example). For 3D
coordinates use [[x1;y1;z1], [x2;y2;z2], ..., [xN;yN;zN]]
micPositions = [-0.088, 0.042, 0.078, 0.11];
The algorithm used in this example works with pairs of microphones independently. It then combines
the individual DOA estimates to provide a single live DOA output. The more pairs available, the more
robust (yet computationally expensive) DOA estimation. The maximum number of pairs available can
be computed as nchoosek(length(micPositions),2). In this case, the 3 pairs with the largest
inter-microphone distances are selected. The larger the inter-microphone distance the more sensitive
the DOA estimate. Each column of the following matrix describes a choice of microphone pair within
the array. All values must be integers between 1 and length(micPositions).
micPairs = [1 4; 1 3; 1 2];
numPairs = size(micPairs, 1);
Create an instance of the helper plotting object DOADisplay. This displays the estimated DOA live
with an arrow on a polar plot.
DOAPointer = DOADisplay();
Use a helper object to rearrange the input samples according to how the microphone pairs are
selected.
bufferLength = 64;
preprocessor = PairArrayPreprocessor( ...
'MicPositions',micPositions, ...
'MicPairs',micPairs, ...
'BufferLength',bufferLength);
micSeparations = getPairSeparations(preprocessor);
The main algorithmic building block of this example is a cross-correlator. That is used in conjunction
with an interpolator to ensure a finer DOA resolution. In this simple case it is sufficient to use the
same two objects across the different pairs available. In general, however, different channels may
need to independently save their internal states and hence to be handled by separate objects.
interpFactor = 8;
b = interpFactor * fir1((2*interpFactor*8-1),1/interpFactor);
groupDelay = median(grpdelay(b));
interpolator = dsp.FIRInterpolator('InterpolationFactor',interpFactor,'Numerator',b);
For each iteration of the following while loop: read audioFrameLength samples for each audio
channel, process the data to estimate a DOA value and display the result on a bespoke arrow-based
polar visualization.
1-222
Live Direction of Arrival Estimation with a Linear Microphone Array
tic
for idx = 1:(endTime*fs/audioFrameLength)
cycleStart = toc;
% Read a multichannel frame from the audio source
% The returned array is of size AudioFrameLength x size(micPositions,2)
multichannelAudioFrame = audioInput();
% Combine DOA estimation across pairs by keeping only the median value
DOAInRadians = median(anglesInRadians);
% Arrow display
DOAPointer(DOAInRadians)
1-223
1 Audio Toolbox Examples
release(audioInput)
1-224
Positional Audio
Positional Audio
This example shows several basic aspects of audio signal positioning. The listener occupies a location
in the center of a circle, and the position of the sound source is varied so that it remains within the
circle. In this example, the sound source is a monaural recording of a helicopter. The sound field is
represented by five discrete speaker locations on the circumference of the circle and a low-frequency
output that is presumed to be in the center of the circle.
Example Prerequisites
This example requires a 5.1-channel speaker configuration, and relies on the audio channels being
mapped to physical locations as follows:
1 Front left
2 Front right
3 Front center
4 Low frequency
5 Rear left
6 Rear right
This is the default Windows® speaker configuration for 5.1 channels. Depending on the type of sound
card used, this example may work reasonably well for other speaker configurations.
Example Basics
There are two source blocks of interest in the model. The first is the audio signal itself, and the
second is the spatial location of the helicopter. The spatial location of the helicopter is represented by
a pair of Cartesian coordinates that are constrained to lie within the unit circle. By default, this
location is determined by the block labeled "Set position randomly." This block supplies the input for
the MATLAB Function block labeled "Speaker volume computation," which determines a matrix of
speaker volumes. The outer product of the sound source is then taken with the speaker position
matrix, which is then supplied to the six speakers via the To Audio Device block.
1-225
1 Audio Toolbox Examples
You can also determine the helicopter position manually. To do this, select the switch in the model so
that the signal being supplied to the computeVol block is coming from the block labeled "Set position
visually." Then, double-click on the new source block. A GUI appears that enables you to move the
helicopter to different locations within the circle using the mouse, thereby changing the speaker
amplitudes.
The monaural audio source is mixed into six channels, each of which corresponds to a speaker. There
is one low-frequency channel in the center of the circle and five speakers that lie on the
circumference, as shown in the grey area of the GUI above. The listener is represented by a stick
figure in the center of the circle.
1-226
Positional Audio
1. At the center of the circle, all of the amplitudes are equal. The value for each speaker, including
the low-frequency speaker, is set to 1/sqrt(5).
2. On the perimeter of the circle, the amplitudes of the speakers are determined using Vector Base
Amplitude Panning (VBAP). This algorithm operates as follows:
a) Determine the two speakers on either side of the source or, in the degenerate case, the single
speaker.
b) Interpret the vectors determined by the speaker positions in (a) as basis vectors. Use these basis
vectors to represent the normalized source position vector. The coefficients in this new basis
represent the relative speaker amplitudes after normalization.
For this part of the algorithm, the amplitude of the low-frequency channel is set to zero.
3) As the source moves from the center to the periphery, there is a transition from algorithm (1) to
algorithm (2). This transition decays as a cubic function of the radial distance. The amplitude vectors
are normalized so the power is constant independent of source location.
4) Finally, the amplitudes decay as the distance from the center increases according to an inverse
square law, such that the amplitude at the perimeter of the circle is one-quarter of the amplitude in
the center.
For more details about Vector Base Amplitude Panning, please consult the references.
References
Pulki, Ville. "Virtual Sound Source Positioning Using Vector Base Amplitude Panning." Journal Audio
Engineering Society. Vol 45, No 6. June 1997.
1-227
1 Audio Toolbox Examples
This example shows how to generate a stereo signal from a multichannel audio signal using matrix
encoding, and how to recover the original channels from the stereo mix using matrix decoding. This
example illustrates MATLAB® and Simulink® implementations. This example also shows how
performance can be improved by using dataflow execution domain.
Introduction
Matrix decoding is an audio technique that decodes an audio signal with M channels into an audio
signal with N channels (N > M) for play back on a system with N speakers. The original audio signal
is usually generated using a matrix encoder, which transforms N-channel signals to M-channel
signals.
Matrix encoding and decoding enables the same audio content to be played on different systems. For
example, a surround sound multichannel signal may be encoded into a stereo signal. The stereo
signal may be played back on a stereo system to accommodate settings where a surround sound
receiver does not exist, or it may be decoded and played as surround if surround equipment is
present [1].
In this example, we showcase a matrix encoder used to encode a four-channel signal (left, right,
center and surround) to a stereo signal. The four original signals are then regenerated using a matrix
decoder. This example is a simplified version of the encoding and decoding scheme used in the Dolby
Pro Logic system [2].
Simulink Version
The input to the matrix encoder consists of four separate audio channels (center, left, right and
surround).
1-228
Surround Sound Matrix Encoding and Decoding
Double-click the Audio Channels subsystem to launch a tuning dialog. The dialog enables you to
control the relative power between the right channel and left channel inputs, as well as the power
level of the surround channel.
You can also toggle between listening to any of the original, encoded or decoded audio channels by
double-clicking the Audio Player Selector subsystem and selecting the channel of your choice
from the dialog drop down menu.
Matrix Encoder
The Matrix Encoder encodes the four input channels into a stereo signal.
Notice that since the input left and right channels only contribute to the output left and right
channels, respectively, the output stereo signal conserves the balance between left and right
channels.
1-229
1 Audio Toolbox Examples
The surround input channel is passed through a Hilbert transformer, thereby creating a 180 degree
phase differential between the surround component feeding the left and right stereo outputs [2].
You may listen to the encoded left and right stereo signals by double-clicking the Audio Player
Selector subsystem and selecting either the 'Encoded Total Left' or 'Encoded Total Right' channels.
Matrix Decoder
The Matrix Decoder extracts the four original channels from the encoded stereo signal.
The lowpass frequencies are first separated using a Linkwitz-Riley cross-over filter. For more
information about the implementation of the Linkwitz-Riley filter, refer to “Multiband Dynamic Range
Compression” on page 1-147.
The left and right stereo channels are passed through to the left and right output channels,
respectively. Therefore, there is no loss of separation between left and right channels in the output.
The center output channel is equal to the sum of the stereo input signals, thereby cancelling the
phase-shifted surround left and right components.
The surround output channel is derived by first taking the difference of the stereo signals. Since the
original input center signal contributes equally to both stereo channels, the center channel does not
leak into the output surround signal. Moreover, note that the original left and right signals contribute
to the output surround channel. The surround signal is delayed by 10 msec to achieve a precedence
effect [3].
You may listen to the decoded surround signal by double-clicking the Audio Player Selector
subsystem and selecting one of the decoded signals.
1-230
Surround Sound Matrix Encoding and Decoding
This example can use dataflow execution domain in Simulink to make use of multiple cores on your
desktop to improve simulation performance. To learn more about dataflow and how to run Simulink
models using multiple threads, see “Multicore Execution Using Dataflow Domain”.
In Simulink, you specify dataflow as the execution domain for a subsystem by setting the Domain
parameter to Dataflow using Property Inspector. To access Property Inspector, in the Simulink
Toolstrip, on the Modeling tab, in the Design gallery select Property Inspector or on the Simulation
tab, Prepare gallery, select Property Inspector.
Dataflow domains automatically partition your model into multiple threads for better performance.
Once you set the Domain parameter to Dataflow, you can use the Multicore tab analysis to
analyze your model to get better performance. The Multicore tab is available in the toolstrip when
there is a dataflow domain in the model. To learn more about the Multicore tab, see “Perform
Multicore Analysis for Dataflow”.
For this example the Multicore tab mode is set to Simulation Profiling for simulation
performance analysis.
1-231
1 Audio Toolbox Examples
It is recommended to optimize model settings for optimal simulation performance. To accept the
proposed model settings, on the Multicore tab, click Optimize. Alternatively, you can use the drop
menu below the Optimize button to change the settings individually.
On the Multicore tab, click the Run Analysis button to start the analysis of the dataflow domain for
simulation performance. Once the analysis is finished, the Analysis Report and Suggestions window
shows how many threads the dataflow subsystem uses during simulation.
After analyzing the model, the Analysis Report and Suggestions window shows one thread because
the data dependency between the blocks in the model prevents blocks from being executed
concurrently. By pipelining the data dependent blocks, the dataflow subsystem can increase
concurrency for higher data throughput. The Analysis Report and Suggestions window shows the
recommended number of pipeline delays as Suggested for Increasing Concurrency. The suggested
latency value is computed to give the best performance.
The following diagram shows the Analysis Report and Suggestions window where the suggested
latency is 2 for the dataflow subsystem.
Click the Accept button to use the recommended latency for the dataflow subsystem. This value can
also be entered directly in the Property Inspector for Latency parameter. Simulink shows the latency
parameter value using tags at the output ports of the dataflow subsystem.
The Analysis Report and Suggestions window now shows the number of threads as 2 meaning that
the blocks inside the dataflow subsystem simulate in parallel using 2 threads. Highlight threads
highlights the blocks with colors based on their thread allocation as shown in the Thread
Highlighting Legend. Show pipeline delays shows where pipelining delays were inserted within
the dataflow subsystem using tags.
1-232
Surround Sound Matrix Encoding and Decoding
When latency is increased in the dataflow execution domain to break data dependencies between
blocks and create concurrency, that delay needs to be accounted for in other parts of the model. For
example, signals that are compared or combined with the signals at the output ports of the dataflow
subsystem must be delayed to align in time with the signals at the output ports of the dataflow
subsystem. In this example, the audio signal from the Audio Channels block that goes to the Audio
Player Selector must be delayed to align with other signals going into the Audio Player Selector
block. To compensate for the latency specified on the dataflow subsystem, use a delay block to delay
this signal by two frames. For this signal, the frame length is 1024. A delay value of 2048 is set in the
delay block to align the signal from the Audio Channels block and the signal processed through the
dataflow subsystem.
To measure performance improvement gained by using dataflow, compare execution time of the
model with and without dataflow. The Audio Device Writer block runs in real time and limits the
simulation speed of the model to real time. Comment out the Audio Device Writer block when
measuring execution time. On a Windows desktop computer with Intel® Xeon® CPU W-2133 @ 3.6
GHz 6 Cores 12 Threads processor this model using dataflow domain executes 2.3x times faster
compared to original model.
1-233
1 Audio Toolbox Examples
MATLAB Version
Execute audioMatrixDecoderApp to run the simulation. Note that the simulation runs until you
explicitly stop it.
There are also three buttons on the UI - the 'Reset' button will reset the simulation internal state to
its initial condition and the 'Pause Simulation' button will hold the simulation until you press on it
again. The simulation may be terminated by either closing the UI or by clicking on the 'Stop
simulation' button.
MATLAB Coder can be used to generate C code for the function HelperAudioMatrixDecoderSim.
In order to generate a MEX-file for your platform, execute the command
HelperMatrixDecodingCodeGeneration from a folder with write permissions.
References
[1] https://en.wikipedia.org/wiki/Matrix_decoder
[2] Dolby Pro Logic Surround Decoder: Principles of Operation, Roger Dressler, Dolby Labs
[3] https://en.wikipedia.org/wiki/Precedence_effect
1-234
Speaker Identification Using Pitch and MFCC
This example demonstrates a machine learning approach to identify people based on features
extracted from recorded speech. The features used to train the classifier are the pitch of the voiced
segments of the speech and the mel frequency cepstrum coefficients (MFCC). This is a closed-set
speaker identification: the audio of the speaker under test is compared against all the available
speaker models (a finite set) and the closest match is returned.
Introduction
The approach used in this example for speaker identification is shown in the diagram.
Pitch and MFCC are extracted from speech signals recorded for 10 speakers. These features are used
to train a K-nearest neighbor (KNN) classifier. Then, new speech signals that need to be classified go
through the same feature extraction. The trained KNN classifier predicts which one of the 10
speakers is the closest match.
This section discusses pitch, zero-crossing rate, short-time energy, and MFCC. Pitch and MFCC are
the two features that are used to classify speakers. Zero-crossing rate and short-time energy are used
to determine when the pitch feature is used.
1-235
1 Audio Toolbox Examples
Pitch
Speech can be broadly categorized as voiced and unvoiced. In the case of voiced speech, air from the
lungs is modulated by vocal cords and results in a quasi-periodic excitation. The resulting sound is
dominated by a relatively low-frequency oscillation, referred to as pitch. In the case of unvoiced
speech, air from the lungs passes through a constriction in the vocal tract and becomes a turbulent,
noise-like excitation. In the source-filter model of speech, the excitation is referred to as the source,
and the vocal tract is referred to as the filter. Characterizing the source is an important part of
characterizing the speech system.
As an example of voiced and unvoiced speech, consider a time-domain representation of the word
"two" (/T UW/). The consonant /T/ (unvoiced speech) looks like noise, while the vowel /UW/ (voiced
speech) is characterized by a strong fundamental frequency.
[audioIn,fs] = audioread("Counting-16-44p1-mono-15secs.wav");
twoStart = 110e3;
twoStop = 135e3;
audioIn = audioIn(twoStart:twoStop);
timeVector = linspace(twoStart/fs,twoStop/fs,numel(audioIn));
sound(audioIn,fs)
figure
plot(timeVector,audioIn)
axis([(twoStart/fs) (twoStop/fs) -1 1])
ylabel("Amplitude")
xlabel("Time (s)")
title("Utterance - Two")
1-236
Speaker Identification Using Pitch and MFCC
A speech signal is dynamic in nature and changes over time. It is assumed that speech signals are
stationary on short time scales, and their processing is done in windows of 20-40 ms. This example
uses a 30 ms window with a 25 ms overlap. Use the pitch function to see how the pitch changes
over time.
windowLength = round(0.03*fs);
overlapLength = round(0.025*fs);
f0 = pitch(audioIn,fs,WindowLength=windowLength,OverlapLength=overlapLength,Range=[50,250]);
figure
tiledlayout(2,1)
nexttile()
plot(timeVector,audioIn)
axis([(110e3/fs) (135e3/fs) -1 1])
ylabel("Amplitude")
xlabel("Time (s)")
title("Utterance - Two")
nexttile()
timeVectorPitch = linspace(twoStart/fs,twoStop/fs,numel(f0));
plot(timeVectorPitch,f0,"*")
axis([(110e3/fs) (135e3/fs) min(f0) max(f0)])
ylabel("Pitch (Hz)")
xlabel("Time (s)")
title("Pitch Contour")
1-237
1 Audio Toolbox Examples
The pitch function estimates a pitch value for every frame. However, pitch is only characteristic of a
source in regions of voiced speech. The simplest method to distinguish between silence and speech is
to analyze the short time energy. If the energy in a frame is above a given threshold, you declare the
frame as speech.
energyThreshold = 20;
[segments,~] = buffer(audioIn,windowLength,overlapLength,"nodelay");
ste = sum((segments.*hamming(windowLength,"periodic")).^2,1);
isSpeech = ste(:) > energyThreshold;
The simplest method to distinguish between voiced and unvoiced speech is to analyze the zero
crossing rate. A large number of zero crossings implies that there is no dominant low-frequency
oscillation. If the zero crossing rate for a frame is below a given threshold, you declare it as voiced.
zcrThreshold = 0.02;
zcr = zerocrossrate(audioIn,WindowLength=windowLength,OverlapLength=overlapLength);
isVoiced = zcr < zcrThreshold;
Combine isSpeech and isVoiced to determine whether a frame contains voiced speech.
Remove regions that do not correspond to voiced speech from the pitch estimate and plot.
f0(~voicedSpeech) = NaN;
figure
tiledlayout(2,1)
nexttile()
plot(timeVector,audioIn)
axis([(110e3/fs) (135e3/fs) -1 1])
axis tight
ylabel("Amplitude")
xlabel("Time (s)")
title("Utterance - Two")
nexttile()
plot(timeVectorPitch,f0,"*")
axis([(110e3/fs) (135e3/fs) min(f0) max(f0)])
ylabel("Pitch (Hz)")
xlabel("Time (s)")
title("Pitch Contour")
1-238
Speaker Identification Using Pitch and MFCC
MFCC are popular features extracted from speech signals for use in recognition tasks. In the source-
filter model of speech, MFCC are understood to represent the filter (vocal tract). The frequency
response of the vocal tract is relatively smooth, whereas the source of voiced speech can be modeled
as an impulse train. The result is that the vocal tract can be estimated by the spectral envelope of a
speech segment.
The motivating idea of MFCC is to compress information about the vocal tract (smoothed spectrum)
into a small number of coefficients based on an understanding of the cochlea.
Although there is no hard standard for calculating MFCC, the basic steps are outlined by the
diagram.
1-239
1 Audio Toolbox Examples
The mel filterbank linearly spaces the first 10 triangular filters and logarithmically spaces the
remaining filters. The individual bands are weighted for even energy. The graph represents a typical
mel filterbank.
This example uses audioFeatureExtractor to calculate the MFCC for every file.
1-240
Speaker Identification Using Pitch and MFCC
Data Set
This example uses a subset of the Common Voice data set from Mozilla [1] on page 1-248. The data
set contains 48 kHz recordings of subjects speaking short sentences. The helper function in this
section organizes the downloaded data and returns an audioDatastore object. The data set uses
1.36 GB of memory.
Download the data set if it doesn't already exist and unzip it into tempdir.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","commonvoice.zip");
dataFolder = tempdir;
if ~datasetExists(string(dataFolder) + "commonvoice")
unzip(downloadFolder,dataFolder);
end
Extract the speech files for 10 speakers (5 female and 5 male) and place them into an
audioDatastore using the commonVoiceHelper function. The datastore enables you to collect
necessary files of a file format and read them. The function is placed in your current folder when you
open this example.
ads = commonVoiceHelper
ads =
audioDatastore with properties:
Files: {
' ...\AppData\Local\Temp\commonvoice\train\clips\common_voice_en_11
' ...\AppData\Local\Temp\commonvoice\train\clips\common_voice_en_11
' ...\AppData\Local\Temp\commonvoice\train\clips\common_voice_en_11
... and 172 more
}
Folders: {
'C:\Users\bhemmat\AppData\Local\Temp\commonvoice\train\clips'
}
Labels: [3; 3; 3 ... and 172 more categorical]
AlternateFileSystemRoots: {}
OutputDataType: 'double'
SupportedOutputFormats: ["wav" "flac" "ogg" "opus" "mp4" "m4a"]
DefaultOutputFormat: "wav"
The splitEachLabel function of audioDatastore splits the datastore into two or more datastores.
The resulting datastores have the specified proportion of the audio files from each label. In this
example, the datastore is split into two parts. 80% of the data for each label is used for training, and
the remaining 20% is used for testing. The countEachLabel function of audioDatastore is used to
count the number of audio files per label. In this example, the label identifies the speaker.
[adsTrain,adsTest] = splitEachLabel(ads,0.8);
Display the datastore and the number of speakers in the train datastore.
adsTrain
adsTrain =
audioDatastore with properties:
Files: {
' ...\AppData\Local\Temp\commonvoice\train\clips\common_voice_en_11
1-241
1 Audio Toolbox Examples
' ...\AppData\Local\Temp\commonvoice\train\clips\common_voice_en_11
' ...\AppData\Local\Temp\commonvoice\train\clips\common_voice_en_11
... and 136 more
}
Folders: {
'C:\Users\bhemmat\AppData\Local\Temp\commonvoice\train\clips'
}
Labels: [3; 3; 3 ... and 136 more categorical]
AlternateFileSystemRoots: {}
OutputDataType: 'double'
SupportedOutputFormats: ["wav" "flac" "ogg" "opus" "mp4" "m4a"]
DefaultOutputFormat: "wav"
trainDatastoreCount = countEachLabel(adsTrain)
trainDatastoreCount=10×2 table
Label Count
_____ _____
1 14
10 12
2 12
3 18
4 14
5 16
6 17
7 11
8 11
9 14
Display the datastore and the number of speakers in the test datastore.
adsTest
adsTest =
audioDatastore with properties:
Files: {
' ...\AppData\Local\Temp\commonvoice\train\clips\common_voice_en_11
' ...\AppData\Local\Temp\commonvoice\train\clips\common_voice_en_11
' ...\AppData\Local\Temp\commonvoice\train\clips\common_voice_en_11
... and 33 more
}
Folders: {
'C:\Users\bhemmat\AppData\Local\Temp\commonvoice\train\clips'
}
Labels: [3; 3; 3 ... and 33 more categorical]
AlternateFileSystemRoots: {}
OutputDataType: 'double'
SupportedOutputFormats: ["wav" "flac" "ogg" "opus" "mp4" "m4a"]
DefaultOutputFormat: "wav"
testDatastoreCount = countEachLabel(adsTest)
testDatastoreCount=10×2 table
Label Count
1-242
Speaker Identification Using Pitch and MFCC
_____ _____
1 4
10 3
2 3
3 4
4 4
5 4
6 4
7 3
8 3
9 4
To preview the content of your datastore, read a sample file and play it using your default audio
device.
[sampleTrain,dsInfo] = read(adsTrain);
sound(sampleTrain,dsInfo.SampleRate)
Reading from the train datastore pushes the read pointer so that you can iterate through the
database. Reset the train datastore to return the read pointer to the start for the following feature
extraction.
reset(adsTrain)
Feature Extraction
Extract pitch and MFCC features from each frame that corresponds to voiced speech in the training
datastore. Audio Toolbox™ provides audioFeatureExtractor so that you can quickly and
efficiently extract multiple features. Configure an audioFeatureExtractor to extract pitch, short-
time energy, zcr, and MFCC.
fs = dsInfo.SampleRate;
windowLength = round(0.03*fs);
overlapLength = round(0.025*fs);
afe = audioFeatureExtractor(SampleRate=fs, ...
Window=hamming(windowLength,"periodic"),OverlapLength=overlapLength, ...
zerocrossrate=true,shortTimeEnergy=true,pitch=true,mfcc=true);
When you call the extract function of audioFeatureExtractor, all features are concatenated and
returned in a matrix. You can use the info function to determine which columns of the matrix
correspond to which features.
featureMap = info(afe)
features = [];
labels = [];
energyThreshold = 0.005;
1-243
1 Audio Toolbox Examples
zcrThreshold = 0.2;
allFeatures = extract(afe,adsTrain);
allLabels = adsTrain.Labels;
for ii = 1:numel(allFeatures)
thisFeature = allFeatures{ii};
thisFeature(~voicedSpeech,:) = [];
thisFeature(:,[featureMap.zerocrossrate,featureMap.shortTimeEnergy]) = [];
label = repelem(allLabels(ii),size(thisFeature,1));
features = [features;thisFeature];
labels = [labels,label];
end
Pitch and MFCC are not on the same scale. This will bias the classifier. Normalize the features by
subtracting the mean and dividing the standard deviation.
M = mean(features,1);
S = std(features,[],1);
features = (features-M)./S;
Train Classifier
Now that you have collected features for all 10 speakers, you can train a classifier based on them. In
this example, you use a K-nearest neighbor (KNN) classifier. KNN is a classification technique
naturally suited for multiclass classification. The hyperparameters for the nearest neighbor classifier
include the number of nearest neighbors, the distance metric used to compute distance to the
neighbors, and the weight of the distance metric. The hyperparameters are selected to optimize
validation accuracy and performance on the test set. In this example, the number of neighbors is set
to 5 and the metric for distance chosen is squared-inverse weighted Euclidean distance. For more
information about the classifier, refer to fitcknn (Statistics and Machine Learning Toolbox).
Train the classifier and print the cross-validation accuracy. crossval (Statistics and Machine
Learning Toolbox) and kfoldLoss (Statistics and Machine Learning Toolbox) are used to compute
the cross-validation accuracy for the KNN classifier.
Perform cross-validation.
k = 5;
group = labels;
1-244
Speaker Identification Using Pitch and MFCC
validationAccuracy = 1 - kfoldLoss(partitionedModel,LossFun="ClassifError");
fprintf('\nValidation accuracy = %.2f%%\n', validationAccuracy*100);
validationPredictions = kfoldPredict(partitionedModel);
figure(Units="normalized",Position=[0.4 0.4 0.4 0.4])
confusionchart(labels,validationPredictions,title="Validation Accuracy", ...
ColumnSummary="column-normalized",RowSummary="row-normalized");
You can also use the Classification Learner (Statistics and Machine Learning Toolbox) app to try out
and compare various classifiers with your table of features.
Test Classifier
In this section, you test the trained KNN classifier with speech signals from each of the 10 speakers
to see how well it behaves with signals that were not used to train it.
Read files, extract features from the test set, and normalize them.
1-245
1 Audio Toolbox Examples
features = [];
labels = [];
numVectorsPerFile = [];
allFeatures = extract(afe,adsTest);
allLabels = adsTest.Labels;
for ii = 1:numel(allFeatures)
thisFeature = allFeatures{ii};
thisFeature(~voicedSpeech,:) = [];
numVec = size(thisFeature,1);
thisFeature(:,[featureMap.zerocrossrate,featureMap.shortTimeEnergy]) = [];
label = repelem(allLabels(ii),numVec);
numVectorsPerFile = [numVectorsPerFile,numVec];
features = [features;thisFeature];
labels = [labels,label];
end
features = (features-M)./S;
Predict the label (speaker) for each frame by calling predict on trainedClassifier.
prediction = predict(trainedClassifier,features);
prediction = categorical(string(prediction));
1-246
Speaker Identification Using Pitch and MFCC
For a given file, predictions are made for every frame. Determine the mode of predictions for each file
and then plot the confusion chart.
r2 = prediction(1:numel(adsTest.Files));
idx = 1;
for ii = 1:numel(adsTest.Files)
r2(ii) = mode(prediction(idx:idx+numVectorsPerFile(ii)-1));
idx = idx + numVectorsPerFile(ii);
end
1-247
1 Audio Toolbox Examples
The predicted speakers match the expected speakers for all files under test.
References
See Also
pitch | mfcc
Related Examples
• “Speaker Identification Using Custom SincNet Layer and Deep Learning” on page 1-699
• “Accelerate Audio Machine Learning Workflows Using a GPU” on page 1-249
1-248
Accelerate Audio Machine Learning Workflows Using a GPU
This example shows how to use GPU computing to accelerate machine learning workflows for audio,
speech, and acoustic applications.
One of the easiest ways to speed up your code is to run it on a GPU, and many functions in MATLAB®
automatically run on a GPU if you supply a gpuArray data argument. Starting from the code in the
“Speaker Identification Using Pitch and MFCC” on page 1-235 example, this example demonstrates
how to speed up execution in a machine learning workflow by modifying it to run on a GPU. You can
use a similar approach to accelerate many of your machine learning audio workflows.
As this figure shows, you can significantly speed up feature extraction, prediction, and loss
calculation using a GPU.
Using a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For information on
supported devices, see “GPU Computing Requirements” (Parallel Computing Toolbox).
gpu = gpuDevice;
disp(gpu.Name + " GPU selected.")
1-249
1 Audio Toolbox Examples
If a function supports GPU array input, the documentation page for that function lists GPU support in
the Extended Capabilities section. You can also filter lists of functions in the documentation to show
only functions that support GPU array input. For more information, see “Run MATLAB Functions on a
GPU” (Parallel Computing Toolbox).
After checking that you have a supported GPU, you follow the same steps as the previous example,
with minor modifications to send data to the GPU and run functions on the GPU where possible. The
code requires very little modification to run on a GPU. This diagram shows the approach used in this
example, which includes feature extraction, training a classifier model, and testing the model on
unknown data.
This example uses a subset of the Common Voice data set from Mozilla [1] on page 1-258. The data
set contains 48 kHz recordings of subjects speaking short sentences. The helper function in this
section organizes the downloaded data and returns an audioDatastore object. The data set uses
1.36 GB of memory.
Download the data set if it doesn't already exist and unzip it into tempdir.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","commonvoice.zip");
dataFolder = tempdir;
if ~datasetExists(string(dataFolder) + "commonvoice")
1-250
Accelerate Audio Machine Learning Workflows Using a GPU
unzip(downloadFolder,dataFolder);
end
Extract the speech files for 10 speakers (5 female and 5 male) and place them into an
audioDatastore using the commonVoiceHelper function, which is placed in the current folder
when you open this example. The datastore lets you collect necessary files of a file format and read
them.
ads = commonVoiceHelper;
To make the datastore output gpuArray (Parallel Computing Toolbox) data, set the
OutputEnvironment property to "gpu". If your workflow does not use an audioDatastore, you
can copy any numeric or logical data to GPU memory by calling gpuArray on your data.
ads.OutputEnvironment = "gpu";
The splitEachLabel function of audioDatastore splits the datastore into two or more datastores.
The resulting datastores have the specified proportion of the audio files from each label. In this
example, you split the datastore into two parts, using 80% of the data for each label for training and
using the remaining 20% for testing. Here, the label identifies the speaker.
[adsTrain,adsTest] = splitEachLabel(ads,0.8);
The splitEachLabel function creates datastores with the same OutputEnvironment property as
the original datastore ads. Check the OutputEnvironment properties of the training and testing
datastores.
adsTrain.OutputEnvironment
ans =
'gpu'
adsTest.OutputEnvironment
ans =
'gpu'
To preview the content of your datastore, read a sample file and play it using your default audio
device.
[sampleTrain,dsInfo] = read(adsTrain);
sound(sampleTrain,dsInfo.SampleRate)
Reading from the train datastore pushes a read pointer so that you can iterate through the database.
Reset the train datastore to return the read pointer to the start for feature extraction.
reset(adsTrain)
Extract Features
Extract pitch features and mel frequency cepstrum coefficients (MFCC) features from each frame
that corresponds to voiced speech in the training datastore. Audio Toolbox™ provides
audioFeatureExtractor so that you can quickly and efficiently extract multiple features.
Configure an audioFeatureExtractor to extract pitch, short-time energy, zero-crossing rate
(ZCR), and MFCC.
1-251
1 Audio Toolbox Examples
fs = dsInfo.SampleRate;
windowLength = round(0.03*fs);
overlapLength = round(0.025*fs);
afe = audioFeatureExtractor(SampleRate=fs, ...
Window=hamming(windowLength,"periodic"),OverlapLength=overlapLength, ...
zerocrossrate=true,shortTimeEnergy=true,pitch=true,mfcc=true);
When you call the extract function of audioFeatureExtractor, all features are concatenated and
returned in a matrix. You can use the info function to determine which columns of the matrix
correspond to which features.
featureMap = info(afe)
Extract features from the data set. As the training datastore outputs gpuArray data, the extract
function runs on the GPU.
features = [];
labels = [];
energyThreshold = 0.005;
zcrThreshold = 0.2;
allFeatures = extract(afe,adsTrain);
allLabels = adsTrain.Labels;
for ii = 1:numel(allFeatures)
thisFeature = allFeatures{ii};
thisFeature(~voicedSpeech,:) = [];
thisFeature(:,[featureMap.zerocrossrate,featureMap.shortTimeEnergy]) = [];
label = repelem(allLabels(ii),size(thisFeature,1));
features = [features;thisFeature];
labels = [labels,label];
end
Pitch and MFCC are not on the same scale, which will bias the classifier. Normalize the features by
subtracting the mean and by dividing the standard deviation.
M = mean(features,1);
S = std(features,[],1);
features = (features-M)./S;
1-252
Accelerate Audio Machine Learning Workflows Using a GPU
Train Classifier
Now that you have features for all 10 speakers, you can train a classifier based on them. In this
example, you use a K-nearest neighbor (KNN) classifier. For more information about the classifier,
refer to fitcknn (Statistics and Machine Learning Toolbox).
Train the classifier and compute the cross-validation accuracy. Use the crossval (Statistics and
Machine Learning Toolbox) and kfoldLoss (Statistics and Machine Learning Toolbox) functions to
compute the cross-validation accuracy for the KNN classifier.
Specify all the classifier options and train the classifier. As the training data features is a
gpuArray, the classifier is trained on the GPU.
Perform cross-validation.
k = 5;
group = labels;
c = cvpartition(group,KFold=k);
partitionedModel = crossval(trainedClassifier,CVPartition=c);
validationAccuracy = 1 - kfoldLoss(partitionedModel,LossFun="ClassifError");
fprintf('\nValidation accuracy = %.2f%%\n', validationAccuracy*100);
validationPredictions = kfoldPredict(partitionedModel);
figure(Units="normalized",Position=[0.4 0.4 0.4 0.4])
confusionchart(labels,validationPredictions,title="Validation Accuracy", ...
ColumnSummary="column-normalized",RowSummary="row-normalized");
1-253
1 Audio Toolbox Examples
You can also use the Classification Learner (Statistics and Machine Learning Toolbox) app to compare
various classifiers using your table of features.
Test Classifier
In this section, you test the trained KNN classifier with speech signals from each of the 10 speakers
to see how well it behaves with signals not included in the training dataset.
Read files and extract features from the test set, and normalize them. Similarly to the training
datastore, the testing datastore outputs gpuArray data, so the extract function runs on the GPU.
features = [];
labels = [];
numVectorsPerFile = [];
allFeatures = extract(afe,adsTest);
allLabels = adsTest.Labels;
for ii = 1:numel(allFeatures)
thisFeature = allFeatures{ii};
thisFeature(~voicedSpeech,:) = [];
numVec = size(thisFeature,1);
1-254
Accelerate Audio Machine Learning Workflows Using a GPU
thisFeature(:,[featureMap.zerocrossrate,featureMap.shortTimeEnergy]) = [];
label = repelem(allLabels(ii),numVec);
numVectorsPerFile = [numVectorsPerFile,numVec];
features = [features;thisFeature];
labels = [labels,label];
end
features = (features-M)./S;
Predict the label (speaker) for each frame by calling predict on trainedClassifier.
prediction = predict(trainedClassifier,features);
prediction = categorical(string(prediction));
For a given file, predictions are made for every frame. Determine the mode of predictions for each file
and then plot the confusion chart.
r2 = prediction(1:numel(adsTest.Files));
idx = 1;
for ii = 1:numel(adsTest.Files)
r2(ii) = mode(prediction(idx:idx+numVectorsPerFile(ii)-1));
idx = idx + numVectorsPerFile(ii);
end
1-255
1 Audio Toolbox Examples
The predicted speakers match the expected speakers for all of the test files.
Note that the resulting model is the same as the model from the Speaker Identification Using Pitch
and MFCC example, as you can see by comparing the confusion charts in that example and this one.
The longest running steps in this example are extracting using the audioFeatureExtractor,
making predictions using kfoldPredict, and calculating loss using kfoldLoss.
Time the execution of these functions on the GPU. To accurately time function execution on the GPU,
use the gputimeit (Parallel Computing Toolbox) function, which runs a function multiple times to
average out variation and compensate for overhead. The gputimeit function also ensures that all
operations on the GPU are complete before recording the time.
reset(adsTrain)
timeExtractGPU = gputimeit(@() extract(afe,adsTrain))
timeExtractGPU = 4.3300
timePredictGPU = 3.2398
1-256
Accelerate Audio Machine Learning Workflows Using a GPU
timeLossGPU = 3.2719
For comparison, time the same functions running on the CPU using the timeit function.
adsTrain.OutputEnvironment = "cpu";
reset(adsTrain)
timeExtractCPU = timeit(@() extract(afe,adsTrain))
timeExtractCPU = 23.4533
partitionedModel = gather(partitionedModel);
timePredictCPU = timeit(@() kfoldPredict(partitionedModel))
timePredictCPU = 20.1054
timeLossCPU = 19.9273
1-257
1 Audio Toolbox Examples
Running your code on a GPU is straightforward and can provide a significant speedup for many
workflows. Generally, using a GPU is more beneficial when you are performing computations on
larger amounts of data, though the speedup you can achieve depends on your specific hardware and
code.
References
See Also
gpuArray | gputimeit | audioDatastore | audioFeatureExtractor | pitch | mfcc
Related Examples
• “Accelerate Audio Deep Learning Using GPU-Based Feature Extraction” on page 1-757
• “Speaker Identification Using Custom SincNet Layer and Deep Learning” on page 1-699
1-258
Measure Audio Latency
This example shows how to measure the latency of an audio device. The example uses
audioLatencyMeasurementExampleApp which in turn uses audioPlayerRecorder along with a test
signal and cross correlation to determine latency. To avoid disk access interference, the test signal is
loaded into a dsp.AsyncBuffer object first, and frames are streamed from that object through the
audio device.
Introduction
In general terms, latency is defined as the time from when the audio signal enters a system until it
exits. In a digital audio processing chain, there are multiple parameters that cause latency:
1 Hardware (including A/D and D/A conversion)
2 Audio drivers that communicate with the system's sound card
3 Sampling rate
4 Samples per frame (buffer size)
5 Algorithmic latency (e.g. delay introduced by a filter or audio effect)
This example shows how to measure round trip latency. That is, the latency incurred when playing
audio through a device, looping back the audio with a physical loopback cable, and recording the
loopback audio with the same audio device. In order to compute latency for your own audio device,
you need to connect the audio out and audio in ports using a loopback cable.
Roundtrip latency does not break down the measurement between output latency and input latency. It
measures only the combined effect of the two. Also, most practical applications will not use a
loopback setup. Typically the processing chain consists of recording audio, processing it, and playing
the processed audio. However, the latency involved should be the same either way provided the other
factors (frame size, sampling rate, algorithm latency) don't change.
Hardware Latency
Smaller frame sizes and higher sampling rates reduce the roundtrip latency. However, the tradeoff is
a higher chance of dropouts occurring (overruns/underruns).
In addition to potentially increasing latency, the amount of processing involved in the audio algorithm
can also cause dropouts.
1-259
1 Audio Toolbox Examples
audioLatencyMeasurementExampleApp('SamplesPerFrame',64,'SampleRate',48e3)
Real-time processing on a general purpose operating system is only possible if you minimize other
tasks being performed by the computer. It is recommended to:
On Windows, you can use the asiosettings function to launch the dialog to control the hardware
settings. On macOS, you should launch the Audio MIDI Setup.
When using ASIO (or CoreAudio with Mac OS), the latency measurements are consistent as long as
no dropouts occur. For small buffer sizes, it is possible to get a clean measurement in one instance
and dropouts the next. The Ntrials option can be used to ensure consistent dropout behavior when
measuring latency. For example, to perform 3 measurements, use:
audioLatencyMeasurementExampleApp('SamplesPerFrame',96,...
'SampleRate',48e3,'Ntrials',3)
On macOS, it is also possible to try different frame sizes without changing the hardware settings. To
make this convenient, you can specify a vector of SamplesPerFrame:
1-260
Measure Audio Latency
BufferSizes = [64;96;128];
t = audioLatencyMeasurementExampleApp('SamplesPerFrame',BufferSizes)
% Notice that for every sample increment in the buffer size, the additional
% latency is 3*SamplesPerFrameIncrement/SampleRate (macOS only).
t.Latency_ms - 3*BufferSizes/48
ans =
0.0020 0.0020
ans =
4.3125
4.3125
4.3125
The measurements performed so far assume that channel #1 is used for both input and output. If
your device has a loopback cable connected to other channels, you can specify them using the
IOChannels option to measureLatency. This is specified as a 2-element vector, corresponding to the
input and output channels to be used (the measurement is always on a mono signal). For example for
an RME Fireface UFX+:
audioLatencyMeasurementExampleApp('SamplesPerFrame',[32 64 96],...
'SampleRate',96e3,'Device','Fireface UFX+ (23767940)',...
'IOChannels',[1 3])
1-261
1 Audio Toolbox Examples
Algorithmic Latency
The measurements so far have not included algorithm latency. Therefore, they represent the minimal
roundtrip latency that can be achieved for a given device, buffer size, and sampling rate. You can add
a linear phase FIR filter the processing chain to verify that the latency measurements are as
expected. Moreover, it provides a way of verifying robustness of the real-time audio processing under
a given workload. For example,
L = 961;
Fs = 48e3;
audioLatencyMeasurementExampleApp('SamplesPerFrame',128,...
'SampleRate',Fs,'FilterLength',L,'Ntrials',3)
GroupDelay = (L-1)/2/Fs
% The group delay accounts for the 10 ms of additional latency when using a
% 961-tap linear-phase FIR filter vs. the minimal achievable latency.
% If the optional FIR filtering is used, the waveforms are not affected
% because the filter used has a broader bandwidth than the test audio
% signal.
1-262
Measure Audio Latency
1-263
1 Audio Toolbox Examples
This example presents a utility that can be used to analyze the timing performance of signal
processing algorithms designed for real-time streaming applications.
Introduction
The ability to prototype an audio signal processing algorithm in real time using MATLAB depends
primarily on its execution performance. Performance is affected by a number of factors, such as the
algorithm's complexity, the sampling frequency and the input frame size. Ultimately, the algorithm
must be fast enough to ensure it can always execute within the available time budget and not drop
any frames. Frames are dropped whenever the audio input queue is overrun with new samples (not
read fast enough) or the audio output queue is underrun (not written fast enough). Dropped frames
result in undesirable artifacts in the output audio signal.
This example presents a utility to profile the execution performance of an audio signal processing
algorithm within MATLAB and compare it to the available time budget.
Results in this example were obtained on a machine running an Intel (R) Xeon (R) CPU with a clock
speed of 3.50 GHz, and 64 GB of RAM. Results vary depending on system specifications.
In this example, you measure performance of an eighth-order notch filter, implemented using
dsp.BiquadFilter.
helperAudioLoopTimerExample defines and instantiates the variables used in the algorithm. The
input is read from a file using a dsp.AudioFileReader object, and then streamed through the
notch filter in a processing loop.
audioexample.AudioLoopTimer is the utility object used to profile execution performance and display
a summary of the results. The utility uses simple tic/toc commands to log the timing of different
stages of the simulation. The initialization time (which is the time it takes to instantiate and set up
variables and objects before the simulation loop begins) is measured using the ticInit and
tocInit methods. The individual simulation loop times are measured using the ticLoop and
tocLoop methods. After the simulation loop is done, a performance report is generated using the
object's generateReport method.
Execute helperAudioLoopTimerExample to run the simulation and view the performance report:
helperAudioLoopTimerExample;
1-264
Measure Performance of Streaming Real-Time Audio Algorithms
The performance report figure displays a histogram of the loop execution times in the top plot. The
red line represents the maximum allowed loop execution time, or budget, above which samples will
be dropped. The budget per simulation loop is equal to L/Fs, where L is the input frame size, and Fs
is the sampling rate. In this example, L = 512, Fs = 44100 Hz, and the budget per loop is around 11.6
milliseconds. The performance report also displays the runtime of the individual simulation loops in
the bottom plot. Again, the red line represents the allowed budget per loop.
Notice that although the median loop time is well within the budget, the maximum loop time exceeds
the budget. From the bottom plot, it is evident that the budget is exceeded on the very first loop pass,
and that subsequent loop runs are within the budget. The relative slow performance of the first
simulation loop is due to the penalty incurred the first time you call the dsp.BiquadFilter and
dsp.AudioFileReader objects. The first call to the object triggers the execution of one-time tasks
that do not depend on the inputs, such as hardware resource allocation and state initialization. This
problem can be alleviated by executing one-time tasks before the simulation loop. You can perform
the one-time tasks by calling the simulation objects in the initialization stage. Execute
helperAudioLoopTimerExample(true) to re-run the simulation with pre-loop setup enabled.
helperAudioLoopTimerExample(true);
1-265
1 Audio Toolbox Examples
All loop runs are now within the budget. Notice that the maximum and total loop times have been
drastically reduced compared to the first performance report, at the expense of a higher initialization
time.
1-266
THD+N Measurement with Tone-Tracking
This example shows how to measure total harmonic distortion and noise level of audio input and
output devices.
Introduction
Audio input and output devices are non-linear in nature. This causes harmonic distortion in the audio
signal. Apart from the unwanted signals that may be harmonically related to the signal, these devices
can also add uncorrelated noise to the audio signal.
Total Harmonic Distortion and Noise (THD+N) quantifies the sum of these two distortions. It is
defined as the root mean square (RMS) level of all harmonics and noise components over a specified
bandwidth. The signal level is also specified as a reference.
Measurement of THD+N
This example introduces a reference model that can be used for THD+N measurements of audio input
and output devices. The steps involved in measurement are:
1 Generate a pure sine wave of a specific frequency.
2 Play the signal through an audio output device and record it through an audio input device.
3 From the recorded signal, identify the sine wave peak. This will give the reference signal RMS
level.
4 Remove the identified sine wave from recorded signal. What remains is everything unwanted,
and its RMS will give THD+N value.
This example follows the AES17-1998(r2004) [1] standard for THD+N measurement. The standard
recommends a 997 Hz frequency sine wave. It also recommends a notch filter having Q between 1
and 5 for filtering out the sine wave from recorded signal. A Q value of 5 is used in this example.
1-267
1 Audio Toolbox Examples
of 997 Hz. The subsystem System Under Test is a variant subsystem. By default, it selects a non-
linear model implemented in Simulink for measuring the THD+N. To perform the measurement on
your machine's audio input and output device, set the SUT variable in base workspace to
THDNDemoSUT.AudioHardware.
Dual-Bandpass Controller
The measurement system in the model uses a dual-peak tracking filter to locate the notch at the test
tone's fundamental. This accommodates signal generators that are not synchronized to the ADC
clock. The output of this block is the center frequency coefficient of the notch filter that will be used
to extract the test sine tone. The two peaking filters in the controller are implemented using
dsp.NotchPeakFilter System objects. When the model is run, the feedback loop works to adjust
the center frequencies of the two peaking filters in such a way that the output locks on to the peak
tone of the input.
Notch-Peak Filter
Once the frequency of the sine wave has been identified, pass it to a peaking filter to extract the test
tone signal. This will be used to determine the test signal's peak level. A notch filter will then use the
same center frequency to remove the sine wave. The remaining signal is the sum of the total
harmonic distortion and noise. Use a single dsp.NotchPeakFilter to get both - notch and peak
outputs. The Q-factor of this filter is chosen as 5, conforming to AES17-1998 standard.
THD+N Computer
The THD+N Computer subsystem mimics a signal level meter. It takes the notch and peak outputs
and smooths them using a lowpass filter. It then converts the level of the signals to dB.
1-268
THD+N Measurement with Tone-Tracking
You can run the model and see the displays update with measured sine wave frequency, THD+N level
in dB, and reference signal level in dB.
References
[1] AES17-1998 "AES standard method for digital audio engineering - Measurement of digital audio
equipment", Audio Engineering Society (1998), r2004.
1-269
1 Audio Toolbox Examples
The impulse response (IR) is an important tool for characterizing or representing a linear time-
invariant (LTI) system. The Impulse Response Measurer enables you to measure and capture the
impulse response of audio systems, including:
In this example, you use the Impulse Response Measurer to measure the impulse response of your
room. You then use the acquired impulse response with audiopluginexample.FastConvolver to
add reverberation to an audio signal.
This example requires that your machine has an audio device capable of full-duplex mode and an
appropriate audio driver. To learn more about how the app records and plays audio data, see
audioPlayerRecorder.
The Swept Sine measurement technique uses an exponential time-growing frequency sweep as an
output signal. The output signal is recorded and deconvolution is used to recover the impulse-
response from the swept sine tone. For more details, see [1].
The Maximum-Length-Sequence (MLS) technique is based upon the excitation of the acoustical space
by a periodic pseudo-random signal. The impulse response is obtained by circular cross-correlation
between the measured output and the test tone (MLS sequence). For more details, see [2].
impulseResponseMeasurer
1-270
Measure Impulse Response of an Audio System
2. Use the default settings of the app and click Capture. Make sure the device name and the channel
number match your system's configuration.
3. Once you capture the impulse response, click the Export button and select To Workspace.
1-271
1 Audio Toolbox Examples
Time-domain convolution of an input frame with a long impulse response adds latency equal to the
length of the impulse response. The algorithm used by the audiopluginexample.FastConvolver
plugin uses frequency-domain partitioned convolution to reduce the latency to twice the partition size
[3]. audiopluginexample.FastConvolver is well-suited to impulse responses acquired using
impulseResponseMeasurer.
fastConvolver = audiopluginexample.FastConvolver
fastConvolver =
audiopluginexample.FastConvolver with properties:
2. Set the impulse response property to your acquired impulse response measurement. You can clear
the impulse response for your workspace once it is saved to the fast convolver.
1-272
Measure Impulse Response of an Audio System
load measuredImpulseResponse
irEstimate = measuredImpulseResponse.ImpulseResponse.Amplitude(:,1);
fastConvolver.ImpulseResponse = irEstimate;
3. Open the audio test bench and specify your fast convolver object.
audioTestBench(fastConvolver)
4. By default, the Audio Test Bench reads from an audio file and writes to your audio device. Click
Run to listen to an audio file convolved with your acquired impulse response.
The excitation level slider on the impulseResponseMeasurer applies gain to the output test tone. A
higher output level is generally recommended to maximize signal-to-noise ratio (SNR). However, if
the output level is too high, undesired distortion may occur.
Export to filter visualizer (FVTool) through the Export button to look at other useful plots like phase
response, group delay, etc.
References
[1] Farina, Angelo. "Advancements in impulse response measurements by sine sweeps." Presented at
the Audio Engineering Society 122nd Convention, Vienna, Austria, 2007.
[2] Guy-Bart, Stan, Jean-Jacques Embrechts, and Dominique Archambeau. "Comparison of different
impulse response measurement techniques." Journal of Audio Engineering Society. Vol. 50, Issue 4,
pp. 249-262.
1-273
1 Audio Toolbox Examples
[3] Armelloni, Enrico, Christian Giottoli, and Angelo Farina. "Implementation of real-time partitioned
convolution on a DSP board." Applications of Signal Processing to Audio and Acoustics, 2003 IEEE
Workshop on., pp. 71-74. IEEE, 2003.
1-274
Measure Frequency Response of an Audio Device
The frequency response (FR) is an important tool for characterizing the fidelity of an audio device or
component.
This example requires an audio device capable of recording and playing audio and an appropriate
audio driver. To learn more about how the example records and plays audio data, see
audioDeviceReader and audioDeviceWriter.
An FR measurement compares the output levels of an audio device to known input levels. A basic FR
measurement consists of two or three test tones: mid, high, and low.
In this example you perform an audible range FR measurement by sweeping a sine wave from the
lowest frequency in the range to the highest. A flat response indicates an audio device that responds
equally to all frequencies.
Setup Experiment
In this example, you measure the FR by playing an audio signal through audioDeviceWriter and
then recording the signal through audioDeviceReader. A loopback cable is used to physically
connect the audio-out port of the sound card to its audio-in port.
To start, use the audioDeviceReader System object™ and audioDeviceWriter System object to
connect to the audio device. This example uses a Focusrite Scarlett 2i2 audio device with a 48 kHz
sampling rate.
sampleRate = 48e3;
device = "Focusrite USB ASIO";
1-275
1 Audio Toolbox Examples
ChannelMapping=1);
Test Signal
The test signal is a sine wave with 1024 samples per frame and an initial frequency of 0 Hz. The
frequency is increased in 50 Hz increments to sweep the audible range.
samplesPerFrame = 1024;
sineSource = audioOscillator( ...
Frequency=0, ...
SignalType="sine", ...
SampleRate=sampleRate, ...
SamplesPerFrame=samplesPerFrame);
Spectrum Analyzer
Use the spectrumAnalyzer to visualize the FR of your audio I/O system. 20 averages of the
spectrum estimate are used throughout the experiment and the resolution bandwidth is set to 50 Hz.
The sampling frequency is set to 48 kHz.
RBW = 50;
Navg = 20;
To avoid the impact of setup time on the FR measurement, prerun your audio loop for 5 seconds.
Once the actual FR measurement starts, sweep the test signal through the audible frequency range.
Use the spectrum analyzer to visualize the FR.
tic
while toc < 5
x = sineSource();
aDW(x);
1-276
Measure Frequency Response of an Audio Device
y = aDR();
scope(y);
end
count = 1;
readerDrops = 0;
writerDrops = 0;
while true
if count == Navg
newFreq = sineSource.Frequency + RBW;
if newFreq > sampleRate/2
break
end
sineSource.Frequency = newFreq;
count = 1;
end
x = sineSource();
writerUnderruns = aDW(x);
[y,readerOverruns] = aDR();
readerDrops = readerDrops + readerOverruns;
writerDrops = writerDrops + writerUnderruns;
scope(y);
count = count + 1;
end
release(aDR)
release(aDW)
release(scope)
1-277
1 Audio Toolbox Examples
The spectrum analyzer shows two plots. The first plot is the spectrum estimate of the last recorded
data. The second plot is the maximum power the spectrum analyzer computed for each frequency bin,
as the sine wave swept over the spectrum. To get the maximum hold plot data and the frequency
vector, you can use the object function getSpectrumData and plot the maximum hold trace only.
data = getSpectrumData(scope);
freqVector = data.FrequencyVector{1};
freqResponse = data.MaxHoldTrace{1};
semilogx(freqVector,freqResponse);
xlabel("Frequency (Hz)");
ylabel("Power (dBm)");
title("Audio Device Frequency Response");
1-278
Measure Frequency Response of an Audio Device
The frequency response plot indicates that the audio device tested in this example has a flat
frequency response in the audible range.
1-279
1 Audio Toolbox Examples
This example shows how to generate a standalone executable for parametric equalization using
MATLAB Coder™ and use it on an audio file. multibandParametricEQ is used for the equalization
algorithm. The example allows you to dynamically adjust the coefficients of the filters using a user
interface (UI) that is running in MATLAB.
Introduction
multibandParametricEQ allows up to ten equalizer bands in cascade. In this example, you create
an equalizer with three bands. Each of the three biquad filters allows three parameters to be tuned:
center frequency, Q factor, and the peak (or dip) gain.
You can use MATLAB Coder to generate readable and standalone C-code from the parametric
equalizer algorithm code. Because the algorithm code uses System objects for reading and playing
audio files, there are additional dependencies for the generated code and executable file. These are
available in the /bin directory of your MATLAB installation.
Once you have generated the executable, run audioEqualizerEXEExampleApp to launch the
executable and a user interface (UI) designed to interact with the simulation. The UI allows you to
tune parameters and the results are reflected in the simulation instantly. For example, moving the
slider for the 'Center Frequency1' to the right while the simulation is running increases the center
frequency of the first parametric equalizer biquad filter. You can verify this by noticing the change
immediately in the magnitude response plot.
1-280
Generate Standalone Executable for Parametric Audio Equalizer
1-281
1 Audio Toolbox Examples
1-282
Deploy Audio Applications with MATLAB Compiler
This example shows how to use MATLAB Compiler™ to create a standalone application from a
MATLAB function. The function implements an audio processing algorithm and plays the result
through your audio output device.
Introduction
In this example, you generate and run an executable application that applies artificial reverberation
to an audio signal and plays it through your selected audio device. The benefit of such applications is
that they can be run on a machine that need not have MATLAB installed. You would only need an
installation of MATLAB Runtime to deploy the application created in this example.
Reverberation Algorithm
The reverberation algorithm is implemented using the System object reverberator. It allows you to
add a reverberation effect to mono or stereo channel audio input. The object provides six properties
that control the nature of reverberation. Each of them can be tuned while the simulation is running.
MATLAB Simulation
audioReverberationCompilerExampleApp
1-283
1 Audio Toolbox Examples
Once you have verified the MATLAB simulation, you can compile the function. Before compiling,
create a temporary directory in which you have write permissions. Copy the main MATLAB function
and the associated helper files into this temporary directory.
Use the mcc (MATLAB Compiler) function from MATLAB Compiler to compile
audioReverberationCompilerExampleApp into a standalone application. This will be saved in
the current directory. Specify the '-m' option to generate a standalone application, '-N' option to
include only the directories in the path specified using the '-p' option.
1-284
Deploy Audio Applications with MATLAB Compiler
mcc('-mN','audioReverberationCompilerExampleApp', ...
'-p',fullfile(matlabroot,'toolbox','dsp'), ...
'-p',fullfile(matlabroot,'toolbox','audio'));
Use the system command to run the generated standalone application. Note that running the
standalone application using the system command uses the current MATLAB environment and any
library files needed from this installation of MATLAB. To deploy this application on a machine which
does not have MATLAB installed, refer to “About the MATLAB Runtime” (MATLAB Compiler).
if ismac
status = system(fullfile('audioReverberationCompilerExampleApp.app', ...
'Contents','MacOS','audioReverberationCompilerExampleApp'));
else
status = system(fullfile(pwd,'audioReverberationCompilerExampleApp'));
end
Similar to the MATLAB simulation, running this deployed application will first ask you to choose the
audio device that you want to use to play audio. Then, it launches the user interface (UI) to interact
with the reverberation algorithm while the simulation is running.
1-285
1 Audio Toolbox Examples
After generating and deploying the executable, you can clean up the temporary directory by running
the following in the MATLAB command prompt:
cd(curDir);
rmdir(compilerDir,'s');
1-286
Parametric Audio Equalizer for Android Devices
This example shows how to use the Single-Band Parametric EQ block and the
multibandParametricEQ System object™ from the Audio Toolbox™ to implement a parametric
audio equalizer model. You can run the model on your host computer or deploy it to an Android
device.
Introduction
Parametric equalizers are used to adjust the frequency response of audio systems. For example, a
parametric equalizer can compensate for biases introduced by specific speakers. Equalization is a
primary tool in audio recording technologies.
In this example, you design a parametric audio equalizer in a Simulink® model. You can run your
model on the host computer or an Android device. The equalization algorithm is a cascade of three
filters with tunable center frequencies, bandwidths, and gains. You can visualize the frequency
response in real time while adjusting the parameters.
Required Hardware
To run this example on Android devices you need the following hardware:
1-287
1 Audio Toolbox Examples
Model Setup
The audioEqualizerAndroid model provides a choice of device (host computer or Android device), and
audio source (MATLAB workspace or microphone). You can choose the configuration by clicking the
Configuration UI button.
Configuration UI:
1-288
Parametric Audio Equalizer for Android Devices
When you choose to run the model on the host computer, a UI designed to interact with the
simulation is provided and can be opened by clicking Host Tuning UI.
1-289
1 Audio Toolbox Examples
The UI allows you to tune the parameters of three filters individually, and view the frequency
response in real time. You can also check the Bypass check box to compare the modified sound with
the original sound.
Click the View Frequency Response button to visualize the frequency response of the filters.
1-290
Parametric Audio Equalizer for Android Devices
To run the model on your Android device, you need to first ensure that you have installed Simulink
Support Package for Android Devices and that your Android device is provisioned.
Once your Android device is properly configured, use a USB cable to connect the device to your host
computer.
You can choose to make a standalone Android equalizer app by clicking the Deploy to hardware
button on the Simulink Editor toolbar. After deployment, the app can run on your Android device even
when it is disconnected from the host computer. The parameter tuning UI and the frequency response
display on your Android device screen, as shown below:
1-291
1 Audio Toolbox Examples
1-292
Denoise Speech Using Deep Learning Networks
This example shows how to denoise speech signals using deep learning networks. The example
compares two types of networks applied to the same task: fully connected, and convolutional.
Introduction
The aim of speech denoising is to remove noise from speech signals while enhancing the quality and
intelligibility of speech. This example showcases the removal of washing machine noise from speech
signals using deep learning networks. The example compares two types of networks applied to the
same task: fully connected, and convolutional.
Problem Summary
[cleanAudio,fs] = audioread("SpeechDFT-16-8-mono-5secs.wav");
sound(cleanAudio,fs)
Add washing machine noise to the speech signal. Set the noise power such that the signal-to-noise
ratio (SNR) is zero dB.
noise = audioread("WashingMachine-16-8-mono-1000secs.mp3");
speechPower = sum(cleanAudio.^2);
noisePower = sum(noiseSegment.^2);
noisyAudio = cleanAudio + sqrt(speechPower/noisePower)*noiseSegment;
sound(noisyAudio,fs)
t = (1/fs)*(0:numel(cleanAudio) - 1);
figure(1)
tiledlayout(2,1)
nexttile
plot(t,cleanAudio)
title("Clean Audio")
grid on
nexttile
plot(t,noisyAudio)
title("Noisy Audio")
xlabel("Time (s)")
grid on
1-293
1 Audio Toolbox Examples
The objective of speech denoising is to remove the washing machine noise from the speech signal
while minimizing undesired artifacts in the output speech.
This example uses a subset of the Mozilla Common Voice dataset [1 on page 1-312] to train and test
the deep learning networks. The data set contains 48 kHz recordings of subjects speaking short
sentences. Download the data set and unzip the downloaded file.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","commonvoice.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"commonvoice");
Use audioDatastore to create a datastore for the training set. To speed up the runtime of the
example at the cost of performance, set speedupExample to true.
adsTrain = audioDatastore(fullfile(dataset,"train"),IncludeSubfolders=true);
speedupExample = ;
if speedupExample
adsTrain = shuffle(adsTrain);
adsTrain = subset(adsTrain,1:1000);
end
Use read to get the contents of the first file in the datastore.
1-294
Denoise Speech Using Deep Learning Networks
[audio,adsTrainInfo] = read(adsTrain);
sound(audio,adsTrainInfo.SampleRate)
figure(2)
t = (1/adsTrainInfo.SampleRate) * (0:numel(audio)-1);
plot(t,audio)
title("Example Speech Signal")
xlabel("Time (s)")
grid on
The basic deep learning training scheme is shown below. Note that, since speech generally falls
below 4 kHz, you first downsample the clean and noisy audio signals to 8 kHz to reduce the
computational load of the network. The predictor and target network signals are the magnitude
spectra of the noisy and clean audio signals, respectively. The network's output is the magnitude
spectrum of the denoised signal. The regression network uses the predictor input to minimize the
mean square error between its output and the input target. The denoised audio is converted back to
the time domain using the output magnitude spectrum and the phase of the noisy signal [2 on page 1-
312].
1-295
1 Audio Toolbox Examples
You transform the audio to the frequency domain using the Short-Time Fourier transform (STFT),
with a window length of 256 samples, an overlap of 75%, and a Hamming window. You reduce the size
of the spectral vector to 129 by dropping the frequency samples corresponding to negative
frequencies (because the time-domain speech signal is real, this does not lead to any information
loss). The predictor input consists of 8 consecutive noisy STFT vectors, so that each STFT output
estimate is computed based on the current noisy STFT and the 7 previous noisy STFT vectors.
1-296
Denoise Speech Using Deep Learning Networks
This section illustrates how to generate the target and predictor signals from one training file.
windowLength = 256;
win = hamming(windowLength,"periodic");
overlap = round(0.75*windowLength);
fftLength = windowLength;
inputFs = 48e3;
fs = 8e3;
numFeatures = fftLength/2 + 1;
numSegments = 8;
src = dsp.SampleRateConverter(InputSampleRate=inputFs,OutputSampleRate=fs,Bandwidth=7920);
Use read to get the contents of an audio file from the datastore.
1-297
1 Audio Toolbox Examples
audio = read(adsTrain);
Make sure the audio length is a multiple of the sample rate converter decimation factor.
decimationFactor = inputFs/fs;
L = floor(numel(audio)/decimationFactor);
audio = audio(1:decimationFactor*L);
Create a random noise segment from the washing machine noise vector.
randind = randi(numel(noise) - numel(audio),[1 1]);
noiseSegment = noise(randind:randind + numel(audio) - 1);
Add noise to the speech signal such that the SNR is 0 dB.
noisePower = sum(noiseSegment.^2);
cleanPower = sum(audio.^2);
noiseSegment = noiseSegment.*sqrt(cleanPower/noisePower);
noisyAudio = audio + noiseSegment;
Use stft to generate magnitude STFT vectors from the original and noisy audio signals.
cleanSTFT = stft(audio,Window=win,OverlapLength=overlap,fftLength=fftLength);
cleanSTFT = abs(cleanSTFT(numFeatures-1:end,:));
noisySTFT = stft(noisyAudio,Window=win,OverlapLength=overlap,fftLength=fftLength);
noisySTFT = abs(noisySTFT(numFeatures-1:end,:));
Generate the 8-segment training predictor signals from the noisy STFT. The overlap between
consecutive predictors is 7 segments.
noisySTFT = [noisySTFT(:,1:numSegments - 1),noisySTFT];
stftSegments = zeros(numFeatures,numSegments,size(noisySTFT,2) - numSegments + 1);
for index = 1:size(noisySTFT,2) - numSegments + 1
stftSegments(:,:,index) = noisySTFT(:,index:index + numSegments - 1);
end
Set the targets and predictors. The last dimension of both variables corresponds to the number of
distinct predictor/target pairs generated by the audio file. Each predictor is 129-by-8, and each target
is 129-by-1.
targets = cleanSTFT;
size(targets)
ans = 1×2
129 751
predictors = stftSegments;
size(predictors)
ans = 1×3
129 8 751
1-298
Denoise Speech Using Deep Learning Networks
To speed up processing, extract feature sequences from the speech segments of all audio files in the
datastore using tall arrays. Unlike in-memory arrays, tall arrays typically remain unevaluated until
you call the gather function. This deferred evaluation enables you to work quickly with large data
sets. When you eventually request output using gather, MATLAB combines the queued calculations
where possible and takes the minimum number of passes through the data. If you have Parallel
Computing Toolbox™, you can use tall arrays in your local MATLAB session, or on a local parallel
pool. You can also run tall array calculations on a cluster if you have MATLAB® Parallel Server™
installed.
reset(adsTrain)
T = tall(adsTrain)
T =
{281712×1 double}
{289776×1 double}
{251760×1 double}
{332400×1 double}
{296688×1 double}
{113520×1 double}
{211440×1 double}
{ 97392×1 double}
: : :
: : :
The display indicates that the number of rows (corresponding to the number of files in the datastore),
M, is not yet known. M is a placeholder until the calculation completes.
Extract the target and predictor magnitude STFT from the tall table. This action creates new tall
array variables to use in subsequent calculations. The function
HelperGenerateSpeechDenoisingFeatures performs the steps already highlighted in the STFT
Targets and Predictors on page 1-297 section. The cellfun command applies
HelperGenerateSpeechDenoisingFeatures to the contents of each audio file in the datastore.
[targets,predictors] = cellfun(@(x)HelperGenerateSpeechDenoisingFeatures(x,noise,src),T,UniformOu
[targets,predictors] = gather(targets,predictors);
It is good practice to normalize all features to zero mean and unity standard deviation.
Compute the mean and standard deviation of the predictors and targets, respectively, and use them to
normalize the data.
predictors = cat(3,predictors{:});
noisyMean = mean(predictors(:));
1-299
1 Audio Toolbox Examples
noisyStd = std(predictors(:));
predictors(:) = (predictors(:) - noisyMean)/noisyStd;
targets = cat(2,targets{:});
cleanMean = mean(targets(:));
cleanStd = std(targets(:));
targets(:) = (targets(:) - cleanMean)/cleanStd;
Reshape predictors and targets to the dimensions expected by the deep learning networks.
predictors = reshape(predictors,size(predictors,1),size(predictors,2),1,size(predictors,3));
targets = reshape(targets,1,1,size(targets,1),size(targets,2));
You will use 1% of the data for validation during training. Validation is useful to detect scenarios
where the network is overfitting the training data.
trainPredictors = predictors(:,:,:,inds(1:L));
trainTargets = targets(:,:,:,inds(1:L));
validatePredictors = predictors(:,:,:,inds(L+1:end));
validateTargets = targets(:,:,:,inds(L+1:end));
You first consider a denoising network comprised of fully connected layers. Each neuron in a fully
connected layer is connected to all activations from the previous layer. A fully connected layer
multiplies the input by a weight matrix and then adds a bias vector. The dimensions of the weight
matrix and bias vector are determined by the number of neurons in the layer and the number of
activations from the previous layer.
Define the layers of the network. Specify the input size to be images of size NumFeatures-by-
NumSegments (129-by-8 in this example). Define two hidden fully connected layers, each with 1024
neurons. Since purely linear systems, follow each hidden fully connected layer with a Rectified Linear
Unit (ReLU) layer. The batch normalization layers normalize the means and standard deviations of the
outputs. Add a fully connected layer with 129 neurons, followed by a regression layer.
1-300
Denoise Speech Using Deep Learning Networks
layers = [
imageInputLayer([numFeatures,numSegments])
fullyConnectedLayer(1024)
batchNormalizationLayer
reluLayer
fullyConnectedLayer(1024)
batchNormalizationLayer
reluLayer
fullyConnectedLayer(numFeatures)
];
Next, specify the training options for the network. Set MaxEpochs to 3 so that the network makes 3
passes through the training data. Set MiniBatchSize of 128 so that the network looks at 128
training signals at a time. Specify Plots as "training-progress" to generate plots that show the
training progress as the number of iterations increases. Set Verbose to false to disable printing the
table output that corresponds to the data shown in the plot into the command line window. Specify
Shuffle as "every-epoch" to shuffle the training sequences at the beginning of each epoch.
Specify LearnRateSchedule to "piecewise" to decrease the learning rate by a specified factor
(0.9) every time a certain number of epochs (1) has passed. Set ValidationData to the validation
predictors and targets. Set ValidationFrequency such that the validation mean square error is
computed once per epoch. This example uses the adaptive moment estimation (Adam) solver.
miniBatchSize = 128;
options = trainingOptions("adam", ...
MaxEpochs=3, ...
InitialLearnRate=1e-5,...
MiniBatchSize=miniBatchSize, ...
Shuffle="every-epoch", ...
Plots="training-progress", ...
Verbose=false, ...
ValidationFrequency=floor(size(trainPredictors,4)/miniBatchSize), ...
LearnRateSchedule="piecewise", ...
LearnRateDropFactor=0.9, ...
LearnRateDropPeriod=1, ...
ValidationData={validatePredictors,squeeze(validateTargets)'});
Train the network with the specified training options and layer architecture using trainNetwork.
Because the training set is large, the training process can take several minutes. To download and load
a pre-trained network instead of training a network from scratch, set downloadPretrainedSystem
to true.
downloadPretrainedSystem = ;
if downloadPretrainedSystem
downloadFolder = matlab.internal.examples.downloadSupportFile("audio/examples","sefc.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
netFolder = fullfile(dataFolder,"SpeechDenoising");
s = load(fullfile(dataFolder,"denoiseNetFullyConnected.mat"));
cleanMean = s.cleanMean;
cleanStd = s.cleanStd;
noisyMean = s.noisyMean;
noisyStd = s.noisyStd;
denoiseNetFullyConnected = s.denoiseNetFullyConnected;
else
1-301
1 Audio Toolbox Examples
denoiseNetFullyConnected = trainnet(trainPredictors,squeeze(trainTargets)',layers,"mse",optio
end
Initialized: true
Inputs:
1 'imageinput' 129×8×1 images
Consider a network that uses convolutional layers instead of fully connected layers [3 on page 1-312].
A 2-D convolutional layer applies sliding filters to the input. The layer convolves the input by moving
the filters along the input vertically and horizontally and computing the dot product of the weights
and the input, and then adding a bias term. Convolutional layers typically consist of fewer parameters
than fully connected layers.
Define the layers of the fully convolutional network described in [3 on page 1-312], comprising 16
convolutional layers. The first 15 convolutional layers are groups of 3 layers, repeated 5 times, with
filter widths of 9, 5, and 9, and number of filters of 18, 30 and 8, respectively. The last convolutional
layer has a filter width of 129 and 1 filter. In this network, convolutions are performed in only one
direction (along the frequency dimension), and the filter width along the time dimension is set to 1 for
all layers except the first one. Similar to the fully connected network, convolutional layers are
followed by ReLu and batch normalization layers.
layers = [imageInputLayer([numFeatures,numSegments])
convolution2dLayer([9 8],18,Stride=[1 100],Padding="same")
1-302
Denoise Speech Using Deep Learning Networks
batchNormalizationLayer
reluLayer
repmat( ...
[convolution2dLayer([5 1],30,Stride=[1 100],Padding="same")
batchNormalizationLayer
reluLayer
convolution2dLayer([9 1],8,Stride=[1 100],Padding="same")
batchNormalizationLayer
reluLayer
convolution2dLayer([9 1],18,Stride=[1 100],Padding="same")
batchNormalizationLayer
reluLayer],4,1)
The training options are identical to the options for the fully connected network, except that the
dimensions of the validation target signals are permuted to be consistent with the dimensions
expected by the regression layer.
options = trainingOptions("adam", ...
MaxEpochs=3, ...
InitialLearnRate=1e-5, ...
MiniBatchSize=miniBatchSize, ...
Shuffle="every-epoch", ...
Plots="training-progress", ...
Verbose=false, ...
ValidationFrequency=floor(size(trainPredictors,4)/miniBatchSize), ...
LearnRateSchedule="piecewise", ...
LearnRateDropFactor=0.9, ...
LearnRateDropPeriod=1, ...
ValidationData={validatePredictors,permute(validateTargets,[3 1 2 4])});
Train the network with the specified training options and layer architecture using trainNetwork.
Because the training set is large, the training process can take several minutes. To download and load
a pre-trained network instead of training a network from scratch, set downloadPretrainedSystem
to true.
downloadPretrainedSystem = ;
if downloadPretrainedSystem
downloadFolder = matlab.internal.examples.downloadSupportFile("audio/examples","secnn.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
s = load(fullfile(dataFolder,"denoiseNetFullyConvolutional.mat"));
cleanMean = s.cleanMean;
cleanStd = s.cleanStd;
noisyMean = s.noisyMean;
1-303
1 Audio Toolbox Examples
noisyStd = s.noisyStd;
denoiseNetFullyConvolutional = s.denoiseNetFullyConvolutional;
else
denoiseNetFullyConvolutional = trainnet(trainPredictors,permute(trainTargets,[3 1 2 4]),layer
end
summary(denoiseNetFullyConvolutional)
Initialized: true
Inputs:
1 'imageinput' 129×8×1 images
adsTest = audioDatastore(fullfile(dataset,"test"),IncludeSubfolders=true);
[cleanAudio,adsTestInfo] = read(adsTest);
Make sure the audio length is a multiple of the sample rate converter decimation factor.
L = floor(numel(cleanAudio)/decimationFactor);
cleanAudio = cleanAudio(1:decimationFactor*L);
1-304
Denoise Speech Using Deep Learning Networks
cleanAudio = src(cleanAudio);
reset(src)
In this testing stage, you corrupt speech with washing machine noise not used in the training stage.
noise = audioread("WashingMachine-16-8-mono-200secs.mp3");
Create a random noise segment from the washing machine noise vector.
Add noise to the speech signal such that the SNR is 0 dB.
noisePower = sum(noiseSegment.^2);
cleanPower = sum(cleanAudio.^2);
noiseSegment = noiseSegment.*sqrt(cleanPower/noisePower);
noisyAudio = cleanAudio + noiseSegment;
Use stft to generate magnitude STFT vectors from the noisy audio signals.
noisySTFT = stft(noisyAudio,Window=win,OverlapLength=overlap,fftLength=fftLength);
noisyPhase = angle(noisySTFT(numFeatures-1:end,:));
noisySTFT = abs(noisySTFT(numFeatures-1:end,:));
Generate the 8-segment training predictor signals from the noisy STFT. The overlap between
consecutive predictors is 7 segments.
Normalize the predictors by the mean and standard deviation computed in the training stage.
Compute the denoised magnitude STFT by using predict with the two trained networks.
predictors = reshape(predictors,[numFeatures,numSegments,1,size(predictors,3)]);
STFTFullyConnected = predict(denoiseNetFullyConnected,predictors);
STFTFullyConvolutional = predict(denoiseNetFullyConvolutional,predictors);
Scale the outputs by the mean and standard deviation used in the training stage.
STFTFullyConnected = (STFTFullyConnected.').*exp(1j*noisyPhase);
STFTFullyConnected = [conj(STFTFullyConnected(end-1:-1:2,:));STFTFullyConnected];
STFTFullyConvolutional = squeeze(STFTFullyConvolutional).*exp(1j*noisyPhase);
STFTFullyConvolutional = [conj(STFTFullyConvolutional(end-1:-1:2,:));STFTFullyConvolutional];
Compute the denoised speech signals. istft performs the inverse STFT. Use the phase of the noisy
STFT vectors to reconstruct the time-domain signal.
1-305
1 Audio Toolbox Examples
denoisedAudioFullyConnected = istft(STFTFullyConnected,Window=win,OverlapLength=overlap,fftLength
denoisedAudioFullyConvolutional = istft(STFTFullyConvolutional,Window=win,OverlapLength=overlap,f
t = (1/fs)*(0:numel(denoisedAudioFullyConnected)-1);
figure(3)
tiledlayout(4,1)
nexttile
plot(t,cleanAudio(1:numel(denoisedAudioFullyConnected)))
title("Clean Speech")
grid on
nexttile
plot(t,noisyAudio(1:numel(denoisedAudioFullyConnected)))
title("Noisy Speech")
grid on
nexttile
plot(t,denoisedAudioFullyConnected)
title("Denoised Speech (Fully Connected Layers)")
grid on
nexttile
plot(t,denoisedAudioFullyConvolutional)
title("Denoised Speech (Convolutional Layers)")
grid on
xlabel("Time (s)")
1-306
Denoise Speech Using Deep Learning Networks
h = figure(4);
tiledlayout(4,1)
nexttile
spectrogram(cleanAudio,win,overlap,fftLength,fs);
title("Clean Speech")
grid on
nexttile
spectrogram(noisyAudio,win,overlap,fftLength,fs);
title("Noisy Speech")
grid on
nexttile
spectrogram(denoisedAudioFullyConnected,win,overlap,fftLength,fs);
title("Denoised Speech (Fully Connected Layers)")
grid on
nexttile
spectrogram(denoisedAudioFullyConvolutional,win,overlap,fftLength,fs);
title("Denoised Speech (Convolutional Layers)")
grid on
1-307
1 Audio Toolbox Examples
p = get(h,"Position");
set(h,"Position",[p(1) 65 p(3) 800]);
sound(noisyAudio,fs)
Listen to the denoised speech from the network with fully connected layers.
sound(denoisedAudioFullyConnected,fs)
Listen to the denoised speech from the network with convolutional layers.
sound(denoisedAudioFullyConvolutional,fs)
sound(cleanAudio,fs)
You can test more files from the datastore by calling testDenoisingNets. The function produces
the time-domain and frequency-domain plots highlighted above, and also returns the clean, noisy, and
denoised audio signals.
[cleanAudio,noisyAudio,denoisedAudioFullyConnected,denoisedAudioFullyConvolutional] = testDenoisi
1-308
Denoise Speech Using Deep Learning Networks
1-309
1 Audio Toolbox Examples
1-310
Denoise Speech Using Deep Learning Networks
Real-Time Application
The procedure in the previous section passes the entire spectrum of the noisy signal to predict.
This is not suitable for real-time applications where low latency is a requirement.
The scope plots the clean, noisy and denoised signals, as well as the gain of the noise gate.
1-311
1 Audio Toolbox Examples
References
[1] https://commonvoice.mozilla.org/
[2] "Experiments on Deep Learning for Speech Denoising", Ding Liu, Paris Smaragdis, Minje Kim,
INTERSPEECH, 2014.
[3] "A Fully Convolutional Neural Network for Speech Enhancement", Se Rim Park, Jin Won Lee,
INTERSPEECH, 2017.
See Also
Related Examples
• “3-D Speech Enhancement Using Trained Filter and Sum Network” on page 1-948
1-312
Train Speech Command Recognition Model Using Deep Learning
This example shows how to train a deep learning model that detects the presence of speech
commands in audio. The example uses the Speech Commands Dataset [1] on page 1-324 to train a
convolutional neural network to recognize a set of commands.
To use a pretrained speech command recognition system, see “Speech Command Recognition Using
Deep Learning” on page 1-905.
To run the example quickly, set speedupExample to true. To run the full example as published, set
speedupExample to false.
speedupExample = ;
rng default
Load Data
This example uses the Google Speech Commands Dataset [1] on page 1-324. Download and unzip the
data set.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","google_speech.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"google_speech");
Augment Data
The network should be able to not only recognize different spoken words but also to detect if the
audio input is silence or background noise.
The supporting function, augmentDataset on page 1-323, uses the long audio files in the
background folder of the Google Speech Commands Dataset to create one-second segments of
background noise. The function creates an equal number of background segments from each
background noise file and then splits the segments between the train and validation folders.
augmentDataset(dataset)
1-313
1 Audio Toolbox Examples
Specify the words that you want your model to recognize as commands. Label all files that are not
commands or background noise as unknown. Labeling words that are not commands as unknown
creates a group of words that approximates the distribution of all words other than the commands.
The network uses this group to learn the difference between commands and all other words.
To reduce the class imbalance between the known and unknown words and speed up processing, only
include a fraction of the unknown words in the training set.
Use subset to create a datastore that contains only the commands, the background noise, and the
subset of unknown words. Count the number of examples belonging to each category.
commands = categorical(["yes","no","up","down","left","right","on","off","stop","go"]);
background = categorical("background");
isCommand = ismember(ads.Labels,commands);
isBackground = ismember(ads.Labels,background);
isUnknown = ~(isCommand|isBackground);
ads.Labels(isUnknown) = categorical("unknown");
adsTrain = subset(ads,isCommand|isUnknown|isBackground);
adsTrain.Labels = removecats(adsTrain.Labels);
Create an audioDatastore that points to the validation data set. Follow the same steps used to
create the training datastore.
ads = audioDatastore(fullfile(dataset,"validation"), ...
IncludeSubfolders=true, ...
FileExtensions=".wav", ...
LabelSource="foldernames");
isCommand = ismember(ads.Labels,commands);
isBackground = ismember(ads.Labels,background);
isUnknown = ~(isCommand|isBackground);
ads.Labels(isUnknown) = categorical("unknown");
1-314
Train Speech Command Recognition Model Using Deep Learning
adsValidation = subset(ads,isCommand|isUnknown|isBackground);
adsValidation.Labels = removecats(adsValidation.Labels);
figure(Units="normalized",Position=[0.2,0.2,0.5,0.5])
tiledlayout(2,1)
nexttile
histogram(adsTrain.Labels)
title("Training Label Distribution")
ylabel("Number of Observations")
grid on
nexttile
histogram(adsValidation.Labels)
title("Validation Label Distribution")
ylabel("Number of Observations")
grid on
if speedupExample
numUniqueLabels = numel(unique(adsTrain.Labels)); %#ok<UNRCH>
% Reduce the dataset by a factor of 20
adsTrain = splitEachLabel(adsTrain,round(numel(adsTrain.Files) / numUniqueLabels / 20));
adsValidation = splitEachLabel(adsValidation,round(numel(adsValidation.Files) / numUniqueLabe
end
1-315
1 Audio Toolbox Examples
To prepare the data for efficient training of a convolutional neural network, convert the speech
waveforms to auditory-based spectrograms.
To speed up processing, you can distribute the feature extraction across multiple workers. Start a
parallel pool if you have access to Parallel Computing Toolbox™.
Extract Features
Define the parameters to extract auditory spectrograms from the audio input. segmentDuration is
the duration of each speech clip in seconds. frameDuration is the duration of each frame for
spectrum calculation. hopDuration is the time step between each spectrum. numBands is the
number of filters in the auditory spectrogram.
segmentDuration = 1;
frameDuration = 0.025;
hopDuration = 0.010;
FFTLength = 512;
numBands = 50;
segmentSamples = round(segmentDuration*fs);
frameSamples = round(frameDuration*fs);
hopSamples = round(hopDuration*fs);
overlapSamples = frameSamples - hopSamples;
Define a series of transform on the audioDatastore to pad the audio to a consistent length,
extract the features, and then apply a logarithm.
1-316
Train Speech Command Recognition Model Using Deep Learning
transform1 = transform(adsTrain,@(x)[zeros(floor((segmentSamples-size(x,1))/2),1);x;zeros(ceil((s
transform2 = transform(transform1,@(x)extract(afe,x));
transform3 = transform(transform2,@(x){log10(x+1e-6)});
Use the readall function to read all data from the datastore. As each file is read, it is passed
through the transforms before the data is returned.
XTrain = readall(transform3,UseParallel=useParallel);
The output is a numFiles-by-1 cell array. Each element of the cell array corresponds to the auditory
spectrogram extracted from a file.
numFiles = numel(XTrain)
numFiles = 28463
[numHops,numBands,numChannels] = size(XTrain{1})
numHops = 98
numBands = 50
numChannels = 1
1-317
1 Audio Toolbox Examples
Convert the cell array to a 4-dimensional array with auditory spectrograms along the fourth
dimension.
XTrain = cat(4,XTrain{:});
[numHops,numBands,numChannels,numFiles] = size(XTrain)
numHops = 98
numBands = 50
numChannels = 1
numFiles = 28463
Perform the feature extraction steps described above on the validation set.
transform1 = transform(adsValidation,@(x)[zeros(floor((segmentSamples-size(x,1))/2),1);x;zeros(ce
transform2 = transform(transform1,@(x)extract(afe,x));
transform3 = transform(transform2,@(x){log10(x+1e-6)});
XValidation = readall(transform3,UseParallel=useParallel);
XValidation = cat(4,XValidation{:});
TTrain = adsTrain.Labels;
TValidation = adsValidation.Labels;
Visualize Data
Plot the waveforms and auditory spectrograms of a few training samples. Play the corresponding
audio clips.
specMin = min(XTrain,[],"all");
specMax = max(XTrain,[],"all");
idx = randperm(numel(adsTrain.Files),3);
figure(Units="normalized",Position=[0.2,0.2,0.6,0.6]);
tlh = tiledlayout(2,3);
for ii = 1:3
[x,fs] = audioread(adsTrain.Files{idx(ii)});
nexttile(tlh,ii)
plot(x)
axis tight
title(string(adsTrain.Labels(idx(ii))))
nexttile(tlh,ii+3)
spect = XTrain(:,:,1,idx(ii))';
pcolor(spect)
clim([specMin specMax])
shading flat
sound(x,fs)
pause(2)
end
1-318
Train Speech Command Recognition Model Using Deep Learning
Create a simple network architecture as an array of layers. Use convolutional and batch
normalization layers, and downsample the feature maps "spatially" (that is, in time and frequency)
using max pooling layers. Add a final max pooling layer that pools the input feature map globally over
time. This enforces (approximate) time-translation invariance in the input spectrograms, allowing the
network to perform the same classification independent of the exact position of the speech in time.
Global pooling also significantly reduces the number of parameters in the final fully connected layer.
To reduce the possibility of the network memorizing specific features of the training data, add a small
amount of dropout to the input to the last fully connected layer.
The network is small, as it has only five convolutional layers with few filters. numF controls the
number of filters in the convolutional layers. To increase the accuracy of the network, try increasing
the network depth by adding identical blocks of convolutional, batch normalization, and ReLU layers.
You can also try increasing the number of convolutional filters by increasing numF.
To give each class equal total weight in the loss, use class weights that are inversely proportional to
the number of training examples in each class. When using the Adam optimizer to train the network,
the training algorithm is independent of the overall normalization of the class weights.
1-319
1 Audio Toolbox Examples
classes = categories(TTrain);
classWeights = 1./countcats(TTrain);
classWeights = classWeights'/mean(classWeights);
numClasses = numel(classes);
timePoolSize = ceil(numHops/8);
dropoutProb = 0.2;
numF = 12;
layers = [
imageInputLayer([numHops,afe.FeatureVectorLength])
convolution2dLayer(3,numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,2*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer([timePoolSize,1])
dropoutLayer(dropoutProb)
1-320
Train Speech Command Recognition Model Using Deep Learning
fullyConnectedLayer(numClasses)
softmaxLayer];
To define parameters for training, use trainingOptions (Deep Learning Toolbox). Use the Adam
optimizer with a mini-batch size of 128.
miniBatchSize = 128;
validationFrequency = floor(numel(TTrain)/miniBatchSize);
options = trainingOptions("adam", ...
InitialLearnRate=3e-4, ...
MaxEpochs=15, ...
MiniBatchSize=miniBatchSize, ...
Shuffle="every-epoch", ...
Plots="training-progress", ...
Verbose=false, ...
ValidationData={XValidation,TValidation}, ...
ValidationFrequency=validationFrequency, ...
Metrics="accuracy");
Train Network
To train the network, use trainnet. If you do not have a GPU, then training the network can take
time.
trainedNet = trainnet(XTrain,TTrain,layers,@(Y,T)crossentropy(Y,T,classWeights(:),WeightsFormat="
To calculate the final accuracy of the network on the training and validation sets, use
minibatchpredict. The network is very accurate on this data set. However, the training, validation,
1-321
1 Audio Toolbox Examples
and test data all have similar distributions that do not necessarily reflect real-world environments.
This limitation particularly applies to the unknown category, which contains utterances of only a
small number of words.
scores = minibatchpredict(trainedNet,XValidation);
YValidation = scores2label(scores,classes,"auto");
validationError = mean(YValidation ~= TValidation);
scores = minibatchpredict(trainedNet,XTrain);
YTrain = scores2label(scores,classes,"auto");
trainError = mean(YTrain ~= TTrain);
disp(["Training error: " + trainError*100 + " %";"Validation error: " + validationError*100 + " %
To plot the confusion matrix for the validation set, use confusionchart (Deep Learning Toolbox).
Display the precision and recall for each class by using column and row summaries.
figure(Units="normalized",Position=[0.2,0.2,0.5,0.5]);
cm = confusionchart(TValidation,YValidation, ...
Title="Confusion Matrix for Validation Data", ...
ColumnSummary="column-normalized",RowSummary="row-normalized");
sortClasses(cm,[commands,"unknown","background"])
When working on applications with constrained hardware resources, such as mobile applications, it is
important to consider the limitations on available memory and computational resources. Compute the
total size of the network in kilobytes and test its prediction speed when using a CPU. The prediction
time is the time for classifying a single input image. If you input multiple images to the network,
1-322
Train Speech Command Recognition Model Using Deep Learning
these can be classified simultaneously, leading to shorter prediction times per image. When
classifying streaming audio, however, the single-image prediction time is the most relevant.
for ii = 1:100
x = randn([numHops,numBands]);
predictionTimer = tic;
y = predict(trainedNet,x);
time(ii) = toc(predictionTimer);
end
Supporting Functions
function augmentDataset(datasetloc)
adsBkg = audioDatastore(fullfile(datasetloc,"background"));
fs = 16e3; % Known sample rate of the data set
segmentDuration = 1;
segmentSamples = round(segmentDuration*fs);
volumeRange = log10([1e-4,1]);
numBkgSegments = 4000;
numBkgFiles = numel(adsBkg.Files);
numSegmentsPerFile = floor(numBkgSegments/numBkgFiles);
fpTrain = fullfile(datasetloc,"train","background");
fpValidation = fullfile(datasetloc,"validation","background");
if ~datasetExists(fpTrain)
% Create directories
mkdir(fpTrain)
mkdir(fpValidation)
1-323
1 Audio Toolbox Examples
end
% Print progress
fprintf('Progress = %d (%%)\n',round(100*progress(adsBkg)))
end
end
end
References
[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017.
Available from https://storage.googleapis.com/download.tensorflow.org/data/
speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed
under the Creative Commons Attribution 4.0 license, available here: https://creativecommons.org/
licenses/by/4.0/legalcode.
See Also
Related Examples
• “Speech Command Recognition Using Deep Learning” on page 1-905
• “Accelerate Audio Machine Learning Workflows Using a GPU” on page 1-249
1-324
Ambisonic Plugin Generation
This examples shows how to create ambisonic plugins using MATLAB® higher order ambisonic
(HOA) demo functions. Ambisonics is a spatial audio technique which represents a three-dimensional
sound field using spherical harmonics. This example contains an encoder plugin, a function to
generate custom encoder plugins, a decoder plugin, and a function to generate custom decoder
plugins. The customization of plugin generation enables a user to specify various ambisonic orders
and various device lists for a given ambisonic configuration.
Background
Ambisonic encoding is the process of decomposing a sound field into spherical harmonics. The
encoding matrix is the amount of spherical harmonics present at a specific device position. In mode-
matching decoding, the decoding matrix is the pseudo-inverse of the encoding matrix. Ambisonic
decoding is the process of reconstructing spherical harmonics into a sound field.
This example involves higher order ambisonics, which include traditional first-order ambisonics. In
ambisonics, there is a relationship between the number of ambisonic channels and the ambisonic
order:
ambisonic_channels = (ambisonic_order + 1)^2
For example: First-order ambisonics requires four audio channels while fourth-order ambisonics
requires 25 audio channels.
Supported Conventions
Ambisonic devices are divided into two groups: elements and speakers. Each device has an audio
signal and metadata describing its position and operation. Elements correspond to multi-element
microphone arrays, and speakers correspond to loudspeaker arrays for ambisonic playback.
The ambisonic encoder applies the ambisonic encoding matrix to raw audio from microphone
elements. The position (azimuth, elevation) and deviceType of the microphone elements along with
desired ambisonic order are needed to create the ambisonic encoding matrix.
The ambisonic decoder applies the ambisonic decoding matrix to ambisonic audio for playback on
speakers. The position (azimuth, elevation) and deviceType of the speakers along with desired
ambisonic order are needed to create the ambisonic decoding matrix.
In order to capture, represent, or reproduce a sound field with ambisonics, the number of devices
(elements or speakers) must be greater than or equal to the number of ambisonic channels.
1-325
1 Audio Toolbox Examples
For the encoding example, audio captured with a 32-channel spherical array microphone may be
encoded up to fourth-order ambisonics (25 channels). For the decoding example, a loudspeaker array
containing 64 speakers is configured for ambisonic playback up to seventh-order. If the playback
content is fourth order ambisonics, then even though the array is set up for seventh- order, only
fourth-order ambisonics is realized through the system.
For an encoder, if the number of devices (elements) is less than the number of ambisonic channels,
then audio from the device (elements) positions may be represented in ambisonics, but a sound field
is not represented. One or more audio channels may be encoded into ambisonics in an effort to
position sources in an ambisonic field. Each encoder represents the intensity of the sound field to be
encoded at a specified device (element) location.
For a decoder, if the number of devices (speakers) is less than the number of ambisonic channels, the
devices (speakers) do not fully reproduce a sound field at the specified ambisonic order. A sound field
may be reproduced at a lower ambisonic order. For example, third-order ambisonics played on a
speaker array with 10 speakers can be realized as a second-order (9 channel) system with an
additional speaker for playback. Each decoder represents an intensity of the ambisonic field at the
specified device (speaker) position.
There are many decoding options. This example uses pseudoinverse decoding, also known as mode
matching. This decoding method favors regular-shaped device layouts. Other decoding methods may
favor irregular-shaped device layouts.
Device Type
The deviceType for encoders turns the device (element) encoding on or off for a particular element.
The deviceType for decoders turns the device (speaker) decoding on or off for a particular speaker. If
the deviceType vector is omitted, then the deviceTypes are set to 1 (on). The intention behind
deviceType is to provide flexibility of padding encoder inputs or decoder outputs with off channels to
fit an ambisonic encoder or decoder plugin into an environment with fixed channel counts such as an
8-, 16- or 32-channel audio bus.
For example: A second-order ambisonic encoder with 14 elements has 14 inputs and 9 outputs. If you
add two more devices (elements) with deviceType 0 (off) to the encoder, then the encoder has 16
inputs and 9 outputs. A fourth-order ambisonic decoder with 29 devices (speakers) has 25 inputs and
29 outputs. If you add three more devices (speakers) with deviceType 0 (off) to the decoder, then the
channel count becomes 25 inputs and 32 outputs.
When the deviceType is set to 0 (off), the azimuth and elevation for that channel are ignored;
however, a value is still needed. It is recommended to set the azimuth and elevation to 0 degrees
when the device types are set to 0 (off).
1-326
Ambisonic Plugin Generation
The encoder plugin inherits directly from the audioPlugin base class. The plugin constructor calls
audioexample.ambisonics.ambiencodemtrx to build the initial encoder matrix. The process
function calls audioexample.ambisonics.ambiencode to apply the encoder matrix to the audio
input. The output of the plugin is ambisonic encoded audio. The encoder matrix is recalculated only
when a plugin property is modified which minimizes computations inside the process loop.
The plugin interface populates azimuth and elevation but not device type. The idea behind device
type is to add off-channels to an encoder matrix to fit the matrix into a 8x-channel frame. For
example: second-order has 9 channels, create a 16 channel encoder matrix, with the first 9 channels
having device type of 1 (on) and the remaining 7 channels having device type of 0 (off).
audioTestBench(audiopluginexample.AmbiEncoderPlugin)
audioTestBench('-close')
1-327
1 Audio Toolbox Examples
Once designed, the audio plugin can be validated, generated, and deployed to a third-party digital
audio workstation (DAW).
The decoder plugin inherits directly from the audioPlugin base class. The plugin constructor calls
audioexample.ambisonics.ambidecodemtrx to build the initial decoder matrix. The process
function calls audioexample.ambisonics.ambidecode to apply the decoder matrix to the audio
input. The output of the plugin is decoded audio. The decoder matrix is recalculated only when a
plugin property is modified which minimizes computations inside the process loop.
The plugin interface populates azimuth and elevation but not device type. The idea behind device
type is to add off-channels to an encoder matrix to fit the matrix into a 8x-channel frame. For
example: second-order has 9 channels, create a 16 channel encoder matrix, with the first 9 channels
having device type of 1 (on) and the remaining 7 channels having device type of 0 (off).
1-328
Ambisonic Plugin Generation
audioTestBench(audiopluginexample.AmbiDecoderPlugin)
audioTestBench('-close')
1-329
1 Audio Toolbox Examples
order = 3;
audioexample.ambisonics.ambiGenerateDecoderPlugin(name,device,order,format)
Once designed, the audio plugin can be validated, generated, and deployed to a third-party digital
audio workstation (DAW).
See Also
Related Topics
References
[1] Kronlachner, M. (2014). Spatial Transformations for the Alteration of Ambisonic Recordings
(Master's thesis).
[2] https://en.wikipedia.org/wiki/Ambisonics
[3] https://en.wikipedia.org/wiki/Ambisonic_data_exchange_formats
1-330
Ambisonic Binaural Decoding
This example shows how to decode ambisonic audio into binaural audio using virtual loudspeakers. A
virtual loudspeaker is a sound source positioned on the surface of a sphere, with the listener located
at the center of the sphere. Each virtual loudspeaker has a pair of Head-Related Transfer Functions
(HRTF) associated with it: one for the left ear and one for the right ear. The virtual loudspeaker
locations along with the ambisonic order are used to calculate the ambisonic decoder matrix. The
output of the decoder is filtered by the HRTFs corresponding to the virtual loudspeaker position. The
signals from the left HRTFs are summed together and fed to the left ear. The signals from the right
HRTFs are summed together and fed to the right ear. A block diagram of the audio signal flow is
shown here.
ARIDataset = sofaread("ReferenceHRTF.sofa");
hrtfData = ARIDataset.Numerator;
ARIDataset.SourcePositionType
ans =
'spherical'
sourcePosition = ARIDataset.SourcePosition(:,[1,2]);
The ARI HRTF Databases used in this example is based on the work by Acoustics Research Institute.
The HRTF data and source position in ReferenceHRTF.sofa are from ARI NH2 subject.
The HRTF Databases by Acoustics Research Institute, Austrian Academy of Sciences are licensed
under a Creative Commons Attribution-ShareAlike 3.0 Unported License: https://
creativecommons.org/licenses/by-sa/3.0/.
1-331
1 Audio Toolbox Examples
Now that the HRTF Dataset is loaded, determine which points to pick for virtual loudspeakers. This
example picks random points distributed on the surface of a sphere and selects the points of the
HRTF dataset closest to the picked points.
% Compare distributed points on the sphere to points from the HRTF dataset
pick = zeros(1, nPoints);
d = zeros(size(pickedSphere,1), size(sourcePosition,1));
for ii = 1:size(pickedSphere,1)
for jj = 1:size(sourcePosition,1)
% Calculate arc length
d(ii,jj) = acos( ...
sind(pickedSphere(ii,2))*sind(sourcePosition(jj,2)) + ...
cosd(pickedSphere(ii,2))*cosd(sourcePosition(jj,2)) * ...
cosd(pickedSphere(ii,1) - sourcePosition(jj,1)));
end
[~,Idx] = sort(d(ii,:)); % Sort points
pick(ii) = Idx(1); % Pick the closest point
end
Specify a desired ambisonic order and desired virtual loudspeaker source positions as inputs to the
audioexample.ambisonics.ambidecodemtrx helper function. The function returns an
ambisonics decoder matrix.
order = 7;
devices = sourcePosition(pick,:)';
dmtrx = audioexample.ambisonics.ambidecodemtrx(order, devices);
Create an FIR filter to perform binaural HRTF filtering based on the position of the virtual
loudspeakers.
filters = squeeze(hrtfData(pick,:,:));
filters = permute(filters,[2 1 3]);
filters = reshape(filters,size(filters,1)*size(filters,2),[]);
filt = dsp.FrequencyDomainFIRFilter(filters, SumFilteredOutputs=true);
Load the ambisonic audio file of helicopter sound and convert it to 48 kHz for compatibility with the
HRTF dataset. Specify the ambisonic format of the audio file.
1-332
Ambisonic Binaural Decoding
Create an audio file sampled at 48 kHz for compatibility with the HRTF dataset.
desiredFs = 48e3;
[audio,fs] = audioread("Heli_16ch_ACN_SN3D.wav");
audio = resample(audio,desiredFs,fs);
audiowrite("Heli_16ch_ACN_SN3D_48.wav",audio,desiredFs);
Specify the ambisonic format of the audio file. Set up the audio input and audio output objects.
format = "acn-sn3d";
samplesPerFrame = 2048;
fileReader = dsp.AudioFileReader("Heli_16ch_ACN_SN3D_48.wav", ...
SamplesPerFrame=samplesPerFrame);
deviceWriter = audioDeviceWriter(SampleRate=desiredFs);
audioFiltered = zeros(samplesPerFrame,size(filters,1),2);
Process Audio
while ~isDone(fileReader)
audioAmbi = fileReader();
audioDecoded = audioexample.ambisonics.ambidecode(audioAmbi, dmtrx, format);
audioFiltered = 10*filt(audioDecoded);
numUnderrun = deviceWriter(audioFiltered);
end
% Release resources
release(fileReader)
release(deviceWriter)
References
[1] Kronlachner, M. (2014). Spatial Transformations for the Alteration of Ambisonic Recordings
(Master's thesis).
[2] Noisternig, Markus. et al. "A 3D Ambisonic Based Binaural Sound Reproduction System."
Presented at 24th AES International Conference: Multichannel Audio, The New Reality, Alberta, June
2003.
See Also
“Ambisonic Plugin Generation” on page 1-325
1-333
1 Audio Toolbox Examples
This example shows how to beamform signals received by an array of microphones to extract a
desired speech signal in a noisy environment. It uses the dataflow domain in Simulink® to partition
the data-driven portions of the system into multiple threads and thereby improving the performance
of the simulation by executing it on your desktop's multiple cores.
Introduction
The model simulates receiving three audio signals from different directions on a 10-element uniformly
linear microphone array (ULA). After the addition of thermal noise at the receiver, beamforming is
applied and the result played on a sound device.
The Audio Sources subsystem reads from audio files and specifies the direction for each audio source.
The Wideband Rx Array block simulates receiving audio signals at the ULA. The first input to the
Wideband Rx Array block is a 1000x3 matrix, where the three columns of the input correspond to the
three audio sources. The second input (Ang) specifies the incident direction of the signals. The first
row of Ang specifies the azimuth angle in degrees for each signal and the second row specifies the
elevation angle in degrees for each signal. The output of this block is a 1000x10 matrix. Each column
of the output corresponds to the audio recorded at each element of the microphone array. The
microphone array's configuration is specified in the Sensor Array tab of the block dialog panel. The
Receiver Preamp block adds white noise to the received signals.
1-334
Multicore Simulation of Acoustic Beamforming Using a Microphone Array
Beamforming
There are three Frost Beamformer blocks that perform beamforming on the matrix passed via the
input port X along the direction specified by the input port Ang. Each of the three beamformers
steers their beam towards one of the three sources. The output of the beamformer is played in the
Audio Device Writer block. Different sources can be selected using the Select Source block.
1-335
1 Audio Toolbox Examples
This example can use the dataflow domain in Simulink to automatically partition the data-driven
portions of the system into multiple threads and thereby improving the performance of the simulation
by executing it on your desktop's multiple cores. To learn more about dataflow and how to run
Simulink models using multiple threads, see “Multicore Execution Using Dataflow Domain”.
This example uses dataflow domain in Simulink to make use of multiple cores on your desktop to
improve simulation performance. The Domain parameter of the dataflow subsystem in this model is
set as Dataflow. You can view this by selecting the subsystem and then accessing Property
Inspector. To access Property Inspector, in the Simulink Toolstrip, on the Modeling tab, in the Design
gallery select Property Inspector or on the Simulation tab, Prepare gallery, select Property Inspector.
Dataflow domains automatically partition your model into multiple threads for better performance.
Once you set the Domain parameter to Dataflow, you can use the Multicore tab analysis to
analyze your model to get better performance. The Multicore tab is available in the toolstrip when
there is a dataflow domain in the model. To learn more about the Multicore tab, see “Perform
Multicore Analysis for Dataflow”.
For this example the Multicore tab mode is set to Simulation Profiling for simulation
performance analysis.
It is recommended to optimize model settings for optimal simulation performance. To accept the
proposed model settings, on the Multicore tab, click Optimize. Alternatively, you can use the drop
menu below the Optimize button to change the settings individually. In this example the model
settings are already optimal.
On the Multicore tab, click the Run Analysis button to start the analysis of the dataflow domain for
simulation performance. Once the analysis is finished, the Analysis Report and Suggestions window
shows how many threads the dataflow subsystem uses during simulation.
1-336
Multicore Simulation of Acoustic Beamforming Using a Microphone Array
After analyzing the model, the Analysis Report and Suggestions window shows 3 threads. This is
because the three Frost beamformer blocks are computationally intensive and can run in parallel. The
three Frost beamformer blocks however, depend on the Microphone Array and the Receiver blocks.
Pipeline delays can be used to break this dependency and increase concurrency. The Analysis Report
and Suggestions window shows the recommended number of pipeline delays as Suggested for
Increasing Concurrency. The suggested latency value is computed to give the best performance.
The following diagram shows the Analysis Report and Suggestions window where the suggested
latency is 1 for the dataflow subsystem.
Click the Accept button to use the recommended latency for the dataflow subsystem. This value can
also be entered directly in the Property Inspector for Latency parameter. Simulink shows the latency
parameter value using tags at the output ports of the dataflow subsystem.
The Analysis Report and Suggestions window now shows the number of threads as 4 meaning that
the blocks inside the dataflow subsystem simulate in parallel using 4 threads. Highlight threads
highlights the blocks with colors based on their thread allocation as shown in the Thread
Highlighting Legend. Show pipeline delays shows where pipelining delays were inserted within
the dataflow subsystem using tags.
To measure performance improvement gained by using dataflow, compare execution time of the
model with and without dataflow. The Audio Device Writer runs in real time and limits the simulation
speed of the model to real time. Comment out the Audio Device Writer block when measuring
1-337
1 Audio Toolbox Examples
execution time. On a Windows desktop computer with Intel® Xeon® CPU W-2133 @ 3.6 GHz 6 Cores
12 Threads processor this model using dataflow domain executes 1.8x times faster compared to
original model.
Summary
This example showed how to beamform signals received by an array of microphones to extract a
desired speech signal in a noisy environment. It also shows how to use the dataflow domain to
automatically partition the data-driven part of the model into concurrent execution threads and run
the model using multiple threads.
1-338
Convert MIDI Files into MIDI Messages
This example shows how to convert ordinary MIDI files into MIDI message representation using
Audio Toolbox™. In this example, you:
For more information about interacting with MIDI devices using MATLAB, see “MIDI Device
Interface” on page 5-2. To learn more about MIDI in general, consult The MIDI Manufacturers
Association.
Introduction
MIDI files contain MIDI messages, timing information, and metadata about the encoded music. This
example shows how to extract MIDI messages and timing information. To simplify the code, this
example ignores metadata. Because metadata includes information like time signature and tempo,
this example assumes the MIDI file is in 4/4 time at 120 beats per minute (BPM).
Read a MIDI file using the fread function. The fread function returns a vector of bytes, represented
as integers.
readme = fopen('CmajorScale.mid');
[readOut, byteCount] = fread(readme);
fclose(readme);
MIDI files have header chunks and track chunks. Header chunks provide basic information required
to interpret the rest of the file. MIDI files always start with a header chunk. Track chunks come after
the header chunk. Track chunks provide the MIDI messages, timing information, and metadata of the
file. Each track chunk has a track chunk header that includes the length of the track chunk. The track
chunk contains MIDI events after the track chunk header. Every MIDI event has a delta-time and a
MIDI message.
1-339
1 Audio Toolbox Examples
The MIDI header chunk includes the timing division of the file. The timing division determines how to
interpret the resolution of ticks in the MIDI file. Ticks are the unit of time used to set timestamps for
MIDI files. A MIDI file with more ticks per unit time has MIDI messages with more granular time
stamps. Timing division does not determine tempo. MIDI files specify timing division either by ticks
per quarter note or frames per second. This example assumes the MIDI timing division is in ticks per
quarter note.
The fread function reads binary files byte-by-byte, but the timing division is stored as a 16-bit (2-
byte) value. To evaluate multiple bytes as one value, use the polyval function. A vector of bytes can
be evaluated as a polynomial where x is set at 256. For example, the vector of bytes [1 2 3] can be
evaluated as:
2 1 0
1 • 256 + 2 • 256 + 3 • 256
1-340
Convert MIDI Files into MIDI Messages
The MIDI track chunk contains a header and MIDI events. The track chunk header contains the
length of the track chunk. The rest of the track chunk contains one or more MIDI events.
• A delta-time value—The time difference in ticks between the previous MIDI track event and the
current one
• A MIDI message—The raw data of the MIDI track event
To parse MIDI track events sequentially, construct a loop within a loop. In the outer loop, parse track
chunks, iterating by chunkIndex. In the inner loop, parse MIDI events, iterating by a pointer ptr.
1-341
1 Audio Toolbox Examples
MIDI message:
NoteOn Channel: 1 Note: 60 Velocity: 127 Timestamp: 0 [ 90 3C 7F ]
NoteOff Channel: 1 Note: 60 Velocity: 0 Timestamp: 0.5 [ 80 3C 00 ]
NoteOn Channel: 1 Note: 62 Velocity: 127 Timestamp: 0.5 [ 90 3E 7F ]
NoteOff Channel: 1 Note: 62 Velocity: 0 Timestamp: 1 [ 80 3E 00 ]
NoteOn Channel: 1 Note: 64 Velocity: 127 Timestamp: 1 [ 90 40 7F ]
NoteOff Channel: 1 Note: 64 Velocity: 0 Timestamp: 1.5 [ 80 40 00 ]
NoteOn Channel: 1 Note: 65 Velocity: 127 Timestamp: 1.5 [ 90 41 7F ]
NoteOff Channel: 1 Note: 65 Velocity: 0 Timestamp: 1.75 [ 80 41 00 ]
NoteOn Channel: 1 Note: 67 Velocity: 127 Timestamp: 2 [ 90 43 7F ]
NoteOff Channel: 1 Note: 67 Velocity: 0 Timestamp: 2.5 [ 80 43 00 ]
NoteOn Channel: 1 Note: 69 Velocity: 127 Timestamp: 2.5 [ 90 45 7F ]
NoteOff Channel: 1 Note: 69 Velocity: 0 Timestamp: 3 [ 80 45 00 ]
NoteOn Channel: 1 Note: 71 Velocity: 127 Timestamp: 3 [ 90 47 7F ]
NoteOff Channel: 1 Note: 71 Velocity: 0 Timestamp: 3.5 [ 80 47 00 ]
NoteOn Channel: 1 Note: 72 Velocity: 127 Timestamp: 3.5 [ 90 48 7F ]
NoteOff Channel: 1 Note: 72 Velocity: 0 Timestamp: 3.75 [ 80 48 00 ]
NoteOn Channel: 1 Note: 72 Velocity: 127 Timestamp: 4 [ 90 48 7F ]
NoteOff Channel: 1 Note: 72 Velocity: 0 Timestamp: 4.5 [ 80 48 00 ]
NoteOn Channel: 1 Note: 71 Velocity: 127 Timestamp: 4.5 [ 90 47 7F ]
NoteOff Channel: 1 Note: 71 Velocity: 0 Timestamp: 5 [ 80 47 00 ]
NoteOn Channel: 1 Note: 69 Velocity: 127 Timestamp: 5 [ 90 45 7F ]
NoteOff Channel: 1 Note: 69 Velocity: 0 Timestamp: 5.5 [ 80 45 00 ]
NoteOn Channel: 1 Note: 67 Velocity: 127 Timestamp: 5.5 [ 90 43 7F ]
NoteOff Channel: 1 Note: 67 Velocity: 0 Timestamp: 5.75 [ 80 43 00 ]
NoteOn Channel: 1 Note: 65 Velocity: 127 Timestamp: 6 [ 90 41 7F ]
NoteOff Channel: 1 Note: 65 Velocity: 0 Timestamp: 6.5 [ 80 41 00 ]
NoteOn Channel: 1 Note: 64 Velocity: 127 Timestamp: 6.5 [ 90 40 7F ]
NoteOff Channel: 1 Note: 64 Velocity: 0 Timestamp: 7 [ 80 40 00 ]
NoteOn Channel: 1 Note: 62 Velocity: 127 Timestamp: 7 [ 90 3E 7F ]
NoteOff Channel: 1 Note: 62 Velocity: 0 Timestamp: 7.5 [ 80 3E 00 ]
NoteOn Channel: 1 Note: 60 Velocity: 127 Timestamp: 7.5 [ 90 3C 7F ]
NoteOff Channel: 1 Note: 60 Velocity: 0 Timestamp: 7.75 [ 80 3C 00 ]
AllNotesOff Channel: 1 Timestamp: 8 [ B0 7B 00 ]
This example plays parsed MIDI messages using a simple monophonic synthesizer. To see a
demonstration of this synthesizer, see “Design and Play a MIDI Synthesizer” on page 4-2.
simplesynth(msgArray,osc,deviceWriter);
You can also send parsed MIDI messages to a MIDI device using midisend. For more information
about interacting with MIDI devices using MATLAB, see “MIDI Device Interface” on page 5-2.
1-342
Convert MIDI Files into MIDI Messages
Helper Functions
Read Delta-Times
The delta-times of MIDI track events are stored as variable-length values. These values are 1 to 4
bytes long, with the most significant bit of each byte serving as a flag. The most significant bit of the
final byte is set to 0, and the most significant bit of every other byte is set to 1.
In a MIDI track event, the delta-time is always placed before the MIDI message. There is no gap
between a delta-time and the end of the previous MIDI event.
The findVariableLength function reads variable-length values like delta-times. It returns the
length of the input value and the value itself. First, the function creates a 4-byte vector byteStream,
which is set to all zeros. Then, it pushes a pointer to the beginning of the MIDI event. The function
checks the four bytes after the pointer in a loop. For each byte, it checks the most significant bit
(MSB). If the MSB is zero, findVariableLength adds the byte to byteStream and exits the loop.
Otherwise, it adds the byte to byteStream and continues to the next byte.
1-343
1 Audio Toolbox Examples
Once the findVariableLength function reaches the final byte of the variable-length value, it
evaluates the bytes collected in byteStream using the polyval function.
byteStream = zeros(4,1);
for i = 1:4
valCheck = readOut(lengthIndex+i);
byteStream(i) = bitand(valCheck,127); % Mask MSB for value
if ~bitand(valCheck,uint32(128)) % If MSB is 0, no need to append further
break
end
end
valueOut = polyval(byteStream(1:i),128); % Base is 128 because 7 bits are used for value
byteLength = i;
end
To interpret a MIDI message, read the status byte. The status byte is the first byte of a MIDI message.
Even though this example ignores Sysex messages and meta-events, it is important to identify these
messages and determine their lengths. The lengths of Sysex messages and meta-events are key to
determining where the next message starts. Sysex messages have 'F0' or 'F7' as the status byte,
and meta-events have 'FF' as the status byte. Sysex messages and meta-events can be of varying
lengths. After the status byte, Sysex messages and meta-events specify event lengths. The event
length values are variable-length values like delta-time values. The length of the event can be
determined using the findVariableLength function.
For MIDI messages, the message length can be determined by the value of the status byte. However,
MIDI files support running status. If a MIDI message has the same status byte as the previous MIDI
message, the status byte can be omitted. If the first byte of an incoming message is not a valid status
byte, use the status byte of the previous MIDI message.
The interpretMessage function returns a status byte, a length, and a vector of bytes. The status
byte is returned to the inner loop in case the next message is a running status message. The length is
returned to the inner loop, where it specifies how far to push the inner loop pointer. Finally, the
vector of bytes carries the raw binary data of a MIDI message. interpretMessage requires an
output even if the function ignores a given message. For Sysex messages and meta-events,
interpretMessage returns -1 instead of a vector of bytes.
1-344
Convert MIDI Files into MIDI Messages
introValue = readOut(eventIn+1);
if isStatusByte(introValue)
statusOut = introValue; % New status
running = false;
else
statusOut = statusIn; % Running status—Keep old status
running = true;
end
switch statusOut
case 255 % Meta-event (FF)—IGNORE
[eventLength, lengthLen] = findVariableLength(eventIn+2, ...
readOut); % Meta-events have an extra byte for type of meta-event
lenOut = 2+lengthLen+eventLength;
message = -1;
case 240 % Sysex message (F0)—IGNORE
[eventLength, lengthLen] = findVariableLength(eventIn+1, ...
readOut);
lenOut = 1+lengthLen+eventLength;
message = -1;
else
lenOut = eventLength;
message = uint8(readOut(eventIn+(1:lenOut)));
end
end
end
% ----
function n = msgnbytes(statusByte)
1-345
1 Audio Toolbox Examples
n = 2;
else
n = 1;
end
end
% ----
The midimsg object can generate a MIDI message from a struct using the format:
midistruct = struct('RawBytes', [144 65 127 0 0 0 0 0], 'Timestamp',1);
msg = midimsg.fromStruct(midiStruct)
This returns:
msg =
MIDI message:
NoteOn Channel: 1 Note: 65 Velocity: 127 Timestamp: 1 [ 90 41 7F ]
The createMessage function returns a midimsg object and a timestamp. The midimsg object
requires its input struct to have two fields:
To set the RawBytes field, take the vector of bytes created by interpretMessage and append
enough zeros to create a 1-by-8 vector of bytes.
To set the Timestamp field, create a timestamp variable ts. Set ts to 0 before parsing any track
chunks. For every MIDI message sent, convert the delta-time value from ticks to seconds. Then, add
that value to ts. To convert MIDI ticks to seconds, use:
numTicks • tempo
timeAdd =
ticksPerQuarterNote • 1e6
Where tempo is in microseconds (μs) per quarter note. To convert beats per minute (BPM) to μs per
quarter note, use:
6e7
tempo =
BPM
Once you fill both fields of the struct, create a midimsg object. Return the midimsg object and the
modified value of ts.
The createMessage function ignores Sysex messages and meta-events. When interpretMessage
handles Sysex messages and meta-events, it returns -1 instead of a vector of bytes. The
createMessage function then checks for that value. If createMessage identifies a Sysex message
or meta-event, it returns the ts value it was given and an empty midimsg object.
function [tsOut,msgOut] = createMessage(messageIn,tsIn,deltaTimeIn,ticksPerQNoteIn,bpmIn)
1-346
Convert MIDI Files into MIDI Messages
end
This example plays parsed MIDI messages using a simple monophonic synthesizer. To see a
demonstration of this synthesizer, see “Design and Play a MIDI Synthesizer” on page 4-2.
You can also send parsed MIDI messages to a MIDI device using midisend. For more information
about interacting with MIDI devices using MATLAB, see “MIDI Device Interface” on page 5-2.
function simplesynth(msgArray,osc,deviceWriter)
i = 1;
tic
endTime = msgArray(length(msgArray)).Timestamp;
end
1-347
1 Audio Toolbox Examples
% ----
% ----
% ----
1-348
Cocktail Party Source Separation Using Deep Learning Networks
This example shows how to isolate a speech signal using a deep learning network.
Introduction
The cocktail party effect refers to the ability of the brain to focus on a single speaker while filtering
out other voices and background noise. Humans perform very well at the cocktail party problem. This
example shows how to use a deep learning network to separate individual speakers from a speech
mix where one male and one female are speaking simultaneously.
Before going into the example in detail, you will download a pre-trained network and 4 audio files.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio/examples","cocktailpartyfc.z
dataFolder = tempdir;
dataset = fullfile(dataFolder,"CocktailPartySourceSeparation");
unzip(downloadFolder,dataset)
Problem Summary
Load audio files containing male and female speech sampled at 4 kHz. Listen to the audio files
individually for reference.
[mSpeech,Fs] = audioread(fullfile(dataset,"MaleSpeech-16-4-mono-20secs.wav"));
sound(mSpeech,Fs)
[fSpeech] = audioread(fullfile(dataset,"FemaleSpeech-16-4-mono-20secs.wav"));
sound(fSpeech,Fs)
Combine the two speech sources. Ensure the sources have equal power in the mix. Scale the mix so
that its max amplitude is one.
mSpeech = mSpeech/norm(mSpeech);
fSpeech = fSpeech/norm(fSpeech);
ampAdj = max(abs([mSpeech;fSpeech]));
mSpeech = mSpeech/ampAdj;
fSpeech = fSpeech/ampAdj;
Visualize the original and mix signals. Listen to the mixed speech signal. This example shows a source
separation scheme that extracts the male and female sources from the speech mix.
t = (0:numel(mix)-1)*(1/Fs);
figure
tiledlayout(3,1)
nexttile
plot(t,mSpeech)
1-349
1 Audio Toolbox Examples
title("Male Speech")
grid on
nexttile
plot(t,fSpeech)
title("Female Speech")
grid on
nexttile
plot(t,mix)
title("Speech Mix")
xlabel("Time (s)")
grid on
Time-Frequency Representation
Use stft to visualize the time-frequency (TF) representation of the male, female, and mix speech
signals. Use a Hann window of length 128, an FFT length of 128, and an overlap length of 96.
windowLength = 128;
fftLength = 128;
overlapLength = 96;
win = hann(windowLength,"periodic");
1-350
Cocktail Party Source Separation Using Deep Learning Networks
figure
tiledlayout(3,1)
nexttile
stft(mSpeech,Fs,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,FrequencyRange="onesid
title("Male Speech")
nexttile
stft(fSpeech,Fs,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,FrequencyRange="onesid
title("Female Speech")
nexttile
stft(mix,Fs,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,FrequencyRange="onesided")
title("Mix Speech")
The application of a TF mask has been shown to be an effective method for separating desired audio
signals from competing sounds. A TF mask is a matrix of the same size as the underlying STFT. The
mask is multiplied element-by-element with the underlying STFT to isolate the desired source. The TF
mask can be binary or soft.
In an ideal binary mask, the mask cell values are either 0 or 1. If the power of the desired source is
greater than the combined power of other sources at a particular TF cell, then that cell is set to 1.
Otherwise, the cell is set to 0.
1-351
1 Audio Toolbox Examples
Compute the ideal binary mask for the male speaker and then visualize it.
P_M = stft(mSpeech,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,FrequencyRange="one
P_F = stft(fSpeech,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,FrequencyRange="one
[P_mix,F] = stft(mix,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,FrequencyRange="o
figure
plotMask(binaryMask,windowLength - overlapLength,F,Fs)
Estimate the male speech STFT by multiplying the mix STFT by the male speaker's binary mask.
Estimate the female speech STFT by multiplying the mix STFT by the inverse of the male speaker's
binary mask.
P_M_Hard = P_mix.*binaryMask;
P_F_Hard = P_mix.*(1-binaryMask);
Estimate the male and female audio signals using the inverse short-time FFT (ISTFT). Visualize the
estimated and original signals. Listen to the estimated male and female speech signals.
mSpeech_Hard = istft(P_M_Hard,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,Frequenc
fSpeech_Hard = istft(P_F_Hard,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,Frequenc
figure
tiledlayout(2,2)
nexttile
1-352
Cocktail Party Source Separation Using Deep Learning Networks
plot(t,mSpeech)
axis([t(1) t(end) -1 1])
title("Original Male Speech")
grid on
nexttile
plot(t,mSpeech_Hard)
axis([t(1) t(end) -1 1])
xlabel("Time (s)")
title("Estimated Male Speech")
grid on
nexttile
plot(t,fSpeech)
axis([t(1) t(end) -1 1])
title("Original Female Speech")
grid on
nexttile
plot(t,fSpeech_Hard)
axis([t(1) t(end) -1 1])
title("Estimated Female Speech")
xlabel("Time (s)")
grid on
sound(mSpeech_Hard,Fs)
sound(fSpeech_Hard,Fs)
1-353
1 Audio Toolbox Examples
In a soft mask, the TF mask cell value is equal to the ratio of the desired source power to the total
mix power. TF cells have values in the range [0,1].
Compute the soft mask for the male speaker. Estimate the STFT of the male speaker by multiplying
the mix STFT by the male speaker's soft mask. Estimate the STFT of the female speaker by
multiplying the mix STFT by the female speaker's soft mask.
Estimate the male and female audio signals using the ISTFT.
P_M_Soft = P_mix.*softMask;
P_F_Soft = P_mix.*(1-softMask);
mSpeech_Soft = istft(P_M_Soft,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,Frequenc
fSpeech_Soft = istft(P_F_Soft,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,Frequenc
Visualize the estimated and original signals. Listen to the estimated male and female speech signals.
Note that the results are very good because the mask is created with full knowledge of the separated
male and female signals.
figure
tiledlayout(2,2)
nexttile
plot(t,mSpeech)
axis([t(1) t(end) -1 1])
title("Original Male Speech")
grid on
nexttile
plot(t,mSpeech_Soft)
axis([t(1) t(end) -1 1])
title("Estimated Male Speech")
grid on
nexttile
plot(t,fSpeech)
axis([t(1) t(end) -1 1])
xlabel("Time (s)")
title("Original Female Speech")
grid on
nexttile
plot(t,fSpeech_Soft)
axis([t(1) t(end) -1 1])
xlabel("Time (s)")
title("Estimated Female Speech")
grid on
1-354
Cocktail Party Source Separation Using Deep Learning Networks
sound(mSpeech_Soft,Fs)
sound(fSpeech_Soft,Fs)
The goal of the deep learning network in this example is to estimate the ideal soft mask described
above. The network estimates the mask corresponding to the male speaker. The female speaker mask
is derived directly from the male mask.
The basic deep learning training scheme is shown below. The predictor is the magnitude spectra of
the mixed (male + female) audio. The target is the ideal soft masks corresponding to the male
speaker. The regression network uses the predictor input to minimize the mean square error between
its output and the input target. At the output, the audio STFT is converted back to the time domain
using the output magnitude spectrum and the phase of the mix signal.
1-355
1 Audio Toolbox Examples
You transform the audio to the frequency domain using the Short-Time Fourier transform (STFT),
with a window length of 128 samples, an overlap of 127, and a Hann window. You reduce the size of
the spectral vector to 65 by dropping the frequency samples corresponding to negative frequencies
(because the time-domain speech signal is real, this does not lead to any information loss). The
predictor input consists of 20 consecutive STFT vectors. The output is a 65-by-20 soft mask.
You use the trained network to estimate the male speech. The input to the trained network is the
mixture (male + female) speech audio.
This section illustrates how to generate the target and predictor signals from the training dataset.
Read in training signals consisting of around 400 seconds of speech from male and female speakers,
respectively, sampled at 4 kHz. The low sample rate is used to speed up training. Trim the training
signals so that they are the same length.
mSpeechTrain = audioread(fullfile(dataset,"MaleSpeech-16-4-mono-405secs.wav"));
fSpeechTrain = audioread(fullfile(dataset,"FemaleSpeech-16-4-mono-405secs.wav"));
L = min(length(mSpeechTrain),length(fSpeechTrain));
mSpeechTrain = mSpeechTrain(1:L);
fSpeechTrain = fSpeechTrain(1:L);
Read in validation signals consisting of around 20 seconds of speech from male and female speakers,
respectively, sampled at 4 kHz. Trim the validation signals so that they are the same length.
mSpeechValidate = audioread(fullfile(dataset,"MaleSpeech-16-4-mono-20secs.wav"));
fSpeechValidate = audioread(fullfile(dataset,"FemaleSpeech-16-4-mono-20secs.wav"));
1-356
Cocktail Party Source Separation Using Deep Learning Networks
L = min(length(mSpeechValidate),length(fSpeechValidate));
mSpeechValidate = mSpeechValidate(1:L);
fSpeechValidate = fSpeechValidate(1:L);
Scale the training signals to the same power. Scale the validation signals to the same power.
mSpeechTrain = mSpeechTrain/norm(mSpeechTrain);
fSpeechTrain = fSpeechTrain/norm(fSpeechTrain);
ampAdj = max(abs([mSpeechTrain;fSpeechTrain]));
mSpeechTrain = mSpeechTrain/ampAdj;
fSpeechTrain = fSpeechTrain/ampAdj;
mSpeechValidate = mSpeechValidate/norm(mSpeechValidate);
fSpeechValidate = fSpeechValidate/norm(fSpeechValidate);
ampAdj = max(abs([mSpeechValidate;fSpeechValidate]));
mSpeechValidate = mSpeechValidate/ampAdj;
fSpeechValidate = fSpeechValidate/ampAdj;
P_mix0 = abs(stft(mixTrain,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,FrequencyRa
P_M = abs(stft(mSpeechTrain,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,FrequencyR
P_F = abs(stft(fSpeechTrain,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,FrequencyR
Take the log of the mix STFT. Normalize the values by their mean and standard deviation.
P_mix = log(P_mix0 + eps);
MP = mean(P_mix(:));
SP = std(P_mix(:));
P_mix = (P_mix - MP)/SP;
Generate validation STFTs. Take the log of the mix STFT. Normalize the values by their mean and
standard deviation.
P_Val_mix0 = stft(mixValidate,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,Frequenc
P_Val_M = abs(stft(mSpeechValidate,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,Fre
P_Val_F = abs(stft(fSpeechValidate,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,Fre
1-357
1 Audio Toolbox Examples
Training neural networks is easiest when the inputs to the network have a reasonably smooth
distribution and are normalized. To check that the data distribution is smooth, plot a histogram of the
STFT values of the training data.
figure
histogram(P_mix,EdgeColor="none",Normalization="pdf")
xlabel("Input Value")
ylabel("Probability Density")
Compute the training soft mask. Use this mask as the target signal while training the network.
Compute the validation soft mask. Use this mask to evaluate the mask emitted by the trained
network.
To check that the target data distribution is smooth, plot a histogram of the mask values of the
training data.
figure
histogram(maskTrain,EdgeColor="none",Normalization="pdf")
xlabel("Input Value")
ylabel("Probability Density")
1-358
Cocktail Party Source Separation Using Deep Learning Networks
Create chunks of size (65, 20) from the predictor and target signals. In order to get more training
samples, use an overlap of 10 segments between consecutive chunks.
seqLen = 20;
seqOverlap = 10;
mixSequences = zeros(1 + fftLength/2,seqLen,1,0);
maskSequences = zeros(1 + fftLength/2,seqLen,1,0);
loc = 1;
while loc < size(P_mix,2) - seqLen
mixSequences(:,:,:,end+1) = P_mix(:,loc:loc+seqLen-1);
maskSequences(:,:,:,end+1) = maskTrain(:,loc:loc+seqLen-1);
loc = loc + seqOverlap;
end
Create chunks of size (65,20) from the validation predictor and target signals.
loc = 1;
while loc < size(P_Val_mix,2) - seqLen
mixValSequences(:,:,:,end+1) = P_Val_mix(:,loc:loc+seqLen-1);
maskValSequences(:,:,:,end+1) = maskValidate(:,loc:loc+seqLen-1);
loc = loc + seqOverlap;
end
1-359
1 Audio Toolbox Examples
Define the layers of the network. Specify the input size to be images of size 1-by-1-by-1300. Define
two hidden fully connected layers, each with 1300 neurons. Follow each hidden fully connected layer
with a sigmoid layer. The batch normalization layers normalize the means and standard deviations of
the outputs. Add a fully connected layer with 1300 neurons, followed by a regression layer.
numNodes = (1 + fftLength/2)*seqLen;
layers = [ ...
imageInputLayer([1 1 (1 + fftLength/2)*seqLen],Normalization="None")
fullyConnectedLayer(numNodes)
BiasedSigmoidLayer(6)
batchNormalizationLayer
dropoutLayer(0.1)
fullyConnectedLayer(numNodes)
BiasedSigmoidLayer(6)
batchNormalizationLayer
dropoutLayer(0.1)
fullyConnectedLayer(numNodes)
BiasedSigmoidLayer(0)
];
Specify the training options for the network. Set MaxEpochs to 3 so that the network makes three
passes through the training data. Set MiniBatchSize to 64 so that the network looks at 64 training
signals at a time. Set Plots to training-progress to generate plots that show the training
progress as the number of iterations increases. Set Verbose to false to disable printing the table
output that corresponds to the data shown in the plot into the command line window. Set Shuffle to
every-epoch to shuffle the training sequences at the beginning of each epoch. Set
LearnRateSchedule to piecewise to decrease the learning rate by a specified factor (0.1) every
time a certain number of epochs (1) has passed. Set ValidationData to the validation predictors
and targets. Set ValidationFrequency such that the validation mean square error is computed
once per epoch. This example uses the adaptive moment estimation (ADAM) solver.
maxEpochs = 3;
miniBatchSize = 64;
1-360
Cocktail Party Source Separation Using Deep Learning Networks
Train the network with the specified training options and layer architecture using trainnet.
Because the training set is large, the training process can take several minutes. To load a pre-trained
network, set speedupExample to true.
speedupExample = ;
if ~speedupExample
lossFcn = @(Y,T)0.5*l2loss(Y,T,NormalizationFactor="batch-size");
CocktailPartyNet = trainnet(mixSequencesT,permute(maskSequencesT,[4 3 1 2]),layers,lossFcn,op
else
s = load(fullfile(dataset,"CocktailPartyNet.mat"));
CocktailPartyNet = s.CocktailPartyNet;
end
Pass the validation predictors to the network. The output is the estimated mask. Reshape the
estimated mask.
estimatedMasks0 = predict(CocktailPartyNet,mixSequencesV);
estimatedMasks0 = estimatedMasks0.';
estimatedMasks0 = reshape(estimatedMasks0,1 + fftLength/2,numel(estimatedMasks0)/(1 + fftLength/2
Plot a histogram of the error between the actual and expected mask.
1-361
1 Audio Toolbox Examples
figure
histogram(maskValSequences(:) - estimatedMasks0(:),EdgeColor="none",Normalization="pdf")
xlabel("Mask Error")
ylabel("Probability Density")
Estimate male and female soft masks. Estimate male and female binary masks by thresholding the
soft masks.
SoftMaleMask = estimatedMasks0;
SoftFemaleMask = 1 - SoftMaleMask;
Shorten the mix STFT to match the size of the estimated mask.
P_Val_mix0 = P_Val_mix0(:,1:size(SoftMaleMask,2));
Multiply the mix STFT by the male soft mask to get the estimated male speech STFT.
P_Male = P_Val_mix0.*SoftMaleMask;
Use the ISTFT to get the estimated male audio signal. Scale the audio.
maleSpeech_est_soft = istft(P_Male,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,Fre
maleSpeech_est_soft = maleSpeech_est_soft/max(abs(maleSpeech_est_soft));
1-362
Cocktail Party Source Separation Using Deep Learning Networks
range = windowLength:numel(maleSpeech_est_soft)-windowLength;
t = range*(1/Fs);
Visualize the estimated and original male speech signals. Listen to the estimated soft mask male
speech.
sound(maleSpeech_est_soft(range),Fs)
figure
tiledlayout(2,1)
nexttile
plot(t,mSpeechValidate(range))
title("Original Male Speech")
xlabel("Time (s)")
grid on
nexttile
plot(t,maleSpeech_est_soft(range))
xlabel("Time (s)")
title("Estimated Male Speech (Soft Mask)")
grid on
Multiply the mix STFT by the female soft mask to get the estimated female speech STFT. Use the
ISTFT to get the estimated male audio signal. Scale the audio.
P_Female = P_Val_mix0.*SoftFemaleMask;
1-363
1 Audio Toolbox Examples
femaleSpeech_est_soft = istft(P_Female,Window=win,OverlapLength=overlapLength,FFTLength=fftLength
femaleSpeech_est_soft = femaleSpeech_est_soft/max(femaleSpeech_est_soft);
Visualize the estimated and original female signals. Listen to the estimated female speech.
sound(femaleSpeech_est_soft(range),Fs)
figure
tiledlayout(2,1)
nexttile
plot(t,fSpeechValidate(range))
title("Original Female Speech")
grid on
nexttile
plot(t,femaleSpeech_est_soft(range))
xlabel("Time (s)")
title("Estimated Female Speech (Soft Mask)")
grid on
Estimate male and female binary masks by thresholding the soft masks.
1-364
Cocktail Party Source Separation Using Deep Learning Networks
Multiply the mix STFT by the male binary mask to get the estimated male speech STFT. Use the
ISTFT to get the estimated male audio signal. Scale the audio.
P_Male = P_Val_mix0.*HardMaleMask;
maleSpeech_est_hard = istft(P_Male,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,Fre
maleSpeech_est_hard = maleSpeech_est_hard/max(maleSpeech_est_hard);
Visualize the estimated and original male speech signals. Listen to the estimated binary mask male
speech.
sound(maleSpeech_est_hard(range),Fs)
figure
tiledlayout(2,1)
nexttile
plot(t,mSpeechValidate(range))
title("Original Male Speech")
grid on
nexttile
plot(t,maleSpeech_est_hard(range))
xlabel("Time (s)")
title("Estimated Male Speech (Binary Mask)")
grid on
1-365
1 Audio Toolbox Examples
Multiply the mix STFT by the female binary mask to get the estimated male speech STFT. Use the
ISTFT to get the estimated male audio signal. Scale the audio.
P_Female = P_Val_mix0.*HardFemaleMask;
femaleSpeech_est_hard = istft(P_Female,Window=win,OverlapLength=overlapLength,FFTLength=fftLength
femaleSpeech_est_hard = femaleSpeech_est_hard/max(femaleSpeech_est_hard);
Visualize the estimated and original female speech signals. Listen to the estimated female speech.
sound(femaleSpeech_est_hard(range),Fs)
figure
tiledlayout(2,1)
nexttile
plot(t,fSpeechValidate(range))
title("Original Female Speech")
grid on
nexttile
plot(t,femaleSpeech_est_hard(range))
title("Estimated Female Speech (Binary Mask)")
grid on
Compare STFTs of a one-second segment for mix, original female and male, and estimated female and
male, respectively.
1-366
Cocktail Party Source Separation Using Deep Learning Networks
range = 7e4:7.4e4;
figure
stft(mixValidate(range),Fs,Window=win,OverlapLength=64,FFTLength=fftLength,FrequencyRange="onesid
title("Mix STFT")
figure
tiledlayout(3,1)
nexttile
stft(mSpeechValidate(range),Fs,Window=win,OverlapLength=64,FFTLength=fftLength,FrequencyRange="on
title("Male STFT (Actual)")
nexttile
stft(maleSpeech_est_soft(range),Fs,Window=win,OverlapLength=64,FFTLength=fftLength,FrequencyRange
title("Male STFT (Estimated - Soft Mask)")
nexttile
stft(maleSpeech_est_hard(range),Fs,Window=win,OverlapLength=64,FFTLength=fftLength,FrequencyRange
title("Male STFT (Estimated - Binary Mask)");
1-367
1 Audio Toolbox Examples
figure
tiledlayout(3,1)
nexttile
stft(fSpeechValidate(range),Fs,Window=win,OverlapLength=64,FFTLength=fftLength,FrequencyRange="on
title("Female STFT (Actual)")
nexttile
stft(femaleSpeech_est_soft(range),Fs,Window=win,OverlapLength=64,FFTLength=fftLength,FrequencyRan
title("Female STFT (Estimated - Soft Mask)")
nexttile
stft(femaleSpeech_est_hard(range),Fs,Window=win,OverlapLength=64,FFTLength=fftLength,FrequencyRan
title("Female STFT (Estimated - Binary Mask)")
1-368
Cocktail Party Source Separation Using Deep Learning Networks
References
See Also
Related Examples
• “End-to-End Deep Speaker Separation” on page 1-85
1-369
1 Audio Toolbox Examples
This example shows how to design parametric equalizer filters. Parametric equalizers are digital
filters used in audio for adjusting the frequency content of a sound signal. Parametric equalizers
provide capabilities beyond those of graphic equalizers by allowing the adjustment of gain, center
frequency, and bandwidth of each filter. In contrast, graphic equalizers only allow for the adjustment
of the gain of each filter.
Typically, parametric equalizers are designed as second-order IIR filters. These filters have the
drawback that because of their low order, they can present relatively large ripple or transition
regions and may overlap with each other when several of them are connected in cascade. Audio
Toolbox™ provides the capability to design high-order IIR parametric equalizers. Such high-order
designs provide much more control over the shape of each filter. In addition, the designs special-case
to traditional second-order parametric equalizers if the order of the filter is set to two.
This example uses designParamEQ. It is a simple function that provides support for the most
common designs. It also supports C code generation which is needed if there is a desire to tune the
filter at run-time with generated code.
Consider the following two designs of parametric equalizers. The design specifications are the same
except for the filter order. The first design is a typical second-order parametric equalizer that boosts
the signal around 10 kHz by 5 dB. The second design does the same with a sixth-order filter. Notice
how the sixth-order filter is closer to an ideal brickwall filter when compared to the second-order
design. Obviously the approximation can be improved by increasing the filter order even further. The
price to pay for such improved approximation is increased implementation cost as more multipliers
are required.
Fs = 48e3;
N1 = 2;
N2 = 6;
G = 5; % 5 dB
Wo = 10000/(Fs/2);
BW = 4000/(Fs/2);
[B1,A1] = designParamEQ(N1,G,Wo,BW,'Orientation','row');
[B2,A2] = designParamEQ(N2,G,Wo,BW,'Orientation','row');
BQ1 = dsp.SOSFilter('Numerator',B1,'Denominator',A1);
BQ2 = dsp.SOSFilter('Numerator',B2,'Denominator',A2);
hfvt = fvtool(BQ1,BQ2,'Fs',Fs,'Color','white');
legend(hfvt,'2nd-Order Design','6th-Order Design');
1-370
Parametric Equalizer Design
One of the design parameters is the filter bandwidth, BW. In the previous example, the bandwidth
was specified as 4 kHz. The 4 kHz bandwidth occurs at half the gain (2.5 dB).
Another common design parameter is the quality factor, Q. The Q of the filter is defined as Wo/BW
(center frequency/bandwidth). It provides a measure of the sharpness of the filter, i.e., how sharply
the filter transitions between the reference value (0 dB) and the gain G. Consider two designs with
same G and Wo, but different Q values.
Fs = 48e3;
N = 2;
Q1 = 1.5;
Q2 = 10;
G = 15; % 15 dB
Wo = 6000/(Fs/2);
BW1 = Wo/Q1;
BW2 = Wo/Q2;
[B1,A1] = designParamEQ(N,G,Wo,BW1,'Orientation','row');
[B2,A2] = designParamEQ(N,G,Wo,BW2,'Orientation','row');
BQ1 = dsp.SOSFilter('Numerator',B1,'Denominator',A1);
BQ2 = dsp.SOSFilter('Numerator',B2,'Denominator',A2);
hfvt = fvtool(BQ1,BQ2,'Fs',Fs,'Color','white');
legend(hfvt,'Q = 1.5','Q = 10');
1-371
1 Audio Toolbox Examples
Although a higher Q factor corresponds to a sharper filter, it must also be noted that for a given
bandwidth, the Q factor increases simply by increasing the center frequency. This might seem
unintuitive. For example, the following two filters have the same Q factor, but one clearly occupies a
larger bandwidth than the other.
Fs = 48e3;
N = 2;
Q = 10;
G = 9; % 9 dB
Wo1 = 2000/(Fs/2);
Wo2 = 12000/(Fs/2);
BW1 = Wo1/Q;
BW2 = Wo2/Q;
[B1,A1] = designParamEQ(N,G,Wo1,BW1,'Orientation','row');
[B2,A2] = designParamEQ(N,G,Wo2,BW2,'Orientation','row');
BQ1 = dsp.SOSFilter('Numerator',B1,'Denominator',A1);
BQ2 = dsp.SOSFilter('Numerator',B2,'Denominator',A2);
hfvt = fvtool(BQ1,BQ2,'Fs',Fs,'Color','white');
legend(hfvt,'BW1 = 200 Hz; Q = 10','BW2 = 1200 Hz; Q = 10');
1-372
Parametric Equalizer Design
When viewed on a log-frequency scale though, the "octave bandwidth" of the two filters is the same.
hfvt = fvtool(BQ1,BQ2,'FrequencyScale','log','Fs',Fs,'Color','white');
legend(hfvt,'Fo1 = 2 kHz','Fo2 = 12 kHz');
The filter's bandwidth BW is only perfectly centered around the center frequency Wo when such
frequency is set to 0.5*pi (half the Nyquist rate). When Wo is closer to 0 or to pi, there is a warping
effect that makes a larger portion of the bandwidth to occur at one side of the center frequency. In
the edge cases, if the center frequency is set to 0 (pi), the entire bandwidth of the filter occurs to the
right (left) of the center frequency. The result is a so-called shelving low (high) filter.
Fs = 48e3;
N = 4;
G = 10; % 10 dB
1-373
1 Audio Toolbox Examples
Wo1 = 0;
Wo2 = 1; % Corresponds to Fs/2 (Hz) or pi (rad/sample)
BW = 1000/(Fs/2); % Bandwidth occurs at 7.4 dB in this case
[B1,A1] = designParamEQ(N,G,Wo1,BW,'Orientation','row');
[B2,A2] = designParamEQ(N,G,Wo2,BW,'Orientation','row');
BQ1 = dsp.SOSFilter('Numerator',B1,'Denominator',A1);
BQ2 = dsp.SOSFilter('Numerator',B2,'Denominator',A2);
hfvt = fvtool(BQ1,BQ2,'Fs',Fs,'Color','white');
legend(hfvt,'Low Shelf Filter','High Shelf Filter');
All previous designs are examples of a parametric equalizer that boosts the signal over a certain
frequency band. You can also design equalizers that cut (attenuate) the signal in a given region.
Fs = 48e3;
N = 2;
G = -5; % -5 dB
Wo = 6000/(Fs/2);
BW = 2000/(Fs/2);
[B,A] = designParamEQ(N,G,Wo,BW,'Orientation','row');
BQ = dsp.SOSFilter('Numerator',B,'Denominator',A);
hfvt = fvtool(BQ,'Fs',Fs,'Color','white');
legend(hfvt,'G = -5 dB');
1-374
Parametric Equalizer Design
At the limit, the filter can be designed to have a gain of zero (-Inf dB) at the frequency specified. This
allows to design 2nd order or higher order notch filters.
Fs = 44.1e3;
N = 8;
G = -inf;
Q = 1.8;
Wo = 60/(Fs/2); % Notch at 60 Hz
BW = Wo/Q; % Bandwidth will occur at -3 dB for this special case
[B1,A1] = designParamEQ(N,G,Wo,BW,'Orientation','row');
[NUM,DEN] = iirnotch(Wo,BW); % or [NUM,DEN] = designParamEQ(2,G,Wo,BW);
BQ1 = dsp.SOSFilter('Numerator',B1,'Denominator',A1);
BQ2 = dsp.SOSFilter('Numerator',NUM,'Denominator',DEN);
hfvt = fvtool(BQ1,BQ2,'Fs',Fs,'FrequencyScale','Log','Color','white');
legend(hfvt,'8th order notch filter','2nd order notch filter');
1-375
1 Audio Toolbox Examples
Parametric equalizers are usually connected in cascade (in series) so that several are used
simultaneously to equalize an audio signal. To connect several equalizers in this way, use the
dsp.FilterCascade.
Fs = 48e3;
N = 2;
G1 = 3; % 3 dB
G2 = -2; % -2 dB
Wo1 = 400/(Fs/2);
Wo2 = 1000/(Fs/2);
BW = 500/(Fs/2); % Bandwidth occurs at 7.4 dB in this case
[B1,A1] = designParamEQ(N,G1,Wo1,BW,'Orientation','row');
[B2,A2] = designParamEQ(N,G2,Wo2,BW,'Orientation','row');
BQ1 = dsp.SOSFilter('Numerator',B1,'Denominator',A1);
BQ2 = dsp.SOSFilter('Numerator',B2,'Denominator',A2);
FC = dsp.FilterCascade(BQ1,BQ2);
hfvt = fvtool(FC,'Fs',Fs,'Color','white','FrequencyScale','Log');
legend(hfvt,'Cascade of 2nd order filters');
Low-order designs such as the second-order filters above can interfere with each other if their center
frequencies are closely spaced. In the example above, the filter centered at 1 kHz was supposed to
have a gain of -2 dB. Due to the interference from the other filter, the actual gain is more like -1 dB.
Higher-order designs are less prone to such interference.
Fs = 48e3;
N = 8;
G1 = 3; % 3 dB
G2 = -2; % -2 dB
Wo1 = 400/(Fs/2);
Wo2 = 1000/(Fs/2);
BW = 500/(Fs/2); % Bandwidth occurs at 7.4 dB in this case
[B1,A1] = designParamEQ(N,G1,Wo1,BW,'Orientation','row');
[B2,A2] = designParamEQ(N,G2,Wo2,BW,'Orientation','row');
BQ1a = dsp.SOSFilter('Numerator',B1,'Denominator',A1);
BQ2a = dsp.SOSFilter('Numerator',B2,'Denominator',A2);
1-376
Parametric Equalizer Design
FC2 = dsp.FilterCascade(BQ1a,BQ2a);
hfvt = fvtool(FC,FC2,'Fs',Fs,'Color','white','FrequencyScale','Log');
legend(hfvt,'Cascade of 2nd order filters','Cascade of 8th order filters');
1-377
1 Audio Toolbox Examples
This example shows how to design octave-band and fractional octave-band filters, including filter
banks and octave SPL meters. Octave-band and fractional-octave-band filters are commonly used in
acoustics. For example, octave filters are used to perform spectral analysis for noise control.
Acousticians work with octave or fractional (often 1/3) octave filter banks because it provides a
meaningful measure of the noise power in different frequency bands.
Octave-Band Filter
3/10
An octave is the interval between two frequencies having a ratio of 2:1 (or 10 ≈ 1 . 995 for base-10
octave ratios). An octave-band or fractional-octave-band filter is a bandpass filter determined by its
center frequency, order, and bandwidth. The magnitude attenuation limits are defined in the ANSI®
S1.11-2004 standard for three classes of filters: class 0, class 1 and class 2. Class 0 allows only
+/-0.15 dB of ripple in the passband, while class 1 allows +/-0.3 dB and class 2 allows +/-0.5 dB.
Levels of stopband attenuation vary from 60 to 75 dB, depending on the class of the filter.
visualize(of,"class 1")
1-378
Octave-Band and Fractional Octave-Band Filters
The visualizer plot is synchronized to the object, so you can see the magnitude response update as
you change the filter parameters. The mask around the magnitude response is green if the filter
complies with the ANSI S1.11-2004 standard (including being centered at a valid frequency), and red
otherwise. To change the specifications of the filter with a graphical user interface, use
parameterTuner. You can also use the Audio Test Bench app to quickly set up a test bench for the
octave filter you designed. For example, run audioTestBench(of) to launch a test bench with
octave filter.
Open a parameter tuner that enables you to modify the filter in real time.
parameterTuner(of)
Open a spectrum analyzer to display white noise filtered by the octave filter. You can modify the filter
settings with the parameter tuner while the loop runs.
Nx = 100000;
scope1 = spectrumAnalyzer(SampleRate=Fs,Method="filter-bank", ...
AveragingMethod="exponential",PlotAsTwoSidedSpectrum=false, ...
1-379
1 Audio Toolbox Examples
FrequencyScale="log",FrequencySpan="start-and-stop-frequencies", ...
StartFrequency=1,StopFrequency=Fs/2,YLimits=[-60 10], ...
RBWSource="property",RBW=1);
tic
while toc < 20
% Run for 20 seconds
x1 = randn(Nx,1);
y1 = of(x1);
scope1(y1)
end
Many applications require a complete set of octave filters to form a filter bank. To design each filter
manually, you would use getANSICenterFrequencies(of) to get a list of center frequencies for
each individual filter. However, it is usually much simpler to use the octaveFilterBank object.
1-380
Octave-Band and Fractional Octave-Band Filters
Filter the output of a pink noise generator with the 1/3-octave filter bank and compute the total
power at the output of each filter.
centerOct = getCenterFrequencies(ofb);
nbOct = numel(centerOct);
bandPower = zeros(1,nbOct);
nbSamples = 0;
tic
while toc < 10
xp = pinkNoise();
yp = ofb(xp);
bandPower = bandPower + sum(yp.^2,1);
nbSamples = nbSamples + Nx;
scope2(yp)
end
1-381
1 Audio Toolbox Examples
Pink noise has the same total power in each octave band, so the power between 5 Hz and 10 Hz is the
same as between 5,000 Hz and 10,000 Hz. Consequently, in the spectrum analyzer, you can observe
the 10 dB/decade fall-off that is characteristic of pink noise on a log-log scale, and how that signal is
split into the 30 1/3-octave bands. The higher frequency bands have less power density, but the log
scale means that they are also wider, so that their total power is constant.
Plot the power spectrum to show that pink noise has a flat octave spectrum.
bar(log(centerOct)/log(b),octPower);
set(gca,Xticklabel=round(b.^get(gca,"Xtick"),2,"significant"));
title("1/3-Octave Power Spectrum")
xlabel("Octave Frequency Band (Hz)")
ylabel("Power (dB)")
1-382
Octave-Band and Fractional Octave-Band Filters
Octave SPL
The SPL Meter object (splMeter) also supports octave-band measurements. Reproduce the same
power spectrum measurement in real time. Use a dsp.ArrayPlot object to visualize the power per
band. Use the Z-weighting option to omit the frequency weighting filter.
1-383
1 Audio Toolbox Examples
1-384
Pitch Tracking Using Multiple Pitch Estimations and HMM
This example shows how to perform pitch tracking using multiple pitch estimations, octave and
median smoothing, and a hidden Markov model (HMM).
Introduction
Pitch detection is a fundamental building block in speech processing, speech coding, and music
information retrieval (MIR). In speech and speaker recognition, pitch is used as a feature in a
machine learning system. For call centers, pitch is used to indicate the emotional state and gender of
customers. In speech therapy, pitch is used to indicate and analyze pathologies and diagnose physical
defects. In MIR, pitch is used to categorize music, for query-by-humming systems, and as a primary
feature in song identification systems.
Pitch detection for clean speech is mostly considered a solved problem. Pitch detection with noise
and multi-pitch tracking remain difficult problems. There are many algorithms that have been
extensively reported on in the literature with known trade-offs between computational cost and
robustness to different types of noise.
Usually, a pitch detection algorithm (PDA) estimates the pitch for a given time instant. The pitch
estimate is then validated or corrected within a pitch tracking system. Pitch tracking systems enforce
continuity of pitch estimates over time.
Problem Summary
Load an audio file and corresponding reference pitch for the audio file. The reference pitch is
reported every 10 ms and was determined as an average of several third-party algorithms on the
clean speech file. Regions without voiced speech are represented as nan.
[x,fs] = audioread("Counting-16-44p1-mono-15secs.wav");
load TruePitch.mat truePitch
Use the pitch function to estimate the pitch of the audio over time.
[f0,locs] = pitch(x,fs);
Two metrics are commonly reported when defining pitch error: gross pitch error (GPE) and voicing
decision error (VDE). Because the pitch algorithms in this example do not provide a voicing decision,
only GPE is reported. In this example, GPE is calculated as the percent of pitch estimates outside
±10 % of the reference pitch over the span of the voiced segments.
Calculate the GPE for regions of speech and plot the results. Listen to the clean audio signal.
isVoiced = ~isnan(truePitch);
f0(~isVoiced) = nan;
p = 0.1;
GPE = mean(abs(f0(isVoiced)-truePitch(isVoiced)) > truePitch(isVoiced).*p).*100;
t = (0:length(x)-1)/fs;
1-385
1 Audio Toolbox Examples
t0 = (locs-1)/fs;
sound(x,fs)
figure(1)
tiledlayout(2,1)
nexttile
plot(t,x)
ylabel("Amplitude")
title("Pitch Estimation of Clean Signal")
nexttile
plot(t0,[truePitch,f0])
legend("Reference","Estimate",Location="northwest")
ylabel("F0 (Hz)")
xlabel("Time (s)")
title("GPE = " + round(GPE,2) + " (%)")
Use the pitch function on the noisy audio to estimate the pitch over time. Calculate the GPE for
regions of voiced speech and plot the results. Listen to the noisy audio signal.
desiredSNR = -5;
x = mixSNR(x,rand(size(x)),desiredSNR);
[f0,locs] = pitch(x,fs);
f0(~isVoiced) = nan;
GPE = mean(abs(f0(isVoiced) - truePitch(isVoiced)) > truePitch(isVoiced).*p).*100;
1-386
Pitch Tracking Using Multiple Pitch Estimations and HMM
sound(x,fs)
figure(2)
tiledlayout(2,1)
nexttile
plot(t,x)
ylabel("Amplitude")
title("Pitch Estimation of Noisy Signal")
nexttile
plot(t0,[truePitch,f0])
legend("Reference","Estimate",Location="northwest")
ylabel("F0 (Hz)")
xlabel("Time (s)")
title("GPE = " + GPE + " (%)")
This example shows how to improve the pitch estimation of noisy speech signals using multiple pitch
candidate generation, octave-smoothing, median-smoothing, and an HMM.
help HelperPitchTracker
1-387
1 Audio Toolbox Examples
f0 = HelperPitchTracker(...,'EmissionMatrix',EMISSIONMATRIX) specifies
the emission matrix used for the HMM during the forward pass. The
default emission matrix was trained on the Pitch Tracking Database from
Graz University of Technology. The database consists of 4720 speech
segments with corresponding pitch trajectories derived from
laryngograph signals. The emission matrix corresponds to the
probability that a speaker leaves one pitch state to another, in the
range [50, 400] Hz. Specify the emission matrix such that rows
correspond to the current state, columns correspond to the possible
future state, and the values of the matrix correspond to the
probability of moving from the current state to the future state. If
you specify your own emission matrix, specify its corresponding
EMISSIONMATRIXRANGE. EMISSIONMATRIX must be a real N-by-N matrix of
integers.
f0 = HelperPitchTracker(...,'EmissionMatrixRange',EMISSIONMATRIXRANGE)
specifies how the EMISSIONMATRIX corresponds to Hz. If unspecified,
EMISSIONMATRIXRANGE defaults to 50:400.
The graphic provides an overview of the pitch tracking system implemented in the example function.
The following code walks through the internal workings of the HelperPitchTracker example
function.
1-388
Pitch Tracking Using Multiple Pitch Estimations and HMM
In the first stage of the pitch tracking system, you generate multiple pitch candidates using multiple
pitch detection algorithms. The primary pitch candidates, which are generally more accurate, are
generated using algorithms based on the Summation of Residual Harmonics (SRH) [2 on page 1-407]
algorithm and the Pitch Estimation Filter with Amplitude Compression (PEFAC) [3 on page 1-407]
algorithm.
Buffer the noisy input signal into overlapped frames, and then use audio.internal.pitch.SRH to
generate 5 pitch candidates for each hop. Also return the relative confidence of each pitch candidate.
Plot the results.
RANGE = [50,400];
HOPLENGTH = round(fs.*0.01);
figure(3)
tiledlayout(2,1)
nexttile
plot(t0,f0_SRH)
ylabel("F0 Candidates (Hz)")
title("Multiple Candidates from SRH Pitch Estimation")
nexttile
plot(t0,conf_SRH)
ylabel("Relative Confidence")
xlabel("Time (s)")
1-389
1 Audio Toolbox Examples
Generate an additional set of primary pitch candidates and associated confidence using the PEF
algorithm. Generate backup candidates and associated confidences using the normalized correlation
function (NCF) algorithm and cepstrum pitch determination (CEP) algorithm. Log only the most
confident estimate from the backup candidates.
xBuff_PEF = buffer(x,round(0.06*fs),round(0.05*fs),"nodelay");
params_PEF = struct(Method="PEF", ...
Range=RANGE, ...
WindowLength=round(fs*0.06), ...
OverlapLength=round(fs*0.06-HOPLENGTH), ...
SampleRate=fs, ...
NumChannels=size(x,2), ...
SamplesPerChannel=size(x,1));
multiCandidate_params_PEF = struct(NumCandidates=5,MinPeakDistance=5);
[f0_PEF,conf_PEF] = audio.internal.pitch.PEF(xBuff_PEF,fs, ...
params_PEF, ...
multiCandidate_params_PEF);
xBuff_NCF = buffer(x,round(0.04*fs),round(0.03*fs),"nodelay");
xBuff_NCF = xBuff_NCF(:,2:end-1);
params_NCF = struct(Method="NCF", ...
Range=RANGE, ...
WindowLength=round(fs*0.04), ...
OverlapLength=round(fs*0.04-HOPLENGTH), ...
SampleRate=fs, ...
NumChannels=size(x,2), ...
SamplesPerChannel=size(x,1));
1-390
Pitch Tracking Using Multiple Pitch Estimations and HMM
multiCandidate_params_NCF = struct(NumCandidates=5,MinPeakDistance=1);
f0_NCF = audio.internal.pitch.NCF(xBuff_NCF,fs, ...
params_NCF, ...
multiCandidate_params_NCF);
xBuff_CEP = buffer(x,round(0.04*fs),round(0.03*fs),"nodelay");
xBuff_CEP = xBuff_CEP(:,2:end-1);
params_CEP = struct(Method="CEP", ...
Range=RANGE, ...
WindowLength=round(fs*0.04), ...
OverlapLength=round(fs*0.04-HOPLENGTH), ...
SampleRate=fs, ...
NumChannels=size(x,2), ...
SamplesPerChannel=size(x,1));
multiCandidate_params_CEP = struct(NumCandidates=5,MinPeakDistance=1);
f0_CEP = audio.internal.pitch.CEP(xBuff_CEP,fs, ...
params_CEP, ...
multiCandidate_params_CEP);
BackupCandidates = [f0_NCF(:,1),f0_CEP(:,1)];
The long-term median of the pitch candidates is used to reduce the number of pitch candidates. To
calculate the long-term median pitch, first calculate the harmonic ratio. Pitch estimates are only valid
in regions of voiced speech, where the harmonic ratio is high.
hr = harmonicRatio(xBuff_PEF,fs, ...
Window=hamming(size(xBuff_NCF,1),"periodic"), ...
OverlapLength=0);
figure(4)
tiledlayout(2,1)
nexttile
plot(t,x)
ylabel("Amplitude")
nexttile
plot(t0,hr)
ylabel("Harmonic Ratio")
xlabel("Time (s)")
1-391
1 Audio Toolbox Examples
Use the harmonic ratio to threshold out regions that do not include voiced speech in the long-term
median calculation. After determining the long-term median, calculate lower and upper bounds for
pitch candidates. In this example, the lower and upper bounds were determined empirically as 2/3
and 4/3 the median pitch. Candidates outside of these bounds are penalized in the following stage.
idxToKeep = logical(movmedian(hr>((3/4)*max(hr)),3));
longTermMedian = median([f0_PEF(idxToKeep,1);f0_SRH(idxToKeep,1)]);
lower = max((2/3)*longTermMedian,RANGE(1));
upper = min((4/3)*longTermMedian,RANGE(2));
figure(5)
tiledlayout(1,1)
nexttile
plot(t0,[f0_PEF,f0_SRH])
hold on
plot(t0,longTermMedian.*ones(size(f0_PEF,1)),"r:",LineWidth=3)
plot(t0,upper.*ones(size(f0_PEF,1)),"r",LineWidth=2)
plot(t0,lower.*ones(size(f0_PEF,1)),"r",LineWidth=2)
hold off
xlabel("Time (s)")
ylabel("Frequency (Hz)")
title("Long Term Median")
1-392
Pitch Tracking Using Multiple Pitch Estimations and HMM
3. Candidate Reduction
By default, candidates returned by the pitch detection algorithm are sorted in descending order of
confidence. Decrease the confidence of any primary candidate outside the lower and upper bounds.
Decrease the confidence by a factor of 10. Re-sort the candidates for both the PEF and SRH
algorithms so they are in descending order of confidence. Concatenate the candidates, keeping only
the two most confident candidates from each algorithm.
candidates = [f0_PEF(:,1:2),f0_SRH(:,1:2)];
confidence = [conf_PEF(:,1:2),conf_SRH(:,1:2)];
1-393
1 Audio Toolbox Examples
figure(6)
plot(t0,candidates)
xlabel("Time (s)")
ylabel("Frequency (Hz)")
title("Reduced Candidates")
4. Make Distinctive
If two or more candidates are within a given 5 Hz span, set the candidates to their mean and sum
their confidence.
span = 5;
confidenceFactor = 1;
for r = 1:size(candidates,1)
for c = 1:size(candidates,2)
tf = abs(candidates(r,c)-candidates(r,:)) < span;
candidates(r,c) = mean(candidates(r,tf));
confidence(r,c) = sum(confidence(r,tf))*confidenceFactor;
end
end
candidates = max(min(candidates,400),50);
Now that the candidates have been reduced, you can feed them into an HMM to enforce continuity
constraints. Pitch contours are generally continuous for speech signals when analyzed in 10 ms hops.
The probability of a pitch moving from one state to another across time is referred to as the emission
1-394
Pitch Tracking Using Multiple Pitch Estimations and HMM
probability. Emission probabilities can be encapsulated into a matrix which describes the probability
of going from any pitch value in a set range to any other in a set range. The emission matrix used in
this example was created using the Graz database. [1 on page 1-407]
Load the emission matrix and associated range. Plot the probability density function (PDF) of a pitch
in 150 Hz state.
currentState = ;
figure(7)
plot(emissionMatrixRange(1):emissionMatrixRange(2),emissionMatrix(currentState - emissionMatrixRa
title("Emission PDF for " + currentState + " Hz")
xlabel("Next Pitch Value (Hz)")
ylabel("Probability")
The HMM used in this example combines the emission probabilities, which enforce continuity, and the
relative confidence of the pitch. At each hop, the emission probabilities are combined with the
relative confidence to create a confidence matrix. A best choice for each path is determined as the
max of the confidence matrix. The HMM used in this example also assumes that only one path can be
assigned to a given state (an assumption of the Viterbi algorithm).
1-395
1 Audio Toolbox Examples
In addition to the HMM, this example monitors for octave jumps relative to the short-term median of
the pitch paths. If an octave jump is detected, then the backup pitch candidates are added as options
for the HMM.
% Preallocation
numPaths = 4;
numHops = size(candidates,1);
logbook = zeros(numHops,3,numPaths);
suspectHops = zeros(numHops,1);
1-396
Pitch Tracking Using Multiple Pitch Estimations and HMM
% Make distinctive
span = 10;
confidenceFactor = 1.2;
for r = 1:size(nowCandidates,1)
for c = 1:size(nowCandidates,2)
tf = abs(nowCandidates(r,c)-nowCandidates(r,:)) < span;
nowCandidates(r,c) = mean(nowCandidates(r,tf));
nowConfidence(r,c) = sum(nowConfidence(r,tf))*confidenceFactor;
end
end
end
end
logbook(hopNumber,:,pageIdx) = ...
[nowCandidates(chosenPitch), ...
confidenceMatrix(chosenPitch,pastPitchIdx), ...
pastPitchIdx];
1-397
1 Audio Toolbox Examples
end
% Normalize confidence
logbook(hopNumber,2,:) = logbook(hopNumber,2,:)/sum(logbook(hopNumber,2,:));
end
6. Traceback of HMM
Once the forward iteration of the HMM is complete, the final pitch contour is chosen as the most
confident path. Walk backward through the log book to determine the pitch contour output by the
HMM. Calculate the GPE and plot the new pitch contour and the known contour.
numHops = size(logbook,1);
keepLooking = true;
index = numHops + 1;
while keepLooking
index = index - 1;
if abs(max(logbook(index,2,:))-min(logbook(index,2,:)))~=0
keepLooking = false;
end
end
[~,bestPathIdx] = max(logbook(index,2,:));
bestIndices = zeros(numHops,1);
bestIndices(index) = bestPathIdx;
for ii = index:-1:2
bestIndices(ii-1) = logbook(ii,3,bestIndices(ii));
end
bestIndices(bestIndices==0) = 1;
f0 = zeros(numHops,1);
for ii = (numHops):-1:2
f0(ii) = logbook(ii,1,bestIndices(ii));
end
f0toPlot = f0;
f0toPlot(~isVoiced) = NaN;
GPE = mean( abs(f0toPlot(isVoiced) - truePitch(isVoiced)) > truePitch(isVoiced).*p).*100;
figure(8)
plot(t0,[truePitch,f0toPlot])
legend("Reference","Estimate")
ylabel("F0 (Hz)")
xlabel("Time (s)")
title("GPE = " + round(GPE,2) + " (%)")
1-398
Pitch Tracking Using Multiple Pitch Estimations and HMM
As a final post-processing step, apply a moving median filter with a window length of three hops.
Calculate the final GPE and plot the final pitch contour and the known contour.
f0 = movmedian(f0,3);
f0(~isVoiced) = NaN;
1-399
1 Audio Toolbox Examples
Performance Evaluation
The HelperPitchTracker function uses an HMM to apply continuity constraints to pitch contours.
The emission matrix of the HMM can be set directly. It is best to train the emission matrix on sound
sources similar to the ones you want to track.
This example uses the Pitch Tracking Database from Graz University of Technology (PTDB-TUG) [4]
on page 1-407. The data set consists of 20 English native speakers reading 2342 phonetically rich
sentences from the TIMIT corpus. Download and extract the data set.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","ptdb-tug.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"ptdb-tug");
Create an audio datastore that points to the microphone recordings in the database. Set the label
associated with each file to the location of the associated known pitch file. The dataset contains
recordings of 10 female and 10 male speakers. Use subset to isolate the 10th female and male
speakers. Train an emission matrix based on the reference pitch contours for both male and female
speakers 1 through 9.
ads = audioDatastore([fullfile(dataset,"SPEECH DATA","FEMALE","MIC"),fullfile(dataset,"SPEECH DAT
IncludeSubfolders=true, ...
FileExtensions=".wav");
wavFileNames = ads.Files;
ads.Labels = replace(wavFileNames,["MIC","mic","wav"],["REF","ref","f0"]);
1-400
Pitch Tracking Using Multiple Pitch Estimations and HMM
idxToRemove = contains(ads.Files,["F10","M10"]);
ads1 = subset(ads,idxToRemove);
ads9 = subset(ads,~idxToRemove);
ads1 = shuffle(ads1);
ads9 = shuffle(ads9);
The emission matrix describes the probability of going from one pitch state to another. In the
following step, you create an emission matrix based on speakers 1 through 9 for both male and
female. The database stores reference pitch values, short-term energy, and additional information in
the text files with files extension f0. The getReferencePitch function reads in the pitch values if
the short-term energy is above a threshold. The threshold was determined empirically in listening
tests. The HelperUpdateEmissionMatrix creates a 2-dimensional histogram based on the current
pitch state and the next pitch state. After the histogram is created, it is normalized to create an
emission matrix.
emissionMatrixRange = [50,400];
emissionMatrix = [];
for i = 1:numel(ads9.Files)
x = getReferencePitch(ads9.Labels{i});
emissionMatrix = HelperUpdateEmissionMatrix(x,emissionMatrixRange,emissionMatrix);
end
emissionMatrix = emissionMatrix + sqrt(eps);
emissionMatrix = emissionMatrix./norm(emissionMatrix);
Define different types of background noise: white, ambiance, engine, jet, and street. Resample them
to 16 kHz to help speed up testing the database.
noiseType = ["white","ambiance","engine","jet","street"];
numNoiseToTest = numel(noiseType);
desiredFs = 16e3;
whiteNoiseMaker = dsp.ColoredNoise(Color="white",SamplesPerFrame=40000,RandomStream="mt19937ar wi
noise{1} = whiteNoiseMaker();
[ambiance,ambianceFs] = audioread("Ambiance-16-44p1-mono-12secs.wav");
noise{2} = resample(ambiance,desiredFs,ambianceFs);
[engine,engineFs] = audioread("Engine-16-44p1-stereo-20sec.wav");
noise{3} = resample(engine,desiredFs,engineFs);
[jet,jetFs] = audioread("JetAirplane-16-11p025-mono-16secs.wav");
noise{4} = resample(jet,desiredFs,jetFs);
[street,streetFs] = audioread("MainStreetOne-16-16-mono-12secs.wav");
noise{5} = resample(street,desiredFs,streetFs);
snrToTest = [10,5,0,-5,-10];
numSNRtoTest = numel(snrToTest);
Run the pitch detection algorithm for each SNR and noise type for each file. Calculate the average
GPE across speech files. This example compares performance with the popular pitch tracking
algorithm: Sawtooth Waveform Inspired Pitch Estimator (SWIPE). A MATLAB® implementation of the
algorithm can be found at [5 on page 1-407]. To run this example without comparing to other
algorithms, set compare to false. The following comparison takes around 15 minutes.
1-401
1 Audio Toolbox Examples
compare = ;
numFilesToTest = 20;
p = 0.1;
GPE_pitchTracker = zeros(numSNRtoTest,numNoiseToTest,numFilesToTest);
if compare
GPE_swipe = GPE_pitchTracker;
end
for i = 1:numFilesToTest
[cleanSpeech,info] = read(ads1);
cleanSpeech = resample(cleanSpeech,desiredFs,info.SampleRate);
truePitch = getReferencePitch(info.Label{:});
isVoiced = truePitch~=0;
truePitchInVoicedRegions = truePitch(isVoiced);
for j = 1:numSNRtoTest
for k = 1:numNoiseToTest
noisySpeech = mixSNR(cleanSpeech,noise{k},snrToTest(j));
f0 = HelperPitchTracker(noisySpeech,desiredFs,EmissionMatrix=emissionMatrix,EmissionM
f0 = [0;f0]; % manual alignment with database.
GPE_pitchTracker(j,k,i) = mean(abs(f0(isVoiced) - truePitchInVoicedRegions) > truePit
if compare
f0 = swipep(noisySpeech,desiredFs,[50,400],0.01);
f0 = f0(3:end); % manual alignment with database.
GPE_swipe(j,k,i) = mean(abs(f0(isVoiced) - truePitchInVoicedRegions) > truePitchI
end
end
end
end
GPE_pitchTracker = mean(GPE_pitchTracker,3);
if compare
GPE_swipe = mean(GPE_swipe,3);
end
for ii = 1:numel(noise)
figure
plot(snrToTest,GPE_pitchTracker(:,ii),"b")
hold on
if compare
plot(snrToTest,GPE_swipe(:,ii),"g")
end
plot(snrToTest,GPE_pitchTracker(:,ii),"bo")
if compare
plot(snrToTest,GPE_swipe(:,ii),"gv")
end
title(noiseType(ii))
xlabel("SNR (dB)")
ylabel("Gross Pitch Error (p = " + round(p,2) + " )")
if compare
legend("HelperPitchTracker","SWIPE")
else
legend("HelperPitchTracker")
end
1-402
Pitch Tracking Using Multiple Pitch Estimations and HMM
grid on
hold off
end
1-403
1 Audio Toolbox Examples
1-404
Pitch Tracking Using Multiple Pitch Estimations and HMM
1-405
1 Audio Toolbox Examples
1-406
Pitch Tracking Using Multiple Pitch Estimations and HMM
Conclusion
You can use HelperPitchTracker as a baseline for evaluating GPE performance of your pitch
tracking system, or adapt this example to your application.
References
[1] G. Pirker, M. Wohlmayr, S. Petrik, and F. Pernkopf, "A Pitch Tracking Corpus with Evaluation on
Multipitch Tracking Scenario", Interspeech, pp. 1509-1512, 2011.
[2] Drugman, Thomas, and Abeer Alwan. "Joint Robust Voicing Detection and Pitch Estimation Based
on Residual Harmonics." Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH. 2011, pp. 1973-1976.
[3] Gonzalez, Sira, and Mike Brookes. "A Pitch Estimation Filter robust to high levels of noise
(PEFAC)." 19th European Signal Processing Conference. Barcelona, 2011, pp. 451-455.
[4] Signal Processing and Speech Communication Laboratory. Accessed September 26, 2018. https://
www.spsc.tugraz.at/databases-and-tools/ptdb-tug-pitch-tracking-database-from-graz-university-of-
technology.html.
1-407
1 Audio Toolbox Examples
1-408
Train Voice Activity Detection in Noise Model Using Deep Learning
This example shows how to detect regions of speech in a low signal-to-noise environment using deep
learning. You train a bidirectional long short-term memory (BiLSTM) network from scratch to perform
voice activity detection (VAD) and compare that network to a pretrained deep learning-based VAD. To
explore the model trained from scratch in this example, see “Voice Activity Detection in Noise Using
Deep Learning” on page 1-428. To use an off-the-shelf deep learning-based VAD, see
detectspeechnn.
Introduction
Voice activity detection is an essential component of many audio systems, such as automatic speech
recognition, speaker recognition, and audio conferencing. Voice activity detection can be especially
challenging in low signal-to-noise (SNR) situations, where speech is obstructed by noise.
rng default
In high SNR scenarios, traditional speech detection algorithms perform adequately. Read in an audio
file that consists of words spoken with pauses between and listen to it.
fs = 16e3;
[speech,fileFs] = audioread("MaleVolumeUp-16-mono-6secs.ogg");
sound(speech,fs)
Use the detectSpeech function to locate regions of speech. The detectSpeech function correctly
identifies all regions of speech.
detectSpeech(speech,fs)
1-409
1 Audio Toolbox Examples
Load two noise signals and resample to the audio sample rate.
[noise200,fileFs200] = audioread("WashingMachine-16-8-mono-200secs.mp3");
[noise1000,fileFs1000] = audioread("WashingMachine-16-8-mono-1000secs.mp3");
noise200 = resample(noise200,fs,fileFs200);
noise1000 = resample(noise1000,fs,fileFs1000);
Use the supporting function mixSNR on page 1-426 to corrupt the clean speech signal with washing
machine noise at a desired SNR level in dB. Listen to the corrupted audio.
SNR = ;
noisySpeech = mixSNR(speech,noise200,SNR);
sound(noisySpeech,fs)
Call detectSpeech on the noisy speech signal. The function fails to detect the speech regions given
the very low SNR. The remainder of the example walks through training and evaluating deep
learning-based VAD networks that can perform well under low SNR.
detectSpeech(noisySpeech,fs)
1-410
Train Voice Activity Detection in Noise Model Using Deep Learning
Download and extract the Google Speech Commands Dataset [1] on page 1-427.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","google_speech.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"google_speech");
Create audioDatastore objects to point to the training and validation data sets.
adsTrain = audioDatastore(fullfile(dataset,"train"),IncludeSubfolders=true);
adsValidation = audioDatastore(fullfile(dataset,"validation"),IncludeSubfolders=true);
The Google dataset consists of isolated words. Use the supporting function, constructSignal on
page 1-426, to contruct train and validation signals that consist of isolated words and regions of
silence. The constructSignal function also returns ground truth binary masks indicating the
regions of speech in the train and validation signals.
[audioTrain,TTrainPerSample] = constructSignal(adsTrain,fs,1000);
[audioValidation,TValidationPerSample] = constructSignal(adsValidation,fs,200);
Listen to the first 10 seconds of the constructed signal. Use signalMask and plotsigroi to
visualize the signal and ground truth binary mask.
1-411
1 Audio Toolbox Examples
duration = ;
sound(audioTrain(1:duration*fs),fs)
mask = signalMask(TTrainPerSample,SampleRate=fs);
plotsigroi(mask,audioTrain,true)
xlim([0,duration])
title("Clean Signal ("+duration+" seconds)")
Use the supporting function mixSNR on page 1-426 to corrupt the train and validation signals with
noise.
audioTrain = mixSNR(audioTrain,noise1000,SNR);
audioValidation = mixSNR(audioValidation,noise200,SNR);
Listen to the first 10 seconds of the train signal and visualize the signal and mask.
sound(audioTrain(1:duration*fs),fs)
plotsigroi(mask,audioTrain,true)
xlim([0,duration])
title("Training Signal ("+duration+" seconds)")
1-412
Train Voice Activity Detection in Noise Model Using Deep Learning
Input Pipeline
featuresTrain = extract(afe,audioTrain);
Display the dimensions of the features matrix. The first dimension corresponds to the number of
windows the signal was broken into (it depends on the signal length, window length, and overlap
length). The second dimension is the number of features used in this example.
1-413
1 Audio Toolbox Examples
[numWindows,numFeatures] = size(featuresTrain)
numWindows = 124999
numFeatures = 9
In classification applications, it is a good practice to normalize all features to have zero mean and
unity standard deviation.
Compute the mean and standard deviation for each coefficient, and use them to normalize the data.
M = mean(featuresTrain,1);
S = std(featuresTrain,[],1);
featuresTrain = (featuresTrain - M) ./ S;
Extract features from the validation signal using the same process.
XValidation = extract(afe,audioValidation);
XValidation = (XValidation - mean(XValidation,1)) ./ std(XValidation,[],1);
Each feature corresponds to 256 samples of data (the window length), sampled every 128 samples
(the hop length). For each window, set the expected voice/no voice value to the mode of the baseline
mask values corresponding to those 256 samples. Convert the voice/no voice mask to categorical.
windowLength = numel(afe.Window);
overlapLength = afe.OverlapLength;
TTrain = mode(buffer(TTrainPerSample,windowLength,overlapLength,"nodelay"),1);
TTrain = categorical(TTrain);
TValidation = categorical(TValidation);
Use the supporting function featureBuffer on page 1-425 to split the training features and the
mask into sequences with a duration approximately 8 seconds and a 75% overlap between
consecutive sequences.
sequenceDuration = ;
analysisHopLength = numel(afe.Window) - afe.OverlapLength;
sequenceLength = round(sequenceDuration*fs/analysisHopLength);
overlapPercent = ;
XTrain = featureBuffer(featuresTrain',sequenceLength,overlapPercent);
TTrain = featureBuffer(TTrain,sequenceLength,overlapPercent);
Network Architecture
LSTM networks can learn long-term dependencies between time steps of sequence data. This
example uses the bidirectional LSTM layer bilstmLayer (Deep Learning Toolbox) to look at the
sequence in both forward and backward directions.
layers = [ ...
sequenceInputLayer(afe.FeatureVectorLength)
1-414
Train Voice Activity Detection in Noise Model Using Deep Learning
bilstmLayer(200,OutputMode="sequence")
bilstmLayer(200,OutputMode="sequence")
fullyConnectedLayer(2)
softmaxLayer
];
Training Options
To define parameters for training, use trainingOptions (Deep Learning Toolbox). Use the Adam
optimizer with a mini-batch size of 64 and a piecewise learn rate schedule.
maxEpochs = ;
miniBatchSize = ;
options = trainingOptions("adam", ...
MaxEpochs=maxEpochs, ...
MiniBatchSize=miniBatchSize, ...
Shuffle="every-epoch", ...
Verbose=false, ...
ValidationFrequency=floor(numel(XTrain)/miniBatchSize), ...
ValidationData={XValidation.',TValidation}, ...
Plots="training-progress", ...
LearnRateSchedule="piecewise", ...
Metrics = "Accuracy",...
LearnRateDropFactor=0.1, ...
LearnRateDropPeriod=5, ...
OutputNetwork="best-validation-loss",...
InputDataFormats = "CTB");
Train Network
1-415
1 Audio Toolbox Examples
Estimate voice activity in the validation signal using the trained network. Convert the estimated VAD
mask from categorical to double, then replicate the window-based decisions to sample-based
decisions.
YValidation = predict(speechDetectNet,XValidation);
YValidation = scores2label(YValidation,unique(TValidation));
YValidation = double(YValidation)-1;
wL = numel(afe.Window);
hL = wL - afe.OverlapLength;
YValidationPerSample = [repelem(YValidation(1),floor(wL/2 + hL/2),1);
repelem(YValidation(2:end-1),hL,1);
repelem(YValidation(end),ceil(wL/2 + hL/2),1)];
Calculate and plot the validation confusion matrix from the vectors of actual and estimated labels.
Save the results for later analysis.
cc = confusionchart(TValidationPerSample,YValidationPerSample, ...
title="speechDetect - Validation Confusion Chart", ...
ColumnSummary="column-normalized",RowSummary="row-normalized");
1-416
Train Voice Activity Detection in Noise Model Using Deep Learning
speechDetectResults = cc.NormalizedValues;
The vadnet network is a pretrained network for voice activity detection. You can use it with the
vadnetPreprocess and vadnetPostprocess functions for applications such as transfer learning,
or you can use detectspeechnn, which encapsulates vadnetPreprocess, vadnet, and
vadnetPostprocess for inference-only applications. The vadnet network performs well under
every-day adverse conditions, however it fails in the cases of extreme SNR, such as the -10 dB SNR
used in this example. Also, vadnet was trained to detect regions of continuous speech (meaning
several words in a row), not isolated words. In short, the pretrained vadnet fails for the validation
signal in this example.
net = audioPretrainedNetwork("vadnet");
Extract features from the validation signal using the same input pipeline used to train the network.
XValidation = vadnetPreprocess(audioValidation,fs);
y = predict(net,gpuArray(XValidation));
1-417
1 Audio Toolbox Examples
boundaries = vadnetPostprocess(audioValidation,16e3,y);
The vadnetPostprocess function returns the decisions as time boundaries. To convert the
boundaries to a binary mask that corresponds to the original signal samples, use sigroi2binmask.
YValidationPerSample = double(sigroi2binmask(boundaries,size(audioValidation,1)));
To create a confusion chart to analyze the error, use confusionchart (Deep Learning Toolbox).
confusionchart(TValidationPerSample,YValidationPerSample, ...
title="vadnet - Validation Confusion Chart", ...
ColumnSummary="column-normalized",RowSummary="row-normalized");
Transfer Learning
Apply transfer learning to the pretrained vadnet to make use of both the pretrained weights and the
network architecture.
featuresTrain = vadnetPreprocess(audioTrain,fs);
Buffer the ground truth mask so that decisions correspond to the analysis windows used in
vadnetPreprocess.
windowLength = 400;
overlapLength = 240;
1-418
Train Voice Activity Detection in Noise Model Using Deep Learning
TTrainPerSamplePadded = [zeros(floor(windowLength/2),1);TTrainPerSample;zeros(ceil(windowLength/2
TTrain = mode(buffer(TTrainPerSamplePadded,windowLength,overlapLength,"nodelay"),1);
TValidationPerSamplePadded = [zeros(floor(windowLength/2),1);TValidationPerSample;zeros(ceil(wind
TValidation = mode(buffer(TValidationPerSamplePadded,windowLength,overlapLength,"nodelay"),1);
Split the long training signal into overlapped sequences for training. Do the same for the ground-
truth mask.
sequenceDuration = ;
analysisHopLength = windowLength - overlapLength;
sequenceLength = round(sequenceDuration*fs/analysisHopLength);
overlapPercent = ;
XTrain = featureBuffer(featuresTrain,sequenceLength,overlapPercent);
TTrain = featureBuffer(TTrain,sequenceLength,overlapPercent);
miniBatchSize = ;
maxEpochs = ;
options = trainingOptions("adam", ...
InitialLearnRate=0.01, ...
LearnRateSchedule="piecewise", ...
LearnRateDropPeriod=3, ...
MiniBatchSize=miniBatchSize, ...
Shuffle="every-epoch", ...
ValidationFrequency=floor(numel(XTrain)/miniBatchSize), ...
ValidationData={XValidation,TValidation}, ...
Verbose=false, ...
Plots="training-progress", ...
MaxEpochs=maxEpochs, ...
OutputNetwork="best-validation-loss" ...
);
noisyvadnet = trainnet(XTrain,TTrain,net,"mse",options);
1-419
1 Audio Toolbox Examples
Estimate voice activity in the validation signal using the trained network. Postprocess the predictions
using vadnetPostprocess, then convert the boundaries in time to a sample-based mask.
y = predict(noisyvadnet,gpuArray(XValidation));
boundaries = vadnetPostprocess(audioValidation,fs,y);
YValidationPerSample = double(sigroi2binmask(boundaries,size(audioValidation,1)));
Calculate and plot the validation confusion matrix from the vectors of actual and estimated labels.
Save the results for later analysis.
cc = confusionchart(TValidationPerSample,YValidationPerSample, ...
title="noisyvadnet - Validation Confusion Chart", ...
ColumnSummary="column-normalized",RowSummary="row-normalized");
1-420
Train Voice Activity Detection in Noise Model Using Deep Learning
noisyvadnetResults = cc.NormalizedValues;
Compare Networks
There are several considerations when choosing a network, such as size, inference speed, error, and
streaming capabilities.
Streaming
The speechDetectNet trained from scratch in this example is well-suited for streaming inference
because its BiLSTM layers retain state between calls. See “Voice Activity Detection in Noise Using
Deep Learning” on page 1-428 for an example of using speechDetect for streaming voice activity
detection.
The vadnet architecture consists of convolutional, recurrent, and fully-connected layers, and is not
well-suited for low-latency streaming. See the vadnet documentation for an example of streaming
VAD detection using vadnet.
Network Size
networks = ["speechDetect","noisyvadnet"];
b = bar(reordercats(categorical(networks),networks),[whos("speechDetectNet").bytes/1024,whos("noi
title("Network Size")
ylabel("Size (KB)")
grid on
1-421
1 Audio Toolbox Examples
b.FaceColor = "flat";
b.CData(2,:) = [0.8500 0.3250 0.0980];
Compare the network inference speeds. The simple speechDetect architecture has faster inference
speed on both the CPU and the GPU for short durations (approximately 8 second chunks or less). For
longer durations, speechDetect is faster than noisyvadnet on the GPU and slower on the CPU.
durationsToTest = [1,5,10,20,40];
environment = ["CPU","GPU"];
speechDetectSpeed = zeros(numel(durationsToTest),numel(environment));
noisyvadnetSpeed = zeros(numel(durationsToTest),numel(environment));
for jj = 1:numel(environment)
for ii = 1:numel(durationsToTest)
idx = 1:durationsToTest(ii)*fs;
speechDetectFeatures = extract(afe,audioValidation(idx))';
vadnetFeatures = vadnetPreprocess(audioValidation(idx),fs);
switch environment(jj)
case "CPU"
speechDetectSpeed(ii,1) = timeit(@()predict(speechDetectNet,speechDetectFeatures.
noisyvadnetSpeed(ii,1) = 0;%timeit(@()predict(noisyvadnet,vadnetFeatures),1);
case "GPU"
speechDetectSpeed(ii,2) = gputimeit(@()predict(speechDetectNet,gpuArray(speechDet
noisyvadnetSpeed(ii,2) = gputimeit(@()predict(noisyvadnet,gpuArray(vadnetFeatures
end
1-422
Train Voice Activity Detection in Noise Model Using Deep Learning
end
end
tiledlayout(2,1)
for ii = 1:numel(environment)
nexttile
plot(durationsToTest,speechDetectSpeed(:,ii),"b-", ...
durationsToTest,noisyvadnetSpeed(:,ii),"r-", ...
durationsToTest,speechDetectSpeed(:,ii),"bo", ...
durationsToTest,noisyvadnetSpeed(:,ii),"ro")
legend(["speechDetect","noisyvadnet"],Location="best")
grid on
xlabel("Audio Duration (s)")
ylabel("Computation Duration (s)")
title("Inference Speed ("+environment(ii)+")")
end
Network Error
Use the previously calculated confusion charts to display common statistics for error analysis.
Accuracy, recall, precision, and f1 score are all derived from the confusion matrices previously
plotted.
Accuracy is defined as the ratio of correctly predicted observations to the total observations. It is the
most intuitive metric but can be misleading for imbalanced data sets. For example, if speech is only
present in 5% of the audio, then classifying all audio as non-speech would result in 95 % accuracy.
1-423
1 Audio Toolbox Examples
TP + TN
Accuracy =
TP + TN + FP + FN
Recall, also called sensitivity, is the ratio of correctly predicted positive observations to all
observations that belong to the positive class. Recall answers the question: Of all speech regions, how
many were correctly classified? A low recall indicates that regions of speech were misclassified as
regions of nonspeech.
TP
Recall =
TP + FN
Precision is the ratio of correctly predicted positive observations to the total predicted positive
observations. Precision answers the question: Of all the observations the network classified as
speech, how many were actually speech? A low precision indicates that regions of nonspeech were
misclassified as regions of speech.
TP
Precision =
TP + FP
F1 score is the harmonic mean of the precision and recall: it accounts for both false positives and
false negatives.
Precision × Recall
F1 Score = 2
Precision + Recall
The true measure of a network depends on your application. In real-world situations, a cost function
is usually optimized which weights the costs of false positives and false negatives.
TP = speechDetectResults(2,2);
TN = speechDetectResults(1,1);
FP = speechDetectResults(1,2);
FN = speechDetectResults(2,1);
speechDetectAccuracy = (TP+TN)/(TP+TN+FP+FN);
speechDetectRecall = TP/(TP+FN);
speechDetectPrecision = TP/(TP+FP);
speechDetectF1Score = 2*(speechDetectRecall*speechDetectPrecision)/(speechDetectRecall+speechDete
TP = noisyvadnetResults(2,2);
TN = noisyvadnetResults(1,1);
FP = noisyvadnetResults(1,2);
FN = noisyvadnetResults(2,1);
noisyvadnetAccuracy = (TP+TN)/(TP+TN+FP+FN);
noisyvadnetRecall = TP/(TP+FN);
noisyvadnetPrecision = TP/(TP+FP);
noisyvadnetF1Score = 2*(noisyvadnetRecall*noisyvadnetPrecision)/(noisyvadnetRecall+noisyvadnetPre
figure
bar(categorical(["Accuracy","Recall","Precision","F1 Score"]), ...
[speechDetectAccuracy,noisyvadnetAccuracy; ...
speechDetectRecall,noisyvadnetRecall; ...
speechDetectPrecision,noisyvadnetPrecision; ...
speechDetectF1Score,noisyvadnetF1Score]);
title("Error Analysis")
legend("speechDetect","noisyvadnet",Location="bestoutside")
ylim([0.5,1])
grid on
1-424
Train Voice Activity Detection in Noise Model Using Deep Learning
Supporting Functions
featureVectorOverlap = round(overlapPercent*featureVectorsPerSequence);
hopLength = featureVectorsPerSequence - featureVectorOverlap;
N = floor((size(features,2) - featureVectorsPerSequence)/hopLength) + 1;
sequences = cell(N,1);
idx = 1;
for jj = 1:N
sequences{jj} = features(:,idx:idx + featureVectorsPerSequence - 1);
idx = idx + hopLength;
end
end
1-425
1 Audio Toolbox Examples
Mix SNR
numSamples = size(signal,1);
signalNorm = norm(signal);
noiseNorm = norm(noise);
goalNoiseNorm = signalNorm/(10^(ratio/20));
factor = goalNoiseNorm/noiseNorm;
requestedNoise = noise.*factor;
noisySignal = signal + requestedNoise;
noisySignal = noisySignal./max(abs(noisySignal));
end
Construct Signal
win = hamming(50e-3*fs,"periodic");
1-426
Train Voice Activity Detection in Noise Model Using Deep Learning
References
[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017.
Available from https://storage.googleapis.com/download.tensorflow.org/data/
speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed
under the Creative Commons Attribution 4.0 license
1-427
1 Audio Toolbox Examples
In this example, you perform batch and streaming voice activity detection (VAD) in a low SNR
environment using a pretrained deep learning model. For details about the model and how it was
trained, see “Train Voice Activity Detection in Noise Model Using Deep Learning” on page 1-409.
Read in an audio file that consists of words spoken with pauses between and listen to it. Use
resample to resample the signal to the sample rate to 16 kHz. Use detectSpeech on the clean
signal to determine the ground-truth speech regions.
fs = 16e3;
[speech,fileFs] = audioread("Counting-16-44p1-mono-15secs.wav");
speech = resample(speech,fs,fileFs);
speech = speech./max(abs(speech));
sound(speech,fs)
detectSpeech(speech,fs,Window=hamming(0.04*fs,"periodic"),MergeDistance=round(0.5*fs))
[noise,fileFs] = audioread("WashingMachine-16-8-mono-200secs.mp3");
noise = resample(noise,fs,fileFs);
1-428
Voice Activity Detection in Noise Using Deep Learning
Use the supporting function mixSNR on page 1-436 to corrupt the clean speech signal with washing
machine noise at a desired SNR level in dB. Listen to the corrupted audio. The network was trained
under -10 dB SNR conditions.
SNR = ;
noisySpeech = mixSNR(speech,noise,SNR);
sound(noisySpeech,fs)
Download and load a pretrained network and a configured audioFeatureExtractor object. The
network was trained to detect speech in low SNR environments given features output from the
audioFeatureExtractor object.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio/examples","vadbilsmtnet.zip"
dataFolder = tempdir;
netFolder = fullfile(dataFolder,"vadbilsmtnet");
unzip(downloadFolder,netFolder)
pretrainedNetwork = load(fullfile(netFolder,"voiceActivityDetectionExample.mat"));
afe = pretrainedNetwork.afe;
net = pretrainedNetwork.speechDetectNet;
1-429
1 Audio Toolbox Examples
afe
afe =
audioFeatureExtractor with properties:
Properties
Window: [256×1 double]
OverlapLength: 128
SampleRate: 16000
FFTLength: []
SpectralDescriptorInput: 'linearSpectrum'
FeatureVectorLength: 9
Enabled Features
spectralCentroid, spectralCrest, spectralEntropy, spectralFlux, spectralKurtosis, spectralRo
spectralSkewness, spectralSlope, harmonicRatio
Disabled Features
linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfcc, mfccDelta
mfccDeltaDelta, gtcc, gtccDelta, gtccDeltaDelta, spectralDecrease, spectralFlatness
spectralSpread, pitch, zerocrossrate, shortTimeEnergy
The network consists of two bidirectional LSTM layers, each with 200 hidden units, and a
classification output that returns either class 0 corresponding to no voice activity detected or class 1
corresponding to voice activity detected.
net.Layers
ans =
5×1 Layer array with layers:
Extract features from the speech data and then standardize them. Orient the features so that time is
across columns.
features = extract(afe,noisySpeech);
features = (features - mean(features,1))./std(features,[],1);
features = features';
1-430
Voice Activity Detection in Noise Using Deep Learning
Pass the features through the speech detection network to classify each feature vector as belonging
to a frame of speech or not.
scores = predict(net,features.');
decisionsCategorical = scores2label(scores,categorical([0 1]));
tiledlayout(2,1)
nexttile
detectSpeech(speech,fs,Window=hamming(0.04*fs,"periodic"),MergeDistance=round(0.5*fs))
title("Ground Truth VAD")
xlabel("")
nexttile
mask = signalMask(decisionsPerSample,SampleRate=fs,Categories="Activity");
plotsigroi(mask,noisySpeech,true)
title("Predicted VAD")
1-431
1 Audio Toolbox Examples
The audioFeatureExtractor object is intended for batch processing and does not retain state
between calls. Use generateMATLABFunction to create a streaming-friendly feature extractor. You
can use the trained VAD network in a streaming context using classifyAndUpdateState (Deep
Learning Toolbox).
generateMATLABFunction(afe,"featureExtractor",IsStreaming=true)
To simulate a streaming environment, save the speech and noise signals as WAV files. To simulate
streaming input, you will use dsp.AudioFileReader to read frames from the files and mix them at
a desired SNR. You can also use audioDeviceReader so that your microphone is the speech source.
audiowrite("Speech.wav",speech,fs)
audiowrite("Noise.wav",noise,fs)
Define parameters for the streaming voice activity detection in noise demonstration:
• signal - Signal source, specified as either the speech file previously recorded, or your
microphone.
• noise - Noise source, specified as a noise sound file to mix with the signal.
• SNR - Signal-to-noise ratio to mix the signal and noise, specified in dB.
• testDuration - Test duration, specified in seconds.
• playbackSource - Playback source, specified as either the original clean signal, the noisy signal,
or the detected speech. An audioDeviceWriter object is used to play the audio to your
speakers.
signal = ;
noise = "Noise.wav";
SNR = ; % dB
testDuration = ; % seconds
playbackSource = ;
Call the supporting function streamingDemo on page 1-433 to observe the performance of the VAD
network on streaming audio. The parameters you set using the live controls do not interrupt the
streaming example. After the streaming demo is complete, you can modify parameters of the
demonstration, then run the streaming demo again.
streamingDemo(net,afe, ...
signal,noise,SNR, ...
testDuration,playbackSource);
1-432
Voice Activity Detection in Noise Using Deep Learning
References
[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017.
Available from https://storage.googleapis.com/download.tensorflow.org/data/
speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed
under the Creative Commons Attribution 4.0 license
Supporting Functions
Streaming Demo
function streamingDemo(net,afe,signal,noise,SNR,testDuration,playbackSource)
% streamingDemo(net,afe,signal,noise,SNR,testDuration,playbackSource) runs
% a real-time VAD demo.
1-433
1 Audio Toolbox Examples
% Create a dsp.MovingRMS object. You will use this to determine the signal
% and noise mix at the desired SNR. This object is only useful for example
% purposes where you are artificially adding noise.
movRMS = dsp.MovingRMS(Method="Exponential weighting",ForgettingFactor=1);
% Create three dsp.AsyncBuffer objects. One to buffer the input audio, one
% to buffer the extracted features, and one to buffer the output audio so
% that VAD decisions correspond to the audio signal. The output buffer is
% only necessary for visualizing the decisions in real time.
audioInBuffer = dsp.AsyncBuffer(2*speechReader.SamplesPerFrame);
featureBuffer = dsp.AsyncBuffer(ceil(2*speechReader.SamplesPerFrame/(numel(afe.Window)-afe.Overla
audioOutBuffer = dsp.AsyncBuffer(2*speechReader.SamplesPerFrame);
% Create a time scope to visualize the original speech signal, the noisy
% signal that the network is applied to, and the decision output from the
% network.
scope = timescope(SampleRate=fs, ...
TimeSpanSource="property", ...
TimeSpan=3, ...
BufferLength=fs*3*3, ...
TimeSpanOverrunAction="Scroll", ...
AxesScaling="updates", ...
MaximizeAxes="on", ...
AxesScalingNumUpdates=20, ...
NumInputPorts=3, ...
LayoutDimensions=[3,1], ...
ChannelNames=["Noisy Speech","Clean Speech (Original)","Detected Speech"], ...
...
ActiveDisplay = 1, ...
ShowGrid=true, ...
...
ActiveDisplay = 2, ...
ShowGrid=true, ...
...
ActiveDisplay=3, ...
ShowGrid=true);
setup(scope,{1,1,1})
1-434
Voice Activity Detection in Noise Using Deep Learning
% Read a frame of the speech signal and a frame of the noise signal
speechIn = speechReader();
noiseIn = noiseReader();
if featureBuffer.NumUnreadSamples >= 1
% Read the audio data corresponding to the number of unread
% feature vectors.
audioHop = read(audioOutBuffer,featureBuffer.NumUnreadSamples*hopLength);
% Use only the new features to update the standard deviation and
% mean. Normalize the features.
rmean = movMean(features);
rstd = movSTD(features);
features = (features - rmean(end,:)) ./ rstd(end,:);
% Network inference
[score,state] = predict(net,features);
net.State = state;
[~,decision] = max(score,[],2);
decision = decision-1;
1-435
1 Audio Toolbox Examples
end
end
end
Mix SNR
numSamples = size(signal,1);
signalNorm = norm(signal);
noiseNorm = norm(noise);
goalNoiseNorm = signalNorm/(10^(ratio/20));
factor = goalNoiseNorm/noiseNorm;
requestedNoise = noise.*factor;
noisySignal = signal + requestedNoise;
noisySignal = noisySignal./max(abs(noisySignal));
end
1-436
Using a MIDI Control Surface to Interact with a Simulink Model
This example shows how to use a MIDI control surface as a physical user interface to a Simulink®
model, allowing you to use knobs, sliders and buttons to interact with that model. It can be used in
Simulink as well as with generated code running on a workstation.
Introduction
Although MIDI is best known for its use in audio applications, this example illustrates that MIDI
control surfaces have uses in many other applications besides audio. In this example, we use a MIDI
controller to provide a user configurable value that can vary at runtime, we use it to control the
amplitude of signals, and for several other illustrative purposes. This example is not comprehensive,
but rather can provide inspiration for other creative uses of the control surface to interact with a
model.
Many MIDI controllers plug into the USB port on a computer and make use of the MIDI support built
into modern operating systems. Specific MIDI control surfaces that we have used include the Korg
nanoKONTROL and the Behringer BCF2000. An advantage of the Korg device is its cost: it is readily
available online at prices comparable to that of a good mouse. The Behringer device is more costly,
but has the enhanced capability to both send and receive MIDI signals (the Korg can only send
signals). This ability can be used to send data back from a model to keep a control surface in sync
with changes to the model. We use this capability to bring a control surface in sync with the starting
point of a model, so that initially changes to a specific control do not produce abrupt changes in the
block output.
To use your own controller with this example, plug it into the USB port on the computer and run the
model audiomidi. Be sure that the model is not running when you plug in the control device. The
model is originally configured such that it responds to movement of any control on the default MIDI
device. This construction is meant to make it easier and more likely that this example works out of
the box for all users. In a real use case, you would probably want to tie individual controls to each
sub-portion of the model. For that purpose, you can use the midiid function to explicitly set the MIDI
device parameter on the appropriate blocks in the model to recognize a specific control. For example,
running midiid with the Korg nanoKONTROL device produces the following information:
ctl =
1002
device =
nanoKONTROL
1-437
1 Audio Toolbox Examples
If you will be using a particular controller repeatedly, you may want to use the setpref command to
set that controller as the default midi device:
>> setpref('midi','DefaultDevice','nanoKONTROL')
This capability is particularly helpful on Linux, where your control surface may not be immediately
recognized as the default device.
After the controller is plugged in, hit the play button on audiomidi. Now move any knob or slider. You
should see variations in the signals that are plotted in the various scopes in the model as you move
any knob or slider. The model is initially configured to respond to any control.
Examples
Next, several example use cases are provided. Each example uses the basic MIDI Controls block to
accomplish a different task. Look under the mask of the appropriate block in each example to see how
that use case was accomplished. To reuse these in your own model, just drag a copy of the desired
block into your model.
In example 1 of the model, we see the simplest use of this control. It can act as a source that is under
user control. The original block MIDI Controls (in the DSP sources block library), outputs a value
between 0 and 1. We have also created a slightly modified block, by placing a mask on the original
block to output a source with values that cover a user defined range.
In this example, a straightforward application of the MIDI controls block uses the 0 to 1 range as an
amplitude control on a given signal.
Example 3: MIDI Controls to Split a Signal Into Two Streams With User Controlled Relative
Amplitudes.
In this example, we see an example where a signal is split into two streams: and where
can be interactively controlled by the user with the control surface.
Lastly, example 5 allows the user input a desired phase with the control surface. A sinusoid with that
phase is then generated. The phase can be interactively varied as the model runs.
1-438
Using a MIDI Control Surface to Interact with a Simulink Model
Conclusions
This model is provided to give inspiration for how the MIDI Controls block can be used to interact
with a model. Other uses are possible and encouraged, including use with generated code.
1-439
1 Audio Toolbox Examples
This example shows how to classify spoken digits using both machine and deep learning techniques.
In the example, you perform classification using wavelet time scattering with a support vector
machine (SVM) and with a long short-term memory (LSTM) network. You also apply Bayesian
optimization to determine suitable hyperparameters to improve the accuracy of the LSTM network. In
addition, the example illustrates an approach using a deep convolutional neural network (CNN) and
mel-frequency spectrograms.
Data
Download the Free Spoken Digit Dataset (FSDD). FSDD consists of 2000 recordings in English of the
digits 0 through 9 obtained from four speakers. Two of the speakers are native speakers of American
English, one speaker is a nonnative speaker of English with a Belgian French accent, and one speaker
is a nonnative speaker of English with a German accent. The data is sampled at 8000 Hz.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"FSDD");
Use audioDatastore to manage data access and ensure the random division of the recordings into
training and test sets. Create an audioDatastore that points to the dataset.
ads = audioDatastore(dataset,IncludeSubfolders=true);
The helper function helpergenLabels creates a categorical array of labels from the FSDD files. The
source code for helpergenLabels is listed in the appendix. List the classes and the number of
examples in each class.
ads.Labels = helpergenLabels(ads);
summary(ads.Labels)
0 200
1 200
2 200
3 200
4 200
5 200
6 200
7 200
8 200
9 200
The FSDD data set consists of 10 balanced classes with 200 recordings each. The recordings in the
FSDD are not of equal duration. The FSDD is not prohibitively large, so read through the FSDD files
and construct a histogram of the signal lengths.
LenSig = zeros(numel(ads.Files),1);
nr = 1;
while hasdata(ads)
digit = read(ads);
LenSig(nr) = numel(digit);
nr = nr+1;
1-440
Spoken Digit Recognition with Wavelet Scattering and Deep Learning
end
reset(ads)
histogram(LenSig)
grid on
xlabel("Signal Length (Samples)")
ylabel("Frequency")
The histogram shows that the distribution of recording lengths is positively skewed. For classification,
this example uses a common signal length of 8192 samples, a conservative value that ensures that
truncating longer recordings does not cut off speech content. If the signal is greater than 8192
samples (1.024 seconds) in length, the recording is truncated to 8192 samples. If the signal is less
than 8192 samples in length, the signal is prepadded and postpadded symmetrically with zeros out to
a length of 8192 samples.
Use waveletScattering (Wavelet Toolbox) to create a wavelet time scattering framework using an
invariant scale of 0.22 seconds. In this example, you create feature vectors by averaging the
scattering transform over all time samples. To have a sufficient number of scattering coefficients per
time window to average, set OversamplingFactor to 2 to produce a four-fold increase in the
number of scattering coefficients for each path with respect to the critically downsampled value.
sf = waveletScattering(SignalLength=8192,InvarianceScale=0.22, ...
SamplingFrequency=8000,OversamplingFactor=2);
1-441
1 Audio Toolbox Examples
Split the FSDD into training and test sets. Allocate 80% of the data to the training set and retain 20%
for the test set. The training data is for training the classifier based on the scattering transform. The
test data is for validating the model.
rng("default")
ads = shuffle(ads);
[adsTrain,adsTest] = splitEachLabel(ads,0.8);
countEachLabel(adsTrain)
ans=10×2 table
Label Count
_____ _____
0 160
1 160
2 160
3 160
4 160
5 160
6 160
7 160
8 160
9 160
countEachLabel(adsTest)
ans=10×2 table
Label Count
_____ _____
0 40
1 40
2 40
3 40
4 40
5 40
6 40
7 40
8 40
9 40
The helper function helperReadSPData truncates or pads the data to a length of 8192 and
normalizes each recording by its maximum value. The source code for helperReadSPData is listed
in the appendix. Create an 8192-by-1600 matrix where each column is a spoken-digit recording.
Xtrain = [];
scatds_Train = transform(adsTrain,@(x)helperReadSPData(x));
while hasdata(scatds_Train)
smat = read(scatds_Train);
Xtrain = cat(2,Xtrain,smat);
end
Repeat the process for the test set. The resulting matrix is 8192-by-400.
Xtest = [];
scatds_Test = transform(adsTest,@(x)helperReadSPData(x));
1-442
Spoken Digit Recognition with Wavelet Scattering and Deep Learning
while hasdata(scatds_Test)
smat = read(scatds_Test);
Xtest = cat(2,Xtest,smat);
end
Apply the wavelet scattering transform to the training and test sets.
Strain = sf.featureMatrix(Xtrain);
Stest = sf.featureMatrix(Xtest);
Obtain the mean scattering features for the training and test sets. Exclude the zeroth-order
scattering coefficients.
TrainFeatures = Strain(2:end,:,:);
TrainFeatures = squeeze(mean(TrainFeatures,2))';
TestFeatures = Stest(2:end,:,:);
TestFeatures = squeeze(mean(TestFeatures,2))';
SVM Classifier
Now that the data has been reduced to a feature vector for each recording, the next step is to use
these features for classifying the recordings. Create an SVM learner template with a quadratic
polynomial kernel. Fit the SVM to the training data.
template = templateSVM(...
KernelFunction="polynomial", ...
PolynomialOrder=2, ...
KernelScale="auto", ...
BoxConstraint=1, ...
Standardize=true);
classificationSVM = fitcecoc( ...
TrainFeatures, ...
adsTrain.Labels, ...
Learners=template, ...
Coding="onevsone", ...
ClassNames=categorical({'0'; '1'; '2'; '3'; '4'; '5'; '6'; '7'; '8'; '9'}));
Use k-fold cross-validation to predict the generalization accuracy of the model based on the training
data. Split the training set into five groups.
partitionedModel = crossval(classificationSVM,KFold=5);
[validationPredictions, validationScores] = kfoldPredict(partitionedModel);
validationAccuracy = (1 - kfoldLoss(partitionedModel,LossFun="ClassifError"))*100
validationAccuracy = 96.9375
The estimated generalization accuracy is approximately 97%. Use the trained SVM to predict the
spoken-digit classes in the test set.
predLabels = predict(classificationSVM,TestFeatures);
testAccuracy = sum(predLabels==adsTest.Labels)/numel(predLabels)*100 %#ok<NASGU>
testAccuracy = 98
Summarize the performance of the model on the test set with a confusion chart. Display the precision
and recall for each class by using column and row summaries. The table at the bottom of the
confusion chart shows the precision values for each class. The table to the right of the confusion
chart shows the recall values.
1-443
1 Audio Toolbox Examples
The scattering transform coupled with a SVM classifier classifies the spoken digits in the test set with
an accuracy of 98% (or an error rate of 2%).
An LSTM network is a type of recurrent neural network (RNN). RNNs are neural networks that are
specialized for working with sequential or temporal data such as speech data. Because the wavelet
scattering coefficients are sequences, they can be used as inputs to an LSTM. By using scattering
features as opposed to the raw data, you can reduce the variability that your network needs to learn.
Modify the training and testing scattering features to be used with the LSTM network. Exclude the
zeroth-order scattering coefficients and convert the features to cell arrays.
TrainFeatures = Strain(2:end,:,:);
TrainFeatures = squeeze(num2cell(TrainFeatures,[1 2]));
TestFeatures = Stest(2:end,:,:);
TestFeatures = squeeze(num2cell(TestFeatures, [1 2]));
[inputSize, ~] = size(TrainFeatures{1});
YTrain = adsTrain.Labels;
numHiddenUnits = 512;
numClasses = numel(unique(YTrain));
1-444
Spoken Digit Recognition with Wavelet Scattering and Deep Learning
layers = [ ...
sequenceInputLayer(inputSize)
lstmLayer(numHiddenUnits,OutputMode="last")
fullyConnectedLayer(numClasses)
softmaxLayer
classificationLayer];
Set the hyperparameters. Use Adam optimization and a mini-batch size of 50. Set the maximum
number of epochs to 300. Use a learning rate of 1e-4. You can turn off the training progress plot if
you do not want to track the progress using plots. The training uses a GPU by default if one is
available. Otherwise, it uses a CPU. For more information, see trainingOptions (Deep Learning
Toolbox).
maxEpochs = 300;
miniBatchSize = 50;
net = trainNetwork(TrainFeatures,YTrain,layers,options);
predLabels = classify(net,TestFeatures);
testAccuracy = sum(predLabels==adsTest.Labels)/numel(predLabels)*100 %#ok<NASGU>
testAccuracy = 96.2500
1-445
1 Audio Toolbox Examples
Bayesian Optimization
Determining suitable hyperparameter settings is often one of the most difficult parts of training a
deep network. To mitigate this, you can use Bayesian optimization. In this example, you optimize the
number of hidden layers and the initial learning rate by using Bayesian techniques. Create a new
directory to store the MAT-files containing information about hyperparameter settings and the
network along with the corresponding error rates.
YTrain = adsTrain.Labels;
YTest = adsTest.Labels;
if ~exist("results/",'dir')
mkdir results
end
Initialize the variables to be optimized and their value ranges. Because the number of hidden layers
must be an integer, set 'type' to 'integer'.
optVars = [
optimizableVariable(InitialLearnRate=[1e-5, 1e-1],Transform="log")
optimizableVariable(NumHiddenUnits=[10, 1000],Type="integer")
];
Bayesian optimization is computationally intensive and can take several hours to finish. For the
purposes of this example, set optimizeCondition to false to download and use predetermined
optimized hyperparameter settings. If you set optimizeCondition to true, the objective function
helperBayesOptLSTM is minimized using Bayesian optimization. The objective function, listed in the
appendix, is the error rate of the network given specific hyperparameter settings. The loaded settings
are for the objective function minimum of 0.02 (2% error rate).
ObjFcn = helperBayesOptLSTM(TrainFeatures,YTrain,TestFeatures,YTest);
optimizeCondition = false;
if optimizeCondition
BayesObject = bayesopt(ObjFcn,optVars, ...
MaxObjectiveEvaluations=15, ...
IsObjectiveDeterministic=false, ...
UseParallel=true); %#ok<UNRCH>
else
url = "http://ssd.mathworks.com/supportfiles/audio/SpokenDigitRecognition.zip";
downloadNetFolder = tempdir;
netFolder = fullfile(downloadNetFolder,"SpokenDigitRecognition");
if ~exist(netFolder,"dir")
disp("Downloading pretrained network (1 file - 12 MB) ...")
unzip(url,downloadNetFolder)
end
load(fullfile(netFolder,"0.02.mat"))
end
If you perform Bayesian optimization, figures similar to the following are generated to track the
objective function values with the corresponding hyperparameter values and the number of
iterations. You can increase the number of Bayesian optimization iterations to ensure that the global
minimum of the objective function is reached.
1-446
Spoken Digit Recognition with Wavelet Scattering and Deep Learning
1-447
1 Audio Toolbox Examples
Use the optimized values for the number of hidden units and initial learning rate and retrain the
network.
numHiddenUnits = 768;
numClasses = numel(unique(YTrain));
layers = [ ...
sequenceInputLayer(inputSize)
lstmLayer(numHiddenUnits,OutputMode="last")
fullyConnectedLayer(numClasses)
softmaxLayer
classificationLayer];
maxEpochs = 300;
miniBatchSize = 50;
net = trainNetwork(TrainFeatures,YTrain,layers,options);
1-448
Spoken Digit Recognition with Wavelet Scattering and Deep Learning
predLabels = classify(net,TestFeatures);
testAccuracy = sum(predLabels==adsTest.Labels)/numel(predLabels)*100
testAccuracy = 96.7500
As the plot shows, using Bayesian optimization yields an LSTM with a higher accuracy.
As another approach to the task of spoken digit recognition, use a deep convolutional neural network
(DCNN) based on mel-frequency spectrograms to classify the FSDD data set. Use the same signal
truncation/padding procedure as in the scattering transform. Similarly, normalize each recording by
dividing each signal sample by the maximum absolute value. For consistency, use the same training
and test sets as for the scattering transform.
Set the parameters for the mel-frequency spectrograms. Use the same window, or frame, duration as
in the scattering transform, 0.22 seconds. Set the hop between windows to 10 ms. Use 40 frequency
bands.
segmentDuration = 8192*(1/8000);
frameDuration = 0.22;
hopDuration = 0.01;
numBands = 40;
reset(adsTrain)
reset(adsTest)
The helper function helperspeechSpectrograms, defined at the end of this example, uses
melSpectrogram to obtain the mel-frequency spectrogram after standardizing the recording length
and normalizing the amplitude. Use the logarithm of the mel-frequency spectrograms as the inputs to
the DCNN. To avoid taking the logarithm of zero, add a small epsilon to each element.
1-449
1 Audio Toolbox Examples
epsil = 1e-6;
XTrain = helperspeechSpectrograms(adsTrain,segmentDuration,frameDuration,hopDuration,numBands);
XTest = helperspeechSpectrograms(adsTest,segmentDuration,frameDuration,hopDuration,numBands);
YTrain = adsTrain.Labels;
YTest = adsTest.Labels;
Construct a small DCNN as an array of layers. Use convolutional and batch normalization layers, and
downsample the feature maps using max pooling layers. To reduce the possibility of the network
memorizing specific features of the training data, add a small amount of dropout to the input to the
last fully connected layer.
sz = size(XTrain);
specSize = sz(1:2);
imageSize = [specSize 1];
numClasses = numel(categories(YTrain));
dropoutProb = 0.2;
numF = 12;
layers = [
imageInputLayer(imageSize)
convolution2dLayer(5,numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,2*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
1-450
Spoken Digit Recognition with Wavelet Scattering and Deep Learning
reluLayer
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(2)
dropoutLayer(dropoutProb)
fullyConnectedLayer(numClasses)
softmaxLayer
classificationLayer(Classes=categories(YTrain));
];
Set the hyperparameters to use in training the network. Use a mini-batch size of 50 and a learning
rate of 1e-4. Specify Adam optimization. Because the amount of data in this example is relatively
small, set the execution environment to 'cpu' for reproducibility. You can also train the network on
an available GPU by setting the execution environment to either 'gpu' or 'auto'. For more
information, see trainingOptions (Deep Learning Toolbox).
miniBatchSize = 50;
options = trainingOptions("adam", ...
InitialLearnRate=1e-4, ...
MaxEpochs=30, ...
MiniBatchSize=miniBatchSize, ...
Shuffle="every-epoch", ...
Plots="training-progress", ...
Verbose=false, ...
ExecutionEnvironment="cpu");
Use the trained network to predict the digit labels for the test set.
1-451
1 Audio Toolbox Examples
[Ypredicted,probs] = classify(trainedNet,XTest,ExecutionEnvironment="CPU");
cnnAccuracy = sum(Ypredicted==YTest)/numel(YTest)*100
cnnAccuracy = 98.2500
Summarize the performance of the trained network on the test set with a confusion chart. Display the
precision and recall for each class by using column and row summaries. The table at the bottom of
the confusion chart shows the precision values. The table to the right of the confusion chart shows
the recall values.
The DCNN using mel-frequency spectrograms as inputs classifies the spoken digits in the test set
with an accuracy rate of approximately 98% as well.
Summary
This example shows how to use different machine and deep learning approaches for classifying
spoken digits in the FSDD. The example illustrated wavelet scattering paired with both an SVM and a
LSTM. Bayesian techniques were used to optimize LSTM hyperparameters. Finally, the example
shows how to use a CNN with mel-frequency spectrograms.
The goal of the example is to demonstrate how to use MathWorks® tools to approach the problem in
fundamentally different but complementary ways. All workflows use audioDatastore to manage
flow of data from disk and ensure proper randomization.
1-452
Spoken Digit Recognition with Wavelet Scattering and Deep Learning
All approaches used in this example performed equally well on the test set. This example is not
intended as a direct comparison between the various approaches. For example, you can also use
Bayesian optimization for hyperparameter selection in the CNN. An additional strategy that is useful
in deep learning with small training sets like this version of the FSDD is to use data augmentation.
How manipulations affect class is not always known, so data augmentation is not always feasible.
However, for speech, established data augmentation strategies are available through
audioDataAugmenter.
In the case of wavelet time scattering, there are also a number of modifications you can try. For
example, you can change the invariant scale of the transform, vary the number of wavelet filters per
filter bank, and try different classifiers.
function x = helperReadSPData(x)
% This function is only for use Wavelet Toolbox examples. It may change or
% be removed in a future release.
N = numel(x);
if N > 8192
x = x(1:8192);
elseif N < 8192
pad = 8192-N;
prepad = floor(pad/2);
postpad = ceil(pad/2);
x = [zeros(prepad,1) ; x ; zeros(postpad,1)];
end
x = x./max(abs(x));
end
layers = [ ...
sequenceInputLayer(inputSize)
bilstmLayer(optVars.NumHiddenUnits,OutputMode="last") % Using number of hidden layers
fullyConnectedLayer(numClasses)
1-453
1 Audio Toolbox Examples
softmaxLayer
classificationLayer];
function X = helperspeechSpectrograms(ads,segmentDuration,frameDuration,hopDuration,numBands)
% This function is only for use in the
% "Spoken Digit Recognition with Wavelet Scattering and Deep Learning"
% example. It may change or be removed in a future release.
%
% helperspeechSpectrograms(ads,segmentDuration,frameDuration,hopDuration,numBands)
% computes speech spectrograms for the files in the datastore ads.
% segmentDuration is the total duration of the speech clips (in seconds),
% frameDuration the duration of each spectrogram frame, hopDuration the
% time shift between each spectrogram frame, and numBands the number of
% frequency bands.
disp("Computing speech spectrograms...");
for i = 1:numFiles
[x,info] = read(ads);
x = normalizeAndResize(x);
fs = info.SampleRate;
frameLength = round(frameDuration*fs);
hopLength = round(hopDuration*fs);
1-454
Spoken Digit Recognition with Wavelet Scattering and Deep Learning
if mod(i,500) == 0
disp("Processed " + i + " files out of " + numFiles)
end
end
disp("...done");
end
%--------------------------------------------------------------------------
function x = normalizeAndResize(x)
% This function is only for use in the
% "Spoken Digit Recognition with Wavelet Scattering and Deep Learning"
% example. It may change or be removed in a future release.
N = numel(x);
if N > 8192
x = x(1:8192);
elseif N < 8192
pad = 8192-N;
prepad = floor(pad/2);
postpad = ceil(pad/2);
x = [zeros(prepad,1) ; x ; zeros(postpad,1)];
end
x = x./max(abs(x));
end
1-455
1 Audio Toolbox Examples
Design a real-time active noise control system using a Speedgoat® Simulink® Real-Time™ target.
The goal of active noise control is to reduce unwanted sound by producing an “anti-noise” signal that
cancels the undesired sound wave. This principle has been applied successfully to a wide variety of
applications, such as noise-cancelling headphones, active sound design in car interiors, and noise
reduction in ventilation conduits and ventilated enclosures.
In this example, we apply the principles of model-based design. First, we design the ANC without any
hardware by using a simple acoustic model in our simulation. Then, we complete our prototype by
replacing the simulated acoustic path by the “Speedgoat Target Computers and Speedgoat Support”
(Simulink Real-Time) and its IO104 analog module. The Speedgoat is an external Real-Time target for
Simulink, which allows us to execute our model in real time and observe any data of interest, such as
the adaptive filter coefficients, in real time.
This example has a companion video: Active Noise Control – From Modeling to Real-Time
Prototyping.
The following figure illustrates a classic example of feedforward ANC. A noise source at the entrance
of a duct, such as a fan, is “cancelled” by a loudspeaker. The noise source b(n) is measured with a
reference microphone, and the signal present at the output of the system is monitored with an error
microphone, e(n). Note that the smaller the distance between the reference microphone and the
loudspeaker, the faster the ANC must be able to compute and play back the “anti-noise”.
The primary path is the transfer function between the two microphones, W(z) is the adaptive filter
computed from the last available error signal e(n), and the secondary path S(z) is the transfer
function between the ANC output and the error microphone. The secondary path estimate S'(z) is
used to filter the input of the NLMS update function. Also, the acoustic feedback F(z) from the ANC
loudspeaker to the reference microphone can be estimated (F'(z)) and removed from the reference
signal b(n).
1-456
Active Noise Control with Simulink Real-Time
To implement a successful ANC system, we must estimate both the primary and the secondary paths.
In this example, we estimate the secondary path and the acoustic feedback first and then keep it
constant while the ANC system adapts the primary path.
With Simulink and model-based design, you can start with a basic model of the desired system and a
simulated environment. Then, you can improve the realism of that model or replace the simulated
environment by the real one. You can also iterate by refining your simulated environment when you
learn more about the challenges of the real-world system. For example, you could add acoustic
feedback or measurement noise to the simulated environment if those are elements that limit the
performance of the real-world system.
Start with a model of a Filtered-X NLMS ANC system, including both the ANC controller and the
duct’s acoustic environment. Assume that we already have an estimate of the secondary path, since
we will design a system to measure that later. Simulate the signal at the error microphone as the sum
of the noise source filtered by the primary acoustic path and the ANC output filtered by the
secondary acoustic path. Use an “LMS Update” block in a configuration that minimizes the signal
captured by the error microphone. In a Filtered-X system, the NLMS update’s input is the noise
source filtered by the estimate of the secondary path. To avoid an algebraic loop, there is a delay of
one sample between the computation of the new filter coefficients and their use by the LMS filter.
Set the secondary path to s(n) = [0.5 0.5 -.3 -.3 -.2 -.2] and the primary path to conv(s(n), f(n)), where
f(n) = [.1 -.1 .2 -.2 .3 -.3 .15 -.15]. Verify that the adaptive filter properly converges to f(n), in which
case it matches the primary path in our model once convolved with the secondary path. Note that s(n)
and f(n) were set arbitrarily, but we could try any FIR transfer functions, such as an actual impulse
response measurement.
1-457
1 Audio Toolbox Examples
Design a model to estimate the secondary path. Use an adaptive filter in a configuration appropriate
for the identification of an unknown system. We can then verify that it converges to f(n).
1-458
Active Noise Control with Simulink Real-Time
To experiment with ANC in a real-time environment, we built the classic duct example. In the
following image, from right to left, we have a loudspeaker playing the noise source, the reference
microphone, the ANC loudspeaker, and the error microphone.
1-459
1 Audio Toolbox Examples
Latency is critical: the system must record the reference microphone, compute the response and play
it back on the ANC loudspeaker in the time it takes for sound to travel between these points. In this
example, the distance between the reference microphone and the beginning of the “Y” section is 34
cm. The speed of sound is 343 m/s, thus our maximum latency is 1 ms, or 16 samples at the 16 kHz
sampling rate used in this example.
We will be using the Speedgoat real-time target in Simulink, with the IO104 analog I/O interface card.
The Speedgoat allows us to achieve a latency as low as one or two samples.
1-460
Active Noise Control with Simulink Real-Time
To realize our real-time model, we use the building blocks that we tested earlier, and simply replace
the acoustic models by the Speedgoat I/O blocks. We also included the measurement of the acoustic
feedback from the ANC loudspeaker to the reference microphone, and we added some logic to
automatically measure the secondary path for 15 seconds before switching to the actual ANC mode.
During the first 15 seconds, white noise is played back on the ANC loudspeaker and two NLMS filters
are enabled, one per microphone. Then, a “noise source” is played back by the model for
convenience, but the actual input of the ANC system is the reference microphone (this playback could
be replaced by a real noise source, such as a fan at the right end of the duct). The system records the
reference microphone, adapts the ANC NLMS filter and computes a signal for the ANC loudspeaker.
To access the model’s folder, open the example by clicking the “Open Script” button. The model’s file
name is “Speedgoat_FXLMS_ANC_model.slx”.
1-461
1 Audio Toolbox Examples
We have measured the performance of this ANC prototype with both dual tones and the actual
recording of a washing machine. Tonal or repetitive noises are easier because the adaptive filter will
work even if it is adapting to an earlier or later cycle. With the washing machine, there is a mix of
tones and noise. In this case, we require the Speedgoat's low latency processing so that we can adapt
the ANC filter and produce the anti-noise so that it reaches the error mic in time to correct the noise
signal. We obtained a noise reduction of 30-40 dB for the dual tones and 8-10 dB for the recording.
1-462
Active Noise Control with Simulink Real-Time
Also, the convergence rate for the filter is a few seconds with the tones, but requires more time for
the washing machine recording. These were the coefficients for the dual tones:
We can also look at the coefficients for the secondary path estimation:
1-463
1 Audio Toolbox Examples
1-464
Active Noise Control with Simulink Real-Time
References
S. M. Kuo and D. R. Morgan, "Active noise control: a tutorial review," in Proceedings of the IEEE, vol.
87, no. 6, pp. 943-973, June 1999.
K.-C. Chen, C.-Y. Chang, and S. M. Kuo, "Active noise control in a duct to cancel broadband noise," in
IOP Conference Series: Materials Science and Engineering, vol. 237, no. 1, 2017. https://
iopscience.iop.org/article/10.1088/1757-899X/237/1/012015.
See also: “Active Noise Control Using a Filtered-X LMS FIR Adaptive Filter” on page 1-129
1-465
1 Audio Toolbox Examples
This example shows how to create a multi-model late fusion system for acoustic scene recognition.
The example trains a convolutional neural network (CNN) using mel spectrograms and an ensemble
classifier using wavelet scattering. The example uses the TUT dataset for training and evaluation [1]
on page 1-480.
Introduction
Acoustic scene classification (ASC) is the task of classifying environments from the sounds they
produce. ASC is a generic classification problem that is foundational for context awareness in
devices, robots, and many other applications [1] on page 1-480. Early attempts at ASC used mel-
frequency cepstral coefficients (mfcc) and Gaussian mixture models (GMMs) to describe their
statistical distribution. Other popular features used for ASC include zero crossing rate, spectral
centroid (spectralCentroid), spectral rolloff (spectralRolloffPoint), spectral flux
(spectralFlux ), and linear prediction coefficients (lpc) [5] on page 1-480. Hidden Markov models
(HMMs) were trained to describe the temporal evolution of the GMMs. More recently, the best
performing systems have used deep learning, usually CNNs, and a fusion of multiple models. The
most popular feature for top-ranked systems in the DCASE 2017 contest was the mel spectrogram
(melSpectrogram). The top-ranked systems in the challenge used late fusion and data augmentation
to help their systems generalize.
To illustrate a simple approach that produces reasonable results, this example trains a CNN using
mel spectrograms and an ensemble classifier using wavelet scattering. The CNN and ensemble
classifier produce roughly equivalent overall accuracy, but perform better at distinguishing different
acoustic scenes. To increase overall accuracy, you merge the CNN and ensemble classifier results
using late fusion.
To run the example, you must first download the data set [1] on page 1-480. The full data set is
approximately 15.5 GB. Depending on your machine and internet connection, downloading the data
can take about 4 hours.
downloadFolder = tempdir;
dataset = fullfile(downloadFolder,"TUT-acoustic-scenes-2017");
if ~datasetExists(dataset)
disp("Downloading TUT-acoustic-scenes-2017 (15.5 GB) ...")
HelperDownload_TUT_acoustic_scenes_2017(dataset);
end
Read in the development set metadata as a table. Name the table variables FileName,
AcousticScene, and SpecificLocation.
1-466
Acoustic Scene Recognition Using Late Fusion
Note that the specific recording locations in the test set do not intersect with the specific recording
locations in the development set. This makes it easier to validate that the trained models can
generalize to real-world scenarios.
sharedRecordingLocations = intersect(testMetaData.SpecificLocation,trainMetaData.SpecificLocation
disp("Number of specific recording locations in both train and test sets = " + numel(sharedRecord
The first variable of the metadata tables contains the file names. Concatenate the file names with the
file paths.
trainFilePaths = fullfile(dataset,"TUT-acoustic-scenes-2017-development",trainMetaData.FileName);
testFilePaths = fullfile(dataset,"TUT-acoustic-scenes-2017-evaluation",testMetaData.FileName);
There may be files listed in the metadata that are not present in the data set. Remove the file paths
and acoustic scene labels that correspond to the missing files.
ads = audioDatastore(dataset,IncludeSubfolders=true);
allFiles = ads.Files;
trainIdxToRemove = ~ismember(trainFilePaths,allFiles);
trainFilePaths(trainIdxToRemove) = [];
trainLabels = categorical(trainMetaData.AcousticScene);
trainLabels(trainIdxToRemove) = [];
testIdxToRemove = ~ismember(testFilePaths,allFiles);
testFilePaths(testIdxToRemove) = [];
testLabels = categorical(testMetaData.AcousticScene);
testLabels(testIdxToRemove) = [];
1-467
1 Audio Toolbox Examples
Create audio datastores for the train and test sets. Set the Labels property of the audioDatastore
to the acoustic scene. Call countEachLabel to verify an even distribution of labels in both the train
and test sets.
15×2 table
Label Count
________________ _____
beach 312
bus 312
cafe/restaurant 312
car 312
city_center 312
forest_path 312
grocery_store 312
home 312
library 312
metro_station 312
office 312
park 312
residential_area 312
train 312
tram 312
15×2 table
Label Count
________________ _____
beach 108
bus 108
cafe/restaurant 108
car 108
city_center 108
forest_path 108
grocery_store 108
home 108
library 108
metro_station 108
office 108
park 108
residential_area 108
train 108
tram 108
1-468
Acoustic Scene Recognition Using Late Fusion
You can reduce the data set used in this example to speed up the run time at the cost of performance.
In general, reducing the data set is a good practice for development and debugging. Set
speedupExample to true to reduce the data set.
speedupExample = ;
if speedupExample
adsTrain = splitEachLabel(adsTrain,20);
adsTest = splitEachLabel(adsTest,10);
end
trainLabels = adsTrain.Labels;
testLabels = adsTest.Labels;
Call read to get the data and sample rate of a file from the train set. Audio in the database has
consistent sample rate and duration. Normalize the audio and listen to it. Display the corresponding
label.
[data,adsInfo] = read(adsTrain);
data = data./max(data,[],"all");
fs = adsInfo.SampleRate;
sound(data,fs)
reset(adsTrain)
Each audio clip in the dataset consists of 10 seconds of stereo (left-right) audio. The feature
extraction pipeline and the CNN architecture in this example are based on [3] on page 1-480.
Hyperparameters for the feature extraction, the CNN architecture, and the training options were
modified from the original paper using a systematic hyperparameter optimization workflow.
First, convert the audio to mid-side encoding. [3] on page 1-480 suggests that mid-side encoded data
provides better spatial information that the CNN can use to identify moving sources (such as a train
moving across an acoustic scene).
dataMidSide = [sum(data,2),data(:,1)-data(:,2)];
Divide the signal into one-second segments with overlap. The final system uses a probability-weighted
average on the one-second segments to predict the scene for each 10-second audio clip in the test
set. Dividing the audio clips into one-second segments makes the network easier to train and helps
prevent overfitting to specific acoustic events in the training set. The overlap helps to ensure all
combinations of features relative to one another are captured by the training data. It also provides
the system with additional data that can be mixed uniquely during augmentation.
segmentLength = 1;
segmentOverlap = 0.5;
[dataBufferedMid,~] = buffer(dataMidSide(:,1),round(segmentLength*fs),round(segmentOverlap*fs),"n
[dataBufferedSide,~] = buffer(dataMidSide(:,2),round(segmentLength*fs),round(segmentOverlap*fs),"
dataBuffered = zeros(size(dataBufferedMid,1),size(dataBufferedMid,2)+size(dataBufferedSide,2));
1-469
1 Audio Toolbox Examples
dataBuffered(:,1:2:end) = dataBufferedMid;
dataBuffered(:,2:2:end) = dataBufferedSide;
Use melSpectrogram to transform the data into a compact frequency-domain representation. Define
parameters for the mel spectrogram as suggested by [3] on page 1-480.
windowLength = 2048;
samplesPerHop = 1024;
samplesOverlap = windowLength - samplesPerHop;
fftLength = 2*windowLength;
numBands = 128;
X = reshape(spec,size(spec,1),size(spec,2),size(data,2),[]);
Call melSpectrogram without output arguments to plot the mel spectrogram of the mid channel for
the first six of the one-second increments.
tiledlayout(3,2)
for channel = 1:2:11
nexttile
melSpectrogram(dataBuffered(:,channel),fs, ...
Window=hamming(windowLength,"periodic"), ...
OverlapLength=samplesOverlap, ...
FFTLength=fftLength, ...
NumBands=numBands, ...
ApplyLog=true);
title("Segment " + ceil(channel/2))
end
1-470
Acoustic Scene Recognition Using Late Fusion
To apply the feature extraction steps to all files in the datastores, create transform datastores and
specify the HelperSegmentedMelSpectrograms function as the transform. To speed up subsequent
processing, use readall to read all of the files and place the extracted features in memory.
Transform datastores can leverage parallel pools to speed up file reading and preprocessing if you
have Parallel Computing Toolbox™.
xTrain = cat(4,xTrain{:});
1-471
1 Audio Toolbox Examples
SegmentOverlap=segmentOverlap, ...
WindowLength=windowLength, ...
HopLength=samplesPerHop, ...
NumBands=numBands, ...
FFTLength=fftLength)});
xTest = readall(tdsTest,UseParallel=canUseParallelPool);
xTest = cat(4,xTest{:});
Replicate the labels of the training and test sets so that they are in one-to-one correspondence with
the segments.
numSegmentsPer10seconds = size(dataBuffered,2)/2;
yTrain = repmat(trainLabels,1,numSegmentsPer10seconds)';
yTrain = yTrain(:);
yTest = repmat(testLabels,1,numSegmentsPer10seconds)';
yTest = yTest(:);
The DCASE 2017 dataset contains a relatively small number of acoustic recordings for the task, and
the development set and evaluation set were recorded at different specific locations. As a result, it is
easy to overfit to the data during training. One popular method to reduce overfitting is mixup. In
mixup, you augment your dataset by mixing the features of two different classes. When you mix the
features, you mix the labels in equal proportion. That is:
∼
x = λxi + 1 − λ x j
∼
y = λyi + 1 − λ y j
Mixup was reformulated by [2] on page 1-480 as labels drawn from a probability distribution instead
of mixed labels. The implementation of mixup in this example is a simplified version of mixup: each
spectrogram is mixed with a spectrogram of a different label with lambda set to 0.5. The original and
mixed datasets are combined for training.
xTrainExtra = xTrain;
yTrainExtra = yTrain;
lambda = 0.5;
for ii = 1:size(xTrain,4)
% Mix.
xTrainExtra(:,:,:,ii) = lambda*xTrain(:,:,:,ii) + (1-lambda)*xTrain(:,:,:,availableSpectrogra
Call summary to display the distribution of labels for the augmented training set.
1-472
Acoustic Scene Recognition Using Late Fusion
summary(yTrain)
beach 11769
bus 11904
cafe/restaurant 11873
car 11820
city_center 11886
forest_path 11936
grocery_store 11914
home 11923
library 11817
metro_station 11804
office 11922
park 11871
residential_area 11704
train 11773
tram 11924
Define the CNN architecture. This architecture is based on [1] on page 1-480 and modified through
trial and error. See “List of Deep Learning Layers” (Deep Learning Toolbox) to learn more about deep
learning layers available in MATLAB®.
imgSize = [size(xTrain,1),size(xTrain,2),size(xTrain,3)];
numF = 32;
layers = [ ...
imageInputLayer(imgSize)
batchNormalizationLayer
convolution2dLayer(3,numF,Padding="same")
batchNormalizationLayer
reluLayer
convolution2dLayer(3,numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,2*numF,Padding="same")
batchNormalizationLayer
reluLayer
convolution2dLayer(3,2*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
1-473
1 Audio Toolbox Examples
convolution2dLayer(3,8*numF,Padding="same")
batchNormalizationLayer
reluLayer
convolution2dLayer(3,8*numF,Padding="same")
batchNormalizationLayer
reluLayer
globalAveragePooling2dLayer
dropoutLayer(0.5)
fullyConnectedLayer(15)
softmaxLayer];
Define trainingOptions (Deep Learning Toolbox) for the CNN. These options are based on [3] on
page 1-480 and modified through a systematic hyperparameter optimization workflow.
miniBatchSize = 128;
tuneme = 128;
lr = 0.05*miniBatchSize/tuneme;
options = trainingOptions( ...
"sgdm", ...
Momentum=0.9, ...
L2Regularization=0.005, ...
...
MiniBatchSize=miniBatchSize, ...
MaxEpochs=8, ...
Shuffle="every-epoch", ...
...
Plots="training-progress", ...
Verbose=false, ...
Metrics="accuracy", ...
...
InitialLearnRate=lr, ...
LearnRateSchedule="piecewise", ...
LearnRateDropPeriod=2, ...
LearnRateDropFactor=0.2, ...
...
ValidationData={xTest,yTest}, ...
ValidationFrequency=floor(size(xTrain,4)/miniBatchSize));
trainedNet = trainnet(xTrain,yTrain,layers,"crossentropy",options);
1-474
Acoustic Scene Recognition Using Late Fusion
Evaluate CNN
Call minibatchpredict to predict responses from the trained network using the held-out test set.
cnnResponsesPerSegment = minibatchpredict(trainedNet,xTest);
counter = 1;
cnnResponses = zeros(numFiles,numel(classes));
for channel = 1:numFiles
cnnResponses(channel,:) = sum(cnnResponsesPerSegment(counter:counter+numSegmentsPer10seconds-
counter = counter + numSegmentsPer10seconds;
end
For each 10-second audio clip, choose the maximum of the predictions, then map it to the
corresponding predicted location.
[~,classIdx] = max(cnnResponses,[],2);
cnnPredictedLabels = classes(classIdx);
Call confusionchart (Deep Learning Toolbox) to visualize the accuracy on the test set.
figure(Units="normalized",Position=[0.2 0.2 0.5 0.5])
confusionchart(testLabels,cnnPredictedLabels, ...
title=["Test Accuracy - CNN","Average Accuracy = " + mean(testLabels==cnnPredictedLabels)*100
ColumnSummary="column-normalized",RowSummary="row-normalized");
1-475
1 Audio Toolbox Examples
Wavelet scattering has been shown in [4] on page 1-480 to provide a good representation of acoustic
scenes. Define a waveletScattering (Wavelet Toolbox) object. The invariance scale and quality
factors were determined through trial and error.
sf = waveletScattering(SignalLength=size(data,1), ...
SamplingFrequency=fs, ...
InvarianceScale=0.75, ...
QualityFactors=[4 1]);
Convert the audio signal to mono, and then call featureMatrix (Wavelet Toolbox) to return the
scattering coefficients for the scattering decomposition framework, sf.
dataMono = mean(data,2);
scatteringCoeffients = featureMatrix(sf,dataMono,Transform="log");
The helper function HelperWaveletFeatureVector on page 1-480 performs the above steps. Use a
transform datastore to parallelize the feature extraction. Extract wavelet feature vectors for the train
and test sets.
1-476
Acoustic Scene Recognition Using Late Fusion
scatteringTrain = transform(adsTrain,@(x)HelperWaveletFeatureVector(x,sf));
xTrain = readall(scatteringTrain,UseParallel=canUseParallelPool);
xTrain = (reshape(xTrain,numel(featureVector),[]))';
scatteringTest = transform(adsTest,@(x)HelperWaveletFeatureVector(x,sf));
xTest = readall(scatteringTest,UseParallel=canUseParallelPool);
xTest = (reshape(xTest,numel(featureVector),[]))';
For each 10-second audio clip, call predict to return the labels and the weights, then map it to the
corresponding predicted location. Call confusionchart (Deep Learning Toolbox) to visualize the
accuracy on the test set.
[waveletPredictedLabels,waveletResponses] = predict(classificationEnsemble,xTest);
1-477
1 Audio Toolbox Examples
For each 10-second clip, calling predict on the wavelet classifier and the CNN returns a vector
indicating the relative confidence in their decision. Multiply the waveletResponses with the
cnnResponses to create a late fusion system.
fused = waveletResponses.*cnnResponses;
[~,classIdx] = max(fused,[],2);
predictedLabels = classes(classIdx);
1-478
Acoustic Scene Recognition Using Late Fusion
Supporting Functions
HelperSegmentedMelSpectrograms
function X = HelperSegmentedMelSpectrograms(x,fs,options)
% Copyright 2019-2023 The MathWorks, Inc.
arguments
x
fs
options.WindowLength = 1024
options.HopLength = 512
options.NumBands = 128
options.SegmentLength = 1
options.SegmentOverlap = 0
options.FFTLength = 1024
end
x = [sum(x,2),x(:,1)-x(:,2)];
x = x./max(max(x));
[xb_m,~] = buffer(x(:,1),round(options.SegmentLength*fs),round(options.SegmentOverlap*fs),"nodela
[xb_s,~] = buffer(x(:,2),round(options.SegmentLength*fs),round(options.SegmentOverlap*fs),"nodela
xb = zeros(size(xb_m,1),size(xb_m,2)+size(xb_s,2));
1-479
1 Audio Toolbox Examples
xb(:,1:2:end) = xb_m;
xb(:,2:2:end) = xb_s;
X = reshape(spec,size(spec,1),size(spec,2),size(x,2),[]);
end
HelperWaveletFeatureExtractor
References
[1] A. Mesaros, T. Heittola, and T. Virtanen. Acoustic Scene Classification: an Overview of DCASE
2017 Challenge Entries. In proc. International Workshop on Acoustic Signal Enhancement, 2018.
[2] Huszar, Ferenc. "Mixup: Data-Dependent Data Augmentation." InFERENCe. November 03, 2017.
Accessed January 15, 2019. https://www.inference.vc/mixup-data-dependent-data-augmentation/.
[3] Han, Yoonchang, Jeongsoo Park, and Kyogu Lee. "Convolutional neural networks with binaural
representations and background subtraction for acoustic scene classification." the Detection and
Classification of Acoustic Scenes and Events (DCASE) (2017): 1-5.
[4] Lostanlen, Vincent, and Joakim Anden. Binaural scene classification with wavelet scattering.
Technical Report, DCASE2016 Challenge, 2016.
1-480
Keyword Spotting in Noise Using MFCC and LSTM Networks
This example shows how to identify a keyword in noisy speech using a deep learning network. In
particular, the example uses a Bidirectional Long Short-Term Memory (BiLSTM) network and mel
frequency cepstral coefficients (MFCC).
Introduction
Keyword spotting (KWS) is an essential component of voice-assist technologies, where the user
speaks a predefined keyword to wake-up a system before speaking a complete command or query to
the device.
This example trains a KWS deep network with feature sequences of mel-frequency cepstral
coefficients (MFCC). The example also demonstrates how network accuracy in a noisy environment
can be improved using data augmentation.
This example uses long short-term memory (LSTM) networks, which are a type of recurrent neural
network (RNN) well-suited to study sequence and time-series data. An LSTM network can learn long-
term dependencies between time steps of a sequence. An LSTM layer (lstmLayer (Deep Learning
Toolbox)) can look at the time sequence in the forward direction, while a bidirectional LSTM layer
(bilstmLayer (Deep Learning Toolbox)) can look at the time sequence in both forward and
backward directions. This example uses a bidirectional LSTM layer.
The example uses the google Speech Commands Dataset to train the deep learning model. To run the
example, you must first download the data set. If you do not want to download the data set or train
the network, then you can download and use a pretrained network by opening this example in
MATLAB® and running the Spot Keyword with Pretrained Network section.
Before going into the training process in detail, you will download and use a pretrained keyword
spotting network to identify a keyword.
[audioIn,fs] = audioread("keywordTestSignal.wav");
sound(audioIn,fs)
Download and load the pretrained network, the mean (M) and standard deviation (S) vectors used for
feature normalization, as well as 2 audio files used for validating the network later in the example.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","KeywordSpotting.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
netFolder = fullfile(dataFolder,"KeywordSpotting");
load(fullfile(netFolder,"KWSNet.mat"));
windowLength = 512;
overlapLength = 384;
afe = audioFeatureExtractor(SampleRate=fs, ...
1-481
1 Audio Toolbox Examples
Window=hann(windowLength,"periodic"),OverlapLength=overlapLength, ...
mfcc=true,mfccDelta=true,mfccDeltaDelta=true);
features = extract(afe,audioIn);
Compute the keyword spotting binary mask. A mask value of one corresponds to a segment where the
keyword was spotted.
mask = classify(KWSNet,features.');
Each sample in the mask corresponds to 128 samples from the speech signal (windowLength -
overlapLength).
mask = repmat(mask,windowLength-overlapLength,1);
mask = double(mask) - 1;
mask = mask(:);
figure
audioIn = audioIn(1:length(mask));
t = (0:length(audioIn)-1)/fs;
plot(t,audioIn)
grid on
hold on
plot(t, mask)
legend("Speech","YES")
1-482
Keyword Spotting in Noise Using MFCC and LSTM Networks
Test your pre-trained command detection network on streaming audio from your microphone. Try
saying random words, including the keyword (YES).
Define an audio device reader that can read audio from your microphone. Set the frame length to the
hop length. This enables you to compute a new set of features for every new audio frame from the
microphone.
hopLength = windowLength - overlapLength;
frameLength = hopLength;
adr = audioDeviceReader(SampleRate=fs,SamplesPerFrame=frameLength);
Create a scope for visualizing the speech signal and the estimated mask.
scope = timescope(SampleRate=fs, ...
TimeSpanSource="property", ...
TimeSpan=5, ...
TimeSpanOverrunAction="Scroll", ...
1-483
1 Audio Toolbox Examples
BufferLength=fs*5*2, ...
ShowLegend=true, ...
ChannelNames={'Speech','Keyword Mask'}, ...
YLimits=[-1.2,1.2], ...
Title="Keyword Spotting");
Define the rate at which you estimate the mask. You will generate a mask once every
numHopsPerUpdate audio frames.
numHopsPerUpdate = 16;
To run the loop indefinitely, set timeLimit to Inf. To stop the simulation, close the scope.
timeLimit = 20;
tic
while toc < timeLimit
data = adr();
write(dataBuff,data);
write(plotBuff,data);
frame = read(dataBuff,windowLength,overlapLength);
features = generateKeywordFeatures(frame,fs);
write(featureBuff,features.');
if featureBuff.NumUnreadSamples == numHopsPerUpdate
featureMatrix = read(featureBuff);
featureMatrix(~isfinite(featureMatrix)) = 0;
featureMatrix = (featureMatrix - M)./S;
[keywordNet,v] = classifyAndUpdateState(KWSNet,featureMatrix.');
v = double(v) - 1;
v = repmat(v,hopLength,1);
v = v(:);
v = mode(v);
v = repmat(v,numHopsPerUpdate*hopLength,1);
data = read(plotBuff);
scope([data,v]);
if ~isVisible(scope)
break;
end
end
end
hide(scope)
1-484
Keyword Spotting in Noise Using MFCC and LSTM Networks
In the rest of the example, you will learn how to train the keyword spotting network.
You use a sample speech signal to validate the KWS network. The validation signal consists 34
seconds of speech with the keyword YES appearing intermittently.
1-485
1 Audio Toolbox Examples
Load the KWS baseline. This baseline was obtained using speech2text and Signal Labeler. For a
related example, see “Label Spoken Words in Audio Signals”.
load("KWSBaseline.mat","KWSBaseline")
The baseline is a logical vector of the same length as the validation audio signal. Segments in
audioIn where the keyword is uttered are set to one in KWSBaseline.
1-486
Keyword Spotting in Noise Using MFCC and LSTM Networks
fig = figure;
plot(t,[audioIn,KWSBaseline'])
grid on
xlabel("Time (s)")
legend("Speech","KWS Baseline",Location="southeast")
l = findall(fig,"type","line");
l(1).LineWidth = 2;
title("Validation Signal")
sound(audioIn(KWSBaseline),fs)
The objective of the network that you train is to output a KWS mask of zeros and ones like this
baseline.
Download and extract the Google Speech Commands Dataset [1] on page 1-503.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","google_speech.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"google_speech");
1-487
1 Audio Toolbox Examples
ads = audioDatastore(dataset,LabelSource="foldername",Includesubfolders=true);
ads = shuffle(ads);
The dataset contains background noise files that are not used in this example. Use subset to create
a new datastore that does not have the background noise files.
isBackNoise = ismember(ads.Labels,"background");
ads = subset(ads,~isBackNoise);
The dataset has approximately 65,000 one-second long utterances of 30 short words (including the
keyword YES). Get a breakdown of the word distribution in the datastore.
countEachLabel(ads)
ans=30×2 table
Label Count
______ _____
bed 1713
bird 1731
cat 1733
dog 1746
down 2359
eight 2352
five 2357
four 2372
go 2372
happy 1742
house 1750
left 2353
marvin 1746
nine 2364
no 2375
off 2357
⋮
Split ads into two datastores: The first datastore contains files corresponding to the keyword. The
second datastore contains all the other words.
keyword = "yes";
isKeyword = ismember(ads.Labels,keyword);
adsKeyword = subset(ads,isKeyword);
adsOther = subset(ads,~isKeyword);
To train the network with the entire dataset and achieve the highest possible accuracy, set
speedupExample to false. To run this example quickly, set speedupExample to true.
speedupExample = ;
if speedupExample
% Reduce the dataset by a factor of 20
adsKeyword = splitEachLabel(adsKeyword,round(numel(adsKeyword.Files)/20));
numUniqueLabels = numel(unique(adsOther.Labels));
adsOther = splitEachLabel(adsOther,round(numel(adsOther.Files)/numUniqueLabels/20));
end
Get a breakdown of the word distribution in each datastore. Shuffle the adsOther datastore so that
consecutive reads return different words.
1-488
Keyword Spotting in Noise Using MFCC and LSTM Networks
countEachLabel(adsKeyword)
ans=1×2 table
Label Count
_____ _____
yes 2377
countEachLabel(adsOther)
ans=29×2 table
Label Count
______ _____
bed 1713
bird 1731
cat 1733
dog 1746
down 2359
eight 2352
five 2357
four 2372
go 2372
happy 1742
house 1750
left 2353
marvin 1746
nine 2364
no 2375
off 2357
⋮
adsOther = shuffle(adsOther);
The training datastores contain one-second speech signals where one word is uttered. You will create
more complex training speech utterances that contain a mixture of the keyword along with other
words.
Here is an example of a constructed utterance. Read one keyword from the keyword datastore and
normalize it to have a maximum value of one.
yes = read(adsKeyword);
yes = yes/max(abs(yes));
The signal has non-speech portions (silence, background noise, etc.) that do not contain useful speech
information. This example removes silence using detectSpeech.
Get the start and end indices of the useful portion of the signal.
speechIndices = detectSpeech(yes,fs);
Randomly select the number of words to use in the synthesized training sentence. Use a maximum of
10 words.
numWords = randi([0,10]);
1-489
1 Audio Toolbox Examples
keywordLocation = randi([1,numWords+1]);
Read the desired number of non-keyword utterances, and construct the training sentence and mask.
sentence = [];
mask = [];
for index = 1:numWords+1
if index == keywordLocation
sentence = [sentence;yes]; %#ok
newMask = zeros(size(yes));
newMask(speechIndices(1,1):speechIndices(1,2)) = 1;
mask = [mask;newMask]; %#ok
else
other = read(adsOther);
other = other./max(abs(other));
sentence = [sentence;other]; %#ok
mask = [mask;zeros(size(other))]; %#ok
end
end
figure
t = (1/fs)*(0:length(sentence)-1);
fig = figure;
plot(t,[sentence,mask])
grid on
xlabel("Time (s)")
legend("Training Signal","Mask",Location="southeast")
l = findall(fig,"type","line");
l(1).LineWidth = 2;
title("Example Utterance")
1-490
Keyword Spotting in Noise Using MFCC and LSTM Networks
Extract Features
This example trains a deep learning network using 39 MFCC coefficients (13 MFCC, 13 delta and 13
delta-delta coefficients).
ans = 1×2
1372 39
1-491
1 Audio Toolbox Examples
Note that you compute MFCC by sliding a window through the input, so the feature matrix is shorter
than the input speech signal. Each row in featureMatrix corresponds to 128 samples from the
speech signal (windowLength - overlapLength).
Sentence synthesis and feature extraction for the whole training dataset can be quite time-
consuming. To speed up processing, if you have Parallel Computing Toolbox™, partition the training
datastore, and process each partition on a separate worker.
numPartitions = 6;
TrainingFeatures = {};
TrainingMasks= {};
Perform sentence synthesis, feature extraction, and mask creation using parfor.
tic
parfor ii = 1:numPartitions
subadsKeyword = partition(adsKeyword,numPartitions,ii);
subadsOther = partition(adsOther,numPartitions,ii);
count = 1;
localFeatures = cell(length(subadsKeyword.Files),1);
localMasks = cell(length(subadsKeyword.Files),1);
while hasdata(subadsKeyword)
% Create mask
range = hopLength*(1:size(featureMatrix,1)) + hopLength;
featureMask = zeros(size(range));
for index = 1:numel(range)
featureMask(index) = mode(mask((index-1)*hopLength+1:(index-1)*hopLength+windowLength
end
1-492
Keyword Spotting in Noise Using MFCC and LSTM Networks
localFeatures{count} = featureMatrix;
localMasks{count} = [emptyCategories,categorical(featureMask)];
count = count + 1;
end
TrainingFeatures = [TrainingFeatures;localFeatures];
TrainingMasks = [TrainingMasks;localMasks];
end
It is good practice to normalize all features to have zero mean and unity standard deviation. Compute
the mean and standard deviation for each coefficient and use them to normalize the data.
sampleFeature = TrainingFeatures{1};
numFeatures = size(sampleFeature,2);
featuresMatrix = cat(1,TrainingFeatures{:});
if speedupExample
load(fullfile(netFolder,"keywordNetNoAugmentation.mat"),"keywordNetNoAugmentation","M","S");
else
M = mean(featuresMatrix);
S = std(featuresMatrix);
end
for index = 1:length(TrainingFeatures)
f = TrainingFeatures{index};
f = (f - M)./S;
TrainingFeatures{index} = f.'; %#ok
end
featureMask = zeros(size(range));
for index = 1:numel(range)
featureMask(index) = mode(KWSBaseline((index-1)*hopLength+1:(index-1)*hopLength+windowLength)
end
BaselineV = categorical(featureMask);
1-493
1 Audio Toolbox Examples
LSTM networks can learn long-term dependencies between time steps of sequence data. This
example uses the bidirectional LSTM layer bilstmLayer (Deep Learning Toolbox) to look at the
sequence in both forward and backward directions.
Specify the input size to be sequences of size numFeatures. Specify two hidden bidirectional LSTM
layers with an output size of 150 and output a sequence. This command instructs the bidirectional
LSTM layer to map the input time series into 150 features that are passed to the next layer. Specify
two classes by including a fully connected layer of size 2, followed by a softmax layer and a
classification layer.
layers = [ ...
sequenceInputLayer(numFeatures)
bilstmLayer(150,OutputMode="sequence")
bilstmLayer(150,OutputMode="sequence")
fullyConnectedLayer(2)
softmaxLayer
];
Specify the training options for the classifier. Set MaxEpochs to 10 so that the network makes 10
passes through the training data. Set MiniBatchSize to 64 so that the network looks at 64 training
signals at a time. Set Plots to "training-progress" to generate plots that show the training
progress as the number of iterations increases. Set Verbose to false to disable printing the table
output that corresponds to the data shown in the plot. Set Shuffle to "every-epoch" to shuffle the
training sequence at the beginning of each epoch. Set LearnRateSchedule to "piecewise" to
decrease the learning rate by a specified factor (0.1) every time a certain number of epochs (5) has
passed. Set ValidationData to the validation predictors and targets.
This example uses the adaptive moment estimation (ADAM) solver. ADAM performs better with
recurrent neural networks (RNNs) like LSTMs than the default stochastic gradient descent with
momentum (SGDM) solver.
maxEpochs = 10;
miniBatchSize = 64;
options = trainingOptions("adam", ...
InitialLearnRate=1e-4, ...
MaxEpochs=maxEpochs, ...
MiniBatchSize=miniBatchSize, ...
Shuffle="every-epoch", ...
Verbose=false, ...
ValidationFrequency=floor(numel(TrainingFeatures)/miniBatchSize), ...
ValidationData={FeaturesValidationClean.',BaselineV}, ...
Plots="training-progress", ...
Metrics = "accuracy",...
LearnRateSchedule="piecewise", ...
LearnRateDropFactor=0.1, ...
LearnRateDropPeriod=5,...
InputDataFormats="CTB");
Train the LSTM network with the specified training options and layer architecture using trainnet.
Because the training set is large, the training process can take several minutes.
1-494
Keyword Spotting in Noise Using MFCC and LSTM Networks
keywordNetNoAugmentation = trainnet(TrainingFeatures,TrainingMasks,layers,"crossentropy",options)
if speedupExample
load(fullfile(netFolder,"keywordNetNoAugmentation.mat"),"keywordNetNoAugmentation","M","S");
end
Estimate the KWS mask for the validation signal using the trained network.
v = predict(keywordNetNoAugmentation,FeaturesValidationClean);
v = scores2label(v,unique(BaselineV));
Calculate and plot the validation confusion matrix from the vectors of actual and estimated labels.
figure
confusionchart(BaselineV,v, ...
Title="Validation Accuracy", ...
ColumnSummary="column-normalized",RowSummary="row-normalized");
v = double(v.')-1;
v = repmat(v,hopLength,1);
v = v(:);
sound(audioIn(logical(v)),fs)
1-495
1 Audio Toolbox Examples
baseline = double(BaselineV) - 1;
baseline = repmat(baseline,hopLength,1);
baseline = baseline(:);
t = (1/fs)*(0:length(v)-1);
fig = figure;
plot(t,[audioIn(1:length(v)),v,0.8*baseline])
grid on
xlabel("Time (s)")
legend("Training Signal","Network Mask","Baseline Mask",Location="southeast")
l = findall(fig,"type","line");
l(1).LineWidth = 2;
l(2).LineWidth = 2;
title("Results for Noise-Free Speech")
You will now check the network accuracy for a noisy speech signal. The noisy signal was obtained by
corrupting the clean validation signal by additive white Gaussian noise.
[audioInNoisy,fs] = audioread(fullfile(netFolder,"NoisyKeywordSpeech-16-16-mono-34secs.flac"));
sound(audioInNoisy,fs)
figure
t = (1/fs)*(0:length(audioInNoisy)-1);
plot(t,audioInNoisy)
grid on
xlabel("Time (s)")
title("Noisy Validation Speech Signal")
v = predict(keywordNetNoAugmentation,FeaturesValidationNoisy);
v = scores2label(v,unique(BaselineV));
Compare the network output to the baseline. Note that the accuracy is lower than the one you got for
a clean signal.
figure
confusionchart(BaselineV,v, ...
Title="Validation Accuracy - Noisy Speech", ...
ColumnSummary="column-normalized",RowSummary="row-normalized");
1-496
Keyword Spotting in Noise Using MFCC and LSTM Networks
v = double(v.')-1;
v = repmat(v,hopLength,1);
v = v(:);
sound(audioIn(logical(v)),fs)
t = (1/fs)*(0:length(v)-1);
fig = figure;
plot(t,[audioInNoisy(1:length(v)),v,0.8*baseline])
grid on
xlabel("Time (s)")
legend("Training Signal","Network Mask","Baseline Mask",Location="southeast")
l = findall(fig,"type","line");
l(1).LineWidth = 2;
l(2).LineWidth = 2;
title("Results for Noisy Speech - No Data Augmentation")
The trained network did not perform well on a noisy signal because the trained dataset contained
only noise-free sentences. You will rectify this by augmenting your dataset to include noisy sentences.
With these settings, the audioDataAugmenter object corrupts an input audio signal with white
Gaussian noise with a probability of 85%. The SNR is randomly selected from the range [-1 1] (in dB).
There is a 15% probability that the augmenter does not modify your input signal.
reset(adsKeyword)
x = read(adsKeyword);
data = augment(ada,x,fs)
data=1×2 table
Audio AugmentationInfo
________________ ________________
Inspect the AugmentationInfo variable in data to verify how the signal was modified.
data.AugmentationInfo
1-497
1 Audio Toolbox Examples
reset(adsKeyword)
reset(adsOther)
TrainingFeatures = {};
TrainingMasks = {};
Perform feature extraction again. Each signal is corrupted by noise with a probability of 85%, so your
augmented dataset has approximately 85% noisy data and 15% noise-free data.
tic
parfor ii = 1:numPartitions
subadsKeyword = partition(adsKeyword,numPartitions,ii);
subadsOther = partition(adsOther,numPartitions,ii);
count = 1;
localFeatures = cell(length(subadsKeyword.Files),1);
localMasks = cell(length(subadsKeyword.Files),1);
while hasdata(subadsKeyword)
[sentence,mask] = synthesizeSentence(subadsKeyword,subadsOther,fs,windowLength);
localFeatures{count} = featureMatrix;
localMasks{count} = [emptyCategories,categorical(featureMask)];
count = count + 1;
end
TrainingFeatures = [TrainingFeatures;localFeatures];
TrainingMasks = [TrainingMasks;localMasks];
end
disp("Training feature extraction took " + toc + " seconds.")
Compute the mean and standard deviation for each coefficient; use them to normalize the data.
sampleFeature = TrainingFeatures{1};
numFeatures = size(sampleFeature,2);
featuresMatrix = cat(1,TrainingFeatures{:});
if speedupExample
1-498
Keyword Spotting in Noise Using MFCC and LSTM Networks
load(fullfile(netFolder,"KWSNet.mat"),"KWSNet","M","S");
else
M = mean(featuresMatrix);
S = std(featuresMatrix);
end
for index = 1:length(TrainingFeatures)
f = TrainingFeatures{index};
f = (f - M) ./ S;
TrainingFeatures{index} = f.'; %#ok
end
Normalize the validation features with the new mean and standard deviation values.
Recreate the training options. Use the noisy baseline features and mask for validation.
[KWSNet,netInfo] = trainnet(TrainingFeatures,TrainingMasks,layers,"crossentropy",options);
1-499
1 Audio Toolbox Examples
if speedupExample
load(fullfile(netFolder,"KWSNet.mat"));
end
v = predict(KWSNet,FeaturesValidationNoisy);
v = scores2label(v,unique(BaselineV));
figure
confusionchart(BaselineV,v, ...
Title="Validation Accuracy with Data Augmentation", ...
ColumnSummary="column-normalized",RowSummary="row-normalized");
1-500
Keyword Spotting in Noise Using MFCC and LSTM Networks
v = double(v.')-1;
v = repmat(v,hopLength,1);
v = v(:);
sound(audioIn(logical(v)),fs)
fig = figure;
plot(t,[audioInNoisy(1:length(v)),v,0.8*baseline])
grid on
xlabel("Time (s)")
legend("Training Signal","Network Mask","Baseline Mask",Location="southeast")
l = findall(fig,"type","line");
l(1).LineWidth = 2;
l(2).LineWidth = 2;
title("Results for Noisy Speech - With Data Augmentation")
1-501
1 Audio Toolbox Examples
Supporting Functions
Synthesize Sentence
1-502
Keyword Spotting in Noise Using MFCC and LSTM Networks
mask = [mask;newMask];
else
other = read(adsOther);
other = other./max(abs(other));
sentence = [sentence;other];
mask = [mask;zeros(size(other))];
end
end
end
References
[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017.
Available from https://storage.googleapis.com/download.tensorflow.org/data/
speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed
under the Creative Commons Attribution 4.0 license.
1-503
1 Audio Toolbox Examples
Speaker verification, or authentication, is the task of verifying that a given speech segment belongs
to a given speaker. In speaker verification systems, there is an unknown set of all other speakers, so
the likelihood that an utterance belongs to the verification target is compared to the likelihood that it
does not. This contrasts with speaker identification tasks, where the likelihood of each speaker is
calculated, and those likelihoods are compared. Both speaker verification and speaker identification
can be text dependent or text independent. In this example, you create a text-dependent speaker
verification system using a Gaussian mixture model/universal background model (GMM-UBM).
To motivate this example, you will first perform speaker verification using a pre-trained universal
background model (UBM). The model was trained using the word "stop" from the Google Speech
Commands data set [1] on page 1-521.
Enroll
If you would like to test enrolling yourself, set enrollYourself to true. You will be prompted to
record yourself saying "stop" several times. Say "stop" only once per prompt. Increasing the number
of recordings should increase the verification accuracy.
1-504
Speaker Verification Using Gaussian Mixture Model
enrollYourself = ;
if enrollYourself
numToRecord = ;
ID = ;
helperAddUser(afe.SampleRate,numToRecord,ID);
end
Create an audioDatastore object to point to the five audio files included with this example, and, if
you enrolled yourself, the audio files you just recorded. The audio files included with this example are
part of an internally created data set and were not used to train the UBM.
ads = audioDatastore(pwd);
The files included with this example consist of the word "stop" spoken five times by three different
speakers: BFn (1), BHm (3), and RPalanim (1). The file names are in the format
SpeakerID_RecordingNumber. Set the datastore labels to the corresponding speaker ID.
[~,fileName] = cellfun(@(x)fileparts(x),ads.Files,UniformOutput=false);
fileName = split(fileName,"_");
speaker = strcat(fileName(:,1));
ads.Labels = categorical(speaker);
Use all but one file from the speaker you are enrolling for the enrollment process. The remaining files
are used to test the system.
if enrollYourself
enrollLabel = ID;
else
enrollLabel = "BHm";
end
forEnrollment = ads.Labels==enrollLabel;
forEnrollment(find(forEnrollment==1,1)) = false;
adsEnroll = subset(ads,forEnrollment);
adsTest = subset(ads,~forEnrollment);
Enroll the chosen speaker using maximum a posteriori (MAP) adaptation. You can find details of the
enrollment algorithm later in the example on page 1-511.
speakerGMM = helperEnroll(ubm,afe,normFactors,adsEnroll);
Verification
For each of the files in the test set, use the likelihood ratio test and a threshold to determine whether
the speaker is the enrolled speaker or an imposter.
threshold = ;
reset(adsTest)
while hasdata(adsTest)
disp("Identity to confirm: " + enrollLabel)
[audioData,adsInfo] = read(adsTest);
1-505
1 Audio Toolbox Examples
verificationStatus = helperVerify(audioData,afe,normFactors,speakerGMM,ubm,threshold);
if verificationStatus
disp(" | Confirmed.");
else
disp(" | Imposter!");
end
end
| Imposter!
| Confirmed.
| Imposter!
The remainder of the example details the creation of the UBM and the enrollment algorithm, and
then evaluates the system using commonly reported metrics.
The UBM used in this example is trained using [1] on page 1-521. Download and extract the data
set.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","google_speech.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
Create an audioDatastore that points to the dataset. Use the folder names as the labels. The folder
names indicate the words spoken in the dataset.
ads = audioDatastore(dataFolder,Includesubfolders=true,LabelSource="folderNames");
ads = subset(ads,ads.Labels==categorical("stop"));
Set the labels to the unique speaker IDs encoded in the file names. The speaker IDs sometimes start
with a number: add an 'a' to all the IDs to make the names more variable friendly.
[~,fileName] = cellfun(@(x)fileparts(x),ads.Files,UniformOutput=false);
fileName = split(fileName,"_");
speaker = strcat("a",fileName(:,1));
ads.Labels = categorical(speaker);
Create three datastores: one for enrollment, one for evaluating the verification system, and one for
training the UBM. Enroll speakers who have at least three utterances. For each of the speakers, place
two of the utterances in the enrollment set. The others will go in the test set. The test set consists of
1-506
Speaker Verification Using Gaussian Mixture Model
utterances from all speakers who have three or more utterances in the dataset. The UBM training set
consists of the remaining utterances.
numSpeakersToEnroll = ;
labelCount = countEachLabel(ads);
forEnrollAndTestSet = labelCount{:,1}(labelCount{:,2}>=3);
forEnroll = forEnrollAndTestSet(randi([1,numel(forEnrollAndTestSet)],numSpeakersToEnroll,1));
tf = ismember(ads.Labels,forEnroll);
adsEnrollAndValidate = subset(ads,tf);
adsEnroll = splitEachLabel(adsEnrollAndValidate,2);
adsTest = subset(ads,ismember(ads.Labels,forEnrollAndTestSet));
adsTest = subset(adsTest,~ismember(adsTest.Files,adsEnroll.Files));
Read from the training datastore and listen to a file. Reset the datastore.
[audioData,audioInfo] = read(adsTrainUBM);
fs = audioInfo.SampleRate;
sound(audioData,fs)
reset(adsTrainUBM)
Feature Extraction
First, create an audioFeatureExtractor object to extract the MFCC. Specify a 40 ms duration and
10 ms hop for the frames.
windowDuration = 0.04;
hopDuration = 0.01;
windowSamples = round(windowDuration*fs);
hopSamples = round(hopDuration*fs);
overlapSamples = windowSamples - hopSamples;
audioData = audioData./max(abs(audioData));
1-507
1 Audio Toolbox Examples
Use the detectSpeech function to locate the region of speech in the audio clip. Call detectSpeech
without any output arguments to visualize the detected region of speech.
detectSpeech(audioData,fs);
Call detectSpeech again. This time, return the indices of the speech region and use them to remove
nonspeech regions from the audio clip.
idx = detectSpeech(audioData,fs);
audioData = audioData(idx(1,1):idx(1,2));
Call extract on the audioFeatureExtractor object to extract features from audio data. The size
output from extract is numHops-by-numFeatures.
features = extract(afe,audioData);
[numHops,numFeatures] = size(features)
numHops = 21
numFeatures = 13
Normalize the features by their global mean and variance. The next section of the example walks
through calculating the global mean and variance. For now, just use the precalculated mean and
variance already loaded.
1-508
Speaker Verification Using Gaussian Mixture Model
Extract all features from the data set. If you have the Parallel Computing Toolbox™, determine the
optimal number of partitions for the dataset and spread the computation across available workers. If
you do not have Parallel Computing Toolbox™, use a single partition.
featuresAll = {};
if ~isempty(ver("parallel"))
numPar = 18;
else
numPar = 1;
end
Use the helper function, helperFeatureExtraction, to extract all features from the dataset.
Calling helperFeatureExtraction with an empty third argument performs the feature extraction
steps described in Feature Extraction on page 1-507 except for the normalization by global mean and
variance.
parfor ii = 1:numPar
adsPart = partition(ads,numPar,ii);
featuresPart = cell(0,numel(adsPart.Files));
for iii = 1:numel(adsPart.Files)
audioData = read(adsPart);
featuresPart{iii} = helperFeatureExtraction(audioData,afe,[]);
end
featuresAll = [featuresAll,featuresPart];
end
allFeatures = cat(2,featuresAll{:});
normFactors.Mean = mean(allFeatures,2,"omitnan");
normFactors.STD = std(allFeatures,[],2,"omitnan");
Initialize GMM
The universal background model is a Gaussian mixture model. Define the number of components in
the mixture. [2] on page 1-521 suggests more than 512 for text-independent systems. The
component weights begin evenly distributed.
numComponents = ;
alpha = ones(1,numComponents)/numComponents;
Use random initialization for the mu and sigma of each GMM component. Create a structure to hold
the necessary UBM information.
1-509
1 Audio Toolbox Examples
mu = randn(numFeatures,numComponents);
sigma = rand(numFeatures,numComponents);
ubm = struct(ComponentProportion=alpha,mu=mu,sigma=sigma);
Fit the GMM to the training set to create the UBM. Use the expectation-maximization algorithm.
maxIter = 20;
targetLogLikelihood = 0;
tol = 0.5;
pastL = -inf; % initialization of previous log-likelihood
tic
for iter = 1:maxIter
% EXPECTATION
N = zeros(1,numComponents);
F = zeros(numFeatures,numComponents);
S = zeros(numFeatures,numComponents);
L = 0;
parfor ii = 1:numPar
adsPart = partition(adsTrainUBM,numPar,ii);
while hasdata(adsPart)
audioData = read(adsPart);
% Extract features
features = helperFeatureExtraction(audioData,afe,normFactors);
1-510
Speaker Verification Using Gaussian Mixture Model
% MAXIMIZATION
N = max(N,eps);
ubm.ComponentProportion = max(N/sum(N),eps);
ubm.ComponentProportion = ubm.ComponentProportion/sum(ubm.ComponentProportion);
ubm.mu = bsxfun(@rdivide,F,N);
ubm.sigma = max(bsxfun(@rdivide,S,N) - ubm.mu.^2,eps);
end
Once you have a universal background model, you can enroll speakers and adapt the UBM to the
speakers. [2] on page 1-521 suggests an adaptation relevance factor of 16. The relevance factor
controls how much to move each component of the UBM to the speaker GMM.
relevanceFactor = 16;
speakers = unique(adsEnroll.Labels);
numSpeakers = numel(speakers);
gmmCellArray = cell(numSpeakers,1);
tic
parfor ii = 1:numSpeakers
% Subset the datastore to the speaker you are adapting.
adsTrainSubset = subset(adsEnroll,adsEnroll.Labels==speakers(ii));
N = zeros(1,numComponents);
F = zeros(numFeatures,numComponents);
S = zeros(numFeatures,numComponents);
while hasdata(adsTrainSubset)
audioData = read(adsTrainSubset);
features = helperFeatureExtraction(audioData,afe,normFactors);
[n,f,s,l] = helperExpectation(features,ubm);
N = N + n;
F = F + f;
S = S + s;
1-511
1 Audio Toolbox Examples
end
gmmCellArray{ii} = gmm;
end
disp("Enrollment completed in " + round(toc,2) + " seconds.")
For bookkeeping purposes, convert the cell array of GMMs to a struct, with the fields being the
speaker IDs and the values being the GMM structs.
for i = 1:numel(gmmCellArray)
enrolledGMMs.(string(speakers(i))) = gmmCellArray{i};
end
Evaluation
The speaker false rejection rate (FRR) is the rate that a given speaker is incorrectly rejected. Use the
known speaker set to determine the speaker false rejection rate for a set of thresholds.
speakers = unique(adsEnroll.Labels);
numSpeakers = numel(speakers);
llr = cell(numSpeakers,1);
tic
parfor speakerIdx = 1:numSpeakers
localGMM = enrolledGMMs.(string(speakers(speakerIdx)));
adsTestSubset = subset(adsTest,adsTest.Labels==speakers(speakerIdx));
llrPerSpeaker = zeros(numel(adsTestSubset.Files),1);
for fileIdx = 1:numel(adsTestSubset.Files)
audioData = read(adsTestSubset);
[x,numFrames] = helperFeatureExtraction(audioData,afe,normFactors);
logLikelihood = helperGMMLogLikelihood(x,localGMM);
Lspeaker = helperLogSumExp(logLikelihood);
logLikelihood = helperGMMLogLikelihood(x,ubm);
Lubm = helperLogSumExp(logLikelihood);
1-512
Speaker Verification Using Gaussian Mixture Model
end
llr{speakerIdx} = llrPerSpeaker;
end
disp("False rejection rate computed in " + round(toc,2) + " seconds.")
llr = cat(1,llr{:});
thresholds = -0.5:0.01:2.5;
FRR = mean(llr<thresholds);
plot(thresholds,FRR*100)
title("False Rejection Rate (FRR)")
xlabel("Threshold")
ylabel("Incorrectly Rejected (%)")
grid on
The speaker false acceptance rate (FAR) is the rate that utterances not belonging to an enrolled
speaker are incorrectly accepted as belonging to the enrolled speaker. Use the known speaker set to
1-513
1 Audio Toolbox Examples
determine the speaker FAR for a set of thresholds. Use the same set of thresholds used to determine
FRR.
speakersTest = unique(adsTest.Labels);
llr = cell(numSpeakers,1);
tic
parfor speakerIdx = 1:numel(speakers)
localGMM = enrolledGMMs.(string(speakers(speakerIdx)));
adsTestSubset = subset(adsTest,adsTest.Labels~=speakers(speakerIdx));
llrPerSpeaker = zeros(numel(adsTestSubset.Files),1);
for fileIdx = 1:numel(adsTestSubset.Files)
audioData = read(adsTestSubset);
[x,numFrames] = helperFeatureExtraction(audioData,afe,normFactors);
logLikelihood = helperGMMLogLikelihood(x,localGMM);
Lspeaker = helperLogSumExp(logLikelihood);
logLikelihood = helperGMMLogLikelihood(x,ubm);
Lubm = helperLogSumExp(logLikelihood);
llr = cat(1,llr{:});
FAR = mean(llr>thresholds);
plot(thresholds,FAR*100)
title("False Acceptance Rate (FAR)")
xlabel("Threshold")
ylabel("Incorrectly Rejected (%)")
grid on
1-514
Speaker Verification Using Gaussian Mixture Model
As you move the threshold in a speaker verification system, you trade off between FAR and FRR. This
is referred to as the detection error tradeoff (DET) and is commonly reported for binary classification
problems.
x1 = FAR*100;
y1 = FRR*100;
plot(x1,y1)
grid on
xlabel("False Acceptance Rate (%)")
ylabel("False Rejection Rate (%)")
title("Detection Error Tradeoff (DET) Curve")
1-515
1 Audio Toolbox Examples
To compare multiple systems, you need a single metric that combines the FAR and FRR
performances. For this, you determine the equal error rate (EER), which is the threshold where the
FAR and FRR curves meet. In practice, the EER threshold may not be the best choice. For example, if
speaker verification is used as part of a multi-authentication approach for wire transfers, FAR would
most likely be weighed more heavily than FRR.
1-516
Speaker Verification Using Gaussian Mixture Model
If you changed parameters of the UBM training, consider resaving the MAT file with the new
universal background model, audioFeatureExtractor, and norm factors.
resave = ;
if resave
save("speakerVerificationExampleData.mat","ubm","afe","normFactors")
end
Supporting Functions
function helperAddUser(fs,numToRecord,ID)
% Create an audio device reader to read from your audio device
deviceReader = audioDeviceReader(SampleRate=fs);
% Initialize variables
numRecordings = 1;
audioIn = [];
1-517
1 Audio Toolbox Examples
while toc<2
audioIn = [audioIn;deviceReader()];
end
fprintf('complete.\n')
idx = detectSpeech(audioIn,fs);
if isempty(idx)
fprintf('Speech not detected. Try again.\n')
else
audiowrite(sprintf('%s_%i.flac',ID,numRecordings),audioIn,fs)
numRecordings = numRecordings+1;
end
pause(0.2)
audioIn = [];
end
Enroll
while hasdata(adsEnroll)
% Read from the enrollment datastore
audioData = read(adsEnroll);
% Adapt the UBM to create the speaker model. Use a relevance factor of 16,
% as proposed in [2]
relevanceFactor = 16;
1-518
Speaker Verification Using Gaussian Mixture Model
Verify
% Determine the log-likelihood the audio came from the GMM adapted to
% the speaker
post = helperGMMLogLikelihood(x,speakerGMM);
Lspeaker = helperLogSumExp(post);
% Determine the log-likelihood the audio came form the GMM fit to all
% speakers
post = helperGMMLogLikelihood(x,ubm);
Lubm = helperLogSumExp(post);
% Calculate the ratio for all frames. Apply a moving median filter
% to remove outliers, and then take the mean across the frames
llr = mean(movmedian(Lspeaker - Lubm,3));
Feature Extraction
1-519
1 Audio Toolbox Examples
% Feature extraction
features = extract(afe,audioData);
% Feature normalization
if ~isempty(normFactors)
features = (features-normFactors.Mean')./normFactors.STD';
end
features = features';
numFrames = size(features,2);
end
Log-sum-exponent
function y = helperLogSumExp(x)
% Calculate the log-sum-exponent while avoiding overflow
a = max(x,[],1);
y = a + sum(exp(bsxfun(@minus,x,a)),1);
end
Expectation
function [N,F,S,L] = helperExpectation(features,gmm)
post = helperGMMLogLikelihood(features,gmm);
N = sum(gamma,1);
F = features * gamma;
S = (features.*features) * gamma;
L = sum(L);
end
Maximization
function gmm = helperMaximization(N,F,S)
N = max(N,eps);
gmm.ComponentProportion = max(N/sum(N),eps);
gmm.mu = bsxfun(@rdivide,F,N);
gmm.sigma = max(bsxfun(@rdivide,S,N) - gmm.mu.^2,eps);
end
1-520
Speaker Verification Using Gaussian Mixture Model
temp = squeeze(permute(Lunweighted,[1,3,2]));
if size(temp,1)==1
% If there is only one frame, the trailing singleton dimension was
% removed in the permute. This accounts for that edge case
temp = temp';
end
L = bsxfun(@plus,temp,log(gmm.ComponentProportion)');
end
References
[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017.
Available from https://storage.googleapis.com/download.tensorflow.org/data/
speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed
under the Creative Commons Attribution 4.0 license, available here: https://creativecommons.org/
licenses/by/4.0/legalcode.
[2] Reynolds, Douglas A., Thomas F. Quatieri, and Robert B. Dunn. "Speaker Verification Using
Adapted Gaussian Mixture Models." Digital Signal Processing 10, no. 1-3 (2000): 19-41. https://
doi.org/10.1006/dspr.1999.0361.
1-521
1 Audio Toolbox Examples
This example shows a typical workflow for feature selection applied to the task of spoken digit
recognition.
In sequential feature selection, you train a network on a given feature set and then incrementally add
or remove features until the highest accuracy is reached [1] on page 1-532. In this example, you
apply sequential forward selection to the task of spoken digit recognition using the Free Spoken Digit
Dataset [2] on page 1-532.
deviceReader = audioDeviceReader(SampleRate=fs,SamplesPerFrame=256);
audioBuffer = dsp.AsyncBuffer(fs*3);
steBuffer = dsp.AsyncBuffer(1000);
predictionBuffer = dsp.AsyncBuffer(5);
Create a plot to display the streaming audio, the probability the network outputs during inference,
and the prediction.
fig = figure;
streamAxes = subplot(3,1,1);
streamPlot = plot(zeros(fs,1));
ylabel("Amplitude")
xlabel("Time (s)")
title("Audio Stream")
streamAxes.XTick = [0,fs];
streamAxes.XTickLabel = [0,1];
streamAxes.YLim = [-1,1];
analyzedAxes = subplot(3,1,2);
analyzedPlot = plot(zeros(fs/2,1));
title("Analyzed Segment")
ylabel("Amplitude")
xlabel("Time (s)")
set(gca,XTickLabel=[])
analyzedAxes.XTick = [0,fs/2];
analyzedAxes.XTickLabel = [0,0.5];
analyzedAxes.YLim = [-1,1];
probabilityAxes = subplot(3,1,3);
probabilityPlot = bar(0:9,0.1*ones(1,10));
axis([-1,10,0,1])
1-522
Sequential Feature Selection for Audio Features
ylabel("Probability")
xlabel("Class")
Perform streaming digit recognition (digits 0 through 9) for 20 seconds. While the loop runs, speak
one of the digits and test its accuracy.
First, define a short-term energy threshold under which to assume a signal contains no speech.
steThreshold = 0.015;
idxVec = 1:fs;
tic
while toc < 20
ste = mean(abs(audioToAnalyze));
write(steBuffer,ste);
if steBuffer.NumUnreadSamples > 5
abc = sort(peek(steBuffer));
steThreshold = abc(round(0.4*numel(abc)));
end
if ste > steThreshold
1-523
1 Audio Toolbox Examples
% winning label.
features(isnan(features)|isinf(features)) = 0;
scores = predict(bestNet,features);
if predictionBuffer.NumUnreadSamples == predictionBuffer.Capacity
lastTen = peek(predictionBuffer);
[~,decision] = max(mean(lastTen.*hann(size(lastTen,1)),1));
probabilityAxes.Title.String = num2str(decision-1);
end
end
else
% If the signal energy is below the threshold, assume no speech
% detected.
probabilityAxes.Title.String = "";
probabilityPlot.YData = 0.1*ones(10,1);
analyzedPlot.YData = zeros(fs/2,1);
reset(predictionBuffer)
end
drawnow limitrate
end
end
1-524
Sequential Feature Selection for Audio Features
The remainder of the example illustrates how the network used in the streaming detection was
trained and how the features fed into the network were chosen.
Download the Free Spoken Digit Dataset (FSDD) [2] on page 1-532. FSDD consists of short audio
files with spoken digits (0-9).
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"FSDD");
Create an audioDatastore to point to the recordings. Get the sample rate of the data set.
ads = audioDatastore(dataset,IncludeSubfolders=true);
[~,adsInfo] = read(ads);
fs = adsInfo.SampleRate;
The first element of the file names is the digit spoken in the file. Get the first element of the file
names, convert them to categorical, and then set the Labels property of the audioDatastore.
[~,filenames] = cellfun(@(x)fileparts(x),ads.Files,UniformOutput=false);
ads.Labels = categorical(string(cellfun(@(x)x(1),filenames)));
To split the datastore into a development set and a validation set, use splitEachLabel. Allocate
80% of the data for development and the remaining 20% for validation.
[adsTrain,adsValidation] = splitEachLabel(ads,0.8);
win = hamming(round(0.03*fs),"periodic");
overlapLength = round(0.02*fs);
1-525
1 Audio Toolbox Examples
spectralEntropy=true, ...
spectralFlatness=true, ...
spectralFlux=true, ...
spectralKurtosis=true, ...
spectralRolloffPoint=true, ...
spectralSkewness=true, ...
spectralSlope=true, ...
spectralSpread=true, ...
...
pitch=false, ...
harmonicRatio=false, ...
zerocrossrate=false, ...
shortTimeEnergy=false);
Define the “List of Deep Learning Layers” (Deep Learning Toolbox) and trainingOptions (Deep
Learning Toolbox) used in this example. The first layer, sequenceInputLayer (Deep Learning
Toolbox), is just a placeholder. Depending on which features you test during sequential feature
selection, the first layer is replaced with a sequenceInputLayer of the appropriate size.
numUnits = ;
layers = [ ...
sequenceInputLayer(1)
bilstmLayer(numUnits,OutputMode="last")
fullyConnectedLayer(numel(categories(adsTrain.Labels)))
softmaxLayer];
In the basic form of sequential feature selection, you train a network on a given feature set and then
incrementally add or remove features until the accuracy no longer improves [1] on page 1-532.
Forward Selection
Consider a simple case of forward selection on a set of four features. In the first forward selection
loop, each of the four features are tested independently by training a network and comparing their
validation accuracy. The feature that resulted in the highest validation accuracy is noted. In the
second forward selection loop, the best feature from the first loop is combined with each of the
remaining features. Now each pair of features is used for training. If the accuracy in the second loop
did not improve over the accuracy in the first loop, the selection process ends. Otherwise, a new best
feature set is selected. The forward selection loop continues until the accuracy no longer improves.
1-526
Sequential Feature Selection for Audio Features
Backward Selection
In backward feature selection, you begin by training on a feature set that consists of all features and
test whether or not accuracy improves as you remove features.
direction = ;
[logbook,bestFeatures,bestNet] = sequentialFeatureSelection(adsTrain,adsValidation,afe,layers,opt
1-527
1 Audio Toolbox Examples
logbook
logbook=62×2 table
Features Accuracy
_______________________________________________________ ________
bestFeatures
set(afe,bestFeatures)
afe
1-528
Sequential Feature Selection for Audio Features
afe =
audioFeatureExtractor with properties:
Properties
Window: [240×1 double]
OverlapLength: 160
SampleRate: 8000
FFTLength: []
SpectralDescriptorInput: 'linearSpectrum'
FeatureVectorLength: 39
Enabled Features
mfcc, mfccDeltaDelta, gtccDelta
Disabled Features
linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfccDelta, gtcc
gtccDeltaDelta, spectralCentroid, spectralCrest, spectralDecrease, spectralEntropy, spectral
spectralFlux, spectralKurtosis, spectralRolloffPoint, spectralSkewness, spectralSlope, spect
pitch, harmonicRatio, zerocrossrate, shortTimeEnergy
sequentialFeatureSelection also outputs the best performing network and the normalization
factors that correspond to the chosen features. To save the network and configured
audioFeatureExtractor, uncomment this line:
% save('network_Audio_SequentialFeatureSelection.mat','bestNet','afe')
Supporting Functions
fs = afe.SampleRate;
1-529
1 Audio Toolbox Examples
tLabels = adsValidation.Labels;
% Use the training set to determine the mean and standard deviation of each
% feature. Normalize the training and validation sets.
allFeatures = cat(1,featuresTrain{:});
allFeatures(isinf(allFeatures)) = nan;
[S,M] = std(allFeatures,0,1,"omitnan");
end
afe = copy(afeThis);
featuresToTest = fieldnames(info(afe));
N = numel(featuresToTest);
bestValidationAccuracy = 0;
1-530
Sequential Feature Selection for Audio Features
% Update Logbook
result = table(currentConfig,valAccuracy,VariableNames=["Feature Configuration","Accuracy
logbook = [logbook;result]; %#ok<AGROW>
end
1-531
1 Audio Toolbox Examples
% Determine and print the setting with the best accuracy. If accuracy
% did not improve, end the run.
[a,b] = max(logbook{:,"Accuracy"});
if a <= bestAccuracy
wrapperIdx = inf;
else
wrapperIdx = wrapperIdx + 1;
end
bestAccuracy = a;
end
References
[1] Jain, A., and D. Zongker. "Feature Selection: Evaluation, Application, and Small Sample
Performance." IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 19, Issue 2,
1997, pp. 153-158.
1-532
Sequential Feature Selection for Audio Features
1-533
1 Audio Toolbox Examples
This example shows how to train and use a generative adversarial network (GAN) to generate sounds.
Introduction
In generative adversarial networks, a generator and a discriminator compete against each other to
improve the generation quality.
GANs have generated significant interest in the field of audio and speech processing. Applications
include text-to-speech synthesis, voice conversion, and speech enhancement.
This example trains a GAN for unsupervised synthesis of audio waveforms. The GAN in this example
generates percussive sounds. The same approach can be followed to generate other types of sound,
including speech.
Before you train a GAN from scratch, use a pretrained GAN generator to synthesize percussive
sounds.
matFileName = "drumGeneratorWeights.mat";
loc = matlab.internal.examples.downloadSupportFile("audio","GanAudioSynthesis/" + matFileName);
copyfile(loc,pwd)
synthsound = synthesizePercussiveSound;
fs = 16e3;
sound(synthsound,fs)
t = (0:length(synthsound)-1)/fs;
plot(t,synthsound)
grid on
xlabel("Time (s)")
title("Synthesized Percussive Sound")
1-534
Train Generative Adversarial Network (GAN) for Sound Synthesis
You can use the percussive sounds synthesizer with other audio effects to create more complex
applications. For example, you can apply reverberation to the synthesized percussive sounds.
Create a reverberator object and open its parameter tuner UI. This UI enables you to tune the
reverberator parameters as the simulation runs.
reverb = reverberator(SampleRate=fs);
parameterTuner(reverb);
1-535
1 Audio Toolbox Examples
In a loop, synthesize the percussive sounds and apply reverberation. Use the parameter tuner UI to
tune reverberation. If you want to run the simulation for a longer time, increase the value of the
loopCount parameter.
loopCount = 20;
for ii = 1:loopCount
synthsound = synthesizePercussiveSound;
synthsound = reverb(synthsound);
ts(synthsound(:,1));
soundsc(synthsound,fs)
pause(0.5)
end
Now that you have seen the pretrained percussive sounds generator in action, you can investigate the
training process in detail.
1-536
Train Generative Adversarial Network (GAN) for Sound Synthesis
A GAN is a type of deep learning network that generates data with characteristics similar to the
training data.
A GAN consists of two networks that train together, a generator and a discriminator:
• Generator - Given a vector or random values as input, this network generates data with the same
structure as the training data. It is the generator's job to fool the discriminator.
• Discriminator - Given batches of data containing observations from both the training data and the
generated data, this network attempts to classify the observations as real or generated.
To maximize the performance of the generator, maximize the loss of the discriminator when given
generated data. That is, the objective of the generator is to generate data that the discriminator
classifies as real. To maximize the performance of the discriminator, minimize the loss of the
discriminator when given batches of both real and generated data. Ideally, these strategies result in a
generator that generates convincingly realistic data and a discriminator that has learned strong
feature representations that are characteristic of the training data.
In this example, you train the generator to create fake time-frequency short-time Fourier transform
(STFT) representations of percussive sounds. You train the discriminator to identify whether an STFT
was synthesized by the generator or computed from a real audio signal. You create the real STFTs by
computing the STFT of short recordings of real percussive sounds.
Train a GAN using the Freesound One-Shot Percussive Sounds dataset [2] on page 1-556. Download
and extract the dataset. Remove any files with licenses that prohibit commercial use.
url1 = "https://zenodo.org/record/4687854/files/one_shot_percussive_sounds.zip";
url2 = "https://zenodo.org/record/4687854/files/licenses.txt";
downloadFolder = tempdir;
percussivesoundsFolder = fullfile(downloadFolder,"one_shot_percussive_sounds");
licensefilename = fullfile(percussivesoundsFolder,"licenses.txt");
if ~datasetExists(percussivesoundsFolder)
disp("Downloading Freesound One-Shot Percussive Sounds Dataset (112.6 MB) ...")
unzip(url1,downloadFolder)
websave(licensefilename,url2);
removeRestrictiveLicence(percussivesoundsFolder,licensefilename)
end
1-537
1 Audio Toolbox Examples
ads = audioDatastore(percussivesoundsFolder,IncludeSubfolders=true);
Define a network that generates STFTs from 1-by-1-by-100 arrays of random values. Create a network
that upscales 1-by-1-by-100 arrays to 128-by-128-by-1 arrays using a fully connected layer followed
by a reshape layer and a series of transposed convolution layers with ReLU layers.
This figure shows the dimensions of the signal as it travels through the generator. The generator
architecture is defined in Table 4 of [1] on page 1-556.
1-538
Train Generative Adversarial Network (GAN) for Sound Synthesis
1-539
1 Audio Toolbox Examples
The generator network is defined in modelGenerator, which is included at the end of this example.
Create a network that takes 128-by-128 images and outputs a scalar prediction score using a series of
convolution layers with leaky ReLU layers followed by a fully connected layer.
This figure shows the dimensions of the signal as it travels through the discriminator. The
discriminator architecture is defined in Table 5 of [1] on page 1-556.
1-540
Train Generative Adversarial Network (GAN) for Sound Synthesis
1-541
1 Audio Toolbox Examples
Generate STFT data from the percussive sound signals in the datastore.
To speed up processing, distribute the feature extraction across multiple workers using parfor.
First, determine the number of partitions for the dataset. If you do not have Parallel Computing
Toolbox™, use a single partition.
if canUseParallelPool
pool = gcp;
numPar = numpartitions(ads,pool);
else
numPar = 1;
end
For each partition, read from the datastore and compute the STFT.
parfor ii = 1:numPar
subds = partition(ads,numPar,ii);
STrain = zeros(fftLength/2+1,128,1,numel(subds.Files));
% Read audio
[x,xinfo] = read(subds);
% Preprocess
x = preprocessAudio(single(x),xinfo.SampleRate);
% STFT
S0 = stft(x,Window=win,OverlapLength=overlapLength,FrequencyRange="onesided");
% Magnitude
S = abs(S0);
STrain(:,:,:,idx) = S;
end
STrainC{ii} = STrain;
end
Convert the output to a four-dimensional array with STFTs along the fourth dimension.
STrain = cat(4,STrainC{:});
1-542
Train Generative Adversarial Network (GAN) for Sound Synthesis
Convert the data to the log scale to better align with human perception.
Normalize training data to have zero mean and unit standard deviation.
Compute the STFT mean and standard deviation of each frequency bin.
STrain = (STrain-SMean)./SStd;
The computed STFTs have unbounded values. Following the approach in [1] on page 1-556, make
the data bounded by clipping the spectra to 3 standard deviations and rescaling to [-1 1].
STrain = STrain/3;
Y = reshape(STrain,numel(STrain),1);
Y(Y<-1) = -1;
Y(Y>1) = 1;
STrain = reshape(Y,size(STrain));
Discard the last frequency bin to force the number of STFT bins to a power of two (which works well
with convolutional layers).
STrain = STrain(1:end-1,:,:,:);
maxEpochs = 1000;
miniBatchSize = 64;
numIterationsPerEpoch = floor(size(STrain,4)/miniBatchSize);
Specify the options for Adam optimization. Set the learn rate of the generator and discriminator to
0.0002. For both networks, use a gradient decay factor of 0.5 and a squared gradient decay factor of
0.999.
learnRateGenerator = 0.0002;
learnRateDiscriminator = 0.0002;
gradientDecayFactor = 0.5;
squaredGradientDecayFactor = 0.999;
Train on a GPU if one is available. Using a GPU requires Parallel Computing Toolbox™.
executionEnvironment = ;
1-543
1 Audio Toolbox Examples
generatorParameters = initializeGeneratorWeights;
discriminatorParameters = initializeDiscriminatorWeights;
Train Model
Train the model using a custom training loop. Loop over the training data and update the network
parameters at each iteration.
For each epoch, shuffle the training data and loop over mini-batches of data.
• Generate a dlarray object containing an array of random values for the generator network.
• For GPU training, convert the data to a gpuArray (Parallel Computing Toolbox) object.
• Evaluate the model gradients using dlfeval (Deep Learning Toolbox) and the helper functions,
modelDiscriminatorGradients and modelGeneratorGradients.
• Update the network parameters using the adamupdate (Deep Learning Toolbox) function.
trailingAvgGenerator = [];
trailingAvgSqGenerator = [];
trailingAvgDiscriminator = [];
trailingAvgSqDiscriminator = [];
Depending on your machine, training this network can take hours. To skip training, set doTraining
to false.
doTraining = ;
You can set saveCheckpoints to true to save the updated weights and states to a MAT file every
ten epochs. You can then use this MAT file to resume training if it is interrupted.
saveCheckpoints = ;
numLatentInputs = 100;
iteration = 0;
1-544
Train Generative Adversarial Network (GAN) for Sound Synthesis
iteration = iteration + 1;
1-545
1 Audio Toolbox Examples
end
end
end
1-546
Train Generative Adversarial Network (GAN) for Sound Synthesis
Synthesize Sounds
Now that you have trained the network, you can investigate the synthesis process in more detail.
1-547
1 Audio Toolbox Examples
The trained percussive sound generator synthesizes short-time Fourier transform (STFT) matrices
from input arrays of random values. An inverse STFT (ISTFT) operation converts the time-frequency
STFT to a synthesized time-domain audio signal.
The generator takes 1-by-1-by-100 vectors of random values as an input. Generate a sample input
vector.
dlZ = dlarray(2*(rand(1,1,numLatentInputs,1,"single") - 0.5));
Pass the random vector to the generator to create an STFT image. generatorParameters is a
structure containing the weights of the pretrained generator.
dlXGenerated = modelGenerator(dlZ,generatorParameters);
Transpose the STFT to align its dimensions with the istft function.
S = S.';
The STFT is a 128-by-128 matrix, where the first dimension represents 128 frequency bins linearly
spaced from 0 to 8 kHz. The generator was trained to generate a one-sided STFT from an FFT length
of 256, with the last bin omitted. Reintroduce that bin by inserting a row of zeros into the STFT.
S = [S;zeros(1,128)];
Revert the normalization and scaling steps used when you generated the STFTs for training.
S = S * 3;
S = (S.*SStd) + SMean;
Convert the STFT from the log domain to the linear domain.
S = exp(S);
1-548
Train Generative Adversarial Network (GAN) for Sound Synthesis
S = [zeros(256,100),S,zeros(256,100)];
The STFT matrix does not contain any phase information. Use a fast version of the Griffin-Lim
algorithm with 20 iterations to estimate the signal phase and produce audio samples.
sound(gather(myAudio),fs)
t = (0:length(myAudio)-1)/fs;
plot(t,myAudio)
grid on
xlabel("Time (s)")
title("Synthesized GAN Sound")
1-549
1 Audio Toolbox Examples
figure
stft(myAudio,fs,Window=hann(256,"periodic"),OverlapLength=128);
The modelGenerator function upscales 1-by-1-by-100 arrays (dlX) to 128-by-128-by-1 arrays (dlY).
parameters is a structure holding the weights of the generator layers. The generator architecture is
defined in Table 4 of [1] on page 1-556.
function dlY = modelGenerator(dlX,parameters)
dlY = fullyconnect(dlX,parameters.FC.Weights,parameters.FC.Bias,Dataformat="SSCB");
dlY = dltranspconv(dlY,parameters.Conv1.Weights,parameters.Conv1.Bias,Stride=2,Cropping="same",Da
dlY = relu(dlY);
dlY = dltranspconv(dlY,parameters.Conv2.Weights,parameters.Conv2.Bias,Stride=2,Cropping="same",Da
dlY = relu(dlY);
dlY = dltranspconv(dlY,parameters.Conv3.Weights,parameters.Conv3.Bias,Stride=2,Cropping="same",Da
dlY = relu(dlY);
dlY = dltranspconv(dlY,parameters.Conv4.Weights,parameters.Conv4.Bias,Stride=2,Cropping="same",Da
dlY = relu(dlY);
1-550
Train Generative Adversarial Network (GAN) for Sound Synthesis
dlY = dltranspconv(dlY,parameters.Conv5.Weights,parameters.Conv5.Bias,Stride=2,Cropping="same",Da
dlY = tanh(dlY);
end
The modelDiscriminator function takes 128-by-128 images and outputs a scalar prediction score.
The discriminator architecture is defined in Table 5 of [1].
dlY = dlconv(dlX,parameters.Conv1.Weights,parameters.Conv1.Bias,Stride=2,Padding="same");
dlY = leakyrelu(dlY,0.2);
dlY = dlconv(dlY,parameters.Conv2.Weights,parameters.Conv2.Bias,Stride=2,Padding="same");
dlY = leakyrelu(dlY,0.2);
dlY = dlconv(dlY,parameters.Conv3.Weights,parameters.Conv3.Bias,Stride=2,Padding="same");
dlY = leakyrelu(dlY,0.2);
dlY = dlconv(dlY,parameters.Conv4.Weights,parameters.Conv4.Bias,Stride=2,Padding="same");
dlY = leakyrelu(dlY,0.2);
dlY = dlconv(dlY,parameters.Conv5.Weights,parameters.Conv5.Bias,Stride=2,Padding="same");
dlY = leakyrelu(dlY,0.2);
dlY = stripdims(dlY);
dlY = permute(dlY,[3 2 1 4]);
dlY = reshape(dlY,4*4*64*16,numel(dlY)/(4*4*64*16));
weights = parameters.FC.Weights;
bias = parameters.FC.Bias;
dlY = fullyconnect(dlY,weights,bias,Dataformat="CB");
end
% Calculate the predictions for real data with the discriminator network.
Y = modelDiscriminator(X,discriminatorParameters);
% Calculate the predictions for generated data with the discriminator network.
Xgen = modelGenerator(Z,generatorParameters);
Ygen = modelDiscriminator(dlarray(Xgen,"SSCB"),discriminatorParameters);
% For each network, calculate the gradients with respect to the loss.
1-551
1 Audio Toolbox Examples
gradientsDiscriminator = dlgradient(lossDiscriminator,discriminatorParameters);
end
The modelGeneratorGradients function takes as input the discriminator and generator learnable
parameters and an array of random values Z, and returns the gradients of the generator loss with
respect to the learnable parameters in the networks.
function gradientsGenerator = modelGeneratorGradients(discriminatorParameters,generatorParameters
% Calculate the predictions for generated data with the discriminator network.
Xgen = modelGenerator(Z,generatorParameters);
Ygen = modelDiscriminator(dlarray(Xgen,"SSCB"),discriminatorParameters);
% For each network, calculate the gradients with respect to the loss.
gradientsGenerator = dlgradient(lossGenerator,generatorParameters);
end
The objective of the discriminator is to not be fooled by the generator. To maximize the probability
that the discriminator successfully discriminates between the real and generated images, minimize
the discriminator loss function. The loss function for the generator follows the DCGAN approach
highlighted in [1] on page 1-556.
function lossDiscriminator = ganDiscriminatorLoss(dlYPred,dlYPredGenerated)
fake = dlarray(zeros(1,size(dlYPred,2)));
real = dlarray(ones(1,size(dlYPred,2)));
D_loss = mean(sigmoid_cross_entropy_with_logits(dlYPredGenerated,fake));
D_loss = D_loss + mean(sigmoid_cross_entropy_with_logits(dlYPred,real));
lossDiscriminator = D_loss / 2;
end
The objective of the generator is to generate data that the discriminator classifies as "real". To
maximize the probability that images from the generator are classified as real by the discriminator,
minimize the generator loss function. The loss function for the generator follows the deep
convolutional generative adverarial network (DCGAN) approach highlighted in [1] on page 1-556.
function lossGenerator = ganGeneratorLoss(dlYPredGenerated)
real = dlarray(ones(1,size(dlYPredGenerated,2)));
lossGenerator = mean(sigmoid_cross_entropy_with_logits(dlYPredGenerated,real));
end
1-552
Train Generative Adversarial Network (GAN) for Sound Synthesis
filterSize = [5 5];
dim = 64;
% Conv2D
weights = iGlorotInitialize([filterSize(1) filterSize(2) 1 dim]);
bias = zeros(1,1,dim,"single");
discriminatorParameters.Conv1.Weights = dlarray(weights);
discriminatorParameters.Conv1.Bias = dlarray(bias);
% Conv2D
weights = iGlorotInitialize([filterSize(1) filterSize(2) dim 2*dim]);
bias = zeros(1,1,2*dim,"single");
discriminatorParameters.Conv2.Weights = dlarray(weights);
discriminatorParameters.Conv2.Bias = dlarray(bias);
% Conv2D
weights = iGlorotInitialize([filterSize(1) filterSize(2) 2*dim 4*dim]);
bias = zeros(1,1,4*dim,"single");
discriminatorParameters.Conv3.Weights = dlarray(weights);
discriminatorParameters.Conv3.Bias = dlarray(bias);
% Conv2D
weights = iGlorotInitialize([filterSize(1) filterSize(2) 4*dim 8*dim]);
bias = zeros(1,1,8*dim,"single");
discriminatorParameters.Conv4.Weights = dlarray(weights);
discriminatorParameters.Conv4.Bias = dlarray(bias);
% Conv2D
weights = iGlorotInitialize([filterSize(1) filterSize(2) 8*dim 16*dim]);
bias = zeros(1,1,16*dim,"single");
discriminatorParameters.Conv5.Weights = dlarray(weights);
discriminatorParameters.Conv5.Bias = dlarray(bias);
% fully connected
weights = iGlorotInitialize([1,4 * 4 * dim * 16]);
bias = zeros(1,1,"single");
discriminatorParameters.FC.Weights = dlarray(weights);
discriminatorParameters.FC.Bias = dlarray(bias);
end
dim = 64;
% Dense 1
weights = iGlorotInitialize([dim*256,100]);
bias = zeros(dim*256,1,"single");
generatorParameters.FC.Weights = dlarray(weights);
generatorParameters.FC.Bias = dlarray(bias);
filterSize = [5 5];
% Trans Conv2D
weights = iGlorotInitialize([filterSize(1) filterSize(2) 8*dim 16*dim]);
1-553
1 Audio Toolbox Examples
bias = zeros(1,1,dim*8,"single");
generatorParameters.Conv1.Weights = dlarray(weights);
generatorParameters.Conv1.Bias = dlarray(bias);
% Trans Conv2D
weights = iGlorotInitialize([filterSize(1) filterSize(2) 4*dim 8*dim]);
bias = zeros(1,1,dim*4,"single");
generatorParameters.Conv2.Weights = dlarray(weights);
generatorParameters.Conv2.Bias = dlarray(bias);
% Trans Conv2D
weights = iGlorotInitialize([filterSize(1) filterSize(2) 2*dim 4*dim]);
bias = zeros(1,1,dim*2,"single");
generatorParameters.Conv3.Weights = dlarray(weights);
generatorParameters.Conv3.Bias = dlarray(bias);
% Trans Conv2D
weights = iGlorotInitialize([filterSize(1) filterSize(2) dim 2*dim]);
bias = zeros(1,1,dim,"single");
generatorParameters.Conv4.Weights = dlarray(weights);
generatorParameters.Conv4.Bias = dlarray(bias);
% Trans Conv2D
weights = iGlorotInitialize([filterSize(1) filterSize(2) 1 dim]);
bias = zeros(1,1,1,"single");
generatorParameters.Conv5.Weights = dlarray(weights);
generatorParameters.Conv5.Bias = dlarray(bias);
end
function y = synthesizePercussiveSound
% Generate spectrograms
dlXGenerated = modelGenerator(dlZ,pGeneratorParameters);
S = S.';
% Zero-pad to remove edge effects
S = [S ; zeros(1,128)];
1-554
Train Generative Adversarial Network (GAN) for Sound Synthesis
% Make it two-sided
S = [S ; S(end-1:-1:2,:)];
% Pad with zeros at end and start
S = [zeros(256,100) S zeros(256,100)];
Utility Functions
function w = iGlorotInitialize(sz)
if numel(sz) == 2
numInputs = sz(2);
numOutputs = sz(1);
else
numInputs = prod(sz(1:3));
numOutputs = prod(sz([1 2 4]));
end
multiplier = sqrt(2 / (numInputs + numOutputs));
w = multiplier * sqrt(3) * (2 * rand(sz,"single") - 1);
end
% Resample to 16kHz
x = resample(in,16e3,fs);
% Normalize
out = y./max(abs(y));
end
function y = trimOrPad(x,n)
1-555
1 Audio Toolbox Examples
a = size(x,1);
if a < n
frontPad = floor((n-a)/2);
backPad = n - a - frontPad;
y = [zeros(frontPad,size(x,2),like=x);x;zeros(backPad,size(x,2),like=x)];
elseif a > n
frontTrim = floor((a-n)/2) + 1;
backTrim = a - n - frontTrim;
y = x(frontTrim:end-backTrim,:);
else
y = x;
end
end
function removeRestrictiveLicence(percussivesoundsFolder,licensefilename)
%removeRestrictiveLicense Remove restrictive license
% Parse the licenses file that maps ids to license. Create a table to hold the info.
f = fileread(licensefilename);
K = jsondecode(f);
fns = fields(K);
T = table(Size=[numel(fns),4], ...
VariableTypes=["string","string","string","string"], ...
VariableNames=["ID","FileName","UserName","License"]);
for ii = 1:numel(fns)
fn = string(K.(fns{ii}).name);
li = string(K.(fns{ii}).license);
id = extractAfter(string(fns{ii}),"x");
un = string(K.(fns{ii}).username);
T(ii,:) = {id,fn,un,li};
end
% Remove any files that prohibit commercial use. Find the file inside the
% appropriate folder, and then delete it.
unsupportedLicense = "http://creativecommons.org/licenses/by-nc/3.0/";
fileToRemove = T.ID(strcmp(T.License,unsupportedLicense));
for ii = 1:numel(fileToRemove)
fileInfo = dir(fullfile(percussivesoundsFolder,"**",fileToRemove(ii)+".wav"));
delete(fullfile(fileInfo.folder,fileInfo.name))
end
end
Reference
[1] Donahue, C., J. McAuley, and M. Puckette. 2019. "Adversarial Audio Synthesis." ICLR.
1-556
Train Generative Adversarial Network (GAN) for Sound Synthesis
[2] Ramires, Antonio, Pritish Chandna, Xavier Favory, Emilia Gomez, and Xavier Serra. "Neural
Percussive Synthesis Parameterised by High-Level Timbral Features." ICASSP 2020 - 2020 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020. https://doi.org/
10.1109/icassp40776.2020.9053128.
See Also
text2speech
1-557
1 Audio Toolbox Examples
Speaker verification, or authentication, is the task of confirming that the identity of a speaker is who
they purport to be. Speaker verification has been an active research area for many years. An early
performance breakthrough was to use a Gaussian mixture model and universal background model
(GMM-UBM) [1] on page 1-581 on acoustic features (usually mfcc). For an example, see “Speaker
Verification Using Gaussian Mixture Model” on page 1-504. One of the main difficulties of GMM-UBM
systems involves intersession variability. Joint factor analysis (JFA) was proposed to compensate for
this variability by separately modeling inter-speaker variability and channel or session variability [2]
on page 1-581 [3] on page 1-581. However, [4] on page 1-581 discovered that channel factors in
the JFA also contained information about the speakers, and proposed combining the channel and
speaker spaces into a total variability space. Intersession variability was then compensated for by
using backend procedures, such as linear discriminant analysis (LDA) and within-class covariance
normalization (WCCN), followed by a scoring, such as the cosine similarity score. [5] on page 1-581
proposed replacing the cosine similarity scoring with a probabilistic LDA (PLDA) model. [11] on page
1-582 and [12] on page 1-582 proposed a method to Gaussianize the i-vectors and therefore make
Gaussian assumptions in the PLDA, referred to as G-PLDA or simplified PLDA. While i-vectors were
originally proposed for speaker verification, they have been applied to many problems, like language
recognition, speaker diarization, emotion recognition, age estimation, and anti-spoofing [10] on page
1-582. Recently, deep learning techniques have been proposed to replace i-vectors with d-vectors or
x-vectors [8] on page 1-581 [6] on page 1-581.
Audio Toolbox provides ivectorSystem which encapsulates the ability to train an i-vector system,
enroll speakers or other audio labels, evaluate the system for a decision threshold, and identify or
verify speakers or other audio labels. See ivectorSystem for examples of using this feature and
applying it to several applications.
To learn more about how an i-vector system works, continue with the example.
In this example, you develop a standard i-vector system for speaker verification that uses an LDA-
WCCN backend with either cosine similarity scoring or a G-PLDA scoring.
1-558
Speaker Verification Using i-vectors
Throughout the example, you will find live controls on tunable parameters. Changing the controls
does not rerun the example. If you change a control, you must rerun the example.
This example uses the Pitch Tracking Database from Graz University of Technology (PTDB-TUG) [7]
on page 1-581. The data set consists of 20 English native speakers reading 2342 phonetically rich
sentences from the TIMIT corpus. Download and extract the data set.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","ptdb-tug.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"ptdb-tug");
Create an audioDatastore object that points to the data set. The data set was originally intended
for use in pitch-tracking training and evaluation, and includes laryngograph readings and baseline
pitch decisions. Use only the original audio recordings.
The file names contain the speaker IDs. Decode the file names to set the labels on the
audioDatastore object.
speakerIDs = extractBetween(fileNames,"mic_","_");
ads.Labels = categorical(speakerIDs);
countEachLabel(ads)
ans=20×2 table
Label Count
_____ _____
F01 236
F02 236
F03 236
F04 236
F05 236
F06 236
F07 236
F08 234
F09 236
F10 236
M01 236
M02 236
M03 236
M04 236
M05 236
M06 236
⋮
Separate the audioDatastore object into training, evaluation, and test sets. The training set
contains 16 speakers. The evaluation set contains four speakers and is further divided into an
enrollment set and a set to evaluate the detection error tradeoff of the trained i-vector system, and a
test set.
1-559
1 Audio Toolbox Examples
developmentLabels = categorical(["M01","M02","M03","M04","M06","M07","M08","M09","F01","F02","F03
evaluationLabels = categorical(["M05","M10","F05","F10"]);
adsTrain = subset(ads,ismember(ads.Labels,developmentLabels));
adsEvaluate = subset(ads,ismember(ads.Labels,evaluationLabels));
numFilesPerSpeakerForEnrollment = ;
[adsEnroll,adsTest,adsDET] = splitEachLabel(adsEvaluate,numFilesPerSpeakerForEnrollment,2);
countEachLabel(adsTrain)
ans=16×2 table
Label Count
_____ _____
F01 236
F02 236
F03 236
F04 236
F06 236
F07 236
F08 234
F09 236
M01 236
M02 236
M03 236
M04 236
M06 236
M07 236
M08 236
M09 236
countEachLabel(adsEnroll)
ans=4×2 table
Label Count
_____ _____
F05 3
F10 3
M05 3
M10 3
countEachLabel(adsDET)
ans=4×2 table
Label Count
_____ _____
F05 231
F10 231
M05 231
M10 231
1-560
Speaker Verification Using i-vectors
countEachLabel(adsTest)
ans=4×2 table
Label Count
_____ _____
F05 2
F10 2
M05 2
M10 2
Read an audio file from the training data set, listen to it, and plot it. Reset the datastore.
[audio,audioInfo] = read(adsTrain);
fs = audioInfo.SampleRate;
t = (0:size(audio,1)-1)/fs;
sound(audio,fs)
plot(t,audio)
xlabel("Time (s)")
ylabel("Amplitude")
axis([0 t(end) -1 1])
title("Sample Utterance from Training Set")
1-561
1 Audio Toolbox Examples
reset(adsTrain)
You can reduce the data set and the number of parameters used in this example to speed up the
runtime at the cost of performance. In general, reducing the data set is a good practice for
development and debugging.
speedUpExample = ;
if speedUpExample
adsTrain = splitEachLabel(adsTrain,30);
adsDET = splitEachLabel(adsDET,21);
end
Feature Extraction
numCoeffs = ;
deltaWindowLength = ;
windowDuration = ;
hopDuration = ;
windowSamples = round(windowDuration*fs);
hopSamples = round(hopDuration*fs);
overlapSamples = windowSamples - hopSamples;
Extract features from the audio read from the training datastore. Features are returned as a
numHops-by-numFeatures matrix.
features = extract(afe,audio);
[numHops,numFeatures] = size(features)
numHops = 797
numFeatures = 60
Training
Training an i-vector system is computationally expensive and time-consuming. If you have Parallel
Computing Toolbox™, you can spread the work across multiple cores to speed up the example.
Determine the optimal number of partitions for your system. If you do not have Parallel Computing
Toolbox™, use a single partition.
1-562
Speaker Verification Using i-vectors
Use the helper function, helperFeatureExtraction, to extract all features from the data set. The
helperFeatureExtraction on page 1-579 function extracts MFCC from regions of speech in the
audio. The speech detection is performed by the detectSpeech function.
featuresAll = {};
tic
parfor ii = 1:numPar
adsPart = partition(adsTrain,numPar,ii);
featuresPart = cell(0,numel(adsPart.Files));
for iii = 1:numel(adsPart.Files)
audioData = read(adsPart);
featuresPart{iii} = helperFeatureExtraction(audioData,afe,[]);
end
featuresAll = [featuresAll,featuresPart];
end
allFeatures = cat(2,featuresAll{:});
disp("Feature extraction from training set complete (" + toc + " seconds).")
Calculate the global mean and standard deviation of each feature. You will use these in future calls to
the helperFeatureExtraction function to normalize the features.
normFactors.Mean = mean(allFeatures,2,"omitnan");
normFactors.STD = std(allFeatures,[],2,"omitnan");
Initialize the Gaussian mixture model (GMM) that will be the universal background model (UBM) in
the i-vector system. The component weights are initialized as evenly distributed. Systems trained on
the TIMIT data set usually contain around 2048 components.
numComponents = ;
if speedUpExample
numComponents = 32;
end
alpha = ones(1,numComponents)/numComponents;
mu = randn(numFeatures,numComponents);
vari = rand(numFeatures,numComponents) + eps;
ubm = struct(ComponentProportion=alpha,mu=mu,sigma=vari);
maxIter = ;
if speedUpExample
1-563
1 Audio Toolbox Examples
maxIter = 2;
end
tic
for iter = 1:maxIter
tic
% EXPECTATION
N = zeros(1,numComponents);
F = zeros(numFeatures,numComponents);
S = zeros(numFeatures,numComponents);
L = 0;
parfor ii = 1:numPar
adsPart = partition(adsTrain,numPar,ii);
while hasdata(adsPart)
audioData = read(adsPart);
% Extract features
Y = helperFeatureExtraction(audioData,afe,normFactors);
% MAXIMIZATION
N = max(N,eps);
ubm.ComponentProportion = max(N/sum(N),eps);
ubm.ComponentProportion = ubm.ComponentProportion/sum(ubm.ComponentProportion);
ubm.mu = F./N;
ubm.sigma = max(S./N - ubm.mu.^2,eps);
end
1-564
Speaker Verification Using i-vectors
The Baum-Welch statistics are the N (zeroth order) and F (first order) statistics used in the EM
algorithm, calculated using the final UBM.
Nc s = ∑ γt c
t
Fc s = ∑ γt c Yt
t
Calculate the zeroth and first order Baum-Welch statistics over the training set.
numSpeakers = numel(adsTrain.Files);
Nc = {};
Fc = {};
tic
parfor ii = 1:numPar
adsPart = partition(adsTrain,numPar,ii);
numFiles = numel(adsPart.Files);
Npart = cell(1,numFiles);
Fpart = cell(1,numFiles);
for jj = 1:numFiles
audioData = read(adsPart);
% Extract features
Y = helperFeatureExtraction(audioData,afe,normFactors);
Npart{jj} = reshape(n,1,1,numComponents);
Fpart{jj} = reshape(f,numFeatures,1,numComponents);
end
1-565
1 Audio Toolbox Examples
Nc = [Nc,Npart];
Fc = [Fc,Fpart];
end
disp("Baum-Welch statistics completed (" + toc + " seconds).")
Expand the statistics into matrices and center F s , as described in [3] on page 1-581, such that
N = Nc;
F = Fc;
muc = reshape(ubm.mu,numFeatures,1,[]);
for s = 1:numSpeakers
N{s} = repelem(reshape(Nc{s},1,[]),numFeatures);
F{s} = reshape(Fc{s} - Nc{s}.*muc,[],1);
end
Because this example assumes a diagonal covariance matrix for the UBM, N are also diagonal
matrices, and are saved as vectors for efficient computation.
In the i-vector model, the ideal speaker supervector consists of a speaker-independent component
and a speaker-dependent component. The speaker-dependent component consists of the total
variability space model and the speaker's i-vector.
M = m + Tw
The dimensionality of the i-vector, w, is typically much lower than the C F -dimensional speaker
utterance supervector, making the i-vector, or i-vectors, a much more compact and tractable
representation.
To train the total variability space, T , first randomly initialize T, then perform these steps iteratively
[3] on page 1-581:
lT s = I + T′ × Σ−1 × N s × T
−1
Κ= ∑F s × lT s × T′ × Σ−1 × F s ′
s
1-566
Speaker Verification Using i-vectors
−1
Ac = ∑s Nc s lT s
Tc = Ac−1 × Κ
T1
T2
T=
⋮
TC
[3] on page 1-581 proposes initializing Σ by the UBM variance, and then updating Σ according to the
equation:
−1
Σ= ∑N s ∑S s − diag Κ × T′
s s
where S(s) is the centered second-order Baum-Welch statistic. However, updating Σ is often dropped
in practice as it has little effect. This example does not update Σ.
Sigma = ubm.sigma(:);
Specify the dimension of the total variability space. A typical value used for the TIMIT data set is
1000.
numTdim = ;
if speedUpExample
numTdim = 16;
end
T = randn(numel(ubm.sigma),numTdim);
T = T/norm(T);
I = eye(numTdim);
Ey = cell(numSpeakers,1);
Eyy = cell(numSpeakers,1);
Linv = cell(numSpeakers,1);
Set the number of iterations for training. A typical value reported is 20.
numIterations = ;
1-567
1 Audio Toolbox Examples
TtimesInverseSSdiag = (T./Sigma)';
parfor s = 1:numSpeakers
L = (I + TtimesInverseSSdiag.*N{s}*T);
Linv{s} = pinv(L);
Ey{s} = Linv{s}*TtimesInverseSSdiag*F{s};
Eyy{s} = Linv{s} + Ey{s}*Ey{s}';
end
newT = cell(numComponents,1);
for c = 1:numComponents
AcLocal = zeros(numTdim);
for s = 1:numSpeakers
AcLocal = AcLocal + Nc{s}(:,:,c)*Eyy{s};
end
disp("Training Total Variability Space: " + iterIdx + "/" + numIterations + " complete (" + r
end
i-vector Extraction
Once the total variability space is calculated, you can calculate the i-vectors as [4] on page 1-581:
w = I + T′Σ−1 NT ′T′Σ−1 F
At this point, you are still considering each training file as a separate speaker. However, in the next
step, when you train a projection matrix to reduce dimensionality and increase inter-speaker
differences, the i-vectors must be labeled with the appropriate, distinct speaker IDs.
Create a cell array where each element of the cell array contains a matrix of i-vectors across files for
a particular speaker.
speakers = unique(adsTrain.Labels);
numSpeakers = numel(speakers);
ivectorPerSpeaker = cell(numSpeakers,1);
TS = T./Sigma;
TSi = TS';
ubmMu = ubm.mu;
tic
parfor speakerIdx = 1:numSpeakers
1-568
Speaker Verification Using i-vectors
ivectorPerFile = zeros(numTdim,numFiles);
for fileIdx = 1:numFiles
audioData = read(adsPart);
% Extract features
Y = helperFeatureExtraction(audioData,afe,normFactors);
Projection Matrix
Many different backends have been proposed for i-vectors. The most straightforward and still well-
performing one is the combination of linear discriminant analysis (LDA) and within-class covariance
normalization (WCCN).
Create a matrix of the training vectors and a map indicating which i-vector corresponds to which
speaker. Initialize the projection matrix as an identity matrix.
w = ivectorPerSpeaker;
utterancePerSpeaker = cellfun(@(x)size(x,2),w);
ivectorsTrain = cat(2,w{:});
projectionMatrix = eye(size(w{1},1));
LDA attempts to minimize the intra-class variance and maximize the variance between speakers. It
can be calculated as outlined in [4] on page 1-581:
Given:
S
Sb = ∑ ws − w
‾ ws − w
‾ ′
s=1
S ns
1
Sw = ∑ ns ∑ wis − ws wis − ws ′
s=1 i=1
1-569
1 Audio Toolbox Examples
where
• 1 n
ws = ns
∑i s= 1 wis is the mean of i-vectors for each speaker.
• 1 S n
w
‾ = ∑ ∑ s ws
N s=1 i=1 i
is the mean i-vector across all speakers.
• ns is the number of utterances for each speaker.
Sbv = λ Swv
performLDA = ;
if performLDA
tic
numEigenvectors = ;
Sw = zeros(size(projectionMatrix,1));
Sb = zeros(size(projectionMatrix,1));
wbar = mean(cat(2,w{:}),2);
for ii = 1:numel(w)
ws = w{ii};
wsbar = mean(ws,2);
Sb = Sb + (wsbar - wbar)*(wsbar - wbar)';
Sw = Sw + cov(ws',1);
end
[A,~] = eigs(Sb,Sw,numEigenvectors);
A = (A./vecnorm(A))';
ivectorsTrain = A * ivectorsTrain;
w = mat2cell(ivectorsTrain,size(ivectorsTrain,1),utterancePerSpeaker);
projectionMatrix = A * projectionMatrix;
WCCN attempts to scale the i-vector space inversely to the in-class covariance, so that directions of
high intra-speaker variability are de-emphasized in i-vector comparisons [9] on page 1-582.
where
• 1 n
ws = ns
∑i s= 1 wis is the mean of i-vectors for each speaker.
1-570
Speaker Verification Using i-vectors
W −1 = BB′
performWCCN = ;
if performWCCN
tic
alpha = ;
W = zeros(size(projectionMatrix,1));
for ii = 1:numel(w)
W = W + cov(w{ii}',1);
end
W = W/numel(w);
W = (1 - alpha)*W + alpha*eye(size(W,1));
B = chol(pinv(W),"lower");
projectionMatrix = B * projectionMatrix;
The training stage is now complete. You can now use the universal background model (UBM), total
variability space (T), and projection matrix to enroll and verify speakers.
ivectors = cellfun(@(x)projectionMatrix*x,ivectorPerSpeaker,UniformOutput=false);
This algorithm implemented in this example is a Gaussian PLDA as outlined in [13] on page 1-582. In
the Gaussian PLDA, the i-vector is represented with the following equation:
yi ∼ Ν 0, Ι
εij ∼ Ν 0, Λ−1
where μ is a global mean of the i-vectors, Λ is a full precision matrix of the noise term εij, and V is the
factor loading matrix, also known as the eigenvoices.
Specify the number of eigenvoices to use. Typically numbers are between 10 and 400.
numEigenVoices = ;
1-571
1 Audio Toolbox Examples
Determine the number of disjoint persons, the number of dimensions in the feature vectors, and the
number of utterances per speaker.
K = numel(ivectors);
D = size(ivectors{1},1);
utterancePerSpeaker = cellfun(@(x)size(x,2),ivectors);
K
N= ∑ ni
i=1
1
N i,∑j i, j
μ= ϕ
φij = ϕij − μ
ivectorsMatrix = cat(2,ivectors{:});
N = size(ivectorsMatrix,2);
mu = mean(ivectorsMatrix,2);
Determine a whitening matrix from the training i-vectors and then whiten the i-vectors. Specify either
ZCA whitening, PCA whitening, or no whitening.
whiteningType = ;
if strcmpi(whiteningType,"ZCA")
S = cov(ivectorsMatrix');
[~,sD,sV] = svd(S);
W = diag(1./(sqrt(diag(sD)) + eps))*sV';
ivectorsMatrix = W * ivectorsMatrix;
elseif strcmpi(whiteningType,"PCA")
S = cov(ivectorsMatrix');
[sV,sD] = eig(S);
W = diag(1./(sqrt(diag(sD)) + eps))*sV';
ivectorsMatrix = W * ivectorsMatrix;
else
W = eye(size(ivectorsMatrix,1));
end
Apply length normalization and then convert the training i-vector matrix back to a cell array.
ivectorsMatrix = ivectorsMatrix./vecnorm(ivectorsMatrix);
S= ∑ φijφijT
ij
S = ivectorsMatrix*ivectorsMatrix';
1-572
Speaker Verification Using i-vectors
Sort persons according to the number of samples and then group the i-vectors by number of
utterances per speaker. Precalculate the first-order moment of the i-th person as
ni
fi = ∑ φij
j=1
uniqueLengths = unique(utterancePerSpeaker);
numUniqueLengths = numel(uniqueLengths);
speakerIdx = 1;
f = zeros(D,K);
for uniqueLengthIdx = 1:numUniqueLengths
idx = find(utterancePerSpeaker==uniqueLengths(uniqueLengthIdx));
temp = {};
for speakerIdxWithinUniqueLength = 1:numel(idx)
rho = ivectors(idx(speakerIdxWithinUniqueLength));
temp = [temp;rho]; %#ok<AGROW>
f(:,speakerIdx) = sum(rho{:},2);
speakerIdx = speakerIdx+1;
end
ivectorsSorted{uniqueLengthIdx} = temp; %#ok<SAGROW>
end
Initialize the eigenvoices matrix, V, and the inverse noise variance term, Λ.
V = randn(D,numEigenVoices);
Lambda = pinv(S/N);
Specify the number of iterations for the EM algorithm and whether or not to apply the minimum
divergence.
numIter = ;
minimumDivergence = ;
Train the G-PLDA model using the EM algorithm described in [13] on page 1-582.
idx = 1;
for lengthIndex = 1:numUniqueLengths
ivectorLength = uniqueLengths(lengthIndex);
% Calculate M
M = pinv(ivectorLength*(V'*(Lambda*V)) + eye(numEigenVoices)); % Equation (A.7) in [13]
1-573
1 Audio Toolbox Examples
% Update Ryy
R = R + ivectorLength*(M + Eyy); % Equation (A.13) in [13]
% Append EyTotal
EyTotal(:,idx) = Ey;
idx = idx + 1;
% Calculate T
TT = EyTotal*f'; % Equation (A.12) in [13]
% MAXIMIZATION
V = TT'*pinv(R); % Equation (A.16) in [13]
Lambda = pinv((S - V*TT)/N); % Equation (A.17) in [13]
% MINIMUM DIVERGENCE
if minimumDivergence
gamma = gamma/K; % Equation (A.18) in [13]
V = V*chol(gamma,'lower');% Equation (A.22) in [13]
end
end
Once you've trained the G-PLDA model, you can use it to calculate a score based on the log-likelihood
ratio as described in [14] on page 1-582. Given two i-vectors that have been centered, whitened, and
length-normalized, the score is calculated as:
Σ + VVT VVT −1 −1
score w1, wt = w1T wtT w1 wt − w1T Σ + VVT w1 − wtT Σ + VVT wt + C
T T
VV Σ + VV
where w1 and wt are the enrollment and test i-vectors, Σ is the variance matrix of the noise term, V is
the eigenvoice matrix. The C term are factored-out constants and can be dropped in practice.
speakerIdx = ;
utteranceIdx = ;
w1 = ivectors{speakerIdx}(:,utteranceIdx);
speakerIdx = ;
utteranceIdx = ;
wt = ivectors{speakerIdx}(:,utteranceIdx);
1-574
Speaker Verification Using i-vectors
VVt = V*V';
SigmaPlusVVt = pinv(Lambda) + VVt;
w1wt = [w1;wt];
score = w1wt'*term1*w1wt - w1'*term2*w1 - wt'*term2*wt
score = 56.2336
In practice, the test i-vectors, and depending on your system, the enrollment ivectors, are not used in
the training of the G-PLDA model. In the following evaluation section, you use previously unseen data
for enrollment and verification. The supporting function, gpldaScore on page 1-580 encapsulates the
scoring steps above, and additionally performs centering, whitening, and normalization. Save the
trained G-PLDA model as a struct for use with the supporting function gpldaScore.
Enroll
Enroll new speakers that were not in the training data set.
Create i-vectors for each file for each speaker in the enroll set using the this sequence of steps:
1 Feature Extraction
2 Baum-Welch Statistics: Determine the zeroth and first order statistics
3 i-vector Extraction
4 Intersession compensation
Then average the i-vectors across files to create an i-vector model for the speaker. Repeat the for
each speaker.
speakers = unique(adsEnroll.Labels);
numSpeakers = numel(speakers);
enrolledSpeakersByIdx = cell(numSpeakers,1);
tic
parfor speakerIdx = 1:numSpeakers
% Subset the datastore to the speaker you are adapting.
adsPart = subset(adsEnroll,adsEnroll.Labels==speakers(speakerIdx));
numFiles = numel(adsPart.Files);
ivectorMat = zeros(size(projectionMatrix,1),numFiles);
for fileIdx = 1:numFiles
audioData = read(adsPart);
% Extract features
Y = helperFeatureExtraction(audioData,afe,normFactors);
1-575
1 Audio Toolbox Examples
amax = max(logLikelihood,[],1);
logLikelihoodSum = amax + log(sum(exp(logLikelihood-amax),1));
gamma = exp(logLikelihood - logLikelihoodSum)';
%i-vector Extraction
w = pinv(I + (TS.*repelem(n(:),numFeatures))' * T) * TSi * f(:);
% Intersession Compensation
w = projectionMatrix*w;
ivectorMat(:,fileIdx) = w;
end
% i-vector model
enrolledSpeakersByIdx{speakerIdx} = mean(ivectorMat,2);
end
disp("Speakers enrolled (" + round(toc,2) + " seconds).")
For bookkeeping purposes, convert the cell array of i-vectors to a structure, with the speaker IDs as
fields and the i-vectors as values
enrolledSpeakers = struct;
for s = 1:numSpeakers
enrolledSpeakers.(string(speakers(s))) = enrolledSpeakersByIdx{s};
end
Verification
scoringMethod = ;
The speaker false rejection rate (FRR) is the rate that a given speaker is incorrectly rejected. Create
an array of scores for enrolled speaker i-vectors and i-vectors of the same speaker.
speakersToTest = unique(adsDET.Labels);
numSpeakers = numel(speakersToTest);
scoreFRR = cell(numSpeakers,1);
tic
parfor speakerIdx = 1:numSpeakers
adsPart = subset(adsDET,adsDET.Labels==speakersToTest(speakerIdx));
numFiles = numel(adsPart.Files);
score = zeros(numFiles,1);
for fileIdx = 1:numFiles
audioData = read(adsPart);
% Extract features
Y = helperFeatureExtraction(audioData,afe,normFactors);
1-576
Speaker Verification Using i-vectors
% Extract i-vector
w = pinv(I + (TS.*repelem(n(:),numFeatures))' * T) * TSi * f(:);
% Intersession Compensation
w = projectionMatrix*w;
% Score
if strcmpi(scoringMethod,"CSS")
score(fileIdx) = dot(ivectorToTest,w)/(norm(w)*norm(ivectorToTest));
else
score(fileIdx) = gpldaScore(gpldaModel,w,ivectorToTest);
end
end
scoreFRR{speakerIdx} = score;
end
disp("FRR calculated (" + round(toc,2) + " seconds).")
The speaker false acceptance rate (FAR) is the rate that utterances not belonging to an enrolled
speaker are incorrectly accepted as belonging to the enrolled speaker. Create an array of scores for
enrolled speakers and i-vectors of different speakers.
speakersToTest = unique(adsDET.Labels);
numSpeakers = numel(speakersToTest);
scoreFAR = cell(numSpeakers,1);
tic
parfor speakerIdx = 1:numSpeakers
adsPart = subset(adsDET,adsDET.Labels~=speakersToTest(speakerIdx));
numFiles = numel(adsPart.Files);
% Extract features
Y = helperFeatureExtraction(audioData,afe,normFactors);
1-577
1 Audio Toolbox Examples
amax = max(logLikelihood,[],1);
logLikelihoodSum = amax + log(sum(exp(logLikelihood-amax),1));
gamma = exp(logLikelihood - logLikelihoodSum)';
% Extract i-vector
w = pinv(I + (TS.*repelem(n(:),numFeatures))' * T) * TSi * f(:);
% Intersession compensation
w = projectionMatrix * w;
% Score
if strcmpi(scoringMethod,"CSS")
score(fileIdx) = dot(ivectorToTest,w)/(norm(w)*norm(ivectorToTest));
else
score(fileIdx) = gpldaScore(gpldaModel,w,ivectorToTest);
end
end
scoreFAR{speakerIdx} = score;
end
disp("FAR calculated (" + round(toc,2) + " seconds).")
To compare multiple systems, you need a single metric that combines the FAR and FRR performance.
For this, you determine the equal error rate (EER), which is the threshold where the FAR and FRR
curves meet. In practice, the EER threshold might not be the best choice. For example, if speaker
verification is used as part of a multi-authentication approach for wire transfers, FAR would most
likely be more heavily weighted than FRR.
amin = min(cat(1,scoreFRR{:},scoreFAR{:}));
amax = max(cat(1,scoreFRR{:},scoreFAR{:}));
thresholdsToTest = linspace(amin,amax,1000);
figure
1-578
Speaker Verification Using i-vectors
plot(thresholdsToTest,FAR,"k", ...
thresholdsToTest,FRR,"b", ...
EERThreshold,EER,"ro",MarkerFaceColor="r")
title(["Equal Error Rate = " + round(EER,4),"Threshold = " + round(EERThreshold,4)])
xlabel("Threshold")
ylabel("Error Rate")
legend("False Acceptance Rate (FAR)","False Rejection Rate (FRR)","Equal Error Rate (EER)",Locati
grid on
axis tight
Supporting Functions
1-579
1 Audio Toolbox Examples
% Normalize
audioData = audioData/max(abs(audioData(:)));
% Feature normalization
if ~isempty(normFactors)
features = (features-normFactors.Mean')./normFactors.STD';
end
features = features';
numFrames = size(features,2);
end
function L = helperGMMLogLikelihood(x,gmm)
xMinusMu = repmat(x,1,1,numel(gmm.ComponentProportion)) - permute(gmm.mu,[1,3,2]);
permuteSigma = permute(gmm.sigma,[1,3,2]);
temp = squeeze(permute(Lunweighted,[1,3,2]));
if size(temp,1)==1
% If there is only one frame, the trailing singleton dimension was
% removed in the permute. This accounts for that edge case.
temp = temp';
end
L = temp + log(gmm.ComponentProportion)';
end
G-PLDA Score
1-580
Speaker Verification Using i-vectors
w1 = w1./vecnorm(w1);
wt = wt./vecnorm(wt);
w1wt = [w1;wt];
score = w1wt'*term1*w1wt - w1'*term2*w1 - wt'*term2*wt;
end
References
[1] Reynolds, Douglas A., et al. "Speaker Verification Using Adapted Gaussian Mixture Models."
Digital Signal Processing, vol. 10, no. 1–3, Jan. 2000, pp. 19–41. DOI.org (Crossref), doi:10.1006/
dspr.1999.0361.
[2] Kenny, Patrick, et al. "Joint Factor Analysis Versus Eigenchannels in Speaker Recognition." IEEE
Transactions on Audio, Speech and Language Processing, vol. 15, no. 4, May 2007, pp. 1435–47.
DOI.org (Crossref), doi:10.1109/TASL.2006.881693.
[3] Kenny, P., et al. "A Study of Interspeaker Variability in Speaker Verification." IEEE Transactions on
Audio, Speech, and Language Processing, vol. 16, no. 5, July 2008, pp. 980–88. DOI.org (Crossref),
doi:10.1109/TASL.2008.925147.
[4] Dehak, Najim, et al. "Front-End Factor Analysis for Speaker Verification." IEEE Transactions on
Audio, Speech, and Language Processing, vol. 19, no. 4, May 2011, pp. 788–98. DOI.org (Crossref),
doi:10.1109/TASL.2010.2064307.
[5] Matejka, Pavel, Ondrej Glembek, Fabio Castaldo, M.j. Alam, Oldrich Plchot, Patrick Kenny, Lukas
Burget, and Jan Cernocky. "Full-Covariance UBM and Heavy-Tailed PLDA in i-Vector Speaker
Verification." 2011 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2011. https://doi.org/10.1109/icassp.2011.5947436.
[6] Snyder, David, et al. "X-Vectors: Robust DNN Embeddings for Speaker Recognition." 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp.
5329–33. DOI.org (Crossref), doi:10.1109/ICASSP.2018.8461375.
[7] Signal Processing and Speech Communication Laboratory. Accessed December 12, 2019. https://
www.spsc.tugraz.at/databases-and-tools/ptdb-tug-pitch-tracking-database-from-graz-university-of-
technology.html.
1-581
1 Audio Toolbox Examples
[8] Variani, Ehsan, et al. "Deep Neural Networks for Small Footprint Text-Dependent Speaker
Verification." 2014 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, 2014, pp. 4052–56. DOI.org (Crossref), doi:10.1109/ICASSP.2014.6854363.
[9] Dehak, Najim, Réda Dehak, James R. Glass, Douglas A. Reynolds and Patrick Kenny. “Cosine
Similarity Scoring without Score Normalization Techniques.” Odyssey (2010).
[10] Verma, Pulkit, and Pradip K. Das. “I-Vectors in Speech Processing Applications: A Survey.”
International Journal of Speech Technology, vol. 18, no. 4, Dec. 2015, pp. 529–46. DOI.org (Crossref),
doi:10.1007/s10772-015-9295-3.
[12] Kenny, Patrick. "Bayesian Speaker Verification with Heavy-Tailed Priors". Odyssey 2010 - The
Speaker and Language Recognition Workshop, Brno, Czech Republic, 2010.
[13] Sizov, Aleksandr, Kong Aik Lee, and Tomi Kinnunen. “Unifying Probabilistic Linear Discriminant
Analysis Variants in Biometric Authentication.” Lecture Notes in Computer Science Structural,
Syntactic, and Statistical Pattern Recognition, 2014, 464–75. https://doi.org/
10.1007/978-3-662-44415-3_47.
[14] Rajan, Padmanabhan, Anton Afanasyev, Ville Hautamäki, and Tomi Kinnunen. 2014. “From Single
to Multiple Enrollment I-Vectors: Practical PLDA Scoring Variants for Speaker Verification.” Digital
Signal Processing 31 (August): 93–101. https://doi.org/10.1016/j.dsp.2014.05.001.
1-582
i-vector Score Normalization
An i-vector system outputs a raw score specific to the data and parameters used to develop the
system. This makes interpreting the score and finding a consistent decision threshold for verification
tasks difficult.
To address these difficulties, researchers developed score normalization and score calibration
techniques.
• In score normalization, raw scores are normalized in relation to an 'imposter cohort'. Score
normalization occurs before evaluating the detection error tradeoff and can improve the accuracy
of a system and its ability to adapt to new data.
• In score calibration, raw scores are mapped to probabilities, which are used to better understand
the system's confidence in decisions.
In this example, you apply score normalization to an i-vector system. To learn about score calibration,
see “i-vector Score Calibration” on page 1-602.
For example purposes, you use cosine similarity scoring (CSS) throughout this example. Probabilistic
linear discriminant analysis (PLDA) scoring is also improved by normalization, although less
dramatically.
ivs = speakerRecognition();
The pretrained i-vector system achieves an equal error rate (EER) around 6.73% using CSS on the
LibriSpeech test set. The EER achieved using PLDA is considerably better. However, because CSS is
simpler, for the purposes of this example you investigate CSS only.
detectionErrorTradeoff(ivs)
1-583
1 Audio Toolbox Examples
The error rate on the LibriSpeech test set, and the accompanying default decision threshold for
speaker verification, do not extend well to unseen data. To confirm this, download the PTDB-TUG [3]
on page 1-600 data set. The supporting function, loadDataset on page 1-594, downloads the data
set and then resamples it from 48 kHz to 16 kHz, which is the sample rate that the i-vector system
was trained on. The loadDataset function returns four audioDatastore objects:
targetSampleRate = ivs.SampleRate;
[adsEnroll,adsTest,adsDET,adsImposter] = loadDataset(targetSampleRate);
Enroll speakers from the enrollment dataset. When you enroll speakers, an i-vector template is
created for each unique speaker label.
enroll(ivs,adsEnroll,adsEnroll.Labels)
enrolledLabels = categorical(ivs.EnrolledLabels.Properties.RowNames);
1-584
i-vector Score Normalization
Spot-check the false rejection rate (FRR) and the false acceptance rate (FAR) using the verify
object function. The verify object function scores the i-vector derived from the audio input against
the i-vector template corresponding to the specified label. The function then compares the score to a
decision threshold and either accepts or rejects the proposition that the audio input belongs to the
specified speaker label. The default decision threshold corresponds to the equal error rate (EER)
determined the last time the detection error tradeoff was evaluated.
FA = 0;
FR = 0;
reset(adsTest)
numToSpotCheck = 50;
for ii = 1:numToSpotCheck
[audioIn,fileInfo] = read(adsTest);
targetLabel = fileInfo.Label;
FR = FR + ~verify(ivs,audioIn,targetLabel,"css");
nontargetIdx = find(~ismember(enrolledLabels,targetLabel));
nontargetLabel = enrolledLabels(nontargetIdx(randperm(numel(nontargetIdx),1)));
FA = FA + verify(ivs,audioIn,nontargetLabel,"css");
end
FRR = FR./numToSpotCheck;
FAR = FA./numToSpotCheck;
disp(["False Rejection Rate = " + round(100*FRR,2) + " (%)"; ...
"False Acceptance Rate = " + round(100*FAR,2) + " (%)"])
The performance on this new dataset does not match performance reported when training and
evaluating the i-vector system on the LibriSpeech data set. Also, the default decision threshold on the
LibriSpeech data set does not correspond to the equal error rate on the PTDB-TUG data set.
To better evaluate the system's performance, and to choose a new decision threshold, call
detectionErrorTradeoff again. This time call detectionErrorTradeoff with new evaluation
data that is more suited to the target application. The evaluation data should be as close as possible
to the data that your deployed system encounters in terms of vocabulary, prosody, signal duration,
noise level, noise type, accents, channel characteristics, and so on.
detectionErrorTradeoff(ivs,adsDET,adsDET.Labels,Scorer="css")
1-585
1 Audio Toolbox Examples
Spot-check the FAR and FRR of the updated system. The FAR and FRR are now reasonably close to
the EER reported in the detection error tradeoff analysis. Note that calling
detectionErrorTradeoff does not modify the i-vector extraction or scoring, only the default
decision threshold for speaker verification. In the following sections, you enhance an i-vector system
to perform score normalization. Score normalization helps an i-vector system extend to new datasets
without the need to reevaluate the detection error tradeoff. Score normalization also helps bridge the
performance gap between training a system and deploying it.
FA = 0;
FR = 0;
reset(adsTest)
for ii = 1:numToSpotCheck
[audioIn,fileInfo] = read(adsTest);
trueLabel = fileInfo.Label;
FR = FR + ~verify(ivs,audioIn,trueLabel,"css");
imposterIdx = find(~ismember(enrolledLabels,trueLabel));
imposter = enrolledLabels(imposterIdx(randperm(numel(imposterIdx),1)));
FA = FA + verify(ivs,audioIn,imposter,"css");
end
FRR = FR./numToSpotCheck;
FAR = FA./numToSpotCheck;
disp(["False Rejection Rate = " + round(100*FRR,2) + " (%)"; ...
"False Acceptance Rate = " + round(100*FAR,2) + " (%)"])
1-586
i-vector Score Normalization
Score Normalization
Score normalization is a common approach to make target and non-target score distributions across
speakers more similar. This enables a system to set a decision threshold that is closer to optimal for
more speakers. In this example, you explore adaptive symmetric normalization variant 1 (S-norm1)
[1] on page 1-600.
To motivate score normalization, first inspect the target and non-target score distributions for two
enrolled labels against the same test cohort.
enrolledIvecs = cat(2,ivs.EnrolledLabels.ivector{1},ivs.EnrolledLabels.ivector{9});
label_e = categorical([ivs.EnrolledLabels.Properties.RowNames(1),ivs.EnrolledLabels.Properties.Ro
Extract i-vectors from the test set. The test set labels overlap with the enrolled labels.
testIvecs = ivector(ivs,adsTest);
label_t = adsTest.Labels;
Create indexing vectors to keep track of which test i-vectors correspond to which enrolled label. In
the targets matrix, the columns correspond to the enrolled speakers, and the rows correspond to
the test files. If the test label corresponds to the enrolled label, the value in the matrix is true,
otherwise, the value is false.
targets = [ismember(label_t,label_e(1)),ismember(label_t,label_e(2))];
Score the template i-vectors against the target and non-target i-vectors. The supporting function,
scoreTargets on page 1-596, scores all combinations of the enrolled i-vectors against the test i-
vectors and returns the scores separated into target scores (when the test and enroll labels are the
same) and non-target scores (when the test and enroll labels are different).
[targetScores,nontargetScores] = scoreTargets(enrolledIvecs,testIvecs,targets);
Use the supporting function, plotScoreDistributions on page 1-597, to display the target and
non-target scores for each of the enrolled speakers. Note that the equal error rate (where the target
and non-target distributions cross) is different for the two speakers. That is, assuming the equal error
rate is the goal of the system, a single decision threshold cannot capture the equal error rate for both
speakers.
plotScoreDistributions(targetScores,nontargetScores,Analyze="label")
1-587
1 Audio Toolbox Examples
Analyze the score distributions using an EER plot. EER plots reveal the relationship between a
decision threshold and the probability of a false alarm or false rejection and are often analyzed to
determine a decision threshold.
plotEER(targetScores,nontargetScores,Analyze="label")
1-588
i-vector Score Normalization
In this example, you use adaptive symmetric normalization variant 1 (S-norm1) [1] on page 1-600. S-
norm1 computes an average of normalized scores from Z-norm (zero score normalization) and T-norm
(test score normalization).
top top
1 s e, t − μ Se ξe s e, t − μ St ξt
s e, t s − norm1 = ⋅ +
2 top
σ Se ξe
top
σ St ξt
where
1-589
1 Audio Toolbox Examples
imposterIvecs = ivector(ivs,adsImposter);
Score the enrolled i-vectors against the imposter cohort (Se ξe ) and then isolate only the K best
top
scores (Se ξe ). [1] on page 1-600 suggests using the top 200-500 scoring files to create a speaker-
top top
dependent cohort. Finally, calculate the mean (μ Se ξe ) and standard deviation (σ Se ξe ).
topK = ;
imposterScores = sort(cosineSimilarityScore(enrolledIvecs,imposterIvecs),"descend");
imposterScores = imposterScores(1:min(topK,size(imposterScores,1)),:);
mu_e = mean(imposterScores,1);
std_e = std(imposterScores,[],1);
top top
Calculate μ St ξt and σ St ξt as above.
imposterScores = sort(cosineSimilarityScore(testIvecs,imposterIvecs),"descend");
imposterScores = imposterScores(1:min(topK,size(imposterScores,1)),:);
mu_t = mean(imposterScores,1);
std_t = std(imposterScores,[],1);
Score the test and enrollment i-vectors again, this time specifying the required normalization factors
to perform adaptive s-norm1. The supporting function, scoreTargets on page 1-596, applies the
normalization on the raw scores.
normFactorsSe = struct(mu=mu_e,std=std_e);
normFactorsSt = struct(mu=mu_t,std=std_t);
Plot the score distributions of the scores after applying adaptive s-norm1.
plotScoreDistributions(targetScores,nontargetScores,Analyze="label")
1-590
i-vector Score Normalization
Analyze the equal error rate plots after applying adaptive s-norm1. The thresholds corresponding to
the equal error rates for the two speakers are now closer together.
plotEER(targetScores,nontargetScores,Analyze="label")
1-591
1 Audio Toolbox Examples
Calculate the normalization statistics for each enrolled and test i-vector. The supporting function,
getNormFactors on page 1-600, performs the same operations as in Analyze Score Normalization on
Two Speakers on page 1-589.
topK = ;
normFactorsSe = getNormFactors(enrolledIvecs,imposterIvecs,TopK=topK);
normFactorsSt = getNormFactors(testIvecs,imposterIvecs,TopK=topK);
Create a targets matrix indicating which i-vector pairs have corresponding labels.
targets = true(numel(adsDET.Labels),height(ivs.EnrolledLabels));
for ii = 1:height(ivs.EnrolledLabels)
targets(:,ii) = ismember(adsDET.Labels,ivs.EnrolledLabels.Properties.RowNames(ii));
end
1-592
i-vector Score Normalization
Plot the target and non-target score distributions for the group.
plotScoreDistributions(targetScores,nontargetScores,Analyze="group")
Plot the equal error rate of this new system. The equal error rate after applying adaptive s-norm1 is
approximately 5 %. The equal error rate prior to adaptive s-norm1 is approximately 8 %.
plotEER(targetScores,nontargetScores,Analyze="group")
1-593
1 Audio Toolbox Examples
Supporting Functions
Load Dataset
arguments
targetSampleRate (1,1) {mustBeNumeric,mustBePositive}
end
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","ptdb-tug.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"ptdb-tug");
1-594
i-vector Score Normalization
if ~isfolder(fullfile(pwd,"MIC"))
ads = audioDatastore([fullfile(dataset,"SPEECH DATA","FEMALE","MIC"),fullfile(dataset,"SPEECH
IncludeSubfolders=true, ...
FileExtensions=".wav", ...
LabelSource="foldernames");
reduceDataset = false;
if reduceDataset
ads = splitEachLabel(ads,55);
end
adsTransform = transform(ads,@(x,y)fileResampler(x,y,targetSampleRate),IncludeInfo=true);
writeall(adsTransform,pwd,OutputFormat="flac",UseParallel=~isempty(ver("parallel")))
end
% Create a datastore that points to the resampled dataset. Use the folder
% names as the labels.
ads = audioDatastore(fullfile(pwd,"MIC"),IncludeSubfolders=true,LabelSource="foldernames");
% Split the data set into enrollment, test, DET, and imposter sets.
imposterLabels = categorical(["M05","M10","F05","F10"]);
adsImposter = subset(ads,ismember(ads.Labels,imposterLabels));
adsDev = subset(ads,~ismember(ads.Labels,imposterLabels));
rng default
numToEnroll = 2;
[adsEnroll,adsDev] = splitEachLabel(adsDev,numToEnroll);
numToTest = 50;
[adsTest,adsDET] = splitEachLabel(adsDev,numToTest);
end
File Resampler
arguments
audioIn (:,1) {mustBeA(audioIn,["single","double"])}
adsInfo (1,1) {mustBeA(adsInfo,"struct")}
targetSampleRate (1,1) {mustBeNumeric,mustBePositive}
end
% Resample if necessary
if originalSampleRate ~= targetSampleRate
audioOut = resample(audioIn,targetSampleRate,originalSampleRate);
amax = max(abs(audioOut));
if max(amax>1)
audioOut = audioOut./amax;
end
1-595
1 Audio Toolbox Examples
end
end
arguments
e (:,:) {mustBeA(e,["single","double"])}
t (:,:) {mustBeA(t,["single","double"])}
targetMap (:,:) {mustBeA(targetMap,"logical")}
nvargs.NormFactorsSe = [];
nvargs.NormFactorsSt = [];
end
1-596
i-vector Score Normalization
arguments
a (:,:) {mustBeA(a,["single","double"])}
b (:,:) {mustBeA(b,["single","double"])}
end
scores = squeeze(sum(a.*reshape(b,size(b,1),1,[]),1)./(vecnorm(a).*reshape(vecnorm(b),1,1,[])));
scores = scores';
end
function plotScoreDistributions(targetScores,nontargetScores,nvargs)
%PLOTSCOREDISTRIBUTIONS Plot target and non-target score distributions
% plotScoreDistribution(targetScores,nontargetScores) plots empirical
% estimations of the distribution for target scores and nontarget scores.
% Specify targetScores and nontargetScores as cell arrays where each
% element contains a vector of speaker-specific scores.
%
% plotScoreDistrubtions(targetScores,nontargetScores,Analyze=ANALYZE)
% specifies the scope for analysis as either 'label' or 'group'. If ANALYZE
% is set to 'label', then a score distribution plot is created for each
% label. If ANALYZE is set to 'group', then a score distribution plot is
% created for the entire group by combining scores across speakers. If
% unspecified, ANALYZE defaults to 'group'.
arguments
targetScores (1,:) cell
nontargetScores (1,:) cell
nvargs.Analyze (1,:) char {mustBeMember(nvargs.Analyze,{'label','group'})} = 'group'
end
% Combine all scores to determine good bins for analyzing both the target
% and non-target scores together.
allScores = cat(1,targetScores{:},nontargetScores{:});
[~,edges] = histcounts(allScores);
if strcmpi(nvargs.Analyze,"group")
% Plot the score distributions for the group.
1-597
1 Audio Toolbox Examples
targetScoresBinCounts = histcounts(cat(1,targetScores{:}),edges);
targetScoresBinProb = targetScoresBinCounts(:)./sum(targetScoresBinCounts);
nontargetScoresBinCounts = histcounts(cat(1,nontargetScores{:}),edges);
nontargetScoresBinProb = nontargetScoresBinCounts(:)./sum(nontargetScoresBinCounts);
figure
plot(centers,[targetScoresBinProb,nontargetScoresBinProb])
title("Score Distributions")
xlabel("Score")
ylabel("Probability")
legend(["target","non-target"],Location="northwest")
axis tight
else
% Create a tiled layout and plot the score distributions for each speaker.
N = numel(targetScores);
tiledlayout(N,1)
for ii = 1:N
targetScoresBinCounts = histcounts(targetScores{ii},edges);
targetScoresBinProb = targetScoresBinCounts(:)./sum(targetScoresBinCounts);
nontargetScoresBinCounts = histcounts(nontargetScores{ii},edges);
nontargetScoresBinProb = nontargetScoresBinCounts(:)./sum(nontargetScoresBinCounts);
nexttile
hold on
plot(centers,[targetScoresBinProb,nontargetScoresBinProb])
title("Score Distribution for Speaker " + string(ii))
xlabel("Score")
ylabel("Probability")
legend(["target","non-target"],Location="northwest")
axis tight
end
end
end
function plotEER(targetScores,nontargetScores,nvargs)
%PLOTEER Plot equal error rate (EER)
% plotEER(targetScores,nontargetScores) creates an equal error rate plot
% using the target scores and the non-target scores. Specify targetScores
% and nontargetScores as cell arrays where each element contains a vector
% of speaker-specific scores.
%
% plotEER(targetScores,nontargetScores,Analyze=ANALYZE) specifies the
% scope for analysis as either 'label' or 'group'. If ANALYZE is set to
% 'label', then an equal error rate plot is created for each label. If
% ANALYZE is set to 'group', then an equal error rate plot is created for
% the entire group by combining scores across speakers. If unspecified,
% ANALYZE defaults to 'group'.
arguments
targetScores (1,:) cell
nontargetScores (1,:) cell
1-598
i-vector Score Normalization
% Combine all scores to determine good bins for analyzing both the target
% and non-target scores together.
allScores = cat(1,targetScores{:},nontargetScores{:});
[~,edges] = histcounts(allScores,BinWidth=0.002);
if strcmpi(nvargs.Analyze,"group")
% Plot the equal error rate for the group.
targetScoresBinCounts = histcounts(cat(1,targetScores{:}),edges);
targetScoresBinProb = targetScoresBinCounts(:)./sum(targetScoresBinCounts);
nontargetScoresBinCounts = histcounts(cat(1,nontargetScores{:}),edges);
nontargetScoresBinProb = nontargetScoresBinCounts(:)./sum(nontargetScoresBinCounts);
targetScoresCDF = cumsum(targetScoresBinProb);
nontargetScoresCDF = cumsum(nontargetScoresBinProb,"reverse");
[~,idx] = min(abs(targetScoresCDF(:) - nontargetScoresCDF));
figure
plot(centers,[targetScoresCDF,nontargetScoresCDF])
xline(centers(idx),"-",num2str(centers(idx),3),LabelOrientation="horizontal")
legend(["FRR","FAR"],Location="best")
xlabel("Threshold Score")
ylabel("Error Rate")
title(sprintf("Equal Error Plot, EER = %0.2f (%%)",100*mean([targetScoresCDF(idx);nontargetSc
axis tight
else
% Create a tiled layout and plot the equal error rate for each speaker.
N = numel(targetScores);
f = figure;
tiledlayout(f,N,1,Padding="tight",TileSpacing="tight")
for ii = 1:N
targetScoresBinCounts = histcounts(targetScores{ii},edges);
targetScoresBinProb = targetScoresBinCounts(:)./sum(targetScoresBinCounts);
nontargetScoresBinCounts = histcounts(nontargetScores{ii},edges);
nontargetScoresBinProb = nontargetScoresBinCounts(:)./sum(nontargetScoresBinCounts);
targetScoresCDF = cumsum(targetScoresBinProb);
nontargetScoresCDF = cumsum(nontargetScoresBinProb,"reverse");
[~,idx] = min(abs(targetScoresCDF(:) - nontargetScoresCDF));
nexttile
plot(centers,[targetScoresCDF,nontargetScoresCDF])
xline(centers(idx),"-",num2str(centers(idx),3),LabelOrientation="horizontal")
legend(["FRR","FAR"],Location="southwest")
xlabel("Threshold Score")
ylabel("Error Rate")
title(sprintf("Equal Error Plot for Speaker " + string(ii) + ", EER = %0.2f (%%)", ...
100*mean([targetScoresCDF(idx);nontargetScoresCDF(idx)])))
axis tight
end
1-599
1 Audio Toolbox Examples
end
end
arguments
w (:,:) {mustBeA(w,["single","double"])}
imposterCohort (:,:) {mustBeA(imposterCohort,["single","double"])}
nvargs.TopK (1,1) {mustBePositive} = inf
end
topK = min(ceil(nvargs.TopK),size(imposterCohort,2));
References
[1] Matejka, Pavel, Ondrej Novotny, Oldrich Plchot, Lukas Burget, Mireia Diez Sanchez, and Jan
Cernocky. "Analysis of Score Normalization in Multilingual Speaker Recognition." Interspeech 2017,
2017. https://doi.org/10.21437/interspeech.2017-803.
[2] van Leeuwen, David A., and Niko Brummer. "An Introduction to Application-Independent
Evaluation of Speaker Recognition Systems." Lecture Notes in Computer Science, 2007, 330–53.
https://doi.org/10.1007/978-3-540-74200-5_19.
1-600
i-vector Score Normalization
[3] G. Pirker, M. Wohlmayr, S. Petrik, and F. Pernkopf, "A Pitch Tracking Corpus with Evaluation on
Multipitch Tracking Scenario", Interspeech, pp. 1509-1512, 2011.
1-601
1 Audio Toolbox Examples
An i-vector system outputs a raw score specific to the data and parameters used to develop the
system. This makes interpreting the score and finding a consistent decision threshold for verification
tasks difficult.
To address these difficulties, researchers developed score normalization and score calibration
techniques.
• In score normalization, raw scores are normalized in relation to an 'imposter cohort'. Score
normalization occurs before evaluating the detection error tradeoff and can improve the accuracy
of a system and its ability to adapt to new data.
• In score calibration, raw scores are mapped to probabilities, which in turn are used to better
understand the system's confidence in decisions.
In this example, you apply score calibration to an i-vector system. To learn about score normalization,
see “i-vector Score Normalization” on page 1-583.
For example purposes, you use cosine similarity scoring (CSS) throughout this example. The
interpretability of probabilistic linear discriminant analysis (PLDA) scoring is also improved by
calibration.
Starting in R2022a, you can use the calibrate method of ivectorSystem to calibrate both CSS
and PLDA scoring.
Download the PTDB-TUG data set [1] on page 1-617. The supporting function, loadDataset on
page 1-609, downloads the data set and then resamples it from 48 kHz to 16 kHz, which is the
sample rate that the i-vector system was trained on. The loadDataset function returns these
audioDatastore objects:
Score Calibration
In score calibration, you apply a warping function to scores so that they are more easily and
consistently interpretable as measures of confidence. Generally, score calibration has no effect on the
1-602
i-vector Score Calibration
performance of a verification system because the mapping is an affine transformation. The two most
popular approaches to calibration are Platt scaling and isotonic regression. Isotonic regression
usually results in better performance, but is more prone to overfitting if the calibration data is too
small [2] on page 1-617.
In this example, you perform calibration using both Platt scaling and isotonic regression, and then
compare the calibrations using reliability diagrams.
Extract i-vectors
To properly calibrate a system, you must use data that does not overlap with the evaluation data.
Extract i-vectors from the calibration set. You will use these i-vectors to create a calibration warping
function.
calibrationIvecs = ivector(ivs,adsCalibrate);
You will score each i-vector against each other i-vector to create a matrix of scores, some of which
correspond to target scores where both i-vectors belong to the same speaker, and some of which
correspond to non-target scores where the i-vectors belong to two different speakers. First, create a
targets matrix to keep track of which scores are target and which are non-target.
targets = true(size(calibrationIvecs,2),size(calibrationIvecs,2));
calibrationLabels = adsCalibrate.Labels;
for ii = 1:size(calibrationIvecs,2)
targets(:,ii) = ismember(calibrationLabels,calibrationLabels(ii));
end
Discard the target scores that corresponds to the i-vector scored with itself by setting the
corresponding value in the target matrix to NaN. The supporting function, scoreTargets on page 1-
611, scores each valid i-vector pair and returns the results in cell arrays of target and non-target
scores.
Use the supporting function, plotScoreDistrubtions on page 1-612, to plot the target and non-
target score distributions for the group. The scores range from around 0.64 to 1. In a properly
calibrated system, scores should range from 0 to 1. The job of calibrating a binary classification
system is to map the raw score to a score between 0 and 1. The calibrated score should be
interpretable as the probability that the score corresponds to a target pair.
plotScoreDistributions(targetScores,nontargetScores,Analyze="group")
1-603
1 Audio Toolbox Examples
Platt Scaling
Platt scaling (also referred to as Platt calibration or logistic regression) works by fitting a logistic
regression model to a classifier's scores.
The supporting function logistic on page 1-615 implements a general logistic function defined as
1
px =
1 + e B + Ax
The supporting function logRegCost on page 1-615 defines the cost function for logistic regression
as defined in [3] on page 1-617:
argmin
A, B − ∑ yilog pi + 1 − yi log 1 − pi
i
As described in [3] on page 1-617, the target values are modified from 0 and 1 to avoid overfitting:
N+ + 1 1
y+ = ;y =
N+ + 2 − N− + 2
where y+ is the positive sample value and N+ is the number of positive samples, and y− is the
negative sample value and N− is the number of negative samples.
1-604
i-vector Score Calibration
tS = cat(1,targetScores{:});
ntS = cat(1,nontargetScores{:});
x = [tS;ntS];
Use fminsearch to find the values of A and B that minimize the cost function.
init = [1,1];
AB = fminsearch(@(AB)logRegCost(y,x,AB),init);
[x,idx] = sort(x,"ascend");
trueLabel = [ones(numel(tS),1);zeros(numel(ntS),1)];
trueLabel = trueLabel(idx);
Use the supporting function calibrateScores on page 1-613 to calibrate the raw scores. Plot the
warping function that maps the raw scores to the calibrated scores. Also plot the target scores you
are modeling.
calibratedScores = calibrateScores(x,AB);
plot(x,trueLabel,"o")
hold on
plot(x,calibratedScores,LineWidth=1.5)
grid on
xlabel("Raw Score")
ylabel("Calibrated Score")
hold off
1-605
1 Audio Toolbox Examples
Isotonic Regression
Isotonic regression fits a free-form line to observations with the only condition being that it is non-
decreasing (or non-increasing). The supporting function isotonicRegression on page 1-616 uses
the pool adjacent violators (PAV) algorithm [3] on page 1-617 for isotonic regression.
Call isotonicRegression with the raw score and true labels. The function outputs a struct
containing a map between raw scores and calibrated scores.
scoringMap = isotonicRegression(x,trueLabel);
Plot the raw score against the calibrated score. The line is the learned isotonic fit. The circles are the
data you are fitting.
plot(x,trueLabel,"o")
hold on
plot(scoringMap.Raw,scoringMap.Calibrated,LineWidth=1.5)
grid on
xlabel("Raw Score")
ylabel("Calibrated Score")
hold off
1-606
i-vector Score Calibration
Reliability Diagram
Reliability diagrams reveal reliability by plotting the mean of the predicted value against the known
fraction of positives. A system is reliable if the mean of the predicted value is equal to the fraction of
positives [4] on page 1-617.
Reliability must be assessed using a different data set than the one used to calibrate the system.
Extract i-vectors from the development data set, adsDev. The development data set has no speaker
overlap with the calibration data set.
devIvecs = ivector(ivs,adsDev);
devLabels = adsDev.Labels;
targets = true(size(devIvecs,2),size(devIvecs,2));
for ii = 1:size(devIvecs,2)
targets(:,ii) = ismember(devLabels,devLabels(ii));
end
targets = targets + diag(diag(targets)*nan);
[targetScores,nontargetScores] = scoreTargets(devIvecs,devIvecs,targets);
ts = cat(1,targetScores{:});
nts = cat(1,nontargetScores{:});
1-607
1 Audio Toolbox Examples
scores = [ts;nts];
trueLabels = [true(numel(ts),1);false(numel(nts),1)];
calibratedScoresPlattScaling = calibrateScores(scores,AB);
calibratedScoresIsotonicRegression = calibrateScores(scores,scoringMap);
When interpreting the reliability diagram, values below the diagonal indicate that the system is
giving higher probability scores than it should be, and values above the diagonal indicate the system
is giving lower probability scores than it should. In both cases, increasing the amount of calibration
data, and using calibration data like the target application, should improve performance.
numBins = ;
Plot the reliability diagram for the i-vector system calibrated using Platt scaling.
reliabilityDiagram(trueLabels,calibratedScoresPlattScaling,numBins)
Plot the reliability diagram for the i-vector system calibrated using isotonic regression.
reliabilityDiagram(trueLabels,calibratedScoresIsotonicRegression,numBins)
1-608
i-vector Score Calibration
Supporting Functions
Load Dataset
rng(0)
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","ptdb-tug.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"ptdb-tug");
1-609
1 Audio Toolbox Examples
FileExtensions=".wav", ...
LabelSource="foldernames");
reduceDataset = false;
if reduceDataset
ads = splitEachLabel(ads,10);
end
adsTransform = transform(ads,@(x,y)fileResampler(x,y,targetSampleRate),IncludeInfo=true);
writeall(adsTransform,pwd,OutputFormat="flac",UseParallel=canUseParallelPool)
end
% Create a datastore that points to the resampled dataset. Use the folder
% names as the labels.
ads = audioDatastore(fullfile(pwd,"MIC"),IncludeSubfolders=true,LabelSource="foldernames");
% Split the data set into enrollment, development, and calibration sets.
calibrationLabels = categorical(["M01","M03","M05","M7","M9","F01","F03","F05","F07","F09"]);
adsCalibrate = subset(ads,ismember(ads.Labels,calibrationLabels));
adsDev = subset(ads,~ismember(ads.Labels,calibrationLabels));
numToEnroll = 2;
[adsEnroll,adsDev] = splitEachLabel(adsDev,numToEnroll);
end
File Resampler
arguments
audioIn (:,1) {mustBeA(audioIn,["single","double"])}
adsInfo (1,1) {mustBeA(adsInfo,"struct")}
targetSampleRate (1,1) {mustBeNumeric,mustBePositive}
end
% Resample if necessary
if originalSampleRate ~= targetSampleRate
audioOut = resample(audioIn,targetSampleRate,originalSampleRate);
amax = max(abs(audioOut));
if max(amax>1)
audioOut = audioOut./amax;
end
end
end
1-610
i-vector Score Calibration
arguments
e (:,:) {mustBeA(e,["single","double"])}
t (:,:) {mustBeA(t,["single","double"])}
targetMap (:,:)
nvargs.NormFactorsSe = [];
nvargs.NormFactorsSt = [];
end
targetScores{ii} = localScores(logical(localMap));
nontargetScores{ii} = localScores(~localMap);
end
end
1-611
1 Audio Toolbox Examples
arguments
a (:,:) {mustBeA(a,["single","double"])}
b (:,:) {mustBeA(b,["single","double"])}
end
scores = squeeze(sum(a.*reshape(b,size(b,1),1,[]),1)./(vecnorm(a).*reshape(vecnorm(b),1,1,[])));
scores = scores';
end
function plotScoreDistributions(targetScores,nontargetScores,nvargs)
%PLOTSCOREDISTRIBUTIONS Plot target and non-target score distributions
% plotScoreDistribution(targetScores,nontargetScores) plots empirical
% estimations of the distribution for target scores and nontarget scores.
% Specify targetScores and nontargetScores as cell arrays where each
% element contains a vector of speaker-specific scores.
%
% plotScoreDistrubtions(targetScores,nontargetScores,Analyze=ANALYZE)
% specifies the scope for analysis as either "label" or "group". If ANALYZE
% is set to "label", then a score distribution plot is created for each
% label. If ANALYZE is set to "group", then a score distribution plot is
% created for the entire group by combining scores across speakers. If
% unspecified, ANALYZE defaults to "group".
arguments
targetScores (1,:) cell
nontargetScores (1,:) cell
nvargs.Analyze (1,:) char {mustBeMember(nvargs.Analyze,["label","group"])} = "group"
end
% Combine all scores to determine good bins for analyzing both the target
% and non-target scores together.
allScores = cat(1,targetScores{:},nontargetScores{:});
[~,edges] = histcounts(allScores);
if strcmpi(nvargs.Analyze,"group")
% Plot the score distributions for the group.
1-612
i-vector Score Calibration
targetScoresBinCounts = histcounts(cat(1,targetScores{:}),edges);
targetScoresBinProb = targetScoresBinCounts(:)./sum(targetScoresBinCounts);
nontargetScoresBinCounts = histcounts(cat(1,nontargetScores{:}),edges);
nontargetScoresBinProb = nontargetScoresBinCounts(:)./sum(nontargetScoresBinCounts);
figure
plot(centers,[targetScoresBinProb,nontargetScoresBinProb])
title("Score Distributions")
xlabel("Score")
ylabel("Probability")
legend(["target","non-target"],Location="northwest")
axis tight
else
% Create a tiled layout and plot the score distributions for each speaker.
N = numel(targetScores);
tiledlayout(N,1)
for ii = 1:N
targetScoresBinCounts = histcounts(targetScores{ii},edges);
targetScoresBinProb = targetScoresBinCounts(:)./sum(targetScoresBinCounts);
nontargetScoresBinCounts = histcounts(nontargetScores{ii},edges);
nontargetScoresBinProb = nontargetScoresBinCounts(:)./sum(nontargetScoresBinCounts);
nexttile
hold on
plot(centers,[targetScoresBinProb,nontargetScoresBinProb])
title("Score Distribution for Speaker " + string(ii))
xlabel("Score")
ylabel("Probability")
legend(["target","non-target"],Location="northwest")
axis tight
end
end
end
Calibrate Scores
function y = calibrateScores(score,scoreMapping)
%CALIBRATESCORES Calibrate scores
% y = calibrateScores(score,scoreMapping) maps the raw scores to calibrated
% scores, y, using the score mappinging information in scoreMapping.
% Specify score as a vector or matrix of raw scores. Specify score mapping
% as either struct or a two-element vector. If scoreMapping is specified as
% a struct, then it should have two fields: Raw and Calibrated, that
% together form a score mapping. If scoreMapping is specified as a vector,
% then the elements are used as the coefficients in the logistic function.
% y is returned as vector or matrix the same size as the raw scores.
arguments
score (:,:) {mustBeA(score,["single","double"])}
scoreMapping
end
1-613
1 Audio Toolbox Examples
if isstruct(scoreMapping)
% Calibration using isotonic regression
rawScore = scoreMapping.Raw;
interpretedScore = scoreMapping.Calibrated;
n = numel(score);
% Find the index of the raw score in the mapping closest to the score provided.
idx = zeros(n,1);
for ii = 1:n
[~,idx(ii)] = min(abs(score(ii)-rawScore));
end
else
end
end
Reliability Diagram
function reliabilityDiagram(targets,predictions,numBins)
%RELIABILITYDIAGRAM Plot reliability diagram
% reliabilityDiagram(targets,predictions) plots a reliability diagram for
% targets and predictions. Specify targets an M-by-1 logical vector.
% Specify predictions as an M-by-1 numeric vector.
%
% reliabilityDiagram(targets,predictions,numBins) specifies the number of
% bins for the reliability diagram. If unspecified, numBins defaults to 10.
arguments
targets (:,1) {mustBeA(targets,"logical")}
predictions (:,1) {mustBeA(predictions,["single","double"])}
numBins (1,1) {mustBePositive,mustBeInteger} = 10;
end
% Bin the predictions into the requested number of bins. Count the number of
% predictions per bin.
[predictionsPerBin,~,predictionsInBin] = histcounts(predictions,numBins);
% Determine the mean of the targets per bin. This is the fraction of
% positives--the number of targets in the bin over the total number of
% predictions in the bin.
meanTargets = accumarray(predictionsInBin,targets)./predictionsPerBin(:);
plot([0,1],[0,1])
1-614
i-vector Score Calibration
hold on
plot(meanPredictions,meanTargets,"o")
legend("Ideal Calibration",Location="best")
xlabel("Mean Predicted Value")
ylabel("Fraction of Positives")
title("Reliability Diagram")
grid on
hold off
end
arguments
y (:,1) {mustBeA(y,["single","double"])}
f (:,1) {mustBeA(f,["single","double"])}
iparams (1,2) {mustBeA(iparams,["single","double"])}
end
p = logistic(f,iparams);
cost = -sum(y.*log(p) + (1-y).*log(1-p));
end
Logistic Function
function p = logistic(f,iparams)
%LOGISTIC Logistic function
% p = logistic(f,iparams) applies the general logistic function to input f
% with parameters iparams. Specify f as a numeric array. Specify iparams as
% a two-element vector. p is returned as the same size as f.
arguments
f
iparams = [1 0];
end
p = 1./(1+exp(-iparams(1).*f - iparams(2)));
end
1-615
1 Audio Toolbox Examples
Isotonic Regression
N = numel(x);
% Initialize blocks, one per point. These will merge and the number of
% blocks will reduce as the algorithm proceeds.
blockMap = 1:N;
w = ones(size(m));
while true
diffs = diff(m);
if all(diffs >= 0)
else
% Calculate the mean of each block and update the weights for the
% blocks. We're merging all the points in the blocks here.
m = accumarray(blockIndices,w.*m);
w = accumarray(blockIndices,w);
m = m ./ w;
end
end
1-616
i-vector Score Calibration
scoreMapping = struct(Raw=b,Calibrated=a);
end
References
[1] G. Pirker, M. Wohlmayr, S. Petrik, and F. Pernkopf, "A Pitch Tracking Corpus with Evaluation on
Multipitch Tracking Scenario", Interspeech, pp. 1509-1512, 2011.
[2] van Leeuwen, David A., and Niko Brummer. "An Introduction to Application-Independent
Evaluation of Speaker Recognition Systems." Lecture Notes in Computer Science, 2007, 330–53.
[3] Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning.
Proceedings of the 22nd International Conference on Machine Learning - ICML '05.
doi:10.1145/1102351.1102430
[4] Brocker, Jochen, and Leonard A. Smith. “Increasing the Reliability of Reliability Diagrams.”
Weather and Forecasting 22, no. 3 (2007): 651–61. https://doi.org/10.1175/waf993.1.
1-617
1 Audio Toolbox Examples
Speaker recognition answers the question "Who is speaking?". Speaker recognition is usually divided
into two tasks: speaker identification and speaker verification. In speaker identification, a speaker is
recognized by comparing their speech to a closed set of templates. In speaker verification, a speaker
is recognized by comparing the likelihood that the speech belongs to a particular speaker against a
predetermined threshold. Traditional machine learning methods perform well at these tasks in ideal
conditions. For examples of speaker identification using traditional machine learning methods, see
“Speaker Identification Using Pitch and MFCC” on page 1-235 and “Speaker Verification Using i-
vectors” on page 1-558. Audio Toolbox™ provides ivectorSystem which encapsulates the ability to
train an i-vector system, enroll speakers or other audio labels, evaluate the system for a decision
threshold, and identify or verify speakers or other audio labels.
In adverse conditions, the deep learning approach of x-vectors has been shown to achieve state-of-
the-art results for many scenarios and applications [1] on page 1-629. The x-vector system is an
evolution of i-vectors originally developed for the task of speaker verification.
In this example, you develop an x-vector system. First, you train a time-delay neural network (TDNN)
to perform speaker identification. Then you train the traditional backends for an x-vector-based
speaker verification system: an LDA projection matrix and a PLDA model. You then perform speaker
verification using the TDNN and the backend dimensionality reduction and scoring. The x-vector
system backend, or classifier, is the same as developed for i-vector systems. For details on the
backend, see “Speaker Verification Using i-vectors” on page 1-558 and ivectorSystem.
In “Speaker Diarization Using x-vectors” on page 1-632, you use the x-vector system trained in this
example to perform speaker diarization. Speaker diarization answers the question, "Who spoke
when?".
This example uses a subset of the LibriSpeech Dataset [2] on page 1-629. The LibriSpeech Dataset is
a large corpus of read English speech sampled at 16 kHz. The data is derived from audiobooks read
from the LibriVox project. Download the 100-hour subset of the LibriSpeech training data, the clean
development set, and the clean test set.
dataFolder = tempdir;
datasetTrain = fullfile(dataFolder,"LibriSpeech","train-clean-100");
if ~datasetExists(datasetTrain)
filename = "train-clean-100.tar.gz";
url = "http://www.openSLR.org/resources/12/" + filename;
gunzip(url,dataFolder);
unzippedFile = fullfile(dataFolder,filename);
untar(unzippedFile{1}(1:end-3),dataFolder);
end
datasetDev = fullfile(dataFolder,"LibriSpeech","dev-clean");
if ~datasetExists(datasetDev)
filename = "dev-clean.tar.gz";
url = "http://www.openSLR.org/resources/12/" + filename;
gunzip(url,dataFolder);
unzippedFile = fullfile(dataFolder,filename);
untar(unzippedFile{1}(1:end-3),dataFolder);
end
1-618
Speaker Recognition Using x-vectors
datasetTest = fullfile(dataFolder,"LibriSpeech","test-clean");
if ~datasetExists(datasetTest)
filename = "test-clean.tar.gz";
url = "http://www.openSLR.org/resources/12/" + filename;
gunzip(url,dataFolder);
unzippedFile = fullfile(dataFolder,filename);
untar(unzippedFile{1}(1:end-3),dataFolder);
end
Create audioDatastore objects that point to the data. The speaker labels are encoded in the file
names. Split the datastore into train, validation, and test sets. You will use these sets to train,
validate, and test a TDNN.
adsTrain = audioDatastore(datasetTrain,IncludeSubfolders=true);
trainLabels = filenames2labels(adsTrain,ExtractBefore="-");
adsDev = audioDatastore(datasetDev,IncludeSubfolders=true);
devLabels = filenames2labels(adsDev,ExtractBefore="-");
adsEvaluate = audioDatastore(datasetTest,IncludeSubfolders=true);
evalLabels = filenames2labels(adsEvaluate,ExtractBefore="-");
• adsTrain - Contains training set for the TDNN and backend classifier.
• adsValidation - Contains validation set to evaluate TDNN training progress.
• adsTest - Contains test set to evaluate the TDNN performance for speaker identification.
• adsEnroll - Contains enrollment set to evaluate the detection error tradeoff of the x-vector
system for speaker verification.
• adsDET - Contains evaluation set used to determine the detection error tradeoff of the x-vector
system for speaker verification.
indices = splitlabels(trainLabels,[0.8,0.1,0.1],"randomized");
adsValidation = subset(adsTrain,indices{2});
labelsValidation = trainLabels(indices{2});
adsTest = subset(adsTrain,indices{3});
labelsTest = trainLabels(indices{3});
adsTrain = subset(adsTrain,indices{1});
labelsTrain = trainLabels(indices{1});
indices = splitlabels(evalLabels,3,"randomized");
adsEnroll = subset(adsEvaluate,indices{1});
labelsEnroll = filenames2labels(adsEnroll,ExtractBefore="-");
adsLeftover = subset(adsEvaluate,indices{2});
adsDET = audioDatastore([adsLeftover.Files;adsDev.Files]);
labelsDET = filenames2labels(adsDET,ExtractBefore="-");
You can reduce the training and detection error tradeoff datasets used in this example to speed up
the runtime at the cost of performance. In general, reducing the data set is a good practice for
development and debugging.
speedupExample = ;
if speedupExample
1-619
1 Audio Toolbox Examples
idxs = splitlabels(labelsTrain,5);
adsTrain = subset(adsTrain,idxs{1});
labelsTrain = labelsTrain(idxs{1});
idxs = splitlabels(labelsValidation,2);
adsValidation = subset(adsValidation,idxs{1});
labelsValidation = labelsValidation(idxs{1});
idxs = splitlabels(labelsDET,5);
adsDET = subset(adsDET,idxs{1});
labelsDET = labelsDET(idxs{1});
end
Input Pipeline
windowDuration = ;
hopDuration = ;
windowSamples = round(windowDuration*fs);
hopSamples = round(hopDuration*fs);
overlapSamples = windowSamples - hopSamples;
numCoeffs = ;
afe = audioFeatureExtractor( ...
SampleRate=fs, ...
Window=hann(windowSamples,"periodic"), ...
OverlapLength=overlapSamples, ...
mfcc=true);
setExtractorParameters(afe,"mfcc",NumCoeffs=numCoeffs)
Create a transform datastore that applies preprocessing to the audio and outputs features. The
supporting function, xVectorPreprocess on page 1-629, performs speech detection and extracts
features from regions of speech. When the parameter Segment is set to false, the detected regions
of speech are concatenated together.
adsTrainTransform = transform(adsTrain,@(x)xVectorPreprocess(x,afe,Segment=false,MinimumDuration=
Concatenate the features and then save the global mean and standard deviation in a struct. You will
use these factors to normalize features.
features = cat(1,features{:});
[S,M] = std(features,0,1);
factors = struct(Mean=M,STD=S);
clear features
1-620
Speaker Recognition Using x-vectors
cdsTest = combine(adsTest,arrayDatastore(labelsTest));
cdsEnroll = combine(adsEnroll,arrayDatastore(labelsEnroll));
cdsDET = combine(adsDET,arrayDatastore(labelsDET));
Create a new transform datastore for the training set, this time specifying the normalization factors
and Segment as true. Now, features are normalized by the global mean and standard deviation, and
then the file-level mean. The individual speech regions detected are not concatenated. Extract all the
features and corresponding targets.
xTrain = cat(1,xtTrain{:,1});
tTrain = cat(1,xtTrain{:,2});
Apply the same transformation to the validation, test, enrollment, and DET sets.
x-vector Network
The table summarizes the architecture of the network described in [1] on page 1-629 and
implemented in this example. T is the total number of frames (feature vectors over time) in an audio
signal. N is the number of classes (speakers) in the training set.
1-621
1 Audio Toolbox Examples
Define the network. You can change the model size by increasing or decreasing the numFilters
parameter.
numFilters = ;
dropProb = 0.2;
classes = unique(labelsTrain);
numClasses = numel(classes);
layers = [
sequenceInputLayer(afe.FeatureVectorLength,MinLength=15,Name="input")
convolution1dLayer(5,numFilters,DilationFactor=1,Name="conv_1")
batchNormalizationLayer(Name="BN_1")
dropoutLayer(dropProb,Name="drop_1")
reluLayer(Name="act_1")
convolution1dLayer(3,numFilters,DilationFactor=2,Name="conv_2")
batchNormalizationLayer(Name="BN_2")
dropoutLayer(dropProb,Name="drop_2")
reluLayer(Name="act_2")
convolution1dLayer(3,numFilters,DilationFactor=3,Name="conv_3")
batchNormalizationLayer(Name="BN_3")
dropoutLayer(dropProb,Name="drop_3")
reluLayer(Name="act_3")
1-622
Speaker Recognition Using x-vectors
convolution1dLayer(1,numFilters,DilationFactor=1,Name="conv_4")
batchNormalizationLayer(Name="BN_4")
dropoutLayer(dropProb,Name="drop_4")
reluLayer(Name="act_4")
convolution1dLayer(1,1500,DilationFactor=1,Name="conv_5")
batchNormalizationLayer(Name="BN_5")
dropoutLayer(dropProb,Name="drop_5")
reluLayer(Name="act_5")
statisticsPooling1dLayer(Name="statistics_pooling")
fullyConnectedLayer(numFilters,Name="fc_1")
batchNormalizationLayer(Name="BN_6")
dropoutLayer(dropProb,Name="drop_6")
reluLayer(Name="act_6")
fullyConnectedLayer(numFilters,Name="fc_2")
batchNormalizationLayer(Name="BN_7")
dropoutLayer(dropProb,Name="drop_7")
reluLayer(Name="act_7")
fullyConnectedLayer(numClasses,Name="fc_3")
softmaxLayer(Name="softmax")
];
The model requires statistical pooling which is implemented as a custom layer and placed in your
current folder when you open this example. Display the contents of the custom layer.
type("statisticsPooling1dLayer.m")
methods
function this = statisticsPooling1dLayer(options)
arguments
options.Name = ''
end
this.Name = options.Name;
end
function X = predict(~, X)
X = dlarray(stripdims([mean(X,3);std(X,0,3)]),"CB");
end
function X = forward(~, X)
X = X + 0.0001*rand(size(X),"single");
X = dlarray(stripdims([mean(X,3);std(X,0,3)]),"CB");
end
end
end
1-623
1 Audio Toolbox Examples
Training Options
To specify training options, use trainingOptions. Set the mini-batch size as appropriate for your
hardware.
mbsize = 256;
options = trainingOptions("adam", ...
OutputNetwork="best-validation", ...
LearnRateSchedule="piecewise", ...
LearnRateDropPeriod=2, ...
MaxEpochs=5, ...
Shuffle="every-epoch", ...
MiniBatchSize=mbsize, ...
Plots="training-progress", ...
Metrics="accuracy", ...
Verbose=false, ...
SequenceLength="shortest", ...
ValidationData={xValidation,tValidation}, ...
ValidationFrequency=400, ...
DispatchInBackground=canUseParallelPool);
Train Network
dlnet = trainnet(xTrain,tTrain,layers,"crossentropy",options);
Evaluate the TDNN speaker recognition accuracy using the held-out test set. Decode the predictions
and then compute the prediction accuracy. To speed up this example, use a large mini-batch size. If
1-624
Speaker Recognition Using x-vectors
you reduce the mini-batch size to 1, the accuracy is approximately 98% because you are no longer
discarding relevant data to make consistent sequence lengths.
yTest = scores2label(scores,classes,2);
accuracy = 100*mean(tTest(:)==yTest(:))
accuracy = 91.0644
In the x-vector system for speaker verification, the TDNN you just trained is used to output an
embedding layer. The output from the embedding layer ("fc_1" in this example) are the "x-vectors"
in an x-vector system.
1-625
1 Audio Toolbox Examples
The backend (or classifier) of an x-vector system is the same as the backend of an i-vector system. For
details on the algorithms, see ivectorSystem and “Speaker Verification Using i-vectors” on page 1-
558.
1-626
Speaker Recognition Using x-vectors
xvecsTrain = xvecsTrain';
Create a linear discriminant analysis (LDA) projection matrix to reduce the dimensionality of the x-
vectors. LDA attempts to minimize the intra-class variance and maximize the variance between
speakers.
numEigenvectors = ;
projMat = helperTrainProjectionMatrix(xvecsTrain,tTrain,numEigenvectors);
numIterations = ;
numDimensions = ;
plda = helperTrainPLDA(xvecsTrainP,tTrain,numIterations,numDimensions);
Speaker verification systems verify that a speaker is who they purport to be. Before a speaker can be
verified, they must be enrolled in the system. Enrollment in the system means that the system has a
template x-vector representation of the speaker.
Enroll Speakers
Create template x-vectors for each speaker by averaging the x-vectors of individual speakers across
enrollment files.
uniqueLabels = unique(tEnroll);
enrollmentTable = cell2table(cell(0,2),VariableNames=["xvector","NumSamples"]);
for ii = 1:numel(uniqueLabels)
idx = uniqueLabels(ii)==tEnroll;
wLocalMean = mean(xvecsEnrollP(:,idx),2);
localTable = table({wLocalMean},(sum(idx)), ...
VariableNames=["xvector","NumSamples"], ...
RowNames=string(uniqueLabels(ii)));
enrollmentTable = [enrollmentTable;localTable]; %#ok<AGROW>
end
Speaker verification systems require you to set a threshold that balances the probability of a false
acceptance (FA) and the probability of a false rejection (FR), according to the requirements of your
1-627
1 Audio Toolbox Examples
application. To determine the threshold that meets your FA/FR requirements, evaluate the detection
error tradeoff of the system.
xvecsDET = minibatchpredict(dlnet,xDET, ...
MiniBatchSize=1, ...
Outputs="fc_1");
xvecsDET = xvecsDET';
xvecsDETP = projMat*xvecsDET;
detTable = helperDetectionErrorTradeoff(xvecsDETP,tDET,enrollmentTable,plda);
Plot the results of the detection error tradeoff evaluation for both PLDA scoring and cosine similarity
scoring (CSS).
figure
plot(detTable.PLDA.Threshold,detTable.PLDA.FAR, ...
detTable.PLDA.Threshold,detTable.PLDA.FRR)
eer = helperEqualErrorRate(detTable.PLDA);
title(["Speaker Verification","Detection Error Tradeoff","PLDA Scoring","Equal Error Rate = " + e
xlabel("Threshold")
ylabel("Error Rate")
legend(["FAR","FRR"])
figure
plot(detTable.CSS.Threshold,detTable.CSS.FAR, ...
detTable.CSS.Threshold,detTable.CSS.FRR)
1-628
Speaker Recognition Using x-vectors
eer = helperEqualErrorRate(detTable.CSS);
title(["Speaker Verification","Detection Error Tradeoff","Cosine Similarity Scoring","Equal Error
xlabel("Threshold")
ylabel("Error Rate")
legend(["FAR","FRR"])
References
[1] Snyder, David, et al. "x-vectors: Robust DNN Embeddings for Speaker Recognition." 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp.
5329–33. DOI.org (Crossref), doi:10.1109/ICASSP.2018.8461375.
[2] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public
domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Brisbane, QLD, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964
Supporting Functions
1-629
1 Audio Toolbox Examples
isCombined = numel(audioData)==2;
if isCombined
label = audioData{2};
audioData = audioData{1};
end
% Scale
audioData = audioData/max(abs(audioData(:)));
% Extract features
numSegments = size(idx,1);
features = cell(numSegments,1);
for ii = 1:numSegments
temp = single(extract(afe,audioData(idx(ii,1):idx(ii,2))));
if isempty(temp)
temp = zeros(15,30,"single");
end
features{ii} = temp;
end
if ~isempty(nvargs.Factors)
% Standardize features
features = cellfun(@(x)(x-nvargs.Factors.Mean)./nvargs.Factors.STD,features,UniformOutput=fal
if ~nvargs.Segment
1-630
Speaker Recognition Using x-vectors
features = {cat(1,features{:})};
end
if isCombined
output = {features,repelem(label,numel(features),1)};
else
output = features;
end
end
1-631
1 Audio Toolbox Examples
Speaker diarization is the process of partitioning an audio signal into segments according to speaker
identity. It answers the question "who spoke when" without prior knowledge of the speakers and,
depending on the application, without prior knowledge of the number of speakers.
Speaker diarization has many applications, including: enhancing speech transcription by structuring
text according to active speaker, video captioning, content retrieval (what did Jane say?) and speaker
counting (how many speakers were present in the meeting?).
In this example, you perform speaker diarization using a pretrained x-vector system [1] on page 1-
645 to characterize regions of audio and agglomerative hierarchical clustering (AHC) to group
similar regions of audio [2] on page 1-645. To see how the x-vector system was defined and trained,
see “Speaker Recognition Using x-vectors” on page 1-618.
Download the pretrained speaker diarization system and supporting files. The total size is
approximately 22 MB.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","SpeakerDiarization.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
netFolder = fullfile(dataFolder,"SpeakerDiarization");
addpath(netFolder)
Load an audio signal and a table containing ground truth annotations. The signal consists of five
speakers. Listen to the audio signal and plot its time-domain waveform.
[audioIn,fs] = audioread("exampleconversation.flac");
load("exampleconversationlabels.mat")
audioIn = audioIn./max(abs(audioIn));
sound(audioIn,fs)
t = (0:size(audioIn,1)-1)/fs;
figure(1)
plot(t,audioIn)
xlabel("Time (s)")
ylabel("Amplitude")
axis tight
1-632
Speaker Diarization Using x-vectors
Extract x-vectors
In this example, you used a pretrained x-vector system based on [1] on page 1-645. To see how the x-
vector system was defined and trained, see “Speaker Recognition Using x-vectors” on page 1-618.
Load the lightweight pretrained x-vector system. The x-vector system consists of:
Extract standardized MFCC features from the audio data. View the feature distributions to confirm
that the standardization factors learned from a separate data set approximately standardize the
1-633
1 Audio Toolbox Examples
features derived in this example. A standard distribution has a mean of zero and a standard deviation
of 1.
features = single((extract(xvecsys.afe,audioIn)-xvecsys.factors.Mean')./xvecsys.factors.STD');
figure(2)
histogram(features)
xlabel("Standardized MFCC")
Extract x-Vectors
Each acoustic feature vector represents approximately 0.01 seconds of audio data. Group the
features into approximately 2 second segments with 0.1 second hops between segments.
featureVectorHopDur = (numel(xvecsys.afe.Window) - xvecsys.afe.OverlapLength)/xvecsys.afe.SampleR
segmentDur = ;
segmentHopDur = ;
segmentLength = round(segmentDur/featureVectorHopDur);
segmentHop = round(segmentHopDur/featureVectorHopDur);
idx = 1:segmentLength;
1-634
Speaker Diarization Using x-vectors
featuresSegmented = [];
while idx(end) < size(features,1)
featuresSegmented = cat(3,featuresSegmented,features(idx,:));
idx = idx + segmentHop;
end
Extract x-vectors from each segment. x-vectors correspond to the output from the first fully-
connected layer in the x-vector model trained in “Speaker Recognition Using x-vectors” on page 1-
618. The first fully-connected layer is the first segment-level layer after statistics are calculated for
the time-dilated frame-level layers. Visualize the x-vectors over time.
xvecs = zeros(512,size(featuresSegmented,3));
for sample = 1:size(featuresSegmented,3)
dlX = dlarray(featuresSegmented(:,:,sample),"TCB");
xvecs(:,sample) = predict(xvecsys.dlnet,dlX,Outputs="fc_1");
end
figure(3)
surf(xvecs',EdgeColor="none")
view([90,-90])
axis([1 size(xvecs,1) 1 size(xvecs,2)])
xlabel("Features")
ylabel("Segment")
1-635
1 Audio Toolbox Examples
Apply the pretrained linear discriminant analysis (LDA) projection matrix to reduce the
dimensionality of the x-vectors and then visualize the x-vectors over time.
x = xvecsys.projMat*xvecs;
figure(4)
surf(x',EdgeColor="none")
view([90,-90])
axis([1 size(x,1) 1 size(x,2)])
xlabel("Features")
ylabel("Segment")
Cluster x-vectors
An x-vector system learns to extract compact representations (x-vectors) of speakers. Cluster the x-
vectors to group similar regions of audio using either agglomerative hierarchical clustering
(clusterdata (Statistics and Machine Learning Toolbox)) or k-means clustering (kmeans (Statistics
and Machine Learning Toolbox)). [2] on page 1-645 suggests using agglomerative heirarchical
clustering with PLDA scoring as the distance measurement. K-means clustering using a cosine
similarity score is also commonly used. Assume prior knowledge of the number of speakers in the
audio. Set the maximum clusters to the number of known speakers + 1 so that the background is
clustered independently.
1-636
Speaker Diarization Using x-vectors
knownNumberOfSpeakers = numel(unique(groundTruth.Label));
maxclusters = knownNumberOfSpeakers + 1;
clusterMethod = ;
switch clusterMethod
case "agglomerative - PLDA scoring"
T = clusterdata(x',Criterion="distance",distance=@(a,b)helperPLDAScorer(a,b,xvecsys.plda)
case "agglomerative - CSS scoring"
T = clusterdata(x',Criterion="distance",distance="cosine",linkage="average",maxclust=maxc
case "kmeans - CSS scoring"
T = kmeans(x',maxclusters,Distance="cosine");
end
figure(5)
tiledlayout(2,1)
nexttile
plot(t,audioIn)
axis tight
ylabel("Amplitude")
xlabel("Time (s)")
nexttile
plot(T)
axis tight
ylabel("Cluster Index")
xlabel("Segment")
1-637
1 Audio Toolbox Examples
To isolate segments of speech corresponding to clusters, map the segments back to audio samples.
Plot the results.
mask = zeros(size(audioIn,1),1);
start = round((segmentDur/2)*fs);
segmentHopSamples = round(segmentHopDur*fs);
mask(1:start) = T(1);
start = start + 1;
for ii = 1:numel(T)
finish = start + segmentHopSamples;
mask(start:start + segmentHopSamples) = T(ii);
start = finish + 1;
end
mask(finish:end) = T(end);
figure(6)
tiledlayout(2,1)
nexttile
plot(t,audioIn)
axis tight
1-638
Speaker Diarization Using x-vectors
nexttile
plot(t,mask)
ylabel("Cluster Index")
axis tight
xlabel("Time (s)")
Use detectSpeech to determine speech regions. Use sigroi2binmask to convert speech regions
to a binary voice activity detection (VAD) mask. Call detectSpeech a second time without any
arguments to plot the detected speech regions.
mergeDuration = ;
VADidx = detectSpeech(audioIn,fs,MergeDistance=fs*mergeDuration);
VADmask = sigroi2binmask(VADidx,numel(audioIn));
figure(7)
detectSpeech(audioIn,fs,MergeDistance=fs*mergeDuration)
1-639
1 Audio Toolbox Examples
Apply the VAD mask to the speaker mask and plot the results. A cluster index of 0 indicates a region
of no speech.
mask = mask.*VADmask;
figure(8)
tiledlayout(2,1)
nexttile
plot(t,audioIn)
axis tight
nexttile
plot(t,mask)
ylabel("Cluster Index")
axis tight
xlabel("Time (s)")
1-640
Speaker Diarization Using x-vectors
In this example, you assume each detected speech region belongs to a single speaker. If more than
two labels are present in a speech region, merge them to the most frequently occuring label.
maskLabels = zeros(size(VADidx,1),1);
for ii = 1:size(VADidx,1)
maskLabels(ii) = mode(mask(VADidx(ii,1):VADidx(ii,2)),"all");
mask(VADidx(ii,1):VADidx(ii,2)) = maskLabels(ii);
end
figure(9)
tiledlayout(2,1)
nexttile
plot(t,audioIn)
axis tight
nexttile
plot(t,mask)
ylabel("Cluster Index")
axis tight
xlabel("Time (s)")
1-641
1 Audio Toolbox Examples
uniqueSpeakerClusters = unique(maskLabels);
numSpeakers = numel(uniqueSpeakerClusters)
numSpeakers = 5
Create a signalMask object and then plot the speaker clusters. Label the plot with the ground truth
labels. The cluster labels are color coded with a key on the right of the plot. The true labels are
printed above the plot.
msk = signalMask(table(VADidx,categorical(maskLabels)));
figure(10)
plotsigroi(msk,audioIn,true)
axis([0 numel(audioIn) -1 1])
trueLabel = groundTruth.Label;
for ii = 1:numel(trueLabel)
text(VADidx(ii,1),1.1,trueLabel(ii),FontWeight="bold")
end
1-642
Speaker Diarization Using x-vectors
Choose a cluster to inspect and then use binmask to isolate the speaker. Plot the isolated speech
signal and listen to the speaker cluster.
speakerToInspect = ;
cutOutSilenceFromAudio = ;
bmsk = binmask(msk,numel(audioIn));
audioToPlay = audioIn;
if cutOutSilenceFromAudio
audioToPlay(~bmsk(:,speakerToInspect)) = [];
end
sound(audioToPlay,fs)
figure(11)
tiledlayout(2,1)
nexttile
plot(t,audioIn)
axis tight
ylabel("Amplitude")
nexttile
1-643
1 Audio Toolbox Examples
plot(t,audioIn.*bmsk(:,speakerToInspect))
axis tight
xlabel("Time (s)")
ylabel("Amplitude")
title("Speaker Group "+speakerToInspect)
The common metric for speaker diarization systems is the diarization error rate (DER). The DER is
the sum of the miss rate (classifying speech as non-speech), the false alarm rate (classifying non-
speech as speech) and the speaker error rate (confusing one speaker's speech for another).
In this simple example, the miss rate and false alarm rate are trivial problems. You evaluate the
speaker error rate only.
Map each true speaker to the corresponding best-fitting speaker cluster. To determine the speaker
error rate, count the number of mismatches between the true speakers and the best-fitting speaker
clusters, and then divide by the number of true speaker regions.
uniqueLabels = unique(trueLabel);
guessLabels = maskLabels;
uniqueGuessLabels = unique(guessLabels);
totalNumErrors = 0;
1-644
Speaker Diarization Using x-vectors
for ii = 1:numel(uniqueLabels)
isSpeaker = uniqueLabels(ii)==trueLabel;
minNumErrors = inf;
for jj = 1:numel(uniqueGuessLabels)
groupCandidate = uniqueGuessLabels(jj) == guessLabels;
numErrors = nnz(isSpeaker - groupCandidate);
if numErrors < minNumErrors
minNumErrors = numErrors;
bestCandidate = jj;
end
minNumErrors = min(minNumErrors,numErrors);
end
uniqueGuessLabels(bestCandidate) = [];
totalNumErrors = totalNumErrors + minNumErrors;
if isempty(uniqueGuessLabels)
break
end
end
SpeakerErrorRate = totalNumErrors/numel(trueLabel)
SpeakerErrorRate = 0
References
[1] Snyder, David, et al. “X-Vectors: Robust DNN Embeddings for Speaker Recognition.” 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp.
5329–33. DOI.org (Crossref), doi:10.1109/ICASSP.2018.8461375.
[2] Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., Manohar, V.,
Dehak, N., Povey, D., Watanabe, S., Khudanpur, S. (2018) Diarization is Hard: Some Experiences and
Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge. Proc. Interspeech 2018,
2808-2812, DOI: 10.21437/Interspeech.2018-1893.
1-645
1 Audio Toolbox Examples
This example trains a spoken digit recognition network on out-of-memory auditory spectrograms
using a transformed datastore. In this example, you extract auditory spectrograms from audio using
audioDatastore and audioFeatureExtractor, and you write them to disk. You then use a
signalDatastore to access the features during training. The workflow is useful when the training
features do not fit in memory. In this workflow, you only extract features once, which speeds up your
workflow if you are iterating on the deep learning model design.
Data
Download the Free Spoken Digit Data Set (FSDD). FSDD consists of 2000 recordings of four speakers
saying the numbers 0 through 9 in English.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"FSDD");
ads = audioDatastore(dataset,IncludeSubfolders=true);
[~,filenames] = fileparts(ads.Files);
ads.Labels = categorical(extractBefore(filenames,'_'));
summary(ads.Labels)
0 200
1 200
2 200
3 200
4 200
5 200
6 200
7 200
8 200
9 200
Split the FSDD into training and test sets. Allocate 80% of the data to the training set and retain 20%
for the test set. You use the training set to train the model and the test set to validate the trained
model.
rng default
ads = shuffle(ads);
[adsTrain,adsTest] = splitEachLabel(ads,0.8);
countEachLabel(adsTrain)
ans=10×2 table
Label Count
_____ _____
0 160
1-646
Train Spoken Digit Recognition Network Using Out-of-Memory Features
1 160
2 160
3 160
4 160
5 160
6 160
7 160
8 160
9 160
countEachLabel(adsTest)
ans=10×2 table
Label Count
_____ _____
0 40
1 40
2 40
3 40
4 40
5 40
6 40
7 40
8 40
9 40
To train the network with the entire dataset and achieve the highest possible accuracy, set
speedupExample to false. To run this example quickly, set speedupExample to true.
speedupExample = ;
if speedupExample
adsTrain = splitEachLabel(adsTrain,2);
adsTest = splitEachLabel(adsTest,2);
end
Define parameters used to extract mel-frequency spectrograms. Use 220 ms windows with 10 ms
hops between windows. Use a 2048-point DFT and 40 frequency bands.
fs = 8000;
frameDuration = 0.22;
frameLength = round(frameDuration*fs);
hopDuration = 0.01;
hopLength = round(hopDuration*fs);
segmentLength = 8192;
1-647
1 Audio Toolbox Examples
numBands = 40;
fftLength = 2048;
setExtractorParameters(afe,"melSpectrum",NumBands=numBands,FrequencyRange=[50 fs/2],WindowNormali
Create a transformed datastore that computes mel-frequency spectrograms from audio data. The
supporting function, getSpeechSpectrogram on page 1-651, standardizes the recording length
and normalizes the amplitude of the audio input. getSpeechSpectrogram uses the
audioFeatureExtractor object afe to obtain the log-based mel-frequency spectrograms.
adsSpecTrain = transform(adsTrain,@(x)getSpeechSpectrogram(x,afe,segmentLength));
Use writeall to write auditory spectrograms to disk. Set UseParallel to true to perform writing
in parallel.
outputLocation = fullfile(tempdir,"FSDD_Features");
writeall(adsSpecTrain,outputLocation,WriteFcn=@myCustomWriter,UseParallel=true);
Create a signalDatastore that points to the out-of-memory features. The read function returns a
spectrogram/label pair.
Validation Data
adsTestT = transform(adsTest,@(x){getSpeechSpectrogram(x,afe,segmentLength)});
XTest = readall(adsTestT);
XTest = cat(4,XTest{:});
YTest = adsTest.Labels;
Construct a small CNN as an array of layers. Use convolutional and batch normalization layers, and
downsample the feature maps using max pooling layers. To reduce the possibility of the network
memorizing specific features of the training data, add a small amount of dropout to the input to the
last fully connected layer.
1-648
Train Spoken Digit Recognition Network Using Out-of-Memory Features
sz = size(XTest);
specSize = sz(1:2);
imageSize = [specSize 1];
numClasses = numel(categories(YTest));
dropoutProb = 0.2;
numF = 12;
layers = [
imageInputLayer(imageSize,Normalization="none")
convolution2dLayer(5,numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,2*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(2)
dropoutLayer(dropoutProb)
fullyConnectedLayer(numClasses)
softmaxLayer
];
classes = categories(YTest);
Set the hyperparameters to use in training the network. Use a mini-batch size of 50 and a learning
rate of 1e-4. Specify "adam" optimization. To use the parallel pool to read the transformed datastore,
set PreprocessingEnvironment to "parallel". For more information, see trainingOptions
(Deep Learning Toolbox).
miniBatchSize = 50;
options = trainingOptions("adam", ...
InitialLearnRate=1e-4, ...
MaxEpochs=30, ...
LearnRateSchedule="piecewise", ...
LearnRateDropFactor=0.1, ...
LearnRateDropPeriod=15, ...
MiniBatchSize=miniBatchSize, ...
Shuffle="every-epoch", ...
1-649
1 Audio Toolbox Examples
Plots="training-progress", ...
Metrics="accuracy", ...
Verbose=false, ...
ValidationData={XTest,YTest}, ...
ValidationFrequency=ceil(numel(adsTrain.Files)/miniBatchSize), ...
ExecutionEnvironment="auto", ...
PreprocessingEnvironment="parallel");
trainedNet = trainnet(sds,layers,"crossentropy",options);
Use the trained network to predict the digit labels for the test set.
scores = predict(trainedNet,XTest);
Ypredicted = scores2label(scores,classes);
cnnAccuracy = sum(Ypredicted==YTest)/numel(YTest)*100
cnnAccuracy = 96
Summarize the performance of the trained network on the test set with a confusion chart. Display the
precision and recall for each class by using column and row summaries. The table at the bottom of
the confusion chart shows the precision values. The table to the right of the confusion chart shows
the recall values.
1-650
Train Spoken Digit Recognition Network Using Out-of-Memory Features
Supporting Functions
function X = getSpeechSpectrogram(x,afe,segmentLength)
% getSpeechSpectrogram(x,afe,params) computes a speech spectrogram for the
% signal x using the audioFeatureExtractor afe.
x = scaleAndResize(single(x),segmentLength);
spec = extract(afe,x).';
X = log10(spec + 1e-6);
end
L = segmentLength;
N = size(x,1);
if N > L
x = x(1:L,:);
elseif N < L
pad = L - N;
prepad = floor(pad/2);
postpad = ceil(pad/2);
1-651
1 Audio Toolbox Examples
x = [zeros(prepad,1);x;zeros(postpad,1)];
end
x = x./max(abs(x));
end
function myCustomWriter(spec,writeInfo,~)
% myCustomWriter(spec,writeInfo,~) writes an auditory spectrogram/label
% pair to MAT files.
filename = strrep(writeInfo.SuggestedOutputName,".wav",".mat");
label = writeInfo.ReadInfo.Label;
save(filename,"label","spec");
end
See Also
audioFeatureExtractor | audioDatastore | signalDatastore
Related Examples
• “Train Spoken Digit Recognition Network Using Out-of-Memory Audio Data” on page 1-653
• “Accelerate Audio Machine Learning Workflows Using a GPU” on page 1-249
1-652
Train Spoken Digit Recognition Network Using Out-of-Memory Audio Data
This example trains a spoken digit recognition network on out-of-memory audio data using a
transformed datastore. In this example, you apply a random pitch shift to audio data used to train a
convolutional neural network (CNN). For each training iteration, the audio data is augmented using
the audioDataAugmenter object and then features are extracted using the
audioFeatureExtractor object. The workflow in this example applies to any random data
augmentation used in a training loop. The workflow also applies when the underlying audio data set
or training features do not fit in memory.
Data
Download the Free Spoken Digit Data Set (FSDD). FSDD consists of 2000 recordings of four speakers
saying the numbers 0 through 9 in English.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"FSDD");
ads = audioDatastore(dataset,IncludeSubfolders=true,OutputDataType="single");
Decode the file names to set the labels on the datastore. Display the classes and the number of
examples in each class.
labels = filenames2labels(ads,ExtractBefore="_");
summary(labels)
0 200
1 200
2 200
3 200
4 200
5 200
6 200
7 200
8 200
9 200
Split the FSDD into training and test sets. Allocate 90% of the data to the training set and retain 10%
for the test set. You use the training set to train the model and the test set to validate the trained
model.
idxs = splitlabels(labels,0.9,"randomized");
adsTrain = subset(ads,idxs{1});
adsTest = subset(ads,idxs{2});
labelsTrain = labels(idxs{1});
labelsTest = labels(idxs{2});
classes = unique(labelsTrain);
1-653
1 Audio Toolbox Examples
To train the network with the entire dataset and achieve the highest possible accuracy, set
speedupExample to false. To run this example quickly, set speedupExample to true.
speedupExample = ;
if speedupExample
adsTrain = subset(adsTrain,1:90:numel(labelsTrain));
adsTest = subset(adsTest,1:10:numel(labelsTest));
labelsTrain = labelsTrain(1:90:numel(labelsTrain));
labelsTest = labelsTest(1:10:numel(labelsTest));
end
Data Augmentation
Augment the training data by applying pitch shifting with an audioDataAugmenter object.
Create an audioDataAugmenter. The augmenter applies pitch shifting on an input audio signal with
a 0.5 probability. The augmenter selects a random pitch shifting value in the range [–12 12]
semitones.
Set custom pitch-shifting parameters. Use identity phase locking and preserve formants using
spectral envelope estimation with 30th order cepstral analysis.
setAugmenterParams(augmenter,"shiftPitch",LockPhase=true,PreserveFormants=true,CepstralOrder=30);
Create a transformed datastore that applies data augmentation to the training data.
fs = 8000;
adsAugTrain = transform(adsTrain,@(y)deal(augment(augmenter,y,fs).Audio{1}));
Define parameters used to extract mel-frequency spectrograms. Use 220 ms windows with 10 ms
hops between windows. Use a 2048-point DFT and 40 frequency bands.
frameDuration = 0.22;
frameLength = round(frameDuration*fs);
hopDuration = 0.01;
hopLength = round(hopDuration*fs);
segmentLength = 8192;
numBands = 40;
fftLength = 2048;
1-654
Train Spoken Digit Recognition Network Using Out-of-Memory Audio Data
setExtractorParameters(afe,"melSpectrum", ...
NumBands=numBands, ...
FrequencyRange=[50 fs/2], ...
WindowNormalization=true, ...
ApplyLog=true);
Create a transformed datastore that computes mel-frequency spectrograms from pitch-shifted audio
data. The supporting function, getSpeechSpectrogram on page 1-658, standardizes the recording
length and normalizes the amplitude of the audio input. getSpeechSpectrogram uses the
audioFeatureExtractor object to obtain the log-based mel-frequency spectrograms.
adsSpecTrain = transform(adsAugTrain,@(x)getSpeechSpectrogram(x,afe,segmentLength));
Training Labels
labelsTrain = arrayDatastore(labelsTrain);
Create a combined datastore that points to the mel-frequency spectrogram data and the
corresponding labels.
tdsTrain = combine(adsSpecTrain,labelsTrain);
Validation Data
adsTestT = transform(adsTest,@(x){getSpeechSpectrogram(x,afe,segmentLength)});
XTest = readall(adsTestT);
XTest = cat(4,XTest{:});
Construct a small CNN as an array of layers. Use convolutional and batch normalization layers, and
downsample the feature maps using max pooling layers. To reduce the possibility of the network
memorizing specific features of the training data, add a small amount of dropout to the input to the
last fully connected layer.
sz = size(XTest);
specSize = sz(1:2);
imageSize = [specSize 1];
numClasses = numel(classes);
1-655
1 Audio Toolbox Examples
dropoutProb = 0.2;
numF = 12;
layers = [
imageInputLayer(imageSize,Normalization="none")
convolution2dLayer(5,numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,2*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(3,Stride=2,Padding="same")
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
convolution2dLayer(3,4*numF,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer(2)
dropoutLayer(dropoutProb)
fullyConnectedLayer(numClasses)
softmaxLayer
];
Set the hyperparameters to use in training the network. Use a mini-batch size of 128 and a learning
rate of 1e-4. Specify 'adam' optimization. To use the parallel pool to read the transformed datastore,
set DispatchInBackground to true. For more information, see trainingOptions (Deep Learning
Toolbox).
miniBatchSize = 128;
options = trainingOptions("adam", ...
Metrics="accuracy", ...
InitialLearnRate=1e-4, ...
MaxEpochs=40, ...
LearnRateSchedule="piecewise", ...
LearnRateDropFactor=0.1, ...
LearnRateDropPeriod=30, ...
MiniBatchSize=miniBatchSize, ...
Shuffle="every-epoch", ...
Plots="training-progress", ...
Verbose=false, ...
ValidationData={XTest,labelsTest}, ...
ValidationFrequency=ceil(2*numel(adsTrain.Files)/miniBatchSize), ...
ValidationPatience=5, ...
1-656
Train Spoken Digit Recognition Network Using Out-of-Memory Audio Data
ExecutionEnvironment="auto", ...
OutputNetwork="best-validation");
trainedNet = trainnet(tdsTrain,layers,"crossentropy",options);
Use the trained network to predict the digit labels for the test set.
probs = minibatchpredict(trainedNet,XTest);
Ypredicted = scores2label(probs,classes);
cnnAccuracy = mean(Ypredicted==labelsTest)*100
cnnAccuracy = 93.5000
Summarize the performance of the trained network on the test set with a confusion chart. Display the
precision and recall for each class by using column and row summaries. The table at the bottom of
the confusion chart shows the precision values. The table to the right of the confusion chart shows
the recall values.
1-657
1 Audio Toolbox Examples
Supporting Functions
function X = getSpeechSpectrogram(x,afe,segmentLength)
% getSpeechSpectrogram(x,afe,segmentLength) computes a speech spectrogram for the
% signal x using the audioFeatureExtractor afe.
x = resize(x,segmentLength,Side="both");
x = x./max(abs(x));
X = extract(afe,x).';
end
1-658
Keyword Spotting in Noise Code Generation with Intel MKL-DNN
This example demonstrates code generation for keyword spotting using a Bidirectional Long Short-
Term Memory (BiLSTM) network and mel frequency cepstral coefficient (MFCC) feature extraction.
MATLAB® Coder™ with Deep Learning Support enables the generation of a standalone executable
(.exe) file. Communication between the MATLAB® (.mlx) file and the generated executable file
occurs over asynchronous User Datagram Protocol (UDP). The incoming speech signal is displayed
using a timescope. A mask is shown as a blue rectangle surrounding spotted instances of the
keyword, YES. For more details on MFCC feature extraction and deep learning network training, visit
“Keyword Spotting in Noise Using MFCC and LSTM Networks” on page 1-481.
Example Requirements
For supported versions of libraries and for information about setting up environment variables, see
“Prerequisites for Deep Learning with MATLAB Coder” (MATLAB Coder).
Pretrained Network Keyword Spotting Using MATLAB and Streaming Audio from
Microphone
The sample rate of the pretrained network is 16 kHz. Set the window length to 512 samples, with an
overlap length of 384 samples, and a hop length defined as the difference between the window and
overlap lengths. Define the rate at which the mask is estimated. A mask is generated once for every
numHopsPerUpdate audio frames.
fs = 16e3;
windowLength = 512;
overlapLength = 384;
hopLength = windowLength - overlapLength;
numHopsPerUpdate = 16;
maskLength = hopLength*numHopsPerUpdate;
Download and load the pretrained network, as well as the mean (M) and the standard deviation (S)
vectors used for Feature Standardization.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio/examples","kwslstm.zip");
dataFolder = './';
netFolder = fullfile(dataFolder,"KeywordSpotting");
1-659
1 Audio Toolbox Examples
unzip(downloadFolder,netFolder)
load(fullfile(netFolder,'KWSNet.mat'),"KWSNet","M","S");
generateMATLABFunction(afe,'generateKeywordFeatures','IsStreaming',true);
Define an Audio Device Reader that can read audio from your microphone. Set the frame length equal
to the hop length. This enables you to compute a new set of features for every new audio frame from
the microphone.
frameLength = hopLength;
adr = audioDeviceReader('SampleRate',fs, ...
'SamplesPerFrame',frameLength);
Create a Time Scope to visualize the speech signals and estimated mask.
Initialize a buffer for the audio data, a buffer for the computed features, and a buffer to plot the input
audio and the output speech mask.
dataBuff = dsp.AsyncBuffer(windowLength);
featureBuff = dsp.AsyncBuffer(numHopsPerUpdate);
plotBuff = dsp.AsyncBuffer(numHopsPerUpdate*windowLength);
Perform keyword spotting on speech received from your microphone. To run the loop indefinitely, set
timeLimit to Inf. To stop the simulation, close the scope.
timeLimit = 20;
show(scope);
tic
while toc < timeLimit && isVisible(scope)
data = adr();
write(dataBuff,data);
write(plotBuff,data);
frame = read(dataBuff,windowLength,overlapLength);
features = generateKeywordFeatures(frame,fs);
write(featureBuff,features.');
if featureBuff.NumUnreadSamples == numHopsPerUpdate
featureMatrix = read(featureBuff);
featureMatrix(~isfinite(featureMatrix)) = 0;
featureMatrix = (featureMatrix - M)./S;
1-660
Keyword Spotting in Noise Code Generation with Intel MKL-DNN
KWSNet.State = state;
[~,v] = max(scores,[],2);
v = double(v) - 1;
v = mode(v);
predictedMask = repmat(v,numHopsPerUpdate*hopLength,1);
data = read(plotBuff);
scope([data,predictedMask]);
drawnow limitrate;
end
end
release(adr)
hide(scope)
The supporting function uses a dsp.UDPSender System object to send the input data along with the
output mask predicted by the network to MATLAB. The MATLAB script uses the dsp.UDPReceiver
System object to receive the input data along with the output mask predicted by the network running
in the supporting function.
Create a code generation configuration object to generate an executable. Specify the target language
as C++.
cfg = coder.config('exe');
cfg.TargetLang = 'C++';
Create a configuration object for deep learning code generation with the MKL-DNN library. Attach
the deep learning configuration object to the code generation configuration object.
dlcfg = coder.DeepLearningConfig('mkldnn');
cfg.DeepLearningConfig = dlcfg;
Generate the C++ main file required to produce the standalone executable.
cfg.GenerateExampleMain = 'GenerateCodeAndCompile';
1-661
1 Audio Toolbox Examples
In this section, you generate all the required dependency files and put them into a single folder.
During the build process, MATLAB Coder generates buildInfo.mat, a file that contains the
compilation and run-time dependency information for the standalone executable.
projName = 'helperKeywordSpotting';
packageName = [projName,'Package'];
if ispc
exeName = [projName,'.exe'];
else
exeName = projName;
end
Load buildinfo.mat and use packNGo (MATLAB Coder) to produce a .zip package.
load(['codegen',filesep,'exe',filesep,projName,filesep,'buildInfo.mat']);
packNGo(buildInfo,'fileName',[packageName,'.zip'],'minimalHeaders',false);
Unzip the package and place the executable file in the unzipped directory.
unzip([packageName,'.zip'],packageName);
copyfile(exeName, packageName,'f');
To invoke a standalone executable that depends on the MKL-DNN Dynamic Link Library, append the
path to the MKL-DNN library location to the environment variable PATH.
setenv('PATH',[getenv('INTEL_MKLDNN'),filesep,'lib',pathsep,getenv('PATH')]);
if ispc
system(['start cmd /k "title ',packageName,' && cd ',packageName,' && ',exeName]);
else
cd(packageName);
system(['./',exeName,' &']);
cd ..;
end
Create a dsp.UDPReceiver System object to receive speech data and the predicted speech mask
from the standalone executable. Each UDP packet received from the executable consists of
maskLength mask samples and speech samples. The maximum message length for the
dsp.UDPReceiver object is 65507 bytes. Calculate the buffer size to accommodate the maximum
number of UDP packets.
sizeOfFloatInBytes = 4;
speechDataLength = maskLength;
numElementsPerUDPPacket = maskLength + speechDataLength;
maxUDPMessageLength = floor(65507/sizeOfFloatInBytes);
samplesPerPacket = 1 + numElementsPerUDPPacket;
numPackets = floor(maxUDPMessageLength/samplesPerPacket);
bufferSize = numPackets*samplesPerPacket*sizeOfFloatInBytes;
1-662
Keyword Spotting in Noise Code Generation with Intel MKL-DNN
To run the keyword spotting indefinitely, set timelimit to Inf. To stop the simulation, close the
scope.
tic;
timelimit = 20;
show(scope);
hide(scope);
SUCCESS: The process with PID 17424 (child process of PID 1568) has been terminated.
A similar workflow involves using a MEX file instead of the standalone executable. Perform MEX
profiling to measure the computation time for the workflow.
Create a code generation configuration object to generate the MEX function. Specify the target
language as C++.
cfg = coder.config('mex');
cfg.TargetLang = 'C++';
Create a configuration object for deep learning code generation with the MKL-DNN library. Attach
the deep learning configuration object to the code generation configuration object.
dlcfg = coder.DeepLearningConfig('mkldnn');
cfg.DeepLearningConfig = dlcfg;
1-663
1 Audio Toolbox Examples
x = pinknoise(hopLength,1,'single');
numPredictCalls = 100;
totalNumCalls = numPredictCalls*numHopsPerUpdate;
exeTimeStart = tic;
for call = 1:totalNumCalls
[outputMask,inputData,plotFlag] = profileKeywordSpotting(x);
end
exeTime = toc(exeTimeStart);
fprintf('MATLAB execution time per %d ms of audio = %0.4f ms\n',int32(1000*numHopsPerUpdate*hopLe
exeTimeMexStart = tic;
for call = 1:totalNumCalls
[outputMask,inputData,plotFlag] = profileKeywordSpotting_mex(x);
end
exeTimeMex = toc(exeTimeMexStart);
fprintf('MEX execution time per %d ms of audio = %0.4f ms\n',int32(1000*numHopsPerUpdate*hopLengt
Compare total execution time of the standalone executable approach with the MEX function
approach. This performance test is done on a machine using an NVIDIA Titan Xp® (compute
capability 6.1) GPU with 12.8 GB memory and an Intel Xeon W-2133 CPU running at 3.60 GHz.
PerformanceGain = exeTime/exeTimeMex
PerformanceGain = 3.0739
1-664
Keyword Spotting in Noise Code Generation on Raspberry Pi
This example demonstrates code generation for keyword spotting using a Bidirectional Long Short-
Term Memory (BiLSTM) network and mel frequency cepstral coefficient (MFCC) feature extraction on
Raspberry Pi™. MATLAB® Coder™ with Deep Learning Support enables the generation of a
standalone executable (.elf) file on Raspberry Pi. Communication between MATLAB® (.mlx) file and
the generated executable file occurs over asynchronous User Datagram Protocol (UDP). The incoming
speech signal is displayed using a timescope. A mask is shown as a blue rectangle surrounding
spotted instances of the keyword, YES. For more details on MFCC feature extraction and deep
learning network training, visit “Keyword Spotting in Noise Using MFCC and LSTM Networks” on
page 1-481.
Example Requirements
For supported versions of libraries and for information about setting up environment variables, see
“Prerequisites for Deep Learning with MATLAB Coder” (MATLAB Coder).
Pretrained Network Keyword Spotting Using MATLAB® and Streaming Audio from
Microphone
The sample rate of the pretrained network is 16 kHz. Set the window length to 512 samples, with an
overlap length of 384 samples, and a hop length defined as the difference between the window and
overlap lengths. Define the rate at which the mask is estimated. A mask is generated once for every
numHopsPerUpdate audio frames.
fs = 16e3;
windowLength = 512;
overlapLength = 384;
hopLength = windowLength - overlapLength;
numHopsPerUpdate = 16;
maskLength = hopLength * numHopsPerUpdate;
Download and load the pretrained network, as well as the mean (M) and the standard deviation (S)
vectors used for feature standardization.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio/examples","kwslstm.zip");
dataFolder = './';
netFolder = fullfile(dataFolder,"KeywordSpotting");
1-665
1 Audio Toolbox Examples
unzip(downloadFolder,netFolder)
load(fullfile(netFolder,'KWSNet.mat'),"KWSNet","M","S");
generateMATLABFunction(afe,'generateKeywordFeatures','IsStreaming',true);
Define an Audio Device Reader System object™ to read audio from your microphone. Set the frame
length equal to the hop length. This enables the computation of a new set of features for every new
audio frame received from the microphone.
frameLength = hopLength;
adr = audioDeviceReader('SampleRate',fs, ...
'SamplesPerFrame',frameLength,'OutputDataType','single');
Create a Time Scope to visualize the speech signals and estimated mask.
Initialize a buffer for the audio data, a buffer for the computed features, and a buffer to plot the input
audio and the output speech mask.
dataBuff = dsp.AsyncBuffer(windowLength);
featureBuff = dsp.AsyncBuffer(numHopsPerUpdate);
plotBuff = dsp.AsyncBuffer(numHopsPerUpdate*windowLength);
Perform keyword spotting on speech received from your microphone. To run the loop indefinitely, set
timeLimit to Inf. To stop the simulation, close the scope.
show(scope);
timeLimit = 20;
tic
while toc < timeLimit && isVisible(scope)
data = adr();
write(dataBuff,data);
write(plotBuff,data);
frame = read(dataBuff,windowLength,overlapLength);
features = generateKeywordFeatures(frame,fs);
write(featureBuff,features.');
if featureBuff.NumUnreadSamples == numHopsPerUpdate
featureMatrix = read(featureBuff);
featureMatrix(~isfinite(featureMatrix)) = 0;
featureMatrix = (featureMatrix - M)./S;
1-666
Keyword Spotting in Noise Code Generation on Raspberry Pi
KWSNet.State = state;
[~,v] = max(scores,[],2);
v = double(v) - 1;
v = mode(v);
predictedMask = repmat(v,numHopsPerUpdate*hopLength,1);
data = read(plotBuff);
scope([data,predictedMask]);
drawnow limitrate;
end
end
hide(scope)
The supporting function uses a dsp.UDPReceiver System object to receive the captured audio from
MATLAB® and uses a dsp.UDPSender System object to send the input speech signal along with the
estimated mask predicted by the network to MATLAB®. Similarly, the MATLAB® live script uses the
dsp.UDPSender System object to send the captured speech signal to the executable running on
Raspberry Pi and the dsp.UDPReceiver System object to receive the speech signal and estimated
mask from Raspberry Pi.
Replace the hostIPAddress with your machine's address. Your Raspberry Pi sends the input speech
signal and estimated mask to the specified IP address.
Create a code generation configuration object to generate an executable program. Specify the target
language as C++.
cfg = coder.config('exe');
cfg.TargetLang = 'C++';
Create a configuration object for deep learning code generation with the ARM compute library that is
on your Raspberry Pi. Specify the architecture of the Raspberry Pi and attach the deep learning
configuration object to the code generation configuration object.
dlcfg = coder.DeepLearningConfig('arm-compute');
dlcfg.ArmArchitecture = 'armv7';
dlcfg.ArmComputeVersion = '20.02.1';
cfg.DeepLearningConfig = dlcfg;
Use the Raspberry Pi Support Package function, raspi, to create a connection to your Raspberry Pi.
In the following code, replace:
1-667
1 Audio Toolbox Examples
r = raspi('172.29.252.166','pi','raspberrypi123')
r =
raspi with properties:
DeviceAddress: '172.29.252.166'
Port: 18734
BoardName: 'Raspberry Pi 2 Model B'
AvailableLEDs: {'led0'}
AvailableDigitalPins: [4,5,6,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]
AvailableSPIChannels: {'CE0','CE1'}
AvailableI2CBuses: {'i2c-1'}
AvailableWebcams: {}
I2CBusSpeed: 100000
AvailableCANInterfaces: {}
Supported peripherals
Create a coder.hardware (MATLAB Coder) object for Raspberry Pi and attach it to the code
generation configuration object.
hw = coder.hardware('Raspberry Pi');
cfg.Hardware = hw;
buildDir = '~/remoteBuildDir';
cfg.Hardware.BuildDir = buildDir;
Generate the C++ main file required to produce the standalone executable.
cfg.GenerateExampleMain = 'GenerateCodeAndCompile';
applicationName = 'helperKeywordSpottingRaspi';
applicationDirPaths = raspi.utils.getRemoteBuildDirectory('applicationName',applicationName);
targetDirPath = applicationDirPaths{1}.directory;
exeName = strcat(applicationName,'.elf');
command = ['cd ',targetDirPath,'; ./',exeName,' &> 1 &'];
1-668
Keyword Spotting in Noise Code Generation on Raspberry Pi
system(r,command);
Create a dsp.UDPSender System object to send audio captured in MATLAB® to your Raspberry Pi.
Update the targetIPAddress for your Raspberry Pi. Raspberry Pi receives the captured audio from
the same port using the dsp.UDPReceiver System object.
targetIPAddress = '172.29.252.166';
UDPSend = dsp.UDPSender('RemoteIPPort',26000,'RemoteIPAddress',targetIPAddress);
Create a dsp.UDPReceiver System object to receive speech data and the predicted speech mask
from your Raspberry Pi. Each UDP packet received from the Raspberry Pi consists of maskLength
mask and speech samples. The maximum message length for the dsp.UDPReceiver object is 65507
bytes. Calculate the buffer size to accommodate the maximum number of UDP packets.
sizeOfFloatInBytes = 4;
speechDataLength = maskLength;
numElementsPerUDPPacket = maskLength + speechDataLength;
maxUDPMessageLength = floor(65507/sizeOfFloatInBytes);
numPackets = floor(maxUDPMessageLength/numElementsPerUDPPacket);
bufferSize = numPackets*numElementsPerUDPPacket*sizeOfFloatInBytes;
Spot the keyword as long as time scope is open or until the time limit is reached. To stop the live
detection before the time limit is reached, close the time scope.
tic;
show(scope);
timelimit = 20;
while toc < timelimit && isVisible(scope)
x = adr();
UDPSend(x);
data = UDPReceive();
if ~isempty(data)
mask = data(1:maskLength);
dataForPlot = data(maskLength + 1 : numElementsPerUDPPacket);
scope([dataForPlot,mask]);
end
drawnow limitrate;
end
To evaluate execution time taken by standalone executable on Raspberry Pi, use a PIL (processor-in-
loop) workflow. To perform PIL profiling, generate a PIL function for the supporting function
1-669
1 Audio Toolbox Examples
Enable profiling and generate the PIL code. A MEX file named profileKeywordSpotting_pil is
generated in your current folder.
cfg.CodeExecutionProfiling = true;
codegen -config cfg profileKeywordSpotting -args {pinknoise(hopLength,1,'single')} -report
Warning: Reduce length of the paths for these directories to reduce the build time:
Call the generated PIL function multiple times to get the average execution time.
numPredictCalls = 10;
totalCalls = numHopsPerUpdate * numPredictCalls;
x = pinknoise(hopLength,1,'single');
for k = 1:totalCalls
[maskReceived,inputSignal,plotFlag] = profileKeywordSpotting_pil(x);
end
1-670
Keyword Spotting in Noise Code Generation on Raspberry Pi
clear profileKeywordSpotting_pil
### Host application produced the following standard output (stdout) and standard error (stderr)
executionProfile = getCoderExecutionProfile('profileKeywordSpotting');
report(executionProfile, ...
'Units','Seconds', ...
'ScaleFactor','1e-03', ...
'NumericFormat','%0.4f')
ans =
'C:\Users\jibrahim\OneDrive - MathWorks\Documents\MATLAB\ExampleManager\jibrahim.elcm\deeplearnin
1-671
1 Audio Toolbox Examples
1-672
Keyword Spotting in Noise Code Generation on Raspberry Pi
Plot the Execution Time of each frame from the generated report.
Processing of the first frame took ~20 ms due to initialization overhead costs. The spikes in the time
graph at every 16th frame (numHopsPerUpdate) correspond to the computationally intensive predict
function called every 16th frame. The maximum execution time is ~30 ms, which is below the 128 ms
budget for real-time streaming. The performance is measuerd on Raspberry Pi 4 Model B Rev 1.1.
1-673
1 Audio Toolbox Examples
This example shows how to train a U-Net fully convolutional network (FCN) [1] on page 1-698 to
dereverberate a speech signals.
Introduction
Reverberation occurs when a speech signal is reflected off objects in space, causing multiple
reflections to build up and eventually leads to degradation of speech quality. Dereverberation is the
process of reducing the reverberation effects in a speech signal.
Before going into the training process in detail, use a pretrained network to dereverberate a speech
signal.
Download the pretrained network. This network was trained on 56-speaker versions of the training
datasets. The example walks through training on the 28-speaker version.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio/examples","dereverbunet.zip"
dataFolder = tempdir;
netFolder = fullfile(dataFolder,"dereverbunet");
unzip(downloadFolder,netFolder)
load(fullfile(netFolder,"dereverbNet.mat"));
[cleanAudio,fs] = audioread("clean_speech_signal.wav");
sound(cleanAudio,fs)
An acoustic path can be modelled using a room impulse response. You can model reverberation by
convolving an anechoic signal with a room impulse response.
[rirAudio,fsR] = audioread("room_impulse_response.wav");
tAxis = (1/fsR)*(0:numel(rirAudio)-1);
figure
plot(tAxis,rirAudio)
xlabel("Time (s)")
ylabel("Amplitude")
grid on
1-674
Dereverberate Speech Using Deep Learning Networks
Convolve the clean speech with the room impulse response to obtain reverberated speech. Align the
lengths and amplitudes of the reverberated and clean speech signals.
revAudio = conv(cleanAudio,rirAudio);
revAudio = revAudio(1:numel(cleanAudio));
revAudio = revAudio.*(max(abs(cleanAudio))/max(abs(revAudio)));
sound(revAudio,fs)
The input to the pretrained network is the log-magnitude short-time Fourier transform (STFT) of the
reverberant audio. The network predicts the log-magnitude STFT of the dereverberated input. To
estimate the original time-domain audio signal, you perform an inverse STFT and assume the phase
of the reverberant audio.
params.WindowdowLength = 512;
params.Window = hamming(params.WindowdowLength,"periodic");
params.OverlapLength = round(0.75*params.WindowdowLength);
params.FFTLength = params.WindowdowLength;
Use stft to compute the one-sided log-magnitude STFT. Use single precision when computing
features to better utilize memory usage and to speed up the training. Even though the one-sided
STFT yields 257 frequency bins, consider only 256 bins and ignore the highest frequency bin.
1-675
1 Audio Toolbox Examples
revAudio = single(revAudio);
audioSTFT = stft(revAudio,Window=params.Window,OverlapLength=params.OverlapLength, ...
FFTLength=params.FFTLength,FrequencyRange="onesided");
Eps = realmin("single");
reverbFeats = log(abs(audioSTFT(1:end-1,:)) + Eps);
phaseOriginal = angle(audioSTFT(1:end-1,:));
Each input will have dimensions 256-by-256 (frequency bins by time steps). Split the log-magnitude
STFT into segments of 256 time-steps.
params.NumSegments = 256;
params.NumFeatures = 256;
totalFrames = size(reverbFeats,2);
chunks = ceil(totalFrames/params.NumSegments);
reverbSTFTSegments = mat2cell(reverbFeats,params.NumFeatures, ...
[params.NumSegments*ones(1,chunks - 1),(totalFrames - (chunks-1)*params.NumSegments)]);
reverbSTFTSegments{chunks} = reverbFeats(:,end-params.NumSegments + 1:end);
Scale the segmented features to the range [-1,1]. Retain the minimum and maximum values used to
scale for reconstructing the dereverberated signal.
minVals = num2cell(cellfun(@(x)min(x,[],"all"),reverbSTFTSegments));
maxVals = num2cell(cellfun(@(x)max(x,[],"all"),reverbSTFTSegments));
Reshape the features so that chunks are along the fourth dimension.
featNorm = reshape(cell2mat(featNorm),params.NumFeatures,params.NumSegments,1,chunks);
Predict the log-magnitude spectra of the reverberated speech signal using the pretrained network.
predictedSTFT4D = predict(dereverbNet,featNorm);
Reshape to 3-dimensions and scale the predicted STFTs to the original range using the saved
minimum-maximum pairs.
predictedSTFT = squeeze(mat2cell(predictedSTFT4D,params.NumFeatures,params.NumSegments,1,ones(1,c
featDeNorm = cellfun(@(feat,minFeat,maxFeat) (feat + 1).*(maxFeat-minFeat)./2 + minFeat, ...
predictedSTFT,minVals,maxVals,UniformOutput=false);
predictedSTFT = cellfun(@exp,featDeNorm,UniformOutput=false);
Concatenate the predicted 256-by-256 magnitude STFT segments to obtain the magnitude
spectrogram of original length.
Before taking the inverse STFT, append zeros to the predicted log-magnitude spectrum and the phase
in lieu of the highest frequency bin which was excluded when preparing input features.
1-676
Dereverberate Speech Using Deep Learning Networks
nCount = size(predictedSTFTAll,3);
predictedSTFTAll = cat(1,predictedSTFTAll,zeros(1,totalFrames,nCount));
phase = cat(1,phaseOriginal,zeros(1,totalFrames,nCount));
Use the inverse STFT function to reconstruct the dereverberated time-domain speech signal using
the predicted log-magnitude STFT and the phase of reverberant speech signal.
oneSidedSTFT = predictedSTFTAll.*exp(1j*phase);
dereverbedAudio = istft(oneSidedSTFT, ...
Window=params.Window,OverlapLength=params.OverlapLength, ...
FFTLength=params.FFTLength,ConjugateSymmetric=true, ...
FrequencyRange="onesided");
dereverbedAudio = dereverbedAudio./max(abs([dereverbedAudio;revAudio]));
dereverbedAudio = [dereverbedAudio;zeros(length(revAudio) - numel(dereverbedAudio), 1)];
sound(dereverbedAudio,fs)
t = (1/fs)*(0:numel(cleanAudio)-1);
figure
tiledlayout(3,1)
nexttile
plot(t,cleanAudio)
xlabel("Time (s)")
grid on
subtitle("Clean Speech Signal")
nexttile
plot(t,revAudio)
xlabel("Time (s)")
grid on
subtitle("Revereberated Speech Signal")
nexttile
plot(t,dereverbedAudio)
xlabel("Time (s)")
grid on
subtitle("Derevereberated Speech Signal")
1-677
1 Audio Toolbox Examples
Visualize the spectrograms of the clean, reverberant, and dereverberated speech signals.
figure(Position=[100,100,800,800])
tiledlayout(3,1)
nexttile
spectrogram(cleanAudio,params.Window,params.OverlapLength,params.FFTLength,fs,"yaxis");
subtitle("Clean")
nexttile
spectrogram(revAudio,params.Window,params.OverlapLength,params.FFTLength,fs,"yaxis");
subtitle("Reverberated")
nexttile
spectrogram(dereverbedAudio,params.Window,params.OverlapLength,params.FFTLength,fs,"yaxis");
subtitle("Predicted (Dereverberated)")
1-678
Dereverberate Speech Using Deep Learning Networks
This example uses the Reverberant Speech Database [2] on page 1-698 and the corresponding Clean
Speech Database [3] on page 1-698 to train the network.
1-679
1 Audio Toolbox Examples
downloadFolder = tempdir;
cleanDataFolder = fullfile(downloadFolder,"DS_10283_2791");
if ~datasetExists(cleanDataFolder)
disp("Downloading data set (6 GB) ...")
unzip(url1,cleanDataFolder)
unzip(url2,cleanDataFolder)
end
url3 = "https://datashare.is.ed.ac.uk/bitstream/handle/10283/2031/reverb_trainset_28spk_wav.zip";
url4 = "https://datashare.is.ed.ac.uk/bitstream/handle/10283/2031/reverb_testset_wav.zip";
downloadFolder = tempdir;
reverbDataFolder = fullfile(downloadFolder,"DS_10283_2031");
if ~datasetExists(reverbDataFolder)
disp("Downloading data set (6 GB) ...")
unzip(url3,reverbDataFolder)
unzip(url4,reverbDataFolder)
end
Once the data is downloaded, preprocess the downloaded data and extract features before training
the DNN model:
First, create two audioDatastore objects that point to the clean and reverberant speech datasets.
adsCleanTrain = audioDatastore(fullfile(cleanDataFolder,"clean_trainset_28spk_wav"),IncludeSubfol
adsReverbTrain = audioDatastore(fullfile(reverbDataFolder,"reverb_trainset_28spk_wav"),IncludeSub
The amount of reverberation in the original data is relatively small. You will augment the reverberant
speech data with significant reverberation effects using the reverberator object.
Create an audioDatastore that points to the clean speech dataset allocated for synthetic
reverberant data generation.
adsSyntheticCleanTrain = subset(adsCleanTrain,10e3+1:length(adsCleanTrain.Files));
adsCleanTrain = subset(adsCleanTrain,1:10e3);
adsReverbTrain = subset(adsReverbTrain,1:10e3);
adsSyntheticCleanTrain = transform(adsSyntheticCleanTrain,@(x)resample(x,16e3,48e3));
adsCleanTrain = transform(adsCleanTrain,@(x)resample(x,16e3,48e3));
adsReverbTrain = transform(adsReverbTrain,@(x)resample(x,16e3,48e3));
1-680
Dereverberate Speech Using Deep Learning Networks
Combine the two audio datastores, maintaining the correspondence between the clean and
reverberant speech samples.
adsCombinedTrain = combine(adsCleanTrain,adsReverbTrain);
The applyReverb on page 1-693 function creates a reverberator object, updates the pre delay,
decay factor, and wet-dry mix parameters as specified, and then applies reverberation. Use
audioDataAugmenter to create synthetically generated reverberant data.
augmenter = audioDataAugmenter(AugmentationMode="independent",NumAugmentations=1,ApplyAddNoise=0,
ApplyTimeStretch=0,ApplyPitchShift=0,ApplyVolumeControl=0,ApplyTimeShift=0);
algorithmHandle = @(y,preDelay,decayFactor,wetDryMix,samplingRate) ...
applyReverb(y,preDelay,decayFactor,wetDryMix,samplingRate);
addAugmentationMethod(augmenter,"Reverb",algorithmHandle, ...
AugmentationParameter={'PreDelay','DecayFactor','WetDryMix','SamplingRate'}, ...
ParameterRange={[0.15,0.25],[0.2,0.5],[0.3,0.45],[16000,16000]})
augmenter.ReverbProbability = 1;
disp(augmenter)
AugmentationMode: "independent"
AugmentationParameterSource: 'random'
NumAugmentations: 1
ApplyTimeStretch: 0
ApplyPitchShift: 0
ApplyVolumeControl: 0
ApplyAddNoise: 0
ApplyTimeShift: 0
ApplyReverb: 1
PreDelayRange: [0.1500 0.2500]
DecayFactorRange: [0.2000 0.5000]
WetDryMixRange: [0.3000 0.4500]
SamplingRateRange: [16000 16000]
Next, based on the dimensions of the input features to the network, segment the audio into chunks of
2.072 s duration with an overlap of 50%.
Having too many silent segments can adversely affect the DNN model training. Remove the segments
which are mostly silent (more than 50% of the duration) and exclude those from the model training.
Do not completely remove silence because the model will not be robust to silent regions and slight
reverberation effects could be identified as silence. detectSpeech can identify the start and end
points of silent regions. After these two steps, the feature extraction process can be carried out as
explained in the first section. helperFeatureExtract on page 1-694 implements these steps.
Define the feature extraction parameters. By setting speedupExample to true, you choose a small
subset of the datasets to perform the subsequent steps.
1-681
1 Audio Toolbox Examples
speedupExample = ;
params.fs = 16000;
params.WindowdowLength = 512;
params.Window = hamming(params.WindowdowLength,"periodic");
params.OverlapLength = round(0.75*params.WindowdowLength);
params.FFTLength = params.WindowdowLength;
samplesPerMs = params.fs/1000;
params.samplesPerImage = (24+256*8)*samplesPerMs;
params.shiftImage = params.samplesPerImage/2;
params.NumSegments = 256;
params.NumFeatures = 256
To speed up processing, distribute the preprocessing and feature extraction task across multiple
workers using parfor.
Determine the number of partitions for the dataset. If you do not have Parallel Computing Toolbox™,
use a single partition.
if ~isempty(ver("parallel"))
pool = gcp;
numPar = numpartitions(adsCombinedTrain,pool);
else
numPar = 1;
end
For each partition, read from the datastore, preprocess the audio signal, and then extract the
features.
if speedupExample
adsCombinedTrain = shuffle(adsCombinedTrain);
adsCombinedTrain = subset(adsCombinedTrain,1:200);
adsSyntheticCombinedTrain = shuffle(adsSyntheticCombinedTrain);
adsSyntheticCombinedTrain = subset(adsSyntheticCombinedTrain,1:200);
end
allCleanFeatures = cell(1,numPar);
allReverbFeatures = cell(1,numPar);
cPartitionSize = numel(combinedPartition.UnderlyingDatastores{1}.UnderlyingDatastores{1}.File
cSyntheticPartitionSize = numel(combinedSyntheticPartition.UnderlyingDatastores{1}.Underlying
1-682
Dereverberate Speech Using Deep Learning Networks
cleanFeaturesPartition = cell(1,partitionSize);
reverbFeaturesPartition = cell(1,partitionSize);
allCleanFeatures = cat(2,allCleanFeatures{:});
allReverbFeatures = cat(2,allReverbFeatures{:});
Normalize the extracted features to the range [-1,1] and then reshape as explained in the first
section, using the featureNormalizeAndReshape on page 1-695 function.
trainClean = featureNormalizeAndReshape(allCleanFeatures);
trainReverb = featureNormalizeAndReshape(allReverbFeatures);
Now that you have extracted the log-magnitude STFT features from the training datasets, follow the
same procedure to extract features from the validation datasets. For reconstruction purposes, retain
the phase of the reverberant speech samples of the validation dataset. In addition, retain the audio
data for both the clean and reverberant speech samples in the validation set to use in the evaluation
process (next section).
adsCleanVal = audioDatastore(fullfile(cleanDataFolder,"clean_testset_wav"),IncludeSubfolders=true
adsReverbVal = audioDatastore(fullfile(reverbDataFolder,"reverb_testset_wav"),IncludeSubfolders=t
adsCleanVal = transform(adsCleanVal,@(x)resample(x,16e3,48e3));
adsReverbVal = transform(adsReverbVal,@(x)resample(x,16e3,48e3));
adsCombinedVal = combine(adsCleanVal,adsReverbVal);
if speedupExample
adsCombinedVal = shuffle(adsCombinedVal);
adsCombinedVal = subset(adsCombinedVal,1:50);
end
allValCleanFeatures = cell(1,numPar);
allValReverbFeatures = cell(1,numPar);
allValReverbPhase = cell(1,numPar);
allValCleanAudios = cell(1,numPar);
allValReverbAudios = cell(1,numPar);
1-683
1 Audio Toolbox Examples
combinedPartition = partition(adsCombinedVal,numPar,iPartition);
partitionSize = numel(combinedPartition.UnderlyingDatastores{1}.UnderlyingDatastores{1}.Files
cleanFeaturesPartition = cell(1,partitionSize);
reverbFeaturesPartition = cell(1,partitionSize);
reverbPhasePartition = cell(1,partitionSize);
cleanAudiosPartition = cell(1,partitionSize);
reverbAudiosPartition = cell(1,partitionSize);
cleanAudio = single(audios(:,1));
reverbAudio = single(audios(:,2));
[a,b,c,d,e] = helperFeatureExtract(cleanAudio,reverbAudio,true,params);
cleanFeaturesPartition{idx} = a;
reverbFeaturesPartition{idx} = b;
reverbPhasePartition{idx} = c;
cleanAudiosPartition{idx} = d;
reverbAudiosPartition{idx} = e;
end
allValCleanFeatures{iPartition} = cat(2,cleanFeaturesPartition{:});
allValReverbFeatures{iPartition} = cat(2,reverbFeaturesPartition{:});
allValReverbPhase{iPartition} = cat(2,reverbPhasePartition{:});
allValCleanAudios{iPartition} = cat(2,cleanAudiosPartition{:});
allValReverbAudios{iPartition} = cat(2,reverbAudiosPartition{:});
end
allValCleanFeatures = cat(2,allValCleanFeatures{:});
allValReverbFeatures = cat(2,allValReverbFeatures{:});
allValReverbPhase = cat(2,allValReverbPhase{:});
allValCleanAudios = cat(2,allValCleanAudios{:});
allValReverbAudios = cat(2,allValReverbAudios{:});
valClean = featureNormalizeAndReshape(allValCleanFeatures);
Retain the minimum and maximum values of each feature of the reverberant validation set. You will
use these values in the reconstruction process.
[valReverb,valMinMaxPairs] = featureNormalizeAndReshape(allValReverbFeatures);
A fully convolutional network architecture named U-Net was adapted for this speech dereverberation
task as proposed in [1] on page 1-698. "U-Net" is an encoder-decoder network with skip
connections. In the U-Net model, each layer downsamples its input (stride of 2) until a bottleneck
layer is reached (encoding path). In subsequent layers, the input is upsampled by each layer until the
output is returned to the original shape (decoding path). To minimize the loss of low-level information
during the downsampling process, connections are made between the mirrored layers by directly
concatenating outputs of corresponding layers (skip connections).
1-684
Dereverberate Speech Using Deep Learning Networks
params.NumFeatures = params.FFTLength/2;
params.NumSegments = 256;
filterH = 6;
filterW = 6;
numChannels = 1;
net = dlnetwork;
tempNet = [
imageInputLayer([params.NumFeatures,params.NumSegments,numChannels],"Name","input","Normaliza
convolution2dLayer([filterH filterW],64,"Name","conv1","Padding","same","Stride",[2 2])
leakyReluLayer(0.2,"Name","leaky-relu1")];
net = addLayers(net,tempNet);
tempNet = [
convolution2dLayer([filterH filterW],128,"Name","conv2","Padding","same","Stride",[2 2])
batchNormalizationLayer("Name","batchnorm2")
leakyReluLayer(0.2,"Name","leaky-relu2")];
net = addLayers(net,tempNet);
tempNet = [
convolution2dLayer([filterH filterW],256,"Name","conv3","Padding","same","Stride",[2 2])
batchNormalizationLayer("Name","batchnorm3")
leakyReluLayer(0.2,"Name","leaky-relu3")];
net = addLayers(net,tempNet);
tempNet = [
convolution2dLayer([filterH filterW],512,"Name","conv4","Padding","same","Stride",[2 2])
batchNormalizationLayer("Name","batchnorm4")
leakyReluLayer(0.2,"Name","leaky-relu4")];
net = addLayers(net,tempNet);
tempNet = [
convolution2dLayer([filterH filterW],512,"Name","conv5","Padding","same","Stride",[2 2])
batchNormalizationLayer("Name","batchnorm5")
leakyReluLayer(0.2,"Name","leaky-relu5")];
net = addLayers(net,tempNet);
tempNet = [
convolution2dLayer([filterH filterW],512,"Name","conv6","Padding","same","Stride",[2 2])
batchNormalizationLayer("Name","batchnorm6")
leakyReluLayer(0.2,"Name","leaky-relu6")];
net = addLayers(net,tempNet);
tempNet = [
convolution2dLayer([filterH filterW],512,"Name","conv7","Padding","same","Stride",[2 2])
batchNormalizationLayer("Name","batchnorm7")
leakyReluLayer(0.2,"Name","leaky-relu7")];
net = addLayers(net,tempNet);
tempNet = [
convolution2dLayer([filterH filterW],512,"Name","conv8","Padding","same","Stride",[2 2])
batchNormalizationLayer("Name","batchnorm8")
reluLayer("Name","relu8")
transposedConv2dLayer([filterH filterW],512,"Name","deconv7","Cropping","same","Stride",[2 2]
batchNormalizationLayer("Name","de-batchnorm7")
dropoutLayer(0.5,"Name","de-dropout7")
1-685
1 Audio Toolbox Examples
reluLayer("Name","de-relu7")];
net = addLayers(net,tempNet);
tempNet = [
concatenationLayer(3,2,"Name","concat7")
transposedConv2dLayer([filterH filterW],512,"Name","deconv6","Cropping","same","Stride",[2 2]
batchNormalizationLayer("Name","de-batchnorm6")
dropoutLayer(0.5,"Name","de-dropout6")
reluLayer("Name","de-relu6")];
net = addLayers(net,tempNet);
tempNet = [
concatenationLayer(3,2,"Name","concat6")
transposedConv2dLayer([filterH filterW],512,"Name","deconv5","Cropping","same","Stride",[2 2]
batchNormalizationLayer("Name","de-batchnorm5")
dropoutLayer(0.5,"Name","de-dropout5")
reluLayer("Name","de-relu5")];
net = addLayers(net,tempNet);
tempNet = [
concatenationLayer(3,2,"Name","concat5")
transposedConv2dLayer([filterH filterW],512,"Name","deconv4","Cropping","same","Stride",[2 2]
batchNormalizationLayer("Name","de-batchnorm4")
reluLayer("Name","de-relu4")];
net = addLayers(net,tempNet);
tempNet = [
concatenationLayer(3,2,"Name","concat4")
transposedConv2dLayer([filterH filterW],256,"Name","deconv3","Cropping","same","Stride",[2 2]
batchNormalizationLayer("Name","de-batchnorm3")
reluLayer("Name","de-relu3")];
net = addLayers(net,tempNet);
tempNet = [
concatenationLayer(3,2,"Name","concat3")
transposedConv2dLayer([filterH filterW],128,"Name","deconv2","Cropping","same","Stride",[2 2]
batchNormalizationLayer("Name","de-batchnorm2")
reluLayer("Name","de-relu2")];
net = addLayers(net,tempNet);
tempNet = [
concatenationLayer(3,2,"Name","concat2")
transposedConv2dLayer([filterH filterW],64,"Name","deconv1","Cropping","same","Stride",[2 2])
batchNormalizationLayer("Name","de-batchnorm1")
reluLayer("Name","de-relu1")];
net = addLayers(net,tempNet);
tempNet = [
concatenationLayer(3,2,"Name","concat1")
transposedConv2dLayer([filterH filterW],1,"Name","deconv0","Cropping","same","Stride",[2 2])
tanhLayer("Name","de-tanh0")];
net = addLayers(net,tempNet);
net = connectLayers(net,"leaky-relu1","conv2");
net = connectLayers(net,"leaky-relu1","concat1/in2");
net = connectLayers(net,"leaky-relu2","conv3");
net = connectLayers(net,"leaky-relu2","concat2/in2");
net = connectLayers(net,"leaky-relu3","conv4");
1-686
Dereverberate Speech Using Deep Learning Networks
net = connectLayers(net,"leaky-relu3","concat3/in2");
net = connectLayers(net,"leaky-relu4","conv5");
net = connectLayers(net,"leaky-relu4","concat4/in2");
net = connectLayers(net,"leaky-relu5","conv6");
net = connectLayers(net,"leaky-relu5","concat5/in2");
net = connectLayers(net,"leaky-relu6","conv7");
net = connectLayers(net,"leaky-relu6","concat6/in2");
net = connectLayers(net,"leaky-relu7","conv8");
net = connectLayers(net,"leaky-relu7","concat7/in2");
net = connectLayers(net,"de-relu7","concat7/in1");
net = connectLayers(net,"de-relu6","concat6/in1");
net = connectLayers(net,"de-relu5","concat5/in1");
net = connectLayers(net,"de-relu4","concat4/in1");
net = connectLayers(net,"de-relu3","concat3/in1");
net = connectLayers(net,"de-relu2","concat2/in1");
net = connectLayers(net,"de-relu1","concat1/in1");
unet = initialize(net);
Use analyzeNetwork to view the model architecture. This is a good way to visualize the connections
between layers.
analyzeNetwork(unet);
You will use the mean squared error (MSE) between the log-magnitude spectra of the dereverberated
speech sample (output of the model) and the corresponding clean speech sample (target) as the loss
function. Use the adam optimizer and a mini-batch size of 128 for the training. Allow the model to
train for a maximum of 50 epochs. If the validation loss doesn't improve for 5 consecutive epochs,
terminate the training process. Reduce the learning rate by a factor of 10 every 15 epochs.
Define the training options as below. Change the execution environment and whether to perform
background dispatching depending on your hardware availability and whether you have access to
Parallel Computing Toolbox™.
initialLearnRate = 8e-4;
miniBatchSize = 64;
dereverbNet = trainnet(trainReverb,trainClean,unet,"mse",options);
1-687
1 Audio Toolbox Examples
predictedSTFT4D = predict(dereverbNet,valReverb);
params.WindowdowLength = 512;
params.Window = hamming(params.WindowdowLength,"periodic");
params.OverlapLength = round(0.75*params.WindowdowLength);
params.FFTLength = params.WindowdowLength;
params.fs = 16000;
dereverbedAudioAll = helperReconstructPredictedAudios(predictedSTFT4D,valMinMaxPairs,allValReverb
Visualize the log-magnitude STFTs of the clean, reverberant, and corresponding dereverberated
speech signals.
figure(Position=[100,100,1024,1200])
tiledlayout(3,1)
nexttile
imagesc(squeeze(allValCleanFeatures{1}))
set(gca,Ydir="normal")
subtitle("Clean")
xlabel("Time")
ylabel("Frequency")
colorbar
1-688
Dereverberate Speech Using Deep Learning Networks
nexttile
imagesc(squeeze(allValReverbFeatures{1}))
set(gca,Ydir="normal")
subtitle("Reverberated")
xlabel("Time")
ylabel("Frequency")
colorbar
nexttile
imagesc(squeeze(predictedSTFT4D(:,:,:,1)))
set(gca,Ydir="normal")
subtitle("Predicted (Dereverberated)")
xlabel("Time")
ylabel("Frequency")
clim([-1,1])
colorbar
1-689
1 Audio Toolbox Examples
1-690
Dereverberate Speech Using Deep Learning Networks
Evaluation Metrics
You will use a subset of objective measures used in [1] on page 1-698 to evaluate the performance of
the network. These metrics are computed on the time-domain signals.
• Cepstrum distance (CD) - Provides an estimate of the log spectral distance between two spectra
(predicted and clean). Smaller values indicate better quality.
• Log likelihood ratio (LLR) - Linear predictive coding (LPC) based objective measurement. Smaller
values indicate better quality.
Compute these measurements for the reverberant speech and the dereverberated speech signals.
[summaryMeasuresReconstructed,allMeasuresReconstructed] = calculateObjectiveMeasures(dereverbedAu
[summaryMeasuresReverb,allMeasuresReverb] = calculateObjectiveMeasures(allValReverbAudios,allValC
disp(summaryMeasuresReconstructed)
avgCdMean: 3.2559
avgCdMedian: 2.7702
avgLlrMean: 0.5292
avgLlrMedian: 0.4329
disp(summaryMeasuresReverb)
avgCdMean: 4.2945
avgCdMedian: 3.6521
avgLlrMean: 0.9802
avgLlrMedian: 0.8620
The histograms illustrate the distribution of mean CD, mean SRMR and mean LLR of the reverberant
and dereverberated data.
figure(Position=[50,50,1100,1300])
tiledlayout(2,1)
nexttile
histogram(allMeasuresReverb.cdMean,10)
hold on
histogram(allMeasuresReconstructed.cdMean,10)
subtitle("Mean Cepstral Distance Distribution")
ylabel("Count")
xlabel("Mean CD")
legend("Reverberant (Original)","Dereverberated (Predicted)")
nexttile
histogram(allMeasuresReverb.llrMean,10)
hold on
histogram(allMeasuresReconstructed.llrMean,10)
subtitle("Mean Log Likelihood Ratio Distribution")
ylabel("Count")
xlabel("Mean LLR")
legend("Reverberant (Original)","Dereverberated (Predicted)")
1-691
1 Audio Toolbox Examples
1-692
Dereverberate Speech Using Deep Learning Networks
Supporting Functions
Apply Reverberation
assert(length(cleanAudio) == length(reverbAudio));
nSegments = floor((length(reverbAudio) - (params.samplesPerImage - params.shiftImage))/params.shi
featuresClean = {};
featuresReverb = {};
phaseReverb = {};
cleanAudios = {};
reverbAudios = {};
nGood = 0;
nonSilentRegions = detectSpeech(reverbAudio, params.fs);
nonSilentRegionIdx = 1;
totalRegions = size(nonSilentRegions, 1);
1-693
1 Audio Toolbox Examples
nonSilentSamples = 0;
while nonSilentRegionIdx < totalRegions && nonSilentRegions(nonSilentRegionIdx, 2) < start
nonSilentRegionIdx = nonSilentRegionIdx + 1;
end
nonSilentStart = nonSilentRegionIdx;
while nonSilentStart <= totalRegions && nonSilentRegions(nonSilentStart, 1) <= en
nonSilentDuration = min(en, nonSilentRegions(nonSilentStart,2)) - max(start,nonSilentRegi
nonSilentSamples = nonSilentSamples + nonSilentDuration;
nonSilentStart = nonSilentStart + 1;
end
reverbAudioSegment = reverbAudio(start:en);
if ~silent
nGood = nGood + 1;
cleanAudioSegment = cleanAudio(start:en);
assert(length(cleanAudioSegment)==length(reverbAudioSegment),"Lengths do not match after
% Clean Audio
[featsUnit, ~] = featureExtract(cleanAudioSegment, params);
featuresClean{nGood} = featsUnit; %#ok
% Reverb Audio
[featsUnit, phaseUnit] = featureExtract(reverbAudioSegment, params);
featuresReverb{nGood} = featsUnit; %#ok
if isVal
phaseReverb{nGood} = phaseUnit; %#ok
reverbAudios{nGood} = reverbAudioSegment;%#ok
cleanAudios{nGood} = cleanAudioSegment;%#ok
end
end
end
end
Extract Features
phase = single(angle(audioSTFT(1:end-1,:)));
features = single(log(abs(audioSTFT(1:end-1,:)) + 10e-30));
lastFBin = audioSTFT(end,:);
end
1-694
Dereverberate Speech Using Deep Learning Networks
nSamples = length(feats);
minMaxPairs = zeros(nSamples,2,"single");
featNorm = zeros([size(feats{1}),nSamples],"single");
parfor i = 1:nSamples
feat = feats{i};
maxFeat = max(feat,[],"all");
minFeat = min(feat,[],"all");
featNorm(:,:,i) = 2.*(feat - minFeat)./(maxFeat - minFeat) - 1;
minMaxPairs(i,:) = [minFeat,maxFeat];
end
featNorm = reshape(featNorm,size(featNorm,1),size(featNorm,2),1,size(featNorm,3));
end
predictedSTFT = squeeze(predictedSTFT4D);
denormalizedFeatures = zeros(size(predictedSTFT),"single");
for ii = 1:size(predictedSTFT,3)
feat = predictedSTFT(:,:,ii);
maxFeat = minMaxPairs(ii,2);
minFeat = minMaxPairs(ii,1);
denormalizedFeatures(:,:,ii) = (feat + 1).*(maxFeat-minFeat)./2 + minFeat;
end
predictedSTFT = exp(denormalizedFeatures);
nCount = size(predictedSTFT,3);
dereverbedAudioAll = cell(1,nCount);
1-695
1 Audio Toolbox Examples
nSeg = params.NumSegments;
win = params.Window;
ovrlp = params.OverlapLength;
FFTLength = params.FFTLength;
parfor ii = 1:nCount
% Append zeros to the highest frequency bin
stftUnit = predictedSTFT(:,:,ii);
stftUnit = cat(1,stftUnit, zeros(1,nSeg));
phase = reverbPhase{ii};
phase = cat(1,phase,zeros(1,nSeg));
oneSidedSTFT = stftUnit.*exp(1j*phase);
dereverbedAudio = istft(oneSidedSTFT, ...
Window=win,OverlapLength=ovrlp, ...
FFTLength=FFTLength,ConjugateSymmetric=true,...
FrequencyRange="onesided");
dereverbedAudioAll{ii} = dereverbedAudio./max(max(abs(dereverbedAudio)),max(abs(reverbAudios{
end
end
nAudios = length(reconstructedAudios);
cdMean = zeros(nAudios,1);
cdMedian = zeros(nAudios,1);
llrMean = zeros(nAudios,1);
llrMedian = zeros(nAudios,1);
parfor k = 1 : nAudios
y = reconstructedAudios{k};
x = cleanAudios{k};
y = y./max(abs(y));
x = x./max(abs(x));
[cdMean(k),cdMedian(k)] = cepstralDistance(x,y,fs);
[llrMean(k),llrMedian(k)] = lpcLogLikelihoodRatio(y,x,fs);
end
summaryMeasures.avgCdMean = mean(cdMean);
summaryMeasures.avgCdMedian = mean(cdMedian);
summaryMeasures.avgLlrMean = mean(llrMean);
summaryMeasures.avgLlrMedian = mean(llrMedian);
allMeasures.cdMean = cdMean;
1-696
Dereverberate Speech Using Deep Learning Networks
allMeasures.llrMean = llrMean;
end
Cepstral Distance
function [meanVal, medianVal] = cepstralDistance(x,y,fs)
x = x/sqrt(sum(x.^2));
y = y/sqrt(sum(y.^2));
width = round(0.025*fs);
shift = round(0.01*fs);
nSamples = length(x);
nFrames = floor((nSamples - width + shift)/shift);
win = window(@hanning,width);
xFrames = x(winIndex).*win;
yFrames = y(winIndex).*win;
xCeps = cepstralReal(xFrames,width);
yCeps = cepstralReal(yFrames,width);
meanVal = mean(cepsD);
medianVal = median(cepsD);
end
Real Cepstrum
function realC = cepstralReal(x,width)
width2p = 2^nextpow2(width);
powX = abs(fft(x,width2p));
lowCutoff = max(powX(:))*10^-5;
powX = max(powX,lowCutoff);
realC = real(ifft(log(powX)));
order = 24;
realC = realC(1:order + 1,:);
realC = realC - mean(realC,2);
end
nSamples = length(x);
nFrames = floor((nSamples - width + shift)/shift);
win = window(@hanning,width);
1-697
1 Audio Toolbox Examples
xFrames = x(winIndex).*win;
yFrames = y(winIndex).*win;
lpcX = realLpc(xFrames,width,order);
[lpcY,realY] = realLpc(yFrames,width,order);
llr = zeros(nFrames,1);
for n = 1:nFrames
R = toeplitz(realY(1:order+1,n));
num = lpcX(:,n)'*R*lpcX(:,n);
den = lpcY(:,n)'*R*lpcY(:,n);
llr(n) = log(num/den);
end
llr = sort(llr);
llr = llr(1:ceil(nFrames*0.95));
llr = max(min(llr,2),0);
meanLlr = mean(llr);
medianLlr = median(llr);
end
Rx = ifft(abs(X).^2);
Rx = Rx./width;
realX = real(Rx);
lpcX = levinson(realX,order);
lpcCoeffs = real(lpcX');
end
References
[1] Ernst, O., Chazan, S.E., Gannot, S., & Goldberger, J. (2018). Speech Dereverberation Using Fully
Convolutional Networks. 2018 26th European Signal Processing Conference (EUSIPCO), 390-394.
[2] https://datashare.is.ed.ac.uk/handle/10283/2031
[3] https://datashare.is.ed.ac.uk/handle/10283/2791
[4] https://github.com/MuSAELab/SRMRToolbox
1-698
Speaker Identification Using Custom SincNet Layer and Deep Learning
In this example, you train three convolutional neural networks (CNNs) to perform speaker
verification and then compare the performances of the architectures. The architectures of the three
CNNs are all equivalent except for the first convolutional layer in each:
1 In the first architecture, the first convolutional layer is a "standard" convolutional layer,
implemented using convolution2dLayer.
2 In the second architecture, the first convolutional layer is a constant sinc filterbank, implemented
using a custom layer.
3 In the third architecture, the first convolutional layer is a trainable sinc filterbank, implemented
using a custom layer. This architecture is referred to as SincNet [1] on page 1-712.
[1] on page 1-712 shows that replacing the standard convolutional layer with a filterbank layer leads
to faster training convergence and higher accuracy. [1] on page 1-712 also shows that making the
parameters of the filter bank learnable yields additional performance gains.
Introduction
Speaker identification is a prominent research area with a variety of applications including forensics
and biometric authentication. Many speaker identification systems depend on precomputed features
such as i-vectors or MFCCs, which are then fed into machine learning or deep learning networks for
classification. Other deep learning speech systems bypass the feature extraction stage and feed the
audio signal directly to the network. In such end-to-end systems, the network directly learns low-level
audio signal characteristics.
In this example, you first train a traditional end-to-end speaker identification CNN. The filters learned
tend to have random shapes that do not correspond to perceptual evidence or knowledge of how the
human ear works, especially in scenarios where the amount of training data is limited [1] on page 1-
712. You then replace the first convolutional layer in the network with a custom sinc filterbank layer
that introduces structure and constraints based on perceptual evidence. Finally, you train the SincNet
architecture, which adds learnability to the sinc filterbank parameters.
The three neural network architectures explored in the example are summarized as follows:
This example defines and trains the three neural networks proposed above and evaluates their
performance on the LibriSpeech Dataset [2] on page 1-712.
1-699
1 Audio Toolbox Examples
Data Set
Download Dataset
In this example, you use a subset of the LibriSpeech Dataset [2] on page 1-712. The LibriSpeech
Dataset is a large corpus of read English speech sampled at 16 kHz. The data is derived from
audiobooks read from the LibriVox project.
dataFolder = tempdir;
dataset = fullfile(dataFolder,"LibriSpeech","train-clean-100");
if ~datasetExists(dataset)
filename = "train-clean-100.tar.gz";
url = "http://www.openSLR.org/resources/12/" + filename;
gunzip(url,dataFolder);
unzippedFile = fullfile(dataset,filename);
untar(unzippedFile{1}(1:end-3),dataset);
end
ads = audioDatastore(dataset,IncludeSubfolders=true);
ads.Labels = folders2labels(ads);
The full dev-train-100 dataset is around 6 GB of data. To run this example quickly, set
speedupExample to true.
speedupExample = ;
if speedupExample
allSpeakers = unique(ads.Labels);
subsetSpeakers = allSpeakers(1:50);
ads = subset(ads,ismember(ads.Labels,subsetSpeakers));
ads.Labels = removecats(ads.Labels);
end
ads = splitEachLabel(ads,0.1);
Split the audio files into training and test data. 80% of the audio files are assigned to the training set
and 20% are assigned to the test set.
[adsTrain,adsTest] = splitEachLabel(ads,0.8);
[audioIn,dsInfo] = read(adsTrain);
Fs = dsInfo.SampleRate;
sound(audioIn,Fs)
t = (1/Fs)*(0:length(audioIn)-1);
plot(t,audioIn)
title("Audio Sample")
1-700
Speaker Identification Using Custom SincNet Layer and Deep Learning
xlabel("Time (s)")
ylabel("Amplitude")
grid on
reset(adsTrain)
Data Preprocessing
CNNs expect inputs to have consistent dimensions. You will preprocess the audio by removing
regions of silence and then break the remaining speech into 200 ms frames with 40 ms overlap.
frameDuration = 200e-3;
overlapDuration = 40e-3;
frameLength = floor(Fs*frameDuration);
overlapLength = round(Fs*overlapDuration);
Use the supporting function, preprocessAudioData on page 1-712, to preprocess the training and
test data. Define a transform on the audio datastores to perform the preprocessing, then use readall
to preprocess the entire datasets and place the preprocessed data into memory. If you have Parallel
Computing Toolbox™, you can spread the computational load across workers. XTrain and XTest
contain the train and test speech frames, respectively. TTrain and TTest contain the train and test
labels, respectively.
1-701
1 Audio Toolbox Examples
pFlag = ~isempty(ver("parallel"));
adsTrainTransform = transform(adsTrain,@(x){preprocessAudioData(x,frameLength,overlapLength,Fs)})
XTrain = readall(adsTrainTransform,UseParallel=pFlag);
Replicate the labels so that each 200 ms chunk has a corresponding label.
chunksPerFile = cellfun(@(x)size(x,4),XTrain);
TTrain = repelem(adsTrain.Labels,chunksPerFile,1);
Standard CNN
Define Layers
The standard CNN is inspired by the neural network architecture in [1] on page 1-712.
numFilters = 80;
filterLength = 251;
numSpeakers = numel(unique(removecats(ads.Labels)));
layers = [
convolution2dLayer([1 filterLength],numFilters)
batchNormalizationLayer
leakyReluLayer(0.2)
maxPooling2dLayer([1 3])
convolution2dLayer([1 5],60)
batchNormalizationLayer
leakyReluLayer(0.2)
maxPooling2dLayer([1 3])
convolution2dLayer([1 5],60)
batchNormalizationLayer
1-702
Speaker Identification Using Custom SincNet Layer and Deep Learning
leakyReluLayer(0.2)
maxPooling2dLayer([1 3])
fullyConnectedLayer(256)
batchNormalizationLayer
leakyReluLayer(0.2)
fullyConnectedLayer(256)
batchNormalizationLayer
leakyReluLayer(0.2)
fullyConnectedLayer(256)
batchNormalizationLayer
leakyReluLayer(0.2)
fullyConnectedLayer(numSpeakers)
softmaxLayer];
Analyze the layers of the neural network using the analyzeNetwork function
analyzeNetwork(layers)
Train Network
Train the neural network for 15 epochs using adam optimization. Shuffle the training data before
every epoch. The training options for the neural network are set using trainingOptions. Use the
test data as the validation data to observe how the network performance improves as training
progresses.
numEpochs = 15;
miniBatchSize = 128;
validationFrequency = floor(numel(TTrain)/miniBatchSize);
[convNet,convNetInfo] = trainnet(XTrain,TTrain,layers,"crossentropy",options);
1-703
1 Audio Toolbox Examples
Recall that each signal is broken into short frames. There is a predicted speaker for each frame. You
can achieve higher accuracy by combining all frame predictions into one signal prediction using a
mode operation.
predictions = minibatchpredict(convNet,XTest);
labels = unique(ads.Labels);
predictions = scores2label(predictions,labels);
ind = 1;
finalPredictions = repmat(TTrain(1),length(chunksPerFile),1);
for index=1:length(chunksPerFile)
numS = chunksPerFile(index);
finalPredictions(index) = mode(predictions(ind:ind+numS-1));
ind = ind+numS;
end
Plot the magnitude frequency response of nine filters learned from the standard CNN network. The
shape of these filters is not intuitive and does not correspond to perceptual knowledge. The next
section explores the effect of using constrained filter shapes.
F = squeeze(convNet.Layers(2,1).Weights);
H = zeros(size(F));
1-704
Speaker Identification Using Custom SincNet Layer and Deep Learning
Freq = zeros(size(F));
for ii = 1:size(F,2)
[h,f] = freqz(F(:,ii),1,251,Fs);
H(:,ii) = abs(h);
Freq(:,ii) = f;
end
idx = linspace(1,size(F,2),9);
idx = round(idx);
figure
for jj = 1:9
subplot(3,3,jj)
plot(Freq(:,idx(jj)),H(:,idx(jj)))
sgtitle("Frequency Response of Learned Standard CNN Filters")
xlabel("Frequency (Hz)")
end
In this section, you replace the first convolutional layer in the standard CNN with a constant sinc
filterbank layer. The constant sinc filterbank layer convolves the input frames with a bank of fixed
bandpass filters. The bandpass filters are a linear combination of two sinc filters in the time domain.
The frequencies of the bandpass filters are spaced linearly on the mel scale.
1-705
1 Audio Toolbox Examples
Define Layers
The implementation for the constant sinc filterbank layer can be found in the
constantSincLayer.m file (attached to this example). Define parameters for a
ConstantSincLayer. Use 80 filters and a filter length of 251.
numFilters = 80;
filterLength = 251;
numChannels = 1;
name = "constant_sinc";
Change the first convolutional layer from the standard CNN to the ConstantSincLayer and keep
the other layers unchanged.
cSL = constantSincLayer(numFilters,filterLength,Fs,numChannels,name)
cSL =
constantSincLayer with properties:
Name: 'constant_sinc'
Learnable Parameters
No properties.
State Parameters
No properties.
layers(2) = cSL;
Train Network
Train the network using the trainnet function. Use the same training options defined previously.
[constSincNet,constSincInfo] = trainnet(XTrain,TTrain,layers,"crossentropy",options);
1-706
Speaker Identification Using Custom SincNet Layer and Deep Learning
Similar to the regular network, combine individual frame predictions into a single prediction for each
test audio signal.
constantSincNetAccuracy = getNetworkAccuracy(constSincNet,adsTest,XTest,labels,chunksPerFile);
fprintf("Constant SincNet network accuracy: %f percent\n",100*constantSincNetAccuracy);
The plotNFilters method plots the magnitude frequency response of n filters with equally spaced
filter indices. Plot the magnitude frequency response of nine filters in the ConstantSincLayer.
figure
n = 9;
plotNFilters(constSincNet.Layers(2),n)
1-707
1 Audio Toolbox Examples
SincNet
In this section, you use a trainable SincNet layer as the first convolutional layer in your network. The
SincNet layer convolves the input frames with a bank of bandpass filters. The bandwidth and the
initial frequencies of the SincNet filters are initialized as equally spaced in the mel scale. The SincNet
layer attempts to learn better parameters for these bandpass filters within the neural network
framework.
Define Layers
The implementation for the SincNet layer filterbank layer can be found in the sincNetLayer.m file
(attached to this example). Define parameters for a SincNetLayer. Use 80 filters and a filter length
of 251.
numFilters = 80;
filterLength = 251;
numChannels = 1;
name = "sinc";
Replace the ConstantSincLayer from the previous network with the SincNetLayer. This new
layer has two learnable parameters: FilterFrequencies and FilterBandwidths.
sNL = sincNetLayer(numFilters,filterLength,Fs,numChannels,name)
sNL =
sincNetLayer with properties:
1-708
Speaker Identification Using Custom SincNet Layer and Deep Learning
Name: 'sinc'
Learnable Parameters
FilterFrequencies: [0.0019 0.0032 0.0047 0.0062 0.0078 0.0094 0.0111 0.0128 0.0145 0.0164 0.0
FilterBandwidths: [0.0028 0.0030 0.0031 0.0032 0.0033 0.0034 0.0035 0.0036 0.0037 0.0038 0.0
State Parameters
No properties.
layers(2) = sNL;
Train Network
Train the network using the trainnet function. Use the same training options defined previously.
[sincNet,sincNetInfo] = trainnet(XTrain,TTrain,layers,"crossentropy",options);
Similar to the regular network, combine individual frame predictions into a single prediction for each
test audio signal.
sincNetNetAccuracy = getNetworkAccuracy(sincNet,adsTest,XTest,labels,chunksPerFile);
fprintf("SincNet network accuracy: %f percent\n",100*sincNetNetAccuracy);
1-709
1 Audio Toolbox Examples
Use the plotNFilters method of SincNetLayer to visualize the magnitude frequency response of
nine filters with equally spaced indices learned by SincNet.
figure
plotNFilters(sincNet.Layers(2),9)
Results Summary
Accuracy
The table summarizes the frame accuracy for all three neural networks.
NetworkType = ["Standard CNN";"Constant Sinc Layer";"SincNet Layer"];
Accuracy = [convNetInfo.ValidationHistory.Accuracy(end);constSincInfo.ValidationHistory.Accuracy(
RefinedAccuracy = 100*[standardNetAccuracy;constantSincNetAccuracy;sincNetNetAccuracy];
resultsSummary = table(NetworkType,Accuracy,RefinedAccuracy)
resultsSummary=3×3 table
NetworkType Accuracy RefinedAccuracy
_____________________ ________ _______________
1-710
Speaker Identification Using Custom SincNet Layer and Deep Learning
Plot the accuracy on the test set against the epoch number to see how well the networks learn as the
number of epochs increase. SincNet outperforms the ConstantSincLayer network, especially
during the early stages of training. This shows that updating the parameters of the bandpass filters
within the neural network framework leads to faster convergence. This behavior is only observed
when the dataset is large enough, so it might not be seen when speedupExample is set to true.
epoch = 0:numEpochs;
sinc_valAcc = sincNetInfo.ValidationHistory.Accuracy;
const_sinc_valAcc = constSincInfo.ValidationHistory.Accuracy;
conv_valAcc = convNetInfo.ValidationHistory.Accuracy;
figure
plot(epoch,sinc_valAcc,"-*",MarkerSize=4)
hold on
plot(epoch,const_sinc_valAcc,"-*",MarkerSize=4)
plot(epoch,conv_valAcc,"-*",MarkerSize=4)
ylabel("Frame-Level Accuracy (Test Set)")
xlabel("Epoch")
xlim([0 numEpochs+0.3])
title("Frame-Level Accuracy Versus Epoch")
legend("sincNet","constantSincLayer","conv2dLayer",Location="southeast")
grid on
1-711
1 Audio Toolbox Examples
In the figure above, the final frame accuracy is a bit different from the frame accuracy that is
computed in the last iteration. While training, the batch normalization layers perform normalization
over mini-batches. However, at the end of training, the batch normalization layers normalize over the
entire training data, which results in a slight change in performance.
Supporting Functions
function xp = preprocessAudioData(x,frameLength,overlapLength,Fs)
speechIdx = detectSpeech(x,Fs);
xp = zeros(1,frameLength,1,0);
for ii = 1:size(speechIdx,1)
% Isolate speech segment
audioChunk = x(speechIdx(ii,1):speechIdx(ii,2));
predictions = minibatchpredict(net,XTest);
predictions = scores2label(predictions,labels);
ind = 1;
finalPredictions = repmat(labels(1),length(numSegmentPerObservation),1);
for index=1:length(numSegmentPerObservation)
numS = numSegmentPerObservation(index);
finalPredictions(index) = mode(predictions(ind:ind+numS-1));
ind = ind+numS;
end
end
References
[1] M. Ravanelli and Y. Bengio, "Speaker Recognition from Raw Waveform with SincNet," 2018 IEEE
Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp. 1021-1028, doi: 10.1109/
SLT.2018.8639585.
1-712
Speaker Identification Using Custom SincNet Layer and Deep Learning
[2] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public
domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Brisbane, QLD, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964
1-713
1 Audio Toolbox Examples
In this example, you develop a deep learning model to detect faults in an air compressor using
acoustic measurements. After developing the model, you package the system so that you can
recognize faults based on streaming input data.
Data Preparation
Download and unzip the air compressor data set [1] on page 1-736. This data set consists of
recordings from air compressors in a healthy state or one of seven faulty states.
rng default
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","AirCompressorDataset/AirCo
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"AirCompressorDataset");
Create an audioDatastore object to manage the data and split it into training and validation sets.
You can reduce the training data set used in this example to speed up the runtime at the cost of
performance. In general, reducing the data set is a good practice for development and debugging.
speedupExample = ;
if speedupExample
lbls = folders2labels(ads.Files);
idxs = splitlabels(lbls,0.2);
ads = subset(ads,idxs{1});
end
The data labels are encoded in their containing folder name. To split the data into train and test sets,
use folders2labels and splitlabels.
lbls = folders2labels(ads.Files);
idxs = splitlabels(lbls,0.9);
adsTrain = subset(ads,idxs{1});
labelsTrain = lbls(idxs{1});
adsValidation = subset(ads,idxs{2});
labelsValidation = lbls(idxs{2});
Call countlabels to inspect the distribution of labels in the train and validation sets.
countlabels(labelsTrain)
ans=8×3 table
Label Count Percent
_________ _____ _______
1-714
Acoustics-Based Machine Fault Recognition
countlabels(labelsValidation)
ans=8×3 table
Label Count Percent
_________ _____ _______
Bearing 22 12.5
Flywheel 22 12.5
Healthy 22 12.5
LIV 22 12.5
LOV 22 12.5
NRV 22 12.5
Piston 22 12.5
Riderbelt 22 12.5
The data consists of time-series recordings of acoustics from faulty or healthy air compressors. As
such, there are strong relationships between samples in time. Listen to a recording and plot the
waveform.
[sampleData,sampleDataInfo] = read(adsTrain);
fs = sampleDataInfo.SampleRate;
soundsc(gather(sampleData),fs)
plot(sampleData)
xlabel("Sample")
ylabel("Amplitude")
title("State: " + string(labelsTrain(1)))
axis tight
1-715
1 Audio Toolbox Examples
Because the samples are related in time, you can use a recurrent neural network (RNN) to model the
data. A long short-term memory (LSTM) network is a popular choice of RNN because it is designed to
avoid vanishing and exploding gradients. Before you can train the network, it's important to prepare
the data adequately. Often, it is best to transform or extract features from 1-dimensional signal data
in order to provide a richer set of features for the model to learn from.
Feature Engineering
The next step is to extract a set of acoustic features used as inputs to the network. Audio Toolbox™
enables you to extract spectral descriptors that are commonly used as inputs in machine learning
tasks. You can extract the features using individual functions, or you can use
audioFeatureExtractor to simplify the workflow and do it all at once. After feature extraction,
orient time along rows which is the expected format for sequences in Deep Learning Toolbox™.
1-716
Acoustics-Based Machine Fault Recognition
windowLength = 512;
overlapLength = 0;
tic
trainFeatures = extract(afe,adsTrain);
disp("Feature extraction of train set took " + toc + " seconds.");
Data Augmentation
The training set contains a relatively small number of acoustic recordings for training a deep learning
model. A popular method to enlarge the data set is to use mixup. In mixup, you augment your dataset
by mixing the features and labels from two different class instances. Mixup was reformulated by [2]
on page 1-736 as labels drawn from a probability distribution instead of mixed labels. The
supporting function, mixup on page 1-735, takes the training features, associated labels, and the
number of mixes per observation and then outputs the mixes and associated labels.
numMixesPerInstance = ;
tic
[augData,augLabels] = mixup(trainFeatures,labelsTrain,numMixesPerInstance);
trainLabels = cat(1,labelsTrain,augLabels);
trainFeatures = cat(1,trainFeatures,augData);
disp("Feature augmentation of train set took " + toc + " seconds.");
Train Model
Next, you define and train a network. The pretrained model is also placed in the current folder when
you open this example. To skip training the network, simply continue to the next section on page 1-
720.
1-717
1 Audio Toolbox Examples
Define Network
An LSTM layer learns long-term dependencies between time steps of time series or sequence data.
The first lstmLayer outputs sequence data. Then a dropout layer is used to reduce overfitting. The
second lstmLayer outputs the last step of the time sequence.
numHiddenUnits = ;
dropProb = ;
layers = [ ...
sequenceInputLayer(afe.FeatureVectorLength,Normalization="zscore")
lstmLayer(numHiddenUnits,OutputMode="sequence")
dropoutLayer(dropProb)
lstmLayer(numHiddenUnits,OutputMode="last")
fullyConnectedLayer(numel(unique(labelsTrain)))
softmaxLayer];
miniBatchSize = ;
validationFrequency = floor(numel(trainFeatures)/miniBatchSize);
options = trainingOptions("adam", ...
Metric="accuracy", ...
MiniBatchSize=miniBatchSize, ...
MaxEpochs=100, ...
Plots="training-progress", ...
Verbose=false, ...
Shuffle="every-epoch", ...
LearnRateSchedule="piecewise", ...
LearnRateDropPeriod=30, ...
LearnRateDropFactor=0.1, ...
ValidationData={featuresValidation,labelsValidation}, ...
ValidationFrequency=validationFrequency, ...
ValidationPatience=10, ...
OutputNetwork="best-validation-loss");
Train Network
airCompNet = trainnet(trainFeatures,trainLabels,layers,"crossentropy",options);
1-718
Acoustics-Based Machine Fault Recognition
Evaluate Network
validationResults = minibatchpredict(airCompNet,featuresValidation);
uniqueLabels = unique(labelsTrain);
validationResults = scores2label(validationResults,uniqueLabels,2);
confusionchart(labelsValidation,validationResults, ...
Title="Accuracy: " + mean(validationResults == labelsValidation)*100 + " (%)");
1-719
1 Audio Toolbox Examples
Once you have a trained network with satisfactory performance, you can apply the network to test
data in a streaming fashion.
There are many additional considerations to take into account to make the system work in a real-
world embedded system.
For example,
• The rate or interval at which classification can be performed with accurate results
• The size of the network in terms of generated code (program memory) and weights (data memory)
• The efficiency of the network in terms of computation speed
In MATLAB, you can mimic how the network is deployed and used in hardware on a real embedded
system and begin to answer these important questions.
Once you train your deep learning model, you will deploy it to an embedded target. That means you
also need to deploy the code used to perform the feature extraction. Use the
generateMATLABFunction method of audioFeatureExtractor to generate a MATLAB function
compatible with C/C++ code generation. Specify IsStreaming as true so that the generated
function is optimized for stream processing.
1-720
Acoustics-Based Machine Fault Recognition
filename = fullfile(pwd,"extractAudioFeatures");
generateMATLABFunction(afe,filename,IsStreaming=true);
labels = uniqueLabels;
save("AirCompressorFaultRecognitionModel.mat","airCompNet","labels")
Create a function that combines the feature extraction and deep learning classification.
type recognizeAirCompressorFault.m
persistent airCompNet
if isempty(airCompNet)
airCompNet = coder.loadDeepLearningNetwork('AirCompressorFaultRecognitionModel.mat');
end
if rs
airCompNet = resetState(airCompNet);
end
% Classify
if isa(airCompNet,'dlnetwork')
[scores,state] = predict(airCompNet,dlarray(features,"CT"));
airCompNet.State = state;
else
[airCompNet,scores] = predictAndUpdateState(airCompNet,features);
end
end
1-721
1 Audio Toolbox Examples
% x = read(inputBuffer,512,0);
% featureVector = extractAudioFeatures(x);
% % ... do something with featureVector ...
% end
% end
%
%
% % Example 2: Generate code
% targetDataType = "single";
% codegen extractAudioFeatures -args {ones(512,1,targetDataType)}
% source = dsp.ColoredNoise(OutputDataType=targetDataType);
% inputBuffer = dsp.AsyncBuffer;
% for ii = 1:10
% audioIn = source();
% write(inputBuffer,audioIn);
% while inputBuffer.NumUnreadSamples > 512
% x = read(inputBuffer,512,0);
% featureVector = extractAudioFeatures_mex(x);
% % ... do something with featureVector ...
% end
% end
%
% See also audioFeatureExtractor, dsp.AsyncBuffer, codegen.
dataType = underlyingType(x);
numChannels = size(x,2);
props = coder.const(getProps(dataType));
% Fourier transform
Y = fft(bsxfun(@times,x,props.Window),props.FFTLength);
Z = Y(config.OneSidedSpectrumBins,:);
Zpower = real(Z.*conj(Z));
% Linear spectrum
linearSpectrum = Zpower(config.linearSpectrum.FrequencyBins,:)*config.linearSpectrum.Normalizatio
linearSpectrum([1,end],:) = 0.5*linearSpectrum([1,end],:);
linearSpectrum = reshape(linearSpectrum,[],1,numChannels);
% Spectral descriptors
[featureVector(outputIndex.spectralKurtosis,:),featureVector(outputIndex.spectralSpread,:),featur
featureVector(outputIndex.spectralSkewness,:) = spectralSkewness(linearSpectrum,config.SpectralDe
featureVector(outputIndex.spectralCrest,:) = spectralCrest(linearSpectrum,config.SpectralDescript
featureVector(outputIndex.spectralDecrease,:) = spectralDecrease(linearSpectrum,config.SpectralDe
featureVector(outputIndex.spectralEntropy,:) = spectralEntropy(linearSpectrum,config.SpectralDesc
featureVector(outputIndex.spectralFlatness,:) = spectralFlatness(linearSpectrum,config.SpectralDe
featureVector(outputIndex.spectralRolloffPoint,:) = spectralRolloffPoint(linearSpectrum,config.Sp
1-722
Acoustics-Based Machine Fault Recognition
featureVector(outputIndex.spectralSlope,:) = spectralSlope(linearSpectrum,config.SpectralDescript
end
config.OneSidedSpectrumBins = uint16(1:257);
linearSpectrumFrequencyBins = 1:257;
config.linearSpectrum.FrequencyBins = uint16(linearSpectrumFrequencyBins);
config.linearSpectrum.NormalizationFactor = cast(2*powerNormalizationFactor,dataType);
FFTLength = cast(props.FFTLength,like=props.SampleRate);
w = (props.SampleRate/FFTLength)*(linearSpectrumFrequencyBins-1);
config.SpectralDescriptorInput.FrequencyVector = cast(w(:),dataType);
outputIndex.spectralCentroid = uint8(1);
outputIndex.spectralCrest = uint8(2);
outputIndex.spectralDecrease = uint8(3);
outputIndex.spectralEntropy = uint8(4);
outputIndex.spectralFlatness = uint8(5);
outputIndex.spectralKurtosis = uint8(6);
outputIndex.spectralRolloffPoint = uint8(7);
outputIndex.spectralSkewness = uint8(8);
outputIndex.spectralSlope = uint8(9);
outputIndex.spectralSpread = uint8(10);
end
Next, you test the streaming classifier in MATLAB. Stream audio one frame at a time to represent a
system as it would be deployed in a real-time embedded system. This enables you to measure and
visualize the timing and accuracy of the streaming implementation.
Stream in several audio files and plot the output classification results for each frame of data. At a
time interval equal to the length of each file, evaluate the output of the classifier.
reset(adsValidation)
N = 10;
labels = categories(labelsValidation);
numLabels = numel(labels);
1-723
1 Audio Toolbox Examples
rs = false;
end
reset(audioSource)
1-724
Acoustics-Based Machine Fault Recognition
1-725
1 Audio Toolbox Examples
1-726
Acoustics-Based Machine Fault Recognition
1-727
1 Audio Toolbox Examples
1-728
Acoustics-Based Machine Fault Recognition
1-729
1 Audio Toolbox Examples
1-730
Acoustics-Based Machine Fault Recognition
1-731
1 Audio Toolbox Examples
1-732
Acoustics-Based Machine Fault Recognition
1-733
1 Audio Toolbox Examples
Compare the test results for the streaming version of the classifier and the non-streaming.
Analyze the execution time. The execution time when state is reset is often above the 32 ms budget.
However, in a real, deployed system, that initialization time will only be incurred once. The execution
time of the main loop is around 10 ms, which is well below the 32 ms budget for real-time
performance.
executionTime = read(timingBuffer)*1000;
budget = (windowLength/afe.SampleRate)*1000;
plot(executionTime(2:end),"o")
title("Execution Time Per Frame")
xlabel("Frame Number")
ylabel("Time (ms)")
yline(budget,"-","Budget",LineWidth=2)
1-734
Acoustics-Based Machine Fault Recognition
Supporting Functions
kk = 1;
for ii = 1:numel(data)
for jj = 1:numMixesPerInstance
lambda = max(min((randn./10)+0.5,1),0);
% Mix.
augData{kk} = lambda*data{ii} + (1-lambda)*data{availableData(idx)};
1-735
1 Audio Toolbox Examples
end
kk = kk + 1;
end
end
end
References
[1] Verma, Nishchal K., et al. "Intelligent Condition Based Monitoring Using Acoustic Signals for Air
Compressors." IEEE Transactions on Reliability, vol. 65, no. 1, Mar. 2016, pp. 291–309. DOI.org
(Crossref), doi:10.1109/TR.2015.2459684.
[2] Huszar, Ferenc. "Mixup: Data-Dependent Data Augmentation." InFERENCe. November 03, 2017.
Accessed January 15, 2019. https://www.inference.vc/mixup-data-dependent-data-augmentation/.
See Also
Related Examples
• “Compress Machine Fault Recognition Neural Network Using Projection” on page 1-1051
• “Accelerate Audio Machine Learning Workflows Using a GPU” on page 1-249
1-736
Acoustics-Based Machine Fault Recognition Code Generation
This example demonstrates code generation for “Acoustics-Based Machine Fault Recognition” on
page 1-714 using a long short-term memory (LSTM) network and spectral descriptors. This example
uses MATLAB® Coder™ with deep learning support to generate a MEX (MATLAB executable)
function that leverages C++ code. The input data consists of acoustics time-series recordings from
faulty or healthy air compressors and the output is the state of the mechanical machine predicted by
the LSTM network. For details on audio preprocessing and network training, see “Acoustics-Based
Machine Fault Recognition” on page 1-714.
Specify a sample rate fs of 16 kHz and a windowLength of 512 samples, as defined in “Acoustics-
Based Machine Fault Recognition” on page 1-714. Set numFrames to 100.
fs = 16000;
windowLength = 512;
numFrames = 100;
To run the example on a test signal, generate a pink noise signal. To test the performance of the
system on a real dataset, download the air compressor dataset [1] on page 1-743.
downloadDataset =
if ~downloadDataset
pinkNoiseSignal = pinknoise(windowLength*numFrames,'single');
else
% Download AirCompressorDataset.zip
component = 'audio';
filename = 'AirCompressorDataset/AirCompressorDataset.zip';
localfile = matlab.internal.examples.downloadSupportFile(component,filename);
% Use countEachLabel to get the number of samples of each category in the dataset.
countEachLabel(dataStore)
end
ans=8×2 table
Label Count
_________ _____
Bearing 225
Flywheel 225
Healthy 225
LIV 225
LOV 225
NRV 225
1-737
1 Audio Toolbox Examples
Piston 225
Riderbelt 225
audioSource = dsp.AsyncBuffer;
scoreBuffer = dsp.AsyncBuffer;
model = load('AirCompressorFaultRecognitionModel.mat');
labels = string(model.labels);
Initialize signalToBeTested to pinkNoiseSignal or select a signal from the drop-down list to test
the file of your choice from the dataset.
if ~downloadDataset
signalToBeTested = pinkNoiseSignal;
else
[allFiles,~] = splitEachLabel(dataStore,1);
allData = readall(allFiles);
signalToBeTested = ;
signalToBeTested = cell2mat(signalToBeTested);
end
Stream one audio frame at a time to represent the system as it would be deployed in a real-time
embedded system. Use recognizeAirCompressorFault developed in “Acoustics-Based Machine
Fault Recognition” on page 1-714 to compute audio features and perform deep learning classification.
write(audioSource,signalToBeTested);
resetNetworkState = true;
resetNetworkState = false;
end
1-738
Acoustics-Based Machine Fault Recognition Code Generation
scores = read(scoreBuffer);
[~,labelIndex] = max(scores(end,:),[],2);
detectedFault = labels(labelIndex)
detectedFault =
"LOV"
plot(scores)
legend("" + labels,Location="northwest")
xlabel("Time Step")
ylabel("Score")
str = sprintf("Predicted Scores Over Time Steps.\nPredicted Class: %s",detectedFault);
title(str)
Create a code generation configuration object to generate an executable. Specify the target language
as C++.
cfg = coder.config('mex');
cfg.TargetLang = 'C++';
audioFrame = ones(windowLength,1,"single");
1-739
1 Audio Toolbox Examples
Call the codegen (MATLAB Coder) function from MATLAB Coder to generate C++ code for the
recognizeAirCompressorFault function. Specify the configuration object and prototype
arguments. A MEX-file named recognizeAirCompressorFault_mex is generated to your current
folder.
Initialize signalToBeTested to pinkNoiseSignal or select a signal from the drop-down list to test
the file of your choice from the dataset.
if ~downloadDataset
signalToBeTested = pinkNoiseSignal;
else
[allFiles,~] = splitEachLabel(dataStore,1);
allData = readall(allFiles);
signalToBeTested = ;
signalToBeTested = cell2mat(signalToBeTested);
end
Stream one audio frame at a time to represent the system as it would be deployed in a real-time
embedded system. Use generated recognizeAirCompressorFault_mex to compute audio features
and perform deep learning classification.
write(audioSource,signalToBeTested);
resetNetworkState = true;
resetNetworkState = false;
end
scores = read(scoreBuffer);
[~,labelIndex] = max(scores(end,:),[],2);
detectedFault = labels(labelIndex)
detectedFault =
"Healthy"
plot(scores)
legend(labels,Location="northwest")
1-740
Acoustics-Based Machine Fault Recognition Code Generation
xlabel("Time Step")
ylabel("Score")
str = sprintf("Predicted Scores Over Time Steps.\nPredicted Class: %s",detectedFault);
title(str)
Use tic and toc to measure the execution time of MATLAB function
recognizeAirCompressorFault and MATLAB executable (MEX)
recognizeAirCompressorFault_mex.
Use same recording that you chose in previous section as input to recognizeAirCompressorFault
function and its MEX equivalent recognizeAirCompressorFault_mex.
write(audioSource,signalToBeTested);
1-741
1 Audio Toolbox Examples
resetNetworkState = false;
end
Plot the execution time for each frame and analyze the profile. The first call of
recognizeAirCompressorFault_mex consumes around four times of the budget as it includes
loading of network and resetting of the states. However, in a real, deployed system, that initialization
time is only incurred once. The execution time of the MATLAB function is around 10 ms and that of
MEX function is ~1 ms, which is well below the 32 ms budget for real-time performance.
budget = (windowLength/fs)*1000;
timingMATLAB = read(timingBufferMATLAB)*1000;
timingMEX = read(timingBufferMEX)*1000;
frameNumber = 1:numel(timingMATLAB);
perfGain = timingMATLAB./timingMEX;
plot(frameNumber,timingMATLAB,frameNumber,timingMEX,LineWidth=2)
grid on
yline(budget,'',"Budget",LineWidth=2)
legend("MATLAB Function","MEX Function",Location="northwest")
xlabel("Time Step")
ylabel("Execution Time (in ms)")
title("Execution Time Profile of MATLAB and MEX Function")
1-742
Acoustics-Based Machine Fault Recognition Code Generation
Compute the performance gain of MEX over MATLAB function excluding the first call.
PerformanceGain = sum(timingMATLAB(2:end))/sum(timingMEX(2:end))
PerformanceGain = 19.4074
This example ends here. For deploying machine fault recognition on Raspberry Pi, see “Acoustics-
Based Machine Fault Recognition Code Generation on Raspberry Pi” on page 1-744.
References
[1] Verma, Nishchal K., et al. "Intelligent Condition Based Monitoring Using Acoustic Signals for Air
Compressors." IEEE Transactions on Reliability, vol. 65, no. 1, Mar. 2016, pp. 291–309. DOI.org
(Crossref), doi:10.1109/TR.2015.2459684.
1-743
1 Audio Toolbox Examples
This example demonstrates code generation for “Acoustics-Based Machine Fault Recognition” on
page 1-714 using a long short-term memory (LSTM) network and spectral descriptors. This example
uses MATLAB® Coder™, MATLAB Coder Interface for Deep Learning, MATLAB Support Package for
Raspberry Pi™ Hardware to generate a standalone executable (.elf) file on a Raspberry Pi that
leverages performance of the ARM® Compute Library. The input data consists of acoustics time-
series recordings from faulty or healthy air compressors and the output is the state of the mechanical
machine predicted by the LSTM network. This standalone executable on Raspberry Pi runs the
streaming classifier on the input data received from MATLAB and sends the computed scores for each
label to MATLAB. Interaction between MATLAB script and the executable on your Raspberry Pi is
handled using the user datagram protocol (UDP). For more details on audio preprocessing and
network training, see “Acoustics-Based Machine Fault Recognition” on page 1-714.
Example Requirements
For supported versions of libraries and for information about setting up environment variables, see
“Prerequisites for Deep Learning with MATLAB Coder” (MATLAB Coder)
Specify a sample rate fs of 16 kHz and a windowLength of 512 samples, as defined in “Acoustics-
Based Machine Fault Recognition” on page 1-714. Set numFrames to 100.
fs = 16000;
windowLength = 512;
numFrames = 100;
To run the Example on a test signal, generate a pink noise signal. To test the performance of the
system on a real dataset, download the air compressor dataset [1] on page 1-754.
downloadDataset =
if ~downloadDataset
pinkNoiseSignal = pinknoise(windowLength*numFrames);
else
% Download AirCompressorDataset.zip
component = 'audio';
filename = 'AirCompressorDataset/AirCompressorDataset.zip';
localfile = matlab.internal.examples.downloadSupportFile(component,filename);
1-744
Acoustics-Based Machine Fault Recognition Code Generation on Raspberry Pi
% Use countEachLabel to get the number of samples of each category in the dataset.
countEachLabel(dataStore)
end
To run the streaming classifier in MATLAB, download and unzip the system developed in “Acoustics-
Based Machine Fault Recognition” on page 1-714.
component = 'audio';
filename = 'AcousticsBasedMachineFaultRecognition/AcousticsBasedMachineFaultRecognition.zip';
localfile = matlab.internal.examples.downloadSupportFile(component,filename);
downloadFolder = fullfile(fileparts(localfile),'system');
if ~exist(downloadFolder,'dir')
unzip(localfile,downloadFolder)
end
Load the pretrained network and extract labels from the network.
airCompNet = coder.loadDeepLearningNetwork('AirCompressorFaultRecognitionModel.mat');
labels = string(airCompNet.Layers(end).Classes);
Initialize signalToBeTested to pinkNoiseSignal or select a signal from the drop-down list to test
the file of your choice from the dataset.
if ~downloadDataset
signalToBeTested = pinkNoiseSignal;
else
[allFiles,~] = splitEachLabel(dataStore,1);
allData = readall(allFiles);
signalToBeTested = ;
signalToBeTested = cell2mat(signalToBeTested);
end
Stream one audio frame at a time to represent the system as it would be deployed in a real-time
embedded system. Use recognizeAirCompressorFault developed in “Acoustics-Based Machine
Fault Recognition” on page 1-714 to compute audio features and perform deep learning classification.
write(audioSource,signalToBeTested);
resetNetworkState = true;
1-745
1 Audio Toolbox Examples
resetNetworkState = false;
end
scores = read(scoreBuffer);
[~,labelIndex] = max(scores(end,:),[],2);
detectedFault = labels(labelIndex)
detectedFault =
"Flywheel"
plot(scores)
legend("" + labels,'Location','northwest')
xlabel("Time Step")
ylabel("Score")
str = sprintf("Predicted Scores Over Time Steps.\nPredicted Class: %s",detectedFault);
title(str)
1-746
Acoustics-Based Machine Fault Recognition Code Generation on Raspberry Pi
reset(audioSource)
This example uses the dsp.UDPSender System object to send the audio frame to the executable
running on Raspberry Pi and the dsp.UDPReceiver System object to receive the score vector from
the Raspberry Pi. Create a dsp.UDPSender system object to send audio captured in MATLAB to your
Raspberry Pi. Set the targetIPAddress to the IP address of your Raspberry Pi. Set the
RemoteIPPort to 25000. Raspberry Pi receives the input audio frame from the same port using the
dsp.UDPReceiver system object.
targetIPAddress = '172.31.164.247';
UDPSend = dsp.UDPSender('RemoteIPPort',25000,'RemoteIPAddress',targetIPAddress);
Create a dsp.UDPReceiver system object to receive predicted scores from your Raspberry Pi. Each
UDP packet received from the Raspberry Pi is a vector of scores and each vector element is a score
for a state of the air compressor. The maximum message length for the dsp.UDPReceiver object is
65507 bytes. Calculate the buffer size to accommodate the maximum number of UDP packets.
sizeOfDoubleInBytes = 8;
numScores = 8;
maxUDPMessageLength = floor(65507/sizeOfDoubleInBytes);
numPackets = floor(maxUDPMessageLength/numScores);
bufferSize = numPackets*numScores*sizeOfDoubleInBytes;
1-747
1 Audio Toolbox Examples
type recognizeAirCompressorFaultRaspi
function recognizeAirCompressorFaultRaspi(hostIPAddress)
% This function receives acoustic input using dsp.UDPReceiver and runs a
% streaming classifier by calling recognizeAirCompressorFault, developed in
% the Acoustics-Based Machine Fault Recognition - MATLAB Example.
% Computed scores are sent to MATLAB using dsp.UDPSender.
%#codegen
frameLength = 512;
while true
% Receive audio frame of size frameLength x 1
x = UDPReceiveRaspi();
if(~isempty(x))
x = x(1:frameLength,1);
resetNetworkState = false;
end
end
1-748
Acoustics-Based Machine Fault Recognition Code Generation on Raspberry Pi
Replace the hostIPAddress with your machine's address. Your Raspberry Pi sends the predicted
scores to the IP address you specify.
hostIPAddress = coder.Constant('172.18.230.30');
Create a code generation configuration object to generate an executable program. Specify the target
language as C++.
cfg = coder.config('exe');
cfg.TargetLang = 'C++';
Create a configuration object for deep learning code generation with the ARM compute library that is
on your Raspberry Pi. Specify the architecture of the Raspberry Pi and attach the deep learning
configuration object to the code generation configuration object.
dlcfg = coder.DeepLearningConfig('arm-compute');
dlcfg.ArmArchitecture = 'armv7';
dlcfg.ArmComputeVersion = '20.02.1';
cfg.DeepLearningConfig = dlcfg;
Use the Raspberry Pi Support Package function raspi to create a connection to your Raspberry Pi.
In the next block of code, replace:
Create a coder.hardware (MATLAB Coder) object for Raspberry Pi and attach it to the code
generation configuration object.
hw = coder.hardware('Raspberry Pi');
cfg.Hardware = hw;
Call the codegen (MATLAB Coder) function from MATLAB Coder to generate C++ code and the
executable on your Raspberry Pi. By default, the Raspberry Pi executable has the same name as the
MATLAB function. You get a warning in the code generation logs that you can disregard because
recognizeAirCompressorFaultRaspi has an infinite loop that looks for an audio frame from
MATLAB.
codegen -config cfg recognizeAirCompressorFaultRaspi -args {hostIPAddress} -report
1-749
1 Audio Toolbox Examples
applicationDirPaths = raspi.utils.getRemoteBuildDirectory('applicationName',applicationName);
targetDirPath = applicationDirPaths{1}.directory;
exeName = strcat(applicationName,'.elf');
command = ['cd ',targetDirPath,'; ./',exeName,' &> 1 &'];
system(r,command);
Initialize signalToBeTested to pinkNoiseSignal or select a signal from the drop-down list to test
the file of your choice from the dataset.
if ~downloadDataset
signalToBeTested = pinkNoiseSignal;
else
[allFiles,~] = splitEachLabel(dataStore,1);
allData = readall(allFiles);
signalToBeTested = ;
signalToBeTested = cell2mat(signalToBeTested);
end
Stream one audio frame at a time to represent a system as it would be deployed in a real-time
embedded system. Use the generated MEX file recognizeAirCompressorFault_mex to compute
audio features and perform deep learning classification.
write(audioSource,signalToBeTested);
detectedFault =
"Flywheel"
1-750
Acoustics-Based Machine Fault Recognition Code Generation on Raspberry Pi
xlabel("Time Step")
ylabel("Score")
str = sprintf("Predicted Scores Over Time Steps.\nPredicted Class: %s",detectedFault);
title(str)
To evaluate execution time taken by standalone executable on Raspberry Pi, use a PIL (processor-in-
loop) workflow. To perform PIL profiling, generate a PIL function for the supporting function
recognizeAirCompressorFault.
1-751
1 Audio Toolbox Examples
if (~exist('r','var'))
r = raspi('raspiname','pi','password');
end
hw = coder.hardware('Raspberry Pi');
cfg.Hardware = hw;
buildDir = '~/remoteBuildDir';
cfg.Hardware.BuildDir = buildDir;
cfg.TargetLang = 'C++';
Enable profiling and generate the PIL code. A MEX file named
recognizeAirCompressorFault_pil is generated in your current folder.
cfg.CodeExecutionProfiling = true;
audioFrame = ones(windowLength,1);
resetNetworkStateFlag = true;
codegen -config cfg recognizeAirCompressorFault -args {audioFrame,resetNetworkStateFlag}
Call the generated PIL function 50 times to get the average execution time.
totalCalls = 50;
for k = 1:totalCalls
x = pinknoise(windowLength,1);
score = recognizeAirCompressorFault_pil(x,resetNetworkStateFlag);
resetNetworkStateFlag = false;
end
clear recognizeAirCompressorFault_pil
### Host application produced the following standard output (stdout) and standard error (stderr)
executionProfile = getCoderExecutionProfile('recognizeAirCompressorFault');
report(executionProfile, ...
'Units','Seconds', ...
'ScaleFactor','1e-03', ...
'NumericFormat','%0.4f');
1-752
Acoustics-Based Machine Fault Recognition Code Generation on Raspberry Pi
1-753
1 Audio Toolbox Examples
References
[1] Verma, Nishchal K., et al. "Intelligent Condition Based Monitoring Using Acoustic Signals for Air
Compressors." IEEE Transactions on Reliability, vol. 65, no. 1, Mar. 2016, pp. 291–309. DOI.org
(Crossref), doi:10.1109/TR.2015.2459684.
See Also
Related Examples
• “Compress Machine Fault Recognition Neural Network Using Projection” on page 1-1051
1-754
audioDatastore Object Pointing to Audio Files
To create an audioDatastore object, first specify the file path to the audio samples included with
Audio Toolbox™.
folder = fullfile(matlabroot,'toolbox','audio','samples');
Create an audioDatastore object that points to the specified folder of audio files.
ADS = audioDatastore(folder)
ADS =
audioDatastore with properties:
Files: {
'B:\matlab\toolbox\audio\samples\Ambiance-16-44p1-mono-12secs.wav';
'B:\matlab\toolbox\audio\samples\AudioArray-16-16-4channels-20secs.
' ...\toolbox\audio\samples\ChurchImpulseResponse-16-44p1-mono-5sec
... and 36 more
}
Folders: {
'B:\matlab\toolbox\audio\samples'
}
AlternateFileSystemRoots: {}
OutputDataType: 'double'
OutputEnvironment: 'cpu'
Labels: {}
SupportedOutputFormats: ["wav" "flac" "ogg" "opus" "mp3" "mp4" "m4a"]
DefaultOutputFormat: "wav"
Generate a subset of the audio datastore that only includes audio files containing 'Guitar' in the
file name.
fileContainsGuitar = cellfun(@(c)contains(c,'Guitar'),ADS.Files);
ADSsubset = subset(ADS,fileContainsGuitar)
ADSsubset =
audioDatastore with properties:
Files: {
'B:\matlab\toolbox\audio\samples\RockGuitar-16-44p1-stereo-72secs.w
'B:\matlab\toolbox\audio\samples\RockGuitar-16-96-stereo-72secs.fla
'B:\matlab\toolbox\audio\samples\SoftGuitar-44p1_mono-10mins.ogg'
}
Folders: {
'B:\matlab\toolbox\audio\samples'
}
AlternateFileSystemRoots: {}
OutputDataType: 'double'
OutputEnvironment: 'cpu'
Labels: {}
SupportedOutputFormats: ["wav" "flac" "ogg" "opus" "mp3" "mp4" "m4a"]
DefaultOutputFormat: "wav"
Use the subset audio datastore as the source for a labeledSignalSet object.
1-755
1 Audio Toolbox Examples
audioLabSigSet = labeledSignalSet(ADSsubset)
audioLabSigSet =
labeledSignalSet with properties:
Open Signal Labeler and use Import From Workspace to import the labeledSignalSet.
1-756
Accelerate Audio Deep Learning Using GPU-Based Feature Extraction
In this example, you leverage a GPU for feature extraction and augmentation to decrease the time
required to train a deep learning model. The model you train is a convolutional neural network (CNN)
for acoustic fault recognition.
Audio Toolbox™ includes gpuArray (Parallel Computing Toolbox) support for most feature
extractors, including popular ones such as melSpectrogram and mfcc. For an overview of GPU
support, see “Code Generation and GPU Support”.
Download and unzip the air compressor data set [1] on page 1-766. This data set consists of
recordings from air compressors in a healthy state or one of seven faulty states.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","AirCompressorDataset/AirCo
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"AirCompressorDataset");
Create an audioDatastore object to manage the data and split it into training and validation sets.
ads = audioDatastore(dataset,IncludeSubfolders=true,LabelSource="foldernames");
rng default
[adsTrain,adsValidation] = splitEachLabel(ads,0.8);
uniqueLabels = unique(adsTrain.Labels);
tblTrain = countEachLabel(adsTrain);
tblValidation = countEachLabel(adsValidation);
H = bar(uniqueLabels,[tblTrain.Count, tblValidation.Count],"stacked");
legend(H,["Training Set","Validation Set"],Location="NorthEastOutside")
1-757
1 Audio Toolbox Examples
Select random examples from the training set for plotting. Each recording has 50,000 samples
sampled at 16 kHz.
t = (0:5e4-1)/16e3;
tiledlayout(4,2,TileSpacing="compact",Padding="compact")
for n = 1:numel(uniqueLabels)
idx = find(adsTrain.Labels==uniqueLabels(n));
[x,fs] = audioread(adsTrain.Files{idx(randperm(numel(idx),1))});
nexttile
plotHandle = plot(t,x);
if n == 7 || n == 8
xlabel("Seconds");
else
set(gca,xtick=[])
end
title(string(uniqueLabels(n)));
end
1-758
Accelerate Audio Deep Learning Using GPU-Based Feature Extraction
In this example, you perform feature extraction and data augmentation while training the network. In
this section, you define the feature extraction and augmentation pipeline and compare the speed of
the pipeline executed on a CPU against the speed of the pipeline executed on a GPU. The output of
this pipeline is the input to the CNN you train.
Create an audioFeatureExtractor object to extract log-mel spectrums using 200 ms mel windows
with a 5 ms hop. The output from extract is a numHops-by-128-by-1 array.
afe = audioFeatureExtractor(SampleRate=fs, ...
FFTLength=4096, ...
Window=hann(round(fs*0.2),"periodic"), ...
OverlapLength=round(fs*0.195), ...
melSpectrum=true);
setExtractorParameters(afe,"melSpectrum",NumBands=128,ApplyLog=true);
featureVector = extract(afe,x);
[numHops,numFeatures,numChannels] = size(featureVector)
numHops = 586
numFeatures = 128
numChannels = 1
Deep learning methods are data-hungry, and the training dataset in this example is relatively small.
Use the mixup [2] on page 1-766 augmentation technique to effectively enlarge the training set. In
1-759
1 Audio Toolbox Examples
mixup, you merge the features extracted from two audio signals as a weighted sum. The two signals
have different labels, and the label assigned to the merged feature matrix is probabilistically assigned
based on the mixing coefficient. The mixup augmentation is implemented in the supporting object,
Mixup on page 1-764.
Create two versions of the pipeline for comparison: one that executes the pipeline on your CPU, and
one that converts the raw audio signal to a gpuArray so that the pipeline is executed on your GPU.
offset = eps;
adsTrainCPU = transform(adsTrain,@(x)extract(afe,x));
mixerCPU = Mixup(adsTrainCPU);
adsTrainCPU = transform(adsTrainCPU,@(x,info)mix(mixerCPU,x,info),IncludeInfo=true);
adsTrainGPU = transform(adsTrain,@gpuArray);
adsTrainGPU = transform(adsTrainGPU,@(x)extract(afe,x));
mixerGPU = Mixup(adsTrainGPU);
adsTrainGPU = transform(adsTrainGPU,@(x,info)mix(mixerGPU,x,info),IncludeInfo=true);
For the validation set, apply the feature extraction pipeline but not the augmentation. Because you
are not applying mixup, create a combined datastore to output a cell array containing the features
and the label. Again, create one validation pipeline that executes on your GPU and one validation
pipeline that executes on your CPU.
adsValidationGPU = transform(adsValidation,@gpuArray);
adsValidationGPU = transform(adsValidationGPU,@(x){extract(afe,x)});
adsValidationGPU = combine(adsValidationGPU,arrayDatastore(adsValidation.Labels));
adsValidationCPU = transform(adsValidation,@(x){extract(afe,x)});
adsValidationCPU = combine(adsValidationCPU,arrayDatastore(adsValidation.Labels));
Compare the time it takes for the CPU and a single GPU to extract features and perform data
augmentation.
tic
for ii = 1:numel(adsTrain.Files)
x = read(adsTrainCPU);
end
cpuPipeline = toc;
reset(adsTrainCPU)
tic
for ii = 1:numel(adsTrain.Files)
x = read(adsTrainGPU);
end
wait(gpuDevice) % Ensure all calculations are completed
gpuPipeline = toc;
reset(adsTrainGPU)
disp(["Read, extract, and augment train set (CPU): "+cpuPipeline+" seconds"; ...
1-760
Accelerate Audio Deep Learning Using GPU-Based Feature Extraction
"Read, extract, and augment train set (GPU): "+gpuPipeline+" seconds"; ...
"Speedup (CPU time)/(GPU time): "+cpuPipeline/gpuPipeline]);
Reading from the datastore contributes a significant amount of the overall time to the pipeline. A
comparison of just extraction and augmentation shows an even greater speedup. Compare just
feature extraction on the GPU versus on the CPU.
x = read(ads);
Define Network
Define a convolutional neural network that takes the augmented mel spectrogram as input. This
network applies a single convolutional layer consisting of 48 filters with 3-by-3 kernels, followed by a
batch normalization layer and a ReLU activation layer. The time dimension is then collapsed using a
max pooling layer. Finally, the output of the pooling layer is reduced using a fully connected layer
followed by softmax layer. See “List of Deep Learning Layers” (Deep Learning Toolbox) for more
information.
numClasses = numel(categories(adsTrain.Labels));
imageSize = [numHops,afe.FeatureVectorLength];
layers = [
imageInputLayer(imageSize,Normalization="none")
convolution2dLayer(3,48,Padding="same")
batchNormalizationLayer
reluLayer
maxPooling2dLayer([numHops,1])
1-761
1 Audio Toolbox Examples
fullyConnectedLayer(numClasses)
softmaxLayer
];
To define the training options, use trainingOptions (Deep Learning Toolbox). Set the
ExecutionEnvironment to gpu to leverage your GPU while training the network. The computer
used in this example uses a Titan V GPU device.
miniBatchSize = 128;
options = trainingOptions("adam", ...
Shuffle="every-epoch", ...
MaxEpochs=20, ...
LearnRateSchedule="piecewise", ...
LearnRateDropPeriod=15, ...
LearnRateDropFactor=0.2, ...
MiniBatchSize=miniBatchSize, ...
Plots="training-progress", ...
Verbose=false, ...
ValidationData=adsValidationCPU, ...
ValidationFrequency=2*ceil(numel(adsTrain.Files)/miniBatchSize), ...
ExecutionEnvironment="gpu", ...
Metrics="accuracy");
Train Network
Call trainnet to train the network using your CPU for the feature extraction pipeline. The execution
environment for the network training is your GPU.
tic
net = trainnet(adsTrainCPU,layers,"crossentropy",options);
1-762
Accelerate Audio Deep Learning Using GPU-Based Feature Extraction
cpuTrainTime = toc;
Replace the validation data in the training options with the GPU-based pipeline. Train the network
using your GPU for the feature extraction pipeline. The execution environment for the network
training is your GPU.
options.ValidationData = adsValidationGPU;
tic
net = trainnet(adsTrainGPU,layers,"crossentropy",options);
gpuTrainTime = toc;
Print the timing results for training using a CPU for feature extraction and augmentation, and
training using a GPU for feature extraction and augmentation.
disp(["Training time (CPU): "+cpuTrainTime+" seconds";
"Training time (GPU): "+gpuTrainTime+" seconds";
"Speedup (CPU time)/(GPU time): "+cpuTrainTime/gpuTrainTime])
Compare the time it takes to perform prediction on a single 3-second clip when feature extraction is
performed on the GPU versus the CPU. In both cases, the network prediction happens on your GPU.
signalToClassify = read(ads);
1-763
1 Audio Toolbox Examples
gpuFeatureExtraction = gputimeit(@()predict(net,extract(afe,gpuArray(signalToClassify))));
cpuFeatureExtraction = gputimeit(@()predict(net,extract(afe,(signalToClassify))));
Compare the time it takes to perform prediction on a set of 3-second clips when feature extraction is
performed on the GPU versus the CPU. In both cases, the network prediction happens on your GPU.
adsValidationGPU = transform(adsValidation,@(x)gpuArray(x));
adsValidationGPU = transform(adsValidationGPU,@(x){extract(afe,x)});
adsValidationCPU = transform(adsValidation,@(x){extract(afe,x)});
gpuFeatureExtraction = gputimeit(@()minibatchpredict(net,adsValidationGPU,ExecutionEnvironment="g
cpuFeatureExtraction = gputimeit(@()minibatchpredict(net,adsValidationCPU,ExecutionEnvironment="g
"Prediction time for validation set (feature extraction on CPU): 15.8432 seconds"
"Prediction time for validation set (feature extraction on GPU): 2.4479 seconds"
"Speedup (CPU time)/(GPU time): 6.4722"
Conclusion
It is well known that you can decrease the time it takes to train a network by leveraging GPU devices.
This enables you to more quickly iterate and develop your final system. In many training setups, you
can achieve additional performance gains by leveraging GPU devices for feature extraction and data
augmentation. This example shows a significant decrease in the overall time it takes to train a CNN
when leveraging GPU devices for feature extraction and data augmentation. Additionally, leveraging
GPU devices for feature extraction at inference time, for both single-observations and data sets,
achieves significant performance gains.
Supporting Functions
Mixup
The supporting object, Mixup, is placed in your current folder when you open this example.
type Mixup
1-764
Accelerate Audio Deep Learning Using GPU-Based Feature Extraction
properties (SetAccess=public,GetAccess=public)
%MixProbability Mix probability
% Specify the probability that mixing is applied as a scalar in the
% range [0,1]. If unspecified, MixProbability defaults to 1/3.
MixProbability (1,1) {mustBeNumeric} = 1/3;
end
properties (SetAccess=immutable,GetAccess=public)
%AUGDATASTORE Augmentation datastore
% Specify a datastore from which to get the mixing signals. The
% datastore must contain a label in the info returned from reading.
% This property is immutable, meaning it cannot be changed after
% construction.
AugDatastore
end
methods
function obj = Mixup(augDatastore)
obj.AugDatastore = augDatastore;
end
1-765
1 Audio Toolbox Examples
dataOut = [{x},{infoIn.Label}];
infoOut = infoIn;
end
end
end
end
References
[1] Verma, Nishchal K., et al. "Intelligent Condition Based Monitoring Using Acoustic Signals for Air
Compressors." IEEE Transactions on Reliability, vol. 65, no. 1, Mar. 2016, pp. 291–309. DOI.org
(Crossref), doi:10.1109/TR.2015.2459684.
[2] Huszar, Ferenc. "Mixup: Data-Dependent Data Augmentation." InFERENCe. November 03, 2017.
Accessed January 15, 2019. https://www.inference.vc/mixup-data-dependent-data-augmentation/.
See Also
gpuArray | audioFeatureExtractor | audioDatastore
Related Examples
• “Accelerate Audio Machine Learning Workflows Using a GPU” on page 1-249
• “Train Spoken Digit Recognition Network Using Out-of-Memory Audio Data” on page 1-653
• “Run MATLAB Functions on a GPU” (Parallel Computing Toolbox)
1-766
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
In this example, you train a deep learning model to perform sound localization and event detection
from ambisonic data. The model consists of two independently trained convolutional recurrent neural
networks (CRNN) [1] on page 1-783: one for sound event detection (SED), and one for direction of
arrival (DOA) estimation. To explore the models trained in this example, see “3-D Sound Event
Localization and Detection Using Trained Recurrent Convolutional Neural Network” on page 1-794.
1-767
1 Audio Toolbox Examples
Introduction
Ambisonics is a popular 3-D sound format that has shown promise in tasks like sound source
localization, speech enhancement, and source separation. Ambisonics is a full sphere surround sound
format that contains a speaker-independent sound field representation (B-format). First order B-
format ambisonic recordings contain components that correspond to the sound pressure captured by
an omnidirectional microphone (W) and sound pressure gradients X, Y, and Z that correspond to
front/back, left/right, and up/down captured by figure-of-eight capsules oriented along the three
spatial axes. 3-D SELD has applications in virtual reality, robotics, smart homes, and defense.
You will train two separate models for the sound event detection task and the localization task. Both
models are based on the convolutional recurrent neural network architecture described in [1] on
page 1-783. The sound event detection task is formulated as a classification task. The sound event
localization task estimates Cartesian coordinates of the sound source and is formulated as a
regression task. You use the L3DAS21 data set [2] on page 1-784 to train and validate the networks.
To explore the models trained in this example, see “3-D Sound Event Localization and Detection
Using Trained Recurrent Convolutional Neural Network” on page 1-794.
This example uses a subset of the L3DAS21 Task 2 challenge data set [2] on page 1-784. The data
set contains multiple-source and multiple-perspective (MSMP) B-format ambisonic audio recordings
collected at a sampling rate of 32 kHz. The train and validation splits are provided with the data set.
Each recording is one minute long and contains a simulated 3-D audio environment in which up to 3
simultaneous acoustic events may be active at the same time. In this example, you only use the data
that contains non-overlapping sounds. The sound events belong to 14 sound classes. The labels are
provided as csv files that contain the sound class, the Cartesian coordinates of the sound source, and
the onset and offset time stamps.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","L3DAS21_ov1.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"L3DAS21_ov1");
To train the networks with the entire data set and achieve a reasonable performance, set
speedupExample to false. To run this example quickly, set speedupExample to true.
speedupExample = ;
Create Datastores
Create audioDatastore objects to ingest the data. Each data point in the data set consists of two B-
format ambisonic recordings that correspond to the two microphones (A and B). For each data folder
(train and validation), use subset to create two subsets corresponding to the two microphones.
adsTrain = audioDatastore(fullfile(dataset,"train","data"));
adsTrainA = subset(adsTrain,cellfun(@(c)endsWith(c,"A.wav"),adsTrain.Files));
adsTrainB = subset(adsTrain,cellfun(@(c)endsWith(c,"B.wav"),adsTrain.Files));
adsValidation = audioDatastore(fullfile(dataset,"validation","data"));
1-768
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
adsValidationA = subset(adsValidation,cellfun(@(c)endsWith(c,"A.wav"),adsValidation.Files));
adsValidationB = subset(adsValidation,cellfun(@(c)endsWith(c,"B.wav"),adsValidation.Files));
if speedupExample
adsTrainA = subset(adsTrainA,1:2);
adsTrainB = subset(adsTrainB,1:2);
end
Inspect Data
micA = preview(adsTrainA);
micB = preview(adsTrainB);
tiledlayout(4,2,TileSpacing="tight")
nexttile
plot(micA(:,1))
title("Microphone A")
ylabel("W")
nexttile
plot(micB(:,1))
title("Microphone B")
nexttile
plot(micA(:,2))
ylabel("X")
nexttile
plot(micB(:,2))
nexttile
plot(micA(:,3))
ylabel("Y")
nexttile
plot(micB(:,3))
nexttile
plot(micB(:,4))
ylabel("Z")
nexttile
plot(micB(:,4))
1-769
1 Audio Toolbox Examples
microphone = ;
channel = ;
duration = ;
fs = 32e3; % Known sampling rate of data.
s = [micA,micB];
data = s(1:round(duration*fs),channel + (microphone-1)*4);
sound(data,fs)
Create Targets
Each data point in the data set has a corresponding CSV file containing the sound event class, the
start and end times of the sound, and the location of the sound. Create a container to map between
the sound classes and integers.
keySet = ["Chink_and_clink","Computer_keyboard","Cupboard_open_or_close","Drawer_open_or_close",
"Female_speech_and_woman_speaking","Finger_snapping","Keys_jangling","Knock","Laughter", ...
"Male_speech_and_man_speaking","Printer","Scissors","Telephone","Writing"];
valueSet = {1,2,3,4,5,6,7,8,9,10,11,12,13,14};
params.SoundClasses = containers.Map(keySet,valueSet);
1-770
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
Create a tabularTextDatastore to ingest the train file labels. Make sure the label files are in the
same order as the data files. Preview a label file from the datastore.
[folder,fn] = fileparts(adsTrainA.Files);
targetPath = fullfile(strrep(folder,filesep+"data",filesep+"labels"),"label_" + strrep(fn,"_A",""
ttdsTrain = tabularTextDatastore(targetPath);
labelTable = preview(ttdsTrain)
labelTable=8×7 table
File Start End Class X Y Z
____ _______ ______ ____________________________________ ____ ____ ____
The labels in the dataset are provided with time stamps in seconds. To create targets and train a
network, you need to map the time stamps to frames. The total duration of each file is 60 seconds.
You will divide each file into 600 frames for the target, meaning the model will make a prediction
every 0.1 seconds.
params.Targets.TotalDuration = 60;
params.Targets.NumFrames = 600;
SED Targets
The supporting function, extractSEDTargets on page 1-784, uses the label data to create an SED
target. The target is a one-hot encoded matrix of size numframes-by-numclasses. Frames with no
sounds present are encoded as all-zero vectors.
SEDTargets = extractSEDTargets(labelTable,params);
[numframes,numclasses] = size(SEDTargets{1})
numframes = 600
numclasses = 14
dsTTrain = transform(ttdsTrain,@(x)extractSEDTargets(x,params));
sedTTrain = readall(dsTTrain);
[folder,fn] = fileparts(adsValidationA.Files);
targetPath = fullfile(strrep(folder,filesep+"data",filesep+"labels"),"label_" + strrep(fn,"_A",""
ttdsValidation = tabularTextDatastore(targetPath);
dsTValidation = transform(ttdsValidation,@(x)extractSEDTargets(x,params));
sedTValidation = readall(dsTValidation);
1-771
1 Audio Toolbox Examples
DOA Targets
The supporting function, extractDOATargets on page 1-784, uses the label data to create a DOA
target. The target is a matrix of size numframes-by-numaxis. The axis values correspond to the
sound source location in 3-D space. Frames with no sounds present are encoded as all-zero vectors.
First, define a parameter to scale the target axis values so that they are between -1 and 1. This
scaling is necessary because the DOA network you define later uses tanh activation as its final layer.
params.DOA.ScaleFactor = 2;
DOATargets = extractDOATargets(labelTable,params);
[numframes,numaxis] = size(DOATargets{1})
numframes = 600
numaxis = 3
[folder,fn] = fileparts(adsValidationA.Files);
targetPath = fullfile(strrep(folder,filesep+"data",filesep+"labels"),"label_" + strrep(fn,"_A",""
ttdsValidation = tabularTextDatastore(targetPath);
dsTValidation = transform(ttdsValidation,@(x)extractDOATargets(x,params));
doaTValidation = readall(dsTValidation);
Feature Extraction
The sound event detection model uses log-magnitude short-time Fourier transforms (STFT) as
predictors to the system. Specify a 512-point periodic Hamming window and a hop length of 400
samples.
params.SED.SampleRate = 32e3;
params.SED.HopLength = 400;
params.SED.Window = hamming(512,"periodic");
The supporting function, extractSTFT on page 1-785, takes a cell array of microphone readings and
extracts the half-sided centered log-magnitude STFTs. The STFT features corresponding to both
microphones are stacked along the third dimension.
stftFeats = extractSTFT({micA,micB},params);
[numfeaturesSED,numframesSED,numchannelsSED] = size(stftFeats)
numfeaturesSED = 256
numframesSED = 4800
numchannelsSED = 8
channel = ;
1-772
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
figure
imagesc(stftFeats(:,:,channel))
colorbar
xlabel("Frame")
ylabel("Frequency (bin)")
set(gca,YDir="normal")
Extract features from the entire train and validation sets. First, combine the datastores
corresponding to microphones A and B. Then, define a transform on the datastore so that reading
from it returns the STFT. If you have Parallel Computing Toolbox™, you can speed up processing
using the UseParallel flag of readall.
pFlag = ~isempty(ver("parallel")) && ~speedupExample;
trainDS = combine(adsTrainA,adsTrainB);
trainDS_T = transform(trainDS,@(x){extractSTFT(x,params)},IncludeInfo=false);
XTrain = readall(trainDS_T,UseParallel=pFlag);
valDS = combine(adsValidationA,adsValidationB);
valDS_T = transform(valDS,@(x){extractSTFT(x,params)},IncludeInfo=false);
XValidation = readall(valDS_T,UseParallel=pFlag);
Combine the predictor arrays with the previously computed SED target arrays.
trainSedDS = combine(arrayDatastore(XTrain,OutputType="same"),arrayDatastore(sedTTrain,OutputType
valSedDS = combine(arrayDatastore(XValidation,OutputType="same"),arrayDatastore(sedTValidation,Ou
1-773
1 Audio Toolbox Examples
Training Options
if speedupExample
trainOptionsSED.MaxEpochs = 1;
end
Create minibatchqueue (Deep Learning Toolbox) objects to read mini-batches from the train and
validation datastores.
trainSEDmbq = minibatchqueue(trainSedDS, ...
MiniBatchSize=trainOptionsSED.MiniBatchSize, ...
OutputAsDlarray=[1,1], ...
MiniBatchFormat=["SSCB","TCB"], ...
OutputEnvironment=["auto","auto"]);
The network is implemented in two stages - Convolutional Neural Network (CNN) and Gated
Recurrent Network (GRU). You will use a custom reshaping layer to recast the output of the CNN
model into a sequence and pass that as the input to the RNN model. The custom reshaping layer is
placed in your current folder when you open this example. The final output layer uses sigmoid
activation.
1-774
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
convolution2dLayer([3,3],64,Padding="same",Name="conv1")
batchNormalizationLayer(Name="batchnorm1")
reluLayer(Name="relu1")
maxPooling2dLayer([8,2],Stride=[8,2],Padding="same",Name="maxpool1")
convolution2dLayer([3,3],128,Padding="same",Name="conv2")
batchNormalizationLayer(Name="batchnorm2")
reluLayer(Name="relu2")
maxPooling2dLayer([8,2],Stride=[8,2],Padding="same",Name="maxpool2")
convolution2dLayer([3,3],256,Padding="same",Name="conv3")
batchNormalizationLayer(Name="batchnorm3")
reluLayer(Name="relu3")
maxPooling2dLayer([2,2],Stride=[2,2],Padding="same",Name="maxpool3")
convolution2dLayer([3,3],512,Padding="same",Name="conv4")
batchNormalizationLayer(Name="batchnorm4")
reluLayer(Name="relu4")
maxPooling2dLayer([1,1],Stride=[1,1],Padding="same",Name="maxpool4")
reshapeLayer("reshape")
];
netCNN = dlnetwork(seldnetCNNLayers);
seldnetGRULayers = [
sequenceInputLayer(1024,Name="sequenceInputLayer")
bigruLayer(1024,256,Name="gru1")
bigruLayer(512,256,Name="gru2")
bigruLayer(512,256,Name="gru3")
fullyConnectedLayer(1024,Name="fc1")
reluLayer(Name="relu1")
fullyConnectedLayer(1024,Name="fc2")
reluLayer(Name="relu2")
fullyConnectedLayer(1024,Name="fc3")
reluLayer(Name="relu3")
fullyConnectedLayer(params.SoundClasses.Count,Name="fc4")
sigmoidLayer(Name="output")
];
netRNN = dlnetwork(seldnetGRULayers);
Create a struct to contain both the CNN and RNN sections of the full model.
sedModel.CNN = netCNN;
sedModel.RNN = netRNN;
iteration = 0;
averageGrad = [];
averageSqGrad = [];
1-775
1 Audio Toolbox Examples
epoch = 0;
bestLoss = Inf;
badEpochs = 0;
learnRate = trainOptionsSED.InitialLearnRate;
To display training progress, initialize the supporting object progresPlotterSELD. The supporting
object, progressPlotterSELD, is placed in your current folder when you open this example.
pp = progressPlotterSELD();
rng(0)
while epoch < trainOptionsSED.MaxEpochs && badEpochs < trainOptionsSED.ValidationPatience
epoch = epoch + 1;
while hasdata(trainSEDmbq)
% Evaluate the model gradients and loss using dlfeval and the modelLoss function.
[loss,grad,state] = dlfeval(@modelLoss,sedModel,X,T);
loss = loss/size(T,2);
% Update state.
sedModel.CNN.State = state.CNN;
sedModel.RNN.State = state.RNN;
1-776
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
badEpochs = badEpochs + 1;
end
end
Feature Extraction
The direction of arrival estimation model uses generalized cross correlation phase transform (GCC-
PHAT) as predictors to the system. Specify a 1024-point Hann window, a hop length of 400 samples,
and the number of bands as 96.
params.DOA.SampleRate = 32e3;
params.DOA.Window = hann(1024);
params.DOA.NumBands = 96;
params.DOA.HopLength = 400;
Extract the GCC-PHAT features used as input predictors to the sound localization network. The GCC-
PHAT algorithm measures the cross correlation between each pair of channels. The input signals
have a total of 8 channels, so the output has a total of 28 measurements.
1-777
1 Audio Toolbox Examples
gccPhatFeats = extractGCCPHAT({micA,micB},params);
[numfeaturesDOA,timestepsDOA,numchannelsDOA] = size(gccPhatFeats)
numfeaturesDOA = 96
timestepsDOA = 4800
numchannelsDOA = 28
channelpair = ;
figure
imagesc(gccPhatFeats(:,:,channelpair))
colorbar
xlabel("Frame")
ylabel("Band")
set(gca,YDir="normal")
Extract features from the entire train and validation sets. If you have Parallel Computing Toolbox™,
you can speed up processing using the UseParallel flag of readall.
pFlag = ~isempty(ver("parallel")) && ~speedupExample;
1-778
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
trainDS = combine(adsTrainA,adsTrainB);
trainDS_T = transform(trainDS,@(x){extractGCCPHAT(x,params)},IncludeInfo=false);
XTrain = readall(trainDS_T,UseParallel=pFlag);
valDS = combine(adsValidationA,adsValidationB);
valDS_T = transform(valDS,@(x){extractGCCPHAT(x,params)},IncludeInfo=false);
XValidation = readall(valDS_T,UseParallel=pFlag);
Combine the predictor arrays with the previously compute DOA target arrays.
trainDOA = combine(arrayDatastore(XTrain,OutputType="same"),arrayDatastore(doaTTrain,OutputType="
validationDOA = combine(arrayDatastore(XValidation,OutputType="same"),arrayDatastore(doaTValidati
Training Options
Use the same train options you defined when training the SED network.
trainOptionsDOA = trainOptionsSED;
The DOA network is very similar to the SED network defined earlier. The key differences are the size
of the input layer and the final activation layer.
Update the SELDnet architecture used for the SED network for use with DOA estimation.
seldnetCNNLayers(1) = imageInputLayer([numfeaturesDOA,timestepsDOA,numchannelsDOA],Normalization=
seldnetCNNLayers(5) = maxPooling2dLayer([3,2],Stride=[3,2],Padding="same",Name="maxpool1");
netCNN = dlnetwork(layerGraph(seldnetCNNLayers));
seldnetGRULayers(11) = fullyConnectedLayer(3,Name="fc4");
seldnetGRULayers(12) = tanhLayer(Name="output");
netRNN = dlnetwork(layerGraph(seldnetGRULayers));
1-779
1 Audio Toolbox Examples
Create a struct to contain both the CNN and RNN sections of the full model.
doaModel.CNN = netCNN;
doaModel.RNN = netRNN;
iteration = 0;
averageGrad = [];
averageSqGrad = [];
epoch = 0;
bestLoss = Inf;
badEpochs = 0;
learnRate = trainOptionsDOA.InitialLearnRate;
To display training progress, initialize the supporting object progressPlotterSELD. The supporting
object, progressPlotterSELD, is placed in your current folder when you open this example.
pp = progressPlotterSELD();
rng(0)
while epoch < trainOptionsDOA.MaxEpochs && badEpochs < trainOptionsDOA.ValidationPatience
epoch = epoch + 1;
while hasdata(trainDOAmbq)
% Evaluate the model gradients and loss using dlfeval and the modelLoss function.
[loss,grad,state] = dlfeval(@modelLoss,doaModel,X,T);
loss = loss/size(T,2);
% Update state.
doaModel.CNN.State = state.CNN;
doModel.RNN.State = state.RNN;
1-780
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
To evaluate your system's performance, use the location-sensitive detection error defined in [4] on
page 1-784. Load the best-performing models.
sedModel = importdata("SED-BestModel.mat");
doaModel = importdata("DOA-BestModel.mat");
1-781
1 Audio Toolbox Examples
Location-sensitive detection is a joint metric that evaluates the results of both sound event detection
and sound event localization tasks. In this type of evaluation, a true positive only occurs when the
predicted label is correct, and the predicted location is within a predefined threshold of the true
location. A threshold of 0.2 is used in this example which is about ~3% of the maximum possible
error. To determine regions of silence in the prediction, set a confidence threshold on SED decisions.
If the SED predictions are below that threshold, the frame is considered silence.
params.SpatialThreshold = 0.2;
params.SilenceThreshold = 0.1;
Compute the metrics for the validation data set using the computeMetrics on page 1-789
supporting function.
results = computeMetrics(sedModel,doaModel,validationSEDmbq,validationDOAmbq,params);
results
The computeMetrics supporting function can optionally smooth the decisions over time before
evaluating the system. This option requires the Statistics and Machine Learning Toolbox™. Evaluate
the system again, this time including the smoothing.
[results,cm] = computeMetrics(sedModel,doaModel,validationSEDmbq,validationDOAmbq,params,ApplySmo
results
You can inspect the confusion matrix for SED predictions to get more insights on the prediction
errors. The confusion matrix is only calculated over regions where there is an active sound source.
1-782
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
Conclusion
For next steps, you can download and try out the pretrained models from this example in this second
example showing inference: “3-D Sound Event Localization and Detection Using Trained Recurrent
Convolutional Neural Network” on page 1-794.
References
1-783
1 Audio Toolbox Examples
[1] Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen, "Sound event
localization and detection of overlapping sources using convolutional recurrent neural networks,"
IEEE J. Sel. Top. Signal Process., vol. 13, no. 1, pp. 34–48, 2019.
[2] Eric Guizzo, Riccardo F. Gramaccioni, Saeid Jamili, Christian Marinoni, Edoardo Massaro, Claudia
Medaglia, Giuseppe Nachira, Leonardo Nucciarelli, Ludovica Paglialunga, Marco Pennese, Sveva
Pepe, Enrico Rocchi, Aurelio Uncini, and Danilo Comminiello "L3DAS21 Challenge: Machine Learning
for 3D Audio Signal Processing," 2021.
[3] Yin Cao, Qiuqiang Kong, Turab Iqbal, Fengyan An, Wenwu Wang, and Mark D. Plumbley,
"Polyphonic sound event detection and localization using a two-stage strategy," arXiv preprint:
arXiv:1905.00268v4, 2019.
[4] Mesaros, Annamaria, Sharath Adavanne, Archontis Politis, Toni Heittola, and Tuomas Virtanen.
"Joint Measurement of Localization and Detection of Sound Events." 2019 IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019. https://doi.org/10.1109/
waspaa.2019.8937220.
Supporting Functions
function T = extractDOATargets(csvFile,params)
%EXTRACTDOATARGETS Extract direction of arrival (DOA) targets
% T = extractDOATargets(fileName,params) parses the CSV file
% fileName and returns a matrix, T. The target matrix is an N-by-3
% matrix, where N corresponds to the number of frames and 3 corresponds to
% the 3 axes describing location in 3-D space.
% For each sound source, fill the target matrix sound source location for
% the appropriate number of frames.
for ii = 1:size(startendFrame,1)
idx = startendFrame(ii,1):startendFrame(ii,2)-1;
T(idx,:) = repmat([csvFile.X(ii),csvFile.Y(ii),csvFile.Z(ii)],numel(idx),1);
end
% Scale the target so that it is between -1 and 1 (the bounds of the tanh
% activation layer). Wrap the target in a cell array for convenient batch
% processing.
T = {T/params.DOA.ScaleFactor};
end
1-784
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
function T = extractSEDTargets(csvFile,params)
%EXTRACTSEDTARGETS Extract sound event detection (SED) targets
% T = extractSEDTargets(fileName,params) parses the CSV file
% fileName and returns a matrix of SED targets, T. The target matrix is an N-by-K
% matrix, where N corresponds to the number of frames and K corresponds to
% the number of sound classes.
% For each sound source, fill the appropriate column of the target matrix
% with a 1, indicating that the sound class is present in that frame.
for ii = 1:size(startendFrame,1)
classID = params.SoundClasses(csvFile.Class{ii});
T(startendFrame(ii,1):startendFrame(ii,2)-1,classID) = 1;
end
function X = extractSTFT(s,params)
%EXTRACTSTFT Extract log-magnitude of centered STFT
% X = extractSTFT({s1,s2},params) concatenates s1 and s2 and then
% extracts the one-sided log-magnitude STFT. The signals are padded before
% the STFT so that the first window is centered on the first sample. The
% output is trimmed to remove the 1st (DC) coefficient and the last
% spectrum. The input params defines the STFT.
% Trim the 1st coefficient from all spectrums and trim the last spectrum.
S = S(2:end,1:end-1,:);
1-785
1 Audio Toolbox Examples
function X = extractGCCPHAT(s,params)
%EXTRACTGCCPHAT Extract generalized cross correlation phase transform (GCC-PHAT) features
% X = extractGCCPHAT({s1,s2},params) concatenates s1 and s2 and then
% extracts the GCC-PHAT for all pairs of channels.
% -----------------------------------
% Calculate GCC-PHAT for each pair of channels.
% Precompute STFT for each channel.
N = numel(params.DOA.Window);
overlapLength = N - params.DOA.HopLength;
micAB_stft = centeredSTFT(audio,params.DOA.Window,overlapLength,N);
conjmicAB_stft = conj(micAB_stft(:,:,2:end));
idx = 1;
for ii = 1:nChan - 1
R = micAB_stft(:,:,ii).*conjmicAB_stft(:,:,ii:end);
R = exp(1i .* angle(R));
R = padarray(R, N/2 - 1,"post");
gcc = fftshift(ifft(R,[],1,"symmetric"),1);
X(:,:,idx:idx+size(R,3)-1) = gcc(floor(N/2+1 - (params.DOA.NumBands-1)/2):floor(N/2+1 + (para
end
function s = centeredSTFT(audio,win,overlapLength,fftLength)
%CENTEREDSTFT Centered STFT
% s = centeredSTFT(audioIn,win,overlapLength,fftLength) computes an STFT
% with the first window centered around the first sample. The two ends are
% padded with the reflected audio signal.
1-786
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
% Perform STFT.
s = stft(sig,Window=win,OverlapLength=overlapLength,FFTLength=fftLength,FrequencyRange="onesided"
end
stp = dur/numFrames;
qt = round(t./stp).*stp;
end
% Label the dimensions output from the CNN for consumption by the RNN.
Y2 = dlarray(Y1,"TCUB");
end
% Label the dimensions output from the CNN for consumption by the RNN.
Y2 = dlarray(Y1,"TCUB");
1-787
1 Audio Toolbox Examples
end
Predict Batch
while hasdata(mbq)
% Pass the mini-batch through the model and calculate the loss.
lss = predictAll(model,X,T);
lss = lss/size(T,2);
end
end
1-788
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
state.CNN = cnnState;
state.RNN = rnnState;
end
if isDOAModel
% Calculate MSE loss.
doaLoss = mse(Y,T);
doaLossFactor = 2 / (size(Y,1) * size(Y,3));
loss = doaLoss * doaLossFactor; % To align with the original implementation
else
% Calculate cross-entropy loss.
loss = crossentropy(Y,T,ClassificationMode="multilabel",NormalizationFactor="all-elements");
end
end
% Initialize counters.
TP = 0;
FP = 0;
1-789
1 Audio Toolbox Examples
FN = 0;
it = 0;
ct = 0;
err = 0;
sedYAll = [];
sedTAll = [];
% Get the predictors, targets, and predictions for the SED model.
[sedXb,sedTb] = next(sedMBQ);
[~,sedYb] = predictAll(sedModel,sedXb,sedTb);
sedTb = extractdata(gather(sedTb));
sedYb = extractdata(gather(sedYb));
% Get the predictors, targets, and predictions for the DOA model.
[doaXb,doaTb] = next(doaMBQ);
[~,doaYb] = predictAll(doaModel,doaXb,doaTb);
doaTb = extractdata(gather(doaTb));
doaYb = extractdata(gather(doaYb));
doaYb = doaYb*params.DOA.ScaleFactor;
doaTb = doaTb*params.DOA.ScaleFactor;
[isActive,sedT] = max(sedT,[],1);
sedT = sedT.*isActive;
% Smooth outputs.
if nvargs.ApplySmoothing
[doaY,sedY] = smoothOutputs(doaY,sedY,params);
end
1-790
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
TP = TP + tp;
FP = FP + fp;
FN = FN + fn;
err = err + e;
ct = ct + c;
% Calculate distance.
dist = vecnorm(doaY-doaT);
% True positive:
TP = sum(isDOAnear & isReferenceActive & isPredictedActive & (sedT==sedY));
% False positive:
FP1 = sum(~isReferenceActive & isPredictedActive);
FP2 = sum(isReferenceActive & isPredictedActive & (sedT~=sedY | ~isDOAnear));
FP = FP1 + FP2;
% False negative:
1-791
1 Audio Toolbox Examples
end
Smooth Outputs
if clusters(stt) == clusters(enn)
enn = enn + 1;
else
doaYSmooth(:,stt:enn-1) = smoothDOA(doaY(:,stt:enn-1));
sedYSmooth(:,stt:enn-1) = smoothSED(sedY(:,stt:enn-1));
stt = enn;
end
end
doaYSmooth(:,stt:enn-1) = smoothDOA(doaY(:,stt:enn-1));
sedYSmooth(:,stt:enn-1) = smoothSED(sedY(:,stt:enn-1));
sedYSmooth = round(movmedian(sedYSmooth,5));
end
% Determine the length of the chunk, and then indices to cut out the middle
% half of the data.
chlen = size(chunk,2);
st = max(round(chlen*1/4),1);
en = max(round(chlen*3/4),1);
1-792
Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning
end
smoothed = repmat(mode(chunk),1,size(chunk,2));
end
1-793
1 Audio Toolbox Examples
In this example, you perform 3-D sound event localization and detection (SELD) using a pretrained
deep learning model. For details about the model and how it was trained, see “Train 3-D Sound Event
Localization and Detection (SELD) Using Deep Learning” on page 1-767. The SELD model uses two
B-format ambisonic audio recordings to detect the presence and location of one of 14 sound classes
commonly found in an office environment.
Download the pretrained SELD network, ambisonic test files, and labels. The model architecture is
based on [1] on page 1-807 and [3] on page 1-807. The data the model was trained on, the labels,
and the ambisonic test files, are provided as part of the L3DAS21 challenge [2] on page 1-807.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","SELDmodel.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
netFolder = fullfile(dataFolder,"SELDmodel");
addpath(netFolder)
Load the ambisonic data. First order B-format ambisonic recordings contain components that
correspond to the sound pressure captured by an omnidirectional microphone (W) and sound
pressure gradients X, Y, and Z that correspond to front/back, left/right, and up/down captured by
figure-of-eight capsules oriented along the three spatial axes.
[micA,fs] = audioread("micA.wav");
micB = audioread("micB.wav");
microphone = ;
channel = ;
start = ;
stop = ;
s = [micA,micB];
data = s(round(start*fs):round(stop*fs),channel+(microphone-1)*4);
sound(data,fs)
plotAmbisonics(micA,micB)
1-794
3-D Sound Event Localization and Detection Using Trained Recurrent Convolutional Neural Network
Use the supporting function, getLabels, to load the ground truth labels associated with the sound
event detection (SED) and direction of arrival (DOA).
[sedLabels,doaLabels] = getLabels();
sedLabels is a T-by-1 vector of keys over time, where the values map to one of 14 possible sound
classes. A key of zero indicates a region of silence. The 14 possible sound classes are chink/clink,
keyboard, cupboard, drawer, female speech, finger snapping, keys jangling, knock, laughter, male
speech, printer, scissors, telephone, and writing.
sedLabels
0
0
0
0
0
0
0
0
0
0
1-795
1 Audio Toolbox Examples
soundClasses = getSoundClasses();
soundClasses(sedLabels+1)
doaLabels is a T-by-3 matrix where T is the number of time steps and 3 corresponds to the X, Y, and
Z axes in 3-D space.
doaLabels
doaLabels = 600×3
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
⋮
In both cases, the 60-second ground truth has been discretized into 600 time steps.
Use the supporting object, seldModel, to perform SELD. The object encapsulates the SELD model
developed in “Train 3-D Sound Event Localization and Detection (SELD) Using Deep Learning” on
page 1-767. Create the model, then call seld on the ambisonic data to detect and localize sound in
time and space.
If you have Statistics and Machine Learning Toolbox™, the model applies smoothing to the decisions
using moving averages and clustering.
model = seldModel();
[sed,doa] = seld(model,micA,micB);
To visualize the system's performance over time, call the supporting function plot2d on page 1-801.
plot2d(sedLabels,doaLabels,sed,doa)
1-796
3-D Sound Event Localization and Detection Using Trained Recurrent Convolutional Neural Network
To visualize the system's performance in three spatial dimensions, call the supporting function plot3d
on page 1-802. You can move the slider to visualize sound event locations detected at different
times. The ground truth source location is identified by a semi-transparent sphere. The predicted
source location is identified by a circle connected to the original by a dotted line.
plot3d(sedLabels,doaLabels,sed,doa);
1-797
1 Audio Toolbox Examples
SELD is a 4D problem, in that you are localizing the sound source in 3-D space and 1D time. To
examine the system's performance in 4D, call the supporting function plot4d on page 1-806. The
plot4d function plays the 3-D plot and corresponding ambisonic recording over time.
plot4d(micA(:,1),sedLabels,doaLabels,sed,doa)
1-798
3-D Sound Event Localization and Detection Using Trained Recurrent Convolutional Neural Network
Supporting Functions
Plot Ambisonics
function plotAmbisonics(micA,micB)
%PLOTAMBISONICS Plot B-format ambisonics over time
% plotAmbisonics(micA,micB) plots the ambisonic recordings collected from
% micA and micB. The channels are plotted along the rows of a 4-by-2 tiled
% layout (W,X,Y,Z). The first column of the plot corresponds to data from
% microphone A and the second column corresponds to data from microphone B.
figure(1)
tiledlayout(4,2,TileSpacing="tight")
t = linspace(0,60,size(micA,1));
1-799
1 Audio Toolbox Examples
nexttile
plot(t,micA(:,1))
title("Microphone A")
yL = ylabel("W",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
axis([t(1),t(end),-0.2,0.2])
set(gca,Xticklabel=[])
nexttile
plot(t,micB(:,1))
title("Microphone B")
axis([t(1),t(end),-0.2,0.2])
set(gca,Yticklabel=[],XtickLabel=[])
nexttile
plot(t,micA(:,2))
yL = ylabel("X",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
axis([t(1),t(end),-0.2,0.2])
set(gca,Xticklabel=[])
nexttile
plot(t,micB(:,2))
axis([t(1),t(end),-0.2,0.2])
set(gca,Yticklabel=[],XtickLabel=[])
nexttile
plot(t,micA(:,3))
yL = ylabel("Y",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
axis([t(1),t(end),-0.2,0.2])
set(gca,Xticklabel=[])
nexttile
plot(t,micB(:,3))
axis([t(1),t(end),-0.2,0.2])
set(gca,Yticklabel=[],XtickLabel=[])
nexttile
plot(t,micB(:,4))
yL = ylabel("Z",FontWeight="bold");
set(yL,Rotation=0)
axis([t(1),t(end),-0.2,0.2])
xlabel("Time (s)")
nexttile
plot(t,micB(:,4))
axis([t(1),t(end),-0.2,0.2])
set(gca,Yticklabel=[])
xlabel("Time (s)")
end
function plotTimeSeries(sed,values)
%PLOTTIMESERIES Plot time series
% plotTimeSeries(sed,values) is leveraged by plot2d to plot the color-coded
% SED or DOA estimation.
1-800
3-D Sound Event Localization and Detection Using Trained Recurrent Convolutional Neural Network
colors = getColors();
hold on
for ii = 1:numel(sed)
cls = sed(ii);
if cls > 0
x = [ii-1,ii];
y = repelem(values(ii),2);
plot(x,y,Color=colors{cls},LineWidth=8)
end
end
hold off
grid on
end
Plot 2D
function plot2d(sedLabels,doaLabels,sed,doa)
%PLOT2D Plot 2D
% plot2d(sedLabels,doaLabels,sed,doa) creates plots for SED, SED ground
% truth, DOA estimation, and DOA ground truth.
fh = figure(2);
set(fh,Position=[100 100 800 800])
tiledlayout(5,2,TileSpacing="tight")
nexttile([2,1])
plotTimeSeries(sedLabels,sedLabels);
yticks(1:14)
yticklabels(SoundClasses)
ylim([0.5,14.5])
ylabel("Class")
title("Ground Truth")
set(gca,Xticklabel=[])
nexttile([2,1])
plotTimeSeries(sed,sed);
yticks(1:14)
ylim([0.5,14.5])
title("Prediction")
set(gca,Yticklabel=[],XtickLabel=[])
nexttile
plotTimeSeries(sedLabels,doaLabels(:,1));
yL = ylabel("X",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
set(gca,Xticklabel=[])
nexttile
plotTimeSeries(sed,doa(:,1));
set(gca,Yticklabel=[],XtickLabel=[])
1-801
1 Audio Toolbox Examples
nexttile
plotTimeSeries(sedLabels,doaLabels(:,2));
yL = ylabel("Y",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
set(gca,Xticklabel=[])
nexttile
plotTimeSeries(sed,doa(:,2));
set(gca,Yticklabel=[],XtickLabel=[])
nexttile
plotTimeSeries(sedLabels,doaLabels(:,3));
xlabel("Frame")
yL = ylabel("Z",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
nexttile
plotTimeSeries(sed,doa(:,3));
xlabel("Frame")
set(gca,YtickLabel=[])
end
Plot 3-D
arguments
sedLabels
doaLabels
sed
doa
nvargs.IncludeSlider = true;
end
1-802
3-D Sound Event Localization and Detection Using Trained Recurrent Convolutional Neural Network
% Create figure.
if nvargs.IncludeSlider
data.FigureHandle = figure(3);
else
data.FigureHandle = figure(4);
end
set(data.FigureHandle,Position=[680,400,640,580],Color="k",MenuBar="none",Toolbar="none")
% Initialize plot.
data = initialize3DPlot(data);
end
1-803
1 Audio Toolbox Examples
patch([2.2,2.2,-2.2,-2.2],[2.2,2.2,2.2,2.2],[-2.2,2.2,2.2,-2.2],[3,2,1,2],FaceAlpha=0.5,FaceColor
patch([2.2,2.2,-2.2,-2.2],[2.2,-2.2,-2.2,2.2],[-2.2,-2.2,-2.2,-2.2],[3,2,1,2],FaceAlpha=0.5,FaceC
grid on
grid minor
axis equal
hold off
end
function update3DPlot(timeFrame,data)
%UPDATE3DPLOT Update 3-D Plot
% update3DPlot(timeFrame,data) updates the 3-D plot to display data
% corresponding to the specified time frame.
timeFrame = round(timeFrame*numel(data.sedLabels));
if data.sedLabels(timeFrame) > 0
% Turn plot visibility on.
data.TPlotDot.Visible = "on";
data.GTAnnotation.Visible = "on";
1-804
3-D Sound Event Localization and Detection Using Trained Recurrent Convolutional Neural Network
data.TPlotDot.FaceColor = tcol;
data.TPlotDot.MarkerEdgeColor = tcol;
else
% Turn plot visibility off.
data.TPlotDot.Visible = "off";
data.GTAnnotation.Visible = "off";
if data.sed(timeFrame) > 0
% Turn plot visibility on.
data.PredictedAnnotation.Visible = "on";
data.YPlot.Visible = "on";
data.YPlotDot.Visible = "on";
1-805
1 Audio Toolbox Examples
data.GTAnnotation.Color = col;
data.PredictedAnnotation.String = pClass;
data.PredictedAnnotation.Color = col;
drawnow
end
Plot 4D
function plot4d(audioToPlay,sedLabels,doaLabels,sed,doa)
%PLOT4D Plot 4D
% plot4d(audioToPlay,sedLabels,doaLabels,sed,doa) creates a "movie" of
% ground truth and estimated sound events in a 3-D environment over time.
% The movie runs in real time and plays the audioToPlay to your default
% sound device.
drawnow
% The true and predicted label definitions have resolutions of 0.1 seconds. Create a
% labels vector to only update the 3-D plot when necessary.
changepoints = 0.1:0.1:60;
% Initialize counters.
idx = 1;
elapsedTime = 0;
end
1-806
3-D Sound Event Localization and Detection Using Trained Recurrent Convolutional Neural Network
Get Colors
soundClasses = categorical(["Silence","Clink","Keyboard","Cupboard","Drawer","Female","Fingers Sn
"Keys","Knock","Laughter","Male","Printer","Scissors","Telephone","Writing"]);
end
References
[1] Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen, "Sound event
localization and detection of overlapping sources using convolutional recurrent neural networks,"
IEEE J. Sel. Top. Signal Process., vol. 13, no. 1, pp. 34-48, 2019.
[2] Eric Guizzo, Riccardo F. Gramaccioni, Saeid Jamili, Christian Marinoni, Edoardo Massaro, Claudia
Medaglia, Giuseppe Nachira, Leonardo Nucciarelli, Ludovica Paglialunga, Marco Pennese, Sveva
Pepe, Enrico Rocchi, Aurelio Uncini, and Danilo Comminiello "L3DAS21 Challenge: Machine Learning
for 3D Audio Signal Processing," 2021.
[3] Yin Cao, Qiuqiang Kong, Turab Iqbal, Fengyan An, Wenwu Wang, and Mark D. Plumbley,
"Polyphonic sound event detection and localization using a two-stage strategy," arXiv preprint:
arXiv:1905.00268v4, 2019.
1-807
1 Audio Toolbox Examples
This example shows how to import labels created in Audacity™ into Signal Labeler.
You have an audio file consisting of a human voice uttering "Volume up" several times.
audioFile = "speaker1.ogg";
[x,fs] = audioread(audioFile);
sound(x,fs)
You label the file in Audacity [1] and export the labels to speaker.txt.
Read the labels using readtable. The labels consist of three columns corresponding to the speech
utterances along with their respective start and end times (in seconds).
labelFile = "speaker1.txt";
roiTable = readtable(labelFile,Delimiter="tab");
roiTable.Properties.VariableNames = ["StartTime","EndTime","Value"]
roiTable=6×3 table
StartTime EndTime Value
_________ _______ _____________
In order to gain insight into the labels, you plot the audio signal along with a mask corresponding to
labeled regions of speech.
mask=signalMask(table(roiTable{:,1:2},categorical(roiTable{:,3})),SampleRate=fs);
plotsigroi(mask, x, true);
1-808
Import Audacity Labels to Signal Labeler
Next, convert the label data to a labeledSignalSet that can be imported to Signal Labeler.
First, define the label type. Specify "roi" (Region of interest) for the label type, and "string" for
the label datatype.
labelName = "Speech";
lblDef = signalLabelDefinition(string(labelName),...
LabelType="roi",...
LabelDataType="string");
Next, create a labeledSignalSet pointing to the labeled audio file. Add the label definition to the
labeled signal set.
lss = labeledSignalSet(audioDatastore(audioFile),lblDef);
You are now ready to read these labels into Signal Labeler.
1) Open signalLabeler
1-809
1 Audio Toolbox Examples
The audio signal is now available to you along with its labels.
The helper function importLabels creates a labeled signal set corresponding to labels for multiple
audio files. In this example, you work with two audio files with labels stored in text files.
Call importLabels to create a signal data set for multiple annotated files. Specify the label name as
"Speech".
lss = importLabels("Speech");
You can now load lss in SignalLabeler by following the same steps from the previous section.
labelFiles = dir("*.txt");
labelFiles = {labelFiles.name};
numFiles = length(labelFiles);
1-810
Import Audacity Labels to Signal Labeler
lss = labeledSignalSet(ads);
lblDef = signalLabelDefinition(string(labelName),...
LabelType="roi",...
LabelDataType="string");
addLabelDefinitions(lss,lblDef)
for index0=1:numFiles
filename = labelFiles{index0};
roiTable = readtable(filename,Delimiter="tab");
roiTable.Properties.VariableNames = ["StartTime","EndTime","Value"];
roiLimits = [roiTable.StartTime roiTable.EndTime];
setLabelValue(lss,index0,labelName,roiLimits,roiTable.Value);
end
end
References
[1] https://www.audacityteam.org/
1-811
1 Audio Toolbox Examples
Room impulse response simulation aims to model the reverberant properties of a space without
having to perform acoustic measurements. Many geometric and wave-based room acoustic simulation
methods exist in the literature [1] on page 1-822. The image-source method is a popular and
relatively straightforward geometric method [2] on page 1-823. It models the specular reflections
between a transmitter and a receiver.
This example showcases the image-source method for a simple "shoebox" (cuboid) room. The example
also uses head-related transfer function (HRTF) interpolation to simulate the received sound at the
ears of the listener.
Define the room dimensions, in meters (width, length and height, respectively).
roomDimensions = [4 4 2.5];
You treat the receiver and transmitter as points within the space of the room. Define their
coordinates, in meters.
receiverCoord = [2 1 1.8];
sourceCoord = [3 1 1.8];
Plot the room space along with the receiver (red circle) and transmitter (blue x).
h = figure;
plotRoom(roomDimensions,receiverCoord,sourceCoord,h)
1-812
Room Impulse Response Simulation with the Image-Source Method and HRTF Interpolation
Image-source is a geometric simulation method that models specular sound reflection paths between
the source and receiver. It assumes that sound travels in straight lines (rays) which undergo perfect
reflections when they encounter an obstacle (in our case, one of the four walls, the floor, or the
ceiling of the room).
When a sound ray hits a wall, it spawns a mirrored "image" source. The image source is the
symmetrical reflection of the original source with respect to the encountered boundary. Higher-order
reflections (rays that reach the receiver after bouncing off multiple obstacles) are modeled by
repeating the mirroring process with respect to each encountered obstacle.
As an example, consider the ray that bounces off two walls and the floor before arriving at the
receiver. Define the coordinates of the equivalent image for this ray.
imageSource = [-sourceCoord(1) -sourceCoord(2) -sourceCoord(3)];
The ray is modeled by the straight line connecting the image to the receiver. The length of this
straight line is equal to the traveled distance from the original source to the receiver along the
reflected ray.
1-813
1 Audio Toolbox Examples
To calculate the room impulse response, add the contributions of a large number of source images.
Extend the visible space around the room to ensure the images appear in the plot.
h2 = figure;
plotRoom(roomDimensions,receiverCoord,sourceCoord,h2)
Lx = roomDimensions(1);
Ly = roomDimensions(2);
Lz = roomDimensions(3);
xlim([-3*Lx,3*Lx]);
ylim([-3*Ly,3*Ly]);
zlim([-3*Lz,3*Lz]);
1-814
Room Impulse Response Simulation with the Image-Source Method and HRTF Interpolation
Visualize a subset of the source images. Compute the image coordinates based on equations 6 and 7
in [2].
Model the eight combinations stemming from possible reflections along the x-, y- and z- axes.
x = sourceCoord(1);
y = sourceCoord(2);
z = sourceCoord(3);
sourceXYZ = [-x -y -z;...
-x -y z;...
-x y -z;...
-x y z;...
x -y -z;...
x -y z;...
x y -z;...
x y z].';
Model scenarios with multiple reflections by looping over the x-, y- and z- axes. These loops have
infinite ranges in theory. You will see how to practically limit the ranges in the next section. For now,
select arbitrary limits for the loops.
for n = nVect
1-815
1 Audio Toolbox Examples
for l = lVect
for m = mVect
xyz = [n*2*Lx; l*2*Ly; m*2*Lz];
isourceCoords = xyz - sourceXYZ;
for kk=1:8
isourceCoord=isourceCoords(:,kk);
plot3(isourceCoord(1),isourceCoord(2),isourceCoord(3),"g*")
end
end
end
end
The number of images is theoretically infinite. Restrict the number of images by limiting the
computed impulse response length to the time by which the reverberated sound pressure drops
below a certain level. Here, you use the reverberation time RT60 [3] on page 1-823, which is the
time by which the sound level has dropped by 60 dB.
First, define the absorption coefficients of the walls. The absorption coefficient is a measure of how
much sound is absorbed (rather than reflected) when hitting a surface.
The absorption coefficients are frequency-dependent, and are defined at the frequencies defined in
the variable FVect [4] on page 1-823.
1-816
Room Impulse Response Simulation with the Image-Source Method and HRTF Interpolation
Estimate RT60
V = Lx*Ly*Lz;
WallXZ = Lx*Lz;
WallYZ = Ly*Lz;
WallXY = Lx*Ly;
S = WallYZ*(A(:,1)+A(:,2))+WallXZ.*(A(:,3)+A(:,4))+WallXY.*(A(:,5)+A(:,6));
Compute the frequency-dependent RT60, in seconds, based on Sabine's equation. Notice that RT60 is
frequency-dependent: There are 6 different RT60 values, one for each frequency band.
RT60 = 6×1
1.3886
0.7191
0.3799
0.2581
0.3028
0.2455
Deduce the maximum impulse response length (in samples) based on the largest value in RT60.
Assume a sample rate of 48 kHz.
fs = 48000;
impResLength = fix(max(RT60)*fs)
impResLength = 66653
impResRange=c*(1/fs)*impResLength
impResRange = 476.2912
Use this value to limit the range over which to compute images. In this example, to limit the run time,
you restrict the loop ranges to [-10;10].
1-817
1 Audio Toolbox Examples
nMax = min(ceil(impResRange./(2.*Lx)),10);
lMax = min(ceil(impResRange./(2.*Ly)),10);
mMax = min(ceil(impResRange./(2.*Lz)),10);
In this section, you derive the contribution of one image to the room impulse response.
You later obtain the full room impulse response by summing the contributions of all images under
consideration.
B=sqrt(1-A);
BX1=B(:,1);
BX2=B(:,2);
BY1=B(:,3);
BY2=B(:,4);
BZ1=B(:,5);
BZ2=B(:,6);
Model the eight permutations representing the absence or presence of reflection on the x-, y-, and z-
axes.
surface_coeff=[0 0 0; 0 0 1; 0 1 0; 0 1 1; 1 0 0; 1 0 1; 1 1 0; 1 1 1];
q = surface_coeff(:,1).';
j = surface_coeff(:,2).';
k = surface_coeff(:,3).';
In this section, you focus on the contribution of a single image. Select index values corresponding to
an arbitrary image.
n = 1;
l = 1;
m = 1;
p = 1;
You start by computing the image delay. The delay is related to the total distance traveled by the
wave from the image to the receiver.
dist = norm((isourceCoord(:)-receiverCoord(:)),2);
delay = (fs/c).*dist;
1-818
Room Impulse Response Simulation with the Image-Source Method and HRTF Interpolation
ImagePower = BX1.^abs(n-q(p)).*BY1.^abs(l-j(p)).*BZ1.^abs(m-k(p)).*BX2.^abs(n).*(BY2.^abs(l)).*(B
The image power is only defined at 6 frequencies. Here, you first interpolate the response to the
entire frequency (Nyquist) range, and then perform an inverse FFT operation to derive the image's
contribution to the impulse response.
FFTLength = 512;
HalfLength = fix(FFTLength./2);
OneSidedLength = HalfLength+1;
ImagePower2 = interp1(FVect2./(fs/2),ImagePower2,linspace(0,1,257)).';
h_ImagePower = real(ifft(ImagePower2,FFTLength));
win = hann(FFTLength+1);
h_ImagePower = win.*[h_ImagePower(OneSidedLength:FFTLength); ...
h_ImagePower(1:OneSidedLength)];
HRTF Modeling
You have derived the image's contribution to the impulse response, where you assumed that the
receiver is a point in space. Here, you derive the image's contribution at the ears of a listener located
at the receiver coordinates by using 3-D head-related transfer function (HRTF) interpolation.
You use the ARI HRTF data set [5] on page 1-823. Load the data set.
ARIDataset = load("ReferenceHRTF.mat");
1-819
1 Audio Toolbox Examples
Calculate the elevation and azimuth corresponding to the coordinates of the image source.
sensor_xyz=receiverCoord;
xyz=isourceCoord-sensor_xyz;
hyp = sqrt(xyz(1)^2+xyz(2)^2);
elevation = atan(xyz(3)./(hyp+eps));
azimuth = atan2(xyz(2),xyz(1));
The desired HRTF position is formed by the computed elevation and azimuth.
interpolatedIR = interpolateHRTF(hrtfData,sourcePosition,desiredPosition);
interpolatedIR = squeeze(permute(interpolatedIR,[3 2 1]));
Plot the overall contribution of the selected image. This contribution is added to the overall impulse
response at the computed image delay.
figure
plot(1:size(h,1),h)
grid on
xlabel("Sample Index")
ylabel("Impulse Response")
legend("Left","Right")
1-820
Room Impulse Response Simulation with the Image-Source Method and HRTF Interpolation
In this section, you compute the overall impulse response by summing the contributions of individual
images. The contribution of each image is computed exactly like in the previous section.
The helper function HelperImageSource encapsulates the steps you went over in the previous
section. It computes the impulse response by summing image contributions.
useHRTF = true;
h = HelperImageSource(roomDimensions,receiverCoord,sourceCoord,A,FVect,fs,useHRTF,hrtfData,source
figure
t= (1/fs)*(0:size(h,1)-1);
plot(t,h)
grid on
xlabel("Time (s)")
ylabel("Impulse Response")
legend("Left","Right")
1-821
1 Audio Toolbox Examples
Auralization
[audioIn,fs] = audioread("FunkyDrums-44p1-stereo-25secs.mp3");
audioIn = audioIn(:,1);
y1 = filter(h(:,1),1,audioIn);
y2 = filter(h(:,2),1,audioIn);
y = [y1 y2];
T = 10;
sound(audioIn(1:fs*T),fs)
pause(T)
sound(y(1:fs*T),fs);
References
1-822
Room Impulse Response Simulation with the Image-Source Method and HRTF Interpolation
[1] "Overview of geometrical room acoustic modeling techniques", Lauri Savioja, Journal of the
Acoustical Society of America 138, 708 (2015).
[2] Allen, J and Berkley, D. "Image Method for efficiently simulating small‐room acoustics", The
Journal of the Acoustical Society of America, Vol 65, No.4, pp. 943‐950, 1978.
[3] https://www.sciencedirect.com/topics/engineering/sabine-equation
[4] https://www.acoustic.ua/st/web_absorption_data_eng.pdf
Helper Functions
function plotRoom(roomDimensions,receiverCoord,sourceCoord,figHandle)
% PLOTROOM Plot room, transmitter and receiver
figure(figHandle)
X = [0;roomDimensions(1);roomDimensions(1);0;0];
Y = [0;0;roomDimensions(2);roomDimensions(2);0];
Z = [0;0;0;0;0];
figure;
hold on;
plot3(X,Y,Z,"k","LineWidth",1.5); % draw a square in the xy plane with z = 0
plot3(X,Y,Z+roomDimensions(3),"k","LineWidth",1.5); % draw a square in the xy plane with z = 1
set(gca,"View",[-28,35]); % set the azimuth and elevation of the plot
for k=1:length(X)-1
plot3([X(k);X(k)],[Y(k);Y(k)],[0;roomDimensions(3)],"k","LineWidth",1.5);
end
grid on
xlabel("X (m)")
ylabel("Y (m)")
zlabel("Z (m)")
plot3(sourceCoord(1),sourceCoord(2),sourceCoord(3),"bx","LineWidth",2)
plot3(receiverCoord(1),receiverCoord(2),receiverCoord(3),"ro","LineWidth",2)
end
1-823
1 Audio Toolbox Examples
hrtfData = [];
sourcePosition = [];
if useHRTF
hrtfData = varargin{1};
sourcePosition = varargin{2};
end
x = sourceCoord(1);
y = sourceCoord(2);
z = sourceCoord(3);
sourceXYZ = [-x -y -z; ...
-x -y z; ...
-x y -z; ...
-x y z; ...
x -y -z; ...
x -y z; ...
x y -z; ...
x y z].';
Lx=roomDimensions(1);
Ly=roomDimensions(2);
Lz=roomDimensions(3);
V = Lx*Ly*Lz;
WallXZ=Lx*Lz;
WallYZ=Ly*Lz;
WallXY=Lx*Ly;
S = WallYZ*(A(:,1)+A(:,2))+WallXZ.*(A(:,3)+A(:,4))+WallXY.*(A(:,5)+A(:,6));
impResLength = fix(max(RT60)*fs);
impResRange=c*(1/fs)*impResLength;
nMax = min(ceil(impResRange./(2.*Lx)),10);
lMax = min(ceil(impResRange./(2.*Ly)),10);
mMax = min(ceil(impResRange./(2.*Lz)),10);
B=sqrt(1-A);
BX1=B(:,1);
BX2=B(:,2);
BY1=B(:,3);
BY2=B(:,4);
BZ1=B(:,5);
BZ2=B(:,6);
surface_coeff=[0 0 0; 0 0 1; 0 1 0; 0 1 1; 1 0 0; 1 0 1; 1 1 0; 1 1 1];
q=surface_coeff(:,1).';
j=surface_coeff(:,2).';
k=surface_coeff(:,3).';
FFTLength=512;
1-824
Room Impulse Response Simulation with the Image-Source Method and HRTF Interpolation
HalfLength=fix(FFTLength./2);
OneSidedLength = HalfLength+1;
win = hann(FFTLength+1);
h = zeros(impResLength,2);
for n=-nMax:nMax
Lxn2=n*2*Lx;
for l=-lMax:lMax
Lyl2=l*2*Ly;
if useHRTF
imagesVals = zeros(FFTLength+size(hrtfData,3),2,2*lMax+1,8);
else
imagesVals = zeros(FFTLength+1,2,2*lMax+1,8);
end
Li = size(imagesVals,1);
isDelayValid = zeros(2*lMax+1,8);
start_index_HpV = zeros(2*lMax+1,8);
stop_index_HpV = zeros(2*lMax+1,8);
start_index_hV = zeros(2*lMax+1,8);
parfor mInd=1:2*mMax+1
m = mInd - mMax - 1;
Lzm2=m*2*Lz;
xyz = [Lxn2; Lyl2; Lzm2];
isourceCoordV=xyz - sourceXYZ;
xyzV = isourceCoordV - receiverCoord.';
distV = sqrt(sum(xyzV.^2));
delayV = (fs/c)*distV;
ImagePower = BX1.^abs(n-q).*BY1.^abs(l-j).*BZ1.^abs(m-k).*BX2.^abs(n).*(BY2.^abs(l)).
ImagePower2 = [ImagePower(1,:); ImagePower; ImagePower(6,:)];
ImagePower2 = ImagePower2./distV;
if sum(validDelay)==0
continue;
end
isDelayValid(mInd,:) = validDelay;
ImagePower2 = interp1(FVect2./(fs/2),ImagePower2,linspace(0,1,257));
if isrow(ImagePower2)
ImagePower2 = ImagePower2.';
end
ImagePower3 = [ImagePower2; conj(ImagePower2(HalfLength:-1:2,:))];
h_ImagePower = real(ifft(ImagePower3,FFTLength));
h_ImagePower = [h_ImagePower(OneSidedLength:FFTLength,:); h_ImagePower(1:OneSidedLeng
1-825
1 Audio Toolbox Examples
h_ImagePower = win.*h_ImagePower;
if useHRTF
hyp = sqrt(xyzV(1,:).^2+xyzV(2,:).^2);
elevation = atan(xyzV(3,:)./(hyp+realmin));
azimuth = atan2(xyzV(2,:),xyzV(1,:));
desiredPosition = [azimuth.',elevation.']*180/pi;
interpolatedIR = interpolateHRTF(hrtfData,sourcePosition,desiredPosition,"Algori
interpolatedIR = squeeze(permute(interpolatedIR,[3 2 1]));
pad_ImagePower = zeros(512,2);
for index=1:8
hrir0 = interpolatedIR(:,:,index);
hrir_ext=[hrir0; pad_ImagePower];
for ear=1:2
imagesVals(:,ear,mInd,index)=filter(h_ImagePower(:,index),1,hrir_ext(:,ea
end
end
else
for index=1:8
for ear=1:2
imagesVals(:,ear,mInd,index)=h_ImagePower(:,index);
end
end
end
len_h=Li;
start_index_HpV(mInd,:) = max(adjust_delay+1+(adjust_delay>=0),1);
stop_index_HpV(mInd,:) = min(adjust_delay+1+len_h,impResLength);
start_index_hV(mInd,:) = max(-adjust_delay,1);
end
stop_index_hV = start_index_hV + (stop_index_HpV - start_index_HpV);
for index2=1:size(imagesVals,3)
for index3=1:8
if isDelayValid(index2,index3)
h(start_index_HpV(index2,index3):stop_index_HpV(index2,index3),:)= h(start_in
end
end
end
end
end
h = h./max(abs(h));
end
1-826
Room Impulse Response Simulation with Stochastic Ray Tracing
Room impulse response simulation aims to model the reverberant properties of a space without
having to perform acoustic measurements. Many geometric and wave-based room acoustic simulation
methods exist in the literature [1] on page 1-836.
The image-source method is a popular geometric method (for an example, see “Room Impulse
Response Simulation with the Image-Source Method and HRTF Interpolation” on page 1-812). One
drawback of the image-source method is that it only models specular reflections between a
transmitter and a receiver. There are other geometric methods that address this limitation by also
taking sound diffusion and diffraction into account. Stochastic ray tracing is one such method.
This example showcases a stochastic ray tracing method for a simple "shoebox" (cuboid) room.
Ray tracing assumes that sound energy travels around the room in rays. The rays start at the sound
source, and are emitted in all directions, following a uniform random distribution. In this example,
you follow (trace) rays as they bounce off obstacles (walls, floor and ceiling) in the room. At each ray
reflection, you compute the measured ray energy at the receiver. You use the measured energy to
update a frequency-dependent histogram. You then compute the room impulse response by weighting
a Poisson random process by the histogram values [2] on page 1-836.
Define the room dimensions, in meters (width, length, and height, respectively).
Treat the transmitter as a point within the space of the room. Assume that the receiver is a sphere
with radius 8.75 cm.
sourceCoord = [2 2 2];
receiverCoord = [5 5 1.8];
r = 0.0875;
Plot the room space along with the receiver (red circle) and transmitter (blue x).
h = figure;
plotRoom(roomDimensions,receiverCoord,sourceCoord,h)
1-827
1 Audio Toolbox Examples
Generate the rays using the helper function RandSampleSphere. rays is a N-by-3 matrix. Each row
of rays holds the three-dimensional ray vector direction.
rng(0)
rays = RandSampleSphere(N);
size(rays)
ans = 1×2
5000 3
A sound ray is reflected when it hits a surface. The reflection is a combination of a specular
component and a diffused component. The relative strength of each component is determined by the
reflection and scattering coefficients of the surfaces.
Define the absorption coefficients of the walls. The absorption coefficient is a measure of how much
sound is absorbed (rather than reflected) when hitting a surface.
1-828
Room Impulse Response Simulation with Stochastic Ray Tracing
The frequency-dependent absorption coefficients are defined at the frequencies in the variable FVect
[4] on page 1-837.
Derive the reflection coefficients of the six surfaces from the absorption coefficients.
R = sqrt(1-A);
Define the frequency-dependent scattering coefficients [5] on page 1-837. The scattering coefficient
is defined as one minus the ratio between the specularly reflected acoustic energy and the total
reflected acoustic energy.
As you trace the rays, you update a two-dimensional histogram of the energy detected at the receiver.
The histogram records values along time and frequency.
Set the histogram time resolution, in seconds. The time resolution is typically much larger than the
inverse of the audio sample rate.
histTimeStep = 0.0010;
Compute the number of histogram time bins. In this example, limit the impulse response length to
one second.
impResTime = 1;
nTBins = round(impResTime/histTimeStep);
The ray tracing algorithm is frequency-selective. In this example, focus on six frequency bands,
centered around the frequencies in FVect.
The number of frequency bins in the histogram is equal to the number of frequency bands.
nFBins = length(FVect);
TFHist = zeros(nTBins,nFBins);
Compute the received energy histogram by tracing the rays over each frequency band. When a ray
hits a surface, record the amount of diffused ray energy seen at the receiver based on the diffused
1-829
1 Audio Toolbox Examples
rain model [2] on page 1-836. The new ray direction upon hitting a surface is a combination of a
specular reflection and a random reflection. Continue tracing the ray until its travel time exceeds the
impulse response duration.
for iBand = 1:nFBins
% Perform ray tracing independently for each frequency band.
for iRay = 1:size(rays,1)
% Select ray direction
ray = rays(iRay,:);
% All rays start at the source/transmitter
ray_xyz = sourceCoord;
% Set initial ray direction. This direction changes as the ray is
% reflected off surfaces.
ray_dxyz = ray;
% Initialize ray travel time. Ray tracing is terminated when the
% travel time exceeds the impulse response length.
ray_time = 0;
% Initialize the ray energy to a normalized value of 1. Energy
% decreases when the ray hits a surface.
ray_energy = 1;
if recv_timeofarrival>impResTime
1-830
Room Impulse Response Simulation with Stochastic Ray Tracing
break
end
figure
bar(histTimeStep*(0:size(TFHist,1)-1),TFHist)
grid on
xlabel("Time (s)")
legend(["125 Hz","250 Hz","500 Hz","1000 Hz","2000 Hz","4000 Hz"])
1-831
1 Audio Toolbox Examples
The energy histogram represents the envelope of the room impulse response. Synthesize the impulse
response using a Poisson-distributed noise process [2] on page 1-836.
fs = 44100;
V = prod(roomDimensions);
t0 = ((2*V*log(2))/(4*pi*c^3))^(1/3); % eq 5.45 in [2]
Initialize the random process vector and the vector containing the times at which events occur.
poissonProcess = [];
timeValues = [];
t = t0;
while (t<impResTime)
timeValues = [timeValues t]; %#ok
1-832
Room Impulse Response Simulation with Stochastic Ray Tracing
% Determine polarity.
if (round(t*fs)-t*fs) < 0
poissonProcess = [poissonProcess 1]; %#ok
else
poissonProcess = [poissonProcess -1];%#ok
end
% Determine the mean event occurrence (eq 5.44 in [2])
mu = min(1e4,4*pi*c^3*t^2/V);
% Determine the interval size (eq. 5.44 in [2])
deltaTA = (1/mu)*log(1/rand); % eq. 5.43 in [2])
t = t+deltaTA;
end
You create the impulse response by passing the Poisson process through bandpass filters centered at
the frequencies in FVect, and then weighting the filtered signals with the received energy envelope
computed in the histogram.
Define the lower and upper cutoff frequencies of the bandpass filters.
flow = [85 170 340 680 1360 2720];
fhigh = [165 330 660 1320 2640 5280];
Create the short-time Fourier transform and inverse short-time Fourier transform objects you will use
to filter the Poisson process and reconstruct it. Use a Hann window with 50% overlap.
win = hann(882,"symmetric");
sfft = dsp.STFT(Window = win,OverlapLength=441,FFTLength=NFFT,FrequencyRange="onesided");
isfft = dsp.ISTFT(Window=win,OverlapLength=441,FrequencyRange="onesided");
F = sfft.getFrequencyVector(fs);
Create the bandpass filters (use equation 5.46 in [2] on page 1-836).
RCF = zeros(length(FVect),length(F));
for index0 = 1:length(FVect)
for index=1:length(F)
f = F(index);
if f<FVect(index0) && f>=flow(index0)
RCF(index0,index) = .5*(1+cos(2*pi*f/FVect(index0)));
end
if f<fhigh(index0) && f>=FVect(index0)
RCF(index0,index) = .5*(1-cos(2*pi*f/(2*FVect(index0))));
end
end
end
1-833
1 Audio Toolbox Examples
figure
semilogx(F,RCF(1,:))
hold on
semilogx(F,RCF(2,:))
semilogx(F,RCF(3,:))
semilogx(F,RCF(4,:))
semilogx(F,RCF(5,:))
semilogx(F,RCF(6,:))
semilogx(F,RCF(6,:))
xlabel("Frequency (Hz)")
ylabel("Response")
grid on
frameLength = 441;
numFrames = length(randSeq)/frameLength;
y = zeros(length(randSeq),6);
for index=1:numFrames
x = randSeq((index-1)*frameLength+1:index*frameLength);
X = sfft(x);
X = X.*RCF.';
y((index-1)*frameLength+1:index*frameLength,:) = isfft(X);
end
1-834
Room Impulse Response Simulation with Stochastic Ray Tracing
Construct the impulse response by weighting the filtered random sequences sample-wise using the
envelope (histogram) values.
impTimes = (1/fs)*(0:size(y,1)-1);
W = zeros(size(impTimes,2),numel(FVect));
BW = fhigh-flow;
for k=1:size(TFHist,1)
gk0 = floor((k-1)*fs*histTimeStep)+1;
gk1 = floor(k*fs*histTimeStep);
yy = y(gk0:gk1,:).^2;
val = sqrt(TFHist(k,:)./sum(yy,1)).*sqrt(BW/(fs/2));
for iRay=gk0:gk1
W(iRay,:)= val;
end
end
y = y.*W;
ip = sum(y,2);
ip = ip./max(abs(ip));
Auralization
figure
plot((1/fs)*(0:numel(ip)-1),ip)
grid on
xlabel("Time (s)")
ylabel("Impulse Response")
1-835
1 Audio Toolbox Examples
[audioIn,fs] = audioread("FunkyDrums-44p1-stereo-25secs.mp3");
audioIn = audioIn(:,1);
audioOut = filter(ip,1,audioIn);
audioOut = audioOut/max(audioOut);
T = 10;
sound(audioIn(1:T*fs),fs)
pause(T)
sound(audioOut(1:T*fs),fs)
References
[1] "Overview of geometrical room acoustic modeling techniques", Lauri Savioja, Journal of the
Acoustical Society of America 138, 708 (2015).
1-836
Room Impulse Response Simulation with Stochastic Ray Tracing
[2] "Physically Based Real-Time Auralization of Interactive Virtual Environments", Dirk Schröder,
Aachen, Techn. Hochsch., Diss., 2011.
[3] "Auralization: Fundamentals of Acoustics, Modelling, Simulation, Algorithms and Acoustic Virtual
Reality", Michael Vorlander, Second Edition, Springer.
[4] https://www.acoustic.ua/st/web_absorption_data_eng.pdf
[5] "Scattering in Room Acoustics and Related Activities in ISO and AES", Jens Holger Rindel, 17th
ICA Conference, Rome, Italy, September 2001.
Helper Functions
function plotRoom(roomDimensions,receiverCoord,sourceCoord,figHandle)
% PLOTROOM Helper function to plot 3D room with receiver/transmitter points
figure(figHandle)
X = [0;roomDimensions(1);roomDimensions(1);0;0];
Y = [0;0;roomDimensions(2);roomDimensions(2);0];
Z = [0;0;0;0;0];
figure;
hold on;
plot3(X,Y,Z,"k",LineWidth=1.5);
plot3(X,Y,Z+roomDimensions(3),"k",LineWidth=1.5);
set(gca,"View",[-28,35]);
for k=1:length(X)-1
plot3([X(k);X(k)],[Y(k);Y(k)],[0;roomDimensions(3)],"k",LineWidth=1.5);
end
grid on
xlabel("X (m)")
ylabel("Y (m)")
zlabel("Z (m)")
plot3(sourceCoord(1),sourceCoord(2),sourceCoord(3),"bx",LineWidth=2)
plot3(receiverCoord(1),receiverCoord(2),receiverCoord(3),"ro",LineWidth=2)
end
function X=RandSampleSphere(N)
% RANDSAMPLESPHERE Return random ray directions
% Convert z to latitude
z(z<-1) = -1;
z(z>1) = 1;
lat = acos(z);
X = [x y z];
end
1-837
1 Audio Toolbox Examples
end
function N = getWallNormalVector(surfaceofimpact)
% GETWALLNORMALVECTOR Get the normal vector of a surface
switch surfaceofimpact
case 1
N = [1 0 0];
case 2
1-838
Room Impulse Response Simulation with Stochastic Ray Tracing
N = [-1 0 0];
case 3
N = [0 1 0];
case 4
N = [0 -1 0];
case 5
N = [0 0 1];
case 6
N = [0 0 -1];
end
end
1-839
1 Audio Toolbox Examples
Feature selection reduces the dimensionality of data by selecting a subset of measured features to
create a model. Performing feature selection enables you to train smaller models quickly without
sacrificing accuracy. For some tasks, properly selected features used with simple thresholding can
provide adequate results, especially in situations where model size and complexity must be
minimized.
In this example, you walk through a standard machine learning pipeline to develop an audio
classification system. The pipeline has been abstracted so that you can apply the same steps to either
speaker recognition or word recognition tasks.
Download the Free Spoken Digit Dataset (FSDD) [1] on page 1-850. FSDD consists of short audio
files with spoken digits (0-9). The data is sampled at 8 kHz.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"FSDD");
ads = audioDatastore(dataset,IncludeSubfolders=true);
task = ;
[~,filenames] = fileparts(ads.Files);
switch task
case "speaker recognition"
ads.Labels = extractBetween(filenames,"_","_");
case "word recognition"
ads.Labels = extractBefore(filenames,"_");
end
Split data into train and test sets. Use 80% for training and 20% for testing.
[adsTrain,adsTest] = splitEachLabel(ads,0.8);
Listen to a sample from the training set. Plot the waveform and display the associated label.
[x,xinfo] = read(adsTrain);
sound(x,xinfo.SampleRate)
t = (0:numel(x)-1)/xinfo.SampleRate;
figure
plot(t,x)
1-840
Feature Selection for Audio Classification
Audio signals can broadly be categorized as stationary or non-stationary. Stationary signals have
spectrums that do not change over time, like pure tones. Non-stationary signals have spectrums that
change over time, like speech signals. To make machine learning-based tasks tractable, non-
stationary signals can be approximated as stationary when analyzed at appropriately small time
scales. Generally, speech signals are considered stationary when viewed at time scales around 30 ms.
Therefore, speech can be characterized by extracting features from 30 ms analysis windows over
time.
Use the helper function, helperVisualizeBuffer, to visualize the analysis windows of an audio
file. Specify a 30 ms analysis window with 20 ms overlap between adjacent windows. The overlap
duration must be less than the window duration. The Analysis Windows of Signal plot shows the
individual analysis windows from which features are extracted.
windowDuration = ;
overlapDuration = ;
helperVisualizeBuffer(x,xinfo.SampleRate,WindowDuration=windowDuration,OverlapDuration=overlapDur
1-841
1 Audio Toolbox Examples
afe =
audioFeatureExtractor with properties:
Properties
Window: [240×1 double]
OverlapLength: 160
SampleRate: 8000
FFTLength: []
SpectralDescriptorInput: 'linearSpectrum'
FeatureVectorLength: 0
Enabled Features
none
Disabled Features
linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfcc, mfccDelta
mfccDeltaDelta, gtcc, gtccDelta, gtccDeltaDelta, spectralCentroid, spectralCrest
spectralDecrease, spectralEntropy, spectralFlatness, spectralFlux, spectralKurtosis, spectra
1-842
Feature Selection for Audio Classification
afe
afe =
audioFeatureExtractor with properties:
Properties
Window: [240×1 double]
OverlapLength: 160
SampleRate: 8000
FFTLength: []
SpectralDescriptorInput: 'linearSpectrum'
FeatureVectorLength: 306
Enabled Features
linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfcc, mfccDelta
mfccDeltaDelta, gtcc, gtccDelta, gtccDeltaDelta, spectralCentroid, spectralCrest
spectralDecrease, spectralEntropy, spectralFlatness, spectralFlux, spectralKurtosis, spectra
spectralSkewness, spectralSlope, spectralSpread, pitch, harmonicRatio, zerocrossrate
shortTimeEnergy
Disabled Features
none
You can use the extract object function of audioFeatureExtractor to extract all the enabled
features from an audio signal. The features are concatenated into a matrix with analysis windows
along the rows and features along the columns.
featureMatrix = extract(afe,x);
[numWindows,numFeatures] = size(featureMatrix)
numWindows = 62
numFeatures = 306
You can use info to get a mapping between the columns of the output matrix and the feature names.
The term "features" is overloaded in the literature. features can refer to the feature group, such as
"linearSpectrum", "mfcc", or "spectralCentroid", or the individual feature elements, such as the first
element of the linear spectrum or the third element of the MFCC. The output map returned by info
is a struct where each field corresponds to a feature group and the values correspond to which
columns in the feature matrix the feature groups occupy.
1-843
1 Audio Toolbox Examples
outputMap = info(afe)
This figure is intended to help you interpret the feature matrix returned from extract.
1-844
Feature Selection for Audio Classification
Use extract to extract features from all files in the audio datastore. If you have Parallel Computing
Toolbox™, spread the computation across multiple workers.
The output is a (Number of files)-by-1 cell array. Each element of the cell array is a (Number of hops)-
by-(Number of features) matrix, where the number of hops depends on the length of the audio file.
features = extract(afe,adsTrain,UseParallel=canUseParallelPool);
Feature/Label Correspondence
Once you have extracted features from approximately stationary windows in time, the next question
is whether to feed the window-level features to your machine learning model or to combine the
features into file-level representations. The choice of window-level or file-level features depends on
your application and requirements. For file-level features, you will generally create summary
statistics of the window-level features to collapse the time dimension. Common summary statistics
include the mean and standard deviation. This example uses window-level features.
To train a machine learning model on window-level features, replicate the file-level labels so that they
are in one-to-one correspondence with the features.
N = cellfun(@(x)size(x,1),features);
T = repelem(adsTrain.Labels,N);
1-845
1 Audio Toolbox Examples
Concatenate the features into a single matrix for consumption by machine-learning tools.
X = cat(1,features{:});
Feature Selection
Statistics and Machine Learning Toolbox™ provides several tools to aid in feature selection. The best
feature selector will depend on your intended model. Use fscmrmr (Statistics and Machine Learning
Toolbox) to rank features for classification using the minimum-redundancy/maximum-relevance
(MRMR) algorithm. The MRMR is a sequential algorithm that finds an optimal set of features that is
mutually and maximally dissimilar and can represent the response variable effectively.
rng("default") % for reproducibility
[featureSelectionIdx,featureSelectionScores] = fscmrmr(X,T);
The fscmrmr function considers each column of the input feature matrix as a unique feature. Plot the
scores of each scalar in the feature matrix returned by audioFeatureExtractor.
figure
bar(featureSelectionScores)
ylabel("Feature Score")
xlabel("Feature Matrix Column")
The audioFeatureExtractor extracts feature groups with varying numbers of elements. For
example, the default number of elements of the MFCC feature group is 13, while the spectral
centroid feature always consists of 1 element. The output map returned by calling info on
audioFeatureExtractor is a struct with fields equal to the feature group and values equal to the
columns that feature group occupies in the matrix output by extract. Use the output map and the
supporting function uniqueFeatureName on page 1-849 to create a unique name for each scalar
feature, then plot the scores of the top 25 performing features.
1-846
Feature Selection for Audio Classification
featurenames = uniqueFeatureName(outputMap);
featurenamesSorted = featurenames(featureSelectionIdx);
figure
bar(reordercats(categorical(featurenames),featurenamesSorted),featureSelectionScores)
xlim([featurenamesSorted(1),featurenamesSorted(25)])
Depending on your application, you can approximate grouped feature selection by averaging the
scores of feature groups. Using grouped features (for example, all MFCC) may help you deploy more
efficient feature extraction. In this example, you use the top-performing feature scalars, regardless of
which feature group they belong to.
Select some top scoring features. The number you select will depend on the model you are training
and the final constraints of your application.
numFeatures = ;
selectedFeatureIndex = featureSelectionIdx(1:numFeatures);
Train Model
To train a KNN model using your selected features, use fitcknn (Statistics and Machine Learning
Toolbox). If you are unsure of which machine learning model you want to use, try fitcauto
(Statistics and Machine Learning Toolbox) to automatically select a classification model with
optimized parameters, or try the Classification Learner (Statistics and Machine Learning Toolbox).
Mdl = fitcknn(X(:,selectedFeatureIndex),T,Standardize=true);
Evaluate Model
1-847
1 Audio Toolbox Examples
Read a sample from the test set. Listen to the sample and then plot its waveform and display the
ground-truth label.
[x,xInfo] = read(adsTest);
sound(x,xInfo.SampleRate)
t = (0:numel(x)-1)/xInfo.SampleRate;
figure
plot(t,x)
title("Label: " + xInfo.Label)
grid on
axis tight
ylabel("Amplitude")
xlabel("Time (s)")
yPerWindow = extract(afe,x);
t = predict(Mdl,yPerWindow(:,selectedFeatureIndex));
trueLabel = categorical(xInfo.Label)
trueLabel = categorical
0
predictionsPerWindow = categorical(t')
1-848
Feature Selection for Audio Classification
prediction = mode(predictionsPerWindow)
prediction = categorical
0
Tfile = categorical(adsTest.Labels);
featuresTest = extract(afe,adsTest,UseParallel=canUseParallelPool);
Y = cellfun(@(x)mode(categorical(predict(Mdl,x(:,selectedFeatureIndex)))),featuresTest,UniformOut
Y = cat(1,Y{:});
figure
confusionchart(Tfile,Y,Title="Accuracy = " + 100*mean(Tfile==Y) + " (%)")
You can apply a similar pattern as above to also select an optimal window, window length, window
overlap, DFT length, and input to spectral descriptors.
Supporting Functions
function c = uniqueFeatureName(afeInfo)
%UNIQUEFEATURENAME Create unique feature names
1-849
1 Audio Toolbox Examples
References
See Also
audioFeatureExtractor | audioDatastore | fscmrmr | fitcknn
Related Examples
• “Accelerate Audio Machine Learning Workflows Using a GPU” on page 1-249
• “Accelerate Audio Deep Learning Using GPU-Based Feature Extraction” on page 1-757
1-850
Adapt Pretrained Audio Network for New Data Using Deep Network Designer
This example shows how to interactively adapt a pretrained network to classify new audio signals
using Deep Network Designer.
Transfer learning is commonly used in deep learning applications. You can take a pretrained network
and use it as a starting point to learn a new task. Fine-tuning a network with transfer learning is
usually much faster and easier than training a network with randomly initialized weights from
scratch. You can quickly transfer learned features to a new task using a smaller number of training
signals.
This example retrains YAMNet, a pretrained convolutional neural network, to classify a new set of
audio signals.
Load Data
Download and unzip the air compressor data set [1] on page 1-856. This data set consists of
recordings from air compressors in a healthy state or one of seven faulty states.
zipFile = matlab.internal.examples.downloadSupportFile('audio','AirCompressorDataset/AirCompresso
dataFolder = fileparts(zipFile);
unzip(zipFile,dataFolder);
ads = audioDatastore(dataFolder,IncludeSubfolders=true,LabelSource="foldernames");
classNames = categories(ads.Labels);
Split the data into training, validation, and test sets using the splitEachLabel function.
[adsTrain,adsValidation,adsTest] = splitEachLabel(ads,0.7,0.2,0.1,"randomized");
Use the transform function to preprocess the data using the function audioPreprocess, found at
the end of this example. For each signal:
• Use yamnetPreprocess to generate mel spectrograms suitable for training using YAMNet. Each
audio signal produces multiple spectrograms.
• Duplicate the class label for each of the spectrograms.
tdsTrain = transform(adsTrain,@audioPreprocess,IncludeInfo=true);
tdsValidation = transform(adsValidation,@audioPreprocess,IncludeInfo=true);
tdsTest = transform(adsTest,@audioPreprocess,IncludeInfo=true);
Prepare and train the network interactively using Deep Network Designer (Deep Learning Toolbox).
To open Deep Network Designer, on the Apps tab, under Machine Learning and Deep Learning,
click the app icon. Alternatively, you can open the app from the command line.
deepNetworkDesigner
1-851
1 Audio Toolbox Examples
Deep Network Designer provides a selection of pretrained audio classification networks. These
models require both Audio Toolbox™ and Deep Learning Toolbox™.
Under Audio Networks, select YAMNet from the list of pretrained networks and click Open. If the
Audio Toolbox model for YAMNet is not installed, click Install instead. Deep Network Designer
provides a link to the location of the network weights. Unzip the file to a location on the MATLAB
path. Now close the Deep Network Designer Start Page and reopen it. When the network is correctly
installed and on the path, you can click the Open button on YAMNet. The YAMNet model can classify
audio into one of 521 sound categories. For more information, see yamnet.
Deep Network Designer displays a zoomed-out view of the whole network in the Designer pane. To
zoom in with the mouse, use Ctrl+scroll wheel. To pan, use the arrow keys, or hold down the scroll
wheel and drag the mouse. Select a layer to view its properties. Clear all layers to view the network
summary in the Properties pane.
1-852
Adapt Pretrained Audio Network for New Data Using Deep Network Designer
To use a pretrained network for transfer learning, you must change the number of classes to match
your new data set. First, find the last learnable layer in the network. For YAMNet, the last learnable
layer is the last fully connected layer, dense. At the bottom of the Properties pane, click Unlock
Layer. In the warning dialog that appears, click Unlock Anyway. This unlocks the layer properties so
that you can adapt them to your new task.
Before R2023b: To edit the layer properties, you must replace the layers instead of unlocking them.
The OutputSize property defines the number of classes for classification problems. Change
OutputSize to the number of classes in the new data, in this example, 8.
Change the learning rates so that learning is faster in the new layer than in the transferred layers by
setting WeightLearnRateFactor and BiasLearnRateFactor to 10.
1-853
1 Audio Toolbox Examples
To check that the network is ready for training, click Analyze. If the Deep Learning Network
Analyzer reports zero errors, then the edited network is ready for training. To export the network,
click Export. The app saves the network as the variable net_1.
Specify the training options. Choosing among the options requires empirical analysis. To explore
different training option configurations by running experiments, you can use the Experiment
Manager app.
Train Network
Train the neural network using the trainnet (Deep Learning Toolbox) function. For classification
tasks, use cross-entropy loss. By default, the trainnet function uses a GPU if one is available. Using
a GPU requires a Parallel Computing Toolbox™ license and a supported GPU device. For information
on supported devices, see “GPU Computing Requirements” (Parallel Computing Toolbox). Otherwise,
1-854
Adapt Pretrained Audio Network for New Data Using Deep Network Designer
the trainnet function uses the CPU. To specify the execution environment, use the
ExecutionEnvironment training option.
net = trainnet(tdsTrain,net_1,"crossentropy",options);
Test Network
Make predictions using the minibatchpredict (Deep Learning Toolbox) function. By default, the
minibatchpredict function uses a GPU if one is available.
data = readall(tdsTest);
T = [data{:,2}];
scores = minibatchpredict(net,tdsTest);
Y = scores2label(scores,classNames);
accuracy = sum(Y == T')/numel(T)
accuracy = 0.9943
Supporting Function
numSpectrograms = size(features,4);
1-855
1 Audio Toolbox Examples
data = cell(numSpectrograms,2);
for index = 1:numSpectrograms
data{index,1} = features(:,:,:,index);
data{index,2} = class;
end
end
References
[1] Verma, Nishchal K., Rahul Kumar Sevakula, Sonal Dixit, and Al Salour. “Intelligent Condition
Based Monitoring Using Acoustic Signals for Air Compressors.” IEEE Transactions on Reliability 65,
no. 1 (March 2016): 291–309. https://doi.org/10.1109/TR.2015.2459684.
1-856
Speech Command Recognition Code Generation with Intel MKL-DNN Using Simulink
This example demonstrates how to deploy feature extraction and a convolutional neural network
(CNN) for speech command recognition on Intel® processors. To generate the feature extraction and
network code, you use Embedded Coder in Simulink® and the Intel® Math Kernel Library for Deep
Neural Networks (MKL-DNN). In this example you generate Software-in-the-loop (SIL) code
for a reference model which performs feature extraction and predicts the speech command. The
generated SIL code is called in a Simulink model which displays the predicted speech command and
predicted scores for the given inputs. For details about audio preprocessing and network training, see
“Train Speech Command Recognition Model Using Deep Learning” on page 1-313.
Prerequisites
For supported versions of libraries and for information about setting up environment variables, see
“Prerequisites for Deep Learning with MATLAB Coder” (MATLAB Coder).
Create a Simulink model and capture the feature extraction, convolutional neural network and
postprocessing as developed in “Speech Command Recognition in Simulink” on page 1-42. This model
is shipped with this example. Open the shipped model to understand its configurations.
modelToDeploy = "recognizeSpeechCommand";
open_system(modelToDeploy)
Set the Data type, Port dimensions, Sample time, and Signal type of the input port block as
shown.
1-857
1 Audio Toolbox Examples
Open the recognizeSpeechCommand model. Go to the MODELING Tab and click on Model Settings
or press Ctrl+E. Select Code Generation and set the System Target File to ert.tlc whose
Description is Embedded Coder. Set the Language to C++, which will automatically set the
Language Standard to C++11 (ISO).
1-858
Speech Command Recognition Code Generation with Intel MKL-DNN Using Simulink
set_param(modelToDeploy,SystemTargetFile="ert.tlc")
set_param(modelToDeploy,TargetLang="C++")
set_param(modelToDeploy,TargetLangStandard="C++11 (ISO)")
To set Intel MKL-DNN Deep Learning Config, expand Code Generation and select Interface. Now
set the Deep Learning Target Library to MKL-DNN as shown.
1-859
1 Audio Toolbox Examples
Alternatively, use set_param to configure the Deep learning target library programmatically.
set_param(modelToDeploy,DLTargetLibrary="mkl-dnn")
Select a solver that supports code generation. Set Solver to auto (Automatic solver
selection) and Solver type to Fixed-step.
set_param(modelToDeploy,SolverName="FixedStepAuto")
set_param(modelToDeploy,SolverType="Fixed-step")
In Configuration > Hardware Implementation, set Device vendor to Intel and Device type to
x86-64 (Windows64) or x86-64 (Linux 64) or x86-64 (Mac OS X) depending on your target
system. Alternatively, use set_param to configure the settings programmatically.
switch(computer("arch"))
case "win64"
ProdHWDeviceType = "Intel->x86-64 (Windows64)";
case "glnxa64"
ProdHWDeviceType = "Intel->x86-64 (Linux 64)";
case "maci64"
ProdHWDeviceType = "Intel->x86-64 (Mac OS X)";
end
set_param(modelToDeploy, "ProdHWDeviceType", ProdHWDeviceType)
1-860
Speech Command Recognition Code Generation with Intel MKL-DNN Using Simulink
To automate setting the Device type, add the above code in Property Inspector > Properties >
Callbacks > PreLoadFcn of the recognizeSpeechCommand model.
Use Embedded Coder app to generate and build the code. Click on APPS tab and then click on
Embedded coder as shown.
It will open a new C++ CODE tab, then click on Build to generate and build the code. It will
generate the code in a folder named recognizeSpeechCommand_ert_rtw. After generating the
code, you view the report by clicking on Open Report.
slbuild(modelToDeploy);
Build Summary
1-861
1 Audio Toolbox Examples
save_system(modelToDeploy)
close_system(modelToDeploy)
Create a Simulink Model that Calls recognizeSpeechCommand and Displays its Output
Create a new simulink model and add recognizeSpeechCommand as a model reference block to it.
Add the same base workspace variables, source blocks, and sink blocks as developed in “Speech
Command Recognition in Simulink” on page 1-42. Use a radio button group for selecting speech
command files. For your reference, this model is shipped with this example. Open the same simulink
model.
mainModel = "slexSpeechCommRecognitionCodegenWithMklDnnExample";
open_system(mainModel)
1-862
Speech Command Recognition Code Generation with Intel MKL-DNN Using Simulink
To set the Software-in-the-loop (SIL) simulation mode for the model reference block, click on
MODELING tab.
Now click on the drop-down button as shown above, and it will open a window. Select Property
Inspector as shown below.
1-863
1 Audio Toolbox Examples
You will get a Property Inspector window at the right of your model. Click on the Model block to get
its Property Inspector. If the * Model name* is not set, browse for the
recognizeSpeechCommand.slx and set the Model name. Now set Simulation mode to
Software-in-the-loop (SIL) as shown.
1-864
Speech Command Recognition Code Generation with Intel MKL-DNN Using Simulink
Run the model to deploy the recognizeSpeechCommand.slx on your computer and perform speech
command recognition.
set_param(mainModel,StopTime="20")
sim(mainModel)
Build Summary
ans =
Simulink.SimulationOutput:
1-865
1 Audio Toolbox Examples
save_system(mainModel)
close_system(mainModel)
• Simulate “Speech Command Recognition in Simulink” on page 1-42 model using Intel® MKL-DNN
library by setting the Configuration > Simulation Target > Language to C++.
• Compare the simulation speed of the “Speech Command Recognition in Simulink” on page 1-42
model with and without Intel® MKL-DNN library. Use Simulink Profiler (Simulink) to profile the
model by setting the Configuration > Simulation Target > Language to C and C++.
1-866
Loudspeaker Modeling with Simscape
This example shows how to model a dynamic loudspeaker using linear and nonlinear lumped element
models.
Dynamic loudspeaker drivers convert electrical signals into acoustic waves using electromagnetic
energy to produce mechanical movements in a cone-shaped diaphragm. Therefore, three main
domains must be represented in the model: electrical, mechanical, and acoustical, in addition to the
bidirectional energy conversions between these.
Basket Surround
Cone
Magnet
Suspension
S
Input wires
A common linear model for a loudspeaker is to represent it as an electrical circuit, which is known as
a lumped element model. The mechanical and acoustic effects are represented by electrical circuits
that are mathematically equivalent models.
SD:1
Bl
Re Le Rm Mm Cm
+ +
- -
i(t) v(t)
1-867
1 Audio Toolbox Examples
For the electrical model, the motor is composed of the voice coil and the magnet. The voice coil is
driven by a voltage and has a resistance and an inductance . These two parameters depend
on the wire material, diameter, length, turn radius, number of turns, and other physical properties.
The magnet also has an impact on the coil inductance because of the addition of a ferrous core.
The magnet creates a field in the gap with a flux density . Multiplied by the wire length , this is
known as the force factor . This is also the conversion factor between the electrical and mechanical
domains, as there is a force applied to the voice coil, where is the electrical
current applied at the input. Inversely, there is a voltage generated by which is
analogous to the cone velocity. When converting the mechanical model to an electrical model, the
coupling between them is represented by a gyrator, where velocity corresponds to the electrical
current and force corresponds to the voltage.
For the mechanical model, lumped electrical components are used as analogues to mechanical
properties such as mass and compliance. First, the inertia of the total moving mass (including the
coil, cone, and dust cap) is analogous to the effect of an inductance on varying electrical
currents. Second, the stiffness of the suspension and spider is analogous to the effect of a capacitor
. Thirdly, the mechanical loss in the suspension system is analogous to a resistor . This
mechanical model forms a circuit with resonant frequency , which implies
that the efficient frequency range of the driver depends on its mass.
For the acoustical model, the driver cone surface interface with the air is analogous to a transformer.
The larger the cone is, the more mechanical energy is transformed to acoustic energy (at least for a
given mass). An impedance (formed by , and ) models the radiation resistance for the
front and the back of the cone. For a non-enclosed driver, this value is nonlinear but relatively small.
An enclosed driver has a fixed amount of air, which creates a compliance modeled by a capacitor ,
and any air leaks (including a vent) will contribute to the resistance . For a vented enclosure, the
mass of air moving in and out acts as an inductor .
For simplicity, the remainder of this example assumes a loudspeaker in free space, i.e. no enclosure
( ).
Where:
1-868
Loudspeaker Modeling with Simscape
Using Simscape™, a loudspeaker can be modeled using a mixed-domain approach (mixing electrical
and mechanical domains), or with a familiar model that converts the mechanical domain into the
electrical domain.
First, implement the electrical model shown above, using a gyrator directly.
model = 'LinearGyrator';
open_system(model);
sim(model,1);
1-869
1 Audio Toolbox Examples
close_system(model,0)
1-870
Loudspeaker Modeling with Simscape
For analysis, the gyrator is often removed by rearranging the circuit topology and the elements
values.
Re Le
i(t)
Viewer does not support full SVG 1.1
It can be shown that this circuit behaves the same as above with the following components:
model = 'LinearCircuit';
open_system(model);
sim(model,1);
1-871
1 Audio Toolbox Examples
close_system(model,0)
1-872
Loudspeaker Modeling with Simscape
Simscape™ also allows mixing electrical and mechanical elements, so the loudspeaker model can be
simulated without any physical domain conversions. Again, the same results are obtained.
model = 'LinearMixedDomain';
open_system(model);
sim(model,1);
1-873
1 Audio Toolbox Examples
close_system(model,0)
Other elements can easily be added to this circuit model to account for a closed or vented enclosure.
1-874
Loudspeaker Modeling with Simscape
In addition to combining two physical domains in one simulation, digital signal processing algorithms
can be included. The following model represents an active loudspeaker with a woofer and a tweeter.
The crossover, parametric EQ and shelving filters are implemented in the digital domain, followed by
an optimized power amplifier for each driver. The output of each driver is measured separately, and
the combined output is compared to the log-chirp input in the frequency domain.
model = 'MixedModeling';
open_system(model);
sim(model,3);
1-875
1 Audio Toolbox Examples
1-876
Loudspeaker Modeling with Simscape
1-877
1 Audio Toolbox Examples
close_system(model,0)
Several loudspeaker elements are nonlinear. For example, the voice coil force factor and inductance
vary with its position in the magnet. Furthermore, the suspension spring rate changes at the
extremities of its displacement range.
Linear components in the previous model can be replaced by custom versions that implement the
nonlinearities that are required. For example, the spring component can define the spring rate as
, where is the displacement and are the
polynomial coefficients ( being the spring rate at displacement zero).
Plot sample values for compliance, force factor and inductance. Run model that implements the
"woofer" driver of the previous model, but with three linear components replaced by nonlinear
components.
plotNonLinearBHK;
model = 'NonLinearBHK';
open_system(model);
sim(model,3);
1-878
Loudspeaker Modeling with Simscape
1-879
1 Audio Toolbox Examples
1-880
Loudspeaker Modeling with Simscape
1-881
1 Audio Toolbox Examples
close_system(model,0)
Starting from these examples, add components to model an enclosure, or implement your own
nonlinear elements. Simscape™ will allow you to do this using the domain of your choice (electrical,
mechanical). You can also test any digital pre-processing that is required, all in one model.
Definitions
input voltage
input current
force factor
diaphragm velocity
1-882
Loudspeaker Modeling with Simscape
diaphragm displacement
moving mass
mechanical loss
suspension compliance
acoustical impedance
1-883
1 Audio Toolbox Examples
This example shows how to use interpretability techniques to investigate the predictions of a deep
neural network trained to classify audio data.
Deep learning networks are often described as "black boxes" because why a network makes a certain
decision is not always obvious. You can use interpretability techniques to translate network behavior
into output that a person can interpret. This interpretable output can then answer questions about
the predictions of a network. This example uses interpretability techniques that explain network
predictions using visual representations of what a network is “looking” at. You can then use these
visual representations to see which parts of the input images the network is using to make decisions.
This example uses transfer learning to retrain VGGish, a pretrained convolutional neural network, to
classify a new set of audio signals.
Load Data
Download and unzip the environmental sound classification data set. This data set consists of
recordings labeled as one of 10 different audio sound classes (ESC-10). Download the ESC-10.zip
zip file from the MathWorks website, then unzip the file.
rng("default")
zipFile = matlab.internal.examples.downloadSupportFile("audio","ESC-10.zip");
filepath = fileparts(zipFile);
dataFolder = fullfile(filepath,"ESC-10");
unzip(zipFile,dataFolder)
Create an audioDatastore object to manage the data and split it into training and validation sets.
Use countEachLabel to display the distribution of sound classes and the number of unique labels.
ads = audioDatastore(dataFolder,IncludeSubfolders=true,LabelSource="foldernames");
labelTable = countEachLabel(ads)
labelTable=10×2 table
Label Count
______________ _____
chainsaw 40
clock_tick 40
crackling_fire 40
crying_baby 40
dog 40
helicopter 40
rain 40
rooster 38
sea_waves 40
sneezing 40
1-884
Investigate Audio Classifications Using Deep Learning Interpretability Techniques
Use splitEachLabel to split the data set into training and validation sets. Use 80% of the data for
training and 20% for validation.
[adsTrain,adsValidation] = splitEachLabel(ads,0.8,0.2);
The VGGish pretrained network requires preprocessing of the audio signals into log mel
spectrograms. The supporting function helperAudioPreprocess, defined at the end of this
example, takes as input an audioDatastore object and the overlap percentage between log mel
spectrograms and returns matrices of predictors and responses suitable for input to the VGGish
network. Each audio file is split into several segments to feed into the VGGish network.
overlapPercentage = 75;
[trainFeatures,trainLabels] = helperAudioPreprocess(adsTrain,overlapPercentage);
[validationFeatures,validationLabels,segmentsPerFile] = helperAudioPreprocess(adsValidation,overl
Visualize Data
numImages = 9;
idxSubset = randi(numel(trainLabels),1,numImages);
viewingAngle = ;
figure
tiledlayout("flow",TileSpacing="compact");
for i = 1:numImages
img = trainFeatures(:,:,:,idxSubset(i));
label = trainLabels(idxSubset(i));
nexttile
surf(img,EdgeColor="none")
view(viewingAngle)
title("Class: " + string(label),interpreter="none")
end
colormap parula
1-885
1 Audio Toolbox Examples
Build Network
This example uses transfer learning to retrain VGGish, a pretrained convolutional neural network, to
classify a new set of audio signals.
Type vggish in the Command Window. If the Audio Toolbox model for VGGish is not installed, then
the function provides a link to the location of the network weights. To download the model, click the
link. Unzip the file to a location on the MATLAB path.
pretrainedNetwork = vggish;
lgraph = layerGraph(pretrainedNetwork.Layers);
1-886
Investigate Audio Classifications Using Deep Learning Interpretability Techniques
Prepare the network for transfer learning by replacing the final layers with new layers suitable for
the new data. You can adapt VGGish for the new data programmatically or interactively using Deep
Network Designer. For an example showing how to use Deep Network Designer to perform transfer
learning with an audio classification network, see “Adapt Pretrained Audio Network for New Data
Using Deep Network Designer” (Deep Learning Toolbox).
Use removeLayers to remove the final regression output layer from the graph. After you remove the
regression layer, the new final layer of the graph is a ReLU layer named EmbeddingBatch.
lgraph = removeLayers(lgraph,"regressionoutput");
lgraph.Layers(end)
ans =
ReLULayer with properties:
Name: 'EmbeddingBatch'
lgraph = addLayers(lgraph,fullyConnectedLayer(numClasses,Name="FCFinal"));
lgraph = addLayers(lgraph,softmaxLayer(Name="softmax"));
lgraph = addLayers(lgraph,classificationLayer(Name="classOut"));
Use connectLayers to append the fully connected, softmax, and classification layers to the layer
graph.
lgraph = connectLayers(lgraph,"EmbeddingBatch","FCFinal");
lgraph = connectLayers(lgraph,"FCFinal","softmax");
lgraph = connectLayers(lgraph,"softmax","classOut");
To define the training options, use the trainingOptions function. Set the solver to "adam" and
train for five epochs with a mini-batch size of 128. Specify an initial learning rate of 0.001 and drop
the learning rate after two epochs by multiplying by a factor of 0.5. Monitor the network accuracy
during training by specifying validation data and the validation frequency.
miniBatchSize = 128;
options = trainingOptions("adam", ...
MaxEpochs=5, ...
MiniBatchSize=miniBatchSize, ...
InitialLearnRate = 0.001, ...
LearnRateSchedule="piecewise", ...
LearnRateDropPeriod=2, ...
LearnRateDropFactor=0.5, ...
ValidationData={validationFeatures,validationLabels}, ...
ValidationFrequency=50, ...
Shuffle="every-epoch");
Train Network
To train the network, use the trainNetwork function. By default, trainNetwork uses a GPU if one
is available. Otherwise, it uses a CPU. Training on a GPU requires Parallel Computing Toolbox™ and a
1-887
1 Audio Toolbox Examples
supported GPU device. For information on supported devices, see “GPU Computing Requirements”
(Parallel Computing Toolbox). You can also specify the execution environment by using the
ExecutionEnvironment name-value argument of trainingOptions.
[net,netInfo] = trainNetwork(trainFeatures,trainLabels,lgraph,options);
Test Network
[validationPredictions,validationScores] = classify(net,validationFeatures);
Each audio file produces multiple mel spectrograms. Combine the predictions for each audio file in
the validation set using a majority-rule decision and calculate the classification accuracy.
idx = 1;
validationPredictionsPerFile = categorical;
for ii = 1:numel(adsValidation.Files)
validationPredictionsPerFile(ii,1) = mode(validationPredictions(idx:idx+segmentsPerFile(ii)-1
idx = idx + segmentsPerFile(ii);
end
accuracy = mean(validationPredictionsPerFile==adsValidation.Labels)*100
accuracy = 92.5000
Use confusionchart to evaluate the performance of the network on the validation set.
1-888
Investigate Audio Classifications Using Deep Learning Interpretability Techniques
Visualize Predictions
View a random sample of the input data with the true and predicted class labels.
numImages = ;
idxSubset = randi(numel(validationLabels),1,numImages);
viewingAngle = ;
figure
t1 = tiledlayout("flow",TileSpacing="compact");
for i = 1:numImages
img = validationFeatures(:,:,:,idxSubset(i));
YPred = validationPredictions(idxSubset(i));
YTrue = validationLabels(idxSubset(i));
nexttile
surf(img,EdgeColor="none")
view(viewingAngle)
title({"True: " + string(YTrue),"Predicted: " + string(YPred)},interpreter= "none")
end
colormap parula
1-889
1 Audio Toolbox Examples
The x-axis represents time, the y-axis represents frequency, and the colormap represents decibels.
For several of the classes, you can see interpretable features. For example, the spectrogram for the
clock_tick class shows a repeating pattern through time representing the ticking of a clock. The
first spectrogram from the helicopter class has the constant, loud, low-frequency sound of the
helicopter engine and a repeating high-frequency sound representing the spinning of the helicopter
blades.
As the network is a convolutional neural network with image input, the network might use these
features when making classification decisions. You can investigate this hypothesis using deep
learning interpretability techniques.
Investigate Predictions
Investigate the predictions of the validation mel spectrograms. For each input, generate the Grad-
CAM (gradCAM (Deep Learning Toolbox)), LIME (imageLIME (Deep Learning Toolbox)), and
occlusion sensitivity (occlusionSensitivity (Deep Learning Toolbox)) maps for the predicted
classes. These methods take an input image and a class label and produce a map indicating the
regions of the image that are important to the score for the specified class. Each visualization method
has a specific approach that determines the output it produces.
• Grad-CAM — Use the gradient of the classification score with respect to the convolutional features
determined by the network to understand which parts of the image are most important for
1-890
Investigate Audio Classifications Using Deep Learning Interpretability Techniques
classification. The places where the gradient is large are the places where the final score depends
most on the data.
• LIME — Approximate the classification behavior of a deep learning network using a simpler, more
interpretable model, such as a linear model or a regression tree. The simple model determines the
importance of features of the input data as a proxy for the importance of the features to the deep
learning network.
• Occlusion sensitivity — Perturb small areas of the input by replacing them with an occluding
mask, typically a gray square. As the mask moves across the image, the technique measures the
change in probability score for a given class.
Comparing the results of different interpretability techniques is important for verifying the
conclusions you make. For more information about these techniques, see “Deep Learning
Visualization Methods” (Deep Learning Toolbox).
Using the supporting function helperPlotMaps, defined at the end of this example, plot the input
log mel spectrogram and the three interpretability maps for a selection of images and their predicted
classes.
viewingAngle = ;
imgIdx = [250 500 750];
numImages = length(imgIdx);
figure
t2 = tiledlayout(numImages,4,TileSpacing="compact");
for i = 1:numImages
img = validationFeatures(:,:,:,imgIdx(i));
YPred = validationPredictions(imgIdx(i));
YTrue = validationLabels(imgIdx(i));
mapClass = YPred;
maps = {mapGradCAM,mapLIME,mapOcclusion};
mapNames = ["Grad-CAM","LIME","Occlusion Sensitivity"];
helperPlotMaps(img,YPred,YTrue,maps,mapNames,viewingAngle,mapClass)
end
1-891
1 Audio Toolbox Examples
The interpretability mappings highlight regions of interest for the predicted class label of each
spectrogram.
• For the clock_tick class, all three methods focus on the same area of interest. The network uses
the region corresponding to the ticking sound to make its prediction.
• For the helicopter class, all the three methods focus on the same region at the bottom of the
spectrogram.
• For the crying_baby class, the three methods highlight different areas of the spectrogram,
possibly because this spectrogram contains many small features. Methods like Grad-CAM, which
produce lower resolution maps, might have difficulty picking out meaningful features. This
example highlights the limits of using interpretability methods to understand individual network
predictions.
As the results of training have an element of randomness, if you run this example again, you might
see different results. Additionally, to produce interpretable output for different images, you might
need to adjust the map parameters for the occlusion sensitivity and LIME maps. Grad-CAM does not
require parameter tuning, but it can produce lower resolution maps than the other two methods.
1-892
Investigate Audio Classifications Using Deep Learning Interpretability Techniques
classToInvestigate = ;
idxClass = find(classes == classToInvestigate);
idxSubset = validationLabels==classes(idxClass);
subsetLabels = validationLabels(idxSubset);
subsetImages = validationFeatures(:,:,:,idxSubset);
subsetPredictions = validationPredictions(idxSubset);
Generate and plot the interpretability maps using the input spectrograms and the predicted class
labels.
viewingAngle = ;
figure
t3 = tiledlayout(numImages,4,"TileSpacing","compact");
for i = 1:numImages
img = subsetImages(:,:,:,imgIdx(i));
YPred = subsetPredictions(imgIdx(i));
YTrue = subsetLabels(imgIdx(i));
mapClass = YPred;
maps = {mapGradCAM,mapLIME,mapOcclusion};
mapNames = ["Grad-CAM","LIME","Occlusion Sensitivity"];
helperPlotMaps(img,YPred,YTrue,maps,mapNames,viewingAngle,mapClass)
end
1-893
1 Audio Toolbox Examples
The maps for each image show that the network is focusing on the area of high intensity and low
frequency. The result is surprising as you might expect the network to also be interested in the high-
frequency noise that repeats through time. Spotting patterns like this is important for understanding
the features a network is using to make predictions.
Investigate Misclassifications
Investigate a spectrogram with the true class chainsaw but the predicted class helicopter.
trueClass = ;
predictedClass = ;
idxToInvestigate = incorrectIdx(1);
YPred = validationPredictions(idxToInvestigate);
YTrue = validationLabels(idxToInvestigate);
Generate and plot the maps for both the true class (chainsaw) and the predicted class
(helicopter).
1-894
Investigate Audio Classifications Using Deep Learning Interpretability Techniques
figure
t4 = tiledlayout(2,4,"TileSpacing","compact");
img = validationFeatures(:,:,:,idxToInvestigate);
maps = {mapGradCAM,mapLIME,mapOcclusion};
mapNames = ["Grad-CAM","LIME","Occlusion Sensitivity"];
helperPlotMaps(img,YPred,YTrue,maps,mapNames,viewingAngle,mapClass)
end
1-895
1 Audio Toolbox Examples
The network focuses on the area of low frequency for the helicopter class. The result matches the
interpretability maps generated for the helicopter class. Visual inspection is important for
investigating what parts of an input the network is using to make its classification decisions.
Supporting Functions
helperPlotMaps
The supporting function helperPlotMap generates a plot of the input image and the specified
interpretability maps.
function helperPlotMaps(img,YPred,YTrue,maps,mapNames,viewingAngle,mapClass)
nexttile
surf(img,EdgeColor="none")
view(viewingAngle)
title({"True: "+ string(YTrue), "Predicted: " + string(YPred)}, ...
interpreter="none")
colormap parula
numMaps = length(maps);
for i = 1:numMaps
map = maps{i};
mapName = mapNames(i);
nexttile
surf(map,EdgeColor="none")
view(viewingAngle)
title(mapName,mapClass,interpreter="none")
end
end
helperAudioPreprocess
numFiles = numel(ads.Files);
fs = info.SampleRate;
features = vggishPreprocess(audioIn,fs,OverlapPercentage=overlap);
numSpectrograms = size(features,4);
predictor{ii} = features;
response{ii} = repelem(info.Label,numSpectrograms);
segmentsPerFile(ii) = numSpectrograms;
end
1-896
Investigate Audio Classifications Using Deep Learning Interpretability Techniques
response = cat(2,response{:});
end
1-897
1 Audio Toolbox Examples
This example shows how to deploy feature extraction and a convolutional neural network (CNN) for
speech command recognition on Raspberry Pi®. In this example you develop a Simulink® model that
captures audio from the microphone connected to the Raspberry Pi board and performs speech
command recognition. You run the Simulink model on Raspberry Pi in External Mode and display
the recognized speech command. For details about audio preprocessing and network training, see
“Train Speech Command Recognition Model Using Deep Learning” on page 1-313.
Create a Simulink model and capture the feature extraction, convolutional neural network and
postprocessing as developed in “Speech Command Recognition in Simulink” on page 1-42. Add the
ALSA Audio Capture (Simulink) block from the Simulink Support Package for Raspberry Pi
Hardware library as shown.
Connect a microphone to your Raspberry Pi board and use listAudioDevices (Simulink) to list
all the audio capture devices connected to your board.
r = raspi("raspiname","pi","password");
a = listAudioDevices(r,"capture");
a(1)
a(2)
ans =
struct with fields:
Name: 'USB-Audio-LogitechUSBHeadsetH340-LogitechInc.LogitechUSBHeadsetH340atusb-0000:0
Device: '2,0'
1-898
Speech Command Recognition on Raspberry Pi Using Simulink
Channels: {}
BitDepth: {}
SamplingRate: {}
ans =
Name: 'USB-Audio-PlantronicsBT600-PlantronicsPlantronicsBT600atusb-0000:01:00.0-1.1,fu
Device: '3,0'
Channels: {'1'}
BitDepth: {'16-bit integer'}
SamplingRate: {'16000'}
ALSA Audio Capture (Simulink) block captures the audio signal from the default audio device on the
Raspberry Pi hardware. You can also enter the name of an audio device such as plughw:2,0 to
capture audio from a device other than the default audio device. Double click on the ALSA Audio
Capture (Simulink) block and set Device name to plughw:2,0. Set the other parameters as shown.
ALSA Audio Capture (Simulink) outputs 16-bit fixed-point audio samples with values in the interval of
. You cast the ALSA Audio Capture (Simulink) output to single-precision data and
multiply it by to change the numerical range to . Note that you are changing the
numerical range because the subsequent blocks expect the audio in the range . Use Audio
File Read (Simulink) block and a Manual Switch to switch the audio from the microphone to the audio
file and back.
model = "slexSpeechCommandRecognitionRaspiExample";
open_system(model)
1-899
1 Audio Toolbox Examples
set_param(model,SystemTargetFile="ert.tlc")
set_param(model,TargetLang="C++")
set_param(model,TargetLangStandard="C++11 (ISO)")
1-900
Speech Command Recognition on Raspberry Pi Using Simulink
To run your model in External Mode, set Code Interface packaging to Nonreusable function
and check variable-size signals in Code Generation > Interface > Support as shown.
Select a solver that supports code generation. Set Solver to auto (Automatic solver
selection) and Solver type to Fixed-step.
set_param(model,SolverName="FixedStepAuto")
set_param(model,SolverType="Fixed-step")
In Configuration > Hardware Implementation, set Hardware board to Raspberry Pi and enter
your Raspberry Pi credentials in the Board Parameters as shown.
1-901
1 Audio Toolbox Examples
In the same window, set External mode > Communication interface to XCP on TCP/IP as
shown.
Check Signal logging in Data Import/Export to enable signal monitoring in External Mode.
1-902
Speech Command Recognition on Raspberry Pi Using Simulink
save_system(model);
close_system(model);
You can turn this feature off using the <a href="matlab:slprivate('showprefs')">Simulink Preferen
• Simulate “Speech Command Recognition Code Generation with Intel MKL-DNN Using Simulink”
on page 1-857 Example in Processor-in-the-loop (PIL) mode on Raspberry Pi.
1-903
1 Audio Toolbox Examples
• Use LED (Simulink) block of Simulink Support Package for Raspberry Pi Hardware and light it up
for the Go speech command. Use Deploy pane in Hardware tab to deploy the standalone
application on Raspberry Pi.
1-904
Speech Command Recognition Using Deep Learning
This example shows how to perform speech command recognition on streaming audio. The example
uses a pretrained deep learning model. To learn how the deep learning model was trained, see “Train
Speech Command Recognition Model Using Deep Learning” on page 1-313.
Load the pre-trained network. The network is trained to recognize the following speech commands:
yes, no, up, down, left, right, on, off, stop, and go, and to otherwise classify audio as an unknown
word or as background noise.
load("SpeechCommandRecognitionNetwork.mat")
labels
Load one of the following audio signals: noise, someone saying stop, or someone saying play. The
word stop is recognized by the network as a command. The word play is an unknown word to the
network. Listen to the signal.
audioData = ;
sound(audioData{1},audioData{2})
The pre-trained network takes auditory-based spectrograms as inputs. Use the supporting function
extractAuditorySpectrogram on page 1-909 to extract the spectrogram. Classify the audio
based on its auditory spectrogram.
auditorySpectrogram = extractAuditorySpectrogram(audioData{1},audioData{2});
score = predict(net,auditorySpectrogram);
prediction = scores2label(score,labels,2);
visualizeClassificationPipeline(audioData,net,labels)
1-905
1 Audio Toolbox Examples
The model was trained to classify auditory spectrograms that correspond to 1-second chunks of audio
data. It has no concept of memory between classifications. To adapt this model for streaming
applications, you can add logic to build up decision confidence over time.
Create a 9-second long audio clip of the background noise, the unknown word, and the known
command.
fs = 16e3;
audioPlay = audioread("playCommand.flac");
audioStop = audioread("stopCommand.flac");
audioBackground = 0.02*pinknoise(fs);
audioIn = repmat([audioBackground;audioPlay;audioStop],3,1);
Specify the classification rate in hertz. The classification rate is the number of classifications per
second. Every classification requires 1 second of audio data.
classificationRate = ; % Hz
Specify the time window for decisions. Decisions are made by considering all individual classifications
in a decision time window.
decisionTimeWindow = ; % seconds
1-906
Speech Command Recognition Using Deep Learning
Specify thresholds for the decision logic. The frameAgreementThreshold is the percent of frames
within a decisionTimeWindow that must agree to recognize a word. The probabilityThreshold
is the threshold that at least one of the classification probabilities in the decisionTimeWindow must
pass.
frameAgreementThreshold = ; % percent
probabilityThreshold = ;
Use the supporting function, detectCommands on page 1-911, to simulate streaming command
detection. The function uses your default audio device to play the streaming audio.
detectCommands( ...
Input=audioIn, ...
SampleRate=fs, ...
Network=net, ...
Labels=labels, ...
ClassificationRate=classificationRate, ...
DecisionTimeWindow=decisionTimeWindow, ...
FrameAgreementThreshold=frameAgreementThreshold, ...
ProbabilityThreshold=probabilityThreshold);
1-907
1 Audio Toolbox Examples
You can test the model by performing speech command recognition on data input from your
microphone. In this case, audio is read from your default audio input device. The TimeLimit
parameter controls the duration of the audio recording. You can end the recording early by closing
the scopes.
The network is trained to recognize the following speech commands: yes, no, up, down, left, right, on,
off, stop, and go, and to otherwise classify audio as an unknown word or as background noise.
detectCommands( ...
SampleRate=fs, ...
Network=net, ...
Labels=labels, ...
ClassificationRate= , ...
DecisionTimeWindow= , ...
FrameAgreementThreshold= , ...
ProbabilityThreshold= , ...
TimeLimit= );
1-908
Speech Command Recognition Using Deep Learning
Supporting Functions
1-909
1 Audio Toolbox Examples
numBands = 50;
FFTLength = 512;
segmentSamples = round(segmentDuration*designFs);
frameSamples = round(frameDuration*designFs);
hopSamples = round(hopDuration*designFs);
overlapSamples = frameSamples - hopSamples;
% Extract features
features = extract(afe,x);
% Apply logarithm
features = log10(features + 1e-6);
end
function visualizeClassificationPipeline(audioData,net,labels)
%visualizeClassificationPipeline Visualize classification pipeline
%
% visualizeClassificationPipeline(audioData,net,labels) creates a tiled
% layout of the audio data, the extracted auditory spectrogram, and a word
% cloud indicating the relative prediction probability of each class.
1-910
Speech Command Recognition Using Deep Learning
function plotAuditorySpectrogram(auditorySpectrogram)
%plotAuditorySpectrogram Plot auditory spectrogram
bins = 1:size(auditorySpectrogram,2);
pcolor(t,bins,auditorySpectrogram')
shading flat
xlabel("Time (s)")
ylabel("Bark (bins)")
end
function plotAudio(audioIn,fs)
%plotAudio Plot audio
t = (0:size(audioIn,1)-1)/fs;
plot(t,audioIn)
xlabel("Time (s)")
ylabel("Amplitude")
grid on
axis tight
end
end
function detectCommands(options)
%detectCommand Detect commands
%
% detectCommand(SampleRate=fs,Network=net,Labels=lbls,ClassificationRate=cr, ...
% DecisionTimeWindow=dtw,FrameAgreementThreshold=fat,ProbabilityThreshold=pt, ...
1-911
1 Audio Toolbox Examples
% Input=audioIn)
% opens a timescope to visualize streaming audio and a dsp.MatrixViewer to
% visualize auditory spectrograms extracted from a simulation of streaming
% audioIn. The scopes display the detected speech command after it has been
% processed by the streaming algorithm. The streaming audio is played to
% your default audio output device.
%
% detectCommand(SampleRate=fs,Network=net,Labels=lbls,ClassificationRate=cr, ...
% DecisionTimeWindow=dtw,FrameAgreementThreshold=fat,ProbabilityThreshold=pt, ...
% TimeLimit=tl)
% opens a timescope to visualize streaming audio and a dsp.MatrixViewer to
% visualize auditory spectrograms extracted from audio streaming from your
% default audio input device. The scopes display the detected speech
% command after it has been processed by the streaming algorithm.
arguments
options.SampleRate
options.Network
options.Labels
options.ClassificationRate
options.DecisionTimeWindow
options.FrameAgreementThreshold
options.ProbabilityThreshold
options.Input = []
options.TimeLimit = inf;
end
if isempty(options.Input)
% Create an audioDeviceReader to read audio from your microphone.
adr = audioDeviceReader(SampleRate=options.SampleRate,SamplesPerFrame=floor(options.SampleRat
newSamplesPerUpdate = floor(options.SampleRate/options.ClassificationRate);
% Convert the requested decision time window to the number of analysis frames.
numAnalysisFrame = round((options.DecisionTimeWindow-1)*(options.ClassificationRate) + 1);
% Convert the requested frame agreement threshold in percent to the number of frames that must ag
countThreshold = round(options.FrameAgreementThreshold/100*numAnalysisFrame);
% Initialize buffers for the classification decisions and scores of the streaming audio.
YBuffer = repmat(categorical("background"),numAnalysisFrame,1);
1-912
Speech Command Recognition Using Deep Learning
scoreBuffer = zeros(numel(labels),numAnalysisFrame,"single");
if isempty(options.Input)
% Extract audio samples from the audio device and add the samples to
% the buffer.
audioIn = adr();
write(audioBuffer,audioIn);
end
% Classify the current spectrogram, save the label to the label buffer,
% and save the predicted probabilities to the probability buffer.
score = predict(options.Network,spec);
YPredicted = scores2label(score,labels,2);
YBuffer = [YBuffer(2:end);YPredicted];
scoreBuffer = [scoreBuffer(:,2:end),score(:)];
1-913
1 Audio Toolbox Examples
if ~isempty(options.Input)
% Write the new audio to your audio output device.
adw(ynew);
end
end
release(wavePlotter)
release(specPlotter)
function tf = whileCriteria(loopTimer,timeLimit,wavePlotter,specPlotter,Input,audioBuffer)
if isempty(Input)
tf = toc(loopTimer)<timeLimit && isVisible(wavePlotter) && isVisible(specPlotter);
else
tf = audioBuffer.NumUnreadSamples > 0;
end
end
end
See Also
Related Examples
• “Train Speech Command Recognition Model Using Deep Learning” on page 1-313
• “Accelerate Audio Deep Learning Using GPU-Based Feature Extraction” on page 1-757
• “Accelerate Audio Machine Learning Workflows Using a GPU” on page 1-249
1-914
Keyword Spotting in Simulink
This example shows a Simulink® model that identifies a keyword in speech using a pretrained deep
learning model. This model was trained to identify the keyword "yes". To learn about the model
architecture and training, see “Keyword Spotting in Noise Using MFCC and LSTM Networks” on
page 1-481.
Download and unzip the pretrained network and the standardization factors. The standardization
factors are the global mean and standard deviation of the features used to train the model.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio/examples","kwslstm.zip");
dataFolder = tempdir;
netFolder = fullfile(dataFolder,"KeywordSpotting");
unzip(downloadFolder,netFolder)
addpath(netFolder)
Model Description
The deep learning network was trained on mel-frequency cepstral coefficients (MFCC) computed
using an audioFeatureExtractor. The MFCC block in the model has been configured to extract the
same features that the network was trained on.
The MFCC block extracts feature vectors from the audio stream using 512-point analysis windows with
384-point overlap and then applies a buffer to output 16 feature vectors consisting of 39 features
each. Buffering the feature vectors enables vectorized computations on the Stateful Predict
block, which enables the system to keep pace with real time (given a short time delay).
After the MFCC block, the features are standardized using precomputed coefficients and then
transposed so that time is along the second dimension.
The Stateful Predict block outputs a 2-element score vector for each feature vector. The scores
are converted to decisions by picking the index of the maximum score. The decisions are converted to
doubles and then upsampled to create a decision mask the same length as the corresponding audio.
open_system("keywordSpotting.slx")
Run Model
Use the Manual Switch block to select either a live stream from your microphone or a test signal
from an audio file.
sim("keywordSpotting.slx");
1-915
1 Audio Toolbox Examples
Close the model and remove the path to the pretrained network.
close_system("keywordSpotting.slx",0);
rmpath(netFolder)
1-916
Audio Transfer Learning Using Experiment Manager
This example shows how to configure an experiment that compares the performance of multiple
pretrained networks in a speech command recognition task using transfer learning. This example
uses the Experiment Manager (Deep Learning Toolbox) app to tune hyperparameters and compare
results between the different pretrained networks by using both built-in and user-defined metrics.
Audio Toolbox™ provides a variety of pretrained networks for audio processing, and each network
has a different architecture that requires different data preprocessing. These differences result in
tradeoffs between the accuracy, speed, and size of the various networks. Experiment Manager
organizes the results of the training experiments to highlight the strengths and weaknesses of each
individual network so you can select the network that best fits your constraints.
The example compares the performance of the YAMNet and VGGish pretrained networks, as well as a
custom network that you train from scratch. See Deep Network Designer (Deep Learning Toolbox) to
explore other pretrained network options supported by Audio Toolbox™.
This example uses the Google Speech Commands [1] data set. Download the data set and the
pretrained networks to your temporary directory. The data set and the two networks require 1.96 GB
and 470 MB of disk space, respectively.
Load the example by clicking the Open Example button. This opens the project in Experiment
Manager in your MATLAB editor.
1-917
1 Audio Toolbox Examples
The Hyperparameters section specifies the strategy and hyperparameter values to use for the
experiment. This example uses the exhaustive sweep strategy. When you run the experiment,
Experiment Manager trains the network using every combination of the hyperparameter values
specified in the hyperparameter table. As this example shows you how to test the different network
types, define one hyperparameter, Network, to represent the network names stored as strings.
The Setup Function field contains the name of the main function that configures the training data,
network architecture, and training options for the experiment. The input to the setup function is a
structure with fields from the hyperparameter table. The setup function returns the training data,
network architecture, and training parameters as outputs. This example uses a predesigned setup
function named compareNetSetup.
1-918
Audio Transfer Learning Using Experiment Manager
The Metrics list enables you to define your own custom metrics to compare across different trials of
the training experiment. Experiment Manager runs each of metric in this table against the
networks it trains in each trial. This example defines three custom metrics. To add additional custom
metrics, list them in this table.
In this example, the setup function downloads the data set, selects the desired network, performs the
requisite data preprocessing, and sets the network training options. The input to this function is a
structure with fields for each hyperparameter you define in the Experiment Manager interface. In
the setup function for this example, the input variable is params and the output variables are
trainingData, layers, and options, representing the training data, network structure, and
training parameters, respectively. The key steps of the compareNetSetup setup function are
explained below. Open the example in MATLAB to see the full definition of the function.
To speed up the example, open compareNetSetup and set the speedUp flag to true. This reduces
the size of the data set to quickly test the basic functionality of the experiment.
speedUp = true;
The helper function setupDatastores downloads the Google Speech Commands [1] data set,
selects the commands for networks to recognize, and randomly partitions the data into training and
validation datastores.
[adsTrain,adsValidation] = setupDatastores(speedUp);
Transform the datastores based on the preprocessing required by each network you define in the
hyperparameter table, which you can access using params.Network. The helper function
extractSpectrogram processes the input data to the format required by each network. The helper
function getLayers returns a layerGraph (Deep Learning Toolbox) object that represents the
architecture of the network.
tdsTrain = transform(adsTrain,@(x)extractSpectrogram(x,params.Network));
tdsValidation = transform(adsValidation,@(x)extractSpectrogram(x,params.Network));
layers = getLayers(classes,classWeights,numClasses,netName);
Now that you have set up the datastores, read the data into the trainingData and
validationData variables.
trainingData = readall(tdsTrain,UseParallel=canUseParallelPool);
validationData = readall(tdsValidation,UseParallel=canUseParallelPool);
validationData = table(validationData(:,1),adsValidation.Labels);
trainingData = table(trainingData(:,1),adsTrain.Labels);
Set the training parameters by assigning a trainingOptions (Deep Learning Toolbox) object to the
options output variable. Train the networks for a maximum of 30 epochs with a patience of 8 epochs
using the Adam optimizer. Set the ExecutionEnvironment field to "auto" to use a GPU if
available. Training can be time consuming if you do not use a GPU.
1-919
1 Audio Toolbox Examples
maxEpochs = 30;
miniBatchSize = 256;
validationFrequency = floor(numel(TTrain)/miniBatchSize);
options = trainingOptions("adam", ...
GradientDecayFactor=0.7, ...
InitialLearnRate=params.LearnRate, ...
MaxEpochs=maxEpochs, ...
MiniBatchSize=miniBatchSize, ...
Shuffle="every-epoch", ...
Plots="training-progress", ...
Verbose=false, ...
ValidationData=validationData, ...
ValidationFrequency=validationFrequency, ...
ValidationPatience=10, ...
LearnRateSchedule="piecewise", ...
LearnRateDropFactor=0.2, ...
LearnRateDropPeriod=round(maxEpochs/3), ...
ExecutionEnvironment="auto");
Experiment Manager enables you to define custom metric functions to evaluate the performance of
the networks it trains in each trial. It computes basic metrics such as accuracy and loss by default. In
this example you compare the size of each of the models as memory usage is an important metric
when you deploy deep neural networks to real-world applications.
Custom metric functions must take one input argument trialInfo, which is a structure containing
the fields trainedNetwork, trainingInfo, and parameters.
The metric functions must return a scalar number, logical output, or string which the Experiment
Manager displays in the results table. The example uses these custom metric functions:
Run Experiment
Click Run on the Experiment Manager toolstrip to run the experiment. You can select to run each
trial sequentially, simultaneously, or in batches by using the mode option. For this experiment, set
Mode to Sequential.
Evaluate Results
Experiment Manager displays the results in a table once it finishes running the experiment. The
progress bar shows how many epochs each network trained for before violating the patience
parameter in terms of the percentage of MaxEpochs.
1-920
Audio Transfer Learning Using Experiment Manager
You can sort the table by each column by pointing to the column name and clicking the arrow that
appears. Click the table icon on the top right corner to select which columns to show or hide. To first
compare the networks by accuracy, sort the table over the Validation Accuracy column in
descending order.
In terms of accuracy, the Yamnet network performs the best followed by VGGish networi, and then
the custom network. However, sorting by the Elapsed Time column shows that Yamnet takes the
longest to train. To compare the size of these networks, sort the table by the sizeMB column.
The custom network is the smallest, Yamnet is a few orders of magnitude larger, and VGGish is the
largest.
These results highlight the tradeoffs between the different network designs. The Yamnet network
performs the best at the classification task at the cost of more training time and moderately large
memory consumption. The VGGish network performs slightly worse in terms of accuracy and
requires over 20 times more memory than YAMNet. Lastly, the custom network has the worst
accuracy by a small margin, but the network also uses the least memory.
Even though Yamnet and VGGish are pretrained networks, the custom network converges the
fastest. Looking at the NumIters column, the custom network takes the most batch iterations to
converge because it is learning from scratch. As the custom network is much smaller and shallower
than the deep pretrained models, each of these batch updates are processed much faster, thereby
reducing the overall training time.
To save one of the trained networks from any of the trials, right-click the corresponding row in the
results table and select Export Trained Network.
To further analyze a trial, click on the corresponding row, and under the Review Results tab in the
app toolstrip, choose to plot of the training progress or a confusion matrix of the trained model. This
diagram shows the confusion matrix for the Yamnet model from trial 2 of the experiment.
1-921
1 Audio Toolbox Examples
The model struggles most at differentiating between the "off" and "up" and "no" and "go" commands,
although the accuracy is generally uniform across all classes. Further, the model successfully predicts
the "yes" command as the false positive rate for that class is only 0.4%.
References
[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017.
Available from https://storage.googleapis.com/download.tensorflow.org/data/
speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed
under the Creative Commons Attribution 4.0 license, available here: https://creativecommons.org/
licenses/by/4.0/legalcode.
1-922
Model Smart Speaker in Simulink
This example shows how to model a smart speaker system in Simulink. The smart speaker
incorporates voice command recognition and operates in real time.
Introduction
A smart speaker is a speaker that can be controlled by your voice. This example shows a smart
speaker model that responds to a number of voice commands. You make the smart speaker play
music with the command "Go". You make it stop playing music by saying "Stop". You increase or
decrease the music volume with the commands "Up" and "Down", respectively.
Model Summary
open_system("audioSmartSpeaker");
1 You can specify commands directly as the model is running through a microphone. Set up your
microphone through the dialog of the Audio Device Reader block.
2 You can also simulate the reception of signals into a microphone array. In this case, the source of
the voice commands is a set of audio files containing prerecorded commands.
1-923
1 Audio Toolbox Examples
Select the voice command source by toggling the manual switch in the Audio Input Path section of the
model.
Acoustic Beamforming
You apply acoustic beamforming when you simulate a microphone array. In this case, you model three
sound sources in the Voice Command subsystem (the voice command, plus two background noise
sources). The Acoustic Beamformer subsystem processes the different sound signals to isolate and
enhance the voice command.
When you utter commands as music is playing, the music is picked up by the microphone after it
reverberates around the room, creating an undesired feedback effect.
The Acoustic Echo Cancellation (AEC) subsystem removes the playback audio from the input signal
by using a Normalized LMS adaptive filter. This component applies only when you simulate a
microphone array using acoustic beamforming.
You can include or exclude the AEC component by changing the value of the check box on its mask
dialog.
To hear the effect of AEC on the input audio signal, flip the manual switch in the Audio Devices and
Scopes section of the model.
You pass the preprocessed speech command to the Speech Command Recognition subsystem. Speech
command recognition is based on a pretrained deep learning convolutional network identical to the
one in the “Train Speech Command Recognition Model Using Deep Learning” on page 1-313
example.
You extract auditory (Bark) spectrograms from the input audio, which you feed to the pretrained
network. The network outputs the predicted command. You use this command to drive the Audio
Output Path section of the model.
The decoded speech command goes into two different state charts:
1 The first chart controls playback. Music starts paying when the "Go" command is received, and
stops playing when "Stop" is received.
2 The second chart controls the playback volume by reacting to the commands "Up" and "Down".
When playback is triggered, the Speaker Output Processing subsystem is enabled. This subsystem
contains blocks commonly used to tune audio, such as a Graphic EQ, a multiband parametric
equalizer, and a dynamic range controller (limiter). You can tune your system sound as the model is
playing by opening the mask dialog of these blocks and changing the values of parameters (for
example, the frequencies of the Graphic EQ).
1-924
Model Smart Speaker in Simulink
Smoothed Mute
When the music stops playing, it fades smoothly rather than stopping suddenly. This is achieved by
the Smoothed Mute block which applies a time-varying gain on the audio signal. This block is based
on the System object SmoothedMute.
Speaker Modeling
After Speaker Output Processing and Smoothed Mute, the signal goes into a Speaker Model
subsystem. This subsystem allows you to control how the loudspeaker is modeled:
1 You can choose a behavioral model which implements the speaker model using basic
mathematics Simulink blocks (such as sum, delay, integrator, and gain).
2 You can choose a circuit model which implements the speaker model using SimScape
components.
3 You may also bypass these models if you are using a real, physical loudspeaker to listen to the
audio.
Change the value of the variable speakerMode in the base workspace to select one of bypass
(speakerMode=0), behavioral (speakerMode=1), or circuit (speakerMode=2).
The model uses a Spectrum Analyzer block to plot the audio signal in the frequency domain, and a
time scope to visualize the streaming time-domain audio.
1-925
1 Audio Toolbox Examples
In this example, you train a filter and sum network (FaSNet) [1] on page 1-939 to perform speech
enhancement (SE) using ambisonic data. The model has been updated to use stacked dual-path
recurrent neural networks (DPRNNs) which enable memory-efficient joint modeling of short- and
long-term sequences [4] on page 1-940. To explore the model trained in this example, see “3-D
Speech Enhancement Using Trained Filter and Sum Network” on page 1-948.
1-926
Train 3-D Speech Enhancement Network Using Deep Learning
Introduction
The aim of speech enhancement (SE) is to suppress the noise in a noisy speech signal. The SE system
may be used as a front end in teleconferencing systems, where intelligibility and listening experience
are important metrics, or a speech-to-text system, where the word error rate of the downstream
speech-to-text system is the important metric.
In this example, you use the L3DAS 2021 Task 1 dataset [2] on page 1-940 to train and evaluate a
model that uses B-format ambisonic data to perform speech enhancement. The enhanced speech is
output as a mono audio signal. To explore the model trained in this example, see “3-D Speech
Enhancement Using Trained Filter and Sum Network” on page 1-948.
To train the network with the entire data set, set speedupExample to false. To run this example
quickly, set speedupExample to true. This network requires a large amount of data to achieve
reasonable results.
speedupExample = ;
This example uses the L3DAS21 task 1 challenge data set [2] on page 1-940. The train data sets
contains 2 multiple-source and multiple-perspective (MSMP) B-format ambisonic recordings collected
at a sampling rate of 16 kHz. The two microphones are labeled as "A" and "B". In this example, you
discard recordings from microphone B. Including microphone B data in the training should improve
the final performance. The train and validation splits are provided with the data set. The 3-D speech
enhancement data set contains more than 30,000 virtual 3-D audio environments with a duration up
to 10 seconds. Each sample contains a spoken voice and other office-like background noises. The
target data is the clean monophonic voice signal. The dev dataset is 2.6 GB, the train100 dataset is
7.6 GB, and the train360 dataset is 28.6 GB.
downloadLocation = tempdir;
datasetLocationDev = fullfile(downloadLocation,"L3DAS_Task1_dev");
datasetLocationTrain100 = fullfile(downloadLocation,"L3DAS_Task1_train100");
datasetLocationTrain360 = fullfile(downloadLocation,"L3DAS_Task1_train360");
if speedupExample
if ~datasetExists(datasetLocationDev)
urlDev = "https://zenodo.org/record/4642005/files/L3DAS_Task1_dev.zip";
unzip(urlDev,downloadLocation)
end
ads = audioDatastore(fullfile(downloadLocation,"L3DAS_Task1_dev"),IncludeSubfolders=true);
else
if ~datasetExists(datasetLocationDev)
urlDev = "https://zenodo.org/record/4642005/files/L3DAS_Task1_dev.zip";
unzip(urlDev,downloadLocation)
end
if ~datasetExists(datasetLocationTrain100)
urlTrain100 = "https://zenodo.org/record/4642005/files/L3DAS_Task1_train100.zip";
unzip(urlTrain100,downloadLocation)
end
1-927
1 Audio Toolbox Examples
if ~datasetExists(datasetLocationTrain360)
urlTrain360 = "https://zenodo.org/record/4642005/files/L3DAS_Task1_train360.zip";
unzip(urlTrain360,downloadLocation)
end
adsValidation = audioDatastore(fullfile(downloadLocation,"L3DAS_Task1_dev"),IncludeSubfolders
adsTrain = audioDatastore([fullfile(downloadLocation,"L3DAS_Task1_train100"), ...
fullfile(downloadLocation,"L3DAS_Task1_train360")],IncludeSubfolders=true);
end
To subset the datastores into targets and predictors, use subset. Only use microphone A predictors.
Using both microphones should increase model performance at the cost of more training time.
if speedupExample
[~,fileNames] = fileparts(ads.Files);
targetFiles = ~endsWith(fileNames,["A","B"]);
micAFiles = endsWith(fileNames,"A");
T = subset(ads,targetFiles);
X = subset(ads,micAFiles);
XTrain = subset(X,1:40);
TTrain = subset(T,1:40);
XValidation = subset(X,41:50);
TValidation = subset(T,41:50);
else
[~,fileNames] = fileparts(adsTrain.Files);
targetFiles = ~endsWith(fileNames,["A","B"]);
micAFiles = endsWith(fileNames,"A");
TTrain = subset(adsTrain,targetFiles);
XTrain = subset(adsTrain,micAFiles);
[~,fileNames] = fileparts(adsValidation.Files);
targetFiles = ~endsWith(fileNames,["A","B"]);
micAFiles = endsWith(fileNames,"A");
TValidation = subset(adsValidation,targetFiles);
XValidation = subset(adsValidation,micAFiles);
end
Remove any files that do not overlap between targets and predictors.
[~,hFiles] = fileparts(TTrain.Files);
[~,kFiles] = fileparts(XTrain.Files);
kFiles = erase(kFiles,"_A");
validFiles = intersect(kFiles,hFiles);
targetValidFiles = ismember(validFiles,kFiles);
predictorsValidFiles = ismember(kFiles,validFiles);
TTrain = subset(TTrain,targetValidFiles);
XTrain = subset(XTrain,predictorsValidFiles);
[~,hFiles] = fileparts(TValidation.Files);
[~,kFiles] = fileparts(XValidation.Files);
kFiles = erase(kFiles,"_A");
validFiles = intersect(kFiles,hFiles);
targetValidFiles = ismember(validFiles,kFiles);
predictorsValidFiles = ismember(kFiles,validFiles);
TValidation = subset(TValidation,targetValidFiles);
XValidation = subset(XValidation,predictorsValidFiles);
To combine the predictor and target datastores so that reading from the combined datastore returns
the predictors and associated target, use combine.
1-928
Train 3-D Speech Enhancement Network Using Deep Learning
dsTrain = combine(XTrain,TTrain);
dsValidation = combine(XValidation,TValidation);
Inspect Data
predictor = preview(XTrain);
target = preview(TTrain);
tiledlayout(2,1,TileSpacing="tight")
nexttile
plot(t,target)
title("Target")
xlabel("Time (s)")
axis tight
nexttile
plot(t,predictor)
title("Predictor")
xlabel("Time (s)")
legend(["W","X","Y","Z"])
axis tight
1-929
1 Audio Toolbox Examples
Listen to the target data, the mean of the ambisonic channels, or one of the ambisonic channels
individually.
soundSource = ;
soundsc(soundSource,fs)
Choosing an appropriate metric to evaluate a SE system performance depends on the final task of the
system. For speech-to-text applications, evaluating the word error rate (WER) using the target
speech-to-text system is a common approach. For teleconferencing applications, the short-time
objective intelligibility measure (STOI) is a common approach. Similarly, the choice of loss function
should depend on the final application of the speech enhancement system. In this example, you
attempt to optimize the system to reduce WER for a downstream speech-to-text system. One option
for the loss function is to use the WER directly, however this can be prohibitively time-consuming for
training, and couples the speech enhancement module tightly with the speech-to-text module.
Another approach is to use an auditory-based representation of the targets and predictors and
calculate the mean square error between them. This example takes the second approach. To get a
baseline for performance analysis, calculate the WER of the target (clean) signal, and the noisy signal
using a naive approach to SE (mean over channels). The supporting function, wordErrorRate on
page 1-940, uses the wav2vec2.0 option of the speech2text functionality. If you have not
downloaded the pretrained wav2vec 2.0 model, the function throws an error with a link to the
download. The WER is calculated using Text Analytics Toolbox™.
WERa = wordErrorRate(dsWER,TargetWER=true,BaselineWER=true);
progress = 1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32
WERa.Target
ans = 0.0296
WERa.Baseline
ans = 0.4001
This example uses the filter and sum network (FaSNet) architecture with dual-path recurrent neural
networks (DPRNN). FaSNet is a time-domain adaptive beamforming framework consisting of two
stages:
1 Estimate the beamforming filter for selected reference channel, and then denoise the reference
signal.
2 Beamform remaining channels using the denoised reference channel.
1-930
Train 3-D Speech Enhancement Network Using Deep Learning
The FaSNet using DPRNN architecture is implemented in the supporting function FaSNet, which is
in the current folder when you open this example.
In stage one, a normalized cross correlation (NCC) metric is computed between the windows of the
reference channel with context and windows of the remaining channels. This example uses cosine
similarity as the correlation metric. The metric is pooled across the channels, passed through a
temporal convolutional network (TCN), and then through the beamforming filter learner. The output
from the beamformer module blocks is then used to filter the reference channel.
In stage two, a NCC metric is computed between the denoised windows of the reference channel and
windows of the remaining channels with context. A beamforming filter is learned for each of the
remaining channels. Each channel is separately denoised, and then the channels are summed to
create the beamformed final signal.
Beamformer
The beamformer module follows the design of [1] on page 1-939 except replaces the stacked TCN
blocks with stacked DPRNN blocks.
1-931
1 Audio Toolbox Examples
Dual-path recurrent neural networks (DPRNN) were introduced in [4] on page 1-940 as a method of
organizing RNN layers in a deep structure to model extremely long sequences. DPRNN splits
sequential input into chunks and then applies intra- and inter-chunk operations iteratively. The
approach has been shown to perform as well or better than 1-D CNN architectures with a
significantly smaller model size. The DPRNN model consists of three stages: segmentation, DPRNN
blocks (which may be stacked), and then overlap-add reconstruction.
Segmentation
The sequence is split into S segments of length K with overlap P. In this example, K = 2P.
DPRNN Block
The segmented signal passes through B DPRNN blocks. In this example, B is set to 6. Each block
contains two sub-modules corresponding to intra- and inter-chunk processing. The intra-chunk RNN
is always bi-directional. The intra-chunk RNN processes each segment individually. The inter-chunk
RNN may be uni- or bi-directional, depending on latency requirements of your system. In this
example, the inter-chunk RNN is bi-directional. The inter-chunk RNN processes along the stacked
dimension of length S. The output of each DPRNN block is the same size as the input.
1-932
Train 3-D Speech Enhancement Network Using Deep Learning
Overlap-Add
The output from the stacked DPRNN blocks is overlapped and added to reconstruct the sequence
data.
1-933
1 Audio Toolbox Examples
Define Parameters
% FaSNet-level parameters
parameters.WindowLength = 256; % L in FaSNet
parameters.EncoderDimension = 64; % Number filters in TCN
parameters.NumDPRNNBlocks = 6; % Number of stacked DPRNN blocks
% DPRNN-level parameters
parameters.FeatureDimension = 64; % Number of filters in convolutional blocks
parameters.SegmentSize = 24; % 2P
parameters.HiddenDimension = 128; % RNN size
Use the supporting function, intitializeLearnables on page 1-946, to initialize the FaSNet
architecture for the specified parameters.
learnables = initializeLearnables(parameters);
Input Pipeline
Define the mini-batch size. Create minibatchqueue (Deep Learning Toolbox) objects to read mini-
batches from the training data set. The supporting function preprocessMiniBatch on page 1-943
randomly selects a single clip of the specified parameters.AnalysisLength on page 1-934 from
each audio file in the mini-batch. This approach avoids the need to buffer and save individual audio
files, which reduces disk space requirements. The approach has the added benefit of changing the
exact sequences seen between epochs. However, this approach puts more emphasis on shorter files in
the training data.
miniBatchSize = ;
Training Options
1-934
Train 3-D Speech Enhancement Network Using Deep Learning
• auditory-mse: Use the mean-square-error (MSE) between a mel spectrogram computed from
the target and a mel spectrogram computed from the prediction.
• sample-mse: Use the sample-level MSE between the target and predictor.
• sample-sisdr: Use the sample-level scale-invariant signal-to-distortion ratio defined in [3] on
page 1-940.
lossType = ;
Define the maximum number of epochs, the initial learn rate, and piece-wise learning parameters
such as validation patience, learn rate drop factor, and minimum learn rate. The default settings
correspond to those reported in [4] on page 1-940 for the task of speaker separation.
maxEpochs = ;
initialLearnRate = ;
validationPatience = ;
learnRateDropFactor = ;
learnRateDropPeriod = ;
if speedupExample
maxEpochs = 1;
end
iteration = 0;
bestLoss = inf;
averageGrad = [];
averageSqGrad = [];
learnRate = initialLearnRate;
Train Network
Create a trainingProgressMonitor to monitor the training loss and validation loss while training.
validationLoss = mbqLoss(mbqValidation,learnables,parameters,lossType);
recordMetrics(monitor,0,ValidationLoss=validationLoss)
1-935
1 Audio Toolbox Examples
shuffle(mbqTrain)
while hasdata(mbqTrain)
iteration = iteration + 1;
% Pass the predictors through the network and return the loss and
% gradients.
[loss,gradients] = dlfeval(@modelLoss,learnables,parameters,X,T,lossType);
if monitor.Stop
break
end
end
if monitor.Stop
break
end
% Checkpoint
if validationLoss < bestLoss
bestLoss = validationLoss;
bestLossEpoch = epoch;
save("CheckPoint.mat","bestLoss","learnables","epoch", ...
"averageGrad","averageSqGrad","iteration","learnRate")
end
1-936
Train 3-D Speech Enhancement Network Using Deep Learning
Evaluate System
load("CheckPoint.mat")
Compare the results of the baseline speech enhancement approach against the FaSNet approach
using listening tests and common metrics.
dsValidation = shuffle(dsValidation);
[x,t] = read(dsValidation);
predictor = x{1};
target = x{2};
As a baseline speech enhancement system, simply take the mean of the predictors across the
channels.
yBaseline = mean(predictor,2);
Pass the noisy speech through the network. The network was trained to process data in 2-second
segments. The architecture does accept longer and shorter segments, but performs best on inputs of
the same size as it was trained on. Use the preprocessSignal on page 1-943 supporting function
to split the audio input into the same segment length as your model was trained on. Pass the
segments through the FaSNet model. Treat each segment individually by placing the segment
dimension along the third dimension, which the FaSNet model recognizes as the batch dimension.
y = preprocessSignal(predictor,parameters.AnalysisLength);
1-937
1 Audio Toolbox Examples
y = FaSNet(dlarray(y),parameters,learnables);
Listen to the clean, baseline speech enhanced, and FaSNet speech enhanced signals.
dur = size(target,1)/fs;
soundsc(target,fs),pause(dur+1)
soundsc(yBaseline,fs),pause(dur+1)
soundsc(y,fs),pause(dur+1)
Compute the baseline and FaSNet sample MSE, auditory-based MSE, and SISDR. Another common
metric not implemented in this example is short-time objective intelligibility (STOI) [5] on page 1-
940, which is often used both as a training loss function and for system evaluation.
yBaselineMSE = 2*mse(yBaseline,target,DataFormat="TB")/size(target,1);
yMSE = 2*mse(y,target,DataFormat="TB")/size(target,1);
yABaseline = extractdata(dlmelspectrogram(yBaseline,parameters.SampleRate));
yA = extractdata(dlmelspectrogram(y,parameters.SampleRate));
targetA = extractdata(dlmelspectrogram(target,parameters.SampleRate));
yBaselineAMSE = mse(yABaseline,targetA,DataFormat="CTB")/(size(targetA,1)*size(targetA,2));
yAMSE = mse(yA,targetA,DataFormat="CTB")/(size(targetA,1)*size(targetA,2));
yBaselineSISDR = sisdr(yBaseline,target);
ySISDR = sisdr(y,target);
Plot the target signal, the baseline SE result, and the FaSNet SE result. Display performance metrics
in the plot titles.
tiledlayout(3,1)
nexttile
plot(yBaseline)
title("Baseline:"+" MSE="+yBaselineMSE+" Auditory MSE="+yBaselineAMSE+" SISDR="+yBaselineSISDR)
grid on
axis tight
nexttile
plot(y)
title("FaSNet: "+" MSE="+yMSE+" Auditory MSE="+yAMSE+" SISDR="+ySISDR)
grid on
axis tight
nexttile
plot(target)
grid on
title("Target")
axis tight
1-938
Train 3-D Speech Enhancement Network Using Deep Learning
Evaluate the word error rate after FaSNet processing and compare to the target (clean) signal and
the baseline approach.
WER = wordErrorRate(dsWER,parameters,learnables,FaSNetWER=true);
progress = 1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32
WERa.Baseline
ans = 0.4001
WER.FaSNet
ans = 0.2760
WERa.Target
ans = 0.0296
References
[1] Luo, Yi, Cong Han, Nima Mesgarani, Enea Ceolini, and Shih-Chii Liu. "FaSNet: Low-Latency
Adaptive Beamforming for Multi-Microphone Audio Processing." In 2019 IEEE Automatic Speech
Recognition and Understanding Workshop (ASRU), 260–67. SG, Singapore: IEEE, 2019. https://
doi.org/10.1109/ASRU46091.2019.9003849.
1-939
1 Audio Toolbox Examples
[2] Guizzo, Eric, Riccardo F. Gramaccioni, Saeid Jamili, Christian Marinoni, Edoardo Massaro, Claudia
Medaglia, Giuseppe Nachira, et al. "L3DAS21 Challenge: Machine Learning for 3D Audio Signal
Processing." In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing
(MLSP), 1–6. Gold Coast, Australia: IEEE, 2021. https://doi.org/10.1109/MLSP52302.2021.9596248.
[3] Roux, Jonathan Le, et al. "SDR – Half-Baked or Well Done?" ICASSP 2019 - 2019 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 626–
30. DOI.org (Crossref), https://doi.org/10.1109/ICASSP.2019.8683855.
[4] Luo, Yi, et al. "Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-
Channel Speech Separation." ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 46–50. DOI.org (Crossref), https://doi.org/
10.1109/ICASSP40776.2020.9054266.
[5] Taal, Cees H., Richard C. Hendriks, Richard Heusdens, and Jesper Jensen. "An Algorithm for
Intelligibility Prediction of Time–Frequency Weighted Noisy Speech." IEEE Transactions on Audio,
Speech, and Language Processing 19, no. 7 (September 2011): 2125–36. https://doi.org/10.1109/
TASL.2011.2114881.
Supporting Functions
arguments
ds
parameters = [];
learnables = [];
nvargs.TargetWER = false;
nvargs.BaselineWER = false;
nvargs.FaSNetWER = false;
nvargs.Verbose = true;
end
1-940
Train 3-D Speech Enhancement Network Using Deep Learning
% Initialize counters
editDistanceTotal_t = 0;
editDistanceTotal_b = 0;
editDistanceTotal_y = 0;
numWordsTotal = 0;
p = 0;
1-941
1 Audio Toolbox Examples
% Print status
if nvargs.Verbose && (100*progress(ds))>p+1
p = round(100*progress(ds));
fprintf(string(p)+".")
end
end
fprintf("...complete.\n")
end
Model Loss
1-942
Train 3-D Speech Enhancement Network Using Deep Learning
gradients = dlgradient(loss,learnables);
end
end
for ii = 1:numel(Xcell)
[Xcell{ii},idx] = preprocessSignalTrain(Xcell{ii},Samples=N);
Tcell{ii} = preprocessSignalTrain(Tcell{ii},Samples=N,Index=idx);
end
X = cat(3,Xcell{:});
T = cat(2,Tcell{:});
end
function y = preprocessSignal(x,L)
%preprocessSignal Preprocess signal for FaSNet
% y = preprocessSignal(x,L) splits the multi-channel
% signal x into analysis frames of length L and hop L. The output is a
% L-by-size(x,2)-by-numHop array, where the number of hops depends on the
% input signal length and L.
% Pad as necessary.
if N<L
numToPad = L-N;
x = cat(1,x,zeros(numToPad,size(x,2),like=x));
else
numHops = floor((N-L)/L) + 1;
numSamplesUsed = L+(L*(numHops-1));
if numSamplesUsed < N
numSamplesUnused = N-numSamplesUsed;
numToPad = L - numSamplesUnused;
x = cat(1,x,zeros(numToPad,nchan,like=x));
end
end
1-943
1 Audio Toolbox Examples
function y = dlmelspectrogram(x,fs)
%dlmelspectrogram Mel spectrogram compatible with dlarray
% y = dlmelspectrogram(x,fs) computes a mel spectrogram from the audio
% input.
% Power spectrum
y = S.*conj(S);
% Apply log10.
y = log(y+eps)/log(10);
end
y = y - mean(y,1);
t = t - mean(t,1);
etarget = alpha.*t;
eres = y - etarget;
1-944
Train 3-D Speech Enhancement Network Using Deep Learning
top = sum(etarget.^2);
bottom = sum(eres.^2);
metric = 10*log(top./(bottom+eps))/log(10);
end
arguments
x
options.Samples = 32000
options.Index = []
end
numSamples = size(x,1);
numChannels = size(x,2);
% Choose a random starting index in the signal, then clip a segment out of
% the signal.
if isempty(options.Index)
idx = randi(numSamples-options.Samples+1);
else
idx = options.Index;
end
y = x(idx:idx+options.Samples-1,:);
end
numMiniBatch = 0;
validationLoss = 0;
reset(mbq)
1-945
1 Audio Toolbox Examples
while hasdata(mbq)
[X,T] = next(mbq);
numMiniBatch = numMiniBatch + 1;
validationLoss = validationLoss + modelLoss(learnables,parameters,X,T,lossType);
end
loss = validationLoss/numMiniBatch;
end
validateattributes(parameters.SegmentSize,["single","double"],["even","positive"],"intializeLearn
validateattributes(parameters.WindowLength,["single","double"],["even","positive"],"initialzieLea
filterDimension = 2*parameters.WindowLength+1;
learnables.TCN.conv.weight = dlarray(permute(initializeGlorot(1,parameters.EncoderDimension,3*par
learnables.TCN.norm.offset = dlarray(zeros(parameters.EncoderDimension,1,"single"));
learnables.TCN.norm.scaleFactor = dlarray(ones(parameters.EncoderDimension,1,"single"));
learnables.("Beamformer"+jj).BN.conv.weight = dlarray(squeeze(initializeGlorot(1,parameters.F
learnables.("Beamformer"+jj).Output.prelu.alpha = dlarray(0.25);
1-946
Train 3-D Speech Enhancement Network Using Deep Learning
learnables.("Beamformer"+jj).Output.conv.weight = dlarray(initializeGlorot(parameters.Feature
learnables.("Beamformer"+jj).Output.conv.bias = dlarray(initializeZeros([1,parameters.Feature
learnables.("Beamformer"+jj).GenerateFilter.X1.weight = dlarray(permute(initializeGlorot(para
learnables.("Beamformer"+jj).GenerateFilter.X1.bias = dlarray(initializeZeros([1,filterDimens
learnables.("Beamformer"+jj).GenerateFilter.X2.weight = dlarray(permute(initializeGlorot(para
learnables.("Beamformer"+jj).GenerateFilter.X2.bias = dlarray(initializeZeros([1,filterDimens
end
function weights = initializeGlorot(filterSize,numChannels,numFilters)
sz = [filterSize,numChannels,numFilters];
numOut = prod(filterSize)*numFilters;
numIn = prod(filterSize)*numFilters;
Z = 2*rand(sz,"single") - 1;
bound = sqrt(6/(numIn + numOut));
weights = bound*Z;
weights = dlarray(weights);
end
function parameter = initializeOrthogonal(numHiddenUnits)
sz = [4*numHiddenUnits,numHiddenUnits];
Z = randn(sz,"single");
[Q,R] = qr(Z,0);
D = diag(R);
Q = Q * diag(D./abs(D));
parameter = dlarray(Q);
end
function bias = initializeUnitForgetGate(numHiddenUnits)
bias = zeros(4*numHiddenUnits,1,"single");
idx = numHiddenUnits+1:2*numHiddenUnits;
bias(idx) = 1;
bias = dlarray(bias);
end
function parameter = initializeZeros(sz)
parameter = zeros(sz,"single");
parameter = dlarray(parameter);
end
function parameter = initializeOnes(sz)
parameter = ones(sz,"single");
parameter = dlarray(parameter);
end
end
1-947
1 Audio Toolbox Examples
In this example, you perform speech enhancement using a pretrained deep learning model. For
details about the model and how it was trained, see “Train 3-D Speech Enhancement Network Using
Deep Learning” on page 1-926. The speech enhancement model is an end-to-end deep beamformer
that takes B-format ambisonic audio recordings and outputs enhanced mono speech signals.
Download the pretrained speech enhancement (SE) network, ambisonic test files, and labels. The
model architecture is based on [1] on page 1-952 and [4] on page 1-952, as implemented in the
baseline system for the L3DAS21 challenge task 1 [2] on page 1-952. The data the model was
trained on and the ambisonic test files are provided as part of [2] on page 1-952.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","speechEnhancement/FaSNet.z
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
netFolder = fullfile(dataFolder,"speechEnhancement");
addpath(netFolder)
[cleanSpeech,fs] = audioread("cleanSpeech.wav");
soundsc(cleanSpeech,fs)
In the L3DAS21 challenge, "clean" speech files were taken from the LibriSpeech dataset and
augmented to obtain synthetic tridimensional acoustic scenes containing a randomly placed speaker
and other sound sources typical of background noise in an office environment. The data is encoded as
B-format ambisonics. Load the ambisonic data. First order B-format ambisonic channels correspond
to the sound pressure captured by an omnidirectional microphone (W) and sound pressure gradients
X, Y, and Z that correspond to front/back, left/right, and up/down captured by figure-of-eight capsules
oriented along the three spatial axes.
[ambisonicData,fs] = audioread("ambisonicRecording.wav");
channel = ;
soundsc(ambisonicData(:,channel),fs)
To plot the clean speech and the noisy ambisonic data, use the supporting function compareAudio
on page 1-952.
compareAudio(cleanSpeech,ambisonicData,SampleRate=fs)
1-948
3-D Speech Enhancement Using Trained Filter and Sum Network
To visualize the spectrograms of the clean speech and the noisy ambisonic data, use the supporting
function compareSpectrograms on page 1-954.
compareSpectrograms(cleanSpeech,ambisonicData)
compareSpectrograms(cleanSpeech,ambisonicData,Warp="mel")
1-949
1 Audio Toolbox Examples
Use the supporting object, seModel, to perform speech enhancement. The seModel class definition
is in the current folder when you open this example. The object encapsulates the SE model developed
in “Train 3-D Speech Enhancement Network Using Deep Learning” on page 1-926. Create the model,
then call enhanceSpeech on the ambisonic data to perform speech enhancement.
model = seModel(netFolder);
enhancedSpeech = enhanceSpeech(model,ambisonicData);
Listen to the enhanced speech. You can compare the enhanced speech listening experience with the
clean speech or noisy ambisonic data by selecting the desired sound source from the dropdown.
soundSource = ;
soundsc(soundSource,fs)
Compare the clean speech, noisy speech, and enhanced speech in the time domain, as spectrograms,
and as mel spectrograms.
compareAudio(cleanSpeech,ambisonicData,enhancedSpeech)
compareSpectrograms(cleanSpeech,ambisonicData,enhancedSpeech)
1-950
3-D Speech Enhancement Using Trained Filter and Sum Network
compareSpectrograms(cleanSpeech,ambisonicData,enhancedSpeech,Warp="mel")
transcriber = speechClient("wav2vec2.0",segmentation="none");
Perform speech-to-text transcription using the clean speech, the ambisonic data, and the enhanced
speech.
cleanSpeechResults = speech2text(transcriber,cleanSpeech,fs)
cleanSpeechResults =
"i tell you it is not poison she cried"
noisySpeechResults = speech2text(transcriber,ambisonicData(:,channel),fs)
noisySpeechResults =
"i tell you it is not parzona she cried"
enhancedSpeechResults = speech2text(transcriber,enhancedSpeech,fs)
enhancedSpeechResults =
"i tell you it is not poisen she cried"
1-951
1 Audio Toolbox Examples
Compare the performance of the speech enhancement system using the short-time objective
intelligibility (STOI) measurement [5] on page 1-952. STOI has been shown to have a high
correlation with the intelligibility of noisy speech and is commonly used to evaluate speech
enhancement systems.
Calculate STOI for the omnidirectional channel of the ambisonics, and for the enhanced speech.
Perfect intelligibility has a score of 1.
stoi(ambisonicData(:,channel),cleanSpeech,fs)
ans = 0.6941
stoi(enhancedSpeech,cleanSpeech,fs)
ans = single
0.8418
References
[1] Luo, Yi, Cong Han, Nima Mesgarani, Enea Ceolini, and Shih-Chii Liu. "FaSNet: Low-Latency
Adaptive Beamforming for Multi-Microphone Audio Processing." In 2019 IEEE Automatic Speech
Recognition and Understanding Workshop (ASRU), 260–67. SG, Singapore: IEEE, 2019. https://
doi.org/10.1109/ASRU46091.2019.9003849.
[2] Guizzo, Eric, Riccardo F. Gramaccioni, Saeid Jamili, Christian Marinoni, Edoardo Massaro, Claudia
Medaglia, Giuseppe Nachira, et al. "L3DAS21 Challenge: Machine Learning for 3D Audio Signal
Processing." In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing
(MLSP), 1–6. Gold Coast, Australia: IEEE, 2021. https://doi.org/10.1109/MLSP52302.2021.9596248.
[3] Roux, Jonathan Le, et al. "SDR – Half-Baked or Well Done?" ICASSP 2019 - 2019 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 626–
30. DOI.org (Crossref), https://doi.org/10.1109/ICASSP.2019.8683855.
[4] Luo, Yi, et al. "Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-
Channel Speech Separation." ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 46–50. DOI.org (Crossref), https://doi.org/
10.1109/ICASSP40776.2020.9054266.
[5] Taal, Cees H., Richard C. Hendriks, Richard Heusdens, and Jesper Jensen. "An Algorithm for
Intelligibility Prediction of Time–Frequency Weighted Noisy Speech." IEEE Transactions on Audio,
Speech, and Language Processing 19, no. 7 (September 2011): 2125–36. https://doi.org/10.1109/
TASL.2011.2114881.
Supporting Functions
Compare Audio
1-952
3-D Speech Enhancement Using Trained Filter and Sum Network
function compareAudio(target,x,y,parameters)
%compareAudio Plot clean speech, B-format ambisonics, and predicted speech
% over time
arguments
target
x
y = []
parameters.SampleRate = 16e3
end
numToPlot = 2 + ~isempty(y);
f = figure;
tiledlayout(4,numToPlot,TileSpacing="compact",TileIndexing="columnmajor")
f.Position = [f.Position(1),f.Position(2),f.Position(3)*numToPlot,f.Position(4)];
t = (0:(size(x,1)-1))/parameters.SampleRate;
xmax = max(x(:));
xmin = min(x(:));
nexttile(1,[4,1])
plot(t,target,Color=[0 0.4470 0.7410])
axis tight
ylabel("Amplitude")
xlabel("Time (s)")
title("Clean Speech (Target Data)")
grid on
nexttile(5)
plot(t,x(:,1),Color=[0.8500 0.3250 0.0980])
title("Noisy Speech (B-Format Ambisonic Data)")
axis([t(1),t(end),xmin,xmax])
set(gca,Xticklabel=[],YtickLabel=[])
grid on
yL = ylabel("W",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
nexttile(6)
plot(t,x(:,2),Color=[0.8600 0.3150 0.0990])
axis([t(1),t(end),xmin,xmax])
set(gca,Xticklabel=[],YtickLabel=[])
grid on
yL = ylabel("X",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
nexttile(7)
plot(t,x(:,3),Color=[0.8700 0.3050 0.1000])
axis([t(1),t(end),xmin,xmax])
set(gca,Xticklabel=[],YtickLabel=[])
grid on
yL = ylabel("Y",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
nexttile(8)
plot(t,x(:,4),Color=[0.8800 0.2950 0.1100])
axis([t(1),t(end),xmin,xmax])
1-953
1 Audio Toolbox Examples
xlabel("Time (s)")
set(gca,YtickLabel=[])
grid on
yL = ylabel("Z",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
if numToPlot==3
nexttile(9,[4,1])
plot(t,y,Color=[0 0.4470 0.7410])
axis tight
xlabel("Time (s)")
title("Enhanced Speech")
grid on
set(gca,YtickLabel=[])
end
end
Compare Spectrograms
function compareSpectrograms(target,x,y,parameters)
%compareSpectrograms Plot spectrograms of clean speech, B-format
% ambisonics, and predicted speech over time
arguments
target
x
y = []
parameters.SampleRate = 16e3
parameters.Warp = "linear"
end
fs = parameters.SampleRate;
switch parameters.Warp
case "linear"
fn = @(x)spectrogram(x,hann(round(0.03*fs),"periodic"),round(0.02*fs),round(0.03*fs),fs,"
case "mel"
fn = @(x)melSpectrogram(x,fs);
end
numToPlot = 2 + ~isempty(y);
f = figure;
tiledlayout(4,numToPlot,TileSpacing="tight",TileIndexing="columnmajor")
f.Position = [f.Position(1),f.Position(2),f.Position(3)*numToPlot,f.Position(4)];
nexttile(1,[4,1])
fn(target)
fh = gcf;
fh.Children(1).Children(1).Visible="off";
title("Clean Speech")
nexttile(5)
fn(x(:,1))
fh = gcf;
fh.Children(1).Children(1).Visible="off";
1-954
3-D Speech Enhancement Using Trained Filter and Sum Network
set(gca,Yticklabel=[],XtickLabel=[],Xlabel=[])
yL = ylabel("W",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
title("Noisy Speech (B-Format Ambisonic Data)")
nexttile(6)
fn(x(:,2))
fh = gcf;
fh.Children(1).Children(1).Visible="off";
set(gca,Yticklabel=[],XtickLabel=[],Xlabel=[])
yL = ylabel("X",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
nexttile(7)
fn(x(:,3))
fh = gcf;
fh.Children(1).Children(1).Visible="off";
set(gca,Yticklabel=[],XtickLabel=[],Xlabel=[])
yL = ylabel("Y",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
nexttile(8)
fn(x(:,4))
fh = gcf;
fh.Children(1).Children(1).Visible="off";
set(gca,Yticklabel=[])
yL = ylabel("Z",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
if numToPlot==3
nexttile(9,[4,1])
fn(y)
fh = gcf;
fh.Children(1).Children(1).Visible="off";
set(gca,Yticklabel=[],Ylabel=[])
title("Enhanced Speech")
end
end
1-955
1 Audio Toolbox Examples
This example combines Optimization Toolbox™ and Audio Toolbox™ to develop an algorithm that
automatically tunes a set of filter parameters.
There are many audio applications where it is desirable to compute parametric equalizer parameters
to fit an arbitrary frequency response. For instance, one could fit a filter response to a measured
impulse response (IR) to obtain a lower-order implementation of the same filter. Alternatively, one can
apply correction to a measured loudspeaker response (anechoic or in-room) to smooth out any
imperfections and create a perceptually flat frequency response. The latter is demonstrated here by
designing an algorithm that automatically tunes the parameters of N parametric equalizers such that
when the resulting EQ is applied to the speaker, the frequency response is perceived as flat in the
room.
• Measure the in-room response of a loudspeaker using the Impulse Response Measurer
• Compute a fractional-octave smoothed response
• Take into account the microphone calibration data
• Compute a target response that is perceptually optimal for the given loudspeaker system and
room configuration
• Optimize a set of filter parameters that modifies the response to better fit the target
• Produce an audio filter or audio files to evaluate the results using headphones or listening room
In-Room Measurement
The first step is to obtain measurements for the system that needs to be improved.
Set up a full duplex sound interface so that it can both play on the loudspeaker, and record with a
calibration microphone (such as a Behringer ECM8000 connected to a sound interface capable of
supplying phantom power). Place the microphone on a stand so that you can move it into the listening
position(s). You can start with the microphone centered in the listener's position, but measuring at
several positions will help reduce the chances of overcorrecting for issues like a high frequency dip in
the response that could only be present in a region smaller than a listener's head.
Verify that the correct audio device is selected. Change the sample rate if desired (this example uses
96 kHz). Set the player and recorder channels to the loudspeaker and microphone, respectively.
Select the Swept Sine method. Set the number of runs to average several measurement together. This
example uses five. Set the duration per run to have time for a long enough swept sine and a period of
silence that is long enough for the reverberation to completely die off, which can be several seconds
in a typical room.
In the advanced settings, set a pause between runs that allows time to move the microphone around
the initial "center" position. You must keep silent during the "silence" part of the measurement, but
you can move the microphone and make noise during this pause. The start frequency should be set
below the range of the loudspeaker (10 Hz might be a good starting point). The stop frequency can be
set to half the sampling rate, unless measuring a subwoofer with limited high-frequency range. Set
1-956
Automated Design of Audio Filters for Room Equalization
the sweep duration to a few seconds, and make sure there is enough duration left for silence at the
end.
You may test levels with the level meter or try a first capture with 1 short run. Set playback level loud
enough to hide any background noise in the room and set the microphone level so that it is high but
does not overload/clip.
Now you can capture and save export the data to a MAT file. The rest of this example uses a file
provided here.
Import the last measurement that was exported by the application (by addressing with end). The
data used here is in a similar (but compatible) format.
load('measured_ir_data.mat','measurementData');
Fs = measurementData.SampleRate(end);
ir = measurementData.ImpulseResponse(end).Amplitude;
frequency = measurementData.MagnitudeResponse(end).Frequency;
if isfield(measurementData.MagnitudeResponse,'PowerDb')
magnitudeDB = measurementData.MagnitudeResponse(end).PowerDb;
else % Version R2022b of IRM has renamed this field to Magnitude (dB)
magnitudeDB = measurementData.MagnitudeResponse(end).MagnitudeDB;
end
Using a helper function provided here, smooth the response by 1/24-octave sections. Since powerDB
is a measurement, add extra smoothing (last argument set to true).
semilogx(frequency,magnitudeDB,':',Color="#77AC30")
hold on
plot(cfFR,pdbFR,'b',LineWidth=2,Color="#0072BD")
title('Measured Speaker Response')
legend('Raw Data','Octave Smoothed',Location='southwest')
xlabel('Frequency (Hz) \rightarrow')
ylabel('Magnitude (dB) \rightarrow')
yrange = [-50 0];
axis([30 22e3 yrange])
hold off
grid on
1-957
1 Audio Toolbox Examples
Microphone Calibration
If there is calibration data available for the microphone, subtract it from the measurement.
In this case, generic calibration data for the Behringer ECM8000 microphone is used. Calling the
helper function with no output arguments plots the microphone frequency response.
getMicCalibration()
1-958
Automated Design of Audio Filters for Room Equalization
Subtract the microphone calibration data from the measured response, using the same frequency
values.
micGainDB = getMicCalibration(cfFR);
pdbFRmic = pdbFR - micGainDB;
Determine a "range of interest" (ROI) for the optimization based on the measurement above and the
manufacturer specifications for our loudspeaker. The bookshelf speaker measured above has a range
of 60 Hz to 22 kHz according to the manufacturer. Set the ROI to a range of 40 Hz to 20 kHz to
slightly increase the low end and to take into account the steep decline above 20 kHz.
ROI = [40 20e3]; % specify the region of interest for the optimization
1-959
1 Audio Toolbox Examples
The next step is to determine a suitable target response for the system in the ROI. The main goal is to
provide a response that is perceived as "flat", and potentially extend the low frequencies (within
reasonable limits).
To compute a target response, fit a straight line (in the log-frequency domain) onto the frequency
response (in dB) for a subset of the ROI. Add a roll off to the lower frequencies.
1-960
Automated Design of Audio Filters for Room Equalization
% Roll off the low frequencies, starting slightly above the linear range above
lfcutoff = 1.05*lfCutOff;
idx = cf<lfcutoff;
targetResp(idx) = targetResp(idx) - min(30,ROI(1)/2)*((lfcutoff-cf(idx))/lfcutoff).^2;
1-961
1 Audio Toolbox Examples
The target response (in black) has a slight downward tilt and a roll off for the low frequencies that
allows for some boost (10 to 12 dB).
The variables being tuned by the optimization algorithm are typical audio parametric EQ parameters:
Center Frequency, Filter Bandwidth, and Peak Gain.
Use the response to produce settings for a 12-band parametric filter (10 peak filters and 2 shelf
filters).
Use the designParamEQ function from Audio Toolbox to design the filter. Use the lsqnonlin
(Optimization Toolbox) function to perform the fit by tuning the parameters of the EQ bands until the
speaker response is as flat as possible.
Before configuring the optimization algorithm, you can look at what a manual filter design looks like
for 2 filters. Use the following controls to manually tune the filter parameters and observe the output
response using fvtool. This allow us to visualize the parameters that the optimization algorithm
automatically tunes.
gain = [ , ...
];
centerFreq = [ , ...
1-962
Automated Design of Audio Filters for Room Equalization
];
bandwidth = [ , ...
];
Next, generate the filter coefficients using the specified parameters by calling designParamEQ:
[B,A] = designParamEQ(2,gain,(centerFreq/(Fs/2)),(bandwidth/(Fs/2)),Orientation='row');
fvtool([B,A],'Fs',Fs)
To produce the optimized parametric filters, call a helper function that sets starting values and limits
for every filter parameter, then calls lsqnonlin. The optimizer uses the eqObjectiveFct function
that computes the response of the given EQ and compares it to the desired response. The lsqnonlin
optimizer attempts to minimize that error on every iteration.
Norm of First-order
Iteration Func-count Resnorm step optimality
0 37 0.162731 0.425
1 74 0.0545925 10 0.524
2 111 0.0289868 20 0.108
1-963
1 Audio Toolbox Examples
1-964
Automated Design of Audio Filters for Room Equalization
Results
Examine the results. Start by computing the filter frequency response over a larger frequency range.
freqzResp = eqFreqz(EQ,frequency,Fs);
pFR = octaveAverage(frequency,abs(freqzResp),24,fullRange,false);
filtRespFR = 20*log10(pFR);
outputRespFR = pdbFRmic + filtRespFR;
1-965
1 Audio Toolbox Examples
Also plot error relative to target, but invert the error curves to make it easier to compare them with
the optimized EQ.
1-966
Automated Design of Audio Filters for Room Equalization
EQt=12×4 table
Type Frequency (Hz) Gain (dB) Q/S
____ ______________ _________ _______
1-967
1 Audio Toolbox Examples
Compute an overall gain that avoids any tones increasing in amplitude. This is not a guarantee that
signals cannot clip but is generally more than sufficient in practice.
gainDB = -max(filtRespFR)-.1;
Instantiate a multibandParametricEQ object with the EQ settings. You can use this object to
visualize the filter, create an audio plugin, or load either the object or plugin into the Audio Test
Bench.
% Create EQ object
mbpeq = multibandParametricEQ(HasLowShelfFilter=true, HasHighShelfFilter=true, ...
NumEQBands=numFilters-2, EQOrder=2, SampleRate=Fs, ...
Frequencies=EQ(1:end-2,3)', QualityFactors=EQ(1:end-2,2)', PeakGain
LowShelfCutoff=EQ(end-1,3), LowShelfSlope=EQ(end-1,2), LowShelfGain
HighShelfCutoff=EQ(end,3), HighShelfSlope=EQ(end,2), HighShelfGain=
Produce output files to subjectively evaluate the results, either with headphones or in the actual
listening room. The IR is included for headphone evaluation but should be omitted when testing in
the actual room (which produces that response itself).
[rock,fileFs] = audioread('RockDrums-44p1-stereo-11secs.mp3');
% Resample the test file to match the sample rate of the IR measurement
rock = resample(rock,Fs,fileFs,100);
% Convolve the IR with the test audio. This will simulate the
% effect of the room when evaluating the EQ using headphones.
rockIR = conv(rock(:,1),ir);
rockIR(:,2) = conv(rock(:,2),ir);
rockIR = rockIR*.97/max(abs(rockIR),[],'all');
audiowrite('RockDrums-with-IR.wav',rockIR,Fs,Comment=...
'Convolution of rock drums with impulse response (for simulated evaluation on headphones)'
Type audioTestBench(mbpeq) at the command prompt to try the plugin in the Audio Test Bench
app.
In the Audio Test Bench, click the "Audio File Reader" button with the gear icon and select
'RockDrums-with-IR.wav' if using headphones, or simply use 'RockDrums-44p1-
stereo-11secs.mp3' if playing back over the same system that was used to measure the impulse
response. With the Audio Test Bench, you can toggle the EQ on and off, and you can even further tune
the EQ settings to your liking.
You can also process files with the EQ to listen outside of the Audio Test Bench.
% Apply the EQ to the original file (for use in the measured room).
% Use the gain that was computed to avoid clipping.
rockEQ = db2mag(gainDB)*mbpeq(rock);
reset(mbpeq); % reset the EQ before processing another file
audiowrite('RockDrums-with-Correction-only.wav',rockEQ,Fs,Comment=...
['Convolution of rock drums with correction (for listening '...
'in the same environment the IR was measured in)']);
1-968
Automated Design of Audio Filters for Room Equalization
audiowrite('RockDrums-with-IR-and-Correction.wav',rockIREQ,Fs,Comment=...
['Convolution of rock drums with impulse response and '...
'correction (IR simulation for evaluation over headphones)']);
Alternatively, to apply the EQ to any playback on a selected device, Windows users can export the EQ
settings in a Room EQ Wizard (REW) compatible format and load it in Equalizer APO.
eqExport2APO("myroomeq.txt",EQ,gainDB);
type("myroomeq.txt")
Equalizer: Generic
Preamp: -8.5 dB
Filter: ON PK Fc 72.52 Hz Gain 3.91 dB Q 2.15162
Filter: ON PK Fc 122.84 Hz Gain -5.89 dB Q 3.02225
Filter: ON PK Fc 211.61 Hz Gain -9.73 dB Q 0.92132
Filter: ON PK Fc 492.18 Hz Gain -6.61 dB Q 3.72172
Filter: ON PK Fc 779.00 Hz Gain -4.25 dB Q 7.83389
Filter: ON PK Fc 1102.35 Hz Gain 5.19 dB Q 4.35797
Filter: ON PK Fc 2127.60 Hz Gain -15.94 dB Q 1.12182
Filter: ON PK Fc 4842.40 Hz Gain 8.07 dB Q 0.48624
Filter: ON PK Fc 9175.50 Hz Gain -6.19 dB Q 2.07367
Filter: ON PK Fc 11710.49 Hz Gain -6.45 dB Q 2.19118
Filter: ON LS Fc 3194.10 Hz Gain 6.74 dB Q 1.89623
Filter: ON HS Fc 15580.23 Hz Gain -3.54 dB Q 1.34551
1-969
1 Audio Toolbox Examples
This example shows how to design an autoencoder neural network to perform anomaly detection for
machine sounds using unsupervised learning. In this example you will download and process the data
using a log-mel spectrogram, design and train an autoencoder network, and make out-of-sample
predictions by applying a statistical model to the trained network output.
Audio-based anomaly detection is the process of identifying whether the sound generated by an
object is abnormal. This is applicable to the automatic detection of industrial component failures, as a
machine that emits an abnormal sound is likely malfunctioning.
The problem of classifying sounds as either normal or abnormal can be viewed as a standard
supervised learning task, where a model is trained on samples of both sound types and learns to
discriminate between them. However, in practice, a data set of abnormal sounds is generally not
available because machine malfunctions do not occur frequently enough or for long enough duration
1-970
Audio-Based Anomaly Detection for Machine Health Monitoring
to be properly recorded. Also, it would be impossible to create a data set representative of every type
of anomaly, as a machine could malfunction for a diverse set of reasons.
Autoencoders are useful for anomaly detection tasks because they train solely on the normal samples.
Autoencoder networks perform the unsupervised learning task of finding both a low-dimensional
encoding of the input as well as a rule to accurately reconstruct the input from its low-dimensional
representation. This forces the autoencoder to learn a process specifically for compressing and
decompressing normal samples. The motivating principle is that when an abnormal sample is fed into
the autoencoder, the reconstruction error will be much larger than expected from the training set
because the signal compression and decompression scheme learned by the network is only expected
to work well for normal samples. To make predictions on unseen samples, an error threshold is
picked based off the expected distribution of reconstruction errors for normal samples, and any input
with an error larger than the threshold is classified as an anomaly.
In this example, the autoencoder first passes the input through an encoding section of fully-connected
layers using a number of nodes on the same order of magnitude as the input dimension. The data
then feeds into a bottleneck layer with a number of nodes much smaller than the input size which
forces the network to compress the input signal into the lower-dimensional representation. This
compressed representation feeds into a decoding section that generally mirrors the same
architecture as the encoder section in order to recreate the input signal. Lastly, the decoder output is
passed into a final output layer with the same number of dimensions as the input. The network loss is
taken as the regression error between the original input and the reconstructed signal.
Download Data
This example applies to the second task of the Detection and Classification of Acoustic Scenes and
Events (DCASE) 2022 challenge [1] on page 1-982. The example uses a subset of the public data set
from Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection [2] on page
1-982 to train and evaluate the autoencoder. It implements ideas from the preprocessing steps and
network designs of both the autoencoder baseline system in [1] on page 1-982 and the proposed
network in [2] on page 1-982 and uses the performance metrics devised in [1] on page 1-982 to
analyze the testing results.
Download a subset of the data set in [2] on page 1-982 that contains recorded audio files of 4
different fan types, labeled by ID number. There are both normal and abnormal recordings for each
fan type. These files contain 1 channel sampled at 16 kHz and are 10 seconds long. The samples are
recordings of operating fans with background noise with a signal to noise ratio of 6 dB. A full
explanation of the data collection process is available in [2] on page 1-982.
dataFolder = tempdir;
dataset = fullfile(dataFolder,"fan6db");
supportFileLoc = "mimii/mono/fan6db.zip";
downloadFolder = matlab.internal.examples.downloadSupportFile("audio",supportFileLoc);
unzip(downloadFolder,dataFolder)
Investigate Data
To briefly examine the data set and the differences between the normal and abnormal recordings,
select one recording of each type from the ID 00 fan data set and play the first two seconds over your
speaker.
[normalSample,fs] = audioread(fullfile(dataset,"id_00","normal_00","00000000.wav"));
abnormalSample = audioread(fullfile(dataset,"id_00","abnormal_00","00000000.wav"));
numSamples = 10*fs;
1-971
1 Audio Toolbox Examples
sound(normalSample(1:numSamples/5),fs)
pause(3)
sound(abnormalSample(1:numSamples/5),fs)
Both recordings are dominated by a single tone, and this tone is clearly higher pitched in the
abnormal sample.
Preprocess Data
You can optionally set the speedUp flag to true to reduce the size of the data set used in the
example. If you set this to true you can quickly verify that the script runs as expected, but the
results will be skewed.
speedUp = ;
Separate the data set into two audioDatastore objects, one with the normal samples and one with
the abnormal samples. Since the autoencoder only trains on the normal samples, hold out the
abnormal samples to be included in the test set.
ads = audioDatastore(dataset, ...
IncludeSubfolders=true, ...
LabelSource="foldernames", ...
FileExtensions=".wav");
normalLabels = categorical(["normal_00","normal_02","normal_04","normal_06"]);
abnormalLabels = categorical(["abnormal_00","abnormal_02","abnormal_04","abnormal_06"]);
isNormal = ismember(ads.Labels,normalLabels);
isAbnormal = ~isNormal;
adsNormal = subset(ads,isNormal);
adsTestAbnormal = subset(ads,isAbnormal);
rng(3);
if speedUp
c = cvpartition(adsTestAbnormal.Labels,kFold=8,Stratify=true);
adsTestAbnormal = subset(adsTestAbnormal,c.test(1));
end
Divide the normal samples into training, validation, and test sets, stratified by ID number. Then
concatenate the normal test set with the abnormal samples to form the full test set.
c = cvpartition(adsNormal.Labels,kFold=8,Stratify=true);
if speedUp
trainInd = c.test(3);
else
trainInd = ~boolean(c.test(1)+c.test(2));
end
valInd = c.test(1);
testInd = c.test(2);
adsTrain = subset(adsNormal,trainInd);
adsVal = subset(adsNormal,valInd);
adsTestNormal = subset(adsNormal,testInd);
Transform each of the datastores by applying an STFT with frame length of 64 ms and hop length of
32 ms, find the log-mel energies for 128 frequency bands, and then concatenate these frames into
overlapping, consecutive groups of 5 to form a context window. It is common to use log-mel energies
1-972
Audio-Based Anomaly Detection for Machine Health Monitoring
as inputs to audio deep learning tasks as they represent the spectrum of tones on a scale similar to
how humans perceive sound. Visualize the log-mel spectrograms of the two clips played previously
using the plotLogMelSpect supporting function.
plotLogMelSpect(normalSample,abnormalSample);
Read the data into arrays where each column represents an input sample. Do this in parallel if you
have enabled Parallel Computing Toolbox™. Then combine the normal test set and abnormal data set
into the full test set, and label the samples accordingly.
trainingData = readall(tdsTrain,UseParallel=false);
valData = readall(tdsVal,UseParallel=canUseParallelPool);
normalTestData = readall(tdsTestNormal,UseParallel=canUseParallelPool);
abnormalTestData = readall(tdsTestAbnormal,UseParallel=canUseParallelPool);
testLabels = categorical([zeros(length(adsTestNormal.Labels),1);ones(length(adsTestAbnormal.Label
testData = [normalTestData;abnormalTestData];
Network Architecture
The encoder section consists of 2 fully connected layers with output sizes of 128. The bottleneck layer
constrains the network to an 8-dimensional representation of the original 640-dimensional input. The
decoder section mirrors the encoder architecture as the input is reconstructed and fed into the
output layer. Use half-mean-squared-error as the loss function to train the network and quantify the
reconstruction error.
layers = [
1-973
1 Audio Toolbox Examples
featureInputLayer(640)
fullyConnectedLayer(128,Name="Encoder1")
batchNormalizationLayer
reluLayer
fullyConnectedLayer(128,Name="Encoder2")
batchNormalizationLayer
reluLayer
fullyConnectedLayer(8,Name="Bottleneck")
batchNormalizationLayer
reluLayer
fullyConnectedLayer(128,Name="Decoder1")
batchNormalizationLayer
reluLayer
fullyConnectedLayer(128,Name="Decoder2")
batchNormalizationLayer
reluLayer
fullyConnectedLayer(640,Name="Output")];
Train Network
Train the network using an ADAM optimizer for 40 epochs. Shuffle the mini-batches each epoch, and
set the ExecutionEnvironment field to "auto" so that a GPU is used instead of the CPU if
available. If using a GPU with limited memory, you may need to decrease the value of the
miniBatchSize field. The training parameter settings were found empirically to optimize
convergence speed. This may take 10-15 minutes depending on your hardware.
batchSize = length(trainingData)/4;
if speedUp
batchSize = 2*batchSize;
end
trainingData is both the input and the target output as the network attempts to regress the
training data on itself with the low-dimensional encoding constraint. Your results should look similar
to the training plots below.
[net,info] = trainnet(trainingData,trainingData,layers,"mse",options);
1-974
Audio-Based Anomaly Detection for Machine Health Monitoring
Evaluate Performance
For each input to the network, the autoencoder outputs an attempted reconstruction. However, each
network input is only one context window from a larger audio sample. For each of these network
inputs, the error is defined as the squared L-2 norm of the difference between the original input and
the network output. To calculate a decision metric for each entire audio sample, the errors for each
context window associated with that audio sample are added together, and this sum is divided by the
product of the network input dimension and the number of context groups per audio sample. For an
2
f xi − xi
audio sample X , the decision function metric is denoted A X and definedA X = ∑ni = 1 n * dim xi
th
where n is the number of context groups per sample, xi is the i context group constructed from X ,
and f xi is the network output for xi. A X represents the mean squared reconstruction error across
each vector component of all context windows associated with an audio sample X .
For each input X , A X can also be interpreted as a relative measure of the network's confidence that
X is abnormal, with higher values indicating larger confidence. To deploy this model and make
predictions on new data, you must select a decision boundary on the values of A to separate positive
and negative predictions. Model A X for normal samples as a gamma distribution. Gamma
distributions are commonly used to model autoencoder reconstruction errors since the errors are
usually skewed right with a heavy tail, which is the natural shape of a gamma distribution. In this
example, the decision boundary is selected as the point that corresponds to an expected false positive
rate (FPR) p = 0.1. This decision boundary attempts to capture all truly abnormal samples while
tolerating the expectation that 10% of normal samples will be falsely predicted as abnormal. You can
choose a specific value of p to fit your individual system constraints.
p = 0.1;
Compute the values of A over the training set and store them in the variable A_train using the
helper function getScore. Then solve for the maximum likelihood estimate for the gamma
1-975
1 Audio Toolbox Examples
distribution parameters, select the cutoff point from the inverse gamma cumulative distribution
function, and plot the fitted distribution with the histogram of A using the getCutoff helper
function.
trainRecons = minibatchpredict(net,trainingData,MiniBatchSize=size(trainingData,1)/4);
A_train = getScore(trainingData,trainRecons);
cutoff = getCutoff(A_train,p);
Verify that this cutoff point roughly corresponds to an FPR of 0.1 on the training set:
ans = 0.1221
Test the classification accuracy of this system with the chosen cutoff point on the holdout test set.
testRecons = minibatchpredict(net,testData,MiniBatchSize=size(testData,1));
A_test = getScore(testData,testRecons);
testPreds = categorical(A_test > cutoff,[false,true],["normal","abnormal"]).';
figure
cm = confusionchart(testLabels,testPreds);
cm.RowSummary="row-normalized";
1-976
Audio-Based Anomaly Detection for Machine Health Monitoring
Using this cutoff point, the model achieves a true positive rate (TPR) of 0.647 at the cost of an FPR of
0.125.
To evaluate the accuracy of the network over a range of decision boundaries, measure the overall
performance on the test set by the area under the receiver operating characteristic curve (AUC). Use
both the full AUC and the partial AUC (pAUC) to analyze the network performance. pAUC is the AUC
on the subdomain where the FPR is on the interval 0, p divided by the maximum possible area in the
interval, which is p. It is important to consider pAUC since anomaly detection systems need to be able
to achieve high TPR while keeping the FPR to a minimum, as a system with frequent false alarms is
untrustworthy and unusable. Compute the AUC using the perfcurve function from Statistics and
Machine Learning Toolbox™.
[X,Y,T,AUC] = perfcurve(testLabels,A_test,categorical("abnormal"));
[~,cutoffIdx] = min(abs(T-cutoff));
figure
plot(X,Y);
xlabel("FPR");
ylabel("TPR");
title("Test Set ROC Curve");
hold on
plot(X(cutoffIdx),Y(cutoffIdx),'r*');
hold off
legend("ROC Curve","Cutoff Decision Point");
grid on
1-977
1 Audio Toolbox Examples
AUC
AUC = single
0.8949
To calculate the pAUC, approximate the area under the curve in the first tenth of the FPR domain
using trapz. For reference, the expected value of the pAUC of a random classifier is 0.05.
pAUC = single
0.5177
The network separates the normal and abnormal test samples fairly well and is able to learn a single
encoding across multiple fan IDs. Visualize the difference in reconstruction errors between the
normal and abnormal groups by their histograms.
figure
hold on
edges = linspace(min(A_test),1,100);
histogram(A_test(testLabels == categorical("normal")),edges,Normalization="probability");
histogram(A_test(testLabels == categorical("abnormal")),edges,Normalization="probability");
ylabel("Sample Probability");
xlabel("Reconstruction Error (A)");
legend("Normal","Abnormal");
1-978
Audio-Based Anomaly Detection for Machine Health Monitoring
Although there is some overlap, the distribution of reconstruction errors for the abnormal samples is
offset further to the right and contains a much heavier tail than the distribution of reconstruction
errors over the normal samples.
Lastly, evaluate the model's performance on each fan ID individually to reveal any imbalance between
the fan types and check if the model is able to predict universally well over all IDs.
IDs = [0;2;4;6];
AUCs = zeros(4,1);
pAUCs = AUCs;
A_testNormal = A_test(1:sum(testInd));
A_testAbnormal = A_test(sum(testInd)+1:end);
for ii = 1:4
normalMask = adsTestNormal.Labels == normalLabels(ii);
abnormalMask = adsTestAbnormal.Labels == abnormalLabels(ii);
A_testByID = [A_testNormal(normalMask) A_testAbnormal(abnormalMask)];
testLabelsByID = [adsTestNormal.Labels(normalMask);adsTestAbnormal.Labels(abnormalMask)];
[X_ID,Y_ID,T_ID,AUC_ID] = perfcurve(testLabelsByID,A_testByID,abnormalLabels(ii));
AUCs(ii) = AUC_ID;
pX_ID = X_ID(X_ID <= p);
pY_ID = Y_ID(X_ID <= p);
pAUCs(ii) = trapz(pX_ID,pY_ID)/p;
end
disp(table(IDs,AUCs,pAUCs));
1-979
1 Audio Toolbox Examples
0 0.72849 0.1776
2 0.96949 0.683
4 0.90909 0.47002
6 0.95452 0.67158
The results show the model performance significantly varies by fan type. This result is important to
note as this network is relatively small and simple compared to the top performing DCASE challenge
submissions in [3] on page 1-982. To generalize better across fan types and to different domains, a
more complex model is needed. However, if you know the exact fan type that you are deploying an
anomaly detector for, a very light-weight model like the one in this example may suffice.
Supporting Functions
function plotLogMelSpect(normalSample,abnormalSample)
%PLOTLOGMELSPECT plots the log-mel spectrogram of the normal and abornomal
% plotLogMelSpect(normalSample,abnormalSample) plots the log-mel
% spectrogram of the two inputs side by side, with parameters consistent
% with the data preprocessing transformation used to prepare the signals
% to be fed into the autoencoder.
f = figure;
f.Position(3) = 900;
samples = {normalSample,abnormalSample};
fs = 16e3;
winDur = 64e-3;
winLen = winDur * fs;
numMelBands = 128;
tiledlayout(1,2)
for i = 1:2
nexttile
x = samples{i};
melSpectrogram(x,fs,Window=hamming(winLen,"periodic"),FFTLength=winLen,OverlapLength=winLen/2
xticks(1:10);
xticklabels(string(1:10));
colormap("jet");
if i == 2
cbar = colorbar;
cbar.Label.String = "Power (dB)";
title("Abnormal Log-Mel Spectrogram");
ylabel([]);
else
colorbar off
title("Normal Log-Mel Spectrogram");
end
end
end
function features = processData(x)
%PROCESSDATA transforms an audio file input x into the autoencoder network
%input format
% features = processData(x) takes the STFT of audio data x, transforms
% the STFT into the log-mel spectrogram, and then constructs context
% groups of consecutive mel-spectrogram frames. The function returns the
% features as a numContextGroupsPerSample-by-contextGroupSize matrix. For
% this data set, numContextGroupsPerSample = 309 and contextGroupSize =
% 640 = 128*5 (since there are 128 mel bands per frame and 5 frames are
% concatenated for each context group)
fs = 16e3;
1-980
Audio-Based Anomaly Detection for Machine Health Monitoring
winDur = 64e-3;
winLen = winDur*fs;
numMelBands = 128;
afe = audioFeatureExtractor(...
Window=hamming(winLen,"periodic"), ...
FFTLength=winLen, ...
OverlapLength=winLen/2, ...
SampleRate=fs, ...
melSpectrum=true);
setExtractorParameters(afe,"melSpectrum",numBands=numMelBands,ApplyLog=true);
% Zero pad
numPad = numel(x) + winLen - mod(numel(x),winLen);
xPadded = resize(x,numPad,Side="both");
% Extract
features = {extract(afe,xPadded)};
features = cellfun(@groupSTFT,features,UniformOutput=false);
features = cat(1,features{:});
end
function A = getScore(data,preds)
%GETSCORE returns the reconstruction error for each sample in data
% A = getScore(data,preds) returns A(X) for each X in the set of samples
% transformed into network input data.
err = sum((preds-data).^2,2);
numSTFTFrames = 313;
contextWin = 5;
numMelFilters = 128;
numContextGroupsPerSample = numSTFTFrames - contextWin + 1;
numSamples = length(err)/numContextGroupsPerSample;
A_total = reshape(err,[numContextGroupsPerSample,numSamples]); %Each column contains reconstructi
A = sum(A_total)/(numMelFilters*contextWin*numSTFTFrames); %Each entry is a reconstruction error
end
1-981
1 Audio Toolbox Examples
a = gammaParams(1);
b = gammaParams(2);
cutoff = gaminv(1-p,a,b);
figure
ax1 = subplot(4,1,1:3);
histogram(A);
xticks([]);
title("Histogram of A with Fitted Gamma Dist. PDF");
ylTop = ylabel("Count");
xline(cutoff,"--",LineWidth=2,Label="cutoff",LabelOrientation="horizontal",LabelVerticalAlignment
ax2 = subplot(4,1,4);
t = linspace(0, max(A), 1000);
y = gampdf(t,a,b);
plot(t,y);
xline(cutoff,"--",LineWidth=2);
ylBottom = ylabel("\Gamma Density");
yticks([]);
linkaxes([ax1 ax2],"x");
ylBottom.Position(1) = ylTop.Position(1);
xlabel("Reconstruction Error (A)");
xlim([0 .4]);
ylBottom.Position(1) = ylTop.Position(1);
end
References
[1] "Unsupervised anomalous sound detection for machine condition monitoring applying domain
generalization techniques," DCASE 2022. [Online]. Available: https://dcase.community/
challenge2022/task-unsupervised-anomalous-sound-detection-for-machine-condition-monitoring.
[Accessed: 08-Jun-2022].
[2] "Purohit, Harsh, Tanabe, Ryo, Ichige, Kenji, Endo, Takashi, Nikaido, Yuki, Suefusa, Kaori, &
Kawaguchi, Yohei. (2019). MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine
Investigation and Inspection (public 1.0) [Data set]. 4th Workshop on Detection and Classification of
Acoustic Scenes and Events (DCASE 2019 Workshop), New York, USA. Zenodo. https://doi.org/
10.5281/zenodo.3384388. Dataset is licensed under the Creative Commons Attribution-ShareAlike 4.0
International License available at https://creativecommons.org/licenses/by-sa/4.0/
[3] "Unsupervised detection of anomalous sounds for Machine Condition Monitoring," DCASE 2020.
[Online]. Available: https://dcase.community/challenge2020/task-unsupervised-detection-of-
anomalous-sounds-results#Giri2020. [Accessed: 08-Jun-2022].
1-982
Use Datastores to Manage Audio Data Sets
Deep learning and machine learning models are popular for processing audio signals for various
tasks. Training these models requires working with large data sets containing both audio data and
labeling information. For example, when training a model to identify spoken commands, the data can
be a collection of audio files and the labels in this case are the ground truth commands for each file.
Datastores are useful for working with large collections of data, and the audioDatastore object
allows you to manage collections of audio files.
This example shows you how to use datastores to manage three different audio data sets. The first
data set uses the names of the folders containing the audio files as labels, the second data set uses
the file names as labels, and the third data set contains labels in a metadata file. You can then use
these datastores to train machine learning or deep learning models on the audio data.
The Google Speech Commands data set [1] on page 1-986 contains files with spoken command
words stored in folders whose names are the word labels. Download and extract the data set.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","google_speech.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"google_speech");
Extract the labels for each file from the folder names using the folders2labels function. Use
countlabels to view the distribution of labels.
labels = folders2labels(ads);
countlabels(labels)
ans=30×3 table
Label Count Percent
______ _____ _______
1-983
1 Audio Toolbox Examples
Use combine to create a CombinedDatastore object from the audio data and the labels. Each call
to read on the datastore returns one of the audio signals and its label.
lds = arrayDatastore(labels);
cds = combine(ads,lds);
You can create a separate datastore for validation data by repeating the same steps after creating an
audioDatastore that instead points to the validation subfolder of the data set. Alternatively, you
can use splitlabels to separate an existing datastore into training and validation sets. Specify
UnderlyingDatastoreIndex to indicate which of the underlying datastores in the combined
datastore contains the labels.
idxs = splitlabels(cds,0.8,"randomized",UnderlyingDatastoreIndex=2);
trainDs = subset(cds,idxs{1});
valDs = subset(cds,idxs{2});
Call read on the train datastore. The function returns both the audio signal and the label in a cell
array.
read(trainDs)
The Free Spoken Digit Dataset (FSDD) [2] on page 1-986 contains recordings of spoken digits in
files whose names contain the digit labels as well as speaker labels. Download the data set and create
an audioDatastore that points to the data.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"FSDD","recordings");
ads = audioDatastore(dataset);
Select a random file from the data set and display its name. The file name is formatted as
digitLabel_speakerName_index.
[~,name] = fileparts(ads.Files{randi(length(ads.Files))})
name =
'1_jackson_45'
Use filenames2labels to extract the digit labels from the file names. Combine the labels with the
audio into a CombinedDatastore and see the label distribution of the data set.
labels = filenames2labels(ads,ExtractBefore="_");
lds = arrayDatastore(labels);
cds = combine(ads,lds);
countlabels(cds,UnderlyingDatastoreIndex=2)
ans=10×3 table
Label Count Percent
_____ _____ _______
1-984
Use Datastores to Manage Audio Data Sets
0 200 10
1 200 10
2 200 10
3 200 10
4 200 10
5 200 10
6 200 10
7 200 10
8 200 10
9 200 10
The Mozilla Common Voice data set [3] on page 1-986 contains recordings of subjects speaking
short sentences. The data set has a metadata file with various labels including sentence
transcriptions and speaker IDs. Download the data set and create an audioDatastore that points to
the training data.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","commonvoice.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"commonvoice","train");
ads = audioDatastore(fullfile(dataset,"clips"));
metadata = readtable(fullfile(dataset,"train.tsv"),FileType="text");
Assert that the order of the files in the datastore matches the table. This ensures you can easily
associate the metadata information with the datastore.
[~,adsFilenames,~] = fileparts(ads.Files);
assert(length(adsFilenames)==length(metadata.path))
assert(all(strcmp(adsFilenames,metadata.path)))
sentences = arrayDatastore(string(metadata.sentence));
transcriptDs = combine(ads,sentences);
Create another CombinedDatastore with speaker IDs as labels. Rename the speaker ID labels to
natural numbers for simplicity.
speakerLabels = categorical(metadata.client_id);
speakerIDs = string(1:length(categories(speakerLabels)));
speakerLabels = renamecats(speakerLabels,speakerIDs);
labelsDs = arrayDatastore(speakerLabels);
speakerDs = combine(ads,labelsDs);
countlabels(speakerDs,UnderlyingDatastoreIndex=2)
ans=595×3 table
Label Count Percent
_____ _____ _______
1 1 0.05
10 1 0.05
1-985
1 Audio Toolbox Examples
100 3 0.15
101 4 0.2
102 36 1.8
103 4 0.2
104 1 0.05
105 2 0.1
106 4 0.2
107 1 0.05
108 1 0.05
109 1 0.05
11 4 0.2
110 1 0.05
111 1 0.05
112 10 0.5
⋮
Next Steps
You can now use the data sets to train deep learning or machine learning models, and you can use
read and readall to access the data and labels. You can also perform feature extraction on the data
and use transform to create a new datastore that extracts features from the audio data.
References
[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017.
Available from https://storage.googleapis.com/download.tensorflow.org/data/
speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed
under the Creative Commons Attribution 4.0 license, available here: https://creativecommons.org/
licenses/by/4.0/legalcode.
[2] Zohar Jackson, César Souza, Jason Flaks, Yuxin Pan, Hereman Nicolas, and Adhish Thite.
“Jakobovski/free-spoken-digit-dataset: V1.0.8”. Zenodo, August 9, 2018. https://doi.org/10.5281/
zenodo.1342401.
See Also
Objects
audioDatastore
Functions
folders2labels | filenames2labels | countlabels
1-986
Read, Analyze and Process SOFA Files
SOFA (Spatially Oriented Format for Acoustics) [1] on page 1-1006 is a file format for storing spatially
oriented acoustic data like head-related transfer functions (HRTF) and binaural or spatial room
impulse responses. SOFA has been standardized by the Audio Engineering Society (AES) as
AES69-2015.
In this example, you load a SOFA file containing HRTF measurements for a single subject in MATLAB.
You then analyze the HRTF measurements in the time domain and the frequency domain. Finally, you
use the HRTF impulse responses to spatialize an audio signal in real time by modeling a moving
source based on desired azimuth and elevation values.
You use a SOFA file from the SADIE II database [2] on page 1-1006. The file corresponds to spatially
discrete free-field in-the-ear HRTF measurements for a single subject. The measurements
characterize how each ear receives a sound from a point in space.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","SOFA/SOFA.zip");
dataFolder = tempdir;
1-987
1 Audio Toolbox Examples
unzip(downloadFolder,dataFolder)
netFolder = fullfile(dataFolder,"SOFA");
addpath(netFolder)
filename = "H10_48K_24bit_256tap_FIR_SOFA.sofa";
SOFA files consist of binary data stored in the netCDF-4 format. You can use MATLAB to read and
write netCDF files.
Display the contents of the SOFA file using ncdisp (execute ncdisp(filename)).
The file contents consist of multiple fields corresponding to different aspects of the measurements,
such as the (fixed) listener position, the varying source position, the coordinate system used to
capture the data, general metadata related to the measurement, as well as the measured impulse
responses.
NetCDF is a "self-describing" file format, where data is stored along with attributes that can be used
to assist in its interpretation. Consider the display snippet corresponding to the source position for
example:
SourcePosition contains the coordinates for the varying source position used in the measurements
(here, there are 2114 separate positions). The file also contains attributes (Type, Units) describing
the coordinate system used to store the positions (here, spherical), as well as information about the
dimensions of the data (C,M). The dimensions are defined in the AES69 standard [3] on page 1-1006:
• M = 2114 (the total number of measurements, each corresponding to a unique source position).
• R = 2 (corresponding to the two ears).
• E = 1 (one emitter or sound source per measurement).
• N = 256 (the length of each recorded impulse response).
1-988
Read, Analyze and Process SOFA Files
SOFAInfo = ncinfo(filename);
The fields of the structure SOFAInfo hold information related to the file's dimensions, variables and
attributes.
ir = ncread(filename,"Data.IR");
size(ir)
ans = 1×3
256 2 2114
This variable holds impulse responses for the left and right ear for 2114 independent measurements.
Each impulse response is of length 256.
fs = ncread(filename,"Data.SamplingRate")
fs = 48000
figure;
t = (0:size(ir,1)-1)/fs;
plot(t,ir(:,1,1))
grid on
xlabel("Time (s)")
ylabel("Impulse response")
1-989
1 Audio Toolbox Examples
It is possible to read and analyze the contents of the SOFA file using a combination of ncinfo and
ncread. However, the process can be cumbersome and time consuming.
The function sofaread allows you to read the entire contents of a SOFA file in one line of code.
s =
audio.sofa.SimpleFreeFieldHRIR handle with properties:
Click "Show all properties" in the display above to see the rest of the properties from the SOFA file.
1-990
Read, Analyze and Process SOFA Files
ans = 1×3
2114 2 256
You access other variables in a similar fashion. For example, read the source positions along with the
coordinate system used to express them:
srcPositions = s.SourcePosition;
size(srcPositions)
ans = 1×2
2114 3
srcPositions(1,:)
ans = 1×3
0 -90.0000 1.2000
s.SourcePositionType
ans =
'spherical'
s.SourcePositionUnits
ans =
"degree, degree, meter"
figure;
plotGeometry(s)
1-991
1 Audio Toolbox Examples
Alternatively, specify input indices to restrict the plot to desired source locations. For example, plot
the source positions located in the median plane.
figure;
idx = findMeasurements(s,Plane="median");
plotGeometry(s,MeasurementIndex=idx)
1-992
Read, Analyze and Process SOFA Files
The file in this example uses the SimpleFreeFieldHRIR convention, which stores impulse response
measurements in the time domain as FIR filters.
s.SOFAConventions
ans =
"SimpleFreeFieldHRIR"
s.DataType
ans =
"FIR"
Plot the impulse response of the first 3 measurements corresponding to the second receiver using
impz.
figure;
impz(s, MeasurementIndex=1:3,Receiver=2)
1-993
1 Audio Toolbox Examples
It is straightforward to compute and plot the frequency response of the impulse responses using
freqz.
Compute the frequency response of the first measurement for both ears. Use a frequency response
length of 512.
[H,F] = freqz(s, Receiver=1:2, NPoints=512);
H is the complex frequency response array. F is the vector of corresponding frequencies (in Hertz).
size(H)
ans = 1×3
512 1 2
size(F)
ans = 1×2
512 1
You can also use freqz to plot the frequency response by calling the function with no output
arguments.
1-994
Read, Analyze and Process SOFA Files
Plot the frequency response of the first 3 measurements in the horizontal plane at zero elevation for
the first receiver.
figure;
idx = findMeasurements(s, Plane="horizontal");
freqz(s, MeasurementIndex=idx(1:3), Receiver=1)
It is often useful to compute and visualize the magnitude spectra of HRTF data in a specific plane in
space.
For example, compute the spectrum in the horizontal plane (corresponding to an elevation angle
equal to zero) for the first receiver. Use a frequency response length of 2048.
[S,F,azi] = spectrum(s, Plane="horizontal",Receiver=1, NPoints=2048); %#ok
• S is the horizontal plane spectrum of size 2048-by-L, where L is the number of measurements in
the specified plane.
• F is the frequency vector of length 2048.
• azi is a vector of length L containing the azimuth angles (in degrees) corresponding to the
horizontal plane measurements.
1-995
1 Audio Toolbox Examples
It is often useful to visualize the decay of the HRTF responses over time using an energy-time curve
(ETC).
ETC = energyTimeCurve(s);
size(ETC)
ans = 1×2
256 96
The first dimension of ETC is equal to the length of the impulse response. The second dimension of
ETC is equal to the number of points in the specified plane.
figure;
energyTimeCurve(s)
1-996
Read, Analyze and Process SOFA Files
Interaural time difference (ITD) is the difference in arrival time of a sound between two ears. It is an
important binaural cue for sound source localization.
figure;
interauralTimeDifference(s, Specification="measurements");
1-997
1 Audio Toolbox Examples
You can also compute and plot the ITD in a specified plane. For example, plot the ITD in the
horizontal plane at an elevation of 35 degrees.
figure
interauralTimeDifference(s, Plane="horizontal",PlaneOffsetAngle=35)
1-998
Read, Analyze and Process SOFA Files
Interaural level difference (ILD) measures the difference, in decibels, of the received signals at the
two ears in the desired plane.
[ILD,F] = interauralLevelDifference(s,NPoints=1024);
size(ILD)
ans = 1×2
1024 96
figure
interauralLevelDifference(s)
1-999
1 Audio Toolbox Examples
HRTF measurements are highly frequency selective. It is often useful to compare the received power
level of different frequencies in space.
Call directivity to plot the received level at 200 Hz for all measurements.
figure;
directivity(s, 200, Specification="measurements")
1-1000
Read, Analyze and Process SOFA Files
figure;
1-1001
1 Audio Toolbox Examples
You can also compute and plot the directivity for selected planes. For example, plot the directivity at
700 Hz and 1200 Hz for the horizontal plane at an elevation of 35 degrees.
figure;
1-1002
Read, Analyze and Process SOFA Files
The HRTF measurements in the SOFA file correspond to a finite number of azimuth/elevation angle
combinations. It is possible to interpolate the data to any desired spatial location using 3-D HRTF
interpolation with interpolateHRTF.
desiredAz = [-120;-60;0;60;120;0;-120;120];
desiredEl = [-90;90;45;0;-45;0;45;45];
desiredPosition = [desiredAz, desiredEl];
Calculate the head-related impulse response (HRIR) using the vector base amplitude panning
interpolation (VBAP) algorithm at a desired source position.
figure;
interpolateHRTF(s, desiredPosition);
1-1003
1 Audio Toolbox Examples
Filter a mono input through the interpolated impulse responses to model a moving source.
Create an audio file sampled at 48 kHz for compatibility with the HRTF dataset.
desiredFs = s.SamplingRate;
[x,fs] = audioread("Counting-16-44p1-mono-15secs.wav");
y = audioresample(x,InputRate=fs,OutputRate=desiredFs);
y = y./max(abs(y));
audiowrite("Counting-16-48-mono-15secs.wav",y,desiredFs);
fileReader = dsp.AudioFileReader("Counting-16-48-mono-15secs.wav");
deviceWriter = audioDeviceWriter(SampleRate=fileReader.SampleRate);
spatialFilter = dsp.FrequencyDomainFIRFilter(squeeze(interpolatedIR(1,:,:)),SumFilteredOutputs=fa
1-1004
Read, Analyze and Process SOFA Files
sourcePositionIndex = 1;
samplesRead = 0;
while ~isDone(fileReader)
audioIn = fileReader();
samplesRead = samplesRead + fileReader.SamplesPerFrame;
audioOut = spatialFilter(audioIn);
deviceWriter(audioOut);
if mod(samplesRead,samplesPerPosition) == 0
sourcePositionIndex = sourcePositionIndex + 1;
spatialFilter.Numerator = squeeze(interpolatedIR(sourcePositionIndex,:,:));
end
end
Simulate a sound source moving in the horizontal plane, with an initial azimuth of -90 degrees.
Gradually increase the azimuth as the simulation is running.
Compute the starting impulse responses based on the initial source position.
index = 1;
loc = [-90 0];
Execute the simulation loop. Shift the source elevation by 30 degrees every 100 loop iterations. Use
interpolateHRTF to estimate the new desired impulse responses.
while ~isDone(fileReader)
index=index+1;
frame = fileReader();
frame = frame(:,1);
audioOut = spatialFilter(frame);
deviceWriter(audioOut);
1-1005
1 Audio Toolbox Examples
if mod(index,100)==0
loc(1)=loc(1)+30;
interpolatedIR = interpolateHRTF(s, ...
loc);
spatialFilter.Numerator = squeeze(interpolatedIR);
end
end
release(deviceWriter)
release(fileReader)
release(spatialFilter);
References
[1] Majdak, P., Zotter, F., Brinkmann, F., De Muynke, J., Mihocic, M., and Noisternig, M. (2022).
“Spatially Oriented Format for Acoustics 21: Introduction and Recent Advances,” J Audio Eng Soc 70,
565–584. DOI: 10.17743/jaes.2022.0026.
[2] https://www.york.ac.uk/sadie-project/database.html
[3] Majdak, P., De Muynke, J., Zotter, F., Brinkmann, F., Mihocic, M., and Ackermann, D. (2022). AES
standard for file exchange - Spatial acoustic data file format (AES69-2022), Standard of the Audio
Engineering Society. https://www.aes.org/publications/standards/search.cfm?docID=99
[4] Andreopoulou A, Katz BFG. Identification of perceptually relevant methods of inter-aural time
difference estimation. J Acoust Soc Am. 2017 Aug;142(2):588. doi: 10.1121/1.4996457. PMID:
28863557.
See Also
sofaread | sofawrite | interpolateHRTF
Related Examples
• “Room Impulse Response Simulation with the Image-Source Method and HRTF Interpolation”
on page 1-812
• “Binaural Audio Rendering Using Head Tracking” on page 1-67
1-1006
Impulse Response Measurement Using a NI USB-4431 Device
This example shows how to measure an impulse response using a National Instruments™ (NI)
USB-4431 sound and vibration device and the Impulse Response Measurer app.
The app enables easy device setup, generation and playback of excitation signals, and simultaneous
recording of responses. You can examine the results, export them, or generate an equivalent MATLAB
script that can be modified to suit your requirements.
For this example, you need Audio Toolbox™ and Data Acquisition Toolbox™. You also need to have
installed the NI drivers (recommended) or the MATLAB NI support package.
The Impulse Response Measurer app allows you to measure an impulse response using the MLS or
exponential swept sine method with either a full-duplex audio device or a data aquisition device like
the NI USB-4431. Start the app by entering impulseResponseMeasurer at the command prompt.
You can also click the app icon on the Apps tab of the MATLAB® Toolstrip.
Start by seleting your Capture Device, in this case the National Instruments™ USB-4431. Set the
Sample Rate according to the device and your application, such as 48000 Hz to measure a
loudspeaker response. Select the I/O channels, for example Player Channels 0 for a loudspeaker on
1-1007
1 Audio Toolbox Examples
ao0, and Recorder Channels 0 and 2 for inputs on ai0 and ai2. Select Swept Sine, as this method is
better suited to measure a loudspeaker where there is nonlinear distortion and background noise.
The other settings will depend on your device under test, but make sure that the Sweep start
frequency is not outside the range of the device (loudspeaker) as that can cause enough distortion to
impact the whole measurement.
Now click Capture to make the measurement. Zoom in on the Amplitude plot. Compare the response
of the first channel (blue) that has a loopback cable with the other channel (red) that is a
measurement microphone in front of a bookshelve loudspeaker on a table in a small room.
Next, instead of measuring the response of the loopback cable, use it to remove that device latency
from the loudspeaker measurement. Remove channel 0 from the Recorder Channels. Use the
Latency Compensation menu to remove the device latency from the measurement: enable the
loopback and specify channel 0 as output and channel 0 as input. Click Capture to make a new
measurement. You can also click on the "color" to change it to a more visible choice, like green. Reset
the zoom and zoom in at the beginning to better see the impulse in the time domain.The delay
introduced by the measurement system is removed, leaving the acoustic delay that depends on the
distance between the loudspeaker and the microphone.
1-1008
Impulse Response Measurement Using a NI USB-4431 Device
Script Generation
If you want to automate measurements or further customize the DAQ settings in a way that is not
possible with the app, you can generate a script from the app and modify it.
Make your selections in the DEVICE, METHOD, METHOD SETTINGS and DISPLAY sections. You
can also set a linear or log scale for magnitude and phase responses (using the toolbar that appears
when hovering the mouse over these plots). For example, set channel 0 as input and output, and
disable the latency compensation. Then, click Generate Script in the toolstrip to create a new
document that will appear in the editor. You can make it a function by adding function capture =
irm_script. and saving it as irm_script.m.
capture = irm_script;
Recording...
Computing results...
1-1009
1 Audio Toolbox Examples
Using the generated code, you can integrate this functionality in your own project, or change options
that might not be provided by the app. For example, if you want to set the input mode to
"Microphone", change the daqInputType variable.
daqInputType = "Microphone";
daqOutputType = "Voltage";
for ii = 1:numel(recChMap)
ch = addinput(dq,daqDevID,daqInputs(recChMap(ii)),daqInputType);
ch.Coupling = "AC"; % select AC coupling
ch.Sensitivity = 1; % set sensitivity
end
You can also delete the legend. Now, run the modified script.
Recording...
Computing results...
1-1010
Impulse Response Measurement Using a NI USB-4431 Device
See Also
Impulse Response Measurer
1-1011
1 Audio Toolbox Examples
This example shows how to plot large audio files in MATLAB. The first section shows a simple way to
read and plot all the data in an audio file. The next two sections show how to read and plot only the
envelope of an audio file without loading the entire audio file into memory.
Use the audioread function to read an 11 second MP3 audio file. The audioread function can
support other file formats. For a full list of viable formats, see “Supported File Formats for Import and
Export”.
filename = "RockDrums-48-stereo-11secs.mp3";
[y,fs] = audioread(filename);
Using the sample rate fs returned by audioread, create a duration vector t the same length as y to
represent elapsed time.
t = seconds(0:1/fs:(size(y,1)-1)/fs);
The audio file contains a stereo signal. Plot the two channels of audio data y as a function of time t.
plot(t,y)
title(filename)
xlabel("Time")
ylabel("Amplitude")
legend("Channel 1", "Channel 2")
xlim("tight")
ylim([-1 1])
1-1012
Plot Large Audio Files
When the audio file is very long (hours or even several minutes), reading and plotting all the data in
MATLAB might take significant time and memory resources. In such cases, you might not want to
read all the data in MATLAB, if the only purpose is to visualize the waveform. You can use the
audioEnvelope function to read an envelope of the audio file and plot only the overall envelope of
the audio waveform.
filename = "SoftGuitar-44p1_mono-10mins.ogg";
auInfo = audioinfo(filename)
1-1013
1 Audio Toolbox Examples
[envMin,envMax,loc] = audioEnvelope(filename,NumPoints=2000);
audioEnvelope returns envMin and envMax containing the minimum and maximum sample values
over frames of length equal to floor(L/numPoints), where L is the length of the audio signal and
numPoints is the number of points returned by audioEnvelope. Connect envMin and envMax at
each point and plot them as a function of time t.
nChans = size(envMin,2);
envbars = [shiftdim(envMin,-1);
shiftdim(envMax,-1);
shiftdim(NaN(size(envMin)),-1)];
ybars = reshape(envbars,[],nChans);
t = seconds(loc/auInfo.SampleRate);
tbars = reshape(repmat(t,3,1),[],1);
plot(tbars,ybars);
title(filename,Interpreter="none")
xlabel("Time")
ylabel("Amplitude")
xlim("tight")
ylim([-1 1])
In the previous section, you plot the audio envelope of a 10-minute audio file using 2000 points. You
can zoom and pan the plot above, but when you zoom in, it does not fetch more data.
This section introduces a new custom chart, audioplot, which plots any audio file using the audio
envelope technique and also makes it interactive so that when you zoom or pan, the plot fetches more
1-1014
Plot Large Audio Files
data from the audio file and updates the visual as needed. The custom chart audioplot is a subclass
of the ChartContainer base class. By inheriting from the ChartContainer base class, instances of
audioplot are members of the graphics object hierarchy and can be embedded in any MATLAB
figure alongside other graphics objects. For more information, see “Chart Development Overview”.
audioplot displays audio data using two axes with interactive features. The top axes has panning
and zooming enabled along the x dimension to help examine a region of interest. The bottom axes
displays a plot over the entire time range along with a light blue time window, which indicates the
display range in the top axes.
• AudioSource - A public and dependent property that stores the audio file name or a numeric
array representing audio data.
• SampleRate - A public and dependent property that stores the sampling rate of the audio signal
in hertz. This property is read-only when the AudioSource property is an audio file name.
• DisplayLimits - A public property that sets the limits of the top axes and the width of the time
window in the bottom axes.
• WaveformAxes and PannerAxes - Read-only properties that store the axes objects.
filename = "SoftGuitar-44p1_mono-10mins.ogg";
ap = audioplot(filename);
1-1015
1 Audio Toolbox Examples
See Also
audioEnvelope | Audio Viewer
1-1016
Audio Event Classification Using TensorFlow Lite on Raspberry Pi
This example demonstrates audio event classification using a pretrained deep neural network,
YAMNet, from TensorFlow™ Lite library on Raspberry Pi™. You load the TensorFlow Lite model and
predict the class for the given audio frame on Raspberry Pi using a processor-in-the-loop (PIL)
workflow. To generate code on Raspberry Pi, you use Embedded Coder®, MATLAB® Support
Package for Raspberry Pi Hardware and Deep Learning Toolbox Interface for TensorFlow Lite. Refer
to Audio Classification and yamnet classification for more details on the YAMNet model description.
Third-Party Prerequisites
• Raspberry Pi hardware
• TensorFlow Lite library (on the target ARM® hardware)
• Pretrained TensorFlow Lite Model
Download YAMNet
component = "audio";
filename = "yamnet.zip";
localfile = matlab.internal.examples.downloadSupportFile(component,filename);
downloadFolder = fileparts(localfile);
if exist(fullfile(downloadFolder,"yamnet"),"dir") ~= 7
unzip(localfile,downloadFolder)
end
addpath(fullfile(downloadFolder,"yamnet"))
Use audioread to read the audio file data and listen to it using sound function.
Call classifySound to detect the different sounds present in the given audio.
detectedSounds = classifySound(audioIn,fs)
You detected the different sounds in the pre-recorded audio in offline mode. The later sections of this
example demonstrates the audio event classification in the real-time scenario where you process one
audio frame at a time.
You load the TFLite YAMNet using loadTFLiteModel (Deep Learning Toolbox). As mentioned in
TFLiteModel (Deep Learning Toolbox) page, you set the Mean and Variance parameter of the
TFLite model to 0 and 1, respectively, because the input to YAMNet is not already normalized.
1-1017
1 Audio Toolbox Examples
modelFileName = "lite-model_yamnet_classification_tflite_1.tflite";
modelFullPath = fullfile(downloadFolder,"yamnet",modelFileName);
TFLiteYAMNet = loadTFLiteModel(modelFullPath);
TFLiteYAMNet.Mean = 0;
TFLiteYAMNet.StandardDeviation = 1;
Use yamnetGraph to load all the audio event classes supported by YAMNet, as an array of strings.
Set the sample rate (in Hertz), the length of input audio frame and the frame duration in seconds,
supported by YAMNet.
modelSamplingRate = 16000;
frameDimension = TFLiteYAMNet.InputSize{1};
frameLength = frameDimension(2);
frameDuration = frameLength/modelSamplingRate;
Set the classificationRate i.e. the number of classifications per second. As the number of hops
per second must be equal to the classification rate, set the hopDuration to the reciprocal of
classificationRate.
classificationRate = 10;
hopDuration = 1/classificationRate;
hopLength = floor(modelSamplingRate*hopDuration);
overlapLength = frameLength - hopLength;
You use dropdown control to list the different input audio files. Use dsp.AudioFileReader to read
the audio file data.
afr = dsp.AudioFileReader( );
audioInSamplingRate = afr.SampleRate;
audioFileInfo = audioinfo(afr.Filename);
audioInFrameLength = floor(audioInSamplingRate*hopDuration);
afr.SamplesPerFrame = audioInFrameLength;
predictedAudiolassesDuration = 1;
audioClassBufferLength = floor(predictedAudiolassesDuration*classificationRate);
audioClassBuffer = dsp.AsyncBuffer(audioClassBufferLength);
audioBufferYamnet = dsp.AsyncBuffer(2*frameLength);
indexOfSilenceAudioClass = find(audioEventClasses == "Silence");
write(audioClassBuffer,ones(audioClassBufferLength,1)*indexOfSilenceAudioClass);
1-1018
Audio Event Classification Using TensorFlow Lite on Raspberry Pi
Setup a dsp.SampleRateConverter system object to convert the sampling rate of the input audio
to 16000 Hz, as YAMNet is trained using audio signals sampled at 16000 Hz sampling rate.
src = dsp.SampleRateConverter('InputSampleRate',audioInSamplingRate,...
'OutputSampleRate',modelSamplingRate,...
'Bandwidth',10000);
You feed one audio frame at a time to represent the system as it would be deployed in a real-time
embedded system. In the streaming loop, you first load one hop of audio samples and fed them to the
dsp.SampleRateConverter to convert the sampling rate to 16000 Hz. The resampled frame is
written in a FIFO buffer, audioBufferYamnet, you load the overlapping frames of length
frameLength from this buffer and fed it to the YAMNet. The TensorFlow Lite YAMNet model outputs
the predicted score vector that contains a score for each audio event class. You calculate the index of
the maximum score in the score vector and write it in the FIFO buffer, audioClassBuffer. The
predicted index is the statistical mode of the contents of the audioClassBuffer. The predicted
audio event class is the value of audioEventClasses array at the predicted index. You visualize the
resampled audio frame in the time scope and print the predicted audio event class as the title of the
time scope.
while ~isDone(afr)
audioInFrame = afr();
resampledAudioInFrame = src(audioInFrame);
write(audioBufferYamnet,resampledAudioInFrame);
audioInYamnetFrame = read(audioBufferYamnet,frameLength,overlapLength);
scoresTFLite = TFLiteYAMNet.predict(audioInYamnetFrame');
[~, audioClassIndex] = max(scoresTFLite);
write(audioClassBuffer,audioClassIndex);
preditedSoundClass = audioEventClasses(mode(audioClassBuffer.peek(audioClassBufferLength)));
timeScope(resampledAudioInFrame);
timeScope.Title = char(preditedSoundClass);
drawnow
end
hide(timeScope)
reset(timeScope)
reset(afr)
type predictAudioClassUsingYAMNET.m
1-1019
1 Audio Toolbox Examples
%#codegen
if isempty(TFLiteYAMNETModel)
TFLiteYAMNETModel = loadTFLiteModel("lite-model_yamnet_classification_tflite_1.tflite");
TFLiteYAMNETModel.NumThreads = 4;
TFLiteYAMNETModel.Mean = 0;
TFLiteYAMNETModel.StandardDeviation = 1;
scores = predict(TFLiteYAMNETModel,audioIn);
[~, audioClassIndex] = max(scores);
write(AudioClassBuffer,audioClassIndex);
predictedAudioClassHistory = peek(AudioClassBuffer,audioClassHistoryBufferLength);
preditedAudioClassIndex = mode(predictedAudioClassHistory);
end
Use the Raspberry Pi Support Package function, raspi, to create a connection to your Raspberry Pi.
In the following code, replace:
if ~(exist("r","var"))
r = raspi("raspiname","pi","password");
end
1-1020
Audio Event Classification Using TensorFlow Lite on Raspberry Pi
Create a coder.hardware (MATLAB Coder) object for Raspberry Pi and attach it to the code
generation configuration object.
hw = coder.hardware("Raspberry Pi");
cfg.Hardware = hw;
buildDir = "~/remoteBuildDir";
cfg.Hardware.BuildDir = buildDir;
Copy TensorFlow Lite Model to the Target Hardware and the Current Directory
Copy the TensorFlow Lite model to the Raspberry Pi board. On the hardware board, set the
environment variable TFLITE_MODEL_PATH to the location of the TensorFlow Lite model. For more
information on setting environment variables, see “Prerequisites for Deep Learning with TensorFlow
Lite Models” (Deep Learning Toolbox).
Use putFile method of the raspi object to copy the TFLite model to Raspberry Pi.
putFile(r,char(modelFullPath),'/home/pi')
Copy the model to the current directory as it is required by codegen (MATLAB Coder) during code
generation.
copyfile(modelFullPath)
You use coder.Constant (MATLAB Coder) to make the constant input arguments, compile time
constants in the generated code. Run the codegen (MATLAB Coder) command to generate a PIL
MEX function predictAudioClassUsingYAMNET_pil.
You call the generated PIL function predictAudioClassUsingYAMNET_pil to stream one audio
frame at a time to represent the system as it would be deployed in a real-time embedded system.
show(timeScope)
while ~isDone(afr)
audioInFrame = afr();
resampledAudioInFrame = src(audioInFrame);
write(audioBufferYamnet,resampledAudioInFrame);
audioInYamnetFrame = read(audioBufferYamnet,frameLength,overlapLength);
predictedSoundClassIndex = predictAudioClassUsingYAMNET_pil(single(audioInYamnetFrame'),audio
preditedSoundClass = audioEventClasses(predictedSoundClassIndex);
timeScope(resampledAudioInFrame)
timeScope.Title = char(preditedSoundClass);
drawnow
end
1-1021
1 Audio Toolbox Examples
hide(timeScope)
### Host application produced the following standard output (stdout) and standard error (stderr)
You use PIL workflow to profile the predictAudioClassUsingYAMNET function. You enable
profiling in the code generation configuration and generate the PIL function that keeps a log of
execution profile.
cfg.CodeExecutionProfiling = true;
codegen -config cfg predictAudioClassUsingYAMNET -args {ones(1,15600,"single"), coder.Constant(au
You call the generated PIL function multiple times to get the average execution time.
numCalls = 100;
for k = 1:numCalls
1-1022
Audio Event Classification Using TensorFlow Lite on Raspberry Pi
x = pinknoise(1,15600,"single");
scores = predictAudioClassUsingYAMNET_pil(x,audioClassBufferLength,indexOfSilenceAudioClass);
end
clear predictAudioClassUsingYAMNET_pil
### Host application produced the following standard output (stdout) and standard error (stderr)
executionProfile = getCoderExecutionProfile('predictAudioClassUsingYAMNET');
report(executionProfile, ...
'Units','Seconds', ...
'ScaleFactor','1e-03', ...
'NumericFormat','%0.4f');
In the code execution profiling report, you find that the average execution time taken by
predictAudioClassUsingYAMNET is 24.29 ms which is within the budget of 100 ms. You
calculate the budget as the reciprocal of the classification rate. The performance is measured on
Raspberry Pi 3 Model B Plus Rev 1.2.
1-1023
1 Audio Toolbox Examples
Release buffers, timescope and other system objects used in the example.
release(audioBufferYamnet)
release(audioClassBuffer)
release(timeScope)
release(src)
release(afr)
1-1024
Deploy Smart Speaker System on Raspberry Pi Using Simulink
This example demonstrates how to deploy a smart speaker system on Raspberry Pi® using
Simulink®. A smart speaker is a speaker that can be controlled by your voice. You run the smart
speaker Simulink model on Raspberry Pi in External Mode. The voice commands are captured
through the USB microphone connected to your Raspberry Pi board. You can optionally input voice
commands through the pre-recorded files. The smart speaker model plays the audio on the speaker
connected to the Raspberry Pi. You make the smart speaker play music with the command "Go". You
make it stop playing music by saying "Stop". You increase or decrease the music volume with the
commands "Up" and "Down", respectively. For details about modeling the various modules used in the
smart speaker model, see “Model Smart Speaker in Simulink” on page 1-923.
The model can be divided into four sub-modules that perform four sub-tasks
1 Capture 16-bit speech samples and convert them to single precision format in the range [-1,1)
2 Recognize speech commands
3 Prepare audio frame based on the recognized speech commands
4 Convert audio samples to 16-bit signed integer format and play the audio on Raspberry Pi
modelName = "AudioSmartSpeakerOnRaspberryPi";
open_system(modelName)
The smart speaker model uses the ALSA Audio Capture (Simulink) block to capture the voice
commands from a microphone connected to your Raspberry Pi board. The model uses the ALSA Audio
Playback (Simulink) block to play the audio on a speaker connected to your Raspberry Pi board. The
1-1025
1 Audio Toolbox Examples
ALSA Audio IO blocks come with Simulink Support Package for Raspberry Pi Hardware. After
connecting the microphone and speaker to your Raspberry Pi board, you list the audio capture and
audio playback devices using listAudioDevices (Simulink).
r = raspi("raspiname","pi","password");
audioCaptureDevicesList = listAudioDevices(r,"capture");
audioPlaybackDevicesList = listAudioDevices(r,"playback");
You set the Device name in the ALSA Audio Capture:Block Parameters dialog to the device of
your choice from audioCaptureDevicesList. Similarly, you configure the Device name in the
ALSA Audio Playback:Block Parameters dialog to the playback device of your choice from
audioPlaybackDevicesList.
Display the details of an audio capture and audio playback device from audioCaptureDevicesList
and audioPlaybackDevicesList.
audioCaptureDevicesList(1)
ans =
Name: 'USB-Audio-LogitechUSBHeadsetH340-LogitechInc.LogitechUSBHeadsetH340atusb-0000
Device: '2,0'
Channels: {'2'}
BitDepth: {'16-bit integer'}
SamplingRate: {'44100'}
audioPlaybackDevicesList(3)
ans =
Name: 'USB-Audio-LogitechUSBHeadsetH340-LogitechInc.LogitechUSBHeadsetH340atusb-0000
Device: '2,0'
Channels: {'2'}
BitDepth: {'16-bit integer'}
SamplingRate: {'44100'}
To use the above devices, you set the Device name in the ALSA Audio Capture:Block Parameters
and ALSA Audio Capture:Block Parameters dialog to plughw:2,0. You set the Audio sampling
frequency (Hz) to 16000 as the subsequent convolutional neural network (CNN) used to recognize
voice commands was trained on a 16000 Hz sampling frequency.
The model provides a manual switch to switch audio from microphone to the pre-recorded audio files.
You select the voice commands using the Rotary switch. The model uses four Audio File Read
(Simulink) blocks to read the audio files go.wav, stop.wav, up.wav, and down.wav. Note that
Audio File Read (Simulink) block is included in Simulink Support Package for Raspberry Pi
Hardware.
ALSA Audio Capture (Simulink) and Audio File Read (Simulink) blocks outputs 16-bit signed integers
audio samples with values in the interval of . You cast the output of these blocks output
to single-precision data and multiply it by to change the numerical range to . Note that
1-1026
Deploy Smart Speaker System on Raspberry Pi Using Simulink
you are changing the numerical range because the subsequent blocks expect the audio in the range
.
The ALSA Audio Playback (Simulink) block expects 16-bit signed integers as input, hence the output
of the preceding block that prepares audio frame must be converted to 16-bit signed integers. The
range of the floating-point audio frame samples is . You multiply the floating-point audio
frame samples by to bring the range to . After multiplying, you typecast the product
to int16 data type. These int16 audio frame samples can be fed to ALSA Audio Playback (Simulink)
block. The AudioSmartSpeakerOnRaspberryPi model uses Gain (Simulink) block to multiply the
audio samples by the constants or . It uses Data Type Conversion (Simulink) block to typecast
the audio samples to single or int16.
Configure Smart Speaker Model Settings and Run the Model in External Mode
• Select a solver that supports code generation. Set Solver to auto (Automatic solver
selection) and Solver type to Fixed-step.
• Select Code Generation and set the System Target File to ert.tlc whose Description is
Embedded Coder.
• Set the Language to C++, which will automatically set the Language Standard to C++11
(ISO).
• In Configuration > Hardware Implementation, set the Hardware board to Raspberry Pi
and enter your Raspberry Pi credentials in the Board Parameters.
• In the same window, set External mode > Communication interface to XCP on TCP/IP.
• Check Signal logging in Configuration > Data Import/Export to enable signal monitoring in
External Mode.
• Go to the Hardware tab and click on Monitor & Tune to run the model in external mode.
1-1027
1 Audio Toolbox Examples
One key consideration is the type of data that you want to label.
• If your data is an image collection, use the Image Labeler app. An image collection is an
unordered set of images that can vary in size. For example, you can use the app to label images of
books for training a classifier. The Image Labeler can also handle very large images (at least one
dimension >8K).
• If your data is a single video or image sequence, use the Video Labeler app. An image sequence is
an ordered set of images that resembles a video. For example, you can use this app to label a
video or image sequence of cars driving on a highway for training an object detector.
• If your data includes multiple time-overlapped signals, such as videos, image sequences, or lidar
signals, use the Ground Truth Labeler app. For example, you can label data for a single scene
captured by multiple sensors mounted on a vehicle.
• If your data is only a lidar signal, use the Lidar Labeler. For example, you can use this app to label
data captured from a point cloud sensor.
• If your data consists of single-channel or multichannel one-dimensional signals, use the Signal
Labeler. For example, you can label biomedical, speech, communications, or vibration data. You
can also use Signal Labeler to perform audio-specific tasks, such as speech detection and speech-
to-text transcription.
• If your data is a 2-D medical image or image series, or a 3-D medical image volume, use the
Medical Image Labeler. For example, you can label computed tomography (CT) image volumes of
the chest to train a semantic segmentation network.
1-1028
Choose an App to Label Ground Truth Data
1-1029
1 Audio Toolbox Examples
1-1030
Choose an App to Label Ground Truth Data
1-1031
1 Audio Toolbox Examples
See Also
More About
• “Get Started with the Image Labeler” (Computer Vision Toolbox)
• “Get Started with the Video Labeler” (Computer Vision Toolbox)
• “Get Started with Ground Truth Labelling” (Automated Driving Toolbox)
• “Get Started with the Lidar Labeler” (Lidar Toolbox)
• “Using Signal Labeler App”
• “Label Spoken Words in Audio Signals”
• “Get Started with Medical Image Labeler” (Medical Imaging Toolbox)
1-1032
Compare Speaker Separation Models
Compare the performance, size, and speed of deep learning speaker separation models.
Introduction
Speaker separation is a challenging and critical speech processing task. Modern speaker separation
methods use deep learning to achieve strong results. In this example, you compare four speaker
separation models:
You can find the training recipes for the time-frequency masking model and the 2-speaker ConvTas-
Net model in “Cocktail Party Source Separation Using Deep Learning Networks” on page 1-349 and
“End-to-End Deep Speaker Separation” on page 1-85, respectively. You can perform speaker
separation using the one-and-rest Conv-TasNet (Conv-TasNet OR) model and the SepFormer model
with the separateSpeakers function.
To spot-check model performance, load test data consisting of two speakers and their mix. Listen to
the speakers individually and mixed. Plot the mix and individual speakers using the supporting
function, plotSpeakerSeparation on page 1-1043.
[audioIn,fs] = audioread("MultipleSpeakers-16-8-4channel-5secs.flac");
t1 = audioIn(:,2);
t2 = audioIn(:,3);
x = t1 + t2;
x = x/max(abs(x));
plotSpeakerSeparation(t1,t2,x)
1-1033
1 Audio Toolbox Examples
sound(t1,fs),pause(5)
sound(t2,fs),pause(5)
sound(x,fs),pause(5)
Load Models
Time-Frequency Masking
Load the pretrained speaker separation weights for the time-frequency masking model. The inference
model is defined in the supporting function separateSpeakersTimeFrequency on page 1-1044. To
examine and train this model, see “Cocktail Party Source Separation Using Deep Learning Networks”
on page 1-349.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio/examples","cocktailpartyfc.z
dataFolder = tempdir;
tfNetFolder = fullfile(dataFolder,"CocktailPartySourceSeparation");
unzip(downloadFolder,tfNetFolder)
Separate the mixed test signal and then plot and listen to the results.
y = separateSpeakersTimeFrequency(x,tfNetFolder);
plotSpeakerSeparation(t1,t2,x,y(:,1),y(:,2))
1-1034
Compare Speaker Separation Models
sound(y(:,1),fs),pause(5)
sound(y(:,2),fs),pause(5)
Conv-TasNet
Load the pretrained speaker separation weights for the Conv-TasNet model. The inference model is
defined in the supporting function separateSpeakersConvTasNet on page 1-1046. To examine and
train this model, see “End-to-End Deep Speaker Separation” on page 1-85.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","speechSeparation.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
convtasNetFolder = fullfile(dataFolder,"speechSeparation");
Separate the mixed test signal and then plot and listen to the results.
y = separateSpeakersConvTasNet(x,convtasNetFolder);
plotSpeakerSeparation(t1,t2,x,y(:,1),y(:,2));
1-1035
1 Audio Toolbox Examples
sound(y(:,1),fs),pause(5)
sound(y(:,2),fs),pause(5)
The separateSpeakers function uses three models under-the-hood: a 2-speaker SepFormer model,
a 3-speaker SepFormer model, and a one-and-rest Conv-TasNet model. To use the one-and-rest Conv-
TasNet model, specify NumSpeakers as 1 or do not specify the NumSpeakers. When NumSpeakers
is not specified, the function passes the "rest" from the separation back through the model until no
more speakers are detected. For the purposes of this example, call separateSpeakers twice with
NumSpeakers=1 for both calls.
Separate the mixed test signal and then plot and listen to the results. If you have not downloaded the
required files to use separateSpeakers, an error is thrown with the link to the download.
[y1,r] = separateSpeakers(x,fs,NumSpeakers=1);
[y2,r] = separateSpeakers(r,fs,NumSpeakers=1);
plotSpeakerSeparation(t1,t2,x,y1,y2)
1-1036
Compare Speaker Separation Models
sound(y1,fs),pause(5)
sound(y2,fs),pause(5)
SepFormer
Call separateSpeakers with NumSpeakers=2 to perform speaker separation using the 2-speaker
SepFormer model.
Separate the mixed test signal and then plot and listen to the results.
y = separateSpeakers(x,fs,NumSpeakers=2);
plotSpeakerSeparation(t1,t2,x,y(:,1),y(:,2))
1-1037
1 Audio Toolbox Examples
sound(y(:,1),fs),pause(5)
sound(y(:,2),fs),pause(5)
Compare Models
Compare the computation time, model size, and performance of the models.
Computation Time
To compare execution times for different duration inputs, use the supporting function
compareComputationTime on page 1-1048. If the execution time is less than the input duration,
then the model can run in real time (without dropping samples).
compareComputationTime(DurationToTest=[1,5,10], ...
CompareCPU= , ...
CompareGPU= , ...
TimeFrequencyMaskNetFolder=tfNetFolder, ...
ConvTasNetFolder=convtasNetFolder)
1-1038
Compare Speaker Separation Models
Model Size
Compare the size of all models. Note that the Conv-TasNet model trained in the example and the
Conv-TasNet OR model provided with the separateSpeakers function are quite different in size. In
addition to different loss functions and training recipes, Conv-TasNet OR and Conv-TasNet are both
variations on the architecture described in [1] on page 1-1049. Most noticeably, the Conv-TasNet OR
model uses 24 convolutional blocks while the example Conv-TasNet model uses 32.
timefrequency_size = dir(fullfile(tfNetFolder,"CocktailPartyNet.mat")).bytes/1e6;
convtasnet_size = dir(fullfile(convtasNetFolder,"paramsBest.mat")).bytes/1e6;
convtasnet_or_size = dir(which("convtasnet-librimix-orpit.mat")).bytes/1e6;
sepformer_size = dir(which("sepformer-libri2mix-upit.mat")).bytes/1e6;
figure
bar(n,[timefrequency_size,convtasnet_size,convtasnet_or_size,sepformer_size])
grid on
ylabel("Size (MB)")
title("Disk Memory")
1-1039
1 Audio Toolbox Examples
To compare model performance, download the LibriSpeech [3] on page 1-1049 test-clean dataset. The
dataset consists of files of single speakers reading.
downloadDatasetFolder = tempdir;
datasetFolder = fullfile(downloadDatasetFolder,"LibriSpeech","test-clean");
filename = "test-clean.tar.gz";
url = "http://www.openSLR.org/resources/12/" + filename;
if ~datasetExists(datasetFolder)
gunzip(url,downloadDatasetFolder);
unzippedFile = fullfile(downloadDatasetFolder,filename);
untar(unzippedFile{1}(1:end-3),downloadDatasetFolder);
end
ads = audioDatastore(datasetFolder,IncludeSubfolders=true);
Test the model scale-invariant signal-to-noise ratio (SI-SNR) [6] on page 1-1049 performances on a
sampling of the dataset. SI-SNR is a popular objective metric for the quality of speaker separation
algorithms. If a GPU and Parallel Computing Toolbox™ are available, use the GPU to speed up
processing.
The testModel on page 1-1041 supporting function combines randomly selected audio files, mixes
them, passes the mixed data through the specified model, and then calculates the permutation-
invariant SI-SNR.
The SepFormer model achieves the best results (higher SNR is better).
1-1040
Compare Speaker Separation Models
tf_sisnr = testModel(ads,@(x)separateSpeakersTimeFrequency(x,tfNetFolder),UseGPU=canUseGPU);
convtasnet_sisnr = testModel(ads,@(x)separateSpeakersConvTasNet(x,convtasNetFolder),UseGPU=canUse
convtasnet_orpit_sisnr = testModel(ads,@(x)separateSpeakers(x,8e3,NumSpeakers=1),UseGPU=canUseGPU
sepformer_sisnr = testModel(ads,@(x)separateSpeakers(x,8e3,NumSpeakers=2),UseGPU=canUseGPU);
figure
bar(n,[tf_sisnr,convtasnet_sisnr,convtasnet_orpit_sisnr,sepformer_sisnr])
grid on
ylabel("SI-SNR")
title("Separation Performance (Test Dataset)")
Supporting Functions
Test Model
arguments
ads
model
options.OneAndRest = false
options.UseGPU = false
options.NumTestPoints = 50
options.TestDuration = []
options.SignalRatio = [0.6 0.75 0.85 1]
1-1041
1 Audio Toolbox Examples
end
total_sisnr = zeros(options.NumTestPoints,1);
fn = ads.Files;
spkids = filenames2labels(fn,ExtractBefore="-");
rng default
for ii = 1:options.NumTestPoints
% Choose a random file for speaker 1
idx1 = randi(numel(fn));
fn1 = fn{idx1};
% Mix
x = t1 + t2;
x = x./max(abs(x));
1-1042
Compare Speaker Separation Models
end
[y1,y2] = gather(y1,y2);
end
testSISNR = mean(total_sisnr);
end
function plotSpeakerSeparation(t1,t2,x,y1,y2)
%plotSpeakerSeparation Plot the ground truth and predictions
arguments
t1
t2
x
y1 = []
y2 = []
end
fs = 8e3;
timeVector = ((0:size(t1,1)-1)/fs)';
tiledlayout(3,1)
nexttile()
plot(timeVector,x)
xlabel("Time (s)")
ylabel("Mix")
grid on
xlim tight
ylim([-1 1])
% Match the targets and predictions based on which set of pairs results in
% the best SI-SNR
if ~isempty(y1)
[~,reverseOrder] = uPIT(t1,y1,t2,y2);
if reverseOrder
ytemp = y1;
y1 = y2;
y2 = ytemp;
end
end
nexttile()
if ~isempty(y1)
plot(timeVector,t1,"-",timeVector,y1,"--")
legend("Target","Prediction")
else
plot(timeVector,t1)
end
ylabel("Speaker 1",FontWeight="bold")
xlabel("Time (s)")
1-1043
1 Audio Toolbox Examples
grid on
xlim tight
ylim([-1 1])
nexttile()
if ~isempty(y2)
plot(timeVector,t2,"-",timeVector,y2,"--")
legend("Target","Prediction",Location="best")
else
plot(timeVector,t2)
end
ylabel("Speaker 2",FontWeight="bold")
xlabel("Time (s)")
grid on
xlim tight
ylim([-1 1])
end
t = t - mean(t);
y = y - mean(y);
t = sum(t.*y) .* y ./ (sum(y.^2)+eps);
z = 20*log((sqrt(sum(t.^2))+eps)./sqrt((sum((y-t).^2))+eps))/log(10);
end
v1 = SISNR(t1,y1);
v2 = SISNR(t2,y2);
m1 = mean([v1;v2]);
v1 = SISNR(t1,y2);
v2 = SISNR(t2,y1);
m2 = mean([v1;v2]);
[m,idx] = max([m1,m2]);
reverseOrder = idx==2;
end
persistent CocktailPartyNet
if isempty(CocktailPartyNet)
1-1044
Compare Speaker Separation Models
s = load(fullfile(pathToNet,"CocktailPartyNet.mat"));
CocktailPartyNet = s.CocktailPartyNet;
end
WindowLength = 128;
FFTLength = 128;
OverlapLength = 128-1;
win = hann(WindowLength,"periodic");
% Downsample to 4 kHz
mixR = resample(mix,1,2);
P0 = stft(mixR, ...
Window=win, ...
OverlapLength=OverlapLength,...
FFTLength=FFTLength, ...
FrequencyRange="onesided");
P = log(abs(P0) + eps);
MP = mean(P(:));
SP = std(P(:));
P = (P-MP)/SP;
seqLen = 20;
PSeq = zeros(1 + FFTLength/2,seqLen,1,0);
seqOverlap = seqLen;
loc = 1;
while loc < size(P,2)-seqLen
PSeq(:,:,:,end+1) = P(:,loc:loc+seqLen-1); %#ok
loc = loc + seqOverlap;
end
estimatedMasks = predict(CocktailPartyNet,PSeq);
estimatedMasks = estimatedMasks.';
estimatedMasks = reshape(estimatedMasks,1 + FFTLength/2,numel(estimatedMasks)/(1 + FFTLength/2));
mask1 = estimatedMasks;
mask2 = 1 - mask1;
P0 = P0(:,1:size(mask1,2));
P_speaker1 = P0.*mask1;
P_speaker2 = P0.*mask2;
1-1045
1 Audio Toolbox Examples
OverlapLength=OverlapLength,...
FFTLength=FFTLength, ...
ConjugateSymmetric=true,...
FrequencyRange="onesided");
speaker2 = speaker2/max(speaker2);
speaker1 = resample(double(speaker1),2,1);
speaker2 = resample(double(speaker2),2,1);
N = numel(mix) - numel(speaker1);
mixToAdd = mix(end-N+1:end);
speaker1 = [speaker1;mixToAdd];
speaker2 = [speaker2;mixToAdd];
output = [speaker1,speaker2];
end
if ~isdlarray(input)
input = dlarray(input,"SCB");
end
x = dlconv(input,learnables.Conv1W,learnables.Conv1B,Stride=10);
x = relu(x);
x0 = x;
x = x - mean(x,2);
x = x./sqrt(mean(x.^2, 2) + 1e-5);
x = x.*learnables.ln_weight + learnables.ln_bias;
encoderOut = dlconv(x,learnables.Conv2W,learnables.Conv2B);
masks = dlconv(encoderOut,learnables.Conv3W,learnables.Conv3B);
masks = relu(masks);
mask1 = masks(:,1:256,:);
mask2 = masks(:,257:512,:);
1-1046
Compare Speaker Separation Models
out1 = x0.*mask1;
out2 = x0.*mask2;
weights = learnables.TransConv1W;
bias = learnables.TransConv1B;
output2 = dltranspconv(out1,weights,bias,Stride=10);
output1 = dltranspconv(out2,weights,bias,Stride=10);
output1 = gather(extractdata(output1));
output2 = gather(extractdata(output2));
output1 = output1./max(abs(output1));
output2 = output2./max(abs(output2));
output1 = trimOrPad(output1,numel(input));
output2 = trimOrPad(output2,numel(input));
output = [output1,output2];
end
% Conv:
conv1Out = dlconv(input,learnables.Conv1W,learnables.Conv1B);
% PRelu:
conv1Out = relu(conv1Out) - learnables.Prelu1.*relu(-conv1Out);
% BatchNormalization:
batchOut = batchnorm(conv1Out,learnables.BN1Offset,learnables.BN1Scale,state.BN1Mean,state.BN1Var
% Conv:
padding = [1 1] * 2^(mod(count,8));
dilationFactor = 2^(mod(count,8));
convOut = dlconv(batchOut,learnables.Conv2W,learnables.Conv2B,DilationFactor=dilationFactor,Paddi
% PRelu:
convOut = relu(convOut) - learnables.Prelu2.*relu(-convOut);
% BatchNormalization:
batchOut = batchnorm(convOut,learnables.BN2Offset,learnables.BN2Scale,state.BN2Mean,state.BN2Var)
% Conv:
output = dlconv(batchOut,learnables.Conv3W,learnables.Conv3B);
% Skip connection
output = output + input;
end
function y = trimOrPad(x,n)
%trimOrPad Trim or pad to desired length
1-1047
1 Audio Toolbox Examples
end
function compareComputationTime(options)
%compareComputationTime Compare computation time
arguments
options.DurationToTest
options.CompareCPU
options.CompareGPU
options.TimeFrequencyMaskNetFolder
options.ConvTasNetFolder
end
fs = 8e3;
dur = options.DurationToTest;
if options.CompareCPU
tf.CPU = zeros(numel(dur),1);
convtas.CPU = zeros(numel(dur),1);
convtas_orpit.CPU = zeros(numel(dur),1);
sepformer.CPU = zeros(numel(dur),1);
for ii = 1:numel(dur)
x = pinknoise(dur(ii)*fs,"single");
tf.CPU(ii) = timeit(@()separateSpeakersTimeFrequency(x,options.TimeFrequencyMaskNetFolder
convtas.CPU(ii) = timeit(@()separateSpeakersConvTasNet(x,options.ConvTasNetFolder));
convtas_orpit.CPU(ii) = timeit(@()separateSpeakers(x,8e3,NumSpeakers=1,ConserveEnergy=fal
sepformer.CPU(ii) = timeit(@()separateSpeakers(x,8e3,NumSpeakers=2,ConserveEnergy=false))
end
convtas_orpit.CPU = 2*convtas_orpit.CPU; % Double to adjust for two-passes of one-and-rest.
end
if options.CompareGPU
tf.GPU = zeros(numel(dur),1);
convtas.GPU = zeros(numel(dur),1);
convtas_orpit.GPU = zeros(numel(dur),1);
sepformer.GPU = zeros(numel(dur),1);
for ii = 1:numel(dur)
x = gpuArray(pinknoise(dur(ii)*fs,"single"));
tf.GPU(ii) = gputimeit(@()separateSpeakersTimeFrequency(x,options.TimeFrequencyMaskNetFol
convtas.GPU(ii) = gputimeit(@()separateSpeakersConvTasNet(x,options.ConvTasNetFolder));
convtas_orpit.GPU(ii) = gputimeit(@()separateSpeakers(x,8e3,NumSpeakers=1,ConserveEnergy=
1-1048
Compare Speaker Separation Models
sepformer.GPU(ii) = gputimeit(@()separateSpeakers(x,8e3,NumSpeakers=2,ConserveEnergy=fals
end
convtas_orpit.GPU = 2*convtas_orpit.GPU; % Double to adjust for two-passes of one-and-rest.
end
numTiles = double(options.CompareCPU)+double(options.CompareGPU);
tlh = tiledlayout(numTiles,1);
environments = ["CPU","GPU"];
environments = environments([options.CompareCPU,options.CompareGPU]);
for ii = 1:numel(environments)
nexttile(tlh)
ee = environments(ii);
plot(dur,tf.(ee),'b',dur,convtas.(ee),'r',dur,convtas_orpit.(ee),'g',dur,sepformer.(ee),'k',
dur,tf.(ee),'bo',dur,convtas.(ee),'ro',dur,convtas_orpit.(ee),'go',dur,sepformer.(ee),'ko
legend("Time-Frequency Mask","Conv-TasNet","Conv-TasNet OR","SepFormer",Location="best")
xlabel("Input Duration (s)")
ylabel("Execution Time (s)")
title(ee + " Execution Time")
grid on
end
end
References
[1] Luo, Yi, and Nima Mesgarani. "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude
Masking for Speech Separation." IEEE/ACM Transactions on Audio, Speech, and Language
Processing 27, no. 8 (August 2019): 1256-66. https://doi.org/10.1109/TASLP.2019.2915167.
[3] Panayotov, Vassil, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. "Librispeech: An ASR
Corpus Based on Public Domain Audio Books." In 2015 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 5206-10. South Brisbane, Queensland, Australia: IEEE,
2015. https://doi.org/10.1109/ICASSP.2015.7178964.
[4] Subakan, Cem, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. "Attention Is
All You Need In Speech Separation." In ICASSP 2021 - 2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 21-25. Toronto, ON, Canada: IEEE, 2021. https://
doi.org/10.1109/ICASSP39728.2021.9413901.
[5] Takahashi, Naoya, Sudarsanam Parthasaarathy, Nabarun Goswami, and Yuki Mitsufuji. "Recursive
Speech Separation for Unknown Number of Speakers." In Interspeech 2019, 1348-52. ISCA, 2019.
https://doi.org/10.21437/Interspeech.2019-1550.
1-1049
1 Audio Toolbox Examples
[6] Roux, Jonathan Le, et al. "SDR – Half-Baked or Well Done?" ICASSP 2019 - 2019 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 626–
30. DOI.org (Crossref), https://doi.org/10.1109/ICASSP.2019.8683855.
See Also
separateSpeakers
Related Examples
• “Cocktail Party Source Separation Using Deep Learning Networks” on page 1-349
• “End-to-End Deep Speaker Separation” on page 1-85
1-1050
Compress Machine Fault Recognition Neural Network Using Projection
In this example, you compress a pretrained acoustics-based machine fault recognition neural network
using projection and principal component analysis. Then, you generate C++ code from the
compressed neural network.
To learn how the deep learning model was trained, see “Acoustics-Based Machine Fault Recognition”
on page 1-714.
For a detailed example on deploying the machine fault recognition system to a hardware target, refer
to “Acoustics-Based Machine Fault Recognition Code Generation on Raspberry Pi” on page 1-744.
Prerequisites
For supported versions of libraries and for information about setting up environment variables, see
“Prerequisites for Deep Learning with MATLAB Coder” (MATLAB Coder).
Data Preparation
Download and unzip the air compressor data set [1] on page 1-1061. This data set consists of
recordings from air compressors in a healthy state or one of seven faulty states.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","AirCompressorDataset/AirCo
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"AirCompressorDataset");
Create an audioDatastore object to manage the data and split it into training and validation sets.
ads = audioDatastore(dataset,IncludeSubfolders=true);
The data labels are encoded in their containing folder name. To split the data into train and test sets,
use folders2labels and splitlabels.
lbls = folders2labels(ads.Files);
idxs = splitlabels(lbls,0.9);
adsTrain = subset(ads,idxs{1});
labelsTrain = lbls(idxs{1});
adsValidation = subset(ads,idxs{2});
labelsValidation = lbls(idxs{2});
Call countlabels to inspect the distribution of labels in the train and validation sets.
countlabels(labelsTrain)
ans=8×3 table
Label Count Percent
_________ _____ _______
1-1051
1 Audio Toolbox Examples
countlabels(labelsValidation)
ans=8×3 table
Label Count Percent
_________ _____ _______
Bearing 22 12.5
Flywheel 22 12.5
Healthy 22 12.5
LIV 22 12.5
LOV 22 12.5
NRV 22 12.5
Piston 22 12.5
Riderbelt 22 12.5
Extract a set of acoustic features used as inputs to the network. The extraction process is identical to
the approach in “Acoustics-Based Machine Fault Recognition” on page 1-714.
windowLength = 512;
overlapLength = 0;
[~,info] = read(adsTrain);
fs = info.SampleRate;
aFE = audioFeatureExtractor(SampleRate=fs, ...
Window=hamming(windowLength,"periodic"),...
OverlapLength=overlapLength,...
spectralCentroid=true, ...
spectralCrest=true, ...
spectralDecrease=true, ...
spectralEntropy=true, ...
spectralFlatness=true, ...
spectralFlux=false, ...
spectralKurtosis=true, ...
spectralRolloffPoint=true, ...
spectralSkewness=true, ...
spectralSlope=true, ...
spectralSpread=true);
1-1052
Compress Machine Fault Recognition Neural Network Using Projection
Load the pretrained network. To learn how the deep learning model was trained, see “Acoustics-
Based Machine Fault Recognition” on page 1-714.
load airCompressorNet
ans =
6×1 Layer array with layers:
You use principal component analysis (PCA) to identify the subspace of learnable parameters that
result in the highest variance in neuron activations by analyzing the network activations using the
training data set. This analysis requires only the predictors of the training data to compute the
network activations. It does not require the training targets.
Create a neuronPCA object. To view information about the steps of the neuron PCA algorithm, set the
VerbosityLevel option to "steps".
npca = neuronPCA(netOriginal,mbqTrain,VerbosityLevel="steps");
1-1053
1 Audio Toolbox Examples
npca
npca =
neuronPCA with properties:
Project Network
If you want to compress a network so that it meets specific hardware memory requirements, then you
can manually calculate the reduction value such that the compressed network is of the desired size.
targetMemorySize = 256*1024
targetMemorySize = 262144
Calculate the memory size of the original network (in bytes) using the parameterMemory helper
function.
memorySizeOriginal = parameterMemory(netOriginal)
memorySizeOriginal = 502432
Calculate the factor to reduce the learnable parameters by such that the resulting network meets the
memory requirements.
reductionGoal = 1 - (targetMemorySize/memorySizeOriginal);
Project the network using the compressNetworkUsingProjection function and set the
LearnablesReductionGoal option to the calculated reduction factor.
Calculate the memory size of the projected network using the parameterMemory function. Notice
that the value is very close to the target memory size.
memorySizeProjected = parameterMemory(netProjected)
memorySizeProjected = 261328
validationLabels = folders2labels(adsValidation.Files);
classNames = unique(validationLabels);
1-1054
Compress Machine Fault Recognition Neural Network Using Projection
Create a mini-batch queue using the same steps as the training data.
mbqValidation = minibatchqueue(...
arrayDatastore(validationFeatures.',OutputType="same",ReadSize=miniBatchSize),...
MiniBatchSize=miniBatchSize ,...
MiniBatchFormat="CTB",...
MiniBatchFcn=@(X)cat(3,X{:}));
For comparison, calculate the classification accuracy of the original network using the test data and
the modelPredictions function.
YTest = modelPredictions(netOriginal,mbqValidation,string(classNames));
TTest = validationLabels;
accOriginal = mean(YTest == TTest)
accOriginal = 0.8807
figure
confusionchart(YTest,TTest, ...
Title="Accuracy: " + accOriginal*100 + " (%)");
Calculate the classification accuracy of the projected network using the test data and the
modelPredictions function.
YTest = modelPredictions(netProjected,mbqValidation,string(classNames));
accProjected = mean(YTest == TTest)
1-1055
1 Audio Toolbox Examples
accProjected = 0.8409
figure
confusionchart(YTest,TTest, ...
Title="Accuracy: " + accProjected*100 + " (%)");
Compare the accuracy and the memory size of each network in a bar chart. Notice that memory size
has been significantly reduced, with a relatively slight reduction in accuracy.
figure
tiledlayout("flow")
nexttile
bar([accOriginal accProjected])
xticklabels(["Original" "Projected"])
ylabel("Accuracy (%)")
title("Accuracy")
nexttile
bar([memorySizeOriginal memorySizeProjected])
xticklabels(["Original" "Projected"])
yline(targetMemorySize,"r--","Memory Requirement")
ylabel("Memory (bytes)")
title("Memory Size")
1-1056
Compress Machine Fault Recognition Neural Network Using Projection
Compressing a network using projection typically reduces the network accuracy. You can improve the
accuracy by retraining the network (also known as fine tuning the network).
miniBatchSize = ;
trainLabels = folders2labels(adsTrain.Files);
validationFrequency = floor(numel(trainFeatures)/miniBatchSize);
options = trainingOptions("adam", ...
MiniBatchSize=miniBatchSize, ...
MaxEpochs=35, ...
Plots="training-progress", ...
Verbose=false, ...
Shuffle="every-epoch", ...
LearnRateSchedule="piecewise", ...
LearnRateDropPeriod=30, ...
LearnRateDropFactor=0.1, ...
ValidationData={validationFeatures,validationLabels}, ...
ValidationFrequency=validationFrequency,...
InputDataFormats = "CTB")
options =
TrainingOptionsADAM with properties:
GradientDecayFactor: 0.9000
1-1057
1 Audio Toolbox Examples
SquaredGradientDecayFactor: 0.9990
Epsilon: 1.0000e-08
InitialLearnRate: 1.0000e-03
MaxEpochs: 35
LearnRateSchedule: 'piecewise'
LearnRateDropFactor: 0.1000
LearnRateDropPeriod: 30
MiniBatchSize: 128
Shuffle: 'every-epoch'
WorkerLoad: []
CheckpointFrequency: 1
CheckpointFrequencyUnit: 'epoch'
SequenceLength: 'longest'
DispatchInBackground: 0
L2Regularization: 1.0000e-04
GradientThresholdMethod: 'l2norm'
GradientThreshold: Inf
Verbose: 0
VerboseFrequency: 50
ValidationData: {{1×176 cell} [176×1 categorical]}
ValidationFrequency: 12
ValidationPatience: Inf
CheckpointPath: ''
ExecutionEnvironment: 'auto'
OutputFcn: []
Metrics: []
Plots: 'training-progress'
SequencePaddingValue: 0
SequencePaddingDirection: 'right'
InputDataFormats: {'CTB'}
TargetDataFormats: "auto"
ResetInputNormalization: 1
BatchNormalizationStatistics: 'auto'
OutputNetwork: 'last-iteration'
fineTunedNet = trainnet(trainFeatures,trainLabels,netProjected,"crossentropy",options);
YTest = modelPredictions(fineTunedNet,mbqValidation,string(classNames));
accProjected = mean(YTest == TTest)
accProjected = 0.8864
figure
confusionchart(YTest,TTest, ...
Title="Accuracy: " + accProjected*100 + " (%)");
1-1058
Compress Machine Fault Recognition Neural Network Using Projection
You can generate C++ code from a machine fault recognition system that leverages the fine-tuned,
compressed network.
• Feature extraction.
• Network inference.
1-1059
1 Audio Toolbox Examples
Create a function that combines the feature extraction and deep learning classification.
type detectAirCompressorFault.m
Create a code generation configuration object to generate an executable program. Specify the target
language as C++.
cfg = coder.config("mex");
cfg.TargetLang = "C++";
Call the codegen (MATLAB Coder) function from MATLAB Coder to generate C++ code. Set the
input audio frame length to 512 samples.
audioFrame = ones(512,1,"single");
codegen -config cfg detectAirCompressorFault -args {audioFrame} -report
For a more detailed example on deploying the machine fault recognition system to a hardware target,
refer to “Acoustics-Based Machine Fault Recognition Code Generation on Raspberry Pi” on page 1-
744.
Supporting Functions
function Y = modelPredictions(net,mbq,classNames)
Y = [];
reset(mbq)
while hasdata(mbq)
X = next(mbq);
scores = predict(net,X);
labels = onehotdecode(scores,classNames,1)';
Y = [Y; labels];%#ok
1-1060
Compress Machine Fault Recognition Neural Network Using Projection
end
end
function N = numLearnables(net)
N = 0;
for i = 1:size(net.Learnables,1)
N = N + numel(net.Learnables.Value{i});
end
end
References
[1] Verma, Nishchal K., et al. "Intelligent Condition Based Monitoring Using Acoustic Signals for Air
Compressors." IEEE Transactions on Reliability, vol. 65, no. 1, Mar. 2016, pp. 291–309. DOI.org
(Crossref), doi:10.1109/TR.2015.2459684.
See Also
Related Examples
• “Acoustics-Based Machine Fault Recognition” on page 1-714
• “Acoustics-Based Machine Fault Recognition Code Generation on Raspberry Pi” on page 1-744
1-1061
1 Audio Toolbox Examples
This example shows how to create an app to play and visualize audio files. The app plots any audio
file and plays it using audioDeviceWriter. While playing the audio, the app updates a playback
cursor, a time status indicator, and a uiaudiometer component to perform sound level metering.
Create App
Add the components listed above and create an app in App Designer.
1-1062
Create an App to Play and Visualize Audio Files
In the callback function for the Browse button, use uigetfile to browse for audio files. If a valid
audio file is selected, update the edit field value with the file name and load the audio file data.
1-1063
1 Audio Toolbox Examples
if filename
app.AudioFileName = fullfile(pathname,filename);
app.AudioFileEditField.Value = app.AudioFileName;
loadAudioFile(app);
end
When a valid audio file is selected, read the contents of the audio file using audioread and plot the
audio waveform. Alternatively, you could follow the steps in “Plot Large Audio Files” on page 1-1012
example to load and plot only the overall envelope of the audio waveform.
function loadAudioFile(app)
% Read audio data from the file and plot its waveform
try
[y,fs] = audioread(app.AudioFileName);
t = seconds(0:1/fs:(size(y,1)-1)/fs);
catch ME
uialert(app.UIFigure,ME.message,'Invalid File');
end
Configure the callback functions of the playback buttons to play/pause audio, stop playing audio, and
toggle playing the file in a loop.
In the callback function for the Play button, create an audio stream loop to read and play audio
frame-by-frame and to update the UI. Use dsp.AudioFileReader to read an audio frame,
audioDeviceWriter to play that audio frame, and audioLevelMeter to compute the sound levels
and update the uiaudiometer component. Also, for every audio frame processed, update the
playback cursor and the playback status readout.
currPointer = app.AudioFileReadPointer;
while ~isDone(reader) && ~(app.StopRequested || app.PauseRequested)
% Read audio data, play, and update the meter
audioIn = reader();
player(audioIn);
uimeter.Value = levelMeter(audioIn);
% Increment read pointer and update cursor position
currPointer = currPointer + size(audioIn,1);
setPlaybackPosition(app,currPointer);
% Call drawnow to update graphics
drawnow limitrate
1-1064
Create an App to Play and Visualize Audio Files
end
app.AudioFileReadPointer = currPointer;
See Also
uiaudiometer | audioLevelMeter
Related Examples
• “Plot Large Audio Files” on page 1-1012
• “Real-Time Audio in MATLAB”
1-1065
1 Audio Toolbox Examples
Feature extraction is an important part of machine learning and deep learning workflows for audio
signals. For these workflows, you often need to train your model using features extracted from a large
data set of audio files. Datastores are useful for working with large collections of data, and the
audioDatastore object allows you to manage collections of audio files.
This example shows different approaches to extracting features from an audio data set. It also shows
how to use parallel computing to accelerate file reading and feature extraction. Parallel file reading
and feature extraction requires Parallel Computing Toolbox™.
Create Datastore
Set the useFSDD flag to true to download Free Spoken Digit Dataset (FSDD) [1] on page 1-1068
containing recordings of spoken digits, and create an audioDatastore object that points to the
data. Otherwise, create a datastore with a small set of audio recordings of spoken digits.
Set the OutputDataType property to "single" to read the audio data into single-precision arrays.
Deep learning workflows often require single-precision data, and using such data can help to speed
up feature extraction. You can also set the OutputEnvironment property to "gpu" to return data on
the GPU, which can also speed up feature extraction.
useFSDD = false;
fs = 8000; % sample rate of audio data
if useFSDD
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"FSDD","recordings");
ads = audioDatastore(dataset,OutputDataType="single");
else
ads = audioDatastore(".",OutputDataType="single");
end
To read files and extract features in parallel in this example, call gcp (Parallel Computing Toolbox) to
get the current parallel pool of workers. If no parallel pool exists, gcp starts a new pool.
pool = gcp;
The simplest way to extract audio features from a data set is to use the audioFeatureExtractor
object. Create an audioFeatureExtractor to extract mel-frequency cepstral coefficients (MFCCs)
from each audio file. Call extract to extract the MFCCs from each audio file in the datastore.
Calling extract on the datastore requires that all the features from the datastore fit in memory.
afe = audioFeatureExtractor(SampleRate=fs,mfcc=true);
mfccs = extract(afe,ads);
To read the files and extract features in parallel, call extract with UseParallel set to true.
1-1066
Extract Features from Audio Data Sets
mfccs = extract(afe,ads,UseParallel=true);
Another approach to extract all the features is to loop through the files in the datastore. You might
choose this method if you have a custom feature extraction algorithm that you cannot implement with
audioFeatureExtractor. In this example, you use stft, which computes the short-time Fourier
transform (STFT) of the audio signal.
Create a cell array to contain the features for each file. In a loop, read in the audio file from the
datastore and extract the features using stft. Store the features in the cell array. This approach
requires the features from the whole data set to fit in memory.
numFiles = length(ads.Files);
features = cell(1,numFiles);
for index = 1:numFiles
x = read(ads);
features{index} = stft(x);
end
You can partition the datastore and run the feature extraction loop on the partitions in parallel to
improve performance.
Use numpartitions to get a reasonable number of partitions given the number of files and number
of workers in the current parallel pool.
numPartitions = numpartitions(ads,pool);
In a parfor (Parallel Computing Toolbox) loop, partition the datastore and extract the features from
the files in each partition.
reset(ads)
partitionFeatures = cell(1,numPartitions);
parfor ii = 1:numPartitions
subds = partition(ads,numPartitions,ii);
feats = cell(1,numel(subds.Files));
for index = 1:numel(subds.Files)
x = read(subds);
feats{index} = stft(x);
end
partitionFeatures{ii} = feats;
end
Concatenate the features from each partition into one cell array.
features = cat(2,partitionFeatures{:});
Transform Datastore
Another approach for extracting features from the data set is to create a TransformedDatastore
that applies custom feature extraction when reading in a file. This method is useful when the features
from the whole data set do not fit in memory.
1-1067
1 Audio Toolbox Examples
Create a function featureExtraction that takes the audio data from a file in the datastore and
performs the feature extraction. Call transform on the audioDatastore with the
featureExtraction function handle to create a new datastore that performs the feature extraction.
tds = transform(ads,@featureExtraction);
Calling read on the new datastore reads in a file and performs feature extraction using the provided
function.
fileFeatures = read(tds);
You can also use this method to read all the features from the data set into memory by calling
readall with the TransformedDatastore.
features = readall(tds);
Read the files and extract features in parallel by calling readall with UseParallel set to true.
features = readall(tds,UseParallel=true);
Next Steps
You can use the data set features to train a machine learning or deep learning model. Combine the
features with label information to perform supervised learning. For example, the trainnet (Deep
Learning Toolbox) function allows you to train deep neural networks using labeled datastores.
References
[1] Zohar Jackson, César Souza, Jason Flaks, Yuxin Pan, Hereman Nicolas, and Adhish Thite.
“Jakobovski/free-spoken-digit-dataset: V1.0.8”. Zenodo, August 9, 2018. https://doi.org/10.5281/
zenodo.1342401.
See Also
Related Examples
• “Use Datastores to Manage Audio Data Sets” on page 1-983
1-1068
Measuring Speech Intelligibility and Perceived Audio Quality with STOI and ViSQOL
Evaluating speech and audio quality is fundamental to the development of communication and speech
enhancement systems. In speech communication systems, it is necessary to evaluate signal integrity,
speech intelligibility, and perceived audio quality. The goal of a speech communication system is to
provide to the end-user the perception of a high-quality representation that meets their expectations
of the original signal input. Subjective listening tests are the ground truth for evaluating speech and
audio quality, but are time consuming and expensive. Objective measures of speech and audio quality
have been an active research area for decades and continue to evolve with the technology of
communication, recording, and speech enhancement.
Speech and audio quality measures can be classified as intrusive and non-intrusive. Intrusive
measurements have access to both the reference audio and the audio output from a processing
system. Intrusive measurements give a score based on the perceptually-motivated differences
between the two. Non-intrusive measurements evaluate the audio output from a processing system
without access to the reference audio. Non-intrusive measurements generally require knowledge of
the speech or sound production mechanisms to evaluate the goodness of the output. Both metrics
presented in this example are intrusive.
The short-time objective intelligibility (STOI) metric was introduced in 2010 [1] on page 1-1080 for
the evaluation of noisy speech and time-frequency weighted (enhanced) noisy speech. It was
subsequently expanded on and extended in [2] on page 1-1080 and [3] on page 1-1080. The STOI
algorithm was shown to be strongly correlated with speech intelligibility in subjective listening tests.
Speech intelligibility is measured by the ratio of words correctly understood under different listening
conditions. Listening conditions used in the development of STOI include time-frequency masks
applied to speech signals to mimic speech enhancement, and the simulation of additive noise
conditions such as cafeteria, factor, car, and speech-shaped noise.
The virtual speech quality objective listener metric (ViSQOL) was introduced in 2012 [4] on page 1-
1080 based on prior work on the Neurogram Similarity Index Measure (NSIM) [5 on page 1-1080,6]
on page 1-1080. This metric was designed for the evaluation of speech quality in processed speech
signals, and it was extended for general audio [7] on page 1-1081. This metric objectively measures
the perceived quality of speech and/or audio, taking into account various aspects of human auditory
perception. The algorithm also compensates for temporal differences introduced by jitter, drift, or
packet loss. Similar to STOI, ViSQOL offers a valuable alternative to time-consuming and expensive
subjective listening tests by simulating the human auditory system. Its ability to capture the nuances
of human perception makes it a valuable addition to the arsenal of tools for assessing speech and
audio quality.
This example creates test data and uses both STOI and ViSQOL to evaluate a speech processing
system. The last section in this example, Evaluate Speech Enhancement System on page 1-1076,
requires Deep Learning Toolbox™.
Use the getTestSignals on page 1-1080 supporting function to create a reference and degraded
signal pair. The getTestSignals supporting function uses the specified signal and noise files and
creates a reference signal and corresponding degraded signal at the requested SNR and duration.
fs = 16e3;
signal = ;
1-1069
1 Audio Toolbox Examples
noise = ;
[ref,deg10] = getTestSignals(SampleRate=fs,SNR=10,Duration=10,Noise=noise,Reference=signal);
t = (1/fs)*(0:numel(ref) - 1);
plot(t,ref)
title("Reference Signal")
xlabel("Time (s)")
sound(ref,fs)
plot(t,deg10)
title("Degraded Signal")
xlabel("Time (s)")
1-1070
Measuring Speech Intelligibility and Perceived Audio Quality with STOI and ViSQOL
sound(deg10,fs)
The STOI algorithm assumes a speech signal is present. The algorithm first discards regions of
silence, and then compares the energy of perceptually-spaced frequency bands of the the original and
target signals. The intermediate analysis time scale is relatively small, approximately 386 ms. The
final objective intelligibility is the mean of the intermediate measures.
The authors of STOI designed the algorithm to optimize speed and simplicity. These attributes have
made STOI a popular tool to evaluate speech degradation and speech enhancement systems. The
simplicity and speed have also enabled STOI to be used as a loss function when training deep
learning models.
The algorithm implemented by Audio Toolbox has made some modifications such as an improved
resampling and an improvement to the re-composition issues introduced by the VAD in the original
implementation. Additionally, the algorithm has been reformulated for increased parallelization.
These changes do result in slightly different numerics when compared to the original implementation.
1-1071
1 Audio Toolbox Examples
Call stoi with the degraded signal, reference signal, and the common sample rate. While the stoi
function accepts arbitrary sample rates, signals are resampled internally to 10 kHz.
metric = stoi(deg10,ref,fs)
metric = 0.8465
You can explore how the STOI algorithm behaves under different signal, noise, and mixture
conditions by modifying the test parameters below. While the theoretical range returned from STOI is
[-1,1], the practical range is closer to [0.4,1]. When the STOI metric approaches 0.4 the speech
approaches unintelligibility.
fs = ;
snrSweep = :2: ;
duration = ;
signal = ;
noise = ;
deg = zeros(size(deg10,1),numel(snrSweep));
for ii = 1:numel(snrSweep)
[~,deg(:,ii)] = getTestSignals(SNR=snrSweep(ii),SampleRate=fs,Duration=duration,Noise=noise,R
end
stoi_d = zeros(size(snrSweep));
for ii = 1:numel(snrSweep)
stoi_d(ii) = stoi(deg(:,ii),ref,fs);
end
Plot the STOI metric against the SNR sweep for the signals under test.
plot(snrSweep,stoi_d,"bo", ...
snrSweep,stoi_d,"b-")
xlabel("SNR")
ylabel("metric")
title("STOI")
grid on
1-1072
Measuring Speech Intelligibility and Perceived Audio Quality with STOI and ViSQOL
The ViSQOL algorithm includes a time-matching algorithm to align the input and degraded signals
automatically. Not only is there a global alignment that applies to the whole processed signal, there is
also a granular matching of so-called "patches", corresponding to short periods of time. This allows
the alignment to work with signals that may drift or suffer from lost samples.
ViSQOL has both a speech mode and an audio mode. In speech mode, a voice activity detector (VAD)
algorithm removes irrelevant portions of the signal.
Once the signals are aligned, ViSQOL computes the Neurogram Similarity Index Measure (NSIM) in
the time-frequency domain (based on a spectrogram).
ViSQOL can provide an overall NSIM score, a Mean Opinion Score (MOS), and specific information
about each time patch and frequency bin.
Call visqol with the degraded signal, reference signal, and the common sample rate. Use a sample
rate of 16 kHz for speech, or 48 kHz for audio, and set Mode correspondingly. The alignment
procedure can be the most time consuming, so the SearchWindowSize option can be set to zero in
1-1073
1 Audio Toolbox Examples
cases the degraded signal is known to have constant latency (ex: no dropped samples or drift). The
OutputMetric and ScaleMOS options determine which metric is returned (ex: NSIM and/or MOS).
visqol_d = zeros(numel(snrSweep),2);
for ii = 1:numel(snrSweep)
visqol_d(ii,:) = visqol(deg(:,ii),ref,fs,Mode="speech",OutputMetric="MOS and NSIM");
end
Plot the ViSQOL MOS and NSIM metrics against the SNR sweep for the signals under test. The NSIM
metric is a normalized value between -1 and 1 (but generally positive), while the MOS metric is in the
1 to 5 range as seen in traditional listening tests.
figure
plot(snrSweep,visqol_d(:,2),"bo", ...
snrSweep,visqol_d(:,2),"b-")
xlabel("SNR")
ylabel("NSIM")
title("ViSQOL (speech)")
hold on
yyaxis right
plot(snrSweep,visqol_d(:,1),"md", ...
snrSweep,visqol_d(:,1),"m-")
set(gca,YColor="m");
ylabel("MOS")
grid on
hold off
1-1074
Measuring Speech Intelligibility and Perceived Audio Quality with STOI and ViSQOL
The signals are not just composed of clean speech, so compute the ViSQOL metric in audio mode to
see how the results change.
deg48 = audioresample(deg,InputRate=fs,OutputRate=48e3);
ref48 = audioresample(ref,InputRate=fs,OutputRate=48e3);
visqol_d = zeros(numel(snrSweep),2);
for ii = 1:numel(snrSweep)
visqol_d(ii,:) = visqol(deg48(:,ii),ref48,fs,Mode="audio",OutputMetric="MOS and NSIM");
end
figure
plot(snrSweep,visqol_d(:,2),"bo", ...
snrSweep,visqol_d(:,2),"b-")
xlabel("SNR")
ylabel("NSIM")
title("ViSQOL (audio)")
hold on
yyaxis right
plot(snrSweep,visqol_d(:,1),"md", ...
1-1075
1 Audio Toolbox Examples
snrSweep,visqol_d(:,1),"m-")
set(gca,YColor="m");
ylabel("MOS")
grid on
hold off
Both STOI and ViSQOL are used for the evaluation of speech enhancement systems.
To perform speech enhancement, use enhanceSpeech. This functionality requires Deep Learning
Toolbox™.
Calculate the STOI and ViSQOL NSIM metrics for each of the degraded speech signals.
1-1076
Measuring Speech Intelligibility and Perceived Audio Quality with STOI and ViSQOL
stoi_e = zeros(size(snrSweep));
nsim_e = zeros(size(snrSweep));
for ii = 1:numel(snrSweep)
stoi_e(ii) = stoi(enh(:,ii),ref,fs);
nsim_e(ii) = visqol(enh(:,ii),ref,fs,Mode="speech",OutputMetric="NSIM");
end
Plot the STOI metric against the SNR sweep for both the degraded and enhanced speech signals. The
speech enhancement model appears to perform best around 0 dB SNR, and actually decreases the
STOI metric above 15 dB SNR. When the noise level is very low, the artifacts introduced by the
speech enhancement system become more pronounced. When the noise level is very high, the speech
enhancement system has difficulty isolating the speech for enhancement, and sometimes instead
makes noise signals louder.
figure
plot(snrSweep,stoi_d,"b-", ...
snrSweep,stoi_e,"r-", ...
snrSweep,stoi_d,"bo", ...
snrSweep,stoi_e,"ro")
xlabel("SNR")
ylabel("STOI")
title("Speech Enhancement Performance Evaluation")
legend("Degraded","Enhanced",Location="best")
grid on
1-1077
1 Audio Toolbox Examples
Plot the ViSQOL metric against the SNR sweep for both the degraded and enhanced speech signals.
figure
plot(snrSweep,visqol_d(:,2),"b-", ...
snrSweep,nsim_e,"r-", ...
snrSweep,visqol_d(:,2),"bo", ...
snrSweep,nsim_e,"ro")
xlabel("SNR")
ylabel("NSIM")
title("ViSQOL")
legend("Degraded","Enhanced",Location="best")
grid on
1-1078
Measuring Speech Intelligibility and Perceived Audio Quality with STOI and ViSQOL
Supporting Functions
Mix SNR
signalNorm = norm(signal);
noiseNorm = norm(noise);
goalNoiseNorm = signalNorm/(10^(ratio/20));
factor = goalNoiseNorm/noiseNorm;
requestedNoise = noise.*factor;
noisySignal = signal + requestedNoise;
noisySignal = noisySignal./max(abs(noisySignal));
end
1-1079
1 Audio Toolbox Examples
[ref,xfs] = audioread(options.Reference);
[n,nfs] = audioread(options.Noise);
ref = audioresample(ref,InputRate=xfs,OutputRate=options.SampleRate);
n = audioresample(n,InputRate=nfs,OutputRate=options.SampleRate);
ref = mean(ref,2);
n = mean(n,2);
numsamples = round(options.SampleRate*options.Duration);
ref = resize(ref,numsamples,Pattern="circular");
n = resize(n,numsamples,Patter="circular");
deg = mixSNR(ref,n,options.SNR);
ref = ref./max(abs(ref));
end
References
[1] C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen, "A Short-Time Objective Intelligibility Measure for
Time-Frequency Weighted Noisy Speech," ICASSP 2010, Dallas, Texas, US.
[2] C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen, "An Algorithm for Intelligibility Prediction of Time-
Frequency Weighted Noisy Speech," IEEE Transactions on Audio, Speech, and Language Processing,
2011.
[3] Jesper Jensen and Cees H. Taal, "An Algorithm for Predicting the Intelligibility of Speech Masked
by Modulated Noise Maskers," IEEE Transactions on Audio, Speech and Language Processing, 2016.
[4] A. Hines, J. Skoglund, A. Kokaram, N. Harte, "ViSQOL: The Virtual Speech Quality Objective
Listener," International Workshop on Acoustic Signal Enhancement 2012, 4-6 September 2012,
Aachen, DE.
[5] A. Hines and N. Harte, “Speech Intelligibility Prediction using a Neurogram Similarity Index
Measure,” Speech Communication, vol. 54, no. 2, pp.306-320, 2012.
[6] A. Hines, "Predicting Speech Intelligibility", Doctoral Thesis, Trinity College Dublin, 2012.
1-1080
Measuring Speech Intelligibility and Perceived Audio Quality with STOI and ViSQOL
[7] A. Hines, E. Gillen, D. Kelly, J. Skoglund, A. Kokaram, N. Harte, "ViSQOLAudio: An objective audio
quality metric for low bitrate codecs," Journal of the Acoustical Society of America, vol. 137, no. 6,
2015.
See Also
stoi | visqol | enhanceSpeech
1-1081
2
Audio plugins enable you to tune parameters of a processing algorithm while streaming audio in real
time. To enhance usability, you can define a custom user interface (UI) that maps parameters to
intuitively designed and positioned controls. You can use audioPluginInterface,
audioPluginParameter, and audioPluginGridLayout to define the custom UI. You can interact
with the custom UI in MATLAB® using parameterTuner, or deploy the plugin with a custom UI to a
digital audio workstation (DAW). This tutorial walks through key design capabilities of audio plugins
by sequentially enhancing a basic audio plugin UI.
To learn more about audio plugins in general, see “Audio Plugins in MATLAB”.
The equalizerV1 audio plugin enables you to tune the gains and center frequencies of a three-band
equalizer, tune the overall volume, and toggle between enabled and disabled states.
2-2
Design User Interface for Audio Plugin
methods
function obj = equalizerV1
obj.mPEQ = multibandParametricEQ('HasHighpassFilter',false, ...
'HasLowShelfFilter',false,'HasHighShelfFilter',false, ...
'HasLowpassFilter',false,'Oversample',false,'NumEQBands',3, ...
'EQOrder',2);
end
function y = process(obj, x)
if obj.Enable
y = step(obj.mPEQ,x);
y = y*obj.Volume;
else
y = x;
end
end
function reset(obj)
obj.mPEQ.SampleRate = getSampleRate(obj);
reset(obj.mPEQ);
end
function set.FreqLow(obj,val)
obj.FreqLow = val;
obj.mPEQ.Frequencies(1) = val; %#ok<*MCSUP>
end
function set.GainLow(obj,val)
obj.GainLow = val;
obj.mPEQ.PeakGains(1) = val;
end
function set.FreqMid(obj,val)
obj.FreqMid = val;
obj.mPEQ.Frequencies(2) = val;
end
function set.GainMid(obj,val)
obj.GainMid = val;
obj.mPEQ.PeakGains(2) = val;
end
function set.FreqHigh(obj,val)
obj.FreqHigh = val;
obj.mPEQ.Frequencies(3) = val;
end
function set.GainHigh(obj,val)
obj.GainHigh = val;
obj.mPEQ.PeakGains(3) = val;
end
end
end
parameterTuner(equalizerV1)
2-3
2 Plugin GUI Design
2-4
Design User Interface for Audio Plugin
To define the UI control style, update the audioPluginParameter definition of each parameter to
include the “Style” and “Layout” name-value pairs. Style defines the type of control (rotary knob,
slider, or switch, for example). Layout defines which cells the controls occupy on the UI grid. You can
specify Layout as the [row, column] of the grid to occupy, or as the [upper, left; lower, right] of the
group of cells to occupy. By default, control display names are also displayed and occupy their own
cells on the UI grid. The cells they occupy depend on the “DisplayNameLocation” name-value pair.
The commented arrows indicate the difference between equalizerV1 and equalzierV2.
2-5
2 Plugin GUI Design
2-6
Design User Interface for Audio Plugin
parameterTuner(equalizerV2)
2-7
2 Plugin GUI Design
2-8
Design User Interface for Audio Plugin
The BackgroundColor can be specified as a short or long color name string or as an RBG triplet.
When you specify BackgroundColor, the color is applied to all space on the UI except space
occupied by controls or a BackgroundImage. If the control or background image includes a
transparency, then the background color shows through the transparency.
The BackgroundImage can be specified as a PNG, GIF, or JPG file. The image is applied to the UI
grid by aligning the top left corners of the UI grid and image. If the image is larger than the UI grid
size defined in audioPluginGridLayout, then the image is clipped to the UI grid size. The
background image is not resized. If the image is smaller than the UI grid, then unoccupied regions of
the UI grid are treated as transparent.
In this example, you increase the padding around the perimeter of the grid to create space for the
MathWorks® logo. You can calculate the total width of the UI grid as the sum of all column widths
plus the left and right padding plus the column spacing (the default column spacing of 10 pixels is
used in this example): 100 + 100 + 100 + 50 + 150 + 20 + 20 + 4 × 10 = 580. The total height of
the UI grid is the sum of all row heights plus the top and bottom padding plus the row spacing (the
default row spacing of 10 pixels is used in this example):
20 + 20 + 160 + 20 + 100 + 20 + 120 + 4 × 10 = 500 . To locate the logo at the bottom of the UI
grid, use a 580-by-500 image:
2-9
2 Plugin GUI Design
2-10
Design User Interface for Audio Plugin
audioPluginParameter('FreqMid', ...
'Label','Hz', ...
'Mapping',{'log',500,3e3}, ...
'Style','rotaryknob', ...
'Layout',[5,2], ...
'DisplayNameLocation','None'), ...
audioPluginParameter('GainHigh', ...
'Label','dB', ...
'Mapping',{'lin',-20,20}, ...
'Style','vslider', ...
'Layout',[2,3;4,3], ...
'DisplayName','High','DisplayNameLocation','Above'), ...
audioPluginParameter('FreqHigh', ...
'Label','Hz', ...
'Mapping',{'log',3e3,20e3}, ...
'Style','rotaryknob', ...
'Layout',[5,3], ...
'DisplayNameLocation','None'), ...
audioPluginParameter('Volume', ...
'DisplayName','Volume', ...
'Mapping',{'lin',0,2}, ...
'Style','rotaryknob', ...
'Layout',[3,5], ...
'DisplayNameLocation','Above'), ...
audioPluginParameter('Enable', ...
'Style','vtoggle', ...
'Layout',[5,5], ...
'DisplayNameLocation','None'), ...
...
audioPluginGridLayout( ...
'RowHeight',[20,20,160,20,100], ...
'ColumnWidth',[100,100,100,50,150], ...
'Padding',[20,120,20,20]), ... %<--
...
'BackgroundImage','background.png', ... %<--
'BackgroundColor',[210/255,210/255,210/255]) %<--
end
... % omited for example purposes
end
parameterTuner(equalizerV3)
2-11
2 Plugin GUI Design
2-12
Design User Interface for Audio Plugin
To use custom filmstrips, specify the “Filmstrip” and “FilmstripFrameSize” name-value pairs in
audioPluginParameter. The filmstrip can be a PNG, GIF, or JPG file, and should consist of frames
placed end-to-end either vertically or horizontally. The filmstrip is mapped to the control's range so
that the corresponding filmstrip frame is displayed on the plugin UI as you tune parameters. In this
example, specify a two-frame filmstrip for the Enable parameter. As a best practice, the size of each
frame of the film strip should equal the size of the region occupied by the parameter. The Enable
parameter occupies one cell that is 150-by-100 pixels. To create a vertical filmstrip where each frame
is 150-by-100, make the total filmstrip size 150-by-200 and set FilmstripFrameSize to
[150,100]. The filmstrip used in this example contains the frame corresponding to the off position
first, then the on position:
2-13
2 Plugin GUI Design
2-14
Design User Interface for Audio Plugin
'Style','rotaryknob', ...
'Layout',[3,5], ...
'DisplayNameLocation','Above'), ...
audioPluginParameter('Enable', ...
'Style','vtoggle', ...
'Layout',[5,5], ...
'DisplayNameLocation','None', ...
'Filmstrip','vtoggle.png', ... %<--
'FilmstripFrameSize',[150,100]), ... %<--
...
audioPluginGridLayout( ...
'RowHeight',[20,20,160,20,100], ...
'ColumnWidth',[100,100,100,50,150], ...
'Padding',[20,120,20,20]), ...
...
'BackgroundImage','background.png', ...
'BackgroundColor',[210/255,210/255,210/255])
end
... % omitted for example purposes
end
Filmstrips are not supported by parameterTuner. To see the custom plugin UI, you must deploy the
plugin to a DAW. Use generateAudioPlugin to create a VST plugin.
generateAudioPlugin equalizerV4
.......
In this example, the plugin was opened in REAPER. A screenshot of the UI in REAPER is displayed
below.
2-15
2 Plugin GUI Design
See Also
More About
• “Audio Plugins in MATLAB”
• “Export a MATLAB Plugin to a DAW”
See Also
audioPlugin | audioPluginGridLayout | audioPluginInterface | audioPluginParameter |
generateAudioPlugin | parameterTuner
2-16
3
To begin, open the Impulse Response Measurer app by selecting the icon from the app gallery.
•
Windows® –– ASIO™: Click the button to open the settings panel for the ASIO driver.
• Mac –– CoreAudio
• Linux® –– ALSA
Valid values for sample rate and number of samples per frame depend on your specified audio device.
You can use the level monitor to verify the configuration of your audio I/O system.
To measure the audio device latency and remove it from captured measurements, you must use a
loopback cable to connect one of the device player channels directly to one of the recorder channels.
3-2
Measure and Manage Impulse Responses
To enable latency measurement and removal, click the Latency Compensation drop-down list, select
Loopback Measurement, and then set the Loopback Channels to the player and recorder
channels that are connected by the loopback cable.
This example sets Latency Compensation to None, so it does not measure the audio device latency.
Both methods for IR acquisition have the same basic settings, including:
• Number of Runs –– Number of times the excitation signal is sent within a single capture.
Multiple runs are used to average individual impulse response captures to reduce measurement
noise.
• Duration per Run (s) –– Total time of each run in seconds.
• Excitation Level (dBFS) –– The level of the excitation signal in dBFS.
Both methods for IR acquisition also have the same advanced run settings, including:
• Wait before first run –– Delay before starting the first run. The delay allows time for any last-
minute tasks, such as exiting a room before testing its acoustics.
• Pause between runs –– Duration of the pause between runs. During a pause, the excitation
signal is not sent, and audio is not recorded. When using the Swept Sine method, include a pause
between runs to avoid buildup of reverberations. Pause between runs is always zero for the MLS
method.
• Number of warmup runs –– Number of times to output the excitation signal before acquisition.
The MLS method assumes the signal it acquires is a combination of the excitation signal and its
impulse response.
The total capture time is a sum of run durations, pauses, and the initial wait.
3-3
3 Measure Impulse Response of an Audio System
The Swept Sine method has additional Advanced Settings to control the excitation signal,
including:
When using the Swept Sine method, the Run Duration is divided into Sweep duration and End
silence duration. During the end silence, the app continues to record audio, enabling acquisition of
the response over the entire range of the frequency sweep.
Starting in R2022a, you can automatically save device, method, and advanced settings and use them
in future measurement sessions.
Acquire IR Measurements
For this example, use the Swept Sine method with default settings. Once you have your audio device
set up, click Capture. A dialog box opens that displays the progress of your capture. Capture IR
measurements twice.
3-4
Measure and Manage Impulse Responses
By default, the impulse response and magnitude response are plotted. You can view any combination
of the impulse, magnitude, and phase response using the Display button. Here you can also remove
the measured audio device latency from the plotted impulse response and phase response.
3-5
3 Measure Impulse Response of an Audio System
Minimize Captured Data and Captured Data Information, then select the Phase Response.
You can toggle the relative size of the plot by moving the dividers. You can zoom in and out or toggle
between linear and logarithmic frequency axes by selecting the icons that appear when your pointer
is over the plot. Updating either the magnitude response or the phase response updates the other.
Zoom in on the impulse response plot and in the range 0–20 Hz of your frequency response plots.
Zooming in, you can see the small delay between FirstCapture and SecondCapture. When the
zoom level is high enough, line markers automatically appear.
3-6
Measure and Manage Impulse Responses
Export IR Measurements
To view export options for further analysis or use, click Export.
Export the data to your workspace. The data is saved as a table. To inspect how the data is saved,
display the table you exported.
irdata_160957
3-7
3 Measure Impulse Response of an Audio System
irdata_160957 =
2×17 table
When you export the data as a MAT-file, the same table is saved as when you export to the workspace.
When you select to export the data as a WAV file, each impulse response is saved as a separate WAV
file. The title of the capture is the name of the WAV file. In this example, selecting to export data to
audio WAV file places two WAV files in the specified folder, FirstCapture.wav and
SecondCapture.wav.
To analyze your captured data further, view the data in FVTool or Signal Analyzer.
Run the script to measure the impulse response and store the captured data in the capture
structure.
capture
capture =
3-8
Measure and Manage Impulse Responses
The script optionally plots the impulse, magnitude, and phase response according to the Display
settings in the app.
You can examine the generated code to understand how the app performs the measurement, and you
can edit the script for customization.
See Also
Impulse Response Measurer | audioPlayerRecorder | splMeter | reverberator
Related Examples
• “Measure Impulse Response of an Audio System” on page 1-270
• “Measure Frequency Response of an Audio Device” on page 1-275
• “Impulse Response Measurement Using a NI USB-4431 Device” on page 1-1007
3-9
4
To learn about interfacing with MIDI devices in general, see “MIDI Device Interface” on page 5-2.
• Velocity indicates how hard a note is played. By convention, Note On messages with velocity set
to zero represent note off messages. Representing note off messages with note on messages is
more efficient when using Running Status.
• Note indicates the frequency of the audio signal. The Note property takes a value between zero
and 127, inclusive. The MIDI protocol specifies that 60 is Middle C, with all other notes relative to
that note. Create a MIDI note on message that indicates to play Middle C.
channel = 1;
note = 60;
velocity = 64;
msg = midimsg('NoteOn',channel,note,velocity)
msg =
MIDI message:
NoteOn Channel: 1 Note: 60 Velocity: 64 Timestamp: 0 [ 90 3C 40 ]
To interpret the note property as frequency, use the equal tempered scale and the A440 convention:
frequency = 440 * 2^((msg.Note-69)/12)
frequency =
261.6256
Some MIDI synthesizers use an Attack Decay Sustain Release (ADSR) envelope to control the volume,
or amplitude, of a note over time. For simplicity, use the note velocity to determine the amplitude.
Conceptually, if a key is hit harder, the resulting sound is louder. The Velocity property takes a
value between zero and 127, inclusive. Normalize the velocity and interpret as the note amplitude.
amplitude = msg(1).Velocity/127
amplitude =
0.5039
To synthesize a sine wave, create an audioOscillator System object™. To play the sound to your
computer's default audio output device, create an audioDeviceWriter System object. Step the
objects for two seconds and listen to the note.
osc = audioOscillator('Frequency',frequency,'Amplitude',amplitude);
deviceWriter = audioDeviceWriter('SampleRate',osc.SampleRate);
4-2
Design and Play a MIDI Synthesizer
tic
while toc < 2
synthesizedAudio = osc();
deviceWriter(synthesizedAudio);
end
First, create an array of midimsg objects and cache the note on and note off times to the variable,
eventTimes.
msgs = [midimsg('Note',channel,60,64,0.5,0), ...
midimsg('Note',channel,62,64,0.5,.75), ...
midimsg('Note',channel,57,40,0.5,1.5), ...
midimsg('Note',channel,60,50,1,3)];
eventTimes = [msgs.Timestamp];
To mimic receiving notes in real time, create a for-loop that uses the eventTimes variable and tic
and toc to play notes according to the MIDI message timestamps. Release your audio device after
the loop is complete.
i = 1;
tic
while toc < max(eventTimes)
if toc > eventTimes(i)
msg = msgs(i);
i = i+1;
if msg.Velocity~= 0
osc.Frequency = 440 * 2^((msg.Note-69)/12);
osc.Amplitude = msg.Velocity/127;
else
osc.Amplitude = 0;
end
end
deviceWriter(osc());
end
release(deviceWriter)
simplesynth
function simplesynth(midiDeviceName)
4-3
4 Design and Play a MIDI Synthesizer
midiInput = mididevice(midiDeviceName);
osc = audioOscillator('square', 'Amplitude', 0);
deviceWriter = audioDeviceWriter;
deviceWriter.SupportVariableSizeInput = true;
deviceWriter.BufferSize = 64; % small buffer keeps MIDI latency low
while true
msgs = midireceive(midiInput);
for i = 1:numel(msgs)
msg = msgs(i);
if isNoteOn(msg)
osc.Frequency = note2freq(msg.Note);
osc.Amplitude = msg.Velocity/127;
elseif isNoteOff(msg)
if msg.Note == msg.Note
osc.Amplitude = 0;
end
end
end
deviceWriter(osc());
end
end
To query your system for your device name, use mididevinfo. To listen to your chosen device, call
the simplesynth function with the device name. This example uses an M-Audio KeyRig 25 device,
which registers with device name USB 02 on the machine used in this example.
mididevinfo
Call the simplesynth function with your device name. The simplesynth function listens for note
messages and plays them to your default audio output device. Play notes on your MIDI device and
listen to the synthesized audio.
4-4
Design and Play a MIDI Synthesizer
simplesynth('USB 02')
See Also
Classes
midimsg | mididevice
Functions
midisend | midireceive | mididevinfo
External Websites
• https://www.midi.org
4-5
5
MIDI
This tutorial introduces the Musical Instrument Digital Interface (MIDI) protocol and how you can use
Audio Toolbox to interact with MIDI devices. The tools described here enable you to send and receive
all MIDI messages as described by the MIDI protocol. If you are interested only in sending and
receiving Control Change messages with a MIDI control surface, see “MIDI Control Surface
Interface” on page 8-2. If you are interested in using MIDI to control your audio plugins, see “MIDI
Control for Audio Plugins” on page 7-2. To learn more about MIDI in general, consult The MIDI
Manufacturers Association.
MIDI is a technical standard for communication between electronic instruments, computers, and
related devices. MIDI carries event messages specific to audio signals, such as pitch and velocity, as
well as control signals for parameters and clock signals to synchronize tempo.
MIDI Devices
A MIDI device is any device capable of sending or receiving MIDI messages. MIDI devices have input
ports, output ports, or both. The MIDI protocol defines messages as unidirectional. A MIDI device can
be real-world or virtual.
Audio Toolbox enables you to create an interface to a MIDI device using mididevice. To create a
MIDI interface to a specific device, use mididevinfo to query your system for available devices.
Then create a mididevice object by specifying a MIDI device by name or ID.
mididevinfo
device =
mididevice connected to
Input: 'USB MIDI Interface ' (1)
Output: 'USB MIDI Interface ' (3)
You can specify a mididevice object to listen for input messages, send output messages, or both. In
this example, the mididevice object receives MIDI messages at the input port named 'USB MIDI
Interface ', and sends MIDI messages from the output port named 'USB MIDI Interface '.
5-2
MIDI Device Interface
MIDI Messages
A MIDI message contains information that describes an audio-related action. For example, when you
press a key on a keyboard, the corresponding MIDI message contains 3 bytes:
1 The first byte describes the kind of action and the channel. The first byte is referred to as the
Status Byte.
2 The second byte describes which key is pressed. The second byte is referred to as a Data Byte.
3 The third byte describes how hard the key is played. The third byte is also a Data Byte.
This message is a Note On message. Note On is referred to as the message name, command, or type.
In MATLAB, a MIDI message is packaged as a midimsg object and can be manipulated as scalars or
arrays. To create a MIDI message, call midimsg with a message type and then specify the required
parameters for the specific message type. For example, to create a note on message, specify the
midimsg Type as 'NoteOn' and then specify the required inputs: channel, note, and velocity.
channel = 1;
note = 60;
velocity = 64;
msg = midimsg('NoteOn',channel,note,velocity)
msg =
MIDI message:
NoteOn Channel: 1 Note: 60 Velocity: 64 Timestamp: 0 [ 90 3C 40 ]
For convenience, midimsg displays the message type, channel, additional parameters, timestamp,
and the constructed message in hexadecimal form. Hexadecimal is the preferred form because it has
a straightforward interpretation:
5-3
5 MIDI Device Interface
To send and receive MIDI messages, use the mididevice object functions midisend and
midireceive. When you create a mididevice object, it begins receiving data at its input and
placing it in a buffer.
receivedMessages = midireceive(device)
receivedMessages =
MIDI message:
NoteOn Channel: 1 Note: 36 Velocity: 64 Timestamp: 15861.9 [ 90 24 40 ]
NoteOn Channel: 1 Note: 36 Velocity: 0 Timestamp: 15862.1 [ 90 24 00 ]
The MIDI messages are returned as an array of midimsg objects. In this example, a MIDI keyboard
key is pressed.
midisend(device,msg)
The type of MIDI message you create is defined as a character vector or string. To create a MIDI
message, specify it by its type and the required property values. For example, create a Channel
Pressure MIDI message by entering the following at the command prompt:
channelPressureMessage = midimsg('ChannelPressure',1,20)
channelPressureMessage =
MIDI message:
ChannelPressure Channel: 1 ChannelPressure: 20 Timestamp: 0 [ D0 14 ]
After you create a MIDI message, you can modify the properties, but you cannot modify the type.
channelPressureMessage.ChannelPressure = 37
channelPressureMessage =
MIDI message:
ChannelPressure Channel: 1 ChannelPressure: 37 Timestamp: 0 [ D0 25 ]
5-4
MIDI Device Interface
The Audio Toolbox provides convenience syntaxes to create multiple MIDI messages used in sequence
and to create arrays of MIDI messages. See midimsg for a complete list of syntaxes.
5-5
5 MIDI Device Interface
The MIDI protocol does not define message timing and assumes that messages are acted on
immediately. Many applications require timing information for queuing and batch processing. For
convenience, the Audio Toolbox packages timing information with MIDI messages into a single
midimsg object. All midimsg objects have a Timestamp property, which is set during creation as an
optional last argument or after creation. The default Timestamp is zero.
The interpretation of the Timestamp property depends on how a MIDI message is created and used:
• When receiving MIDI messages using midireceive, the underlying infrastructure assigns a
timestamp when receiving MIDI messages. Conceptually, the timing clock starts when a
mididevice object is created and attached as a listener to a given MIDI input port. If another
mididevice is attached to the same input port, it receives timestamps from the same timing
clock as the first object.
• When sending MIDI messages using midisend, timestamps are interpreted as when to send the
message.
If there have been no recent calls to midisend, then midisend interprets timestamps as relative
to the current real-world time. A message with a timestamp of zero is sent immediately. If there
has been a recent call to midisend, then midisend interprets timestamps as relative to the
largest timestamp of the last call to midisend. The timestamp clock for midisend is specific to
the MIDI output port that mididevice is connected to.
Consider a pair of MIDI messages that turn a note on and off. The messages specify that the note
starts after one second and is sustained for one second.
Create Note On and Note Off messages. To create the Note Off message, use the 'NoteOn' MIDI
message type and specify zero velocity. (If you want to specify a velocity, use the 'NoteOff'
message type.) For more information, see midimsg.
OnMsg = midimsg('NoteOn',1,59,64);
OffMsg = midimsg('NoteOn',1,59,0);
To send on and off messages using a single call to midisend, specify the timestamps of the
messages relative to the same start time.
OnMsg.Timestamp = 1;
OffMsg.Timestamp = 2;
midisend(device,[OnMsg;OffMsg]))
5-6
MIDI Device Interface
To send the Note Off message separately, specify the timestamp of the Note Off message relative
to the largest timestamp of the previous call to midisend.
OnMsg.Timestamp = 1;
OffMsg.Timestamp = 1;
midisend(device,OnMsg)
midisend(device,OffMsg)
The "start" time, or reference time, for midisend is the max between the absolute time and the
largest timestamp in the last call to midisend. For example, consider that x, the arbitrary start
time, is equal to the current absolute time. If there is a 1.5-second pause between sending the
note on and note off messages, the resulting note duration is 1.5 seconds.
OnMsg.Timestamp = 1;
OffMsg.Timestamp = 1;
midisend(device,OnMsg)
pause(1.5)
midisend(device,OffMsg)
5-7
5 MIDI Device Interface
Usually, MIDI messages are sent faster than or at real-time speeds so there is no need to track the
absolute time.
For live performances or to enable interrupts in a MIDI stream, you can set timestamps to zero
and then call midisend at appropriate real-world time intervals. Depending on your use case, you
can divide your MIDI stream into small repeatable time chunks.
See Also
Classes
midimsg | mididevice
Functions
midisend | midireceive | mididevinfo
Related Examples
• “Design and Play a MIDI Synthesizer” on page 4-2
External Websites
• MIDI Manufacturers Association
5-8
6
• Dynamic range compressor –– Attenuates the volume of loud sounds that cross a given threshold.
They are often used in recording systems to protect hardware and to increase overall loudness.
• Dynamic range limiter –– A type of compressor that brickwalls sound above a given threshold.
• Dynamic range expander –– Attenuates the volume of quiet sounds below a given threshold. They
are often used to make quiet sounds even quieter.
• Noise gate –– A type of expander that brickwalls sound below a given threshold.
This tutorial shows how to implement dynamic range control systems using the compressor,
expander, limiter, and noiseGate System objects from Audio Toolbox. The tutorial also provides
an illustrated example of dynamic range limiting at various stages of a dynamic range limiting
system.
In a dynamic range control system, a gain signal is calculated in a sidechain and then applied to the
input audio signal. The sidechain consists of:
6-2
Dynamic Range Control
Linear to dB Conversion
The gain signal used in dynamic range control is processed on a dB scale for all dynamic range
controllers. There is no reference for the dB output; it is a straight conversion: xdB = 20log10(x). You
might need to adjust the output of a dynamic range control system to the range of your system.
Gain Computer
The gain computer provides the first rough estimate of a gain signal for dynamic range control. The
principal component of the gain computer is the static characteristic. Each type of dynamic range
control has a different static characteristic with different tunable properties:
• Threshold –– All static characteristics have a threshold. On one side of the threshold, the input is
given to the output with no modification. On the other side of the threshold, compression,
expansion, brickwall limiting, or brickwall gating is applied.
• Ratio –– Expanders and compressors enable you to adjust the input-to-output ratio of the static
characteristic above or below a given threshold.
• KneeWidth –– Expanders, compressors, and limiters enable you to adjust the knee width of the
static characteristic. The knee of a static characteristic is centered at the threshold. An increase in
knee width creates a smoother transition around the threshold. A knee width of zero provides no
smoothing and is known as a hard knee. A knee width greater than zero is known as a soft knee.
In these static characteristic plots, the expander, limiter, and compressor each have a 10 dB knee
width.
6-3
6 Dynamic Range Control
Gain Smoothing
All dynamic range controllers provide gain smoothing over time. Gain smoothing diminishes sharp
jumps in the applied gain, which can result in artifacts and an unnatural sound. You can
conceptualize gain smoothing as the addition of impedance to your gain signal.
The expander and noiseGate objects have the same smoothing equation, because a noise gate is a
type of expander. The limiter and compressor objects have the same smoothing equation, because
a limiter is a type of compressor.
The type of gain smoothing is specified by a combination of attack time, release time, and hold time
coefficients. Attack time and release time correspond to the time it takes the gain signal to go from
10% to 90% of its final value. Hold time is a delay period before gain is applied. See the algorithms of
individual dynamic range controller pages for more detailed explanations.
6-4
Dynamic Range Control
Smoothing Equations
expander and noiseGate
• αA and αR are determined by the sample rate and specified attack and release time:
−log(9) −log(9)
αA = exp , αR = exp
Fs × T A Fs × TR
• k is the specified hold time in samples.
• CA and CR are hold counters for attack and release, respectively.
• αA and αR are determined by the sample rate and specified attack and release time:
−log(9) −log(9)
αA = exp , αR = exp
Fs × T A Fs × TR
Examine a trivial case of dynamic range compression for a two-step input signal. In this example, the
compressor has a threshold of –10 dB, a compression ratio of 5, and a hard knee.
Several variations of gain smoothing are shown. On the top, a smoothed gain curve is shown for
different attack time values, with release time set to zero seconds. In the middle, release time is
varied and attack time is held constant at zero seconds. On the bottom, both attack and release time
are specified by nonzero values.
6-5
6 Dynamic Range Control
Make-Up Gain
Make-up gain applies for compressors and limiters, where higher dB portions of a signal are
attenuated or brickwalled. The dB reduction can significantly reduce total signal power. In these
cases, make-up gain is applied after gain smoothing to increase the signal power. In Audio Toolbox,
you can specify a set amount of make-up gain or specify the make-up gain mode as 'auto'.
The 'auto' make-up gain ensures that a 0 dB input results in a 0 dB output. For example, assume a
static characteristic of a compressor with a soft knee:
W
xdB xdB < T −
2
1 W 2
− 1 xdB − T + W W
xsc(xdB) = x + R 2
T− ≤ xdB ≤ T +
dB 2W 2 2
xdB − T W
T+ xdB > T +
R 2
T is the threshold, W is the knee width, and R is the compression ratio. The calculated auto make-up
gain is the negative of the static characteristic equation evaluated at 0 dB:
6-6
Dynamic Range Control
W
0 <T
2
1 W 2
MAKE‐UP GAIN = − xsc(0) = − R
−1 T− 2 W W
− ≤T≤
2W 2 2
T W
−T + − >T
R 2
dB to Linear Conversion
gm
Once the gain signal is determined in dB, it is transferred to the linear domain: glin = 10 20.
• Threshold = –15 dB
• Knee width = 0 (hard knee)
• Attack time = 0.004 seconds
• Release time = 0.1 seconds
• Make-up gain = 1 dB
To create a limiter System object with these properties, at the MATLAB command prompt, enter:
dRL = limiter('Threshold',-15,...
'KneeWidth',0,...
'AttackTime',0.004,...
'ReleaseTime',0.1,...
'MakeUpGainMode','property',...
'MakeUpGain',1);
This example provides a visual walkthrough of the various stages of the dynamic range limiter
system.
6-7
6 Dynamic Range Control
Linear to dB Conversion
Gain Computer
The static characteristic brickwall limits the dB signal at –15 dB. To determine the dB gain that
results in this limiting, the gain computer subtracts the original dB signal from the dB signal
processed by the static characteristic.
6-8
Dynamic Range Control
Gain Smoothing
The relatively short attack time specification results in a steep curve when the applied gain is
suddenly increased. The relatively long release time results in a gradual diminishing of the applied
gain.
Make-Up Gain
Assume a limiter with a 1 dB make-up gain value. The make-up gain is added to the smoothed gain
signal.
dB to Linear Conversion
6-9
6 Dynamic Range Control
References
[1] Zolzer, Udo. "Dynamic Range Control." Digital Audio Signal Processing. 2nd ed. Chichester, UK:
Wiley, 2008.
[2] Giannoulis, Dimitrios, Michael Massberg, and Joshua D. Reiss. "Digital Dynamic Range
Compressor Design –– A Tutorial And Analysis." Journal of Audio Engineering Society. Vol. 60,
Issue 6, 2012, pp. 399–408.
See Also
Compressor | Expander | Limiter | Noise Gate | compressor | expander | limiter | noiseGate
More About
• “Dynamic Range Compression Using Overlap-Add Reconstruction” on page 1-167
6-10
7
In the MATLAB environment, audio plugins are defined as any valid class that derives from the
audioPlugin base class or the audioPluginSource base class. For more information about how
audio plugins are defined in the MATLAB environment, see “Audio Plugins in MATLAB”.
• configureMIDI –– Configure MIDI connections between audio plugin and MIDI controller.
• getMIDIConnections –– Get MIDI connections of audio plugin.
• disconnectMIDI –– Disconnect MIDI controls from audio plugin.
These functions combine the abilities of general MIDI functions into a streamlined and user-friendly
interface suited to audio plugins in MATLAB. For a tutorial on the general functions and the MIDI
protocol, see “MIDI Control Surface Interface” on page 8-2.
This tutorial walks you through the MIDI functions for audio plugins in MATLAB.
Before starting MATLAB, connect your MIDI control surface to your computer and turn it on. For
connection instructions, see the instructions for your MIDI device. If you start MATLAB before
connecting your device, MATLAB might not recognize your device when you connect it. To correct the
problem, restart MATLAB with the device already connected.
Use configureMIDI to establish MIDI connections between your default MIDI device and an audio
plugin. You can use configureMIDI programmatically, or you can open a user interface (UI) to guide
you through the process. The configureMIDI UI reads from your audio plugin and populates a drop-
down list of tunable plugin properties. You are then prompted to move individual controls on your
MIDI control surface to associate the position of each control with the normalized value of each
property you select. For example, create an object of audiopluginexample.PitchShifter and
then call configureMIDI with the object as the argument:
7-2
MIDI Control for Audio Plugins
ctrlPitch = audiopluginexample.PitchShifter;
configureMIDI(ctrlPitch)
The Synchronize to MIDI controls dialog box opens with the tunable properties of your plugin
automatically populated. When you select a property and operate a MIDI control, its identification is
entered into the MIDI control column. After you synchronize tunable properties with MIDI controls,
click OK to complete the configuration. If your MIDI control surface is bidirectional, it automatically
shifts the position of the synchronized controls to the initial property values specified by your plugin.
To open a MATLAB function with the programmatic equivalent of your actions in the UI, select the
Generate MATLAB Code check box. Saving this function enables you to reuse your settings and
quickly establish the configuration in future sessions.
After you establish connections between plugin properties and MIDI controls, you can tune the
properties in real time using your MIDI control surface.
Audio Toolbox provides an all-in-one app for running and testing your audio plugin. The test bench
mimics how a DAW interacts with plugins.
audioTestBench(ctrlPitch)
7-3
7 MIDI Control for Audio Plugins
When you adjust the controls on your MIDI surface, the corresponding plugin parameter dials move.
Click to run the plugin. Move the controls on your MIDI surface to hear the effect of tuning the
plugin parameters.
To establish MIDI connections and modify existing ones, click the Synchronize to MIDI Controls
button to open a configureMIDI UI.
Alternatively, you can use the MIDI connections you established in a script or function. For example,
run the following code and move your synchronized MIDI controls to hear the pitch-shifting effect:
fileReader = dsp.AudioFileReader(...
'Filename','Counting-16-44p1-mono-15secs.wav');
deviceWriter = audioDeviceWriter;
release(fileReader);
release(deviceWriter);
To query the MIDI connections established with your audio plugin, use the getMIDIConnections
function. getMIDIConnections returns a structure with fields corresponding to the tunable
properties of your plugin. The corresponding values are nested structures containing information
about the mapping between your plugin property and the specified MIDI control.
7-4
MIDI Control for Audio Plugins
connectionInfo = getMIDIConnections(ctrlPitch)
connectionInfo =
connectionInfo.PitchShift
ans =
Law: 'int'
Min: -12
Max: 12
MIDIControl: 'control 1081 on 'BCF2000''
As a best practice, release external devices such as MIDI control surfaces when you are done.
disconnectMIDI(ctrlPitch)
See Also
Apps
Audio Test Bench
Classes
audioPlugin | audioPluginSource
Functions
configureMIDI | disconnectMIDI | getMIDIConnections
More About
• “What Are DAWs, Audio Plugins, and MIDI Controllers?”
• “MIDI Control Surface Interface” on page 8-2
• “Audio Plugins in MATLAB”
• “Host External Audio Plugins”
External Websites
• https://www.midi.org
7-5
8
In this section...
“About MIDI” on page 8-2
“MIDI Control Surfaces” on page 8-2
“Use MIDI Control Surfaces with MATLAB and Simulink” on page 8-3
About MIDI
Musical Instrument Digital Interface (MIDI) was originally developed to interconnect electronic
musical instruments. This interface is flexible and has uses in applications far beyond musical
instruments. Its simple unidirectional messaging protocol supports many different kinds of
messaging. One kind of MIDI message is the Control Change message, which is used to communicate
changes in controls, such as knobs, sliders, and buttons.
Because the MIDI messaging protocol is unidirectional, determining a particular controller position
requires that the receiver listen for Control Change messages that the controller sends. The protocol
does not support querying the MIDI controller for its position.
The simplest MIDI control surfaces are unidirectional: They send MIDI Control Change messages but
do not receive them. More sophisticated control surfaces are bidirectional: They can both send and
receive Control Change messages. These control surfaces have knobs or sliders that can operate
automatically. For example, a control surface can have motorized sliders or knobs. When it receives a
Control Change message, the appropriate control moves to the position in the message.
8-2
MIDI Control Surface Interface
This diagram shows a typical workflow involving general MIDI functions in MATLAB. For the Simulink
environment, follow steps 1 and 2, and then use the MIDI Controls block for a user-interface guided
workflow.
Before starting MATLAB, connect your MIDI control surface to your computer and turn it on. For
connection instructions, see the instructions for your MIDI device. If you start MATLAB before
connecting your device, MATLAB might not recognize your device when you connect it. To correct the
problem, restart MATLAB with the device already connected.
Use the midiid function to determine the device name and control numbers of your MIDI control
surface. After you call midiid, it continues to listen until it receives a Control Change message.
When it receives a Control Change message, it returns the control number associated with the MIDI
controller number that you manipulated, and optionally returns the device name of your MIDI control
surface. The manufacturer and host operating system determine the device name. See “Control
Numbers” on page 8-7 for an explanation of how MATLAB calculates the control number.
To set a default device name, see “Set Default MIDI Device” on page 8-7.
8-3
8 MIDI Control Surface Interface
View Example
Call midiid with two outputs and then move a controller on your MIDI device. midiid returns the
control number specific to the controller you moved and the device name of the MIDI control surface.
[controlNumber,deviceName] = midiid;
Use the midicontrols function to create an object that listens for Control Change messages and
caches the most recent values corresponding to specified controllers. When you create a
midicontrols object, you specify a MIDI control surface by its device name and specific controllers
on the surface by their associated control numbers. Because the midicontrols object cannot query
the MIDI control surface for initial values, consider setting initial values when creating the object.
View Example
Identify two control numbers on your MIDI control surface. Choose initial control values for the
controls you identified. Create a midicontrols object that listens to Control Change messages that
arrive from the controllers you identified on the device you identified. When you create your
midicontrols object, also specify initial control values. These initial control values work as default
values until a Control Change message is received.
controlNum1 = midiid;
[controlNum2,deviceName] = midiid;
initialControlValues = [0.1,0.9];
Use the midiread function to query your midicontrols object for current control values.
midiread returns a matrix with values corresponding to all controllers the midicontrols object is
listening to. Generally, you want to place midiread in an audio stream loop for continuous updating.
8-4
MIDI Control Surface Interface
View Example
Place midiread in an audio stream loop to return the current control value of a specified controller.
Use the control value to apply gain to an audio signal.
[controlNumber, deviceName] = midiid;
initialControlValue = 1;
midicontrolsObject = midicontrols(controlNumber,initialControlValue,'MIDIDevice',deviceName);
% In an audio stream loop, read an audio signal frame from the file, apply
% gain specified by the control on your MIDI device, and then write the
% frame to your audio output device. By default, the control value returned
% by midiread is normalized.
while ~isDone(fileReader)
audioData = step(fileReader);
controlValue = midiread(midicontrolsObject);
gain = controlValue*2;
audioDataWithGain = audioData*gain;
play(deviceWriter,audioDataWithGain);
end
You can use midisync to send Control Change messages to your MIDI control surface. If the MIDI
control surface is bidirectional, it adjusts the specified controllers. One important use of midisync is
to set the controller positions on your MIDI control surface to initial values.
View Example
In this example, you initialize a controller on your MIDI control surface to a specified position. Calling
midisync(midicontrolsObject) sends a Control Change message to your MIDI control surface,
using the initial control values specified when you created the midicontrols object.
[controlNumber,deviceName] = midiid;
initialControlValue = 0.5;
midicontrolsObject = midicontrols(controlNumber,initialControlValue,'MIDIDevice',deviceName);
midisync(midicontrolsObject);
8-5
8 MIDI Control Surface Interface
Another important use of midisync is to update your MIDI control surface if control values are
adjusted in an audio stream loop. In this case, you call midisync with both your midicontrols
object and the updated control values.
View Example
In this example, you check the normalized output volume in an audio stream loop. If the volume is
above a given threshold, midisync is called and the MIDI controller that controls the applied gain is
reduced.
while ~isDone(fileReader)
audioData = step(fileReader);
controlValue = midiread(midicontrolsObject);
gain = controlValue*2;
audioDataWithGain = audioData*gain;
play(deviceWriter,audioDataWithGain);
end
release(fileReader);
release(deviceWriter);
8-6
MIDI Control Surface Interface
midisync is also a powerful tool in systems that also involve user interfaces (UIs), so that when one
control is changed, the other control tracks it. Typically, you implement such tracking by setting
callback functions on both the midicontrols object (using midicallback) and the UI control. The
callback for the midicontrols object sends new values to the UI control. The UI uses midisync to
send new values to the midicontrols object and MIDI control surface. See midisync for examples.
You can use midicallback as an alternative to placing midiread in an audio stream loop. If a
midicontrols object receives a Control Change message, midicallback automatically calls a
specified function handle. The callback function typically calls midiread to determine the new value
of the MIDI controls. You can use this callback when you want a MIDI controller to trigger an action,
such as updating a UI. Using this approach prevents having a MATLAB program continuously running
in the command window.
You can set the default MIDI device in the MATLAB environment by using the setpref function. Use
midiid to determine the name of the device, and then use setpref to set the preference.
[~,deviceName] = midiid
deviceName =
BCF2000
setpref('midi','DefaultDevice',deviceName)
This preference persists across MATLAB sessions, so you only have to set it once, unless you want to
change devices.
If you do not set this preference, MATLAB and the host operating system choose a device for you.
However, such autoselection can cause unpredictable results because many computers have "virtual"
(software) MIDI devices installed that you might not be aware of. For predictable behavior, set the
preference.
Control Numbers
MATLAB defines control numbers as (MIDI channel number) × 1000 + (MIDI controller number).
8-7
8 MIDI Control Surface Interface
• MIDI channel number is the transmission channel that your device uses to send messages. This
value is in the range 1–16.
• MIDI controller number is a number assigned to an individual control on your MIDI device. This
value is in the range 1–127.
Your MIDI device determines the values of MIDI channel number and MIDI controller number.
See Also
Blocks
MIDI Controls
Functions
midicallback | midisync | midiread | midicontrols | midiid
More About
• “What Are DAWs, Audio Plugins, and MIDI Controllers?”
• “Real-Time Audio in MATLAB”
• “MIDI Device Interface” on page 5-2
• “MIDI Control for Audio Plugins” on page 7-2
External Websites
• https://www.midi.org
8-8
9
audioTestBench
and press Enter. Alternatively, you can click the button to browse for a file containing a plugin
class definition or an external plugin binary. The Audio Test Bench automatically displays the tunable
parameters of the audiopluginexample.VarSlopeBandpassFilter audio plugin.
9-2
Develop, Analyze, and Debug Plugins In Audio Test Bench
The mapping between the tunable parameters of your object and the UI controls on the Audio Test
Bench is determined by audioPluginInterface and audioPluginParameter in the class
definition of your object.
Under Objects Under Test in the Test Bench View, click and add the
audiopluginexample.VolumeController plugin. The two plugins are now connected in a
cascade where the audiopluginexample.VarSlopeBandpassFilter plugin processes the input
signal, and its output is then processed by the audiopluginexample.VolumeController plugin.
Right-click the Volume Controller tab and select the Left/Right configuration under Tile All to
view the tunable parameters of both plugins side by side.
9-3
9 Use the Audio Test Bench
You can change the order of the plugins in the cascade by selecting a plugin from the Objects Under
Test and clicking or to move it up or down. You can also remove a plugin by selecting it
and clicking .
To run the Audio Test Bench and stream audio through your plugins, click . Use the UI controls to
tune the plugin parameters while streaming.
To stop the audio stream loop, click . The MATLAB command line and objects used by the test
bench are now released.
To reset internal states of your audio plugin and return the UI controls to their initial positions, select
9-4
Develop, Analyze, and Debug Plugins In Audio Test Bench
To open the source file of your audio plugin, click and select the plugin from the dropdown list.
For this example, select Volume Controller.
You can inspect the source code of your audio plugin, set breakpoints on it, and modify the code. Set
a breakpoint in the set.TransitionTime function and then click on the Audio Test Bench.
The Audio Test Bench runs your plugin until it reaches the breakpoint. To reach the breakpoint, move
the Transition Time dial. To stop debugging, remove the breakpoint and click Continue in the
MATLAB editor.
Open Scopes
To open a time scope to visualize the time-domain input and output, click . To open a spectrum
analyzer to visualize the frequency-domain input and output, click . The scopes display input to
the Audio Test Bench and the output of the last plugin in the cascade.
9-5
9 Use the Audio Test Bench
You can enter any file name included on the MATLAB path. To specify a file that is not on the
MATLAB path, specify the full file path.
3 In the Audio file box, enter: RockDrums-44p1-stereo-11secs.mp3
Click again to close the settings panel. To run the audio test bench with your new input, click .
To release your output object and stop the audio stream loop, click .
The Audio Device Writer and Both options are not supported in MATLAB Online.
9-6
Develop, Analyze, and Debug Plugins In Audio Test Bench
1 Choose to output to device and file by selecting Both from the Output menu.
2 To open settings panels for Audio Device Writer and Audio File Writer configuration,
click .
9-7
9 Use the Audio Test Bench
Custom visualizations are available only in MATLAB and not in generated plugins.
MIDI controls. To open a MIDI configuration UI, click and select Variable Slope Bandpass.
Synchronize the LowCutoff and HighCutoff properties with MIDI controls you choose. Click OK.
To run your audio plugin, click . Adjust your plugin properties in real time using your synchronized
MIDI controls and sliders. Your processed audio file is saved to the current folder according to the
Audio File Writer settings you configured for the Output.
9-8
Develop, Analyze, and Debug Plugins In Audio Test Bench
To open the validation and generation dialog box, click and select Variable Slope Bandpass.
You can validate your MATLAB audio plugin code and generate audio plugin binaries. In the Coder
configuration section, you can specify libraries for deep learning and code replacement when
generating plugins. See generateAudioPlugin, validateAudioPlugin, and
audioPluginConfig for more information.
9-9
9 Use the Audio Test Bench
9-10
Develop, Analyze, and Debug Plugins In Audio Test Bench
You can modify the code for complete control over the test bench environment.
See Also
Audio Test Bench | validateAudioPlugin | generateAudioPlugin | audioPlugin
More About
• “Audio Plugins in MATLAB”
• “Audio Plugin Example Gallery” on page 10-2
• “Export a MATLAB Plugin to a DAW”
9-11
10
Audio Effects
Filters
Gain Control
Spatial Audio
Speech Processing
Deep Learning
See Also
Audio Test Bench | audioPlugin | audioPluginSource | audioPluginInterface |
audioPluginParameter
More About
• “Develop, Analyze, and Debug Plugins In Audio Test Bench” on page 9-2
• “Audio Plugins in MATLAB”
10-2
11
Equalization
11 Equalization
Equalization
Equalization (EQ) is the process of weighting the frequency spectrum of an audio signal.
• Lowpass and highpass filters –– Attenuate high frequency and low frequency content, respectively.
• Low-shelf and high-shelf equalizers –– Boost or cut frequencies equally above or below a desired
cutoff point.
• Parametric equalizers –– Selectively boost or cut frequency bands. Also known as peaking filters.
• Graphic equalizers –– Selectively boost or cut octave or fractional octave frequency bands. The
bands have standards-based center frequencies. Graphic equalizers are a special case of
parametric equalizers.
This tutorial describes how Audio Toolbox implements the design functions: designParamEQ,
designShelvingEQ, and designVarSlopeFilter. The multibandParametricEQ System object
combines the filter design functions into a multiband parametric equalizer. The graphicEQ System
object combines the filter design functions and the octaveFilter System object for standards-
based graphic equalization. For a tutorial focused on using the design functions in MATLAB, see
“Parametric Equalizer Design” on page 1-370.
EQ Filter Design
Audio Toolbox design functions use the bilinear transform method of digital filter design to determine
your equalizer coefficients. In the bilinear transform method, you:
11-2
Equalization
Audio Toolbox uses the high-order parametric equalizer design presented in [1]. In this design
method, the analog prototype is taken to be a low-shelf Butterworth filter:
2
gβ + s r L g2 β + 2gsi βs + s2
Ha(s) =
β+s ∏ 2
i=1 β + 2si βs + s2
0 N even
• r=
1 N odd
1/N
• g=G
1 1
2 2 − N 2 2 − N
G − GB Δω G − GB
• β = ΩB × = tan π × , where Δω is the desired
2
GB − 1 2 2
GB − 1
digital bandwidth
2i − 1 π
• si = sin , i = 1, 2, ..., L
2N
For parametric equalizers, the analog prototype is reduced by setting the bandwidth gain to the
square root of the peak gain (GB = sqrt(G)).
After the design parameters are specified, the analog prototype is transformed directly to the desired
digital equalizer by a bandpass bilinear transformation:
11-3
11 Equalization
This transformation doubles the filter order. Every first-order analog section becomes a second-order
digital section. Every second-order analog section becomes a fourth-order digital section. Audio
Toolbox always calculates fourth-order digital sections, which means that returning second-order
sections requires the computation of roots, and is less efficient.
Digital Coefficients
The digital transfer function is implemented as a cascade of second-order and fourth-order sections.
r L
b00 + b01z−1 + b02z−2 bi0 + bi1z−1 + bi2z−2 + bi3z−3 + bi4z−4
H(z) = ∏
1 + a01z−1 + a02z−2 i=1 1 + ai1z−1 + ai2z−2 + ai3z−3 + ai4z−4
The coefficients are given by performing the bandpass bilinear transformation on the analog
prototype design.
Biquadratic Case
ΩB
D0 = +1
G
b00 = 1 + ΩB G /D0, b01 = − 2cos(ω0)/D0, b02 = 1 − ΩB G /D0
ΩB
a01 = − 2cos(ω0)/D0, a02 = 1 − /D0
G
Denormalizing the a00 coefficient, and making substitutions of A =sqrt(G), ΩB ≅ α yields the familiar
peaking EQ coefficients described in [2].
Δω ln2
ΩB = tan = sin ω0 sinh B ,
2 2
11-4
Equalization
ω0
B= × BW
sinω0
Substituting the approximation for B into the ΩB equation yields the definition of α in [2]:
ln2 ω0
α = sin ω0 sinh × × BW
2 sinω0
To design lowpass and highpass filters, Audio Toolbox uses a special case of the filter design for
parametric equalizers. In this design, the peak gain, G, is set to 0, and GB2 is set to 0.5 (–3 dB cutoff).
The cutoff frequency of the lowpass filter corresponds to 1 – ΩB. The cutoff frequency of the highpass
filter corresponds to ΩB.
Digital Coefficients
The table summarizes the results of the bandpass bilinear transformation. The digital center
frequency, ω0, is set to π for lowpass filters and 0 for highpass filters.
11-5
11 Equalization
Audio Toolbox implements the shelving filter design presented in [2]. In this design, the high-shelf
and low-shelf analog prototypes are presented separately:
As2 + A
Q s+1 s2 + A
Q s+ A
HL(s) = A HH(s) = A
s2 + A
Q s+ A As2 + A s+1
Q
For compactness, the analog filters are presented with variables A and Q. You can convert A and Q to
available Audio Toolbox design parameters:
G/40
A = 10
1
= A + 1 A 1 slope − 1 + 2
Q
After you specify the design parameters, the analog prototype is transformed to the desired digital
shelving filter by a bilinear transformation with prewarping:
z−1 1
s= ×
z+1 ω0
tan 2
Digital Coefficients
The table summarizes the results of the bilinear transformation with prewarping.
11-6
Equalization
Low-Shelf b0 = A A + 1 − A − 1 cos(ω0) + 2α A
b1 = 2A A − 1 − A + 1 cos(ω0)
b2 = A A + 1 − A − 1 cos(ω0) − 2α A
a0 = A + 1 + A − 1 cos(ω0) + 2α A
a1 = − 2 A − 1 + A + 1 cos(ω0)
a2 = A + 1 + A − 1 cos(ω0) − 2α A
High-Shelf b0 = A A + 1 + A − 1 cos(ω0) + 2α A
b1 = − 2A A − 1 + A + 1 cos(ω0)
b2 = A A + 1 + A − 1 cos(ω0) − 2α A
a0 = A + 1 − A − 1 cos(ω0) + 2α A
a1 = 2 A − 1 + A + 1 cos(ω0)
a2 = A + 1 − A − 1 cos(ω0) − 2α A
Intermediate sin ω0 1 1
Variables α= A+ − 1 + 2A
2 A slope
Cutof f Frequency
ω0 = 2π
Fs
References
[1] Orfanidis, Sophocles J. "High-Order Digital Parametric Equalizer Design." Journal of the Audio
Engineering Society. Vol. 53, November 2005, pp. 1026–1046.
[2] Bristow-Johnson, Robert. "Cookbook Formulae for Audio EQ Biquad Filter Coefficients." Accessed
March 02, 2016. http://www.musicdsp.org/files/Audio-EQ-Cookbook.txt.
[3] Orfanidis, Sophocles J. Introduction to Signal Processing. Englewood Cliffs, NJ: Prentice Hall,
2010.
[4] Bristow-Johnson, Robert. "The Equivalence of Various Methods of Computing Biquad Coefficients
for Audio Parametric Equalizers." Presented at the 97th Convention of the AES, San
Francisco, November 1994, AES Preprint 3906.
See Also
designVarSlopeFilter | designParamEQ | designShelvingEQ | multibandParametricEQ |
graphicEQ
More About
• “Parametric Equalizer Design” on page 1-370
• “Graphic Equalization” on page 1-187
• “Octave-Band and Fractional Octave-Band Filters” on page 1-378
• “Audio Weighting Filters” on page 1-197
11-7
12
Deployment
This example shows how to accelerate a real-time audio application using C code generation with
MATLAB® Coder™. You must have the MATLAB Coder™ software installed to run this example.
Introduction
Replacing parts of your MATLAB code with an automatically generated MATLAB executable (MEX-
function) can speed up simulation. Using MATLAB Coder, you can generate readable and portable C
code and compile it into a MEX-function that replaces the equivalent section of your MATLAB
algorithm.
This example showcases code generation using an audio notch filtering application.
Notch Filtering
A notch filter is used to eliminate a specific frequency from a signal. Typical filter design parameters
for notch filters are the notch center frequency and the 3 dB bandwidth. The center frequency is the
frequency at which the filter has a linear gain of zero. The 3 dB bandwidth measures the frequency
width of the notch of the filter computed at the half-power or 3 dB attenuation point.
The helper function used in this example is helperAudioToneRemoval. The function reads an audio
signal corrupted by a 250 Hz sinusoidal tone from a file. helperAudioToneRemoval uses a notch
filter to remove the interfering tone and writes the filtered signal to a file.
You can visualize the corrupted audio signal using a spectrum analyzer.
reader = dsp.AudioFileReader("guitar_plus_tone.ogg");
while ~isDone(reader)
audio = reader();
scope(audio(:,1));
end
12-2
Desktop Real-Time Audio Acceleration with MATLAB Coder
Measure the time it takes to read the audio file, filter out the interfering tone, and write the filtered
output using MATLAB code.
tic
helperAudioToneRemoval
t1 = toc;
Next, generate a MEX-function from helperAudioToneRemoval using the MATLAB Coder function,
codegen (MATLAB Coder).
codegen helperAudioToneRemoval
Measure the time it takes to execute the MEX-function and calculate the speedup gain with a
compiled function.
tic
helperAudioToneRemoval_mex
t2 = toc;
12-3
12 Deployment
See Also
Related Examples
• “Generate Standalone Executable for Parametric Audio Equalizer” on page 1-280
• “Deploy Audio Applications with MATLAB Compiler” on page 1-283
12-4
What is C Code Generation from MATLAB?
In general, the code you generate using the toolbox is portable ANSI C code. In order to use code
generation, you need a MATLAB Coder license. For more information, see “Get Started with MATLAB
Coder” (MATLAB Coder).
The simplest way to generate MEX files from your MATLAB code is by using the codegen function at
the command line. For example, if you have an existing function, myfunction.m, you can type the
commands at the command line to compile and run the MEX function. codegen adds a platform-
specific extension to this name. In this case, the "mex" suffix is added.
codegen myfunction.m
myfunction_mex;
Within your code, you can run specific commands either as generated C code or by using the
MATLAB engine. In cases where an isolated command does not yet have code generation support,
you can use the coder.extrinsic command to embed the command in your code. This means that
the generated code reenters the MATLAB environment when it needs to run that particular
command. This is also useful if you want to embed commands that cannot generate code (such as
plotting functions).
To generate standalone executables that run independently of the MATLAB environment, create a
MATLAB Coder project inside the MATLAB Coder Integrated Development Environment (IDE).
Alternatively, you can call the codegen command in the command line environment with appropriate
configuration parameters. A standalone executable requires you to write your own main.c or
main.cpp function. See “Generating Standalone C/C++ Executables from MATLAB Code” (MATLAB
Coder) for more information.
12-5
12 Deployment
After installation, at the MATLAB command prompt, run mex -setup. You can then use the codegen
function to compile your code.
See Also
Functions
codegen | mex
More About
• “Code Generation Workflow” (MATLAB Coder)
• Generate C Code from MATLAB Code Video
12-6
13
System Objects
• audioPlayerRecorder
• audioDeviceReader
• audioDeviceWriter
• dsp.AudioFileReader
• dsp.AudioFileWriter
Blocks
The generated code for the audio I/O features relies on prebuilt dynamic library files included with
MATLAB. You must account for these extra files when you run audio I/O features outside the MATLAB
and Simulink environments. To run a standalone executable generated from a model or code
containing the audio I/O features, set your system environment using commands specific to your
platform.
Platform Command
Mac setenv DYLD_LIBRARY_PATH "$
{DYLD_LIBRARY_PATH}:$MATLABROOT/bin/
maci64" (csh/tcsh)
export DYLD_LIBRARY_PATH=
$LD_LIBRARY_PATH:$MATLABROOT/bin/
maci64 (Bash)
export LD_LIBRARY_PATH=
$LD_LIBRARY_PATH:$MATLABROOT/bin/
glnxa64 (Bash)
Windows set PATH=%PATH%;%MATLABROOT%\bin\win64
The path in these commands is valid only on systems that have MATLAB installed. If you run the
standalone app on a machine with only MCR, and no MATLAB installed, replace $MATLABROOT/
bin/... with the path to the MCR.
13-2
Run Audio I/O Features Outside MATLAB and Simulink
To run the code generated from the above System objects and blocks on a machine does not have
MCR or MATLAB installed, use the packNGo function. The packNGo function packages all relevant
files in a compressed zip file so that you can relocate, unpack, and rebuild your project in another
development environment with no MATLAB installed.
You can use the packNGo function at the command line or the Package option in the MATLAB Coder
app. The files are packaged in a compressed file that you can relocate and unpack using a standard
zip utility. For more details on how to pack the code generated from MATLAB code, see “Package
Code for Other Development Environments” (MATLAB Coder). For more details on how to pack the
code generated from Simulink blocks, see the packNGo function.
See Also
More About
• “MATLAB Programming for Code Generation” (MATLAB Coder)
13-3
14
Decrease Underrun
Examine the Audio Device Writer block in a Simulink® model, determine underrun, and decrease
underrun.
1. Run the model. The Audio Device Writer sends an audio stream to your computer's default audio
output device. The Audio Device Writer block sends the number of samples underrun to your Time
Scope.
14-2
Decrease Underrun
2. Uncomment the Artificial Load block. This block performs computations that slow the simulation.
If your model continues to drop samples, increase the frame size again. The increased frame size
increases the buffer size used by the sound card. A larger buffer size decreases the possibility of
underruns at the cost of higher audio latency.
See Also
From Multimedia File | Time Scope
14-3
15
15-2
Extract Cepstral Coefficients
Use the Cepstral Feature Extractor block to extract and visualize cepstral coefficients from an audio
file.
See Also
Cepstral Feature Extractor | mfcc | gtcc | Audio Device Writer | From Multimedia File | Array Plot
15-3
15 Block Example Repository
Tune the center frequency of an Octave Filter block in Simulink® using the optional input port.
1. Run the simulation. The From Multimedia File block sends a stereo audio stream to the Octave
Filter block. The center frequency of the Octave Filter block can be tuned using the manual switches
routed into the optional input port. The filtered audio is sent to your computer's default audio device.
The filtered audio and unfiltered audio are sent to a Spectrum Analyzer block for visualization.
2. Tune the center frequency by toggling manual switches routing constant values. The constant
value routed from the left is multiplied with the constant value routed from the right. The center
frequency of the Octave Filter block can be set at 400, 800, 4000, and 8000 Hz.
3. Observe the Spectrum Analyzer as you tune the center frequency. Note how the center frequency
changes as you toggle the manual switches.
15-4
Tune Center Frequency Using Input Port
See Also
Audio Device Writer | From Multimedia File | Time Scope | Manual Switch | Octave Filter
15-5
15 Block Example Repository
This model uses if-else block signal routing to replace regions of no speech with zeros.
To explore this model, tune the Probability of transition from a silence frame to a speech frame
and Probability of transition from a speech frame to a silence frame parameters of the Voice
Activity Detector (VAD) and observe the effect on the speech presence probability. Toggle between
the gated and original audio signal to assess the quality of your system.
15-6
Gate Audio Signal Using VAD
See Also
Voice Activity Detector | Audio Device Writer | From Multimedia File | Time Scope | Random Source |
Manual Switch | If | If Action Subsystem
15-7
15 Block Example Repository
Voice Activity Detection is often used as an indication whether further processing or analysis of a
signal is required. Many processing and analysis techniques require a frequency-domain
representation of the signal. For example, the voice activity detection algorithm operates in the
frequency domain. To save computation, you can convert the audio signal to the frequency domain
once, and then feed the frequency-domain signal to downstream analysis and processing.
This model additionally buffers the signal so that the VAD operates on half-overlapped frames.
Overlapping the input frames to the VAD increases the accuracy and resolution in time of the
probability of speech.
See Also
Voice Activity Detector | Audio Device Writer | From Multimedia File | Time Scope | Window Function
| Buffer | Delay | FFT
15-8
Visualize Noise Power
This model plots the noise power estimated by the Voice Activity Detector.
To explore this model, tune the Frequency (Hz) parameter of the Sine Wave block and observe the
noise power estimate updated on the Array Plot block.
15-9
15 Block Example Repository
15-10
Visualize Noise Power
Zoom in on the Array Plot to verify that the Voice Activity Detector outputs a good estimate of the
noise tone.
See Also
Voice Activity Detector | Audio Device Writer | From Multimedia File | Time Scope | Array Plot | Sine
Wave
15-11
15 Block Example Repository
This model uses the Voice Activity Detector block to visualize the probability of speech presence in an
audio signal.
To explore this model, tune the Probability of transition from a silence frame to a speech frame
and Probability of transition from a speech frame to a silence frame parameters of the Voice
Activity Detector (VAD) and observe the effect on the speech presence probability.
The Time Scope blocks plots the audio signal and associated voice activity probability.
15-12
Detect Presence of Speech
See Also
Voice Activity Detector | Audio Device Writer | From Multimedia File | Time Scope
15-13
15 Block Example Repository
2. In the Graphic EQ block, click Visualize equalizer response. Modify gains of the graphic
equalizer and see the magnitude response plot update automatically.
3. Run the model. Tune gains on the Graphic EQ to listen to the effect on your audio device and see
the effect on the Spectrum Analyzer display. Double-click the Manual Switch (Simulink) block to
toggle between the original and equalized signal as output.
15-14
Perform Graphic Equalization
See Also
Graphic EQ | Audio Device Writer | From Multimedia File | Spectrum Analyzer
15-15
15 Block Example Repository
Split-Band De-Essing
This model implements split-band de-essing by separating a speech signal into high and low
frequencies, applying dynamic range expansion to diminish the sibilant frequencies, and then
remixing the channels.
De-essing is the process of diminishing sibilant sounds in an audio signal. Sibilance refers to the s, z,
and sh sounds in speech, which can be disproportionately emphasized during recording. es sounds
fall under the category of unvoiced speech with all consonants, and have a higher frequency than
voiced speech.
To explore the model, tune the parameters of the Expander and Crossover Filter blocks. To switch
between listening to the processed and unprocessed speech signal, double-click the Manual Switch
block. To view the effect of the processing, double-click the Time Scope block.
See Also
Audio Device Writer | Time Scope | Expander | From Multimedia File | Crossover Filter
15-16
Diminish Plosives from Speech
This model minimizes the plosives of a speech signal by applying highpass filtering and low-band
compression.
Plosives are consonant sounds resulting from a sudden release of airflow. They are most pronounced
in p, d, and g words. Plosives can be emphasized by the recording process and are often displeasing
to hear.
To explore this model, tune the highpass filter cutoff and the parameters on the Compressor and
Crossover Filter blocks. Switch between listening to the original and processed signals by double-
clicking the Manual Switch block.
See Also
Audio Device Writer | Compressor | From Multimedia File | Crossover Filter
15-17
15 Block Example Repository
Use the Compressor block to suppress loud sounds and visualize the applied compression gain.
2. Run the model. To switch between listening to the compressed signal and the original signal,
double-click the Manual Switch (Simulink) block.
3. Observe how the applied gain depends on compression parameters and input signal dynamics by
tuning the Compressor block parameters and viewing the results on the Time Scope.
15-18
Suppress Loud Sounds
See Also
Audio Device Writer | Time Scope | From Multimedia File | Matrix Concatenate | Compressor
More About
• “Dynamic Range Control” on page 6-2
15-19
15 Block Example Repository
Use the Expander block to attenuate low-level noise and visualize the applied dynamic range control
gain.
2. Run the model. To switch between listening to the expanded signal and the original signal, double-
click the Manual Switch (Simulink) block.
3. Observe how the applied gain depends on expansion parameters and input signal dynamics by
tuning the Expander block parameters and viewing the results on the Time Scope.
15-20
Attenuate Low-Level Noise
See Also
Audio Device Writer | Time Scope | From Multimedia File | Matrix Concatenate | Colored Noise |
Expander
More About
• “Dynamic Range Control” on page 6-2
15-21
15 Block Example Repository
Suppress the volume of loud sounds and visualize the applied dynamic range control gain.
2. Run the model. To switch between listening to the gated signal and the original signal, double-click
the Manual Switch (Simulink) block.
3. Observe how the applied gain depends on dynamic range limiting parameters and input signal
dynamics by tuning Limiter block parameters and viewing the results on the Time Scope.
15-22
Suppress Volume of Loud Sounds
See Also
Audio Device Writer | Time Scope | From Multimedia File | Matrix Concatenate | Limiter
More About
• “Dynamic Range Control” on page 6-2
15-23
15 Block Example Repository
Apply dynamic range gating to remove low-level noise from an audio file.
2. Run the model. To switch between listening to the gated signal and the original signal, double-click
the Manual Switch (Simulink) block.
3. Observe how the applied gain depends on noise gate parameters and input signal dynamics by
tuning Noise Gate block parameters and viewing the results on the Time Scope.
15-24
Gate Background Noise
See Also
Audio Device Writer | Time Scope | From Multimedia File | Matrix Concatenate | Random Source |
Noise Gate
More About
• “Dynamic Range Control” on page 6-2
15-25
15 Block Example Repository
The example shows how to set the MIDI Controls block parameters to output control values from your
MIDI device.
1. Connect a MIDI device to your computer and then open the model.
2. Run the model with default settings. Move any controller on your default MIDI device to update the
Display block.
4. At the MATLAB™ command line, use midiid to determine the name of your MIDI device and two
control numbers associated with your device.
5. In the MIDI Control block dialog box, set MIDI device to Specify other and enter the name of
your MIDI device.
6. Set MIDI controls to Respond to specified controls and enter the control numbers
determined using midiid.
7. Specify initial values as a vector the same size as MIDI control numbers. The initial values you
specify are quantized according to the MIDI protocol and your particular MIDI surface.
The dialog box shows sample values for a 'BCF2000' MIDI device with control numbers 1081 and
1083.
15-26
Output Values from MIDI Control Surface
8. Click OK, and then run the model. Verify that the Display block shows initial values and updates
when you move the specified controls.
See Also
Audio Device Writer | Time Scope | From Multimedia File | MIDI Controls
More About
• “MIDI Control Surface Interface” on page 8-2
15-27
15 Block Example Repository
Examine the Weighting Filter block in a Simulink® model and tune parameters.
2. Run the model. Switch between listening to the frequency-weighted signal and the original signal
by double-clicking the Manual Switch (Simulink) block.
3. Stop the model. Open the Weighting Filter block and choose a different weighting method. Observe
the difference in simulation.
15-28
Apply Frequency Weighting
See Also
Audio Device Writer | Spectrum Analyzer | From Multimedia File | Weighting Filter
15-29
15 Block Example Repository
Measure momentary and short-term loudness before and after compression of a streaming audio
signal in Simulink®.
2. Run the model. To switch between listening to the compressed signal and the original signal,
double-click the switch.
3. Observe the effect of compression on loudness by tuning the Compressor block parameters and
viewing the momentary loudness on the Time Scope block.
15-30
Compare Loudness Before and After Audio Processing
4. Stop the model. For both Loudness blocks, replace momentary loudness with short-term loudness
as input to the Matrix Concatenate block. Run the model again and observe the effect of compression
on short-term loudness.
See Also
Audio Device Writer | Time Scope | From Multimedia File | Matrix Concatenate | Compressor |
Loudness Meter
15-31
15 Block Example Repository
Divide a mono signal into a stereo signal with distinct frequency bands. To hear the full effect of this
simulation, use a stereo speaker system, such as headphones.
2. Run the model. To switch between listening to the filtered and original signal, double-click the
Manual Switch (Simulink) block.
3. Tune the crossover frequency on the Crossover Filter block to listen to the effect on your speakers
and view the effect on the Spectrum Analyzer block.
15-32
Two-Band Crossover Filtering for a Stereo Speaker System
See Also
Audio Device Writer | Spectrum Analyzer | From Multimedia File | Matrix Concatenate | Crossover
Filter
15-33
15 Block Example Repository
Examine the Reverberator block in a Simulink® model and tune parameters. The reverberation
parameters in this model mimic a large room with hard walls, such as a gymnasium.
1. Run the simulation. Listen to the audio signal with and without reverberation by double-clicking
the Manual Switch block.
3. Disconnect the To Multimedia File block so that you can run the model without recording.
5. Run the simulation and tune the parameters of the Reverberator block.
6. After you are satisfied with the reverberation environment, stop the simulation.
7. Reconnect the To Multimedia File block. Rename the output file with a description to match your
reverberation environment, and rerun the model.
See Also
Audio Device Writer | To Multimedia File | From Multimedia File | Matrix Concatenate | Reverberator
15-34
Perform Parametric Equalization
Examine the Single-Band Parametric EQ block in a Simulink® model and tune parameters.
2. In the Single-Band Parametric EQ block, click View Filter Response. Modify parameters of the
parametric equalizer and see the magnitude response plot update automatically.
3. Run the model. Tune parameters on the Single-Band Parametric EQ to listen to the effect on your
audio device and see the effect on the Spectrum Analyzer display. Double-click the Manual Switch
(Simulink) block to toggle between the original and equalized signal as output.
15-35
15 Block Example Repository
See Also
Audio Device Writer | Spectrum Analyzer | From Multimedia File | Single-Band Parametric EQ
15-36
Perform Octave Filtering
Examine the Octave Filter block in a Simulink® model and tune parameters.
1. Open the Octave Filter block and click Visualize filter response. Tune parameters on the Octave
Filter dialog. The filter response visualization updates automatically. If you break compliance with the
ANSI S1.11-2004 standard, the filter mask is drawn in red.
2. Run the model. Open the Spectrum Analyzer block. Tune parameters on the Octave Filter block to
listen to the effect on your audio device and see the effect on the Spectrum Analyzer display. Switch
between listening to the filtered and unfiltered audio by double-clicking the Manual Switch (Simulink)
block.
15-37
15 Block Example Repository
See Also
Audio Device Writer | Spectrum Analyzer | From Multimedia File | Octave Filter
15-38
Read from Microphone and Write to Speaker
Examine the Audio Device Reader block in a Simulink® model, modify parameters, and explore
overrun.
1. Run the model. The Audio Device Reader records an audio stream from your computer's default
audio input device. The Reverberator block processes your input audio. The Audio Device Writer
block sends the processed audio to your default audio output device.
15-39
15 Block Example Repository
2. Stop the model. Open the Audio Device Reader block and lower the Samples per frame
parameter. Open the Time Scope block to view overrun.
3. Run the model again. Lowering the Samples per frame decreases the buffer size of your Audio
Device Reader block. A smaller buffer size decreases audio latency while increasing the likelihood of
overruns.
See Also
Audio Device Writer | Audio Device Reader | Time Scope | Reverberator
More About
• “Audio I/O: Buffering, Latency, and Throughput”
15-40
Channel Mapping
Channel Mapping
Examine the Audio Device Writer block in a Simulink® model and specify a nondefault channel
mapping.
1. Run the simulation. The Audio Device Writer sends a stereo audio stream to your computer's
default audio output device. If you are using a stereo audio output device, such as headphones, you
can hear a tone from one speaker and noise from the other speaker.
c. On the Advanced tab, clear the Use default channel mapping parameter.
d. Specify the Device output channels in reverse order: [2,1]. If you are using a stereo output
device, such as headphones, you hear that the noise and tone have switched speakers.
See Also
Audio Device Writer | Random Source | Sine Wave | Matrix Concatenate
More About
• “Audio I/O: Buffering, Latency, and Throughput”
15-41
15 Block Example Repository
This model enables you to apply dynamic range compression to an audio signal while staying inside a
preset loudness range. In this model, a Compressor block increases the loudness and decreases the
dynamic range of an audio signal. A Loudness Meter block calculates the momentary loudness of the
compressed audio signal. If momentary loudness crosses a -23 LUFS threshold, an enabled subsystem
applies gain to lower the corresponding level of the audio signal.
2. Run the model. To switch between listening to the compressed signal with and without gain
adjustment, double-click the switch.
3. To observe the effect of compression on loudness, tune the Compressor block parameters and view
the compressed audio signal on the Time Scope block.
15-42
Trigger Gain Control Based on Loudness Measurement
See Also
Blocks
Audio Device Writer | Time Scope | From Multimedia File | Compressor | Loudness Meter
Objects
loudnessMeter
Functions
integratedLoudness
More About
• “Loudness Normalization in Accordance with EBU R 128 Standard” on page 1-176
15-43
15 Block Example Repository
Examine the Audio Oscillator block in a Simulink® model and tune the parameters.
1. Run the simulation. Listen to the tone from the Audio Oscillator block generating a sine wave.
Visualize the spectrums of all three waveforms on the Spectrum Analyzer. Visualize the waveforms on
the Time Scope.
2. Toggle the manual switches to listen to the square and sawtooth waves.
3. Open any of the Audio Oscillator blocks and modify the Frequency (Hz) or Amplitude parameters to
hear the effect and visualize the effect on the Spectrum Analyzer and Time Scope.
15-44
Generate Variable-Frequency Tones in Simulink
15-45
15 Block Example Repository
15-46
Trigger Reverberation Parameters
Examine the Reverberator block in a Simulink® model where the reverberation parameters are
triggered by time.
Run the simulation. Listen to the audio signal with the reverberation parameters set to Location A.
After 5 seconds, the switches change to the reverberation parameters of Location B.
15-47
15 Block Example Repository
In this model, the Wavetable Synthesizer block is used to synthesize realistic engine noise. Such a
system may be found in a vehicle where artificial engine noise enhancement is desired. The
wavetable sample is a real-world engine recorded at an unspecified RPM.
1. Run the simulation. Listen to the engine sound output from the Wavetable Synthesizer.
2. Tune the RPM source to adjust the perceived RPM of the generated engine sound. The RPM source
is lowpass smoothed using a Biquad filter, so that the engine sound ramps in a realistic fashion.
Visualize the RPM source before and after smoothing on a Scope.
15-48
Model Engine Noise
3. The tuning factor can be used to increase or decrease the overall range of output frequencies. This
is used because the wavetable sample RPM is unknown and the sound range might require
calibration.
15-49
15 Block Example Repository
Examine the Octave Filter Bank block in a Simulink® model. Apply octave band compression and
reverb to create a flanging chorus effect on the guitar signal. Use the processed signal as an overdub
layer to enhance your guitar recordings.
Using the Octave Filter Bank block allows you to separate the audio signal into multiple frequency
bands and process each band individually. In this model example, you split a guitar recording into 5
octave bands and apply compression and reverb to each band separately to create a flanging chorus
overdub layer for your recording project.
1. Double-click the Octave Filter Bank block to view its parameters. Notice the Bands as separate
output ports box is checked. This creates a direct output on the block for each filter in the bank. The
Octave ratio is also set to Base two (musical scale). To see the magnitude response of the filters in
the bank, click the View Filter Response button.
3. Tune parameters on the Reverberator and Compressor blocks to hear the effects on your audio
device and see the effect on the Spectrum Analyzer display. Switch between listening to the Original
Signal and the Filtered (Processed) Signal by double-clicking the Manual Switch (Simulink) block.
4. This Simulink® model can be used to provide overdub guitar layers in your digital audio
workstation (DAW) recording projects. Uncomment the To Multimedia File block to save your
Filtered (Processed) Signal audio to a file. In your DAW session, pan the Original Signal to the
left side of the stereo-field and pan the Filtered (Processed) Signal to the right side of the stereo-
field. This creates a wide, lush stereo image and adds depth and warmth to your guitar track.
15-50
Use Octave Filter Bank to Create Flanging Chorus Effect for Guitar Layers (Overdubs)
15-51
15 Block Example Repository
Use the Gammatone Filter Bank block to decompose a signal by passing it through a bank of
gammatone filters.
Use the Random Source block to generate the signal and observe the output of the Gammatone Filter
Bank block using the Spectrum Analyzer Block. Configure the Gammatone Filter Bank by setting the
block parameter as:
Run the model and select the Spectrum Analyzer Block to view the output of the Gammatone Filter
Bank block.
15-52
Decompose Signal using Gammatone Filter Bank Block
See Also
Related Examples
• “Visualize Filter Response of Multiband Parametric Equalizer Block” on page 15-54
15-53
15 Block Example Repository
Perform multiband parametric equalization independently across each channel of an input using
specified center frequencies, gains, and quality factors.
Connect the Multiband Parametric EQ block to an audio input as shown in this model.
Configure the Multiband Parametric Equalizer block by setting its parameters as:
• EQ order -- 6
• Number of bands -- 3
• Frequencies (Hz) -- [100 390 800]
• Peak gains (dB) -- [3 -5 3]
• Quality factors -- [2 2 2]
• Input sample rate (Hz) -- 44100
Run the model and click the Visualize filter response button to plot the filter response in magnitude
(dB) vs. frequency (Hz).
15-54
Visualize Filter Response of Multiband Parametric Equalizer Block
Use the knobs to change frequency and gain and observe the changing response. For instance,
change Knob 5 to set the peak gain of the third frequency to 9 dB and observe the filter response.
You can also toggle the switch to listen to either the original or the filtered signal.
See Also
Multiband Parametric EQ
15-55
15 Block Example Repository
Related Examples
• “Decompose Signal using Gammatone Filter Bank Block” on page 15-52
15-56
Detect Music in Simulink Using YAMNet
The YAMNet network requires you to preprocess and extract features from audio signals by
converting them to the sample rate the network was trained on (16e3 Hz), and then extracting
overlapping mel spectrograms. The Sound Classifier block does the required preprocessing and
feature extraction that is necessary to match the preprocessing and feature extraction used to train
YAMNet.
To use YAMNet, a pretrained YAMNet network must be installed in a location on the MATLAB® path.
If a pretrained network is not installed, run the yamnetGraph function and the software provides a
download link. Click the link and unzip the file to a location on the MATLAB path.
Alternatively, execute the following commands to download and unzip the YAMNet model to your
temporary directory.
downloadFolder = fullfile(tempdir,'YAMNetDownload');
loc = websave(downloadFolder,'https://ssd.mathworks.com/supportfiles/audio/yamnet.zip');
YAMNetLocation = tempdir;
unzip(loc,YAMNetLocation)
addpath(fullfile(YAMNetLocation,'yamnet'))
Get all music sounds in the AudioSet ontology. The ontology covers a wide range of everyday sounds,
from human and animal sounds to natural and environmental sounds and to musical and
miscellaneous sounds. Use the yamnetGraph function to obtain a graph of the AudioSet ontology and
a list of all sounds supported by YAMNet. The dfsearch function returns a vector of 'Music'
sounds in the order of their discovery using depth-first search.
Find the location of these musical sounds in the list of supported sounds.
[~,musicIndices] = intersect(allSounds,musicSounds);
The detectMusic model detects the musical sounds in input audio. Open and run the model. The
model starts by reading in an audio signal to classify using two From Multimedia File blocks. The first
block reads in a musical sound signal and the second block reads in an ambiance signal that is not
music. Both signals have a sample rate of 44100 Hz and contain 441 samples per channel. Using the
Manual Switch (Simulink) block, you can choose one of the two signals.
The Sound Classifier block in the model detects the scores and labels of the input audio. The Selector
(Simulink) block in the model picks the scores related to music using the vector of indices given by
musicIndices. If the maximum value of these scores is greater than 0.2, then the score is related to
music. The Scope (Simulink) block plots the maximum value of the score. The Activation dial in the
model shows this value as well. Using the Audio Device Writer block, confirm that you hear music
when the plot shows a score greater than 0.2
open_system("detectMusic.slx")
sim("detectMusic.slx")
15-57
15 Block Example Repository
15-58
Detect Music in Simulink Using YAMNet
close_system("detectMusic.slx",0)
See Also
Functions
yamnetGraph | dfsearch
Blocks
Sound Classifier | From Multimedia File | Manual Switch | Selector | Scope | Audio Device Writer
Related Examples
• “Detect Air Compressor Sounds in Simulink Using YAMNet” on page 15-62
• “Compare Sound Classifier block with Equivalent YAMNet blocks” on page 15-60
15-59
15 Block Example Repository
The Sound Classifier block is equivalent to the cascade of the YAMNet Preprocess block and YAMNet
block. The model in this example compares the two implementations and shows their equivalence.
The input to the model is a single-channel audio signal. The signal has a sample rate of 44100 Hz and
contains 441 samples per channel. The first branch of the model contains the Sound Classifier block.
The second branch of the model contains the YAMNet Preprocess block followed by the YAMNet
block.
To use these blocks, a YAMNet pretrained network must be installed in a location on the MATLAB®
path. If a pretrained network is not installed, then open and run the model. The software provides a
download link. To download the network, click the link and unzip the file to a location on the MATLAB
path.
Alternatively, execute the following commands to download and unzip the YAMNet model to your
temporary directory.
downloadFolder = fullfile(tempdir,'YAMNetDownload');
loc = websave(downloadFolder,'https://ssd.mathworks.com/supportfiles/audio/yamnet.zip');
YAMNetLocation = tempdir;
unzip(loc,YAMNetLocation)
addpath(fullfile(YAMNetLocation,'yamnet'))
Open and run the model. The Maximum block on each branch computes the maximum value of the
vector of music scores predicted on each branch. Plot these maximum values on the Scope block and
confirm if they match. Similarly, confirm the equivalence in sound labels shown by the Display blocks.
open_system("compareblocks.slx")
sim("compareblocks.slx")
15-60
Compare Sound Classifier block with Equivalent YAMNet blocks
close_system("compareblocks.slx",0)
See Also
Blocks
Sound Classifier | YAMNet Preprocess | YAMNet | From Multimedia File | Maximum | Scope | Audio
Device Writer
Related Examples
• “Detect Music in Simulink Using YAMNet” on page 15-57
• “Detect Air Compressor Sounds in Simulink Using YAMNet” on page 15-62
15-61
15 Block Example Repository
This example shows how to use a pretrained network obtained from transfer learning within a
Simulink® model to classify audio signals obtained from an air compressor.
The network is pretrained using a data set that contains recordings from air compressors. The data
set is classified into one healthy state and seven faulty states, for a total of eight classes. For more
information on training, see “Transfer Learning Using YAMNet”.
To download this pretrained network and a set of air compressor sounds to detect, run the following
commands. These commands download and unzip the files to a location on the MATLAB® path. The
airCompressorNet.mat file stores the pretrained network.
url = 'https://ssd.mathworks.com/supportfiles/audio/YAMNetTransferLearning.zip';
AirCompressorLocation = tempdir;
dataFolder = fullfile(AirCompressorLocation,'YAMNetTransferLearning');
if ~exist(dataFolder,'dir')
disp('Downloading pretrained network ...')
unzip(url,AirCompressorLocation)
end
addpath(fullfile(AirCompressorLocation,'YAMNetTransferLearning'))
Open the detectsound.slx model. Click the Select Compressor State block. The default type
of sound is set to 'Bearing'. The model contains a YAMNet Preprocess block followed by an Image
Classifier (Deep Learning Toolbox) block.
Run the model. The YAMNet Preprocess block generates 96-by-64 sized mel spectrograms from the
input audio. The Image Classifier block uses the airCompressorNet.mat file and classifies the
signal into one of the eight classes the model is trained on. The label of the predicted class is
displayed using the Display block. The scope shows the score of the predicted class and the other
classes.
open_system("detectsound.slx")
sim("detectsound.slx")
15-62
Detect Air Compressor Sounds in Simulink Using YAMNet
While the simulation is running, you can change the input sound by double clicking the Select
Compressor State block and choosing a type of sound from the drop-down menu.
Select 'Healthy' while the simulation is running. The Display block updates the predicted label and
the Scope block shows the new scores.
15-63
15 Block Example Repository
15-64
Detect Air Compressor Sounds in Simulink Using YAMNet
See Also
Image Classifier | YAMNet Preprocess
Related Examples
• “Detect Music in Simulink Using YAMNet” on page 15-57
• “Compare Sound Classifier block with Equivalent YAMNet blocks” on page 15-60
15-65
15 Block Example Repository
Create an auditory filter bank and apply it to a signal in the frequency domain using the Design
Auditory Filter Bank block in Simulink.
15-66
Design Mel Filter Bank
Create a mel filter bank and apply it to a signal in the frequency domain using the Design Mel Filter
Bank block in Simulink.
15-67
15 Block Example Repository
Extract an auditory spectrogram from a signal using the Auditory Spectrogram block in Simulink.
15-68
Extract Mel Spectrogram
Extract the mel spectrogram from an audio signal using the Mel Spectrogram block in Simulink.
15-69
15 Block Example Repository
15-70
Filter Audio Using Shelving Filter Block
15-71
15 Block Example Repository
The VGGish Embeddings block is equivalent to the cascade of the VGGish Preprocess block and
VGGish block. The model in this example compares the two implementations and shows their
equivalence.
To use these blocks, a VGGish pretrained network must be installed in a location on the MATLAB®
path. If a pretrained network is not installed, then open and run the model. The software provides a
download link. To download the network, click the link and unzip the file to a location on the MATLAB
path.
15-72
Compare VGGish Embeddings Block with Equivalent VGGish Blocks
15-73
15 Block Example Repository
Extract gammatone cepstral coefficients and their delta features using the Auditory Spectrogram,
Cepstral Coefficients, and Audio Delta blocks in Simulink.
15-74
Include an Audio Plugin in Simulink
Double-click the block to open the dialog box. In the Audio plugin field, enter the plugin
audiopluginexample.Echo. To inspect the source code for this plugin, enter edit
audiopluginexample.Echo in the command line. Optionally, specify the name and location of the
generated System object class file using the Generated file name field.
Click OK to generate a block with the same functionality as the plugin. This also generates the
System object class file and places it in the current directory by default. The file must be on the
MATLAB path for the generated block to work.
Double-clicking the new block opens the parameter dialog box, where you can view and edit the
plugin parameters. You can also choose to specify the tunable parameters through additional input
ports on the block.
15-75
15 Block Example Repository
Use the plugin in a model to process an audio signal and listen to the results. Add a Slider (Simulink)
block to the model to tune the gain parameter of the plugin during simulation.
15-76
Include an Audio Plugin in Simulink
See Also
Audio Plugin
More About
• “Audio Plugins in MATLAB”
15-77
15 Block Example Repository
This example shows how to use a simple neural network in Simulink® to classify audio signals from
their VGGish feature embeddings using the VGGish Embeddings and Predict (Deep Learning Toolbox)
blocks.
The network is a small fully connected network that was trained on VGGish feature embeddings
extracted from air compressor audio signals. The air compressor data set consists of recordings from
air compressors in a healthy state or one of seven faulty states. For information on how the network
was trained, see “Use VGGish Embeddings for Deep Learning”.
While the simulation is running, you can change the input sound by double clicking the Select
Compressor State block and choosing a type of sound from the drop-down menu. After you change
the air compressor sound, see how the predicted class probabilities change.
15-78
Use VGGish Embeddings for Deep Learning in Simulink
15-79
16
Audio Toolbox is optimized for parameter tuning in a real-time audio stream. The System objects,
blocks, and audio plugins provide various tunable parameters, including sample rate and frame size,
making them robust tools when used in an audio stream loop.
To optimize your use of Audio Toolbox, package your audio processing algorithm as an audio plugin.
Packaging your audio algorithm as an audio plugin enables you to graphically tune your algorithm
using parameterTuner or Audio Test Bench:
• Audio Test Bench –– Creates a user interface (UI) for tunable parameters, enables you to specify
input and output from your audio stream loop, and provides access to analysis tools such as the
time scope and spectrum analyzer. Packaging your code as an audio plugin also enables you to
quickly synchronize your parameters with MIDI controls.
• parameterTuner –– Creates a UI for tunable parameters that can be used from any MATLAB
programmatic environment. You can customize your parameter controls to render as knobs,
sliders, rocker switches, toggle switches, check boxes, or drop-downs. You can also define a
custom background color, background image, or both. You can then place your audio plugin in an
audio processing loop in a programmatic environment such as a script, and then tune parameters
while the loop executes.
• App Designer –– Development environment for a large set of interactive controls with support for
2-D plots. See “Create and Run a Simple App Using App Designer” for more information.
• Programmatic workflow –– Use MATLAB functions to define your app element-by-element. This
tutorial uses a programmatic approach.
See “Ways to Build Apps” for a more detailed list of the costs and benefits of the different approaches
to parameter tuning.
Inspect the diagram for an overview of how real-time parameter tuning is implemented. To implement
real-time parameter tuning, walk through the example for explanations and step-by-step instructions.
16-2
Real-Time Parameter Tuning
To tune a parameter in an audio stream loop using a UI, you need to associate the parameter with the
position of a UI widget. To associate a parameter with a UI widget, make the parameter an object of a
handle class. Objects of handle classes are passed by reference, meaning that you can modify the
value of the object in one place and use the updated value in another. For example, you can modify
the value of the object using a slider on a figure and use the updated value in an audio processing
loop.
Objects of the parameterRef class have a name and value. The name is for display purposes on the
UI. You use the value for tuning.
The parameterTuningUI function accepts your parameter, specified as an object handle, and the
desired range. The function creates a figure with a slider associated with your parameter. The nested
function, slidercb, is called whenever the slider position changes. The slider callback function maps
the position of the slider to the parameter range, updates the value of the parameter, and updates the
text on the UI. You can easily modify this function to tune multiple parameters in the same UI.
16-3
16 Real-Time Parameter Tuning
parameterTuningUI
Open parameterTuningUI.
function parameterTuningUI(parameter,parameterMin,parameterMax)
% Main figure
hMainFigure = figure( ...
'Name','Parameter Tuning', ...
'MenuBar','none', ...
'Toolbar','none', ...
'HandleVisibility','callback', ...
'NumberTitle','off', ...
'IntegerHandle','off');
16-4
Real-Time Parameter Tuning
Run AudioProcessingScript
drawnow limitrate
audioOut = audioIn.*x.value;
deviceWriter(audioOut);
end
16-5
16 Real-Time Parameter Tuning
While the script runs, move the position of the slider to update your parameter value and hear the
result.
See Also
Audio Test Bench | parameterTuner
More About
• “Real-Time Audio in MATLAB”
• “Audio Plugins in MATLAB”
• “Develop, Analyze, and Debug Plugins In Audio Test Bench” on page 9-2
• “Create and Run a Simple App Using App Designer”
• “Ways to Build Apps”
16-6
17
To learn more about audio plugins in general, see “Audio Plugins in MATLAB”.
While running, the Audio Test Bench calls in a loop the process method and then the set methods for
tuned properties. The plugin API does not specify the order that the tuned properties are set.
It is possible to disrupt the normal methods timing by interrupting the event queue. Common ways to
accidentally interrupt the event queue include using a plot or drawnow function.
Note plot and drawnow are only available in the MATLAB environment. plot and drawnow cannot
be included in generated plugins. See “Separate Code for Features Not Supported for Plugin
Generation” on page 17-4 for more information.
In the following code snippet, the gain applied to the left and right channels is not the same if the
associated Gain parameter is tuned during the call to process:
...
L = plugin.Gain*in(:,1);
drawnow
R = plugin.Gain*in(:,2);
out = [L,R];
...
17-2
Tips and Tricks for Plugin Authoring
properties (Constant)
PluginInterface = audioPluginInterface(audioPluginParameter('Gain'));
end
methods
function out = process(plugin,in)
L = plugin.Gain*in(:,1);
drawnow
R = plugin.Gain*in(:,2);
out = [L,R];
end
function set.Gain(plugin,val)
plugin.Gain = val;
end
end
end
The author interrupts the event queue in the code snippet, causing the set methods of properties
associated with parameters to be called while the process method is in the middle of execution.
Depending on your processing algorithm, interrupting the event queue can lead to inconsistent and
buggy behavior. Also, the set method might not be explicit, which can make the issue difficult to
track down. Possible fixes for the problem of event queue disruption include saving properties to local
variables, and moving the queue disruption to the beginning or end of the process method.
You can save tunable property values to local variables at the start of your processing. This technique
ensures that the values used during the process method are not updated within a single call to
process. Because accessing the value of a local variable is cheaper than accessing the value of a
property, saving properties to local variables that are accessed multiple times is a best practice.
...
gain = plugin.Gain;
L = gain*in(:,1);
drawnow
R = gain*in(:,2);
out = [L,R];
...
17-3
17 Tips and Tricks for Plugin Authoring
L = gain*in(:,1);
drawnow
R = gain*in(:,2);
out = [L,R];
end
function set.Gain(plugin,val)
plugin.Gain = val;
end
end
end
You can move the disruption to the event queue to the bottom or top of the process method. This
technique ensures that property values are not updated in the middle of the call.
...
L = plugin.Gain*in(:,1);
R = plugin.Gain*in(:,2);
out = [L,R];
drawnow
...
L = plugin.Gain*in(:,1);
R = plugin.Gain*in(:,2);
out = [L,R];
drawnow
end
function set.Gain(plugin,val)
plugin.Gain = val;
end
end
end
17-4
Tips and Tricks for Plugin Authoring
...
if coder.target('MATLAB')
...
end
...
If you generate the plugin using generateAudioPlugin, code inside the statement if
coder.target('MATLAB') is ignored.
For example, timescope is not enabled for code generation. If you run the following plugin in
MATLAB, you can use the visualize function to open a time scope that plots the input and output
power per frame.
power = 20*log10(mean(var(in)))*ones(numSamples,1);
adjustedPower = 20*log10(mean(var(out)))*ones(numSamples,1);
plugin.aScope([power,adjustedPower]);
end
end
end
function reset(plugin)
fs = getSampleRate(plugin);
plugin.aCompressor.SampleRate = fs;
reset(plugin.aCompressor)
17-5
17 Tips and Tricks for Plugin Authoring
end
end
end
function visualize(plugin)
% Visualization function. This function is public in the MATLAB
% environment. Because the plugin does not call this function
% directly, the function is not part of the code generated by
% generateAudioPlugin.
• Clearing state
• Passing down calls to reset to component objects
• Updating properties which depend on sample rate
Invalid use of the reset method includes setting the value of any properties associated with
parameters. Do not use your reset method to set properties associated with parameters to their initial
conditions. Directly setting a property associated with a parameter causes the property to be out of
sync with the parameter. For example, the following plugin is an example of incorrect use of the reset
method.
classdef badReset < audioPlugin
properties
Gain = 1;
end
properties (Constant)
PluginInterface = audioPluginInterface(audioPluginParameter('Gain'));
end
methods
function out = process(plugin,in)
out = in*plugin.Gain;
end
function reset(plugin) % <-- Incorrect use of reset method.
plugin.Gain = 1; % <-- Never set values of a property that is
end % associated with a plugin parameter.
end
end
17-6
Tips and Tricks for Plugin Authoring
setSampleRate(plugin.aPhaser, fs)
setSampleRate(plugin.aEcho, fs)
reset(plugin.aPhaser)
reset(plugin.aEcho);
end
% Use the set method of your properties to pass down property
% values to your component plugins.
function set.PhaserQ(plugin,val)
plugin.PhaserQ = val;
plugin.aPhaser.QualityFactor = val;
end
function set.EchoGain(plugin,val)
plugin.EchoGain = val;
plugin.aEcho.Gain = val;
end
end
end
Plugin composition using System objects has these key differences from plugin composition using
basic plugins.
• Immediately call setup on your component System object after it is constructed. Construction
and setup of the component object occurs inside the constructor of the composite plugin.
• If your component System object requires sample rate information, then it has a sample rate
property. Set the sample rate property in the reset method.
17-7
17 Tips and Tricks for Plugin Authoring
properties (Constant)
PluginInterface = audioPluginInterface( ...
audioPluginParameter('CrossoverFrequency', ...
'DisplayName','Crossover Frequency', ...
'Mapping',{'lin',50, 200}), ...
audioPluginParameter('CompressorThreshold', ...
'DisplayName','Compressor Threshold', ...
'Mapping',{'lin',-100,0}));
end
methods
function plugin = compositePluginWithSystemObjects
% Construct your component System objects within the composite
% plugin's constructor. Call setup immediately after
% construction.
%
% The audio plugin API requires plugins to declare the number
% of input and output channels in the plugin interface. This
% plugin uses the default 2-in 2-out configuration. Call setup
% with a sample input that has the same number of channels as
% defined in the plugin interface.
%
sampleInput = zeros(1,2);
plugin.aCrossoverFilter = crossoverFilter;
setup(plugin.aCrossoverFilter,sampleInput)
plugin.aCompressor = compressor;
setup(plugin.aCompressor,sampleInput)
end
function out = process(plugin,in)
% Call your component System objects inside the call to
% process of your composite plugin.
[band1,band2] = plugin.aCrossoverFilter(in);
band1Compressed = plugin.aCompressor(band1);
out = band1Compressed + band2;
end
function reset(plugin)
% Set the sample rate properties of your component System
% objects.
fs = getSampleRate(plugin);
plugin.aCrossoverFilter.SampleRate = fs;
plugin.aCompressor.SampleRate = fs;
reset(plugin.aCrossoverFilter)
reset(plugin.aCompressor);
end
% Use the set method of your properties to pass down property
% values to your component System objects.
function set.CrossoverFrequency(plugin,val)
plugin.CrossoverFrequency = val;
plugin.aCrossoverFilter.CrossoverFrequencies = val;
end
function set.CompressorThreshold(plugin,val)
plugin.CompressorThreshold = val;
plugin.aCompressor.Threshold = val;
end
end
end
The following code snippet follows the plugin authoring best practice for processing changes in
parameter property Cutoff.
classdef highpassFilter < audioPlugin
...
properties (Constant)
PluginInterface = audioPluginInterface( ...
audioPluginParameter('Cutoff', ...
'Label','Hz',...
'Mapping',{'log',20,2000}));
end
17-8
Tips and Tricks for Plugin Authoring
methods
function y = process(plugin,x)
[y,plugin.State] = filter(plugin.B,plugin.A,x,plugin.State);
end
function set.Cutoff(plugin,val)
plugin.Cutoff = val;
[plugin.B,plugin.A] = highpassCoeffs(plugin,val,getSampleRate(plugin)); % <<<< warning occurs here
end
end
...
end
%-----------------------------------------------------------------------
% Private Properties - Used for internal storage
%-----------------------------------------------------------------------
properties (Access = private)
State = zeros(2);
B = zeros(1,3);
A = zeros(1,3);
end
%-----------------------------------------------------------------------
% Constant Properties - Used to define plugin interface
%-----------------------------------------------------------------------
properties (Constant)
PluginInterface = audioPluginInterface( ...
audioPluginParameter('Cutoff', ...
'Label','Hz', ...
'Mapping',{'log',20,2000}));
end
methods
%-------------------------------------------------------------------
% Main processing function
%-------------------------------------------------------------------
function y = process(plugin,x)
[y,plugin.State] = filter(plugin.B,plugin.A,x,plugin.State);
end
%-------------------------------------------------------------------
% Set Method
%-------------------------------------------------------------------
function set.Cutoff(plugin,val)
plugin.Cutoff = val;
[plugin.B,plugin.A] = highpassCoeffs(plugin,val,getSampleRate(plugin)); % <<<< warning occurs here
end
%-------------------------------------------------------------------
% Reset Method
%-------------------------------------------------------------------
function reset(plugin)
plugin.State = zeros(2);
[plugin.B,plugin.A] = highpassCoeffs(plugin,plugin.Cutoff,getSampleRate(plugin));
end
end
methods (Access = private)
%-------------------------------------------------------------------
% Calculate Filter Coefficients
%-------------------------------------------------------------------
function [B,A] = highpassCoeffs(~,fc,fs)
w0 = 2*pi*fc/fs;
alpha = sin(w0)/sqrt(2);
cosw0 = cos(w0);
norm = 1/(1+alpha);
B = (1 + cosw0)*norm * [.5 -1 .5];
A = [1 -2*cosw0*norm (1 - alpha)*norm];
end
end
end
17-9
17 Tips and Tricks for Plugin Authoring
The highpassCoeffs function might be expensive, and should be called only when necessary. You
do not want to call highpassCoeffs in the process method, which runs in the real-time audio
processing loop. The logical place to call highpassCoeffs is in set.Cutoff. However, mlint
shows a warning for this practice. The warning is intended to help you avoid initialization order
issues when saving and loading classes. See “Avoid Property Initialization Order Dependency” for
more details. The solution recommended by the warning is to create a dependent property with a get
method and compute the value there. However, following the recommendation complicates the design
and pushes the computation back into the real-time processing method, which you are trying to avoid.
You might also incur the warning when correctly implementing plugin composition. For an example of
a correct implementation of composition, see “Implement Plugin Composition Correctly” on page 17-
6.
validateAudioPlugin BrokenAnalyticSignalTransformer
Checking plug-in class 'BrokenAnalyticSignalTransformer'... passed.
Generating testbench file 'testbench_BrokenAnalyticSignalTransformer.m'... done.
Running testbench...
Error using dsp.AnalyticSignal/parenReference
Changing the size on input 1 is not allowed without first calling the release() method.
Error in validateAudioPlugin
If you want to use the functionality of a System object that does not support variable-size signals, you
can buffer the input and output of the System object, or always call the object with one sample.
17-10
Tips and Tricks for Plugin Authoring
You can create a loop around your call to an object. The loop iterates for the number of samples in
your variable frame size. The call to the object inside the loop is always a single sample.
Note Depending on your implementation and the particular object, calling an object sample by
sample in a loop might result in significant computational cost.
You can buffer the input to your object to a consistent frame size, and then buffer the output of your
object back to the original frame size. The dsp.AsyncBuffer System object is well-suited for this
task.
17-11
17 Tips and Tricks for Plugin Authoring
plugin.InputBuffer = dsp.AsyncBuffer;
setup(plugin.InputBuffer,1);
plugin.OutputBuffer = dsp.AsyncBuffer;
setup(plugin.OutputBuffer,[1,1]);
end
function out = process(plugin,in)
write(plugin.InputBuffer,in);
Note Use of the asynchronous buffering object forces a minimum latency of your specified frame
size.
17-12
Tips and Tricks for Plugin Authoring
To work around this issue, you can use a separate enumeration class that maps the strings to the
enumerations, as described in the audioPluginParameter documentation.
Alternatively, if you want to avoid writing an enumeration class and keep all your code in one file, you
can use a dependent property to map your parameter names to a set of values. In this scenario, you
map your enumeration value to a value that you can cache.
17-13
17 Tips and Tricks for Plugin Authoring
end
end
end
See Also
More About
• “Audio Plugins in MATLAB”
• “Audio Plugin Example Gallery” on page 10-2
• “Export a MATLAB Plugin to a DAW”
17-14
18
Spectral Descriptors
Signal Toolbox™ and Audio Toolbox™ provide a suite of functions that describe the shape, sometimes
referred to as timbre, of audio. This example defines the equations used to determine the spectral
features, cites common uses of each feature, and provides examples so that you can gain intuition
about what the spectral descriptors are describing.
Spectral descriptors are widely used in machine and deep learning applications, and perceptual
analysis. Spectral descriptors have been applied to a range of applications, including:
Spectral Centroid
b
∑k2= b f k sk
1
μ1 = b2
∑k = b sk
1
where
The spectral centroid represents the "center of gravity" of the spectrum. It is used as an indication of
brightness [2 on page 18-23] and is commonly used in music analysis and genre classification. For
example, observe the jumps in the centroid corresponding to high hat hits in the audio file.
[audio,fs] = audioread('FunkyDrums-44p1-stereo-25secs.mp3');
audio = sum(audio,2)/2;
centroid = spectralCentroid(audio,fs);
subplot(2,1,1)
t = linspace(0,size(audio,1)/fs,size(audio,1));
plot(t,audio)
ylabel('Amplitude')
subplot(2,1,2)
t = linspace(0,size(audio,1)/fs,size(centroid,1));
18-2
Spectral Descriptors
plot(t,centroid)
xlabel('Time (s)')
ylabel('Centroid (Hz)')
The spectral centroid is also commonly used to classify speech as voiced or unvoiced [3 on page 18-
23]. For example, the centroid jumps in regions of unvoiced speech.
[audio,fs] = audioread('Counting-16-44p1-mono-15secs.wav');
centroid = spectralCentroid(audio,fs);
subplot(2,1,1)
t = linspace(0,size(audio,1)/fs,size(audio,1));
plot(t,audio)
ylabel('Amplitude')
subplot(2,1,2)
t = linspace(0,size(audio,1)/fs,size(centroid,1));
plot(t,centroid)
xlabel('Time (s)')
ylabel('Centroid (Hz)')
18-3
18 Spectral Descriptors Chapter
Spectral Spread
Spectral spread (spectralSpread) is the standard deviation around the spectral centroid [1 on page
18-23]:
b 2
∑k2= b f k − μ1 sk
1
μ2 = b
∑k2= b sk
1
where
The spectral spread represents the "instantaneous bandwidth" of the spectrum. It is used as an
indication of the dominance of a tone. For example, the spread increases as the tones diverge and
decreases as the tones converge.
fs = 16e3;
tone = audioOscillator('SampleRate',fs,'NumTones',2,'SamplesPerFrame',512,'Frequency',[2000,100])
duration = 5;
18-4
Spectral Descriptors
numLoops = floor(duration*fs/tone.SamplesPerFrame);
signal = [];
for i = 1:numLoops
signal = [signal;tone()];
if i<numLoops/2
tone.Frequency = tone.Frequency + [0,50];
else
tone.Frequency = tone.Frequency - [0,50];
end
end
spread = spectralSpread(signal,fs);
subplot(2,1,1)
spectrogram(signal,round(fs*0.05),round(fs*0.04),2048,fs,'yaxis')
subplot(2,1,2)
t = linspace(0,size(signal,1)/fs,size(spread,1));
plot(t,spread)
xlabel('Time (s)')
ylabel('Spread')
Spectral Skewness
Spectral skewness (spectralSkewness) is computed from the third order moment [1 on page 18-
23]:
18-5
18 Spectral Descriptors Chapter
b 3
∑k2= b f k − μ1 sk
1
μ3 =
3 b
μ2 ∑k2= b sk
1
where
The spectral skewness measures symmetry around the centroid. In phonetics, spectral skewness is
often referred to as spectral tilt and is used with other spectral moments to distinguish the place of
articulation [4 on page 18-23]. For harmonic signals, it indicates the relative strength of higher and
lower harmonics. For example, in the four-tone signal, there is a positive skew when the lower tone is
dominant and a negative skew when the upper tone is dominant.
fs = 16e3;
duration = 99;
tone = audioOscillator('SampleRate',fs,'NumTones',4,'SamplesPerFrame',fs,'Frequency',[500,2000,25
signal = [];
for i = 1:duration
signal = [signal;tone()];
tone.Amplitude = tone.Amplitude + [0.01,0,0,-0.01];
end
skewness = spectralSkewness(signal,fs);
t = linspace(0,size(signal,1)/fs,size(skewness,1))/60;
subplot(2,1,1)
spectrogram(signal,round(fs*0.05),round(fs*0.04),round(fs*0.05),fs,'yaxis','power')
view([-58 33])
subplot(2,1,2)
plot(t,skewness)
xlabel('Time (minutes)')
ylabel('Skewness')
18-6
Spectral Descriptors
Spectral Kurtosis
Spectral kurtosis (spectralKurtosis) is computed from the fourth order moment [1 on page 18-
23]:
b 4
∑k2= b f k − μ1 sk
1
μ4 =
4 b
μ2 ∑k2= b sk
1
where
The spectral kurtosis measures the flatness, or non-Gaussianity, of the spectrum around its centroid.
Conversely, it is used to indicate the peakiness of a spectrum. For example, as the white noise is
increased on the speech signal, the kurtosis decreases, indicating a less peaky spectrum.
[audioIn,fs] = audioread('Counting-16-44p1-mono-15secs.wav');
18-7
18 Spectral Descriptors Chapter
noiseGenerator = dsp.ColoredNoise('Color','white','SamplesPerFrame',size(audioIn,1));
noise = noiseGenerator();
noise = noise/max(abs(noise));
ramp = linspace(0,.25,numel(noise))';
noise = noise.*ramp;
kurtosis = spectralKurtosis(audioIn,fs);
t = linspace(0,size(audioIn,1)/fs,size(audioIn,1));
subplot(2,1,1)
plot(t,audioIn)
ylabel('Amplitude')
t = linspace(0,size(audioIn,1)/fs,size(kurtosis,1));
subplot(2,1,2)
plot(t,kurtosis)
xlabel('Time (s)')
ylabel('Kurtosis')
Spectral Entropy
Spectral entropy (spectralEntropy) measures the peakiness of the spectrum [6 on page 18-23]:
18-8
Spectral Descriptors
b
− ∑k2= b sklog sk
1
entropy =
log b2 − b1
where
Spectral entropy has been used successfully in voiced/unvoiced decisions for automatic speech
recognition [6 on page 18-23]. Because entropy is a measure of disorder, regions of voiced speech
have lower entropy compared to regions of unvoiced speech.
[audioIn,fs] = audioread('Counting-16-44p1-mono-15secs.wav');
entropy = spectralEntropy(audioIn,fs);
t = linspace(0,size(audioIn,1)/fs,size(audioIn,1));
subplot(2,1,1)
plot(t,audioIn)
ylabel('Amplitude')
t = linspace(0,size(audioIn,1)/fs,size(entropy,1));
subplot(2,1,2)
plot(t,entropy)
xlabel('Time (s)')
ylabel('Entropy')
18-9
18 Spectral Descriptors Chapter
Spectral entropy has also been used to discriminate between speech and music [7 on page 18-24] [8
on page 18-24]. For example, compare histograms of entropy for speech, music, and background
audio files.
fs = 8000;
[speech,speechFs] = audioread('Rainbow-16-8-mono-114secs.wav');
speech = resample(speech,fs,speechFs);
speech = speech./max(speech);
[music,musicFs] = audioread('RockGuitar-16-96-stereo-72secs.flac');
music = sum(music,2)/2;
music = resample(music,fs,musicFs);
music = music./max(music);
[background,backgroundFs] = audioread('Ambiance-16-44p1-mono-12secs.wav');
background = resample(background,fs,backgroundFs);
background = background./max(background);
speechEntropy = spectralEntropy(speech,fs);
musicEntropy = spectralEntropy(music,fs);
backgroundEntropy = spectralEntropy(background,fs);
figure
h1 = histogram(speechEntropy);
hold on
h2 = histogram(musicEntropy);
h3 = histogram(backgroundEntropy);
18-10
Spectral Descriptors
h1.Normalization = 'probability';
h2.Normalization = 'probability';
h3.Normalization = 'probability';
h1.BinWidth = 0.01;
h2.BinWidth = 0.01;
h3.BinWidth = 0.01;
title('Spectral Entropy')
legend('Speech','Music','Background','Location',"northwest")
xlabel('Entropy')
ylabel('Probability')
hold off
Spectral Flatness
Spectral flatness (spectralFlatness) measures the ratio of the geometric mean of the spectrum to
the arithmetic mean of the spectrum [9 on page 18-24]:
1
b2 b2 − b1
∏k = b sk
1
flatness = b
1
∑ 2 s
b2 − b1 k = b1 k
where
• sk is the spectral value at bin k. The magnitude spectrum and power spectrum are both commonly
used.
18-11
18 Spectral Descriptors Chapter
• b1 and b2 are the band edges, in bins, over which to calculate the spectral flatness.
Spectral flatness is an indication of the peakiness of the spectrum. A higher spectral flatness
indicates noise, while a lower spectral flatness indicates tonality.
[audio,fs] = audioread('WaveGuideLoopOne-24-96-stereo-10secs.aif');
audio = sum(audio,2)/2;
noise = (2*rand(numel(audio),1)-1).*linspace(0,0.05,numel(audio))';
flatness = spectralFlatness(audio,fs);
subplot(2,1,1)
t = linspace(0,size(audio,1)/fs,size(audio,1));
plot(t,audio)
ylabel('Amplitude')
subplot(2,1,2)
t = linspace(0,size(audio,1)/fs,size(flatness,1));
plot(t,flatness)
ylabel('Flatness')
xlabel('Time (s)')
Spectral flatness has also been applied successfully to singing voice detection [10 on page 18-24]
and to audio scene recognition [11 on page 18-24].
18-12
Spectral Descriptors
Spectral Crest
Spectral crest (spectralCrest) measures the ratio of the maximum of the spectrum to the
arithmetic mean of the spectrum [1 on page 18-23]:
where
• sk is the spectral value at bin k. The magnitude spectrum and power spectrum are both commonly
used.
• b1 and b2 are the band edges, in bins, over which to calculate the spectral crest.
Spectral crest is an indication of the peakiness of the spectrum. A higher spectral crest indicates
more tonality, while a lower spectral crest indicates more noise.
[audio,fs] = audioread('WaveGuideLoopOne-24-96-stereo-10secs.aif');
audio = sum(audio,2)/2;
noise = (2*rand(numel(audio),1)-1).*linspace(0,0.2,numel(audio))';
crest = spectralCrest(audio,fs);
subplot(2,1,1)
t = linspace(0,size(audio,1)/fs,size(audio,1));
plot(t,audio)
ylabel('Amplitude')
subplot(2,1,2)
t = linspace(0,size(audio,1)/fs,size(crest,1));
plot(t,crest)
ylabel('Crest')
xlabel('Time (s)')
18-13
18 Spectral Descriptors Chapter
Spectral Flux
Spectral flux (spectralFlux) is a measure of the variability of the spectrum over time [12 on page
18-24]:
1
b2 p
p
flux t = ∑ sk t − sk t − 1
k = b1
where
• sk is the spectral value at bin k. The magnitude spectrum and power spectrum are both commonly
used.
• b1 and b2 are the band edges, in bins, over which to calculate the spectral flux.
• p is the norm type.
Spectral flux is popularly used in onset detection [13 on page 18-24] and audio segmentation [14 on
page 18-24]. For example, the beats in the drum track correspond to high spectral flux.
[audio,fs] = audioread('FunkyDrums-48-stereo-25secs.mp3');
audio = sum(audio,2)/2;
flux = spectralFlux(audio,fs);
subplot(2,1,1)
18-14
Spectral Descriptors
t = linspace(0,size(audio,1)/fs,size(audio,1));
plot(t,audio)
ylabel('Amplitude')
subplot(2,1,2)
t = linspace(0,size(audio,1)/fs,size(flux,1));
plot(t,flux)
ylabel('Flux')
xlabel('Time (s)')
Spectral Slope
Spectral slope (spectralSlope) measures the amount of decrease of the spectrum [15 on page 18-
24]:
b
∑k2= b f k − μf sk − μs
1
slope = b2 2
∑k = b f k − μf
1
where
18-15
18 Spectral Descriptors Chapter
Spectral slope has been used extensively in speech analysis, particularly in modeling speaker stress
[19 on page 18-25]. The slope is directly related to the resonant characteristics of the vocal folds
and has also been applied to speaker identification [21 on page 18-25]. Spectral slope is a socially
important aspect of timbre. Spectral slope discrimination has been shown to occur in early childhood
development [20 on page 18-25]. Spectral slope is most pronounced when the energy in the lower
formants is much greater than the energy in the higher formants.
[female,femaleFs] = audioread('FemaleSpeech-16-8-mono-3secs.wav');
female = female./max(female);
femaleSlope = spectralSlope(female,femaleFs);
t = linspace(0,size(female,1)/femaleFs,size(femaleSlope,1));
subplot(2,1,1)
spectrogram(female,round(femaleFs*0.05),round(femaleFs*0.04),round(femaleFs*0.05),femaleFs,'yaxis
subplot(2,1,2)
plot(t,femaleSlope)
title('Female Speaker')
ylabel('Slope')
xlabel('Time (s)')
18-16
Spectral Descriptors
Spectral Decrease
Spectral decrease (spectralDecrease) represents the amount of decrease of the spectrum, while
emphasizing the slopes of the lower frequencies [1 on page 18-23]:
b sk − sb
1
∑k2= b
1+1 k−1
decrease = b
∑k2= b + 1 sk
1
where
Spectral decrease is used less frequently than spectral slope in the speech literature, but it is
commonly used, along with slope, in the analysis of music. In particular, spectral decrease has been
shown to perform well as a feature in instrument recognition [22 on page 18-25].
[guitar,guitarFs] = audioread('RockGuitar-16-44p1-stereo-72secs.wav');
guitar = mean(guitar,2);
[drums,drumsFs] = audioread('RockDrums-44p1-stereo-11secs.mp3');
drums = mean(drums,2);
guitarDecrease = spectralDecrease(guitar,guitarFs);
drumsDecrease = spectralDecrease(drums,drumsFs);
t1 = linspace(0,size(guitar,1)/guitarFs,size(guitarDecrease,1));
t2 = linspace(0,size(drums,1)/drumsFs,size(drumsDecrease,1));
subplot(2,1,1)
plot(t1,guitarDecrease)
title('Guitar')
ylabel('Decrease')
axis([0 10 -0.3 0.3])
subplot(2,1,2)
plot(t2,drumsDecrease)
title('Drums')
ylabel('Decrease')
xlabel('Time (s)')
axis([0 10 -0.3 0.3])
18-17
18 Spectral Descriptors Chapter
The spectral rolloff point (spectralRolloffPoint) measures the bandwidth of the audio signal by
determining the frequency bin under which a given percentage of the total energy exists [12 on page
18-24]:
i b2
Rolloff Point = i such that ∑ sk = κ ∑ sk
k = b1 k = b1
where
• sk is the spectral value at bin k. The magnitude spectrum and power spectrum are both commonly
used.
• b1 and b2 are the band edges, in bins, over which to calculate the spectral rolloff point.
• κ is the specified energy threshold, usually 95% or 85%.
The spectral rolloff point has been used to distinguish between voiced and unvoiced speech, speech/
music discrimination [12 on page 18-24], music genre classification [16 on page 18-24], acoustic
scene recognition [17 on page 18-24], and music mood classification [18 on page 18-24]. For
example, observe the different mean and variance of the rolloff point for speech, rock guitar, acoustic
guitar, and an acoustic scene.
18-18
Spectral Descriptors
[speech,fs1] = audioread('SpeechDFT-16-8-mono-5secs.wav');
speech = speech(1:min(end,fs1*dur));
[electricGuitar,fs2] = audioread('RockGuitar-16-44p1-stereo-72secs.wav');
electricGuitar = mean(electricGuitar,2); % Convert to mono for comparison.
electricGuitar = electricGuitar(1:fs2*dur);
[acousticGuitar,fs3] = audioread('SoftGuitar-44p1_mono-10mins.ogg');
acousticGuitar = acousticGuitar(1:fs3*dur);
[acousticScene,fs4] = audioread('MainStreetOne-16-16-mono-12secs.wav');
acousticScene = acousticScene(1:fs4*dur);
r1 = spectralRolloffPoint(speech,fs1);
r2 = spectralRolloffPoint(electricGuitar,fs2);
r3 = spectralRolloffPoint(acousticGuitar,fs3);
r4 = spectralRolloffPoint(acousticScene,fs4);
t1 = linspace(0,size(speech,1)/fs1,size(r1,1));
t2 = linspace(0,size(electricGuitar,1)/fs2,size(r2,1));
t3 = linspace(0,size(acousticGuitar,1)/fs3,size(r3,1));
t4 = linspace(0,size(acousticScene,1)/fs4,size(r4,1));
figure
plot(t1,r1)
title('Speech')
ylabel('Rolloff Point (Hz)')
xlabel('Time (s)')
axis([0 5 0 4000])
18-19
18 Spectral Descriptors Chapter
figure
plot(t2,r2)
title('Rock Guitar')
ylabel('Rolloff Point (Hz)')
xlabel('Time (s)')
axis([0 5 0 4000])
18-20
Spectral Descriptors
figure
plot(t3,r3)
title('Acoustic Guitar')
ylabel('Rolloff Point (Hz)')
xlabel('Time (s)')
axis([0 5 0 4000])
18-21
18 Spectral Descriptors Chapter
figure
plot(t4,r4)
title('Acoustic Scene')
ylabel('Rolloff Point (Hz)')
xlabel('Time (s)')
axis([0 5 0 4000])
18-22
Spectral Descriptors
References
[1] Peeters, G. "A Large Set of Audio Features for Sound Description (Similarity and Classification) in
the CUIDADO Project." Technical Report; IRCAM: Paris, France, 2004.
[2] Grey, John M., and John W. Gordon. “Perceptual Effects of Spectral Modifications on Musical
Timbres.” The Journal of the Acoustical Society of America. Vol. 63, Issue 5, 1978, pp. 1493–1500.
[3] Raimy, Eric, and Charles E. Cairns. The Segment in Phonetics and Phonology. Hoboken, NJ: John
Wiley & Sons Inc., 2015.
[4] Jongman, Allard, et al. “Acoustic Characteristics of English Fricatives.” The Journal of the
Acoustical Society of America. Vol. 108, Issue 3, 2000, pp. 1252–1263.
[5] S. Zhang, Y. Guo, and Q. Zhang, "Robust Voice Activity Detection Feature Design Based on
Spectral Kurtosis." First International Workshop on Education Technology and Computer Science,
2009, pp. 269–272.
[6] Misra, H., S. Ikbal, H. Bourlard, and H. Hermansky. "Spectral Entropy Based Feature for Robust
ASR." 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.
18-23
18 Spectral Descriptors Chapter
[8] Pikrakis, A., et al. “A Speech/Music Discriminator of Radio Recordings Based on Dynamic
Programming and Bayesian Networks.” IEEE Transactions on Multimedia. Vol. 10, Issue 5, 2008, pp.
846–857.
[9] Johnston, J.d. “Transform Coding of Audio Signals Using Perceptual Noise Criteria.” IEEE Journal
on Selected Areas in Communications. Vol. 6, Issue 2, 1988, pp. 314–323.
[10] Lehner, Bernhard, et al. “On the Reduction of False Positives in Singing Voice Detection.” 2014
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014.
[11] Y. Petetin, C. Laroche and A. Mayoue, "Deep Neural Networks for Audio Scene Recognition,"
2015 23rd European Signal Processing Conference (EUSIPCO), 2015.
[12] Scheirer, E., and M. Slaney. “Construction and Evaluation of a Robust Multifeature Speech/Music
Discriminator.” 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing,
1997.
[13] S. Dixon, "Onset Detection Revisited." International Conference on Digital Audio Effects. Vol.
120, 2006, pp. 133–137.
[14] Tzanetakis, G., and P. Cook. “Multifeature Audio Segmentation for Browsing and Annotation.”
Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics,
1999.
[15] Lerch, Alexander. An Introduction to Audio Content Analysis Applications in Signal Processing
and Music Informatics. Piscataway, NJ: IEEE Press, 2012.
[16] Li, Tao, and M. Ogihara. "Music Genre Classification with Taxonomy." IEEE International
Conference on Acoustics, Speech, and Signal Processing, 2005.
[17] Eronen, A.j., V.t. Peltonen, J.t. Tuomi, A.p. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J.
Huopaniemi. "Audio-Based Context Recognition." IEEE Transactions on Audio, Speech and Language
Processing. Vol. 14, Issue 1, 2006, pp. 321–329.
[18] Ren, Jia-Min, Ming-Ju Wu, and Jyh-Shing Roger Jang. "Automatic Music Mood Classification
Based on Timbre and Modulation Features." IEEE Transactions on Affective Computing. Vol. 6, Issue
3, 2015, pp. 236–246.
18-24
Spectral Descriptors
[19] Hansen, John H. L., and Sanjay Patil. "Speech Under Stress: Analysis, Modeling and
Recognition." Lecture Notes in Computer Science. Vol. 4343, 2007, pp. 108–137.
[20] Tsang, Christine D., and Laurel J. Trainor. "Spectral Slope Discrimination in Infancy: Sensitivity
to Socially Important Timbres." Infant Behavior and Development. Vol. 25, Issue 2, 2002, pp. 183–
194.
[21] Murthy, H.a., F. Beaufays, L.p. Heck, and M. Weintraub. "Robust Text-Independent Speaker
Identification over Telephone Channels." IEEE Transactions on Speech and Audio Processing. Vol. 7,
Issue 5, 1999, pp. 554–568.
[22] Essid, S., G. Richard, and B. David. "Instrument Recognition in Polyphonic Music Based on
Automatic Taxonomies." IEEE Transactions on Audio, Speech and Language Processing. Vol 14, Issue
1, 2006, pp. 68–80.
18-25