0% found this document useful (0 votes)
41 views42 pages

Audio Signal Processing

The document provides an introduction to audio signal processing and its applications in human-computer interaction, covering topics such as waveform audio file format, FFmpeg, and audio processing with Matlab. It explains the fundamentals of audio signals, including digital and analog formats, and details the structure of WAV files. Additionally, it includes practical exercises and code examples for processing audio files using C/C++ and FFmpeg.

Uploaded by

richard balili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views42 pages

Audio Signal Processing

The document provides an introduction to audio signal processing and its applications in human-computer interaction, covering topics such as waveform audio file format, FFmpeg, and audio processing with Matlab. It explains the fundamentals of audio signals, including digital and analog formats, and details the structure of WAV files. Additionally, it includes practical exercises and code examples for processing audio files using C/C++ and FFmpeg.

Uploaded by

richard balili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Introduction to

Audio Signal Processing


Human-Computer Interaction

Angelo Antonio Salatino


[email protected]
http://infernusweb.altervista.org
License
This work is licensed under the Creative Commons
Attribution-Noncommercial-Share Alike 3.0
Unported License. To view a copy of this license,
visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or
send a letter to Creative Commons, 171 Second
Street, Suite 300, San Francisco, California, 94105,
USA.
Overview
• Audio Signal Processing;
• Waveform Audio File Format;
• FFmpeg;
• Audio Processing with Matlab;
• Doing phonetics with Praat;
• Last but not least: Homework.
Audio Signal Processing
• Audio signal processing is an engineering field
that focuses on the computational methods for
intentionally altering auditory signals or sounds,
in order to achieve a particular goal.

Output Signal
Input Signal
Audio
Signal
Processing Data with meaning
Audio Processing in HCI
Some HCI applications involving audio signal
processing are:
• Speech Emotion Recognition
• Speaker Recognition
▫ Speaker Verification
▫ Speaker Identification
• Voice Commands
• Speech to Text
• Etc.
Audio Signals
You can find audio signals represented in either
digital or analog format.

• Digital – the pressure wave-form is a sequence


of symbols, usually binary numbers.

• Analog – is a smooth wave of energy represented


by a continuous stream of data.
Analog to Digital Converter (ADC)
• Don’t worry, it’s only a fast review!!!
Sampling Frequency # bits per sample
must be defined must be defined

Analog Signal
Sample Digital Signal

Quantization Encoding
Continuous in Time & Hold Discrete in Time Discrete in Time Discrete in Time
Continuous in Continuous in Discrete in Discrete in
Amplitude Amplitude Amplitude Amplitude

• For each measurement a number is assigned


according to its amplitude.
• Sampling frequency and the number of bits to
represent a sample can be considered as main
features for digital signals.
• How these digital signals are stored?
Waveform Audio File Format (WAV)
Endianess
Byte
Field Name Field Size Description
The Wav file is an instance of
Offeset a Resource Interchange
Big 0 ChunkID 4 File Format (RIFF) defined
Little 4 ChunkSize 4 RIFF Chunk Descriptor by IBM and Microsoft.
Big 8 Format 4
Big 12 SubChunk1ID 4 The RIFF is a generic file
Little 16 SubChunk1Size 4 container format for storing
Little 20 AudioFormat 2 data in tagged chunks (basic
Little 22 NumChannels 2 building blocks). It is a file
Format SubChunk
Little 24 SampleRate 4 structure that defines a class
Little 28 ByteRate 4 of more specific file formats,
Little 32 BlockAlign 2 such as: wav, avi, rmi, etc.
Little 34 BitsPerSample 2
Big 36 SubChunk2ID 4
Little 40 SubChunk2Size 4 Data SubChunk
Little 44 Data SubChunk2Size
Waveform Audio File Format (WAV)
Endianess
Byte
Field Name Field Size Description
ChunkID
Offeset Contains the letters
Big 0 ChunkID 4 «RIFF» in ASCII form
Little 4 ChunkSize 4 RIFF Chunk Descriptor (0x52494646 big-endian
Big 8 Format 4 form)
Big 12 SubChunk1ID 4
Little 16 SubChunk1Size 4 ChunkSize
Little 20 AudioFormat 2 This is the size of the rest
Little 22 NumChannels 2 of the chunk following this
Format SubChunk number. The size of the
Little 24 SampleRate 4
Little 28 ByteRate 4 entire file in bytes minus 8
Little 32 BlockAlign 2 for the two fields not
Little 34 BitsPerSample 2 included: ChunkID and
Big 36 SubChunk2ID 4 ChunkSize.
Little 40 SubChunk2Size 4 Data SubChunk
Little 44 Data SubChunk2Size
Format
Contains the letters
«WAVE» in ASCII form
(0x57415645 big-endian
form)
Waveform Audio File Format (WAV)
Byte
Endianess Field Name Field Size Description
Offeset
Big 0 ChunkID 4 SubChunk1ID
Little 4 ChunkSize 4 RIFF Chunk Descriptor Contains the letters «fmt »
Big 8 Format 4 in ASCII form
Big 12 SubChunk1ID 4 (0x666d7420 big-endian
Little 16 SubChunk1Size 4 form)
Little 20 AudioFormat 2
Little 22 NumChannels 2
Little 24 SampleRate 4
Format SubChunk SubChunk1Size
Little 28 ByteRate 4
16 for PCM. This is the
Little 32 BlockAlign 2
size of the SubChunk
Little 34 BitsPerSample 2
which follows this
Big 36 SubChunk2ID 4
number.
Little 40 SubChunk2Size 4 Data SubChunk
Little 44 Data SubChunk2Size
Waveform Audio File Format (WAV)
Endianess
Byte
Offeset
Field Name Field Size Description AudioFormat
Format Code or
Big 0 ChunkID 4
compression type:
Little 4 ChunkSize 4 RIFF Chunk Descriptor
PCM = 0x0001 (Linear
Big 8 Format 4
quantization,
Big 12 SubChunk1ID 4
uncompressed)
Little 16 SubChunk1Size 4
IEEE_FLOAT = 0x0003
Little 20 AudioFormat 2
Microsoft_ALAW=0x0006
Little 22 NumChannels 2
Format SubChunk Microsoft_MLAW=0x0007
Little 24 SampleRate 4
IBM_ADPCM = 0x0103
Little 28 ByteRate 4

Little 32 BlockAlign 2
Little 34 BitsPerSample 2
Big 36 SubChunk2ID 4
Little 40 SubChunk2Size 4 Data SubChunk NumChannels
Little 44 Data SubChunk2Size Mono = 1, Stereo = 2, etc.
Note: Channels are
interleaved
Waveform Audio File Format (WAV)
Endianess
Byte
Offeset
Field Name Field Size Description SampleRate
Samplig frequency:
Big 0 ChunkID 4
8000, 16000, 44100, etc.
Little 4 ChunkSize 4 RIFF Chunk Descriptor
Big 8 Format 4
ByteRate
Big 12 SubChunk1ID 4
Average bytes per second.
Little 16 SubChunk1Size 4
It is typically determined
Little 20 AudioFormat 2
by the Equation 1.
Little 22 NumChannels 2
Format SubChunk
Little 24 SampleRate 4
BlockAlign
Little 28 ByteRate 4
The number of bytes for
Little 32 BlockAlign 2
one sample including all
Little 34 BitsPerSample 2
channels.
Big 36 SubChunk2ID 4
It is determined by the
Little 40 SubChunk2Size 4 Data SubChunk
Equation 2.
Little 44 Data SubChunk2Size

BitsPerSample
1) ByteRate = SampleRate ⋅ NumChannels ⋅ 8
BitsPerSample
2) BlockAlign = NumChannels ⋅ 8
Waveform Audio File Format (WAV)
Endianess
Byte
Offeset
Field Name Field Size Description BitsPerSample
8 bits = 8, 16 bits = 16, etc.
Big 0 ChunkID 4
Little 4 ChunkSize 4 RIFF Chunk Descriptor
Big 8 Format 4
SubChunk2ID
Big 12 SubChunk1ID 4
Contains the letters
Little 16 SubChunk1Size 4
«data» in ASCII form
Little 20 AudioFormat 2
(0x64617461 big-endian
Little 22 NumChannels 2
Format SubChunk form)
Little 24 SampleRate 4
Little 28 ByteRate 4
Little 32 BlockAlign 2 SubChunk2Size
Little 34 BitsPerSample 2 This is the number of
Big 36 SubChunk2ID 4 bytes in the Data field.
Little 40 SubChunk2Size 4 Data SubChunk If AudioFormat=PCM,
Little 44 Data SubChunk2Size then you can compute the
number of samples (see
Equation 3).
8 ⋅ SubChunk2Size
3) NumOfSamples =
NumChannels ⋅ BitsPerSample
Example of wave header
AudioFormat = 1 (PCM)

Chunk Descriptor Fmt SubChunk


52 49 46 46 16 02 01 00 57 41 56 45 66 6d 74 20 10 00 00 00 01 00 01 00
R I F F W A V E f m t

ChunkSize = 66070 SubChunk1Size = 16 NumChannels = 1

BitsPerSample = 16 SubChunk2Size = 66034

Fmt SubChunk (cont…) Data SubChunk


80 3e 00 00 00 7d 00 00 02 00 10 00 64 61 74 61 f2 01 01 00 … . . .
d a t a
Data
SampleRate = 16000 BloackAlign = 2
ByteRate = 32000
Exercise
For the next 15 min, write a C/C++ program that
takes a wav file as input and prints the following
values on standard output:
• Header size;
• Sample rate;
• Bits per sample;
• Number of channels;
• Number of samples.

Good work!
typedef struct header_file

Solution
{
char chunk_id[4];
int chunk_size;
char format[4];
char subchunk1_id[4];
int subchunk1_size;
short int audio_format;
short int num_channels;
int sample_rate;
int byte_rate;
short int block_align;
short int bits_per_sample;
char subchunk2_id[4];
int subchunk2_size;
} header;

/************** Inside Main() **************/


header* meta = new header;
ifstream infile;

infile.exceptions (ifstream::eofbit | ifstream::failbit | ifstream::badbit);

infile.open("foo.wav", ios::in|ios::binary);
infile.read ((char*)meta, sizeof(header));

cout << " Header size: "<<sizeof(*meta)<<" bytes" << endl;


cout << " Sample Rate "<< meta->sample_rate <<" Hz" << endl;
cout << " Bits per samples: " << meta->bits_per_sample << " bit" <<endl;
cout << " Number of channels: " << meta->num_channels << endl;
long numOfSample = (meta->subchunk2_size/meta->num_channels)/(meta->bits_per_sample/8);
cout << " Number of samples: " << numOfSample << endl;

However, this solution contains an error. Can you spot it?


What about reading samples?
short int* pU = NULL;
unsigned char* pC = NULL;
gWavDataIn = new double*[meta->num_channels]; //data structure storing the samples
for (int i = 0; i < meta->num_channels; i++) gWavDataIn[i] = new double[numOfSample];

wBuffer = new char[meta->subchunk2_size]; //data structure storing the bytes

/* data conversion: from byte to samples */


if(meta->bits_per_sample == 16)
{
pU = (short*) wBuffer;
for( int i = 0; i < numOfSample; i++)
for (int j = 0; j < meta->num_channels; j++)
gWavDataIn[j][i] = (double) (pU[i]);
}
else if(meta->bits_per_sample == 8)
{
pC = (unsigned char*) wBuffer;
for( int i = 0; i < numOfSample; i++)
for (int j = 0; j < meta->num_channels; j++)
gWavDataIn[j][i] = (double) (pC[i]);
}
else
{
printERR("Unhandled case");
}

This solution is available at: https://github.com/angelosalatino/AudioSignalProcessing


A better solution: FFmpeg
What FFmpeg says about itself:
• FFmpeg is the leading multimedia framework,
able to decode, encode, transcode, mux,
demux, stream, filter and play pretty much
anything that humans and machines have
created. It supports the most obscure ancient
formats up to the cutting edge. No matter if they
were designed by some standards committee,
the community or a corporation.
Why FFmpeg is better?
• Off-the-shelf;
• Open Source;
• We can read samples from different kind of
formats: wav, mp3, aac, flac and so on;
• The code is always the same for all these audio
formats;
• It can also decode video formats.
A little bit of code …
Step 1
• Create AVFormatContext
▫ Format I/O context: nb_streams, filename,
start_time, duration, bit_rate, audio_codec_id,
video_codec_id and so on.
• Open file

AVFormatContext* formatContext = NULL;


av_open_input_file(&formatContext,"foo.wav",NULL,0,NULL)
A little bit of code …
Step 2
• Create AVStream
▫ Stream structure; It contains: nb_frames,
codec_context, duration and so on;
• Association between audio stream inside the
context and the new one.
// Find the audio stream (some container files can have multiple streams in them)
AVStream* audioStream = NULL;
for (unsigned int i = 0; i < formatContext->nb_streams; ++i)
if (formatContext->streams[i]->codec->codec_type == AVMEDIA_TYPE_AUDIO)
{
audioStream = formatContext->streams[i];
break;
}
A little bit of code …
Step 3
• Create AVCodecContext
▫ Main external API structure; It contains: codec_name,
codec_id and so on.
• Create AVCodec
▫ Codec Structure; It contains deep level information about
codec.
• Find codec availability
• Open Codec

AVCodecContext* codecContext = audioStream->codec;


AvCodec codec = avcodec_find_decoder(codecContext->codec_id);
avcodec_open(codecContext,codec);
A little bit of code …
Step 4
• Create AVPacket
▫ This structure stores compressed data.

• Create AVFrame
▫ This structure describes decoded (raw) audio or
video data.

AVPacket packet;
av_init_packet(&packet);

AVFrame* frame = avcodec_alloc_frame();
A little bit of code …
Step 5
• Read packets
▫ Packets are read from AVContextFormat

• Decode packets
▫ Frame are decodec with CodecContext
// Read the packets in a loop
while (av_read_frame(formatContext, &packet) == 0)
{

avcodec_decode_audio4(codecContext, frame, &frameFinished, &packet);

src_data = frame->data[0];
}
Problems with FFmpeg
• Update issues (with lib update, your previous
code might not work)
▫ Deprecated methods;
▫ Function name or parameters could change.
• Poor documentation (until today)

Example of migration:
• avcodec_open (AVCodecContext *avctx, const AVCodec *codec)
• avcodec_open2 (AVCodecContext *avctx, const AVCodec *codec,
AVDictionary **options)
Audio Processing with Matlab
• Matlab contains a lot of built-in functions to
read, listen, manipulate and save audio files.
• It also contains Signal Processing Toolbox and
DSP System Toolbox

Advantages Disadvantages

• Well documented; • Only wave, flac, mp3, mpeg-4 and


• It works on different level of ogg formats are recognized in
abstraction; audioread (Is it really a
• Direct access to samples; disadvantage?);
• Coding is simple. • License is expensive.
Let’s code: Opening files
%% Reading file
% Section ID = 1

filename = './test.wav';
[data,fs] = wavread(filename); % reads only wav file

% data = sample collection, fs = sampling frequency

% or ---> [data,fs] = audioread(filename);


Recognized formats by audioread()
% write an audio file
audiowrite('./testCopy.wav',data,fs)
Information and play
%% Information & play
% Section ID = 2
numberOfSamples = length(data);
tempo = numberOfSamples / fs;

disp (sprintf('Length: %f seconds',tempo));


disp (sprintf('Number of Samples %d', numberOfSamples));
disp (sprintf('Sampling Frequency %d Hz',fs));
disp (sprintf('Number of Channels: %d', min(size(data))));

%play file
sound(data,fs);

% PLOT the signal


time = linspace(0,tempo,numberOfSamples);
plot(time,data);
Framing 𝑠(𝑡) = 𝑥(𝑡) ⋅ 𝑟𝑒𝑐𝑡
𝑡−𝜏
#𝑠𝑎𝑚𝑝𝑙𝑒
%% Framing
% Section ID = 4

timeWindow = 0.04; % Frame length in term of seconds. Default: timeWindow = 40ms


timeStep = 0.01; % seconds between two frames. Default: timeStep = 10ms (in case of
OVERLAPPING)

overlap = 1; % 1 in case of overlap, 0 no overlap


sampleForWindow = timeWindow * fs;

if overlap == 0;
Y = buffer(data,sampleForWindow);
else
sampleToJump = sampleForWindow - timeStep * fs;
Y = buffer(data,sampleForWindow,ceil(sampleToJump));
end

[m,n]=size(Y); % m corresponds to sampleForWindow


numFrames = n;

disp(sprintf('Number of Frames: %d',numFrames));


Windowing
2
1 𝑛−(𝑁−1) 2
−2
𝑤𝐺𝐴𝑈𝑆𝑆 (𝑛) = 𝑒 𝜎(𝑁−1) 2 , 𝜎 ≤ 0.5
%% Windowing 2𝜋𝑛
% Section ID = 5 𝑤𝐻𝐴𝑀𝑀𝐼𝑁𝐺 (𝑛) = 0.54 + 0.46 cos
𝑁−1
num_points = sampleForWindow;
% some windows USE help window 2𝜋𝑛
w_gauss = gausswin(num_points); 𝑤 𝐻𝐴𝑁𝑁 (𝑛) = 0.5 1 + cos
w_hamming = hamming(num_points); 𝑁−1
w_hann = hann(num_points);
plot(1:num_points,[w_gauss,w_hamming, w_hann]); axis([1 num_points 0 2]);
legend('Gaussian','Hamming','Hann');

old_Y = Y;
for i=1:numFrames
Y(:,i)=Y(:,i).*w_hann;
end

%see the difference


index_to_plot = 88;
figure
plot (old_Y(:,index_to_plot))
hold on
plot (Y(:,index_to_plot), 'green')
hold off
clear num_points w_gauss w_hamming w_hann
Energy
%% Energy
% Section ID = 6

% It requires that signal is already framed


% Run Section ID=4

for i=1:numFrames
energy(i)=sum(abs(old_Y(:,i)).^2);
end

figure, plot(energy)

𝐸= |𝑥(𝑖 )|2
𝑖=1
Fast Fourier Transform (FFT)
%% Fast Fourier Transform (sull'intero segnale)
% Section ID = 7

NFFT = 2^nextpow2(numberOfSamples); % Next higher power of 2.


(in order to optimize FFT computation)
freqSignal = fft(data,NFFT);
f = fs/2*linspace(0,1,NFFT/2+1);

% PLOT
plot(f,abs(freqSignal(1:NFFT/2+1)))
title('Single-Sided Amplitude Spectrum of y(t)')
xlabel('Frequency (Hz)')
ylabel('|Y(f)|')

clear NFFT freqSignal f


Short Term Fourier Transform (STFT)
%% Short Term Fourier Transform
% Section ID = 8
% It requires that signal is already framed. Run Section ID=4
NFFT = 2^nextpow2(sampleForWindow);
STFT = ones(NFFT,numFrames);

for i=1:numFrames
STFT(:,i)=fft(Y(:,i),NFFT);
end

indexToPlot = 80; %frame index to plot


if indexToPlot < numFrames
f = fs/2*linspace(0,1,NFFT/2+1);
plot(f,2*abs(STFT(1:NFFT/2+1,indexToPlot))) % PLOT
title(sprintf('FFT del frame %d', indexToPlot));
xlabel('Frequency (Hz)')
ylabel(sprintf('|STFT_{%d}(f)|',indexToPlot))
else
disp('Unable to create plot');
End
% *********************************************
specgram(data,sampleForWindow,fs) % SPECTROGRAM
title('Spectrogram [dB]')
Auto-correlation
%% Auto-Correlazione per frames
% Section ID = 9

% It requires that signal is already framed


% Run Section ID=4
𝑁
for i=1:numFrames
autoCorr(:,i)=xcorr(Y(:,i)); Rx (n) = x(i) ⋅ x(i + n)
end
𝑖=1
indexToPlot = 80; %frame index to plot

if indexToPlot < numFrames

% PLOT
plot(autoCorr(sampleForWindow:end,i))
else
disp('Unable to create plot');
end

clear indexToPlot
A system for doing phonetics: Praat
• PRAAT is a comprehensive
speech analysis, synthesis, and
manipulation package
developed by Paul Boersma
and David Weenink at the
Institute of Phonetic Sciences
of the University of
Amsterdam, The Netherlands.
Pitch with Praat
Formants with Praat

5th
4th

3rd

2nd

1st
Other features with Praat
• Intensity
• Mel-Frequency Cepstrum Coefficients (MFCC);
• Linear Predictive Coefficients (LPC);
• Harmonic-to-Noise Ratio (HNR);
• and many others.
Scripting in Praat
• Praat can run scripts containing all the different commands available
in its environment and perform the operations and functionalities
that they represent.
fileName$ = "test.wav"
Read from file... 'fileName$'
name$ = fileName$ - ".wav"
select Sound 'name$'
To Pitch (ac)... 0.0 50.0 15 off 0.1 0.60 0.01 0.35 0.14 500.0

numFrame=Get number of frames


for i to numFrame
time=Get time from frame number... i
value=Get value in frame... i Hertz
Here is an example to perform a if value = undefined
pitch listing and save it in a text value=0
endif
file. path$=name$+"_pitch.txt"
fileappend 'path$' 'time' 'value' 'newline$'
endfor
select Pitch 'name$'
Remove
select Sound 'name$'
Remove
Homework
• Exercise 1) Consider a speech
signal containing silence,
unvoiced and voiced regions,
as showed here and write a
Matlab function (or whatever
language you prefer) capable
to identify these sections.
Silence

Voiced • Exercise 2) Then, in voiced


regions identify the
Unvoiced
fundamental frequency, the so
called pitch.

Please, try this at home!!


References and further reading
• Signal Processing
▫ http://deecom19.poliba.it/dsp/Teoria_dei_Segnali.pdf (Italian)
• WAV
▫ https://ccrma.stanford.edu/courses/422/projects/WaveFormat/
▫ http://www.onicos.com/staff/iz/formats/wav.html
• MATLAB
▫ http://www.mathworks.com/products/signal/
▫ http://www.mathworks.com/products/dsp-system/
▫ http://homepages.udayton.edu/~hardierc/ece203/sound.htm
▫ http://www.utdallas.edu/~assmann/hcs7367/classnotes.html
References and further reading
• FFmpeg
▫ https://www.ffmpeg.org/
▫ https://trac.ffmpeg.org/wiki/CompilationGuide/Ubuntu
• Praat
▫ http://www.fon.hum.uva.nl/praat/
▫ http://www.fon.hum.uva.nl/david/sspbook/sspbook.
pdf
▫ http://www.fon.hum.uva.nl/praat/manual/Scripting.
html
• Source code
▫ https://github.com/angelosalatino/AudioSignalProces
sing

You might also like