Digital Signal Processing Report
Digital Signal Processing Report
Group 18
Department of Information and Communication Technology, USTH
Digital Signal Processing
Dr Tung Tran Hoang
April 16, 2022
2
Table of Contents
Abstract............................................................................................................................................3
Introduction......................................................................................................................................4
Methodologies.................................................................................................................................6
Import dependencies............................................................................................................6
FIR methods.........................................................................................................................8
Another method.................................................................................................................14
Mix channel using the frame buffer (or phase change method)............................14
Librosa framework................................................................................................15
Conclusion.....................................................................................................................................18
Outro..............................................................................................................................................19
References......................................................................................................................................20
3
Abstract
Digital signal processing is an important part of the music. A song consists of vocals and
instrumental sounds. Separating vocals from songs is useful for many purposes, such as using the
melody of an instrument in it as a karaoke soundtrack. We will draw different methods for this
project and compare them. In particular, the first method is about creating simple FIR filters to
filter out frequency bands that contain vocals. The second method is through computation
between framebuffers thus overwriting the original file. The third method used is through the use
of pre-written Python libraries for interacting with music data. The musical data in this project
includes a variety of sources, including using studio music or live recordings to provide the most
intuitive results for comparison and interpretation.
4
Introduction
Humans can recognize sounds and classify them through the ear, and more specifically,
each type of sound can be recognized separately. For example, a Jazz song, with it consisting of
a variety of separate sources such as guitar, saxophone, piano and most songs also include
vocals.
Separating vocals from songs has many uses, either with vocals or with instrumentals.
Instrument sounds can be made sounds for karaoke. Vocals are the most important element in a
song. The analysis of a singer with low, mid or high tones can be discerned through digital signal
processing.
What changed after our presentation? Our team realized that the technical approach we
applied did not fully explain what was needed. So instead of trying to explain in detail every
possible step, we end up adding more practical examples. The goal is to get even a non-technical
person interested in how close and realistic the sound processing is to real life.
5
Methodologies
Import dependencies
To make the project more convenient, we use the following Python libraries and will
explain why we use them.
Figure 1
Python dependencies
Note. The numpy, scipy, matplotlib, librosa, Ipython (of Jupyter) and contextlib
framework must be added to the machine's Python environment before it can be used in a
Python project based on that environment.
The numpy framework is used in creating arrays for buffers, calculating averages,
transposing array matrices, .. mainly related to data flow tuning.
The scipy framework is used to overwrite ‘*.wav’ files
The wave framework is used to read and get data like sample width, number of channels,
number of frames, sample rate, ... from the ‘.wav’ file.
The contectlib framework is used to return a context manager that closes everything after
block completion.
The IPython framework is used to display the audio file in the jupyter notebook editor.
The librosa framework is an advanced framework for interacting with the audio file.
Get raw data and interact with it
The purpose of this section is very simple. We take the necessary data of a “.wav” audio
file and turn it into data that computers can interact with. The signal that humans, or more
precisely, the signal that the microphone picks up from the song, is called an analog signal. In the
computer environment, everything is done only when it is in digital form. The conversion
7
between the two forms through a device called Analog to Digital Converter (ADC for short).
This is done through Digital Signal Processors or the confusing abbreviation DSP. In the
opposite direction, after the digital information has been completed editing, the Digital to Analog
Converter will work as the name implies, converting the data into a form that can be heard by
humans. Its abbreviated name is DAC, which is also the name of a specialized terminal to
improve the output sound quality. It is also available on phones, computers, TVs, ... or any other
device used to output audio.
Get raw data
The target of this section is to get the ‘.wav’ file data from an audio file into a
digital file that can be read by the computer. Getting raw data has three main stages, that
is getting data from audio files through a wave framework, getting interactive time using
the frame of the audio file and finally getting data from buffers of the ‘.wav’ files and the
corresponding array for later use.
The first job is to use the wave library to open the file to be read. Before that, we
used the contextlib framework to get the final data of the ‘.wav’ file before processing.
The data we get from the ‘.wav’ file using this library are sample rate, sample width,
number of channels and the total number of frames.
Given the number of frames and the sampling rate, we combine it with a time
metric to determine the frame length at the beginning and the end, as well as the number
of frames between those two intervals. This determination is very simple, through
learning about the interaction between the sample rate in a frame and the number of
frames in a time, we know that after multiplying these two figures together, it will
produce the sample rate as the sum of that frame. It should also be noted that the sample
rate here can be roughly understood as the smallest unit in digital audio, equivalent to a
quantum.
Once we get the necessary information about the beginning and end frames, we
use that to get each frame's data for that period. Then comes the third part, which is to put
the data into an array for interaction, we have also defined the required data types as
'uint8' and 'int16', corresponding to signed and unsigned data. Now, all that needs to be
done is to read the data logically.
Interact with raw data
8
For the data to be reasonable to use, since the amount of knowledge we gained
could not be applied to multidimensional audio processing, we converted the required
data into a one-dimensional form. Thereby making it more convenient for later data use
Now comes one of the important parts that our team has worked on, which is
sound recognition. The data we collect will be presented in the form of a graph. This is
part of the final exam this time around, but our team has gone further than that. This
identification of audio data is mainly in the form of spectrograms of audio frequencies
and amplitudes.
The use is very simple, by obtaining from one-way audio processing combined
with the original sample rate collected, we have determined the spectrogram by applying
the matplotlib library. The results obtained are very positive. There will be examples of
this in the next section.
FIR methods
For this part, we use the knowledge we learned from the DSP course. We will create a
low pass filter, a high pass filter, and bandpass and band rejection filters. This is mainly through
using the formula we have learned sensibly. Sound consists of frequency and amplitude. The
sound emitted by each source has its frequency and amplitude. Acquired sound can come from
many separate sound sources. As mentioned above, the data and audio frequencies we obtain will
be represented as spectrograms. From these data, we will test the filters accordingly. This use
also has certain limitations or problems that arise during testing on some songs.
High pass filter and low pass filter
The first thing we did was to learn about the working mechanism of the low pass
filter. A low pass filter is used for filtering high frequencies and allowing lower
frequencies to pass through. The ideal low pass filter is a sinc filter.
The sinc function (after normalizing) is defined as:
sin(πx)
sinc ( x )=
πx
Then we have the impulse response of a sinc filter:
h [ n ] =2 f c sinc(2 f c n)
where f c is the cutoff frequency, specified as a fraction of the sampling rate
9
Because the sinc filter has an infinite length, which means that the delay of the
filter will also be infinite, making this filter unrealizable. The solution is to combine it
with a window, in this project we would like to use Hamming window, i.e.
ω ( n )=0.54−0.46 cos 2 π ( n
N )
The reason we choose Hamming window is that it has a good tradeoff between
frequency and amplitude accuracy, and reduced spectral leakage.
After combining with the sinc filter, we got a windowed-sinc filter
( (
h [ n ] =sinc 2 f c n−
N
2 ))( 0.54−0.46 cos (2 π Nn ))
where N is the filter length, it must be odd
For the FIR high pass filter, we just implement a spectral inversion, i.e.
1. Change the sign of each value in ℎ[𝑛].
2. Add one to the value in the centre.
How does a spectral inversion work? It is based on the following idea. A low-pass filter
generates a signal with the high frequencies removed. Hence, if you subtract this signal
from the original one, you have exactly the high frequencies. This means that you can
implement a high-pass filter in two steps. First, you compute:
x lpf [ n ]=x [ n ]∗hlpf [n ]
where is the original signal, is the low pass filter and is the low-pass-filtered
signal. This is a convolution, represent by the asterisk.
Second, you compute:
x hpf [ n ] =x [ n ]−x lpf [n]
where x hpf [ n ] is the high pass filtered signal
The alternative is to adapt the filter through spectral inversion. To show that
spectral inversion has the same result, first note that, where is the impulse response. Now
we have:
x hpf [ n ] =x [ n ]−x lpf [ n ] =x [ n ]∗δ [ n ]−x [ n ]∗h lpf [ n ] =x [ n ]∗( δ [ n ] −h lpf [ n ] )
This means the high pass filter is
h hpf [ n ] =δ [ n ] −hlpf [ n ]
10
In our project, we are using the live version of a jazz song, the extraordinary
“What A Wonderful World” by Louis Armstrong. The result after combining our simple
filter function and the raw data gets a great result, which we present below.
Figure 2.
First two minutes spectrogram
Note. Our version of this song is a ‘.wav’ file converted from the original ‘.mp3’
file. So the frequency above 16kHz does not appear because ‘.mp3’ is a
compressed version of the ‘.wav’ file, some details may be missed from this file
after being converted to a ‘.wav’ file.
Figure 3.
Low pass filter and high pass filter combined
11
Note. This is the result after we filled up the low pass with the low-frequency
199Hz and high pass with the high-frequency 7600Hz, both with filter length N =
461 and after two passes. The result is really impressive given that the vocals are
gone, but the sound is not quite clear.
Band-pass and band-reject filter
A band-pass filter passes frequencies between the lower limit f L and the higher
limit f H , and rejects other frequencies. If you don’t create a specific filter for this, you
can get this result in two steps. In the first step, you apply a low-pass filter with cutoff
frequency f H ,
x lpf , H [n ]=x [ n ]∗hlpf , H [ n]
where is the original signal, hlpf , H [n] is the low-pass filter with cutoff
frequency f H , and x lpf , H [n ] is the low-pass-filtered signal?
The asterisk represents convolution. The result is a signal in which the rejection of
frequencies larger than f H has been taken care of. You can then filter that signal again,
with a high-pass filter with cutoff frequency,
x bp , LH [n]=x lpf , H [ n]∗hlpf , L[ n]
where hlpf , L[ n] is the high-pass filter with cutoff frequency f L, and x bp , LH [n] is the
required band-pass-filtered signal.
However, you can do better and combine both of these filters into a single one.
How does that work? You can write
x bp , LH [n]=( x [ n ]∗hlpf ,H [n ])∗hlpf , L [ n] =x [ n ]∗(hlpf , H [ n ∗h
] lpf , L [ n ]
)
where the last step follows from the associative property of convolution. This means that
the required band-pass filter is
h bp, LH [ n] =h lpf , H [ n ]∗hlpf , L [ n]
Hence, a band-pass filter can be created from a low-pass and a high-pass filter
with appropriate cutoff frequencies by convolving the two filters.
A band-reject filter rejects frequencies between the lower limit f L and the higher
limit f H , and passes other frequencies. As for the band-pass filter, you can get this result
in two steps. In the first step, you apply a low-pass filter with cutoff frequency f L,
x lpf , L[n]=x [ n ]∗hlpf , L[ n]
12
where x [ n ] is the original signal, hlpf , L[ n] is the low-pass filter with cutoff
frequency f L, and x lpf , L[n] is the low-pass-filtered signal.
The result is a signal in which the frequencies in the rejection interval have been
eliminated, but in which the frequencies higher than f H are also gone. This can be
corrected by filtering the original signal again, with a high-pass filter with cutoff
frequency f H , and adding the result to the first signal,
x br , LH [ n ] =x lpf , L + x [ n ]∗hlpf , H [ n]
where hlpf , H [ n ] is the high-pass filter with cutoff frequency f H , and x br , LH [ n ] is the
required band-reject-filtered signal.
You can again to better and combine both operations into a single filter. You can
write:
x br , LH [ n ] =x [ n ]∗hlpf , L [ n ] + x [ n ]∗hlpf ,H [ n] =x [ n ]∗( hlpf , L [ n ] +h lpf , H [ n ] )
where the last step follows from the distributive property of convolution
This means that the required band-reject filter is
h br , LH [ n ]=hlpf , L [ n ] +h lpf , H [ n ]
Our implementation of Python for this part ran well, but we still have to find a
good frequency or filter length. The audio quality is really bad and cannot be used in
other industries.
Figure 4.
Band-pass filter
13
Note. The frequency we used for this part is still the same as in the last part. i.e.
199Hz for low and 7600 for high frequencies, and the filtered length is still 461
for both.
Figure 5.
Band-rejected filter
Note. The frequency and filter length are the same as in the last part
Problems and predictions
There were some problems while we were testing our simple filters on titles. For
example, figuring out how to use the filter effectively. Let's talk about both of the
examples we used, the results are depending on the parameters we used. Regarding the
two filters: low pass and high pass, for each N value or filter length we use, the results are
very difficult to predict. Some tests show that vocal sound can be filtered, which is
suitable for "using background sounds as karaoke sound". But the resulting sound is very
small, sometimes inaudible and must be used very loudly to be heard. That's because the
audio we've filtered includes instrumental sounds as well. That makes even part of the
goal, which is to separate the vocals from the song, possible, but the results are far from
satisfactory. Another example is the band-pass filter we used above. With our tests, we
also obtained a vocal comb and filtered out a lot of instrumental sounds, but conversely,
some unspecified passages appeared to be tearing, especially with some information.
numbers make the sound completely inaudible, like the sound of a scratched CD.
Our prediction of the errors that occur during this process is the difficulty of using
parameters. The reason this happened, according to our team's provisional conclusion, is
14
because the filter we used is simple, not at the advanced level that other filter tools can
do. . Audio can include a lot of different sources, as we said above. The goal we need is
very difficult to achieve even if we spend time defining each parameter.
Another method
Mix channel using the frame buffer (or phase change method)
The above method did not give us a good result. So we looked for another
method. This comes from a solution based on a feature of audio processing software. It
inverts the audio samples of one channel and mixes with the other.
Before diving in, let's talk about the channels in the '.wav' file. A ‘.wav’ file can
have multiple channels. This is very useful when it comes to games or cinema but usually
in this case they have 1 (mono) or 2 (stereo [left, right]) channels
For this method, the process in audio software can be described as follows.
Inversion is to invert the audio samples and reverse their polarity. Positive samples are
moved below the zero line (so it is becoming negative) and negative samples are made
positive Inversion usually does not affect the sound of the audio, but it can be used to
remove the sound. If an invert is applied to a track and that track is mixed with another
uninverted track that sounds the same, the identical audio will be cancelled (muted). To
prepare the song, if the song is mono, you can try using software to convert it to stereo.
We chose this method because we think the sound is the same in mono on both sides
while music has something different on the left and right sides, this method only loses the
bass, not the sound. suitable for jazz music.
We mimic this method in python by taking the subtraction of two channels. The
results after implementing this method are really good for a live recording, but we tried a
different song, for example. “Dream a Little Dream of Me” by Louis Armstrong. The
results from a song recorded in the studio are not impressed with this method as it cannot
manage to separate the vocals from the audio file. From the results, it is our opinion that
the timing of your live recording is suitable for this method as the stereo setup results in
different music on both channels.
Figure 6.
Mix channel in a live song
15
We also implement a version that combined both the FIR filter and the frame
buffer, the result is still good with the live song and not great for a studio record.
Figure 7.
Mixing method filter for a live song
Librosa framework
For better results, we are using librosa framework. It has a great result in
separating vocals from audio files. With librosa, we can use a method known as speech-
to-audio separation using an analog matrix or 'REPET-SIM'. We consulted the paper of
the same name (Raffi and Pardo, 2012) with instructions to be able to implement this
advanced method. The results obtained are very satisfactory, and the resulting sound is
very good. This tool can filter both vocals and background sounds. We think that if we
16
adjust the code more carefully, this tool can completely achieve what some of today's
leading music editing software such as Audacity, Adobe Audition, ... can do.
But this can be very difficult especially after we dig deeper into how this library
works. According to the author, using this mechanism may not apply to the entire song,
here is the verbatim:
The original REPET method can be successfully applied for music/voice
separation on short excerpts (e.g. 10 second verse) [12]. For complete music
pieces, the repeating background is likely to vary over time (e.g. verse followed
by chorus). An extended version of REPET was therefore later introduced to
handle variations in the repeating structure [10]. Rather than finding a global
period, the method tracks local periods of the repeating structure. In both cases,
the algorithm needs to identify periods of the repeating structure, as both methods
assume periodically repeating patterns.
This librosa framework as well as the REPET-SIM method is very good, but it
still has limitations in the working mechanism. There are still better ways to do it, and my
team will highlight a few potential tools in the final concluding part of this project.
Figure 8.
FFmpeg working diagram
After installing FFmpeg to the system, the Python can convert from a ‘.wav’ file
to a ‘.mp3’ file using the prefix that came with the framework: !ffmpeg -y -loglevel panic
-i *.wav *.mp3, with * representing for file names.
18
Conclusion
In this project, we have done very well the personal request of our team, which is to use
the knowledge learned from DSP through Python to interact with the song. In addition, we also
use several other techniques to find the most optimal method. The first is the FIR filter option,
which is only reasonable in a music file with a simple source. The second option, based on phase
change, can be used for two-way sources based on its mechanism of action between the two
sides, but with direct singing sources (mainly mono), this method is not. useless. The third part is
very good and the most satisfactory in terms of results, but still has a few insignificant
limitations. In addition to these methods, some other methods can be mentioned such as using
machine learning, using other libraries such as PYO, Dejavu, ... or using software measures that
will be more user-friendly.
19
Outro
This is the end of our report. We have gained a lot of knowledge after working on this
project, and although it took a lot of management time to run the code, the results have improved
quite a bit. Thank you for your lecture. And I, the report writer, thank our team for managing
their time and effort to come up with a good project.
20
References
Choudhury, A. (2020, October 10). 7 Python Libraries For Manipulating Audio That Data
libraries-for-manipulating-audio-that-data-scientists-use/
https://stackoverflow.com/questions/2226853/interpreting-wav-data
Pq, R. (2020, May 3). How to Isolate or Remove Vocals from a Song. Icon Collective College of
Music. https://iconcollective.edu/remove-vocals-from-songs/
Roelandts, T. (2014a, April 15). How to Create a Simple Low-Pass Filter. TomRoelandts.Com.
https://tomroelandts.com/articles/how-to-create-a-simple-low-pass-filter
Roelandts, T. (2014b, April 27). How to Create a Simple High-Pass Filter. TomRoelandts.Com.
https://tomroelandts.com/articles/how-to-create-a-simple-high-pass-filter
Roelandts, T. (2014c, May 10). How to Create Simple Band-Pass and Band-Reject Filters.
TomRoelandts.Com. https://tomroelandts.com/articles/how-to-create-simple-band-pass-
and-band-reject-filters
Smith, S. W. (1997). The Scientist & Engineer’s Guide to Digital Signal Processing (1st ed.).