0% found this document useful (0 votes)
55 views

Digital Signal Processing Report

This document describes a digital signal processing project that aims to separate vocals from songs. It discusses three methods: 1) creating FIR filters to filter out frequency bands containing vocals, 2) using phase cancellation by computing between frame buffers, and 3) using Python libraries like Librosa to interact with music data. The project obtains raw digital data from wav files and explores preprocessing techniques like FIR filtering, phase cancellation, and library-based extraction to separate vocals from a variety of music sources for further analysis.

Uploaded by

dam huu khoa
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Digital Signal Processing Report

This document describes a digital signal processing project that aims to separate vocals from songs. It discusses three methods: 1) creating FIR filters to filter out frequency bands containing vocals, 2) using phase cancellation by computing between frame buffers, and 3) using Python libraries like Librosa to interact with music data. The project obtains raw digital data from wav files and explores preprocessing techniques like FIR filtering, phase cancellation, and library-based extraction to separate vocals from a variety of music sources for further analysis.

Uploaded by

dam huu khoa
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

1

Digital Signal Processing Report: Final project


Interacting with songs

Group 18
Department of Information and Communication Technology, USTH
Digital Signal Processing
Dr Tung Tran Hoang
April 16, 2022
2

Table of Contents

Abstract............................................................................................................................................3

Introduction......................................................................................................................................4

Breakdown and analysis..................................................................................................................5

Methodologies.................................................................................................................................6

Import dependencies............................................................................................................6

Get raw data and interact with it..........................................................................................6

Get raw data.............................................................................................................7

Interact with raw data..............................................................................................8

FIR methods.........................................................................................................................8

High pass filter and low pass filter..........................................................................8

Band-pass and band-reject filter............................................................................11

Problems and predictions.......................................................................................13

Another method.................................................................................................................14

Mix channel using the frame buffer (or phase change method)............................14

Librosa framework................................................................................................15

Subpart: compressed and resize using FFmpeg framework..................................16

Conclusion.....................................................................................................................................18

Outro..............................................................................................................................................19

References......................................................................................................................................20
3

Abstract
Digital signal processing is an important part of the music. A song consists of vocals and
instrumental sounds. Separating vocals from songs is useful for many purposes, such as using the
melody of an instrument in it as a karaoke soundtrack. We will draw different methods for this
project and compare them. In particular, the first method is about creating simple FIR filters to
filter out frequency bands that contain vocals. The second method is through computation
between framebuffers thus overwriting the original file. The third method used is through the use
of pre-written Python libraries for interacting with music data. The musical data in this project
includes a variety of sources, including using studio music or live recordings to provide the most
intuitive results for comparison and interpretation.
4

Digital Signal Processing Report: Final project


Interacting with songs

Introduction
Humans can recognize sounds and classify them through the ear, and more specifically,
each type of sound can be recognized separately. For example, a Jazz song, with it consisting of
a variety of separate sources such as guitar, saxophone, piano and most songs also include
vocals.
Separating vocals from songs has many uses, either with vocals or with instrumentals.
Instrument sounds can be made sounds for karaoke. Vocals are the most important element in a
song. The analysis of a singer with low, mid or high tones can be discerned through digital signal
processing.
What changed after our presentation? Our team realized that the technical approach we
applied did not fully explain what was needed. So instead of trying to explain in detail every
possible step, we end up adding more practical examples. The goal is to get even a non-technical
person interested in how close and realistic the sound processing is to real life.
5

Breakdown and analysis


The audio signal that the microphone receives is in analog form, which is also a format
that can be heard by humans. Every computer processor has a built-in analog to digital converter
and vice versa. With a computer, to process the sound, it is necessary to convert the song into
digital form. The process is that after the computer converts from analog data to digital, we
process that digital data and return the result with analog data after converting it back. The goal
of this project is to process that data source to be able to separate the voice from the audio source
or the audio source that contains instruments sound.
We use Python for this project. The reason we chose Python is that Python comes with
libraries that can handle audio data sources and is quite clear in the way the command lines are
arranged. Furthermore, we are also currently using Python in our courses in the curriculum.
In the processing stage, we will use three methods. The first is based on what our team
has learned from DSP to create simple audio filters, and this is also the method we will go into
very closely. The second is based on phase cancellation, a method that we have learned through
the process of learning the technique of separating voice from audio files. The third technique we
use is Python's rather powerful audio framework for audio extraction.
In the process of making this project, we also combined with extracting the spectrogram
for the song, which is audio recognition, and in addition, used a file compression technique to
make the extracted project lighter. Since this is not the main goal, we will not go too deeply into
explaining these categories.
6

Methodologies
Import dependencies
To make the project more convenient, we use the following Python libraries and will
explain why we use them.
Figure 1
Python dependencies

Note. The numpy, scipy, matplotlib, librosa, Ipython (of Jupyter) and contextlib
framework must be added to the machine's Python environment before it can be used in a
Python project based on that environment.
The numpy framework is used in creating arrays for buffers, calculating averages,
transposing array matrices, .. mainly related to data flow tuning.
The scipy framework is used to overwrite ‘*.wav’ files
The wave framework is used to read and get data like sample width, number of channels,
number of frames, sample rate, ... from the ‘.wav’ file.
The contectlib framework is used to return a context manager that closes everything after
block completion.
The IPython framework is used to display the audio file in the jupyter notebook editor.
The librosa framework is an advanced framework for interacting with the audio file.
Get raw data and interact with it
The purpose of this section is very simple. We take the necessary data of a “.wav” audio
file and turn it into data that computers can interact with. The signal that humans, or more
precisely, the signal that the microphone picks up from the song, is called an analog signal. In the
computer environment, everything is done only when it is in digital form. The conversion
7

between the two forms through a device called Analog to Digital Converter (ADC for short).
This is done through Digital Signal Processors or the confusing abbreviation DSP. In the
opposite direction, after the digital information has been completed editing, the Digital to Analog
Converter will work as the name implies, converting the data into a form that can be heard by
humans. Its abbreviated name is DAC, which is also the name of a specialized terminal to
improve the output sound quality. It is also available on phones, computers, TVs, ... or any other
device used to output audio.
Get raw data
The target of this section is to get the ‘.wav’ file data from an audio file into a
digital file that can be read by the computer. Getting raw data has three main stages, that
is getting data from audio files through a wave framework, getting interactive time using
the frame of the audio file and finally getting data from buffers of the ‘.wav’ files and the
corresponding array for later use.
The first job is to use the wave library to open the file to be read. Before that, we
used the contextlib framework to get the final data of the ‘.wav’ file before processing.
The data we get from the ‘.wav’ file using this library are sample rate, sample width,
number of channels and the total number of frames.
Given the number of frames and the sampling rate, we combine it with a time
metric to determine the frame length at the beginning and the end, as well as the number
of frames between those two intervals. This determination is very simple, through
learning about the interaction between the sample rate in a frame and the number of
frames in a time, we know that after multiplying these two figures together, it will
produce the sample rate as the sum of that frame. It should also be noted that the sample
rate here can be roughly understood as the smallest unit in digital audio, equivalent to a
quantum.
Once we get the necessary information about the beginning and end frames, we
use that to get each frame's data for that period. Then comes the third part, which is to put
the data into an array for interaction, we have also defined the required data types as
'uint8' and 'int16', corresponding to signed and unsigned data. Now, all that needs to be
done is to read the data logically.
Interact with raw data
8

For the data to be reasonable to use, since the amount of knowledge we gained
could not be applied to multidimensional audio processing, we converted the required
data into a one-dimensional form. Thereby making it more convenient for later data use
Now comes one of the important parts that our team has worked on, which is
sound recognition. The data we collect will be presented in the form of a graph. This is
part of the final exam this time around, but our team has gone further than that. This
identification of audio data is mainly in the form of spectrograms of audio frequencies
and amplitudes.
The use is very simple, by obtaining from one-way audio processing combined
with the original sample rate collected, we have determined the spectrogram by applying
the matplotlib library. The results obtained are very positive. There will be examples of
this in the next section.
FIR methods
For this part, we use the knowledge we learned from the DSP course. We will create a
low pass filter, a high pass filter, and bandpass and band rejection filters. This is mainly through
using the formula we have learned sensibly. Sound consists of frequency and amplitude. The
sound emitted by each source has its frequency and amplitude. Acquired sound can come from
many separate sound sources. As mentioned above, the data and audio frequencies we obtain will
be represented as spectrograms. From these data, we will test the filters accordingly. This use
also has certain limitations or problems that arise during testing on some songs.
High pass filter and low pass filter
The first thing we did was to learn about the working mechanism of the low pass
filter. A low pass filter is used for filtering high frequencies and allowing lower
frequencies to pass through. The ideal low pass filter is a sinc filter.
The sinc function (after normalizing) is defined as:
sin(πx)
sinc ( x )=
πx
Then we have the impulse response of a sinc filter:
h [ n ] =2 f c sinc(2 f c n)
where f c is the cutoff frequency, specified as a fraction of the sampling rate
9

Because the sinc filter has an infinite length, which means that the delay of the
filter will also be infinite, making this filter unrealizable. The solution is to combine it
with a window, in this project we would like to use Hamming window, i.e.

ω ( n )=0.54−0.46 cos 2 π ( n
N )
The reason we choose Hamming window is that it has a good tradeoff between
frequency and amplitude accuracy, and reduced spectral leakage.
After combining with the sinc filter, we got a windowed-sinc filter

( (
h [ n ] =sinc 2 f c n−
N
2 ))( 0.54−0.46 cos (2 π Nn ))
where N is the filter length, it must be odd
For the FIR high pass filter, we just implement a spectral inversion, i.e.
1. Change the sign of each value in ℎ[𝑛].
2. Add one to the value in the centre.
How does a spectral inversion work? It is based on the following idea. A low-pass filter
generates a signal with the high frequencies removed. Hence, if you subtract this signal
from the original one, you have exactly the high frequencies. This means that you can
implement a high-pass filter in two steps. First, you compute:
x lpf [ n ]=x [ n ]∗hlpf [n ]
where is the original signal, is the low pass filter and is the low-pass-filtered
signal. This is a convolution, represent by the asterisk.
Second, you compute:
x hpf [ n ] =x [ n ]−x lpf [n]
where x hpf [ n ] is the high pass filtered signal
The alternative is to adapt the filter through spectral inversion. To show that
spectral inversion has the same result, first note that, where is the impulse response. Now
we have:
x hpf [ n ] =x [ n ]−x lpf [ n ] =x [ n ]∗δ [ n ]−x [ n ]∗h lpf [ n ] =x [ n ]∗( δ [ n ] −h lpf [ n ] )
This means the high pass filter is
h hpf [ n ] =δ [ n ] −hlpf [ n ]
10

In our project, we are using the live version of a jazz song, the extraordinary
“What A Wonderful World” by Louis Armstrong. The result after combining our simple
filter function and the raw data gets a great result, which we present below.
Figure 2.
First two minutes spectrogram

Note. Our version of this song is a ‘.wav’ file converted from the original ‘.mp3’
file. So the frequency above 16kHz does not appear because ‘.mp3’ is a
compressed version of the ‘.wav’ file, some details may be missed from this file
after being converted to a ‘.wav’ file.
Figure 3.
Low pass filter and high pass filter combined
11

Note. This is the result after we filled up the low pass with the low-frequency
199Hz and high pass with the high-frequency 7600Hz, both with filter length N =
461 and after two passes. The result is really impressive given that the vocals are
gone, but the sound is not quite clear.
Band-pass and band-reject filter
A band-pass filter passes frequencies between the lower limit f L and the higher
limit f H , and rejects other frequencies. If you don’t create a specific filter for this, you
can get this result in two steps. In the first step, you apply a low-pass filter with cutoff
frequency f H ,
x lpf , H [n ]=x [ n ]∗hlpf , H [ n]
where is the original signal, hlpf , H [n] is the low-pass filter with cutoff
frequency f H , and x lpf , H [n ] is the low-pass-filtered signal?
The asterisk represents convolution. The result is a signal in which the rejection of
frequencies larger than f H has been taken care of. You can then filter that signal again,
with a high-pass filter with cutoff frequency,
x bp , LH [n]=x lpf , H [ n]∗hlpf , L[ n]
where hlpf , L[ n] is the high-pass filter with cutoff frequency f L, and x bp , LH [n] is the
required band-pass-filtered signal.
However, you can do better and combine both of these filters into a single one.
How does that work? You can write
x bp , LH [n]=( x [ n ]∗hlpf ,H [n ])∗hlpf , L [ n] =x [ n ]∗(hlpf , H [ n ∗h
] lpf , L [ n ]
)
where the last step follows from the associative property of convolution. This means that
the required band-pass filter is
h bp, LH [ n] =h lpf , H [ n ]∗hlpf , L [ n]
Hence, a band-pass filter can be created from a low-pass and a high-pass filter
with appropriate cutoff frequencies by convolving the two filters.
A band-reject filter rejects frequencies between the lower limit f L and the higher
limit f H , and passes other frequencies. As for the band-pass filter, you can get this result
in two steps. In the first step, you apply a low-pass filter with cutoff frequency f L,
x lpf , L[n]=x [ n ]∗hlpf , L[ n]
12

where x [ n ] is the original signal, hlpf , L[ n] is the low-pass filter with cutoff
frequency f L, and x lpf , L[n] is the low-pass-filtered signal.
The result is a signal in which the frequencies in the rejection interval have been
eliminated, but in which the frequencies higher than f H are also gone. This can be
corrected by filtering the original signal again, with a high-pass filter with cutoff
frequency f H , and adding the result to the first signal,
x br , LH [ n ] =x lpf , L + x [ n ]∗hlpf , H [ n]

where hlpf , H [ n ] is the high-pass filter with cutoff frequency f H , and x br , LH [ n ] is the
required band-reject-filtered signal.
You can again to better and combine both operations into a single filter. You can
write:
x br , LH [ n ] =x [ n ]∗hlpf , L [ n ] + x [ n ]∗hlpf ,H [ n] =x [ n ]∗( hlpf , L [ n ] +h lpf , H [ n ] )
where the last step follows from the distributive property of convolution
This means that the required band-reject filter is
h br , LH [ n ]=hlpf , L [ n ] +h lpf , H [ n ]
Our implementation of Python for this part ran well, but we still have to find a
good frequency or filter length. The audio quality is really bad and cannot be used in
other industries.
Figure 4.
Band-pass filter
13

Note. The frequency we used for this part is still the same as in the last part. i.e.
199Hz for low and 7600 for high frequencies, and the filtered length is still 461
for both.
Figure 5.
Band-rejected filter

Note. The frequency and filter length are the same as in the last part
Problems and predictions
There were some problems while we were testing our simple filters on titles. For
example, figuring out how to use the filter effectively. Let's talk about both of the
examples we used, the results are depending on the parameters we used. Regarding the
two filters: low pass and high pass, for each N value or filter length we use, the results are
very difficult to predict. Some tests show that vocal sound can be filtered, which is
suitable for "using background sounds as karaoke sound". But the resulting sound is very
small, sometimes inaudible and must be used very loudly to be heard. That's because the
audio we've filtered includes instrumental sounds as well. That makes even part of the
goal, which is to separate the vocals from the song, possible, but the results are far from
satisfactory. Another example is the band-pass filter we used above. With our tests, we
also obtained a vocal comb and filtered out a lot of instrumental sounds, but conversely,
some unspecified passages appeared to be tearing, especially with some information.
numbers make the sound completely inaudible, like the sound of a scratched CD.
Our prediction of the errors that occur during this process is the difficulty of using
parameters. The reason this happened, according to our team's provisional conclusion, is
14

because the filter we used is simple, not at the advanced level that other filter tools can
do. . Audio can include a lot of different sources, as we said above. The goal we need is
very difficult to achieve even if we spend time defining each parameter.
Another method
Mix channel using the frame buffer (or phase change method)
The above method did not give us a good result. So we looked for another
method. This comes from a solution based on a feature of audio processing software. It
inverts the audio samples of one channel and mixes with the other.
Before diving in, let's talk about the channels in the '.wav' file. A ‘.wav’ file can
have multiple channels. This is very useful when it comes to games or cinema but usually
in this case they have 1 (mono) or 2 (stereo [left, right]) channels
For this method, the process in audio software can be described as follows.
Inversion is to invert the audio samples and reverse their polarity. Positive samples are
moved below the zero line (so it is becoming negative) and negative samples are made
positive Inversion usually does not affect the sound of the audio, but it can be used to
remove the sound. If an invert is applied to a track and that track is mixed with another
uninverted track that sounds the same, the identical audio will be cancelled (muted). To
prepare the song, if the song is mono, you can try using software to convert it to stereo.
We chose this method because we think the sound is the same in mono on both sides
while music has something different on the left and right sides, this method only loses the
bass, not the sound. suitable for jazz music.
We mimic this method in python by taking the subtraction of two channels. The
results after implementing this method are really good for a live recording, but we tried a
different song, for example. “Dream a Little Dream of Me” by Louis Armstrong. The
results from a song recorded in the studio are not impressed with this method as it cannot
manage to separate the vocals from the audio file. From the results, it is our opinion that
the timing of your live recording is suitable for this method as the stereo setup results in
different music on both channels.
Figure 6.
Mix channel in a live song
15

We also implement a version that combined both the FIR filter and the frame
buffer, the result is still good with the live song and not great for a studio record.
Figure 7.
Mixing method filter for a live song

Librosa framework
For better results, we are using librosa framework. It has a great result in
separating vocals from audio files. With librosa, we can use a method known as speech-
to-audio separation using an analog matrix or 'REPET-SIM'. We consulted the paper of
the same name (Raffi and Pardo, 2012) with instructions to be able to implement this
advanced method. The results obtained are very satisfactory, and the resulting sound is
very good. This tool can filter both vocals and background sounds. We think that if we
16

adjust the code more carefully, this tool can completely achieve what some of today's
leading music editing software such as Audacity, Adobe Audition, ... can do.
But this can be very difficult especially after we dig deeper into how this library
works. According to the author, using this mechanism may not apply to the entire song,
here is the verbatim:
The original REPET method can be successfully applied for music/voice
separation on short excerpts (e.g. 10 second verse) [12]. For complete music
pieces, the repeating background is likely to vary over time (e.g. verse followed
by chorus). An extended version of REPET was therefore later introduced to
handle variations in the repeating structure [10]. Rather than finding a global
period, the method tracks local periods of the repeating structure. In both cases,
the algorithm needs to identify periods of the repeating structure, as both methods
assume periodically repeating patterns.

This librosa framework as well as the REPET-SIM method is very good, but it
still has limitations in the working mechanism. There are still better ways to do it, and my
team will highlight a few potential tools in the final concluding part of this project.

Subpart: compressed and resize using FFmpeg framework


There is a problem that we have trouble with when interacting with ‘.wav’ files,
that is the capacity of the wave is very large. Through research, we know that the ‘.wav’
file is a processing format used in today's professional audio environment. The reason its
capacity is so high is that it is an uncompressed file format, and is also a suitable format
for audio editing. Of course, its quality is very good, it can be said that it is the best file
format today for these audio objects. But that leads to storage and portability issues. The
file format '.mp3' will probably be more familiar to most users today, its frequency is
relatively high because of its compactness and convenience in storage. Most music
streaming platforms or online podcast radios today use the '.mp3' format for those
platform services. This format will compress the file, called loosy (for comparison, the
‘.wav’ file format is lossless) and the size will be smaller.
17

To be able to convert ‘.wav’ files to mp3, we use the framework FFmpeg, a


framework that is very popular today because of its usefulness. The brief working
mechanism of this framework has been very well summarized through the official
documentation: FFmpeg calls the libavformat library (containing demuxers) to
read input files and get packets containing encoded data from them. When there
are multiple input files, FFmpeg tries to keep them synchronized by tracking the
lowest timestamp on any active input stream.
Encoded packets are then passed to the decoder (unless stream copy is selected
for the stream, see further for a description). The decoder produces uncompressed
frames (raw video/PCM audio/...) which can be processed further by filtering (see
next section). After filtering, the frames are passed to the encoder, which encodes
them and outputs encoded packets. Finally, those are passed to the muxer, which
writes the encoded packets to the output file.

Figure 8.
FFmpeg working diagram

After installing FFmpeg to the system, the Python can convert from a ‘.wav’ file
to a ‘.mp3’ file using the prefix that came with the framework: !ffmpeg -y -loglevel panic
-i *.wav *.mp3, with * representing for file names.
18

Conclusion
In this project, we have done very well the personal request of our team, which is to use
the knowledge learned from DSP through Python to interact with the song. In addition, we also
use several other techniques to find the most optimal method. The first is the FIR filter option,
which is only reasonable in a music file with a simple source. The second option, based on phase
change, can be used for two-way sources based on its mechanism of action between the two
sides, but with direct singing sources (mainly mono), this method is not. useless. The third part is
very good and the most satisfactory in terms of results, but still has a few insignificant
limitations. In addition to these methods, some other methods can be mentioned such as using
machine learning, using other libraries such as PYO, Dejavu, ... or using software measures that
will be more user-friendly.
19

Outro
This is the end of our report. We have gained a lot of knowledge after working on this
project, and although it took a lot of management time to run the code, the results have improved
quite a bit. Thank you for your lecture. And I, the report writer, thank our team for managing
their time and effort to come up with a good project.
20

References
Choudhury, A. (2020, October 10). 7 Python Libraries For Manipulating Audio That Data

Scientists Use. Analytics India Magazine. https://analyticsindiamag.com/7-python-

libraries-for-manipulating-audio-that-data-scientists-use/

Interpreting WAV Data. (2010, February 9). Stack Overflow.

https://stackoverflow.com/questions/2226853/interpreting-wav-data

Pq, R. (2020, May 3). How to Isolate or Remove Vocals from a Song. Icon Collective College of

Music. https://iconcollective.edu/remove-vocals-from-songs/

Roelandts, T. (2014a, April 15). How to Create a Simple Low-Pass Filter. TomRoelandts.Com.

https://tomroelandts.com/articles/how-to-create-a-simple-low-pass-filter

Roelandts, T. (2014b, April 27). How to Create a Simple High-Pass Filter. TomRoelandts.Com.

https://tomroelandts.com/articles/how-to-create-a-simple-high-pass-filter

Roelandts, T. (2014c, May 10). How to Create Simple Band-Pass and Band-Reject Filters.

TomRoelandts.Com. https://tomroelandts.com/articles/how-to-create-simple-band-pass-

and-band-reject-filters

Smith, S. W. (1997). The Scientist & Engineer’s Guide to Digital Signal Processing (1st ed.).

California Technical Pub. http://www.dspguide.com/pdfbook.htm

You might also like