0% found this document useful (0 votes)
39 views4 pages

Learning FIR Filter Coefficients From Data For Speech-Music Separation

The document summarizes a machine learning approach to designing FIR (finite impulse response) filters. It presents a model that can learn FIR filter coefficients directly from input signals, using a convolutional layer and mean squared error loss function. This allows designing filters to separate signals with overlapping spectra, like speech and music. An example application separates a mixture of speech and music signals. The model is trained on noise data transformed with a desired frequency response to learn filters for selective filtering tasks.

Uploaded by

HIMANI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views4 pages

Learning FIR Filter Coefficients From Data For Speech-Music Separation

The document summarizes a machine learning approach to designing FIR (finite impulse response) filters. It presents a model that can learn FIR filter coefficients directly from input signals, using a convolutional layer and mean squared error loss function. This allows designing filters to separate signals with overlapping spectra, like speech and music. An example application separates a mixture of speech and music signals. The model is trained on noise data transformed with a desired frequency response to learn filters for selective filtering tasks.

Uploaded by

HIMANI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

245

Learning FIR Filter Coefficients from Data for


Speech-Music Separation
Boyang Wang and Jafar Saniie
Embedded Computing and Signal Processing (ECASP) Research Laboratory (http://ecasp.ece.iit.edu)
Department of Electrical and Computer Engineering
Illinois Institute of Technology, Chicago IL, U.S.A.

Abstract—An Finite Impulse Response (FIR) filter is a widely Tmult + Tadder regardless of the number of taps. However, the input
used digital filter technology whose impulse response has a finite signal is broadcasted to N multipliers, the fanout must be
duration. An FIR filter is usually favored for many reasons such considered when we design a filter with a large number of taps.
as easy to design, easy to implement on a variety of system
architectures. An FIR filter can be easily designed with a linear X[n]
phase response and its output is more predictable since it doesn’t D D D D D

have feedback components. There are both engineer and (a)


mathematical methods for designing an FIR filter so that machine h1 h2 h3 h18 h19 h20

learning doesn’t play an important role in the FIR filter design. In


this paper, we present an alternative to traditional filter design Y[n]
methods to direct learn the FIR filter coefficients from input data
X[n]
with machine learning algorithm. With the proposed algorithm,
we can easily design an FIR filter from the input data mixed with
designed all spectrum noise signal. To show the capability of this h20 h19 h18 h3 h2 h1 (b)
algorithm, an example application of suppressing background
music from speech or vice versa is demonstrated in this paper. D D D D
Y[n]

Despite that the music and speech have a lot of overlap in their
Figure 1. FIR Filter Architecture
spectrum, the filter designed by our algorithm can successfully
suppress music or speech in a mixture of music and speech signals.
There are many special types of FIR filters such as raised
Keywords—FIR Filter, Convolutional Layer, Filter Design, cosine filter and differentiator filter [4] [5]. These filters have
Selective Filtering, Machine Learning, TensorFlow their own specific design concepts and methods. There are
multiple methods for designing typical FIR filters with specified
I. INTRODUCTION frequency response including window design method and
An Finite Impulse Response (FIR) filter has a finite duration frequency sampling method [6] [7]. These conventional FIR
and is widely used in many signal filtering applications such as design methods are optimized mathematically or offer an
communication, image processing, and many other signal efficient engineering solution. Machine learning usually doesn’t
processing methods that require signal conditioning due to play a role in the FIR design when there is already a direct
stability [1] [2]. Equation 1 shows the formula of filtering a solution. In this paper, we present a machine learning model that
signal x[n] with an FIR filter of N taps. can learn directly from the input signals and come up with an
optimized FIR filter solution. For example, we have a speech
ேିଵ
signal mixed with a music signal and need to be separated. This
‫ݕ‬ሾ݊ሿ ൌ ෍ ݄ሾ݇ሿ ή ‫ݔ‬ሾ݊ െ ݇ሿ (1) doesn’t seem to be a problem that can be solved by a
௞ୀ଴ conventional FIR filter since speech and music have notable
The term h[k] is the impulse response of the FIR filter, it is spectral overlap. With the proposed machine learning
also referred to as FIR coefficients. Each tap in an FIR filter is a algorithm, we can learn a special FIR filter that can decompose
multiply-accumulate (MAC) unit which contains a register, a speech signals from music signals adaptively.
multiplier, and an adder as shown in Figure 1. This formula can Section II of this paper discusses the FIR filter design
also be interpreted as a convolution between the input signal and method with machine learning algorithms and validation
the FIR filter kernel impulse response. Figure 1(a) is a classical procedures. Section III demonstrates an example of separating
FIR filter design schematic with 20 taps. A disadvantage of this music and speech.
type of architecture is that the critical path is Tmult+20Tadder, this
will dramatically reduce the maximum system clock and II. MACHINE LEARNING FIR FILTER DESIGN METHODS
jeopardize the speed of the FIR filter realization. Figure 1(b) This section introduces FIR machine learning models and
shows a transposed implementation of the FIR filter, it is also their validation. Figure 2 illustrates Signal A has significant
called a broadcast FIR filter since the input signal will be spectral overlap with unwanted Signal B. The goal of FIR design
directly broadcasted to all multipliers [3]. It is a more preferred is to separate Signal A from Signal B.
architecture since the critical path of this design is always the

978-1-7281-5317-9/20/$31.00 ©2020 IEEE


Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on November 24,2023 at 05:40:41 UTC from IEEE Xplore. Restrictions apply.
246

noise is used as the training data. As shown in Figure 4, the


generated noise is transformed into the frequency domain using
FFT and multiplied by the desired frequency response mask [9].
After this step, the noise signal will then be converted back to
the time domain to be used as the expected training output.

Figure 2. Mixed signals with an overlapped spectrum

A. Design of Machine Learning Model


Figure 3 is the block diagram of the proposed model for
learning FIR filter coefficients. The training input is a mixture
of three signals as shown in equation 2. Signal A, sA[n], needs to
be estimated. Signal B, sB[n], must be filtered. Noise, ‫ݒ‬ሾ݊ሿ, is
used to eliminate the irrelevant frequency components in the
process of filtering Signal B. Equation 3 represents an FIR
convolution layer, with coefficient w[k]. The model is designed
that this output to estimate Signal A. The expected training
output ‫ݕ‬ത is defined as Signal A. The loss function of the model
Figure 4. Model validation learning block diagram
is defined as Mean Square Error (dMSE) presented in equation
4. The training of the coefficient is done by the backpropagation The model is trained with 10,000 randomly generated noise
algorithm [8]. When training this model, the gradients of the loss signals, each of these noise signals is uniformly distributed with
function with respect to each individual weight is computed and the length of 8,192. For demonstration, the frequency response
are used to update these weights to minimize MSE. Once the mask is designed to look like Chicago Skyline. Figure 5(a) is
training is done, we can extract the weights of the convolutional the extracted FIR filter coefficient from the trained model after
layer and use it as the FIR filter impulse response. training the convolutional layer with 400 coefficients for 100
epochs. Figure 5(b) is the digital filter frequency and phase
response. Figure 5(c) is the frequency response of the designed
filter kernel plotted against the frequency response mask that is
used to modulate the noise in the frequency domain. This
method provides a very close solution compared to the IDCT
digital FIR filter design method but will never perform better
since IDCT is the direct solution to the problem. The result in
this experiment shows that our model is fully capable of
learning the FIR filter kernel from time-domain input data.

Figure 3. Model for Learning FIR Filter Coefficients (a)

‫ݔ‬ሾ݊ሿ ൌ ‫ݏ‬஺ ሾ݊ሿ ൅ ‫ݏ‬஻ ሾ݊ሿ ൅ ‫ݒ‬ሾ݊ሿ (2)


ேିଵ

‫ݏ‬ഥ஺ ሾ݊ሿ ൌ ෍ ‫ݓ‬ሾ݇ሿ ή ‫ݔ‬ሾ݊ െ ݇ሿ (3)


௞ୀ଴
ேିଵ (b)
ͳ
‫ܧܵܯ‬ሺ‫ݏ‬ഥ஺ ሾ݊ሿǡ ‫ݏ‬஺ ሾ݊ሿሻ ൌ   ෍ሺ‫ݏ‬஺ ሾ݅ሿ െ  ‫ݏ‬ഥ஺ ሾ݅ሿሻଶ (4)
ܰ
௜ୀଵ

TensorFlow is used in this paper to build and train the model.


With Keras API in the TensorFlow library, it is flexible to
structure and control the designed model. (c)

B. FIR Filter Design


Figure 5. FIR Filter kernel learned from the desired mask spectrum
To show that the model can automatically compute the
optimized filter kernel with the desired frequency response, a Another important validation experiment is to show that the
randomly generated all spectrum noise is used for training the model can characterize different frequencies by directly learning
model. In the validation experiment, a uniformly distributed from the input data. In this experiment, a training data mixed by

Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on November 24,2023 at 05:40:41 UTC from IEEE Xplore. Restrictions apply.
247

8 sinusoidal signals that have different center frequencies, mixture of both. The sampling rate for all the voices presented
amplitudes, and phases are created. Figure 6(a) is the generated in this paper is 44,100 Hz. As can be observed in the figure, the
training data and its frequency response. We choose three out of music and speech signals have a lot of overlap on the frequency
these eight frequency components to create the corresponding domain, which raises the difficulty of being separated.
training output as is shown in Figure 6(b). In order to suppress
the irrelevant frequency band an all-spectrum uniformly
distributed noise to the training data. Figure 6(c) is the acquired
FIR filter impulse response with 300 taps. Figure 6(d) is the
digital filter frequency response. To be noticed is that the learned
filter frequency response shows that the designed filter has
additional attenuation when the unwanted frequency
components have higher energy. The result shows that our
model is fully capable of separating different frequency
components from the time domain.
This FIR filter coefficients design method has very little
control over its phase response when the frequency component
is been attenuated to almost zero. This is because the frequency
Figure 7. The spectrum of music, speech and their mixture
components that are zeroed have very little contributed to the
backpropagation algorithm computation. The proposed During training, we use the mixture of the music signal,
algorithm only computes the statistically optimized solution, speech signal and all spectrum noise as the training input. A
which is not always the best solution to the problem when the selection of either a music signal or speech signal is used as the
specific frequency response is known by the designer. expected training output. In order to select the optimized FIR tap
number, we train the model with different tap numbers for 50
epoch and plotted the final training loss against its tap number
as shown in Figure 8.
(a)

(b)

(c)

Figure 8. Training Loss after 50 epochs plot against different FIR Tap
Number

(d) After training the model with 2000 taps for 100 epochs,
Figure 9 shows the MSE training loss against epochs.

Figure 6. Experiment to show that the model can be used to separate


different frequency components

III. EXAMPLE APPLICATION ON MUSIC AND SPEECH


Source separation is a classical problem in speech processing
and other signal processing algorithms [10] [11]. Clean
separation is difficult to achieve when the sources to be
separated have overlapped frequency components. However,
with the proposed machine learning filter coefficient design
method, we can achieve acceptable source suppression among
two overlapped voices on the spectrum with a single FIR filter. Figure 9. Training loss against epochs
Figure 7 shows the spectrum of flute music, speech and a

Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on November 24,2023 at 05:40:41 UTC from IEEE Xplore. Restrictions apply.
248

Figure 10 shows the designed FIR filter impulse responses IV. CONCLUSION
and their frequency responses. As can be seen from the figure, In this paper, we provide a simple but effective machine
music signal extractor and speech signal extractors have overlap learning model for learning FIR filter coefficients directly from
on the frequency axis. The designed filter has extra attenuation the input data. The proposed algorithm can produce FIR filter
on a certain spectrum when the other signal to be suppressed has impulse response that can separate highly spectrum overlapped
high energy on that frequency band. signals. Uniformly distributed noise is added to the training data
Music to eliminate the irrelevant frequency components. This method
will find the statistically optimized FIR filter impulse response
with the given number of taps according to the training data set.
The learned FIR filter coefficients will have limited control over
(a)
Speech
the highly attenuated frequency components. The designed FIR
filter provides a linear phase response for most of its passbands.
An example application of speech or music signal extractor from
FIR Filter Coefficient
a mixture of two is demonstrated. The result shows that our
model can achieve a very complicate design by simply training
Music
Extractor with the input data.
REFERENCES
Speech (b)
Extractor
[1] M. B. Trimale and Chilveri, "A review: FIR filter implementation," in
2017 2nd IEEE International Conference on Recent Trends in
Electronics, Information Communication Technology (RTEICT), 2017.
Frequency (kHz)
[2] M. Ferrario, A. Spalvieri and R. Valtolina, "Design of transmit FIR
Figure 10. (a) FIR filter impulse responses and (b) frequency responses filters for FDM data transmission systems," IEEE Transactions on
Communications, vol. 52, no. 2, pp. 180-182, 2004.
Figure 11 is the processed mixture speech and music signal [3] Xilinx, "PG149 LogiCORE IP FIR Compiler v7.1, Product Guide," 2
April 2014. [Online]. Available:
filtered by the designed FIR coefficients using machine learning. https://www.xilinx.com/support/documentation/ip_documentation/fir
As can be observed in the figure, the filters successfully suppress _compiler/v7_1/pg149-fir-compiler.pdf.
the unwanted signal. Experiment result shows that when voice [4] N. S. Alagha and P. Kabal, "Generalized raised-cosine filters," IEEE
signal is suppressed, the mean square error between filtered Transactions on Communications, vol. 47, no. 7, pp. 989-997, 1999.
signal and the music signal is 0.0035. On the other hand, when [5] C.-C. Tseng, "Digital differentiator design using fractional delay filter
music signal is chosen to be filtered out, the mean squre error and limit computation," IEEE Transactions on Circuits and Systems I:
between the filtered output and voice signal is 0.006. When the Regular Papers, vol. 52, no. 10, pp. 2248-2259, 2005.
filtered signals are played after restored back into wav files, the [6] A. E. Cetin, O. N. Gerek and Y. Yardimci, "Equiripple FIR filter design
by the FFT algorithm," IEEE Signal Processing Magazine, vol. 12, no.
unwanted signal will sound just like a background whisper but 2, pp. 60-64, 1997.
cannot be completely removed since this method is only an FIR [7] M. G. Shayesteh and M. Mottaghi-Kashtiban, "FIR filter design using
approach after all. a new window function," in 2009 16th International Conference on
Digital Signal Processing, 2009.
Music
[8] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, MIT Press,
2016.
[9] E. O. Brigham and R. E. Morrow, "The fast Fourier transform," IEEE
Spectrum, vol. 4, no. 12, pp. 63-70, 1967.
[10] E. M. Grais and H. Erdogan, "Single channel speech-music separation
Speech using matching pursuit and spectral masks," in 2011 IEEE 19th Signal
Processing and Communications Applications Conference (SIU),
Antalya, 2011.
[11] P. Mowlaee, A. Sayadian, M. Sheikhan and M. Fallah, "Single-channel
music/speech separation using non-linear masks," in 2008
Frequency (kHz) International Symposium on Telecommunications, Tehran, 2008.
Figure 11. Extracted music and speech signal spectrums

Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on November 24,2023 at 05:40:41 UTC from IEEE Xplore. Restrictions apply.

You might also like