Learning FIR Filter Coefficients From Data For Speech-Music Separation
Learning FIR Filter Coefficients From Data For Speech-Music Separation
Abstract—An Finite Impulse Response (FIR) filter is a widely Tmult + Tadder regardless of the number of taps. However, the input
used digital filter technology whose impulse response has a finite signal is broadcasted to N multipliers, the fanout must be
duration. An FIR filter is usually favored for many reasons such considered when we design a filter with a large number of taps.
as easy to design, easy to implement on a variety of system
architectures. An FIR filter can be easily designed with a linear X[n]
phase response and its output is more predictable since it doesn’t D D D D D
Despite that the music and speech have a lot of overlap in their
Figure 1. FIR Filter Architecture
spectrum, the filter designed by our algorithm can successfully
suppress music or speech in a mixture of music and speech signals.
There are many special types of FIR filters such as raised
Keywords—FIR Filter, Convolutional Layer, Filter Design, cosine filter and differentiator filter [4] [5]. These filters have
Selective Filtering, Machine Learning, TensorFlow their own specific design concepts and methods. There are
multiple methods for designing typical FIR filters with specified
I. INTRODUCTION frequency response including window design method and
An Finite Impulse Response (FIR) filter has a finite duration frequency sampling method [6] [7]. These conventional FIR
and is widely used in many signal filtering applications such as design methods are optimized mathematically or offer an
communication, image processing, and many other signal efficient engineering solution. Machine learning usually doesn’t
processing methods that require signal conditioning due to play a role in the FIR design when there is already a direct
stability [1] [2]. Equation 1 shows the formula of filtering a solution. In this paper, we present a machine learning model that
signal x[n] with an FIR filter of N taps. can learn directly from the input signals and come up with an
optimized FIR filter solution. For example, we have a speech
ேିଵ
signal mixed with a music signal and need to be separated. This
ݕሾ݊ሿ ൌ ݄ሾ݇ሿ ή ݔሾ݊ െ ݇ሿ (1) doesn’t seem to be a problem that can be solved by a
ୀ conventional FIR filter since speech and music have notable
The term h[k] is the impulse response of the FIR filter, it is spectral overlap. With the proposed machine learning
also referred to as FIR coefficients. Each tap in an FIR filter is a algorithm, we can learn a special FIR filter that can decompose
multiply-accumulate (MAC) unit which contains a register, a speech signals from music signals adaptively.
multiplier, and an adder as shown in Figure 1. This formula can Section II of this paper discusses the FIR filter design
also be interpreted as a convolution between the input signal and method with machine learning algorithms and validation
the FIR filter kernel impulse response. Figure 1(a) is a classical procedures. Section III demonstrates an example of separating
FIR filter design schematic with 20 taps. A disadvantage of this music and speech.
type of architecture is that the critical path is Tmult+20Tadder, this
will dramatically reduce the maximum system clock and II. MACHINE LEARNING FIR FILTER DESIGN METHODS
jeopardize the speed of the FIR filter realization. Figure 1(b) This section introduces FIR machine learning models and
shows a transposed implementation of the FIR filter, it is also their validation. Figure 2 illustrates Signal A has significant
called a broadcast FIR filter since the input signal will be spectral overlap with unwanted Signal B. The goal of FIR design
directly broadcasted to all multipliers [3]. It is a more preferred is to separate Signal A from Signal B.
architecture since the critical path of this design is always the
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on November 24,2023 at 05:40:41 UTC from IEEE Xplore. Restrictions apply.
247
8 sinusoidal signals that have different center frequencies, mixture of both. The sampling rate for all the voices presented
amplitudes, and phases are created. Figure 6(a) is the generated in this paper is 44,100 Hz. As can be observed in the figure, the
training data and its frequency response. We choose three out of music and speech signals have a lot of overlap on the frequency
these eight frequency components to create the corresponding domain, which raises the difficulty of being separated.
training output as is shown in Figure 6(b). In order to suppress
the irrelevant frequency band an all-spectrum uniformly
distributed noise to the training data. Figure 6(c) is the acquired
FIR filter impulse response with 300 taps. Figure 6(d) is the
digital filter frequency response. To be noticed is that the learned
filter frequency response shows that the designed filter has
additional attenuation when the unwanted frequency
components have higher energy. The result shows that our
model is fully capable of separating different frequency
components from the time domain.
This FIR filter coefficients design method has very little
control over its phase response when the frequency component
is been attenuated to almost zero. This is because the frequency
Figure 7. The spectrum of music, speech and their mixture
components that are zeroed have very little contributed to the
backpropagation algorithm computation. The proposed During training, we use the mixture of the music signal,
algorithm only computes the statistically optimized solution, speech signal and all spectrum noise as the training input. A
which is not always the best solution to the problem when the selection of either a music signal or speech signal is used as the
specific frequency response is known by the designer. expected training output. In order to select the optimized FIR tap
number, we train the model with different tap numbers for 50
epoch and plotted the final training loss against its tap number
as shown in Figure 8.
(a)
(b)
(c)
Figure 8. Training Loss after 50 epochs plot against different FIR Tap
Number
(d) After training the model with 2000 taps for 100 epochs,
Figure 9 shows the MSE training loss against epochs.
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on November 24,2023 at 05:40:41 UTC from IEEE Xplore. Restrictions apply.
248
Figure 10 shows the designed FIR filter impulse responses IV. CONCLUSION
and their frequency responses. As can be seen from the figure, In this paper, we provide a simple but effective machine
music signal extractor and speech signal extractors have overlap learning model for learning FIR filter coefficients directly from
on the frequency axis. The designed filter has extra attenuation the input data. The proposed algorithm can produce FIR filter
on a certain spectrum when the other signal to be suppressed has impulse response that can separate highly spectrum overlapped
high energy on that frequency band. signals. Uniformly distributed noise is added to the training data
Music to eliminate the irrelevant frequency components. This method
will find the statistically optimized FIR filter impulse response
with the given number of taps according to the training data set.
The learned FIR filter coefficients will have limited control over
(a)
Speech
the highly attenuated frequency components. The designed FIR
filter provides a linear phase response for most of its passbands.
An example application of speech or music signal extractor from
FIR Filter Coefficient
a mixture of two is demonstrated. The result shows that our
model can achieve a very complicate design by simply training
Music
Extractor with the input data.
REFERENCES
Speech (b)
Extractor
[1] M. B. Trimale and Chilveri, "A review: FIR filter implementation," in
2017 2nd IEEE International Conference on Recent Trends in
Electronics, Information Communication Technology (RTEICT), 2017.
Frequency (kHz)
[2] M. Ferrario, A. Spalvieri and R. Valtolina, "Design of transmit FIR
Figure 10. (a) FIR filter impulse responses and (b) frequency responses filters for FDM data transmission systems," IEEE Transactions on
Communications, vol. 52, no. 2, pp. 180-182, 2004.
Figure 11 is the processed mixture speech and music signal [3] Xilinx, "PG149 LogiCORE IP FIR Compiler v7.1, Product Guide," 2
April 2014. [Online]. Available:
filtered by the designed FIR coefficients using machine learning. https://www.xilinx.com/support/documentation/ip_documentation/fir
As can be observed in the figure, the filters successfully suppress _compiler/v7_1/pg149-fir-compiler.pdf.
the unwanted signal. Experiment result shows that when voice [4] N. S. Alagha and P. Kabal, "Generalized raised-cosine filters," IEEE
signal is suppressed, the mean square error between filtered Transactions on Communications, vol. 47, no. 7, pp. 989-997, 1999.
signal and the music signal is 0.0035. On the other hand, when [5] C.-C. Tseng, "Digital differentiator design using fractional delay filter
music signal is chosen to be filtered out, the mean squre error and limit computation," IEEE Transactions on Circuits and Systems I:
between the filtered output and voice signal is 0.006. When the Regular Papers, vol. 52, no. 10, pp. 2248-2259, 2005.
filtered signals are played after restored back into wav files, the [6] A. E. Cetin, O. N. Gerek and Y. Yardimci, "Equiripple FIR filter design
by the FFT algorithm," IEEE Signal Processing Magazine, vol. 12, no.
unwanted signal will sound just like a background whisper but 2, pp. 60-64, 1997.
cannot be completely removed since this method is only an FIR [7] M. G. Shayesteh and M. Mottaghi-Kashtiban, "FIR filter design using
approach after all. a new window function," in 2009 16th International Conference on
Digital Signal Processing, 2009.
Music
[8] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, MIT Press,
2016.
[9] E. O. Brigham and R. E. Morrow, "The fast Fourier transform," IEEE
Spectrum, vol. 4, no. 12, pp. 63-70, 1967.
[10] E. M. Grais and H. Erdogan, "Single channel speech-music separation
Speech using matching pursuit and spectral masks," in 2011 IEEE 19th Signal
Processing and Communications Applications Conference (SIU),
Antalya, 2011.
[11] P. Mowlaee, A. Sayadian, M. Sheikhan and M. Fallah, "Single-channel
music/speech separation using non-linear masks," in 2008
Frequency (kHz) International Symposium on Telecommunications, Tehran, 2008.
Figure 11. Extracted music and speech signal spectrums
Authorized licensed use limited to: Malaviya National Institute of Technology Jaipur. Downloaded on November 24,2023 at 05:40:41 UTC from IEEE Xplore. Restrictions apply.