0% found this document useful (0 votes)
78 views8 pages

Audio Query Based Interface

This document proposes a method for audio query-based music source separation using a neural network with two components: a Query-net and a Separator. The Query-net encodes an input query audio signal into a latent vector, and the Separator uses this latent vector to separate the target source from an audio mixture. This approach allows separating different numbers and types of sources with a single model, and improves performance when the target source characteristics differ from general classes. It also enables continuous control of separation outputs by interpolating latent vectors. The method is evaluated on the MUSDB18 dataset and shows it can separate multiple sources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views8 pages

Audio Query Based Interface

This document proposes a method for audio query-based music source separation using a neural network with two components: a Query-net and a Separator. The Query-net encodes an input query audio signal into a latent vector, and the Separator uses this latent vector to separate the target source from an audio mixture. This approach allows separating different numbers and types of sources with a single model, and improves performance when the target source characteristics differ from general classes. It also enables continuous control of separation outputs by interpolating latent vectors. The method is evaluated on the MUSDB18 dataset and shows it can separate multiple sources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

AUDIO QUERY-BASED MUSIC SOURCE SEPARATION

Jie Hwan Lee∗ Hyeong-Seok Choi∗ Kyogu Lee


Music and Audio Research Group,
Graduate School of Convergence Science and Technology,
Seoul National University
{wiswisbus, kekepa15, kglee}@snu.ac.kr

ABSTRACT While being the most straight-forward approach, we ar-


gue that such an approach is not a proper way to deal with
In recent years, music source separation has been one of the outliers when the generic and broadly defined class
the most intensively studied research areas in music in- labels are the only available data at hand [7, 11]. To un-
formation retrieval. Improvements in deep learning lead derstand this situation more concretely, let us consider the
to a big progress in music source separation performance. mismatched situations where the target source is classi-
However, most of the previous studies are restricted to sep- fied into a certain generic class but still somewhat far from
arating a few limited number of sources, such as vocals, the general characteristics of that broadly defined generic
drums, bass, and other. In this study, we propose a net- class. For example, consider the situation where we de-
work for audio query-based music source separation that sire to separate ‘distorted singing voice’ or ‘acoustic gui-
can explicitly encode the source information from a query tar’ sources. In these cases, we can imagine that the perfor-
signal regardless of the number and/or kind of target sig- mance can be boosted if we were to have more fine-grained
nals. The proposed method consists of a Query-net and a labels such as ‘distorted singing voice’ or ‘acoustic guitar’
Separator: given a query and a mixture, the Query-net en- rather than generic classes such as ‘vocals’ or ‘guitar’. One
codes the query into the latent space, and the Separator of the simplest ad-hoc solutions, therefore, can be manu-
estimates masks conditioned by the latent vector, which is ally annotating such outliers based on the music instrument
then applied to the mixture for separation. The Separator ontology and conditioning those new classes into the deep
can also generate masks using the latent vector from the learning network. Unfortunately, manually annotating an
training samples, allowing separation in the absence of a audio signal has limitation in many aspects. First, labeling
query. We evaluate our method on the MUSDB18 dataset, an audio itself is costly. Second, given the same audio sam-
and experimental results show that the proposed method ples, the number of samples per class is reduced, hence it
can separate multiple sources with a single network. In ad- is likely that the separation performance degrades. Third,
dition, through further investigation of the latent space we such a method is not scalable to new outlier samples, and
demonstrate that our method can generate continuous out- is thus limited.
puts via latent vector interpolation.

1. INTRODUCTION
Music source separation, isolating the signals of certain
instruments from a mixture, has been intensively studied
in recent years. Due to the improvements in deep learn-
ing techniques, various approaches using deep learning for
music source separation have been introduced. However,
most of the previous studies are mainly focused on improv-
ing music source separation performances, not the range
of separable sources. To tackle this problem, a few stud-
ies have tried to separate the fixed number of sources of Figure 1. t-SNE visualization [8] of encoded latent vectors
interest by conditioning one-hot label in the deep learning of the test dataset in MUSDB18. Without any classification
network [14, 15]. loss, the Query-net is trained to output latent vectors that
*these authors contributed equally provide useful information about various instruments. It is
observable that the latent vectors from the same class are
clustered in the latent space while not being identical.
c Jie Hwan Lee, Hyeong-Seok Choi, Kyogu Lee. Licensed
under a Creative Commons Attribution 4.0 International License (CC BY
4.0). Attribution: Jie Hwan Lee, Hyeong-Seok Choi, Kyogu Lee. “Au- To deal with these problems, in this paper, a novel au-
dio query-based music source separation”, 20th International Society for dio query-based music source separation framework is pro-
Music Information Retrieval Conference, Delft, The Netherlands, 2019. posed. The main idea is to directly compress the diverse au-

878
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

dio samples into latent vectors – using the so-called Query-


net – so that the audio samples can be mapped into non-
identical points even when the samples are from the same
class as illustrated in Fig. 1. The encoded latent vector
is then fed into a separation network to output a source
whose characteristics is similar to the audio sample taken
into the Query-net. The proposed framework is scalable as
the Query-net is able to encode an unseen singing voice
or instrument sound into the continuous latent space. This
property allows many useful utilities as follows. First, it
is capable of separating various number of sources with a
single network. Second, we can expect an increase in sep-
aration performance especially when the characteristics of
the target source in the mixture is considered far from the
given generic class since the user can manually select and
encode the held-out sound sample that is deemed similar to
the target signal. Third, it allows the natural control of the
output of the separation network by interpolating the latent Figure 2. Illustration of the (a) Query-net and (b) Separa-
vectors in the continuous latent space. tor. The Query-net encodes the query into the latent vec-
tor and it is passed into the Separator by two methods. 1.
To demonstrate the usefulness of the proposed method,
Concatenation: The latent vector is concatenated with mix-
we show various experiments using the MUSDB18 dataset
ture spectrogram by tiling the latent vector along the spa-
[11]. The experiments show that the output of the separa-
tial dimension. 2. AdaIN: Adaptive instance normalization
tion network is highly dependent on the latent vector which
is used in every layer of decoder part.
allows smooth transition in signal level by controlling and
interpolating the latent vectors. Also, we show that the pro-
posed method becomes especially useful when the target 3. PROPOSED METHOD
source of interest is far from the general characteristic of
coarsely defined sound class. Finally, we show that the pro- 3.1 Query-based Source Separation
posed method can be even automated by iteratively encod-
ing the separation output. The proposed framework is composed of two deep learning
networks, Query-net Q(·) and Separator S(·). While most
of the previous studies typically use S to extract a single
2. RELATED WORK class source from a mixture, we aim to separate the mixture
by manipulating the additional input signal, a query. By
In this section, we first introduce previous music source doing so, we can expect to have a control over the mixture
separation studies that tried to separate mixture into multi- just by choosing a different query input which can be done
ple sound classes. One of the most basic ways is to estimate either manually by the user or automatically by the system.
several separation masks with a single model. In [10], they Hence, the query signal is expected to be sampled from a
tried to separate four sources with one stacked hourglass similar sound class to the target signal within the mixture,
model [9]. While they showed a competitive results the but does not have to be identical. To achieve this, Q directly
method is not flexible as the model requires a fixed num- encodes the query audio signal into a latent vector so that
ber of output. Next, [15] introduced a one-hot label condi- we can control the output of S by manipulating the latent
tioning approach and showed that their proposed method space.
is capable of separating multiple sources. This method is Q is composed of 6 strided-convolutional layers fol-
more flexible than the aforementioned model but the model lowed by gated recurrent unit (GRU) layer. The stack of
does not assume latent space, therefore, is not capable of strided-convolutional layers are used to extract local fea-
manipulating output other than conditioning the one-hot tures from the given query signal. Then, the extracted fea-
label. Finally, [14] showed that they can embed each time- tures are reshaped by stacking each feature map along the
frequency bin of the mixture into a high-dimensional space frequency axis. Finally, the reshaped tensor is passed into
using deep clustering [1] approach. However, this approach GRU and the last state of the GRU is used as a summary
still has a limitation in that the model is not capable of en- of the query signal. As we would like the encoded latent
coding the audio signal directly into the latent space. Apart vector to have a meaningful high-level information, we de-
from the music source separation studies, [21] suggested a signed Q to map the query into a small enough dimension
speaker-dependent speech separation method by incorpo- compared to the dimension of the query signal. After the
rating a lstm-based anchor vector encoder which enables audio query has been encoded, the summarized informa-
direct encoding of audio signal into a latent space. Using tion is passed into S.
this technique, they showed that the proposed method can S is a U-Net [13] based network which has proven its
cluster the time-frequency bin embeddings that are close to effectiveness in many source separation studies [3, 10, 16,
the anchor vector in the latent space. 18,19]. It is a convolutional encoder and decoder with skip-

879
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

connections between the layers. S takes the mixture signal 3.2.2 cVAE with Latent Regressor
and estimate a sigmoid mask to separate the mixture into a
To design the proposed framework, we borrow the for-
source given the summarized information of query from Q.
mulation of conditional variational autoencoder (cVAE).
To effectively pass the summary of the query signal to S,
While the latent vector z can be deterministically encoded
we applied two methods. First, we simply concatenated the
into the latent space, in cVAE framework, z is instead sam-
latent vector along the channel dimension of the input mix-
pled from the Gaussian distribution, where the parame-
ture spectrogram expecting the summarized information to
ters of the distribution (mean and variance) are estimated
be delivered from the start. Second, we used the adaptive
from Q. Then, S is used to reconstruct ST given M and
instance normalization (AdaIN) technique in the decoding
z ∼ Q(ST ). This is ensured by one of the two objectives
stage of S, which is proven to be effective in many stud-
of cVAE, namely, reconstruction loss LR . The purpose of
ies for conditioning latent vectors [2, 4]. AdaIN is simply
LR is to guarantee that the output of S is dependent on the
done by applying two steps on each output x of the con-
encoded latent vector as follows,
volutional layer (before activation) of the decoder part of
S. First, each i-th feature map xi is normalized using in- LR = EST ∼p(ST ), M ∼p(M ), z∼Q(ST ) [kST − S(M, z)k]1 .
stance normalization technique [2]. Second, affine trans- (3)
formation is applied to the normalized feature map using Note that, in training phase, the latent vector z is sampled
learned scale and bias parameters which transforms en- using re-parameterization trick to allow backpropagation
coded query vector z into ys and yi respectively as fol- in training phase [5].
lows, ys = WsT z, yb = WbT z, where Ws and Wb denote Next, KL-divergence loss is used to make the distribu-
the trainable parameters, tion of z be close to the Gaussian distribution N (0, I) to
guarantee a sampling at test time.
xi − µ(xi )
AdaIN(xi , y) = ys,i · ( ) + yb,i . (1) LKL = EST ∼p(ST ) [DKL (Q(ST )kN (0, I))] (4)
σ(xi )

The overall framework of the proposed method is illus- Apart from cVAE framework, we also adopted latent
trated in Fig. 2. regressor used in [24] to enforce the output of S to be
more dependent on the latent vector. First, a random vec-
tor z is drawn from the prior Gaussian distribution N (0, I)
3.2 Training and passed to S. Then, S produces a reasonable output re-
3.2.1 Data Sampling flecting the information in the random vector. Finally, Q is
reused to restore the random vector from the output from
We first describe how the mixture and target source are S. Note that, unlike Eq. 3 and 4, only the mean values (µ)
selected throughout the training phase. are taken from Q as a point estimate of z.
Let, vi be the single source sampled from i-th source
class, where i ∈ {1, 2, 3..., K} and K denote the total Llatent = EM ∼p(M ), z∼p(z) kz − Q(S(M, z))k1 (5)
number of source classes. We split the classes into two Finally, the total loss can be written as follows,
groups by randomly assigning each source class into group
LT otal = λR LR + λKL LKL + λlatent Llatent . (6)
T (Target) and R (Rest) without replacement until every
class is assigned to one of the two groups. Next, we multi- 3.3 Test
ply binary value αi to the vi , where αi being sampled from
the Bernoulli distribution, αi ∼ Bernoulli(0.5). This was During the training phase, S was trained to separate the tar-
done to make sure that there are not too many sources in- get source by using the target source as a query as in Eq. 3.
cluded in the mixture. After then, as a data augmentation In the test phase, however, the target source to be separated
strategy [20], we scale each source by multiplying a value from the mixture is unknown. Hence, the target source and
βi to source vi , where βi is sampled from the Uniform dis- query can no longer be the same. Nevertheless, since we
tribution, βi ∼ U[0.25, 1.25]. Finally, the sources in each designed the output dimension of Q to be small enough, the
group is added to form two waveforms sT and sR and the latent vector z is trained to have a high-level information
mixture m is constructed as the linear sum of sT and sR as such as instrument class. In the test phase, therefore, we
follows, can utilize this property in many ways. For example, when
X X the user wants to separate a specific source in the mixture,
m = sT + sR = (βi · αi · vi ) + (βj · αj · vj ). it is possible to collect a small amount of audio samples
i∈T j∈R that have similar characteristics but not exactly the same to
(2) the source of interest. Then, the user can extract that spe-
cific source by feeding the collected audio samples into the
As we used magnitude spectrogram as input of the mod- Query-net and passing the summarized information to the
ules, m, sT , and sR are transformed into short-time- Separator.
Fourier-transform (STFT) domain, which we denote in Apart from the query dependent approach, we can also
capital letter M , ST and SR , respectively. Note that, we do take the average of latent vectors of each source class in
not assume any musicality of mixture signal, hence each the training set and use it as a representative latent vector
class is sampled from arbitrary mixture tracks. that reflects the general characteristics of a single class.

880
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

4. EXPERIMENT
4.1 Dataset
We trained our network with the MUSDB18 dataset. The
dataset consists of 100 tracks for training set and 50 tracks
for test set and each track is recorded in 44.1kHz, stereo
format. The dataset provides the mixture and coarsely de-
fined labels for sources, namely, ‘vocals’, ‘drums’, ‘bass’
and ‘other’. The class ‘other’ includes every instrument
other than ‘vocals’, ‘drums’ and ‘bass’, providing the most
coarsely defined class. We resampled the audio to 22050Hz
and divided each track into 3-second segments. Magni-
tude spectrogram was obtained by applying STFT with a
window size 1024 and 75% overlap. To restore the audio
from the output, Inverse STFT is applied using the phase
of the mixture. We evaluated our method on the test set
of MUSDB18 using the official museval package 1 which
computes signal-to-distortion ratio (SDR) as a quantitative
measurement.
Figure 3. Results of manually targeting specific sound
4.2 Experiment Details sources. The first row show the separation results of hi-hat
The followings are the experimental details of our method. from the mixture of hi-hat, kick drum and bass. The second
Q consists of 6 strided-convolutional layers with 4 × 4 fil- row shows the separation results of piano from the mixture
ter size and the number of output channels for each layer of electric guitar and piano. It is worth noting that the net-
is 32, 32, 64, 64, 128 and 128, respectively. Every strided- work was never trained to only separate a hi-hat component
convolutional layer has the stride size of 2 along the fre- from ‘drum’ class nor piano from ‘other’ class.
quency axis and only second, fourth and sixth layers have
a stride size of 2 along the time axis. After every convolu-
separation result that the kick drums and the bass which
tional operation, we used instance normalization and relu.
lie in the low-frequency band were mostly removed while
We used GRU with 128 units. The length of the query seg-
broadband components of hi-hat remained. The result of
ment was fixed to 3-second in every experiment. For S,
piano separation is not as clear as in the case of hi-hat, but
the encoder part consists of 9 strided-convolutional lay-
we can see the guitar was removed considerably.
ers and the decoder part consists of the same number of
The noticeable fact is that we trained our method only
strided-deconvolutional layers, with a filter size of 4 × 4.
with the MUSDB18 dataset, which has no hierarchical
The number of output channels for first, second, and third
class label information besides the coarsely defined labels
layer is 64, 128, 256, respectively, and 512 for the rest of
of sources such as ‘vocals’, ‘drums’, ‘bass’ and ‘other’.
the layers. Every layer has stride size of 2 along the fre-
Under the definition of class in the dataset, hi-hat and kick
quency axis. And stride size along the time axis is set to 2
drum are grouped into ‘drums’, and piano and electric gui-
for every layer except the first layer of the encoder and the
tar into ‘other’. Although our method was never trained
last layer of the decoder.
to separate the subclass from the mixture, it was able to
The dimension of the latent vector was set to 32 and the
separate hi-hat and piano from the mixture, which can be
batch size was set to 5. The coefficients in Eq.6 were set to
referred to as a zero-shot separation. These results indicate
λR = 10, λKL = 0.01, λlatent = 0.5. The initial learning
the proposed method can be well applied for audio query-
rate was set to 0.0002 and after 200000 iterations the rate
based separation.
was decreased to 5 × 10−6 for every 10000-iteration. We
used Adam optimizer with β1 = 0.5, β2 = 0.999. 4.4 Latent Interpolation

4.3 Manually Targeting a Specific Sound Source Furthermore, we conducted a latent interpolation exper-
iment using the mean vector of each source. The mean
To validate that our method captures the characteristics of vector of each source was computed by averaging the
the audio given in the query and separates them accord-
ingly, we conducted an experiment of separating specific 1
P vectors of each source in the training set, zc =
latent
Nc i Q(Sc,i ), where Sc,i denotes i-th 3-second magni-
instruments. As shown in Fig. 3, an audio query of hi-hat tude spectrogram in the sound class c and Nc denotes the
and piano were given to the mixtures of (hi-hat + kick number of segments in class c.
drum + bass) and (piano + electric guitar). Queries and For the interpolation method, we used the spherical lin-
mixtures were not from the train set, and both queries were ear interpolation (Slerp) introduced in [23],
not sampled from the mixture. We can observe in the hi-hat
sin(1 − α)θ sin αθ
1 https://sigsep.github.io/sigsep-mus-eval Slerp(z1 , z2 ; α) = z1 + z2 , (7)
sin θ sin θ
881
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

Figure 6. The relationship between SDR improvement


(∆SDR) and cosine distance difference (∆CD) in vocal
tracks.

Figure 4. Results of the mean vector interpolation. The


closest cosine distance (CD) from k-th test vocal track
first row shows the interpolation results between vocals
ztestk = zk , k ∈ test as follows,
and drums. The second row shows the interpolation results
between drums and bass. k̃ = arg min CD(zi , ztestk ), zretk = zk̃ . (8)
i∈training

where α denotes the weight of interpolation and θ denotes We compare the performance of two cases where the
the angle between z1 and z2 . As shown in Fig. 4, we inter- goal is to separate a k-th vocal track from test set. The first
polated between the mean vector of sound sources, drums case is to use zmean to separate a target source, Ŝmean =
(zdrums ) → bass (zbass ) and vocals (zvocals ) → drums S(M, zmean ). The second case is to use zretk to separate
(zdrums ). We can see the ratio of separated instruments a target source, Ŝretk = S(M, zretk ). We defined perfor-
changes as the weight α changes. These experimental re- mance improvement in terms of SDR as follows,
sults show that our method can generate continuous out-
puts just by manipulating a latent space. ∆SDR = SDR(SGTk , Ŝretk ) − SDR(SGTk , Ŝmean ), (9)

4.5 Effects of Latent Vector on Performance where SGTk denotes k-th ground truth vocal track from
test set. To measure the distance between latent vectors we
used cosine distance (CD(z1 , z2 ) = 1 − (z1 / kz1 k2 ) ·
(z2 / kz2 k2 )) and defined cosine distance difference be-
tween (ztestk , zretk ) and (ztestk , zmean ) as follows,

∆CD = CD(ztestk , zmean ) − CD(ztestk , zretk ). (10)

Fig. 5 illustrates two possible cases of using zretk . (a)


shows the positive ∆CD case where we assume to induce
positive effect on performance improvement (∆SDR > 0).
In this case, we expect the performance to be improved
since zretk is expected to contain information close to
Figure 5. Illustration of two ∆CD cases. (a) shows the
ztestk compared to zmean . (b) shows the negative ∆CD
positive ∆CD case where we assume that the performance
case where we assume to induce negative effect on per-
should be improved. (b) shows the negative ∆CD case
formance improvement (∆SDR < 0). In this case, we ex-
where we assume that the performance should be wors-
pect the performance to be worsened as the system could
ened.
not retrieve a zretk that is close enough to ztestk . To em-
pirically prove our assumption, we show the relationship
This subsection investigates the performance improve-
between ∆SDR and ∆CD in Fig. 6. We can observe that
ment varying the latent vector and see in which situation
the closer the vector gets to the targeted ground-truth vec-
we can achieve a performance improvement. For the exper-
tor, the larger the performance gain becomes, therefore re-
iment, we first obtained the mean vector of eachPvocal track
inforcing our assumption that better performances can be
from the entire dataset as follows, zi = N1i j Q(Si,j ),
achieved if we can obtain closer latent vectors to the target
where i denotes a i-th vocal track, j denotes a j-th seg-
latent vector.
ment in i-th vocal track, and Ni denotes the number of
segments in i-th vocalPtrack. Then, we obtained the mean
1 4.6 Iterative Method
vector, zmean = 100 i∈training zi , of vocal tracks from
training set. Finally, we retrieved the latent vector of cer- In this subsection, we seek a performance improvement
tain vocal track zretk from the training set which has the by automating the query-based framework in an iterative

882
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

way, which we refer to as an iterative method. The itera- Vocals Drums Bass Other
tive method is done as follows. First, we separate the tar- STL2 [16] 3.25 4.22 3.21 2.25
get source using the mean vector of certain sound class WK [22] 3.76 4.00 2.94 2.43
zmean . Then, we re-encode the separated source into a RGT1 [12] 3.85 3.44 2.70 2.63
latent space expecting the re-encoded latent vector to be JY3 [6] 5.74 4.66 3.67 3.40
closer to the target latent vector. Finally, we separate the UHL2 [20] 5.93 5.92 5.03 4.19
target source using the re-encoded latent vector. We verify TAK1 [18] 6.60 6.43 5.16 4.15
the effect of the proposed iterative method and show that Ours (mean) 4.90 4.34 3.09 3.16
it can be helpful under the harsh condition where the tar- Ours (GT) 5.48 4.59 3.45 3.26
get sources are far from generic class. The results (Single
step → Iterative) are as follows, ‘vocals’: 4.84 → 4.90,
‘drums’: 4.31 → 4.34, ‘bass’: 3.11 → 3.09, and ‘other’: Table 1. Median scores of SDR for the MUSDB18 dataset.
2.97 → 3.16. We can see the iterative method noticeably
improves the performance in ‘vocals’ and ‘other’. On the ing the sources of distinctive characteristics.
other hand, the differences are not significant in drums and
bass.
We looked into the tracks which gained significant im- 4.7 Algorithm Comparison
provement in terms of SDR in vocals. ‘Timboz - Pony’ and
‘Hollow Ground - Ill Fate’ gained more than 0.5dB in SDR In this subsection, we compare our method to other meth-
through the iterative method. We found the results intuitive ods with the evaluation result of the MUSDB18 dataset. As
as the vocals in the two songs feature a growling technique stated above, our method’s output is dependent on the en-
from heavy metal genres, which can be considered distant coded latent vector from a query. For the comparison with
from the general characteristics of vocals. other methods that do not require a query, therefore, we
used the mean vector in the latent space encoded from the
training samples for each source – i.e., we ended up us-
ing four mean latent vectors for ‘vocals’, ‘drums’, ‘bass’,
and ‘other’, respectively. Additionally, to show the upper
bound of our proposed method, we used the encoded la-
tent vector of the ground truth (GT) signal from test set.
Note also that the separation is done with a single network.
Table 1 shows the median scores of SDR of methods re-
ported in SiSEC2018 [17], including our method denoted
as Ours. Although the proposed algorithm did not achieve
the best performance, the results show that it is compara-
ble to the other deep learning-based models that are ded-
icated to separating just four sources in the dataset. This
means that our method is not limited to query-based sepa-
ration, but also can be used for general music source sep-
aration just like as other conventional methods. Addition-
ally, there is room for improvement: applying the multi-
channel Wiener filter and/or using other architecture for
the separator besides U-net could be such an option.
Figure 7. t-SNE visualization of encoded latent vectors
from each source in the test dataset. Red points denote
the vectors of the tracks which gained more than 0.4dB 5. CONCLUSION
in terms of SDR by the iterative method.
In this study, we presented a novel framework, consist-
To verify our assumptions, we divided each source of ing of Query-net and Separator, for audio query-based mu-
the test set into segments and converted them into latent sic source separation. Experiment results showed that our
vectors. We divided the encoded vectors into two groups, method is scalable as the Query-net directly encodes audio
the ones which gained more than 0.4dB in terms of SDR query into a latent space. The latent space is interpretable
by the iterative method and the ones did not. Then, we vi- as was shown by the t-SNE visualization and latent in-
sualized the encoded vectors using t-SNE (results shown terpolation experiments. Furthermore, we have introduced
in Fig. 7). The red dots in Fig. 7 represent the latent vector various utilities of the proposed framework including man-
from the group that showed significant SDR improvement ual and automated approach showing the promise of audio-
more than 0.4dB. Although some vectors lie around the query based source separation. As a future work, we plan to
center, most of them are located far from the center. These investigate more adequate conditioning method for audio
vectors can be inferred as outliers and the results show that and better neural architecture for performance improve-
our iterative method is effective when it comes to separat- ment.

883
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

6. ACKNOWLEDGEMENT [9] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked
hourglass networks for human pose estimation. In Eu-
This work was supported partly by Kakao and Kakao Brain
ropean Conference on Computer Vision, pages 483–
corporations and partly supported by Institute for Informa-
499. Springer, 2016.
tion & Communications Technology Planning & Evalua-
tion(IITP) grant funded by the Korea government(MSIT) [10] Sungheon Park, Taehoon Kim, Kyogu Lee, and Nojun
(No.2019-0-01367, Infant-Mimic Neurocognitive Devel- Kwak. Music source separation using stacked hour-
opmental Machine Learning from Interaction Experience glass networks. In Proceedings of the 19th Interna-
with Real World (BabyMind)). tional Society for Music Information Retrieval Con-
ference, ISMIR 2018, Paris, France, September 23-27,
7. REFERENCES 2018, pages 289–296, 2018.
[1] John R Hershey, Zhuo Chen, Jonathan Le Roux, and [11] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter,
Shinji Watanabe. Deep clustering: Discriminative em- Stylianos Ioannis Mimilakis, and Rachel Bittner. The
beddings for segmentation and separation. In 2016 MUSDB18 corpus for music separation, December
IEEE International Conference on Acoustics, Speech 2017.
and Signal Processing (ICASSP), pages 31–35. IEEE,
2016. [12] Gerard Roma, Owen Green, and Pierre Alexandre
Tremblay. Improving single-network single-channel
[2] Xun Huang and Serge Belongie. Arbitrary style trans-
separation of musical audio with convolutional layers.
fer in real-time with adaptive instance normalization.
In International Conference on Latent Variable Anal-
In Proceedings of the IEEE International Conference
ysis and Signal Separation, pages 306–315. Springer,
on Computer Vision, pages 1501–1510, 2017.
2018.
[3] Andreas Jansson, Eric J. Humphrey, Nicola Montec-
chio, Rachel M. Bittner, Aparna Kumar, and Tillman [13] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
Weyde. Singing voice separation with deep u-net con- U-net: Convolutional networks for biomedical image
volutional networks. In Proceedings of the 18th Inter- segmentation. In International Conference on Medical
national Society for Music Information Retrieval Con- image computing and computer-assisted intervention,
ference, ISMIR 2017, Suzhou, China, October 23-27, pages 234–241. Springer, 2015.
2017, pages 745–751, 2017.
[14] P. Seetharaman, G. Wichern, S. Venkataramani, and
[4] Tero Karras, Samuli Laine, and Timo Aila. A style- J. L. Roux. Class-conditional embeddings for music
based generator architecture for generative adversarial source separation. In ICASSP 2019 - 2019 IEEE Inter-
networks. arXiv preprint arXiv:1812.04948, 2018. national Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 301–305, May 2019.
[5] Diederik P. Kingma and Max Welling. Auto-encoding
variational bayes. In 2nd International Conference [15] O. Slizovskaia, L. Kim, G. Haro, and E. Gomez. End-
on Learning Representations, ICLR 2014, Banff, AB, to-end sound source separation conditioned on instru-
Canada, April 14-16, 2014, Conference Track Pro- ment labels. In ICASSP 2019 - 2019 IEEE Interna-
ceedings, 2014. tional Conference on Acoustics, Speech and Signal
[6] Jen-Yu Liu and Yi-Hsuan Yang. Denoising auto- Processing (ICASSP), pages 306–310, May 2019.
encoder with recurrent skip connections and residual
[16] Daniel Stoller, Sebastian Ewert, and Simon Dixon.
regression for music source separation. In 2018 17th
Wave-u-net: A multi-scale neural network for end-to-
IEEE International Conference on Machine Learn-
end audio source separation. In Proceedings of the 19th
ing and Applications (ICMLA), pages 773–778. IEEE,
International Society for Music Information Retrieval
2018.
Conference, ISMIR 2018, Paris, France, September 23-
[7] Antoine Liutkus, Fabian-Robert Stöter, Zafar Rafii, 27, 2018, pages 334–340, 2018.
Daichi Kitamura, Bertrand Rivet, Nobutaka Ito, Nobu-
taka Ono, and Julie Fontecave. The 2016 signal sepa- [17] Fabian-Robert Stöter, Antoine Liutkus, and Nobutaka
ration evaluation campaign. In Petr Tichavský, Mas- Ito. The 2018 signal separation evaluation campaign.
soud Babaie-Zadeh, Olivier J.J. Michel, and Nadège In International Conference on Latent Variable Anal-
Thirion-Moreau, editors, Latent Variable Analysis and ysis and Signal Separation, pages 293–305. Springer,
Signal Separation - 12th International Conference, 2018.
LVA/ICA 2015, Liberec, Czech Republic, August 25-
[18] Naoya Takahashi, Nabarun Goswami, and Yuki Mitsu-
28, 2015, Proceedings, pages 323–332, Cham, 2017.
fuji. Mmdenselstm: An efficient combination of convo-
Springer International Publishing.
lutional and recurrent neural networks for audio source
[8] Laurens van der Maaten and Geoffrey Hinton. Visu- separation. In 2018 16th International Workshop on
alizing data using t-sne. Journal of machine learning Acoustic Signal Enhancement (IWAENC), pages 106–
research, 9(Nov):2579–2605, 2008. 110. IEEE, 2018.

884
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

[19] Naoya Takahashi and Yuki Mitsufuji. Multi-scale


multi-band densenets for audio source separation. In
2017 IEEE Workshop on Applications of Signal Pro-
cessing to Audio and Acoustics (WASPAA), pages 21–
25. IEEE, 2017.

[20] Stefan Uhlich, Marcello Porcu, Franck Giron, Michael


Enenkl, Thomas Kemp, Naoya Takahashi, and Yuki
Mitsufuji. Improving music source separation based on
deep neural networks through data augmentation and
network blending. In 2017 IEEE International Con-
ference on Acoustics, Speech and Signal Processing
(ICASSP), pages 261–265. IEEE, 2017.

[21] Jun Wang, Jie Chen, Dan Su, Lianwu Chen, Meng Yu,
Yanmin Qian, and Dong Yu. Deep extractor network
for target speaker recovery from single channel speech
mixtures. In Interspeech 2018, 19th Annual Confer-
ence of the International Speech Communication Asso-
ciation, Hyderabad, India, 2-6 September 2018., pages
307–311, 2018.

[22] Felix Weninger, John R Hershey, Jonathan Le Roux,


and Björn Schuller. Discriminatively trained recurrent
neural networks for single-channel speech separation.
In 2014 IEEE Global Conference on Signal and Infor-
mation Processing (GlobalSIP), pages 577–581. IEEE,
2014.

[23] Tom White. Sampling generative networks. arXiv


preprint arXiv:1609.04468, 2016.

[24] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor


Darrell, Alexei A Efros, Oliver Wang, and Eli Shecht-
man. Toward multimodal image-to-image translation.
In Advances in Neural Information Processing Sys-
tems, pages 465–476, 2017.

885

You might also like