Audio Query Based Interface
Audio Query Based Interface
1. INTRODUCTION
Music source separation, isolating the signals of certain
instruments from a mixture, has been intensively studied
in recent years. Due to the improvements in deep learn-
ing techniques, various approaches using deep learning for
music source separation have been introduced. However,
most of the previous studies are mainly focused on improv-
ing music source separation performances, not the range
of separable sources. To tackle this problem, a few stud-
ies have tried to separate the fixed number of sources of Figure 1. t-SNE visualization [8] of encoded latent vectors
interest by conditioning one-hot label in the deep learning of the test dataset in MUSDB18. Without any classification
network [14, 15]. loss, the Query-net is trained to output latent vectors that
*these authors contributed equally provide useful information about various instruments. It is
observable that the latent vectors from the same class are
clustered in the latent space while not being identical.
c Jie Hwan Lee, Hyeong-Seok Choi, Kyogu Lee. Licensed
under a Creative Commons Attribution 4.0 International License (CC BY
4.0). Attribution: Jie Hwan Lee, Hyeong-Seok Choi, Kyogu Lee. “Au- To deal with these problems, in this paper, a novel au-
dio query-based music source separation”, 20th International Society for dio query-based music source separation framework is pro-
Music Information Retrieval Conference, Delft, The Netherlands, 2019. posed. The main idea is to directly compress the diverse au-
878
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019
879
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019
connections between the layers. S takes the mixture signal 3.2.2 cVAE with Latent Regressor
and estimate a sigmoid mask to separate the mixture into a
To design the proposed framework, we borrow the for-
source given the summarized information of query from Q.
mulation of conditional variational autoencoder (cVAE).
To effectively pass the summary of the query signal to S,
While the latent vector z can be deterministically encoded
we applied two methods. First, we simply concatenated the
into the latent space, in cVAE framework, z is instead sam-
latent vector along the channel dimension of the input mix-
pled from the Gaussian distribution, where the parame-
ture spectrogram expecting the summarized information to
ters of the distribution (mean and variance) are estimated
be delivered from the start. Second, we used the adaptive
from Q. Then, S is used to reconstruct ST given M and
instance normalization (AdaIN) technique in the decoding
z ∼ Q(ST ). This is ensured by one of the two objectives
stage of S, which is proven to be effective in many stud-
of cVAE, namely, reconstruction loss LR . The purpose of
ies for conditioning latent vectors [2, 4]. AdaIN is simply
LR is to guarantee that the output of S is dependent on the
done by applying two steps on each output x of the con-
encoded latent vector as follows,
volutional layer (before activation) of the decoder part of
S. First, each i-th feature map xi is normalized using in- LR = EST ∼p(ST ), M ∼p(M ), z∼Q(ST ) [kST − S(M, z)k]1 .
stance normalization technique [2]. Second, affine trans- (3)
formation is applied to the normalized feature map using Note that, in training phase, the latent vector z is sampled
learned scale and bias parameters which transforms en- using re-parameterization trick to allow backpropagation
coded query vector z into ys and yi respectively as fol- in training phase [5].
lows, ys = WsT z, yb = WbT z, where Ws and Wb denote Next, KL-divergence loss is used to make the distribu-
the trainable parameters, tion of z be close to the Gaussian distribution N (0, I) to
guarantee a sampling at test time.
xi − µ(xi )
AdaIN(xi , y) = ys,i · ( ) + yb,i . (1) LKL = EST ∼p(ST ) [DKL (Q(ST )kN (0, I))] (4)
σ(xi )
The overall framework of the proposed method is illus- Apart from cVAE framework, we also adopted latent
trated in Fig. 2. regressor used in [24] to enforce the output of S to be
more dependent on the latent vector. First, a random vec-
tor z is drawn from the prior Gaussian distribution N (0, I)
3.2 Training and passed to S. Then, S produces a reasonable output re-
3.2.1 Data Sampling flecting the information in the random vector. Finally, Q is
reused to restore the random vector from the output from
We first describe how the mixture and target source are S. Note that, unlike Eq. 3 and 4, only the mean values (µ)
selected throughout the training phase. are taken from Q as a point estimate of z.
Let, vi be the single source sampled from i-th source
class, where i ∈ {1, 2, 3..., K} and K denote the total Llatent = EM ∼p(M ), z∼p(z) kz − Q(S(M, z))k1 (5)
number of source classes. We split the classes into two Finally, the total loss can be written as follows,
groups by randomly assigning each source class into group
LT otal = λR LR + λKL LKL + λlatent Llatent . (6)
T (Target) and R (Rest) without replacement until every
class is assigned to one of the two groups. Next, we multi- 3.3 Test
ply binary value αi to the vi , where αi being sampled from
the Bernoulli distribution, αi ∼ Bernoulli(0.5). This was During the training phase, S was trained to separate the tar-
done to make sure that there are not too many sources in- get source by using the target source as a query as in Eq. 3.
cluded in the mixture. After then, as a data augmentation In the test phase, however, the target source to be separated
strategy [20], we scale each source by multiplying a value from the mixture is unknown. Hence, the target source and
βi to source vi , where βi is sampled from the Uniform dis- query can no longer be the same. Nevertheless, since we
tribution, βi ∼ U[0.25, 1.25]. Finally, the sources in each designed the output dimension of Q to be small enough, the
group is added to form two waveforms sT and sR and the latent vector z is trained to have a high-level information
mixture m is constructed as the linear sum of sT and sR as such as instrument class. In the test phase, therefore, we
follows, can utilize this property in many ways. For example, when
X X the user wants to separate a specific source in the mixture,
m = sT + sR = (βi · αi · vi ) + (βj · αj · vj ). it is possible to collect a small amount of audio samples
i∈T j∈R that have similar characteristics but not exactly the same to
(2) the source of interest. Then, the user can extract that spe-
cific source by feeding the collected audio samples into the
As we used magnitude spectrogram as input of the mod- Query-net and passing the summarized information to the
ules, m, sT , and sR are transformed into short-time- Separator.
Fourier-transform (STFT) domain, which we denote in Apart from the query dependent approach, we can also
capital letter M , ST and SR , respectively. Note that, we do take the average of latent vectors of each source class in
not assume any musicality of mixture signal, hence each the training set and use it as a representative latent vector
class is sampled from arbitrary mixture tracks. that reflects the general characteristics of a single class.
880
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019
4. EXPERIMENT
4.1 Dataset
We trained our network with the MUSDB18 dataset. The
dataset consists of 100 tracks for training set and 50 tracks
for test set and each track is recorded in 44.1kHz, stereo
format. The dataset provides the mixture and coarsely de-
fined labels for sources, namely, ‘vocals’, ‘drums’, ‘bass’
and ‘other’. The class ‘other’ includes every instrument
other than ‘vocals’, ‘drums’ and ‘bass’, providing the most
coarsely defined class. We resampled the audio to 22050Hz
and divided each track into 3-second segments. Magni-
tude spectrogram was obtained by applying STFT with a
window size 1024 and 75% overlap. To restore the audio
from the output, Inverse STFT is applied using the phase
of the mixture. We evaluated our method on the test set
of MUSDB18 using the official museval package 1 which
computes signal-to-distortion ratio (SDR) as a quantitative
measurement.
Figure 3. Results of manually targeting specific sound
4.2 Experiment Details sources. The first row show the separation results of hi-hat
The followings are the experimental details of our method. from the mixture of hi-hat, kick drum and bass. The second
Q consists of 6 strided-convolutional layers with 4 × 4 fil- row shows the separation results of piano from the mixture
ter size and the number of output channels for each layer of electric guitar and piano. It is worth noting that the net-
is 32, 32, 64, 64, 128 and 128, respectively. Every strided- work was never trained to only separate a hi-hat component
convolutional layer has the stride size of 2 along the fre- from ‘drum’ class nor piano from ‘other’ class.
quency axis and only second, fourth and sixth layers have
a stride size of 2 along the time axis. After every convolu-
separation result that the kick drums and the bass which
tional operation, we used instance normalization and relu.
lie in the low-frequency band were mostly removed while
We used GRU with 128 units. The length of the query seg-
broadband components of hi-hat remained. The result of
ment was fixed to 3-second in every experiment. For S,
piano separation is not as clear as in the case of hi-hat, but
the encoder part consists of 9 strided-convolutional lay-
we can see the guitar was removed considerably.
ers and the decoder part consists of the same number of
The noticeable fact is that we trained our method only
strided-deconvolutional layers, with a filter size of 4 × 4.
with the MUSDB18 dataset, which has no hierarchical
The number of output channels for first, second, and third
class label information besides the coarsely defined labels
layer is 64, 128, 256, respectively, and 512 for the rest of
of sources such as ‘vocals’, ‘drums’, ‘bass’ and ‘other’.
the layers. Every layer has stride size of 2 along the fre-
Under the definition of class in the dataset, hi-hat and kick
quency axis. And stride size along the time axis is set to 2
drum are grouped into ‘drums’, and piano and electric gui-
for every layer except the first layer of the encoder and the
tar into ‘other’. Although our method was never trained
last layer of the decoder.
to separate the subclass from the mixture, it was able to
The dimension of the latent vector was set to 32 and the
separate hi-hat and piano from the mixture, which can be
batch size was set to 5. The coefficients in Eq.6 were set to
referred to as a zero-shot separation. These results indicate
λR = 10, λKL = 0.01, λlatent = 0.5. The initial learning
the proposed method can be well applied for audio query-
rate was set to 0.0002 and after 200000 iterations the rate
based separation.
was decreased to 5 × 10−6 for every 10000-iteration. We
used Adam optimizer with β1 = 0.5, β2 = 0.999. 4.4 Latent Interpolation
4.3 Manually Targeting a Specific Sound Source Furthermore, we conducted a latent interpolation exper-
iment using the mean vector of each source. The mean
To validate that our method captures the characteristics of vector of each source was computed by averaging the
the audio given in the query and separates them accord-
ingly, we conducted an experiment of separating specific 1
P vectors of each source in the training set, zc =
latent
Nc i Q(Sc,i ), where Sc,i denotes i-th 3-second magni-
instruments. As shown in Fig. 3, an audio query of hi-hat tude spectrogram in the sound class c and Nc denotes the
and piano were given to the mixtures of (hi-hat + kick number of segments in class c.
drum + bass) and (piano + electric guitar). Queries and For the interpolation method, we used the spherical lin-
mixtures were not from the train set, and both queries were ear interpolation (Slerp) introduced in [23],
not sampled from the mixture. We can observe in the hi-hat
sin(1 − α)θ sin αθ
1 https://sigsep.github.io/sigsep-mus-eval Slerp(z1 , z2 ; α) = z1 + z2 , (7)
sin θ sin θ
881
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019
where α denotes the weight of interpolation and θ denotes We compare the performance of two cases where the
the angle between z1 and z2 . As shown in Fig. 4, we inter- goal is to separate a k-th vocal track from test set. The first
polated between the mean vector of sound sources, drums case is to use zmean to separate a target source, Ŝmean =
(zdrums ) → bass (zbass ) and vocals (zvocals ) → drums S(M, zmean ). The second case is to use zretk to separate
(zdrums ). We can see the ratio of separated instruments a target source, Ŝretk = S(M, zretk ). We defined perfor-
changes as the weight α changes. These experimental re- mance improvement in terms of SDR as follows,
sults show that our method can generate continuous out-
puts just by manipulating a latent space. ∆SDR = SDR(SGTk , Ŝretk ) − SDR(SGTk , Ŝmean ), (9)
4.5 Effects of Latent Vector on Performance where SGTk denotes k-th ground truth vocal track from
test set. To measure the distance between latent vectors we
used cosine distance (CD(z1 , z2 ) = 1 − (z1 / kz1 k2 ) ·
(z2 / kz2 k2 )) and defined cosine distance difference be-
tween (ztestk , zretk ) and (ztestk , zmean ) as follows,
882
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019
way, which we refer to as an iterative method. The itera- Vocals Drums Bass Other
tive method is done as follows. First, we separate the tar- STL2 [16] 3.25 4.22 3.21 2.25
get source using the mean vector of certain sound class WK [22] 3.76 4.00 2.94 2.43
zmean . Then, we re-encode the separated source into a RGT1 [12] 3.85 3.44 2.70 2.63
latent space expecting the re-encoded latent vector to be JY3 [6] 5.74 4.66 3.67 3.40
closer to the target latent vector. Finally, we separate the UHL2 [20] 5.93 5.92 5.03 4.19
target source using the re-encoded latent vector. We verify TAK1 [18] 6.60 6.43 5.16 4.15
the effect of the proposed iterative method and show that Ours (mean) 4.90 4.34 3.09 3.16
it can be helpful under the harsh condition where the tar- Ours (GT) 5.48 4.59 3.45 3.26
get sources are far from generic class. The results (Single
step → Iterative) are as follows, ‘vocals’: 4.84 → 4.90,
‘drums’: 4.31 → 4.34, ‘bass’: 3.11 → 3.09, and ‘other’: Table 1. Median scores of SDR for the MUSDB18 dataset.
2.97 → 3.16. We can see the iterative method noticeably
improves the performance in ‘vocals’ and ‘other’. On the ing the sources of distinctive characteristics.
other hand, the differences are not significant in drums and
bass.
We looked into the tracks which gained significant im- 4.7 Algorithm Comparison
provement in terms of SDR in vocals. ‘Timboz - Pony’ and
‘Hollow Ground - Ill Fate’ gained more than 0.5dB in SDR In this subsection, we compare our method to other meth-
through the iterative method. We found the results intuitive ods with the evaluation result of the MUSDB18 dataset. As
as the vocals in the two songs feature a growling technique stated above, our method’s output is dependent on the en-
from heavy metal genres, which can be considered distant coded latent vector from a query. For the comparison with
from the general characteristics of vocals. other methods that do not require a query, therefore, we
used the mean vector in the latent space encoded from the
training samples for each source – i.e., we ended up us-
ing four mean latent vectors for ‘vocals’, ‘drums’, ‘bass’,
and ‘other’, respectively. Additionally, to show the upper
bound of our proposed method, we used the encoded la-
tent vector of the ground truth (GT) signal from test set.
Note also that the separation is done with a single network.
Table 1 shows the median scores of SDR of methods re-
ported in SiSEC2018 [17], including our method denoted
as Ours. Although the proposed algorithm did not achieve
the best performance, the results show that it is compara-
ble to the other deep learning-based models that are ded-
icated to separating just four sources in the dataset. This
means that our method is not limited to query-based sepa-
ration, but also can be used for general music source sep-
aration just like as other conventional methods. Addition-
ally, there is room for improvement: applying the multi-
channel Wiener filter and/or using other architecture for
the separator besides U-net could be such an option.
Figure 7. t-SNE visualization of encoded latent vectors
from each source in the test dataset. Red points denote
the vectors of the tracks which gained more than 0.4dB 5. CONCLUSION
in terms of SDR by the iterative method.
In this study, we presented a novel framework, consist-
To verify our assumptions, we divided each source of ing of Query-net and Separator, for audio query-based mu-
the test set into segments and converted them into latent sic source separation. Experiment results showed that our
vectors. We divided the encoded vectors into two groups, method is scalable as the Query-net directly encodes audio
the ones which gained more than 0.4dB in terms of SDR query into a latent space. The latent space is interpretable
by the iterative method and the ones did not. Then, we vi- as was shown by the t-SNE visualization and latent in-
sualized the encoded vectors using t-SNE (results shown terpolation experiments. Furthermore, we have introduced
in Fig. 7). The red dots in Fig. 7 represent the latent vector various utilities of the proposed framework including man-
from the group that showed significant SDR improvement ual and automated approach showing the promise of audio-
more than 0.4dB. Although some vectors lie around the query based source separation. As a future work, we plan to
center, most of them are located far from the center. These investigate more adequate conditioning method for audio
vectors can be inferred as outliers and the results show that and better neural architecture for performance improve-
our iterative method is effective when it comes to separat- ment.
883
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019
6. ACKNOWLEDGEMENT [9] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked
hourglass networks for human pose estimation. In Eu-
This work was supported partly by Kakao and Kakao Brain
ropean Conference on Computer Vision, pages 483–
corporations and partly supported by Institute for Informa-
499. Springer, 2016.
tion & Communications Technology Planning & Evalua-
tion(IITP) grant funded by the Korea government(MSIT) [10] Sungheon Park, Taehoon Kim, Kyogu Lee, and Nojun
(No.2019-0-01367, Infant-Mimic Neurocognitive Devel- Kwak. Music source separation using stacked hour-
opmental Machine Learning from Interaction Experience glass networks. In Proceedings of the 19th Interna-
with Real World (BabyMind)). tional Society for Music Information Retrieval Con-
ference, ISMIR 2018, Paris, France, September 23-27,
7. REFERENCES 2018, pages 289–296, 2018.
[1] John R Hershey, Zhuo Chen, Jonathan Le Roux, and [11] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter,
Shinji Watanabe. Deep clustering: Discriminative em- Stylianos Ioannis Mimilakis, and Rachel Bittner. The
beddings for segmentation and separation. In 2016 MUSDB18 corpus for music separation, December
IEEE International Conference on Acoustics, Speech 2017.
and Signal Processing (ICASSP), pages 31–35. IEEE,
2016. [12] Gerard Roma, Owen Green, and Pierre Alexandre
Tremblay. Improving single-network single-channel
[2] Xun Huang and Serge Belongie. Arbitrary style trans-
separation of musical audio with convolutional layers.
fer in real-time with adaptive instance normalization.
In International Conference on Latent Variable Anal-
In Proceedings of the IEEE International Conference
ysis and Signal Separation, pages 306–315. Springer,
on Computer Vision, pages 1501–1510, 2017.
2018.
[3] Andreas Jansson, Eric J. Humphrey, Nicola Montec-
chio, Rachel M. Bittner, Aparna Kumar, and Tillman [13] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
Weyde. Singing voice separation with deep u-net con- U-net: Convolutional networks for biomedical image
volutional networks. In Proceedings of the 18th Inter- segmentation. In International Conference on Medical
national Society for Music Information Retrieval Con- image computing and computer-assisted intervention,
ference, ISMIR 2017, Suzhou, China, October 23-27, pages 234–241. Springer, 2015.
2017, pages 745–751, 2017.
[14] P. Seetharaman, G. Wichern, S. Venkataramani, and
[4] Tero Karras, Samuli Laine, and Timo Aila. A style- J. L. Roux. Class-conditional embeddings for music
based generator architecture for generative adversarial source separation. In ICASSP 2019 - 2019 IEEE Inter-
networks. arXiv preprint arXiv:1812.04948, 2018. national Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 301–305, May 2019.
[5] Diederik P. Kingma and Max Welling. Auto-encoding
variational bayes. In 2nd International Conference [15] O. Slizovskaia, L. Kim, G. Haro, and E. Gomez. End-
on Learning Representations, ICLR 2014, Banff, AB, to-end sound source separation conditioned on instru-
Canada, April 14-16, 2014, Conference Track Pro- ment labels. In ICASSP 2019 - 2019 IEEE Interna-
ceedings, 2014. tional Conference on Acoustics, Speech and Signal
[6] Jen-Yu Liu and Yi-Hsuan Yang. Denoising auto- Processing (ICASSP), pages 306–310, May 2019.
encoder with recurrent skip connections and residual
[16] Daniel Stoller, Sebastian Ewert, and Simon Dixon.
regression for music source separation. In 2018 17th
Wave-u-net: A multi-scale neural network for end-to-
IEEE International Conference on Machine Learn-
end audio source separation. In Proceedings of the 19th
ing and Applications (ICMLA), pages 773–778. IEEE,
International Society for Music Information Retrieval
2018.
Conference, ISMIR 2018, Paris, France, September 23-
[7] Antoine Liutkus, Fabian-Robert Stöter, Zafar Rafii, 27, 2018, pages 334–340, 2018.
Daichi Kitamura, Bertrand Rivet, Nobutaka Ito, Nobu-
taka Ono, and Julie Fontecave. The 2016 signal sepa- [17] Fabian-Robert Stöter, Antoine Liutkus, and Nobutaka
ration evaluation campaign. In Petr Tichavský, Mas- Ito. The 2018 signal separation evaluation campaign.
soud Babaie-Zadeh, Olivier J.J. Michel, and Nadège In International Conference on Latent Variable Anal-
Thirion-Moreau, editors, Latent Variable Analysis and ysis and Signal Separation, pages 293–305. Springer,
Signal Separation - 12th International Conference, 2018.
LVA/ICA 2015, Liberec, Czech Republic, August 25-
[18] Naoya Takahashi, Nabarun Goswami, and Yuki Mitsu-
28, 2015, Proceedings, pages 323–332, Cham, 2017.
fuji. Mmdenselstm: An efficient combination of convo-
Springer International Publishing.
lutional and recurrent neural networks for audio source
[8] Laurens van der Maaten and Geoffrey Hinton. Visu- separation. In 2018 16th International Workshop on
alizing data using t-sne. Journal of machine learning Acoustic Signal Enhancement (IWAENC), pages 106–
research, 9(Nov):2579–2605, 2008. 110. IEEE, 2018.
884
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019
[21] Jun Wang, Jie Chen, Dan Su, Lianwu Chen, Meng Yu,
Yanmin Qian, and Dong Yu. Deep extractor network
for target speaker recovery from single channel speech
mixtures. In Interspeech 2018, 19th Annual Confer-
ence of the International Speech Communication Asso-
ciation, Hyderabad, India, 2-6 September 2018., pages
307–311, 2018.
885