Zui Chen1∗ , Yansen Jing1∗ , Shengcheng Yuan2∗ , Yifei Xu2 , Jian Wu2 and Hang Zhao1
Institute for Interdisciplinary Information Sciences, Tsinghua University
Beijing DeepMusic Technology Co., Ltd
{chenzui19, jingys19}@mails.tsinghua.edu.cn, {yuansc,xuyf}@lazycomposer.com,
wujian7752@163.com, Zhaohang0124@gmail.com
arXiv:2205.03043v1 [cs.SD] 6 May 2022
a wide range, solving Eq. (4) to get every possible distance
between two harmonics would create a huge amount of dilated
Figure 4: Illustration of different sound timbres having different locations, i.e., creating a tall and dense filter with a lot of
harmonic features. From left to right, audios are obtained from sine parameters. Instead, we introduce the prime-ratio function to
wave, piano, and violin, separately. The audios are transformed into represent all the distances using a small list of prime numbers
Mel-spectrogram. The phenomenon of multi-stripes shape due to the and create a sparse filter.
harmonics is very clear in these acoustic scenes.
Prime-Ratio Function
Considering relationships among harmonics as different For any prime number p, the prime-ratio function r(p) is
timbres, downstream networks (in this case, synthesizer pa- defined as follows.
rameters classifier) can utilize and leverage these harmonic r(p) = 2−s p, s = max{s ∈ N|2s < p} (5)
features to improve performance. PDC constructs a sparse Given the mathematical rule of prime factorization, i.e., for
filter by expanding the concept of dilated convolution to accu- every n ∈ N and n ≥ 2, there exists only one way to de-
rately reach all the integer harmonics in a log-scale spectro- compose n into the product of prime powers. Considering
gram. Unlike the regular dilated convolution which has a fixed that r(2) = 2, therefore, p = r(2)s r(p), and the following
dilated step, PDC’s dilated location is not evenly distributed, theorem holds true for every integer.
since the distance between two harmonics in a spectrogram
Theorem 1. For all n ≥ 2, n can be represented in exactly
is not constant. PDC reaches all integer harmonics not at
one way as a product of the prime-ratio powers, i.e.,
once, but through stacking itself. We apply the mathematical
rule of prime factorization, decomposing an integer into the Y
product of a few prime numbers. These primes are further n= r(pi )αi (6)
decomposed into the product of some integer ratios between 1 i=1
and 2. PDC’s dilated location is then built according to these where p1 < p2 < · · · < pl are prime numbers and αi ’s are
ratios. In this way, PDC has a fixed receptive field, and it positive integers.
only requires a few primes to complete the filter construction, Calculating d(1, n) based on Eq. (6) results in the following
rather than traversing every integer. equation:
Prerequisite l
Let X ∈ RC×K×T denote the spectrogram, where C is the d(1, n) = αi d(1, r(pi )) (7)
number of channels, K is the number of frequency bins, and i=1
T is the number of temporal segments. Let f (k) denote the Eq. (7) states that the distance between any integer harmonic
frequency at the k-th bin, then X(c, k, t) refers to the energies and its fundamental frequency can be represented as a finite
of sound in c-th channel, at the t-th time frame, around fre- linear summation of the prime-ratio’s distance d(1, r(pi )),
quency f (k). The inverse function of f is denoted as f −1 (·), where d(1, r(pi )) has two characteristics:
Asymmetric Symmetric
version version Asymmetric version. Let K = {kj }lj=0 denote the set of
dilated locations. Let vector ~v = (v0 , v1 , . . . , vl ) denote the
trainable parameters in PDC. Let w ~ = (w0 , w1 , . . . , wB ) de-
flip note the vector after dilation, where the receptive field is
(B + 1) × 1. The values of wk are defined as follows.
wkj = vj , j = 0, 1, . . . , l (10)
wk = 0, k ∈ /K
Eq. (11) creates an asymmetric structure w ~ which covers only
higher harmonics, as shown in Fig. 5 (a) and (b).
Symmetric version. Symmetric PDC is created by flipping
and fusing the asymmetric version, as shown in Fig. 5 (c). Let
Figure 5: Main concept of the prime-dilated convolution, illustrated K = {kj }lj=−l denote the set of dilated locations, where the
with the hyper-parameters B = 12 and l = 4. (a) shows the dilated negative index of k−j is defined as kj ’s opposite:
structure of asymmetric version, where j is index, pj is the j-th
smallest prime, r(pj ) is the prime-ratio function, d(1, r(pj )) is the k−j = −kj , j = 1, . . . , l (11)
distance of bins to the bottom line, ~v is the trainable parameters, w~ is Let vector ~v = (v−l , . . . , vl ) denote the trainable parameters
the filter after dilation, and kj is the dilated location. (b) illustrates in PDC, and w ~ = (w−B , . . . , wB ) denote the vector after
the way the filter shifts and stacks to capture all the harmonics in a dilation, where the receptive field is (2B + 1) × 1. The values
log-scale spectrogram. (c) introduces the symmetric version of PDC of wk are defined as follows:
which is constructed from the first version’s flip and fusion.
wkj = vj , j = −l, . . . , l (12)
wk = 0, k ∈ /K
1. It always represents the distance between two integer
The convolution operation pdc(·) can be parameterized with
harmonics, because according to Eq. (4) and (5),
the dilated filter constructed by w.
d(1, r(pi )) = d(2si , pi ) (8)
2.4 Multi-modal Feature Engineering
where 2 and pi are both integer numbers. Besides spectrograms, chromagrams, and MFCC, we also
2. It always lies between 0 and B, since r(p) ∈ (1, 2] clearly utilized certain statistical features in the network, which are
is true for every prime, and 0 = d(1, 1) < d(1, r(pi )) ≤ closely related to sound timbre. For example, the following
d(1, 2) = B. information is widely used in audio processing tasks:
• Amplitude Envelop: The changes in the amplitude of a
Formulation sound over time.
The main concept of PDC is inspired by the above theorem • RMS Energy: The root mean square energy of audio.
and their inferences. Because: 1) any integer can be repre-
sented as the product of r(p); 2) the product of r(p) appears • Zero Crossing Rate: The rate at which a signal changes
as the summation of d(1, r(p)); and 3) if the dilated loca- between positive value and negative value.
tion is set as d(1, r(p)), the summation refers to the shift and • Wiener Entropy: Also known as Spectral Flatness, a
stack of convolution operation. In other words, if a series of metric to measure whether a sound is tonal or noisy.
distance d(1, r(pi )) is taken as dilated locations as shown in Notice that the information is scalars per time step, given
Fig. 5, then any integer harmonics will be captured by the shift that a fixed input note has a fixed duration, we can directly
and stack of the filter. For example, 6 = r(2)2 r(3), so the use an MLP mapping from time steps to feature dimensions to
six-times harmonics can be reached by the stack of dilation process each statistical information.
d(1, r(2)), d(1, r(2)) and d(1, r(3)). In our experiments, we
obtain such a stack by inserting a single PDC filter after every 2.5 Techniques
convolutional layer. All the distances d(1, r(p)) are no greater Label Smoothing
than B, leaving PDC with a fixed receptive field. As mentioned above, by discretizing continuous parameters,
In practice, the dilated location in a filter has to be an in- all parameter estimation could be treated as classification prob-
teger, but d(1, r(pi )) is often an irrational. Therefore, PDC lems. In practice, all continuous parameters are in [0, 1] range
constructs the dilation according to the integer approximation and are divided into K segments, as in a K-way classification
of d(1, r(pi )) sequence, which is formally defined as follows: task.
Let p1 < p2 < · · · < pl denote the smallest l prime num- Unlike normal classification tasks, discretized numerical
bers selected. Let kj denote the integer which is the closest to classes are not symmetric – wrongly classifying a class as
d(1, r(pj )), we then have: an adjacent class has a smaller influence than classifying it
as an arbitrary other class. Thus, we can split part of the
k0 = 0, probability mass of the ground truth label into neighboring
kj = arg min |k − B log2 r(pj )|, j = 1, . . . , l (9) classes. Technically, the ground-truth label of length K would
first be 1d-convoluted with a Gaussian kernel of σ = σ0 /K,
We introduce two versions of the PDC filter in terms of how normalized to have a total probability mass sum to 1, and then
to construct the dilated locations. be used as the target for cross-entropy loss computation.
Gradient-Inspired Weighting affects the perception of the human ear. In the model training
As mentioned in Eq. (2.2), only considering MSE loss on and inferencing process, we also fixed the Output parameter,
parameter space could result in overfitting the configuration which is consistent with the data set. Fixing this parameter is
space, while performing badly in the audio space. also a reasonable decision in practice since users can easily
Observation 1. Most parameters in most presets are local adjust the overall volume of the synthesizer output afterward,
continuous: a small change in the preset would also indicate and it is often necessary to do so.
a small change in the rendered audio and MFCC. We combined three different methods to generate datasets:
This implies that we can approximate local audio space loss • Preset Based: We collected 100k+ presets on the Inter-
using a linear loss term. net, which are widely used in real-life music composition.
Observation 2. Based on preliminary experiments, our model We input each preset θi into the Dexed synthesizer to
would be able to generate predictions relatively close to the render the corresponding audio files Ai . We split the
ground truth. preset-audio pairs into 32 sub-datasets according to dif-
ferent values of Algorithm.
This implies that we can use the gradient field around
ground truth θ∗ to substitute the one around prediction θ̂. • Preset Augmentation: We applied simple data augmen-
Combining the observations, we can approximately state: tation to enlarge the dataset. Given a preset, we can fix
the value of most of its parameters and then uniformly
∂LMFCCD (θ) sample the values of the remaining parameters.3 Note
∂θi that randomly generated presets may not be audible, we
θ=θ̂ set a minimum threshold of audible volume to sieve those
∂LMFCCD (θ) ∂LMSE (θi )
= · presets out.
∂LMSE (θi ) θ=θ̂ ∂θi
∂LMSE (θi )
• Random Walk Based: To improve and test generaliz-
≈ · (13) ability, besides collecting presets online, we also ran-
∆LMSE (θi ) θ=θ∗ ∂θi
θ=θ̂ domly sampled presets from configuration space to con-
struct a random dataset. Similarly, we only preserved
By preprocessing ∆L MFCCD (θ)
∆LMSE (θi ) for each parameter θi of presets that can generate audible sounds.
θ=θ ∗
each preset θ in dataset D, we can estimate an importance
weight of prediction θ̂i , to be used in training.
This technique could be applied only if the number of train-
4 Results
ing samples in dataset D is small, since rendering audio using
a synthesizer and computing audio space loss for every param- Quantitative Results on Dexed
eter of every data point is very time-consuming. Optimizing
Method MFCCD
this method will be left as future work.
*Hill Climbing 21.96
3 Dataset *Genetic Algorithm 31.32
There are 155 parameters for the Dexed synthesizer in total: APVST LSTM 32.76
94 continuous parameters, 59 discrete parameters, and 2 fixed APVST LSTM++ 22.59
parameters (Algorithm and Output). PresetGen VAE 14.70
*Similarity Threshold
10 ∼ 15
for Human Perception
S OUND 2S YNTH multi-modal (OURS) 5.36
Figure 6: Different Algorithms. There are 32 Algorithms in total.
Table 1: Experiment results. MFCCD is the lower the better. All
MFCCDs are measured under T6 setting: 6 oscillators on Dexed.
Algorithm is a special parameter in Dexed, which deter-
mines the modulation relationship between the six oscillators.
The physical meaning of all parameters depends on the choice The detailed experiment settings are elaborated in Ap-
of Algorithm. WLOG, we restrict our experiment on a fixed pendix. A.
Algorithm setting. In practical application, a model should From a quantitative perspective (Tab. 1), our model largely
be trained for different algorithms respectively. And then we outperforms previous SOTA: PresetGen VAE [Le Vaillant et
select the output of the model with the smallest audio space
loss as the final output of the system. 3
Dexed presets are grouped into different themes, which are split
There is another special parameter called Output, which and augmented separately so that there is no data leakage during
controls the volume of the generated sound. In our datasets, augmentation.
we fixed the value of Output to the maximum value, otherwise, * These figures obtained from APVST [Yee-King et al., 2018].
the generated audio loudness of many presets is so low that it Our subjective listening test also agrees with this similarity threshold.
On model architecture, the extracted global features have
the same dimension of 2048 for all model structures. In the
case of the multi-modal structure, each backbone is assigned
a small portion of features. Specifically, convolutional back-
bones, which are used to extract features from spectrogram
and CQT chromagram, each have an output dimension of 512,
while other backbones, which are used to extract features from
waveform, MFCC, or statistical information, each have an
output dimension of 128. The masked classifier has 64 hidden
neurons for each group (a parameter or an oscillator).
Figure 7: Four sampled spectrogram cases. Ground truth audios are We trained our models using the AdamW [Loshchilov and
on the top and audios generated using predicted parameters are on Hutter, 2019] optimizer with a universal weight decay 10−4
the bottom inside each group. and a linear warm-up cosine annealing scheduler with 4 fixed
warm-up epochs and a peak learning rate 2 × 10−4 over at
most 30 epochs. We used a virtual batch size of 64 data points
al., 2021].4 From a visual perspective (Fig. 7), the spectro-
per batch by using gradient accumulation. We adopted training
grams of our predictions are very similar to that of the ground
tricks including gradient clipping, snapshot, early stopping,
truths. From an auditory perspective, audios generated using
stochastic weight averaging, etc. It is worth noticing that small
predicted preset and the ground truth audio are very alike.
Gaussian noise is added to training data points to improve the
robustness of the model.
5 Conclusion We trained each of our models on a Linux server using a
We proposed a novel multi-modal pipeline, along with a prime- single NVIDIA GeForce GTX 1080Ti GPU. The maximum
dilated convolution structure and many other useful techniques GPU RAM usage is no more than 9GB for a properly chosen
in audio processing, to tackle the synthesizer parameters esti- physical batch size.
mation problem. The result of our pipeline, S OUND 2S YNTH,
is not only significantly better than previous SOTA on the B S OUND 2S YNTH Plug-In
Dexed synthesizer but also able to reach human auditory per-
ception precision. We have released code, plug-in, audio de-
mos, and example use cases in which our plug-in is boosting
musicians’ creativity and simplifying the process of creation
substantially. This could have an impact on the development
of AI for art in the field of music and sound design, and it
could be beneficial for other audio processing tasks in the
A Experiment Settings
WLOG, we fixed the input note η0 to be at the middle C pitch
(C4). The note is always pressed with maximum velocity, sus-
tained for 4 beats, and recorded 8 beats in total, under tempo
120 bpm. All audios are converted to 48kHz sample rate and
32 bit depth. All 6 oscillators of Dexed are used, including
155 parameters in total: 94 continuous parameters, 59 classi-
fication parameters, and 2 fixed parameters. All continuous
parameters are discretized into 64 classes.
Our experiments are carried out on a pre-generated dataset Figure 8: Screenshot of S OUND 2S YNTH plug-in built on Dexed. The
containing 30106 training/validation data points and 1679 test part highlighted by the rectangle is the S OUND 2S YNTH interface.
data on Dexed. Among the training/validation data points,
6191 are directly sampled from existing presets, 22237 are Using our S OUND 2S YNTH model, we developed and re-
augmented from those presets, and 1678 are generated purely leased a plugin based on the Dexed synthesizer. The plug-in
at random. In practice, 80% of the data points are used for first “Ping” the server running the neural network to establish
training and 20% are held out for validation. Notice that the a connection. Then by “Match”ing an input audio file, our
test dataset is generated from independent held-out themes of S OUND 2S YNTH model will automatically calculate the corre-
presets and random-walk is not used, preventing data leakage. sponding parameters and assign them back to the synthesizer.
The plug-in also supports “Download” to serialize and save
MFCCD metric is influenced by the number of filter banks, preset in human-readable JSON format.
however, we computed MFCCD under both 13-band ([Yee-King
et al., 2018]) and 40-band ([Le Vaillant et al., 2021]) settings and
observed no remarkable difference in the evaluation results. Thus we Ethical Statement
reported the 13-band MFCCD in Tab. 1. There are no ethical issues.
