1. Introduction

Cinematic source separation refers to the task of separating movie audio into dialogue (DX), music (MX), and sound effects (FX). While speech separation (Hershey et al., 2016; Chen et al., 2017; Yu et al., 2017) and music separation (Huang et al., 2012; Grais et al., 2014; Uhlich et al., 2015) have been studied extensively, cinematic source separation is a relatively recent field (Petermann et al., 2022) despite its numerous practical applications. These include enhancing old movies by converting them to formats like MPEG-H or Dolby Atmos, dubbing them into different languages, or generating subtitles including non-speech sounds present in an auditory scene.

The first work in the area of cinematic separation was dialogue enhancement (Uhle et al., 2008; Geiger et al., 2015; Paulus et al., 2019; Torcoli et al., 2021), which employs source separation to extract and remix the dialogue signal at a desired level. The problem was further formalized by Petermann et al. (2022), who introduced cinematic separation as a three-way problem of splitting the audio into dialogue, sound effects, and music, which they referred to as the cocktail fork problem. They created a new dataset, Divide and Remaster (DnR), which was built upon LibriSpeech (Panayotov et al., 2015) for dialogue, Free Music Archive (Defferrard et al., 2016) for music, and the Freesound Dataset 50k (Fonseca et al., 2021) for sound effects. Their exploration of various separation models revealed that their proposed multi-resolution extension of X-UMX (Sawata et al., 2021, 2023), termed MRX, provided the best performance. Subsequently, Petermann et al. (2023) extended this work to also consider the impact of source separation on downstream tasks. They proposed a two-stage approach where an MRX separator is used to obtain preliminary separation, which is followed by an activity detector to estimate the activity profile for every source. This activity information is then utilized in a second stage by a conditioned MRX, called MRX-C, to improve the separation performance. Recently, DnR was also used by Watcharasupat et al. (2023), who extended the band-split RNN (Luo and Yu, 2023) to cinematic separation by introducing the BandIt architecture.

Cinematic separation has several unique challenges compared to speech or music separation. Firstly, the multi-channel format of most cinematic audio (stereo or 5.1 surround) necessitates a suitable augmentation during training, as many datasets are only monaural, such as the DnR dataset. Secondly, the scarcity of full-bandwidth material with sampling rate of 48 kHz for training poses a significant hurdle, as high-quality audio data is essential for effective model training. Thirdly, the lack of emotional speech within the used speech datasets presents a challenge. Separation models trained on these datasets often struggle with emotional speech in real cinematic dialogues, because it is typically absent from the training data as was already noted in earlier work (Uhle et al., 2008). Fourthly, the sound effect class, which encompasses a wide variety of sounds, is particularly challenging to extract due to its broad and diverse nature.1 Finally, the three classes exhibit some overlap, such as the presence of vocals in music, background chatter (chatter noise), which is a sound effect but shares similarities with dialogue, or the use of musical instruments for sound design as seen in the alien communication signal in Close Encounters of the Third Kind, which is a sound effect made of musical notes. These challenges highlight the complexity of cinematic separation and the need for further research and development in this field.

Hence, in addition to the music demixing (MDX) track (Fabbro et al., 2024), which was already present in the Music Demixing Challenge 2021 (MDX’21) Mitsufuji et al., 2022), we have added a new cinematic demixing (CDX) track to the Sound Demixing Challenge 2023 (SDX’23) in order to foster research in this direction. The challenge was facilitated through AIcrowd,2 and participants were invited to submit their systems to one of two leaderboards, depending on whether they used only DnR or additional training data. To rank the submissions, we developed a new hidden test set, called CDXDB23, derived from real movies. Through the establishment of this challenge framework, we observed substantial performance enhancements. Specifically, the top-performing system, trained solely on DnR, demonstrated an improvement of 1.8 dB compared to the cocktail-fork baseline based on MRX (Petermann et al., 2022). Remarkably, the highest-performing system on the open leaderboard, which allowed the use of any data for training, exhibited a significant improvement of 5.7 dB. These results underscore the efficacy of our challenge in driving advancements in the field of cinematic audio separation.

This paper is organized as follows: Section 2 outlines the competition’s design, Section 3 discusses the training datasets and establishes the performance baseline, Section 4 presents the results and summarizes the most successful strategies, and Section 5 analyzes the differences between the provided training dataset, DnR, and hidden test set, CDXDB23. Finally Section 6 concludes the paper with key findings and future research directions.

2. CDX Challenge Setup

In the following, we will summarize the structure of the competition.

2.1 Task Definition

Participants in the CDX track of SDX’23 were asked to submit systems that can extract the dialogue sDX(n)2, sound effects sFX(n)2, and music sMX(n)2 from the stereo cinematic audio

(1)
x(n)=sDX(n)+sFX(n)+sMX(n),

where n denotes the time index and all stereo signals are sampled at 44.1 kHz. We used the following definition for each class:3

  • Dialogue refers to all spoken content in a movie including conversations between characters, monologues, and any other spoken elements.
  • Sound effects are sounds that are used to support or complement the action on screen. They can be split into object sounds (e.g., footsteps) and ambient sounds (e.g., wind or rain).
  • Music refers to the soundtrack that accompanies the visuals and which is often used to provide an emotional context. It might be a single instrument (e.g., a violin in a dramatic moment) or a full orchestra or band.

We verified these definitions with mixing engineers from Sony Pictures.

A unique aspect of this challenge was the requirement for participants to submit their pre-trained models along with the corresponding inference code, as the test dataset was kept hidden. This stands in contrast to many other challenges where participants have access to unlabeled test data and are required to submit processed files or labels.

2.2 Leaderboards

Submissions were categorized under two leaderboards:

  • Leaderboard A was designated for models exclusively trained on the train and validation splits ‘tr’ and ‘cv’ of the Divide and Remaster (DnR) dataset (Petermann et al., 2022), while
  • Leaderboard B was for models trained on any data.

The rationale behind this dual-leaderboard approach is threefold. Firstly, it allows individuals who may not have access to extensive datasets to participate in the competition. Secondly, it provides a platform to explore data augmentation strategies, such as mono-to-stereo conversion, which is particularly relevant as the DnR dataset is monaural, while the hidden test set used for evaluation is in stereo format. Thus, the two leaderboards not only foster inclusivity but also encourage innovative approaches to data augmentation. Thirdly, the two leaderboards allow disentangling data improvements from algorithm improvements, as Leaderboard B performance could come from extra data or better augmentation strategies relying on additional data (e.g., room impulse responses), while Leaderboard A improvements must come from augmentations without additional data and from algorithms only. However, Leaderboard B is required to determine the true state of the art.

2.3 Ranking Metric

For the evaluation of the systems, we used the global signal-to-distortion ratio (SDR) which is defined for one movie clip as

(2)
SDR=13(SDRDX+SDRFX+SDRMX),

with SDRj=10log10Σnsj(n)2Σnsj(n)s^j(n)2 where sj(n)2 and s^j(n)2 denote the stereo target and estimate for source j ∈ {DX,FX,MX}. The definition in Equation (2) is also called utterance-level SDR (cf., for example, Luo and Yu, 2023) and equivalent to the SDR of multi-channel BSS Eval v3 (Vincent et al., 2007). Finally, the global SDR of (2) is averaged over all clips in the hidden test dataset and the three sources DX, FX, and MX to obtain the final score. We chose this metric to rank submissions over scale-invariant metrics like SI-SDR (Le Roux et al., 2019), because systems with good SDR performance have the advantage that they can easily be blended with other models (Uhlich et al., 2017) and also allow one to compute the residual s^¬j(n)=x(n)s^j(n) without having to recover the correct scale.

Besides the chosen global SDR (2), there are also other metrics that were proposed in the literature for the comparison of source separation models. As part of MDX’21, a thorough comparison of different metrics was performed by Mitsufuji et al. (2022) to show that Equation (2) highly correlates with many other metrics, in particular those that were used in previous iterations of the SiSEC competition in 2015, 2016, and 2018. We refer the interested reader to Mitsufuji et al. (2022) for more details.

2.4 Timeline, Challenge Phases and Prizes

The challenge took place in two phases. Phase I started on January 23rd, 2023. Phase II commenced on March 6th, 2023, as planned. However, due to the submission system experiencing difficulties in handling the surge in the number of submissions towards the end of the challenge, the end date of Phase II was extended by one week, to May 8, 2023, to ensure a fair competition for all teams.

CDXDB23 was partitioned into three sets of approximately equal size, containing three, three, and four movies respectively. During Phase 1 of the competition, participants were able to assess the performance of their submissions using one-third of the movies from the hidden test set. In Phase 2, this was expanded to include two-thirds of the movies from the hidden test set. Upon the conclusion of Phase 2, participants were required to select three submissions for evaluation on the full hidden test set, the results of which were then displayed on the final leaderboards. This selection process was implemented to mitigate the potential impact of overfitting. In cases where participants did not explicitly select three submissions, the top three submissions from the Phase 2 leaderboard were automatically chosen for final evaluation.

For Leaderboard A, which was for models trained exclusively on the Divide and Remaster (DnR) dataset, a total of 5,000 USD was distributed among the top three submissions. The first-place winner received 2,500 USD, the second-place winner was awarded 1,500 USD, and the third-place winner received 1,000 USD. To be eligible for these prizes, participants were required to open-source both their training and inference code as well as the pretrained model. Similarly, for Leaderboard B, which was for models trained on any data, the same prize distribution was applied. For this leaderboard, participants were required to open-source their inference code as well as the pretrained model. Compliance with these open-source requirements was ensured by the organizers through a due diligence check. In the course of this evaluation, a thorough review of the source code was conducted to verify that participants in Leaderboard A exclusively trained their models using only DnR.

3. Datasets and Baseline

The following subsections offer detailed descriptions of the datasets employed throughout the challenge, as well as an overview of the baseline included in the starter kit.

3.1 Divide and Remaster (DnR) – Training Dataset

Introduced by Petermann et al. (2022), the Divide and Remaster (DnR) dataset serves as a tool for developing and evaluating mono audio signal separation algorithms applied to podcasts, television, and movies. It includes artificial mixtures sourced from LibriSpeech (Panayotov et al., 2015), Free Music Archive (FMA) (Defferrard et al., 2016), and Freesound Dataset 50k (FSD50k) (Fonseca et al., 2021). This dataset, available in both 16 kHz and 44.1 kHz sampling rates, comes with time-stamped annotations for each class: genre for music, audio-tags for sound effects, and transcription for speech.

The creation process of DnR was centered on addressing class overlap and relative source levels in the mix within a single-channel4 context. It includes four categories: speech, music, foreground effects, and background effects – the latter two being merged into a single submix. All mixtures have a duration of 60 seconds, encompassing multiple full speech utterances and sufficient onsets and offsets between classes. File count for each class was set via a zero-truncated Poisson distribution, and relative amplitude levels across the classes were determined per industry standards and prior studies as discussed by Petermann et al. (2022). Each sound file’s gain was individually adjusted to add variability while preserving realistic consistency across the mix. The final dataset, divided into training, validation, and testing subsets in line with base dataset proportions, comprises 3,406 training mixtures, 487 validation mixtures, and 973 test mixtures.

While the DnR dataset took care to simulate realistic cinematic mixtures, there are some notable differences between the source material used to create DnR and actual cinematic audio:

  • Read speech vs. emotional speech – First, LibriSpeech contains read speech from audio books, which may have significant timbral differences compared to the emotional speech typically used by film actors.
  • Vocals in music stems – Second, many of the musical genres from the FMA dataset contain vocals. While music with vocals is used in cinema, the majority of cinematic music does not contain singing. Thus, the prevalence of music with vocals may be overrepresented in FMA compared to the hidden test data.
  • Production quality – Finally, Librispeech, FMA, and FSD50K are all crowd-sourced datasets, and there may be significant differences in terms of recording hardware and post-production effects compared to actual movies. We will investigate this in more detail in Section 5.

In summary, it is expected that mismatches such as this may limit performance of separation models trained only on DnR.

For Leaderboard A, participants were required to only utilize the training and validation split of the DnR dataset in training their systems.

3.2 CDXDB23 – Hidden Test Dataset

To rank the submissions, we generated a novel dataset derived from authentic Sony Pictures movies and we will refer in the following to this dataset as cinematic demixing database (CDXDB23). It comprises 11 movies with a total of 156 clips each with an average length of 11 seconds, amounting overall to approximately 28.7 minutes of content. The audio was originally at a higher sample rate, but we downsampled it to 44.1 kHz stereo to match the sample rate of the DnR dataset. This was done to avoid requiring participants in Leaderboard A to design systems that can upscale to a higher sampling rate. Figure 1 shows the distribution of genres and release years of the eleven movies in CDXDB23. Please note that a single movie can fall under multiple genres, such as Animation and Family. This characteristic is reflected in the bar plot, where the representation of movies in various genres contributes to the observed distribution. From Figure 1 we can observe that they are recent movies covering a wide variety of genres.

Figure 1 

Statistics of movies in CDXDB23.

The original data supplied by Sony Pictures was formatted as 5.1-channel, 48 kHz, and 24-bit with several stem tracks for each movie, encompassing either dialogue, effects, music, or their combination. We manually annotated the sound events with one class label (dialogue, sound effects, or music) within each stem and carefully selected segments to ensure a balanced representation of each class in the resulting mixture. Specifically, to exclude extremely low-amplitude sound sources, we computed for each class the root mean square amplitude RMSj=(1NΣn=1Nsj(n)2)12 and excluded segments where this value was below a threshold τj for any j ∈ {DX,FX,MX}. Empirically, we found the thresholds

τDX=0.022,  τFX=0.005,  τMX=0.003,

to give good test samples. On occasion, environmental noise was unintentionally recorded, or dialogue/vocal components appeared in the effects or music stems. We made diligent efforts to minimize the inclusion of such samples by manually inspecting all data.

Please note that we are unable to provide further details about the movies (e.g., title or actors) to prevent participants in a future challenge from fine-tuning their models based on this specific information. However, we made available demo samples from “Kilian’s Game”, a short film produced by Sony Pictures to demonstrate the latest filmmaking technologies. These samples could be used by participants to test their submissions and to see the performance on real movie audio.5 The samples from “Kilian’s Game” were not used to rank the submissions.

Ideally, we should also use authentic movie data for training models. During the preparation of CDXDB23, we noticed that this is actually not straightforward. One problem is the preparation of the three-way stems from movie audio, which is a time-intensive process. The material is not readily accessible, requiring reloading all raw tracks into the Digital Audio Workstation (DAW), deciding for each of them the class it belongs to, and finally bouncing the stems for each class. Additionally, we noticed the challenge of other sound classes infiltrating a single stem. For example, dialogue stems can contain sound effects recorded on-stage. Consequently, after bouncing the stems, one has to manually annotate all audio material to find suitable time regions for the three-way separation, leading to a smaller dataset suitable only for testing, as exemplified by CDXDB23.

3.3 Cocktail-Fork Baseline

As part of the challenge, MERL open-sourced their multi-resolution CrossNet (MRX) (Petermann et al., 2022), an improved version of CrossNet-UMX (X-UMX) (Sawata et al., 2021, 2023), which itself is an improved version of Open-Unmix (UMX) (Stöter et al., 2019). MRX leverages multiple short-time Fourier transform (STFT) resolutions of the mixture, enhancing the estimation process as it allows to better address the variety of acoustic characteristics of the three source types. The entire system is available on GitHub.6

Using the available pre-trained model on DnR, a baseline submission was created and made available for all participants as part of the starter-kit.7 We noticed already during the preparation of the baseline that scaling the input mixture is beneficial and, hence, apply the scaling

(3)
x(n)x(n)maxn|x(n)|,

i.e., the cocktail-fork model is run on the peak normalized mixture. Training of MRX utilized scale-invariant signal-to-distortion ratio (SI-SDR) loss, necessitating subsequent scale estimation using least-squares according to the formula

(4)
s^j(n)Σnx(n)Ts^j(n)107+Σns^j(n)2s^j(n)

for any j ∈ {DX,FX,MX}. Furthermore, a post-processing step was implemented to ensure mixture consistency (Wisdom et al., 2019), where we first compute the residual r(n)=x(n)s^DX(n)s^FX(n)s^MX(n) which is then distributed to the estimates

s^DX(n)s^DX(n),s^FX(n)s^FX(n)+12r(n),s^MX(n)s^MX(n)+12r(n).

This post-processing was beneficial as the residual contains mostly sound effects and background music. SDRFX improved by +1.1 dB and SDRMX by +0.7 dB, resulting in an overall improvement of +0.6 dB. The performance of the cocktail-fork baseline on CDXDB23 can be found in Table 1.

Table 1

Final Leaderboard A (models trained only on DnR; top 5).


RankParticipant Global SDR (dB)Submissions to Ldb A


MeanDialogueEffectsMusic1st phase2nd phase

Submissions
    1.aim-less–4.3457.981  1.2173.8373632Code8
    2.mp3d–4.2378.484  1.6222.60742Code9
    3.subatomicseer–4.1447.178  2.8202.4336522Code10
    4.thanatoz–3.8718.948  1.2241.4422122
    5.kuielab–3.5377.687  0.4492.4743615
Baseline
Scaled Identity s^j(n)=13x(n) –0.0191.562–1.236–0.383
Cocktail-Fork (Petermann et al., 2022)–2.4917.321–1.0491.200

After the challenge, we revisited this baseline as many participants recognized a distribution mismatch between DnR and CDXDB23, which can also be seen in Table 1 in the lower scores of this model. In Section 5.2, we will present two new versions of the cocktail-fork baseline with improved performance due to adjusting the loudness or equalization of DnR during training.

4. Challenge Outcome

The CDX track saw a dynamic evolution in terms of both the number of submissions and the SDR performance. The competition attracted a total of 19 teams for Leaderboard A and 10 teams for Leaderboard B, with 369 and 179 submissions respectively. Tables 1 and 2 present the final rankings for both leaderboards. The team aim-less emerged as the winner of Leaderboard A, achieving an average SDR of 4.345 dB. On the other hand, Leaderboard B was topped by JusperLee, with an impressive SDR of 8.181 dB. It is noteworthy that while all top five teams in Leaderboard A were from academic institutions, the highest scores in Leaderboard B were obtained by two commercial entities. This diversity of participants underscores the broad interest and applicability of the challenge across both academic and industry sectors. Figure 2 shows the progress that the teams could achieve during the course of the competition. We can observe that there was a continuous improvement of the SDR for each source and, especially at the end of the competition, there is a steady improvement visible as participants tuned their submissions.

Table 2

Final Leaderboard B (models trained on any data; top 5).


RankParticipant Global SDR (dB)Submissions to Ldb A + B


MeanDialogueEffectsMusic1st phase2nd phase

Submissions
    1.JusperLee8.18114.6193.9585.96642102
    2.Audioshake8.07714.9634.0345.234197
    3.ZFTurbo7.63014.7343.3234.83425131Code11
    4.aim-less4.34517.9811.2173.83736153Code8
    5.mp3d4.23718.4841.6222.607148Code9

Figure 2 

Performance of submissions on full CDXDB23 over time.

To investigate whether this improvement resulted from participants overfitting to the visible portion of the test set, Figure 3 presents the difference between the hidden SDR (the SDR for all clips of CDXDB23 hidden from the participants) and the visible SDR (the SDR for all clips of CDXDB23 shown to the participants). Comparing two subsequent submissions where the newer one is worse in this difference than the previous one indicates that a participant is obtaining less improvement/more degradation on the hidden SDR than for the visible SDR hinting at a possible overfitting to the displayed global SDR. Hence, seeing “trajectories” of consecutive submissions in Figure 3 with negative slopes can be used to detect overfitting. Intriguingly, some degree of overfitting is apparent for the submissions to Leaderboard B towards the end of the challenge but less overfitting is observed for submissions to Leaderboard A. For example, looking at the results for the teams JusperLee and Audioshake, we can see that there is a negative trend in their submissions towards the end of the challenge. Especially for team Audioshake, this is visible as the models extracting sound effects and music seem to be tuned in the last week of the challenge period. Consequently, to reduce the potential effect of overfitting, participants needed to select three submissions at the end of the challenge which were then evaluated on the full CDXDB23 as discussed in Section 2.4.

Figure 3 

Analysis of overfitting of global SDR. The y-axis shows the difference between global SDR on the hidden test set and global SDR displayed to the participants (trajectories with negative slope indicate overfitting).

The substantial improvement upon the provided cocktail-fork baseline by the participants is noteworthy. This was achieved not only through the implementation of enhanced architectures, such as MRX-C (Petermann et al., 2023) used by team mp3d, but also through the identification and rectification of two issues inherent in the DnR dataset. Firstly, the presence of vocals in the music category necessitated dataset cleaning. Secondly, the difference in loudness exhibited by DnR resulted in suboptimal performance of systems trained on this dataset, necessitating the consideration of this factor as discussed in Section 4.6.2 by using a suitable input normalization. Interestingly, none of the top teams explored mono-to-stereo augmentations, which presents an intriguing avenue for future research.

Comparing the results for Leaderboards A and B, we can observe that especially dialogue gains from having access to additional training data. This is in our opinion due to the access to much more speech and vocal material, which can be used as training material for dialogue. Particularly, the inclusion of vocal material proves advantageous due to its similarity to emotional speech. Additionally, the processing pipelines employed in cinematic production may align closely with those utilized in music production, further enhancing the benefit of vocal material.

In order to gain more insight into the benefit of additional data, we show in Figure 4 the performance of the winning submissions on both leaderboards in comparison to the cocktail-fork baseline. Please note that there is only a single clip for movie “000” and, hence, the box plot collapses to a horizontal line. Notably, the most significant disparities between the models trained on DnR and the winning entry in Leaderboard B are observed in animation movies (“002”, “006”) and action movies (“003”, “008”).

Figure 4 

Comparison of the cocktail-fork baseline with winning submissions on both leaderboards for individual movies. For movie “000”, we only have one clip and, hence, the box plot collapses to a horizontal line. Circles represent outliers that are outside the whiskers of the boxplot.

After the conclusion of the challenge, we contacted the top three teams in each leaderboard and invited them to contribute to this manuscript with a description of their approaches. In the following, the teams accepting our invitation present their submissions and discuss them. For the team subatomicseer, which ranked 3rd in Leaderboard A, we refer the interested reader to Fabbro et al. (2024) where the team explains their approach in detail.

4.1 Team JusperLee (Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu)

Final ranking: Leaderboard A: —, Leaderboard B: 1st

4.1.1 Dataset

We used the public Divide and Remaster (DnR) (Petermann et al., 2022) dataset, the public deep noise suppression (DNS) dataset (Dubey et al., 2022), the public MUSDB18-HQ dataset (Rafii et al., 2019), and some extra internal data for system training. The extra internal speech data include 150 hours of data used for a text-to-speech task, the extra internal sound effect data include 10 hours of cinematic sound effect data, and the extra internal music data include 100 hours of cinematic background music data.

One important step in our data preprocessing pipeline was that we found that the effect and music signals in both the DnR dataset and our internal dataset may contain human voice. We thus used a music source separation (MSS) model to preprocess all the effect and music signals to remove the “speech” or “vocal” signals from them. We found that doing this significantly improved the systems’ performance compared to directly using the original signals for training.

4.1.2 Methods

a) On-the-fly Data Mixing – We performed on-the-fly data mixing during training to increase the variety of the training data mixtures. For each mixture utterance, we randomly sampled 0–1 speech or vocal signals (we also treated a vocal signal as a form of dialogue signal in our setting), 0–2 music signals, and 0–3 effect signals, and rescaled each of them by a random energy gain of [–10, 10] dB. We truncated the signals to be 3 seconds long and then added them up to form the mixture. The sum of individual music and effect signals were set as the training targets for the two tracks, respectively.

b) Model Design – Our system consists of three independent models, one for each of the dialogue, effect, and music sources. All models share the same architecture, which is the band-split RNN (BSRNN) architecture we proposed for the MSS task (Luo and Yu, 2023). For the dialogue track, we directly use a BSRNN model trained for the music source separation task instead of the CDX task, as we eventually found that using an MSS model trained on music-only data that extracts the vocal track from the accompaniment can lead to better SDR score on the hidden test set than a speech extraction model trained on speech data (please see the discussion section for more on this observation). For the effect source and the music source, we used two separate BSRNN models trained on the aforementioned dataset, while we used the MSS model to first subtract the separated dialogue signal from the mixture to create a pseudo music-effects-only mixture, and then trained the two models on this mixture to perform a slightly simpler separation task. We found that this could lead to better performance than training the two models on mixtures containing dialogue data, and also better than training on mixtures without speech or vocal signals.

We used the standard BSRNN architecture, for which we do not include a detailed description here for the sake of brevity. The band-split scheme we used for all models was identical to the one we proposed in the original paper (Luo and Yu, 2023). The number of sequence and band modeling modules in the effect and music models were 8 and 12, respectively, and the feature dimension N was set to 64 and 128, respectively.

c) Training Configurations – All models were trained with the Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.001. We used 8 GPUs for each model with a per-GPU batch size of 2. Each training epoch contained 10k iterations, and the learning rate was decayed by 0.98 every two epochs. We did not apply early stopping as the evaluation was done on the hidden test set, and we submitted the latest model to the grading system every day to find the best model.

4.1.3 Results and Discussions

Our system achieved #1 on the Final Leaderboard B in the CDX challenge. Comparing with other top-ranking systems, our system performed significantly better on the music source and on par or slightly worse on the two other sources, and the overall improvement mainly came from the gain from the music source.

To better understand the effect of our vocal-removal preprocessing on DnR, we did an ablation study where we trained two BSRNN models: one using the original DnR dataset, and the other using DnR after applying vocal-removal preprocessing to the music and sound effect sources. Both models were configured identically and their performance was evaluated on CDXDB23 using the AIcrowd evaluation system. Compared to the model trained on the original DnR dataset, the one trained on the vocal-removed DnR dataset achieved 1.32 dB overall SDR improvement on the challenge’s test set. This confirmed our hypothesis and proved that vocal-removal for the music and sound effect class during training with DnR is an important step in our pipeline.

Another interesting observation we had was about the dialogue source – we initially tried to treat the “dialogue separation” task as a “speech enhancement” task which aims at removing any non-speech components out of the mixture, and we trained systems based on both our speech enhancement system which ranked 3rd in the 5th DNS challenge (Yu et al., 2022, 2023) and our MSS system (Luo and Yu, 2023) with the extra cinematic data. We perceptually evaluated the systems’ performance on internal movie data and found the quality of their outputs satisfying. However, all model weights trained in this fashion could not achieve 13 dB SDR on the hidden test set, no matter how we adjusted the training pipeline or the model design. Later we tried to directly submit the original MSS system trained on music-only data (MUSDB18-HQ and another internal music dataset), and the performance of the dialogue source on Leaderboard B suddenly reached 15 dB SDR. One possible explanation is that there might still exist non-speech human sounds that are categorized as noise by the speech enhancement system but identified as vocals by the music separation system, possibly due to the differences in the training data as well as the data mixing strategies used during training.

4.2 Team ZFTurbo (Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva)

Final ranking: Leaderboard A: —, Leaderboard B: 2nd

4.2.1 Approach

Our approach is based on an ensemble of models suited best for a particular stem. As we noticed that dialogue can be extracted with high quality by a vocal model that was trained originally for music separation, we separate dialogue by this model first, and then apply a model trained on the DnR dataset to the remaining part (music and sound effects). The source code is publicly available on GitHub.12

To compare different models, we developed new benchmarks and leaderboards for sound demixing (Solovyev et al., 2023).13 It can be seen from this leaderboard that various hybrid transformer demucs (HT demucs) (Rouard et al., 2023) models dominate all stems except for vocal separation. Models based on the MDX algorithm (Kim et al., 2021) are the best for separating the vocals. Therefore, ensembles of different models used for vocal and non-vocal stems are expected to provide the top overall performance.

To separate the dialogue, we used a combination of three pre-trained vocal models: UVR-MDX114 and UVR-MDX215 from the Ultimate Vocal Remover project,16 and HT demucs (finetuned).17 The vocals were separated independently by all of these models and the results were combined with weights:

s^DX,1=UVR-MDX1(x,overlap=0.6),s^DX,2=UVR-MDX2(x,overlap=0.6),s^DX,3=HT-demucs(x,'demucs_ft',shifts=1,overlap=0.6).

We tried different weights for the DX ensemble and optimized them by considering two datasets that we created: “Multisong MVSep” and “Synth MVSep” as detailed by Solovyev et al. (2023). The weights 10, 4, and 2 for UVR-MDX1, UVR-MDX2, and HT demucs, respectively, produced the best results on these two datasets. Interestingly, we observed that the models with the best SDR for vocal extraction were also the best for dialogue and, hence, we can rely on the results of Solovyev et al. (2023).

After obtaining the high-quality dialogue part, we can subtract it from the original track to obtain the non-dialogue part. To separate it into two remaining stems, we trained two versions of the HT demucs model (Rouard et al., 2023) on the DnR dataset. The first HT demucs model was trained using the standard protocol for all three stems, while the second HT demucs model was trained only on two stems: sound effects and music, excluding dialogue. Table 3 shows the global SDR on CDXDB23 and we can observe that the 2-stem model yields better scores. Especially music benefits from the simplified training mixtures as it improves by 2.5 dB. Interestingly, blending both models is still beneficial as can also be seen from Table 3, where we blended four checkpoints from the 2-stem HT demucs training with seven checkpoints of the 3-stem HT demucs training, giving each the same weight. Please note that we also updated the vocal model in this submission and, hence, there is also a slight improvement for vocals if compared to the individual models. Consequently, for the final submission, we used several checkpoints of each of the 2-stem and 3-stem models to average predictions and obtain better generalization.

Table 3

Comparison of 2-stem and 3-stem HT demucs models trained on DnR and evaluated on CDXDB23 (Team ZFTurbo).


Model Global SDR (dB)

MeanDialogueEffectsMusic

HT demucs trained on 2-stem mix7.56014.5323.3554.794
HT demucs trained on 3-stem mix6.69214.5303.2772.269
Ensemble of 2- and 3-stem HT demucs7.63014.7343.3234.834

4.2.2 Results

Table 4 shows the results of the ablation study with and without a separate dialogue removal. We used two validation sets: val1 – validation on two tracks provided by the organizers, and val2 – a subset of 20 random tracks from the DnR test set. The single HT demucs model showed promising results on both validation sets (see Table 4). However, the model performance was poor on CDXDB23 and the best results can only be obtained with our ensemble model which first extracts dialogue with a vocal model from music separation and then employs an HT demucs for sound effect and music separation. val1 results correlated better with the CDXDB23 dataset than val2, which is based on the DnR dataset. Still, metrics on val1 did not strongly correlate with the final results, presumably due to the tiny size of the demo set.

Table 4

Comparison of single model HT demucs with final ensemble model (Team ZFTurbo).


Model Global SDR on val1 (dB) Global SDR on val2 (dB) Global SDR on CDXDB23 (dB)



MeanDialogueEffectsMusicMeanDialogueEffectsMusicMeanDialogueEffectsMusic

HT demucs (single)6.38713.8872.7812.4949.63414.1517.7407.0122.60216.6500.6480.507
CDX23 best ensemble model8.92214.9273.7808.0607.58519.9496.3776.4297.63014.7343.3234.834

4.2.3 Discussion

During the competition, we noticed that the DnR dataset contains vocals in the music part sometimes. Our SDR for vocals on the leaderboard is very high, but our vocal model extracts all vocals from audio. Based on this, we made a conclusion that music in the competition dataset most likely never contains vocals.

4.3 Team mp3d (Mikhail Sukhovei)

Final ranking: Leaderboard A: 2nd, Leaderboard B: 5th

4.3.1 Approach

Since the training data for Leaderboard A was restricted to only the DnR dataset, we focused on identifying the shortcomings of the baseline multi-resolution crossnet (MRX) (Petermann et al. 2022) model and improving it. We implemented a modification of this model called conditional multi-resolution crossnet (MRX-C) (Petermann et al., 2023). The essence of this modification is to train an additional CRNN model that predicts source activity labels. The output of the MRX model (which estimates music, dialogue, and sound effects) is converted into a mel-spectrogram and concatenated with the mel-spectrogram of the original audio to form a (4, nmels, nfreq) tensor, which is then fed into the CRNN.

To further improve our solution, we analyzed the effect of the Wiener filter on the final score as well as the influence of post-processing source scaling on the final SDR score.

4.3.2 Results

During the competition, we observed that validation on DnR is significantly different from metrics obtained on CDXDB23. This difference arises due to the dependence of the MRX and MRX-C model performance on the volume of the input signal. For testing on DnR, the optimal value was found to be –27 LUFS, which yielded the maximum SDR value. To obtain the optimal input volume for real world data, we propose a Realistic Evaluation Dataset (RED) consisting of 26 stereo audio tracks of 20 seconds each. These audio samples were manually compiled, with an average sample volume of approximately 15 LUFS and a sampling rate of 44.1 kHz. Additionally, each audio file includes separate tracks for dialogue, music, and effects, all with the same duration and with average sample volumes of –24.4 ± 4.5 LUFS for dialogues, –18.8 ± 0.5 LUFS for effects and 18.4 ± 5.0 LUFS for music. All the original audio files are sourced from the open archive.org platform. The RED dataset was used only to select the optimal value of the input signal volume of the model.

As shown in Figure 5, the optimal volume value differed significantly from that of DnR, both in the case of RED and CDXDB23. Furthermore, the SDR metric on RED was more consistent with CDXDB23. After separating the sources, they were brought back to the original volume and a Wiener filter and post-processing scaling were applied. Post-processing scaling involved multiplying the estimated sources by a factor of 1α where α=Σnx(n)Ts^(n)107+Σns^(n)2,x(n), is the mixture and s^(n) is an estimated source.

Figure 5 

SDR dependencies on the input volume in LUFS for music, dialogue, and effects. A solid line shows SDR values on RED; crosses mark SDR on CDXDB23. Horizontal dashed and dotted lines show SDR for models without converting the volume of the input signal. The MRX model is blue, MRX-C is orange, MRX-C with a Wiener filter is green, and MRX-C with post-processing scaling is red. In the case of testing MRX-C scaling on the CDXDB23, the SDR values are only available for effects (Team mp3d).

Table 5 summarizes the SDR metrics on RED for the baseline MRX solution, MRX-C, MRX-C with Wiener filter, and MRX-C with source scaling. As shown in the table, the MRX-C model yields a 0.1 dB improvement in dialogue and a 0.1 dB decrease in effects. The Wiener filter yields a 0.1 dB improvement in music, a 0.3 dB improvement in dialogue, and a 0.3 dB improvement in effects. Post-processing scaling results in a 0.2 dB decrease in music, a 0.6 dB decrease in dialogue, and a 0.3 dB increase in effects. Our final solution involved applying the Wiener filter only to the dialogue and post-processing scaling only to the effects.

Table 5

SDR values obtained during testing on RED for MRX, MRX-C, MRX-C with Wiener filter, and MRX-C with scaling. SDR values from the table are maximum possible values from all input volumes (Team mp3d).


Model Global SDR (dB)

MeanDialogueEffectsMusic

MRX4.388.381.723.02
MRX-C4.368.481.622.99
MRX-C Wiener4.578.751.903.07
MRX-C scaling4.247.921.952.85

4.3.3 Discussion

In the future, we plan to study the effect of adding additional activity labels contained in the DnR data on the MRX-C model accuracy. Additionally, to use the model on real data, we need to make the result of source separation independent of the input volume. Unlike the approaches of other teams in this competition, we focused on training a single model rather than an ensemble of models. A key aspect of our solution is to apply the normalization of the mixture at the input of the model.

5. Distribution Mismatch between DnR and CDXDB23

In the following, we will analyze the distribution mismatch between DnR and CDXDB23 that was noticed by the participants in Section 4. This will give us insight into the recording and production differences as well as allow us to train two improved cocktail-fork models with an adjusted version of DnR.

5.1 Difference in Signal Statistics

First, we will compare the signal characteristics between DnR and CDXDB23. Our focus will be on loudness, equalization, stereo panning, and dynamic range compression as they are key elements in audio mixing (Martínez-Ramírez et al., 2022).18

Loudness—We measured the loudness for each audio clip in both datasets using ITU-R BS.1770-4 (International Telecommunications Union, 2015) with the help of pyloudnorm (Steinmetz and Reiss, 2021). The average loudness values are shown in Table 6 and the histograms can be found in Figure 6. We can observe that CDXDB23 utilizes the full range of loudness for all three classes whereas DnR has a more limited range. On average, DnR is 4 LUFS louder than CDXDB23. Notably, CDXDB23 uses the same loudness level for effects and music which is 5 LUFS lower than dialogue. This balance is likely due to a post-production step which was not considered in DnR.

Table 6

Loudness and Dynamic Range Compression (DRC) statistics for DnR and CDXDB23.


Divide and Remaster (DnR)CDXDB23


DialogueEffectsMusicDialogueEffectsMusic

Loudness (LUFS)–24.4 ± 1.3–29.7 ± 1.9–31.4 ± 1.8–28.4 ± 3.1–33.9 ± 8.0–33.6 ± 7.1
DRC (dB)–10.7 ± 0.9–5.1 ± 2.4–12.6 ± 1.4–11.4 ± 1.3–10.6 ± 3.7–11.2 ± 2.3

Figure 6 

Comparison of loudness between DnR and CDXDB23.

Equalization—To assess equalization differences, we normalized each waveform to 24 LUFS and calculated the magnitude STFT spectrogram using a Hann window of 4096 samples with 75% overlap. The average equalization curves are displayed in Figure 7. It shows that CDXDB23 generally has a faster decay at low and high frequencies but more energy in the mid-frequency range compared to DnR. We attribute this to the use of parametric EQs containing low and high shelf filters in the post production process for CDXDB23, while DnR consisted mostly of web content, which likely lacked professional post production.

Figure 7 

Comparison of average equalization between DnR and CDXDB23. Dashed curves give one standard deviation above/below average.

Amplitude panning—We calculated the Stereo Panning Spectrum as outlined by Tzanetakis et al. (2007); Avendano (2003) and as used by Martínez-Ramírez et al. (2022). Using the magnitude spectrogram from the earlier equalization analysis, we computed

(5a)
Ψ(f)=2XL(f)XR(f)XL(f)2+XR(f)2,
(5b)
Δ(f)=signΨL(f)ΨR(f)            =signXL(f)XR(f)XL(f)2XL(f)XR(f)XR(f)2,

where XL/R(f) denotes the left/right channel magnitude spectrogram. Ψ(f) measures whether frequencies are panned, regardless of direction, while Δ(f) measures which direction frequencies are panned to. Figures 8 and 9 show the average values for both datasets. It is important to note that DnR is monaural, which results in a horizontal line for Ψ(f) and Δ(f). From these figures, we see that in the audio mixing process, dialogue is typically centered, while effects are more often panned to one side. Music shows the most varied panning. Figure 9 indicates that there is no specific preferred direction for panning in the datasets.

Figure 8 

Figure 8: Comparison of average amplitude panning between DnR and CDXDB23. Channel amplitude similarity Ψ(f) can take values 0Ψ(f)1 where Ψ(f)=1 refers to panning frequency f to the center whereas Ψ(f)<1 denotes a panning to either side. Dashed curves give one standard deviation above/below average. Please note that DnR is monaural and, hence, Ψ(f) collapses to a horizontal line at Ψ(f)=1.

Figure 9 

Figure 9: Comparison of average amplitude panning between DnR and CDXDB23. Δ(f)=sign(ΨL(f)ΨR(f)) denotes the panning direction where Δ(f)<0 refers to panning to the left and Δ(f)>0 to a panning to the right. Dashed curves give one standard deviation above/below average. Please note that DnR is monaural and, hence, Δ(f) collapses to a horizontal line at Δ(f)=0.

Dynamic range compression (DRC)—Lastly, we analyzed the DRC by calculating the average peak value as DRC usually alters the transients. We started by normalizing the loudness of the audio waveform to 24 LUFS. Then, we used the high frequency content (HFC) method for onset detection, as described by Masri (1996); Brossier et al. (2019) and as implemented by Martínez-Ramírez et al. (2022). For each audio clip, we calculated the average peak level Pμ. This measure helps us understand the extent of DRC applied; larger Pμ values indicate less compression since the peaks are more pronounced at the same loudness level. From the data in Table 6, we see that CDXDB23 exhibits more uniform compression compared to DnR. Notably, in DnR, effects are less compressed compared to dialogue and music. This contrast in compression is not observed in CDXDB23, where compression is more consistently applied across all three classes due to being professionally produced.

5.2 Improving the Cocktail-Fork Baseline

Using our understanding of the distribution differences from the previous section, we aim to enhance the cocktail-fork model from Section 3.3. We created two updated versions of DnR, modifying either the average loudness or the average equalization for each audio source. For the loudness adjustment, we changed the loudness of each source stem by a specific amount. For example, we altered the loudness of each dialogue stem by 4 LUFS (from 28.4 to 24.4 LUFS). This adjustment was made so that the average loudness of the modified DnR matches that of CDXDB23. Regarding equalization, we designed a 101-tap FIR filter for each audio source. The magnitude response of this filter is the square root of the difference between the average equalization of CDXDB23 and DnR. We then applied this filter to each stem using forward-backward filtering (filtfilt), as mentioned by Martínez-Ramírez et al. (2022). This process modifies the amplitude without altering the phase of the audio.

Table 7 shows the improvements to the cocktail-fork model. Please note that an additional feature for mixture loudness normalization was introduced with version 1.1, setting the mixture to –27 LUFS. We will discuss the results without considering this normalization, although similar trends are observed with it. The results in Table 7 indicate that adjusting the loudness is particularly effective. The mean SDR improved from –0.1 dB to 1.3 dB, primarily due to enhanced performance in dialogue. Adjusting the equalization also showed benefits, with an overall improvement of 0.3 dB. Here, both dialogue and effects showed improvement, but there was a slight decrease of 0.2 dB in music. We believe this decrease in music is linked to the fluctuation in the equalization curve shown in Figure 7, which correlates with the frequencies of musical notes. This fluctuation could inadvertently act as a marker on the music stems, making them easier for the model to separate, leading to a slight decline in music performance. Both trained models are available on the cocktail-fork GitHub6.

Table 7

Results on CDXDB23 for training the cocktail-fork model with adjusted DnR versions where we matched either the average loudness or the average equalization from CDXDB23. “Input norm” refers to the loudness normalization to -27 LUFS introduced with version 1.1 of the cocktail-fork model.


Training Dataset Global SDR w/o input norm (dB) Global SDR w/ input norm (dB)


MeanDialogueEffectsMusicMeanDialogueEffectsMusic

DnR–0.1044.108–2.018–2.4010.3254.662–1.979–1.707
DnR w/ adapted loudness  1.2876.535–1.506–1.1681.5396.727–1.278–0.832
DnR w/ adapted equalization  0.1764.621–1.470–2.6230.5444.922–1.212–2.078

In summary, by aligning the DnR dataset more closely with a more realistic dataset like CDXDB23, we significantly enhanced the performance of the model. This approach presents a promising avenue for future research in this field. Besides using mono-to-stereo augmentation for stereo panning and compressors to adjust the DRC, also combining all of them should be considered to close the distribution mismatch as much as possible. Using the reported statistics in Section 5.1 will help to choose realistic parameters for the data augmentation.

6 Summary and Outlook

The CDX track of SDX’23 has provided valuable insights into the current state-of-the-art in cinematic audio separation and has highlighted areas for future research and development.

Looking at the results for Leaderboard A, we can observe that models suffered from the constraint of only being allowed to utilize the DnR dataset. This caused a “simulation-to-reality” gap where models were trained on simulated data, but evaluated on real-world data (CDXDB23). In particular, the following challenges were identified by the participants:

  • The DnR dataset contains sometimes vocals within the music, leading to confusion during model training. Preemptively removing these vocals prior to training was found to enhance model performance.
  • The dialogue data in the DnR dataset, being read speech, lacks emotional speech elements such as shouting, as well as other human sounds like breathing or humming. This absence posed a challenge for the models.
  • Lastly, a mismatch in loudness was observed between the training and evaluation data. If not accounted for, this mismatch could lead to suboptimal model performance as we also saw in Section 5.

Hosting a competition like SDX’23 allows to identify and address these issues, thereby contributing significantly to the field. Moreover, the allowance for participants to utilize additional data, as seen in Leaderboard B, proved beneficial. Particularly, an improvement of approximately 6 dB was observed for dialogue when comparing the results of Leaderboard B to those of Leaderboard A.

Another interesting observation was the successful application of cascaded approaches, which initially filtered out dialogue. The effectiveness of this strategy can likely be attributed to two factors. First, the substantial amount of available data for vocals and the existence of highly efficient models, honed through research in the field of music separation (Stöter et al., 2018; Mitsufuji et al., 2022; Fabbro et al., 2024), provide a strong foundation for dialogue extraction. Second, vocals, which include sounds like breathing, bear a close resemblance to dialogue, thereby facilitating the extraction of dialogue from movies. As cinematic separation is a relatively young field, further advances are required, particularly in the extraction of sound effects and music.

Looking forward to the next challenge, we anticipate further advances in the field. Exploring uncharted areas such as mono-to-stereo augmentations is interesting and will have a positive impact on performance. We also aim to encourage participants to develop models that are robust to variations in the input data, such as in loudness. The goal remains to push the boundaries of what is possible in cinematic audio separation and to continue fostering innovation in this exciting field. This first edition of the challenge has demonstrated the utilization of models, data, and concepts from music separation to enhance cinematic separation. We believe that this represents an initial stage, and look forward to future development of more specialized ideas and approaches exploiting also the signal statistics presented in Section 5.1.