Exploring Potential of State-of-the-Art Speaker Diarization Frameworks For Multilingual Multi-Speaker Conversational Audio
Exploring Potential of State-of-the-Art Speaker Diarization Frameworks For Multilingual Multi-Speaker Conversational Audio
Abstract—This study investigates the potential of three key context. The speaker segmentation module consists of three distinct
speaker diarization approaches − Modularized Diarization, End- sub-tasks. Initially, voice activity detection (VAD) filters out non-
to-End Neural Diarization, and Multi-scale Speaker Diarization speech regions, followed by speaker change detection (SCD) which
in handling multilingual, multi-speaker conversational audio. identifies transitions between speakers. These sub-modules are
As voice-based systems become more prevalent across various typically trained and optimized independently. However, this method
languages, achieving high accuracy in speaker diarization within has two main limitations: firstly, errors from each stage can propagate,
complex multilingual settings is crucial. We assess state-of-the- potentially leading to compounded errors; secondly, it does not
art diarization frameworks, including Pyannote, Silero, and inherently address overlapping speech regions, requiring additional
NVIDIA’s NeMo, using the Displace development seta challenging postprocessing. Usually, the clustering output is subsequently refined
multilingual dataset characterized by multiple speakers, frequent using the variational Bayes hidden Markov model (VB-HMM) with
language switching, and overlapping speech. The evaluation posterior scaling.
focuses on each framework’s capability to maintain high speaker
identification accuracy and diarization quality in such demanding
scenarios. Our findings highlight the comparative strengths and
limitations of Pyannote, Silero, and NeMo, providing insights
into their suitability for real-world multilingual applications and
pointing out areas for further improvement.
Index Terms—Speaker diarization, Speaker embedding,
Speech processing, Speaker segmentation, Spectral clustering,
Embedding extractor, Deep neural networks
I. I NTRODUCTION
Speaker diarization is a process of segmenting an audio
recording into homogeneous segments, allowing audio segments Fig. 1. (a) The speaker diarization system (b)The pipeline for a traditional
to be associated with speaker labels. The primary goal is to modularized speaker diarization system.
identify and distinguish different speakers in the audio data [1].
This technique contributes to a more comprehensive understanding (b) End-to-End Neural Diarization (EEND): In this paradigm,
of spoken content and facilitates various applications in fields the traditional sub-modules of speaker diarization are replaced by
such as speech recognition, transcription, audio indexing, voice a singular neural network. This EEND approach directly handles
assistants, security and surveillance. Generally, any ideal speaker the audio input and generates diarization, thus minimizing error
diarization system (SDS) should not rely on prior knowledge of propagation. The problem is framed as a multi-label classification
the speakers, the number of speakers, and even the language task, enabling simultaneous recognition of multiple overlapping
associated with the audio. In favourable conditions, state-of-the- speakers. Permutation-invariant training is pivotal, transforming
art diarization systems can attain error rates almost equivalent to clustering, an unsupervised task, into a supervised classification
human performance. Nevertheless, even top-tier diarization systems task. However, training speaker embedding networks necessitates
encounter challenges in identifying speakers in adverse conditions substantial data, presenting challenges for practitioners. Some
such as speaker variability, acoustic condition, variability in speaker approaches tackle this by integrating end-to-end speaker segmentation
change time (dwell time), the amount of overlapping speech, and [3], concurrently training VAD, SCD, and OSD, followed by global
meeting environment in multi-lingual scenarios. So, there is always agglomerative clustering.
a need for a robust SDS which can produce decent results even (c) Multi-scale Speaker Diarization: In the process of speaker
in these types of adverse scenarios [2]. The foundational principles embedding extraction, attaining high-quality representation vectors
and structures of all speaker diarization systems can be classified often demands sacrificing temporal resolution by analyzing longer
into three distinct categories: (a) Modularized Diarization: This speech segments. As a result, speaker diarization systems are
approach is structured around a sequence of interconnected sub- consistently confronted with a trade-off between two factors:
modules, encompassing speaker segmentation, speaker embedding temporal resolution and representation quality. In conventional
for generating distinct speaker representations, and clustering to diarization systems, audio segments typically span from 1.5 to
group speech segments by speaker identity, illustrated in Fig. 1. 3.0 seconds, as these durations strike a balance between speaker
Typically, both Agglomerative hierarchical clustering (AHC) and characteristic quality and temporal resolution. This segmentation
Spectral clustering (SC) are applicable in the speaker diarization method is commonly known as a single-scale approach. Even
with overlapping techniques, single-scale segmentation maintains III. A modularized system utilizing Silero-VAD1 for voice activity
a temporal resolution of 0.75 to 1.5 seconds, allowing room detection, ECAPA-TDNN for speaker embedding [12], and
for improvement in temporal accuracy. Coarse temporal resolution spectral clustering (SC) for diarization.
not only affects diarization performance but also reduces speaker Throughout this article, these models are respectively referred to as
counting accuracy, as brief speech segments may not be adequately Pyannote, NeMo, and Silero for brevity. Our primary aim is to assess
captured. To tackle this challenge, the multi-scale approach is how resilient and flexible each model is when faced with real-world
introduced to manage this trade-off. This approach involves extracting conditions that can greatly affect diarization accuracy. We utilized a
speaker features from various segment lengths and then integrating stringent methodology to guarantee an equitable comparison, yielding
the results from multiple scales. Multi-scale segmentation is valuable insights into the models’ strengths and weaknesses across
employed, and speaker embeddings are derived from each scale. various demanding audio environments. The major contribution of
T. J. Park et al. proposed a multiscale diarization decoder [4] that this article are given as follows:
dynamically evaluates the significance of each scale at every time • We assessed the performance of the three aforementioned
step. The multiscale diarization decoder processes multiple speaker speaker diarization systems using DISPLACE (DIarization of
embedding vectors from different scales and determines optimal scale SPeaker and LAnguage in Conversational Environments) 2024
weights. Subsequently, speaker labels are generated based on the challenge Speech dataset containing conversational far-field
estimated scale weights. speech recordings with multiple speakers conversing in different
The representation of speakers plays a crucial role in speaker
Indian languages.
diarization systems as it allows for assessing the similarity between
• Our experimental result and analysis contributes to the
speech segments. With the increasing prominence of deep learning
understanding of the applicability and effectiveness of these
in speech processing, extensive research has explored harnessing
models in practical scenarios, offering guidance for selecting
the powerful modeling capabilities of neural networks to produce
the most suitable parameters and settings for speaker diarization
speaker embeddings. Notable examples include d-vectors and x-
solution based on specific use cases and environmental
vectors [5], which are typically embedding vectors derived from
conditions.
the output of a deep neural network’s bottleneck layer trained
for speaker recognition. This shift from traditional MFCC or i- The rest of the paper is organized as follows: Section II provides
vector approaches to neural embeddings has led to performance an overview of the architecture for the four-speaker diarization
enhancements, facilitated training with larger datasets, and increased frameworks evaluated in this study. Section III details the dataset, the
robustness against speaker variations and diverse acoustic conditions. experimental setup, and the results for all four frameworks. Finally,
Although effective in monolingual scenarios, these methods Section IV wraps up with the paper’s conclusions.
struggle with multilingual data due to differences in language
phonetics and speaker characteristics across languages. Multilingual
II. OVERVIEW OF SPEAKER DIARIZATION FRAMEWORKS
speaker diarization is a challenging but essential task for accurately This section provides an overview of all the frameworks of speaker
processing and analyzing conversational speech. For the sake of diarization, namely pyannote.audio, NVIDIA’s nemo, speechbrain,
brevity, we have presented the significant challenges of multilingual and silero-models. They all claimed state-of-the-art performances on
speaker diarization as found in the most recent existing literature various standard datasets.
in Table I. Traditional speech processing systems, including
automatic speech recognition (ASR) and speaker diarization, are A. Pyannote.audio
typically designed for single-language scenarios. In contrast, language Pyannote.audio is an open-source Python toolkit designed
recognition systems are not usually evaluated in environments for speaker diarization, utilizing the PyTorch machine learning
with multiple speakers. Diarization in the context of multilingual framework. This toolkit provides a set of trainable end-to-end
conversational is tangled with speakers as well as languages. neural components that can be integrated and collectively fine-
Several studies have integrated language identification into tuned to create speaker diarization workflows. Pyannote.audio offers
speaker diarization to enhance performance in multilingual settings the capability to train models directly from waveforms using
[6], [7]. Code-switching, a common phenomenon in multilingual learnable features derived from the SincNet architecture [13] or
communities, enhances ASR systems’ ability to process multiple standard features such as MFCCs or spectrograms. Additionally,
languages using advanced models and robust language handling Pyannote.audio provides pre-trained PyTorch models based on the
techniques, improving multilingual diarization accuracy [8]. Code- generic PyanNet architecture, suitable for training recurrent neural
mixing, where words from multiple languages are blended within networks for various speaker diarization sub-tasks, including voice
a single sentence, poses challenges for language models and is activity detection, speaker change detection, overlapped speech
prevalent in spoken language and informal writing. Fine-tuning detection, and re-segmentation.
massive multilingual models like mBERT and XLM-R with code- Version 2.1 of pyannote.audio implements an overlap-aware
mixed data improves handling of such complexities. However, the speaker diarization approach, comprising three main blocks: (a) end-
lack of publicly available code-mixed datasets presents challenges to-end speaker segmentation, (b) neural speaker embedding, and
for developing specialized models. Recent work [9] addresses (c) AHC (Agglomerative Hierarchical Clustering). This approach
overlapping speech in multilingual data using acoustic features like applies the end-to-end segmentation method on small audio chunks
MFCCs and NLP techniques but faces scalability issues due to high to estimate an upper bound on the local number of speakers, followed
computational demands. by global constrained clustering on the resulting local speakers.
In this work, we meticulously selected four recent methods, each The end-to-end speaker segmentation is formulated as a multi-
representing a distinct class of algorithms: label classification problem using permutation-invariant training,
I. An End-to-End Neural Diarization (EEND) framework obtained utilizing neural speaker embeddings for speaker representation and
from the Pyannote.audio toolkit comprising a speaker comparison. The model architecture is based on the canonical x-
segmentation model, x-vector based speaker embedding, and vector TDNN-based architecture, with a modification in the statistics
agglomerative hierarchical clustering (AHC). pooling layer.
II. Nvidia’s Nemo framework incorporating MarbleNet [10] for This speaker diarization approach involves an iterative interplay
VAD, TitaNet [11] for speaker embedding extraction, and a between two main steps: segmentation and incremental clustering.
Multi-scale Diarization Decoder (MSDD) for neural diarization
[4], estimating speaker labels from multi-scale segmentation. 1 https://github.com/snakers4/silero-vad
TABLE I
C HALLENGES
Challenge Description
Code-Switching Frequent language switching within conversations disrupts continuity
in language identification and speaker diarization
Code-mixing when elements from a secondary language, such as words, phrases,
or grammatical structures, are integrated into a primary language
utterance.
Accent and Pronunciation Variability Variations in accents and pronunciation patterns make it difficult to
distinguish between different speakers of the same language.
Overlapping Speech Simultaneous speech from multiple speakers complicates segmentation
and attribution, especially with different languages involved.
Lack of Multilingual Data The scarcity of large, annotated multilingual datasets limits model
training and evaluation, affecting performance.
Real-Time Processing Constraints Integrating language identification and diarization in a unified model
increases computational demands, challenging real-time processing.
Language-Independent Feature Representation Developing features that work effectively across multiple languages
while preserving linguistic nuances is challenging.
TABLE II
S UMMARY OF THE SPEAKER DIARIZATION FRAMEWORKS DESCRIBED IN S ECTION 2
Implementation details
Model Name
VAD Embedding extractor Clustering Resegmentation
RNN-Based Canonical x-vector
Pyannote AHC xx
Speech Activity Detection TDNN-based architecture [5]
NeMo MarbleNet TitaNet-M (13.4M parameters) MSDD xx
Multi-head attention (MHA)
Silero Resnet-101 SC VBHMM
based neural network
Every few hundred milliseconds (500 ms in this case), the speaker features are extracted from multiple segment lengths, and the
segmentation module conducts a fine-grained overlap-aware results from these scales are combined.
diarization of a 5s rolling buffer. The local diarization results are In the Nemo framework, 1-D convolutional neural networks
then processed by the incremental clustering module, which utilizes dynamically determine the importance of each scale at each time step.
speaker embeddings to map local speakers to the appropriate global The Multiscale Diarization Decoder (MSDD) takes multiple speaker
speakers (or create new ones) before updating its internal state. embedding vectors from various scales and estimates desirable scale
Additionally, a recipe is provided for practitioners to adapt the weights, generating speaker labels based on these weights. Thus, the
pretrained pipeline to their specific use cases by leveraging their proposed system places greater emphasis on larger scales if they are
labeled data effectively. deemed to provide more accurate information.
——- In the multiscale SDS, audio input is processed to extract
multi-scale segments, from which corresponding speaker embedding
vectors are generated using the TitaNet speaker embedding extractor.
B. NVIDIA NeMo These multi-scale embeddings are then clustered, providing initial
The speaker diarization pipeline in NeMo incorporates the clustering results for the MSDD module. The MSDD module
MarbleNet model for Voice Activity Detection (VAD), TitaNet compares cluster-average speaker embedding vectors with the input
models for speaker embedding extraction, and the Multi-scale speaker embedding sequences, estimating scale weights at each step
Diarization Decoder (MSDD) for neural diarization. In conversational to determine the significance of each scale. Ultimately, the sequence
settings, where speaker turns can be very brief, achieving precise model is trained to produce speaker label probabilities for each
timestamps is crucial. Human conversations often include short speaker.
interjections like ”yes,” ”uh-huh,” or ”oh,” which pose challenges
for machine transcription and speaker identification. Therefore, in C. Modularized SDS with SILERO-VAD and Resnet-293
segmenting audio recordings based on speaker identity, speaker As mentioned in the introduction, modularized diarization systems
diarization necessitates fine-grained decisions on relatively short rely on an efficient Voice Activity Detection (VAD) and neural
segments, ranging from fractions of a second to several seconds. speaker embeddings, coupled with a clustering algorithm. This
However, accurately discerning speaker traits from such short modularized speaker diarization framework utilizes Silero-VAD,
segments is difficult. which employs a multi-head attention (MHA) based neural network
Traditional diarization systems typically use audio segment lengths with the Short-time Fourier transform as features. This architecture
ranging from 1.5 to 3.0 seconds, striking a balance between speaker was selected because MHA-based networks have demonstrated
characteristic quality and temporal resolution. This approach, termed promising results across various applications, from natural language
the single-scale method, limits temporal resolution even with overlap processing to computer vision and speech processing.
techniques, leaving room for improvement in temporal accuracy. To Speaker embeddings are extracted using ECAPA-TDNN, a
address this trade-off, a multi-scale approach can be employed, where novel TDNN-based extractor which builds upon the x-vector
TABLE III
C OMPARISON OF DER, JER FOR P YANNOTE , N E M O , AND S ILERO F RAMEWORKS ON THE D ISPLACE D EV S ET
Fig. 2. Visualization of embeddings of size(2821, 192) generated by Silero+Ecapa-TDNN on M018 Displace dev file
architecture with a stronger focus on Channel Attention, Propagation, dev set, strating with prefix B and M. There are 11(eleven) files
and Aggregation. This is achieved through the integration of with 04(four) speakers whereas 24(twenty four) files comprise with
Squeeze-Excitation blocks, multi-scale Res2Net features, extra skip 03(three) speakers.
connections, and channel-dependent attentive statistics pooling. For the sake of reproducibility, the study outlined in this paper
To enhance the accuracy of speaker boundaries, the VB-HMM is is based on the pyannote.audio speaker diarization toolkit3 and the
individually initialized for each audio file using the clustering output. NVIDIA NeMo Framework4 . The open-source implementation of all
Table II presents a summary of all the implemented frameworks experiments is accessible online5 . All experiments were conducted
(pipelines), explicitly specifying the type of each component: using an Nvidia RTX A4000 GPU and an Intel Xeon W-2245 CPU
Voice Activity Detection (VAD), Speaker Embedding Extractor, and @3.90GHz.
clustering algorithms.
B. Experimental setup for Pyannote framework
III. E XPERIMENTAL RESULTS AND DISCUSSION We utilized version 2.1 of pyannote.audio. While the pretrained
models performed exceptionally well, the default pipeline might
A. Dataset encounter domain mismatch issues common to many machine
The DISPLACE challenge2 presented a genuine dataset containing learning models, resulting in suboptimal performance on our data.
conversational far-field speech recordings with multiple speakers To adapt the model to our data, we followed the recipe provided
conversing in different languages. In regions with diverse linguistic in [3]. The pretrained speaker diarization pipeline of pyannote
communities, informal conversations often showcase a rich tapestry relies on its own set of hyperparameters tailored to the internal
of languages and speakers. What sets the DISPLACE corpus apart is segmentation pretrained model. For instance, the segmentation-
its distinctive blend of multilingual recordings, showcasing frequent threshold (a value between 0 and 1) controls the aggressiveness
occurrences of code-mixing and code-switching. To facilitate the of speaker activity detection, where a higher value leads to fewer
challenge, the DISPLACE corpus was split into development (Dev)
and evaluation (Eval) sets. There are two types of session ID in 3 https://github.com/pyannote/pyannote-audio
4 https://github.com/NVIDIA/NeMo/
2 https://displace2024.github.io/ 5 https://github.com/sumansamui/SOTA Speaker Diarization models
[GROUND TRUTH]
Fig. 3. Time stamps of M018 Displace dev file evaluated by the various frameworks
Fig. 4. Speaker attrributed diarization error rate (DER) of Displace dev sets as evaluted by various frameworks
detected speech instances. The clustering-threshold determines the Silero+ECAPA-TDNN shows the least effective performance. When a
number of speakers, with a higher value resulting in fewer speakers. collar of 250 ms is used and overlaps are ignored, the DER for these
The segmentation.min-duration-off parameter governs whether intra- models on the ”displace” development set is significantly reduced.
speaker pauses are filled, typically depending on the downstream However, variability in collar size and overlapping is found to have
application; hence, it’s initially set to zero during optimization. We no significant impact on JER.
trained the segmentation model and fine-tuned the hyperparameters
using the AMI and Displace 2024 dev set. IV. C ONCLUSION
This research paper explores the potential of widely used speaker
C. Experimental setup for NeMo framework diarization models: Pyannote, NeMo, and Silero VAD + ECAPA-
In the NeMo speaker diarization pipeline, we adopted a multi-scale TDNNwhen applied to multilingual conversational audio. This
approach to address the trade-off between long and short segment analysis explains the trade-offs related to missed speech, false
lengths. We utilized multiple scales (segment lengths) and fused the alarms, and overall alignment with ground truth data. Future work
affinity values from each scale’s result. Within the NeMo speaker could explore the effectiveness of these techniques on more realistic
diarization toolkit, such neural modules are referred to as neural datasets, such as in the Main Control Room (MCR) of a process
diarizers. The Multi-scale Diarization Decoder (MSDD) model, a type industry or during daily household tasks. Additionally, it would
of neural diarizer, was employed in the NeMo speaker diarization be intriguing to assess the performance of multilingual speaker
pipeline. Regarding the speaker diarization problem, the MSDD diarization systems in the context of secure multimodal control
model employs a divide-and-conquer strategy, where a pairwise operations.
model is utilized for both training and inference. We trained MSDD
with a frozen speaker embedding extractor (TitaNet) using the AMI R EFERENCES
mixheadset and Displace 2024 Dev set. [1] T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and
S. Narayanan, “A review of speaker diarization: Recent advances with
D. Experimental setup for Silero + Spectral clustering deep learning,” Computer Speech & Language, vol. 72, p. 101317, 2022.
In the Silero framework, a pre-trained Silero Voice Activity [2] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba,
Detector (VAD) was utilized, and embeddings were extracted M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, and
using a pre-trained SincNet-based extractor [13] with a ResNet101 S. Khudanpur, “Diarization is Hard: Some Experiences and Lessons
architecture. Spectral clustering was chosen for its ability to handle Learned for the JHU Team in the Inaugural DIHARD Challenge,” in
complex and unknown cluster shapes, where conventional methods Proc. Interspeech 2018, 2018, pp. 2808–2812.
[3] H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: principle,
such as K-means and Expectation-Maximization (EM)–based mixture benchmark, and recipe,” in Proc. INTERSPEECH 2023, 2023.
models often fall short. Unlike these traditional approaches that rely [4] T. J. Park, N. R. Koluguri, J. Balam, and B. Ginsburg, “Multi-
on explicit data distribution modelling, spectral clustering leverages scale Speaker Diarization with Dynamic Scale Weighting,” in Proc.
the eigenstructure of an affinity matrix, making it particularly Interspeech 2022, 2022, pp. 5080–5084.
effective for high-dimensional audio data [14]. Given the varying [5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-
lengths of audio segments, the commonly used Euclidean metric vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE
may not be suitable. Instead, the Kullback-Leibler (KL) divergence is international conference on acoustics, speech and signal processing
considered a more appropriate distance measure for comparing two (ICASSP). IEEE, 2018, pp. 5329–5333.
audio segments. [6] A. Tjandra, D. G. Choudhury, F. Zhang, K. Singh, A. Conneau,
A. Baevski, A. Sela, Y. Saraf, and M. Auli, “Improved language
Fig. 3(a) shows the ground truth timestamps of a ”displace”
identification through cross-lingual self-supervised learning,” in ICASSP
development file (M018.wav), which is 29 minutes and 53 seconds 2022. IEEE, 2022, pp. 6877–6881.
long and contains three speakers. Fig. 3(b) presents a zoomed- [7] W. Cai, Z. Cai, X. Zhang, X. Wang, and M. Li, “A novel learnable
in view of the ground truth timestamps, trimmed to the interval dictionary encoding layer for end-to-end language identification,” in
from 600 to 660 seconds for better clarity. Fig. 3(c) displays 2018 IEEE international conference on acoustics, speech and signal
the timestamps for the same interval as detected by PyAnnote, processing (ICASSP). IEEE, 2018, pp. 5189–5193.
using a collar of 0 ms and allowing overlap. Fig. 3(d), Fig. [8] H. Liu, L. P. G. Perera, X. Zhang, J. Dauwels, A. W. Khong,
3(e), and Fig. 3(f) show the timestamps of the file as traced by S. Khudanpur, and S. J. Styles, “End-to-end language diarization for
Silero, NeMo Clustering Diarizer with Oracle VAD, and NeMo bilingual code-switching speech,” in INTERSPEECH 2021, 2021, pp.
866–870.
Neural Diarizer with Oracle VAD, respectively. This visualization
[9] O. H. Anidjar, Y. Estève, C. Hajaj, A. Dvir, and I. Lapidot, “Speech and
highlights the discrepancies between the ground truth and the multilingual natural language framework for speaker change detection
results produced by the respective frameworks. It is important and diarization,” Expert Systems with Applications, vol. 213, p. 119238,
to note that PyAnnote operates unsupervised, without specifying 2023.
any speakers in the code to generate RTTM files. In contrast, [10] F. Jia, S. Majumdar, and B. Ginsburg, “Marblenet: Deep 1d time-channel
using other frameworks, the number of speakers must be manually separable convolutional neural network for voice activity detection,” in
provided for each file. A detailed visualization of the embeddings ICASSP 2021. IEEE, 2021, pp. 6818–6822.
generated by Silero+ECAPA-TDNN for the ”displace” development [11] N. R. Koluguri, T. Park, and B. Ginsburg, “Titanet: Neural model for
file (M018.wav) is presented in Fig. 2(a). The dimensions of these speaker representation with 1d depth-wise separable convolutions and
global context,” in ICASSP 2022. IEEE, 2022, pp. 8102–8106.
embeddings are (2821, 192). It is observed that Spectral Clustering
[12] N. Dawalatabad, M. Ravanelli, F. Grondin, J. Thienpondt,
(SC) performs better than Agglomerative Hierarchical Clustering B. Desplanques, and H. Na, “Ecapa-tdnn embeddings for speaker
(AHC) in grouping the speaker vectors. The t-SNE representation of diarization,” in INTERSPEECH 2021, 2021, pp. 3560–3564.
spectral clustering applied to these embeddings is shown in Fig. 2(b), [13] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform
while Fig. 2(c) and Fig. 2(d) provide the UMAP representations of with sincnet,” in IEEE spoken language technology workshop (SLT)
SC and AHC, respectively. Fig. 4(a) depicts the Speaker Attributed 2018, pp. 1021–1028.
Diarization Error Rate (DER) for the ”displace” development sets, [14] H. Ning, M. Liu, H. Tang, and T. S. Huang, “A spectral clustering
as evaluated by various frameworks. The model-wise DER and approach to speaker diarization,” in INTERSPEECH 2006, pp. 2178–
JER for the ”displace” development set are presented in Table III. 2181.
Among the frameworks, the NeMo Neural Diarizer performs the
best. Comparable performance is observed for the PyAnnote and
NeMo Clustering Diarizer on the ”displace” development set, whereas