0% found this document useful (0 votes)
4 views14 pages

2021 - FuzzyGCP - A Deep Learning Architecture For Automatic Spoken Language

The paper presents FuzzyGCP, a deep learning architecture designed for automatic spoken language identification (SLID) from speech signals, addressing challenges posed by language variation and speaker differences. It combines a Deep Dumb Multi Layer Perceptron, Deep Convolutional Neural Network, and Semi-supervised Generative Adversarial Network, achieving high accuracy rates across multiple datasets. The proposed model demonstrates significant improvements over existing methods, with an F1-score of up to 98% on certain datasets.

Uploaded by

Susanta Sarangi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

2021 - FuzzyGCP - A Deep Learning Architecture For Automatic Spoken Language

The paper presents FuzzyGCP, a deep learning architecture designed for automatic spoken language identification (SLID) from speech signals, addressing challenges posed by language variation and speaker differences. It combines a Deep Dumb Multi Layer Perceptron, Deep Convolutional Neural Network, and Semi-supervised Generative Adversarial Network, achieving high accuracy rates across multiple datasets. The proposed model demonstrates significant improvements over existing methods, with an F1-score of up to 98% on certain datasets.

Uploaded by

Susanta Sarangi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Expert Systems With Applications 168 (2021) 114416

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

FuzzyGCP: A deep learning architecture for automatic spoken language


identification from speech signals
Avishek Garain a , Pawan Kumar Singh b ,∗, Ram Sarkar a
a
Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, West Bengal, India
b
Department of Information Technology, Jadavpur University, Kolkata 700106, West Bengal, India

ARTICLE INFO ABSTRACT

Keywords: In this modern era, language has no geographic boundary. Therefore, for developing an automated system for
Spoken language identification search engines using audio, tele-medicine, emergency service via phone etc., the first and foremost requirement
Speech signal is to identify the language. The fundamental difficulty of automatic speech recognition is that the speech
Deep learning
signals vary significantly due to different speakers, speech variation, language variation, age and sex wise
GAN
voice modulation variation, contents and acoustic conditions and so on. In this paper, we have proposed a deep
DNN
MLP
learning based ensemble architecture, called FuzzyGCP, for spoken language identification from speech signals.
Ensemble learning This architecture combines the classification principles of a Deep Dumb Multi Layer Perceptron (DDMLP),
Choquet integral Deep Convolutional Neural Network (DCNN) and Semi-supervised Generative Adversarial Network (SSGAN)
Spectrogram to increase the precision to maximum and finally applies Ensemble learning using Choquet integral to predict
the final output, i.e., the language class. We have evaluated our model on four standard benchmark datasets
comprising of two Indic language datasets and two foreign language datasets. Irrespective of the languages,
the F1-score of the proposed language identification model is as high as 98% in MaSS dataset and worst
performance is that of 67% on the VoxForge dataset which is much better compared to maximum of 44% by
state-of-the-art models on multi-class classification. The link to the source code of our model is available here.

1. Introduction industries for delivering proper customer care services, where spoken
languages play a key role, the applications of SLID are truly widespread.
Automatic spoken language identification (SLID) refers to the pro- In the aviation industry, where pilots need to know a standard language
cess of identification of the spoken language through any computing for communication purposes, an efficient automatic spoken language
devices using speech signals. Such a system can act as a support towards recognition system can eliminate this dependency by transferring the
speech recognition purposes in multilingual countries. This can be control to proper language translation systems based on language of
accomplished by determining the language of the spoken segments with the queries by the pilots. Doctors from all over the world can com-
the help of language recognizers. However, the intermixing of various municate with each other freely while making use of tele-medicine,
languages, having common origin, poses huge challenge for precise if their spoken languages can be correctly identified and put forward
classification of languages in an automated manner when a multi- properly to translation systems. The rising trend of globalization and
lingual dataset is taken into consideration. Thus to find out a feasible
the increasing popularity of the Internet have amplified the need for
solution to such mixing of data and many other challenges, SLID has
the competent SLID systems. An important application arises in call
attracted many researchers around the world to work on this topic.
centers across the world dealing with speakers of different languages.
If we consider the country India only, we can see that 23 languages
With the huge volume of vocabularies, for indexing such speech data
are recognized for official use (Languages of India (2017)). The inter-
archives or searching from the same, which contain multiple languages,
dependency existing among the Indic languages poses a real challenge
SLID systems are gaining more and more importance in recent times.
for automatic classification of languages from the voice signals. Besides,
this field has wide spectrum of applications and shall continue gaining Therefore, a comprehensive system for SLID is a pressing need to
it in near future. Ranging from usage in speech based evolving search address the above-mentioned needs.
engines to segregate results based on automatic location identifica- In this paper we have presented an ensemble based architecture
tion from the spoken languages to usage in the telecommunication which leverages the functionalities of deep learning models that include

∗ Corresponding author.
E-mail addresses: [email protected] (A. Garain), [email protected] (P.K. Singh), [email protected] (R. Sarkar).

https://doi.org/10.1016/j.eswa.2020.114416
Received 3 June 2020; Received in revised form 18 October 2020; Accepted 28 November 2020
Available online 8 December 2020
0957-4174/© 2020 Elsevier Ltd. All rights reserved.
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

Deep Dumb Multi Layer Perceptron (DDMLP), Deep Convolutional six languages namely English, German, Spanish, French, Russian and
Neural Network (DCNN) and Semi-supervised Generative Adversarial Italian and obtained an accuracy of 95.4%.
Network (SSGAN) in solving the problem of SLID mainly for Indic lan- After the evolution of deep learning and availability of compu-
guages as well as some popular foreign languages. We have discussed tational resources specifically Graphical Processing Units (GPUs) at
the underlying challenges, probable discriminatory features as well as cheaper costs, the research has seen a paradigm shift.
provided a detailed analysis for the same. In the paper by Miao et al. (2019), the authors have aimed to
The remaining paper has been organized as follows. Section 2 improve traditional DNN (Delay Neural Network) x-vector language
provides a brief explanation of some previous works and their per- identification (LID) performance by employing Convolutional and Long
formances. Section 3 describes the datasets on which the proposed Short Term Memory-Recurrent (CLSTM) Neural Networks by harness-
framework has been evaluated. The methodology that has been fol- ing their advantage to strengthen feature extraction and capture longer
lowed in designing our architecture is described in Section 4. This is temporal dependencies. The authors have introduced a frequency at-
followed by the results and concluding remarks in Sections 5 and 6 tention mechanism to give different weights to different frequency
respectively. bands to generate weighted means and standard deviations. They have
shown that CLSTM can significantly outperform a traditional DNN x-
2. Literature survey vector implementation and the proposed frequency attention method
has outperformed time attention, particularly when the number of
Previously research works have been carried out in this domain frequency bands matches the feature size.
The authors Madhu et al. (2017) in their work have proposed
by mainly making use of feature based approaches like MFCC (Mel-
a framework by using language dependent prosodic information and
frequency cepstral coefficients) (Logan et al., 2000), LPC (Linear Predic-
phonotactic features. It consists of a Phonetic Engine which serves
tive Coding) (O’Shaughnessy, 1988), Gaussian Mixture Model (GMM),
the purpose of front end for the SLID system and converts the speech
PLP 9 (Hermansky, 1990), PHCC (Perceptual Harmonic Cepstral Coef-
sample fed as input into a sequence of phonetic symbols. Thereafter,
ficients) (Gu & Rose, 2001), Mel Scale Cepstral Analysis (Imai, 1983),
syllable boundaries are recognized and phones within a syllable bound-
Power Spectral Analysis (Stoica et al., 2005), LFCC (Linear Frequency
ary are divided into groups. Then rules which are phonotactic in nature
Cepstral Coefficient) (Zhou et al., 2011), RASTA (Relative Spectral
are applied to get syllables. Numeric representation of successive pairs
Analysis Technique) (Hermansky & Morgan, 1994) and Shifted-delta
of syllables is done to get phonotactic feature vectors. Vectors for
features (Wang et al., 2012).
features which are prosodic in nature are obtained by concatenating
Li et al. (2013) have given an introductory note on the funda-
feature vectors of three successive syllables. Then these features are fed
mentals of the theory and the solutions, from both computational
to a multilayer feed forward neural network based classifier in the lan-
and phonological aspects of spoken language recognition. They have
guage identification process. The data on which the classifier is trained
also given a detailed and comprehensive review of current trends and
consist of speech samples with a total of two hours duration from each
future research directions using the language recognition evaluation
of the seven languages. The language classes that are targeted include
(LRE) formulated by the National Institute of Standards and Technology
Hindi, Bangla, Telugu, Assamese, Punjabi, Manipuri and Urdu.
(NIST).
The letter by Wang et al. (2013) presents a study of application
Albadr et al. (2019) in their study have employed the extreme
of phoneme posterior features for spoken language recognition. In
learning machine (ELM) as the learning model for the task of SLID
their work, they have estimated phoneme posterior features from Multi
using some standard features. In addition, the authors have proposed Layer Perceptron (MLP) based phoneme recognizer, and further pro-
an optimized Genetic Algorithm (OGA) with three different selection cessed them through transformations. These transformations include
criteria namely K-tournament, roulette wheel and random for selecting taking logarithm, Principal Component Analysis (PCA), and appending
the most appropriate initial weights and biases of the input hidden layer shifted delta coefficients. The authors have reported that the resulting
of the ELM to minimize the classification error. The proposed OGA– shifted-delta MLP (SDMLP) features have shown similar distribution as
ELM with three different selection criteria has produced the highest conventional shifted-delta cepstral (SDC) features, and SDMLP features
accuracies of 99.50%, 100% and 99.38%, respectively. are more robust compared to the SDC features.
In the paper by Zhang Jian et al. (2017), the authors have used Ferrer et al. (2014) in their work have proposed a new approach
F-ratio analysis method for analyzing the importance of weightage for SLID based on the estimated posteriors for a set of senones which
that should be given to different SLID feature vectors. After this, a represent the phonetic space of one or more languages. For speech
weighted phone log-likelihood ratio (WPLLR) feature has been used to recognition systems, these senones usually are the Hidden Markov
weight those dimensions more heavily which are having high F-ratio Model (HMM) states of the acoustic model, which can be predicted by a
values. The authors have tested on the NIST 2007 dataset. The results neural network, if a Delay Neural Network/HMM hybrid approach for
show the effectiveness of their feature, with relative improvements in acoustic modeling is applied. Then they have derived a feature vector
terms of average cost and equal error rate compared with the phone for every sample using these probabilities. Their proposed system is
log-likelihood ratio (PLLR) feature. reported to give over 40% relative gain compared to state-of-the-art
The authors Lee and Jang (2018) in their paper have presented language identification systems at sample duration ranging from 3 to
an approach based on a perspective of linguistics, specifically that 120 s.
of syllable structure. Their approach contains a section for labeling In the paper by Miao et al. (2018), the authors have exploited the
common syllable structures. Then, the authors have made use of a latent abilities of conditional Generative Adversarial Network (cGAN)
long short-term memory (LSTM) network in order to transform the to firstly combine them with DNN based i-vector approach. Then they
MFCC of an audio sample to its structure of syllable. They have applied have tried to improve the language identification model using cGAN
their work on 10 different languages and have achieved an accuracy of architecture. First, they have extracted the deep bottleneck features
70.40%. Their results has outperformed most of the methods based on (DBF) which are phoneme dependent. Then they have combined them
acoustic–phonetic and phonotactic features in terms of efficiency. with output posteriors of a pre-trained DNN. After that they have
The authors Shukla et al. (2019) in their paper have focused on used them to extract i-vectors in the normal way. They have classi-
an approach which is implicit. This is due to the absence of data fied these i-vectors using cGAN. Results show that cGAN architecture
in transcriptive form. They have proposed a new model based on can significantly outperform DBF, DNNs and i-vector methods where
attention mechanism which makes use of log-Mel spectrogram images 49-dimensional i-vectors are used, but not where 600-dimensional i-
as input. For training and evaluating the models, they have considered vectors are used. In the work by Snyder et al. (2018a), the authors

2
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

have applied the concept of x-vectors for recognition of spoken lan- 2. We have used a conventional DDMLP architecture as a classifier
guage. Their framework consists of a DNN that maps sequences of for numeric features.
speech features to fixed-dimensional embedding, called x-vectors. Long 3. The image based features are used to train architectures like
term language characteristics have been captured in the network by a DCNN and SSGAN. Usage of SSGAN in this context has not yet
temporal pooling layer that aggregates information across time. Once been explored much, and may be first-of-its-kind.
the x-vectors are extracted, they make use of the same classification 4. Finally, to obtain results beyond the reach of each of the models
methodology as developed for i-vectors. if used separately, we have formed a heterogeneous ensemble,
Dehak et al. (2011) in their paper have presented a new SLID system called FuzzyGCP, by combining the results of the aforemen-
based on the total variability approach. They have employed various tioned architectures using a fuzzy integral measure.
techniques to extract the most salient features in the lower dimensional 5. The datasets we have considered are itself diverse enough to
i-vector space. Additional performance gain has been observed when prove the efficiency and robustness of our model. The results
the system has been combined with other acoustic systems. are quite impressive considering the multi-lingual classification
approach and the domain.
2.1. Research Gap & Motivation 6. We have shown a detailed analysis of bi-lingual, tri-lingual and
multi-lingual classification capabilities of our model for the Indic
In recent times, many approaches which are based on GMM have datasets, and only multi-lingual classification capability for the
been developed for SLID purposes. Among all such methods, the foreign datasets.
method of modeling of i-vectors happens to be one of the best methods
3. Dataset used
and results in significant improvement in performance over the oth-
ers (Dehak et al., 2011). In i-vector model, features based on acoustics There are more than 7000 languages spoken throughout the world.
are firstly transformed into higher dimensional vectors. Then a mapping We have come across various spoken language datasets, however, we
of these vectors is done into a low-dimensional subspace. Each speech have selected the datasets which consist of speech signal of some
sample is denoted by a vector of fixed length called the i-vector. After popular languages. They are enlisted in the Annexure section (Table
the i-vectors are completed extracting, standard techniques like Gaus- 16).
sian back-end and Logistic regression are applied to the i-vectors of the The datasets that we have selected for evaluating the performance
test samples. However, the performance of this method heavily depends of our model consist of both foreign and Indic languages. The datasets
on the choice of hyper-parameters and suffers from drastic decrease are diverse in terms of speakers, gender and ethnicity. Also, we have
in performance if used for separate classes of languages other than intentionally selected languages which tend to be similar in terms of
classes used for training. In recent times, the DNN based models have semantics and phono-tactic features and show inter-dependency among
been used predominantly for acoustic modeling in the field of speech themselves. This was done in order to put our model through intense
recognition as a replacement of GMM. In SLID, investigation on several training and confusions, to improve its precision and generalize it
strategies using DNNs has been done so far, and the most successful properly. The class balance for these datasets is approximately perfect
approaches that have paved out their way are those frameworks which preventing any kind of class-based biased training thus giving great
are built using hybrid techniques (Jog et al., 2018). These are the recall metrics.
frameworks where DNNs are trained to differentiate between senones
and are combined with conventional language identification models. 3.1. IIIT Hyderabad dataset
However, recognition results seem to fail miserably when it comes to
identification of Indic languages which have various commonalities The IIIT Hyderabad Indic speech databases (Prahallad et al., 2012)
among them. Also use of same model for both foreign and Indic consist of data in textual and speech format for the languages namely
languages may give unsatisfactory results (Anjana & Poorna, 2018). Bangla, Hindi, Tamil, Kannada, Telugu, Malayalam, and Marathi. The
Though a significant amount of work have been performed by the creators of these datasets selected these languages based on the fact
that the total number of articles found in Wikipedia written using each
researchers, however, to the best of our knowledge application of a
of the said languages is more than 10,000. The languages considered
common model on the datasets with significant diversity has not been
here have different dialects. To maintain the originality, they decided
explored much. This is because same model may not give desirable
recording of the speech to be done in the dialect in which the native
accuracy across the datasets, hence consistent performance over the
speakers were comfortable with. The dataset consists of 7000 audio
varied datasets is required to prove the robustness and versatility of
samples approximately equally divided among the 7 language classes.
a model. Evaluating a common model on bi-lingual, tri-lingual and
multi-lingual scenarios are not investigated as such till now. However, 3.2. IIT Madras dataset
this aspect of the SLID research becomes more pertinent for Indic
languages owing to their common roots of development, which may This dataset is result of a project on developing text-to-speech (TTS)
not be valid for many foreign languages. This fact is evident from our synthesis systems for Indian languages (Baby et al., 2016) as well as
experimental outcomes. We have evaluated our model in such scenarios enhancing quality of synthesis. We have applied our model on 6000
and results have been analyzed, and it can be said that the results are audio samples, 1000 of each language class. The comprising languages
quite satisfactory keeping in mind the complexity of problem under are English, Marathi, Tamil, Bangla, Telugu and Hindi.
consideration. The use of Generative Adversarial Networks (GANs) and
ensemble mechanisms in this domain so far has been limited and yet 3.3. VoxForge dataset
to be retrospected and worked upon.
VoxForge (Voxforge.org) is a project which was set up to collect
2.2. Contributions transcribed speech for use in Open Source Speech Recognition Engines
(‘‘SRE’’s) such as Julius, ISIP, HTK and Sphinx. This dataset is huge and
In the light of the above-mentioned facts, we have proposed a new diverse both in terms of variety and size. Here, we have considered
SLID model. The highlights of this work are as follows: 2000 audio samples for each language category, thus avoiding class
imbalance of any kind. The selection of samples is such that total
1. We have used two types of features — one type being the recording time is approximately same for all the language classes.
numeric values, while the other being images obtained from The languages considered are French, German, Italian, Portuguese and
their corresponding spectrograms. Spanish.

3
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

3.4. MaSS dataset 4. Spectral roll-off: Frequency below which 85% of the distribu-
tion magnitude is concentrated
MaSS (Multilingual corpus of Sentence-aligned Spoken utterances) 5. Spectral flatness: Determined by dividing geometric mean of
dataset (Boito et al., 2020) is an extension to the CMU Wilderness the power spectrum by arithmetic mean of the power spectrum
Multilingual Speech dataset (Black, 2019). They have prepared this 6. Spectral Centroid: Indicates the location of the center of mass
dataset by considering multilingual links between speech segments of the spectrum
of different languages. It consists of a voluminous and clean dataset 7. Polynomial features: Coefficients of fitting an nth-order poly-
where we find 8130 parallel spoken utterances of 8 languages with 56 nomial to the columns of a spectrogram
language pairs. The language categories are Basque, English, Finnish, 8. Tonnetz: Tonal centroid
French, Hungarian, Romanian, Russian and Spanish. The quality of the
final corpus is attested by means of human evaluation performed on a
corpus subset (8 language pairs with 100 utterances).

4. Methodology

Every audio signal considered here is sampled to 5 s duration with


a sampling rate of 44.1 kHz to maintain the uniformity in feature
extraction. Lesser duration like 1 s or 2 s would have led to increased
localization of feature learning, thereby reducing the generalization
in learning over the whole time series information of the signal. For
making the audio signals machine readable, we have made use of the
Librosa library (McFee et al., 2015). For implementing our architecture
we have made use of libraries like Tensorflow (Abadi et al., 2015) and
Keras (Chollet et al., 2015).

4.1. Feature extraction

In this section at first we will discuss the challenges related to


inter-dependencies among the different Indic languages owing to the
common origin and others, which make the feature extraction process
extremely difficult. Then we will discuss different feature extraction
processes applied here. The Indic languages namely Marathi, Hindi
and Bangla belong to an Indo-Aryan language family. Marathi has its
grammar and syntax derived from Pali and Prakrit. It uses the retroflex
nasal sound /𝜂/ most frequently. Much of the vocabulary of the lan-
guage Hindi have been derived from Sanskrit. In Hindi, distinction of
length into long and short vowels has been neutralized. It is a language
which is syllable-timed, meaning words are not distinguished based on
stress alone. Default stress in Hindi is given on the last syllable. Each
content word except the final one has rising contour. Bangla is also
derived from Magadhi Prakrit and Pali, and it is also a bound stress
language. In this language, voiced stops have shorter closure duration
than voiceless stops and breathy voiced stops have the shortest closure
duration (Berkson, 2013). Fig. 1. Illustration of output images for a sample audio signal representing: (a) MFCCs,
The Indic languages namely Tamil, Telugu, Malayalam and Kannada (b) Spectral bandwidth, (c) Spectral contrast, (d) Spectral roll-off, (e) Spectral flatness,
(f) Spectral centroid, (g) Polynomial features and (h) Tonnetz.
belong to the Dravidian language family, where Telugu belongs to the
south central group and Malayalam, Kannada and Tamil belong to the The features mentioned above are first scaled using StandardScaler
southern group. Tamil neither consists aspirated nor voiced stop like function of Scikitlearn library (Pedregosa et al., 2011) and then fed to
other Indian languages and also aspirated consonant is absent in this the DDMLP classifier by averaging principle, the description of which
language (Keane, 2004). Telugu is influenced by Sanskrit and Prakrit. It is given as follows.
shows the vowel harmony phenomenon, which is not characteristic of Let us consider a set of features S = {𝑆1 , 𝑆2 , … , 𝑆𝑁 }, where 𝑆𝑖 =
any other Dravidian language. In this phenomenon, quality of a vowel {𝑠1 , 𝑠2, … , 𝑠𝑀 }. So any element located at 𝑖th row and 𝑗th column of
in a syllable is decided by vowels of the preceding (Bhaskararao, 2011). the NxM dimensional feature array S can be denoted by S𝑖𝑗 , where
Malayalam is thought to be a branch of classical Tamil but has a large 1 ≤ 𝑖 ≤ 𝑁 and 1 ≤ 𝑗 ≤ 𝑀.
contribution from Sanskrit vocabulary (Caldwell, 1875). Kannada is Let us denote the set of end features to be fed to the network by F.
influenced by Prakrit, Sanskrit and Pali languages (mustgo.com). Then any feature element F𝑗 of the set of end features F is given by,
Here, 8 set of features (see Fig. 1) are extracted from the audio ∑𝑁
𝑖=1 S𝑖𝑗
signals to obtain useful information for the rest of the working pipeline. F‫= ג‬ (1)
𝑁
They are as follows:
where 1 ≤ 𝑗 ≤ 𝑀.
1. MFCCs: MFCCs are the coefficients which are derived from a This makes F to be a 1xM dimensional feature vector which is later
type of cepstral representation of the audio clip processed and fed to the DDMLP classifier for the classification purpose.
2. Spectral bandwidth: Wavelength interval where a radiated For the DCNN and SSGAN networks, however, the spectrogram based
spectral quantity is not less than half its maximum value features are fed as these networks require image data to work upon.
3. Spectral contrast: Mean of the level difference between peaks The individual spectrogram is first converted to grayscale image. These
and valleys in the spectrum images are then concatenated together to form a single image as shown

4
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

Fig. 2. Sample image representing feature after concatenation of individual


spectrogram based features.

in Fig. 2. Such images are fed to the DCNN and SSGAN networks for
the classification purpose.

4.2. Architecture

The overall architecture starting from processing of audio sample to


identification of the language class is shown briefly in Fig. 3.

4.2.1. Deep dumb multi layer perceptron


Overview
MLP is a special class of neural network belonging to the class of
feed-forward artificial neural network (ANN). A basic MLP unit consists
of minimum three layers of nodes: input layer, hidden layer, and output Fig. 3. Architecture of the proposed FuzzyGCP used for Spoken Language
layer (Haykin, 1994). Of these, input nodes use linear activation, and Identification.

all the other nodes are the neurons that use an activation function
which is nonlinear. Generally, it applies a technique belonging to the
class of supervised learning called back propagation for the purpose of
learning. The multiple layers that it contains and their property of non-
linear activation aid to the classification of data that are not separable
by using linear techniques. An MLP is called Deep Dumb if it consists
of many hidden layers just stacked one after the other in a sequential
manner.
Implementation of the architecture
We have used a DDMLP network (see Fig. 4) with 14 hidden layers
along with an input layer and an output layer with Softmax activation.
Extracted averaged out features from the audio clips are scaled and then
fed as input to this model and output is softmax probability assigned
to each language class. The output of this network is used in the later
stage to form an ensemble learning.

4.2.2. Deep convolutional neural network


Convolutional layer
Primary building block of CNN (see Fig. 5) is the convolutional
Fig. 4. A simple Deep Dumb Multi Layer Perceptron — DDMLP network.
layer. The layer has a collection of parameters which mainly consists of
a set of filters (or kernels) which are learnable. The kernels are simply
small receptive fields, but can be extended through the full depth of
the input volume. While moving forward, every filter is convolved the dot product is computed between the entries of the filter and the
throughout the width and height of the input volume. In the process, input. Hence, a 2-dimensional activation map of that filter is produced.

5
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

region in the input and shares parameters with neurons in the same
activation map.

Max Pooling layer


It is a method which is used to down-sample images using non-linear
methods. It works by partitioning the input image into a collection of
discrete and non-overlapping rectangular elements and, for each such
rectangular region, outputs the maximum.
Intuitively, the exact location of a feature is less important than
its rough location relative to other features. Thus a max-pooling gives
rough estimation of edges. This is the idea behind the use of pooling
in convolutional neural networks. It is common to periodically insert
a pooling layer between successive convolutional layers in a CNN
architecture (Ciresan et al., 2011).
The feature images obtained after concatenation of the spectrograms
are fed to the model as input. The outputs thus obtained are later used
to form the ensemble.

4.2.3. Semi supervised generative adversarial network


The semi-supervised GAN or SSGAN model is an extension of the
GAN architecture that involves the simultaneous training of an unsu-
pervised discriminator, supervised discriminator and a generator. Let
us consider a standard classifier for classifying a data point 𝑥 into one
of the 𝑁 possible classes, that is, labels of the data. This model accepts
𝑥 as input and gives as output a 𝑁-dimensional vector of logits {𝑙1 , . . .
, 𝑙𝑁 }, that can thereafter be turned into class probabilities by applying
𝑒𝑥𝑝(𝑙𝑗 )
the softmax function: 𝑃𝑚𝑜𝑑𝑒𝑙 (𝑦 = 𝑗|𝑥) = ∑𝑁 In supervised
𝑛=1 𝑒𝑥𝑝(𝑙𝑛 )
learning, training of such a model is then done by minimizing the cross-
entropy value between the observed outcome and the model predictive
distribution 𝑃𝑚𝑜𝑑𝑒𝑙 (𝑦|𝑥). To make the classifier adapt to swift changes
in quality of data samples, we can take a semi-supervised approach to
generate fake data and mimic the possible noises and varieties that may
be present in real data.
To apply semi-supervised learning approach with any standard
classifier we can do so by simply adding samples to our dataset which
are generated from the generator 𝐺 of the GAN. These samples are then
labeled with a new ‘‘generated’’ class 𝑦 = 𝑁 + 1, and correspondingly
increase the dimension of our classifier output from 𝑁 to 𝑁 + 1 that
is number of output classes increases by 1 unit. We may then make
use of 𝑃𝑚𝑜𝑑𝑒𝑙 (𝑦 = 𝑁 + 1|𝑥) to supply the probability that 𝑥 is fake,
corresponding to 1 − 𝐷(𝑥) in the original GAN framework. It can also
learn from unlabeled data, as long as model knows that it corresponds
to one of the N classes of real data by maximizing 𝑙𝑜𝑔 𝑃𝑚𝑜𝑑𝑒𝑙 (𝑦 ∈
{1, … , 𝑁}|𝑥). The brief working of this model is shown in Fig. 6.

Fig. 5. Deep CNN model used in the present work.

As a direct result of which, the network learns to recognize the filters


that activate whenever it detects some specific type of feature at some
position in the input space.
Fig. 6. Working procedure of multi-class SSGAN architecture.
For all the filters the activation maps are then stacked along the Following the work by Salimans et al. (2016), we have imple-
depth dimension. This generates the final output volume of the con- mented it in our own way. The loss of this multi-language classification
volution layer. Therefore, each and every entry in this output volume framework can be decomposed into the supervised loss:

can also be interpreted as an output of a neuron that considers a small 𝐿𝑜𝑠𝑠𝑠𝑢𝑝𝑒𝑟𝑣𝑖𝑠𝑒𝑑 = −E𝑥,𝑦∼𝑃𝑑𝑎𝑡𝑎 (𝑥,𝑦) 𝑙𝑜𝑔𝑃𝑚𝑜𝑑𝑒𝑙 (𝑦|𝑥, 𝑦 < 𝑁 + 1) (2)

6
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

The unsupervised loss is given by: 2. For each 𝐴, 𝐵 ∈ 𝑃 (𝑋) and 𝐴 ⊂ 𝐵 then 𝑧(𝐴) ≤ 𝑧(𝐵) where z(k)
is the grade of subjective importance of the classifier set k. The
𝐿𝑜𝑠𝑠𝑢𝑛𝑠𝑢𝑝𝑒𝑟𝑣𝑖𝑠𝑒𝑑 = −{E𝑥∼𝑃𝑑𝑎𝑡𝑎 (𝑥) 𝑙𝑜𝑔[1 − 𝑃𝑚𝑜𝑑𝑒𝑙 (𝑦 = 𝑁 + 1|𝑥)]
(3) fuzzy singleton measures for each classifier are 𝑧(𝑥𝑖 ) = 𝑧𝑖 and
+E𝑥∼𝐺 𝑙𝑜𝑔[𝑃𝑚𝑜𝑑𝑒𝑙 (𝑦 = 𝑁 + 1|𝑥)]} are commonly referred as densities. Not only must the value of
The GAN loss of a discriminator: each singleton be calculated, but also the value of function z
for any combination of classifiers. The Sugeno 𝜆-measure and
𝐿𝑜𝑠𝑠𝐺𝐴𝑁 = −{E𝑥,𝑦∼𝑃𝑑𝑎𝑡𝑎 (𝑥,𝑦) [𝑙𝑜𝑔𝑃𝑚𝑜𝑑𝑒𝑙 (𝑦|𝑥)]+
(4) fuzzy densities are used to calculate the fuzzy measure of any
E𝑥∼𝐺 [𝑙𝑜𝑔 𝑃𝑚𝑜𝑑𝑒𝑙 (𝑦 = 𝑁 + 1|𝑥)]} combination of classifiers. The 𝜆-measure can be calculated by
The best solution for minimizing both 𝐿𝑜𝑠𝑠𝑠𝑢𝑝𝑒𝑟𝑣𝑖𝑠𝑒𝑑 and 𝐿𝑜𝑠𝑠𝑢𝑛𝑠𝑢𝑝𝑒𝑟𝑣𝑖𝑠𝑒𝑑 the following formula:
is to have 𝑒𝑥𝑝[𝑙𝑗 (𝑥)] = c(x) p(y=j,x)∀ j< N+1 and 𝑒𝑥𝑝[𝑙𝑁+1 (𝑥)] = ∏
𝑛
𝑐(𝑥)𝑃𝐺 (𝑥) for some arbitrary function for scaling, 𝑐(𝑥). This can easily 𝜆+1= (1 + 𝜆𝑧𝑖 ), 𝜆 > −1 (5)
𝑖=1
be deduced from the given values of 𝐿𝑜𝑠𝑠𝑠𝑢𝑝𝑒𝑟𝑣𝑖𝑠𝑒𝑑 and 𝐿𝑜𝑠𝑠𝑢𝑛𝑠𝑢𝑝𝑒𝑟𝑣𝑖𝑠𝑒𝑑 .
The unsupervised loss is thus consistent with the supervised loss as
mentioned by Sutskever et al. (2015). Hence, we can get a clearer Definition 2. z is the fuzzy measure of 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 }. The
estimate of this optimal solution from the data by minimizing these following equation shows the Choquet integral function of 𝑓 ∶ 𝑋 → 𝑅
two loss functions clubbed together. In practice, 𝐿𝑜𝑠𝑠𝑢𝑛𝑠𝑢𝑝𝑒𝑟𝑣𝑖𝑠𝑒𝑑 will and its relation with z:
only help if it is not trivial enough to minimize for our classifier, and ∑
𝑛
𝐶𝑧 (𝑓 ) = 𝑓𝑖 [𝑧(𝐴𝑖 ) − 𝑧(𝐴𝑖−1 )] (6)
thus we need to train 𝐺 for approximation of the data distribution.
𝑖=1
One way to do this is by training 𝐺 to minimize the game-value of
The prediction result of classifier 𝑥𝑖 , is denoted by 𝑓𝑖 , and [𝑧(𝐴𝑖 ) −
the GAN model, using the discriminator, 𝐷, defined by our supervised
𝑧(𝐴𝑖−1 )] depicts the relative importance of the classifier 𝑥𝑖 . The fuzzy
classifier. This approach helps in introducing an interaction between
integral of f with respect to z is the result of integration.
𝐺 and the supervised classifier. Empirically we find that optimizing 𝐺
using feature matching, Sutskever et al. (2015) GAN works very well
for semi-supervised learning. On the other hand, training 𝐺 using GAN Implementation of the architecture
by preventing learning from isolation, does not work at all. The outputs from all the aforesaid architectures act as input for this
Lastly, it is noted that the classifier with 𝑁 + 1 outputs is over- fuzzy ensemble model. The class wise confidence probabilities and the
parameterized, the reason being inclusion of an extra fake output obtained confusion matrices are used for getting the values of various
class. On subtracting a general function 𝑔(𝑥) from each output logit, parameters involved in calculating the ensemble model.
i.e. setting 𝑙𝑗 (𝑥) ← 𝑙𝑗 (𝑥) − 𝑔(𝑥)∀𝑗, does not have any effect on the As mentioned in Siami et al. (2019), suppose in a sample data space
output of the softmax. This makes way for the hypothesis that we may S, data are divided into two classes by a classifier (E). A classifier index
equivalently fix 𝑙𝑁+1 (𝑥) = 0∀𝑥. In such a case 𝐿𝑜𝑠𝑠𝑠𝑢𝑝𝑒𝑟𝑣𝑖𝑠𝑒𝑑 becomes is specified by (i=1, . . . ,P); j is the class index (j=1, . . . ,M); and k is the
the standard supervised loss function of our original classifier with 𝑁 instance index (k=1, . . . ,N). For 𝑘th sample, the prediction result by the
𝑇 (𝑥) 𝑖th classifier is [𝑔𝑖1 (𝑘), 𝑔𝑖2 (𝑘), … , 𝑔𝑖𝑀 (𝑘)] where 𝑔𝑖𝑗 (𝑘) is the probability
classes. Also, our discriminator 𝐷 is given by 𝐷(𝑥) = , where
𝑇 (𝑥) + 1 result of the 𝑖th classifier, which shows the probability of 𝑘th data
∑𝑁
𝑇 (𝑥) = 𝑛=1 𝑒𝑥𝑝[𝑙𝑛 (𝑥)]. The overall SSGAN architecture is shown in the belonging to class j. [𝑔1𝑗 (𝑘), 𝑔2𝑗 (𝑘), … , 𝑔𝑃 𝑗 (𝑘)]𝑇 where 𝑔𝑖𝑗 (𝑘) is defined
Annexure section (Figure 11). The output of this model is later used for as 𝑔𝑗 (𝑠𝑘 ) which can be interpreted as:
ensemble learning. 𝑔𝑗 ∶ 𝑆 → [0, 1], 𝑔𝑗 (𝑠𝑘 ) = [𝑔1𝑗 (𝑘), 𝑔2𝑗 (𝑘), … , 𝑔𝑃 𝑗 (𝑘)]𝑇 for sample 𝑠𝑘 , we
obtain a value for 𝑔𝑗 (𝑠𝑘 ) as degree of support provided by each classifier
4.2.4. Ensemble learning with respect to the 𝑗th class for sample 𝑠𝑘 . In addition to 𝑔𝑗 (𝑠𝑘 ), the
Usually, results obtained from a classifier may not be such precise Choquet fuzzy integral operates on the fuzzy measures (z). This includes
or lack certainty. For architectures like GANs and CNNs, if train- fuzzy densities as well as the fuzzy measure of any possible combination
able parameters are few enough, that is, features vectors have lesser of classifiers. By calculating the Choquet integral of 𝑔𝑗 (𝑠𝑘 ), z, we can
dimensions they give great results. But they show degradation in pre- provide the degree of support given by the ensemble model with respect
cision if there is a steep increase in feature dimensions as pointed to the 𝑗th class for sample 𝑠𝑘 . The output class 𝑐𝑗 for the sample 𝑠𝑘 is
out by Miao et al. (2018). Similarly, architectures like MLP show the class with the largest integral value 𝐶𝑧 (𝑓 ).
increase in performance if higher dimensional features are fed to it
for training. So application of fuzzy measures is useful for merging 5. Results and analysis
different classifiers to finally give one prediction result. It has been
validated that popular fuzzy integral methods such as Sugeno and In this section we have provided a detailed analysis of our findings
Choquet have great applications and been applied in a wide range of from our experiments using the aforementioned architecture.
domains ranging from economics, mathematics to machine learning
and pattern recognition (Wang et al., 2015). 5.1. Evaluation metrics
Although both of these fuzzy integral methods are popular, Cho-
quet fuzzy integral has been more widely applied than Sugeno inte- For analyzing the performance of our model on various datasets,
grals (Krishnan et al., 2015). A Choquet integral can be defined as we have considered three standard performance metrics namely Pre-
an aggregation function that simultaneously keeps into consideration cise, Recall and F1-score values with their corresponding class support
the importance of a classifier as well as its interaction with other division.
classifiers in terms of output prediction. The definition of Choquet Precision is defined as:
integral and fuzzy measures according to Murofushi and Sugeno (1989) 𝑇𝑃
are as follows. Let us assume X to be a set of various classifiers and the Precision = (7)
𝑇𝑃 + 𝐹𝑃
power set of X be denoted by P(X). Recall is defined as:
𝑇𝑃
Definition 1. The fuzzy measure of X is a set function 𝑧 ∶ 𝑃 (𝑋) → Recall = (8)
𝑇𝑃 + 𝐹𝑁
[0, 1]. This function satisfies the following conditions:
Here, TP (True Positive) = Number of audio files correctly classified
1. The boundary of z is : 𝑧(𝜙) = 0, 𝑧(𝑋) = 1 into corresponding language classes

7
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

FP (False Positive) = Number of audio files classified to be belonging


to a language class which they do not belong to
FN (False Negative) = Number of audio files classified to be not
belonging to a language class which they actually belong to
F1-score is defined as:
Precision × Recall
𝐹 1 − 𝑠𝑐𝑜𝑟𝑒 = 2 × (9)
(𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + Recall)
Support for a language class is defined as the number of audio files that
lies in that language class.

5.2. Effect of sampling duration on classification accuracy

As it can be seen from Fig. 7 that there is a direct proportionality Fig. 8. Schematic diagram showing the baseline x-vector system.
relationship between the length of sampling duration and language
identification accuracy. Longer duration tends to give better results Table 1
and this can be justified from the fact that longer sampling duration Comparison of test accuracies of our proposed FuzzyGCP model with different
gives more information, i.e., features to be learned by the model for state-of-the-art feature vectors and language models on four SLID datasets.
proper representation of the data which eventually helps classifiers to Dataset PPRLM i-vector x-vector FuzzyGCP model
distinguish different classes of data. IIIT Hyderabad 81.32% 79.87% 84.47% 95.00%
IIT Madras 72.33% 74.55% 74.65% 81.51%
VoxForge 57.12% 56.53% 59.67% 68.00%
MaSS 86.21% 84.73% 88.57% 98.75%

extraction of the embeddings, PLDA is used as backend scoring by the


standard Kaldi x-vector project.
For the i-vector calculation, given a speech utterance of the lan-
guage, the channel dependent GMM-supervector (Campbell et al.,
2006) 𝑚𝑙 can be written as:

𝑚𝑙 = 𝑚 + 𝑇 𝑤 𝑙 (10)
Fig. 7. SLID accuracy attained for various datasets with respect to their sampling
duration. where, 𝑚 is the GMM-supervector of the universal background model
(UBM) (Reynolds et al., 2000) which is both language and channel-
5.3. Overall results independent, 𝑇 is a low-rank total variability matrix, and the posterior
mean of 𝑤𝑠 is a low-dimension vector called i-vector.
All the datasets are divided into train, validation and test sets in From the speech samples, 13-dimensional MFCCs and various other
the ratio of 70:20:10. The corresponding identification accuracies of spectral features along with energy and their delta and acceleration
our model on the test sets are shown in Table 1. coefficients, forming 38-dimensional acoustic features are extracted.
We came across many works where both training and testing are The GMM consists of 64 Gaussian mixtures thus, resulting in a feature
done on whole dataset. This may lead to impressive results on the given vector of 2432(𝐷) dimensions. Given an utterance with 𝐷-dimensional
dataset, but when tested on new data, it may lead to drastic degradation acoustic vector sequence 𝑋 = {𝑥1 , … , 𝑥𝑇 } belonging to language 𝑙, we
in performance. However, our results are evaluated on a test dataset can write
which are then fed as a completely new data to our classification model
𝑇 = [𝑡1 , … , 𝑡𝑅 ] (11)
for the testing purposes.
Identification accuracies attained by different state-of-the-art fea- where, 𝑡𝑟 is an 𝑀𝐷 × 1 column vector.
ture vectors like Parallel Phone Recognition and Language Modeling It can be observed that the x-vector based feature outperforms the
(PPRLM) (Zissman & Singer, 1994), i-vector (Snyder et al., 2015), i-vector based feature. The probable reason being that the x-vectors
x-vector (Snyder et al., 2018b) over four datasets are shown in Table 1. exploit the large increase in the amount of in-domain data better than
Although these feature vectors find their use mainly for speaker the i-vectors, thus proving better compared to i-vectors (acoustic).
recognition purposes, but considering each speaker as a language class, However, all these state-of-the-art feature vectors show poorer results
we have tried to give a fair comparison for the strategies. comparatively. The main reasons for this are these feature vectors are
For carrying out comparative analysis, we have used the ‘Kaldi’ used in scenarios where the number of channels are very low, and
framework (Povey et al., 2011) for computing i-vectors and x-vectors enough resources are not available to perform an in-depth analysis of
and the project by Srivastava et al. (2017) for computing results for hyperparameters for comparison as it would be out of scope of the
PPRLM (codes). Fortunately, Kaldi has support for pretrained x-vectors present work.
(here) and Probabilistic Linear Discriminant Analysis (PLDA) backend. The distribution of sounds, shown in Table 2, helps in discriminating
All scripts have been readily made available by the developers of the Indo-Aryan and Dravidian languages from others through their specific
Kaldi project, the commands for which have been included in the An- acoustic, phonetic and prosody characteristics (Aarti & Kopparapu,
nexure. The same goes for other sets of features. The number of feature 2018).
units in input layer is same as number of features extracted for the Time
Delay Neural Network (TDNN) used for x-vector extraction. The whole 5.3.1. IIIT hyderabad
procedure for x-vector feature generation is shown in Fig. 8. The X-Y- It can be seen that the performance of our model in classifying
Z written in the blocks represent the number of filters, filter window Tamil is not so good as compared to other languages. It has been
size and dilation number at the corresponding layers respectively. After observed that there is presence of several keywords like ‘Malayalam’,

8
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

Table 2
Sound distribution found in different Indic languages.
Language Vowels Diphthongs Liquids Glides Consonants Nasals Stops Fricatives Affric
Kannada 13 2 3 2 34 5 16 5 4
Malayalam 10 5 4 2 37 6 16 4 3
Hindi 10 2 4 2 37 5 20 4 4
Tamil 10 2 3 2 18 5 4 4 2
Marathi 12 2 3 2 45 3 16 3 7
Bengla 14 15 3 3 34 4 20 2 4
Telugu 11 2 3 2 36 3 15 5 4

Fig. 11. Classification loss.

Fig. 9. Fake image loss.

Fig. 12. SSGAN accuracy.

Table 3
Language-wise performance of our FuzzyGCP model on IIIT Hyderabad dataset.
Class Precision Recall F1-score Support
Bangla 0.95 0.95 0.95 104
Kannada 0.96 0.94 0.95 101
Marathi 0.92 0.90 0.91 94
Tamil 0.91 0.89 0.90 92
Malayalam 0.97 0.96 0.97 109
Telugu 0.96 0.96 0.96 96

Fig. 10. Real image loss. Hindi 0.98 0.95 0.97 104
Macro avg 0.95 0.95 0.95 700

‘Hindi’, ‘India’ etc. which are also found in the Malayalam and Tamil Table 4
audio clips. Presence of these keywords adds confusion thus resulting Comparison of performance of our proposed model with other state-of-the-art models
on IIIT Hyderabad dataset.
in mis-classification.
Classifier Features used Accuracy (%) Precision (%) Recall (%)
Additionally, in the Marathi audio clip, there are words like ‘‘En-
MFCC 78 24 53.3
greji’’ meaning English, which is also present in frequent basis in SVM
Formants 86 62.47 60
the Bangla language class. This, in turn, leads to comparatively poor
Both 84 52.17 46.66
identification results as well. The class-wise distribution of different
evaluation metrics of multi-lingual approach is shown in Table 3. Our MFCC 80.02 31.51 57.14
LDA
model performs best for the Hindi and Malayalam language classes of Formants 93.06 80.57 75.71

this dataset. Both 93.88 84.24 78.57

The worst performances are given for the Tamil and Marathi lan- Extracted
Our model 95.00 95.00 95.00
features
guage classes. Comparison of performance of our FuzzyGCP model
with the state-of-the-art models (Anjana & Poorna, 2018) is shown in
Table 4.
SSGAN training metrics on the IIIT Hyderabad dataset are shown 5.3.2. IIT madras
in Figs. 9–12. Fig. 9 shows the iterative training loss of the fake image The class-wise distribution of different evaluation metrics of multi-
generator, while Fig. 10 shows the iterative training loss of the real lingual approach on IIT Madras dataset is shown in Table 5. Our model
image discriminator. Fig. 11 shows the iterative training loss of the performs best for the Marathi and Telugu language classes. However, it
supervised classifier, and Fig. 12 shows the iterative training accuracy is to be noted from Table 5 that the worst performances are obtained by
of the SSGAN’s supervised classifier. our model for Hindi language followed by Bangla and Tamil languages.

9
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

Table 5 Table 7
Class-wise performance measures of our proposed FuzzyGCP model on IIT Madras Performance of our FuzzyGCP model for individual language classes on VoxForge
dataset. dataset.
Class Precision Recall F1-score Support Class Precision Recall F1-score Support
English 0.76 0.80 0.78 100 French 0.64 0.67 0.65 188
Marathi 0.84 0.95 0.89 100 German 0.68 0.59 0.63 191
Tamil 0.79 0.73 0.76 100 Italian 0.66 0.75 0.70 210
Bangla 0.72 0.79 0.75 100 Portuguese 0.70 0.83 0.76 202
Telugu 0.90 0.72 0.80 100 Spanish 0.67 0.56 0.61 209
Hindi 0.75 0.71 0.73 100 Macro avg 0.67 0.68 0.67 1000
Macro avg 0.79 0.78 0.79 600

quicker and clearer communication, if everyone listening understands


The class balance that is division of support values for each class leads the terminology. However, if the listeners have different definitions
to proper similarity in Recall values. It is already mentioned earlier of the terminologies, then jargon becomes noisy. In the VoxForge
that most of the Indic languages are originated from the early Brahmi dataset, adequate number of jargons are found. Additionally, the whole
script. Out of all these languages, Hindi is semantically the most similar dataset is vast consisting of samples with variable record durations and
language to Sanskrit, followed by Bangla. Thus, we can find a transitive sampling rates. So, extracting the best quality data was difficult for us
relation here. from the computational perspective. However, our model performs the
Let us represent the language classes Hindi, Sanskrit and Bangla by best for the Portuguese and Italian language classes of this dataset and
p, q, r respectively and  denote the set of Indic languages. worst for the Spanish and German language classes respectively as seen
A relation R on the set  is a transitive relation if, for all 𝑝, 𝑞, 𝑟 ∈ , in Table 7. As we can see from Table 8, the state-of-the-art models
if p R q and q R r, then p R r. If thought logically: fail considerably when it comes to multi-lingual classification on this
dataset.
∀𝑝, 𝑞, 𝑟 ∈  ∶ (𝑝𝑅𝑞 ∧ 𝑞𝑅𝑟) ⇒ 𝑝𝑅𝑟 (12)

where p R q is the infix notation for (𝑝, 𝑞) ∈ 𝑅. 5.3.4. MaSS


Thus, we see that p and r are related to each other. This similarity Hungary and Romania are neighboring countries. So, it is obvious
in grammar and other linguistic features lead to more mis-classification that both the countries have linguistic and cultural influences over each
between these language classes compared to others. other. Over the years there have been extension of deeper influence
Performance comparison of our proposed model with other methods pertaining to the exchange of even basic characteristics of a language
by Jog et al. (2018) on 6 different languages of IIT Madras dataset such as morphology and grammar. This influence is the prime reason
with common languages being Hindi, Marathi and Telugu is shown for the relatively poor classification for the Hungarian and Romanian
in Table 6. As we can see that the performance of the works by Jog language classes compared to others. However, the best performance
et al. (2018) and Sarkar et al. (2013) is much better compared to our of our model has been found in case of this dataset. As seen in Table 9,
model. The probable significant reasons for the same may be, firstly, our model performs best for the Finnish, Spanish and French language
a difference in the set of languages used for experimentation. Perhaps classes of this dataset. We could not show a comparison with other
those language classes show better distinguishing features. Secondly, models, as no spoken language identification work has been done on
they have trained and tested on the whole dataset which is bound to this dataset so far.
give better results as the classifier has got the sufficient samples to learn
efficiently. 5.4. Bi-lingual results
In Table 6, the language groups consist of following languages:
Bi-lingual results are shown for Indic datasets due to the presence
• 1 - Kannada, Malayalam, Marathi, Telugu, Hindi, and Manipuri of dependency between different languages. It can be seen that bi-
• 2 - English, Marathi, Tamil, Bangla, Telugu, and Hindi lingual classification results are much higher compared to multi-lingual
approach. This is pertaining to the fact that binary classification is
5.3.3. VoxForge much more robust to noise and dependency in features compared to the
The audio samples found in this database are very noisy. Jar- scenario where the classifier needs to distinguish among more language
gon (merriam webmaster) is a type of linguistic shortcut, which helps in classes.

Table 6
Comparison of performance of our FuzzyGCP model with some existing works on IIT Madras dataset.
Method Features Language group Test Accuracy/ Error Rates
Jothilakshmi et al. (2012) MFCC 1 80.56%
Prosodic
Singh et al. (2013) 1 EER is 7.46%
features
Delta MFCC, Lesser than 45% for
Aarti and Kopparapu (2017) 1
Double Delta MFCC different windows
Phonotactic, Prosodic
Madhu et al. (2017) 1 72% and 68%
features
Sarkar et al. (2013) MFCC 1 95.21%
Verma and Khanna (2013) MFCC 1 81%
Cochleagram Based
Jog et al. (2018) 1 95.36%
Texture Descriptors
Proposed method Extracted features 2 81.51%

10
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

Table 8 Table 12
Performance comparison of FuzzyGCP model with other state-of-the-art models on F1-scores of class-wise binary classification on IIIT Hyderabad dataset.
VoxForge dataset. Classes Bangla Kannada Marathi Tamil Malayalam Telugu Hindi
Model Precision(%) Recall(%) F1-score(%)
Bangla X 0.97 0.95 0.93 0.97 0.99 0.93
kNN 44.33 36.04 40
Kannada 0.97 X 0.98 0.92 0.97 1.00 0.98
SVM 41.44 44.67 44
Marathi 0.95 0.98 X 0.98 0.99 1.00 0.99
Extratrees 37.23 34.5 35.81
Tamil 0.93 0.92 0.98 X 0.97 0.91 0.96
Our model 67 68 67
Malayalam 0.97 0.97 0.99 0.97 X 1.00 0.97
Telugu 0.99 1.00 1.00 0.91 1.00 X 0.99
Table 9 Hindi 0.93 0.98 0.99 0.96 0.97 0.99 X
Language-wise performance measures attained by our FuzzyGCP model on MaSS
Dataset.
Class Precision Recall F1-score Support Table 13
Precision values of class-wise binary classification on IIT Madras dataset.
Basque 0.93 1.00 0.96 25
Classes English Marathi Tamil Bangla Telugu Hindi
Russian 0.94 1.00 0.97 15
English X 0.93 0.75 0.61 0.73 0.72
French 1.00 0.97 0.98 30
Marathi 0.93 X 0.95 0.87 1.00 1.00
Romanian 1.00 0.88 0.93 24
Tamil 0.75 0.95 X 0.89 0.89 0.98
Hungarian 0.96 0.92 0.94 29
Bangla 0.61 0.87 0.89 X 0.90 0.59
Spanish 0.97 1.00 0.98 28
Telugu 0.73 1.00 0.89 0.90 X 0.92
English 0.94 1.00 0.97 29
Hindi 0.72 1.00 0.98 0.59 0.92 X
Finnish 1.00 1.00 1.00 28
Macro avg 0.98 0.98 0.98 208
Table 14
Recall values of class-wise binary classification on IIT Madras dataset.
Table 10
Classes English Marathi Tamil Bangla Telugu Hindi
Precision values of class-wise binary classification on IIIT Hyderabad dataset.
English X 0.92 0.74 0.60 0.72 0.71
Classes Bangla Kannada Marathi Tamil Malayalam Telugu Hindi
Marathi 0.92 X 0.95 0.84 1.00 1.00
Bangla X 0.98 0.95 0.94 0.97 0.98 0.94
Tamil 0.74 0.95 X 0.88 0.88 0.97
Kannada 0.98 X 0.98 0.92 0.97 1.00 0.99
Bangla 0.60 0.84 0.88 X 0.90 0.58
Marathi 0.95 0.98 X 0.98 0.99 1.00 1.00
Telugu 0.72 1.00 0.88 0.90 X 0.90
Tamil 0.94 0.92 0.98 X 0.97 0.92 0.96
Hindi 0.71 1.00 0.97 0.58 0.90 X
Malayalam 0.97 0.97 0.99 0.97 X 1.00 0.98
Telugu 0.99 1.00 1.00 0.92 1.00 X 0.99
Hindi 0.94 0.99 1.00 0.96 0.98 0.99 X Table 15
F1-scores of class-wise binary classification on IIT Madras dataset.
Classes English Marathi Tamil Bangla Telugu Hindi
Table 11
English X 0.92 0.74 0.60 0.72 0.71
Recall values of class-wise binary classification on IIIT Hyderabad dataset.
Marathi 0.92 X 0.95 0.84 1.00 1.00
Classes Bangla Kannada Marathi Tamil Malayalam Telugu Hindi
Tamil 0.74 0.95 X 0.88 0.88 0.97
Bangla X 0.97 0.95 0.94 0.97 0.99 0.94
Bangla 0.60 0.84 0.88 X 0.89 0.58
Kannada 0.97 X 0.98 0.92 0.97 1.00 0.98
Telegu 0.72 1.00 0.88 0.89 X 0.90
Marathi 0.95 0.98 X 0.97 0.98 1.00 0.99
Hindi 0.71 1.00 0.97 0.58 0.90 X
Tamil 0.94 0.92 0.97 X 0.97 0.92 0.96
Malayalam 0.97 0.97 0.98 0.97 X 1.00 0.97
Telugu 0.99 1.00 1.00 0.92 1.00 X 0.99
Hindi 0.94 0.98 0.99 0.96 0.97 0.99 X 5.4.2. Bi-lingual classification on IIT Madras dataset
The precision, recall and F1-score values for the bi-lingual binary
classification achieved by our proposed model on IIT Madras dataset
are shown in Tables 13, 14 and 15 respectively. As it can be seen that
5.4.1. Bi-lingual classification on IIIT Hyderabad dataset the binary classification of Hindi and Bangla shows the worst perfor-
The precision, recall and F1-score values for the bi-lingual binary mance (58%), thus proving our earlier analysis to be true. Just as seen
classification achieved by our proposed model are shown in Tables 10, before, the proposed model achieves 100% F1-score for the (Marathi,
11 and 12 respectively. The achievement of having F1-score values Telugu) language pair proving them to be the most distinguishable
of 100% for the language pairs (Kannada, Telugu), (Marathi, Telugu) among the Indic languages considered here. Also, the (Marathi, Hindi)
language pair gives F1-score of 100%. On the other hand, the highest
and (Malayalam, Telugu) with Telugu being the common language,
mis-classification is found for the (English, Bangla) and (English, Hindi)
shows that Telugu has significant features that clearly distinguish it
language pairs. Probably we fail to extract distinguishable features for
from all these languages. However, it can be seen that the language
the same.
pair (Malayalam, Telugu) has a lower F1-score value, showing com-
paratively more similarity. As discussed earlier, both the Malayalam 5.5. Tri-lingual results
and Tamil languages have some terms common between them. So, this
pair attains a F1-score of 0.97 which is much less than the collective We have also shown the classification results on tri-lingual scenarios
average. Similarly, the results for (Bangla, Hindi) language pair is also of Indic datasets as we can see that in the neighboring states many
found to be less (0.94) as compared to global average. people are used to speak multiple languages.

11
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

Fig. 13. Tri-lingual classification F1-scores for various language class triplets in IIIT Hyderabad dataset.

Fig. 14. Tri-lingual classification F1-scores for various language class triplets in IIT Madras dataset.

5.5.1. Tri-lingual classification on IIIT Hyderabad dataset 5.5.2. Tri-lingual classification on IIT Madras dataset
The tri-lingual classification results on IIIT Hyderabad dataset are The tri-lingual classification results on IIT Madras dataset are shown
shown in Fig. 13. The number triplets in the 𝑋-axis correspond to in Fig. 14. The number triplets in the 𝑋-axis correspond to the equiv-
the equivalent language codes as shown in the legend. For example, alent language codes as shown in the legend. For example, the triplet
(0,1,2) corresponds to the languages English, Marathi and Tamil re-
the triplet (0,1,2) corresponds to the languages Bangla, Kannada and
spectively. All the triplets where (Bangla, Hindi) and (Tamil, Bangla)
Marathi respectively. All the triplets where (Bangla, Hindi) and (Tamil,
language pairs are present together have shown lower F1-scores com-
Malayalam) language pairs are present together have shown lower
pared to others, proving the correctness of our analysis discussed
F1-scores compared to others. The reasons for the same have been above. It can be seen that the tri-lingual classification F1-scores are
explained before and the results here prove the correctness of the much better in comparison to the multi-lingual approach. The reason
analysis. The triplet (Bangla, Kannada, Telugu) shows an abnormal dip surely is more relative differences in characteristic features which is an
in F1-score. The model is unable to detect any Bangla language in immediate result of decreased diversity in language classes.
presence of the other two languages and gave 0% recall value for the
same. It is a known fact that distinguishing Kannada language from 6. Conclusion
Telugu language is very difficult. The situation might have been such
that our model gives too much weightage to learning of features that In this paper, we have proposed a method for identifying the
could help it to classify both Kannada and Telugu language classes. spoken languages from speech signals. In doing so, we have designed

12
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

a deep learning based ensemble architecture which we have named as Anjana, J., & Poorna, S. (2018). Language identification from speech features using
FuzzyGCP. Use of SSGAN along with forming of an ensemble architec- SVM and LDA. In 2018 International conference on wireless communications, signal
processing and networking (pp. 1–4). IEEE.
ture is a new approach in this domain. Mapping the audio classification
Baby, A., Thomas, A., L, N., & Consortium, T. (2016). Resources for Indian languages.
problem to image classification problem by making use of spectrograms Berkson, K. H. (2013). Phonation types in Marathi: An acoustic investigation (Ph.D. thesis),
is one of the key aspects of this architecture. Heterogeneous ensemble University of Kansas.
consisting of a conventional DDMLP as a classifier using numeric Bhaskararao, P. (2011). Salient phonetic features of Indian languages in speech
features along with DCNN and SSGAN as classifiers for image based technology. Sadhana, 36(5), 587–599.
Black, A. W. (2019). CMU wilderness multilingual speech dataset. In 2019 IEEE
features, proved to be quite a useful approach as observed from the
international conference on acoustics, speech and signal processing (pp. 5971–5975).
results reported in Section 5. The diversity in datasets which we have Boito, M. Z., Havard, W. N., Garnerin, M., Ferrand, ÉricLe., & Besacier, L. (2020).
considered consisting both of Indic and foreign languages prove the MaSS: A large and clean multilingual corpus of sentence-aligned spoken utterances
robustness and versatility of FuzzyGCP. A multi-lingual classification extracted from the bible. In Language resources and evaluation conference.
Caldwell, R. (1875). A comparative grammar of the Dravidian or South-Indian family of
approach is always a challenging task compared to its bi-lingual and
languages. Trübner.
tri-lingual counterparts in the domain of SLID and it has been quite Campbell, W. M., Sturim, D. E., Reynolds, D. A., & Solomonoff, A. (2006). SVM
successfully accomplished here. However, challenges like the inter based speaker verification using a GMM supervector kernel and NAP variability
dependency of Indic languages among themselves, presence of com- compensation. In 2006 IEEE international conference on acoustics speech and signal
mon set of words in the languages and demographic influence on the processing proceedings (vol. 1) (p. I). IEEE.
Chollet, F. (2015). Keras. https://keras.io.
languages need to be addressed with better understanding.
Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., & Schmidhuber, J. (2011).
As for future scope, improvement may be done by using some Flexible, high performance convolutional neural networks for image classification.
feature selection algorithms, and by using lesser computational inten- In Twenty-second international joint conference on artificial intelligence.
sive architectures. Usage of sequential models like Gated Recurrent Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language
Units (GRUs), LSTMs etc. can be checked out to form an ensemble. recognition via i-vectors and dimensionality reduction. In Twelfth annual conference
of the international speech communication association.
Also in-depth analysis of the x-vector and i-vector based models with Ferrer, L., Lei, Y., McLaren, M., & Scheffer, N. (2014). Spoken language recognition
proper tuning of hyperparameters on the datasets explored in this work based on senone posteriors. In Fifteenth annual conference of the international speech
can be considered. Besides, there exists many other speech corpus communication association.
which can be used for evaluation. The immediate benefit of a proper Gu, L., & Rose, K. (2001). Perceptual harmonic cepstral coefficients for speech recogni-
tion in noisy environment. In 2001 IEEE international conference on acoustics, speech,
multi-lingual SLID system is that it can be developed based on the
and signal processing. proceedings (Cat. No. 01CH37221) (vol. 1) (pp. 125–128). IEEE.
output of this model for other purposes like speaker profile generation, Haykin, S. (1994). Neural networks: a comprehensive foundation. Prentice Hall PTR.
automatic translation switching frameworks, ease of understanding in Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal
tele-medicine purposes etc. of the Acoustical Society of America, 87(4), 1738–1752.
Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions
on Speech and Audio Processing, 2(4), 578–589.
CRediT authorship contribution statement Imai, S. (1983). Cepstral analysis synthesis on the mel frequency scale. In IEEE
international conference on acoustics, speech, and signal processing (vol. 8) (pp. 93–96).
Avishek Garain: Software, Methodology, Formal analysis, Writing IEEE.
- original draft. Pawan Kumar Singh: Conceptualization, Validation, Jog, A. H., Jugade, O. A., Kadegaonkar, A. S., & Birajdar, G. K. (2018). Indian language
identification using cochleagram based texture descriptors and ANN classifier. In
Resources, Data curation, Writing - review & editing. Ram Sarkar:
2018 15th IEEE India council international conference (pp. 1–6). IEEE.
Investigation, Supervision, Project administration. Jothilakshmi, S., Ramalingam, V., & Palanivel, S. (2012). A hierarchical language
identification system for Indian languages. Digital Signal Processing, 22(3), 544–553.
Declaration of competing interest Keane, E. (2004). Tamil. Journal of the International Phonetic Association, 34(1), 111–116.
http://dx.doi.org/10.1017/S0025100304001549.
Krishnan, A. R., Kasim, M. M., & Bakar, E. M. N. E. A. (2015). A short survey
The authors declare that they have no known competing finan-
on the usage of choquet integral and its associated fuzzy measure in multiple
cial interests or personal relationships that could have appeared to attribute analysis. Procedia Computer Science, 59, 427–434. http://dx.doi.org/
influence the work reported in this paper. 10.1016/j.procs.2015.07.560, International Conference on Computer Science and
Computational Intelligence (ICCSCI 2015). http://www.sciencedirect.com/science/
article/pii/S187705091502089X.
Acknowledgment
Lee, R.-H. A., & Jang, J.-S. R. (2018). A syllable structure approach to spoken language
recognition. In T. Dutoit, C. Martín-Vide, & G. Pironkov (Eds.), Statistical language
We would like to thank the CMATER research laboratory of the and speech processing (pp. 56–66). Cham: Springer International Publishing.
Computer Science and Engineering Department, Jadavpur University, Li, H., Ma, B., & Lee, K. A. (2013). Spoken language recognition: From fundamentals
India for providing us the infrastructural support. to practice. Proceedings of the IEEE, 101(5), 1136–1159.
Logan, B. (2000). Mel frequency cepstral coefficients for music modeling. In Ismir (vol.
270) (pp. 1–11).
Appendix A. Supplementary data Madhu, C., George, A., & Mary, L. (2017). Automatic language identification for seven
Indian languages using higher level features. In 2017 IEEE international conference
Supplementary material related to this article can be found online on signal processing, informatics, communication and energy systems (pp. 1–6). IEEE.
at https://doi.org/10.1016/j.eswa.2020.114416. McFee, B., McVicar, M., Raffel, C., Liang, D., Nieto, O., Moore, J., Ellis, D., Repetto, D.,
Viktorin, P., ao Felipe Santos, J., & Holovaty, A. (2015). librosa: v0.4.0. Zenodo,
http://dx.doi.org/10.5281/zenodo.18369.
References merriam webmaster (0000). Jargon. Merriam-Webster. https://www.merriam-webster.
com/dictionary/jargon.
Aarti, B., & Kopparapu, S. K. (2017). Spoken Indian language classification using arti- Miao, X., McLoughlin, I., & Yan, Y. (2019). A new time-frequency attention mechanism
ficial neural network—An experimental study. In 2017 4th international conference for TDNN and CNN-LSTM-TDNN, with application to language identification. In
on signal processing and integrated networks (pp. 424–430). IEEE. INTERSPEECH (pp. 4080–4084).
Aarti, B., & Kopparapu, S. K. (2018). Spoken Indian language identification: a review Miao, X., McLoughlin, I., Yao, S., & Yan, Y. (2018). Improved conditional generative
of features and databases. Sādhanā, 43(4), 53. adversarial net classification for spoken language recognition. in 2018 IEEE spoken
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., language technology workshop (pp. 98–104).
Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Murofushi, T., & Sugeno, M. (1989). An interpretation of fuzzy measures and the
Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., .... Zheng, X. (2015). Choquet integral as an integral with respect to a fuzzy measure. Fuzzy Sets and
Tensorflow: Large-scale machine learning on heterogeneous systems. Systems, 29(2), 201–227. http://dx.doi.org/10.1016/0165-0114(89)90194-2.
Albadr, M. A. A., Tiun, S., Ayob, M., & AL-Dhief, F. T. (2019). Spoken language mustgo. com (0000). Kannada language - structure, writing and alphabet - MustGo.
identification based on optimised genetic algorithm–extreme learning machine MustGo.com. URL https://www.mustgo.com/worldlanguages/kannada/.
approach. International Journal of Speech Technology, 22(3), 711–727. O’Shaughnessy, D. (1988). Linear predictive coding. IEEE Potentials, 7(1), 29–32.

13
A. Garain et al. Expert Systems With Applications 168 (2021) 114416

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blon- Srivastava, B. L., Vydana, H., Vuppala, A. K., & Shrivastava, M. (2017). Significance of
del, M., Prettenhofer, P., Weiss, R., & Dubourg, V. (2011). Scikit-learn: Machine neural phonotactic models for large-scale spoken language identification. In 2017
learning in Python. The Journal of Machine Learning Research, 12, 2825–2830. International joint conference on neural networks (pp. 2144–2151).
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hanne- Stoica, P., & Moses, R. L. (2005). Spectral analysis of signals. Pearson Prentice Hall
mann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. Upper Saddle River, NJ.
(2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic Sutskever, I., Józefowicz, R., Gregor, K., Rezende, D. J., Lillicrap, T. P., &
speech recognition and understanding. IEEE Signal Processing Society, IEEE Catalog Vinyals, O. (2015). Towards principled unsupervised learning. arXiv:1511.06440,
No.: CFP11SRW-USB. CoRR abs/1511.06440.
Prahallad, K., Elluru, N. K., Keri, V., Rajendran, S., & Black, A. W. (2012). The IIIT-H Verma, V. K., & Khanna, N. (2013). Indian language identification using k-means
indic speech databases. In INTERSPEECH. clustering and support vector machine (SVM). In 2013 Students conference on
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using engineering and systems (pp. 1–5). IEEE.
adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41. Voxforge. org (2014). Free speech... Recognition (linux, windows and mac) -
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). voxforge.org. http://www.voxforge.org/, (Accessed 25 June 2014).
Improved techniques for training GANs. In Proceedings of the 30th international Wang, H., Leung, C.-C., Lee, T., Ma, B., & Li, H. (2012). Shifted-delta MLP features for
conference on neural information processing systems (pp. 2234–2242). Red Hook, NY, spoken language recognition. IEEE Signal Processing Letters, 20(1), 15–18.
USA: Curran Associates Inc. Wang, H., Leung, C., Lee, T., Ma, B., & Li, H. (2013). Shifted-delta MLP features for
Sarkar, S., Rao, K. S., Nandi, D., & Kumar, S. S. (2013). Multilingual speaker recognition spoken language recognition. IEEE Signal Processing Letters, 20(1), 15–18.
on Indian languages. In 2013 Annual IEEE India conference (pp. 1–5). IEEE. Wang, Q., Zheng, C., Yu, H., & Deng, D. (2015). Integration of heterogeneous classifiers
Shukla, S., & Mittal, G. (2019). Spoken language identification using convnets. In based on choquet fuzzy integral. in 2015 7th international conference on intelligent
European conference on ambient intelligence (pp. 252–265). Springer. human-machine systems and cybernetics (vol. 1) (pp. 543–547).
Siami, M., Naderpour, M., & Lu, J. (2019). A Choquet fuzzy integral vertical bagging Zhang Jian, B. X., Ruohua, Z., & Yonghong, Y. (2017). Weighted phone log-likelihood
classifier for mobile telematics data analysis. in 2019 IEEE international conference ratio feature for spoken language recognition. Journal of Tsinghua University(Science
on fuzzy systems (pp. 1–6). and Technology), 57(10), 1038. http://dx.doi.org/10.16511/j.cnki.qhdxxb.2017.25.
Singh, O. P., Haris, B., Sinha, R., Chettri, B., & Pradhan, A. (2013). Sparse representa- 042, http://jst.tsinghuajournals.com/EN/abstract/article_151999.shtml.
tion based language identification using prosodic features for Indian languages. In Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., & Shamma, S. (2011).
2013 Annual IEEE India conference (pp. 1–5). IEEE. Linear versus mel frequency cepstral coefficients for speaker recognition. In 2011
Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D., & Khudanpur, S. (2018). IEEE workshop on automatic speech recognition & understanding (pp. 559–564). IEEE.
Spoken language recognition using X-vectors. In Odyssey (pp. 105–111). Zissman, M. A., & Singer, E. (1994). Automatic language identification of telephone
Snyder, D., Garcia-Romero, D., & Povey, D. (2015). Time delay deep neural network- speech messages using phoneme recognition and N-gram modeling. In Proceedings
based universal background models for speaker recognition. In 2015 IEEE workshop of ICASSP ’94. IEEE international conference on acoustics, speech and signal processing
on automatic speech recognition and understanding (pp. 92–97). IEEE. (vol. i) (pp. I/305–I/308).
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-
vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international
conference on acoustics, speech and signal processing (pp. 5329–5333). IEEE.

14

You might also like