0% found this document useful (0 votes)
160 views7 pages

Voice Recognition and Voice Comparison Using Machine Learning Techniques: A Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
160 views7 pages

Voice Recognition and Voice Comparison Using Machine Learning Techniques: A Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)

Voice Recognition and Voice Comparison using


Machine Learning Techniques: A Survey
Nishtha H. Tandel Harshadkumar B. Prajapati Vipul K. Dabhi
Dept. of Information Technology Dept. of Information Technology Dept. of Information Technology
Dharmsinh Desai University, Dharmsinh Desai University, Dharmsinh Desai University,
Nadiad, India Nadiad, India Nadiad, India
[email protected] [email protected] [email protected]

Abstract—Voice comparison is a variant of speaker used in speech, the speaker recognition and voice comparison
recognition or voice recognition. Voice comparison plays a system are divided into two categories: (1) text-dependent and
significant role in the forensic science field and security systems. (2) text-independent. Text-dependent employs the same text
Precise voice comparison is a challenging problem. Traditionally, for training and testing whereas text-independent employs
different classification and comparison models were used by the different text for training and testing.
researchers to solve the speaker recognition and the voice
comparison, respectively but deep learning is gaining popularity Saquib et al. [5] and Singh et al. [6] presented a survey of
because of its strength in accuracy when trained with large speaker recognition techniques in 2010 and 2017, respectively,
amounts of data. This paper focuses on an elaborated literature which contain traditional approaches of speaker recognition.
survey on both traditional and deep learning-based methods of For speaker recognition and voice comparison, most of the
speaker recognition and voice comparison. This paper also works, e.g., [1], [7], [8], [9], [37] have been carried out using
discusses publicly available datasets that are used for speaker the traditional approaches by various researchers. Less amount
recognition and voice comparison by researchers. This concise of research exists on the use of deep learning methods on the
paper would provide substantial input to beginners and topic of speaker recognition and voice comparison. Therefore,
researchers for understanding the domain of voice recognition there is a need for such a survey that explores both traditional
and voice comparison. approaches as well as deep learning-based approaches of
speaker recognition (identification and verification) and voice
Keywords—voice comparison, speaker recognition, deep
comparison.
learning, Siamese NN
This paper explores and analyzes various traditional and
I. INTRODUCTION deep learning-based approaches to discuss potential solutions
to the problem of voice comparison. This paper conducts a
Voice comparison [1] is a difficult problem to solve survey of major works carried out on speaker recognition and
because the voice of a person may change due to the emotion, voice comparison to discuss all major issues and their
age-gap, and throat infection [2]. On the other hand, when a solutions. Furthermore, the paper also analyzes the suitability
speaker tries to say precisely the same utterance twice, a of the Siamese Neural Network for the problem of voice
measurable difference occurs in the speaker’s voices. comparison. The paper also discusses and analyzes the
However, a robust voice comparison is necessary because it datasets used by various researchers for speaker recognition
can be used in many fields, such as forensic science [1], and voice comparison.
authentication/verification [30], [39] surveillance, etc. Though
voice comparison is a hard problem for researchers, newer This paper is arranged as follows: Section II includes
machine learning techniques, such as deep learning [3], have introduction on voice comparison, speaker identification, and
the capability to provide an appropriate solution for the speaker verification. Furthermore, a general pipeline for voice
problem. comparison is discussed, and traditional and deep learning
approaches for speaker recognition and voice comparison are
We highlight differences among different voice processing studied. Section III presents a detailed literature survey on
operations that are used in the literature. Speaker recognition speaker recognition (identification and verification) and voice
is the method of recognizing who is the speaker by using comparison. Section III includes analyses of different datasets
speaker’s unique information. The recognition of speakers is and Siamese NN (Siamese Neural Network). Finally, Section
typically divided into two categories: (1) speaker identification IV concludes the paper.
[9] and (2) speaker verification or authentication [30]. Speaker
identification is the process of determining an unknown
speaker’s identity by matching his or her voice to the voices in II. VOICE RECOGNITION AND VOICE COMPARISON
the database of registered speakers. Speaker verification can This section presents variants of speaker recognition,
determine whether a person is what he or she claims to be descriptions of the voice comparison, and traditional v/s deep
based on his or her voice sample. There is an additional learning-based approaches for voice comparison. Additionally,
variant of speaker recognition called voice comparison [1] in for voice comparison, Siamese Architecture is also studied.
which two voices are supplied as input to the voice
comparison system and the system determines the similarity
score between two input voices. On the basis of words or text

978-1-7281-5197-7/20/$31.00 ©2020 IEEE 459


2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
A. Variants of Speaker Recognition (machines’) subjective judgment on the basis of listening of
Speaker recognition can be divided into two types: (1) speech recording. A spectrographic approach is an image-
speaker identification (2) speaker verification. Fig. 1 shows an based approach in which speech recordings are transformed
into speech images, called a spectrogram. In general, the
illustration of speaker identification and speaker verification.
spectrogram reflects the frequency spectrum, which is also
Identification known as "voiceprints". In a spectrographic approach, an
verification (on right). expert will pay attention at multiple words or phrases in both
Whose is this voice? ?
recordings. The expert can then look at a specific pattern of
?
information in the image to see how close they are. In Fig. 3,
? one example of a spectrogram is given. Lukic et al. [12] used a
voiceprint (spectrogram) as input to CNN.
Verification
Is this Ale’s voice? Audio file (wav file) Spectrogram
?

Fig. 1. Variants of speaker recognition


1) Speaker identification: The speech of an unknown
speaker is processed and is compared to established speaker
voice models. The unknown speaker is defined as the one that
best suits. Thus, input to speaker identification is an unknown
voice and the output is the name or id of the speaker.
Fig. 3. Spectogram of Voice data
2) Speaker verification: In this variant of speaker The acoustic-phonetic approach needs making quantitative
recognition, an unknown speaker claims an identity whose estimates of the acoustic properties (pitch, formant,
speech is compared to the registered speaker model claiming fundamental frequency, and HNR) on equivalent phonetic
identity. Thus, input to speaker verification is the name of the units in both recordings of the speakers. Cardoso et al. [1]
speaker and his or her voice and the output is Yes or No. proposed a technique to improve the performance of the
Forensic Voice Comparison (FVC) system using fundamental
There is one more variant of speaker recognition called
frequency and formant. In an automatic approach, frame-wise
voice comparison.
speech features are automatically extracted. Unlike an
3) Voice comparison: Voice comparison is a task to acoustic-phonetic approach, the automated approach does not
analyze two recordings of the speaker and make a decision use different acoustic features on a specific part of the signal.
whether the voices belong to the same speaker or to different Examples of automatic approaches are MFCC [13], LPCC
speakers. Fig. 2 shows an illustration of voice comparison. [14], etc.
Thus, input to voice comparison is two voice recordings and
the output is similarity score in the range 0 to 1. C. Difference between traditional and deep learning-based
techniques
How much similar
The traditional methods of speaker recognition and voice
™ Similarity are these two comparison system such as HMM (Hidden Markov Model),
measure voices? GMM (Gaussian Mixture Model), and VQ (Vector
Quantization) use unique characteristics of speech features
from a collection of speakers; therefore, it is necessary to
Fig. 2. Voice comparison choose the most successful feature extraction approaches that
As discussed earlier, based on the text or words used in truly represent the characteristics of speech. There are many
voice, types of system that address the problem of speaker feature extraction techniques available such as MFCC [13],
recognition (identification and verification) and comparison LPCC [15], and pitch [16]. As per the researchers’ analysis,
e.g., in [3] and [4], traditional methods are very time-
can be classified into two types [40]: 1) text-dependent and (2)
consuming. Therefore, deep learning-based approaches are
text-independent. Text-dependent system is connected to a
preferred for an automatic system to save time.
predefined text used for training and testing and the text-
independent system should be capable of using any text. This A generalized process to perform voice comparison
paper explores both the text-dependent and text-independent efficiently using deep learning model is shown in Fig. 4.
methods of recognition.
Deep
Data Pre- Similarity
learning
B. Different approaches of voice comparison Acquisition processing measure
model
The forensic voice comparison is based on four specific
Fig. 4. Pipeline of voice comparison System
approaches: (1) auditory, (2) spectrographic, (3) acoustic, and
(4) automatic approach [11]. In all the approaches, for 1) Data Acquisition: Forensic scientists or researchers
comparing a voice, at least two recordings of a speaker are generally avoid modeling raw audio because it ticks so often.
needed. The result of the auditory approach is the experts’ Generally, the text-independent voice comparison system

978-1-7281-5197-7/20/$31.00 ©2020 IEEE 460


2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
requires a type of datasets that contain audios of the same Recently, Siamese NN has been designed for one-shot image
subject having different dialogues and a text-dependent voice recognition [29]. In one-shot image recognition, the
comparison system requires datasets that contain audios of the researchers have to make predictions right with only one
same subject having the same dialogue. Some datasets are instance of each new class. In one shot image recognition [29],
available for recognition and comparison tasks; such datasets the researchers discuss a method for learning Siamese NN that
include microphone audio data (TIMIT [20]), telephone uses a unique structure to rank similarities between inputs
speech (NTIMIT [21]), age-wise speech corpus (VoxCeleb naturally and experimentally conclude that Siamese NN
[22]). provides better results than CNN when training is less. In the
2) Preprocessing: Preprocessing of audio data [17] is a human speech domain, every human’s voice has a unique
very important step after data acquisition because real-world formant structure and vocal pattern. Therefore, there is no
audio data is noisy. Generally, VAD (Voice Activity need to train the network with more samples of one specific
Detection) is used to separate voiced data and unvoiced data, speaker’s speech.
i.e., VAD is used to find out the presence and absence of a Input (s1) Input (s2)
human in speech. The VAD strategies utilize the prompt
proportions of the dissimilarity separation among speech and Exactly
same
noise. In the past, VAD was based on extracting features such NN1
network
NN2
as short-time energy [23], zero-crossing rate [24], and pitch
analysis [16]. Nowadays, the classification of voiced and Feature of input s1 Feature of input s2
unvoiced segments is done based on cepstral coefficients [13],
[15], and wavelet transforms [25]. The important methods of |h (s1) - h (s2)|
VAD and its applications are listed in Table I.
Sigmoid Similarity score
TABLE I. VOICE ACTIVITY DETECTION METHODS

VAD Methods Application References


Fig. 5. Architecture of Siamese NN
Linear predictive coding Speech coding and
[14] [15]
(LPC) speech synthesis
III. SURVEY ON SPEAKER RECOGNITION AND VOICE
Speech recognition
Formant shape
speaker recognition
[1] COMPARISON
Zero crossing rate (ZCR) Find out human presence [23][24] This section presents a broad survey of speaker
Cepstral feature
Speech recognition and
[13] [18]
identification, speaker verification, and voice comparison. We
speaker recognition divide the survey into two subsections: a traditional approach
Visualizing structural based and deep learning-based.
Periodicity measure [51]
periodic changes
Voice-based personal
Pattern recognition [19] A. Survey on Speaker Recognition and Voice Comparison
verification
In a multi-speaker environment, we may need the System based on the Traditional approach
answer to "who spoke when". In such a context, audio data Many papers dealing with the problems and difficulties of
often contains recordings of more than one person talking (i.e. speaker recognition and voice comparison systems have been
telephone and meeting conversation). Speaker diarization is published in recent years. Several of the papers are reviewed
the method of splitting an input audio into homogeneous and analyzed in Table II. Most of the researchers used
segments according to the speaker identity. Wang et al. [26] traditional based methods. Reynolds and Rose [7] proposed
proposed a novel speaker diarization technique based on speaker identification based on GMM (Gaussian Mixture
LSTM-based d-vector audio embeddings. Speaker diarization Model). The whole procedure of speaker identification is
itself is a wide domain and hence is out of the scope of this divided into two parts: (1) feature extraction and (2)
paper. However, a recent review on speaker diarization is classification. For feature extraction, the authors introduced a
available in [27], which interested readers can refer. Mel-frequency filter bank for short utterance of speaker. For
classification, GMM model is used. The result of the speaker
3) Deep learning model: After preprocessing, the inputs
identification technique is, the accuracy decreases when the
are fed into the model. As per our understanding of the
quality of sound is degraded, i.e. GMM attains 96.8 % and
literature, the Siamese Neural Network (Siamese NN) is well 80.8% accuracy, when the speech is clean-speech and
adapted for the problem of comparison. Siamese NN learns a telephone speech, respectively.
similarity function that takes two inputs (i.e. spectrogram or
voiceprint) as input and shows how identical the two inputs Reynolds [30] proposed speaker identification and
are. The Siamese architecture's goal is not to classify input verification systems providing superior performance based on
objects, but to distinguish between the two. Gaussian mixture speaker models. The author [30] tested their
work on publicly available datasets such as TIMIT [20],
The Architecture of Siamese NN is shown in Fig. 5. For NTIMIT [21], Switchboard [45], and YOHO [46]. In their
the first time, Siamese NN was used for Signature verification work, the whole procedure of speaker identification is divided
(whether the signatures belong to one person) in [28]. into two parts: (1) feature extraction and (2) classification. In

978-1-7281-5197-7/20/$31.00 ©2020 IEEE 461


2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
the feature extraction, the speech signal of the speaker is first (Fundamental frequency), (2) CPP (Cepstral Peak
divided into separate speech frames, and then Mel-scale Prominence), (3) HNR (Harmonic to Noise Ratio), and (4)
Cepstral feature (MFCC) vectors are extracted from the H1-A1; H1-A2; H1-A3. All these four features are also called
speech frame. For classification, GMM is used. For feature the vocal characteristic of humans. These four features with
extraction, MFCC is widely used by various researchers, e.g., MFCC are extracted from speech. However, as transmission
Chakroborty et al. [9], Tolba et al. [34], Krishnamoorthy et al. quality degraded (HQ > TEL > MOBHQ > MOBLQ), the
[36] and Saeidi et al. [10]. contribution of Vq to system performance became much more
Cardoso et al. [1] proposed a new technique of voice impressive. In their work, EER (Equal Error Rate) is 2.85%
comparison system, in which they extracted Vq (Voice using MFCC alone and EER is 0.09% using MFCC combined
Quality) features with MFCC features from speech. In their with Vq.
work [1], 97 Speakers are used from the DyViS corpus [52].
Out of 97 speakers, 32 speakers are used for training 33 Zhang et al. [37] explored the effectiveness of the formant
trajectory technique applied to tokens of the standard Chinese
speakers are used for testing purposes and 32 speakers for
triphthong /iau/. Chinese token is extracted from speech and
references. In DyViS corpus, the speech is recorded in four
for extracting features from tokens acoustic-phonetic approach
ways: (1) HQ (high-quality recording), (2) TEL (telephone
is used. The test scores from the acoustic-phonetic and
recording), (3) MOBHQ (mobile high-quality recording), and
automatic systems were fused using logistic-regression fusion.
(4) MOBLQ (mobile low-quality recording). For feature
Another similar work is carried out in [38], in which Morrison
extraction, the author chose MFCC as a feature. To improve
et al. [38] proposed a forensic-voice-comparison system for
the performance of the system, Vq features are added with
standard Chinese monophthongs /i/,/e/, and /a/.
MFCC. For Vq, four methods are used, that is (1) F0

TABLE II. ANALYSIS OF SPEAKER RECOGNITION AND COMPARISON METHODS BASED ON TRADITIONAL APPROACH
No. of Feature Text-
Researchers Dataset Model System-type Accuracy or EER (in %)
speakers Extraction type
Reynolds and Mel-frequency
KING speech
Rose (1995) 49 filter bank for GMM TI SI AC: 96.3
database
[7] short utterance
TIMIT AC: 99.5
TIMIT: 630 NTIMIT AC: 60.7
SI
NTIMIT: switchboard AC: 82.8
Reynolds(199 TIMIT, NTIMIT,
630, Mel-scale YOHO AC: NA
5) [30] Switch-board, GMM NA
Switchboard: Cepstral TIMIT EER:0.24
YOHO
113 YOHO: NTIMIT EER:7.19
NA SV
switchboard EER:5.15
YOHO EER:0.51
Adami et al. LPCC, FOR,
Random 30 MLP NA SI LPCC- AC: 100
(2001) [31] PIT, LPC
Rabha et al. Clean speech- AC: 99.5
Random 10 LPC/ Cepstral SVD-based algorithm TI SI
(2003) [32] Noisy speech- AC: 77.5
HMM AC: 61.4
Shahin (2009) Non professional 40(20 male +
LFPC CHMM TD SI AC: 66.4
[33] database 20 female)
SPHMM AC: 69.1
Revathi et al. MF-PLP Iterative clustering AC: 91.0
TIMIT 50 TI SR
(2009) [8] PLP approach AC: 88
Chakroborty Dataset TF GF
and saha YOHO >130 MFCC, IMFCC GMM TI SI AC: 97.26 AC: 97.42
(2009) [9] POLYCOST AC: 81.16 AC: 82.76
Saeidi et al. Speech separation 34(18 male +
MFCC GMM-UBM TI SI AC: 97.0
(2010) [10] challenge corpus 16 female)
Tolba et al.
Arabic speaker 10 MFCC CHMM TI SI AC: 80
(2011) [34]
Ajmera et al. TIMIT 630 Spectrographic AC: 96.69
DCT TI SI
(2011) [35] SGGS 151 acoustic feature AC: 98.41
Krishnamoort
hy et al. TIMIT 100 MFCC GMM-UBM TI SR AC: 80
(2011) [36]
Zhang et al. Chinese female Formant,
60 GMM-UBM TD VC AC: NA
(2011) [37] speakers MFCC
Morrison et
Chinese male
al. (2011) 64 Formant Likelihood ratio TD VC AC: NA
speakers
[38]
Cardoso et al. HNR, CPP, f0,
DyViS corpus 97 GMM-UBM NA VC EER: 0.09
(2019) [1] formant, MFCC
a.Result: EER-Equal Error Rate, AC-Accuracy, TF-Triangular Filter, GF-Gaussian Filterb.Text-type: TI-Text Independent, TD-Text Dependent, c.feature extraction: FOR-Formant, PIT-Pitch, d. System-type: SV-
Speaker Verification, SI-Speaker Identification, SR- Speaker Recognition.

978-1-7281-5197-7/20/$31.00 ©2020 IEEE 462


2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
B. Survey on Speaker Recognition (Identification and sections (e.g. a 3X3-area, which is then repeated throughout
Verification ) and Voice Comparison System based on the entire input space). That convolution layer is followed by a
Deep learning-based approach max-pooling layer, which produces a lower resolution version
Deep learning is becoming an interesting and powerful of the activations of the convolution layer by removing the
method of machine learning. Deep learning strategies have total filter activation from e.g. a 2X2 window. At the end,
been effective in recognizing speakers. Few of the researchers fully connected layers eventually integrate all outputs of the
have worked on voice domains using deep learning-based last max-pooling layer to classify speakers.
methods. Table III presents a survey on the deep learning- Plchot et al. [40] presented a DNN-based auto-encoder
based approach. (DAE) of speaker recognition systems for microphones and
Variani et al. [39] proposed the DNN (Deep Neural noisy information. The function of auto-encoder is to enhance
Network) based method of speaker verification. The DNN is the speech signal (i.e., to de-noise and de-reverberate). Plchot
trained to classify speakers with acoustic characteristics at the et al. [40] concluded that an audio enhancement method offers
frame level. The average features of these speakers, or called good compensation for distortions caused by reverberation,
d-vector features, are then used to verify other speakers. Lukic whereas multi-condition training can very well handle the
et al. [12] proposed a new approach for optimizing the distortion caused by additive noise. Torfi et al. [42] proposed a
pipeline of speaker identification and evaluated on TIMIT novel method for text-independent speaker verification using
dataset. The authors have used the Convolution Neural 3D convolution neural network (3D-CNN) architecture. In
Network (CNN) on spectrograms to learn speaker-specific their work, the authors proposed an adaptive learning feature
characteristics from a rich representation of acoustic sources. by using the 3D-CNN to directly create speaker model. In the
The CNN consists of several such layers of convolution that process, an identical number of spoken utterances per speaker
apply a wide range of filters to subsequent small local input are flowed into the network.

TABLE III. ANALYSIS OF SPEAKER RECOGNITION AND COMPARISON METHODS BASED ON DEEP LEARNING APPROACH
Researchers Dataset No. of Input Model Text-type System Accuracy or EER
speakers type (in %)
Variani et al. NA 646 Energy features of DNN TD SV EER : 2.00
(2014) [39] frame (For 20 utterances)
Lukic et al. TIMIT 630 Spectrogram of voice CNN NA SI AC: 97
(2016) [12] data
Plchot et al. PRISM Fisher corpora 13916 MFCC, PNCC DNN auto TD, TI SR NA
(2016) [40] Switch board 1991 encoder
SRE 2740
Chung et al. Voxceleb 1251 Spectrogram CNN NA SI AC: 80.5
(2017) [41] SV EER: 7.8
Torfi et al. WVU-Multimodal 2013 1083 Frame-wise MFEC 3D-CNN TI SV EER: 21.1
(2018) [42]
Muckenhirn et Voxforge Selected Raw speech data CNN, MLP NA SI EER: 1.18
al. (2018) [43] 300 SV EER: 1.20
Dhakal et al. ELSDSR 22 Statistical, Gabor SVM NA SR AC: 98.07
(2019) [44] feature and CNN based RF AC: 99.41
DNN AC: 98.14
a.Result: EER-Equal Error Rate, AC-Accuracyb. System-type: SV-Speaker Verification, SI-Speaker Identification, SR-Speaker Recognition c. Text-type: TD-Text Dependent, TI-Text Independent

C. Analysis of different datasets traditional methods take a lot of time for feature extraction
In Table IV, we analyze widely used datasets such as because traditional approaches measure frame-wise features
TIMIT [20], NTIMIT [21], Switchboard [45], YOHO [46], such as fundamental frequency, formant, pitch, etc. Instead of
VoxCeleb [22], ELSDSR [47], POLYCOST [48], ICSI extracting features manually, many automatic feature
Meeting speech [49], and 2010 NIST SRE [50]. For analyzing extraction techniques such as MFCC [13] and LPCC [15] are
datasets we use attributes like a number of subjects, available. Traditional approaches are usually two-step
utterances, types of speech, sample rate, dataset size, and procedures: first, calculate the feature (e.g., MFCC) and then
application. The TIMIT [20] corpus (440 MB) is created to feed them into the classifier (e.g., GMM, HMM, and VQ).
provide speech data for acoustic-phonetic studies and However, in a deep learning method, we give voiceprint
automated speech recognition systems. VoxCeleb [22] is a images directly as input to the model. For example, a
dataset for the recognition of speakers on a large scale, which spectrogram or voiceprint was used as an input to CNN by
is prepared from celebrities’ YouTube videos. Most of the Lukic et al. in [12]. CNN directly learn from the input
data [22] are gender-balanced (males are 55%). The videos spectrogram. The main advantage of using any deep learning-
include a range of backgrounds, professions, age, and gender. based system is that the system is fully automatic. In a deep
learning-based approach, CNN is perfect for classification.
However, for comparison, Siamese NN is one of the popular
D. Analysis of traditional and deep learning-based approach
approaches as compared to CNN because Siamese NN
Table V provides a comparison of the traditional and deep performs well for a limited dataset. In Siamese NN, we can
learning-based models. From an analysis, we can state that the use CNN as a sub-network as a feature vector generator.

978-1-7281-5197-7/20/$31.00 ©2020 IEEE 463


2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
We emphasize the following points related to CNN and presenting the survey and analysis, this paper explained the
Siamese NN architecture: essential concepts of speaker identification, speaker
verification, and speaker comparison. Furthermore, the paper
• CNN is good for classification problems while Siamese also presented a whole pipeline of voice comparison with
NN is good for comparison problems. enough details. In the survey, the paper studied and analyzed
• Even if the training is less, Siamese NN can estimate both traditional and deep learning-based approaches for
well as compared to CNN. speaker recognition and voice comparison and suggested the
use of deep learning-based approaches for the voice
processing domain. Furthermore, the paper also surveyed and
IV. CONCLUSION presented various datasets used for automated voice
This paper focused on an elaborated survey of two useful processing. At the end, the suitability of Siamese NN
voice processing operations: voice recognition and voice combined with CNN, which is popular for the classification
comparison. Voice comparison can become a very important problems, for voice comparison is discussed.
task in the Human Machine Interface based systems. Before

TABLE IV. ANALYSIS OF DATASETS


Datasets # Subject # Types of speech Sample rate Dataset Size Application
(speaker) Utterances ( in HZ)
TIMIT [20] 630 6300 Microphone speech 16000 440 MB Speaker identification and verification
NTIMIT [21] 630 6300 Telephone speech 16000 (25200 files) Speaker identification and verification
Switchboard [45] 3114 33039 Telephone talking 8000 NA Speaker identification
YOHO [46] NA NA Microphone speech 8000 1500 MB Speaker verification
VoxCeleb [22] 1251 § 100000 From YouTube videos NA 150 MB Speaker classification
ELSEDSR [47] 22 198 MARANTZ PMD670 16000 NA Speaker identification and verification
recording speech
POLYCOST [48] 131 > 1285 Telephone speech 8000 § 1246 MB Speaker identification and verification
ICSI Meeting speech 53 922 Microphone conversation 16000 NA Speaker segmentation
[49]
2010 NIST SRE [50] > 2000 NA Microphone and 8000 NA Speaker identification
telephone speech

TABLE V. ANALYSIS OF SPEAKER RECOGNITION AND VOICE COMPARISON MODELS


Approaches Model Advantages Disadvantages
Ɣ Provide good result when system is text-dependent, Ɣ High Ɣ Not suitable for text-independent voice comparison
HMM computation burden in pattern recognition system, Ɣ Suitable for small speech recording
Traditional Ɣ Easy to use, Ɣ For pattern recognition, VQ have low computational Ɣ Performance is degraded when recording of speaker
VQ burden then HMM is too large Ɣ Slow generation of code book
Approaches
Ɣ The modeling of mixture is very versatile, Ɣ It is a probabilistic Ɣ Computationally costly if there is a huge number of
GMM approach to achieve a fuzzy observation classification distributions and need large datasets
Ɣ Fully automatic Ɣ Easy model construction involving fewer formal Ɣ Due to the complexity of the model structure, prone
CNN to over-fitting
Deep statistics Ɣ Capacity to capture non-linearity between predictors-results
learning- 3D-CNN Ɣ Offer direct modeling of speaker Ɣ Required optimized structure
based DAE Ɣ It include de-noising Ɣ Problem of over-fitting
approach Siamese Ɣ Required small training, Ɣ Easy to label, Ɣ More Sophisticated to Ɣ Both sub-network are required to calculate same
NN irregularity in class hyper parameter
[6] N. Singh, A. Agrawal, and R.A. Khan, “Automatic speaker recognition:
References current approaches and progress in last six decades,” in last six decades.
[1] A. Cardoso, P. Foulkes, J.P. French, A.J. Gully, P.T. Harrison, and V. Global J Enterp Inf Syst. 2017 Jul 1;9(3):45-52.
Hughes, “Forensic voice comparison using long-term acoustic measures [7] D.A. Reynolds, and R.C. Rose, “Robust text-independent speaker
of voice quality,” InProceedings of the 19th International Congress of identification using Gaussian mixture speaker models,” IEEE
Phonetic Sciences (ICPhS) 2019 Feb 12. York. transactions on speech and audio processing. 1995 Jan;3(1):72-83.
[2] K. Kolhatkar, M. Kolte, and J. Lele, “Implementation of pitch detection [8] A. Revathi, R. Ganapathy, and Y.Venkataramani, “Text independent
algorithms for pathological voices,” In2016 International Conference on speaker recognition and speaker independent speech recognition using
Inventive Computation Technologies (ICICT) 2016 Aug 26 (Vol. 1, pp. iterative clustering approach,” International Journal of Computer science
1-5). IEEE. & Information Technology (IJCSIT) 1.2 (2009): 30-42.
[3] H. Lee, P. Pham, Y. Largman, and A.Y. Ng, "Unsupervised feature [9] S. Chakroborty, and G. Saha, “Improved text-independent speaker
learning for audio classification using convolutional deep belief identification using fused MFCC & IMFCC feature sets based on
networks,” InAdvances in neural information processing systems 2009 Gaussian filter,” International Journal of Signal Processing. 2009 Nov
(pp. 1096-1104). 26;5(1):11-9.R.
[4] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for [10] R Saeidi, Mowlaee, T. Kinnunen, Z.H. Tan, M.G. Christensen, S.H.
speaker recognition using a phonetically-aware deep neural network,” Jensen, and P. Franti, “Signal-to-signal ratio independent speaker
In2014 IEEE International Conference on Acoustics, Speech and Signal identification for co-channel speech signals,” In2010 20th International
Processing (ICASSP) 2014 May 4 (pp. 1695-1699). IEEE. Conference on Pattern Recognition 2010 Aug 23 (pp. 4565-4568). IEEE.
[5] Z. Saquib, N. Salam, R.P. Nair, N. Pandey, and A. Joshi, “A survey on [11] G.S. Morrison, and W.C. Thompson, “Assessing the admissibility of a
automatic speaker recognition systems,” InSignal Processing and new generation of forensic voice comparison testimony,” Colum. Sci. &
Multimedia 2010 Dec 13 (pp. 134-145). Springer, Berlin, Heidelberg. Tech. L. Rev. 18 (2016): 326.

978-1-7281-5197-7/20/$31.00 ©2020 IEEE 464


2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS)
[12] Y. Lukic, C. Vogt, O. Dürr, and T. Stadelmann, “Speaker identification InFourth International Conference on Information, Communications and
and clustering using convolutional neural networks,” In2016 IEEE 26th Signal Processing, 2003 and the Fourth Pacific Rim Conference on
international workshop on machine learning for signal processing Multimedia. Proceedings of the 2003 Joint 2003 Dec 15 (Vol. 3, pp.
(MLSP) 2016 Sep 13 (pp. 1-6). IEEE. 1624-1628). IEEE.
[13] M.A. Hossan, S. Memon, and M.A. Gregory, “A novel approach for [33] I. Shahin, “Speaker identification in emotional environments,” (2009):
MFCC feature extraction,” In2010 4th International Conference on 41-46.
Signal Processing and Communication Systems 2010 Dec 13 (pp. 1-5). [34] H. Tolba, “A high-performance text-independent speaker identification
IEEE. of Arabic speakers using a CHMM-based approach,” Alexandria
[14] L.R. Rabiner, and M.R.Sambur , “Voiced-unvoiced-silence detection Engineering Journal 50.1 (2011): 43-47.
using Itakura LPC distance measure,” in Proc. IEEE Int. Conf. Acoust., [35] P.K. Ajmera, D.V. Jadhav, and R.S. Holambe, “Text-independent
Speech, Signal Process., May 1977, pp. 323–326. speaker identification using Radon and discrete cosine transforms based
[15] Y. Yujin, Z. Peihua, and Z. Qun, “Research of speaker recognition based features from speech spectrogram,” Pattern Recognition 44.10-11
on combination of LPCC and MFCC,” In2010 IEEE International (2011): 2749-2759.
Conference on Intelligent Computing and Intelligent Systems 2010 Oct
[36] P. Krishnamoorthy, H.S. Jayanna, and S.M. Prasanna, “Speaker
29 (Vol. 3, pp. 765-767). IEEE.
recognition under limited data condition by noise addition,” Expert
[16] A.M. Noll, "Cepstrum pitch determination,” The journal of the
Systems with Applications 38.10 (2011): 13487-13490.
acoustical society of America 41.2 (1967): 293-309.
[37] C. Zhang, G.S. Morrison, and T. Thiruvaran, “Forensic Voice
[17] T. Kinnunen, and P. Rajan, “A practical, self-adaptive voice activity
Comparison Using Chinese/iau/,” in InICPhS 2011 Aug 17 (pp. 2280-
detector for speaker verification with noisy telephone and microphone
2283).
data,” In2013 IEEE international conference on acoustics, speech and
signal processing 2013 May 26 (pp. 7229-7233). IEEE. [38] G.S. Morrison, C. Zhang, and P. Rose, “An empirical estimate of the
[18] J.A. Haigh, and J.S. Mason, “A voice activity detector based on cepstral precision of likelihood ratios from a forensic-voice-comparison system,”
analysis,” InEurospeech 1993 Sep (Vol. 9, pp. 1103-1106). Forensic science international 208.1-3 (2011): 59-65.
[19] B. Atal, and L. Rabiner, “A pattern recognition approach to voiced- [39] E. Variani, X. Lei, E. McDermott, I.L. Moreno, and J. Gonzalez-
unvoiced-silence classification with applications to speech recognition,” Dominguez, “Deep neural networks for small footprint text-dependent
IEEE Transactions on Acoustics, Speech, and Signal Processing 24.3 speaker verification,” In2014 IEEE International Conference on
(1976): 201-212 Acoustics, Speech and Signal Processing (ICASSP) 2014 May 4 (pp.
4052-4056). IEEE.
[20] J.S. Garofolo, “TIMIT Acoustic-Phonetic Continuous SpeechCorpus,”
LDC93S1: Linguistic Data Consortium, 1993. 1993. [40] O. Plchot, L. Burget, H. Aronowitz, and P. Matejka, “Audio enhancing
with DNN autoencoder for speaker recognition,” In2016 IEEE
[21] Fisher, and M. William, “NTIMIT LDC93S2,” Web Download. International Conference on Acoustics, Speech and Signal Processing
Philadelphia: Linguistic Data Consortium, 1993. (ICASSP) 2016 Mar 20 (pp. 5090-5094). IEEE.
[22] J.S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker [41] A. Nagrani, J.S. Chung, and A. Zisserman, “Voxceleb: a large-scale
Recognition INTERSPEECH,” arXiv preprint arXiv:1806.05622. 2018 speaker identification dataset,” arXiv preprint arXiv:1706.08612. 2017
Jun. Jun 26.
[23] M. Jalil, F.A. Butt, and A. Malik, “Short-time energy, magnitude, zero [42] A. Torfi, J. Dawson, and N.M. Nasrabadi, “Text-independent speaker
crossing rate and autocorrelation measurement for discriminating voiced verification using 3d convolutional neural networks,” In2018 IEEE
and unvoiced segments of speech signals,” In2013 The International International Conference on Multimedia and Expo (ICME) 2018 Jul 23
Conference on Technological Advances in Electrical, Electronics and (pp. 1-6). IEEE.
Computer Engineering (TAEECE) 2013 May 9 (pp. 208-212). IEEE.
[43] H. Muckenhirn, M.M. Doss, and S. Marcell, “Towards directly
[24] N.N. Lokhande, N.S. Nehe, and P.S. Vikhe, “Voice activity detection modeling raw speech signal for speaker verification using CNNs,”
algorithm for speech recognition applications,” InIJCA Proceedings on In2018 IEEE International Conference on Acoustics, Speech and Signal
International Conference in Computational Intelligence (ICCIA2012), Processing (ICASSP) 2018 Apr 15 (pp. 4884-4888). IEEE.
vol. iccia 2012 Mar (No. 6, pp. 1-4).
[44] P. Dhakal, P. Damacharla, A.Y. Javaid, and V. Devabhaktuni, “A near
[25] J. Stegmann, and G. Schroder, “Robust voice-activity detection based on real-time automatic speaker recognition architecture for voice-based user
the wavelet transform,” In 1997 IEEE Workshop on Speech Coding for interface,” Machine Learning and Knowledge Extraction. 2019
Telecommunications Proceedings. Back to Basics: Attacking Mar;1(1):504-20.
Fundamental Problems in Speech Coding 1997 Sep 7 (pp. 99-100).
IEEE. [45] J. Godfrey, and E. Holliman, “Switchboard-1 Release 2 LDC97S62,”
DVD. Philadelphia: Linguistic Data Consortium. 1993.
[26] Q. Wang, C. Downey, L. Wan, P.A Mansfield, and I.L. Moreno,
“Speaker diarization with lstm,” In2018 IEEE International Conference [46] J. Campbell, and A. Higgins, “YOHO speaker verification. Linguistic
Data Consortium,” Philadelphia. 1994.
on Acoustics, Speech and Signal Processing (ICASSP) 2018 Apr 15 (pp.
5239-5243). IEEE. [47] L. Feng, “Speaker recognition,” (Master's thesis, Technical University
of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark).
[27] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O.
Vinyals, "Speaker diarization: A review of recent research," IEEE [48] J. Hennebert, H. Melin, D. Petrovska, and D. Genoud, “ POLYCOST: a
Transactions on Audio, Speech, and Language Processing 20.2 (2012): telephone-speech database for speaker recognition,” Speech
356-370. communication. 2000 Jun 1;31(2-3):265-70.
[28] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, "Signature [49] Janin, Adam, et al. ICSI Meeting Speech LDC2004S02. Web Download.
verification using a" siamese" time delay neural network," InAdvances Philadelphia: Linguistic Data Consortium, 2004
in neural information processing systems 1994 (pp. 737-744). [50] Greenberg, Craig, et al. 2010 NIST Speaker Recognition Evaluation
[29] Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov, "Siamese Test Set LDC2017S06. Hard Drive. Philadelphia: Linguistic Data
neural networks for one-shot image recognition," ICML deep learning Consortium, 2017.
workshop. Vol. 2. 2015. [51] R. Tucker, “Voice activity detection using a periodicity measure,” IEE
[30] D.A. Reynolds, “Speaker identification and verification using Gaussian Proceedings I (Communications, Speech and Vision). 1992 Aug
mixture speaker models,” Speech communication 17.1-2 (1995): 91-108. 1;139(4):377-80.
[31] A.G. Adami, and D.A. Barone, “A speaker identification system using a [52] F. Nolan, K. McDougall, G. De Jong, and T. Hudson, “The DyViS
model of artificial neural networks for an elevator application,” database: style-controlled recordings of 100 homogeneous speakers for
Information Sciences 138.1-4 (2001): 1-5. forensic phonetic research,” International Journal of Speech, Language
& the Law. 2009 Jun 1;16(1).
[32] R.W. Aldhaheri, and F.E. Al-Saadi, “Text-independent speaker
identification in noisy environment using singular value decomposition,”

978-1-7281-5197-7/20/$31.00 ©2020 IEEE 465

You might also like