Voice Recognition and Voice Comparison Using Machine Learning Techniques: A Survey
Voice Recognition and Voice Comparison Using Machine Learning Techniques: A Survey
Abstract—Voice comparison is a variant of speaker used in speech, the speaker recognition and voice comparison
recognition or voice recognition. Voice comparison plays a system are divided into two categories: (1) text-dependent and
significant role in the forensic science field and security systems. (2) text-independent. Text-dependent employs the same text
Precise voice comparison is a challenging problem. Traditionally, for training and testing whereas text-independent employs
different classification and comparison models were used by the different text for training and testing.
researchers to solve the speaker recognition and the voice
comparison, respectively but deep learning is gaining popularity Saquib et al. [5] and Singh et al. [6] presented a survey of
because of its strength in accuracy when trained with large speaker recognition techniques in 2010 and 2017, respectively,
amounts of data. This paper focuses on an elaborated literature which contain traditional approaches of speaker recognition.
survey on both traditional and deep learning-based methods of For speaker recognition and voice comparison, most of the
speaker recognition and voice comparison. This paper also works, e.g., [1], [7], [8], [9], [37] have been carried out using
discusses publicly available datasets that are used for speaker the traditional approaches by various researchers. Less amount
recognition and voice comparison by researchers. This concise of research exists on the use of deep learning methods on the
paper would provide substantial input to beginners and topic of speaker recognition and voice comparison. Therefore,
researchers for understanding the domain of voice recognition there is a need for such a survey that explores both traditional
and voice comparison. approaches as well as deep learning-based approaches of
speaker recognition (identification and verification) and voice
Keywords—voice comparison, speaker recognition, deep
comparison.
learning, Siamese NN
This paper explores and analyzes various traditional and
I. INTRODUCTION deep learning-based approaches to discuss potential solutions
to the problem of voice comparison. This paper conducts a
Voice comparison [1] is a difficult problem to solve survey of major works carried out on speaker recognition and
because the voice of a person may change due to the emotion, voice comparison to discuss all major issues and their
age-gap, and throat infection [2]. On the other hand, when a solutions. Furthermore, the paper also analyzes the suitability
speaker tries to say precisely the same utterance twice, a of the Siamese Neural Network for the problem of voice
measurable difference occurs in the speaker’s voices. comparison. The paper also discusses and analyzes the
However, a robust voice comparison is necessary because it datasets used by various researchers for speaker recognition
can be used in many fields, such as forensic science [1], and voice comparison.
authentication/verification [30], [39] surveillance, etc. Though
voice comparison is a hard problem for researchers, newer This paper is arranged as follows: Section II includes
machine learning techniques, such as deep learning [3], have introduction on voice comparison, speaker identification, and
the capability to provide an appropriate solution for the speaker verification. Furthermore, a general pipeline for voice
problem. comparison is discussed, and traditional and deep learning
approaches for speaker recognition and voice comparison are
We highlight differences among different voice processing studied. Section III presents a detailed literature survey on
operations that are used in the literature. Speaker recognition speaker recognition (identification and verification) and voice
is the method of recognizing who is the speaker by using comparison. Section III includes analyses of different datasets
speaker’s unique information. The recognition of speakers is and Siamese NN (Siamese Neural Network). Finally, Section
typically divided into two categories: (1) speaker identification IV concludes the paper.
[9] and (2) speaker verification or authentication [30]. Speaker
identification is the process of determining an unknown
speaker’s identity by matching his or her voice to the voices in II. VOICE RECOGNITION AND VOICE COMPARISON
the database of registered speakers. Speaker verification can This section presents variants of speaker recognition,
determine whether a person is what he or she claims to be descriptions of the voice comparison, and traditional v/s deep
based on his or her voice sample. There is an additional learning-based approaches for voice comparison. Additionally,
variant of speaker recognition called voice comparison [1] in for voice comparison, Siamese Architecture is also studied.
which two voices are supplied as input to the voice
comparison system and the system determines the similarity
score between two input voices. On the basis of words or text
TABLE II. ANALYSIS OF SPEAKER RECOGNITION AND COMPARISON METHODS BASED ON TRADITIONAL APPROACH
No. of Feature Text-
Researchers Dataset Model System-type Accuracy or EER (in %)
speakers Extraction type
Reynolds and Mel-frequency
KING speech
Rose (1995) 49 filter bank for GMM TI SI AC: 96.3
database
[7] short utterance
TIMIT AC: 99.5
TIMIT: 630 NTIMIT AC: 60.7
SI
NTIMIT: switchboard AC: 82.8
Reynolds(199 TIMIT, NTIMIT,
630, Mel-scale YOHO AC: NA
5) [30] Switch-board, GMM NA
Switchboard: Cepstral TIMIT EER:0.24
YOHO
113 YOHO: NTIMIT EER:7.19
NA SV
switchboard EER:5.15
YOHO EER:0.51
Adami et al. LPCC, FOR,
Random 30 MLP NA SI LPCC- AC: 100
(2001) [31] PIT, LPC
Rabha et al. Clean speech- AC: 99.5
Random 10 LPC/ Cepstral SVD-based algorithm TI SI
(2003) [32] Noisy speech- AC: 77.5
HMM AC: 61.4
Shahin (2009) Non professional 40(20 male +
LFPC CHMM TD SI AC: 66.4
[33] database 20 female)
SPHMM AC: 69.1
Revathi et al. MF-PLP Iterative clustering AC: 91.0
TIMIT 50 TI SR
(2009) [8] PLP approach AC: 88
Chakroborty Dataset TF GF
and saha YOHO >130 MFCC, IMFCC GMM TI SI AC: 97.26 AC: 97.42
(2009) [9] POLYCOST AC: 81.16 AC: 82.76
Saeidi et al. Speech separation 34(18 male +
MFCC GMM-UBM TI SI AC: 97.0
(2010) [10] challenge corpus 16 female)
Tolba et al.
Arabic speaker 10 MFCC CHMM TI SI AC: 80
(2011) [34]
Ajmera et al. TIMIT 630 Spectrographic AC: 96.69
DCT TI SI
(2011) [35] SGGS 151 acoustic feature AC: 98.41
Krishnamoort
hy et al. TIMIT 100 MFCC GMM-UBM TI SR AC: 80
(2011) [36]
Zhang et al. Chinese female Formant,
60 GMM-UBM TD VC AC: NA
(2011) [37] speakers MFCC
Morrison et
Chinese male
al. (2011) 64 Formant Likelihood ratio TD VC AC: NA
speakers
[38]
Cardoso et al. HNR, CPP, f0,
DyViS corpus 97 GMM-UBM NA VC EER: 0.09
(2019) [1] formant, MFCC
a.Result: EER-Equal Error Rate, AC-Accuracy, TF-Triangular Filter, GF-Gaussian Filterb.Text-type: TI-Text Independent, TD-Text Dependent, c.feature extraction: FOR-Formant, PIT-Pitch, d. System-type: SV-
Speaker Verification, SI-Speaker Identification, SR- Speaker Recognition.
TABLE III. ANALYSIS OF SPEAKER RECOGNITION AND COMPARISON METHODS BASED ON DEEP LEARNING APPROACH
Researchers Dataset No. of Input Model Text-type System Accuracy or EER
speakers type (in %)
Variani et al. NA 646 Energy features of DNN TD SV EER : 2.00
(2014) [39] frame (For 20 utterances)
Lukic et al. TIMIT 630 Spectrogram of voice CNN NA SI AC: 97
(2016) [12] data
Plchot et al. PRISM Fisher corpora 13916 MFCC, PNCC DNN auto TD, TI SR NA
(2016) [40] Switch board 1991 encoder
SRE 2740
Chung et al. Voxceleb 1251 Spectrogram CNN NA SI AC: 80.5
(2017) [41] SV EER: 7.8
Torfi et al. WVU-Multimodal 2013 1083 Frame-wise MFEC 3D-CNN TI SV EER: 21.1
(2018) [42]
Muckenhirn et Voxforge Selected Raw speech data CNN, MLP NA SI EER: 1.18
al. (2018) [43] 300 SV EER: 1.20
Dhakal et al. ELSDSR 22 Statistical, Gabor SVM NA SR AC: 98.07
(2019) [44] feature and CNN based RF AC: 99.41
DNN AC: 98.14
a.Result: EER-Equal Error Rate, AC-Accuracyb. System-type: SV-Speaker Verification, SI-Speaker Identification, SR-Speaker Recognition c. Text-type: TD-Text Dependent, TI-Text Independent
C. Analysis of different datasets traditional methods take a lot of time for feature extraction
In Table IV, we analyze widely used datasets such as because traditional approaches measure frame-wise features
TIMIT [20], NTIMIT [21], Switchboard [45], YOHO [46], such as fundamental frequency, formant, pitch, etc. Instead of
VoxCeleb [22], ELSDSR [47], POLYCOST [48], ICSI extracting features manually, many automatic feature
Meeting speech [49], and 2010 NIST SRE [50]. For analyzing extraction techniques such as MFCC [13] and LPCC [15] are
datasets we use attributes like a number of subjects, available. Traditional approaches are usually two-step
utterances, types of speech, sample rate, dataset size, and procedures: first, calculate the feature (e.g., MFCC) and then
application. The TIMIT [20] corpus (440 MB) is created to feed them into the classifier (e.g., GMM, HMM, and VQ).
provide speech data for acoustic-phonetic studies and However, in a deep learning method, we give voiceprint
automated speech recognition systems. VoxCeleb [22] is a images directly as input to the model. For example, a
dataset for the recognition of speakers on a large scale, which spectrogram or voiceprint was used as an input to CNN by
is prepared from celebrities’ YouTube videos. Most of the Lukic et al. in [12]. CNN directly learn from the input
data [22] are gender-balanced (males are 55%). The videos spectrogram. The main advantage of using any deep learning-
include a range of backgrounds, professions, age, and gender. based system is that the system is fully automatic. In a deep
learning-based approach, CNN is perfect for classification.
However, for comparison, Siamese NN is one of the popular
D. Analysis of traditional and deep learning-based approach
approaches as compared to CNN because Siamese NN
Table V provides a comparison of the traditional and deep performs well for a limited dataset. In Siamese NN, we can
learning-based models. From an analysis, we can state that the use CNN as a sub-network as a feature vector generator.