Voice BasedHumanIdentificationusingMachineLearning
Voice BasedHumanIdentificationusingMachineLearning
net/publication/361794621
CITATIONS READS
9 754
6 authors, including:
All content following this page was uploaded by Hadeel sami Abu Arja on 29 August 2022.
Abstract—Voice and natural language processing, is at the enhance the decision making process automatically through
forefront of any human-machine interaction domain. Speech is the learning experience. Along with a new state-of-the-art
an effortless and usable method of communication that is based feature extraction algorithms such as Mel-Frequency Cepstral
on the sound waves generated by the speaker. It permits the
machine to identify and comprehend human spoken language Coefficients (MFCC) [7]. The MFCC is a leading approach
through speech signal processing and pattern recognition. In this and a widely used algorithm in speech feature extraction, the
work, a methodology for speaker recognition based on machine main advantage of MFCC is that it is good in error reduction
learning algorithms is proposed. Support Vector Machine (SVM) and able to produce a robust feature when the signal is affected
and Random Forest (RF) models are used with statistical features by noise.
and Mel-Frequency Cepstral Coefficients (MFCC) as the input
features of the models. A new voice dataset was collected for the Voice-based systems can be deployed in many applica-
purpose of training and evaluating speaker recognition models. tions such as speaker identification, speaker verification, voice
Samples were obtained from non-native English speakers from identification, control applications, and many more. Triyono
the arab region over the course of two months. The performed et al. [8] proposed a smart house system that uses voice
experiments showed that using the developed methodology and commands to perform certain operations such as turning the
the collected dataset, a 94% identification accuracy can be
achieved. light on and off, turn the fan on and off, and read the news.
Index Terms—Voice Identification, Support Vector Machines, Other researchers such as Grimaldi et al. [9], focused on the
Random Forest, Mel Frequency Cepstral Coefficients. problem of speaker recognition. The main focus of [9] was
to evaluate the performance of a generic Gaussian Mixture
I. I NTRODUCTION Model (GMM) classification system for speaker identification.
Speaker recognition is a valuable bio-feature recognition On the other hand, other researchers take advantage of voice
method. It aims to recognize someone’s identity based on their recognition systems for medical purposes. For example, the
captured speech. These biometric recognition technologies work of Das et al. [10] proposes a system that can detect
have been used in many fields, such as secure access to respiratory problems by analysing the changes in the recorded
highly classified areas, machines control such as voice dialing, voice.
banking, database, and multi-factor authentication. The goal of In this paper, we offer a study using two machine learning
a speaker recognition system is to convert the speaker’s voice approaches: Support Vector Machine (SVM) and Random
waveform to a parametric representation that is later processed Forest (RF) to examine the capability of optimized machine
to be used as the input data for many recognition models and learning models to achieve the best performance on speaker
approaches that are built to satisfy the system needs. identification. Three feature selection techniques were used
In 1980s, the AI field underwent a powerful revolution, and to explore which features are most important for training the
research on speaker identification has been greatly developed. models. These techniques are Recursive Feature Elimination
The process during that period consisted of the following: (RFE), Minimum Redundancy Maximum Relevance (MRMR)
data pre-processing, feature extraction, model constructing, and Chi-Square test. After that, an investigation is done to find
and model scoring. The previous stages made the bedrock for the best choice among SVM kernels namely linear, polynomial
many models and applications that have been developed over and RBF kernels then an attempt was done to estimate the
the time. best number of decision trees for the RF model. The process
In recent years, speaker recognition technology has achieved is followed up by a comparison between the performance of
tremendous success as it has become more affordable and the proposed models.
reliable and its software has been enriched by the power of ma- This paper is organized as follow: In Section II the details
chine learning techniques; for its efficiency in automated data of the used dataset are presented alongside the pre-processing
analyzing, pattern identifying, and processing algorithms that methodology we used to clean and prepare the captured voice
signals. In addition, the methodology used for feature extrac-
tion, feature selection, and building the recognition models
is also presented in this section. The results of our work are
provided in Section III, and the conclusion is presented in
Section IV.
II. MATERIAL AND METHODS
A. Dataset
The goal of gathering the dataset is to have a new English
speech dataset suitable for training and evaluating speaker
recognition systems. The dataset provides insights into the ac-
cents spoken by English speakers of Middle Eastern descent. It
consists of 150 speakers with a total of 3,000 data samples and
about six hours of speech. Twenty voice samples per speaker
were collected, each sample is about five to ten seconds long.
The dataset was divided into two sub-datasets. The first sub-
dataset contains ten samples per speaker repeating the phrase
“Machine learning 1, 2, 3, 4, 5, 6, 7, 8, 9, 10”. The second Fig. 1. Architecture of the dataset.
sub-dataset also contains ten samples per speaker speaking
different phrases randomly. Table I shows in detail the division
of the dataset. and then warping the frequencies on a Mel scale, followed
by applying the inverse Discrete Cosine Transform (DCT) [5].
TABLE I
DATASET SUBSETS . 2) Correlation and Feature selection: Correlation [6] is a
Details Different Phrase Same Phrase
statistical technique that provides an insight about the level of
Number of Samples 1500 1500 relationship between features. This approach is applied to the
Number of Male Speakers 54 54 extracted features and resulted in interesting outcomes. Figure
Number of Female Speakers 96 96
Data Duration 3.1 hours 2.9 hours
2 below shows The correlation heat map between features in
the time domain.