0% found this document useful (0 votes)
19 views5 pages

Voice BasedHumanIdentificationusingMachineLearning

not

Uploaded by

Tính võ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views5 pages

Voice BasedHumanIdentificationusingMachineLearning

not

Uploaded by

Tính võ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/361794621

Voice-Based Human Identification using Machine Learning

Conference Paper · June 2022


DOI: 10.1109/ICICS55353.2022.9811154

CITATIONS READS
9 754

6 authors, including:

Baha Adnan Alsaify Hadeel sami Abu Arja


Jordan University of Science and Technology Jordan University of Science and Technology
25 PUBLICATIONS 271 CITATIONS 3 PUBLICATIONS 18 CITATIONS

SEE PROFILE SEE PROFILE

Baskal Maayah Masa Altaweel


Jordan University of Science and Technology Jordan University of Science and Technology
3 PUBLICATIONS 18 CITATIONS 3 PUBLICATIONS 18 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Hadeel sami Abu Arja on 29 August 2022.

The user has requested enhancement of the downloaded file.


Voice-Based Human Identification using Machine
Learning
Baha A. Alsaify∗ , Hadeel S. Abu Arja ∗ , Baskal Y. Maayah∗ , Masa M. Al-Taweel∗ , Rami Alazrai† ,
Mohammad I. Daoud†
∗ Department of Network Engineering and Security
Jordan University of Science and Technology
{baalsaify, hsabuarja17, byalmousa17, mmaltaweel172}@just.edu.jo
† Department of Computer Engineering

German Jordanian University


{rami.azrai, mohammad.daoud}@gju.edu.jo

Abstract—Voice and natural language processing, is at the enhance the decision making process automatically through
forefront of any human-machine interaction domain. Speech is the learning experience. Along with a new state-of-the-art
an effortless and usable method of communication that is based feature extraction algorithms such as Mel-Frequency Cepstral
on the sound waves generated by the speaker. It permits the
machine to identify and comprehend human spoken language Coefficients (MFCC) [7]. The MFCC is a leading approach
through speech signal processing and pattern recognition. In this and a widely used algorithm in speech feature extraction, the
work, a methodology for speaker recognition based on machine main advantage of MFCC is that it is good in error reduction
learning algorithms is proposed. Support Vector Machine (SVM) and able to produce a robust feature when the signal is affected
and Random Forest (RF) models are used with statistical features by noise.
and Mel-Frequency Cepstral Coefficients (MFCC) as the input
features of the models. A new voice dataset was collected for the Voice-based systems can be deployed in many applica-
purpose of training and evaluating speaker recognition models. tions such as speaker identification, speaker verification, voice
Samples were obtained from non-native English speakers from identification, control applications, and many more. Triyono
the arab region over the course of two months. The performed et al. [8] proposed a smart house system that uses voice
experiments showed that using the developed methodology and commands to perform certain operations such as turning the
the collected dataset, a 94% identification accuracy can be
achieved. light on and off, turn the fan on and off, and read the news.
Index Terms—Voice Identification, Support Vector Machines, Other researchers such as Grimaldi et al. [9], focused on the
Random Forest, Mel Frequency Cepstral Coefficients. problem of speaker recognition. The main focus of [9] was
to evaluate the performance of a generic Gaussian Mixture
I. I NTRODUCTION Model (GMM) classification system for speaker identification.
Speaker recognition is a valuable bio-feature recognition On the other hand, other researchers take advantage of voice
method. It aims to recognize someone’s identity based on their recognition systems for medical purposes. For example, the
captured speech. These biometric recognition technologies work of Das et al. [10] proposes a system that can detect
have been used in many fields, such as secure access to respiratory problems by analysing the changes in the recorded
highly classified areas, machines control such as voice dialing, voice.
banking, database, and multi-factor authentication. The goal of In this paper, we offer a study using two machine learning
a speaker recognition system is to convert the speaker’s voice approaches: Support Vector Machine (SVM) and Random
waveform to a parametric representation that is later processed Forest (RF) to examine the capability of optimized machine
to be used as the input data for many recognition models and learning models to achieve the best performance on speaker
approaches that are built to satisfy the system needs. identification. Three feature selection techniques were used
In 1980s, the AI field underwent a powerful revolution, and to explore which features are most important for training the
research on speaker identification has been greatly developed. models. These techniques are Recursive Feature Elimination
The process during that period consisted of the following: (RFE), Minimum Redundancy Maximum Relevance (MRMR)
data pre-processing, feature extraction, model constructing, and Chi-Square test. After that, an investigation is done to find
and model scoring. The previous stages made the bedrock for the best choice among SVM kernels namely linear, polynomial
many models and applications that have been developed over and RBF kernels then an attempt was done to estimate the
the time. best number of decision trees for the RF model. The process
In recent years, speaker recognition technology has achieved is followed up by a comparison between the performance of
tremendous success as it has become more affordable and the proposed models.
reliable and its software has been enriched by the power of ma- This paper is organized as follow: In Section II the details
chine learning techniques; for its efficiency in automated data of the used dataset are presented alongside the pre-processing
analyzing, pattern identifying, and processing algorithms that methodology we used to clean and prepare the captured voice
signals. In addition, the methodology used for feature extrac-
tion, feature selection, and building the recognition models
is also presented in this section. The results of our work are
provided in Section III, and the conclusion is presented in
Section IV.
II. MATERIAL AND METHODS
A. Dataset
The goal of gathering the dataset is to have a new English
speech dataset suitable for training and evaluating speaker
recognition systems. The dataset provides insights into the ac-
cents spoken by English speakers of Middle Eastern descent. It
consists of 150 speakers with a total of 3,000 data samples and
about six hours of speech. Twenty voice samples per speaker
were collected, each sample is about five to ten seconds long.
The dataset was divided into two sub-datasets. The first sub-
dataset contains ten samples per speaker repeating the phrase
“Machine learning 1, 2, 3, 4, 5, 6, 7, 8, 9, 10”. The second Fig. 1. Architecture of the dataset.
sub-dataset also contains ten samples per speaker speaking
different phrases randomly. Table I shows in detail the division
of the dataset. and then warping the frequencies on a Mel scale, followed
by applying the inverse Discrete Cosine Transform (DCT) [5].
TABLE I
DATASET SUBSETS . 2) Correlation and Feature selection: Correlation [6] is a
Details Different Phrase Same Phrase
statistical technique that provides an insight about the level of
Number of Samples 1500 1500 relationship between features. This approach is applied to the
Number of Male Speakers 54 54 extracted features and resulted in interesting outcomes. Figure
Number of Female Speakers 96 96
Data Duration 3.1 hours 2.9 hours
2 below shows The correlation heat map between features in
the time domain.

The dataset was obtained in a real-world, noisy, and single-


speaker environment, which means it is free from the au-
dience and people overlapping speech. The voice samples
were recorded using different recording hardware (e,g. Mi-
crophones, mobile devices, and laptops) then uploaded on
cloud applications. Samples collected were received in various
formats (i.e., mp4 [1], Ogg [2]). All the collected samples were
converted to .flac format to have a unified format.
Also, samples that have two channels (stereo) are converted
into one channel (mono) to have balanced and unified data.
The final architecture of the dataset is shown as follow 1. For
more information regarding the dataset, please refer to [11].
B. Pre-processing methodology
1) Feature extraction: The First Step for voice recognition
is to extract features from a given signal. Overall, Twenty
Mel Frequency Cepstral Coefficients (MFCCs) in addition to
seventeen statistical features were extracted from each sample Fig. 2. Correlation heat map between features in time domain.
in the time domain. MFCCs mimics the human perception
for sensitivity at lower frequencies, as the human ear has One of two features with correlation more than 0.9 were
what is called the cochlea [3] which is an anatomy that excluded and this resulted in 12 features to work with (chroma
has more filters at low frequency and very few filters at a stft, rmse, spectral centroid, spectral bandwidth, spectral con-
higher frequency. MFCC algorithm converts the conventional trast, spectral flatness, mean, max, min, skew, kurt and iqr).
frequency to Mel Scale. The MFCC extraction technique After applying correlation, three feature selection techniques
includes windowing the signal, applying the Discrete Fourier were applied to determine which of the features are most
transform (DFT) [4], taking the log of the magnitude, important for training the models.
• Recursive Feature Elimination (RFE) in a grid. Using the available dataset, the algorithm con-
• Minimum Redundancy Maximum Relevance (MRMR) sumed approximately 9 minutes to fit all combinations then
• Chi-Square resulted in the following parameters: C=0.1, gamma=1 and
In our proposed work, after data preparation, three feature kernel=linear. After applying the grid search methodology and
selection techniques namely: RFE, MRMR, and Chi-2 are hyper-parameter tuning, the model got trained on 90% of the
applied. They show that the models tend to give a higher data using the best combination of parameters which resulted
performance without excluding any features. This is due to in the most precise predictions the model was able to give.
the small number of features that we have, every information The model was then tested on three hundred samples that
extracted was important enough to be fed into the models for represents the remaining 10% of the data.
training and evaluation. Table II shows how much on average 2) Random Forest (RF): The RF algorithm is composed
the accuracy of the machine learning models has dropped. of many decision trees. They select random data samples,
acquire predictions from each tree, and then select the best vote
TABLE II as the final result. It prioritizes characteristics by removing
F EATURES S ELECTION T ECHNIQUES R ESULT. the least significant ones and focusing on the most important
ones. As done previously with the SVM model, a fine-tuning
RFE MRMR Chi-2
spectral bandwidth spectral centroid spectral centroid algorithm was applied to estimate best parameters of the
Standard Deviation spectral bandwidth spectral bandwidth RF model. Different values were generated using a brute-
Output skew mean spectral flatness
median rolloff rolloff force script then Random Search algorithm were used to
MFCC 5 kurt skew estimate the best value for each parameter. The algorithm
skew kurt
Accuracy 20% 13% 32% was Fitting 3 folds for each of 100 candidates, totalling 300
fits. It took about 15 minutes to get the following results:
n estimators=1000, min samples split=2, min samples leaf
C. Model Training =1, max features=auto, max depth=50 and bootstrap=False.
In this work, to identify the speakers using their recorded
voice, two main classification algorithms are used: support III. RESULTS
vector machine (SVM), and random forest (RF). The results of applying the previous two algorithms on the
1) Support Vector Machine (SVM): SVM is a supervised acquired dataset are provided in this section. Table IV shows
learning method used for regression, classification, and outlier the results obtained from the proposed two models in machine
detection. SVM is one of the most powerful classification learning. The Random Forest shows promising results with an
approaches in machine learning before the revolution of neural accuracy 94% as compared to Support Vector Machine that
networks, and it is still considered as a valuable algorithm showed an accuracy of 83%.
used in many fields today. In the process of building the
SVM model, methods were applied to estimate the best hyper-
parameters in order to improve the model accuracy and achieve TABLE IV
the best performance. A brute-force approach was applied to T HE ACCURACY OF TWO MODELS
determine which kernel performs best based on performance Algorithm accuracy
metrics such as precision, recall and f1 − score. Table III SVM 94%
summarizes the performance of each Kernel based on the RF 83%
classification report.
IV. CONCLUSION
TABLE III
K ERNELS ’ P ERFORMANCES O N T HE SVM M ODEL . Voice recognition has attracted scientists as an important
domain and has created a technological impact on society that
Kernel precision recall f1-score
Polynomial 0.00 0.01 0.00 is expected to grow in this area of human-machine interaction.
RBF 0.03 0.02 0.02 In this research, we tried to summarize the machine learning
Linear 0.83 0.83 0.82 practices in the field of voice recognition. We described
Sigmoid 0.00 0.00 0.00
the basic principles, methods, and analysis of issues that
voice recognition faces. MFCC is calculated as features to
The Kernel that gave logical results and was used to build characterize audio content. A comparison of the performance
our model is the Linear kernel. It was expected given that of each of these models was created. In terms of recognition
the provided data is linearly separable data. After finding the accuracy, the SVM model outperformed the RF model.
best kernel to be used, a grid search method was applied to
R EFERENCES
estimate the best value for the two other parameters (C and
Gamma). Grid search is basically used as an approach for [1] 3GPP2 (2007). 3GPP2 File Formats for Multimedia Services (1st ed.,
Vol. 3). 3GPP2 C.S0050-B.
hyper-parameter tuning, it methodically evaluate a model that [2] I. Goncalves; S. Pfeiffer; C. Montgomery (2008). Ogg Media Types.
is built from different combinations of parameters specified sec. 10. doi:10.17487/RFC5334. RFC 5334.
[3] Anne M. Gilroy; Brian R. MacPherson; Lawrence M. Ross (2008). Atlas
of anatomy. Thieme. p. 536. ISBN 978-1-60406-151-2.
[4] Smith, Steven W. (1999). ”Chapter 8: The Discrete Fourier Transform”.
The Scientist and Engineer’s Guide to Digital Signal Processing (Second
ed.). San Diego, Calif.: California Technical Publishing
[5] Stanković, Radomir S.; Astola, Jaakko T. (2012). ”Reminiscences of the
Early Work in DCT: Interview with K.R. Rao”
[6] Croxton, Frederick Emory; Cowden, Dudley Johnstone; Klein, Sidney
(1968) Applied General Statistics, Pitman. ISBN 9780273403159
[7] Hossan, M. A., Memon, S., Gregory, M. A. (2010, December). A
novel approach for MFCC feature extraction. In 2010 4th International
Conference on Signal Processing and Communication Systems (pp. 1-5).
IEEE.
[8] Triyono, L., Yudantoro, T. R., Sukamto, S., Hestinigsih, I. (2021,
March). VeRO: Smart home assistant for blind with voice recognition. In
IOP Conference Series: Materials Science and Engineering (Vol. 1108,
No. 1, p. 012016). IOP Publishing.
[9] Grimaldi, M., Cummins, F. (2008). Speaker identification using instanta-
neous frequencies. IEEE Transactions on Audio, Speech, and Language
Processing, 16(6), 1097-1111.
[10] Das, S. (2019, March). A Machine Learning Model for Detecting
Respiratory Problems using Voice Recognition. In 2019 IEEE 5th
International Conference for Convergence in Technology (I2CT) (pp.
1-3). IEEE.
[11] Alsaify, B. A., Arja, H. S. A., Maayah, B. Y., & Al-Taweel, M. M.
(2022). A dataset for voice-based human identity recognition. Data in
Brief, 108070.

View publication stats

You might also like