0% found this document useful (0 votes)
11 views4 pages

Voice Gender Recognition Using Deep Learning: December 2016

This conference paper presents a deep learning model using a Multilayer Perceptron (MLP) for voice gender recognition, achieving 96.74% accuracy on a dataset of 3,168 voice samples. The study employs acoustic analysis to extract features and utilizes various Python libraries, including Keras and TensorFlow, for model training and implementation. Additionally, an interactive web application has been developed to facilitate gender prediction from uploaded voice files.

Uploaded by

Rabia Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

Voice Gender Recognition Using Deep Learning: December 2016

This conference paper presents a deep learning model using a Multilayer Perceptron (MLP) for voice gender recognition, achieving 96.74% accuracy on a dataset of 3,168 voice samples. The study employs acoustic analysis to extract features and utilizes various Python libraries, including Keras and TensorFlow, for model training and implementation. Additionally, an interactive web application has been developed to facilitate gender prediction from uploaded voice files.

Uploaded by

Rabia Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/312219824

Voice Gender Recognition Using Deep Learning

Conference Paper · December 2016


DOI: 10.2991/msota-16.2016.90

CITATIONS READS

32 12,248

2 authors:

Mücahit Büyükyılmaz Ali Osman Çıbıkdiken


Konya Metropolitan Municipality Karatay Univeristy
4 PUBLICATIONS 36 CITATIONS 14 PUBLICATIONS 63 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Sensitivity of the stability of dynamical systems View project

Wep Application for Step Size Strategies View project

All content following this page was uploaded by Ali Osman Çıbıkdiken on 25 October 2017.

The user has requested enhancement of the downloaded file.


Advances in Computer Science Research, volume 58
Modeling, Simulation and Optimization Technologies and Applications (MSOTA 2016)

Voice Gender Recognition Using Deep Learning


Mucahit Buyukyilmaz1,* and Ali Osman Cibikdiken2
1
Necmettin Erbakan University, Advanced Computation and Data Analysis Laboratory, Konya, Turkey
2
Necmettin Erbakan University, Department of Computer Engineering, Konya, Turkey
*
Corresponding author

Abstract—In this article, a Multilayer Perceptron (MLP) deep


learning model has been described to recognize voice gender. The III. DATA SET AND SOFTWARE LIBRARIES
data set have 3,168 recorded samples of male and female voices. A. Data Set
The samples are produced by using acoustic analysis. An MLP
deep learning algorithm has been applied to detect gender- Each voice sample format is a .WAV file. The .WAV format
specific traits. Our model achieves 96.74% accuracy on the test files have been pre-processed for acoustic analysis using the
data set. Also the interactive web page has been built for specan function by the WarbleR R package [11]. A specan
recognition gender of voice. function measures 22 acoustic parameters on acoustic signals.
These parameters are showed in “Table II”.
Keywords-deep learning; voice recognition; multilayer
perceptron networks TABLE II. MEASURED ACOUSTIC PROPERTIES.

I. INTRODUCTION Acoustic Properties

Acoustic analysis of the voice depend upons parameter Properties Description


settings specific to sample characteristics such as intensity, duration length of signal
duration, frequency and filtering [1]. The acoustic properties of
the voice and speech can be used to detect gender of speaker. meanfreq mean frequency (in kHz)
warbleR R package is designed for acoustic analysis. The data sd standard deviation of frequency
set which have acoustic parameters can be obtained with this
analysis. The data set can be trained with different machine median median frequency (in kHz)
learning algorithms. In this paper, MLP has been used to obtain
Q25 first quantile (in kHz)
model. The results have been compared with related work. A
web page has been designed to detect the gender of voice by Q75 third quantile (in kHz)
using obtained model.
IQR interquantile range (in kHz)
II. RELATED WORK
skew skewness
Becker [2] used a frequency-based baseline model, logistic
regression model [3], classification and regression tree (CART) kurt kurtosis
model [4], random forest model [5], boosted tree model [6], sp.ent spectral entropy
Support Vector Machine (SVM) model [7], XGBoost model [8],
stacked model [9] for recognition of voices data set [10]. sfm spectral flatness
According to used models, the results are showed in “Table I”.
mode mode frequency
TABLE I. ACCURACY OF MODELS FOR RECOGNITION VOICES.
centroid frequency centroid
Accuracy (%)
peakf peak frequency
Model Train Test
average of fundamental frequency
Frequency-based baseline 61 59 meanfun
measured across acoustic signal
minimum fundamental frequency
Logistic regression 72 71 minfun
measured across acoustic signal
CART 81 78 maximum fundamental frequency
maxfun
measured across acoustic signal
Random forest 100 87 average of dominant frequency
meandom
measured across acoustic signal
Boosted tree 91 84 minimum of dominant frequency
mindom
measured across acoustic signal
SVM 96 85 maximum of dominant frequency
maxdom
XGBoost 100 87 measured across acoustic signal
range of dominant frequency
dfrange
Stacked 100 89 measured across acoustic signal
modindx modulation index

Copyright © 2016, the Authors. Published by Atlantis Press.


This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). 409
Advances in Computer Science Research, volume 58

f(x) = G(b(2) + W(2)(s(b(1) + W(1)x))), (1)


The pre-processed WAV files have been saved into a CSV
file. The CSV file is contained 3168 rows and 21 columns. with bias vectors b(1), b(2); weight matrices W(1), W(2);
There are features and the classification of male or female in activation functions G and s. Wi(1) represents the weights from
these 21 columns. the input units to the i-th hidden unit [18]. Generally, a function
B. Software Libraries tanh is chosen for activation function, with tanh(x)=(ex-e-
x)/(ex+e-x) [19].
Python; is an interpreted, interactive, object-oriented,
dynamic type, easy to learn and open source programming V. METHOD
language. Python combines remarkable power with very clear
syntax [12] . All training, test and prediction codes have been written by
using Python libraries. Data have been loaded from csv file into
Keras; “is a high-level neural networks library, written in Numpy arrays with built-in Python libraries. Data set has been
Python and capable of running on top of either TensorFlow or loaded from csv file into 2 dimension Python array. Each row
Theano” [13]. has 20 parameters and 1 label. The array has been shuffled
randomly. It has been splitted to 5 chunks. First 4 chunk has
TensorFlow™ is an open source software library for
633 data but last has 636 data. Also last column of data, which
numerical computation using data flow graphs. Nodes in the
is label, has been converted integer as 0 for male and 1 for
graph represent mathematical operations, while the graph edges
female and added to Python array to 5 chunks.
represent the multidimensional data arrays (tensors)
communicated between them [14]. TensorFlow's flexible 5-Fold cross validation has been used and average score
architecture allows you to use GPU or CPU to mainly has been obtained. Training and test loop have been run 5 times.
conducting machine learning and deep neural networks On each run different chunk has been used for test, other
research, but other domains can be adapted easily. chunks are concatenated to Numpy array and used for training.
On each loop, 20% of data has been used for test and 10% of
NumPy is the open source fundamental package for
data has been used for validation. Keras has been used top of
scientific computing with Python. It contains powerful
Tensorflow and has been configured to use GPU.
capabilities such as N-dimensional array objects, sophisticated
(broadcasting) functions, tools for integrating C/C++ and 1 input layer, 4 hidden layers and 1 output layer have been
Fortran code, useful linear algebra, Fourier transform, and used to build our model. Input layer has 20 inputs and
random number capabilities [15]. By using Numpy arbitrary connected to first hidden layer which has 64 perceptrons.
data-types can be defined. This allows NumPy to seamlessly Second and third hidden layers have each 256 perceptrons.
and speedily integrate with a wide variety of databases. Keras Forth hidden layer has 64 perceptrons. The output layer has 2
uses Numpy for input data types. perceptrons. Softmax activation function conducted in output
layer to obtain the categorical distribution of the result for
Django is free and open source high-level Python Web
labels. Dropout 0.25 has been applied between each hidden
framework that encourages rapid development and clean,
layers. Dropout consists of randomly setting a fraction of input
pragmatic design. Django is reassuringly secure, exceedingly
units to 0 at each update during training time. In this way, it
scalable and was designed to help developers create
helps to prevent overfitting.
applications quickly as possible [16].
Nadam optimization algorithm in Keras has been used to
warbleR is a package designed to streamline acoustic
train our model. The learning rate has been chosen 0.001. This
analysis in R. This package allows users to collect open-access
gave us slow learning but it prevents us to miss minimum. By
acoustic data or input their own data into a workflow that
choosing lower learning rate our model has been trained with
facilitates automated spectrographic visualization and acoustic
150 epochs. Total training time is around 100-120 sec for each
measurements.
fold. Several loss function has been tested with our model and
Rpy2 is a Python package to provide interface to run R Kullback–Leibler divergence [20] algorithm has been chosen
code embedded in a Python process. which gave best performance and accuracy. The model
achieved 96.74% accuracy on the test data set. The result is
IV. MULTILAYER PERCEPTRON NETWORKS showed in “Table III” for test data set.
Deep feedforward networks, or Multilayer Perceptron
(MLP) networks, are used in supervised learning problems. TABLE III. RESULTS OF TEST DATA
These problems have a training set of input-output. The Test Data Set
network must produce a model to find the dependency between Gender Correct Incorrect
them. An MLP is one of typical deep learning algorithms. It
uses a supervised learning technique called backpropagation Male 1553 31
for training the network [17]. An MLP network contains a set Female 1512 72
of input layers, one or more hidden layers of computation
nodes, and an output layer of nodes. Total 3065 103
Model weights has been saved to HDF5 file on each fold by
An MLP function f is, f: RD→RL, where D is the size of using Keras. Best weight file has been chosen by fold accuracy.
input vector x and L is the size of the output vector f(x). This Chosen model weights have been used on website to predict
function can be represented in matrix notation: gender of uploaded voice.

410
Advances in Computer Science Research, volume 58

A website has been developed by using Django framework: [16] Django, https://djangoproject.com
[17] D.E. Rumelhart, G.E. Hinton, R.J.Williams, Learning representations by
https://www.konya.edu.tr/acdal/projects/deep-learning- back-propagating errors, Nature 323: 533-536, 1986.
voice-gender-detection [18] http://deeplearning.net/tutorial/_sources/mlp.txt
Model has been built same as training part. After compiling [19] S. Haykin, Neural Networks: A Comprehensive Foundation (2 ed.).
Prentice Hall, 1998.
the model saved HDF5 file has been loaded and model weights
[20] S. Kullback, R.A. Leibler, On information and sufficiency, Annals of
have been set up. User can upload wav or mp3 file on web Mathematical Statistics. 22 (1): 79–86, 1951.
browser. Mp3 files convert to wav format. Rpy2 library has
been used to run R code inside Django. After load and
conversation the file, filename passed to R code by using rpy2.
Voice file has been readed as data frame and passed specan
function of warbleR library. Specan function return 22
parameters about loaded file. Chosen 20 parameters have been
succeed to predict result using our model. Results is taken by
Django and showed to user. All computations have been
performed in Advanced Computing and Data Analysis
Laboratory (ACDAL), Necmettin Erbakan University, Konya.
VI. CONCLUSION
The model obtained in paper show us that we can use
acoustic properties of the voices and speech to detect the voice
gender. MLP has been used to obtain the model for
classification from data set which have the parameters of voice
samples. A larger data set of voice samples can be minimized
incorrect classifications from intonation. The web page has
been published to develop the model from loaded examples
about male and female voice samples.
ACKNOWLEDGMENT
This work is supported in part by the Necmettin Erbakan
University, BAP Coordination Office.
REFERENCES
[1] A.P. Vogel, P. Maruff, P. J. Snyder, J.C. Mundt, Standardization of pitch-
range settings in voice acoustic analysis, Behavior Research Methods,
v.41, n.2, p.318-324, 2009.
[2] K. Becker, “Identifying the Gender of a Voice using Machine Learning”,
2016, unpublished.
[3] J. M. Hilbe, Logistic Regression Models, CRC Press, 2009.
[4] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and
Regression Trees, CRC Press, 1984.
[5] L. Breiman, “Random forests”, Machine Learning, Springer US, 45:5–
32, 2001.
[6] J.H. Friedman, Stochastic Gradient Boosting, 1999.
[7] C. Cortes, V. Vapnik, “Support-vector networks”, Machine Learning, 20
(3): 273–297, 1995.
[8] J.H. Friedman, Greedy Function Approximation: A Gradient Boosting
Machine, 1999.
[9] L. Breiman, “Stacked regressions”, Machine Learning, Springer US,
45:5–32, 2001.
[10] Dataset, https://raw.githubusercontent.com/primaryobjects/voice-
gender/master/voice.csv
[11] M. Araya-Salas, G. Smith-Vidaurre, warbleR: an R package to streamline
analysis of animal acoustic signals. Methods Ecol Evolution, 2016,
doi:10.1111/2041-210X.12624.
[12] Python, https://docs.python.org/3/faq/general.html
[13] Keras, Chollet, François, 2015, https://github.com/fchollet/keras
[14] M. Abadi, A. Agarwal, TensorFlow: Large-scale machine learning on
heterogeneous systems, 2015. Software available from tensorflow.org.
[15] S. van der Walt, S.C. Colbert, G. Varoquaux. The NumPy Array: A
Structure for Efficient Numerical Computation, Computing in Science &
Engineering, 13, 22-30, 2011, doi:10.1109/MCSE.2011.37

411
View publication stats

You might also like