Gender Recong Paper 4
Gender Recong Paper 4
ABSTRACT:
Gender classification is one among the major speech analysis problems. The acoustic features
of humans are based on genders due to physiological changes in glottis, vocal track thickness
and length. The length of vocal cord for male speaker is longer than that of female speaker, it
means the length of vocal fold is gender dependent. Because of the longer vocal folds male
voice have more intensity. Based on this idea, features of frequency of voice signals will be
differentiated. So, the aim of this paper is to classify gender based on voice signals using
Machine Learning algorithms.
Keywords: Logistic Regression, Decision Tree, Support Vector Machine, Naïve Bayes
Classifier, k-Nearest Neighbors.
[1] INTRODUCTION
Speech signals possess different types of information. Speech recognition, which
provides information about the content of speech signals. Speaker recognition consists
information about the speaker identity. Emotion recognition delivers information about the
speaker’s emotional state. Health recognition offers information on the patient’s health status.
Language recognition yields information of the spoken language. Accent recognition produces
information about the speaker accent. Age recognition gives information about the speaker age.
Gender recognition carries information about the speaker gender [1]. Gender recognition plays
a vital role in Speech signal processing. Using machine learning algorithms, gender of an
individual are often identified based on voice. Automatic gender recognition is the process of
recognizing whether the speaker is a male or female.
“gender”. It is easier for an individual to recognize or identify a person’s gender by hearing the
voice. This paper was developed with a mindset to make the machine learn and identify the
gender of the given voice (real-world input). The different machine learning algorithms like
Support vector machine, Decision tree, Gradient Boosting, Random forest were used.
Advantages
Feature selection is performed to identify relevant features.
Training and testing the data with different classification algorithms.
Applying the classification models on extracted features from new audio file.
[4] METHODOLOGY
The system basically takes the input file as audio file. Then extract acoustic features in
the form of numerical from the audio file. Then train and test the various classification
algorithms with available dataset. Then evaluate the accuracy for each algorithm and compare
to find best fit model. Using this model, predict the gender of person with new audio file.
[Figure-1] shows the system architecture which defines sequence of steps involved that
is data collection, data pre-processing, training and testing the data on various classification
algorithms and identifying gender.
The dataset used to train the model is voice.csv dataset which has 20 features, 1 target
label and 3168 observations.
The features in voice.csv dataset are as following:
meanfreq: mean frequency (in kHz) which defines the average of all the frequencies.
sd: standard deviation of frequency defines how much the frequency values in the list varies
from the mean of the list.
median: median frequency (in kHz) which defines the value separating the upper half from
the lower half of frequency list.
Q25: first quartile (in kHz) which defines the median of lower half of frequency list
i.e.,25% frequencies are lower than Q25 and 75% frequencies are above Q25.
Q75: third quartile (in kHz) which defines the median of upper half of frequency list
i.e.,75% frequencies are lower than Q75 and 25% frequencies are above Q75.
IQR: interquartile range (in kHz) defines the difference between third quartile and first
quartile.
skew: skewness defines the measure of the asymmetry of the frequency. It can be negative,
positive and zero. If skewness is positive, the spectrum spreads out more to the right of the
mean than to the left side. If skewness is negative, the spectrum spreads out more to the
left of the mean than to the right side. If skewness is zero, it is symmetric about mean.
kurt: kurtosis defines the measure of combined weight of distribution tail relative to the
centre of distribution.
sp.ent: spectral entropy defines the measure of disorganisation of frequency.
sfm: spectral flatness defines the measure of similarity of the spectrum to noise.
mode: mode frequency defines the frequency which occurs maximum number times.
centroid: spectral centroid defines a measure used in digital signal processing to
characterise a spectrum.
peakf: peak frequency defines the frequency with highest energy.
meanfun: average of fundamental frequency measured across acoustic signal
minfun: minimum fundamental frequency measured across acoustic signal.
maxfun: maximum fundamental frequency measured across acoustic signal.
meandom: average of dominant frequency measured across acoustic signal.
mindom: minimum of dominant frequency measured across acoustic signal.
maxdom: maximum of dominant frequency measured across acoustic signal
dfrange: range of dominant frequency measured across acoustic signal.
modindx: modulation index. Calculated as the accumulated absolute difference between
adjacent measurements of fundamental frequencies divided by the frequency range.
label: male or female.
Data pre-processing is the process of cleaning and transform of raw inputs before
processing and analysis. Data cleaning involves identifying missing values and filling them
with mean value of attribute. Data cleaning is performed to voice.csv dataset before training.
The dataset is split into train and test sets. Feature selection is the process where we select the
features which are more appropriate for classification or prediction, automatically or manually.
Mean of frequency, standard deviation, spectral entropy, skew, kurtosis, spectral flatness, Q25,
Q75, IQR, median features are selected. The remaining attributes are depended on these
attributes.
Audio file need to be converted into discrete data to extract features of acoustic signal.
The frequency in audio file is in continuous form. By sampling, the continuous signal is
converted to discrete sequence of frequencies. The sampling is done on audio file. The
sampling frequency or sampling rate of sound wave defines the frequency at which amplitudes
are captured. Fourier transformation is a mathematical concept used to convert continuous
signal from time-domain to frequency domain. Fast Fourier transformation is similar to Fourier
transformation, but takes input as discrete signal. Fast Fourier Transform calculates Discrete
Fourier Transform (DFT) of a given sequence. Discrete Fourier Transform converts a sequence
of discrete signal into its frequency constituents. DFT or FFT algorithm convert time domain
discrete signal to frequency domain. Spectral and frequency values at each amplitude are
calculated using discrete Fourier transform.
Logistic regression is a classification algorithm, used when the value of the target
variable is categorial in nature. It is supervised learning algorithm used to find probability of a
target variable that given data entry belongs to the particular target class. Logistic regression
models the data using the sigmoid function. Binomial logistic regression is used as the target
variable(label) is of two labels 0 (male) and 1(female).
Split the training set of the dataset into subsets. While making the subset make sure that
each subset of training dataset should have the same value for an attribute.
Find leaf nodes in all branches by repeating above steps on each subset. The root node and
internal nodes can be selected using Gini index method, Information gain method, Entropy
method.
Gini index method is default method used to select root from n attributes of the dataset
which attribute would be placed at the root node or the internal node. Gini index is a metric to
measure how often a randomly chosen attribute would be incorrectly identified. It means an
attribute with lower gini index should be preferred.
[4.5] K-NEAREST NEIGHBOURS (KNN)
Naïve Bayes classifier is a classification technique based on Bayes theorem with the
assumption of independence among predictors. Naïve Bayes classifiers assume that the
presence of a particular feature in a class is unrelated to the presence of any other feature or
that all of these properties have independent contribution to the probability. Naïve Bayes
theorem is not a single algorithm but a combination of algorithms where all of them share a
common principle, i.e., every pair of features being classified is independent of each other.
Bayes theorem is calculated as
P(C/x) = P(C)*P(x/C) / P(x)
Where P(C) is class prior probability, P(x) is predictor prior probability, P(x/C) is
likelihood and P(C/x) is posterior probability.
The following steps has to be performed.
[5] RESULTS
The accuracy of different classification models is calculated and compared. The model
with highest accuracy is considered as best classification model for voice data. The accuracies
are shown in below [Table 1].
Classifiers Accuracy
Logistic Regression 86%
Support Vector 89%
Machine
Decision Tree 76%
K- Nearest 93%
Neighbours
Naïve Bayes 85%
The features are extracted from Voice record or audio file which is in the form of
continuous frequencies. So, it is need to transform into discrete data. The sampling is done on
audio file. The sampling frequency or sampling rate of sound wave defines the frequency at
which amplitudes are captured.
The audio file is represented in amplitude and time graph to differentiate between
different voice signals visually.
The frequencies of signals are plotted in graph as shown in [Figure-2] and [Figure-3].
The gender of particular voice signal is identified using the extracted features from
corresponding voice signals.
[6] CONCLUSION
The ultimate aim is to generate a system which can able to identify gender whether it is
male or female based on voice signals. There are many usages from gender recognition. Some
of them are Biometric purpose, Mobile healthcare system, Remote access to computers,
Automatic transfer of phone call. The acoustic features are selected and trained on various
classification algorithms. Among them decision tree shown good accuracy. So, gender of new
audio file is identified through Decision tree model.Further some more classification
algorithms can be trained and compared to get more appropriate model. The features can be
selected using some other feature selection method. The audio file should be recorded without
noise and disturbance to get accurate results.
REFERENCES
[1] Ali Bou Nassif, Ismail Shahin, Imtinan Attili, Mohammad Azzeh, and Khaled Shaalan,
“Speech Recognition Using Deep Neural Networks: A Systematic Review”, IEEE, 2019
[2] R. Shiva Shankar, J. Raghaveni , Pravallika Rudraraju , Y.Vineela Sravya, “Classification
of Gender by Voice Recognition using Machine Learning Algorithms”, International Journal
of Advanced Science and Technology,2020
[3] Steve Jadav, “Voice-Based Gender Identification Using Machine Learning”, International
Conference on Computing Communication and Automation, IEEE, 2018
[4] A Raahul, R Sapthagiri, K Pankaj and V Vijayarajan, “Voice based gender classification
using machine learning”, IOP Publishing, 2017
[5] Hadi harb, Liming chen, “Voice - based gender identification in Multimedia
applications”,2016
[6] Remna R. Nair1, Bhagya Vijayan, “Voice based Gender Recognition”, IRJET, 2019
[7] Francois Pachet, Pierre Roy, “Analytical Features: A knowledge-based approach to audio
features generation”, EURASIP journal on Audio, Speech and Music processing,2009
[8] Shai Shalev-Shwartz, Shai Ben-David, “Understanding Machine learning”, Cambridge
university press, 2014
[9] K.R. Rao, D.N. Kim, J.J. Hwang, “The Fast Fourier Transform algorithm and applications
Springer”, 2010
[10] Julius O. Smith, “Spectral audio signal processing”, W3K, 2011