Evaluation of Machine Learning Models For Employee Churn
Evaluation of Machine Learning Models For Employee Churn
Abstract— Employees are the valuable assets of any or occupation relinquishment. Attrition happens when an
organization. But if they quit jobs unexpectedly, it may incur employee resigns or when the organization eliminates his
huge cost to any organization. Because new hiring will consume occupation. The real contrast between the two is that when
not only money and time but also the freshly hired employees turnover happens, the organization looks for somebody to
take time to make the respective organization profitable. Hence
supplant the employee. In instances of attrition, the business
in this paper we try to build a model which will predict employee
churn rate based on HR analytics dataset obtained from Kaggle leaves the opportunity unfilled or wipes out that employment
website. To show the relation between attributes, the correlation job.
matrix and heatmap is generated. In the experimental part, the This paper is organized as follows. The next section
histogram is generated, which shows the contrast between left describes the related works done in the past and the motivation
employees vs. salary, department, satisfaction level, etc. For regarding this analysis. Section 3 will describe different
prediction purpose, we use five different machine learning machine learning algorithms used in this paper and their
algorithms such as linear support vector machine, C 5.0 Decision significance. Section 4 describes about the data set and also
Tree classifier, Random Forest, k-nearest neighbor and Naïve shows the statistical information using the data set. Section 5
Bayes classifier. This paper proposes the reasons which optimize
contains the detailed experimental results using the machine
the employee attrition in any organization.
learning algorithms using the mentioned data set, which will
Keywords—Turnover;JobSatisfaction;Attrition;Organization; be followed by the conclusion section.
employee retention strategy
II. LITERATURE SURVEY
I. INTRODUCTION Middle level officers are more likely to leave, may be due
Employee Attrition [1] is a reduction in manpower in any to some disagreement with their senior officer as proposed by
organization where employees may voluntarily leave the [3]. They observed major factors that influenced employee
organization or may be retired. Employee turnover is the abandonment from the firm. The two rules are moderately
number of existing employees replaced by new employees for derived by him. Some set of questions are asked with the both
a specific period. A high attrition causes high employee parties and depending upon their answers he concluded some
turnover in any organization. This in turn causes huge facts based on workload, objectives, carrier opportunity and
expenditure on human resource, by contributing towards new firm management. Human resource management [4]
recruitment, training and development of the freshly appointed endeavors on basically termination rates and dismissal rates
employees, also the performance management. Again, attrition but actual content of them are enormously different. The
[2] which are of voluntary is unavoidable. Hence, by previous model shows that, there are several distinct levels of
improving employee morale and providing a desirable attrition and turnover. Some research dictates that the
working environment, we can certainly reduce this problem consequences of dismissal and termination rates are at
significantly. organizational level. Allen & Meyer (1990) [5] described the
three-basic entity for the negative side of the turnover.
The rate of attrition is defined as the recruitment and
termination criteria of the company. An employee can leave Regulating officer will more probable leave from the
the job for various reason. Here, the ‘Turnover’ and ‘Attrition’ organization because of a contention with the higher
are the business terminologies that always conflicts each other. administration than a representative who is in struggle with his
There are various kinds of ‘turnover’ in an organization. prompt director. He recognized the determinant figures that
Lowering in number of employee is mainly considered as the influence employee acceptance [5] without protest from the
‘attrition’. To analyze the manpower data and other organization. Two arrangements of information social
measurements that are necessary for manpower planning these occasion techniques were directed. An equivalent number of
terminologies can be interchangeably used. When an representative and officer respondents were solicited to answer
employee leaves the company both attrition and turnover a set from polls that were ordered by workload, objectives,
happens. Turnover, be that as it may occur because of various identity, professional success, and hierarchical administration.
work activities, for example, release, termination, abdication The after-effects of the two information gathering methods
demonstrated that the most noteworthy component that adds to IV. DATA SET ANALYSIS
employee rejection [6] is money related compensation. The ‘HR Analytics’ data set [13], obtained from Kaggle
Website, is used in this paper for the experimental verification.
III. MACHINE LEARNING ALGORITHMS
This data set comprises ten attributes and 15000 tuples.
There are various kind of machine learning techniques
available to learn from the given data which is called train data.
When new or unseen data arises the learned model analyses
and predict desired class. In our experiment we have used the
HR Analytic data set to apply various machine learning
algorithms to predict the chances of employees to quit the job.
The machine learning algorithms for predicting the same are
described below.
A. K-Nearest Neighbour
k-NN classifier [7] is known as lazy learner in machine
learning community. It never learns from the data and do not
build any models. Rather, it finds out the examples from the
train dataset which are closest to the unknown example. Based
on the neighbor examples it will predict the new example. The
value of ‘k’ determines the no. of closest data points or
examples to be selected from the training example.
B. Supprt Vector Machine
A Support Vector Machine [8] is a kind of classification
technique, where the data points are separated by a line in case
of linear SVM, and a hyperplane in case of non-linear SVM. Fig.1. Correlation Matrix
The separation is chosen in such a way that; the two sides of
the hyperplane categorizes the data set in to two classes. When
an unknown data comes it predicts which side/class it belongs
to. The margin between the hyperplane and the support
vectors are as large as possible to reduce the error in
classification.
C. Naïve Bayes Classifier
Naive Bayes [9] is a popular classification technique which
classifies examples based on the probability of chances that are
likely to be occurred. It often performs very well for complex
data set which are very hard to learn using the traditional
learning algorithms.
D. Decision Tree
This is one of the popular learning techniques in machine
learning. C 4.5 [10] is the benchmark learning algorithm in
decision tree which is often compared with the new algorithms
that are being developed. Here C 5.0 [11] learning algorithm is
used which is an advanced version of traditional decision tree
learning algorithms. The nodes along with the edges are the
series of conditions and the leaves are the class labels.
E. Random Forest Fig. 2. Histogram of employee status and satisfaction level
It is one of the ensemble learning technique [12] which The categorical values are converted to numeric values in order
consists of several decision tree rather than a single decision to make the classification algorithm more efficient. For
tree for classification. While classifying all the trees in the example, categorical attribute ‘salary’ contains three values
random forest gives a class to an unknown example and the such as low, medium and high. Hence it is converted to 0, 1
class having maximum votes will be assigned to the unknown and 2 respectively. The misspelled attributes are also corrected.
example. Figure (1) represents the correlation matrix which helps to
identify attributes with the strong or weak correlation.
Figure (2) represents the histogram of employee status and
satisfaction level [14] . It can be seen from the figure that, there
are three segments or behaviors.