0% found this document useful (0 votes)

3 views

Employee Attrition Prediction Using Machine Learning

This paper presents a machine learning approach to predict employee attrition by training various models on a dataset obtained from Kaggle. The study evaluates models like logistic regression, decision trees, and nearest neighbors, finding that the nearest neighbor models significantly outperformed others in predicting attrition and the duration before it occurs. Key features influencing attrition include satisfaction level, average monthly hours, and last evaluation rating, demonstrating the potential for informed HR decisions based on predictive analytics.

Uploaded by

harshitagrawal1204

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Employee Attrition Prediction Using Machine Learning

Uploaded by

harshitagrawal1204

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Proc.

of the International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME 2023)
19-21 July 2023, Tenerife, Canary Islands, Spain

EMPLOYEE ATTRITION PREDICTION USING

MACHINE LEARNING
Sanidhya Barara Umang Soni
Student, Class XI Assistant Professor
Modern School, Barakhamba Road Netaji Subhash University of technology
New Delhi, India New Delhi, India
[email protected] [email protected]

Abstract— Data mining plays an important role in the leading to a non-ideal workforce [4]. However, the analysis
internal Human Resource management processes of any firm of sizeable datasets of employee attrition data has the potential
[1]. These processes include prevention and prediction of to reveal certain dominating trends in employee attritions [5,
employee attrition. This paper presents a machine learning 6, 7, 8]. The modus operandi of such an approach is to process
based approach to attrition prediction for individual employees,
attrition data and give weights or importance values to all of
by training different machine learning models on attrition data.
In this paper, machine learning models like the logistic the various attributes (i.e. all of the characteristics) of any
regression model, the linear regression model, the decision tree particular employee’s employment which correspond to the
model, the random forest model, K nearest neighbors model, correlation between the value of that attribute and the
radius nearest neighbors model, the naïve bayes classifier model employee’s likeliness to attrit [4, 6, 8-12]. The weights thus
and Bayesian ridge model are trained on data obtained from obtained can then be used as a foundation for making
Kaggle, an online community platform for data scientists and informed decisions relating to the attributes that most impact
machine learning enthusiasts. The resulting models then predict employee attrition.
whether an employee with particular attributes will attrit, and
if so, within how many years they can be expected to do so. Prior The principles of machine learning provide us with a variety
to training the models, the data is cleaned (outlier removal, of ways in which to analyze the dataset of attrition data.
feature removal), scaled, and categorical variables are Different models in machine learning correspond to different
converted to numerical ones. Then, predictions are carried out ideologies for processing this data. Various such models are
and feature importances are found using random forest and of interest to us in this paper. These are the logistic regression
decision tree models. The utilities of these models are then
compared against each other based on accuracy, precision,
model, the linear regression model, the decision tree model,
recall, f-beta score, kappa score and three self made metrics. the random forest model, K nearest neighbors model, radius
The results show that, surprisingly, the nearest neighbor models nearest neighbors model, the naïve bayes classifier model and
outperformed all others by a large margin (possibly due to data Bayesian ridge model. All of these models, with the exception
scaling being a part of preprocessing), and the logistic regression of radius nearest neighbors, are those that have been used in
model was unable to predict attrition very satisfactorily. The prior studies on the same topic and have been noticed to be
results showed that satisfaction_level, average_monthly_hours staple in other prediction-based studies as well. This is
and last_evaluation_rating are the most important features because they have been seen to perform generally well at
when predicting attrition or years. The research also shows that predicting any given target variables. The inclusion of radius
it is viable to use traditional ML models to predict the time in
which an employee will attrit, and using the methodology
nearest neighbors (RNN) model in this paper is due to data
defined in this research on a real dataset will provide useful scaling being a step in our preprocessing, which may provide
information to the corporation applying it. an opportunity for the RNN model to provide good results.
Keywords— Artificial Intelligence, Machine Learning, These models may be trained on a given employee attrition
Employee Attrition dataset to predict not only whether an employee with some
I. INTRODUCTION given values for the attributes will attrit or not, but also the
exact duration for which the employee will stay with the firm
The human resource is a key factor in the success of any in case they do attrit and thus, the number of years before the
firm. As such, large amounts of time and effort are dedicated firm needs to look for a replacement. This may be achieved by
by the firm towards the effective management of this resource means of a straightforward analysis of the correlation between
[2]. An important aspect of this management is the employee the attributes of the employees in the dataset and the duration
hiring process. Hiring must keep up with the personnel for which those who attrit stayed with the firm before doing
demand of the company to ensure maximum productivity [3]. so.
In order to ensure that this demand is met however, firms must
also take into account the attrition of employees presently Furthermore, various metrics may be employed to better
working for the firm. understand the effectiveness of these models in predicting
employee attrition and give it an objective measure. For the
Since attrition is at the most basic level a human decision, purposes of this paper, the dataset has been subjected to an 85-
it is difficult to predict with absolute certainty. This results in 15 train-test split and graded based on eight different metrics.
it being one of the greatest issues a company can face, often These metrics are the accuracy, the precision, the f-beta score,

979-8-3503-2297-2/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: Zhejiang University. Downloaded on November 13,2024 at 13:13:22 UTC from IEEE Xplore. Restrictions apply.
the recall and the Cohen kappa score, (all five of which are Elaborating on Step 1, preprocessing the data involves the
standard reliable metrics generally applied together in following steps:
multiple researches [13, 14, 15]) and three self designed
Data Cleanup: Rows which contain null values, or rows in
metrics namely FP year difference, TP percentage mean, TP
which any feature has an erroneous outlier value are removed
percentage median of the machine learning models in
from the dataset entirely. Useless/unimportant features such
question.
as Over18, StandardHours, and ID were removed from the
II. DATASET dataset.
The dataset used in this research paper was obtained from Categorical variables: Categorical variables are dealt with
online community platform for data scientists Kaggle. The using one hot encoding for logistic and linear regression,
dataset is an employee attrition dataset, containing KNN Models, RNN Models and Hybrid Models [7, 16]. Plus,
information about employees and the number of years they label encoding is used for Decision Tree Models, Random
have already spent at the company, along with the status of Forest Models and Naïve Bayes models [17].
their attrition.
Data Scaling: The continuous numerical variables are scaled
The dataset has 2 predivided train and test sets, in an using python’s ‘Standard Scaler’ in order to achieve better
approximately 85-15 train-test split, with the training set predictions [18].
containing ~25500 datapoints, and the test set containing IV. SCORING MECHANISMS
~4500 data points. The features contained in the dataset are
“ID”, “satisfaction_level”, “last_evaluation_rating”, Eight scoring mechanisms were used to check the
“projects_worked_on”, “average_montly_hours”, effectiveness of the model in predicting attrition. Of these
“time_spend_company”, “Work_accident”, “Over18”, eight, the following six are mechanisms in which a higher
“standard_hours”, “promotion_last_5years”, “Department”, value indicates a better model:
“salary”, and “Attrition”. Of these, “Over18” and A. Accuracy Score
“standard_hours” contained only one unique value, “1” and
“9” respectively, and “ID” contained a unique numerical value The accuracy score of a machine learning model is the
for each datapoint. simplest metric to evaluate its performance. To calculate the
accuracy score of the model, the actual attritions and the
Due to the nature of the field of research, data on a company’s model’s predicted attritions are required. Once both these
employees is required. Such data is highly private and is arguments are received, the fraction of correct predictions
protected by most companies. As such, it is near impossible to over the total predictions is calculated, and this provides us
obtain this data from a company, or via extensive surveying of with the accuracy score of the model. Generally, an accuracy
employees of several similar companies. Thus, a dataset from score of 0.9+ is regarded as excellent, a score of 0.7+ as good,
kaggle, on which the methods applied in our research have not and anything below is regarded as poor. The formula for
been previously applied, was used for the purposes of our accuracy is as follows:
research. +
= (1)
+ + +
III. RESEARCH METHODOLOGY
• TN: Number of True Negatives
The methodology for conducting the research was as follows:
• TP: Number of True Positives
Step 1: Preprocessing
• FN: Number of False Negatives
Step 2: Feature importances are isolated while predicting
attrition, and it is found that years worked with • FP: Number of False Positives
company (the second factor to be predicted) is a
significantly important one. However, in our case, the accuracy score may not be the
best possible scoring metric as the model’s False Positives will
Step 3: “time_spend_company” is made a target variable. decrease the accuracy when they may not indicate any error
Step 4: A classification model is trained on the 9 predictor with the model. For example, it may predict that a person will
variables and is asked to predict which employees in attrit in 2 more years than are written for them in the dataset,
the test set will attrit. but the accuracy score of the model will be reduced since the
accuracy will simply see that while the attrition column says
Step 5: For the predicted attritions, the regression model is “No” in the database, our model predicted “Yes”.
used to predict the number of years of employment
after which the attrition will occur using the 8 B. Precision Score
predictor variables except “time_spend_company”, The precision score of a machine learning model is yet
which is now the target variable. another easily understood metric for performance evaluation.
Step 6: Each model’s scoring metrics are calculated and Again, the actual attritions and the model’s predicted attritions
analyzed. are required. However, the precision score of a model
represents the fraction of predicted positives (in our research,
predicted attritions) which were correct. A precision of 0.85+
is excellent, that of 0.7+ is good, and anything below 0.7 is
poor. Precision can be calculated by the formula:

Authorized licensed use limited to: Zhejiang University. Downloaded on November 13,2024 at 13:13:22 UTC from IEEE Xplore. Restrictions apply.
model has made an erroneous prediction. For this scoring
= (2)
+ mechanism, magnitude is not of importance, only the sign is.
• TP: Number of True Positives G. Tp Percentage Mean
• FP: Number of False Positives It is the mean of the absolute values of the percentage
differences between the predicted years and actual years of
C. Recall Score employment before attrition in the correctly predicted
The precision and recall scores are used together as they attritions. A lower value of the mean represents a lower
are complementary. While the precision tells us the fraction of average percentage difference in predictions of years,
predicted attritions that were correct, the recall tells us the meaning a better model.
fraction of attritions our model managed to catch/predict. H. Tp Percentage Median
While the same benchmarks for evaluation as precision do
apply to recall in most cases, for the purposes of our research It is the median of the absolute values of the percentage
we may set the benchmark at 0.9+ for excellency, 0.75+ for differences between the predicted years and actual years of
good, and consider anything below 0.75 to be poor. The employment before attrition in the correctly predicted
formula for recall is: attritions. A lower value of the median represents a lower
percentage difference in most predictions of years, meaning
= (3) a better model.
+
V. MACHINE LEARNING MODELS
• TP: Number of True Positives
A. Logistic Regression
• FN: Number of False Negatives
The Logistic Regression Model (proposed in 1958 [21])
D. F-Beta Score (Beta=1, 2) provides the likelihood that Attrition (the target variable) is
The f-beta score is a metric that features the harmonic class 1 (i.e., the employee will attrit) from a linear sum of the
mean of precision and recall, i.e., it is a way to analyze predictor variables, in the formula:
precision and recall together and give one single aggregate + = ,- + ,. /. + , / + ⋯ + ,1 /1 (6)
quality value. While the f1 score is the most used due to its
4
interpretability, f2 score is utilized in our research to give
(3 = 1) = (7)
more importance to recall (as in [19]). The f-beta score of a 4 +1
model may be calculated by setting the value of beta as
Where X1, X2, ..., Xp are the different variables, the
required (in our case, 2) in the formula below:
omegas are the respective coefficients, and P(E = 1) is the the
∙ probability that the data point E is of class 1. Also, in order to
= (1 + )∙ (4)
( ∙ )+ optimize our task, we utilize the log-likelihood loss function:
<
1
& 7( 8 log( 8 ) + (1 & 8 ) log(1 & 8) (8)
E. Cohen Kappa Score 6
8=.
The Cohen kappa score of an ML model informs us about the Where m is the number of samples in the training data, yi
reliability of the model, accounting for the probability of the is the label of the i-th sample, Pi is the prediction value of the
model predicting attrition correctly by random chance [20]. i-th sample.
The kappa score reveals the correlation between the actual Applying this model to our dataset yields results as
and the predicted attritions using the formula: follows. The model assigns weights to the various attributes in
!"##$!% & !'()!$
the dataset as can be seen in Fig. 1.
= (5)
1& !'()!$

• Pcorrect = Proportion of test dataset in which the model

is correct
• Pchance = Proportion of test dataset in which the model
could be expected to be correct by chance
F. Fp Year Difference
It is the difference between the predicted years and actual
years of employment before attrition in the incorrectly
predicted attritions. A non-negative value represents that more
years were predicted for the employee than the years the
employee has currently been with the company on average,
which does not indicate an error with the model. A negative
value, on the other hand, represents that the predicted years
before attrition are less than the years the employee has Fig. 1. Feature Importances from Logistic Regression Model
already been with the company on average, meaning that the

Authorized licensed use limited to: Zhejiang University. Downloaded on November 13,2024 at 13:13:22 UTC from IEEE Xplore. Restrictions apply.
Therefore, according to the Logistic Regression Model Therefore, linear regression predicts that salary and
satisfaction_level, salary_high and work_accident are the department are most important while predicting years of
most important attributes for an employee to stay, and employment before attrition.
salary_low, department_hr and salary_medium are most
As can be seen by the scoring metrics, Linear Regression
important for an employee to attrit.
model is acceptable at predicting years of employment before
attrition.

C. Decision Tree Model

Decision Tree model is a non-parametric, supervised
machine learning method which has been used for
classification and regression [23]. We have created a model
that predicts our target variables by learning simple decision
rules inferred from the data features. It decides what which
provides the most information using the formula:
Information Gain (S,a)
|OQ |
=M N (O) & 7 M N (OQ ) (10)
|O|
R ∈ R(TU$V(()

Where:
Fig. 2. Logistic Regression Confusion Matrix
• a represents a specific attribute or class label
Owing to the low value of the Kappa score, we can say
that this model is not reliable. • Entropy(S) is the entropy of dataset, S
• |SV|/|S| represents the proportion of the values in
B. Linear Regression SV to the number of values in dataset, S

The Linear Regression model predicts the years of • Entropy (SV) is the entropy of dataset, SV
employment of the person before attrition using the traditional Entropy is calculated by:
least squares method, i.e., finding the line of best fit through
the data such that the sum of the residuals of the points is M N (O) = & 7 ( ) log ( ) (11)
minimized. It can be said to be solving a problem of the form !XY

min‖/, & ‖ (9) The feature with greater information gain is more
B
important. Based on the feature importances, it splits the data
It can be represented mathematically in the same linear at nodes, using basic True/False structures to arrive at a
equation form as the logistic regression model (See [22] for decision for attrition and number of years.
more).
Applying this model to our dataset yields results as
The following results are gained from the Linear follows. The model can also be used to find out feature
Regression model: importances [4, 24] as in Fig. 4 and Fig. 5.

Fig. 4. Feature Importances for Attrition Prediction from Decision Tree

Fig. 3. Feature Importances from Linear Regression Model Model

Authorized licensed use limited to: Zhejiang University. Downloaded on November 13,2024 at 13:13:22 UTC from IEEE Xplore. Restrictions apply.
Applying this model to our dataset yields results as follows.
The model can also be used to find out feature importances
as in Fig. 7 and Fig. 8.

Fig. 5. Feature Importances for Duration Prediction from Decision Tree

Model
Fig. 7. Feature Importances Predicting Attrition with Random Forest
Therefore, according to Decision Tree, satisfaction_level, Model
projects_worked_on and last_evaluation_rating are the most
important attributes when predicting whether an employee
will attrit, and if they will, then satisfaction_level,
average_monthly_hours and last_evaluation_rating are
important when predicting the number of years within which
they can be expected to do so.

Fig. 8. Feature Importances for Duration Prediction from Random Forest

Model

Therefore, according to Random Forest satisfaction_level,

average_monthly_hours and last_evaluation_rating are the
Fig. 6. Decision Tree Confusion Matrix
most important attributes when predicting whether or not an
As is clear from the scoring metrics and the confusion employee will attrit, and if they will, then the number of years
matrix, Decision Tree models are excellent at predicting both within which they can be expected to do so.
attrition and years.

D. Random Forest Model

In random forests, the trees in the ensemble are constructed
using a bootstrap sample (see the paper by Breiman [25]).
Also, the best split is found from all of the input features. The
randomness in this functions to decrease the variance of the
forest estimator. Individual decision trees often show high
variance and overfit. The randomness in the forests yield
decision trees with somewhat decoupled prediction errors. By
averaging all these predictions, some of the errors may cancel
out. The random forest model helps achieve greatly decreased
variance by combining many different trees. However,
because of this, it may also result in a slight increase in bias.
But due to the significance of the variance reduction, the Fig. 9. Random Forest Confusion Matrix
result is still an overall better model.

Authorized licensed use limited to: Zhejiang University. Downloaded on November 13,2024 at 13:13:22 UTC from IEEE Xplore. Restrictions apply.
As is clear from the scoring metrics and the convolution As seen from Fig. 11, RNN lags behind KNN while
matrix, Random Forest models are excellent at predicting predicting attrition and years. However, it still provides
both attritions and years. excellent results for attrition and acceptable results for years.
F. Naïve Bayes Classifier
E. Nearest Neighbors Models The naïve bayes classifier, as the name indicates, works
only for classification and as such can only be used to predict
Nearest neighbors is an ML model which simply remembers
attrition. It assumes independence among the variables, i.e.,
the data given by the user, and then, based on the points that
the value in one of the variables does not depend on the values
are closeby, classifies/predicts the target values. Two types of
in any of the rest of the variables. This is a bad assumption to
nearest neighbors models have been utilized in this research,
make in our scenario, since, for example, salary and
the K nearest neighbors model (KNN), and the radius nearest
satisfaction index may be inter-dependent. However, as the
neighbors model (RNN).
model deals well with large datasets, it has been included in
The KNN model simply finds the K (integer value specified the scope of the research (See [26] and [30] for more). It works
for our research purposes to be 1) nearest points and (a) off of the Bayes algorithm:
Classifies an employee to attrit or not attrit based on a (Z| ) ( )
majority vote and (b) predicts the number of years by ( |Z) = (12)
averaging the 2 nearest points (See [18] and [26] for more). (Z)
( |/) = (Z. | ) [ … [ (Z) | ) [ ( ) (13)
Above,
• P(c|x) is the posterior probability of target class c
if given the predictor attribute x
• P(c) is the prior probability of the target.
• P(x|c) is the probability of predictor given class.
• P(x) is the prior probability of the attribute.

Fig. 10. KNN Confusion Matrix

As can be seen in Fig. 10, KNN is capable of predicting both

types of attrition with great accuracy. Also, based on the
mean and medians of predictions of years, we can see that it
is the best model
The RNN model chooses all points within a radius (float
value specified for our research purposes to be 0.5 for
classification and 100 for regression) and then (a) Predicts
attrition based on majority vote of points within radius and
(b) Predicts number of years based on average of points Fig. 12. Naïve Bayes Confusion Matrix
within radius (See [27, 28, 29] for more). As can be seen in Fig. 12, the Naïve Bayes Classifier is
lackluster in predicting attritions, as was expected.
G. Bayesian Ridge Model
Bayesian Ridge regression formulates linear regressions
using probability distributors rather than point estimates. The
prediction is assumed to be drawn from a probability
distribution rather than estimated as a single value. To obtain
a fully probabilistic model, the output y is assumed to be
Gaussian distributed around Xω (See [32, 33] for more):
( |/, ,, ]) = ( |/B , ]) (14)
where α is again treated as a random variable that is to be
estimated from the data. The prior for the coefficient ω is
given by a spherical Gaussian:
Fig. 11. RNN Confusion Matrix
(,|^) = _,`0, ^a. b1 c (15)

Authorized licensed use limited to: Zhejiang University. Downloaded on November 13,2024 at 13:13:22 UTC from IEEE Xplore. Restrictions apply.
H. Hybrid Model TABLE II. TABLE SHOWING THE SCORING METRICS FOR EACH MODEL
WHEN PREDICTING YEARS
Looking at the results for the models, the KNN and RNN
Scoring Metrics
models were best at predicting attrition and years. Thus, we
used all models except Random Forest, KNN, and RNN Models TP TP
FP Year
Percentage Percentage
together to form a hybrid model (See [31]); using 2 Difference
Mean Median
independent ensembles with 5 of each type of model (15 Log Reg NA- NA- NA-
models total) in order to predicted attritions and then years.
Lin Reg -1.05 0.055 0.028
DT 0.213 0.053 0.0
RF 0.275 0.046 0.005
KNN 0.0 0.0 0.0
RNN -0.055 0.178 0.209
NB NA- NA- NA-
BRR 0.017 0.046 0.018
Hybrid -0.096 0.035 0.012

As for prediction of years, all the models seem to have

performed well, except for the Linear Regression, Radius
Nearest Neighbors, and Hybrid model, all of which have
negative values of FP Year Difference. Surprisingly, the
RNN model which performed so excellently when predicting
Fig. 13. Hybrid Model Confusion Matrix attrition was unable to deliver a similar performance when
While these results are not bad, speaking with respect to the predicting years, instead showing the worst metrics of all the
other models, they are not stellar either. models.
VII. CONCLUSION
VI. RESULTS
As can be seen from Table 1 and Table 2, the K Nearest This research aimed at training several machine learning
Neighbors model has clearly provided the best results in this models to predict attritions among company employees, and
research, showing the best values amongst all the models in then predict the total number of years of employment with
each scoring metric. the company before attrition for the predicted attritions and
finding the best one. With this purpose in mind, 6 different
TABLE I. TABLE SHOWING THE SCORING METRICS FOR EACH MODEL models were trained on the same training set and
WHEN PREDICTING ATTRITION subsequently tested on the same test set. Analysis of the
results acquired from these models shows that the logistic
Scoring Metrics
Models
regression model’s performance was surprisingly subpar
Kohen
Accuracy Precision Recall F1 F2
Kappa
compared to the other models and both the nearest neighbors
models used performed the best. After this, a 7th hybrid model
Log Reg 0.823 0.712 0.560 0.627 0.585 0.513
consisting of all models except nearest neighbors and
Lin Reg NA- NA- NA- NA- NA- NA- Random Forest was trained and tested but it failed to produce
DT 0.965 0.958 0.906 0.931 0.916 0.908 results better than the nearest neighbors models.
RF 0.970 0.992 0.894 0.940 0.912 0.920 The models were also used to show the feature importances
KNN 0.999 0.999 0.996 0.998 0.997 0.997 to predict the results, and while decision tree and random
RNN 0.992 0.979 0.993 0.986 0.990 0.980 forest were largely in agreement about satisfaction_level,
average_monthly_hours and last_evaluation_rating being the
NB 0.904 0.859 0.760 0.807 0.778 0.743
most important features to predict anything, the logistic and
BRR NA- NA- NA- NA- NA- NA- linear regression model showed strong contrast to this and
Hybrid 0.926 0.945 0.765 0.845 0.795 0.797 provided that satisfaction_level, salary_high and salary_low
are most important when predicting attrition, and salary and
department are most important when predicting years.
In terms of prediction of attrition, the Nearest Neighbors However, in light of the relative performances of the three
models have performed the best, followed closely by the tree models, the results from decision tree and random forest
based models, and then the hybrid model followed by the models are given preference over those from logistic and
Naïve Bayes model, and then the Logistic Regression model. linear regression.
While the latter two do seem to show acceptable results in the
other metrics, the lower values of their Kohen Kappa scores This research article develops upon previous studies in the
are discouraging. The Logreg model surprisingly field of employee attrition prediction, presenting not only an
underperformed in this situation, being the worst in every alternative way of prediction using Nearest Neighbour
metric with respect to the other models. models and preprocessing steps, but also showing that
methodologies/classifiers used previously to predict only the

Authorized licensed use limited to: Zhejiang University. Downloaded on November 13,2024 at 13:13:22 UTC from IEEE Xplore. Restrictions apply.
attrition status of employees can, with slight modification and [5] V Nagadevara, V. Srinivasan, R. Valk: Establishing a link between
employee turnover and withdrawal behaviours: application of data
utilisation of regressors instead of classifiers, also be used to mining techniques. Res. Pract. Hum. Resour. Manag. 16, 81–97 (2008)
predict the years in which employee will attrit, providing
[6] K. Suceendran, R. Saravanan, D.S. Divya Ananthram, R.K. Kumar, K.
further invaluable information to the employers. Sarukesi: Applying classifier algorithms to organizational memory to
build an attrition predictor model
As compared to similar studies carried out in previous years,
the paper reveals a contrast in the best performing model. [7] R. Punnoose, P. Ajit: Prediction of employee turnover in organizations
While the random forest model, or sometimes the logistic using machine learning algorithms. Int. J. Adv. Res. Artif. Intell. 5, 22–
26 (2016)
regression model are generally expected to perform best, with
the indicated preprocessing of the data, the nearest neighbour [8] E. Sikaroudi, A. Mohammad, R. Ghousi, A. Sikaroudi: A data mining
approach to employee turnover prediction (case study: Arak automotive
models can also be utilized to their full potential. While the parts manufacturing). J. Ind. Syst. Eng. 8, 106–121 (2015)
result that satisfaction level is an important predictor is in line
[9] Q.A. Al-Radaideh, E. Al Nagi: Using data mining techniques to build a
with some other studies [34, 35], it is a generally uncommon classification model for predicting employees performance. Int. J. Adv.
one as can be seen in [36, 37, 38, 39]. In comparision to Comput. Sci. Appl. 3, 144–151 (2012)
previous papers, a new target variable of [10] H.Y. Chang: Employee turnover: a novel prediction solution with
“time_spend_company” (years the employee spends with the effective feature selection. WSEAS Trans. Inf. Sci. Appl. 6, 417–426
company before attriting) has been introduced. (2009)
[11] C.F. Chien, L.F. Chen: Data mining to improve personnel selection and
While the study does propose a new prediction methodology enhance human capital: a case study in high-technology industry. Expert
and target variable, the research is put at a disadvantage due Syst. Appl. 34, 280–290 (2008)
to a lack of extensive real world data on the topic. However, [12] R.S. Sexton, S. McMurtrey, J.O Michalopoulos, A.M. Smith:
the results so obtained do show that the additional, extremely Employee turnover: a neural network solution. Comput. Oper. Res. 32,
useful variable of years before attrition can also be predicted 2635–2651 (2005)
using traditional ML models (meaning that using the [13] K. Ranveer, S. B. Dwivedi, and S. Gaur. "A comparative study of
proposed methods on a real dataset can yield information machine learning and Fuzzy-AHP technique to groundwater potential
crucial to a corporation), building upon prior studies on the mapping in the data-scarce region." Computers & Geosciences 155
(2021)
topic of prediction in attrition.
[14] Rustam, Furqan, et al. "Wireless capsule endoscopy bleeding images
In comparision to previous works, this paper presents a classification using CNN based model." IEEE Access 9 (2021)
comparative analysis of some of the most used machine [15] Sors, Arnaud, et al. "A convolutional neural network for sleep stage
learning models at predicting attritions, complete with scoring from raw single-channel EEG." Biomedical Signal Processing
preprocessing methodologies, and also checks if these and Control 42 (2018)
models can be used to predict the years of employment before [16] A. Quinn, J.R. Rycraft, D. Schoech: Building a model to predict
attrition. The feature selection is done by hand due to the caseworker and supervisor turnover using a neural network and logistic
regression. J. Technol. Hum. Serv. 19, 65–85 (2002)
limited number of features, most are kept and some which
only include 1 unique value or which are wholly irrelevant to [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O.
Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J.
the research are removed. Random Forest and Decision Tree Vanderplas: Scikit-learn: machine learning in Python. J. Mach. Learn.
are used to find feature importances. Similar scoring metrics Res. 12, 2825–2830 (2011)
to the new ones introduced in this paper may be developed [18] J. Friedman, T. Hastie, R. Tibshirani: The elements of statistical
for future research. For future research into attrition, machine learning. Springer, New York (2001)
learning on a real dataset consisting of a vast variety of
[19] B. Prasetiyo, M. A. Muslim, and N. Baroroh. "Evaluation performance
variables including some which are based on more of the recall and F2 score of credit card fraud detection unbalanced dataset
psychological factors of attrition and some others like their using SMOTE oversampling technique." Journal of Physics:
compensation ratio inside their salary bracket and inside their Conference Series. Vol. 1918. No. 4. IOP Publishing, 2021
line of work in their country. When carrying out the research [20] Sun, Shuyan. "Meta-analysis of Cohen’s kappa." Health Services and
on the larger dataset with a greater variety of features, the Outcomes Research Methodology 11 (2011)
feature selection can also be improved by the m-max-out [21] H. Zhang: The optimality of naive Bayes. AA, 1, 3
method, and the results can be improved by generating
[22] Su, Xiaogang, Xin Yan, and Chih‐Ling Tsai. "Linear regression."
smaller bootstrap datasets from the parent dataset, as Wiley Interdisciplinary Reviews: Computational Statistics 4.3 (2012)
described in [37].
[23] J.N. Morgan, J.A. Sonquist: Problems in the analysis of survey data,
REFERENCES and a proposal. J. Am. Stat. Assoc. 58, 415–434 (1963)

[1] J. Ranjan, D. P. Goyal, and S. I. Ahson. "Data mining techniques for [24] H. Jantan, A.R. Hamdan, Z.A. Othman: Human talent prediction in
better decisions in human resource management systems." HRM using C4. 5 classification algorithm. Int. J. Comput. Sci. Eng. 2,
International Journal of Business Information Systems 3.5 (2008) 2526–2534 (2010)

[2] B.L. Das, and Mukulesh Baruah. "Employee retention: A review of [25] L. Breiman: Random forests. Mach. Learn. 45, 5–32 (2001) Cox, D.R.:
literature." Journal of business and management 14.2 (2013) The regression analysis of binary sequences. J. Roy. Stat. Soc. B. Met.,
215–242 (1958)
[3] D. A. B. A. Alao, and A. B. Adeyemo. "Analyzing employee attrition
using decision tree algorithms." Computing, Information Systems, [26] K.P. Murphy: Machine learning: a probabilistic perspective. MIT
Development Informatics and Allied Research Journal 4.1 (2013) press, Cambridge (2012)

[4] D. Alao, A.B. Adeyemo: Analyzing employee attrition using decision [27] Scikit-Learn User Manual. Available online: https://scikit-
tree algorithms. Comput. Inf. Syst. Dev. Inform. Allied Res. J. 4 (2013) learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighbo
rsClassifier.html#sklearn.neighbors.RadiusNeighborsClassifier

Authorized licensed use limited to: Zhejiang University. Downloaded on November 13,2024 at 13:13:22 UTC from IEEE Xplore. Restrictions apply.
[28] Scikit-Learn User Manual. Available online: https://scikit- Integration and Inclusion Through Interdisciplinary Practices in
learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighbo Management, March 2019, pp.62- 67,
rsRegressor.html#sklearn.neighbors.RadiusNeighborsRegressor
[35] .R. Srivastava, and P. Eachempati. "Intelligent employee retention
[29] Gursoy, Mehmet Emre, et al. "Differentially private nearest neighbor system for attrition rate analysis and churn prediction: An ensemble
classification." Data Mining and Knowledge Discovery 31 (2017) machine learning and multi-criteria decision-making
approach." Journal of Global Information Management (JGIM) 29, no.
[30] A. Géron: Hands-on machine learning with Scikit-Learn and 6 (2021): 1-29.
TensorFlow: concepts, tools, and techniques to build intelligent
systems. O’Reilly Media (2017) [36] Yahia, Nesrine Ben, Jihen Hlel, and Ricardo Colomo-Palacios. "From
big data to deep data to support people analytics for employee attrition
[31] Yi Tan, and P.P. Shenoy. "A bias-variance based heuristic for prediction." IEEE Access 9 (2021): 60447-60458.
constructing a hybrid logistic regression-naïve Bayes model for
classification." International Journal of Approximate Reasoning 117 [37] S. Najafi-Zangeneh; N. Shams-Gharneh; A. Arjomandi-Nezhad; S.
(2020) Hashemkhani Zolfani: An Improved Machine Learning-Based
Employees Attrition Prediction Framework with Emphasis on Feature
[32] McDonald, C. Gary "Ridge regression." Wiley Interdisciplinary Selection. Mathematics 2021, 9, 1226.
Reviews: Computational Statistics 1.1 (2009)
[38] F. Fallucchi, C. Marco, R. Giuliano, and E.W.D. Luca. "Predicting
[33] Price, Bertram. "Ridge regression: Application to nonexperimental employee attrition using machine learning techniques." Computers 9,
data." Psychological Bulletin 84.4 (1977) no. 4 (2020): 86.
[34] Dr. R. S. Kamath | Dr. S. S. Jamsandekar | Dr. P.G. Naik "Machine [39] N. El-Rayes, F. Ming, Michael Smith, and S. M. Taylor. "Predicting
Learning Approach for Employee Attrition Analysis" Published in employee attrition using tree-based models." International Journal of
International Journal of Trend in Scientific Research and Development Organizational Analysis (2020)
(ijtsrd), ISSN: 2456-6470, Special Issue | Fostering Innovation,

Authorized licensed use limited to: Zhejiang University. Downloaded on November 13,2024 at 13:13:22 UTC from IEEE Xplore. Restrictions apply.

Final Capstone Project Report
100% (1)
Final Capstone Project Report
35 pages
RPubs - Panel Data Examples Using R&Quot
No ratings yet
RPubs - Panel Data Examples Using R&Quot
1 page
0 0 1 1 1 W A P 1 N N I 1 I X I N 1 N N I 1 I 2
No ratings yet
0 0 1 1 1 W A P 1 N N I 1 I X I N 1 N N I 1 I 2
2 pages
Summer Internship Report
No ratings yet
Summer Internship Report
24 pages
Cdu 1121 09
No ratings yet
Cdu 1121 09
10 pages
AIP_aip-202501-0006
No ratings yet
AIP_aip-202501-0006
16 pages
Reportprediction of Employee Atrition Uisng Machine Learning
No ratings yet
Reportprediction of Employee Atrition Uisng Machine Learning
6 pages
Employee Attrition Prediction using Machine Learning Models: A Review Paper
No ratings yet
Employee Attrition Prediction using Machine Learning Models: A Review Paper
27 pages
Db15 Conference
No ratings yet
Db15 Conference
6 pages
Evaluation of Machine Learning Models For Employee Churn
No ratings yet
Evaluation of Machine Learning Models For Employee Churn
5 pages
Research Paper (1)
No ratings yet
Research Paper (1)
5 pages
HR_Review1
No ratings yet
HR_Review1
11 pages
Tentative Research Topic
No ratings yet
Tentative Research Topic
4 pages
Attrition Project Mangal
No ratings yet
Attrition Project Mangal
75 pages
Applsci 12 06424
No ratings yet
Applsci 12 06424
17 pages
Employee Future Prediction
No ratings yet
Employee Future Prediction
3 pages
Emloyee Attrition and Retention
No ratings yet
Emloyee Attrition and Retention
17 pages
ATAIML_02.04_04
No ratings yet
ATAIML_02.04_04
14 pages
IBM Analysis
No ratings yet
IBM Analysis
17 pages
A Novel Optimized Approach For Machine Learning Techniques For Predicting Employee Attrition
No ratings yet
A Novel Optimized Approach For Machine Learning Techniques For Predicting Employee Attrition
9 pages
5-IEEE[1]
No ratings yet
5-IEEE[1]
6 pages
Evaluating Employee Attrition - Design and Implementation
No ratings yet
Evaluating Employee Attrition - Design and Implementation
10 pages
[email protected]
No ratings yet
[email protected]
6 pages
Ibm Attrition Practices
No ratings yet
Ibm Attrition Practices
7 pages
Early Prediction of Employee Attrition Using Data Mining Techniques
No ratings yet
Early Prediction of Employee Attrition Using Data Mining Techniques
6 pages
PPT (1)
No ratings yet
PPT (1)
44 pages
Human Retention Using Data Science
No ratings yet
Human Retention Using Data Science
16 pages
Employee Turnover Prediction
100% (1)
Employee Turnover Prediction
16 pages
Employee Turnover Prediction
No ratings yet
Employee Turnover Prediction
12 pages
Employee Attrition Prediction
No ratings yet
Employee Attrition Prediction
3 pages
Employee Attrition Miniblogs
100% (1)
Employee Attrition Miniblogs
15 pages
employee turnover1
No ratings yet
employee turnover1
4 pages
Employee Attrition in HR using ML techniques
No ratings yet
Employee Attrition in HR using ML techniques
14 pages
18 Intellisys Employee
No ratings yet
18 Intellisys Employee
22 pages
Prediction of Employee Attrition PDF
0% (1)
Prediction of Employee Attrition PDF
7 pages
Employee Attrition Classification
No ratings yet
Employee Attrition Classification
16 pages
Karpagam Sep Oct 2019 Article 6
No ratings yet
Karpagam Sep Oct 2019 Article 6
6 pages
Employee Attrition Prediction
100% (1)
Employee Attrition Prediction
21 pages
Employee Attrition Prediction Analysis Report
No ratings yet
Employee Attrition Prediction Analysis Report
6 pages
3. Retention is All You Need
No ratings yet
3. Retention is All You Need
7 pages
DATA4800 Report
No ratings yet
DATA4800 Report
6 pages
ANLY 502 Final Report
No ratings yet
ANLY 502 Final Report
7 pages
batch 16 (3)
No ratings yet
batch 16 (3)
8 pages
83
No ratings yet
83
2 pages
Employee Attrition Rate Prediction Using Machine Learning Approach
No ratings yet
Employee Attrition Rate Prediction Using Machine Learning Approach
8 pages
Predict Employee Retention Using Data Sciene
No ratings yet
Predict Employee Retention Using Data Sciene
7 pages
Borse Et Al 2024 Detecting Early Warning Signs of Employee Attrition Using Machine Learning Algorithms
No ratings yet
Borse Et Al 2024 Detecting Early Warning Signs of Employee Attrition Using Machine Learning Algorithms
9 pages
Attrition Prediction Docs
No ratings yet
Attrition Prediction Docs
27 pages
Attrition Prediction: Schandia@cit - Edu.in
No ratings yet
Attrition Prediction: Schandia@cit - Edu.in
1 page
Assighment3 4 AI Projecct
No ratings yet
Assighment3 4 AI Projecct
58 pages
Predicting Employee Attrition Using XGBoost Machine Learning
No ratings yet
Predicting Employee Attrition Using XGBoost Machine Learning
8 pages
Employee Attrition Analysis
No ratings yet
Employee Attrition Analysis
2 pages
ISE 527 IEEE Access LaTeX Template
No ratings yet
ISE 527 IEEE Access LaTeX Template
16 pages
Employee Attrition Analysis of Data Driven Models
No ratings yet
Employee Attrition Analysis of Data Driven Models
10 pages
Predicting Employee Attrition Along With Identifying High Risk Employees Using Big Data and Machine Learning
No ratings yet
Predicting Employee Attrition Along With Identifying High Risk Employees Using Big Data and Machine Learning
8 pages
Report
No ratings yet
Report
45 pages
Towards Understanding Employee Attrition Using Decision Tree
100% (1)
Towards Understanding Employee Attrition Using Decision Tree
4 pages
ISE 527 Term Paper
No ratings yet
ISE 527 Term Paper
17 pages
Research Paper 102
No ratings yet
Research Paper 102
8 pages
Data Mining
No ratings yet
Data Mining
17 pages
Workshop Practice Manual
From Everand
Workshop Practice Manual
Jatinder Madan
No ratings yet
MATHEMATICAL FOUNDATIONS OF MACHINE LEARNING: Unveiling the Mathematical Essence of Machine Learning (2024 Guide for Beginners)
From Everand
MATHEMATICAL FOUNDATIONS OF MACHINE LEARNING: Unveiling the Mathematical Essence of Machine Learning (2024 Guide for Beginners)
DAVID MACKAY
No ratings yet
Econometrics 4
No ratings yet
Econometrics 4
37 pages
Machine Learning Axioms Q&A
No ratings yet
Machine Learning Axioms Q&A
3 pages
ML Unit 03 MCQ
No ratings yet
ML Unit 03 MCQ
20 pages
Lampiran Output Eviews
No ratings yet
Lampiran Output Eviews
11 pages
Regression Models Course Project
100% (1)
Regression Models Course Project
4 pages
Mod 3C
No ratings yet
Mod 3C
36 pages
(eBook PDF) Discovering Statistics Using IBM SPSS Statistics 5th Edition instant download
100% (6)
(eBook PDF) Discovering Statistics Using IBM SPSS Statistics 5th Edition instant download
52 pages
Xtabond Postestimation - Postestimation Tools For Xtabond
No ratings yet
Xtabond Postestimation - Postestimation Tools For Xtabond
3 pages
Pearson Product-Moment Correlation: Correlation - Html#Ixzz293U1Fdye
No ratings yet
Pearson Product-Moment Correlation: Correlation - Html#Ixzz293U1Fdye
3 pages
423 - ShreyaKumari - TSA - 2 - Shreya Kumari
No ratings yet
423 - ShreyaKumari - TSA - 2 - Shreya Kumari
5 pages
Data Camp - Correlation and Regression
No ratings yet
Data Camp - Correlation and Regression
151 pages
E-JRA Vol. 08 No. 05 Agustus 2019 Fakultas Ekonomi Dan Bisnis Universitas Islam Malang
No ratings yet
E-JRA Vol. 08 No. 05 Agustus 2019 Fakultas Ekonomi Dan Bisnis Universitas Islam Malang
11 pages
Unit 4 Test Review Answers
No ratings yet
Unit 4 Test Review Answers
3 pages
Decision Science Assignment
No ratings yet
Decision Science Assignment
13 pages
Multivariate Regression
No ratings yet
Multivariate Regression
20 pages
Recursive and Non-Recursive Models
No ratings yet
Recursive and Non-Recursive Models
3 pages
Longitudinal Analysis Modeling Within Person Fluctuation and Change 1st Edition Lesa Hoffman 2024 Scribd Download
100% (11)
Longitudinal Analysis Modeling Within Person Fluctuation and Change 1st Edition Lesa Hoffman 2024 Scribd Download
70 pages
Lab 4 - Support Vector Machines: Part B
No ratings yet
Lab 4 - Support Vector Machines: Part B
5 pages
Business Analytics
No ratings yet
Business Analytics
10 pages
Econometrics Example Questions and Solutions
No ratings yet
Econometrics Example Questions and Solutions
5 pages
FINARTS Case Study 1-1
No ratings yet
FINARTS Case Study 1-1
2 pages
AMT305 INTRODUCTION TO MACHINE LEARNING, Pyq2
No ratings yet
AMT305 INTRODUCTION TO MACHINE LEARNING, Pyq2
3 pages
Understanding of Working of DECISION TREE CART Algorithm
No ratings yet
Understanding of Working of DECISION TREE CART Algorithm
15 pages
Assignment On Correlation Analysis Name: Md. Arafat Rahman
No ratings yet
Assignment On Correlation Analysis Name: Md. Arafat Rahman
6 pages
Quiz Week 7 - Support Vector Machines
100% (1)
Quiz Week 7 - Support Vector Machines
3 pages
Hypothesis Testing and Interval Estimation
No ratings yet
Hypothesis Testing and Interval Estimation
9 pages
Assignment 2 - Applied Statistics and Probability
No ratings yet
Assignment 2 - Applied Statistics and Probability
2 pages
1 When Heteroskedasticity Is Known Up To A Multiplicative Constant
No ratings yet
1 When Heteroskedasticity Is Known Up To A Multiplicative Constant
4 pages

Employee Attrition Prediction Using Machine Learning

Uploaded by

Employee Attrition Prediction Using Machine Learning

Uploaded by

Proc.

EMPLOYEE ATTRITION PREDICTION USING

979-8-3503-2297-2/23/$31.00 ©2023 IEEE

• Pcorrect = Proportion of test dataset in which the model

C. Decision Tree Model

Fig. 4. Feature Importances for Attrition Prediction from Decision Tree

Fig. 5. Feature Importances for Duration Prediction from Decision Tree

Fig. 8. Feature Importances for Duration Prediction from Random Forest

Therefore, according to Random Forest satisfaction_level,

D. Random Forest Model

Fig. 10. KNN Confusion Matrix

As can be seen in Fig. 10, KNN is capable of predicting both

As for prediction of years, all the models seem to have

You might also like