0% found this document useful (0 votes)
24 views12 pages

From Big Data To Deep Data To Support People Analytics For Employee Attrition Prediction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views12 pages

From Big Data To Deep Data To Support People Analytics For Employee Attrition Prediction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received March 29, 2021, accepted April 13, 2021, date of publication April 20, 2021, date of current

version April 27, 2021.


Digital Object Identifier 10.1109/ACCESS.2021.3074559

From Big Data to Deep Data to Support People


Analytics for Employee Attrition Prediction
NESRINE BEN YAHIA 1 , JIHEN HLEL1 ,
AND RICARDO COLOMO-PALACIOS 2 , (Senior Member, IEEE)
1 RIADI Laboratory, National School of Computer Sciences, University of Manouba, Manouba 2010, Tunisia
2 Computer Science Department, Østfold University College, 1783 Halden, Norway
Corresponding author: Ricardo Colomo-Palacios ([email protected])
This work was supported in part by the ERASMUS + KA2 Projects ‘‘Information Technology Governance for Tunisian Universities’’
under Grant 561614-EPP-1-2015-1-ES-EPPKA2-CBHE-JP.

ABSTRACT In the era of data science and big data analytics, people analytics help organizations and their
human resources (HR) managers to reduce attrition by changing the way of attracting and retaining talent.
In this context, employee attrition presents a critical problem and a big risk for organizations as it affects
not only their productivity but also their planning continuity. In this context, the salient contributions of
this research are as follows. Firstly, we propose a people analytics approach to predict employee attrition
that shifts from a big data to a deep data context by focusing on data quality instead of its quantity. In fact,
this deep data-driven approach is based on a mixed method to construct a relevant employee attrition model
in order to identify key employee features influencing his/her attrition. In this method, we started thinking
‘big’ by collecting most of the common features from the literature (an exploratory research) then we tried
thinking ‘deep’ by filtering and selecting the most important features using survey and feature selection
algorithms (a quantitative method). Secondly, this attrition prediction approach is based on machine, deep
and ensemble learning models and is experimented on a large-sized and a medium-sized simulated human
resources datasets and then a real small-sized dataset from a total of 450 responses. Our approach achieves
higher accuracy (0.96, 0.98 and 0.99 respectively) for the three datasets when compared previous solutions.
Finally, while rewards and payments are generally considered as the most important keys to retention, our
findings indicate that ‘business travel’, which is less common in the literature, is the leading motivator for
employees and must be considered within HR policies to retention.

INDEX TERMS Deep people analytics, employee attrition, retention, prediction, interpretation, policies
recommendation.

I. INTRODUCTION value from people and a pathway to broadening the strate-


Employee attrition or voluntary turnover presents a key issue gic influence of the HR function’’ [4]. So, it represents the
for organizations as it affects not only their productivity and quantification and the systematic identification of the people
work sustainability but also their long term growth strate- drivers of the business outcomes with the purpose of mak-
gies [1]. On this path, employee retention is a major challenge ing better decisions. There are interchangeable terms used
for recruiters and employers alike, since employee attrition for HR analytics that are talent analytics, people analytics,
means not only the loss of skills, experiences and personnel and workforce analytics [5]. Thanks to people analytics, HR
but also the loss of business opportunities [2]. In the era of Big managers gain the ability to understand their departments
Data, people analytics help organizations and their human and their employees, by providing more accessible and inter-
resources (HR) managers to reduce attrition by changing the pretable data about employee attributes, performance and
way of attracting and retaining talent [3]. In this context, behaviours [6]. Thus, HR analytics plays a significant role
HR analytics is considered as a ‘must have’ capability for in every aspect of the HR function in organizations including
the HR management and profession and ‘‘a tool for creating recruiting, training and development, retention, engagement
and compensation. In the context of HR Analytics, employee
The associate editor coordinating the review of this manuscript and attrition analysis has caught more and more attention in the
approving it for publication was Weiping Ding . business world. In fact, how to use analytic methods to predict

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 60447
N. B. Yahia et al.: From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction

whether employees will leave or not can help the organization models and especially big data analytics across different HR
improve the HR management and save the cost on it [4]. functions [12]. One of the main challenges of using analytics
Therefore for the HR managers, it is crucial to have a in HR is the deficiency of empirical data. In fact, lack of
better idea of what kind of employees will tend to leave enough empirical data can be in terms of both the number
and what kind of features will influence them to leave [6]. of candidates or samples, as well as the number of features
Most commonly, organizations desire to make sure the right and this fails to adequately train a reliable model based on
employees are in the right place at the right time and identify- such a small dataset. Hence, organizations that plan to use
ing employees’ intention to leave by means of analytics [7]. HR analytics first have to face the data availability challenge
Descriptive analytics are used to summarize or turn data and they must be able to produce very large volumes of
into relevant information so investigate what has occurred. data [13]. Consequently, organizations need large-scale stor-
In other words, descriptive analytics have some meaningful age solutions that tend to be cloud-based and which require
impact by explaining what has already happened however, high costs. Moreover, small organizations may not have high-
they are not much helpful in predicting what will hap- quality HR data and may lack the analytical capabilities to
pen or may happen in the future. On the contrary, predictive adapt techniques designed for big data to areas where the
analytics have been proposed and used to forecast what will volume of data is quite small (big data). In this context,
happen in the future. In the field of HR, predictive analytics the main challenge is the quality of data where organizations
lead to achievement of organizational benefits and help surely must know exactly the data, they need to support their HR
in better decision-making in the organization without any analytics functions as HR managers may not have need to
biasness, especially with the most prosperous trend of the big all the data they collect. From this point of view, the volume
data era and data science basing on machine and deep learn- of data is not very important, as what matters in this context
ing techniques [8]. In fact, data is considered as one of the is the value of data. Importantly, the identification of deep
mandatory ingredients that a people analytics team requires data, a high-quality data that focus on specific predict trends,
to be effective [9]. Otherwise, HR is set to fail in handling Big is a major barrier to the use of HR analytics for some orga-
Data challenges since Big Data focuses on capturing every nizations. So, the main objective of our approach is to shift
piece of available information and collecting every suitable from big data to deep data perspective and to section down
and unsuitable data. But, in HR analytics context, the issue the massive amount of data by excluding useless or duplicate
must move from the size of the data to its smartness and information.
making better use of data to create and capture value, being a Thereby, in this paper, we aim to propose a deep data-
necessary prerequisite to the more advanced forms of big data driven predictive approach that can early detect and predict
analysis [4]. Additionally, [10] highlighted the limits of the employee intention to leave. Comparing with the related
application of Big Data within a contextual HR case study, works, this approach focuses on small information-rich HR
whilst also noting the need to shift the focus from a quan- data within big data. In fact, recent related works such
titative to a qualitative analysis of HR data. In this context, as [14]–[25] and [26] are commonly focusing on find-
the concept of deep data was born to deal with collecting only ing the best predictive models with high performances to
relevant and specific information and excluding information predict employee attrition using generally benchmarks and
that might be unusable or otherwise redundant [11]. simulated open data such as HR IBM1 and HR Kaggle2
Thus, in this paper, we mainly focus on two dimensions: a datasets. But, in this paper, we argue that apart from mod-
functional dimension and a data dimension. From a functional els performances, the HR data must be well constructed
dimension, we aim to test, compare and select the best accu- and filtered to give relevant and rapid prediction without
rate predictive model that can early detect employee attrition. biases.
We also aim to interpret the positive attrition to find reasons Thanks to this deep-data driven approach, which is based
behind it and so to support HR managers to build retention on small data providing the greatest business value at a
plan. From a data dimension, the key property of the proposed lower cost than vast volumes of big data with regards to the
approach, we aim to shift from big data to deep data to address real impactful factors on employee attrition. Thus, the main
data issues that organizations may face when implementing goals of this research are to: 1) create an effective employee
HR analytics. attrition model that contains the necessary and sufficient
Big data is a label commonly used to identify large vol- factors for early detection of attrition intent by deploying a
umes of (structured or unstructured) data that can gener- mixed method based on exploratory as well as quantitative
ally be defined with the help of the 3Vs volume, velocity, analyses, 2) build decision models to predict attrition using
and variety. Volume refers to the quantity of data that are Machine, Ensemble and Deep Learning techniques (ML,
produced by various sources such as sensors, social media, EL and DL),3) make interpretations to explain and identify
business transactions, etc. Velocity represents the speed at the exact reasons behind employee attrition, and 4) make
which data are produced, and variety refers to the different 1 https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-
formats of data. Over the last decade, the exploitation of big dataset
data has become very popular among organizations and these 2 https://github.com/ryankarlos/Human-Resource-Analytics-Kaggle-
ones tend to adopt new data-driven strategic decision-making Dataset

60448 VOLUME 9, 2021


N. B. Yahia et al.: From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction

recommendations to fight this possible attrition and to take TABLE 1. Recent related works.
necessary HR management policies.
The outline of this paper is as follows: In the second
section, we will present an overview of related works. The
research methodology conducted in this research to collect
data for our study and to design our final employee attri-
tion model will be presented in the third section. In the
ford section, we will present our approach and the various
intelligent and predictive models proposed in order to predict
employee attrition as soon as possible. The fifth section will
show the experimental results as well as the findings of this
research i.e. interpretation of the results to understand what
makes an employee quit. Finally, we conclude and present an
outlook on future works.

II. RELATED WORKS


Literature reports several employee attrition and voluntary
turnover predictive models. In this study, we particularly
consider recent works that are based on machine and deep
learning models applied to the simulated HR datasets of IBM the features. 2) They generally focus only on the employee
and Kaggle e.g. [14]– [25] and [26]. This choice is motivated attrition prediction however for a HR manager it is important
by the existence of experiments results of predictive models’ to not only predict as soon as possible an employee’s intention
accuracy for these open datasets so we can compare them with to leave but also to interpret and explain why the employee
our proposed models. has this intention to leave.
IBM HR simulated dataset is a medium sized-dataset
provided by IBM and it contains 1470 samples with III. A MIXED METHOD FOR EMPLOYEE
34 input features (Age, Business Travel, Daily Rate, ATTRITION MODELING
Department, Distance From Home, Education, Education As employee attrition or voluntary turnover is a non-
Field, Employee Count, Employee Number, Environment avoidable phenomenon, modelling it is a key issue for the
Satisfaction, Gender, Hourly Rate, Job Involvement, Job process of attrition prediction. In addition, as we aim to
Level, Job Role, Job Satisfaction, Marital Status, Monthly adopt a deep data-driven approach, a research methodology
Income, Monthly Rate, Num Companies Worked, Over18, that allows us to match theoretical models and experiments
Over Time, Percent Salary Hike, Performance Rating, Rela- must be adopted. That’s why we propose to conduct a mixed
tionship Satisfaction, Standard Hours, Stock Option Level, research method based on the combination of an exploratory
Total Working Years, Training Times Last Year, Work Life research and a quantitative method where the aim is to under-
Balance, Years At Company, Years In Current Role, Years stand and explain employee attrition phenomena. These two
Since Last Promotion, Years With Current Manager) and combined methods are used sequentially (e.g., findings from
its target variable is attrition that is represented as ’’No’’ one method inform the other). Thus, such a combined method
(employee did not left) or ’’Yes’’ (employee left). can leverage the strengths and weaknesses of exploratory
Kaggle HR dataset is a large sized-dataset supplied by and quantitative methods and offer greater insights on a
Kaggle that contains 15000 samples where its target variable phenomenon that each of these methods individually cannot
is ’’left’’ and its 9 features are satisfaction level; last eval- offer.
uation; number project; average monthly hours; time spend In fact, in order to gain a deeper understanding of the
company; Work accident; promotion last 5 years; sales and phenomenon of high attrition and identifying the factors
Salary. behind it, an exploratory study based on reviewing avail-
In Table 1 authors present an overview of recent solutions able literature is firstly established in detail using studies,
to predict employee turnover. For each solution, used datasets papers and open datasets provided by HR experts and
and proposed models are presented such as Support vector researchers. Secondly, these collected features are compared
Machine (SVM), Decision Tree (DT), Logistic Regression with causal factors for attrition identified through a question-
(LR), Random Forest (RF), XGBoost (XGB) and K Nearest naire and feature selection techniques (a quantitative research
Neighbors (KNN). method).
While these solutions proposed accurate predictive models The architecture of the conducted research methodology
to predict employee attrition, they suffer from two major crit- in this study is depicted in Fig. 1. We will explain in the
ics: 1) there are no deep studies of employee features selected following sections the different steps of the proposed mixed
and used to predict the attrition that justifies the choice of method.

VOLUME 9, 2021 60449


N. B. Yahia et al.: From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction

TABLE 2. Features collected after the exploratory study.

FIGURE 1. Our mixed method for employee attrition modelling.

A. FEATURES COLLECTION: EXPLORATORY STUDY


The first step in this study was to identify and collect
employee features that are suitable for our analysis. This
step is carried out using an exploratory research of fac-
tors responsible for employee turnover that relies on sec-
ondary researches by reviewing available literature. Thus,
the exploratory research method, conducted in the first step,
helped us identifying and collecting adequate and impactful
features for our problematic that are most commonly used questionnaire is prepared and used as a data gathering
in different related works and researches in the available instrument from respondents (presented in the appendix and
literature. In fact, through this exploratory study, based on accessible through this link). Features collected through the
reviewing of many researches, experiments in HR manage- exploratory method have been divided into three parts. Part 1
ment and open simulated HR datasets (the fictional data comprises demographic variables including: Gender, Age,
set created by IBM data scientists, and the simulated HR Education, Marital status and Tenure. Part 2 is about their
dataset supported by Kaggle) referenced in Table 1, we overall level of satisfaction, motivation, involvement and life
found that the strongest consensual predictors for employee interest (Job satisfaction, Job involvement, Job performance,
voluntary turnover are Age, Education, Gender, Job involve- Promotability, Environment satisfaction, Rewards, Relation-
ment (implication of employee in decision making), Job ship satisfaction, Business travel, Grade, Training, Work life/
satisfaction (Career satisfaction), Marital status, Job perfor- balance). Finally, part 3 aims to know the most impactful fac-
mance (skills adequacy), Tenure, Promotability (promotions tors according to respondents and to collect their suggestions
in work), Business Travel, Grade, Rewards (Pay, organiza- (if they are other features that can cause a turnover and so can
tion based-rewards, Motivation factors, Salary), Relationship be integrated into our study). From the designed survey we
Satisfaction (Hostile organization culture), Environment received450 responses. Respondents were university people
Satisfaction (favourable or unfavourable working condi- from different countries (Tunisia, Norway, France, United
tions), Training (Training time number, Uncongenial Work States, China, Italy, Pakistan, India, England and Germany).
environment), Work life/balance. In table 2, we summarize The questionnaire is anonymous. 44,5% of respondents are
these most cited 16 features that are commonly and frequently female and 55,5% are male. Age of the respondents varies
used in the available literature. from 27 to 62. Out of the total participants, 47,3% want to
leave their jobs and 52,7% don’t have the intention to quit.
B. FEATURE SELECTION: QUANTITATIVE METHOD
Following the exploratory study conducted to collect most 2) FEATURE SELECTION METHODS
common factors used in the literature that influence employee To improve our first proposition of the employee attrition
attrition, a survey research method is adopted to gather nec- model, a feature selection procedure will be now followed
essary data for the study. Then, some feature selection tech- to better filter features using collected real data from our
niques are also adopted to better filter the chosen features and survey. Feature selection method is the automatic selection of
to end up with a final employee attrition model. attributes in the data that are most relevant to the predictive
modelling problem we are working on. They aim to create
1) DATA COLLECTION: SURVEY an accurate predictive model by choosing relevant features
In order to collect employee real data and to tap the that will contribute to improve the accuracy and removing
factors responsible for attrition in our study, an online irrelevant and redundant attributes. In this study, we will

60450 VOLUME 9, 2021


N. B. Yahia et al.: From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction

TABLE 3. Vote for keeping/eliminating collected features.

FIGURE 2. Combination matrix of SelectKBest and RFE results.

helpful to mention here that we used GridSearch technique


to identify the value of k. In fact, we used cross-validation to
divide data into three sets (10% for the validation set which
is used by GridSearch to find the best hyperparameter k,
70% for the training data and 20% for the test set). The
best k recommended by GridSearch here is 8 features for
both IBM and Kaggle HR analytics datasets. After applying
SelectKBest method to the collected data, the 5 algorithms
select the 8 same features which are: Age, Grade, Tenure,
Job performance, Job satisfaction, Rewards, Environment
jointly benefit from two popular feature selection methods satisfaction and Job involvement.
namely, Recursive Feature Elimination (a wrapper method) Fig. 2 presents a combination matrix of the results of
and SelectKBest (a filter method). the two feature selection techniques. Then, we propose
Recursive Feature Elimination (RFE) is a feature selec- to retain features that are selected by SelectKBest even
tion method that fits a model using an external estimator that though they are eliminated by RFE. Additionally, features
assigns weights to the features (e.g., the coefficients of a that are not selected by SelectKBest and not eliminated
linear model) removes the weakest feature(s). Features are by RFE are equally retained. Finally, features that are
ranked by the model’s coefficients or features importance not selected by SelectKBest and eliminated by RFE are
attributes and by recursively eliminating a small number of then removed. So, we ended up eliminating the following
features per loop RFE attempts to eliminate dependencies and attributes: Gender, Education, Promotability, Relationship
collinearity that may exist in the model. When a predictive satisfaction and Work/life balance.
model or algorithm assigns the value False to an attribute In conclusion, according to the combination of the two
meaning that the attribute has to be eliminated from the data feature selection techniques (RFE and SelectKBest) and the
columns and when the model assigns the value True to an collected data, the 11 main attritionary features necessary
attribute, which should be retained. In this step, we used for the employee attrition prediction are: Age, Marital status,
5 famous and accurate classifiers (XGB, RF, DT, LR and Tenure, Grade, Rewards, Job involvement, Training, Business
SVM). As employee attrition prediction is considered here as Travel, Job satisfaction, Job performance, and Environment
a classification problem, these classifiers have been chosen satisfaction.
because they are the best representatives of the different clas-
sification approaches and at the same time they often be have IV. THE PROPOSED ATTRITION PREDICTION APPROACH
well when dealing with statistical data [27]. Table 3 shows The second part of the study deals with proposing a solution
the results of RFE method applied by the 5 classifiers or pre- for employee attrition prediction. To do so, we will start this
dictive models. Then, from these results, a majority vote section by an overview of the related works with regards
was made to select candidate features of the RFE algorithm, to attrition prediction solutions based on predictive models.
so the selected ones to be eliminated by RFE (that have Then, we will focus on our proposed predictive approach and
False values more than True values) are: Gender, Age, Grade, its steps details.
Education, Tenure, Promotability, Relationship satisfaction, With the help of our previous research studies and col-
and Work/life balance. lected data from the employees’ survey, we found the main
SelectKBest is a feature selection algorithm that scores the impactful features on employee attrition which will help us
features of a dataset using a score function and then removes effectively predicting this attrition. The collected and selected
all but the k-highest scored features. It then simply retains data will be considered as an input to our predictive approach
the first k features of training set with the highest scores. It is that is based on three steps. Fig. 3 presents the architecture of

VOLUME 9, 2021 60451


N. B. Yahia et al.: From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction

1) MACHINE LEARNING BASED PREDICTIVE MODELS


1. Decision Tree is built through a recursive partitioning
process where paths from root to leaf represent classification
rules [28]. Each internal node represents a ‘‘test’’ on an
attribute, each branch represents the partitioned outcome of
the test, and each leaf represents a class label in classification
case or a numerical value in regression case.
2. Support Vector Machine is a supervised learning algo-
rithm that is used for linear as well as nonlinear classification
problems. To achieve class separation, it uses a hyper-
plane or a set of hyper-planes in higher dimensional space.
The intuition in this statistical learning based algorithm is
that a good separation is achieved by the hyper-plane that has
the largest distance to the nearest training data points of any
class [29].
FIGURE 3. Architecture of the proposed approach. 3. Logistic Regression is a simple statistical technique
and one of the basic linear models for classification that
our proposed approach. The first step is data pre-processing. uses the logistic function to model categorical or binary
The second one deals with attrition prediction based on dependent variables. It’s often used with regularization in the
machine, ensemble and deep learning models. And, the third form of penalties based on L1-norm or L2-norm to avoid
one deals with interpretation to explain to HR managers the over-fitting [30].
why of this employee attrition.
2) DEEP LEARNING BASED PREDICTIVE MODELS
1. Deep Neural Networks (DNN), [31], are deep Artificial
A. DATA PREPROCESSING Neural Networks (ANN) with multiple (at least two) hidden
To better train predictive models, data pre-processing is one layers where the ‘‘deep’’ refers to the number of hidden
of the key steps. To do so, data provided by respondents is layers through which the data is transformed from the input to
transformed and encoded to make them proper for processing the output layers. In classical DNN, each layer is composed
and training using the library functions provided and imple- of a set of neurons and an activation function and is fully
mented in Python’s library scikit-learn [27]. For instance, connected. A set of weights is affected to each neuron where
categorical features were One-Hot Encoded, by which each each weight is multiplied by one input into the neuron. They
of the distinct values in the categorical fields was converted are then summed to form the output from the neuron after it
to numerical values, and then scaling technique is used to has been fed through the activation function.
put all the features on a similar scale by normalizing data 2. Long Short-Term Memory Networks (LSTM) are an
to ranges from −1 to 1 which avoids outliers to affect the amelioration of recurrent neural networks (RNN) that are able
predictions. to model sequential and temporal data and to predict times
series [32]. More specifically, a cell state is added in LSTM
B. ATTRITION PREDICTION MODELS to store long-term states and to build more stable RNN for
Employee attrition prediction is tackled as a supervised learn- time series prediction by detecting and memorizing the long-
ing problem, and in particular, as a binary classification one. term dependencies existing in the time series.
In other words, we are interested in detecting and confirming 3. Convolutional Neural Networks (CNN) [33], contain
the existence or not of the employee’s intention to leave. generally four types of layers in their structure: an input layer,
To do so, we have put to the test different supervised machine convolutional layers, pooling layers, and fully connected
and deep learning techniques, using also the implementations layer (output). In the convolutional layer, which represents
provided in Python’s library scikit-learn [29]. In particu- the most important CNN part, the input will be convoluted
lar, we have adhered to the following classifiers: Decision with different filters where each filter is considered as a
Tree, Logistic Regression and Support Vector Machine (as smaller matrix. Then, corresponding feature maps will be
machine learning models), Random Forest, XGBoost and generated after the convolution operation. The pooling opera-
Vote Classifier (as ensemble learning models) and three tion consists in reducing the size, while preserving the impor-
deep learning models (DNN, LSTM and CNN). A grid- tant features. The efficiency of the network is thus improved,
search algorithm was performed for each classifier over tun- and over-fitting is avoided. So, the main role of convolutional
ing hyperparameters and the dataset was split 10:70:20 into and pooling layers is generally to extract features, and the
validation, training and test sets. Then, the different models main goal of fully connected layers is usually to output the
were trained using their best configuration on the training information from feature maps together, and then provide
dataset. them to latter layers.

60452 VOLUME 9, 2021


N. B. Yahia et al.: From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction

3) ENSEMBLE LEARNING BASED PREDICTIVE MODELS somewhere else. In fact, organizations retention policies and
The main goal of ensemble learning (EL) is to combine sev- all other internal policies governance play a significant role
eral models in order to find a better solution that gives better in improving workplace productivity, engaging employees
results [34]. So, EL is used here to combine the classifiers and emotionally and, hence controlling attrition. How to retain
their predictions in order to improve robustness over a single productive employees and their valued skills is one of the
classifier. In this study, we will test three ensemble learning biggest problem that plague organizations, so we aim in this
models: study not only to help HR managers in early detection of
1. Random Forest is a popular tree-based ensemble learn- employee intention to leave but also to enable them to be
ing technique and a bagging algorithm where successive trees aware of the facts leading to employees’ attrition, thus they
are constructed using a different bootstrap sample of the can take few measures and effective management strategies
dataset. By the end, a simple majority vote is taken for pre- to retain their employees. Indeed, it is equally very important
diction. Random forests are different from standard trees as for HR managers to not only have an accurate, but also an
each node is split using the best among a subset of predictors interpretable and an explicative predictive model that indicate
randomly chosen at that node which makes it robust against which features triggering employee attrition and what makes
over-fitting [35]. an employee quit.
2. XGBoost is a gradient boosted tree algorithm that Thus, in this step of our approach, we will show how we
involves fitting a set of weak learners and in which final can use our proposed models for attrition interpretation as
prediction is produced by the combination of predictions well as attrition prediction using features importance. It is a
from all of them through a weighted majority vote (or sum). statistical method that allows us to evaluate and quantify the
This boosting algorithm is based on the use of a regularized- participation of each feature in the prediction of the classifi-
model formalization to control over-fitting, which makes it cation task. So, we will use it here to identify attritionary fea-
highly robust and gives it better performance [36]. tures and to understand these features’ influence on employee
3. Voting Classifier is an ensemble learning model that attrition. Generally, features importance provides a score for
trains on an ensemble of classifiers and then predicts the each attribute that indicates either how much an attribute
output class basing on a majority vote according to two contributes to the improvement of the performance, or how
different strategies. The first one is the Hard Voting where much does the model depends on each of its features in the
the predicted output class is the class which had the highest prediction.
probability of being predicted by each of the classifiers. So, our aim here is to search for real reasons behind
The second one is the Soft Voting where the output class the phenomenon of attrition, so interpretation has to focus
is the prediction based on the average of probability given only on attritional employees and those who have intention
to that class. In our case, we use a Voting classifier that to leave, i.e. only taking into account samples where the
combines our chosen ML models and that is based on the value of Attrition = 1 (and we ignore samples where the
majority vote strategy (Hard vote) to predict the output class. value of Attrition = 0). Then, we consider ‘‘Job satisfac-
Such a classifier can be useful for a set of equally well tion’’ feature as our new target because employee job sat-
performing model in order to balance out their individual isfaction is a key ingredient of employee retention. In fact,
weaknesses. evidence suggests that employee attrition is triggered by job
4. Stacked ANN-based model where outputs of the three dissatisfaction and many researchers have shown that the
chosen deep learners (DNN, LSTM and CNN) are collected employee satisfaction with job is significantly correlated to
to create a new dataset encompassing also for each row the the intention to leave [38]. We then proceed to the following
real expected value that will be used to train a new DNN steps:
learning model, called meta-learner. 1. Remove rows that present employees who did not
It is helpful to recall here that we used GridSeach for 10% leave their jobs or don’t have intention to leave (with
of dataset as validation set to identify the best hyperparame- Attrition = 0).
ters for each model (such as decision criterion and max-depth 2. Delete the ‘‘Attrition’’ column and consider ‘‘Job satis-
for DT, the hidden layers number and units or neurons number faction’’ as the new target.
in each layer for DNN, LSTM and CNN). 3. Convert values of job satisfaction column 1, 2 and 3,
4 into respectively 0 and 1 as satisfied and not satisfied.
C. INTERPRETATION OF THE EMPLOYEE 4. Apply features importance using the Random Forest
ATTRITION PHENOMENON (RF) classifier to identify the most impactful features
Employee retention refers to organizations’ practices and on employee job satisfaction (we choose RF because
policies that are used to prevent valuable and skilled employ- it is the most performing predictor whereas ensemble
ees from leaving their jobs [37]. method cannot be used here as its inputs are classifiers
Thus, retention is totally opposite of attrition, it means and not data).
the ability of organizations to keep their employees, in par- Results of applying features importance on our RF classifier
ticular, productive ones, and stop them from going to work are depicted in Fig. 4.

VOLUME 9, 2021 60453


N. B. Yahia et al.: From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction

TABLE 4. Performance evaluation of models using the two simulated


datasets.

FIGURE 4. Features importance of the random forest model.

V. EXPERIMENTATION RESULTS
After conducting an exploratory and deep data analysis
and then identifying all models settings (parameters and
hyper-parameters), we are now ready to proceed onto build-
TABLE 5. Performance evaluation of models using our real dataset.
ing our models and to assess their performance. Indeed,
we will present in this section the experimental results of
machine, ensemble and deep learning predictive models.
To best assess the performance of these prediction models
in a variety of scenarios, the large-sized Kaggle HR simu-
lated dataset (15000 samples), the medium-sized IBM HR
simulated dataset (1470 samples) and our small-sized HR real
dataset (450 samples) are used. Finally, the salient contribu-
tion of these models will be presented towards the end of
this experimentation to enable the HR manager not only to
predict attrition but also to understand why and so to identify
keys to retention. Evaluation criteria for these models and
the comparison of their results are explained in following
sections.

A. RESULTS OF PREDICTIVE MODELS FOR TWO


SIMULATED HR DATASETS
In this section, the two simulated human resources datasets total number that is correct) and F1-score using the two
are used to assess the performance of our predictive models. simulated datasets.
The first one is the large sized-dataset supplied by Kaggle that
contains 15000 samples where its target variable is ’’left’’ and B. RESULTS OF PREDICTIVE MODELS FOR
its 9 features are satisfaction level; last evaluation; number OUR REAL DATASET
project; average monthly hours; time spend company; Work In this section, we compare our classification predictors for
accident; promotion last 5 years; sales and Salary. The second understanding which predictor is more benefiting to classify
simulated human resources analytics dataset is a medium churners and non-churners using our real dataset. Models
sized-dataset provided by IBM and it contains 1470 samples accuracies are measured before and after feature selection
with 34 features and its target variable is attrition that is repre- algorithms which means that for the first time we use the
sented as ’’No’’ (employee did not left) or ’’Yes’’ (employee entire real dataset with its 16 features. Next, models are eval-
left). In this second simulated dataset, we find our 11 selected uated using only the 11 features selected after applying the
features as part of its 34 features, so we will check the feature selection process by combining RFE and SelectKbest.
performance of our predictors using the entire dataset of IBM Results are shown in Table 5.
with its 34 features. Then, we will assess their performance
using the same dataset but we will keep only the 11 selected VI. FINDINGS AND DISCUSSION
features of our employee attrition model (Marital status, Age, In this section, we aim to discuss our experiment results and
Tenure, Grade, Rewards, Job involvement, Training, Business to put the light on the novelties of this research.
Travel, Job satisfaction, Job performance, and Environment Firstly, regarding the quantitative assessment of our predic-
satisfaction). Table 4 shows the results in terms of accuracy tors’ performance, results depicted in Tables 4 and 5 show that
(that is defined as the percentage of the correctly classified the ensemble learning model Voting Classifier VC performs
data by the model and it represents the ratio of the predictions better than the other models for the simulated as well as

60454 VOLUME 9, 2021


N. B. Yahia et al.: From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction

TABLE 6. Comparison of accuracy models for IBM/Kaggle HR datasets the use of relevant data and the selection of impactful features
with existing works.
instead of using all the collected data. It is helpful mentioning
here that feature selection gives an effective way to reduce the
complexity of classification problems by removing irrelevant
and redundant data, which can reduce computation time,
improve learning accuracy, and facilitate a better understand-
ing for the learning model. According to the results, those
substantiations were experimentally proved here as shown in
tables 4 and 5. In fact, an improvement of accuracy measures
for most of the classifiers is marked when feature selection is
used. We also note an improvement of the F1-score after fea-
tures selection. This confirms the effectiveness of our chosen
employee attrition model in this study and the good results
from multiple classifiers after feature selection justify that the
selected features are effectively contribute to voluntary attri-
tion. Even for the human resources IBM simulated dataset,
predictors’ performance has been improved by reducing the
number of existing features and keeping only our 11 selected
features, and in particular, ensemble method VC accuracy
has been slightly increased from 0.93 before feature selection
to 0.96 after feature selection. Moreover, ensemble learning
VC applied to our final dataset after feature selection gives
real data. In fact, VC outperforms all the other classifiers the best results with an accuracy of 0.99. This also confirms
in terms of accuracy especially when using our real dataset that the choice of SelectKBest and RFE as the two feature
compared to the simulated ones. With regards to the differ- selection algorithms is a good one to improve and validate our
ent used machine learning classifiers, the use of ensemble employee attrition model. So, this deep study also comple-
learning VC gives better results in terms of accuracy for both ments previous findings reported in the literature regarding
simulated and real dataset regardless of the application of the impactful features on employee attrition and confirms
feature selection. In particular, for our final dataset, ensemble only the need of the 11 selected features.
learning VC gives the best results with an accuracy of 0.99. The second salient contribution in this paper concerns the
This can be explained by the fact that the ensemble learning interpretation and the explanation of the attrition phenomena
aims to combine (weak) learners in one method by taking and so the recommendations for effective retention. Accord-
advantage of their complementarity to output best accurate ing to [37], retention policies fall into three levels of HR man-
results. In addition, with regards to deep learning predictors, agement: High, medium and low levels. Each level considers
our ensemble learning VC also outperforms them in both a different perspective and requires a different kind of strate-
simulated and real data. This result may be explained by the gies that can help to combat the problem of attrition arising
quantity of the provided data. In fact, deep learning algo- at that level. In the lower managerial level, understanding and
rithms require ‘‘relatively’’ large datasets to work well and money are keys to retention, whereas, for the medium man-
to give better results, and it also needs the infrastructure to agerial level managers’ appreciation, training and business
train them in reasonable time. Also, deep learning algorithms travel programs act as major keys. Finally, for the high-level
require many more experiences and they are more beneficial management, retention policies include freedom of decision
when we deal with complex problems and real big data with making and creation of a trustworthy environment. Thus, gen-
a greater number of features. erally, organizations should create an environment that fosters
Moreover, in order to compare accuracy of our proposed work appreciation and a friendly collaborative atmosphere
models with regards to recent works that reused the sim- that makes an employee feel involved and connected to the
ulated HR datasets, we show in Table 6 different results. organization. Especially for our real case study, results of
We note here that for IBM HR simulated dataset, our feature importance applied to our RF classifier and plotted
ensemble learning VC gives the best results with an accu- in figure 5 show that, for the 450 respondents, the high-
racy of 0.93. For Kaggle HR simulated dataset, ensemble est importance is assigned firstly to the ’’Business Travel’’
learning VC equally gives the best results with an accuracy feature and secondly to ‘Rewards’. Meaning that Business
of 0.98. travel presents the most motivational attribute and the key to
Apart from proposed predictive models and their combi- employee retention with regards to the studied dataset. Thus,
nation to get more accurate employee attrition predictions, HR manager should adopt a retention strategy in the medium
the salient contributions in this paper basically deal with two managerial level and try to organize some business travels for
points. The first one concerns the proposals of a deep data- the employees. While rewards, pay and effort–reward imbal-
driven predictive approach. In fact, our approach focuses on ance are generally considered as the most impactful variables

VOLUME 9, 2021 60455


N. B. Yahia et al.: From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction

on employee attrition as in [2] and [16], findings here indi- want to leave and to help them in adopting key policies to
cate, however, that one of the leading features identified is retention.
less common in the literature: business travel. In fact and as In terms of study limitations, considering dynamic features
reported in the literature (e.g., [39]), business travel, whether that deal with employees’ behaviour and their emotional
domestic or international, undoubtedly brings benefits for states will be promising to study their impact on employee
employees and is shown to have a significant effect up and attrition. In this case, the predictive models training must be
beyond technology transfer through innovation and inspira- on-line as data will be dynamic and new data can be added
tion from other environments. Indeed, it has been suggested whenever required. We acknowledge also that our question-
that the experience of visiting clients, other companies, cities naire respondents have equally suggested other features to
and countries broaden employees’ understanding of different be considered and that can cause voluntary turnover and so
cultures and make them more open-minded. can be integrated into our future study. In fact, they have
At this stage, we assume that there might be some validity proposed to consider health issues, job security and the use of
threats of our research findings, and we have self-assessed new technologies in the company. Finally, in future research,
them here in order to denote the trustworthiness of our results, considering unbalanced data is a real challenge especially for
to what extent they are true and not biased by our subjec- organizations and companies with high turnover rate because
tive point of view. In addition, these potential threats are the adopted predictive models are experimentally not suitable
addressed according to the classification proposed in [40]. for unbalanced data.
Regarding the construct validity, we assume that the provided
measures could be biased regarding the researchers’ expected APPENDIX
results. However, we have used in this research, to validate QUANTITATIVE QUESTIONNAIRE
and evaluate the performance of the adopted classifiers, accu- 1. Country:
racy which is considered as a standard metric often used for 2. Gender: Female/Male
measuring performance by reducing biases. They are also 3. Grade:
robust, particularly for balanced data, which is almost our 4. Age:
case here as for our real dataset 47,3% of respondents want 5. Education: 1: ’Below College’ 2: ’College’ 3:
to leave their jobs and 52,7% don’t have the intention to ’Bachelor’, 4: ’Master’, 5: ’Doctor’, 6: Other
quit. Regarding the external validity, there might be some 6. Specialty (Computer Science, Electronics, Mechanics,
issues regarding generalization of our predictive approach Business, Medicine, Education, etc.):
as collected data through the employee survey were small 7. Marital status: 1: Single, 2: Married, 3: Divorced
data (450 samples) which might indicate a low relevance of 8. Organization tenure (number of years at your
the obtained results. To overcome this issue, this approach organization):
and its learnt models are assessed on the large-sized Kaggle 9. Years since last promotion in the organization:
HR simulated dataset (15000 samples) and the medium-sized 10. Rate the degree of your job satisfaction (motivational
IBM HR simulated dataset (1470 samples) which will provide work, spirit of challenge, contentment with career
more consistent feedback about the relevance of our results. progress, personal development): 1: Low, 2: Medium,
Finally, regarding reliability, there might be a potential threat 3: High, 4: Very high
that concerns the dependency of data and analysis on the 11. Rate the degree of job performance (productivity, skills
specific researchers. However, we are doing an effort towards adequacy) : 1: Low, 2: Medium, 3: High, 4: Very high
trying to minimize this threat by collecting data from different 12. Rate the degree of environment satisfaction (simple
countries with different cultures. tasks, clear roles, no stressors): 1: Low, 2: Medium,
3: High, 4: Very high
VII. CONCLUSION AND FUTURE WORKS 13. Do you feel you are well rewarded for your dedica-
The main goal of this research is to help HR managers tion and commitment towards the work (rewards, Pay)?
to detect as soon as possible an employee’s intention to Yes/No
leave using predictive analytics methods and so to fight 14. How easy was it for you to get involved in your job
this attrition. The contributions can be summarized into (participation in decision making, opinions): 1: Slightly
three points: i) The proposal of a new employee attrition easy, 2: Moderately easy, 3: Very easy, 4: Extremely easy
model that contains only 11 features necessary and suffi- 15. Are you satisfied with your relationships at work (rela-
cient to detect intention to leave and to predict positive tionship with colleagues and manager)? ∗1: Slightly
attrition using a mixed research methodology. ii) The pro- satisfied, 2: Moderately satisfied, 3: Very satisfied, 4:
posal of machine, deep and ensemble learning predictive Extremely satisfied
models and their experimentation in a variety of different set- 16. Reward/Salary:
tings (large-sized simulated dataset, medium sized simulated 17. Trainings number offered by the organization:
dataset and small-sized real dataset) to best assess their 18. How easy was it to balance your work life and personal
performance. iii)The interpretation and the explication that life while working? 1: Low, 2: Medium, 3: Easy, 4: Very
enables HR managers to understand what makes an employee easy

60456 VOLUME 9, 2021


N. B. Yahia et al.: From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction

19. How often did you travel for business at that organiza- [19] S. Karande and L. Shyamala, ‘‘Prediction of employee turnover using
tion? 1: Non-travel, 2: Travel rarely, 3 : Travel frequently ensemble learning,’’ in Ambient Communications and Computer Systems,
vol. 904, Y.-C. Hu, S. Tiwari, K. K. Mishra, and M. C. Trivedi, Eds.
20. Intention to quit the organization Yes/No Singapore: Springer, 2019, pp. 319–327.
21. Any other factors which you feel are responsible for [20] D. S. Sisodia, S. Vishwakarma, and A. Pujahari, ‘‘Evaluation of machine
Employee Attrition? learning models for employee churn prediction,’’ in Proc. Int. Conf.
Inventive Comput. Informat. (ICICI), Coimbatore, India, Nov. 2017,
pp. 1016–1020, doi: 10.1109/ICICI.2017.8365293.
REFERENCES [21] M. M. Alam, K. Mohiuddin, K. M. Islam, M. Hassan, A.-U. M. Hoque, and
S. M. Allayear, ‘‘A machine learning approach to analyze and reduce fea-
[1] R. Punnoose and P. Ajit, ‘‘Prediction of employee turnover in organizations
tures to a significant number for employee’s turn over prediction model,’’
using machine learning algorithms,’’ Int. J. Adv. Res. Artif. Intell., vol. 5,
in Intelligent Computing, vol. 857, K. Arai, S. Kapoor, and R. Bhatia, Eds.
no. 9, p. 5, 2016, doi: 10.14569/IJARAI.2016.050904.
Cham, Switzerland: Springer, 2019, pp. 142–159.
[2] R. Colomo-Palacios, C. Casado-Lumbreras, S. Misra, and P. Soto-Acosta, [22] S. Shah, S. Alatekar, Y. Bhangare, B. Kasar, and R. Patil, ‘‘Analysis of
‘‘Career abandonment intentions among software workers,’’ Hum. Fac- employee attrition and implementing a decision support system providing
tors Ergonom. Manuf. Service Industries, vol. 24, no. 6, pp. 641–655, personalized feedback and observations,’’ J. Crit. Rev., vol. 7, no. 19,
Nov. 2014, doi: 10.1002/hfm.20509. pp. 2372–2380, 2020.
[3] Amazon.fr—People Analytics in the era of big Data: Changing [23] F. Fallucchi, M. Coladangelo, R. Giuliano, and E. W. De Luca, ‘‘Predicting
the way you Attract, Acquire, Develop, and Retain Talent—Jean employee attrition using machine learning techniques,’’ Computers, vol. 9,
Paul Isson—Livres. Accessed: Dec. 15, 2019. [Online]. Available: no. 4, p. 86, Nov. 2020, doi: 10.3390/computers9040086.
https://www.amazon.fr/People-Analytics-Era-Big-Data/dp/1119050782
[24] S. R. Ponnuru, ‘‘Employee attrition prediction using logistic regression,’’
[4] D. Angrave, A. Charlwood, I. Kirkpatrick, M. Lawrence, and Int. J. Res. Appl. Sci. Eng. Technol., vol. 8, no. 5, pp. 2871–2875,
M. Stuart, ‘‘HR and analytics: Why HR is set to fail the big data May 2020, doi: 10.22214/ijraset.2020.5481.
challenge,’’ Hum. Resource Manage. J., vol. 26, no. 1, pp. 1–11, [25] S. Kakad, R. Kadam, P. Deshpande, S. Karde, and R. Lalwani, ‘‘Employee
Jan. 2016, doi: 10.1111/1748-8583.12090. attrition prediction system,’’ Int. J. Innov. Sci., Eng. Technol., vol. 7, no. 9,
[5] A. Tursunbayeva, S. D. Lauro, and C. Pagliari, ‘‘People analytics—A p. 7, 2020.
scoping review of conceptual boundaries and value propositions,’’ Int. [26] N. Jain, A. Tomar, and P. K. Jana, ‘‘A novel scheme for employee churn
J. Inf. Manage., vol. 43, pp. 224–247, Dec. 2018. problem using multi-attribute decision making approach and machine
[6] T. Pape, ‘‘Prioritising data items for business analytics: Framework and learning,’’ J. Intell. Inf. Syst., vol. 56, no. 2, pp. 279–302, Apr. 2021, doi:
application to human resources,’’ Eur. J. Oper. Res., vol. 252, no. 2, 10.1007/s10844-020-00614-9.
pp. 687–698, Jul. 2016. [27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
[7] S. N. Mishra, D. R. Lama, and Y. Pal, ‘‘Human resource predictive ana- M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, and J. Van der Plas,
lytics (HRPA) for HR management in organizations,’’ Int. J. Sci. Technol. ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach. Learn. Res., vol. 12,
Res., vol. 5, no. 5, pp. 33–35, 2016. pp. 2825–2830, Oct. 2011.
[8] P. Likhitkar and P. Verma, ‘‘HR value proposition using predictive analyt- [28] S. R. Safavian and D. Landgrebe, ‘‘A survey of decision tree classi-
ics: An overview,’’ in New Paradigm in Decision Science and Management. fier methodology,’’ IEEE Trans. Syst., Man, Cybern., vol. 21, no. 3,
Singapore: Springer, 2020, pp. 165–171, doi: 10.1007/978-981-13-9330- pp. 660–674, May 1991.
3_15. [29] A. J. Smola and B. Schölkopf, ‘‘A tutorial on support vector regres-
[9] T. Peeters, J. Paauwe, and K. Van De Voorde, ‘‘People analytics effective- sion,’’ Statist. Comput., vol. 14, no. 3, pp. 199–222, Aug. 2004, doi:
ness: Developing a framework,’’ J. Organizational Effectiveness, People 10.1023/B:STCO.0000035301.49549.88.
Perform., vol. 7, no. 2, pp. 203–219, Jul. 2020, doi: 10.1108/JOEPP-04- [30] G. King and L. Zeng, ‘‘Logistic regression in rare events data,’’ Political
2020-0071. Anal., vol. 9, no. 2, pp. 137–163, 2001.
[10] N. Shah, Z. Irani, and A. M. Sharif, ‘‘Big data in an HR con- [31] G. E. Hinton, ‘‘Reducing the dimensionality of data with neural networks,’’
text: Exploring organizational change readiness, employee attitudes Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006, doi: 10.1126/sci-
and behaviors,’’ J. Bus. Res., vol. 70, pp. 366–378, Jan. 2017, doi: ence.1127647.
10.1016/j.jbusres.2016.08.010. [32] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, ‘‘How to construct deep
[11] S. V. Kalinin, B. G. Sumpter, and R. K. Archibald, ‘‘Big-deep-smart data recurrent neural networks,’’ 2013, arXiv:1312.6026. [Online]. Available:
in imaging for guiding materials design,’’ Nature Mater., vol. 14, no. 10, https://arxiv.org/abs/1312.6026
pp. 973–980, Oct. 2015, doi: 10.1038/nmat4395. [33] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang,
[12] M. Nocker and V. Sena, ‘‘Big data and human resources management: G. Wang, J. Cai, and T. Chen, ‘‘Recent advances in convolutional neu-
The rise of talent analytics,’’ Social Sci., vol. 8, no. 10, p. 273, Sep. 2019, ral networks,’’ Pattern Recognit., vol. 77, pp. 354–377, May 2018, doi:
doi: 10.3390/socsci8100273. 10.1016/j.patcog.2017.10.013.
[13] D. Pessach, G. Singer, D. Avrahami, H. C. Ben-Gal, E. Shmueli, and [34] G. Brown, ‘‘Ensemble learning,’’ in Encyclopedia of Machine Learning,
I. Ben-Gal, ‘‘Employees recruitment: A prescriptive analytics approach via vol. 312. 2010, pp. 15–19.
machine learning and mathematical programming,’’ Decis. Support Syst., [35] A. Liaw and M. Wiener, ‘‘Classification and regression by randomForest,’’
vol. 134, Jul. 2020, Art. no. 113290, doi: 10.1016/j.dss.2020.113290. R News, vol. 2, pp. 18–22, Dec. 2002.
[14] S. S. Alduayj and K. Rajpoot, ‘‘Predicting employee attrition using [36] J. H. Friedman, ‘‘Greedy function approximation: A gradient
machine learning,’’ in Proc. Int. Conf. Innov. Inf. Technol. (IIT), Al Ain, boosting machine.,’’ Ann. Statist., vol. 29, no. 5, pp. 1189–1232,
UAE, Nov. 2018, pp. 93–98, doi: 10.1109/INNOVATIONS.2018.8605976. Oct. 2001.
[15] M. Ganesh V. Aishwaryalakshmi, S. Aksshaya, and K. Abinaya, ‘‘Predict- [37] B. K. Goswami and S. Jha, ‘‘Attrition issues and retention chal-
ing employee attrition using machine learning,’’ Int. J. Sci. Res. Comput. lenges of employees,’’ Int. J. Sci. Eng. Res., vol. 3, no. 4, pp. 1–6,
Sci., Eng. Inf. Technol., vol. 3, no. 3, pp. 145–149, 2018. Apr. 2012.
[16] Y. Zhao, M. K. Hryniewicki, F. Cheng, B. Fu, and X. Zhu, ‘‘Employee [38] A. H. Khan and M. Aleem, ‘‘Impact of job satisfaction on employee
turnover prediction with machine learning: A reliable approach,’’ in Intelli- turnover: An empirical study of autonomous medical institutions of
gent Systems and Applications, vol. 869, K. Arai, S. Kapoor, and R. Bhatia, Pakistan,’’ J. Int. Stud., vol. 7, no. 1, pp. 122–132, May 2014, doi:
Eds. Cham, Switzerland: Springer, 2019, pp. 737–758. 10.14254/2071-8330.2014/7-1/11.
[17] X. Gao, J. Wen, and C. Zhang, ‘‘An improved random forest algorithm for [39] J. V. Beaverstock, B. Derudder, J. R. Faulconbridge, and F. Witlox, ‘‘Inter-
predicting employee turnover,’’ Math. Problems Eng., vol. 2019, pp. 1–12, national business travel: Some explorations,’’ Geografiska Annaler, B,
Apr. 2019, doi: 10.1155/2019/4140707. Hum. Geogr., vol. 91, no. 3, pp. 193–202, Sep. 2009, doi: 10.1111/j.1468-
[18] S. N. Khera and Divya, ‘‘Predictive modelling of employee turnover 0467.2009.00314.x.
in Indian IT industry using machine learning techniques,’’ Vis., [40] P. Runeson and M. Höst, ‘‘Guidelines for conducting and reporting case
J. Bus. Perspective, vol. 23, no. 1, pp. 12–21, Mar. 2019, doi: study research in software engineering,’’ Empirical Softw. Eng., vol. 14,
10.1177/0972262918821221. no. 2, pp. 131–164, Apr. 2009, doi: 10.1007/s10664-008-9102-8.

VOLUME 9, 2021 60457


N. B. Yahia et al.: From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction

NESRINE BEN YAHIA is currently a Doctor- RICARDO COLOMO-PALACIOS (Senior


Engineer in computer sciences. She is also an Member, IEEE) received the bachelor’s, master’s,
Associate Professor with the National School of and M.B.A. degrees from the Instituto de Empresa,
Computer Science (ENSI), where she is a Coordi- in 1994, 1997, and 2002, respectively, and the
nator of the master’s degree in smart systems and Ph.D. degree in computer science from the Uni-
the Chief of the Information and Decision Systems versidad Politécnica of Madrid, in 2005. He is
Department. She participated in scientific national currently a Full Professor with Østfold University
and international projects. Her current research College, Norway. He has been working as a soft-
interests include artificial intelligence, cooperative ware engineer, a project manager, and a software
systems, CSCW, knowledge engineering, social engineering consultant with several companies,
networks analysis, and intelligent decision support systems. Her teaching including Spanish IT Leader INDRA. His research interests include applied
interests include machine and deep leaning, software engineering, UML, research in information systems, software project management, people in
software design, and software reengineering. software projects, business software, and software and services process
improvement. He is also an Associate Editor in journals, like IEEE Software,
IEEE ACCESS, and Computer Standards & Interfaces. He has edited several
special issues in journals, like Journal of Software: Evolution and Process,
Software Quality Journal, Science of Computer Programming, and Future
JIHEN HLEL received the master’s degree in com- Generation Computer Systems.
puter sciences from the National School of Com-
puter Sciences, in 2019. She is currently pursuing
the Ph.D. degree. Her current research interests
include artificial intelligence, machine and deep
leaning, and intelligent decision support systems.

60458 VOLUME 9, 2021

You might also like