Absenteeism at Work Project Report
Absenteeism at Work Project Report
Machine Learning I
Prediction of Absenteeism
Project Report
Submitted by-Team 2
Saisanthosh Mamidala
Shuyu Sui
Srilakshmi Peesa
There is very stiff competition for any business in the current marketplace and productivity of its
workforce helps an organization to nudge ahead of its competition. An employee’s productivity has a lot
of impact on the product of any business, so it is important for the Human Resources Department of an
organization to understand what impacts productivity. Absenteeism is one such behavior that affects the
regular workflow. It is defined as the habitual non-presence of an employee at his or her job (Will
Kenton, 2019). Organizations have to address this behavior of employees. This paper intends to explore
different Machine Learning techniques to understand what factors are contributing to this behavior and
help organizations to update or redesign their employee satisfaction matrix.
Introduction
This analysis intends to understand the relationships between different employee behaviors and
absenteeism. Understand these relationships can be used as a tool to define what factors are causing this
and how an organization can address these issues.
There are many ways to explore the data and understand the relationships but for our analysis, we are
leveraging some Supervised Machine Learning Algorithms. To be specific, we are using four different
algorithms namely Decision Tree, Random Forests, Naïve Bayes Classifier, and Support Vector
Machines. Each of these algorithms has its advantages and disadvantages. So, we want to analyze the
model outputs from all these algorithms and suggest the best algorithm that would fit this data. For model
validation purposes, we will be using a combination of accuracy scores, confusion matrix, ROC, and
AUC.
Related Work
Supervised Machine Learning
This analysis makes use of multiple supervised machine learning algorithms. It cleans most of the
data by removing all the features which are not related to the target. Doing this might potentially lose a lot
of features which might have an indirect effect on the target. This analysis made use of Decision Tree,
Random Forests, Naïve Bayes Classifier and Support Vector Machine algorithms and concluded that
Random Forests is giving the best results purely based on the confusion matrix (Ojo Olawale. 2019, Aug
27).
Unsupervised Machine Learning
This analysis makes use of a hierarchical clustering technique to see what features are
contributing to absenteeism behavior. It concludes that the features profiles and age groups are the biggest
contributors to this behavior (Parker Oakes. 2019, Feb 24).
Data Preprocessing
The dataset that has been used for this analysis is a popular (Absenteeism at work) dataset which
was created using employee information from July 2007 to July 2010 of a courier company in Brazil. The
dataset had 666 observations with 20 columns. Of those 20 columns, there is one target variable and 19
other feature variables.
Figure 1
Data Types of all the variables in the dataset
From the above metadata, we observed that two features Age and Workload Average/day are
string data types which should be numeric data type. Also, Workload Average/day had a blank space in
its name and the values were comma-separated. So, we had to clean up all the blank spaces in the column
name and get rid of the commas in the data. Then we converted both these values to numeric. Once these
features were converted, they introduced null values that were dropped from this analysis. We also
performed null value checks and valid value checks wherever applicable. We dropped all the observations
which had invalid values. In total six observations were dropped from the initial data.
We also checked for multi-collinearity in the data. We plotted the collinearity between all the
variables and dropped the features with an absolute correlated coefficient value of above 0.8. Body Mass
Index seems to be highly correlated with other features and it was dropped.
Figure 2
Heat Map Showing Correlation Coefficient Values Between Features
The above heat map visualizes the correlation coefficient values between features. Lighter the
value of tile, the higher the correlation coefficient value. As we can see from figure 2, dropping Body
Mass Index prevents multi-collinearity. We also converted the target variable, Absenteeism time in hours,
to categorical variables for our initial analysis. The following are the thresholds we used to convert our
target variable to a categorical variable.
Table 1
Target Value Threshold by Groups
Group Number Threshold
0 Number of hours=0
1 0 < Number of hours <= 6
2 Number of hours > 6
Technical Approach
Feature selection
Feature selection is the process of selecting a certain number of most useful features that will be
used to train the model. This is done to reduce the dimensionality when most of the features are not
contributing enough to the overall variance.
In our project, We have used a correlation plot to eliminate a highly correlated variable. Random forest
to find feature importance and Principle Component analysis for dimension reduction.
a. Principle Component Analysis (PCA)
We use PCA primarily for dimensionality reduction. It is a technique of extracting important
features from a larger set of data variables in the dataset. We performed normalization before performing
PCA.
Data has a total of 18 input features after removing BMI Variable. Data has 18 features and 92% of the
variance is explained by 14 components. So, we have selected the number of components is 14.
b. Data Splitting:
We set the Randomization seed at 12345 and make the train and test model in 80: 20 ratios.
Table 1
Model Accuracy Scores After Performing PCA
Techniques Accuracy Score
Random Forest Classifier 75.00%
Decision Tree Classifier 74.24%
Naïve Bayes Classifier 71.96%
Support Vector Machine 67.42%
The accuracy scores appear to be moderately performing well. The Random Forest Classifier
leads the accuracy score closely followed by Naïve Bayes Classifier & Support Vector Machine. The
Decision Tree Classifier stands the least out of the four tested classification models. The four models
were again created without applying the PCA i.e. without eliminating any features from the given dataset
(except for the Body Mass Index which was dropped earlier due to high correlation). The new models
show a significant difference in the accuracy scores.
Table 2
Model Accuracy Scores Without PCA
Techniques Accuracy Score
Random Forest Classifier 84.84%
Decision Tree Classifier 81.06%
Naïve Bayes Classifier 78.78%
Support Vector Machine 74.24%
Without conducting Principal Component Analysis, the accuracy scores of Random Forest
Classifier and Decision Tree Classifier show a remarkable increase. This can be attributed to the fact that,
though PCA served its purpose in reducing the dimensionality of the data, the model applied on data with
PCA failed to capture the underlying pattern and returned a relatively lower accuracy score. The Random
Forest Classifier Technique has the best accuracy score and further, the parameters are tuned by applying
Grid Search. The parameters tuned are n_estimators: [6, 100, 30], max depth: [5, 7, 10]. Even after tuning
the Grid Search, the best parameters {‘max_depth’: 7, ‘n_estimators’: 30} did not show any significant
improvement in the accuracy score. An interesting observation we found was that the SVM model is
misclassifying all the observations as either Group 1 or Group 2 while eliminating to classify as Group 0.
This can be identified from the confusion matrix below.
Figure 3
Confusion Matrix for SVM
Table 3
Model Accuracy Scores After Using Test Data
Techniques Accuracy Score
Random Forest Classifier 71.62%
Decision Tree Classifier 71.62%
Naïve Bayes Classifier 60.81%
Support Vector Machine 60.60%
The Random Forest Classifier and the Decision Tree Classifier both stand the highest at 71.62%
accuracy score. The Naïve Bayes Classifier stands third at about 60.81%. The Support Vector Machine is
misclassifying the total observations as Group 1 without considering Group 0 and Group 2, which means
that the SVM is not a recommended model for the given problem statement. This can be verified from the
confusion matrix in the figure below.
Figure 4
Confusion Matrix and Accuracy Score for SVM
An accuracy score of above 70 is a good score considering that the problem statement falls in the
scope of the Human Resources Domain. Human Resources deal directly with the behavior of human
beings; hence, high accuracy scores cannot be expected. Moreover, we are dealing with limited variables
to predict employee’s behavior. Several other variables such as employee morale, job satisfaction,
relationship with the manager, workplace ambiance, etc. which are generally considered to be the key
indicators of an employee’s performance and absenteeism rates are not present in the dataset. Thus, the
Random Forest Classifier & Decision Tree Classifier which has an accuracy score of 71.62% are selected
for further evaluation.
Model Evaluation
We utilized several methods to evaluate the model performances. In addition to the accuracy
score, we also leverage receiver operating characteristics (ROC) curve with area under the curve (AUC)
and confusion matrix to compare model performance. We are particularly interested in comparing the
performance difference between the random forest model and the decision tree. In the previous section,
we observe that the accuracy score for the random forest model and decision tree model are almost the
same. So, we decided to explore more evaluation methods to choose the best method.
ROC Curve
Receiver operating characteristic curve plots the true positive rate on the Y-axis and the false
positive for the classification rate on the X-axis. For our specific case, we have three classes in total, and
we plot the ROC curve for each of the classes. Plotting these two rates we can have a clear picture of the
sensitivity of the model.
Plotted ROC
From the below ROC curve plots, we can see that from a ROC and AUC perspective random
forest is performing better than the decision tree model. The average AUC for the random forest is 0.92,
and the average AUC for the decision tree model is 0.84. We can also observe from the ROC plot that for
Absenteeism time between zero and six and for Absenteeism time more than six are farther away from 1.
Because for the area under the curve, one is the best result we can get, so in terms of ROC and AUC, the
Random Forest model is performing better than the Decision Tree model.
Utilizing both the confusion matrix and the area under the curve, we can see that though we did
not correctly predict all data points in Absenteeism time equals zero, the AUC for Absenteeism time
equals zero is one. This is because the ROC curve only considers false positive and true positive
classifications. Even in the Random Forest model, we have zero false-positive predictions. Therefore, we
will get a perfect AUC score for Absenteeism time equals zero.
Random Forest
Decision Tree
AUC for Absenteeism time equals zero: 1.0
AUC for Absenteeism time between zero and six: 0.76
AUC for Absenteeism time more than six: 0.76
Figure 6
ROC with Decision Tree
Confusion Matrix
For the confusion matrix, we list out prediction outcomes for each class. In this way, we can see
the prediction result for each class. Using the confusion matrix, we can also calculate precision and recall
scores and get more insights into how we can fine-tune the model.
As we can see from the below two confusion matrices, the Decision Tree is performing better at
predicting Absenteeism time equals zero. The Random Forest model is performing better at predicting
Absenteeism time between zero and six and Absenteeism time between more than six.
Table 5
Confusion Matrix for Random Forest
Predicted Classes
Absenteeism time
Actual Classes between 0 and 6 0 22 6
Absenteeism time
more than 6 0 2 27
Table 6
Confusion Matrix for Decision Tree
Predicted Classes
Absenteeism time
Actual Classes between 0 and 6 0 27 11
Absenteeism time
more than 6 0 10 19
Conclusion
In conclusion, after considering ROC and area under the curve as well as the confusion matrix,
we think that the Random Forest model is performing better than the Decision Tree model. The Random
Forest model gives us a better AUC score and a less false prediction rate on the test dataset.
Table 7
RMSE for Different Models
Model Name RMSE
Lasso Regression 16.72
Ridge Regression 16.71
Random Forest 16.73
Xgboost 17.38
References
1) L. Breiman. Random forests. Maching Learning,45(1):5–32, Oct. 2001.
2) Chen, T., & Guestrin, C. (2016). XGBoost. Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. doi:10.1145/2939672.2939785
3) A Survey on Decision Tree Algorithms of Classification in Data Mining. (2016). International
Journal of Science and Research (IJSR), 5(4), 2094-2097. doi:10.21275/v5i4.nov162954
4) Will Kenton (2019, Jun 4). Absenteeism. Retrieved from
https://www.investopedia.com/terms/a/absenteeism.asp
5) Andrea Martiniano, Ricardo Pinto Ferreira, and Renato Jose Sassi. Absenteeism at work.
Retrieved from https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work
6) Ojo Olawale (2019, Aug 27). EXPLORATION OF ABSENTEEISM WITH MACHINE
LEARNING. Retrieved from https://medium.com/@ojoolawalejulius2016/exploration-of-
absenteeism-with-machine-learning-1f01a8f9357e
7) Parker Oakes (2019, Feb 24). Using Machine Learning to Discover Employee Absenteeism
Reasons. Retrieved from https://rpubs.com/alanoakes/EmployeeAbsenteeism