Project Employee Absenteeism
Project Employee Absenteeism
EMPLOYEE ABSENTEEISM
MAYANK AGARWAL
22-11-2018
Page 1 of 33
Contents
1. Introduction
1.1 Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
1.2 Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
1.3 Sample Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
1.4 Unique Count. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
2. Methodology
2.1 Pre – Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
2.2 Missing Value Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
2.3 Outlier Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Distribution of Continuous variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Distribution of Categorical variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Feature Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
2.7 Feature Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
2.8 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
3. Modelling
3.1 Model Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
3.2 Decision Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Random Forest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
3.4 Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
4. Conclusion
4.1 Model Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
4.2 Model Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
4.3 Solution of Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5. Appendix
5.1 Figures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
6. R code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Page 2 of 33
Chapter 1: Introduction
2. How much losses every month can we project in 2011 if same trend of absenteeism
continues?
1.2 Variables
There are 21 variables in our data in which 20 are independent variables and 1 (Absenteeism time in
hours) is dependent variable. Since the type of target variable is continuous, this is a regression
problem.
Variable Information:
Page 3 of 33
XXI Factors influencing health status and contact with health services.
And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation
(24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation
(28).
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
6. Transportation expense
8. Service time
9. Age
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
18. Weight
19. Height
Page 4 of 33
Fig 1.3 – First five rows of data
Page 5 of 33
Chapter 2: Methodology
Page 6 of 33
2.3 Outlier Analysis
It can be observed from the distribution of variables that almost none of the variables are normally
distributed. The skew in these distributions can be explained by the presence of outliers and
extreme values in the data. One of the steps in pre-processing involves the detection and removal of
such outliers. In this project, we use boxplot to visualize and remove outliers. Any value lying outside
of the lower and upper whisker of the boxplot are outliers.
Variables excluding Distance from residence to work, Weight and Body mass index, contain outliers.
Page 7 of 33
Imputing outlier values:
Missing values obtained from boxplots are first converted to have NA values. Then these missing
values are imputed using KNN imputation method.
Page 8 of 33
2.4 Distribution of Continuous variables
By looking at the distribution of continuous variables, it can be observed that the variables are not
normally distributed. Histograms are used to observe the distribution of continuous variables.
Page 9 of 33
Fig 2.4 – Distribution of Continuous variables using Histogram
Employees who are social drinkers have more absent hours than those who do not drink.
Employees having zero, one or two children have more absent hours.
Employees with ID number 3 and 28 are absent the most.
Employees are absent the most on Mondays and the least on Thursdays.
Reason 23 and 28 are the reasons employee give the most for being absent.
Employees who have completed only high school education are absent more than others.
Employees are absent the most in the month of March.
Page 10 of 33
Fig 2.5 – Distribution of Categorical variables using Bar graph
Page 11 of 33
2.6 Feature Selection
Feature Selection reduces the complexity of a model and makes it easier to interpret. It also reduces
overfitting. Features are selected based on their scores in various statistical tests for their correlation
with the outcome variable. Correlation plot is used to find out if there is any multicollinearity
between variables. The highly collinear variables are dropped and then the model is executed.
From correlation analysis we have found that Weight and Body Mass Index has high correlation
(>0.7), so we have excluded the Body Mass Index column.
Page 12 of 33
final distance. Since our data is not uniformly distributed, we will use Normalization as Feature
Scaling Method.
After applying PCA algorithm and observing the above Cumulative Scree Plot, it can be observed that
almost 95% of the data can be explained by 45 variables out of 116. Hence, we choose only 45
variables as input to the models.
Page 13 of 33
Chapter 3: Modelling
Fig 3.2 – Plot of actual values vs predicted values for Decision Tree
The RMSE values and R^2 values for the given project in R and Python are:
R 0.442 0.978
Page 14 of 33
3.3 Random Forest
Random Forest is a supervised learning algorithm. Random forest builds multiple decision trees and
merges them together to get a more accurate and stable prediction. It can be used for both
classification and regression problems. The method of combining trees is known as an ensemble
method. Ensembling is nothing but a combination of weak learners (individual trees) to produce a
strong learner.
The number of decision trees used for prediction in the forest is 500.
Fig 3.3 – Plot of actual values vs predicted values for Random Forest
R 0.480 0.978
R 0.003 0.9999
Page 15 of 33
Fig 3.5 – Plot of actual values vs predicted values for Linear Regression
Page 16 of 33
Chapter 4: Conclusion
Solution:
a. It can be observed that employees having education only till high school tend to be absent
more than others. So, the company can either hire employees who have at least graduated
from college or inform those employees who have completed only their high school
education to reduce the number of hours they are absent.
Page 17 of 33
b. Employees with ID 3, 28 and 34 are some of the employees who are absent the most. The
company may act warn such employees to reduce being absent a lot or if repeated further, it
can against them if necessary.
c. The reasons most used by employees to be absent are reason 13, 20, 23 and 28. These
reasons include Medical consultation, Dental appointments, morbidity, mortality and
diseases of musculoskeletal system and connective tissue. The company XYZ can help in
informing employees on how to keep themselves healthier by having monthly campus
consultations.
d. People who tend to be social drinkers tend to be more absent than those who don’t drink.
XYZ can keep a track of those people and inform those employees to reduce the intake of
alcohol during working days.
Page 18 of 33
Fig 4.3.4 – Plot of Social Drinker vs Absent Hours
e. Employees are absent the most on Mondays with absent hours equal to 1426 and Tuesdays
with absent hours equal to 1322.4. XYZ can inform employees to not take as many absent
hours on such days.
Page 19 of 33
f. Employees are mostly absent during Spring Season.
g. Employees having a maximum of two children or no child at all are absent the most.
Page 20 of 33
4.3.2 How much losses every month can we project in 2011 if same trend of absenteeism
continues?
Solution:
Considering the losses to be the absenteeism time in hours, if the same trend of absenteeism
continues, then the total total losses per month is as shown in the graph below.
Employees are absent the most in the month of March, with total Absenteeism hours equal to 458.2
hours. Employees are absent the least in the month of January, with total Absenteeism hours equal
to 173.6.
Page 21 of 33
Chapter 5: Appendix
5.1 Figures
Page 22 of 33
Fig 2.3.1 – Boxplots of continuous variables with outliers
Page 23 of 33
Fig 2.4 – Distribution of Continuous variables using Histogram
Page 24 of 33
Fig 2.5 – Distribution of Categorical variables using Bar graph
Page 25 of 33
Fig 2.6 – Correlation plot of Continuous variables
Page 26 of 33
Fig 3.2 – Plot of actual values vs predicted values for Decision Tree
Fig 3.3 – Plot of actual values vs predicted values for Random Forest
Fig 3.5 – Plot of actual values vs predicted values for Linear Regression
Page 27 of 33
Chapter 6: R coSde
#Structure of variables
str(emp_absent)
#Structure of variables
str(emp_absent)
Page 28 of 33
#Calculate missing percentage
missing_values$Miss_perc = ((missing_values$Miss_perc/nrow(emp_absent)) *100)
#Create missing value and impute using mean, median and knn
df[["Body.mass.index"]][3] = NA
df = knnImputation(data = df, k = 5)
gridExtra::grid.arrange(bar1,bar2,bar3,bar4,ncol=2)
gridExtra::grid.arrange(bar5,bar6,bar7,ncol=2)
Page 29 of 33
hist2 = ggplot(data = numeric_data, aes(x =Height)) + ggtitle("Distribution of Height") +
geom_histogram(bins = 25)
gridExtra::grid.arrange(hist1,hist2,hist3,hist4,ncol=2)
#########################OUTLIER ANALYSIS#########################
Page 30 of 33
#Check if any missing values
sum(is.na(df))
#############################FEATURE SELECTION#############################
#Variable Reduction
df = subset.data.frame(df, select = -c(Body.mass.index))
#########################FEATURE SCALING#########################
#Normality check
hist(df$Absenteeism.time.in.hours)
rmExcept(keepers = c("df","emp_absent"))
#Compute variance
pr_var = pr_stdev^2
Page 31 of 33
#Proportion of variance explained
prop_var = pr_var/sum(pr_var)
# From the above plot selecting 45 components since it explains almost 95+ % data variance
train.data =train.data[,1:45]
########################DECISION TREE########################
###########################RANDOM FOREST###########################
Page 32 of 33
#Plot a graph for actual vs predicted values
plot(test$Absenteeism.time.in.hours,type="l",lty=2,col="green")
lines(rf_predictions,col="blue")
###########################LINEAR REGRESSION###########################
Page 33 of 33