Assignment Report - Group A
Assignment Report - Group A
Assignment Report
Submitted to
Prof. Dinesh Kumar
Executive Summary................................................................................................................................. 3
Data Pre-processing ................................................................................................................................ 4
Exploratory Data Analysis ....................................................................................................................... 5
Logistic Regression .................................................................................................................................. 7
Decision Tree......................................................................................................................................... 13
Random Forest ...................................................................................................................................... 18
Sampling Techniques ............................................................................................................................ 23
Final Recommendations ....................................................................................................................... 24
Exhibit 1: Distribution of explanatory variables................................................................................ 25
2|Page
Executive Summary
Since 2010, Kramerica Industries was facing high attrition rate which was leading to
high cost towards talent management and acquisition. Along with that each time an employee
left there were certain indirect costs due to impact on sales and gap in knowledge transfer.
Bob Sacamano, head of talent management was searching for a way to reduce attrition and
associated cost with the help of Machine Learning algorithms.
Employee data for employees working in different departments was collected for 2 years,
2014 and 2015, which number of variables which could impact the attrition rate. The aim was
to predict the probability of an employee leaving the organisation.
The dataset consisted of data for more than 16,000 employees, both past and present
employees, of the organisation. Initially data was explored with use of certain summary
statistics and variables’ relation with dependent variable (attrition) was studied. After
rigorous inspection certain variables were identified which should not be used in model for
making the prediction. After than hypothesis test was ran on the remaining variables to
understand their significance in predicting attrition.
With the help of these variables a logistic regression model was built. This resulted in overall
accuracy of 91.6% in the testing data. Following this it was observed that due imbalance in
the data this model was predicting 0’s with lower error than the error rate in 1’s. So, with help
of Youden’s Index optimal cut-off was determined to increase the sensitivity metric in the
model. Then a decision tree was model was built which provided overall accuracy of 85.1% in
the testing data, determining Ratio Difference as most primary variable for classification.
Random Forest was used among the ensemble techniques increasing the accuracy to 86.9%
in the testing data. Random forest determined variables Ration Difference, Change in CTC and
Leaves in 2015 as top 3 important variables.
Based on these three models, it was analysed that the Logistic model would be most accurate
and effective in deployment. Also, to tackle the problems of imbalance in the various sampling
techniques were employed, of which stratified sampling proved to be the best.
3|Page
Data Pre-processing
Following measures were taken in order to avoid inconsistency issues in the data.
a) Removing entries for employees who left before 2014 and in 2016.
The provided dataset contains data of 2014 and 2015. Hence for the employees who left the
organization before 2014, the data was inconsistent for various variables like Leaves in 2014-
15, CTC in 2015-15 and so on. So those entries were deleted. There were 9 entries of people
who left in 2016, these entries were also deleted as the number was very small compared to
whole dataset.
4|Page
Exploratory Data Analysis
1. Identify variables that should not be used for building ML model among the data provided in
the spreadsheet. Justify your answer (2 points)
The following variables should not be used because of the reasons mentioned below:
Apart from the above-mentioned variables, we also need to statistically test all the variables for its
significance in building the model. For continuous variables we have conducted ANOVA while for
categoric variables we have performed Chi square test of independence. Below mentioned is the
summary for test along with comments on whether the variables are significant at significance level
5%.
5|Page
Employment type 9.66e-55 Significant
Confirmation status 8.77e-55 Significant
Working status 0.0 Significant
Table 1.2 : Chi square test of independence for categoric variable
6|Page
Logistic Regression
2. Build a Logistic Regression Model on the given data. Perform diagnostics tests, Interpret the
results. Comment on the accuracy of the model on test data. (2 Points).
Ans. We generated first logistic regression model with almost all the variables, except some of the
variables, which were deemed unnecessary for exploratory data analysis. After that, based on the p
value and significance levels of Wald’s test(we considered cut off value of significance to be 0.05),
which was performed in Rattle, we found the following variables, using which we built the final
model:
a. EmployeeCurrentAge
b. EmployeeAgeDuringDOJ
c. PriorWorkExp
d. Leavesin2015
e. Ratioofleaves2015
f. AutoParts_Department
g. None_DegreeType
All of these variables were significant using Wald’s test. The results of these tests are given in the
rattle models attached with the report.
7|Page
According to the magnitude of coefficient estimates, the chance of attrition increases with the
increase in magnitude of coefficients like RatioOfLeaves2015, EmployeeCurrentAge. The most
impactful among these are RatioOfLeaves2015, defined a total number of leaves in 2015 divided
by number working days of an employee in 2015. The chance of attrition decreases with the
increase in magnitude of EmployeeAgeduringDOJ, PriorWorkExp, AutoParts, EducationlevelNone.
So people with higher age and work tend not to leave the organisation and same goes for people
working in Auto Parts department and an employee who is neither a undergraduate, a graduate
nor a post graduate.
The error matrices for the above model for training and testing data are shown below (cut-off
probability =0.5):
Training:
Predicted
Actual 0 1 Error
0 3727 2 0.1
Predicted
Actual 0 1 Error
0 72.4 0.0 0.1
So, on training data, the sensitivity and specificity are 99.9% and 70.8% , whereas the
precision is 99.8%, the F-score is 99.84%.
Testing:
Predicted
Actual 0 1 Error
0 1636 1 0.1
Actual Predicted
0 1 Error
8|Page
0 74.1 0.0 0.1
So, on testing data, the sensitivity and specificity are 99.9% and 67.6% respectively;
whereas the precision is 99.74%, the F-score is 99.82%.
The ROC Curve and the lift charts for the training and testing data are shown below:
ROC:
Training:
Testing:
9|Page
For testing data, the Area Under ROC curve(AUC) are 89%, signifying, for a randomly selected
pair of positive and negative observations, probability of correctly classifying them is 0.89
Lift
Training:
Testing:
10 | P a g e
Calculation of Youden’s index for optimal cut-off probability:
11 | P a g e
0.8
0.7
0.6
Youden's Index
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Cut-Off Probability
We have found out the specificity and sensitivity values for different cut-off probabilities, by finding
out the observed and predicted probabilities. We have calculated Youden’s Index for different cut-off
probabilities (Youden’s Index = sensitivity+specificity-1) and found out that the optimal cut-off
probability is 0.25, for which the value of the index is highest at 0.719
12 | P a g e
Decision Tree
Question 3: Construct a simple decision tree (classification tree) classifier. Use decision tree to
generate at least two features. Check whether the newly derived features have statistically
significant relationship with the outcome variable (2 Points)?
We build the decision tree using the variables that were found significant in the question 1. In the
built decision tree, the variables used were ‘RatioDifference’, ‘changeCTC’, ‘Confirmed’,
‘Leaves.in.2015’, ‘leaves.prev.year’.
Ratio.Difference>=0.105
13 | P a g e
Rule number 6: 44 [Attrition=0 cover=251 (5%) prob=0.31]
In the decision tree model, Ratio.difference and changeCTC are found to be the most significant
features.
ROC CURVE AND LIFT CHARTS FOR TRAINING AND TESTING DATA ARE SHOWN BELOW:
ROC:
TRAINING
AUC = 0.7873
TESTING
AUC = 0.8086
14 | P a g e
LIFT:
TRAINING
TESTING
15 | P a g e
ERROR MATRIX FOR THE ABOVE MODEL
Training:
Predicted
Actual 0 1 Error
0 3692 37 1.0
Predicted
Actual 0 1 Error
0 71.7 0.7 1.0
Testing:
Predicted
Actual 0 1 Error
0 1620 17 1.0
16 | P a g e
Predicted
Actual 0 1 Error
0 73.4 0.8 1.0
Training Testing
Sensitivity 43.87% 45.35%
Specificity 99% 98.96%
Precision 94.39% 93.84%
F-Score 59.9 61.15
17 | P a g e
Random Forest
Question 4: Develop models based on ensemble methods, what insights can you get based on
ensemble methods, did ensemble method improve accuracy? Rank variables based on their
importance.
Ans. We generated Random forest model with almost all the variables, except some of the variables,
which were deemed unnecessary from exploratory data analysis. After that, based on the p value
and significance levels of Wald’s test, which was performed in Rattle, we found the following
variables, using which we built the final model:
1 Ratio.of.leaves.2014 12 leaves_current_year
2 Gender 13 RM.Reportee.Male
3 RM.Reportee.Female 14 Leaves.in.2015
4 Ratio.of.leaves.2015 15 leaves_prev_year
5 RM.Age 16 RM.Reportees.Count
6 joining.bonus 17 Prior.Work.Exp
7 Avg.Weighted.Performance 18 Employee.Age.During.DOJ
8 age_diff 19 f_to_m
9 Department 20 Marital.Status
10 emp_manage 21 Working.in.Native.Place
11 changeCTC 22 Start.Date.2015
All these variables passed the significance test and the result are explained in Question 1.
The error matrices for the above model for training and testing data are shown below:
Training:
Predicted
Actual 0 1 Error
0 3721 0 0
(72.4) (0)
1 0 1418 0
(0) (27.6)
So, on training data, the sensitivity and specificity are 100%, whereas the precision is
100%, the F-score is 100%
18 | P a g e
Testing:
Predicted
0 1 Error
Actual 789 17
0 2.1
(71.7) (1.5)
127 167
1 43.2
(11.5) (15.2)
So, on testing data, the sensitivity and specificity are 56.8% and 97.9% respectively; whereas
the precision is 90.8%, the F-score is 69.9%.
Overall, we can see a slight decrease in accuracy in Random Forest compared to logistic
regression. However, it is higher than that of decision tree.
ROC:
The ROC Curve and the lift charts for the training and testing data are shown below:
Training:
Testing:
19 | P a g e
For training data the Area Under ROC curve (AUC) is 100% and for testing data the AUC is
87%, signifying, for 87% of randomly selected pairs of positive and negative observations from
testing data probability of positive class will be higher than the negative class.
Lift:
Training:
20 | P a g e
Testing:
Variable Importance
From the model we get Mean Decrease Accuracy and Mean Decrease Gini which is used to
rank variables based on their importance
21 | P a g e
16 age_diff 4.42 61.75
17 emp_manage 3.32 2.41
18 RM.Reportee.Female 2.87 27.41
19 Gender 2.48 5
20 joining.bonus 1.01 0.11
21 Start.Date.2015 0 0
22 Working.in.Native.Place -0.81 7.87
22 | P a g e
Sampling Techniques
Question 5: Based on previous questions, what kind of modelling problems would Julia expect
when the classes are not represented adequately (imbalanced data)? Suggest ways to handle
these problems (2 points).
Problem with imbalanced data: Since most machine learning algorithm are designed to improve
accuracy, by reducing the error the model will typically do so by having a bias towards the majority
class. Therefore, accuracy as a measure for model performance will become faulty. Other measures
of model performance will have to be looked at such as precision, recall and F-score. Such an
imbalance will force a stronger trade-off between precision and recall.
For solving imbalance problems, one way is to consider F score or specificity as a model selection
criteria. Although ideally industrial practises suggests that we should adopt techniques such as
undersampling, oversampling, SMOTE, and stratified sampling to remove imbalance from the data.
Question 6: Use various sampling techniques that are best suited for the data based on model
accuracy (4 Points).
From the previous analysis, we concluded that Logistic Regression gave the best F-score on test data
out of all the techniques mentioned. Therefore, we shall use that model for predictions. As is given
in the above table, for logistic regression, a stratified sampling technique is working the best, giving a
93% overall model accuracy. Therefore, the company shall be using this technique while predicting
attrition.
23 | P a g e
Final Recommendations
Question 7: Based on the different model results, what would be your final recommendation
to Kramerica Industries? (4 points)
The coefficients obtained from the Logistic Regression model for the significant variables is given
below:
Through the table above, we can conclude that the attrition problem is more significant in other
departments of the company than in Auto Parts. Older age employees and those having a long
tenure at Kramerica have a greater tendency to leave the organisation. The behaviour worsens with
people having a higher prior work experience. Interestingly, from the decision tree, we can see that
people experiencing a change in CTC of more than 3% have a lower chance of leaving the company.
From the above insights, it might be concluded that the company is not providing enough salary
hikes and growth opportunities for employees at the senior positions, and therefore is experiencing
attrition there. We would suggest the company to revise its workforce strategy for the senior
employees to control the problem.
24 | P a g e
Appendix
We plotted Histograms for continuous variables and box plot for categoric variable across classes to
visually validate whether the given variable is significant or not.
25 | P a g e
26 | P a g e
27 | P a g e
28 | P a g e
29 | P a g e
30 | P a g e
31 | P a g e