0% found this document useful (0 votes)

48 views

Assignment Report - Group A

This document reports on a machine learning project to predict employee attrition at Kramerica Industries. Various machine learning models were tested on a dataset of over 16,000 past and present employees. After data preprocessing and exploratory analysis, logistic regression, decision tree, and random forest models were built. The logistic regression model achieved the highest accuracy of 91.6% on the test data. Random forest identified ratio difference, change in compensation, and leaves in 2015 as the most important predictors. Based on the analysis, the logistic regression model was recommended for deployment with sampling techniques to address data imbalance issues.

Uploaded by

keerthi reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Assignment Report - Group A

Uploaded by

keerthi reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Machine Learning Applications

Assignment Report

Submitted to
Prof. Dinesh Kumar

Submitted by: Group - A

Aanchal 2017008
Arjun Mehra 2017010
Madhur Butke 2017030
Raja Ram Pandey 2017011
Sanjoy Podder 2017019
Table of Contents

Executive Summary................................................................................................................................. 3
Data Pre-processing ................................................................................................................................ 4
Exploratory Data Analysis ....................................................................................................................... 5
Logistic Regression .................................................................................................................................. 7
Decision Tree......................................................................................................................................... 13
Random Forest ...................................................................................................................................... 18
Sampling Techniques ............................................................................................................................ 23
Final Recommendations ....................................................................................................................... 24
Exhibit 1: Distribution of explanatory variables................................................................................ 25

2|Page
Executive Summary

Since 2010, Kramerica Industries was facing high attrition rate which was leading to
high cost towards talent management and acquisition. Along with that each time an employee
left there were certain indirect costs due to impact on sales and gap in knowledge transfer.
Bob Sacamano, head of talent management was searching for a way to reduce attrition and
associated cost with the help of Machine Learning algorithms.
Employee data for employees working in different departments was collected for 2 years,
2014 and 2015, which number of variables which could impact the attrition rate. The aim was
to predict the probability of an employee leaving the organisation.
The dataset consisted of data for more than 16,000 employees, both past and present
employees, of the organisation. Initially data was explored with use of certain summary
statistics and variables’ relation with dependent variable (attrition) was studied. After
rigorous inspection certain variables were identified which should not be used in model for
making the prediction. After than hypothesis test was ran on the remaining variables to
understand their significance in predicting attrition.
With the help of these variables a logistic regression model was built. This resulted in overall
accuracy of 91.6% in the testing data. Following this it was observed that due imbalance in
the data this model was predicting 0’s with lower error than the error rate in 1’s. So, with help
of Youden’s Index optimal cut-off was determined to increase the sensitivity metric in the
model. Then a decision tree was model was built which provided overall accuracy of 85.1% in
the testing data, determining Ratio Difference as most primary variable for classification.
Random Forest was used among the ensemble techniques increasing the accuracy to 86.9%
in the testing data. Random forest determined variables Ration Difference, Change in CTC and
Leaves in 2015 as top 3 important variables.
Based on these three models, it was analysed that the Logistic model would be most accurate
and effective in deployment. Also, to tackle the problems of imbalance in the various sampling
techniques were employed, of which stratified sampling proved to be the best.

3|Page
Data Pre-processing

Following measures were taken in order to avoid inconsistency issues in the data.

a) Removing entries for employees who left before 2014 and in 2016.
The provided dataset contains data of 2014 and 2015. Hence for the employees who left the
organization before 2014, the data was inconsistent for various variables like Leaves in 2014-
15, CTC in 2015-15 and so on. So those entries were deleted. There were 9 entries of people
who left in 2016, these entries were also deleted as the number was very small compared to
whole dataset.

b) Creation of new variables

a. Sum of leaves
This is sum of Casual Leaves, Personal Leaves and Sick Leaves in year 2014 and 2015.
b. Ratio of leaves
Ratio if ‘Sum of leaves in 2015’ to ‘Total working days in 2015’. Similar variable was
also added for the year 2014.
c. Difference of leaves ratio
Difference between ‘Ratio of leaves 2015’ and ‘Ratio of leaves 2014’

c) Removing entries for people left in 2014

Certain variables like ‘Difference of leaves ratio’, ‘Difference in CTC’ were defined in which for
people who left in 2014, the respective data of 2015 was not available hence they were
deleted.

d) Infinity values in variables

Few variables like F_to_M, Ratio of leaves, leaves over last year, were defined as ration of two
variable. There were instances where denominator was zero, resulting infinity value of the
ratio. Such values were imputed with appropriate measures.

e) Encoding of categoric variables

One hot encoding for categoric variables.

4|Page
Exploratory Data Analysis
1. Identify variables that should not be used for building ML model among the data provided in
the spreadsheet. Justify your answer (2 points)

The following variables should not be used because of the reasons mentioned below:

Variable Name Reason

X.U.FEFF. Just an identifier, does not have any meaningful implication.
Employee Code Just an identifier, does not have any meaningful implication.
Zodiac Sign Cannot be used for prediction as it can erroneously result in bias
against certain zodiac signs.
Days worked in last quarter People who have left the company even earlier will have this
parameter equal to zero. Gives insight only for those people who
have left company in last quarter of 2015, that too without
statistical significance. Therefore, this variable should be
excluded.
High Education Degree Presence of more than 170 categories in the variable would
make the model highly complex.
Leaves over last year Can have ‘infinite’ value if no leaves were taken in the previous
year, therefore cannot be used in the present form. Might have
to modify the formulation to use the information.
Last Working Date Cannot be used for predicting, as it can be obtained only when
an employee has left the company.
Working Status Same as dependent variable.
Table 1.1

Apart from the above-mentioned variables, we also need to statistically test all the variables for its
significance in building the model. For continuous variables we have conducted ANOVA while for
categoric variables we have performed Chi square test of independence. Below mentioned is the
summary for test along with comments on whether the variables are significant at significance level
5%.

Variable Name p-value Significant or not

Department 2.97e-22 Significant
Employee Category 6.77e-17 Significant
Gender 0.5928 Not significant
Marital Status 1.32e-22 Significant
Paternity.maternity.leave.last.quarter 4.43e-05 Significant
Joining bonus 0.0025 Significant
Degree type 1.74e-09 Significant
Highest degree 0.0006 Significant
Working in native place 0.7858 Not significant

5|Page
Employment type 9.66e-55 Significant
Confirmation status 8.77e-55 Significant
Working status 0.0 Significant
Table 1.2 : Chi square test of independence for categoric variable

For Continuous Variables:

Variable Name p-value Significant or not

Employee.Current.Age < 2.2e-16 Significant
Employee.Age.During.DOJ 0.0013 Significant
Prior.Work.Exp 0.1229 Not Significant
Avg.Weighted.Performance 0.0001 Significant
Ratio.of.leaves.2015 1.0 Not Significant
Ratio.of.leaves.2014 0.4621 Not Significant
Ratio.Difference 0.1742 Not Significant
joining.bonus 0.0373 Significant
Working.in.Native.Place 0.8904 Not Significant
RM.Age 0.0002 Significant
RM.Reportees.Count 1.0000 Not Significant
RM.Reportee.Male 0.5311 Not Significant
changeCTC 0.0050 Significant
emp_manage 0.3123 Not Significant
leaves_prev_year 0.6804 Not Significant
leaves_current_year < 2.2e-16 Significant
Table 1.3 ANOVA for continuous variables

6|Page
Logistic Regression

2. Build a Logistic Regression Model on the given data. Perform diagnostics tests, Interpret the
results. Comment on the accuracy of the model on test data. (2 Points).

Ans. We generated first logistic regression model with almost all the variables, except some of the
variables, which were deemed unnecessary for exploratory data analysis. After that, based on the p
value and significance levels of Wald’s test(we considered cut off value of significance to be 0.05),
which was performed in Rattle, we found the following variables, using which we built the final
model:
a. EmployeeCurrentAge
b. EmployeeAgeDuringDOJ
c. PriorWorkExp
d. Leavesin2015
e. Ratioofleaves2015
f. AutoParts_Department
g. None_DegreeType

All of these variables were significant using Wald’s test. The results of these tests are given in the
rattle models attached with the report.

Fig.: Final logistic regression model

7|Page
According to the magnitude of coefficient estimates, the chance of attrition increases with the
increase in magnitude of coefficients like RatioOfLeaves2015, EmployeeCurrentAge. The most
impactful among these are RatioOfLeaves2015, defined a total number of leaves in 2015 divided
by number working days of an employee in 2015. The chance of attrition decreases with the
increase in magnitude of EmployeeAgeduringDOJ, PriorWorkExp, AutoParts, EducationlevelNone.
So people with higher age and work tend not to leave the organisation and same goes for people
working in Auto Parts department and an employee who is neither a undergraduate, a graduate
nor a post graduate.

The error matrices for the above model for training and testing data are shown below (cut-off
probability =0.5):

Training:
Predicted
Actual 0 1 Error
0 3727 2 0.1

1 414 1006 29.2

Predicted
Actual 0 1 Error
0 72.4 0.0 0.1

1 8.0 19.5 29.2

Overall error: 8.1%, Averaged class error: 14.65%

So, on training data, the sensitivity and specificity are 99.9% and 70.8% , whereas the
precision is 99.8%, the F-score is 99.84%.

Testing:
Predicted
Actual 0 1 Error
0 1636 1 0.1

1 185 386 32.4

Actual Predicted
0 1 Error

8|Page
0 74.1 0.0 0.1

1 8.4 17.5 32.4

Overall error: 8.4%, Averaged class error: 16.25%

So, on testing data, the sensitivity and specificity are 99.9% and 67.6% respectively;
whereas the precision is 99.74%, the F-score is 99.82%.

The ROC Curve and the lift charts for the training and testing data are shown below:

ROC:
Training:

Testing:

9|Page
For testing data, the Area Under ROC curve(AUC) are 89%, signifying, for a randomly selected
pair of positive and negative observations, probability of correctly classifying them is 0.89

Lift
Training:

Testing:

10 | P a g e
Calculation of Youden’s index for optimal cut-off probability:

Cut-off Sensitivity Specificity YI Cut-off Sensitivity Specificity YI

0.05 0.283 0.956 0.194 0.55 0.999 0.669 0.668
0.1 0.573 0.883 0.456 0.6 0.999 0.665 0.665
0.15 0.782 0.82 0.601 0.65 0.999 0.6602 0.6596
0.2 0.889 0.781 0.6705 0.7 0.999 0.651 0.651
0.25 0.952 0.767 0.719 0.75 0.999 0.644 0.6438
0.3 0.979 0.732 0.711 0.8 0.999 0.634 0.6333
0.35 0.993 0.721 0.7148 0.85 0.999 0.6296 0.6236
0.4 0.998 0.7 0.6987 0.9 0.999 0.6147 0.6141
0.45 0.999 0.688 0.6876 0.95 0.999 0.588 0.5878

11 | P a g e
0.8

0.7

0.6
Youden's Index

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
Cut-Off Probability

We have found out the specificity and sensitivity values for different cut-off probabilities, by finding
out the observed and predicted probabilities. We have calculated Youden’s Index for different cut-off
probabilities (Youden’s Index = sensitivity+specificity-1) and found out that the optimal cut-off
probability is 0.25, for which the value of the index is highest at 0.719

The overall accuracy of the model at this cut-off probability is = (TP+TN)/(TP+TN+FP+FN) =

(438+1559)/(438+1559+778+133) = 90.44%

12 | P a g e
Decision Tree
Question 3: Construct a simple decision tree (classification tree) classifier. Use decision tree to
generate at least two features. Check whether the newly derived features have statistically
significant relationship with the outcome variable (2 Points)?

We build the decision tree using the variables that were found significant in the question 1. In the
built decision tree, the variables used were ‘RatioDifference’, ‘changeCTC’, ‘Confirmed’,
‘Leaves.in.2015’, ‘leaves.prev.year’.

Business Rules derived from decision tree

Rule number 1: 23 [Attrition=1 cover=75 (1%) prob=1.00]

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Leaves.in.2015< 23.5

Rule number 2: 87 [Attrition=1 cover=51 (1%) prob=1.00]

0.025=< Ratio.Difference< 0.065, changeCTC< 0.03, Confirmed>=0.5, Leaves.in.2015< 9.5

Rule number 3: 3 [Attrition=1 cover=444 (9%) prob=0.96]

Ratio.Difference>=0.105

Rule number 4: 45 [Attrition=1 cover=90 (2%) prob=0.77]

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Leaves.in.2015>=23.5, leaves_prev_year>=28.5

Rule number 5: 86 [Attrition=0 cover=365 (7%) prob=0.37]

0.025=<Ratio.Difference< 0.105, changeCTC< 0.03, Confirmed>=0.5, Leaves.in.2015>=9.5

13 | P a g e
Rule number 6: 44 [Attrition=0 cover=251 (5%) prob=0.31]

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Leaves.in.2015>=23.5, leaves_prev_year< 28.5

Rule number 7: 42 [Attrition=0 cover=620 (12%) prob=0.23]

0.025=<Ratio.Difference< 0.105, changeCTC< 0.03, Confirmed>=0.5

Rule number 8: 20 [Attrition=0 cover=825 (16%) prob=0.17]

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Confirmed< 0.5

Rule number 9: 4 [Attrition=0 cover=2428 (47%) prob=0.12]

Ratio.Difference< 0.105, changeCTC>=0.03

In the decision tree model, Ratio.difference and changeCTC are found to be the most significant
features.

Significance of these variables checked using ANOVA

Both the variables have p value<0.05, hence significant.

ROC CURVE AND LIFT CHARTS FOR TRAINING AND TESTING DATA ARE SHOWN BELOW:

ROC:
TRAINING
AUC = 0.7873

TESTING
AUC = 0.8086

14 | P a g e
LIFT:
TRAINING

TESTING

15 | P a g e
ERROR MATRIX FOR THE ABOVE MODEL

Training:
Predicted
Actual 0 1 Error
0 3692 37 1.0

1 797 623 56.1

Predicted
Actual 0 1 Error
0 71.7 0.7 1.0

1 15.5 12.1 56.1

Overall error: 16.2%, Averaged class error: 28.55%

Testing:

Predicted
Actual 0 1 Error
0 1620 17 1.0

1 312 259 54.6

16 | P a g e
Predicted
Actual 0 1 Error
0 73.4 0.8 1.0

1 14.1 11.7 54.6

Overall error: 14.9%, Averaged class error: 27.8%

Training Testing
Sensitivity 43.87% 45.35%
Specificity 99% 98.96%
Precision 94.39% 93.84%
F-Score 59.9 61.15

17 | P a g e
Random Forest
Question 4: Develop models based on ensemble methods, what insights can you get based on
ensemble methods, did ensemble method improve accuracy? Rank variables based on their
importance.

Ans. We generated Random forest model with almost all the variables, except some of the variables,
which were deemed unnecessary from exploratory data analysis. After that, based on the p value
and significance levels of Wald’s test, which was performed in Rattle, we found the following
variables, using which we built the final model:

1 Ratio.of.leaves.2014 12 leaves_current_year
2 Gender 13 RM.Reportee.Male
3 RM.Reportee.Female 14 Leaves.in.2015
4 Ratio.of.leaves.2015 15 leaves_prev_year
5 RM.Age 16 RM.Reportees.Count
6 joining.bonus 17 Prior.Work.Exp
7 Avg.Weighted.Performance 18 Employee.Age.During.DOJ
8 age_diff 19 f_to_m
9 Department 20 Marital.Status
10 emp_manage 21 Working.in.Native.Place
11 changeCTC 22 Start.Date.2015

All these variables passed the significance test and the result are explained in Question 1.

The error matrices for the above model for training and testing data are shown below:

Training:
Predicted
Actual 0 1 Error
0 3721 0 0
(72.4) (0)
1 0 1418 0
(0) (27.6)

Overall error: 0%, Averaged class error: 0%

So, on training data, the sensitivity and specificity are 100%, whereas the precision is
100%, the F-score is 100%

18 | P a g e
Testing:

Predicted
0 1 Error
Actual 789 17
0 2.1
(71.7) (1.5)
127 167
1 43.2
(11.5) (15.2)

Overall error: 13.1%, Averaged class error: 22.65%

So, on testing data, the sensitivity and specificity are 56.8% and 97.9% respectively; whereas
the precision is 90.8%, the F-score is 69.9%.

Overall, we can see a slight decrease in accuracy in Random Forest compared to logistic
regression. However, it is higher than that of decision tree.

ROC:
The ROC Curve and the lift charts for the training and testing data are shown below:

Training:

Testing:

19 | P a g e
For training data the Area Under ROC curve (AUC) is 100% and for testing data the AUC is
87%, signifying, for 87% of randomly selected pairs of positive and negative observations from
testing data probability of positive class will be higher than the negative class.

Lift:

Training:

20 | P a g e
Testing:

Variable Importance

From the model we get Mean Decrease Accuracy and Mean Decrease Gini which is used to
rank variables based on their importance

Variable Mean Decrease

Variable Mean Decrease Gini
rank Accuracy
1 Ratio.of.leaves.2015 34.76 328.78
2 changeCTC 15.83 77.15
3 leaves_prev_year 15.22 60.55
4 leaves_current_year 12.91 84.07
5 Avg.Weighted.Performance 12.12 59.09
6 Employee.Current.Age 10.86 67.93
7 Ratio.of.leaves.2014 10.46 39.57
8 Prior.Work.Exp 9.64 61.75
9 Employee.Age.During.DOJ 7.56 57.22
10 RM.Reportee.Male 6.31 40.93
11 RM.Age 5.72 61.53
12 Department 5.71 24.13
13 RM.Reportees.Count 5.47 44
14 Marital.Status 4.68 14.7
15 f_to_m 4.68 37.51

21 | P a g e
16 age_diff 4.42 61.75
17 emp_manage 3.32 2.41
18 RM.Reportee.Female 2.87 27.41
19 Gender 2.48 5
20 joining.bonus 1.01 0.11
21 Start.Date.2015 0 0
22 Working.in.Native.Place -0.81 7.87

22 | P a g e
Sampling Techniques

Question 5: Based on previous questions, what kind of modelling problems would Julia expect
when the classes are not represented adequately (imbalanced data)? Suggest ways to handle
these problems (2 points).
Problem with imbalanced data: Since most machine learning algorithm are designed to improve
accuracy, by reducing the error the model will typically do so by having a bias towards the majority
class. Therefore, accuracy as a measure for model performance will become faulty. Other measures
of model performance will have to be looked at such as precision, recall and F-score. Such an
imbalance will force a stronger trade-off between precision and recall.

For solving imbalance problems, one way is to consider F score or specificity as a model selection
criteria. Although ideally industrial practises suggests that we should adopt techniques such as
undersampling, oversampling, SMOTE, and stratified sampling to remove imbalance from the data.

Question 6: Use various sampling techniques that are best suited for the data based on model
accuracy (4 Points).

Sampling Technique Accuracy (LR) Accuracy (Decision Accuracy

Tree) (Random F)
Simple Random Sampling 83.75% 85.1% 85.6%
Stratified Sampling 93% 84.5% 83.3%
Random Oversampling 84.85% 71.5% 76.9%
Random Undersampling 83% 71.5% 75.2%
SMOTE 85.6% 77.6% 74.5%

From the previous analysis, we concluded that Logistic Regression gave the best F-score on test data
out of all the techniques mentioned. Therefore, we shall use that model for predictions. As is given
in the above table, for logistic regression, a stratified sampling technique is working the best, giving a
93% overall model accuracy. Therefore, the company shall be using this technique while predicting
attrition.

23 | P a g e
Final Recommendations
Question 7: Based on the different model results, what would be your final recommendation
to Kramerica Industries? (4 points)

The coefficients obtained from the Logistic Regression model for the significant variables is given
below:

Employee Current Age 8.9782

Employee Age During DOJ -8.6259
Prior Work Exp -0.4014
Leaves in 2015 -0.3816
Ratio of leaves 2015 136.4236
Auto Parts -0.7359

Through the table above, we can conclude that the attrition problem is more significant in other
departments of the company than in Auto Parts. Older age employees and those having a long
tenure at Kramerica have a greater tendency to leave the organisation. The behaviour worsens with
people having a higher prior work experience. Interestingly, from the decision tree, we can see that
people experiencing a change in CTC of more than 3% have a lower chance of leaving the company.

From the above insights, it might be concluded that the company is not providing enough salary
hikes and growth opportunities for employees at the senior positions, and therefore is experiencing
attrition there. We would suggest the company to revise its workforce strategy for the senior
employees to control the problem.

24 | P a g e
Appendix

Exhibit 1: Distribution of explanatory variables.

We plotted Histograms for continuous variables and box plot for categoric variable across classes to
visually validate whether the given variable is significant or not.

Capstone Interim Report - HR CTC Prediction
80% (10)
Capstone Interim Report - HR CTC Prediction
16 pages
Employee Attrition Study Case
No ratings yet
Employee Attrition Study Case
88 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
Final Capstone Project Report
100% (1)
Final Capstone Project Report
35 pages
Application of Logistic Regression To People-Analytics
No ratings yet
Application of Logistic Regression To People-Analytics
30 pages
Data Mining
No ratings yet
Data Mining
17 pages
Employee Attrition Prediction
100% (1)
Employee Attrition Prediction
21 pages
Problem Statement:: Field Characteristics Data Type
No ratings yet
Problem Statement:: Field Characteristics Data Type
4 pages
Summer Internship Report
No ratings yet
Summer Internship Report
24 pages
Employee Turnover Prediction
100% (1)
Employee Turnover Prediction
16 pages
DATA4800 Report
No ratings yet
DATA4800 Report
6 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Employee Turnover
No ratings yet
Employee Turnover
19 pages
Research Paper (1)
No ratings yet
Research Paper (1)
5 pages
Ibm Attrition Practices
No ratings yet
Ibm Attrition Practices
7 pages
2. Mid-Term PGP Mid-Term OCT 2018
No ratings yet
2. Mid-Term PGP Mid-Term OCT 2018
19 pages
GROUP 9
No ratings yet
GROUP 9
9 pages
Employee Attrition Miniblogs
100% (1)
Employee Attrition Miniblogs
15 pages
Why Are We Using Logistic Regression To Analyze Employee Attrition?
No ratings yet
Why Are We Using Logistic Regression To Analyze Employee Attrition?
4 pages
HR Analytics Synopsis
100% (1)
HR Analytics Synopsis
3 pages
PPT (1)
No ratings yet
PPT (1)
44 pages
Prediction of Employee Attrition PDF
0% (1)
Prediction of Employee Attrition PDF
7 pages
ANLY 502 Final Report
No ratings yet
ANLY 502 Final Report
7 pages
Advanced Business Analytics Project: Prepared By: Group 10 Lohith Kumar Vamshi Aparna Samarth
No ratings yet
Advanced Business Analytics Project: Prepared By: Group 10 Lohith Kumar Vamshi Aparna Samarth
7 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Linear Regression Model
No ratings yet
Linear Regression Model
7 pages
Cdu 1121 09
No ratings yet
Cdu 1121 09
10 pages
Employee Future Prediction
No ratings yet
Employee Future Prediction
3 pages
Assighment3 4 AI Projecct
No ratings yet
Assighment3 4 AI Projecct
58 pages
AI Workshop Predict Employee Leave
No ratings yet
AI Workshop Predict Employee Leave
22 pages
Reportprediction of Employee Atrition Uisng Machine Learning
No ratings yet
Reportprediction of Employee Atrition Uisng Machine Learning
6 pages
HR Analytics
No ratings yet
HR Analytics
24 pages
Data Wrangling Report
No ratings yet
Data Wrangling Report
3 pages
Project Report
No ratings yet
Project Report
22 pages
Project Data Mining
No ratings yet
Project Data Mining
55 pages
Attrition Probs
No ratings yet
Attrition Probs
8 pages
FRA Report
100% (1)
FRA Report
30 pages
Employee Attrition Classification
No ratings yet
Employee Attrition Classification
16 pages
Group Assignment - Data Mining
No ratings yet
Group Assignment - Data Mining
28 pages
1722506171 Employee Turnover Problem Statement
No ratings yet
1722506171 Employee Turnover Problem Statement
5 pages
Employee Attrition Risk Assessment Report - IT Case Study by The Brew (Https://thebrew - In)
No ratings yet
Employee Attrition Risk Assessment Report - IT Case Study by The Brew (Https://thebrew - In)
19 pages
Usiness Analytics Using R: Final Group Presentation
No ratings yet
Usiness Analytics Using R: Final Group Presentation
11 pages
BAUDM Assignment2
No ratings yet
BAUDM Assignment2
16 pages
Report
No ratings yet
Report
45 pages
Group 8 - EFC Project Report
No ratings yet
Group 8 - EFC Project Report
21 pages
HR Analyst (Data Analyst)
No ratings yet
HR Analyst (Data Analyst)
11 pages
Is451 Slide Deck 1
No ratings yet
Is451 Slide Deck 1
28 pages
Is 451 Report 1
No ratings yet
Is 451 Report 1
4 pages
Employee Turnover Prediction Project
No ratings yet
Employee Turnover Prediction Project
10 pages
Project Employee Absenteeism
No ratings yet
Project Employee Absenteeism
33 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
employee turnover1
No ratings yet
employee turnover1
4 pages
Salary Prediction
No ratings yet
Salary Prediction
4 pages
Employee Attrition Analysis
No ratings yet
Employee Attrition Analysis
2 pages
Machine Learning Model
No ratings yet
Machine Learning Model
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
18 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
BA Project - Section 1 Group 1
No ratings yet
BA Project - Section 1 Group 1
27 pages
Applying Lean Thinking: Measuring Success
From Everand
Applying Lean Thinking: Measuring Success
James Lewis
4.5/5 (2)
Data Conversion: Calculating the Monetary Benefits
From Everand
Data Conversion: Calculating the Monetary Benefits
Patricia Pulliam Phillips
No ratings yet
Equations: Performance Assessment Teacher Support
No ratings yet
Equations: Performance Assessment Teacher Support
4 pages
CXD2163BR Signal Processor LSI For Single-Chip CCD Color Camera
No ratings yet
CXD2163BR Signal Processor LSI For Single-Chip CCD Color Camera
91 pages
Responsorial Psalm - April 30, 2017 (Psalm 16)
No ratings yet
Responsorial Psalm - April 30, 2017 (Psalm 16)
1 page
By Jamie Andreas: "The Principles of Correct Practice For Guitar"
No ratings yet
By Jamie Andreas: "The Principles of Correct Practice For Guitar"
33 pages
Sifcon Report 1
100% (1)
Sifcon Report 1
27 pages
MM-1062 SenSmart 7000 Installation Guide
No ratings yet
MM-1062 SenSmart 7000 Installation Guide
8 pages
A Simplified Method For The Cultivation of Extreme Anaerobic Archaea Based SULFIDE 2000 !!!!
No ratings yet
A Simplified Method For The Cultivation of Extreme Anaerobic Archaea Based SULFIDE 2000 !!!!
6 pages
Daoist Identity - History, Lineage, and Ritual
100% (6)
Daoist Identity - History, Lineage, and Ritual
345 pages
Get (Ebook) The Last Murder. The Investigation, Prosecution, and Execution of Ted Bundy by George R. Dekle Sr. ISBN 9780313397431, 9780313397448, 0313397430 free all chapters
100% (4)
Get (Ebook) The Last Murder. The Investigation, Prosecution, and Execution of Ted Bundy by George R. Dekle Sr. ISBN 9780313397431, 9780313397448, 0313397430 free all chapters
81 pages
RSA Secured Implementation Guide For VPN Products: 1. Partner Information
No ratings yet
RSA Secured Implementation Guide For VPN Products: 1. Partner Information
10 pages
0N THI VAO 10 - READING - 255 Ad
No ratings yet
0N THI VAO 10 - READING - 255 Ad
16 pages
DETAILED LESSON PLAN IN CALCULUS 2 (XYLA BLESHY LESIRA AGABON) - Script
No ratings yet
DETAILED LESSON PLAN IN CALCULUS 2 (XYLA BLESHY LESIRA AGABON) - Script
9 pages
[Ebooks PDF] download Seismic Design of Reinforced and Precast Concrete Buildings 1st Edition Robert E. Englekirk full chapters
100% (3)
[Ebooks PDF] download Seismic Design of Reinforced and Precast Concrete Buildings 1st Edition Robert E. Englekirk full chapters
81 pages
HALLITE
No ratings yet
HALLITE
318 pages
Resume Aslan Ahmadi
No ratings yet
Resume Aslan Ahmadi
2 pages
AVH-16-M1-E3-0.2s
No ratings yet
AVH-16-M1-E3-0.2s
4 pages
DS ImagerCX3Series
No ratings yet
DS ImagerCX3Series
3 pages
Chapter 12 PDF
No ratings yet
Chapter 12 PDF
50 pages
Strength of Lug
No ratings yet
Strength of Lug
8 pages
0623 AVALIAÇÃO GLOBAL 3º ANO INGLÊS II UNIDADE 3 LOG
No ratings yet
0623 AVALIAÇÃO GLOBAL 3º ANO INGLÊS II UNIDADE 3 LOG
3 pages
Corporate Flight Attendant Resume A4
No ratings yet
Corporate Flight Attendant Resume A4
2 pages
Ipbt Course 1
No ratings yet
Ipbt Course 1
246 pages
Apple AirPods Pro (2nd Gen) Vs Sony WF-1000XM5 - What Is The Difference
No ratings yet
Apple AirPods Pro (2nd Gen) Vs Sony WF-1000XM5 - What Is The Difference
1 page
Wc1403 Enercon Award PR - FINAL
No ratings yet
Wc1403 Enercon Award PR - FINAL
2 pages
JK1 Mech
No ratings yet
JK1 Mech
49 pages
P - 7 English Lesson Notes PDF
50% (2)
P - 7 English Lesson Notes PDF
48 pages
Standard Specification: Tecnicas Reunidas, S.A
No ratings yet
Standard Specification: Tecnicas Reunidas, S.A
9 pages
Canetti and Nietzsche. Theories of Humor in Die Blendung (Murphy)
No ratings yet
Canetti and Nietzsche. Theories of Humor in Die Blendung (Murphy)
450 pages
EECS 1015: Introduction To Computer Science and Programming Topic 4
No ratings yet
EECS 1015: Introduction To Computer Science and Programming Topic 4
82 pages
Eating Disorders
100% (1)
Eating Disorders
45 pages

Assignment Report - Group A

Uploaded by

Assignment Report - Group A

Uploaded by

Machine Learning Applications

Submitted by: Group - A

b) Creation of new variables

c) Removing entries for people left in 2014

d) Infinity values in variables

e) Encoding of categoric variables

Variable Name Reason

Variable Name p-value Significant or not

For Continuous Variables:

Variable Name p-value Significant or not

Fig.: Final logistic regression model

1 414 1006 29.2

1 8.0 19.5 29.2

Overall error: 8.1%, Averaged class error: 14.65%

1 185 386 32.4

1 8.4 17.5 32.4

Overall error: 8.4%, Averaged class error: 16.25%

Cut-off Sensitivity Specificity YI Cut-off Sensitivity Specificity YI

The overall accuracy of the model at this cut-off probability is = (TP+TN)/(TP+TN+FP+FN) =

Business Rules derived from decision tree

Rule number 1: 23 [Attrition=1 cover=75 (1%) prob=1.00]

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Leaves.in.2015< 23.5

Rule number 2: 87 [Attrition=1 cover=51 (1%) prob=1.00]

0.025=< Ratio.Difference< 0.065, changeCTC< 0.03, Confirmed>=0.5, Leaves.in.2015< 9.5

Rule number 3: 3 [Attrition=1 cover=444 (9%) prob=0.96]

Rule number 4: 45 [Attrition=1 cover=90 (2%) prob=0.77]

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Leaves.in.2015>=23.5, leaves_prev_year>=28.5

Rule number 5: 86 [Attrition=0 cover=365 (7%) prob=0.37]

0.025=<Ratio.Difference< 0.105, changeCTC< 0.03, Confirmed>=0.5, Leaves.in.2015>=9.5

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Leaves.in.2015>=23.5, leaves_prev_year< 28.5

Rule number 7: 42 [Attrition=0 cover=620 (12%) prob=0.23]

0.025=<Ratio.Difference< 0.105, changeCTC< 0.03, Confirmed>=0.5

Rule number 8: 20 [Attrition=0 cover=825 (16%) prob=0.17]

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Confirmed< 0.5

Rule number 9: 4 [Attrition=0 cover=2428 (47%) prob=0.12]

Ratio.Difference< 0.105, changeCTC>=0.03

Significance of these variables checked using ANOVA

Both the variables have p value<0.05, hence significant.

1 797 623 56.1

1 15.5 12.1 56.1

Overall error: 16.2%, Averaged class error: 28.55%

1 312 259 54.6

1 14.1 11.7 54.6

Overall error: 14.9%, Averaged class error: 27.8%

Overall error: 0%, Averaged class error: 0%

Overall error: 13.1%, Averaged class error: 22.65%

Variable Mean Decrease

Sampling Technique Accuracy (LR) Accuracy (Decision Accuracy

Employee Current Age 8.9782

Exhibit 1: Distribution of explanatory variables.

You might also like