Employee Turnover
Employee Turnover
We'll be covering:
• Descriptive Analytics - What happened?
• Predictive Analytics - What might happen?
• Prescriptive Analytics - What should we do?
Objective:
• To understand what factors contributed most to employee turnover.
• To perform clustering of Employees who left based on their satisfaction and
evaluation
• To create a model that predicts the likelihood if a certain employee will leave the
company or not.
• To create or improve different retention strategies on targeted employees.
df = pd.read_csv('HR_comma_sep.csv')
salary
0 low
1 medium
2 medium
3 low
4 low
# Rename Columns
# Renaming certain columns for better readability
df = df.rename(columns={'satisfaction_level': 'satisfaction',
'last_evaluation': 'evaluation',
'number_project': 'projectCount',
'average_montly_hours': 'averageMonthlyHours',
'time_spend_company': 'yearsAtCompany',
'Work_accident': 'workAccident',
'promotion_last_5years': 'promotion',
'sales' : 'department',
'left' : 'turnover'
})
df.shape
(14999, 10)
df.head()
round(df.turnover.value_counts(1), 2)
0 0.76
1 0.24
Name: turnover, dtype: float64
plt.figure(figsize=(12,8))
turnover = df.turnover.value_counts()
sns.barplot(y=turnover.values, x=turnover.index, alpha=0.6)
plt.title('Distribution of Employee Turnover')
plt.xlabel('Employee Turnover', fontsize=16)
plt.ylabel('Count', fontsize=16);
1. Perform data quality check by checking for missing values if any
# Can you check to see if there are any missing values in our data set
df.isnull().any()
satisfaction False
evaluation False
projectCount False
averageMonthlyHours False
yearsAtCompany False
workAccident False
turnover False
promotion False
department False
salary False
dtype: bool
# Check the type of our features. Are there any data inconsistencies?
df.dtypes
satisfaction float64
evaluation float64
projectCount int64
averageMonthlyHours int64
yearsAtCompany int64
workAccident int64
turnover int64
promotion int64
department object
salary object
dtype: object
round(turnover_Summary.mean(), 2)
round(turnover_Summary.std(), 2)
Correlation Matrix
# Create a correlation matrix. What features correlate the most with
turnover? What other correlations did you find?
corr = df.corr()
corr
turnover promotion
satisfaction -0.388375 0.025605
evaluation 0.006567 -0.008684
projectCount 0.023787 -0.006064
averageMonthlyHours 0.071287 -0.003544
yearsAtCompany 0.144822 0.067433
workAccident -0.154622 0.039245
turnover 1.000000 -0.061788
promotion -0.061788 1.000000
plt.figure(figsize=(15,10))
sns.heatmap(corr, xticklabels=corr.columns.values,
yticklabels=corr.columns.values, annot=True)
plt.title('Heatmap of Correlation Matrix');
Distribution of Satisfaction, Evaluation, and Monthly Hours
# Plot the distribution of Employee Satisfaction, Evaluation, and
Project Count. What story can you tell?
• More than half of the employees with 2, 6 and 7 projects left the company
• Majority of the employees who did not leave the company had 3,4, and 5 projects
• All of the employees with 7 projects left the company
• There is an increase in employee turnover rate as project count increases
3. Perform clustering of Employees who left based on their satisfaction and evaluation
# Import KMeans Model
from sklearn.cluster import KMeans
plt.show();
Pre-processing
department_product_mng department_sales
department_support ... \
0 0 1 0 ...
1 0 1 0 ...
2 0 1 0 ...
3 0 1 0 ...
4 0 1 0 ...
1 0 0 1 0 0.80
2 0 0 1 0 0.11
3 0 1 0 0 0.72
4 0 1 0 0 0.37
[5 rows x 21 columns]
new_df.shape
(14999, 21)
Let's split our data into a train and test set. We'll fit our model with the train set and leave our
test set for our last evaluation.
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report,
precision_score, recall_score, confusion_matrix,
precision_recall_curve
print(X_train.shape)
print(X_test.shape)
(11999, 20)
(3000, 20)
0 0.76
1 0.24
Name: turnover, dtype: float64
lr = LogisticRegression()
lr = lr.fit(x_train_sm, y_train_sm)
lr
LogisticRegression()
0.7877848483907197
RandomForestClassifier()
0.9805593654003257
gbc = GradientBoostingClassifier()
gbc = gbc.fit(x_train_sm,y_train_sm)
gbc
GradientBoostingClassifier()
0.9580470716647115
plt.figure(figsize=(15,12))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show();
# Confusion Matrix for Logistic Regresion
confusion_matrix(y_test, lr.predict(X_test))
array([[1720, 566],
[ 153, 561]])
array([[2220, 66],
[ 43, 671]])
array([[2261, 25],
[ 16, 698]])
Recall or Precision?
It depends on how much cost/weight you want on your two types of errors: (1) False
Positives or (2) False Negatives
We want our machine learn model to capture as much of the minority class as possible
(turnover group). Our objective is to catch ALL of the highly probable turnover employee at
the risk of flagging some low-risk non-turnover employee.
• Consider employee turnover domain where an employee is given treatment by
Human Resources because they think the employee will leave the company within a
month, but the employee actually does not. This is a false positive. This mistake
could be expensive, inconvenient, and time consuming for both the Human
Resources and employee, but is a good investment for relational growth.
• Compare this with the opposite error, where Human Resources does not give
treatment/incentives to the employees and they do leave. This is a false negative.
This type of error is more detrimental because the company lost an employee,
which could lead to great setbacks and more money to rehire.
• Depending on these errors, different costs are weighed based on the type of
employee being treated. For example, if it’s a high-salary employee then would we
need a costlier form of treatment? What if it’s a low-salary employee? The cost for
each error is different and should be weighed accordingly.
7a. Using the best model, predict the probability of employee turnover in the test data
Retention Plan
array([[0.96, 0.04],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[0. , 1. ],
[0.84, 0.16],
[0.81, 0.19],
[0.01, 0.99],
[0.18, 0.82],
[0.99, 0.01]])
list(rf.predict_proba(X_test)[175:185, 1])
[0.04, 0.0, 0.0, 1.0, 1.0, 0.16, 0.19, 0.99, 0.82, 0.01]
[False, False, False, True, True, False, False, True, True, False]
Since this model is being used for people, we should refrain from soley relying on the
output of our model. Instead, we can use it's probability output and design our own system
to treat each employee accordingly.
1. Safe Zone (Green) – Employees within this zone are considered safe.
2. Low Risk Zone (Yellow) – Employees within this zone are too be taken into
consideration of potential turnover. This is more of a long-term track.
3. Medium Risk Zone (Orange) – Employees within this zone are at risk of turnover.
Action should be taken and monitored accordingly.
4. High Risk Zone (Red) – Employees within this zone are considered to have the
highest chance of turnover. Action should be taken immediately.
Safe Zone (Green)
• No Action required
Low Risk Zone (Yellow)
• Action to be taken on long term basis
• Apply group interventions
• HR to track demographic data for these individuals to see if the risk profiles are
changing or if the equation needs to be altered
Medium Risk Zone (Orange)
• Action to be taken on medium term basis
• HR to keep a close watch on the behavioral status to change from "Medium" to
"High" risk. HR to analyze demographic data to identify high risk supervisors and
point them to the BU heads
• Apply group interventions
High Risk Zone (Red)
• Action to be taken on immediate basis
• HR to send list to the concerned managers for immediate acion
• HR to validate the risks for consistency with the identified clusters
• Managers to have one - to one conversation with the identified employee
Conclusion
What to Optimize
Binary Classification: Turnover V.S. Non Turnover
Instance Scoring: Likelihood of employee responding to an offer/incentive to save them
from leaving.
Need for Application: Save employees from leaving
In our employee retention problem, rather than simply predicting whether an employee
will leave the company within a certain time frame, we would much rather have an
estimate of the probability that he/she will leave the company. We would rank employees
by their probability of leaving, then allocate a limited incentive budget to the highest
probability instances.
Solution 1:
• We can rank employees by their probability of leaving, then allocate a limited
incentive budget to the highest probability instances.
• OR, we can allocate our incentive budget to the instances with the highest expected
loss, for which we'll need the probability of turnover.
Solution 2:
• Develop learning programs for managers, then use analytics to gauge their
performance and measure progress.
• Be a good coach. Empower the team and do not micromanage
• Express interest for team member success
• Have clear vision / strategy for team
• Help team with career development
Selection Bias
• One thing to note about this dataset is the turnover feature. We don't know if the
employees that left are interns, contractors, full-time, or part-time. These are
important variables to take into consideration when performing a machine learning
algorithm to it.
• Another thing to note down is the type of bias of the evaluation feature. Evaluation
is heavily subjective, and can vary tremendously depending on who is the evaluator.
If the employee knows the evaluator, then he/she will probably have a higher score.