0% found this document useful (0 votes)
174 views24 pages

Summer Internship Report

This document is an internship project report submitted by two students, Chandra Kiran Verma and Tisha Chopra, to their professor Ms. Charu Gupta at IGDTUW. The project aims to predict employee attrition using logistic regression, random forest classifier, and decision tree models. It first provides background on employee attrition and reviews previous work using machine learning for prediction. It then describes the methodology, including an overview of decision trees, logistic regression, and random forest algorithms. The results of applying these three models on a dataset with features like income, work experience, gender, and education are presented and discussed.

Uploaded by

Shivangi Jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
174 views24 pages

Summer Internship Report

This document is an internship project report submitted by two students, Chandra Kiran Verma and Tisha Chopra, to their professor Ms. Charu Gupta at IGDTUW. The project aims to predict employee attrition using logistic regression, random forest classifier, and decision tree models. It first provides background on employee attrition and reviews previous work using machine learning for prediction. It then describes the methodology, including an overview of decision trees, logistic regression, and random forest algorithms. The results of applying these three models on a dataset with features like income, work experience, gender, and education are presented and discussed.

Uploaded by

Shivangi Jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

INDIRA GANDHI DELHI TECHNICAL UNIVERSITY FOR WOMEN

Department of Information Technology

Masters of Computer Applications (MCA)

Internship Project Report On

Employee Attrition Prediction


Conducted by the Centre of Excellence-AI, IGDTUW,Delhi
(5th june – 16th July,2022)

Submitted to: Submitted By:


Ms. Charu gupta Chandra Kiran Verma (01304092021)
IT Dept, IGDTUW Tisha Chopra (06604092021)
MCA 2nd Year, IGDTUW
ACKNOWLEDGEMENT
We would like to express our gratitude to the AI-ML club of Indira Gandhi Delhi Technical
University for women for giving the wonderful opportunity to do the amazing project
[ Employee Attrition] with proper guidance and workshops. We express our deepest thanks to
Dr. Ritu Jangra for giving us necessary advice and guidance . We also got exposed to
industrial perspective of applications of AI and ML and how AI is revolutionising business
processes conducted by industry professionals from various sectors including agriculture,
healthcare, data science and more. We will strive to use gained skills and knowledge in the
best possible way, and will continue to work on their improvement, in order to attain desired
career objectives. Overall, there was a lot of learning in each session for which we are really
thankful to the AI-ML club.

Chandra Kiran Verma (01304092021)

Tisha Chopra (06604092021)


CERTIFICATE
CERTIFICATE
ABSTRACT
Employee attrition refers to reduction in the number of employees or staff members in an
organisation. It occurs when an employee leaves and isn’t replaced at all or for a significant
amount of time, resulting in a reduction of the workforce.

This project aims to predict the rate of employee attrition using Logistic Regression, Random
Forest classifier and Decision Tree. We use evaluation of Work life balance, employee
performance, Standard hours at work and number of years spent in the company, among others,
as our features. The results of this research showed the superiority of the Logistic Regression
in terms of accuracy and predictive effectiveness, by means of the ROC curve.
Considering all the factors, we achieved the maximum accuracy of 88.66 using Logistic
Regression .The results of this research demonstrate that the Logistic regression classifier is a
superior algorithm in terms of significantly higher accuracy, relatively low runtimes and good
F1-score. Because of these reasons it is recommended to use logistic regression for accurately
predicting employee turnover , giving them the vision to take necessary action.
INDEX

SNO. CONTENTS
I. Introduction
II. Literature review
III. Research Methodology
III A. Decision Tree
III B. Logistic Regression
III C. Random Forest
IV. Data Collection
V. Data Analysis :
V A. Data pre-processing
V B. Feature selection
V C. Model Validation
VI. Result
VII. Discussion and Future Work
VIII. Conclusion
IX. References
I. INTRODUCTION

Employee attrition is the major problem which highlights in all the organizations. Whenever a
well trained and well adapted employee leaves the organization, it creates a vacuum. Retaining
employees is a critical and ongoing effort. However, if the situation isn't handled properly, it
can lead to a decrease in productivity. Efficiency of work is hampered to a large extent. Higher
the attrition rate, the organization has to face some incurred costs to recruit, induct, placement
and train the employee. The organization may have to employ new people and train them on
the tool that is being used, which is time consuming.

Many factors like organization size, location, policies, procedures also have an impact on
employee attrition.This project involves a comparative study of various algorithms using the
model evaluation metrics like accuracy, F-measure. This is done by importing the data from
Kaggle in the form of a CSV file. It stands for "Comma-Separated Values" files. It is a file
format which allows us to save the tabular data, such as spreadsheets. It is useful for huge
datasets and can use these datasets in programs.

This project gives a brief overview of the employee turnover problem, the importance of
solving it, and the work done by others using machine learning algorithms to solve the problem.
It explores 3 different machine learning algorithms, including decision tree, logistic regression
and random forest classifier and outlines the experimental method employed in terms of the
features used, pre-processing, and the metrics used to compare the algorithms. It presents the
results of the comparison and a discussion of the same, and possible future work and concludes
by recommending the Logistic Regression as an approach to solve the employee attrition
prediction problem.
II. LITERATURE REVIEW

Attrition is said to be the gradual reduction in the number of employees through retirement,
resignation or demise. It can also be said as employee turnover or employee defection. Most
literature on employee attrition categorizes it as either voluntary or involuntary. Involuntary
attrition is thought of as the mistake of the employee, and refers to the organization firing the
employee for various reasons. Voluntary attrition is when a worker leaves an organization out
of his/her own free will. It can be as a result of leaving a current job for a new job elsewhere
or retiring. It was found that the strongest predictors for voluntary turnover were age, pay,
tenure, overall job satisfaction, and employee’s perceptions of fairness. Other similar research
findings suggested that personal or demographic variables, specifically age, gender, ethnicity,
education, and marital status, were important factors in the prediction of voluntary employee
turnover. Other studies showed that several other features, such as working conditions, job
satisfaction, and growth potential also contributed to voluntary attrition. Employees are a
crucial resource for any organization, and hence withdrawal of productive employees might
affect an organization with respect to various aspects.

Some of the consequences of Employee Attrition are: Investing in staffing and training new
employees, increased burden on existing employees and a decline in the performance of the
organization. Therefore we will tackle this problem by applying machine learning techniques
to predict turnover thus giving the organizations the vision to take necessary action.
III. Research Methodology
Supervised learning is a subcategory of machine learning and artificial intelligence. It is defined
by its use of labelled datasets to train algorithms that to classify data or predict outcomes
accurately. This section outlines the theory behind each machine learning algorithm. The
dataset contains information about employees such as monthly income, TotalWorkingYears,
gender, nature of work, position, education, salary. Overall, there are 35 features recorded in
the dataset. However, for our analysis, not all features were essential or useful. For example,
standard weekly hours were not suitable for us because all records had the same values, so
these features were discarded from the analysis. After we select the features that we want to
keep in the data set, the data selection and cleansing step will be finished.

A. Decision Tree
Decision tree classifiers are regarded to be a standout of the most well-known
methods to data classification representation of classifiers. A decision tree is a flowchart-like
tree structure, where each internal node denotes a test on an attribute, each branch represents
an outcome of the test, and each leaf node (terminal node) holds a class label.

Conventionally, a decision tree is used for making boolean decisions in which the splitting
power of an attribute is computed as its information gain that, in turn, is computed as its
entropy reduction. Decision tree classifiers are known for their enhanced view of
performance outcomes. Because of their strong precision, optimised splitting parameters,
and enhanced tree pruning techniques (ID3, C4.5, CART, CHAID, and QUEST) are
commonly used by all recognized data classifier

Fig 1

In the case of pursuing Boolean decisions, the entropy of a data set is computed as:
H(Set) = −P1 × log2P1 − P2 × log2P2 where P1 is the proportion of the 1st decision, and
P2 is the proportion of the 2nd decision
The information gain of an attribute is computed as: Gain(A) = H(Set) − (w1 × H(a1) + w2 ×
H(a2) + ... + wm × H(am)) where a1, a2, ... , am are the different values of attribute A, and
w1, w2, ..., wm are the weights of the subsets split by using the values of attribute A.

B. Logistic Regression
Logistic regression is basically a supervised classification algorithm. In a classification
problem, the target variable(or output), y, can take only discrete values for a given set of
features(or inputs), X. It’s often used with regularisation in the form of penalties based on L1-
norm or L2-norm to avoid over-fitting.An L2-regularised logistic regression for this paper. This
technique obtains the posterior probabilities by assuming a model for the same and estimates
the parameters involved in the assumed model.

Fig 2

C. Random Forest
This algorithm is a popular tree based ensemble learning technique. In the ensemble-learning
approach, a single prediction model is created by gathering more than one classifier. The
type of ‘ensembling’ used here is bagging. In bagging, successive trees do not depend on
earlier trees — each is independently constructed using a different bootstrap sample of the
data set.
In the end, a simple majority vote is taken for prediction. Random forests are different from
standard trees in that for the latter each node is split using the best split among all variables. In
a random forest, each node is split using the best among a subset of predictors randomly chosen
at that node. This additional layer of randomness makes it robust against overfitting .Random
forest is an ensemble of the decision trees that is expected to perform better and hence give a
higher accuracy.

Fig 3
IV. Data Collection
The data was pulled from a sample dataset on Kaggle in the form of a CSV file.There were
around 1470 observations, each observation corresponding to an employee and 35 different
attributes Business Travel, Daily Rate, Department, Distance , Marital Status, Monthly Income,
Number of companies worked, Over18, Over Time, Percent Salary Hike,
Performance rating,Satisfaction, standard working hours, Stock option level, Employee field,
Environment Satisfaction, Gender, Hourly Rate, Job Involvement, Job Level, Job Role, Job
Satisfaction , Total working years, training times last year, work-life balance,years at company.
Years in current role, Year since last promotion, years with current manager. The dataset
included various important features including average number of monthly hours, overtime,
number of projects, years spent in the company and whether the employee received a promotion
in the last five years.

The factors that are included in this dataset are following:

1. Age

2. Attrition Yes/No

3. Business Travel - NonTravel/Travel-Rarely/Travel-Frequently

4. DailyRate - DailyRate paid to Employee

5. Department- There are 3 department as follow Human Resources, Research &


Development & Sales

6. DistanceFromHome

7. Education: The level of Education coded as follow: 1 ‘Below College’ 2 ‘College’ 3.


‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’

8. Education Field: Education Field of employee among Human Resources, Life Sciences
Marketing, Medical,Other & Technical Degree

9. EmployeeCount- Number of Employees in record 10.EmployeeNumber- Unique Code of


Employees

10. EnvironmentSatisfaction: The Environment satisfaction rate given by employee as 1


‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

11. Gender: Male/Female

12. HourlyRate: The Hourly Rate paid to Employee

13. JobInvolvement: The degree to which employee his involved in job, given as 1 ‘Low’ 2
Medium' 3 ‘High’ 4 ‘Very High’

14. JobLevel: The Job hierarchy of Employee, rated as 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very
High’
15. JobRole: The Job Role of Employee among Healthcare Representative, Human Resources
Laboratory Technician, Manager, Manufacturing Director, Research Director, Research
Scientist, Sales Executive & Sales Representative

16. JobSatisfaction: The job satisfaction rate of employee given as 1 ‘Low’ 2 ‘Medium’ 3
‘High’ 4 ‘Very High’

17. MaritalStatus: Employee is Married/Divorced/Single

18. Monthly Income: The monthly income of employee in Rs

19. MonthlyRate: The monthly rate of employee paid by company in Rs

20. NumCompaniesWorked: The total number of companies employee has worked.

21. Over18: Whether Employee is above 18 age or not.

22. OverTime: Whether employee do OverTime or not.

23. PercentSalaryHike: The percent increase in salary given to employee.

24. Performance Rating: The performance rating given to employee given as 1 ‘Low’ 2
‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’

25. RelationshipSatisfaction: The relationship satisfaction rating by employee given as 1


‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

26. StandardHours: StandardHours for an employee in week.

27. StockOptionLevel:

28. TotalWorkingYears: The total working years of an employee.

29. TrainingTimeLastYear: The total times for which employee was given training last year.
30. WorkLifeBalance: The work life balance rating given by employee as 1 ‘Bad’ 2 ‘Good’ 3
‘Better’ 4 ‘Best’

31. YearsAtComany: Total years employee have been in the company.

32. YearsInCurrentRole: The total year employee has been in current role.

33. YearsSinceLastPromotion: The total years after the last promotion.

34. YearWithCurrManager: The total year employee is with current manager.


V. Data Analysis

A. Data pre-processing

The process that involves transforming the raw data into a model-able format is called data
pre-processing. Real-world data is often incomplete and inconsistent.

It involves below steps: getting the dataset, importing libraries, importing datasets, finding
missing data, encoding categorical data, splitting dataset into training and test set, feature
scaling.
HeatMap representing Employee Attrition data

A heat map is a two-dimensional representation of data in which values are represented by


colours. A simple heat map provides an immediate visual summary of information.

For categorical variables the missing values were imputed using the mode of that field. For
example, salary values, which were either ‘true’ or ‘false’, were converted to 0 and 1
respectively.

B. Feature selection

The input variables that we give to our machine learning models are called features. Each
column in our dataset constitutes a feature. To train an optimal model, we need to make sure
that we use only the essential features. If we have too many features, the model can capture the
unimportant patterns and learn from noise. The method of choosing the important parameters
of our data is called Feature Selection. Popular methods for feature selection are correlation
analysis & chi-square analysis, exploratory bivariate analysis and information value analysis.
Correlation analysis is used for the numeric variables & Chi-square analysis is used for the
categorical variables. A high correlation or a chi-square value proves a feature significant.

Visualising numerical columns


Model Validation
The dataset was split 70-30 into training and testing sets. The trained model was then used
to predict on the 30% test set. The choice of model validation techniques is the area under the
receiver operating characteristic curve (ROCAUC). It is a visualization graph that is used to
evaluate the performance of different machine learning models. This graph is plotted between
true positive and false positive rates where true positive is totally positive and false positive is
a total negative. Additionally, accuracy and F1 scores of the classifiers are also used to compare
the results of the models. These two are important because they clearly show how
suitable the model is for use in an application

1: RANDOM FOREST CLASSIFIER

Random forest is an ensemble of the decision trees that is expected to perform better and hence
give a higher accuracy. Model evaluation matrix for this particular model is:

MODEL EVALUATION:
CONFUSION MATRIX :

ROC CURVE:

2: LOGISTIC REGRESSION

This technique obtains the posterior probabilities by assuming a model for the same and
estimates the parameters involved in the assumed model. Model evaluation matrix for this
particular model is:
MODEL EVALUATION:

Confusion matrix:

ROC Curve of the model:

3: DECISION TREE
Decision tree classifiers are regarded to be a standout of the most well-known methods to
data classification representation of classifiers.
H(Set) = −P1 × log2P1 − P2 × log2P2
where P1 is the proportion of the 1st decision, and P2 is the proportion of the 2nd decision
The information gain of an attribute is computed as: Gain(A) = H(Set) − (w1 × H(a1) + w2 ×
H(a2) + ... + wm × H(am))
where a1, a2, ... , am are the different values of attribute A, and w1, w2, ..., wm are the weights
of the subsets split by using the values of attribute A.

Model evaluation matrix for this particular model is:

MODEL EVALUATION :

Confusion matrix:
ROC Curve of the model:
RESULTS

Comparing the models on the basis of accuracy and F1 Score:

Algorithm Testing Accuracy F1 Score


Training
Accuracy

1.0 0.79 0.35

1. Decision Tree

0.87 0.88 0.51

2. Logistic Regression

3. Random Forest 0.98 0.85 0.21


VII. Discussion and Future Work

Using different models we were able to justify that the features chosen are causes that
contribute to voluntary attrition. Intuitively, we need to find the probability of event success
and event failure (attrition). It is used when the dependent variable is binary (0/1, True/False,
Yes/No) in nature. This is the basis for choosing logistic regression as the best approach in this
paper. The logistic regression algorithm has good ROC-AUC and accuracy values. Future work
might include modifying the algorithm or using another algorithm such as XGBoost classifier
or Support Vector Machine and check their accuracy.
VIII. CONCLUSION

The importance of predicting employee attrition using various machine learning algorithms
such as Logistic Regression, Random Forest Classifier and Decision Tree was presented in this
model. The results of this research showed the superiority of the Logistic Regression in terms
of accuracy and predictive effectiveness, by means of the ROC curve. Data from the dataset
was used to compare the Logistic regression against two other supervised classifiers that had
been historically used to build turnover models. Considering all the factors,we achieved the
maximum accuracy of 89.54 by Logistic Regression. The results of this research demonstrate
that the Logistic regression classifier is a superior algorithm in terms of significantly higher
accuracy, relatively low runtimes and good F1-score. Because of these reasons it is
recommended to use logistic regression for accurately predicting employee turnover , giving
them the vision to take necessary action.

In this project, the importance of employee attrition in organizations and the proper use of
machine learning algorithms by evaluating models was presented. The focus was on using
different algorithms and studying their effective performance. Random Forest Classifier,
Decision tree and Logistic regression were performed on the employee data for predicting
employee turnover. The formation of the regularizations makes it stand out effectively as
compared to the other models / classifiers. The accuracy through the Logistic Regression model
was analyzed to be 88.66%. Hence, the efficiency product of predictive machine learning
algorithms on the same dataset reveals that Logistic Regression outperforms if accuracy is the
preferred.
IX. REFERENCES
[1] A Review on Employee’s Voluntary Turnover: Psychological Perspective,
2020

[2] Supervised Learning - A Systematic Literature Review, 2021

[3] Classification Based on Decision Tree Algorithm for Machine Learning,


2021

[4] An Introduction to Logistic Regression Analysis and Reporting. The Journal


of Educational Research, Volume 96, 2002 - Issue 1

[5] A meta-analysis of research in random forests for classification, 2016

[6] Pre-Processing: A Data Preparation Step, 2018

[7] A survey of feature selection and feature extraction techniques in machine


learning, 2014

[8] Machine Learning Algorithm Validation, 2020

[9] Introduction to ROC analysis, 2006

You might also like