80% found this document useful (10 votes)
5K views16 pages

Capstone Interim Report - HR CTC Prediction

The document is an interim report for an HR data capstone project. It aims to develop a tool to predict employee salaries to reduce effort for HR and avoid discrimination. The report details data collection, preprocessing of 25,000 applicant records with 29 parameters. Exploratory data analysis identified fresher records as outliers and examined relationships between variables. Three regression models were tested on preprocessed data, with boosted decision trees performing best with lower error rates. Recommendations include further outlier removal, parameter selection, and model tuning to improve accuracy.

Uploaded by

chinudash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
80% found this document useful (10 votes)
5K views16 pages

Capstone Interim Report - HR CTC Prediction

The document is an interim report for an HR data capstone project. It aims to develop a tool to predict employee salaries to reduce effort for HR and avoid discrimination. The report details data collection, preprocessing of 25,000 applicant records with 29 parameters. Exploratory data analysis identified fresher records as outliers and examined relationships between variables. Three regression models were tested on preprocessed data, with boosted decision trees performing best with lower error rates. Recommendations include further outlier removal, parameter selection, and model tuning to improve accuracy.

Uploaded by

chinudash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Interim Report of

HR Data Capstone Project

Submitted By

Chinmaynanda Dash
Seshavataram Peesapati
Yogesh S

Under the guidance of

Prerna Bhardwaj

P a g e 1 | 16
1. Introduction
HR team plays a crucial role in determination of salary of employee in organization, if any of the
judgement or consideration goes wrong, will affect the performance due to employee dissatisfaction
& which may lead to disengagement of employee. Meanwhile HR team need to keep an eye to retain
the talent in organization.

At present situation / crisis / opportunity, people do move out frequently and in the hand Organization
need more people as replacement as well as for new project requirements. HR team has to carry out
recruitment drives throughout the year as well as each year fresher need to be hired.

To overcome such cumbersome & judgmental process, Can we have some prediction tool, which can
predict the salary details of each employee recruited by the firm, which will reduce the hard work
carried out by HR team for negotiating the salary & avoid discrimination in organization.

2. Problem Statement, Scope and Objective


We have a problem statement related to an organization Delta ltd. The HR team of Delta want to
have a system, which predict the salary of employees, which will lead to no discrimination &
employee satisfaction based on their past data, easy to use, avoid manual judgement & effective tool
with minimal involvement.

We have a scope of developing a tool, which help them out in solving their issue & reduce their effort
in salary calculation. It will easy to use & avoid manual work out.

The objective, we have here is, we collect past data of all employees of Delta ltd, which are presently
used for estimation of Annual salary of an employee by HR. then we understand the data & analysis
the data & prepare a model to predict the salary of new employee with similar kind of profile & avoid
manual judgement.
We test the model by comparing it with existing data as confirmation.

3. Data Description
We have collected handsome amount of data (25000 Applicants) from the HR team of Delta ltd. It
contain 29 different parameter on which the salary judgement( Expected CTC) is processed. We have
observed it contains both numerical & categorical data.
Numerical data – There are 12 Parameters such as Index, Application ID, Total experience,
Experience in field, passing years of graduation, PG & PHD, Current CTC, No. of companied
worked, No.of publication, certification & expected CTC.

P a g e 2 | 16
Categorical data - Remaining 17 out of 29 are categorical data.
Ordinal categorical data are – Education, Appraisal Rating and Designation.

We do have Missing values in Department, Roles, Designation, education, education related columns.
Most of the missing values have arisen due to freshers & under graduates.
The fresher are outliers.

4. Data Pre-processing
We have observed, fresher’s or “0 Experience” category is an outlier, we remove such rows from
data. Do the model evaluation in 2nd phase.
The higher education details are kept as null as not applicable to lower education level as per
hierarchy. For example, an undergraduate will be not applicable for graduate, postgraduate & PhD
parameters.

P a g e 3 | 16
For industry related parameters such as role, position, industry, & department null positions, we
replaced it with others for experience candidates and for fresher (0 experience candidates it is “NA”
for industry, organization, department, role & designation.
Our dependent variable is expected salary; we consider the median Expected salary as dependent
variable & other 28 parameter as independent variable.
We evaluate the relationship with dependent & independent variable through EDA.
We evaluate the model with all data then check the error reduction with eliminating the outliers & by
model tuning.

5. Exploratory Data Analysis

1. We carried out EDA-01 for initially with all 26 independent parameters with replacing null
values of roles, department, industry & designation as “Others”.
2. The higher education details are kept as null as not applicable to lower education level as per
hierarchy. For example, an undergraduate will be not applicable for graduate, postgraduate
& PHD parameters.
3. Graph shown below department & organization as independent variable with reference to
expected CTC.
4. We have considered “Median of expected CTC” for identification of correlation with
independent variable.

P a g e 4 | 16
Other EDA graphs are covered in Appendix -01

5. We had major observation related to fresher (with zero experience) as outlier.


6. We removed all the 908 rows with fresher to carry out further EDA. With new data to check
the correlation of dependent variable with all 26 independent variable.
7. Below are the inferences of EDA-02 listed in the table.

P a g e 5 | 16
The EDA graph for remaining variable in available in Appendix- EDA-02

P a g e 6 | 16
6. Modelling Approach

We have used Azure ML Studio with initial data without elimination of outlier, With 3 different
regression models.
We have considered three parameters to evaluate the model best suited for our project.
1. Mean absolute error.
2. Root mean square error.
3. Coefficient of determination.

We have split the data into 70:30 ratio as train & test data.

Mean Absolute Root Mean Square Coefficient of


Sl.no. Models Error (MAE) Error(RMSE) Determination(COD)
1 Boosted decision tree Regression. 17744.97 31778.9 0.9992
2 Linear Regression. 53880.17 80657.2 0.9953
3 Decision forest Regression. 41877.84 63639.72 0.997

We have observed boosted decision tree model give better results. Further to this we will
work with boosted decision tree for model tuning.

P a g e 7 | 16
After elimination of fresher (with zero experience) as outlier.

Mean Absolute Root Mean Square Coefficient of


Sl.no. Models
Error (MAE) Error (RMSE) Determination (COD)
1 Boosted decision tree Regression. 13403.08 17277.75 0.9997
2 Linear Regression. 48251.56 65183.91 0.9968
3 Decision forest Regression. 39203.29 57430.4 0.9974

7. Actionable insights and recommendations to the stakeholder

1. We need to identify few insights from EDA & Reason being for such pattern observation.
2. We need to reduce further the MAE & RMSE values & reduce the difference within them.
3. That can be done by identifying further outliers, by elimination of parameter which has
minimal relationship with dependent variables & by model tuning.
4. We convert the data into 70:25:5 ratio to train, test & verify the model as user experience by
providing 5% data as external source to validate the model accuracy.

8. References and Bibliography


1. Tableau dashboard
2. Great learning lecturer videos
3. https://www.ijitee.org/wp-content/uploads/papers/v9i6/F4545049620.pdf
4. https://machinelearningmastery.com/difference-test-validation-datasets/
5. https://www.datascience2000.in/2021/05/employee-salary-prediction-in-machine.html
6. https://towardsdatascience.com/will-your-employee-leave-a-machine-learning-model-
8484c2a6663e
7. https://medium.com/analytics-vidhya/machine-learning-project-3-predict-salary-using-
polynomial-regression-7024c7bace4f
8. https://www.atlantis-press.com/journals/ijcis/25899235/view
9. https://www.hindawi.com/journals/sp/2021/8387277/

9. Appendix

1. EDA-01
2. EDA-02
3. MODEL -01
4. MODEL-02

P a g e 8 | 16
P a g e 9 | 16
P a g e 10 | 16
1. EDA-02

P a g e 11 | 16
P a g e 12 | 16
P a g e 13 | 16
P a g e 14 | 16
Model -01

Model -02

P a g e 15 | 16
For detail instructions see Interim Report Guidelines. Non-adherence to Guideline instructions
will incur heavy penalty.

P a g e 16 | 16

You might also like