HR Analyst (Data Analyst)
HR Analyst (Data Analyst)
Dataset : Dataset is available in the given link. You can download it at your convenience.
About Dataset
Updated 30 January 2023
Version 14 of Dataset
License Update:
There has been some confusion around licensing for this data set. Dr. Carla Patalano and Dr. Rich Huebner are the
original authors of this dataset.
We provide a license to anyone who wishes to use this dataset for learning or teaching. For the purposes of sharing,
please follow this license:
CC-BY-NC-ND
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
License.
Codebook
https://rpubs.com/rhuebner/hrd_cb_v14
PLEASE NOTE -- I recently updated the codebook - please use the above link. A few minor discrepancies were
identified between the codebook and the dataset. Please feel free to contact me through LinkedIn
(www.linkedin.com/in/RichHuebner) to report discrepancies and make requests.
Context
HR data can be hard to come by, and HR professionals generally lag behind with respect to analytics and data
visualization competency. Thus, Dr. Carla Patalano and I set out to create our own HR-related dataset, which is
used in one of our graduate MSHRM courses called HR Metrics and Analytics, at New England College of Business.
We created this data set ourselves. We use the data set to teach HR students how to use and analyze the data in
Tableau Desktop - a data visualization tool that's easy to learn.
This version provides a variety of features that are useful for both data visualization AND creating machine learning /
predictive analytics models. We are working on expanding the data set even further by generating even more
records and a few additional features. We will be keeping this as one file/one data set for now. There is a possibility
of creating a second file perhaps down the road where you can join the files together to practice SQL/joins, etc.
Note that this dataset isn't perfect. By design, there are some issues that are present. It is primarily designed as a
teaching data set - to teach human resources professionals how to work with data and analytics.
Content
We have reduced the complexity of the dataset down to a single data file (v14). The CSV revolves around a fictitious
company and the core data set contains names, DOBs, age, gender, marital status, date of hire, reasons for
termination, department, whether they are active or terminated, position title, pay rate, manager name, and
performance score.
● Absences
● Most Recent Performance Review Date
● Employee Engagement Score
Acknowledgements
Dr. Carla Patalano provided the baseline idea for creating this synthetic data set, which has been used now by over
200 Human Resource Management students at the college. Students in the course learn data visualization
techniques with Tableau Desktop and use this data set to complete a series of assignments.
Inspiration
We've included some open-ended questions that you can explore and try to address through creating Tableau
visualizations, or R or Python analyses. Good luck and enjoy the learning!
● Is there any relationship between who a person works for and their performance score?
● What is the overall diversity profile of the organization?
● What are our best recruiting sources if we want to ensure a diverse organization?
● Can we predict who is going to terminate and who isn't? What level of accuracy can we achieve on this?
● Are there areas of the company where pay is not equitable?
There are so many other interesting questions that could be addressed through this interesting data set. Dr.
Patalano and I look forward to seeing what we can come up with.
If you have any questions or comments about the dataset, please do not hesitate to reach out to me on LinkedIn:
http://www.linkedin.com/in/RichHuebner
Below is a comprehensive guide for a Human Resources Machine Learning project. This
project will involve building a machine learning model to predict employee turnover, also
known as employee attrition, using Python. The code will use common libraries such as
Pandas, Scikit-Learn, and Matplotlib for data processing, model building, and
visualization.
Objective:
To predict whether an employee will leave the company (attrition) based on various
features such as age, job satisfaction, salary, etc.
Step-by-Step Guide
Sample Data:
EmployeeID,Age,Gender,Department,Position,YearsAtCompany,JobSatisfaction,Salary,
Attrition
1,30,Male,Sales,Manager,5,4,75000,No
2,28,Female,Marketing,Executive,3,3,65000,Yes
# Summary statistics
print(df.describe())
3. Data Preprocessing
# Encode categorical variables
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
df['Attrition'] = df['Attrition'].map({'No': 0, 'Yes': 1})
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
6. Feature Importance
7. Model Deployment
In a real-world scenario, you would save the model and deploy it using a framework like Flask or
Django for making predictions on new data.
import joblib
# Save the model
joblib.dump(model, 'employee_attrition_model.pkl')
Summary
In this project, we built a machine learning model to predict employee attrition using a Random
Forest classifier. We started by loading and exploring the dataset, followed by data
preprocessing, model building, training, and evaluation. Finally, we analyzed feature importances
and discussed the steps for model deployment.
Feel free to expand on this foundation based on your specific requirements and data availability.
Sample report
Reference link