0% found this document useful (0 votes)
85 views11 pages

HR Analyst (Data Analyst)

Hr manager
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views11 pages

HR Analyst (Data Analyst)

Hr manager
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Project Title Human Resources Analyst

Tools Machine Learning

Technologies Data Analyst

Project Difficulties level intermediate

Dataset : Dataset is available in the given link. You can download it at your convenience.

Click here to download data set

About Dataset
Updated 30 January 2023

Version 14 of Dataset

License Update:
There has been some confusion around licensing for this data set. Dr. Carla Patalano and Dr. Rich Huebner are the
original authors of this dataset.

We provide a license to anyone who wishes to use this dataset for learning or teaching. For the purposes of sharing,
please follow this license:

CC-BY-NC-ND
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
License.
Codebook
https://rpubs.com/rhuebner/hrd_cb_v14

PLEASE NOTE -- I recently updated the codebook - please use the above link. A few minor discrepancies were
identified between the codebook and the dataset. Please feel free to contact me through LinkedIn
(www.linkedin.com/in/RichHuebner) to report discrepancies and make requests.

Context
HR data can be hard to come by, and HR professionals generally lag behind with respect to analytics and data
visualization competency. Thus, Dr. Carla Patalano and I set out to create our own HR-related dataset, which is
used in one of our graduate MSHRM courses called HR Metrics and Analytics, at New England College of Business.
We created this data set ourselves. We use the data set to teach HR students how to use and analyze the data in
Tableau Desktop - a data visualization tool that's easy to learn.

This version provides a variety of features that are useful for both data visualization AND creating machine learning /
predictive analytics models. We are working on expanding the data set even further by generating even more
records and a few additional features. We will be keeping this as one file/one data set for now. There is a possibility
of creating a second file perhaps down the road where you can join the files together to practice SQL/joins, etc.

Note that this dataset isn't perfect. By design, there are some issues that are present. It is primarily designed as a
teaching data set - to teach human resources professionals how to work with data and analytics.

Content
We have reduced the complexity of the dataset down to a single data file (v14). The CSV revolves around a fictitious
company and the core data set contains names, DOBs, age, gender, marital status, date of hire, reasons for
termination, department, whether they are active or terminated, position title, pay rate, manager name, and
performance score.

Recent additions to the data include:

● Absences
● Most Recent Performance Review Date
● Employee Engagement Score

Acknowledgements
Dr. Carla Patalano provided the baseline idea for creating this synthetic data set, which has been used now by over
200 Human Resource Management students at the college. Students in the course learn data visualization
techniques with Tableau Desktop and use this data set to complete a series of assignments.
Inspiration
We've included some open-ended questions that you can explore and try to address through creating Tableau
visualizations, or R or Python analyses. Good luck and enjoy the learning!

● Is there any relationship between who a person works for and their performance score?
● What is the overall diversity profile of the organization?
● What are our best recruiting sources if we want to ensure a diverse organization?
● Can we predict who is going to terminate and who isn't? What level of accuracy can we achieve on this?
● Are there areas of the company where pay is not equitable?

There are so many other interesting questions that could be addressed through this interesting data set. Dr.
Patalano and I look forward to seeing what we can come up with.

If you have any questions or comments about the dataset, please do not hesitate to reach out to me on LinkedIn:
http://www.linkedin.com/in/RichHuebner

You can also reach me via email at: [email protected]

HOW WE CREATE. PROJECT GUIDE LINE BY USING ML

Below is a comprehensive guide for a Human Resources Machine Learning project. This
project will involve building a machine learning model to predict employee turnover, also
known as employee attrition, using Python. The code will use common libraries such as
Pandas, Scikit-Learn, and Matplotlib for data processing, model building, and
visualization.

Human Resources Machine Learning Project: Predicting Employee Turnover

Objective:

To predict whether an employee will leave the company (attrition) based on various
features such as age, job satisfaction, salary, etc.

Step-by-Step Guide

1. Data Collection and Preparation


For this project, we'll use a sample dataset. You can use an HR analytics dataset from
sources like Kaggle or any other dataset you have.

Sample Data:

EmployeeID,Age,Gender,Department,Position,YearsAtCompany,JobSatisfaction,Salary,
Attrition

1,30,Male,Sales,Manager,5,4,75000,No
2,28,Female,Marketing,Executive,3,3,65000,Yes

2. Load and Explore the Data

# Importing necessary libraries


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset


df = pd.read_csv('employees.csv')

# Display the first few rows of the dataset


print(df.head())

# Check for missing values


print(df.isnull().sum())

# Summary statistics
print(df.describe())

3. Data Preprocessing
# Encode categorical variables
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
df['Attrition'] = df['Attrition'].map({'No': 0, 'Yes': 1})

# One-hot encoding for department and position


df = pd.get_dummies(df, columns=['Department', 'Position'], drop_first=True)

# Drop irrelevant columns


df = df.drop(columns=['EmployeeID', 'Name'])

# Separate features and target variable


X = df.drop('Attrition', axis=1)
y = df['Attrition']

4. Split the Data into Training and Testing Sets

from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

5. Build and Train the Model

We'll use a Random Forest classifier for this project.

from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize the model


model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model


model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

print('Classification Report:')
print(classification_report(y_test, y_pred))

print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

6. Feature Importance

# Get feature importances


importances = model.feature_importances_
feature_names = X.columns
feature_importances = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

# Plot feature importances


plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importances)
plt.title('Feature Importances')
plt.show()

7. Model Deployment

In a real-world scenario, you would save the model and deploy it using a framework like Flask or
Django for making predictions on new data.

import joblib
# Save the model
joblib.dump(model, 'employee_attrition_model.pkl')

# Load the model


loaded_model = joblib.load('employee_attrition_model.pkl')

# Make predictions with the loaded model


new_predictions = loaded_model.predict(X_test)

Summary

In this project, we built a machine learning model to predict employee attrition using a Random
Forest classifier. We started by loading and exploring the dataset, followed by data
preprocessing, model building, training, and evaluation. Finally, we analyzed feature importances
and discussed the steps for model deployment.

This project can be further enhanced by:

● Tuning hyperparameters using GridSearchCV.


● Trying different machine learning algorithms.
● Incorporating additional features.
● Building a more sophisticated model evaluation process.

Feel free to expand on this foundation based on your specific requirements and data availability.

Sample report
Reference link

You might also like