7 HR Data Sets For People Analytics - AIHR Analytics
7 HR Data Sets For People Analytics - AIHR Analytics
Blog Resources
Resources
Academy
Academy
Courses
Courses
About
AboutUs
Us
HR data sets are rare nds. In this article, I will list the 7 best HR data sets available
online. In addition to the data set, I will also list the challenges in the data. This can
be a potential analysis or something to look out for in the data.
We strongly advocate using data and statistics as a means to an end. In analytics we want to
contribute to solving business issues using data and statistics. Analysis and statistics in itself
is not an end – unless you want to learn how to use it. That’s what we wrote this article for.
Note. I may occasionally use the word ‘predict’ loosely in this article. Most data sets have
cross-sectional, making it impossible to ‘predict’ a dependent variable.
Now that we’ve wrapped up the formalities and disclaimers, let’s start playing with HR data!
1. Absenteeism at work
This enormous HR data set focuses on employee absence. It contains a staggering 8335
The data set contains employee numbers and names, gender, city, job title, department,
store location, business unit, division, age, length of service, and the number of hour
absent.
Blog Resources
Academy
Courses
About Us
FREE STEP-BY-STEP GUIDE
This data set is neatly structured. This means that every employee has a single line and that
addition, age and length of service may also be associated with absence – but how? This is
The data set can also be used as an exercise set to predict absence using decision trees or
linear models.
Challenge
This data set is quite straightforward. It is large but still manageable in software like SPSS or
Excel. You may have to code a number of nominal variables into number values before you
can do your analysis but on top of that, the data itself doesn’t pose much of a challenge.
Note: The data does need to be cleaned. Everyone under 18 or above 65 may be removed
Blog Resources
from the data set. Academy
Courses
About Us
Download
This data set is created by Lyndon Sundmark, writer of Doing HR Analytics – A Practitioner’s
Handbook with R Examples, for the purpose of learning to predict absence as an outcome.
Lyndon provides a detailed explanation of how to do this in his book. Alternatively, you can
also download his free, two-part description of this case in which he runs both visual
(descriptive) analytics in part 1 (using R) before creating a decision tree and running linear
regression to predict absence in part 2. The relevant chapter in his book is based on these
two articles.
columns of data.
The data set contains a number of employee IDs. Each row represents a certain quantity of
Information on employees include the number of children, work load, distance from work,
transportation expense, education, height, weight, BMI, and absenteeism time in hours.
Other information include the season, month of absence, day of absence, and day of the
week.
The data set also classi es absence into 21 categories, or reasons of absence. These include
di erent types of illness, congenital malfunctions, and pregnancy. The full list can be found
Challenge
The challenge of this data set is mostly in structuring the data. An individual employee has
multiple records. These need to be combined prior to analysis. This data set also enables
Download
This data set can be found on Kaggle (mirror).
The data set has some interesting properties as the sheets are linked. The
performance score, number of times they asked for help, daily error rate, and 90-day
complaints.
The data set was created by Dr. Rich Huebner and Dr. Carla Patalano for their graduate
HRM course on HR metrics and analytics.
Challenge
Other challenges include looking for predictors of suboptimal performance of production
sta (using the other data sheets). There are multiple dependent variables for suboptimal
performance, including performance ratings, daily error rates, and 90-day complaints. By
linking this back to the datasets that resemble the more general HRIS information, you can
Blog Resources
deploy decision trees and linear regression models to predict performance.
Academy
Courses
About Us
Another data sheet is titled recruiting_cost.csv. This contains the spend on di erent
recruitment channels. The HRDataset_v9.csv contains the source of hire and date of hire,
allowing you to potentially calculate metrics like sourcing channel e ectiveness and average
sourcing channel cost.
The data sheet also contains data on active or termination status, allowing you to predict
termination as well, and associate it with all the other data contained in the other data
sheets.
This may mean that the main challenge is the abundance of information. Start with a
speci c research question that you come up with and start to answer it using the data –
Download
The data set can be downloaded on Kaggle (mirror). The codebook for this data set can be
found here.
HR Business Partner
2.0 Certi cate Program
Develop a comprehensive skillset that delivers strategic
impact. Learn everything from consulting and data literacy
skills to basic nance.
DOWNLOAD SYLLABUS
Blog Resources
enables you to practice attrition modeling, you pay attention. The data set has 1470 rows
and 35 columns.
The data set contains data like age, gender, job satisfaction, environment satisfaction,
education eld, job role, income, overtime, percentage salary hike, tenure, training time,
With these variables, IBM has created a fairly complete overview that contains the data of
the average HRIS combined with a full engagement survey. The data set is therefore great
to predict turnover, or to simply nd di erences between the group that stayed or that left.
Challenge
This data set opens up a lot of possible analyses. One of the most interesting might be to
nd predictors using decision trees or logistic regression. Note, check Pasha Robert’s slide
deck on Why You Shouldn’t Use Logistic Regression to Predict Attrition beforehand!
Alternatively, you can use a simpler one-way ANOVA or Chi-squared tests to nd di erences
between the groups who left and stayed in factors like job satisfaction and whether or not
Download
Originally, the data set was published on IBM’s website but has since been removed. The
data set is still available on Kaggle (mirror). Note that in the original IBM le there was a
second worksheet called Data De nitions. In Kaggle these data de nitions have been
included in the description of the le.
Blog Resources
he has built a large community of people analytics practitioners and has become the face of
people analytics in the East.
The data set contains information on gender, age, wage type, way of travel, tra c (source of
Challenge
In one of his translated posts he poses the question: Which employee will be most likely to
stay the longest, Johnson, Peterson, or Sidorson? In his support article, he than shows how
According to Edward, the data set is real – which is exciting! For the rest, the data is pretty
straight forward. The only thing to keep an eye on is that some terms got lost in translation
Download
You can download the dataset here (mirror) from Edward’s Dropbox. A support article
6. Job classification
Another, one-of-a-kind data set by Lyndon Sundmark can be used for job classi cation. Job
classi cations re ect both job families and pay grade related information. This is especially
relevant when new jobs are created which need to t in the existing job structure.
Jobs have a number of distinct features which impact the job’s classi cation. These include
Blog Resources
education level, experience, organizational impact, level of supervision, nancial budget,
Academy
Courses
and more. Knowing these factors for di erentAbout
jobs enables
Us
a job analyst to classify jobs into
Challenge
Sundmark points out that Linear Discriminant Analysis (LDA) can be used to nd
combinations of features which characterize a number of classes of objects or events. Using
LDA, Sundmark’s job classi cation data set can be used to classify newly created jobs in the
In this dataset, there are 66 job speci cations covering 11 paygrades. All the factors
Download
You can download the data set here. A support article describing how to do the analysis in
7. Engagement survey
One of the hardest data sets to come by are engagement surveys. This has a few reasons,
the most important being the high level of con dentiality and company-sensitive
However, there is a data set available for those who want to learn. In our Statistics in HR
course we use an engagement data set with 85 individuals who all lled in an engagement
survey. The data set contains variables like performance rating, function group, but also
management behavior, mobility behavior (i.e. the likelihood of leaving the company),
A screenshot from the course with the data set on the left. Data is analyzed in SPSS.
The same data is also analyzed in R. In this fragment, the data is checked for homoscedasticity.
Blog Resources
Challenge Academy
Courses
The challenge for this data set is straight forward.
AboutStudents
Us get a data set brie ng and
codebook with an explanation of the data. The brie ng has six questions the students need
to answer. This is easier said than done: each answer is a full 30-minute lesson explaining
The course teaches you how to run these analyses both in SPSS and in R. Once you’re done
with the exercise, there are a number of other challenges that you can solve on your own.
Download
Unfortunately, this data set is not freely available. However, by enrolling for the Statistics in
HR course, you get full access to the data and learning material.
Conclusion
The lack of available data is one of the bottlenecks for HR analytics. We hope to partially
remove this bottleneck through this article. We also o ered you a number of challenges for
each of the data sets to make sure you get the most out of it.
A drawback is that only two of these data sets contain real data. The rest is all arti cially
generated. This can still be of good use for testing out di erent techniques. However, this
data likely has been created to share a practice for a statistical technique or to share a
narrative. The real data is doesn’t have that same intention and is therefore more realistic.
This can be xed by scraping real data from the internet. Jared Valdron started with this, by
sharing two scrapers for Meetup.com and WeWork. These can be used for inspiration to
If you know of any publicly available HR analytics data that we’ve missed, please let us know
in the comments. We will update this article accordingly.
k a Blogd Resources v W
Academy
Courses
About Us
DOWNLOAD SYLLABUS
LOG IN WITH
OR SIGN UP WITH DISQUS ?
Name
Although the mirror link you provided works just fine and Ill carryon to use the same
△ ▽ • Reply • Share ›
Hi Ankith, thanks for pointing that out! The data set by Dr. Rich has been updated
since writing the article
△ ▽ • Reply • Share ›
Erik van Vulpen Mod > Akanji Olubukola David • 10 months ago
✉S d ⚠ S
RECENT POSTS
Glancing Ahead: What Are Your People Analytics Plans for 2021?
NEW
HR Metrics &
Dashboarding
Certi cate Program
DOWNLOAD SYLLABUS
RELATED COURSES
HR Analytics Leader
Self-paced Online Course | 7 modules