Human Resources
Human Resources
concern for organisations across industries. It refers to the loss of employees through
resignation, retirement, or termination, which can impact productivity, organisational
morale, and business continuity. Therefore, understanding the drivers of attrition and
implementing strategic approaches to manage it is critical for all organisations.
Understanding HR Attrition
Attrition is a multifaceted issue and is primarily driven by various factors, such as job
dissatisfaction, limited growth opportunities, inadequate compensation, poor work-life
balance, and ineffective leadership. Additionally, the external labour market
conditions, such as competitive job opportunities and prevailing economic conditions,
can also contribute to employee attrition.
3. Growth Opportunities: Employees are more likely to stay if they can see a clear
career path in the organisation. Thus, opportunities for learning, development, and
career progression can be an effective strategy to reduce attrition.
Conclusion
In this project I will be using the power of R to analyse a dataset to look at this
fascination subject. I will be providing a detailed analysis into the dataset as well as
detailed explanation into the code I have written using R.
The Data
The IBM HR Analytics Employee Attrition dataset, available on Kaggle and can be
accessed here, provides a comprehensive view of various factors that may influence
an employee's decision to leave an organisation (also referred to as "attrition"). This
dataset is synthetic and has been created by IBM data scientists to help others
explore meaningful analytics and predictive modelling around human resources
topics.
Analysing the above data can provide insights into the factors influencing employee
attrition, which is a key concern for any organisation. Identifying patterns and
correlations between parameters such as satisfaction level, number of projects, and
average monthly hours, could help decipher underlying reasons for employee
turnover. For instance, understanding the relationship between monthly hours and
employee satisfaction might shed light on the balance between workload and
contentment. This analysis can aid in devising strategies for improved workload
management and employee welfare. The study of attrition patterns can also indicate
if employees with a particular range of tenure or those involved in specific number of
projects are more likely to leave. This could be crucial in moulding retention strategies
or redefining job roles. Ultimately, such an analysis will be instrumental in increasing
the efficiency of the HR department, reducing employee turnover, and thereby,
enhancing the overall health of the organisation.
The Analysis
The above code signifies the start of my data analysis project in R. It's all about
setting the foundation by installing and loading the necessary packages and libraries I
require.
The first two lines install the 'corrplot' and 'naniar' packages using
install.packages(). 'corrplot' enables graphical display of a correlation matrix, which
is a compact way to represent relationships among variables. 'naniar', on the other
hand, is invaluable for handling missing data.
Following the installation, library() is used to load these packages, and other useful
ones into the current R session. 'Plotly' and 'ggplot2' are for data visualisation, with
the former enabling interactive plots and the latter providing a robust framework for
data handling and plotting.
'Reshape2' allows me to modify the data layout, crucial for certain analyses, while
'GGally' extends 'ggplot2' capabilities, offering additional plot types.
In short, these libraries and packages, when installed and loaded, equip me with
essential tools for data manipulation, analysis, and visualisation, making them a vital
first step in my data analysis process.
The above snippet of code embodies the initial steps of my exploratory data analysis,
a crucial stage in comprehending the structure and content of the dataset that I am
about to examine. For me it is also the first step in my data cleaning process (well at
least see if any cleaning is required).
The function head(hrdata) allows me to preview the first six rows of my dataset,
hrdata. It serves as an initial inspection of the dataset, giving a quick sense of the
variable names, associated values, and the general organisation of the data.
With str(hrdata), I delve further into understanding the intricacies of my dataset. It
reveals the type of data structure, the total number of observations and variables in
hrdata. It also presents me with each variable's name and data type (like numeric,
factor, integer, and so forth), accompanied by the initial few entries. This function is
particularly handy for getting a summary view of the type of data each variable
contains, enabling me to plan subsequent data processing steps more effectively.
Upon execution of the preceding code snippet, an extensive exploration of the hrdata
dataset was conducted, yielding profound insights into its structure and content. This
dataset comprises 1470 observations spanning across 35 variables, establishing a
substantial groundwork for an in-depth analysis.
Executing the head(hrdata) function unveiled the initial few observations for each
variable, offering a preliminary overview. Age, for instance, fluctuates significantly,
ranging from early-career individuals at 18 to experienced professionals at 60. The
department variable displays a diverse array of sectors, including 'Sales' and
'Research & Development', indicating a broad workforce representation. However,
certain variables, like EmployeeCount and StandardHours, demonstrate fixed values,
suggesting their limited usefulness for predictive analytics.
Delving deeper, the str(hrdata) function exposed the data type of each variable
along with the total observation count. This revealed a balanced mix of integers and
characters, underscoring the versatility of the dataset for various types of analysis.
Using the above code, I've implemented a strategic decision to refine my hrdata
dataset by removing specific columns that do not contribute valuable information for
my forthcoming analysis. These columns include EmployeeCount, StandardHours,
and Over18.
This is accomplished via the subset selection function in R, combined with logical
negation and the %in% operator. Essentially, this piece of code is instructing R to
preserve only those columns in hrdata that are not listed in my array of unnecessary
fields.
The output now shows that there are 1470 observations across 32 variables,
compared to the previous 35. The three columns - "EmployeeCount",
"StandardHours", and "Over18" - are no longer part of the dataset, confirming that the
column-removal operation was effective.
Re-executing these functions post column-removal is a crucial practice in professional
data analysis. It ensures that any alterations made to the dataset are both accurate
and beneficial to the ensuing data exploration and modelling phases.
The output of this function shows the number of missing values in each column of the
hrdata dataframe. For each variable (column), it shows a count of how many of the
1470 observations (rows) have missing (NA) values.
According to the output, every column has a count of 0, meaning there are no missing
values in any column. This is a crucial step in data pre-processing as missing values
can significantly impact the accuracy and reliability of data analysis. In this case, the
fact that the dataset does not contain any missing values simplifies the pre-
processing phase and allows me to proceed with further analysis or modelling with the
knowledge that my dataset is complete.
The code I wrote above, shows how easy it is to show correlations. The function 'cor' is
applied to ascertain the correlation between certain designated columns of the
'hrdata' dataset. These selected columns are "Age", "DailyRate",
"DistanceFromHome", "Education", "HourlyRate", "MonthlyIncome", "MonthlyRate",
"NumCompaniesWorked", "TotalWorkingYears", and "TrainingTimesLastYear".
The correlation function 'cor' in R computes the Pearson correlation, providing a
coefficient that ranges between -1 and 1. This coefficient represents both the strength
and direction of the linear relationship between two variables. The resulting
correlation matrix has each cell containing a correlation coefficient indicating the
linear relationship between the corresponding pair of variables.
The correlation between Age and TotalWorkingYears is notably strong and positive
(0.680), suggesting that as employees age, they typically accumulate more work
experience. Additionally, Age correlates moderately positively with MonthlyIncome
(0.498) and NumCompaniesWorked (0.300), hinting that older employees tend to
have higher monthly incomes and have worked with more companies.
Remember, while these correlations provide valuable initial insights, they don't
confirm causality.
Click on the image to enlarge the plot and take advantage of its interactive features.
While the previous code provides a basic correlation which is more than adequate. I
prefer my visualisations to have a little more about them and to make them
interactive. Therefore I wrote the code above, in which a subset is curated from the
'hrdata' dataset, specifically extracting ten columns of interest. The next step is the
calculation of a correlation matrix for this subset, followed by reshaping it into a 'long
format', ideal for visualisation.
To visualise these correlations, the code utilises Plotly's 'plot_ly' function to create a
heatmap. This graphical representation uses a colour gradient, ranging from blue
(negative correlation) through pink (neutral) to red (positive correlation), enabling
easy recognition of the correlation intensity between variables. Numerical correlation
values are overlaid as text annotations for precise interpretation.
The aesthetic attributes, including the title and fonts, are adjusted to enhance
readability, producing a clean and organised layout. The final outcome, an interactive
heatmap, is saved as an HTML file for convenient access and exploration in the future.
The preferential choice of Plotly as the visualisation tool is due to its interactive and
visually engaging features, enriching the data analysis experience and making the
process of understanding complex datasets considerably easier.
The code I wrote above performs the task of creating a scatterplot matrix using four
specific variables from the 'hrdata' dataset, which are 'MonthlyIncome', 'Age',
'TotalWorkingYears', and 'Education'. The code demonstrates (like the correlation
code) how easy it is to create the visualisation. The scatterplot matrix, a pivotal tool in
exploratory data analysis, provides a comprehensive visualisation of pairwise
relationships among these selected variables.
The process commences by calling the png function to open a new PNG device,
denoting "ScatterplotMatrix.png" as the filename for the ensuing plot. Using the pairs
function, the scatterplot matrix is constructed, incorporating arguments that specify
the variables to be analysed, the parent data frame, and the title for the plot.
After plotting, the dev.off function is employed to terminate the PNG device,
effectively capturing the plot as a PNG image within the working directory.
Subsequently, the scatterplot matrix is reproduced within the R environment itself via
the pairs function, fostering immediate, interactive data analysis and visual
inspection.
In a similar vein to what I did previously with the correlation heatmap, I decided to
write the above R code to facilitate the generation of an enhanced, interactive
scatterplot matrix with added linear regression lines. This approach builds upon the
basic scatterplot matrix offering richer insights into the nature of relationships
between variables.
At the outset, a bespoke function scatter_with_lm is created. This function uses the
'ggplot2' package's ggplot, geom_point and geom_smooth functions to generate
scatterplots overlaid with a linear regression line. The function takes three
parameters: the data, mapping, and any additional parameters for geom_smooth.
Following this, the ggpairs function from the 'GGally' package is used to generate an
advanced scatterplot matrix. This matrix includes variables 'MonthlyIncome', 'Age',
'TotalWorkingYears', 'Education', and 'Gender' from the 'hrdata' dataset. The
scatterplot matrix is customised further by colour coding data points according to
'Gender', adding correlation coefficients in the upper panels, and overlaying linear
regression lines in the lower panels using the previously defined scatter_with_lm
function. Manual colour scales are then specified for 'Gender' using
scale_color_manual and scale_fill_manual.
Subsequently, the plot title is set, and the entire plot is converted into an interactive
visual using the ggplotly function from the 'plotly' package. The resulting interactive
plot allows for more dynamic data exploration, where hovering over data points
reveals precise values, and zooming in facilitates a closer inspection of areas of
interest.
Finally, the interactive plot is saved as an HTML file using the 'htmlwidgets' package's
saveWidget function. This facilitates sharing and viewing the plot in web browsers,
extending its accessibility. The plot is also displayed in the R environment using the
print function.
The enhanced scatterplot matrix and the embedded interactive features underscore
the strength of visual analytics. They empower stakeholders to glean intricate data
patterns, trends, and relationships effortlessly, fostering data-informed decision-
making processes.
From the enhanced scatterplot matrix three particularly strong positive correlations
are discernible in the data. Firstly, 'MonthlyIncome' and 'Age' demonstrate a
correlation of 0.498, suggesting that older employees tend to earn a higher monthly
income. When analysed across genders, this correlation appears to be slightly
stronger for females (0.505), indicating that age-related increments in monthly
income may be more pronounced for them as opposed to their male counterparts
(0.482).
The scatterplot matrix was enriched by encoding 'Gender' as a colour variable, which
allowed for a more granular, gender-specific exploration of trends in the HR data.
Click on the image to enlarge the plot and take advantage of its interactive features.
My next task revolves around probing assertions that the recent redundancies
predominantly impacted older employees (potentially indicating age discrimination)
and those who are relatively new to the organisation. To conduct a comprehensive
analysis of these claims, I will employ a two-pronged approach: first, I will generate
boxplots to visually inspect the distributions of age and tenure among the employees
who were laid off; second, I will perform statistical hypothesis tests to evaluate the
evidence supporting these claims, which will involve calculating p-values to quantify
the strength of this evidence.
I wrote two lots of code above, the first of which generates a basic boxplot and the
second is more detailed and interactive using the plotly library/package. The boxplot
provides a graphical representation that illustrates the distribution of 'Age' for
different levels of 'Attrition'. This plot enables us to compare these distributions side-
by-side.
After generating the plot, the R script closes the PNG device. However, the boxplot is
subsequently recreated to be displayed within the R environment. This process
ensures that the boxplot is saved as an image file and is also available for immediate
viewing within R.
Based on visual inspection of the boxplot, it appears that the employees who were
made redundant predominantly fall within the age group of late twenties to late
thirties. However, to garner a more precise and statistically sound understanding, it's
critical to compare the mean ages of both groups — those who experienced
redundancy and those who didn't. For this purpose, I'll be implementing a Welch Two
Sample t-test, which will provide a more robust statistical framework to understand
any potential differences.
The code I wrote above is used to perform a statistical t-test. More specifically, this is
a Welch Two Sample t-test, which tests whether the means of two independent
groups are significantly different from each other.
Firstly, the code filters the 'hrdata' dataset to create two separate datasets. The
'yes_age' dataset consists of the ages of employees who experienced attrition (i.e.,
they left the company), while the 'no_age' dataset comprises the ages of employees
who did not experience attrition.
The t.test(yes_age, no_age) function is then used to perform the t-test. This test
will compare the means of the 'yes_age' and 'no_age' datasets. If the p-value
generated from the test is less than the significance level (commonly 0.05), we would
reject the null hypothesis that the two groups have the same mean. This would
indicate a statistically significant difference in the mean ages of the two groups,
providing further insight into whether age plays a role in employee attrition.
Lets have a look at the output and see what insights I can see.
After running the test, the t-value is -5.828, which is the calculated difference in
sample means in units of standard error. Its negative sign indicates that the first
mean is smaller than the second.
The p-value is reported as 1.38e-08, which is a very small number. This is the
probability of observing such a large absolute t-value (or larger) if the null hypothesis
of equal means were true. Given this very low p-value, we would reject the null
hypothesis at the conventional 5% significance level, and conclude that there's a
statistically significant difference in the mean ages of the two groups.
The 95% confidence interval for the difference in means is [-5.288346, -2.618930].
This suggests that, if we performed this experiment many times, we would expect the
true population difference of means to fall within this interval 95% of the time.
Importantly, as this interval does not include zero, it provides further evidence that
the mean ages of the two groups are significantly different.
The final part of the output provides the sample estimates for the mean of each
group. The mean age of the group that left the company ('yes_age') is approximately
33.61, while the mean age of the group that stayed ('no_age') is approximately 37.56.
Click on the image to enlarge the plot and take advantage of its interactive features.
I have decided to delve deeper into the HR dataset and shift my focus from the 'Age'
attribute to 'EmployeeNumber'. The aim is to uncover any potential disparities in
attrition rates based on the employee number. This choice stems from the hypothesis
that newer employees (indicated by a higher employee number) may have been
impacted differently in terms of attrition compared to those who have been with the
company for a longer period. To test this hypothesis, I will be generating a new
boxplots (above) and performing another Welch Two Sample t-test (below). These
methods will provide additional visual and statistical analysis respectively, offering a
comprehensive insight into the relationship between 'EmployeeNumber' and
'Attrition'.
The key results from the t-test are as follows:
The t-statistic is -0.41725, which is the test statistic score. This score is used to compare the groups
under study. In this context, a negative value suggests that the first group (employees who left the
company) had a lower mean employee number compared to the second group (those who stayed).
The degrees of freedom (df) is 342.33, which is used in the calculation of the p-value. Degrees of
freedom generally represents the number of independent pieces of information available to estimate
another piece of information.
The p-value is 0.6768. This value is used to determine the statistical significance of the observed
difference in means. In general, a p-value of less than 0.05 is considered statistically significant. In
this case, since the p-value is quite high (greater than 0.05), we cannot reject the null hypothesis.
This means there is no significant evidence to suggest that the mean employee numbers differ
between the employees who left the company and those who stayed.
The 95 percent confidence interval is from -98.91087 to 64.29061. This range represents the interval
within which we can be 95% confident that the true difference in mean employee numbers between
the groups lies.
The mean of x and y represent the average employee numbers for those who left the company and
those who stayed, respectively. Here, 'x' corresponds to employees who left the company (mean =
1010.346), and 'y' corresponds to employees who stayed (mean = 1027.656).
In summary, the Welch Two Sample t-test provides no significant evidence to suggest
that newer employees (indicated by a higher employee number) are more likely to
leave the company.
The code I wrote above performs linear regression analysis on the 'hrdata' dataset,
with 'MonthlyIncome' as the dependent variable and 'Age' as the independent
variable.
In summary, the script fits a linear regression model to predict 'MonthlyIncome' based
on 'Age' and then generates a comprehensive summary of the model to enable
interpretation of the model's performance and the statistical significance of the
predictor variable 'Age'.
Now that I have ran the code let's examine the output in more detail:
1. Call: This displays the function call used to compute the regression.
2. Residuals: These represent the differences between the observed (actual) and predicted values. Here,
the 1st quartile (25th percentile), median (50th percentile), and 3rd quartile (75th percentile) show
the spread and skewness of residuals. Large residuals suggest that our model doesn’t capture the data
trend well, in this case the residuals seem quite large.
3. Coefficients: The 'Estimate' column under 'Coefficients' provides the intercept and slope of the
regression line, i.e., -2970.67 and 256.57, respectively. The intercept can be interpreted as the
expected value of MonthlyIncome when Age is zero, but in this context, a zero age doesn't make
sense, so the intercept may not have a meaningful interpretation. The positive slope suggests that
with every additional year of age, the monthly income is expected to increase by about 256.57 units,
on average.
The 'Std. Error' column provides the standard errors of the estimated coefficients,
which measure the variability in the estimate for the coefficients. The 't value' is the t-
statistic, and 'Pr(>|t|)' is the p-value associated with this t-statistic. The three
asterisks next to the p-values indicate that the predictors are highly significant (p <
0.001).
1. Residual standard error: This is the standard deviation of the residuals, and it’s used to measure
the model's accuracy. A smaller residual standard error means the model has a smaller random error
component, and hence is a better fit.
2. Multiple R-squared: This is the proportion of variance in the dependent variable (MonthlyIncome)
that can be explained by the independent variable (Age). Here, about 24.79% of the variability in
MonthlyIncome can be explained by Age.
3. F-statistic and p-value: The F-statistic is used to test the overall significance of the model. The p-
value associated with the F-statistic is practically zero, which provides evidence against the null
hypothesis that all regression coefficients are zero. In other words, it suggests that the model as a
whole is statistically significant.
Remember, while these statistics provide valuable insights about the relationship
between Age and MonthlyIncome, the data may not meet all assumptions of linear
regression, such as linearity, independence, homoscedasticity, and normality of
residuals.
Click on the image to enlarge the plot and take advantage of its interactive features.
Building upon my previous analysis, I've decided to delve deeper into the relationship
between employees' ages, their total years of work experience and their monthly
income. The aim is to understand how both age and total working years
simultaneously influence the monthly income of employees. To achieve this, I've
conducted another linear regression analysis, this time including the
'TotalWorkingYears' as an additional independent variable. This approach provides a
more comprehensive view of the factors affecting monthly income, enhancing the
explanatory power of the model.
The code I wrote above executes a multiple linear regression model to predict
'MonthlyIncome' using 'Age' and 'TotalWorkingYears' as predictors. It generates a
summary of the model's results which I will look into in more detail below.
The 'Coefficients' section provides the estimated effect size and statistical significance
for each predictor:
1. 'Intercept' is the estimated 'MonthlyIncome' when both 'Age' and 'TotalWorkingYears' are zero,
which doesn't practically apply to our case, but it's mathematically needed for the model. It is
statistically significant, as indicated by its p-value (2.36e-08) being much smaller than 0.05.
2. 'Age' has a negative coefficient (-26.87), which means older age is associated with a slight decrease
in 'MonthlyIncome' when 'TotalWorkingYears' is held constant. However, the effect size is small and
is statistically significant (p-value = 0.021).
3. 'TotalWorkingYears' has a positive coefficient (489.13), indicating that as the total working years
increase, the 'MonthlyIncome' also increases, holding 'Age' constant. This effect is statistically
significant (p-value < 2e-16), implying that 'TotalWorkingYears' is a significant predictor of
'MonthlyIncome'.
The 'Residuals' section describes the spread of the residuals (the difference between
the actual and predicted values). The closer the residuals are to 0, the better the
model fits the data.
The 'Multiple R-squared' value (0.5988) indicates that around 60% of the variance in
'MonthlyIncome' can be explained by 'Age' and 'TotalWorkingYears', while the
'Adjusted R-squared' value (0.5983) adjusts this statistic based on the number of
predictors in the model.
The F-statistic (1095) and its associated p-value (< 2.2e-16) indicate that the model
as a whole (i.e., all predictors together) is statistically significant.
In summary, the multiple linear regression model demonstrates that both 'Age' and
'TotalWorkingYears' are statistically significant predictors of 'MonthlyIncome'.
However, 'Age' shows a minor negative effect, suggesting that older employees,
considering their total working years, may earn slightly less. 'TotalWorkingYears'
significantly influences 'MonthlyIncome', indicating that employees with more working
experience tend to earn more. The model accounts for approximately 60% of the
variability in 'MonthlyIncome', suggesting a decent fit to the data, although other
unconsidered factors could be contributing to 'MonthlyIncome'. The F-statistic
validates that the model's predictors are collectively significant.
Click on the image to enlarge the plot and take advantage of its interactive features.
In order to present a more comprehensive and interactive view of the second linear
regression model, I have created a 3D scatter plot. This plot visually captures the
relationship between Monthly Income, Age, and Total Working Years in our dataset. By
leveraging the power of three dimensions, the scatter plot effectively depicts the
correlations and patterns existing among these variables. The plot is color-coded
based on the monthly income to provide an additional layer of understanding. This
visual tool enables me and stakeholders to comprehend and interpret the model's
results better, fostering a deeper understanding of the underlying data patterns. It's
an effective way to appreciate the model's complexity and nuances that a simple
numerical summary might not fully encapsulate.
Conclusion/Insights
From my analysis I was able to draw multiple conclusions:
1. Age and Work Experience: Older employees generally have more years of work experience and
tend to have higher monthly incomes. This strong positive correlation between age and total working
years is more pronounced for females.
2. Monthly Income and Work Experience: There is a strong positive correlation between total
working years and monthly income. This suggests that longer tenures at the company are associated
with higher monthly incomes, with this association slightly more pronounced for male employees.
3. Education and Age/Number of Companies Worked: Older employees or those who have worked
with more companies tend to have slightly higher education levels, but this correlation is weak,
suggesting education may not be a significant predictor of age or the number of companies one has
worked for.
4. Distance from Home and Training: Employees living further from home may have slightly fewer
training instances in the last year, but this negative correlation is weak, suggesting the relationship
may not be strong or may be influenced by other factors.
5. Pay Rates and Other Variables: Daily rate, hourly rate, and monthly rate show minimal
correlations with other variables, suggesting they might not be good predictors for other factors such
as age, total working years, education, and so on.
6. Redundancy and Age: Based on the Welch Two Sample t-test, there is a statistically significant
difference in the mean ages of employees who were made redundant and those who weren't. The
group that experienced redundancy predominantly falls within the late twenties to late thirties age
group.
7. Linear Regression Analysis (Single and Multiple): In the case of single linear regression, age
explains about 24.79% of the variability in monthly income. The positive slope suggests an increase
in monthly income with an increase in age. In the multiple linear regression model, both age and
total working years significantly predict monthly income. However, age shows a slight negative
effect when total working years is held constant. On the other hand, total working years significantly
influence monthly income, suggesting that more working experience leads to higher income.
8. Employee Turnover: There's no significant evidence to suggest that newer employees are more
likely to leave the company.
Remember, correlation does not imply causation. While these relationships are
evident in the dataset, further analysis would be necessary to understand the
underlying causal mechanisms. For example, does working for longer necessarily lead
to a higher income, or are there other factors at play?
Moreover, while the models explain a good portion of the variability in the data (single
model: ~25%, multiple model: ~60%), a significant proportion remains unexplained.
There may be other important predictors not included in these models.
Lastly, the variables should be checked for multicollinearity, especially in the multiple
linear regression model. This situation, where predictor variables are correlated with
one another, can inflate the variance of the regression coefficients and make the
estimates very sensitive to minor changes in the model.