0% found this document useful (0 votes)
24 views26 pages

Human Resources

Uploaded by

scanmyinvoices
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views26 pages

Human Resources

Uploaded by

scanmyinvoices
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Human Resources (HR) attrition, also known as employee turnover, is an enduring

concern for organisations across industries. It refers to the loss of employees through
resignation, retirement, or termination, which can impact productivity, organisational
morale, and business continuity. Therefore, understanding the drivers of attrition and
implementing strategic approaches to manage it is critical for all organisations.

Understanding HR Attrition

Attrition is a multifaceted issue and is primarily driven by various factors, such as job
dissatisfaction, limited growth opportunities, inadequate compensation, poor work-life
balance, and ineffective leadership. Additionally, the external labour market
conditions, such as competitive job opportunities and prevailing economic conditions,
can also contribute to employee attrition.

Economic implications of high attrition rates

High attrition rates can exert considerable economic stress on organisations.


Recruitment, onboarding, training, and development of employees demand
substantial financial resources. Thus, when employees depart, organisations not only
lose the invested resources but also incur additional costs to replace the exiting staff.
Moreover, it affects knowledge retention and continuity, leading to decreased
productivity and, ultimately, profitability. For these reasons, monitoring and managing
attrition rates is of paramount importance to organisations.

Effects on organisational culture

Beyond the economic ramifications, attrition can significantly impact organisational


culture and morale. The departure of valued employees can lower morale, reduce
motivation, and engender a sense of instability within the workforce. Conversely, low
attrition rates can cultivate a sense of continuity, security, and loyalty, fostering a
positive organisational culture that enhances productivity and job satisfaction.

Strategies to manage HR Attrition

Understanding and managing HR attrition requires a comprehensive and strategic


approach. Here are some strategies that can be employed:

1. Employee Engagement: A highly engaged workforce is less likely to seek


opportunities elsewhere. Regular communication, employee recognition programs,
and opportunities for career development can enhance employee engagement.

2. Competitive Compensation: Ensuring that pay packages are competitive in the


relevant industry and local area is crucial in reducing attrition rates.

3. Growth Opportunities: Employees are more likely to stay if they can see a clear
career path in the organisation. Thus, opportunities for learning, development, and
career progression can be an effective strategy to reduce attrition.

4. Leadership Development: Leaders play a critical role in creating a positive work


environment that encourages employee retention. Investing in leadership
development can therefore be a vital element in managing attrition.

5. Work-life Balance: Encouraging a healthy work-life balance, such as flexible


working hours and remote working options, can contribute to lower attrition rates.
6. Employee Wellbeing: Prioritising employee wellbeing, including both physical and
mental health, through initiatives such as health insurance, counselling services, and
fitness programs, can also be an effective strategy.

Conclusion

In conclusion, HR attrition is a complex phenomenon that poses significant challenges


to businesses. It is essential to monitor attrition rates and understand the underlying
causes to effectively address this issue. By employing strategic approaches, including
fostering employee engagement, offering competitive compensation and growth
opportunities, developing effective leaders, promoting work-life balance, and
prioritising employee wellbeing, organisations can effectively manage attrition,
ensuring their longevity and success in today's competitive market.

In this project I will be using the power of R to analyse a dataset to look at this
fascination subject. I will be providing a detailed analysis into the dataset as well as
detailed explanation into the code I have written using R.

The Data

The IBM HR Analytics Employee Attrition dataset, available on Kaggle and can be
accessed here, provides a comprehensive view of various factors that may influence
an employee's decision to leave an organisation (also referred to as "attrition"). This
dataset is synthetic and has been created by IBM data scientists to help others
explore meaningful analytics and predictive modelling around human resources
topics.

Here are the key features of this dataset:

1. Age: This is the age of the employee, measured in years.


2. Attrition: This is a binary variable indicating whether the employee has left the company or not.
This is the primary target variable for most analyses using this dataset.
3. BusinessTravel: This categorical variable indicates how frequently an employee travels as part of
their role, with options such as 'Travel_Rarely', 'Travel_Frequently', and 'Non-Travel'.
4. DailyRate: This is the daily rate of payment for the employee.
5. Department: This categorical variable represents the department in which an employee works, such
as 'Sales', 'Research & Development', or 'Human Resources'.
6. DistanceFromHome: This represents the distance from the employee's home to the workplace,
measured in kilometres.
7. Education: This is an ordinal variable reflecting the level of education of the employee, where 1
'Below College', 2 'College', 3 'Bachelor', 4 'Master', and 5 'Doctor'.
8. EducationField: This categorical variable reflects the field of study of the employee.
9. EmployeeCount: All values are 1 for this dataset (more on this later).
10. EmployeeNumber: This is the unique identifier for each employee.
11. EnvironmentSatisfaction: This ordinal variable reflects the employee's satisfaction with the work
environment, where 1 'Low', 2 'Medium', 3 'High', and 4 'Very High'.
12. Gender: This categorical variable indicates the gender of the employee.
13. HourlyRate: This is the hourly rate of payment for the employee.
14. JobInvolvement: This ordinal variable reflects the level of job involvement for the employee, where
1 'Low', 2 'Medium', 3 'High', and 4 'Very High'.
15. JobLevel: This is an ordinal variable that represents the level of job held by the employee, where a
higher number indicates a more senior role.
16. JobRole: This categorical variable indicates the role of the employee within the company.
17. JobSatisfaction: This ordinal variable reflects the employee's satisfaction with their job, where 1
'Low', 2 'Medium', 3 'High', and 4 'Very High'.
18. MaritalStatus: This categorical variable reflects the marital status of the employee.
19. MonthlyIncome: This is the monthly income of the employee.
20. MonthlyRate: This is the monthly rate of the employee.
21. NumCompaniesWorked: This indicates the number of different companies for which an employee
has worked over their career.
22. Over18: This categorical variable indicates whether the employee is over 18 years of age.
23. OverTime: This categorical variable indicates whether the employee works overtime.
24. PercentSalaryHike: This reflects the percentage increase in salary for the employee from the last
fiscal year.
25. PerformanceRating: This ordinal variable reflects the employee's performance rating, where 1
'Low', 2 'Good', 3 'Excellent', and 4 'Outstanding'.
26. RelationshipSatisfaction: This ordinal variable reflects the employee's satisfaction with their work
relationships, where 1 'Low', 2 'Medium', 3 'High', and 4 'Very High'.
27. StandardHours: All employees in this dataset work the same standard hours (more on this later).
28. StockOptionLevel: This reflects the level of stock options provided to the employee.
29. TotalWorkingYears: This is the total number of years that an employee has been working.
30. TrainingTimesLastYear: This indicates the number of times that an employee underwent training
in the last year.
31. WorkLifeBalance: This ordinal variable reflects the employee's work-life balance, where 1 'Bad', 2
'Good', 3 'Better', and 4 'Best'.
32. YearsAtCompany: This is the number of years an employee has worked at the current company.
33. YearsInCurrentRole: This is the number of years an employee has been in their current role within
the company.
34. YearsSinceLastPromotion: This is the number of years since the employee's last promotion.
35. YearsWithCurrManager: This is the number of years an employee has been with their current
manager.

Analysing the above data can provide insights into the factors influencing employee
attrition, which is a key concern for any organisation. Identifying patterns and
correlations between parameters such as satisfaction level, number of projects, and
average monthly hours, could help decipher underlying reasons for employee
turnover. For instance, understanding the relationship between monthly hours and
employee satisfaction might shed light on the balance between workload and
contentment. This analysis can aid in devising strategies for improved workload
management and employee welfare. The study of attrition patterns can also indicate
if employees with a particular range of tenure or those involved in specific number of
projects are more likely to leave. This could be crucial in moulding retention strategies
or redefining job roles. Ultimately, such an analysis will be instrumental in increasing
the efficiency of the HR department, reducing employee turnover, and thereby,
enhancing the overall health of the organisation.

The Analysis
The above code signifies the start of my data analysis project in R. It's all about
setting the foundation by installing and loading the necessary packages and libraries I
require.

The first two lines install the 'corrplot' and 'naniar' packages using
install.packages(). 'corrplot' enables graphical display of a correlation matrix, which
is a compact way to represent relationships among variables. 'naniar', on the other
hand, is invaluable for handling missing data.

Following the installation, library() is used to load these packages, and other useful
ones into the current R session. 'Plotly' and 'ggplot2' are for data visualisation, with
the former enabling interactive plots and the latter providing a robust framework for
data handling and plotting.

'Reshape2' allows me to modify the data layout, crucial for certain analyses, while
'GGally' extends 'ggplot2' capabilities, offering additional plot types.

Finally, 'htmlwidgets' brings JavaScript's data visualisation to R, supporting dynamic,


web-based graphs directly from R data.

In short, these libraries and packages, when installed and loaded, equip me with
essential tools for data manipulation, analysis, and visualisation, making them a vital
first step in my data analysis process.

The code snippet hrdata <- read.csv("HR-Employee-Attrition.csv") accomplishes


a crucial initial step in any data analysis: importing data into the R environment. The
read.csv() function reads the CSV file "HR-Employee-Attrition.csv" from the current
working directory, translating its contents into a data frame. This data frame, a
standard data structure in R used for easy manipulation and analysis, is then stored in
the object hrdata. This simple line of code forms the critical first step in converting
raw data into insightful analysis.

The above snippet of code embodies the initial steps of my exploratory data analysis,
a crucial stage in comprehending the structure and content of the dataset that I am
about to examine. For me it is also the first step in my data cleaning process (well at
least see if any cleaning is required).

The function head(hrdata) allows me to preview the first six rows of my dataset,
hrdata. It serves as an initial inspection of the dataset, giving a quick sense of the
variable names, associated values, and the general organisation of the data.
With str(hrdata), I delve further into understanding the intricacies of my dataset. It
reveals the type of data structure, the total number of observations and variables in
hrdata. It also presents me with each variable's name and data type (like numeric,
factor, integer, and so forth), accompanied by the initial few entries. This function is
particularly handy for getting a summary view of the type of data each variable
contains, enabling me to plan subsequent data processing steps more effectively.

Lastly, summary(hrdata) is a function that offers a statistical summary of every


variable in my dataset. It produces key descriptive statistics like the minimum, first
quartile, median, mean, third quartile, and maximum for numeric data, and level
counts for factor data. This function facilitates a quick comprehension of the
distributions and variations present in my data, helping me spot anomalies like
outliers, missing values, or data skewness.

Collectively, these commands provide a comprehensive overview of my dataset,


laying a strong foundation for the forthcoming steps in my data analysis journey.

Upon execution of the preceding code snippet, an extensive exploration of the hrdata
dataset was conducted, yielding profound insights into its structure and content. This
dataset comprises 1470 observations spanning across 35 variables, establishing a
substantial groundwork for an in-depth analysis.

The diversity of variables within the dataset is remarkable, encompassing both


numerical and categorical types. Numerical variables, including Age, DailyRate,
DistanceFromHome, Education, and others, offer a quantitative perspective on the
data. Concurrently, categorical variables, such as Attrition, BusinessTravel,
Department, and EducationField, provide a qualitative insight, contributing to a rich,
multi-dimensional analysis.

Executing the head(hrdata) function unveiled the initial few observations for each
variable, offering a preliminary overview. Age, for instance, fluctuates significantly,
ranging from early-career individuals at 18 to experienced professionals at 60. The
department variable displays a diverse array of sectors, including 'Sales' and
'Research & Development', indicating a broad workforce representation. However,
certain variables, like EmployeeCount and StandardHours, demonstrate fixed values,
suggesting their limited usefulness for predictive analytics.

Delving deeper, the str(hrdata) function exposed the data type of each variable
along with the total observation count. This revealed a balanced mix of integers and
characters, underscoring the versatility of the dataset for various types of analysis.

Subsequently, the summary(hrdata) function was executed, providing a


comprehensive statistical breakdown of each variable. For instance, Age exhibits a
median and mean value of approximately 36 and 37 respectively, implying a balanced
distribution. MonthlyIncome, however, demonstrates substantial diversity, with a
range from 1009 to an impressive 19999. The dataset also encompasses ordinal data
like Education, EnvironmentSatisfaction, and JobInvolvement, where integers denote
various levels.

Interestingly, variables such as Over18, EmployeeCount, and StandardHours seem


invariant, reducing their potential contribution to predictive modelling. Conversely,
categorical variables, including Attrition, BusinessTravel, Department, EducationField,
and others, possess the potential to significantly influence the outcomes of the
analysis, depending on the research objectives.

In conclusion, this preliminary exploration was highly informative, offering a detailed


perspective of the dataset's composition. It paves the way for more informed decision-
making regarding the subsequent steps of this data analysis endeavour.

Using the above code, I've implemented a strategic decision to refine my hrdata
dataset by removing specific columns that do not contribute valuable information for
my forthcoming analysis. These columns include EmployeeCount, StandardHours,
and Over18.

This is accomplished via the subset selection function in R, combined with logical
negation and the %in% operator. Essentially, this piece of code is instructing R to
preserve only those columns in hrdata that are not listed in my array of unnecessary
fields.

Consequently, this operation results in a more streamlined hrdata dataset, free of


uninformative variables, hence fostering a more efficient and focused analysis in
subsequent steps.
After implementing the process of feature selection and eliminating the superfluous
columns from the hrdata dataset, re-running the str(hrdata) and
summary(hrdata) functions is indeed a judicious next step.

Executing str(hrdata) again allows me to confirm that the removal of unwanted


columns - in this case, "EmployeeCount", "StandardHours", and "Over18" - has been
successful. It enables me to inspect the revised structure of my dataset. This process
ensures data integrity by verifying that only the relevant variables remain, and the
dataset is now more focused towards the objectives of our analysis.

Subsequently, running summary(hrdata) once more provides an updated statistical


summary of the newly streamlined dataset. This function generates central tendency
measures, dispersion metrics, and other pertinent statistical information for each
remaining variable. These statistics help me understand the new landscape of my
data following the pruning process and prepare me for subsequent stages of the data
analysis pipeline, such as data normalization or transformation, modelling, and
evaluation.

The output now shows that there are 1470 observations across 32 variables,
compared to the previous 35. The three columns - "EmployeeCount",
"StandardHours", and "Over18" - are no longer part of the dataset, confirming that the
column-removal operation was effective.
Re-executing these functions post column-removal is a crucial practice in professional
data analysis. It ensures that any alterations made to the dataset are both accurate
and beneficial to the ensuing data exploration and modelling phases.

The code I wrote above, sapply(hrdata, function(x) sum(is.na(x))) is an


application of the sapply() function in R, which is used to apply a function to each
element of the specified list or vector. Here, the function being applied is function(x)
sum(is.na(x)), which calculates the sum of missing values (represented by NA in R)
for each element x (which, in this context, refers to each column of hrdata).

The output of this function shows the number of missing values in each column of the
hrdata dataframe. For each variable (column), it shows a count of how many of the
1470 observations (rows) have missing (NA) values.

According to the output, every column has a count of 0, meaning there are no missing
values in any column. This is a crucial step in data pre-processing as missing values
can significantly impact the accuracy and reliability of data analysis. In this case, the
fact that the dataset does not contain any missing values simplifies the pre-
processing phase and allows me to proceed with further analysis or modelling with the
knowledge that my dataset is complete.

The code I wrote above, shows how easy it is to show correlations. The function 'cor' is
applied to ascertain the correlation between certain designated columns of the
'hrdata' dataset. These selected columns are "Age", "DailyRate",
"DistanceFromHome", "Education", "HourlyRate", "MonthlyIncome", "MonthlyRate",
"NumCompaniesWorked", "TotalWorkingYears", and "TrainingTimesLastYear".
The correlation function 'cor' in R computes the Pearson correlation, providing a
coefficient that ranges between -1 and 1. This coefficient represents both the strength
and direction of the linear relationship between two variables. The resulting
correlation matrix has each cell containing a correlation coefficient indicating the
linear relationship between the corresponding pair of variables.

Such computation of correlation matrices is fundamental in data analytics as it aids in


understanding the interrelationships among different attributes of the dataset,
thereby playing a critical role in various data analytics and machine learning
applications.

The resulting correlation matrix provides me with quantitative values, or correlation


coefficients, which denote both the strength and directionality of linear associations
between different variable pairs.

The correlation between Age and TotalWorkingYears is notably strong and positive
(0.680), suggesting that as employees age, they typically accumulate more work
experience. Additionally, Age correlates moderately positively with MonthlyIncome
(0.498) and NumCompaniesWorked (0.300), hinting that older employees tend to
have higher monthly incomes and have worked with more companies.

Education shows a weak positive correlation with Age (0.208) and


NumCompaniesWorked (0.126), indicating a subtle tendency for older employees
or those with experience in multiple companies to have higher education levels.

Conversely, DistanceFromHome and TrainingTimesLastYear display a weak


negative correlation (-0.037), suggesting that employees living further from home
may have slightly fewer training instances in the past year.

Rates of pay (DailyRate, HourlyRate, MonthlyRate) show almost negligible


correlations with other variables, denoting minimal linear associations.

Remember, while these correlations provide valuable initial insights, they don't
confirm causality.
Click on the image to enlarge the plot and take advantage of its interactive features.

While the previous code provides a basic correlation which is more than adequate. I
prefer my visualisations to have a little more about them and to make them
interactive. Therefore I wrote the code above, in which a subset is curated from the
'hrdata' dataset, specifically extracting ten columns of interest. The next step is the
calculation of a correlation matrix for this subset, followed by reshaping it into a 'long
format', ideal for visualisation.

To visualise these correlations, the code utilises Plotly's 'plot_ly' function to create a
heatmap. This graphical representation uses a colour gradient, ranging from blue
(negative correlation) through pink (neutral) to red (positive correlation), enabling
easy recognition of the correlation intensity between variables. Numerical correlation
values are overlaid as text annotations for precise interpretation.

The aesthetic attributes, including the title and fonts, are adjusted to enhance
readability, producing a clean and organised layout. The final outcome, an interactive
heatmap, is saved as an HTML file for convenient access and exploration in the future.
The preferential choice of Plotly as the visualisation tool is due to its interactive and
visually engaging features, enriching the data analysis experience and making the
process of understanding complex datasets considerably easier.

The code I wrote above performs the task of creating a scatterplot matrix using four
specific variables from the 'hrdata' dataset, which are 'MonthlyIncome', 'Age',
'TotalWorkingYears', and 'Education'. The code demonstrates (like the correlation
code) how easy it is to create the visualisation. The scatterplot matrix, a pivotal tool in
exploratory data analysis, provides a comprehensive visualisation of pairwise
relationships among these selected variables.

The process commences by calling the png function to open a new PNG device,
denoting "ScatterplotMatrix.png" as the filename for the ensuing plot. Using the pairs
function, the scatterplot matrix is constructed, incorporating arguments that specify
the variables to be analysed, the parent data frame, and the title for the plot.
After plotting, the dev.off function is employed to terminate the PNG device,
effectively capturing the plot as a PNG image within the working directory.

Subsequently, the scatterplot matrix is reproduced within the R environment itself via
the pairs function, fostering immediate, interactive data analysis and visual
inspection.

Scatterplot matrices are invaluable in revealing underlying patterns, correlations or


potentially causal relationships among variables. They present a granular view of data
distributions and correlations, supporting the discovery of data trends and anomalies,
which in turn, inform model selection and feature engineering in predictive analytics.
Furthermore, being a visual tool, scatterplot matrices enable intuitive comprehension
of multidimensional data, bridging the gap between complex statistical relationships
and human cognition. The generated plot can be a practical asset for stakeholders'
presentations or exploratory data analysis reports.
Click on the image to enlarge the plot and take advantage of its interactive features.

In a similar vein to what I did previously with the correlation heatmap, I decided to
write the above R code to facilitate the generation of an enhanced, interactive
scatterplot matrix with added linear regression lines. This approach builds upon the
basic scatterplot matrix offering richer insights into the nature of relationships
between variables.

At the outset, a bespoke function scatter_with_lm is created. This function uses the
'ggplot2' package's ggplot, geom_point and geom_smooth functions to generate
scatterplots overlaid with a linear regression line. The function takes three
parameters: the data, mapping, and any additional parameters for geom_smooth.

Following this, the ggpairs function from the 'GGally' package is used to generate an
advanced scatterplot matrix. This matrix includes variables 'MonthlyIncome', 'Age',
'TotalWorkingYears', 'Education', and 'Gender' from the 'hrdata' dataset. The
scatterplot matrix is customised further by colour coding data points according to
'Gender', adding correlation coefficients in the upper panels, and overlaying linear
regression lines in the lower panels using the previously defined scatter_with_lm
function. Manual colour scales are then specified for 'Gender' using
scale_color_manual and scale_fill_manual.

Subsequently, the plot title is set, and the entire plot is converted into an interactive
visual using the ggplotly function from the 'plotly' package. The resulting interactive
plot allows for more dynamic data exploration, where hovering over data points
reveals precise values, and zooming in facilitates a closer inspection of areas of
interest.

Finally, the interactive plot is saved as an HTML file using the 'htmlwidgets' package's
saveWidget function. This facilitates sharing and viewing the plot in web browsers,
extending its accessibility. The plot is also displayed in the R environment using the
print function.

The enhanced scatterplot matrix and the embedded interactive features underscore
the strength of visual analytics. They empower stakeholders to glean intricate data
patterns, trends, and relationships effortlessly, fostering data-informed decision-
making processes.

From the enhanced scatterplot matrix three particularly strong positive correlations
are discernible in the data. Firstly, 'MonthlyIncome' and 'Age' demonstrate a
correlation of 0.498, suggesting that older employees tend to earn a higher monthly
income. When analysed across genders, this correlation appears to be slightly
stronger for females (0.505), indicating that age-related increments in monthly
income may be more pronounced for them as opposed to their male counterparts
(0.482).

Secondly, 'MonthlyIncome' and 'TotalWorkingYears' show a robust correlation of


0.773. This correlation underscores a discernible trend in the HR data: longer work
tenures are associated with higher monthly incomes. This correlation varies somewhat
across genders, with males (0.781) seemingly having a slightly more pronounced
association between work tenure and monthly income compared to females (0.761).

Lastly, 'TotalWorkingYears' and 'Age' share a robust correlation of 0.680. This


correlation highlights the intuitive trend that older employees generally have more
years of work experience. A gender-based split reveals that this correlation is stronger
for females (0.704), suggesting that female employees tend to accumulate work
experience more rapidly than male employees (0.663).

The scatterplot matrix was enriched by encoding 'Gender' as a colour variable, which
allowed for a more granular, gender-specific exploration of trends in the HR data.
Click on the image to enlarge the plot and take advantage of its interactive features.

My next task revolves around probing assertions that the recent redundancies
predominantly impacted older employees (potentially indicating age discrimination)
and those who are relatively new to the organisation. To conduct a comprehensive
analysis of these claims, I will employ a two-pronged approach: first, I will generate
boxplots to visually inspect the distributions of age and tenure among the employees
who were laid off; second, I will perform statistical hypothesis tests to evaluate the
evidence supporting these claims, which will involve calculating p-values to quantify
the strength of this evidence.

I wrote two lots of code above, the first of which generates a basic boxplot and the
second is more detailed and interactive using the plotly library/package. The boxplot
provides a graphical representation that illustrates the distribution of 'Age' for
different levels of 'Attrition'. This plot enables us to compare these distributions side-
by-side.

A boxplot is designed to provide a summary of the central tendency and dispersion of


the dataset, while also showing any signs of skewness and the presence of outliers.
The box itself displays the interquartile range (IQR), which contains the 50% of data
points situated around the median. The whiskers, meanwhile, represent the range for
the rest of the data, excluding outliers.

In this instance, the boxplot is structured to differentiate between those who


experienced attrition and those who did not, as indicated on the x-axis. The y-axis
represents age.

After generating the plot, the R script closes the PNG device. However, the boxplot is
subsequently recreated to be displayed within the R environment. This process
ensures that the boxplot is saved as an image file and is also available for immediate
viewing within R.

Based on visual inspection of the boxplot, it appears that the employees who were
made redundant predominantly fall within the age group of late twenties to late
thirties. However, to garner a more precise and statistically sound understanding, it's
critical to compare the mean ages of both groups — those who experienced
redundancy and those who didn't. For this purpose, I'll be implementing a Welch Two
Sample t-test, which will provide a more robust statistical framework to understand
any potential differences.

The code I wrote above is used to perform a statistical t-test. More specifically, this is
a Welch Two Sample t-test, which tests whether the means of two independent
groups are significantly different from each other.

Firstly, the code filters the 'hrdata' dataset to create two separate datasets. The
'yes_age' dataset consists of the ages of employees who experienced attrition (i.e.,
they left the company), while the 'no_age' dataset comprises the ages of employees
who did not experience attrition.

The t.test(yes_age, no_age) function is then used to perform the t-test. This test
will compare the means of the 'yes_age' and 'no_age' datasets. If the p-value
generated from the test is less than the significance level (commonly 0.05), we would
reject the null hypothesis that the two groups have the same mean. This would
indicate a statistically significant difference in the mean ages of the two groups,
providing further insight into whether age plays a role in employee attrition.

Lets have a look at the output and see what insights I can see.

After running the test, the t-value is -5.828, which is the calculated difference in
sample means in units of standard error. Its negative sign indicates that the first
mean is smaller than the second.

Degrees of freedom (df), a parameter of the t-distribution, is calculated to be


approximately 316.93. It is somewhat less than the total number of observations in
both groups minus 2, which would have been the case for an ordinary (Student's) t-
test. The Welch's t-test adjusts the degrees of freedom when the variances in the two
groups are assumed to be unequal.

The p-value is reported as 1.38e-08, which is a very small number. This is the
probability of observing such a large absolute t-value (or larger) if the null hypothesis
of equal means were true. Given this very low p-value, we would reject the null
hypothesis at the conventional 5% significance level, and conclude that there's a
statistically significant difference in the mean ages of the two groups.

The 95% confidence interval for the difference in means is [-5.288346, -2.618930].
This suggests that, if we performed this experiment many times, we would expect the
true population difference of means to fall within this interval 95% of the time.
Importantly, as this interval does not include zero, it provides further evidence that
the mean ages of the two groups are significantly different.

The final part of the output provides the sample estimates for the mean of each
group. The mean age of the group that left the company ('yes_age') is approximately
33.61, while the mean age of the group that stayed ('no_age') is approximately 37.56.
Click on the image to enlarge the plot and take advantage of its interactive features.

I have decided to delve deeper into the HR dataset and shift my focus from the 'Age'
attribute to 'EmployeeNumber'. The aim is to uncover any potential disparities in
attrition rates based on the employee number. This choice stems from the hypothesis
that newer employees (indicated by a higher employee number) may have been
impacted differently in terms of attrition compared to those who have been with the
company for a longer period. To test this hypothesis, I will be generating a new
boxplots (above) and performing another Welch Two Sample t-test (below). These
methods will provide additional visual and statistical analysis respectively, offering a
comprehensive insight into the relationship between 'EmployeeNumber' and
'Attrition'.
The key results from the t-test are as follows:

 The t-statistic is -0.41725, which is the test statistic score. This score is used to compare the groups
under study. In this context, a negative value suggests that the first group (employees who left the
company) had a lower mean employee number compared to the second group (those who stayed).
 The degrees of freedom (df) is 342.33, which is used in the calculation of the p-value. Degrees of
freedom generally represents the number of independent pieces of information available to estimate
another piece of information.
 The p-value is 0.6768. This value is used to determine the statistical significance of the observed
difference in means. In general, a p-value of less than 0.05 is considered statistically significant. In
this case, since the p-value is quite high (greater than 0.05), we cannot reject the null hypothesis.
This means there is no significant evidence to suggest that the mean employee numbers differ
between the employees who left the company and those who stayed.
 The 95 percent confidence interval is from -98.91087 to 64.29061. This range represents the interval
within which we can be 95% confident that the true difference in mean employee numbers between
the groups lies.
 The mean of x and y represent the average employee numbers for those who left the company and
those who stayed, respectively. Here, 'x' corresponds to employees who left the company (mean =
1010.346), and 'y' corresponds to employees who stayed (mean = 1027.656).

In summary, the Welch Two Sample t-test provides no significant evidence to suggest
that newer employees (indicated by a higher employee number) are more likely to
leave the company.

I have decided to conduct a linear regression analysis between 'MonthlyIncome' and


'Age' in my HR dataset. This decision is driven by the desire to explore and
understand the possible linear relationship between an employee's age and their
monthly income. Linear regression analysis is a powerful statistical method that
allows me to quantitatively study the relationship between variables. Specifically, it
will allow me to discern whether age has a statistically significant impact on an
employee's income, and if so, how strong this influence is. By conducting this
analysis, I am looking to extract valuable insights that can potentially inform HR
policies and decision-making.

The code I wrote above performs linear regression analysis on the 'hrdata' dataset,
with 'MonthlyIncome' as the dependent variable and 'Age' as the independent
variable.

Here's a detailed explanation of the code:


1. model1 = lm(MonthlyIncome ~ Age, data=hrdata): This line uses the lm() function to fit a linear
model to the 'hrdata' data. The formula MonthlyIncome ~ Age indicates that 'MonthlyIncome' is the
dependent variable (the variable we're trying to predict), and 'Age' is the independent variable (the
predictor). The fitted model is stored in the variable 'model1'.
2. summary(model1): This line provides a comprehensive summary of the fitted model 'model1'. It
outputs the coefficients of the regression model (intercept and slope), the residuals, the R-squared
value (which indicates how well the model fits the data), the F-statistic, and the p-value (which tests
the null hypothesis that all of the regression coefficients are equal to zero).

In summary, the script fits a linear regression model to predict 'MonthlyIncome' based
on 'Age' and then generates a comprehensive summary of the model to enable
interpretation of the model's performance and the statistical significance of the
predictor variable 'Age'.

Now that I have ran the code let's examine the output in more detail:

1. Call: This displays the function call used to compute the regression.
2. Residuals: These represent the differences between the observed (actual) and predicted values. Here,
the 1st quartile (25th percentile), median (50th percentile), and 3rd quartile (75th percentile) show
the spread and skewness of residuals. Large residuals suggest that our model doesn’t capture the data
trend well, in this case the residuals seem quite large.
3. Coefficients: The 'Estimate' column under 'Coefficients' provides the intercept and slope of the
regression line, i.e., -2970.67 and 256.57, respectively. The intercept can be interpreted as the
expected value of MonthlyIncome when Age is zero, but in this context, a zero age doesn't make
sense, so the intercept may not have a meaningful interpretation. The positive slope suggests that
with every additional year of age, the monthly income is expected to increase by about 256.57 units,
on average.

The 'Std. Error' column provides the standard errors of the estimated coefficients,
which measure the variability in the estimate for the coefficients. The 't value' is the t-
statistic, and 'Pr(>|t|)' is the p-value associated with this t-statistic. The three
asterisks next to the p-values indicate that the predictors are highly significant (p <
0.001).

1. Residual standard error: This is the standard deviation of the residuals, and it’s used to measure
the model's accuracy. A smaller residual standard error means the model has a smaller random error
component, and hence is a better fit.
2. Multiple R-squared: This is the proportion of variance in the dependent variable (MonthlyIncome)
that can be explained by the independent variable (Age). Here, about 24.79% of the variability in
MonthlyIncome can be explained by Age.
3. F-statistic and p-value: The F-statistic is used to test the overall significance of the model. The p-
value associated with the F-statistic is practically zero, which provides evidence against the null
hypothesis that all regression coefficients are zero. In other words, it suggests that the model as a
whole is statistically significant.

Remember, while these statistics provide valuable insights about the relationship
between Age and MonthlyIncome, the data may not meet all assumptions of linear
regression, such as linearity, independence, homoscedasticity, and normality of
residuals.

Click on the image to enlarge the plot and take advantage of its interactive features.

Building upon my previous analysis, I've decided to delve deeper into the relationship
between employees' ages, their total years of work experience and their monthly
income. The aim is to understand how both age and total working years
simultaneously influence the monthly income of employees. To achieve this, I've
conducted another linear regression analysis, this time including the
'TotalWorkingYears' as an additional independent variable. This approach provides a
more comprehensive view of the factors affecting monthly income, enhancing the
explanatory power of the model.
The code I wrote above executes a multiple linear regression model to predict
'MonthlyIncome' using 'Age' and 'TotalWorkingYears' as predictors. It generates a
summary of the model's results which I will look into in more detail below.

The 'Coefficients' section provides the estimated effect size and statistical significance
for each predictor:

1. 'Intercept' is the estimated 'MonthlyIncome' when both 'Age' and 'TotalWorkingYears' are zero,
which doesn't practically apply to our case, but it's mathematically needed for the model. It is
statistically significant, as indicated by its p-value (2.36e-08) being much smaller than 0.05.
2. 'Age' has a negative coefficient (-26.87), which means older age is associated with a slight decrease
in 'MonthlyIncome' when 'TotalWorkingYears' is held constant. However, the effect size is small and
is statistically significant (p-value = 0.021).
3. 'TotalWorkingYears' has a positive coefficient (489.13), indicating that as the total working years
increase, the 'MonthlyIncome' also increases, holding 'Age' constant. This effect is statistically
significant (p-value < 2e-16), implying that 'TotalWorkingYears' is a significant predictor of
'MonthlyIncome'.

The 'Residuals' section describes the spread of the residuals (the difference between
the actual and predicted values). The closer the residuals are to 0, the better the
model fits the data.

The 'Multiple R-squared' value (0.5988) indicates that around 60% of the variance in
'MonthlyIncome' can be explained by 'Age' and 'TotalWorkingYears', while the
'Adjusted R-squared' value (0.5983) adjusts this statistic based on the number of
predictors in the model.

The F-statistic (1095) and its associated p-value (< 2.2e-16) indicate that the model
as a whole (i.e., all predictors together) is statistically significant.

In summary, the multiple linear regression model demonstrates that both 'Age' and
'TotalWorkingYears' are statistically significant predictors of 'MonthlyIncome'.
However, 'Age' shows a minor negative effect, suggesting that older employees,
considering their total working years, may earn slightly less. 'TotalWorkingYears'
significantly influences 'MonthlyIncome', indicating that employees with more working
experience tend to earn more. The model accounts for approximately 60% of the
variability in 'MonthlyIncome', suggesting a decent fit to the data, although other
unconsidered factors could be contributing to 'MonthlyIncome'. The F-statistic
validates that the model's predictors are collectively significant.

Click on the image to enlarge the plot and take advantage of its interactive features.

In order to present a more comprehensive and interactive view of the second linear
regression model, I have created a 3D scatter plot. This plot visually captures the
relationship between Monthly Income, Age, and Total Working Years in our dataset. By
leveraging the power of three dimensions, the scatter plot effectively depicts the
correlations and patterns existing among these variables. The plot is color-coded
based on the monthly income to provide an additional layer of understanding. This
visual tool enables me and stakeholders to comprehend and interpret the model's
results better, fostering a deeper understanding of the underlying data patterns. It's
an effective way to appreciate the model's complexity and nuances that a simple
numerical summary might not fully encapsulate.

Conclusion/Insights
From my analysis I was able to draw multiple conclusions:

1. Age and Work Experience: Older employees generally have more years of work experience and
tend to have higher monthly incomes. This strong positive correlation between age and total working
years is more pronounced for females.
2. Monthly Income and Work Experience: There is a strong positive correlation between total
working years and monthly income. This suggests that longer tenures at the company are associated
with higher monthly incomes, with this association slightly more pronounced for male employees.
3. Education and Age/Number of Companies Worked: Older employees or those who have worked
with more companies tend to have slightly higher education levels, but this correlation is weak,
suggesting education may not be a significant predictor of age or the number of companies one has
worked for.
4. Distance from Home and Training: Employees living further from home may have slightly fewer
training instances in the last year, but this negative correlation is weak, suggesting the relationship
may not be strong or may be influenced by other factors.
5. Pay Rates and Other Variables: Daily rate, hourly rate, and monthly rate show minimal
correlations with other variables, suggesting they might not be good predictors for other factors such
as age, total working years, education, and so on.
6. Redundancy and Age: Based on the Welch Two Sample t-test, there is a statistically significant
difference in the mean ages of employees who were made redundant and those who weren't. The
group that experienced redundancy predominantly falls within the late twenties to late thirties age
group.
7. Linear Regression Analysis (Single and Multiple): In the case of single linear regression, age
explains about 24.79% of the variability in monthly income. The positive slope suggests an increase
in monthly income with an increase in age. In the multiple linear regression model, both age and
total working years significantly predict monthly income. However, age shows a slight negative
effect when total working years is held constant. On the other hand, total working years significantly
influence monthly income, suggesting that more working experience leads to higher income.
8. Employee Turnover: There's no significant evidence to suggest that newer employees are more
likely to leave the company.

Remember, correlation does not imply causation. While these relationships are
evident in the dataset, further analysis would be necessary to understand the
underlying causal mechanisms. For example, does working for longer necessarily lead
to a higher income, or are there other factors at play?

Moreover, while the models explain a good portion of the variability in the data (single
model: ~25%, multiple model: ~60%), a significant proportion remains unexplained.
There may be other important predictors not included in these models.

Lastly, the variables should be checked for multicollinearity, especially in the multiple
linear regression model. This situation, where predictor variables are correlated with
one another, can inflate the variance of the regression coefficients and make the
estimates very sensitive to minor changes in the model.

You might also like