Employee Satisfaction Report
Employee Satisfaction Report
Employee Satisfaction Report
Group 20 (L03)
TABLE OF CONTENTS
1.1. MOTIVATION
1.1.1. Context
1.1.2. Problem
1.2. OBJECTIVES
1.2.1. Overview
1.2.2. Goals & Research Questions
2. METHODOLOGY
2.1 Data
2.2 Approach
2.3 Workflow
2.4 Workload Distribution
3. MAIN RESULTS OF THE ANALYSIS
3.1 Multiple Regression Assumptions
3.2 Results
4. CONCLUSION AND DISCUSSION
4.1 Approach
4.2 Future Work
5. REFERENCES
6. APPENDIX
3
1. INTRODUCTION
1.1. MOTIVATION
1.1.1. Context
In this project, we chose to investigate a Employee Satisfaction Survey Dataset from
Kaggle. This dataset offers a set of variables related to employee experiences. The topic that
will be investigated is; How do these work attributes impact employee satisfaction? “An
extensive study into happiness and productivity has found that workers are 13% more
productive when happy.” (Bellet et al. 2023) and “Satisfied employees are more committed,
engaged, productive, resulting in lower turnover rates and higher overall performance levels”
(Long, 2023). Therefore, how happy people are at work reflects positively into their work.
The project aims to determine significant variables in the prediction of employee satisfaction.
1.1.2. Problem
“The average person will spend 99 000 hours at work” which accumulates to one third
of people's lives (Naber, A., n.d.). The fulfillment or lack thereof, from the work environment
extends beyond the work day. Satisfaction at work can negatively or positively affect a
person's mental health and well-being. The significance of workplace characteristics on
employee satisfaction will be investigated to determine the factors which make a satisfied and
happy employee. These factors include an employee's score on their last evaluation, the
number of projects they partake in, their average monthly hours, the length of time they have
been at the company, if they have been in any work accidents, if they have had a promotion
in the last 5 years, their department, and their salary. Additionally, from the employers point
of view, “Companies with high worker satisfaction outperform low satisfaction companies by
202%” (Apollo Technical, 2022). If a company can understand the attributes that matter to
their employees overall satisfaction, they can improve employee performance.
1.2. OBJECTIVES
1.2.1. Overview
The overall intent of the project is to better understand how different characteristics
regarding work can impact an employee's satisfaction level. Understanding the elements
4
which make a satisfied employee can provide actionable insights to cultivate the proper
working conditions that can be created to achieve satisfied employees. If assumptions are met
for the linear regression model, companies could use the model to predict the satisfaction of
their employees and make changes where necessary.
The expectation is that the model will be defined based on techniques learned in class.
This model will meet the assumptions of Linearity, Equal Variance, normality and
Independence. If one of the assumptions is not met, transformations such as box-cox or log
will be applied. A check for outliers will be done and if outliers are found they will be
removed. Lastly, a test for multicollinearity between variables will be done. If collinearity is
found amongst any of the variables all but one of the variables with collinearity will be
removed.
This topic is important because we want to analyze and understand what creates the
highest satisfaction level within a work environment. In the future when looking for jobs we
could potentially use the insights gained to help decide what attributes we should prioritize in
a workplace to improve. Additionally, this evidence is important for companies to help
cultivate an environment that fosters productivity. This subject will be explored by building a
multiple regression model with the data provided.
5
2 METHODOLOGY
2.1 Data
The dataset has 10 columns consisting of employee ID, 8 characteristics of employee work
environment and satisfaction level, which is our predictor. This data was collected through an
employee answering an online survey. There are 12 783 entries. Below is a table that fully
describes the dataset and variables.
This data set was retrieved from kaggle with licensing from APACHE LICENSE (VERSION
2.0 (https://www.apache.org/licenses/LICENSE-2.0) which allows for free and non-exclusive
usage of its data.
Employee ID will not be used in the model to predict employee satisfaction as it serves
purely as an identifier for the employees and not as a predictor for satisfaction.
Type Categoric Numerica Numerical Numerica Numerical Numerical Categoric Categorical Categorical Categorical
al l (Quantitative) l (Quantitati (Quantitative) al (Qualitative) (Qualitative (Qualitative )
(Quantitat (Quantita (Quantitat ve) (Qualitati )
ive) tive) ive) ve)
Description Employee Employe Employee's Number Average Number of Indicates Indicates The Employee's salary
ID e's most recent of number of years the whether whether the department level (e.g., low,
self-repor performance projects hours employee has the employee or division medium)
ted job evaluation the worked spent with the employee has received in which
satisfacti score employee per month company has a promotion the
on level is by the experienc in the last 5 employee
currently employee ed a years works
working work
on accident
Measureme Score Score from Project # Hours Years 1: yes 1: yes department low, medium, high
nt from 0-1 0-1 0: no 0: no
2.2 Approach
6
2.3 Workflow
Initially a first order model will be created using all of the potential predictor
variables. A t-test will be performed to determine which of these coefficients are significant.
The second method was the stepwise regression procedure which will be used to check the
results of the t-test. Next an interaction model will be created. Similar to before a t-test will
be used to test which of the interaction terms are significant. Once the model including
interaction terms is known, each predictor variable being used in the model will be plotted
against the response variable to determine if there are any higher order terms to be included.
Based on the predictor variables and interaction terms deemed significant and the higher
order terms necessary the final model can be completed.
Once the model is created the six different assumptions of linear regression will be
checked for the model. These include the linearity assumption, independence assumption,
normality assumption, equal variance assumption, multicollinearity assumption, and outlier
check. Outliers are found to be significant (cook’s distance greater than 1), will be removed.
If the equal variance or the normality assumption are not met, transformations of box-cox or
log can be applied to the response variable.
speed at which tasks were completed as those not sharing their screen could find info needed
to complete specific parts and write out any tedious code or descriptions related to the section
being worked on.
There are three group members so the project was divided equally between three
roles. The first role included creating the model using the t.test and stepwise regression
followed by checking for higher order terms and interaction terms. The second and third roles
checked the regression tests and assumptions. There were six different checks that were
looked at in Data 603; Linearity Assumption, independence assumption, equal variance
assumption, normality assumption, multicollinearity and outliers. The first three checks were
completed by the person in role 2 and the remaining three checks were completed by the
person in role 3. Each group member was responsible for including and documenting their
assigned role in the report. The remaining sections of the report were worked on together,
with each person taking ownership for sections as needed.
3.1 Results
Due to the sample size exceeding the limit for the Shapiro-Wilk Test we decided to resample
our data (n=5000) and then compute our results.
The Alpha value of 0.05 will be used for all significant testing.
𝑌 = β0 + β1𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙 + β2𝑋𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡 + β3𝑋𝑎𝑣𝑔 𝑚𝑜𝑛 ℎ𝑟𝑠 + β4𝑋𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑 + β5𝑋𝑤𝑜𝑟𝑘 𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡𝑠
To begin the selection procedure, we decided to use both the t-test and stepwise regression
analysis. This allowed for a systematic and comprehensive approach to selecting predictors.
T - Test
𝐻(0) ⇒ β𝑖 = 0
𝐻(𝐴) ⇒ β𝑖 ≠ 0
By applying the t-test and using the hypothesis statement above we decided to remove
insignificant variables. First, the p-value = 0.55 for average_monthly_hours led us to the
conclusions that we can reject the null hypothesis and drop the variable as it has no
significance on the model. Second, the p-value = 0.127 for promotion_last_5years led us to
the conclusions that we can reject the null hypothesis and drop the variable as it has no
significance on the model.the. The department (IT), was close to the significance value of
0.05 with a p-value = 0.075. Based on the hypothesis, we must accept the null hypothesis and
the dept variable can be removed. However, because the p-value is close to the alpha value,
there is potential to keep dept in the model to test for its involvement in significant interaction
terms.
9
The stepwise regression selection procedure produced the best model that dropped
average_monthy_hours, dept, and promotion_last_5years.
After reviewing the various scatter plots of employee satisfaction versus each predictor
variable, it was concluded that none of the variables displayed patterns that would deem
higher order terms necessary.
Interaction Terms
Hypothesis Statement for Individual T-tests (Interaction Terms):
𝐻(0) ⇒ β𝑖 = 0
𝐻(𝐴) ⇒ β𝑖 ≠ 0
After testing for interaction terms it was determined that there were 9 interactions that are
significant to the model:
● last_evaluation:number_project
● last_evaluation:time_spend_company
● last_evaluation:factor(Work_accident)
● last_evaluation:factor(dept)
● number_project:time_spend_company
● number_project:factor(Work_accident)
● number_project:factor(salary)
● time_spend_company:factor(dept)
11
● factor(dept):factor(salary)
These 9 interactions all have p values less than the alpha value 0.05. Therefore we must reject
the null hypothesis and state that these interactions have coefficients that are not equal to 0.
Thus we will include these 9 terms into our final model. Additionally, since the dept variable
is significant in some interaction terms, it will be kept in the final model.
Final Model
+ β9𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙*𝑤𝑜𝑟𝑘 𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡𝑠 + β10𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙*𝑑𝑒𝑝𝑡 + β11𝑋𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡*𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑 + β12𝑋𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡*𝑤𝑜𝑟𝑘 𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡
2
𝑅𝑎𝑑𝑗 = 0. 1998 , this value indicates that 19.98% of the variation of the response variable
satisfaction level is explained by the final model containing the predictors as well as the
interaction terms.
𝑅𝑀𝑆𝐸 = 0. 2218, this value indicates that the standard deviation of the unexplained
variation in estimation of response variable satisfaction level is 0.2218.
Interpretation of Coefficients:
There are a total of fifteen coefficients in our final model. These coefficients are interpreted
below in relation to employee satisfaction levels.
𝛃0: Intercept - This is the baseline satisfaction level when all other predictor variables are set
to their reference levels or zero. The reference levels for this model are
department(accounting), work_accidents(0), salary(High).
𝛃1: last_evaluation - Represents the change in satisfaction level for each unit increase in the
𝛃3: time_spend_company - The coefficient shows the change in satisfaction level for each
additional year spent in the company, assuming other factors are held constant.
𝛃4: Work_accident - The coefficient tells us the difference in employee satisfaction level
between employees who have not (0) (the base) and who have (1) been in a workplace
accident.
𝛃5: dept - The coefficients for different departments show the difference in satisfaction levels
compared to the baseline department of accounting, assuming other factors are held constant.
𝛃6: salary: The coefficients for different salary levels indicate how satisfaction varies across
salary levels compared to the baseline salary level of high, assuming other factors are held
constant.
𝛃7:last_evaluation:number_project - This interaction term is the product of last_evaluation
and number_project. The coefficient will tell us how satisfaction changes for every one
project an employee takes on the employee satisfaction will increase by a constant plus the
last evaluation multiplied by 𝛃7. Or how satisfaction changes for every one score
improvement in last evaluation the employee satisfaction will increase by a constant plus the
number of projects multiplied by 𝛃7.
last_evaluation and time_spend_company. The beta value indicates how satisfaction changes
relative to the interaction of last_evaluation and time_spend_company. For instance, for
every additional year an employee spends at company, the employee satisfaction will increase
by a constant plus the last evaluation multiplied by 𝛃8, and vice versa.
categorical variable Work_accident, the coefficient of this interaction tells the difference in
the effect of the last evaluation on satisfaction from those who have not vs. have been in a
workplace accident.
𝛃10:last_evaluation:factor(dept) - This term is the product of last_evaluation and the dept.
The coefficient of this interaction tells the difference in the effect of the last evaluation on
satisfaction from those in accounting (the base) vs those in other departments.
𝛃11:number_project:time_spend_company - The multiplication of number_project and
time_spend_company creates this term. The coefficient will tell us how satisfaction changes
13
for every one project an employee takes on, the employee satisfaction will increase by a
constant plus the time_spend_company multiplied by 𝛃11. Or how satisfaction changes for
every one score improvement in last evaluation the employee satisfaction will increase by a
constant plus the number of projects multiplied by 𝛃11.
assesses if the effect of the number of projects on satisfaction varies across different salary
levels low, medium and high, with high being the base. The coefficient of this interaction tells
the difference in the effect of the number of projects one undertakes, on satisfaction from
those in with a high salary vs those with other salaries
𝛃14: time_spend_company:factor(dept) - This interaction, created by multiplying
time_spend_company with the dept variable, explores how the impact of tenure at the
company on satisfaction might vary across departments The coefficient of this interaction
tells the difference in the effect of the years spent at the company on satisfaction from those
in accounting (the base) vs those in other departments.
𝛃15: factor(dept):factor(salary) - This term is the product of dept and salary. It examines if
the relationship between department and satisfaction differs based on the salary level.The
coefficient tells us the difference in the effect of being in the accounting department on
satisfaction between those with a high salary vs other salaries and vice versa. The effect of
those with a high salary vs other salaries on satisfaction between those in the accounting
department vs other departments.
14
Linearity Assumption
From the plot created, the plot does not fit linearity assumptions, as the fitted blue line is not
linear. This display of pattern and lack of random scatter, leads us to conclude that this model
does not fulfill the linearity assumption necessary for our model.
15
Independence Assumption
The data was collected by employees individually answering a survey, which is not related to
time, space, or group. Therefore each entry is separate and does not influence the results of
another. This characteristic satisfies the independence assumption.
Normality Assumption
𝐻(0): 𝑇ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 𝑑𝑎𝑡𝑎 𝑎𝑟𝑒 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡𝑙𝑦 𝑛𝑜𝑟𝑚𝑎𝑙𝑙𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑑
𝐻(𝐴): 𝑇ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 𝑑𝑎𝑡𝑎 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡𝑙𝑦 𝑛𝑜𝑟𝑚𝑎𝑙𝑙𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑑
From the histogram for residuals we can note that the data is skewed to the right. The points
on the normal Q-Q plot depict the tail beginning to deviate as x increases. To get a quantified
interpretation, the Shapiro-Wilks test was also performed. The p-value calculated was
2.2e-16<0.05, therefore, we reject the null hypothesis and can state that the data is not
significantly normally distributed.
To visually test for homoscedasticity, we populated a residual vs fitted values plot and to
statistically test for homoscedasticity we performed the Breusch-Pagan test. The residual vs
fitted values plot shows a rectangular shape which indicates that our model is heteroscedastic.
From the Breusch-Pagan test our value 2.2e-16<0.05 therefore, we reject our null hypothesis
and heteroscedasticity is present thus not fulfilling the equal variance assumption.
Multicollinearity Tests
17
To test for multicollinearity in our model, we computed the multiple variance inflation factors
(VIF) to determine which variables should remain in our best fitted model. The VIF values all
fell between 1 ≤ 𝑉𝐼𝐹 ≤ 5 which suggests that there is moderate collinearity, but it does not
require corrective measures.
To identify influential cases we will plot residuals vs leverage plot. From the residual vs
leverage points all cases are well inside of the Cook’s distance lines.
The plot above shows Cook’s distance plotted for each variable. This plot will compute the
overall influence the outlier points have on our regression and the extent of its effect. The
points that were found to be outliers were 2227, 2360, and 3309. However, their Cook’s
distance value is less than 0.5, therefore they are not influential.
Transformation
Since our data violates the normality and equal variance assumptions. We decided to conduct
a box-cox transformation. The reason why we chose to conduct a box-cox transformation is
because the transformation will help with equal variance and follow a normal distribution.
After transformation the model still failed the constant variance and normality assumptions.
To further try and correct our model we also attempted a log transformation. However, the
log transformed model still failed the constant variance and normality assumptions.
3.2 Discussion
In all our model does not meet all of the necessary assumptions to be a viable model
to predict employee satisfaction. To test the linear assumption we plotted the residuals vs
19
fitted where we found that the blue fitted line did not show a linear line. Therefore, violating
the linear assumption. For the normality assumption we plotted a histogram, Normal Q-Q
plot, and computed the Shapiro-Wilks test. The histogram showed the data was right skewed
and the Normal Q-Q plot showed data points leaving the line; both plots visually showed that
the data was not normally distributed. Additionally, when computed the Shapiro-Wilks test
our p-value<0.05 showed that we would reject our null hypothesis that there is normality in
the data. To investigate the equal variance assumption we conducted the Breusch-Pagan test
which showed that p-value<0.05, concluding that we reject the null hypothesis that there is
equal variance. Therefore, showing that the model violates the equal variance assumption. To
attempt to improve the model we conducted a box-cox transformation. However, this did not
improve our model as it still violated the linearity, normality, and equal variance assumption.
To test for multicollinearity, we computed the VIF values. We found that all variables in the
model had a value 1 ≤𝑉𝐼𝐹≤5 which shows moderate collinearity. Which did not require
corrective measures.
The failure of these assumptions is an important part of our findings. Since our model
did not meet the assumptions we can not ensure the reliability or validity of the predictions
made. Without meeting these assumptions we can create false conclusions and interpretations
of our data. Moreover, if comparison to other models is desired, we would not be able to
produce a fair comparison within an invalid model. In all, these assumptions exist to ensure
the model has no result altering biases.
4.1 Approach
Yes, the overall approach we took is promising. We used a systematic approach and
confirmed our results through multiple visual and statistical methods. For verification of
results we prioritized the findings from quantitative tests such as Shapiro-Wilk and
Brusch-Pagan. The precise results of these tests offered less ambiguous results and reduced
the room for interpretation errors. This approach leads to the most accurate result. A variant
of this approach would be to include more visual evidence into our checks. This would help
20
improve our approach by allowing us to visually understand exactly why our model fails and
we would be better equipped at troubleshooting and deciding on transformation methods.
5 REFERENCES
https://www.apollotechnical.com/job-satisfaction-statistics/
Bellet, C. S., De Neve, J. E., & Ward, G. (2023). Does employee happiness have an impact
on productivity?. Management science.
Long, R. (2023, September 19). Why employee satisfaction matters more than happiness.
https://resources.workable.com/tutorial/employee-satisfaction-happiness#:~:text=Satis
fied%20employees%20are%20more%20committed,and%20higher%20overall%20per
formance%20levels.
Naber, A. (n.d.). One third of your life is spent at work. Retrieved from
https://www.gettysburg.edu/news/stories?id=79db7b34-630c-4f49-ad32-4ab9ea48e72
b#:~:text=The%20average%20person%20will%20spend%2090%2C000%20hours%2
0at%20work%20over%20a%20lifetime.
22
6 APPENDIX