0% found this document useful (0 votes)

40 views33 pages

Project Employee Absenteeism

The document discusses analyzing employee absenteeism data using machine learning techniques. It describes preprocessing steps like handling missing values, outliers and feature scaling. Various models like decision tree, random forest and linear regression are trained and evaluated on the data to predict absenteeism hours and address the problem statement.

Uploaded by

magep92240

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views33 pages

Project Employee Absenteeism

Uploaded by

magep92240

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

PROJECT REPORT

EMPLOYEE ABSENTEEISM
MAYANK AGARWAL
22-11-2018

Page 1 of 33
Contents

1. Introduction
1.1 Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
1.2 Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
1.3 Sample Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
1.4 Unique Count. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5

2. Methodology
2.1 Pre – Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
2.2 Missing Value Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
2.3 Outlier Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Distribution of Continuous variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Distribution of Categorical variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Feature Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
2.7 Feature Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
2.8 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

3. Modelling
3.1 Model Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
3.2 Decision Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Random Forest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
3.4 Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

4. Conclusion
4.1 Model Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
4.2 Model Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
4.3 Solution of Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5. Appendix
5.1 Figures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22

6. R code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Page 2 of 33
Chapter 1: Introduction

1.1 Problem Statement

XYZ is a courier company. As we appreciate that human capital plays an important role in collection,
transportation and delivery. The company is passing through genuine issue of Absenteeism. The
company has shared it dataset and requested to have an answer on the following areas:

1. What changes company should bring to reduce the number of absenteeism?

2. How much losses every month can we project in 2011 if same trend of absenteeism
continues?

1.2 Variables
There are 21 variables in our data in which 20 are independent variables and 1 (Absenteeism time in
hours) is dependent variable. Since the type of target variable is continuous, this is a regression
problem.

Variable Information:

1. Individual identification (ID)

2. Reason for absence (ICD).

- Absences attested by the International Code of Diseases (ICD) stratified into 21

categories (I to XXI) as follows:
I Certain infectious and parasitic diseases
II Neoplasms
III Diseases of the blood and blood-forming organs and certain disorders involving the immune
mechanism
IV Endocrine, nutritional and metabolic diseases
V Mental and behavioural disorders
VI Diseases of the nervous system
VII Diseases of the eye and adnexa
VIII Diseases of the ear and mastoid process
IX Diseases of the circulatory system
X Diseases of the respiratory system
XI Diseases of the digestive system
XII Diseases of the skin and subcutaneous tissue
XIII Diseases of the musculoskeletal system and connective tissue
XIV Diseases of the genitourinary system
XV Pregnancy, childbirth and the puerperium
XVI Certain conditions originating in the perinatal period
XVII Congenital malformations, deformations and chromosomal abnormalities
XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified
XIX Injury, poisoning and certain other consequences of external causes
XX External causes of morbidity and mortality

Page 3 of 33
XXI Factors influencing health status and contact with health services.
And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation
(24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation
(28).

3. Month of absence

4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))

5. Seasons (summer (1), autumn (2), winter (3), spring (4))

6. Transportation expense

7. Distance from Residence to Work (KMs)

8. Service time

9. Age

10. Work load Average/day

11. Hit target

12. Disciplinary failure (yes=1; no=0)

13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))

14. Son (number of children)

15. Social drinker (yes=1; no=0)

16. Social smoker (yes=1; no=0)

17. Pet (number of pet)

18. Weight

19. Height

20. Body mass index

21. Absenteeism time in hours (target)

1.3 Sample Data

Page 4 of 33
Fig 1.3 – First five rows of data

1.4 Unique count

Below figure shows the unique count of all the variables present in the data.

Fig 1.4 – Unique Count of data

Page 5 of 33
Chapter 2: Methodology

2.1 Pre – Processing

A predictive model requires that we look at the data before we start to create a model. However, in
data mining, looking at data refers to exploring the data, cleaning the data as well as visualizing the
data through graphs and plots. This is known as Exploratory Data Analysis. In this project we look at
the distribution of categorical variables and continuous variables. We also look at the missing values
in the data and the outliers present in the data.

2.2 Missing Value Analysis

In statistics, missing data or missing values occur when no data value is stored for the variable in an
observation. Missing values are a common occurrence in data analysis. These values can have a
significant impact on the results or conclusions that would be drawn from these data. If a variable
has more than 30% of its values missing, then those values can be ignored, or the column itself is
ignored. In our case, none of the columns have a high percentage of missing values. The maximum
missing percentage is 4.18% i.e., Body Mass Index column. The missing values have been computed
using KNN computation method.

Fig 2.2 – Missing value Percentage

Page 6 of 33
2.3 Outlier Analysis
It can be observed from the distribution of variables that almost none of the variables are normally
distributed. The skew in these distributions can be explained by the presence of outliers and
extreme values in the data. One of the steps in pre-processing involves the detection and removal of
such outliers. In this project, we use boxplot to visualize and remove outliers. Any value lying outside
of the lower and upper whisker of the boxplot are outliers.

Variables excluding Distance from residence to work, Weight and Body mass index, contain outliers.

Fig 2.3.1 – Boxplots of continuous variables with outliers

Page 7 of 33
Imputing outlier values:
Missing values obtained from boxplots are first converted to have NA values. Then these missing
values are imputed using KNN imputation method.

Below figure shows the boxplots of variables after removing outliers.

Fig 2.3.2 – Boxplots of continuous variables without outliers

Page 8 of 33
2.4 Distribution of Continuous variables
By looking at the distribution of continuous variables, it can be observed that the variables are not
normally distributed. Histograms are used to observe the distribution of continuous variables.

Page 9 of 33
Fig 2.4 – Distribution of Continuous variables using Histogram

2.5 Distribution of Categorical Variables

Bar graphs are used to visualize the distribution of categorical variables.

Employees who are social drinkers have more absent hours than those who do not drink.
Employees having zero, one or two children have more absent hours.
Employees with ID number 3 and 28 are absent the most.
Employees are absent the most on Mondays and the least on Thursdays.
Reason 23 and 28 are the reasons employee give the most for being absent.
Employees who have completed only high school education are absent more than others.
Employees are absent the most in the month of March.

Page 10 of 33
Fig 2.5 – Distribution of Categorical variables using Bar graph

Page 11 of 33
2.6 Feature Selection
Feature Selection reduces the complexity of a model and makes it easier to interpret. It also reduces
overfitting. Features are selected based on their scores in various statistical tests for their correlation
with the outcome variable. Correlation plot is used to find out if there is any multicollinearity
between variables. The highly collinear variables are dropped and then the model is executed.

From correlation analysis we have found that Weight and Body Mass Index has high correlation
(>0.7), so we have excluded the Body Mass Index column.

Fig 2.6 – Correlation plot of Continuous variables

2.7 Feature Scaling

Feature scaling is a method used to standardize the range of independent variables or features of
data. In data processing, it is also known as data normalization and is generally performed during the
data pre-processing step.
Most classifiers calculate the distance between two points by the Euclidean distance. If one of the
features has a broad range of values, the distance will be governed by this feature. Therefore, the
range of all features should be normalized so that each feature contributes proportionately to the

Page 12 of 33
final distance. Since our data is not uniformly distributed, we will use Normalization as Feature
Scaling Method.

2.8 Principal Component Analysis (PCA)

Principal component analysis is a method of extracting important variables (in form of components)
from a large set of variables available in a data set. It extracts low dimensional set of features from a
high dimensional data set with a motive to capture as much information as possible.
After creating dummy variable of categorical variables, the data would have 116 columns and 740
observations. This high number of columns leads to bad accuracy.

Fig 2.8 – Cumulative Scree Plot of Principal Components

After applying PCA algorithm and observing the above Cumulative Scree Plot, it can be observed that
almost 95% of the data can be explained by 45 variables out of 116. Hence, we choose only 45
variables as input to the models.

Page 13 of 33
Chapter 3: Modelling

3.1 Model Selection

After a thorough pre-processing we will be using some regression models on our processed data to
predict the target variable. The target variable in our model is a continuous variable i.e.,
Absenteeism time in hours. Hence the models that we choose are Linear Regression, Decision Tree
and Random Forest. The error metric chosen for the given problem statement is Root Mean Square
Error (RMSE).

3.2 Decision Tree

Decision Tree algorithm belongs to the family of supervised learning algorithms. Decision trees are
used for both classification and regression problems.
A decision tree is a tree where each node represents a feature(attribute), each link(branch)
represents a decision(rule) and each leaf represents an outcome (categorical or continues value).
The general motive of using Decision Tree is to create a training model which can use to predict class
or value of target variables by learning decision rules inferred from prior data (training data).

Fig 3.2 – Plot of actual values vs predicted values for Decision Tree

The RMSE values and R^2 values for the given project in R and Python are:

DECISION TREE RMSE R^2

R 0.442 0.978

PYTHON 0.0353 0.9998

Page 14 of 33
3.3 Random Forest
Random Forest is a supervised learning algorithm. Random forest builds multiple decision trees and
merges them together to get a more accurate and stable prediction. It can be used for both
classification and regression problems. The method of combining trees is known as an ensemble
method. Ensembling is nothing but a combination of weak learners (individual trees) to produce a
strong learner.
The number of decision trees used for prediction in the forest is 500.

Fig 3.3 – Plot of actual values vs predicted values for Random Forest

RANDOM FOREST RMSE R^2

R 0.480 0.978

PYTHON 0.0445 0.9998

3.4 Linear Regression

Multiple linear regression is the most common form of linear regression analysis. Multiple linear
regression is used to explain the relationship between one continuous dependent variable and two
or more independent variables. The independent variables can be continuous or categorical.

LINEAR REGRESSION RMSE R^2

R 0.003 0.9999

PYTHON 0.0013 0.9999

Page 15 of 33
Fig 3.5 – Plot of actual values vs predicted values for Linear Regression

Page 16 of 33
Chapter 4: Conclusion

4.1 Model Evaluation

In the previous chapter we have seen the Root Mean Square Error (RMSE) and R-Squared Value of
different models. Root Mean Square Error (RMSE) is the standard deviation of the residuals
(prediction errors). Residuals are a measure of how far from the regression line data points are,
RMSE is a measure of how spread out these residuals are. In other words, it tells you how
concentrated the data is around the line of best fit. Whereas R-squared is a relative measure of fit,
RMSE is an absolute measure of fit. As the square root of a variance, RMSE can be interpreted as the
standard deviation of the unexplained variance and has the useful property of being in the same
units as the response variable. Lower values of RMSE and higher value of R-Squared Value indicate
better fit.

4.2 Model Selection

From the observation of all RMSE Value and R-Squared Value we have concluded that Linear
Regression Model has minimum value of RMSE and its R-Squared Value is also maximum.

4.3 Solutions of Problem Statement

4.3.1 What changes company should bring to reduce the number of absenteeism?

Solution:

a. It can be observed that employees having education only till high school tend to be absent
more than others. So, the company can either hire employees who have at least graduated
from college or inform those employees who have completed only their high school
education to reduce the number of hours they are absent.

Fig 4.3.1 – Plot of Education vs Absent Hours

Page 17 of 33
b. Employees with ID 3, 28 and 34 are some of the employees who are absent the most. The
company may act warn such employees to reduce being absent a lot or if repeated further, it
can against them if necessary.

Fig 4.3.2 – Plot of ID vs Absent Hours

c. The reasons most used by employees to be absent are reason 13, 20, 23 and 28. These
reasons include Medical consultation, Dental appointments, morbidity, mortality and
diseases of musculoskeletal system and connective tissue. The company XYZ can help in
informing employees on how to keep themselves healthier by having monthly campus
consultations.

Fig 4.3.3 – Plot of Reason of Absence vs Absent Hours

d. People who tend to be social drinkers tend to be more absent than those who don’t drink.
XYZ can keep a track of those people and inform those employees to reduce the intake of
alcohol during working days.

Page 18 of 33
Fig 4.3.4 – Plot of Social Drinker vs Absent Hours

e. Employees are absent the most on Mondays with absent hours equal to 1426 and Tuesdays
with absent hours equal to 1322.4. XYZ can inform employees to not take as many absent
hours on such days.

Fig 4.3.5 – Plot of Day of the Week vs Absent Hours

Page 19 of 33
f. Employees are mostly absent during Spring Season.

Fig 4.3.6 – Plot of Season vs Absent Hours

g. Employees having a maximum of two children or no child at all are absent the most.

Fig: 4.3.7 – Plot of Sons vs Absent Hours

Page 20 of 33
4.3.2 How much losses every month can we project in 2011 if same trend of absenteeism
continues?

Solution:
Considering the losses to be the absenteeism time in hours, if the same trend of absenteeism
continues, then the total total losses per month is as shown in the graph below.
Employees are absent the most in the month of March, with total Absenteeism hours equal to 458.2
hours. Employees are absent the least in the month of January, with total Absenteeism hours equal
to 173.6.

Fig 4.3.2 – Absenteeism Hours per Month

Below table shows the monthly losses of absenteeism hours:

Month Absent Hours

January 173.6
February 275.4
March 458.2
April 244.7
May 266.7
June 251
July 375.8
August 254.3
September 190.2
October 295.2
November 266
December 200.3

Page 21 of 33
Chapter 5: Appendix

5.1 Figures

Fig 2.2 – Missing value Percentage

Page 22 of 33
Fig 2.3.1 – Boxplots of continuous variables with outliers

Fig 2.3.2 – Boxplots of continuous variables without outliers

Page 23 of 33
Fig 2.4 – Distribution of Continuous variables using Histogram

Page 24 of 33
Fig 2.5 – Distribution of Categorical variables using Bar graph

Page 25 of 33
Fig 2.6 – Correlation plot of Continuous variables

Fig 2.8 – Cumulative Scree Plot of Principal Components

Page 26 of 33
Fig 3.2 – Plot of actual values vs predicted values for Decision Tree

Fig 3.3 – Plot of actual values vs predicted values for Random Forest

Fig 3.5 – Plot of actual values vs predicted values for Linear Regression

Page 27 of 33
Chapter 6: R coSde

#Read the csv file

emp_absent = read.xlsx(file = "Absenteeism_at_work.xls", header = T, sheetIndex = 1)

#################################EXPLORE THE DATA#################################

#Check number of rows and columns

dim(emp_absent)

#Observe top 5 rows

head(emp_absent)

#Structure of variables
str(emp_absent)

#Transform data types

emp_absent$ID = as.factor(as.character(emp_absent$ID))
emp_absent$Reason.for.absence[emp_absent$Reason.for.absence %in% 0] = 20
emp_absent$Reason.for.absence = as.factor(as.character(emp_absent$Reason.for.absence))
emp_absent$Month.of.absence[emp_absent$Month.of.absence %in% 0] = NA
emp_absent$Month.of.absence = as.factor(as.character(emp_absent$Month.of.absence))
emp_absent$Day.of.the.week = as.factor(as.character(emp_absent$Day.of.the.week))
emp_absent$Seasons = as.factor(as.character(emp_absent$Seasons))
emp_absent$Disciplinary.failure = as.factor(as.character(emp_absent$Disciplinary.failure))
emp_absent$Education = as.factor(as.character(emp_absent$Education))
emp_absent$Son = as.factor(as.character(emp_absent$Son))
emp_absent$Social.drinker = as.factor(as.character(emp_absent$Social.drinker))
emp_absent$Social.smoker = as.factor(as.character(emp_absent$Social.smoker))
emp_absent$Pet = as.factor(as.character(emp_absent$Pet))

#Structure of variables
str(emp_absent)

#Make a copy of data

df = emp_absent

#############################MISSING VALUE ANALYSIS#############################

#Get number of missing values

sapply(df,function(x){sum(is.na(x))})
missing_values = data.frame(sapply(df,function(x){sum(is.na(x))}))

#Get the rownames as new column

missing_values$Variables = row.names(missing_values)

#Reset the row names

row.names(missing_values) = NULL

#Rename the column

names(missing_values)[1] = "Miss_perc"

Page 28 of 33
#Calculate missing percentage
missing_values$Miss_perc = ((missing_values$Miss_perc/nrow(emp_absent)) *100)

#Reorder the columns

missing_values = missing_values[,c(2,1)]

#Sort the rows according to decreasing missing percentage

missing_values = missing_values[order(-missing_values$Miss_perc),]

#Create a bar plot to visualie top 5 missing values

ggplot(data = missing_values[1:5,], aes(x=reorder(Variables, -Miss_perc),y = Miss_perc)) +
geom_bar(stat = "identity",fill = "grey")+xlab("Parameter")+
ggtitle("Missing data percentage") + theme_bw()

#Create missing value and impute using mean, median and knn
df[["Body.mass.index"]][3] = NA
df = knnImputation(data = df, k = 5)

#Check if any missing values

sum(is.na(df))

# Saving output result into excel file

write.xlsx(missing_values, "Missing_perc_R.xlsx", row.names = F)

###################EXPLORE DISTRIBUTION USING GRAPHS###################

#Get numerical data

numeric_index = sapply(df, is.numeric)
numeric_data = df[,numeric_index]

#Distribution of factor data using bar plot

bar1 = ggplot(data = df, aes(x = ID)) + geom_bar() + ggtitle("Count of ID") + theme_bw()
bar2 = ggplot(data = df, aes(x = Reason.for.absence)) + geom_bar() +
ggtitle("Count of Reason for absence") + theme_bw()
bar3 = ggplot(data = df, aes(x = Month.of.absence)) + geom_bar() + ggtitle("Count of Month") +
theme_bw()
bar4 = ggplot(data = df, aes(x = Disciplinary.failure)) + geom_bar() +
ggtitle("Count of Disciplinary failure") + theme_bw()
bar5 = ggplot(data = df, aes(x = Education)) + geom_bar() + ggtitle("Count of Education")
+ theme_bw()
bar6 = ggplot(data = df, aes(x = Son)) + geom_bar() + ggtitle("Count of Son") + theme_bw()
bar7 = ggplot(data = df, aes(x = Social.smoker)) + geom_bar() +
ggtitle("Count of Social smoker") + theme_bw()

gridExtra::grid.arrange(bar1,bar2,bar3,bar4,ncol=2)
gridExtra::grid.arrange(bar5,bar6,bar7,ncol=2)

#Check the distribution of numerical data using histogram

hist1 = ggplot(data = numeric_data, aes(x =Transportation.expense)) +
ggtitle("Transportation.expense") + geom_histogram(bins = 25)

Page 29 of 33
hist2 = ggplot(data = numeric_data, aes(x =Height)) + ggtitle("Distribution of Height") +
geom_histogram(bins = 25)

hist3 = ggplot(data = numeric_data, aes(x =Body.mass.index)) + ggtitle("Distribution of

Body.mass.index") + geom_histogram(bins = 25)
hist4 = ggplot(data = numeric_data, aes(x =Absenteeism.time.in.hours)) + ggtitle("Distribution of
Absenteeism.time.in.hours") + geom_histogram(bins = 25)

gridExtra::grid.arrange(hist1,hist2,hist3,hist4,ncol=2)

#########################OUTLIER ANALYSIS#########################

#Get the data with only numeric columns

numeric_index = sapply(df, is.numeric)
numeric_data = df[,numeric_index]

#Get the data with only factor columns

factor_data = df[,!numeric_index]

#Check for outliers using boxplots

for(i in 1:ncol(numeric_data)) {
assign(paste0("box",i), ggplot(data = df, aes_string(y = numeric_data[,i])) + stat_boxplot(geom =
"errorbar", width = 0.5) + geom_boxplot(outlier.colour = "red", fill = "grey", outlier.size = 1) +
labs(y = colnames(numeric_data[i])) + ggtitle(paste("Boxplot: ",colnames(numeric_data[i]))))
}

#Arrange the plots in grids

gridExtra::grid.arrange(box1,box2,box3,box4,ncol=2)
gridExtra::grid.arrange(box5,box6,box7,box8,ncol=2)
gridExtra::grid.arrange(box9,box10,ncol=2)

#Replace all outlier data with NA

for(i in numeric_columns){
val = df[,i][df[,i] %in% boxplot.stats(df[,i])$out]
print(paste(i,length(val)))
df[,i][df[,i] %in% val] = NA
}

#Check number of missing values

sapply(df,function(x){sum(is.na(x))})

#Get number of missing values after replacing outliers as NA

missing_values_out = data.frame(sapply(df,function(x){sum(is.na(x))}))
missing_values_out$Columns = row.names(missing_values_out)
row.names(missing_values_out) = NULL
names(missing_values_out)[1] = "Miss_perc"
missing_values_out$Miss_perc = ((missing_values_out$Miss_perc/nrow(emp_absent)) *100)
missing_values_out = missing_values_out[,c(2,1)]
missing_values_out = missing_values_out[order(-missing_values_out$Miss_perc),]
missing_values_out

#Compute the NA values using KNN imputation

df = knnImputation(df, k = 5)

Page 30 of 33
#Check if any missing values
sum(is.na(df))

#############################FEATURE SELECTION#############################

#Check for multicollinearity using VIF

vifcor(numeric_data)

#Check for multicollinearity using corelation graph

corrgram(numeric_data, order = F, upper.panel=panel.pie, text.panel=panel.txt, main = "Correlation
Plot")

#Variable Reduction
df = subset.data.frame(df, select = -c(Body.mass.index))

#########################FEATURE SCALING#########################

#Normality check
hist(df$Absenteeism.time.in.hours)

#Remove dependent variable

numeric_index = sapply(df,is.numeric)
numeric_data = df[,numeric_index]
numeric_columns = names(numeric_data)
numeric_columns = numeric_columns[-9]

#Normalization of continuous variables

for(i in numeric_columns){
print(i)
df[,i] = (df[,i] - min(df[,i]))/
(max(df[,i]) - min(df[,i]))
}

#Get the names of factor variables

factor_columns = names(factor_data)

#Create dummy variables of factor variables

df = dummy.data.frame(df, factor_columns)

rmExcept(keepers = c("df","emp_absent"))

########################DIMENSION REDUCTION USING PCA########################

#Principal component analysis

prin_comp = prcomp(train)

#Compute standard deviation of each principal component

pr_stdev = prin_comp$sdev

#Compute variance
pr_var = pr_stdev^2

Page 31 of 33
#Proportion of variance explained
prop_var = pr_var/sum(pr_var)

#Cumulative scree plot

plot(cumsum(prop_var), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")

#Add a training set with principal components

train.data = data.frame(Absenteeism.time.in.hours = train$Absenteeism.time.in.hours,
prin_comp$x)

# From the above plot selecting 45 components since it explains almost 95+ % data variance
train.data =train.data[,1:45]

#Transform test data into PCA

test.data = predict(prin_comp, newdata = test)
test.data = as.data.frame(test.data)

#Select the first 45 components

test.data=test.data[,1:45]

########################DECISION TREE########################

#Build decsion tree using rpart

dt_model = rpart(Absenteeism.time.in.hours ~., data = train.data, method = "anova")

#Predict the test cases

dt_predictions = predict(dt_model,test.data)

#Create data frame for actual and predicted values

df_pred = data.frame("actual"=test[,116], "dt_pred"=dt_predictions)
head(df_pred)

#Calcuate MAE, RMSE, R-sqaured for testing data

print(postResample(pred = dt_predictions, obs = test$Absenteeism.time.in.hours))

#Plot a graph for actual vs predicted values

plot(test$Absenteeism.time.in.hours,type="l",lty=2,col="green")
lines(dt_predictions,col="blue")

###########################RANDOM FOREST###########################

#Train the model using training data

rf_model = randomForest(Absenteeism.time.in.hours~., data = train.data, ntrees = 500)

#Predict the test cases

rf_predictions = predict(rf_model,test.data)

#Create dataframe for actual and predicted values

df_pred = cbind(df_pred,rf_predictions)
head(df_pred)

#Calcuate MAE, RMSE, R-sqaured for testing data

print(postResample(pred = rf_predictions, obs = test$Absenteeism.time.in.hours))

Page 32 of 33
#Plot a graph for actual vs predicted values
plot(test$Absenteeism.time.in.hours,type="l",lty=2,col="green")
lines(rf_predictions,col="blue")

###########################LINEAR REGRESSION###########################

#Train the model using training data

lr_model = lm(Absenteeism.time.in.hours ~ ., data = train.data)

#Get the summary of the model

summary(lr_model)

#Predict the test cases

lr_predictions = predict(lr_model,test.data)

#Create dataframe for actual and predicted values

df_pred = cbind(df_pred,lr_predictions)
head(df_pred)

#Calcuate MAE, RMSE, R-sqaured for testing data

print(postResample(pred = lr_predictions, obs =test$Absenteeism.time.in.hours))

#Plot a graph for actual vs predicted values

plot(test$Absenteeism.time.in.hours,type="l",lty=2,col="green")
lines(lr_predictions,col="blue")

Page 33 of 33

This Pain Is Not Mine
No ratings yet
This Pain Is Not Mine
10 pages
Business Analytics Project
100% (1)
Business Analytics Project
11 pages
Capstone Project - Credit Risk Analysis
67% (6)
Capstone Project - Credit Risk Analysis
50 pages
Advanced Data Analysis - Lecture Notes
No ratings yet
Advanced Data Analysis - Lecture Notes
874 pages
Math108x Doc trendlinesCSdoc
No ratings yet
Math108x Doc trendlinesCSdoc
4 pages
Regression Analysis of Count Data 2nd Ed
No ratings yet
Regression Analysis of Count Data 2nd Ed
9 pages
Dorothy Johnson Behavioral System Model
94% (35)
Dorothy Johnson Behavioral System Model
14 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Medical Tourism
No ratings yet
Medical Tourism
15 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Anly 530 - Final Project Report
No ratings yet
Anly 530 - Final Project Report
18 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Capstone - Project - Final Report - Hitesh - Dadhich
No ratings yet
Capstone - Project - Final Report - Hitesh - Dadhich
38 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Da (22C01156)
No ratings yet
Da (22C01156)
26 pages
Multiple Imputation of Missing Data
No ratings yet
Multiple Imputation of Missing Data
495 pages
PreProcessing With R
No ratings yet
PreProcessing With R
6 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
Investigation and Comparison Missing Data Imputation Methods
No ratings yet
Investigation and Comparison Missing Data Imputation Methods
73 pages
Regression and Neural Network Based Prediction Model For The Participation of Female Employment in Bangladesh
No ratings yet
Regression and Neural Network Based Prediction Model For The Participation of Female Employment in Bangladesh
59 pages
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
100% (1)
Statistics Consulting Cheat Sheet: Kris Sankaran October 1, 2017
44 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Course PDF
No ratings yet
Course PDF
403 pages
Course Regression Model Strategies PDF
No ratings yet
Course Regression Model Strategies PDF
307 pages
Regression Analysis of Count Data 2nd Ed
No ratings yet
Regression Analysis of Count Data 2nd Ed
9 pages
L18&19 Data Exploration
No ratings yet
L18&19 Data Exploration
50 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Rekapitulacija NIR - Sve
No ratings yet
Rekapitulacija NIR - Sve
23 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Multivariant Data.
No ratings yet
Multivariant Data.
36 pages
Advance Stats
No ratings yet
Advance Stats
233 pages
Course HEM245 2021
No ratings yet
Course HEM245 2021
157 pages
CourseProject Absntisem
No ratings yet
CourseProject Absntisem
5 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Employee Performance Analysis
No ratings yet
Employee Performance Analysis
3 pages
Data Science Project - Flow Graph
No ratings yet
Data Science Project - Flow Graph
7 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Analysis of Longitudinal Data With Missing Values
No ratings yet
Analysis of Longitudinal Data With Missing Values
99 pages
Multivariate Statistics Made Simple A Practical Approach by K. v. S. Sarma, R. Vishnu Vardhan
100% (1)
Multivariate Statistics Made Simple A Practical Approach by K. v. S. Sarma, R. Vishnu Vardhan
259 pages
Growth Curve Analysis and Visualization Using R, 1st Edition ISBN 1466584327, 9781466584327 Secure Ebook Download
No ratings yet
Growth Curve Analysis and Visualization Using R, 1st Edition ISBN 1466584327, 9781466584327 Secure Ebook Download
15 pages
Econometric S
100% (1)
Econometric S
348 pages
Project Name - Employee Absenteeism Deadline - 15 Days: Project Description and Problem Statement
No ratings yet
Project Name - Employee Absenteeism Deadline - 15 Days: Project Description and Problem Statement
3 pages
Research File 3
No ratings yet
Research File 3
10 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Rms PDF
No ratings yet
Rms PDF
506 pages
Guide
No ratings yet
Guide
1,201 pages
Employee Turnover Prediction Project
No ratings yet
Employee Turnover Prediction Project
10 pages
Predictive Analytics Group Assignment
No ratings yet
Predictive Analytics Group Assignment
21 pages
Mercy Marimo Thesis - Survival Analysis - 28.03. 2015 - v1
No ratings yet
Mercy Marimo Thesis - Survival Analysis - 28.03. 2015 - v1
213 pages
FRA Report
100% (1)
FRA Report
30 pages
BMDP 2009
No ratings yet
BMDP 2009
39 pages
Course: Applied Statistics Projects: Bui Anh Tuan March 1, 2022
No ratings yet
Course: Applied Statistics Projects: Bui Anh Tuan March 1, 2022
9 pages
Student Performance Analysis and Prediction
No ratings yet
Student Performance Analysis and Prediction
19 pages
Topics
No ratings yet
Topics
11 pages
Anova
No ratings yet
Anova
35 pages
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
From Everand
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
NAGARAJU CHEVURU
No ratings yet
Human Nature Potential in Nurture
From Everand
Human Nature Potential in Nurture
David L. Hawk
No ratings yet
Introduction to Data Analytics
From Everand
Introduction to Data Analytics
Dan Martin
No ratings yet
Quant Developers' Tools and Techniques: Quant Books, #2
From Everand
Quant Developers' Tools and Techniques: Quant Books, #2
Manfred Hindering
No ratings yet
Plant Pathology: Ronaldo T. Alberto, Ph.D. Professor, Department of Crop Protection
100% (1)
Plant Pathology: Ronaldo T. Alberto, Ph.D. Professor, Department of Crop Protection
52 pages
Awakened Empath The Ultimate Guide To Emotional, Psychological and
No ratings yet
Awakened Empath The Ultimate Guide To Emotional, Psychological and
272 pages
Personal Philosophy
No ratings yet
Personal Philosophy
10 pages
2nd Unit 2 ALL SESSIONS
No ratings yet
2nd Unit 2 ALL SESSIONS
11 pages
MAKALAH MENTAL HEALTH-ing
No ratings yet
MAKALAH MENTAL HEALTH-ing
18 pages
Rare Rheumatic Diseases: Springer Books Available As Printed Book
No ratings yet
Rare Rheumatic Diseases: Springer Books Available As Printed Book
1 page
TFN Watson
No ratings yet
TFN Watson
24 pages
More Than Medicine The Broken Promise of American Health Full Text
100% (8)
More Than Medicine The Broken Promise of American Health Full Text
14 pages
Erik Soto LCV
No ratings yet
Erik Soto LCV
3 pages
Case Presentation Ji Castante, Leah S
No ratings yet
Case Presentation Ji Castante, Leah S
17 pages
Sample of Thesis Final Chapter 4
No ratings yet
Sample of Thesis Final Chapter 4
40 pages
Essential MEdicine in Yemen
No ratings yet
Essential MEdicine in Yemen
14 pages
Vaccines 11 01508
No ratings yet
Vaccines 11 01508
12 pages
Monitoring Tool For Curriculum Implementation: Grade 1 Arts - Melc
No ratings yet
Monitoring Tool For Curriculum Implementation: Grade 1 Arts - Melc
27 pages
Code of Ethics Medical Profession
No ratings yet
Code of Ethics Medical Profession
8 pages
Benefit Manual 1737619068
No ratings yet
Benefit Manual 1737619068
21 pages
Health and Economic Development
No ratings yet
Health and Economic Development
8 pages
The Amazing Decline
No ratings yet
The Amazing Decline
26 pages
Disease Etiologies in Non-Western Medical Systems
No ratings yet
Disease Etiologies in Non-Western Medical Systems
10 pages
Inclusiveness - Chapter One
100% (2)
Inclusiveness - Chapter One
22 pages
Arogya Setu App Is Safe
No ratings yet
Arogya Setu App Is Safe
3 pages
School of Commerce Project Management Department: Project Identification Analysis and Appraisal of Cardiac Clinic
No ratings yet
School of Commerce Project Management Department: Project Identification Analysis and Appraisal of Cardiac Clinic
33 pages
A Research Paper Presented To:: Dr. Vilma I. Tacbad
No ratings yet
A Research Paper Presented To:: Dr. Vilma I. Tacbad
17 pages
The Silver Economy A Perspective On Senior Living in India Housing Research 2022
No ratings yet
The Silver Economy A Perspective On Senior Living in India Housing Research 2022
30 pages
Family Doctor-Urdu Language Book
100% (3)
Family Doctor-Urdu Language Book
435 pages
Syllabus Spring 2018 Nur 7203
No ratings yet
Syllabus Spring 2018 Nur 7203
7 pages
Introduction To Communicable Disease
No ratings yet
Introduction To Communicable Disease
32 pages