100% found this document useful (1 vote)
2K views

Business Analytics Project

This document presents a summary of a dataset on employee absenteeism at a Brazilian courier company from 2007 to 2010. It contains 20 attributes related to employee demographics, health issues, and expenses. Summary statistics, visualizations, linear regression, and hypothesis tests were performed to analyze relationships between attributes like distance from work and expenses, age and service time, and the effect of education level on absenteeism.

Uploaded by

Aurva Bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views

Business Analytics Project

This document presents a summary of a dataset on employee absenteeism at a Brazilian courier company from 2007 to 2010. It contains 20 attributes related to employee demographics, health issues, and expenses. Summary statistics, visualizations, linear regression, and hypothesis tests were performed to analyze relationships between attributes like distance from work and expenses, age and service time, and the effect of education level on absenteeism.

Uploaded by

Aurva Bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Business Analytics

Project

Submitted to-
Dr. S. Maheswaran

By-
Aurva Bhardwaj-201922066
Koushik G-201922077
Muzammil Quazi-201922083
Neerav Bhardwaj-201922084
Tulika Das-201922105
Overview
The database was created with records of absenteeism at work from
July 2007 to July 2010 at a courier company in Brazil.
Relevant Information:
The data set allows for several new combinations of attributes and
attribute exclusions, or the modification of the attribute type
(categorical, integer, or real) depending on the purpose of the
research. The data set (Absenteeism at work - Part I) was used in
academic research at the Universidade Nove de Julho - Postgraduate
Program in Informatics and Knowledge Management. Data captures
various attributes and their effects in the employee absenteeism,
various factors like age , distance from residence , transportation,
expenses etc. The dataset captures various reasons for the employee’s
absenteeism such as various kinds of diseases that might effect the
employees.
Some of the Attribute description are-
1) Certain infectious and parasitic diseases
2)Neoplasms
3)Diseases of the blood and blood-forming organs and certain
disorders involving the immune mechanism
4)Endocrine, nutritional and metabolic diseases
5)Mental and behavioural disorders
6)Diseases of the nervous system
7)Diseases of the eye and adnexa
8)Diseases of the ear and mastoid process
9)Diseases of the circulatory system
10)Diseases of the respiratory system
11)Diseases of the digestive system
12)Diseases of the skin and subcutaneous tissue
13)Diseases of the musculoskeletal system and connective tissue
14)Diseases of the genitourinary system
15)Pregnancy, childbirth and the puerperium
16)Certain conditions originating in the perinatal period
17)Congenital malformations, deformations and chromosomal
abnormalities
18)Symptoms, signs and abnormal clinical and laboratory findings,
not elsewhere classified
19)Injury, poisoning and certain other consequences of external
causes
20)External causes of morbidity and mortality
21)Factors influencing health status and contact with health services.
Dataset contains both real and integer values such as education
and age.

Description of Data
Data Set Multivariate, No. of Instances 740
Characteristics Time series
Attribute Integer, Real No. of Attributes 20
Characteristics
Associated Tasks Classification , Missing Values N/A
Clustering

Dataset Review
 Out of the total instances of 740 entries a sample of 350 entries
has been taken.
 Dataset consists of both ordinal and nominal data
 Quantitative attributes like age, weight, height and body mass
index are present
 A total of 20 attributes are present
Dataset is multivariate and can be analysed using both descriptive
and inferential statistics. Using summary statistics measures of
central tendency can be calculated to find mean, median, mode of
various attributes. Measure of variation can be used to calculate the
variation in the data , for example standard deviation can be used to
measure deviation in data.
Visual Statistics can also be used to define the data and represent
the data in more comprehensible manner. Various Visual statistics
tools are there to present data like Pie charts, histograms, Box plots
etc.

Statistical Analysis
Tools
Summary Statistics

Mean 226.7020057
Standard Error 3.575000623
Median 231
Mode 179
Standard Deviation 66.78652319
Sample Variance 4460.43968
Kurtosis -0.455562906
Skewness 0.195363311
Range 270
Minimum 118
Maximum 388
Sum 79119
Count 349

For example mean of transport expense is around 226.7 and


median is 231. The minimum transport expense is 118 and
maximum is 388.
Similarly above table shows summary statistics for distance from
work. The mean of distance from work is 36.3 Kilometres and
median is 36. The minimum distance from home is 27 kilometers
and maximum is 58.

Visual Statistics

Following histogram shows amount of people according the


level of education. We can see no. of people having studied till
high school have the most no. of employees.
Following pie chart shows the amount of absenteeism according
to weekdays. Monday has the highest number of absenteeism
according to weekday.
Linear Regression
Distance from Residence to Work
60
y = 0.0581x + 16.772
50 R² = 0.0687
Travel Expense

40

30 Distance from
Residence to Work
20 Linear (Distance from
Residence to Work)
10

0
0 100 200 300 400 500
Distance

Above linear model shows the relation of distance from


residence to work to total expense. 6.8% change of the total
expenses is explained by distance from residence to work.

2)

Regression Statistics  
Multiple R 0.663525156
R Square 0.440265633
Adjusted R Square 0.438652565
Standard Error 2.960793027
Observations 349

Above linear regression model shows the relation between


service time and age. The correlation coefficient is fairly
correlated that is the age and service time in hours are positively
correlated.
Coefficient of determination or goodness of fit is 44% , that is
only 44 % of values fit our regression model.

Parameters  Coefficients
Age -2.79337902
Service time 0.418098451

Above scatter plot diagram shows the linear regression equation


and the model.
Correlation
Significance
Service
  Distance from residence to work
time
Distance from residence to work NA 0.00924438
Service time 0.00924438 NA

Above correlation between distance from work and service time


is positive and highly correlated which means that service time
is affected by the distance from residence to work.

Inferential Statistics
Hypothesis Testing
T-test for one sample mean

Ho: mean of age is less than 35 years


H1: mean of age is more than 35 years

  Age
Mean 36.30
Variance 39.33
Observations 349.00
Hypothesized Mean
Difference 0.00
df 348.00
t Stat 108.13
P(T<=t) one-tail 0.00
t Critical one-tail 1.65
P(T<=t) two-tail 0.00
t Critical two-tail 1.97

Since p-value for the test is less than level of significance of


0.05 we will reject null hypothesis that is mean age of
employees is more than 35 years.
ANOVA

Ho: There is no difference between the mean of absenteeism all


education groups of Employees
HA: There is a difference between the mean of absenteeism all
education groups of employees

Anova: Single Factor            


             
SUMMARY            
Varianc
Groups Count Sum Average e    
7.42783 212.186
Highschool 582 4323 5 7    
6.39130 45.6212
Graduate 46 294 4 6    
master and doctor 4 21 5.25 10.25    
5.47297
postgraduate 74 405 3 67.0472    
             
             
ANOVA            
Source of Variation SS df MS F P-value F crit
97.9764 0.52802 0.66315 2.6175
Between Groups 293.9294517 3 8 3 7 9
185.553
Within Groups 130258.6215 702 6      
             
Total 130552.551 705        

Since F-value is less than the F-critical value we will accept the
null hypothesis that is there is no difference between the mean
absenteeism of the different education groups.
References-
 www.kaggle.com
 http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work
 www.statcraft.com

Tools Used
 Microsoft excel
 Statcraft

You might also like