Business Analytics Project
Business Analytics Project
Project
Submitted to-
Dr. S. Maheswaran
By-
Aurva Bhardwaj-201922066
Koushik G-201922077
Muzammil Quazi-201922083
Neerav Bhardwaj-201922084
Tulika Das-201922105
Overview
The database was created with records of absenteeism at work from
July 2007 to July 2010 at a courier company in Brazil.
Relevant Information:
The data set allows for several new combinations of attributes and
attribute exclusions, or the modification of the attribute type
(categorical, integer, or real) depending on the purpose of the
research. The data set (Absenteeism at work - Part I) was used in
academic research at the Universidade Nove de Julho - Postgraduate
Program in Informatics and Knowledge Management. Data captures
various attributes and their effects in the employee absenteeism,
various factors like age , distance from residence , transportation,
expenses etc. The dataset captures various reasons for the employee’s
absenteeism such as various kinds of diseases that might effect the
employees.
Some of the Attribute description are-
1) Certain infectious and parasitic diseases
2)Neoplasms
3)Diseases of the blood and blood-forming organs and certain
disorders involving the immune mechanism
4)Endocrine, nutritional and metabolic diseases
5)Mental and behavioural disorders
6)Diseases of the nervous system
7)Diseases of the eye and adnexa
8)Diseases of the ear and mastoid process
9)Diseases of the circulatory system
10)Diseases of the respiratory system
11)Diseases of the digestive system
12)Diseases of the skin and subcutaneous tissue
13)Diseases of the musculoskeletal system and connective tissue
14)Diseases of the genitourinary system
15)Pregnancy, childbirth and the puerperium
16)Certain conditions originating in the perinatal period
17)Congenital malformations, deformations and chromosomal
abnormalities
18)Symptoms, signs and abnormal clinical and laboratory findings,
not elsewhere classified
19)Injury, poisoning and certain other consequences of external
causes
20)External causes of morbidity and mortality
21)Factors influencing health status and contact with health services.
Dataset contains both real and integer values such as education
and age.
Description of Data
Data Set Multivariate, No. of Instances 740
Characteristics Time series
Attribute Integer, Real No. of Attributes 20
Characteristics
Associated Tasks Classification , Missing Values N/A
Clustering
Dataset Review
Out of the total instances of 740 entries a sample of 350 entries
has been taken.
Dataset consists of both ordinal and nominal data
Quantitative attributes like age, weight, height and body mass
index are present
A total of 20 attributes are present
Dataset is multivariate and can be analysed using both descriptive
and inferential statistics. Using summary statistics measures of
central tendency can be calculated to find mean, median, mode of
various attributes. Measure of variation can be used to calculate the
variation in the data , for example standard deviation can be used to
measure deviation in data.
Visual Statistics can also be used to define the data and represent
the data in more comprehensible manner. Various Visual statistics
tools are there to present data like Pie charts, histograms, Box plots
etc.
Statistical Analysis
Tools
Summary Statistics
Mean 226.7020057
Standard Error 3.575000623
Median 231
Mode 179
Standard Deviation 66.78652319
Sample Variance 4460.43968
Kurtosis -0.455562906
Skewness 0.195363311
Range 270
Minimum 118
Maximum 388
Sum 79119
Count 349
Visual Statistics
40
30 Distance from
Residence to Work
20 Linear (Distance from
Residence to Work)
10
0
0 100 200 300 400 500
Distance
2)
Regression Statistics
Multiple R 0.663525156
R Square 0.440265633
Adjusted R Square 0.438652565
Standard Error 2.960793027
Observations 349
Parameters Coefficients
Age -2.79337902
Service time 0.418098451
Inferential Statistics
Hypothesis Testing
T-test for one sample mean
Age
Mean 36.30
Variance 39.33
Observations 349.00
Hypothesized Mean
Difference 0.00
df 348.00
t Stat 108.13
P(T<=t) one-tail 0.00
t Critical one-tail 1.65
P(T<=t) two-tail 0.00
t Critical two-tail 1.97
Since F-value is less than the F-critical value we will accept the
null hypothesis that is there is no difference between the mean
absenteeism of the different education groups.
References-
www.kaggle.com
http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work
www.statcraft.com
Tools Used
Microsoft excel
Statcraft