Project PDF
Project PDF
REPORT
ADVANCED STATISTICS
SANDYA V B
CONTENTS
PROBLEM 1: ANOVA
1) State the Null and Alternate Hypothesis for conducting one-way
ANOVA for both the variables ‘Manufacturer’ and ‘Technician
individually.
2) Perform one-way ANOVA for variable ‘Manufacturer’ with respect to the
variable ‘Service Time’. State whether the Null Hypothesis is accepted or
rejected based on the ANOVA results.
3) Perform one-way ANOVA for variable ‘Technician’ with respect to the
variable ‘Service Time’. State whether the Null Hypothesis is accepted or
rejected based on the ANOVA results.
4) Analyse the effects of one variable on another with the help of an
interaction plot. What is an interaction between two treatments? [hint: use
the ‘pointplot’ function from the ‘seaborn’ graphical subroutine in Python]
5) Perform a two-way ANOVA based on the variables ‘Manufacturer’ &
‘Technician’ with respect to the variable ‘Service Time’ and state your
results.
6) Mention the business implications of performing ANOVA for this
particular case study.
PROBLEM 2: PCA
1) Perform Exploratory Data Analysis [both univariate and multivariate
analysis to be performed]. The inferences drawn from this should be
properly documented.
2) Scale the variables and write the inference for using the type of scaling
function for this case study.
3) Comment on the comparison between covariance and the correlation
matrix after scaling.
4) Check the dataset for outliers before and after scaling. Draw your
inferences from this exercise.
5) Build the covariance matrix, eigenvalues and eigenvector.
6) Write the explicit form of the first PC (in terms of Eigen Vectors)
7) Discuss the cumulative values of the eigenvalues. How does it help you
to decide on the optimum number of principal components? What do the
eigenvectors indicate? Perform PCA and export the data of the Principal
Component scores into a data frame.
8) Mention the business implication of using the Principal Component
Analysis for this case study.
Problem 1: ANOVA
The staff of a service centre for electrical appliances include three technicians who
specialize in repairing three widely used electrical appliances by three different
manufacturers. It was desired to study the effects of Technician and Manufacturer
on the service time. Each technician was randomly assigned five repair jobs on
each manufacturer's appliance and the time to complete each job (in minutes) was
recorded.
Data Dictionary:
Problem 1 data consists of –
• Technician
• Manufacturer
• Job
• ServiceTime
1.1State the Null and Alternate Hypothesis for conducting one-way
ANOVA for both the variables ‘Manufacturer’ and ‘Technician
individually.
After having a look at the data null and alternate hypothesis can be inferred for
the Manufacture and Technician variables =
Ho: There is no significant difference between both the variables ‘Manufacturer’
and ‘Technician’.: µ1 equals µ2
Ha: There is some significant difference between both the variables
‘Manufacturer’ and ‘Technician’.: µ1 not equals µ2
• Let us consider our Alpha value for checking the hypothesis at 0.05 (5%).
• While analysing two-way ANOVA, we first observe the corresponding values of
the interaction term. If there is statistically significant interaction effect then we
cannot consider the main effects i.e. p values of the independent variables
separately because considering their effect separately could be misleading as there
is statistically significant evidence of interaction being present between the two
independent variables.
• The p-value for Manufacturer is 0.656486, which indicates that there is an
association of Manufacturer and Service Time.
• The p-value for Technician is 0.626250, which indicates that there is an association
between Technician and Service Time.
• The p-value for the interaction between Manufacturer*Technician is 0.236268,
which indicates that the relationship between Manufacturer and Service Time
depends on the value of Technician. Because the interaction effect between
Manufacturer and Technician is statistically significant, we cannot interpret the
main effects without considering the interaction effect.
So the data Service Time required by a Technician varies for a products of a different
Manufacturers.
Problem 2: PCA
The ‘Hair Salon.csv’ dataset contains various variables used for the context of
Market Segmentation. This particular case study is based on various parameters of
a salon chain of hair products. You are expected to do Principal Component
Analysis for this case study according to the instructions given in the following
rubric.
Data Dictionary:
2.1 Perform Exploratory Data Analysis [both univariate and
multivariate analysis to be performed]. The inferences drawn from
this should be properly documented.
• After scaling we see that the outliers are been treated and are aligned.
2.3 Comment on the comparison between covariance and the
correlation matrix after scaling.
• After scaling the covariance matrix seems more linear which means the
relationship between variables are more aligned or no or negligibly deviated.
• From the correlation matrix post scaling we can infer that the strength of values in
the covariance matrix seems to be good for us to move ahead with finding the
eigen values for PCA to be performed.
2.4 Check the dataset for outliers before and after scaling. Draw your
inferences from this exercise.
Before Scaling:
After Scaling:
• After scaling we see that the outliers are been treated and are aligned.
2.5 Build the covariance matrix, eigenvalues and eigenvector.
2.6 Write the explicit form of the first PC (in terms of Eigen Vectors).
2.7 Discuss the cumulative values of the eigenvalues. How does it help
you to decide on the optimum number of principal components? What
do the eigenvectors indicate? Perform PCA and export the data of the
Principal Component scores into a data frame.
• We can see the optimum number of principal components would be 7 as the steep
reduces gradually after that. The covariance matrix provides a clear cut view on the
same.
2.8 Mention the business implication of using the Principal
Component Analysis for this case study.
The higher level of satisfaction mostly relies on different factors like the Product quality
score, the pricing, delivery speed etc. But from the below heat map.
It can be inferred that the maximum business can be achieved by if more efforts are put
into the Advertising of various factors. As the correlation for each of factors are majorly
affected by advertising so it can impact on greatly on the business and hence providing
the satisfaction level on the positive side.