Sayan Pal Business Report Advance Statistics Assignment PDF
Sayan Pal Business Report Advance Statistics Assignment PDF
Sayan Pal Business Report Advance Statistics Assignment PDF
Problem 1A:
[Assume that the data follows a normal distribution. In reality, the normality assumption may not
always hold if the sample size is small.]
1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both Education
and Occupation individually.
Solution:
Formulation of hypothesis for conducting one-way ANOVA for education qualification w.r.t salary
Formulation of hypothesis for conducting one-way ANOVA for occupation w.r.t salary
1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.
Solution:
To perform one-way ANOVA for education w.r.t the variable ‘Salary’, we apply the ANOVA formula in
the Jupyter notebook and run the AOV table. We get following output:
From the above table, we find that the P value is less than 0.05, hence we reject the null hypothesis.
Advance Statistics Assignment
1.3 Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA results.
Solution:
To perform one-way ANOVA for occupation w.r.t the variable ‘Salary’, we apply the ANOVA formula in
the Jupyter notebook and run the AOV table. We get following output:
From the above table, we find that the P value is greater than 0.05, hence we do not reject the null
hypothesis.
Problem 1B:
1.5 What is the interaction between the two treatments? Analyze the effects of one variable on
the other (Education and Occupation) with the help of an interaction plot.
Solution:
As seen from the below interaction plots, there seems to be moderate interaction between the two
categorical variables.
Adm-clerical and sales professionals with bachelors and doctorate degrees earn almost similar salary
packages.
Advance Statistics Assignment
1.6 Perform a two-way ANOVA based on the Education and Occupation (along with their
interaction Education*Occupation) with the variable ‘Salary’. State the null and alternative
hypotheses and state your results. How will you interpret this result?
Solution:
Formulation of hypothesis for conducting two-way ANOVA based on education and occupation w.r.t
salary.
Considering both education and occupation, education is a significant factor as the P value is <0.05,
whereas occupation is not a significant variable as P value of it is >0.05
1.7 Explain the business implications of performing ANOVA for this particular case study.
Solution:
By performing ANOVA on the given data set, we can conclude that salary is dependent on
occupation.
Problem 2:
The dataset Education - Post 12th Standard.csv contains information on various colleges. You are
expected to do a Principal Component Analysis for this case study according to the instructions
given.
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?
Solution:
Firstly, after importing all the relevant libraries on Jupyter notebook, we load the data set. Then, we
perform EDA to extract and see patterns in the given data set.
The given data set has a shape of (777, 18). Also, we check the top 5 rows of the data set then went
on to see if there are any missing values in it – as per the output there are no missing values.
Advance Statistics Assignment
We then check the statistical summary of the data set, which is represented below
The output displays, total 17*3 = 51 distinct charts/columns. Hence I have put the screenshot of only
one variable i.e. apps (Please refer the python notebook for your perusal).
Further, we perform multivariate analysis, using correlation function in which we get below output.
Advance Statistics Assignment
Insights:
The average (mean) number of applications received by the listed universities is around
3,001
The number of applications accepted ranges from 72 to 26,330
Average student enrolment is around ~880
Median of new students from top 10% of higher secondary class is 23%
Average book cost is around 550
The minimum S.F. ratio is around 2.5
Average percentage of faculties with Ph.D.’s is 72.6
There are considerable number of variables that are highly correlated
“Apps” has high correlation with “Accept”, and ”Enroll”
2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.
Solution:
Yes, it is necessary to perform scaling for PCA. For instance, in given data set, applications and other
variables are having values in thousands and few variables such as percentile is in just two digits. So,
the data in these variables are of different scales, it is tough to compare these variables.
The PCA calculates a new projection of the given data set and the new axis are based on the standard
deviation of the variables. So a variable with a high standard deviation in the data set will have a
higher weight for the calculation of axis than a variable with a low standard deviation. By performing
scaling, we can easily compare these variables.
Advance Statistics Assignment
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?
Solution:
Before scaling, let’s plot a boxplot to check the outliers in all the variables. We get the following
output:
Post scaling, let’s plot a boxplot to check the outliers in all the variables. We get the following output:
Advance Statistics Assignment
Insights:
By scaling, all variables have the same standard deviation, thus all variables have the same
weight and thus resulting in PCA calculating relevant axis.
Before scaling, we only had one variable with no outliers (top25 perc); Post scaling, we have
multiple variables with negligible outliers – this is achieved by normalizing the scale of the
variables
2.5 Perform PCA and export the data of the Principal Component scores into a data frame.
Solution:
# Step 4: We can perform PCA on the scaled data set by importing PCA from sklearn.decomposition.
We get following component output:
Post that, we can load these components into a dataframe along with the list of columns we had
earlier considered in df_num_scaled.
Below is the representative screenshot of df_pca_loading in which we had exported the principal
component scores into a data frame.
Advance Statistics Assignment
Solution:
We can extract the above represented eigenvalues and eigenvectors using covariance matrix.
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only).
Solution:
-2.73]
If we sort the eigenvectors in descending order with respect to their eigenvalues, we will have that
the first eigenvector accounts for the largest spread among data, the second one for the second
largest spread and so on.
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?
Solution:
From the below screenshot of cumulative values of the eigenvalues, we can see that around 8
principal components explained over 90% of the variance. Thus, the optimum number of principal
components can be 8.
Furthermore, eigenvectors indicate the direction of the principal components, we can multiply the
original data by the eigenvectors to re-orient our data onto the new axes.
2.9 Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components
Obtained]
Solution:
We know that the principal components describe the amount of the total variance that can be
explained by a single dimension of the data. As mentioned above, we have generated only 8 PCA
dimensions. These 8 PCA can be used for further analysis, representing more than 90% of the
variance.
Advance Statistics Assignment
In this case study, we had 17 numeric variables to be assessed, with PCA we did dimensionality
reduction from 17 to 8 (representing more than 90% of the variance).
However, we can see from the above mentioned cumulative variance that even 5 PCA dimensions
represent around 80% of the variance. But, to be on a safer side, we have considered to go with 90%
variance.
Thus, as far as business implication of using PCA is concerned, in this case, we are reducing a high-
dimensional space (with 17 variables) and converting it to a lower dimensional space without
(theoretically) losing much of the explanatory power.
PC1: Explains No. of students for whom the particular college or university is Out-of-state
tuition and instructional expenditure per student
PC2: Represents the highly correlated variables such as Apps, Enroll and Accept
PC3: Highlights the estimated cost of books for a student
PC4: Represents % of faculties with Ph.D.’s and terminal degree
PC5: Explains percentage of new students from top 10% and 25% of higher secondary class
including cost of room and board
PC6: Details about student/faculty ratio
PC7: Highlights estimated personal spending for a student and graduation rate
PC8: Explains number of part-time undergraduate students and alumni who donate