Project: Advanced Statistics: Anova, Eda and Pca
Project: Advanced Statistics: Anova, Eda and Pca
1.1 STATE THE NULL AND THE ALTERNATE HYPOTHESIS FOR CONDUCTING ONE WAY ANOVA FOR BOTH
EDUCATION AND OCCUPATION INDIVIDUALLY
1.2 PERFORM ONE WAY ANOVA FOR EDUCATION WITH RESPECT TO THE VARIABLE ‘SALARY ’. STATE
WHETHERTHE NULL HYPOTHESIS IS ACCEPTED OR REJECTED BASED ON
THE ANOVA RESULTS. ............................................... 5
1.3 PERFORM ONE WAY ANOVA FOR VARIABLE OCCUPATION WITH RESPECT TO THE VARIABLE ‘SALARY ’.
STATE WHETHER THE NULL HYPOTHESIS IS ACCEPTED OR REJECTED BASED ON
THE ANOVA RESULTS. ............................... 5
1.4 IF THE NULL HYPOTHESIS IS REJECTED IN EITHER(1.2) OR IN(1.3), FIND OUT WHICH CLASS MEANS
ARESIGNIFICANTLY DIFFERENT .INTERPRET THE RESULT. ............. 6
1.5 WHAT IS THE INTERACTION BETWEEN THE TWO TREATMENTS? ANALYZE THE EFFECTS OF ONE VARIABLE
ONTHE OTHER (EDUCATION AND OCCUPATION) WITH THE HELP OF AN INTERACTION
PLOT. ....................................... 7
1.6 PERFORM A TWO WAY ANOVA BASED ON THE EDUCATION AND OCCUPATION( ALONG WITH
THEIRINTERACTION EDUCATION *OCCUPATION) WITH THE VARIABLE ‘SALARY ’.
STATE THE NULL AND ALTERNATIVEHYPOTHESES AND STATE YOUR RESULTS. HOW WILL YOU INTERPRET
THIS RESULT ?
1.7 EXPLAIN THE BUSINESS IMPLICATIONS OF PERFORMING ANOVA FOR THIS ARTICULAR CASE STUDY .........
10
2.1 PERFORM EXPLORATORY DATA ANALYSIS [BOTH UNIVARIATE AND MULTIVARIATE ANALYSIS TO
BEPERFORMED
2.2 IS SCALING NECESSARY FOR PCA IN THIS CASE. GIVE JUSTIFICATION AND PERFORM SCALING
2.3 COMMENT ON THE COMPARISON BETWEEN THE COVARIANCE AND THE CORRELATION MATRICES FROM
THISDATA .ON SCALED DATA COMMENT ON THE COMPARISON BETWEEN THE COVARIANCE AND THE
CORRELATION MATRICES FROM THISDATA .[ON SCALED DATA ]
2.4 CHECK THE DATASET FOR OUTLIERS BEFORE AND AFTER SCALING. WHAT INSIGHT DO YOU DERIVE HERE?
2.5 Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]
2.6 PERFORM PCA AND EXPORT THE DATA OF THE PRINCIPAL COMPONENT (EIGENVECTORS) INTO A
DATAFRAME WITH THE ORIGINAL FEATURES
2.7 WRITE DOWN THE EXPLICIT FORM OF THE FIRST PC (IN TERMS OF THE EIGENVECTORS. USE VALUES
WITHTWO PLACES OF DECIMALS ONLY ).
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the optimum
number of principal components? What do the eigenvectors indicate?
2.9 Explain the business implication of using the Principal Component Analysis for this case study. How may
PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components Obtained]
ANOVA is a technique which belongs to the domain called “Experimental Designs”. It helps in establishing in
an exact way, the Cause- Effect relation between variables. From the statistical inference point of view,
ANOVA is an extension of independent t test for testing the equality of two population means. When more
than two population means have to be compared, ANOVA technique is used. In this case, the null hypothesis(
H0) is defined as
H 0 : µ 1 = µ 2 =µ 3 =µ 4 =……=µ k for testing the equality of population means for k populations where µ
In this work, an analysis of salary data has been performed and the results and business insights drawn are
listed.
Problem 1A:
Salary is hypothesized to depend on educational qualification and occupation. To
understand the dependency, the salaries of 40 individuals [SalaryData.csv] are collected
and each person’s educational qualification and occupation are noted. Educational
qualification is at three levels, High school graduate, Bachelor, and Doctorate.
Occupation is at four levels, Administrative and clerical, Sales, Professional or specialty,
and Executive or managerial. A different number of observations are in each level of
education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption
may not always hold if the sample size is small.]
1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.
Null Hypothesis �0: The mean salary is the same across all the 3 categories of education (Doctorate, Bachelors,
HS-Grad).
Alternate Hypothesis �1: The mean salary is different in at least one category of education.
One way ANOVA(Occupation)
Null Hypothesis �0: The mean salary is the same across all the 4 categories of occupation(Prof-Specialty, Sales,
Adm-clerical, Exec-Managerial).
Alternate Hypothesis �1: The mean salary is different in at least one category of occupation.
2.Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
Since the p value = 1.257709e-08 is less than the significance level (alpha = 0.05), we
can reject the null hypothesis and conclude that there is a significant difference in the
mean salaries for at least one category of education.
3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
df sum_sq mean_sq F PR(>F)
C(Occupation) 3.0 1.125878e+10 3.752928e+09 0.884144
0.458508
Residual 36.0 1.528092e+11 4.244701e+09 NaN
NaN
Since the p value = 0.458508 is greater than the significance level (alpha = 0.05), we
fail to reject the null hypothesis (i.e. we accept H0) and conclude that there is no
significant difference in the mean salaries across the 4 categories of occupation.
5.What is the interaction between two treatments? Analyze the effects of one variable on
the other (Education and Occupation) with the help of an interaction plot.[hint: use the
‘pointplot’ function from the ‘seaborn’ function]
We analyze the effects of one variable on the other (Education and Occupation) with the
help of an interaction plot.
The interaction plot shows that there is significant amount of interaction between the
The following are some of the observations from the interaction plot:
· People with HS-grad education do not reach the position of Exec-managerial and they
· People with education as Bachelors and occupation as Prof-Specialty earn lesser than
· People with education as Bachelors and occupation Sales earn higher than people with
higher than people with education as Doctorate and occupation Exec-Managerial. There
· Salespeople with Bachelors or Doctorate education earn the same salaries and earn
higher than people with education as HS-grad.
· Adm clerical people with education as HS-grad earn the lowest salaries when
· Prof-Specialty people with education as Doctorate earn maximum salaries and people
· There are no people with education as HS -grad who hold Exec-managerial occupation.
· People with education as Bachelors and occupation, Sales and Exec-Managerial earn
6.Perform a two-way ANOVA based on Salary with respect to both Education and
Occupation (along with their interaction Education*Occupation). State the null and
alternative hypotheses and state your results. How will you interpret this result?
depend on the effect of the other independent variable ‘occupation’ (i. e. there is no
�1: There is an interaction effect between the independent variable ‘education’ and the
PR(>F)
C(Education) 5.466264e-12
C(Occupation) 7.211580e-02
C(Education):C(Occupation) 2.232500e-05
Residual NaN
From the table, we see that there is a significant amount of interaction between the
As p value = 2.232500e-05 is lesser than the significance level (alpha = 0.05), we reject
Thus, we see that there is an interaction effect between education and occupation on the
mean salary.
7.Explain the business implications of performing ANOVA for this particular case study.
following are the key takeaways: An Employee or Graduates salary is significantly dependent
Salary it is imperative to say despite Occupation’s lesser significance, there is some level of
Noteworthy that for few occupations a higher salary maybe awarded to a bachelors degree
holder than their doctorate counterparts. This brings an important shortcoming of the dataset
provided which further reduces the accuracy of the test and analyses performed, i.e other
important independent variables which impact the salary, such as work experiences/
specialization/domain, industry.
Needless to say, on an average a Doctorate would probably earn higher salary than Bachelors
& HS- grads. However, it is also true being Doctorate may not necessarily mean significantly
The above point draws an important inference that a Doctorate graduate may not be highly
preferred for a job role and maybe considered over- qualified which results in at par or not
Hence, HR professional may need to have a more comprehensive approach towards setting of
salary bands. As with different industries, similar job titles also do demand varying salary
The ANOVA Test does indicate that to occupation level coupled with higher Educational
qualification have significant impact on the salary even though occupation type/level alone
The dataset Education - Post 12th Standard.csv contains information on various colleges.
You are expected to do a Principal Component Analysis for this case study according to
the instructions given. The data dictionary of the 'Education - Post 12th Standard.csv'
can be found in the following file: Data Dictionary.xlsx.
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?
Univariate Analysis
Helps us to understand the distribution of data in the dataset. With Univariate analysis we can
find the patterns and we can summarize the data for
APPS
The Boxplot of Apps variables seems to have outliers, the distribution of the data is skewed. We could also
understand that each college or university offers application in the range of 3000 to 5000. The max
applications seems to be around 50,000.
For Univariate Analysis of Apps we are using boxplot and dist plot to find information or pattern in the data.
So we can clearly understand from the boxplot we have outliers in the dataset.
Accept
The Accept variables seems to have outliers. The dist plot shows us the majority of
applications accepted from each university are in the range from 70 to 1500. The accept
variables seems to be right skewed.
Enroll
The Boxplot of the Enroll variables also has outliers. The distribution of the data is positively skewed.
From the distplot we understand majority of the colleges have enrolled students in the range of 200
to 500 students.
Top 10 Perc
The boxplot of the students from top 10 percentage of higher secondary class seems to have outliers.
The distribution seems to be positively skewed. There is a good amount of intake about 30 to 50
students from top 10 percentage of higher secondary class.
Top 25 Perc
The boxplot for the top 25% has no outliers. The distribution is almost normally
distributed. Majority of the student are from top 25 % of higher secondary class.
The Boxplot of the Full Time Undergraduate has outliers. The distribution of the
data is positively skewed. In the range about 3000 to 5000. They are Full Time
Parttime undergraduate
The Boxplot of the Part Time Graduate has outliers. The distribution of the data is
positively skewed. In the range about 1000 to 3000. They are Part Time Undergraduate
Outstate
The Boxplot of the Outstate has only one outlier. The distribution of the data is almost normally
distributed.
Room Board
The Boxplot of the Room Board has few outliers. The distribution of the data is normally
distributed.
Books
The Boxplot of the Books has outliers. The distribution of the data seems to be bimodal. The cost
of Books per student seems to be in the range of 500 to 100.
Personal
The Boxplot of the Personal expense has outliers. The distribution seems to be positively skewed.
Personal Expense of few students are much higher than others.
PHD
The Boxplot of the PHD has outliers. The distribution seems to be negatively skewed.
TERMINAL
The Boxplot of the Terminal has outliers. The distribution seems to be negatively skewed.
S F RATIO
The Boxplot of the S F Ratio has outliers. The distribution of the data is almost normally
distributed. The student faculty ratio is almost same in all the university.
PERCI ALUMNI
The percentage of alumni boxplot have outliers . The distribution of the data is normally
distributed.
EXPENDITURE
The expenditure variable boxplot also has outliers. The distribution of the data is positively
skewed.
GRAD RATE
The Graduation rate boxplot have outliers. The distribution of the data is normally distributed.
The Graduation rate among the students in all the university is above 60%.
MULTIVARIATE ANALYSIS
2.2Is scaling necessary for PCA in this case? Give justification and perform scaling.
The Standard scaler assumes your data is normally distributed within each feature and will scale the
such that the distribution is now centred around 0, with a standard deviation of 1.
Z = value-mean
Standard deviation
Z score tells us how many standard deviation is the point away from the mean and also the direction. Scaling is
one of the most important method to follow before implementing the models.
2.3 Comment on the comparison between the covariance and the correlation matrices from this data [on
scaled data].
The comparison between the covariance and the correlation matrix is that bothof the terms measures the
relationship and the dependency between 2 variables.
Covariance indicates the direction of the linear relationship between the variables whether it is positive or
negative. By direction means it is directly proportional or inversely proportional.
Covariance Matrix
%s [[ 1.00128866 0.94466636 0.84791332 0.33927032 0.35209304 0.81554018
0.3987775 0.05022367 0.16515151 0.13272942 0.17896117 0.39120081
0.36996762 0.09575627 -0.09034216 0.2599265 0.14694372]
[ 0.94466636 1.00128866 0.91281145 0.19269493 0.24779465 0.87534985
0.44183938 -0.02578774 0.09101577 0.11367165 0.20124767 0.35621633
0.3380184 0.17645611 -0.16019604 0.12487773 0.06739929]
[ 0.84791332 0.91281145 1.00128866 0.18152715 0.2270373 0.96588274
0.51372977 -0.1556777 -0.04028353 0.11285614 0.28129148 0.33189629
0.30867133 0.23757707 -0.18102711 0.06425192 -0.02236983]
[ 0.33927032 0.19269493 0.18152715 1.00128866 0.89314445 0.1414708
-0.10549205 0.5630552 0.37195909 0.1190116 -0.09343665 0.53251337
0.49176793 -0.38537048 0.45607223 0.6617651 0.49562711]
[ 0.35209304 0.24779465 0.2270373 0.89314445 1.00128866 0.19970167
-0.05364569 0.49002449 0.33191707 0.115676 -0.08091441 0.54656564
0.52542506 -0.29500852 0.41840277 0.52812713 0.47789622]
[ 0.81554018 0.87534985 0.96588274 0.1414708 0.19970167 1.00128866
0.57124738 -0.21602002 -0.06897917 0.11569867 0.31760831 0.3187472
0.30040557 0.28006379 -0.22975792 0.01867565 -0.07887464]
[ 0.3987775 0.44183938 0.51372977 -0.10549205 -0.05364569 0.57124738
1.00128866 -0.25383901 -0.06140453 0.08130416 0.32029384 0.14930637
0.14208644 0.23283016 -0.28115421 -0.08367612 -0.25733218]
[ 0.05022367 -0.02578774 -0.1556777 0.5630552 0.49002449 -0.21602002
-0.25383901 1.00128866 0.65509951 0.03890494 -0.29947232 0.38347594
0.40850895 -0.55553625 0.56699214 0.6736456 0.57202613]
[ 0.16515151 0.09101577 -0.04028353 0.37195909 0.33191707 -0.06897917
-0.06140453 0.65509951 1.00128866 0.12812787 -0.19968518 0.32962651
0.3750222 -0.36309504 0.27271444 0.50238599 0.42548915]
[ 0.13272942 0.11367165 0.11285614 0.1190116 0.115676 0.11569867
0.08130416 0.03890494 0.12812787 1.00128866 0.17952581 0.0269404
0.10008351 -0.03197042 -0.04025955 0.11255393 0.00106226]
[ 0.17896117 0.20124767 0.28129148 -0.09343665 -0.08091441 0.31760831
0.32029384 -0.29947232 -0.19968518 0.17952581 1.00128866 -0.01094989
-0.03065256 0.13652054 -0.2863366 -0.09801804 -0.26969106]
[ 0.39120081 0.35621633 0.33189629 0.53251337 0.54656564 0.3187472
0.14930637 0.38347594 0.32962651 0.0269404 -0.01094989 1.00128866
0.85068186 -0.13069832 0.24932955 0.43331936 0.30543094]
[ 0.36996762 0.3380184 0.30867133 0.49176793 0.52542506 0.30040557
0.14208644 0.40850895 0.3750222 0.10008351 -0.03065256 0.85068186
1.00128866 -0.16031027 0.26747453 0.43936469 0.28990033]
[ 0.09575627 0.17645611 0.23757707 -0.38537048 -0.29500852 0.28006379
0.23283016 -0.55553625 -0.36309504 -0.03197042 0.13652054 -0.13069832
-0.16031027 1.00128866 -0.4034484 -0.5845844 -0.30710565]
[-0.09034216 -0.16019604 -0.18102711 0.45607223 0.41840277 -0.22975792
-0.28115421 0.56699214 0.27271444 -0.04025955 -0.2863366 0.24932955
0.26747453 -0.4034484 1.00128866 0.41825001 0.49153016]
[ 0.2599265 0.12487773 0.06425192 0.6617651 0.52812713 0.01867565
-0.08367612 0.6736456 0.50238599 0.11255393 -0.09801804 0.43331936
0.43936469 -0.5845844 0.41825001 1.00128866 0.39084571]
[ 0.14694372 0.06739929 -0.02236983 0.49562711 0.47789622 -0.07887464
-0.25733218 0.57202613 0.42548915 0.00106226 -0.26969106 0.30543094
0.28990033 -0.30710565 0.49153016 0.39084571 1.00128866]]
Correlation measures the strength and the direction of the relationship between 2 variables. Strength is that
positively correlated or negatively correlated.
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here? [Please do not
treat Outliers unless specifically asked to do so]
2.5 Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]
Covariance Matrix
%s [[ 1.00128866 0.94466636 0.84791332 0.33927032 0.35209304 0.81554018
0.3987775 0.05022367 0.16515151 0.13272942 0.17896117 0.39120081
0.36996762 0.09575627 -0.09034216 0.2599265 0.14694372]
[ 0.94466636 1.00128866 0.91281145 0.19269493 0.24779465 0.87534985
0.44183938 -0.02578774 0.09101577 0.11367165 0.20124767 0.35621633
0.3380184 0.17645611 -0.16019604 0.12487773 0.06739929]
[ 0.84791332 0.91281145 1.00128866 0.18152715 0.2270373 0.96588274
0.51372977 -0.1556777 -0.04028353 0.11285614 0.28129148 0.33189629
0.30867133 0.23757707 -0.18102711 0.06425192 -0.02236983]
[ 0.33927032 0.19269493 0.18152715 1.00128866 0.89314445 0.1414708
-0.10549205 0.5630552 0.37195909 0.1190116 -0.09343665 0.53251337
0.49176793 -0.38537048 0.45607223 0.6617651 0.49562711]
[ 0.35209304 0.24779465 0.2270373 0.89314445 1.00128866 0.19970167
-0.05364569 0.49002449 0.33191707 0.115676 -0.08091441 0.54656564
0.52542506 -0.29500852 0.41840277 0.52812713 0.47789622]
[ 0.81554018 0.87534985 0.96588274 0.1414708 0.19970167 1.00128866
0.57124738 -0.21602002 -0.06897917 0.11569867 0.31760831 0.3187472
0.30040557 0.28006379 -0.22975792 0.01867565 -0.07887464]
[ 0.3987775 0.44183938 0.51372977 -0.10549205 -0.05364569 0.57124738
1.00128866 -0.25383901 -0.06140453 0.08130416 0.32029384 0.14930637
0.14208644 0.23283016 -0.28115421 -0.08367612 -0.25733218]
[ 0.05022367 -0.02578774 -0.1556777 0.5630552 0.49002449 -0.21602002
-0.25383901 1.00128866 0.65509951 0.03890494 -0.29947232 0.38347594
0.40850895 -0.55553625 0.56699214 0.6736456 0.57202613]
[ 0.16515151 0.09101577 -0.04028353 0.37195909 0.33191707 -0.06897917
-0.06140453 0.65509951 1.00128866 0.12812787 -0.19968518 0.32962651
0.3750222 -0.36309504 0.27271444 0.50238599 0.42548915]
[ 0.13272942 0.11367165 0.11285614 0.1190116 0.115676 0.11569867
0.08130416 0.03890494 0.12812787 1.00128866 0.17952581 0.0269404
0.10008351 -0.03197042 -0.04025955 0.11255393 0.00106226]
[ 0.17896117 0.20124767 0.28129148 -0.09343665 -0.08091441 0.31760831
0.32029384 -0.29947232 -0.19968518 0.17952581 1.00128866 -0.01094989
-0.03065256 0.13652054 -0.2863366 -0.09801804 -0.26969106]
[ 0.39120081 0.35621633 0.33189629 0.53251337 0.54656564 0.3187472
0.14930637 0.38347594 0.32962651 0.0269404 -0.01094989 1.00128866
0.85068186 -0.13069832 0.24932955 0.43331936 0.30543094]
[ 0.36996762 0.3380184 0.30867133 0.49176793 0.52542506 0.30040557
0.14208644 0.40850895 0.3750222 0.10008351 -0.03065256 0.85068186
1.00128866 -0.16031027 0.26747453 0.43936469 0.28990033]
[ 0.09575627 0.17645611 0.23757707 -0.38537048 -0.29500852 0.28006379
0.23283016 -0.55553625 -0.36309504 -0.03197042 0.13652054 -0.13069832
-0.16031027 1.00128866 -0.4034484 -0.5845844 -0.30710565]
[-0.09034216 -0.16019604 -0.18102711 0.45607223 0.41840277 -0.22975792
-0.28115421 0.56699214 0.27271444 -0.04025955 -0.2863366 0.24932955
0.26747453 -0.4034484 1.00128866 0.41825001 0.49153016]
[ 0.2599265 0.12487773 0.06425192 0.6617651 0.52812713 0.01867565
-0.08367612 0.6736456 0.50238599 0.11255393 -0.09801804 0.43331936
0.43936469 -0.5845844 0.41825001 1.00128866 0.39084571]
[ 0.14694372 0.06739929 -0.02236983 0.49562711 0.47789622 -0.07887464
-0.25733218 0.57202613 0.42548915 0.00106226 -0.26969106 0.30543094
0.28990033 -0.30710565 0.49153016 0.39084571 1.00128866]]
Eigen Vectors
%s [[-2.48765602e-01 3.31598227e-01 -6.30921033e-02 2.81310530e-01
-5.74140964e-03 -1.62374420e-02 -4.24863486e-02 -1.03090398e-01
-9.02270802e-02 5.25098025e-02 -3.58970400e-01 4.59139498e-01
-4.30462074e-02 1.33405806e-01 -8.06328039e-02 -5.95830975e-01
2.40709086e-02]
[-2.07601502e-01 3.72116750e-01 -1.01249056e-01 2.67817346e-01
-5.57860920e-02 7.53468452e-03 -1.29497196e-02 -5.62709623e-02
-1.77864814e-01 4.11400844e-02 5.43427250e-01 -5.18568789e-01
5.84055850e-02 -1.45497511e-01 -3.34674281e-02 -2.92642398e-01
-1.45102446e-01]
[-1.76303592e-01 4.03724252e-01 -8.29855709e-02 1.61826771e-01
5.56936353e-02 -4.25579803e-02 -2.76928937e-02 5.86623552e-02
-1.28560713e-01 3.44879147e-02 -6.09651110e-01 -4.04318439e-01
6.93988831e-02 2.95896092e-02 8.56967180e-02 4.44638207e-01
1.11431545e-02]
[-3.54273947e-01 -8.24118211e-02 3.50555339e-02 -5.15472524e-02
3.95434345e-01 -5.26927980e-02 -1.61332069e-01 -1.22678028e-01
3.41099863e-01 6.40257785e-02 1.44986329e-01 -1.48738723e-01
8.10481404e-03 6.97722522e-01 1.07828189e-01 -1.02303616e-03
3.85543001e-02]
[-3.44001279e-01 -4.47786551e-02 -2.41479376e-02 -1.09766541e-01
4.26533594e-01 3.30915896e-02 -1.18485556e-01 -1.02491967e-01
4.03711989e-01 1.45492289e-02 -8.03478445e-02 5.18683400e-02
2.73128469e-01 -6.17274818e-01 -1.51742110e-01 -2.18838802e-02
-8.93515563e-02]
[-1.54640962e-01 4.17673774e-01 -6.13929764e-02 1.00412335e-01
4.34543659e-02 -4.34542349e-02 -2.50763629e-02 7.88896442e-02
-5.94419181e-02 2.08471834e-02 4.14705279e-01 5.60363054e-01
8.11578181e-02 9.91640992e-03 5.63728817e-02 5.23622267e-01
5.61767721e-02]
[-2.64425045e-02 3.15087830e-01 1.39681716e-01 -1.58558487e-01
-3.02385408e-01 -1.91198583e-01 6.10423460e-02 5.70783816e-01
5.60672902e-01 -2.23105808e-01 -9.01788964e-03 -5.27313042e-02
-1.00693324e-01 2.09515982e-02 -1.92857500e-02 -1.25997650e-01
-6.35360730e-02]
[-2.94736419e-01 -2.49643522e-01 4.65988731e-02 1.31291364e-01
-2.22532003e-01 -3.00003910e-02 1.08528966e-01 9.84599754e-03
-4.57332880e-03 1.86675363e-01 -5.08995918e-02 1.01594830e-01
-1.43220673e-01 3.83544794e-02 3.40115407e-02 1.41856014e-01
-8.23443779e-01]
[-2.49030449e-01 -1.37808883e-01 1.48967389e-01 1.84995991e-01
-5.60919470e-01 1.62755446e-01 2.09744235e-01 -2.21453442e-01
2.75022548e-01 2.98324237e-01 -1.14639620e-03 -2.59293381e-02
3.59321731e-01 3.40197083e-03 5.84289756e-02 6.97485854e-02
3.54559731e-01]
[-6.47575181e-02 5.63418434e-02 6.77411649e-01 8.70892205e-02
1.27288825e-01 6.41054950e-01 -1.49692034e-01 2.13293009e-01
-1.33663353e-01 -8.20292186e-02 -7.72631963e-04 2.88282896e-03
-3.19400370e-02 -9.43887925e-03 6.68494643e-02 -1.14379958e-02
-2.81593679e-02]
[ 4.25285386e-02 2.19929218e-01 4.99721120e-01 -2.30710568e-01
2.22311021e-01 -3.31398003e-01 6.33790064e-01 -2.32660840e-01
-9.44688900e-02 1.36027616e-01 1.11433396e-03 -1.28904022e-02
1.85784733e-02 -3.09001353e-03 -2.75286207e-02 -3.94547417e-02
-3.92640266e-02]
[-3.18312875e-01 5.83113174e-02 -1.27028371e-01 -5.34724832e-01
-1.40166326e-01 9.12555212e-02 -1.09641298e-03 -7.70400002e-02
-1.85181525e-01 -1.23452200e-01 -1.38133366e-02 2.98075465e-02
-4.03723253e-02 -1.12055599e-01 6.91126145e-01 -1.27696382e-01
2.32224316e-02]
[-3.17056016e-01 4.64294477e-02 -6.60375454e-02 -5.19443019e-01
-2.04719730e-01 1.54927646e-01 -2.84770105e-02 -1.21613297e-02
-2.54938198e-01 -8.85784627e-02 -6.20932749e-03 -2.70759809e-02
5.89734026e-02 1.58909651e-01 -6.71008607e-01 5.83134662e-02
1.64850420e-02]
[ 1.76957895e-01 2.46665277e-01 -2.89848401e-01 -1.61189487e-01
7.93882496e-02 4.87045875e-01 2.19259358e-01 -8.36048735e-02
2.74544380e-01 4.72045249e-01 2.22215182e-03 -2.12476294e-02
-4.45000727e-01 -2.08991284e-02 -4.13740967e-02 1.77152700e-02
-1.10262122e-02]
[-2.05082369e-01 -2.46595274e-01 -1.46989274e-01 1.73142230e-02
2.16297411e-01 -4.73400144e-02 2.43321156e-01 6.78523654e-01
-2.55334907e-01 4.22999706e-01 1.91869743e-02 3.33406243e-03
1.30727978e-01 -8.41789410e-03 2.71542091e-02 -1.04088088e-01
1.82660654e-01]
[-3.18908750e-01 -1.31689865e-01 2.26743985e-01 7.92734946e-02
-7.59581203e-02 -2.98118619e-01 -2.26584481e-01 -5.41593771e-02
-4.91388809e-02 1.32286331e-01 3.53098218e-02 -4.38803230e-02
-6.92088870e-01 -2.27742017e-01 -7.31225166e-02 9.37464497e-02
3.25982295e-01]
[-2.52315654e-01 -1.69240532e-01 -2.08064649e-01 2.69129066e-01
1.09267913e-01 2.16163313e-01 5.59943937e-01 -5.33553891e-03
4.19043052e-02 -5.90271067e-01 1.30710024e-02 -5.00844705e-03
-2.19839000e-01 -3.39433604e-03 -3.64767385e-02 6.91969778e-02
1.22106697e-01]]
Eigen Values
%s [5.45052162 4.48360686 1.17466761 1.00820573 0.93423123 0.84849117
0.6057878 0.58787222 0.53061262 0.4043029 0.02302787 0.03672545
0.31344588 0.08802464 0.1439785 0.16779415 0.22061096]
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame with
the original features
pca.explained_variance_
pca.components_
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two places of decimals only).
[hint: write the linear equation of PC in terms of eigenvectors and corresponding features]
The plot visually shows how much of the variance are explained , by how many principlecomponents.
In the below plot we see that ,the 1st PC explains variance 33.13%, 2nd PC explains 57.19% andso on.
Effectively we can get material variance explained (ie. 90%) by analysing 9 Principle componentsinstead all of the 17
variables(attributes) in the dataset.
PCA uses the eigenvectors of the covariance matrix to figure out how we should rotate the data. Because rotation is a kind of linear
transformation, your new dimensions will be sums of the old ones. The eigen-vectors (Principle Components) , determine the
direction or Axes along which linear transformation acts, stretching or compressing input vectors. They are the lines of change
that represent the action of the larger matrix , the very “line” in linear transformation.
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the optimum
number of principal components? What do the eigenvectors indicate?
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
cum_var_exp
Check for cumulative variance upto 90%, check the corresponding associated with 90%
The incremental value between the components should not be less than 5%
We can decide the optimum number of principal components as 6, after this the incremental values is less
than 5%
PCA is performed and it is exported into the data frame. After PCA the multicollinearity is highly reduced.
2.9 Explain the business implication of using the Principal Component Analysis for this case study. How may
PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components Obtained]
The case study is about education dataset which contain the names of various colleges, which has various
details of colleges and university.
To understand more about the dataset we perform univariate analysis and multivariate analysis which
gives us the understanding about the variables.
From analysis we can understand the distribution of the dataset, skew, and patterns in the dataset.
From multivariate analysis we can understand the correlation of variables.
Inference of multivariate analysis shows we can understand multiple variables highly correlated with
each
other.
The scaling helps the dataset to standardize the variable in one scale.
Outliers are imputed using IQR values once the values are imputed we can perform PCA.
The principal component analysis is used reduce the multicollinearity between the variables.
Depending on the variance of the dataset we can reduce the PCA components.
The PCA components for this business case is 5 where we could understand the maximum variance of
the dataset.
Using the components we can now understand the reduced multicollinearity in the dataset
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------