0% found this document useful (0 votes)
148 views35 pages

Project: Advanced Statistics: Anova, Eda and Pca

This document outlines an analysis of salary data using various statistical techniques including one-way and two-way ANOVA. It performs one-way ANOVA on salary with respect to education level and occupation to determine if there are significant differences in mean salaries. It also analyzes the interaction between education level and occupation using an interaction plot and performs a two-way ANOVA to test for an interaction effect between the variables. The results and business implications of the analyses are discussed.

Uploaded by

BhagyaSree J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views35 pages

Project: Advanced Statistics: Anova, Eda and Pca

This document outlines an analysis of salary data using various statistical techniques including one-way and two-way ANOVA. It performs one-way ANOVA on salary with respect to education level and occupation to determine if there are significant differences in mean salaries. It also analyzes the interaction between education level and occupation using an interaction plot and performs a two-way ANOVA to test for an interaction effect between the variables. The results and business implications of the analyses are discussed.

Uploaded by

BhagyaSree J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Project: Advanced Statistics

ANOVA , EDA AND PCA


Project: Advanced Statistics
Contents
1 ANOVA test

1.1 STATE THE NULL AND THE ALTERNATE HYPOTHESIS FOR CONDUCTING ONE WAY ANOVA FOR BOTH
EDUCATION AND OCCUPATION INDIVIDUALLY

1.2 PERFORM ONE WAY ANOVA FOR EDUCATION WITH RESPECT TO THE VARIABLE ‘SALARY ’. STATE
WHETHERTHE NULL HYPOTHESIS IS ACCEPTED OR REJECTED BASED ON
THE ANOVA RESULTS. ............................................... 5

1.3 PERFORM ONE WAY ANOVA FOR VARIABLE OCCUPATION WITH RESPECT TO THE VARIABLE ‘SALARY ’.
STATE WHETHER THE NULL HYPOTHESIS IS ACCEPTED OR REJECTED BASED ON
THE ANOVA RESULTS. ............................... 5

1.4 IF THE NULL HYPOTHESIS IS REJECTED IN EITHER(1.2) OR IN(1.3), FIND OUT WHICH CLASS MEANS
ARESIGNIFICANTLY DIFFERENT .INTERPRET THE RESULT. ............. 6

1.5 WHAT IS THE INTERACTION BETWEEN THE TWO TREATMENTS? ANALYZE THE EFFECTS OF ONE VARIABLE
ONTHE OTHER (EDUCATION AND OCCUPATION) WITH THE HELP OF AN INTERACTION
PLOT. ....................................... 7

1.6 PERFORM A TWO WAY ANOVA BASED ON THE EDUCATION AND OCCUPATION( ALONG WITH
THEIRINTERACTION EDUCATION *OCCUPATION) WITH THE VARIABLE ‘SALARY ’.
STATE THE NULL AND ALTERNATIVEHYPOTHESES AND STATE YOUR RESULTS. HOW WILL YOU INTERPRET
THIS RESULT ?

1.7 EXPLAIN THE BUSINESS IMPLICATIONS OF PERFORMING ANOVA FOR THIS ARTICULAR CASE STUDY .........
10

EDA AND PCA ............................................................................................. 11

2.1 PERFORM EXPLORATORY DATA ANALYSIS [BOTH UNIVARIATE AND MULTIVARIATE ANALYSIS TO
BEPERFORMED

2.2 IS SCALING NECESSARY FOR PCA IN THIS CASE. GIVE JUSTIFICATION AND PERFORM SCALING

2.3 COMMENT ON THE COMPARISON BETWEEN THE COVARIANCE AND THE CORRELATION MATRICES FROM
THISDATA .ON SCALED DATA COMMENT ON THE COMPARISON BETWEEN THE COVARIANCE AND THE
CORRELATION MATRICES FROM THISDATA .[ON SCALED DATA ]

2.4 CHECK THE DATASET FOR OUTLIERS BEFORE AND AFTER SCALING. WHAT INSIGHT DO YOU DERIVE HERE?

2.5 Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]
2.6 PERFORM PCA AND EXPORT THE DATA OF THE PRINCIPAL COMPONENT (EIGENVECTORS) INTO A
DATAFRAME WITH THE ORIGINAL FEATURES

2.7 WRITE DOWN THE EXPLICIT FORM OF THE FIRST PC (IN TERMS OF THE EIGENVECTORS. USE VALUES
WITHTWO PLACES OF DECIMALS ONLY ).

2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the optimum
number of principal components? What do the eigenvectors indicate?
2.9 Explain the business implication of using the Principal Component Analysis for this case study. How may
PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components Obtained]
ANOVA is a technique which belongs to the domain called “Experimental Designs”. It helps in establishing in

an exact way, the Cause- Effect relation between variables. From the statistical inference point of view,

ANOVA is an extension of independent t test for testing the equality of two population means. When more

than two population means have to be compared, ANOVA technique is used. In this case, the null hypothesis(

H0) is defined as

H 0 : µ 1 = µ 2 =µ 3 =µ 4 =……=µ k for testing the equality of population means for k populations where µ

denotes the mean of the population.

In this work, an analysis of salary data has been performed and the results and business insights drawn are

listed.

Problem 1A:
Salary is hypothesized to depend on educational qualification and occupation. To
understand the dependency, the salaries of 40 individuals [SalaryData.csv] are collected
and each person’s educational qualification and occupation are noted. Educational
qualification is at three levels, High school graduate, Bachelor, and Doctorate.
Occupation is at four levels, Administrative and clerical, Sales, Professional or specialty,
and Executive or managerial. A different number of observations are in each level of
education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption
may not always hold if the sample size is small.]

1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.

One way ANOVA(Education)

Null Hypothesis �0: The mean salary is the same across all the 3 categories of education (Doctorate, Bachelors,

HS-Grad).

Alternate Hypothesis �1: The mean salary is different in at least one category of education.
One way ANOVA(Occupation)

Null Hypothesis �0: The mean salary is the same across all the 4 categories of occupation(Prof-Specialty, Sales,

Adm-clerical, Exec-Managerial).

Alternate Hypothesis �1: The mean salary is different in at least one category of occupation.

2.Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.

df sum_sq mean_sq F PR(>F)


C(Education) 2.0 1.026955e+11 5.134773e+10 30.95628 1.257709e-08
Residual 37.0 6.137256e+10 1.658718e+09 NaN NaN

The above is the ANOVA table for Education variable.

Since the p value = 1.257709e-08 is less than the significance level (alpha = 0.05), we
can reject the null hypothesis and conclude that there is a significant difference in the
mean salaries for at least one category of education.

3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
df sum_sq mean_sq F PR(>F)
C(Occupation) 3.0 1.125878e+10 3.752928e+09 0.884144
0.458508
Residual 36.0 1.528092e+11 4.244701e+09 NaN
NaN

The above is the ANOVA table for Occupation variable.

Since the p value = 0.458508 is greater than the significance level (alpha = 0.05), we
fail to reject the null hypothesis (i.e. we accept H0) and conclude that there is no
significant difference in the mean salaries across the 4 categories of occupation.

5.What is the interaction between two treatments? Analyze the effects of one variable on
the other (Education and Occupation) with the help of an interaction plot.[hint: use the
‘pointplot’ function from the ‘seaborn’ function]

We analyze the effects of one variable on the other (Education and Occupation) with the
help of an interaction plot.
The interaction plot shows that there is significant amount of interaction between the

categorical variables, Education and Occupation.

The following are some of the observations from the interaction plot:

· People with HS-grad education do not reach the position of Exec-managerial and they

hold only Adm-clerk, Sales and Prof-Specialty occupations.

· People with education as Bachelors or Doctorate and occupation as Adm-clerical and

Sales almost earn the same salaries(salaries ranging from 170000–190000).

· People with education as Bachelors and occupation as Prof-Specialty earn lesser than

people with education as Bachelors and occupations as Adm-clerical and Sales.

· People with education as Bachelors and occupation Sales earn higher than people with

education as Bachelors and occupation Prof-Specialty whereas people with education as


Doctorate and occupation Sales earn lesser than people with Doctorate and occupation

Prof-Specialty. We see a reversal in this part of the plot.

· Similarly, people with education as Bachelors and occupation as Prof-Specialty earn

lesser than people with education as Bachelors and occupation Exec-Managerial

whereas people with education as Doctorate and occupation as Prof-Specialty earn

higher than people with education as Doctorate and occupation Exec-Managerial. There

is a reversal in this part of the plot too.

· Salespeople with Bachelors or Doctorate education earn the same salaries and earn
higher than people with education as HS-grad.

· Adm clerical people with education as HS-grad earn the lowest salaries when

compared to people with education as Bachelors or Doctorate.

· Prof-Specialty people with education as Doctorate earn maximum salaries and people

with education as HS-Grad earn the minimum.

· People with education as HS -Grad earn the minimum salaries.

· There are no people with education as HS -grad who hold Exec-managerial occupation.

· People with education as Bachelors and occupation, Sales and Exec-Managerial earn

the same salaries.

6.Perform a two-way ANOVA based on Salary with respect to both Education and
Occupation (along with their interaction Education*Occupation). State the null and
alternative hypotheses and state your results. How will you interpret this result?

Two way ANOVA


�0: The effect of the independent variable ‘education’ on the mean ‘salary’ does not

depend on the effect of the other independent variable ‘occupation’ (i. e. there is no

interaction effect between the 2 independent variables, education and occupation).

�1: There is an interaction effect between the independent variable ‘education’ and the

independent variable ‘occupation’ on the mean Salary.

By performing two way ANOVA, we get the following table:


df sum_sq mean_sq F \
C(Education) 2.0 1.026955e+11 5.134773e+10 72.211958
C(Occupation) 3.0 5.519946e+09 1.839982e+09 2.587626
C(Education):C(Occupation) 6.0 3.634909e+10 6.058182e+09 8.519815
Residual 29.0 2.062102e+10 7.110697e+08 NaN

PR(>F)
C(Education) 5.466264e-12
C(Occupation) 7.211580e-02
C(Education):C(Occupation) 2.232500e-05
Residual NaN

From the table, we see that there is a significant amount of interaction between the

variables, Education and Occupation.

As p value = 2.232500e-05 is lesser than the significance level (alpha = 0.05), we reject

the null hypothesis.

Thus, we see that there is an interaction effect between education and occupation on the

mean salary.

7.Explain the business implications of performing ANOVA for this particular case study.

 Assuming the report is intended for HR department of a company or HR Consulting firm,

following are the key takeaways: An Employee or Graduates salary is significantly dependent

on their Education as compared to their Occupation or Job role.


 Given the Statistical conclusion about the interaction effect of Education & Occupation on

Salary it is imperative to say despite Occupation’s lesser significance, there is some level of

impact of job role on Salary.

 Noteworthy that for few occupations a higher salary maybe awarded to a bachelors degree

holder than their doctorate counterparts. This brings an important shortcoming of the dataset

provided which further reduces the accuracy of the test and analyses performed, i.e other

important independent variables which impact the salary, such as work experiences/

specialization/domain, industry.

 Needless to say, on an average a Doctorate would probably earn higher salary than Bachelors

& HS- grads. However, it is also true being Doctorate may not necessarily mean significantly

higher Salary than Bachelor’s degree graduates/employee as was observed in fig.

 The above point draws an important inference that a Doctorate graduate may not be highly

preferred for a job role and maybe considered over- qualified which results in at par or not

significantly higher wage to that of a Bachelor’s Degree holder.

 Hence, HR professional may need to have a more comprehensive approach towards setting of

salary bands. As with different industries, similar job titles also do demand varying salary

packaging as with job requirements/description. Nevertheless, work experience remains an

important factor deciding salary.

 The ANOVA Test does indicate that to occupation level coupled with higher Educational

qualification have significant impact on the salary even though occupation type/level alone

may not be a significant influencer as compared to Education.

An employee or a graduate’s salary is significantly dependent on their level of education as

compared to their occupation or job role.


Problem 2:
The given dataset consist of data points of names of various university and college which
has number of application received, accepted and enrolled percentage of new students
from top 10% of higher secondary class, percentage of new students from top 25% of
higher secondary class, number of fulltime undergraduates, number of part time
undergraduates students number of students for whom the particular college is out of
state tuition cost of room and board , estimated book costs for a student, estimated
personal spending for a student, percentage of faculties with PHD, percentage of
faculties with terminal degree, student/faculty ratio, percentage of alumni who donate.
The instructional expenditure per student, Graduation rate.

Inference of the Dataset


The shape of the dataset seems to be with 777 rows and 1 column
All the columns seems to be integer or float values.
The Names column alone is a categorical value.
We also can see there are no duplicates in the dataset.
The entire dataset do not have missing values or null values.
Names 0
Apps 0
Accept 0
Enroll 0
Top10perc 0
Top25perc 0
F.Undergrad 0
P.Undergrad 0
Outstate 0
Room.Board 0
Books 0
Personal 0
PhD 0
Terminal 0
S.F.Ratio 0
perc.alumni 0
Expend 0
Grad.Rate 0
dtype: int64

The dataset Education - Post 12th Standard.csv contains information on various colleges.
You are expected to do a Principal Component Analysis for this case study according to
the instructions given. The data dictionary of the 'Education - Post 12th Standard.csv'
can be found in the following file: Data Dictionary.xlsx.

2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?
Univariate Analysis

Helps us to understand the distribution of data in the dataset. With Univariate analysis we can
find the patterns and we can summarize the data for
APPS

The Boxplot of Apps variables seems to have outliers, the distribution of the data is skewed. We could also
understand that each college or university offers application in the range of 3000 to 5000. The max
applications seems to be around 50,000.

For Univariate Analysis of Apps we are using boxplot and dist plot to find information or pattern in the data.
So we can clearly understand from the boxplot we have outliers in the dataset.

Accept
The Accept variables seems to have outliers. The dist plot shows us the majority of
applications accepted from each university are in the range from 70 to 1500. The accept
variables seems to be right skewed.

Enroll

The Boxplot of the Enroll variables also has outliers. The distribution of the data is positively skewed.
From the distplot we understand majority of the colleges have enrolled students in the range of 200
to 500 students.

Top 10 Perc

The boxplot of the students from top 10 percentage of higher secondary class seems to have outliers.
The distribution seems to be positively skewed. There is a good amount of intake about 30 to 50
students from top 10 percentage of higher secondary class.
Top 25 Perc

The boxplot for the top 25% has no outliers. The distribution is almost normally

distributed. Majority of the student are from top 25 % of higher secondary class.

Full Time UnderGraduate

The Boxplot of the Full Time Undergraduate has outliers. The distribution of the

data is positively skewed. In the range about 3000 to 5000. They are Full Time

undergraduate studying in all the university.

Parttime undergraduate

The Boxplot of the Part Time Graduate has outliers. The distribution of the data is

positively skewed. In the range about 1000 to 3000. They are Part Time Undergraduate

studying in all the university.

Outstate
The Boxplot of the Outstate has only one outlier. The distribution of the data is almost normally
distributed.
Room Board

The Boxplot of the Room Board has few outliers. The distribution of the data is normally
distributed.

Books

The Boxplot of the Books has outliers. The distribution of the data seems to be bimodal. The cost
of Books per student seems to be in the range of 500 to 100.

Personal

The Boxplot of the Personal expense has outliers. The distribution seems to be positively skewed.
Personal Expense of few students are much higher than others.

PHD

The Boxplot of the PHD has outliers. The distribution seems to be negatively skewed.
TERMINAL
The Boxplot of the Terminal has outliers. The distribution seems to be negatively skewed.

S F RATIO

The Boxplot of the S F Ratio has outliers. The distribution of the data is almost normally
distributed. The student faculty ratio is almost same in all the university.

PERCI ALUMNI

The percentage of alumni boxplot have outliers . The distribution of the data is normally
distributed.

EXPENDITURE

The expenditure variable boxplot also has outliers. The distribution of the data is positively
skewed.

GRAD RATE

The Graduation rate boxplot have outliers. The distribution of the data is normally distributed.
The Graduation rate among the students in all the university is above 60%.
MULTIVARIATE ANALYSIS

The pair plot helps us to understand the relationship between all


the numerical values in the dataset. On comparing all the variables
with each other we could understand the pattern and trend in the
dataset.
HEATMAP
Heatmap showing correlation coefficient:

Observation & Inference


 Few pairs have very high correlation namely:
1. Application & Acceptance
2. Students from Top 10% schools & graduation rate
3. Terminal & PHD qualified faculties
4. Full Time under graduates student & enrollment
5. Students from Top 10% & top 25% schools
 Heatmap exhibit the problem of multicollinearity which can be observed significant number
of high correlation pairs of features. Multicollinearity is a problem because it undermines
the statistical significance of an independent variable.

2.2Is scaling necessary for PCA in this case? Give justification and perform scaling.

 The main objective of scaling or standardization to normalize a data within a particular


range. It is a step of preprocessing which is applied to independent variables of data.
 Another importance of scaling is it helps in speeding up the calculations.
Before scaling Names variable has been dropped which is categorical. Now, the dataset consist
of only numerical values. Z-score method has been applied for case study.

The dataset has 18 numerical columns with different scales.

The Standard scaler assumes your data is normally distributed within each feature and will scale the
such that the distribution is now centred around 0, with a standard deviation of 1.

from scipy.stats import zscore


dz=dx.apply(zscore)
dz.head()

Z = value-mean
Standard deviation

Z score tells us how many standard deviation is the point away from the mean and also the direction. Scaling is
one of the most important method to follow before implementing the models.

2.3 Comment on the comparison between the covariance and the correlation matrices from this data [on
scaled data].
The comparison between the covariance and the correlation matrix is that bothof the terms measures the
relationship and the dependency between 2 variables.

Covariance indicates the direction of the linear relationship between the variables whether it is positive or
negative. By direction means it is directly proportional or inversely proportional.

Covariance Matrix
%s [[ 1.00128866 0.94466636 0.84791332 0.33927032 0.35209304 0.81554018
0.3987775 0.05022367 0.16515151 0.13272942 0.17896117 0.39120081
0.36996762 0.09575627 -0.09034216 0.2599265 0.14694372]
[ 0.94466636 1.00128866 0.91281145 0.19269493 0.24779465 0.87534985
0.44183938 -0.02578774 0.09101577 0.11367165 0.20124767 0.35621633
0.3380184 0.17645611 -0.16019604 0.12487773 0.06739929]
[ 0.84791332 0.91281145 1.00128866 0.18152715 0.2270373 0.96588274
0.51372977 -0.1556777 -0.04028353 0.11285614 0.28129148 0.33189629
0.30867133 0.23757707 -0.18102711 0.06425192 -0.02236983]
[ 0.33927032 0.19269493 0.18152715 1.00128866 0.89314445 0.1414708
-0.10549205 0.5630552 0.37195909 0.1190116 -0.09343665 0.53251337
0.49176793 -0.38537048 0.45607223 0.6617651 0.49562711]
[ 0.35209304 0.24779465 0.2270373 0.89314445 1.00128866 0.19970167
-0.05364569 0.49002449 0.33191707 0.115676 -0.08091441 0.54656564
0.52542506 -0.29500852 0.41840277 0.52812713 0.47789622]
[ 0.81554018 0.87534985 0.96588274 0.1414708 0.19970167 1.00128866
0.57124738 -0.21602002 -0.06897917 0.11569867 0.31760831 0.3187472
0.30040557 0.28006379 -0.22975792 0.01867565 -0.07887464]
[ 0.3987775 0.44183938 0.51372977 -0.10549205 -0.05364569 0.57124738
1.00128866 -0.25383901 -0.06140453 0.08130416 0.32029384 0.14930637
0.14208644 0.23283016 -0.28115421 -0.08367612 -0.25733218]
[ 0.05022367 -0.02578774 -0.1556777 0.5630552 0.49002449 -0.21602002
-0.25383901 1.00128866 0.65509951 0.03890494 -0.29947232 0.38347594
0.40850895 -0.55553625 0.56699214 0.6736456 0.57202613]
[ 0.16515151 0.09101577 -0.04028353 0.37195909 0.33191707 -0.06897917
-0.06140453 0.65509951 1.00128866 0.12812787 -0.19968518 0.32962651
0.3750222 -0.36309504 0.27271444 0.50238599 0.42548915]
[ 0.13272942 0.11367165 0.11285614 0.1190116 0.115676 0.11569867
0.08130416 0.03890494 0.12812787 1.00128866 0.17952581 0.0269404
0.10008351 -0.03197042 -0.04025955 0.11255393 0.00106226]
[ 0.17896117 0.20124767 0.28129148 -0.09343665 -0.08091441 0.31760831
0.32029384 -0.29947232 -0.19968518 0.17952581 1.00128866 -0.01094989
-0.03065256 0.13652054 -0.2863366 -0.09801804 -0.26969106]
[ 0.39120081 0.35621633 0.33189629 0.53251337 0.54656564 0.3187472
0.14930637 0.38347594 0.32962651 0.0269404 -0.01094989 1.00128866
0.85068186 -0.13069832 0.24932955 0.43331936 0.30543094]
[ 0.36996762 0.3380184 0.30867133 0.49176793 0.52542506 0.30040557
0.14208644 0.40850895 0.3750222 0.10008351 -0.03065256 0.85068186
1.00128866 -0.16031027 0.26747453 0.43936469 0.28990033]
[ 0.09575627 0.17645611 0.23757707 -0.38537048 -0.29500852 0.28006379
0.23283016 -0.55553625 -0.36309504 -0.03197042 0.13652054 -0.13069832
-0.16031027 1.00128866 -0.4034484 -0.5845844 -0.30710565]
[-0.09034216 -0.16019604 -0.18102711 0.45607223 0.41840277 -0.22975792
-0.28115421 0.56699214 0.27271444 -0.04025955 -0.2863366 0.24932955
0.26747453 -0.4034484 1.00128866 0.41825001 0.49153016]
[ 0.2599265 0.12487773 0.06425192 0.6617651 0.52812713 0.01867565
-0.08367612 0.6736456 0.50238599 0.11255393 -0.09801804 0.43331936
0.43936469 -0.5845844 0.41825001 1.00128866 0.39084571]
[ 0.14694372 0.06739929 -0.02236983 0.49562711 0.47789622 -0.07887464
-0.25733218 0.57202613 0.42548915 0.00106226 -0.26969106 0.30543094
0.28990033 -0.30710565 0.49153016 0.39084571 1.00128866]]

Correlation measures the strength and the direction of the relationship between 2 variables. Strength is that
positively correlated or negatively correlated.

The correlation matrix before and after scaling remains same.

2.4 Check the dataset for outliers before and after scaling. What insight do you derive here? [Please do not
treat Outliers unless specifically asked to do so]

Checking the data before scaling

Checking the data after scaling


Inference

The outliers are still present in dataset.

 Scaling does not remove outliers.


 Scaling scales the values on a Z score distribution.
 We can use any one method to remove outliers for further process.
If we wish to remove outliers we can consider taking 3 standard deviation as outliers or either we
can remove them or impute them with IQR values.

2.5 Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]

Covariance Matrix
%s [[ 1.00128866 0.94466636 0.84791332 0.33927032 0.35209304 0.81554018
0.3987775 0.05022367 0.16515151 0.13272942 0.17896117 0.39120081
0.36996762 0.09575627 -0.09034216 0.2599265 0.14694372]
[ 0.94466636 1.00128866 0.91281145 0.19269493 0.24779465 0.87534985
0.44183938 -0.02578774 0.09101577 0.11367165 0.20124767 0.35621633
0.3380184 0.17645611 -0.16019604 0.12487773 0.06739929]
[ 0.84791332 0.91281145 1.00128866 0.18152715 0.2270373 0.96588274
0.51372977 -0.1556777 -0.04028353 0.11285614 0.28129148 0.33189629
0.30867133 0.23757707 -0.18102711 0.06425192 -0.02236983]
[ 0.33927032 0.19269493 0.18152715 1.00128866 0.89314445 0.1414708
-0.10549205 0.5630552 0.37195909 0.1190116 -0.09343665 0.53251337
0.49176793 -0.38537048 0.45607223 0.6617651 0.49562711]
[ 0.35209304 0.24779465 0.2270373 0.89314445 1.00128866 0.19970167
-0.05364569 0.49002449 0.33191707 0.115676 -0.08091441 0.54656564
0.52542506 -0.29500852 0.41840277 0.52812713 0.47789622]
[ 0.81554018 0.87534985 0.96588274 0.1414708 0.19970167 1.00128866
0.57124738 -0.21602002 -0.06897917 0.11569867 0.31760831 0.3187472
0.30040557 0.28006379 -0.22975792 0.01867565 -0.07887464]
[ 0.3987775 0.44183938 0.51372977 -0.10549205 -0.05364569 0.57124738
1.00128866 -0.25383901 -0.06140453 0.08130416 0.32029384 0.14930637
0.14208644 0.23283016 -0.28115421 -0.08367612 -0.25733218]
[ 0.05022367 -0.02578774 -0.1556777 0.5630552 0.49002449 -0.21602002
-0.25383901 1.00128866 0.65509951 0.03890494 -0.29947232 0.38347594
0.40850895 -0.55553625 0.56699214 0.6736456 0.57202613]
[ 0.16515151 0.09101577 -0.04028353 0.37195909 0.33191707 -0.06897917
-0.06140453 0.65509951 1.00128866 0.12812787 -0.19968518 0.32962651
0.3750222 -0.36309504 0.27271444 0.50238599 0.42548915]
[ 0.13272942 0.11367165 0.11285614 0.1190116 0.115676 0.11569867
0.08130416 0.03890494 0.12812787 1.00128866 0.17952581 0.0269404
0.10008351 -0.03197042 -0.04025955 0.11255393 0.00106226]
[ 0.17896117 0.20124767 0.28129148 -0.09343665 -0.08091441 0.31760831
0.32029384 -0.29947232 -0.19968518 0.17952581 1.00128866 -0.01094989
-0.03065256 0.13652054 -0.2863366 -0.09801804 -0.26969106]
[ 0.39120081 0.35621633 0.33189629 0.53251337 0.54656564 0.3187472
0.14930637 0.38347594 0.32962651 0.0269404 -0.01094989 1.00128866
0.85068186 -0.13069832 0.24932955 0.43331936 0.30543094]
[ 0.36996762 0.3380184 0.30867133 0.49176793 0.52542506 0.30040557
0.14208644 0.40850895 0.3750222 0.10008351 -0.03065256 0.85068186
1.00128866 -0.16031027 0.26747453 0.43936469 0.28990033]
[ 0.09575627 0.17645611 0.23757707 -0.38537048 -0.29500852 0.28006379
0.23283016 -0.55553625 -0.36309504 -0.03197042 0.13652054 -0.13069832
-0.16031027 1.00128866 -0.4034484 -0.5845844 -0.30710565]
[-0.09034216 -0.16019604 -0.18102711 0.45607223 0.41840277 -0.22975792
-0.28115421 0.56699214 0.27271444 -0.04025955 -0.2863366 0.24932955
0.26747453 -0.4034484 1.00128866 0.41825001 0.49153016]
[ 0.2599265 0.12487773 0.06425192 0.6617651 0.52812713 0.01867565
-0.08367612 0.6736456 0.50238599 0.11255393 -0.09801804 0.43331936
0.43936469 -0.5845844 0.41825001 1.00128866 0.39084571]
[ 0.14694372 0.06739929 -0.02236983 0.49562711 0.47789622 -0.07887464
-0.25733218 0.57202613 0.42548915 0.00106226 -0.26969106 0.30543094
0.28990033 -0.30710565 0.49153016 0.39084571 1.00128866]]

Eigen Vectors
%s [[-2.48765602e-01 3.31598227e-01 -6.30921033e-02 2.81310530e-01
-5.74140964e-03 -1.62374420e-02 -4.24863486e-02 -1.03090398e-01
-9.02270802e-02 5.25098025e-02 -3.58970400e-01 4.59139498e-01
-4.30462074e-02 1.33405806e-01 -8.06328039e-02 -5.95830975e-01
2.40709086e-02]
[-2.07601502e-01 3.72116750e-01 -1.01249056e-01 2.67817346e-01
-5.57860920e-02 7.53468452e-03 -1.29497196e-02 -5.62709623e-02
-1.77864814e-01 4.11400844e-02 5.43427250e-01 -5.18568789e-01
5.84055850e-02 -1.45497511e-01 -3.34674281e-02 -2.92642398e-01
-1.45102446e-01]
[-1.76303592e-01 4.03724252e-01 -8.29855709e-02 1.61826771e-01
5.56936353e-02 -4.25579803e-02 -2.76928937e-02 5.86623552e-02
-1.28560713e-01 3.44879147e-02 -6.09651110e-01 -4.04318439e-01
6.93988831e-02 2.95896092e-02 8.56967180e-02 4.44638207e-01
1.11431545e-02]
[-3.54273947e-01 -8.24118211e-02 3.50555339e-02 -5.15472524e-02
3.95434345e-01 -5.26927980e-02 -1.61332069e-01 -1.22678028e-01
3.41099863e-01 6.40257785e-02 1.44986329e-01 -1.48738723e-01
8.10481404e-03 6.97722522e-01 1.07828189e-01 -1.02303616e-03
3.85543001e-02]
[-3.44001279e-01 -4.47786551e-02 -2.41479376e-02 -1.09766541e-01
4.26533594e-01 3.30915896e-02 -1.18485556e-01 -1.02491967e-01
4.03711989e-01 1.45492289e-02 -8.03478445e-02 5.18683400e-02
2.73128469e-01 -6.17274818e-01 -1.51742110e-01 -2.18838802e-02
-8.93515563e-02]
[-1.54640962e-01 4.17673774e-01 -6.13929764e-02 1.00412335e-01
4.34543659e-02 -4.34542349e-02 -2.50763629e-02 7.88896442e-02
-5.94419181e-02 2.08471834e-02 4.14705279e-01 5.60363054e-01
8.11578181e-02 9.91640992e-03 5.63728817e-02 5.23622267e-01
5.61767721e-02]
[-2.64425045e-02 3.15087830e-01 1.39681716e-01 -1.58558487e-01
-3.02385408e-01 -1.91198583e-01 6.10423460e-02 5.70783816e-01
5.60672902e-01 -2.23105808e-01 -9.01788964e-03 -5.27313042e-02
-1.00693324e-01 2.09515982e-02 -1.92857500e-02 -1.25997650e-01
-6.35360730e-02]
[-2.94736419e-01 -2.49643522e-01 4.65988731e-02 1.31291364e-01
-2.22532003e-01 -3.00003910e-02 1.08528966e-01 9.84599754e-03
-4.57332880e-03 1.86675363e-01 -5.08995918e-02 1.01594830e-01
-1.43220673e-01 3.83544794e-02 3.40115407e-02 1.41856014e-01
-8.23443779e-01]
[-2.49030449e-01 -1.37808883e-01 1.48967389e-01 1.84995991e-01
-5.60919470e-01 1.62755446e-01 2.09744235e-01 -2.21453442e-01
2.75022548e-01 2.98324237e-01 -1.14639620e-03 -2.59293381e-02
3.59321731e-01 3.40197083e-03 5.84289756e-02 6.97485854e-02
3.54559731e-01]
[-6.47575181e-02 5.63418434e-02 6.77411649e-01 8.70892205e-02
1.27288825e-01 6.41054950e-01 -1.49692034e-01 2.13293009e-01
-1.33663353e-01 -8.20292186e-02 -7.72631963e-04 2.88282896e-03
-3.19400370e-02 -9.43887925e-03 6.68494643e-02 -1.14379958e-02
-2.81593679e-02]
[ 4.25285386e-02 2.19929218e-01 4.99721120e-01 -2.30710568e-01
2.22311021e-01 -3.31398003e-01 6.33790064e-01 -2.32660840e-01
-9.44688900e-02 1.36027616e-01 1.11433396e-03 -1.28904022e-02
1.85784733e-02 -3.09001353e-03 -2.75286207e-02 -3.94547417e-02
-3.92640266e-02]
[-3.18312875e-01 5.83113174e-02 -1.27028371e-01 -5.34724832e-01
-1.40166326e-01 9.12555212e-02 -1.09641298e-03 -7.70400002e-02
-1.85181525e-01 -1.23452200e-01 -1.38133366e-02 2.98075465e-02
-4.03723253e-02 -1.12055599e-01 6.91126145e-01 -1.27696382e-01
2.32224316e-02]
[-3.17056016e-01 4.64294477e-02 -6.60375454e-02 -5.19443019e-01
-2.04719730e-01 1.54927646e-01 -2.84770105e-02 -1.21613297e-02
-2.54938198e-01 -8.85784627e-02 -6.20932749e-03 -2.70759809e-02
5.89734026e-02 1.58909651e-01 -6.71008607e-01 5.83134662e-02
1.64850420e-02]
[ 1.76957895e-01 2.46665277e-01 -2.89848401e-01 -1.61189487e-01
7.93882496e-02 4.87045875e-01 2.19259358e-01 -8.36048735e-02
2.74544380e-01 4.72045249e-01 2.22215182e-03 -2.12476294e-02
-4.45000727e-01 -2.08991284e-02 -4.13740967e-02 1.77152700e-02
-1.10262122e-02]
[-2.05082369e-01 -2.46595274e-01 -1.46989274e-01 1.73142230e-02
2.16297411e-01 -4.73400144e-02 2.43321156e-01 6.78523654e-01
-2.55334907e-01 4.22999706e-01 1.91869743e-02 3.33406243e-03
1.30727978e-01 -8.41789410e-03 2.71542091e-02 -1.04088088e-01
1.82660654e-01]
[-3.18908750e-01 -1.31689865e-01 2.26743985e-01 7.92734946e-02
-7.59581203e-02 -2.98118619e-01 -2.26584481e-01 -5.41593771e-02
-4.91388809e-02 1.32286331e-01 3.53098218e-02 -4.38803230e-02
-6.92088870e-01 -2.27742017e-01 -7.31225166e-02 9.37464497e-02
3.25982295e-01]
[-2.52315654e-01 -1.69240532e-01 -2.08064649e-01 2.69129066e-01
1.09267913e-01 2.16163313e-01 5.59943937e-01 -5.33553891e-03
4.19043052e-02 -5.90271067e-01 1.30710024e-02 -5.00844705e-03
-2.19839000e-01 -3.39433604e-03 -3.64767385e-02 6.91969778e-02
1.22106697e-01]]

Eigen Values
%s [5.45052162 4.48360686 1.17466761 1.00820573 0.93423123 0.84849117
0.6057878 0.58787222 0.53061262 0.4043029 0.02302787 0.03672545
0.31344588 0.08802464 0.1439785 0.16779415 0.22061096]

2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame with
the original features

pca.explained_variance_

array([5.45052162, 4.48360686, 1.17466761, 1.00820573, 0.93423123,


0.84849117, 0.6057878 , 0.58787222, 0.53061262, 0.4043029 ,
0.31344588, 0.22061096, 0.16779415, 0.1439785 , 0.08802464,
0.03672545, 0.02302787])

pca.components_

array([[ 2.48765602e-01, 2.07601502e-01, 1.76303592e-01,


3.54273947e-01, 3.44001279e-01, 1.54640962e-01,
2.64425045e-02, 2.94736419e-01, 2.49030449e-01,
6.47575181e-02, -4.25285386e-02, 3.18312875e-01,
3.17056016e-01, -1.76957895e-01, 2.05082369e-01,
3.18908750e-01, 2.52315654e-01],
[ 3.31598227e-01, 3.72116750e-01, 4.03724252e-01,
-8.24118211e-02, -4.47786551e-02, 4.17673774e-01,
3.15087830e-01, -2.49643522e-01, -1.37808883e-01,
5.63418434e-02, 2.19929218e-01, 5.83113174e-02,
4.64294477e-02, 2.46665277e-01, -2.46595274e-01,
-1.31689865e-01, -1.69240532e-01],
[-6.30921033e-02, -1.01249056e-01, -8.29855709e-02,
3.50555339e-02, -2.41479376e-02, -6.13929764e-02,
1.39681716e-01, 4.65988731e-02, 1.48967389e-01,
6.77411649e-01, 4.99721120e-01, -1.27028371e-01,
-6.60375454e-02, -2.89848401e-01, -1.46989274e-01,
2.26743985e-01, -2.08064649e-01],
[ 2.81310530e-01, 2.67817346e-01, 1.61826771e-01,
-5.15472524e-02, -1.09766541e-01, 1.00412335e-01,
-1.58558487e-01, 1.31291364e-01, 1.84995991e-01,
8.70892205e-02, -2.30710568e-01, -5.34724832e-01,
-5.19443019e-01, -1.61189487e-01, 1.73142230e-02,
7.92734946e-02, 2.69129066e-01],
[ 5.74140964e-03, 5.57860920e-02, -5.56936353e-02,
-3.95434345e-01, -4.26533594e-01, -4.34543659e-02,
3.02385408e-01, 2.22532003e-01, 5.60919470e-01,
-1.27288825e-01, -2.22311021e-01, 1.40166326e-01,
2.04719730e-01, -7.93882496e-02, -2.16297411e-01,
7.59581203e-02, -1.09267913e-01],
[-1.62374420e-02, 7.53468452e-03, -4.25579803e-02,
-5.26927980e-02, 3.30915896e-02, -4.34542349e-02,
-1.91198583e-01, -3.00003910e-02, 1.62755446e-01,
6.41054950e-01, -3.31398003e-01, 9.12555212e-02,
1.54927646e-01, 4.87045875e-01, -4.73400144e-02,
-2.98118619e-01, 2.16163313e-01],
[-4.24863486e-02, -1.29497196e-02, -2.76928937e-02,
-1.61332069e-01, -1.18485556e-01, -2.50763629e-02,
6.10423460e-02, 1.08528966e-01, 2.09744235e-01,
-1.49692034e-01, 6.33790064e-01, -1.09641298e-03,
-2.84770105e-02, 2.19259358e-01, 2.43321156e-01,
-2.26584481e-01, 5.59943937e-01],
[-1.03090398e-01, -5.62709623e-02, 5.86623552e-02,
-1.22678028e-01, -1.02491967e-01, 7.88896442e-02,
5.70783816e-01, 9.84599754e-03, -2.21453442e-01,
2.13293009e-01, -2.32660840e-01, -7.70400002e-02,
-1.21613297e-02, -8.36048735e-02, 6.78523654e-01,
-5.41593771e-02, -5.33553891e-03],
[-9.02270802e-02, -1.77864814e-01, -1.28560713e-01,
3.41099863e-01, 4.03711989e-01, -5.94419181e-02,
5.60672902e-01, -4.57332880e-03, 2.75022548e-01,
-1.33663353e-01, -9.44688900e-02, -1.85181525e-01,
-2.54938198e-01, 2.74544380e-01, -2.55334907e-01,
-4.91388809e-02, 4.19043052e-02],
[ 5.25098025e-02, 4.11400844e-02, 3.44879147e-02,
6.40257785e-02, 1.45492289e-02, 2.08471834e-02,
-2.23105808e-01, 1.86675363e-01, 2.98324237e-01,
-8.20292186e-02, 1.36027616e-01, -1.23452200e-01,
-8.85784627e-02, 4.72045249e-01, 4.22999706e-01,
1.32286331e-01, -5.90271067e-01],
[ 4.30462074e-02, -5.84055850e-02, -6.93988831e-02,
-8.10481404e-03, -2.73128469e-01, -8.11578181e-02,
1.00693324e-01, 1.43220673e-01, -3.59321731e-01,
3.19400370e-02, -1.85784733e-02, 4.03723253e-02,
-5.89734026e-02, 4.45000727e-01, -1.30727978e-01,
6.92088870e-01, 2.19839000e-01],
[ 2.40709086e-02, -1.45102446e-01, 1.11431545e-02,
3.85543001e-02, -8.93515563e-02, 5.61767721e-02,
-6.35360730e-02, -8.23443779e-01, 3.54559731e-01,
-2.81593679e-02, -3.92640266e-02, 2.32224316e-02,
1.64850420e-02, -1.10262122e-02, 1.82660654e-01,
3.25982295e-01, 1.22106697e-01],
[ 5.95830975e-01, 2.92642398e-01, -4.44638207e-01,
1.02303616e-03, 2.18838802e-02, -5.23622267e-01,
1.25997650e-01, -1.41856014e-01, -6.97485854e-02,
1.14379958e-02, 3.94547417e-02, 1.27696382e-01,
-5.83134662e-02, -1.77152700e-02, 1.04088088e-01,
-9.37464497e-02, -6.91969778e-02],
[ 8.06328039e-02, 3.34674281e-02, -8.56967180e-02,
-1.07828189e-01, 1.51742110e-01, -5.63728817e-02,
1.92857500e-02, -3.40115407e-02, -5.84289756e-02,
-6.68494643e-02, 2.75286207e-02, -6.91126145e-01,
6.71008607e-01, 4.13740967e-02, -2.71542091e-02,
7.31225166e-02, 3.64767385e-02],
[ 1.33405806e-01, -1.45497511e-01, 2.95896092e-02,
6.97722522e-01, -6.17274818e-01, 9.91640992e-03,
2.09515982e-02, 3.83544794e-02, 3.40197083e-03,
-9.43887925e-03, -3.09001353e-03, -1.12055599e-01,
1.58909651e-01, -2.08991284e-02, -8.41789410e-03,
-2.27742017e-01, -3.39433604e-03],
[ 4.59139498e-01, -5.18568789e-01, -4.04318439e-01,
-1.48738723e-01, 5.18683400e-02, 5.60363054e-01,
-5.27313042e-02, 1.01594830e-01, -2.59293381e-02,
2.88282896e-03, -1.28904022e-02, 2.98075465e-02,
-2.70759809e-02, -2.12476294e-02, 3.33406243e-03,
-4.38803230e-02, -5.00844705e-03],
[ 3.58970400e-01, -5.43427250e-01, 6.09651110e-01,
-1.44986329e-01, 8.03478445e-02, -4.14705279e-01,
9.01788964e-03, 5.08995918e-02, 1.14639620e-03,
7.72631963e-04, -1.11433396e-03, 1.38133366e-02,
6.20932749e-03, -2.22215182e-03, -1.91869743e-02,
-3.53098218e-02, -1.30710024e-02]])

2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two places of decimals only).
[hint: write the linear equation of PC in terms of eigenvectors and corresponding features]

array([[ 0.2487656 , 0.2076015 , 0.17630359, 0.35427395, 0.34400128,


0.15464096, 0.0264425 , 0.29473642, 0.24903045, 0.06475752,
-0.04252854, 0.31831287, 0.31705602, -0.17695789, 0.20508237,
0.31890875, 0.25231565],
[ 0.33159823, 0.37211675, 0.40372425, -0.08241182, -0.04477866,
0.41767377, 0.31508783, -0.24964352, -0.13780888, 0.05634184,
0.21992922, 0.05831132, 0.04642945, 0.24666528, -0.24659527,
-0.13168986, -0.16924053],
[-0.06309208, -0.10124909, -0.08298558, 0.03505553, -0.02414794,
-0.06139295, 0.13968171, 0.04659888, 0.14896739, 0.67741165,
0.49972112, -0.12702837, -0.06603755, -0.2898484 , -0.14698927,
0.22674398, -0.20806465],
[ 0.28131055, 0.26781732, 0.16182676, -0.05154726, -0.10976654,
0.10041235, -0.15855849, 0.13129137, 0.18499599, 0.08708922,
-0.23071057, -0.53472483, -0.51944302, -0.16118949, 0.01731422,
0.07927349, 0.26912907],
[ 0.00574142, 0.05578608, -0.05569365, -0.39543435, -0.42653359,
-0.04345435, 0.30238541, 0.222532 , 0.56091947, -0.12728883,
-0.22231102, 0.14016633, 0.20471973, -0.07938825, -0.21629741,
0.07595812, -0.10926791]])

print('The Linear eq of 1st component: ')


for i in range(0,dz.shape[1]):
print('{} * {}'.format(np.round(pca.components_[0][i],3),dz.columns[i]),end=' + ')

The Linear eq of 1st component:


0.249 * Apps + 0.208 * Accept + 0.176 * Enroll + 0.354 * Top10perc + 0.344 *
Top25perc + 0.155 * F.Undergrad + 0.026 * P.Undergrad + 0.295 * Outstate + 0.249 *
Room.Board + 0.065 * Books + -0.043 * Personal + 0.318 * PhD + 0.317 * Terminal +
-0.177 * S.F.Ratio + 0.205 * perc.alumni + 0.319 * Expend + 0.252 * Grad.Rate +
Observations:

 The plot visually shows how much of the variance are explained , by how many principlecomponents.

 In the below plot we see that ,the 1st PC explains variance 33.13%, 2nd PC explains 57.19% andso on.

 Effectively we can get material variance explained (ie. 90%) by analysing 9 Principle componentsinstead all of the 17
variables(attributes) in the dataset.

PCA uses the eigenvectors of the covariance matrix to figure out how we should rotate the data. Because rotation is a kind of linear
transformation, your new dimensions will be sums of the old ones. The eigen-vectors (Principle Components) , determine the
direction or Axes along which linear transformation acts, stretching or compressing input vectors. They are the lines of change
that represent the action of the larger matrix , the very “line” in linear transformation.

2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the optimum
number of principal components? What do the eigenvectors indicate?

tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
cum_var_exp

array([ 32.0206282 , 58.36084263, 65.26175919, 71.18474841,


76.67315352, 81.65785448, 85.21672597, 88.67034731,
91.78758099, 94.16277251, 96.00419883, 97.30024023,
98.28599436, 99.13183669, 99.64896227, 99.86471628,
100. ])

To decide the optimum number of principal components:

 Check for cumulative variance upto 90%, check the corresponding associated with 90%
 The incremental value between the components should not be less than 5%

We can decide the optimum number of principal components as 6, after this the incremental values is less
than 5%

We select 5 principal components for the case study:

array([[ 0.2487656 , 0.2076015 , 0.17630359, 0.35427395, 0.34400128,


0.15464096, 0.0264425 , 0.29473642, 0.24903045, 0.06475752,
-0.04252854, 0.31831287, 0.31705602, -0.17695789, 0.20508237,
0.31890875, 0.25231565],
[ 0.33159823, 0.37211675, 0.40372425, -0.08241182, -0.04477866,
0.41767377, 0.31508783, -0.24964352, -0.13780888, 0.05634184,
0.21992922, 0.05831132, 0.04642945, 0.24666528, -0.24659527,
-0.13168986, -0.16924053],
[-0.06309208, -0.10124909, -0.08298558, 0.03505553, -0.02414794,
-0.06139295, 0.13968171, 0.04659888, 0.14896739, 0.67741165,
0.49972112, -0.12702837, -0.06603755, -0.2898484 , -0.14698927,
0.22674398, -0.20806465],
[ 0.28131055, 0.26781732, 0.16182676, -0.05154726, -0.10976654,
0.10041235, -0.15855849, 0.13129137, 0.18499599, 0.08708922,
-0.23071057, -0.53472483, -0.51944302, -0.16118949, 0.01731422,
0.07927349, 0.26912907],
[ 0.00574142, 0.05578608, -0.05569365, -0.39543435, -0.42653359,
-0.04345435, 0.30238541, 0.222532 , 0.56091947, -0.12728883,
-0.22231102, 0.14016633, 0.20471973, -0.07938825, -0.21629741,
0.07595812, -0.10926791]])

1. The first components explain 32.02% variance in data.


2. The first 2 components explain 58.36% variance in data.
3. The first 3 components explain 65.26% variance in data.
4. The first 4 components explain 71.18% variance in data.
5. The first 5 components explain 76.67% variance in data.

PCA is performed and it is exported into the data frame. After PCA the multicollinearity is highly reduced.
2.9 Explain the business implication of using the Principal Component Analysis for this case study. How may
PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components Obtained]

This business implication of using Principal Component Analysis :

The case study is about education dataset which contain the names of various colleges, which has various
details of colleges and university.

 To understand more about the dataset we perform univariate analysis and multivariate analysis which
gives us the understanding about the variables.
 From analysis we can understand the distribution of the dataset, skew, and patterns in the dataset.
From multivariate analysis we can understand the correlation of variables.
 Inference of multivariate analysis shows we can understand multiple variables highly correlated with
each
 other.
 The scaling helps the dataset to standardize the variable in one scale.
 Outliers are imputed using IQR values once the values are imputed we can perform PCA.
 The principal component analysis is used reduce the multicollinearity between the variables.
 Depending on the variance of the dataset we can reduce the PCA components.
 The PCA components for this business case is 5 where we could understand the maximum variance of
the dataset.
 Using the components we can now understand the reduced multicollinearity in the dataset

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

You might also like