0% found this document useful (0 votes)
141 views37 pages

Advanced Statistics Project Module 3 - Advanced Statistics: Submitted To Great Learning

This document presents an analysis of salary data from 40 individuals to understand the relationship between salary, education, and occupation. It performs one-way and two-way ANOVA tests to analyze the effects of education and occupation on salary both individually and interactively. It also performs exploratory data analysis and principal component analysis to gain insights from the data and reduce its dimensionality for further analysis. The results of the statistical tests and analyses will help interpret the business implications for this case study.

Uploaded by

Raveena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views37 pages

Advanced Statistics Project Module 3 - Advanced Statistics: Submitted To Great Learning

This document presents an analysis of salary data from 40 individuals to understand the relationship between salary, education, and occupation. It performs one-way and two-way ANOVA tests to analyze the effects of education and occupation on salary both individually and interactively. It also performs exploratory data analysis and principal component analysis to gain insights from the data and reduce its dimensionality for further analysis. The results of the statistical tests and analyses will help interpret the business implications for this case study.

Uploaded by

Raveena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Advanced Statistics Project

Module 3 – Advanced Statistics

Submitted to

Great Learning

BY Raveena Kumari

Post Graduate Program in Data Science and Business Analytics


Table of Contents
Chapter Description Page Number
Executive Summary 1
Hypothesis Testing- Theory 1
ANOVA Introduction 2
Data Description and Sample 2,3
Questions -
Q1.1)State the null and the alternate hypothesis for
conducting one-way ANOVA for both Education 4
and Occupation individually.

Q1.2). Perform a one-way ANOVA on Salary with


respect to Education. State whether the null
hypothesis is accepted or rejected based on the 5,6
ANOVA results.
Problem 1A: Q1.3). Perform a one-way ANOVA on Salary with
respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the
ANOVA results.
6,7
Q1.4). If the null hypothesis is rejected in either (2) or
in (3), find out which class means are significantly
different. Interpret the result.
7
Q1.5). What is the interaction between the two
treatments? Analyze the effects of one variable on
the other (Education and Occupation) with the help
of an interaction plot.
8,9
Q1.6).Perform a two-way ANOVA based on the
Education and Occupation (along with their interaction
Education*Occupation) with the
Problem 1 B:
variable ‘Salary’. State the null and alternative
hypotheses and state your results. How will you
interpret this result?
10,11

Q1.7).Explain the business implications of performing


ANOVA for this particular case study.
11
EDA+Principal Executive Summary 12
Component Introduction 12
Analysis Data Sample and description 12,13
Data Dictionary 14
Questions - Q2.1 Perform Exploratory Data Analysis [both
univariate and multivariate analysis to be performed].
What insight do you draw from the EDA? 15,19

Q2.2 Is scaling necessary for PCA in this case? Give


20,21
justification and perform scaling.

Statistical Tests to be done before implementing


PCA
21,22
Q2.3 Comment on the comparison between the
covariance and the correlation matrices from
this data.[on scaled data]
23,24

Q2.4).Check the dataset for outliers before and after


24,26
Problem 2: scaling. What insight do you derive here?

Q2.5). Extract the eigenvalues and eigenvectors.[print


26,27
both]

Q2.6).Perform PCA and export the data of the


28
Principal Component (eigenvectors) into a data frame
with the original features.
Q2.7). Write down the explicit form of the first PC (in
terms of the eigenvectors. Use values with two places
29
of decimals only).

2.8 Consider the cumulative values of the


eigenvalues. How does it help you to decide on
30,31
the optimum number of principal components?
What do the eigenvectors indicate?
2.9 Explain the business implication of using the
Principal Component Analysis for this case study. How 31
may PCs help in the further analysis?
List of Tables and Figures

Chapter List of Tables Page No.


Hypothesis
Test-ANOVA

Table 1 - Description and sample of the dataset 2


Table 2 - Summary of the dataset 2
Table 3 - Value Count of Independent Data 3
Table 4 - One way ANOVA on Education 5
Table 5 - One way ANOVA on Occupation 6
Table 6 - Multiple Comparison For Education 7
Table 7 - Two-way ANOVA without Interaction 8
Table 8 - Two Way ANOVA with Interaction 11
EDA+PCA

Table 9 - Sample and Description of the dataset 12


Table 10- Summarised Information 13
Table 11- Data Dictionary 14
Table 12 - Skewness of the data 16
Table 13 - Description and sample of scaled Data 21
Table 14 - Covariance Matrix 23
Table 15 - Eigen Values 26
Table 16 - Eigen Vectors 27
Table 17 - Sample of Dataset after Performing PCA 28
Table 18 - Loading Principal Components 28
Loading all Principal Components with Original features
Table 19-
28
Table 20 - Explicit for of the first PC 29
Table 21 - Cumulative variance of the Eigen Values 30

Chapter List of Figures Page


Hypothesis No.
Test-
ANOVA

Figure 1 Interaction Plot with 9


Education and Occupation

EDA+PCA
Figure 2 Distplot for Univariate 15,16
Analysis
Figure 3 Boxplot Analysis 17
Figure 4 Correlation Matrix 18
Figure 5 Scatter Plot for Bivariate 18,19
Analysis
Boxplot with scaling and
Figure 6 25
without outlier treatment
Figure 7-Boxplot with
Figure 7 26
scaling and outlier
treatment

Figure 8 Heatmap to show Explicit 29


Form
Figure 9 Scree Plot 30
Problem Statement

Salary is hypothesized to depend on educational qualification and occupation. To


understand the dependency, the salaries of 40 individual are collected and each person’s
educational qualification and occupation are noted. Educational qualification is at three
levels, High school graduate, Bachelor, and Doctorate. Occupation is at four levels,
Administrative and clerical, Sales, Professional or specialty, and Executive or managerial. A
different number of observations are in each level of education – occupation combination.

Theory

ANOVA

• An experiment has been carried out for the purpose to compare means
• They are having a strong cause and effect relationship
• Randomisation-
• Compare more than 2 Population Means

Assumptions-
• The samples drawn from different populations are independent variables and
random
• The response variables are continuous and ideally normally distributed(if not the test
doesn’t go wrong but may be subject to variation) – Plot a normal Distribution)
• The variance of all the population are equal(only in means not in variance)

ANOVA decomposes in two halves

1. Differences in all the distributions


2. Due to variations in the data

pg. 1
Introduction

The purpose of this whole exercise is to explore the dataset. Do the Hypothesis test analysis.
Here our main aim is to understand the relationship between Independent and Dependent
variables. The data consists of Salaries of 40 different individuals and their Occupation and
Education. This will help us understand and compare the means of different Continuous
variables with independent variables.

Describe and Sample Data

Table 1 – Description(Left) and sample(Right) of the dataset Salary

Table 2 – Informative Summary of the dataset Salary

pg. 2
pg. 3
pg. 4
Q1.2). Perform a one-way ANOVA on Salary with respect to Education. State
whether the null hypothesis is accepted or rejected based on the
ANOVA results.

So here we need to decide, is Education significant in relation with a continuous


variable or dependent variable Salary.

Table 4 – One way ANOVA on Education

• Degrees of Freedom = N-1. Here Education has 3 categories so Degrees of


freedom is 3-1=2. Total No. of observations consists of 40 total line items
where 2 already accounted through education leaving 37 as residual.
• Sums of Square – The education categories (Doctorate, bachelors, Hs-
Graduate) have 3 different means. Those sample means are compared to the
overall means and that is 102695500000. It is also called the between sums
of square. Then we have within Sums of Square in the residual output.
• Mean Sums of Square – Sums of Square/ Degree of Freedom
=102695500000/2 = 51347750000. Same with residual
• F-Statistic – It is Education Mean sums of square/ Residual mean sums of
square = 30.95682. This tells us that the variance between education
Categories is about 30 times within each category of education
• P-Value - Since P value is less than Alpha, we reject the null hypothesis and
we can clearly say that atleast one of the group of education qualifications
has mean salary different from others. Therefore Education
qualification(cause) is a differentiator in the salary (Effect)Structure.
• F-Critical- There is no output for F-critical in python. We can also calculate F-
critical value through Excel Using F.INV.RT .

pg. 5
pg. 6
is about 0.884 times within each category of Occupation which is very less
compared to Education as a factor
• P-Value - Since the p Value is greater than alpha i.e 0.05 therfore we fail to
reject the null hypthesis where we can clearly state that mean salary of
different group of occupation are the same. Therefore Occupation is not a
differentiator in the salary Structure
• F-Critical- There is no output for F-critical in python. We can also calculate F-
critical value through Excel Using F.INV.RT .

Note-

• When the P_Value > Alpha, We Fail to reject the Null Hypothesis
• When the P_Value < Alpha, We Reject the Null Hypothesis
• If F-Statistic > F-Critical , We Reject the Null Hypothesis
Q4) .
• If F-Statistic > F-Critical, We fail to reject the Null Hypthesis

Inference -
Since the P Value is greater than alpha i.e. 0.05 therefore we fail to reject the null
hypothesis where we can clearly state that mean salary of different group of
occupation are the same. Therefore Occupation is not a differentiator in the salary
Structure.

Q1.4). If the null hypothesis is rejected in either (1.2) or in (1.3), find out
which class means are significantly different. Interpret the result.

Table 6- Multiple Comparison For Education

pg. 7
Problem 1. B)

Q1.5) What is the interaction between two treatments? Analyse the effects of
one variable on the other (Education and Occupation) with the help of an
interaction plot.

Two Way ANOVA-


• When multiple treatments which have two independent variables (called factors)

• To check If there is interaction between 2 independent variables and dependent


variables

Assumptions
• Dependent variables are measured at a continuous level

• There should be no significant outliers

• Dependent variable should be normally distributed

• Randomisation is done before hand

I have performed a two-way ANOVA to Check the interaction between two treatments. This
is required to check the interaction between the two by creating a Plot. Two-way ANOVA
without Interaction is required where we take Education as the first factor as Education
has a P-Value less than Alpha.

Table 7- Two-way ANOVA

Here if we take education and occupation together, we can see that Education’s P value is
0.00000001981539 which is less than alpha while the value of occupation is 0.3545825
more than alpha. Hence even if we take education and occupation together education is
alone a significant factor in determining salary of the sample Population.

pg. 8
Figure 1 – Interaction Plot with Factor Education and Occupation together

Inference

From the above plot, there is not a significant interaction between the variables. We can
clearly see that the salary at High School Graduate is the lowest be it in Occupation which is
increasing as they have bachelor’s degree while the highest with doctorate Degrees. While
according, if we take it through Occupation Perspective, Prof-Specialty has the highest salary
while High school students of working in sales have the lowest salary. We can also see that
Even having a doctorate degree in the field of sales and Adm-Clerical departments, there is
not much of difference with a bachelor’s degree.

pg. 9
pg. 10
Table 8– Two Way ANOVA with Interaction

• Degrees of Freedom = N-1. Here Occupation and Education together have 7 categories
so Degrees of freedom is 7-1=6. Total No. of observations consists of 40 total line items
where 5 already accounted through Occupation leaving 29 as residual.
• Sums of Square – The Occupation and Education categories together have different
means. Those sample means are compared to the overall means and that is
36349090000. It is also called the between sums of square. Then we have within Sums of
Square in the residual output.
• Mean Sums of Square – Sums of Square/ Degree of Freedom = 36349090000/2 =
6058182000.
• F-Statistic – It is Occupation and Education Together Mean sums of square/ Residual
mean sums of square = 8.519815. This tells us that the variance between Occupation and
Education Categories is about 8.5 times.
• P-Value - Since the p Value 0.000022325 is Less than alpha i.e 0.05 therefore we reject
the null hypothesis where we can clearly state that mean salary of different group of
occupation and Education together are Different. Therefore Occupation and Education
have some kind of an Interaction. Due to the inclusion of the interaction effect term, we
can see a slight change in the p-value of the first two treatments as compared to the
Two-Way ANOVA without the interaction effect terms and we see that the p-value of the
interaction effect term of 'Education' and 'Occupation' suggests that the Null Hypothesis
is rejected in this case.

Q1.7). Explain the business implications of performing ANOVA for this


particular case study.

• Education alone is a significant variable to salary


• Occupation has no effect on Salary while education and occupation together have an
interaction
• Human Resource can use this method to implement Salary Structure
• We can find other factors to improve our output like Experience

pg. 11
EXPLORATORY DATA ANALYSIS AND PRINCIPAL COMPONENT ANALYSIS

Executive Summary

The dataset contains information on various colleges. You are expected to do a


Principal Component Analysis for this case study according to the instructions given.
The data dictionary of the 'Education - Post 12th Standard is also available.

Introduction
The purpose of this whole exercise is to explore the dataset. Do the exploratory data
analysis. Explore the dataset using central tendency and other parameters. The data
consists of 777 different colleges with 17 unique features. Analyse the different attributes of
the Colleges and reducing the dimensions of the data by scaling and finding Eigen Values
and Vectors and performing PCA .

Sample Data and Description

Table 9 – Sample and Description of the dataset

pg. 12
IF we go by the process of EDA

1. Know the Problem Statement or the Business Objective


2. Load and view the given data
3. Check the relevance of the data against the objective or goal to be achieved
1. Scope of the data
2. Time relevance of the data
3. Quantum of data
4. Features of the data
4. Understand each feature in the data with help of Data Dictionary
5. Know the central tendency and data distribution of each feature

Describe data

• We have 18 features and 777 Colleges


• The range or Min and Max are each taken accurately as per the requirement of the
data.
• The Mean and the median are almost equal indicating that most of the numerical
items are normally distributed. I
• t may be inferred that some of the variables look highly skewed like Apps, Accept,
Enrol, F.Undergrad etc. This is expected as some colleges are expected to have
greater number of applications, more Enrolment etc. Further analysis is done during
univariate analysis.

Table 10 – Summarized information

pg. 13
Inference -
• We have all Non-Null Values so there are no Missing Values
• We have 3 Types of Data Types which Corresponds with the requirement of the dataset
16 Integer, 1 Object and 1 Float.

Data Dictionary

Data Dictionary Type


1) Names: Names of various university and colleges (Categorical)
2) Apps: Number of applications received (Numerical)
3) Accept: Number of applications accepted (Numerical)
4) Enroll: Number of new students enrolled (Numerical)
5) Top10perc: Percentage of new students from top 10% of Higher Secondary class (Numerical)
6) Top25perc: Percentage of new students from top 25% of Higher Secondary class (Numerical)
7) F.Undergrad: Number of full-time undergraduate students (Numerical)
8) P.Undergrad: Number of part-time undergraduate students (Numerical)
9) Outstate: Number of students for whom the particular college or university is Out-of-
state tuition (Numerical)
10) Room.Board: Cost of Room and board (Numerical)
11) Books: Estimated book costs for a student (Numerical)
12) Personal: Estimated personal spending for a student (Numerical)
13) PhD: Percentage of faculties with Ph.D.’s (Numerical)
14) Terminal: Percentage of faculties with terminal degree (Numerical)
15) S.F.Ratio: Student/faculty ratio (Numerical)
16) perc.alumni: Percentage of alumni who donate (Numerical)
17) Expend: The Instructional expenditure per student (Numerical)
18) Grad.Rate: Graduation rate (Numerical)
Table 11- Data Dictionary

Data Pre-processing

Practical data set generally has lot of “noise” and/or“ undesired” data points which might
impact the outcome, hence pre-processing is an important step As these “noise” elements
are so well amalgamated with the complete dataset ,cleansing process is more governed by
the data scientist ability. These noise elements are in the form of –

• Bad values
• Anomalies
• Missing values
• Not Useful Data

pg. 14
There are No Anomalies Or bad Values. Also There are no Duplicate Values so there is no
need for Data Pre-Processing

Data Visualizatin

Visualization is a technique for creating diagrams, images or animations to communicate a


message
Usage of charts or graphs to visualize huge amounts of complex data is easier than poring
over spreadsheets or reports
Data Analysis using Visualization includes:
• Univariate Analysis
• Bivariate Analysis
• Multivariate Analysis
Key for this analysis is generating insights/inferences aligned with the business
problem

Q2.1).Perform Exploratory Data Analysis [both univariate and multivariate


analysis to be performed]. What insight do you draw from the EDA?

Univariate Analysis

For Numerical Categories –

For Univariate analysis I have used a distribution Plot as well as check their skewness to
understand the data to understand the distribution of the Data.

pg. 15
Figure 2 – Distplot for Univariate Analysis

Table 12- Skewness of the data

pg. 16
Figure 3- Boxplot Analysis
Inference –

From the Above Skewness data, distribution plots and the description table mentioned in
Table 3, This means that most of the colleges are within a range of values corresponding to
the individual variables and few colleges are outside this range and nearly symmetrically
placed on either side of the range. It is clear that P. Undergrad, Apps, Books, Expend,
Accept, Enroll, F. Undergrad, Personal, Top 10 Perc, S.F Ratio, Perc Alumni, Outstate, Room
Board as well as Top 25 Perc are Positively or right Skewed as their Mean value is greater
than Median. This right skewness may be observed because only a few colleges have higher
values for the corresponding variables and most of the other colleges have low values and
Their tail falls in the right Side of the distribution plot while Grad.Rate, PHD, Terminal Are
negatively skewed or Left Skewed as their mean is less than the Median. This is observed
because there are only few colleges which have less percentage of faculties with PhD or
Terminal degrees. The Boxplot shows that there are many Outliers in the data.

Multivariate Analysis-
Numerical Data

For Bivariate Analysis, I have used a Correlation matrix as well Scatter plot to understand
the relationship between Two variables.

Heat Map or Correlation Matrix-

pg. 17
Figure 4 – Correlation Matrix

pg. 18
Figure 5- Scatter Plot for Bivariate Analysis

Inferences:

• There is strong positive correlation between 'Apps', 'Accept', 'Enrol' and


'F.Undergrad'. The logic behind the strong positive correlation between 'Apps',
'Accept' and 'Enrol' may be as number of applications increase, this implies more
number of acceptance counts and hence more number of enrolments.
• There are outliers in almost all Bivariate analysis mostly in apps vs accept, apps vs
enrol, apps vs f.undergrad, accept vs enrol, Accept vs F. undergrad
• There is strong positive correlation between 'Top10perc' and 'Top25perc'. The
reason is the students present in the top 10% of higher secondary class are also
present in top 25%.
• Also we can observe high positive correlation between 'Terminal' and 'PhD'. This may
be because 'Terminal' degree holder is most probably also a PhD holder.
• There is a medium negative correlation observed between 'Expend' and 'S.F.Ratio';
the reason may be because higher Expend ratio means student pays higher
instructional expenses but higher student to faculty ratio means more students per
faculty. Thus as 'S.F.Ratio' will increase, the expenses shared by the students towards
'Expend' will decrease.
• There is medium positive correlation between 'Outstate' and 'Room.Board' and also
between 'Outstate' and 'Expend'. The reason could be higher fees for public
universities for out of state students.
• There is medium correlation between 'Outstate' and 'Top10perc', 'Outstate' and
'Top25perc'. The reason could be because the top 10% and top 25% students are
distributed throughout the country.
• There is a lot of corelation while PCA will help us to remove the corelations.

pg. 19
Data set

without Outlier Scaled data Scaled data with Treatment Outlier Treatment

Q2.2). Is scaling necessary for PCA in this case? Give justification and
perform scaling.

The dataset contains certain variables which are counts like 'Apps' having a mean value of
3001 approx. and variables like 'Expend' which are expressed in currency units having a
mean value of 9660 approx. whereas there are certain variables which are ratios and
percentages like 'S.F.Ratio' and 'Top25perc' having much lesser magnitude of values. Since
we are going to perform PCA, which essentially captures the variance in different directions,
if we consider the dataset as it is, it will affect the PCA analysis. With the variables with
higher magnitude and hence higher variance dominating the results. Thus to perform a fair
and proper PCA analysis it is important to do scaling of the variables. Typically while
performing PCA we do mean centering and then scaling by dividing by standard deviation.
So we will do z-score scaling.

Often the variables of the data set are of different scales i.e. one variable is in millions and
other in only 100. Since the data in these variables are of different scales, it is tough to
compare these variables. Feature scaling (also known as data normalization) is the method
used to standardize the range of features of data. Since, the range of values of data may
vary widely, it becomes a necessary step in data pre-processing while using machine
learning algorithms. In this method, we convert variables with different scales of
measurements into a single scale.StandardScaler normalizes the data using the formula
(xmean)/standard deviation.We will be doing this only for the numerical variables only.
From the data below, it is clear that

• It may be observed from the description that the mean is nearly 0 and standard deviation
is nearly 1, which is the effect of z-score scaling.
• The values of the variables are now comparable and hence will give a better PCA.

pg. 20
Table 13- Sample(Above) and Description(Below) of Scaled data

pg. 21
PCA Implementation Process

PCA works on only continuous data. As, we have 17 features we have 17 PCA components in
the data. After Scaling the data Covariance matrix and correlation matrix are the same but
as information is explained in terms of variance we use covariance matrix. We will use these
matrix to decompose eigen values and eigen vectors.

Statistical tests to be done before PCA

1st Method

Bartletts Test of Sphericity

Bartlett's test of sphericity tests the hypothesis that the variables are uncorrelated in the
population.

• H0: All variables in the data are uncorrelated


• Ha: At least one pair of variables in the data are correlated

If the null hypothesis cannot be rejected, then PCA is not advisable.

If the p-value is small, then we can reject the null hypothesis and agree that there is at least
one pair of variables in the data which are correlated hence PCA is recommended.

We get the P-Value as 0 which means we reject the null hypothesis. Therefore there is
enough Evidence that there is Correlation in the data

2nd Method

KMO Test

The Kaiser-Meyer-Olkin (KMO) - measure of sampling adequacy (MSA) is an index used to


examine how appropriate PCA is.

Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is expected.
On the other hand, MSA > 0.7 is expected to provide a considerable reduction is the
dimension and extraction of meaningful components.

KMO test came out to be 0.813 which is greater than 0.7 therefore PCA is Important for
this data set.

pg. 22
Q2.3). Comment on the comparison between the covariance and the
correlation matrices from this data.

On a scaled data covariance matrix = Correlation Matrix


Diagonal variables have a 1-unit variable in Correlation matrix

Covariance matrix gives a relationship about the direction of the dataset. By direction, it
means that if the variables are directly proportional or inversely proportional to each other.
(Increasing the value of one variable might have a positive or a negative impact on the value
of the other variable).

Table 14- Covariance Matrix

Inference-

• Once we standardize the correlation matrix, we get covariance matrix


• Negatives indicate that they tend to be high together or low together
• Variances become 1 and covariance becomes correlations
• Taking into Consideration, Figure 4 – Correlation Matrix. The correlation coefficient
is a dimensionless metric and its value ranges from -1 to +1. The closer it is to +1 or -
1, the more closely the two variables are related.
• When the correlation coefficient is positive, an increase in one variable also
increases the other. When the correlation coefficient is negative, the changes in the
two variables are in opposite directions.

If there is no relationship at all between two variables, then the correlation coefficient will
certainly be 0. However, if it is 0 then we can only say that there is no linear relationship.
There could exist other functional relationships between the variables.

pg. 23
Covariance and correlation are related to each other, in the sense that covariance
determines the type of interaction between two variables, while correlation determines the
direction as well as the strength of the relationship between two variables.

Q2.4).
Check the dataset for outliers before and after scaling. What insight do you
derive here? [Please do not treat Outliers unless specifically asked to do so]

pg. 24
Figure 3(Repeat) - Boxplot without scaling and outlier treatment

Figure 6- Boxplot with scaling and without outlier treatment

pg. 25
Additional –

Figure 7-Boxplot with scaling and outlier treatment (Not required) [Treatment of outliers is
shown in Jupyter]

Inferences

• Shifts distribution’s mean to 0 & unit variance

• There is no predetermined range

• Best to use on data that is approximately normally distributed

Q2.5). Extract the eigenvalues and eigenvectors.[print both]

The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the
“core” of a PCA: The eigenvectors (principal components) determine the directions of the
new feature space, and the eigenvalues determine their magnitude. In other words, the
eigenvalues explain the variance of the data along the new feature axes.

Table 15- Eigen Values

pg. 26
Table 16 – Eigen Vectors

Each eigenvector direction is orthogonal to the other eigenvectors. The correspondin g


coefficients of a particular eigenvector are the loadings corresponding to each of the variable s
of the original dataset, if the eigenve ctors are calculated for a covariance matrix of a standard
scaled (z-scaled) data then these coefficients may be considered to be the correlations with the
variables of the original dataset .

pg. 27
Q2.6).Perform PCA and export the data of the Principal Component
(eigenvectors) into a data frame with the original features.

We will use all the 17 components to perform PCA.

Table 17- Sample of dataset after performing PCA

Table 18- Loading Principal Components


PCA Components are same as eigen Vectors (take reference from Table 16)

Table 19- Loading all Principal Components with Original features

pg. 28
Q2.7). Write down the explicit form of the first PC (in terms of the
eigenvectors. Use values with two places of decimals only).

Linear Combination of variables


C= W1(Y1) + W2(Y2) + W3(Y3) + W4(Y4)
C – Components
W1,2,3,4 – PCA components loading
Y1,2,3,4 – Features

Table 20 – Explicit for of the first PC

To gather the explicit form in graphical manner we take 6 Components to gather 83% of the
data.

Figure 8 – Heatmap to show Explicit Form

These are the highest magnitude for each column. For E.g., Apps is loaded on PC1. Now As
we can see Apps, Accept, Enroll, F.Undergrad, P.Undergrad and Perc. Alumni are maximum
loaded on PC1 and are talking about similar business structure. Now we can club them and
introduce a new dataset With PC1 or their relation to interpret further results.

pg. 29
Q2.8). Consider the cumulative values of the eigenvalues. How does it help you
to decide on the optimum number of principal components? What do the
eigenvectors indicate?

Table 21- Cumulative variance of the Eigen Values

According to eigen values the first and second captures more than half of the variability
They are orthogonal or independent of each other

Total of all the values is 17


And 80% of the information lies in the first 6 eigen values

• The above values are the eigen values of the covariance matrix which show the
variance captured by each principal component in decreasing order.

Figure 9- Scree Plot

pg. 30
• Shows the individual explained variances by the principal components.
• We can observe a sudden decrease in slope from the third principal component onwards.
This means the maximum variances are captured by the first two principal components.
This point is also called inflection point
• The first two principal components capture approximately 62% of the total variance.
• There is sudden drop in variance captured after second principal component.
• 90% of the total variance is captured by first 8 principal components.
• After the first 11 principal components there is less than 1% increase in variance
captured consecutively by the remaining principal components.

2.9 Explain the business implication of using the Principal Component Analysis
for this case study. How may PCs help in the further analysis?
[Hint: Write Interpretations of the Principal Components Obtained]

• I have selected 6 out of the 17 new dimensions now to explain 83.75% of the variance and
reduce the dimensionality of the dataset accordingly.
• The new dimension variables are independent of each other, which also helps in certain
algorithms.
• The dimensionality reduction as obtained from PCA helps in lesser computing power,
i.e. faster processing for further analysis.
• The dimensionality reduction also helps in lesser storage space.The dimensionality reduction
also helps in addressing the overfitting issue, which mainly occurs when there are too many
variables.
• In our case study, after performing multivariate analysis we have observed that many of the
variables are correlated. Thus we don't need all these variables for analysis but we are not
sure which variables to drop and which to select, hence we perform PCA, which captures the
information (in the form of variance) from all these variables into new dimension variables.
Now based on the requirement of information we can select the number of new dimension
variables required.
• Range of the values is very high. Therefore it is important to scale the data.

pg. 31

You might also like