Advanced Statistics Project Module 3 - Advanced Statistics: Submitted To Great Learning
Advanced Statistics Project Module 3 - Advanced Statistics: Submitted To Great Learning
Submitted to
Great Learning
BY Raveena Kumari
EDA+PCA
Figure 2 Distplot for Univariate 15,16
Analysis
Figure 3 Boxplot Analysis 17
Figure 4 Correlation Matrix 18
Figure 5 Scatter Plot for Bivariate 18,19
Analysis
Boxplot with scaling and
Figure 6 25
without outlier treatment
Figure 7-Boxplot with
Figure 7 26
scaling and outlier
treatment
Theory
ANOVA
• An experiment has been carried out for the purpose to compare means
• They are having a strong cause and effect relationship
• Randomisation-
• Compare more than 2 Population Means
Assumptions-
• The samples drawn from different populations are independent variables and
random
• The response variables are continuous and ideally normally distributed(if not the test
doesn’t go wrong but may be subject to variation) – Plot a normal Distribution)
• The variance of all the population are equal(only in means not in variance)
pg. 1
Introduction
The purpose of this whole exercise is to explore the dataset. Do the Hypothesis test analysis.
Here our main aim is to understand the relationship between Independent and Dependent
variables. The data consists of Salaries of 40 different individuals and their Occupation and
Education. This will help us understand and compare the means of different Continuous
variables with independent variables.
pg. 2
pg. 3
pg. 4
Q1.2). Perform a one-way ANOVA on Salary with respect to Education. State
whether the null hypothesis is accepted or rejected based on the
ANOVA results.
pg. 5
pg. 6
is about 0.884 times within each category of Occupation which is very less
compared to Education as a factor
• P-Value - Since the p Value is greater than alpha i.e 0.05 therfore we fail to
reject the null hypthesis where we can clearly state that mean salary of
different group of occupation are the same. Therefore Occupation is not a
differentiator in the salary Structure
• F-Critical- There is no output for F-critical in python. We can also calculate F-
critical value through Excel Using F.INV.RT .
Note-
• When the P_Value > Alpha, We Fail to reject the Null Hypothesis
• When the P_Value < Alpha, We Reject the Null Hypothesis
• If F-Statistic > F-Critical , We Reject the Null Hypothesis
Q4) .
• If F-Statistic > F-Critical, We fail to reject the Null Hypthesis
Inference -
Since the P Value is greater than alpha i.e. 0.05 therefore we fail to reject the null
hypothesis where we can clearly state that mean salary of different group of
occupation are the same. Therefore Occupation is not a differentiator in the salary
Structure.
Q1.4). If the null hypothesis is rejected in either (1.2) or in (1.3), find out
which class means are significantly different. Interpret the result.
pg. 7
Problem 1. B)
Q1.5) What is the interaction between two treatments? Analyse the effects of
one variable on the other (Education and Occupation) with the help of an
interaction plot.
Assumptions
• Dependent variables are measured at a continuous level
I have performed a two-way ANOVA to Check the interaction between two treatments. This
is required to check the interaction between the two by creating a Plot. Two-way ANOVA
without Interaction is required where we take Education as the first factor as Education
has a P-Value less than Alpha.
Here if we take education and occupation together, we can see that Education’s P value is
0.00000001981539 which is less than alpha while the value of occupation is 0.3545825
more than alpha. Hence even if we take education and occupation together education is
alone a significant factor in determining salary of the sample Population.
pg. 8
Figure 1 – Interaction Plot with Factor Education and Occupation together
Inference
From the above plot, there is not a significant interaction between the variables. We can
clearly see that the salary at High School Graduate is the lowest be it in Occupation which is
increasing as they have bachelor’s degree while the highest with doctorate Degrees. While
according, if we take it through Occupation Perspective, Prof-Specialty has the highest salary
while High school students of working in sales have the lowest salary. We can also see that
Even having a doctorate degree in the field of sales and Adm-Clerical departments, there is
not much of difference with a bachelor’s degree.
pg. 9
pg. 10
Table 8– Two Way ANOVA with Interaction
• Degrees of Freedom = N-1. Here Occupation and Education together have 7 categories
so Degrees of freedom is 7-1=6. Total No. of observations consists of 40 total line items
where 5 already accounted through Occupation leaving 29 as residual.
• Sums of Square – The Occupation and Education categories together have different
means. Those sample means are compared to the overall means and that is
36349090000. It is also called the between sums of square. Then we have within Sums of
Square in the residual output.
• Mean Sums of Square – Sums of Square/ Degree of Freedom = 36349090000/2 =
6058182000.
• F-Statistic – It is Occupation and Education Together Mean sums of square/ Residual
mean sums of square = 8.519815. This tells us that the variance between Occupation and
Education Categories is about 8.5 times.
• P-Value - Since the p Value 0.000022325 is Less than alpha i.e 0.05 therefore we reject
the null hypothesis where we can clearly state that mean salary of different group of
occupation and Education together are Different. Therefore Occupation and Education
have some kind of an Interaction. Due to the inclusion of the interaction effect term, we
can see a slight change in the p-value of the first two treatments as compared to the
Two-Way ANOVA without the interaction effect terms and we see that the p-value of the
interaction effect term of 'Education' and 'Occupation' suggests that the Null Hypothesis
is rejected in this case.
pg. 11
EXPLORATORY DATA ANALYSIS AND PRINCIPAL COMPONENT ANALYSIS
Executive Summary
Introduction
The purpose of this whole exercise is to explore the dataset. Do the exploratory data
analysis. Explore the dataset using central tendency and other parameters. The data
consists of 777 different colleges with 17 unique features. Analyse the different attributes of
the Colleges and reducing the dimensions of the data by scaling and finding Eigen Values
and Vectors and performing PCA .
pg. 12
IF we go by the process of EDA
Describe data
pg. 13
Inference -
• We have all Non-Null Values so there are no Missing Values
• We have 3 Types of Data Types which Corresponds with the requirement of the dataset
16 Integer, 1 Object and 1 Float.
Data Dictionary
Data Pre-processing
Practical data set generally has lot of “noise” and/or“ undesired” data points which might
impact the outcome, hence pre-processing is an important step As these “noise” elements
are so well amalgamated with the complete dataset ,cleansing process is more governed by
the data scientist ability. These noise elements are in the form of –
• Bad values
• Anomalies
• Missing values
• Not Useful Data
pg. 14
There are No Anomalies Or bad Values. Also There are no Duplicate Values so there is no
need for Data Pre-Processing
Data Visualizatin
Univariate Analysis
For Univariate analysis I have used a distribution Plot as well as check their skewness to
understand the data to understand the distribution of the Data.
pg. 15
Figure 2 – Distplot for Univariate Analysis
pg. 16
Figure 3- Boxplot Analysis
Inference –
From the Above Skewness data, distribution plots and the description table mentioned in
Table 3, This means that most of the colleges are within a range of values corresponding to
the individual variables and few colleges are outside this range and nearly symmetrically
placed on either side of the range. It is clear that P. Undergrad, Apps, Books, Expend,
Accept, Enroll, F. Undergrad, Personal, Top 10 Perc, S.F Ratio, Perc Alumni, Outstate, Room
Board as well as Top 25 Perc are Positively or right Skewed as their Mean value is greater
than Median. This right skewness may be observed because only a few colleges have higher
values for the corresponding variables and most of the other colleges have low values and
Their tail falls in the right Side of the distribution plot while Grad.Rate, PHD, Terminal Are
negatively skewed or Left Skewed as their mean is less than the Median. This is observed
because there are only few colleges which have less percentage of faculties with PhD or
Terminal degrees. The Boxplot shows that there are many Outliers in the data.
Multivariate Analysis-
Numerical Data
For Bivariate Analysis, I have used a Correlation matrix as well Scatter plot to understand
the relationship between Two variables.
pg. 17
Figure 4 – Correlation Matrix
pg. 18
Figure 5- Scatter Plot for Bivariate Analysis
Inferences:
pg. 19
Data set
without Outlier Scaled data Scaled data with Treatment Outlier Treatment
Q2.2). Is scaling necessary for PCA in this case? Give justification and
perform scaling.
The dataset contains certain variables which are counts like 'Apps' having a mean value of
3001 approx. and variables like 'Expend' which are expressed in currency units having a
mean value of 9660 approx. whereas there are certain variables which are ratios and
percentages like 'S.F.Ratio' and 'Top25perc' having much lesser magnitude of values. Since
we are going to perform PCA, which essentially captures the variance in different directions,
if we consider the dataset as it is, it will affect the PCA analysis. With the variables with
higher magnitude and hence higher variance dominating the results. Thus to perform a fair
and proper PCA analysis it is important to do scaling of the variables. Typically while
performing PCA we do mean centering and then scaling by dividing by standard deviation.
So we will do z-score scaling.
Often the variables of the data set are of different scales i.e. one variable is in millions and
other in only 100. Since the data in these variables are of different scales, it is tough to
compare these variables. Feature scaling (also known as data normalization) is the method
used to standardize the range of features of data. Since, the range of values of data may
vary widely, it becomes a necessary step in data pre-processing while using machine
learning algorithms. In this method, we convert variables with different scales of
measurements into a single scale.StandardScaler normalizes the data using the formula
(xmean)/standard deviation.We will be doing this only for the numerical variables only.
From the data below, it is clear that
• It may be observed from the description that the mean is nearly 0 and standard deviation
is nearly 1, which is the effect of z-score scaling.
• The values of the variables are now comparable and hence will give a better PCA.
pg. 20
Table 13- Sample(Above) and Description(Below) of Scaled data
pg. 21
PCA Implementation Process
PCA works on only continuous data. As, we have 17 features we have 17 PCA components in
the data. After Scaling the data Covariance matrix and correlation matrix are the same but
as information is explained in terms of variance we use covariance matrix. We will use these
matrix to decompose eigen values and eigen vectors.
1st Method
Bartlett's test of sphericity tests the hypothesis that the variables are uncorrelated in the
population.
If the p-value is small, then we can reject the null hypothesis and agree that there is at least
one pair of variables in the data which are correlated hence PCA is recommended.
We get the P-Value as 0 which means we reject the null hypothesis. Therefore there is
enough Evidence that there is Correlation in the data
2nd Method
KMO Test
Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is expected.
On the other hand, MSA > 0.7 is expected to provide a considerable reduction is the
dimension and extraction of meaningful components.
KMO test came out to be 0.813 which is greater than 0.7 therefore PCA is Important for
this data set.
pg. 22
Q2.3). Comment on the comparison between the covariance and the
correlation matrices from this data.
Covariance matrix gives a relationship about the direction of the dataset. By direction, it
means that if the variables are directly proportional or inversely proportional to each other.
(Increasing the value of one variable might have a positive or a negative impact on the value
of the other variable).
Inference-
If there is no relationship at all between two variables, then the correlation coefficient will
certainly be 0. However, if it is 0 then we can only say that there is no linear relationship.
There could exist other functional relationships between the variables.
pg. 23
Covariance and correlation are related to each other, in the sense that covariance
determines the type of interaction between two variables, while correlation determines the
direction as well as the strength of the relationship between two variables.
Q2.4).
Check the dataset for outliers before and after scaling. What insight do you
derive here? [Please do not treat Outliers unless specifically asked to do so]
pg. 24
Figure 3(Repeat) - Boxplot without scaling and outlier treatment
pg. 25
Additional –
Figure 7-Boxplot with scaling and outlier treatment (Not required) [Treatment of outliers is
shown in Jupyter]
Inferences
The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the
“core” of a PCA: The eigenvectors (principal components) determine the directions of the
new feature space, and the eigenvalues determine their magnitude. In other words, the
eigenvalues explain the variance of the data along the new feature axes.
pg. 26
Table 16 – Eigen Vectors
pg. 27
Q2.6).Perform PCA and export the data of the Principal Component
(eigenvectors) into a data frame with the original features.
pg. 28
Q2.7). Write down the explicit form of the first PC (in terms of the
eigenvectors. Use values with two places of decimals only).
To gather the explicit form in graphical manner we take 6 Components to gather 83% of the
data.
These are the highest magnitude for each column. For E.g., Apps is loaded on PC1. Now As
we can see Apps, Accept, Enroll, F.Undergrad, P.Undergrad and Perc. Alumni are maximum
loaded on PC1 and are talking about similar business structure. Now we can club them and
introduce a new dataset With PC1 or their relation to interpret further results.
pg. 29
Q2.8). Consider the cumulative values of the eigenvalues. How does it help you
to decide on the optimum number of principal components? What do the
eigenvectors indicate?
According to eigen values the first and second captures more than half of the variability
They are orthogonal or independent of each other
• The above values are the eigen values of the covariance matrix which show the
variance captured by each principal component in decreasing order.
pg. 30
• Shows the individual explained variances by the principal components.
• We can observe a sudden decrease in slope from the third principal component onwards.
This means the maximum variances are captured by the first two principal components.
This point is also called inflection point
• The first two principal components capture approximately 62% of the total variance.
• There is sudden drop in variance captured after second principal component.
• 90% of the total variance is captured by first 8 principal components.
• After the first 11 principal components there is less than 1% increase in variance
captured consecutively by the remaining principal components.
2.9 Explain the business implication of using the Principal Component Analysis
for this case study. How may PCs help in the further analysis?
[Hint: Write Interpretations of the Principal Components Obtained]
• I have selected 6 out of the 17 new dimensions now to explain 83.75% of the variance and
reduce the dimensionality of the dataset accordingly.
• The new dimension variables are independent of each other, which also helps in certain
algorithms.
• The dimensionality reduction as obtained from PCA helps in lesser computing power,
i.e. faster processing for further analysis.
• The dimensionality reduction also helps in lesser storage space.The dimensionality reduction
also helps in addressing the overfitting issue, which mainly occurs when there are too many
variables.
• In our case study, after performing multivariate analysis we have observed that many of the
variables are correlated. Thus we don't need all these variables for analysis but we are not
sure which variables to drop and which to select, hence we perform PCA, which captures the
information (in the form of variance) from all these variables into new dimension variables.
Now based on the requirement of information we can select the number of new dimension
variables required.
• Range of the values is very high. Therefore it is important to scale the data.
pg. 31